Previous: introduction Up: ../chrrtn.html Next: support-criteria


BACKGROUND


 The problems  with  FORTRAN  66  Hollerith  data  are  well-known,  and
 although the  KARxxx  routines  largely removed  them,  when  Hollerith
 support is no longer available, FORTRAN 77 CHARACTER data will have  to
 be used.

 In the view of the author, the definition of CHARACTER data in the 1977
 FORTRAN Standard was very poorly done, and has done significant harm to
 FORTRAN software portability.  This is a strong statement, and it bears
 some explanation.

 First of  all,  the  Hollerith  data type  is  dropped  from  the  1977
 Standard.  This  means  that a  very  large body  of  existing  FORTRAN
 software which  uses  character  data, even  in  an  at-present  widely
 portable fashion, may require extensive  changes to run with a  FORTRAN
 77 compiler, unless manufacturers can be pressed to continue support of
 character data stored in Hollerith constants and variables.

 The 1977  standard  prohibits  all storage  equivalencing,  either  via
 COMMON  and  EQUIVALENCE  statements,  or  by  FUNCTION  or  SUBROUTINE
 argument associations,  between CHARACTER  data and  all other  FORTRAN
 data types.  This is in sharp contrast to the usual lax implementations
 of FORTRAN for  all other  data types.   This was  necessary to  enable
 FORTRAN 77 to support character  strings of indefinite length, so  that
 declarations of the form

       SUBROUTINE A (B)
       CHARACTER B*(*)

 could be permitted,  allowing CHARACTER variables  to inherit a  string
 length from a calling program. This forces a compiler to generate  code
 to pass  to  a  called  routine the  address  of  a  string  descriptor
 containing size information as well the actual address of the character
 data.  Also, on  word-addressed machines, CHARACTER  data may begin  in
 the middle of a word, so storage equivalencing could be problematic.

 Second, standardized library support of  character data in the form  of
 useful utility routines  is non-existent  in the  1977 Standard,  apart
 from the ICHAR and  CHAR functions for  converting between INTEGER  and
 CHARACTER form.

 Third, null character strings, that is, strings of zero length, are not
 permitted.  Null strings  are in  fact quite useful,  and indeed,  even
 necessary in some applications.  In particular, a null string cannot be
 simulated by any string of non-zero length.

 Fourth, the 1977  Standard does  not specify  the character  set to  be
 used.  The fact that many  manufacturers employ their private  versions
 of character sets, each with  its own special character repertoire  and
 collating sequence,  only continues  to perpetrate  additional  machine
 dependence upon FORTRAN users.

 Fifth,  the  1977  Standard  in  allowing  declarations  of  the   form
 CHARACTER*n did not specify what minimum  'n' should be supported by  a
 standard conforming compiler.  One  might hope that  this would not  be
 less than  the number  of  characters that  could  reside in  the  host
 machine's (possibly virtual)  address space.  At  the least, one  might
 conclude that an assignment of the form "A='long string'" spanning  the
 permitted 19 continuation lines would be permitted.

 Alas,  few  compilers  permit  even   this  much,  and  string   length
 limitations of 128, 256, and 512 are common, and only a few (e.g. ElXsi
 and  DEC-20)  set  the  limit  at  the  machine  address  space   size.
 Interestingly, the  1977 Standard  clearly  states that  a  CHARACTER*n
 argument passed to a subprogram can be legally received as an array  of
 n CHARACTER*1 values, and vice versa.  Since none of the compilers seem
 to put a  limit on array  sizes, it is  odd that they  do so on  string
 lengths.  The  reason of  course  is the  peculiar requirement  of  the
 Standard that the LEN() function be able to return the declared  length
 of its argument string; no such function is provided for obtaining  the
 declared  dimension  of  an  array.   Most  implementations   therefore
 represent a  CHARACTER variable  by a  string descriptor  containing  a
 length field and an address field,  and both of these have fixed  sizes
 allotted to them.   It seems foolish  that although most  architectures
 now require 24 or more bits for the  address field, only 7, 8, 9 or  16
 should be allocated for the length field to "save storage".

 Sixth, although  the 1977  Standard removed  many of  the  unreasonable
 restrictions on where expressions could be permitted in FORTRAN  source
 code, it  introduced a  new one  in the  form of  prohibiting taking  a
 substring of a constant or an expression!

 If one examines  string support  and typical use  thereof in  languages
 like PL/1 and  C, two  characteristics become evident.   First of  all,
 strings whose  length can  vary dynamically  (up to  some compile  time
 limit set by the user, not by the compiler) are supported, and the null
 string is legal.  Having varying  length strings without a null  string
 is like  having integers  without a  zero; how  else can  something  be
 initialized to empty?  FORTRAN 77, Pascal, Modula/2, and Ada, all  make
 the mistake of requiring fixed length  strings, and in Pascal and  Ada,
 because of  their  strong typing,  strings  of different  lengths  have
 different  types,  and  are  therefore  not  conformant.

 Second, individual characters can be processed as equivalents of  small
 integers equal to their position in  the host character set.  Thus,  in
 C, one can  convert a lower  case letter  to upper case  by adding  the
 expression 'A'  -  'a',  without  having to  know  precisely  what  the
 equivalent integers are.   In additions,  printable representations  of
 commonly  used  non-printable  characters,  such  as  backspace,   tab,
 newline, carriage return, formfeed,  and so on,  are provided, so  that
 one can easily construct  strings which span  lines or contain  control
 characters.  The integer equivalents make  it possible to index  arrays
 by character  values, making  for efficient  lookup.  C  in  particular
 makes good use of this in its standard library for determining  whether
 characters are letters,  digits, printable, upper  case or lower  case,
 etc.

 FORTRAN 77 has  the ICHAR()  function which  is supposed  to return  an
 integer ordinal greater or equal to 0, representing the position of the
 argument character in  the host character  set.  However, the  Standard
 defines only letters, digits, and thirteen special characters, a  total
 of only 49.  This means that a processor is free to implement  whatever
 it likes for arguments to ICHAR() which are not among these.  It  could
 even legally raise a fatal error in such a case.  Most  implementations
 do in fact return an integer  for all possible characters which can  be
 stored in the host CHARACTER storage unit, but the sign of the  integer
 is not guaranteed to be positive.

 On older architectures with 36 (IBM 70xx, Univac 1108), 48 (Burroughs),
 or 60 (CDC) bit words, 6-bit characters were common, and an even number
 fit into the host  words.  This only  permits 64 different  characters,
 which is not enough to have both letter cases.  The ISO/ASCII character
 set has 128  different character  values and can  represent both  upper
 case and lower case  letters.  On the 36-bit  DEC-10 and -20  machines,
 these are stored  as five 7-bit  characters per word,  with one  unused
 bit.  On the 36-bit  Univax 11xx machines,  newer compilers store  four
 9-bit characters  per  word, with  the  two  high order  bits  of  each
 character unused.  Most  newer architectures  are based  on an  address
 unit of an 8-bit byte, or have a word size which is a multiple of  this
 (e.g. 64-bit Cray words).  The EBCDIC character set used by IBM has 256
 characters to make  complete use of  the byte storage  unit.  With  the
 ASCII character set, however, only  7 bits are required, and  something
 has to be done about  the extra bit in an  8-bit byte.  Prime makes  it
 one and treats the byte as an unsigned integer, so their ASCII ordinals
 go from  128 to  255 (a  violation of  the ANSI  and ISO  Standards,  I
 believe).  Other machines ignore it, and still others use it as a  sign
 bit.  In the latter case, ICHAR() can  return values 0 .. 127 when  the
 high bit is zero, then -128 .. -1 when it is set.

 In summary then,  one cannot be  sure in FORTRAN  77 whether  CHARACTER
 data can be  used to access  every bit in  memory (it does  not on  the
 DEC-10 and -20, or on any  machine which ignores the high-order  bits),
 or whether  ICHAR()  can  be  used  to  obtain  an  integer  which  can
 confidently be used as an array index.