Topic: standard for C/C++ representation of ASCII CR, LF


Author: Wojtek_L@yahoo.ca (Wojtek Lerch)
Date: Thu, 14 Mar 2002 01:59:57 GMT
Raw View
Daniel Miller <daniel.miller@tellabs.com> wrote in message news:<3C8E57E1.1050708@tellabs.com>...
...
>    And building on Wojtek Lerch's analysis, the line-separator in the Mac
> culture is CR (alone), not CRLF as in the Microsoft/DEC culture, not LF as in
> the UNIX culture.  Thus to guarantee that Mac's CR satisfies the requirement
> that line-separators be represented strictly by \n alone, text mode translates
> CR to \n as required.  Since Mac culture does not consider LF to be part of the
> line-separator, a library author is left with two choices:
>    1) in text mode, drop the LF entirely, such as by Mac libraries considering
> CR (official), LF (foreign from UNIX), CRLF (foreign from Microsoft/DEC), and
> LFCR (oddity) as all mapping to \n on input in text mode.  Obviously on output
> \n would be converted to CR alone in Mac culture in text mode.
>    OR
>    2) in text mode, map the LF into some escaped-character-sequence which is not
> already used in text-mode.  Since CR is represented by \n due to the
> aforementioned text-file chracter-translation requirement (instead of \r), \r is
> unused.  Because LF was evicted from its natural \n home, and because \r is a
> vacant home, LF could conceivably be mapped to \r, so that CR versus LF remain
> distinguishable (instead of both mapped to \n) and so that no characters are
> dropped in text-mode.
>
>    Metrowerks evidently chose the second option.
>
>    It would be preferable for the C & C++ standards to specify which of the two
> options are permissible versus forbidden.

I haven't checked the C++ standard, but the C standard already does
specify that both options are permissible.  Are you saying that you
would like the Standard to pick one and forbid the other (along with
any other possible choices)?  Why?  If you play by the rules, there's
no way for your code to notice which of the two options the
implementation happened to chose anyway.  Would you like the rules to
be changed to let you write any control characters to a text stream
and have it guaranteed that you'rll get exactly the same characters
when you read them back?  Why don't you just use a binary strem
instead of a text stream?

Your two options are specific to the MAC OS.  They wouldn't make any
sense in some other OSes.  Imagine an OS where lines of a text file
are stored as fixed-length, space-padded records, and the file
contains no line separators at all -- do you want to make implementing
C for such an OS impossible?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: Christopher Corbell <ccorbell@vernier.com>
Date: Mon, 11 Mar 2002 21:38:49 GMT
Raw View
I have a question on what's standard and, if not standard,
what's commonly accepted implementation.

I was under the impression that the ASCII characters LF (0x0A)
and CR (0x0D) were, by standard or convention, mapped to the
C character-escapes '\n' and '\r' respectively.  I would also tend
to think that standard text-mode file io (in C with stdio
commands or in C++ with iostream) would preserve these
literal correlations. Perhaps the one implementation-defined
difference is that '\n' might get mapped to a system EOL for
output to a system console or file context where that
was important (e.g., '\n' on output could become CRLF in
DOS-based systems or CR in Mac systems).

The Metrowerks MacOS implementation goes a step further by
reversing the significance of CR and LF with regard to
the '\r' and '\n' escapes in i/o involving a file, but not
in any other context.  This means that if you open a FILE or
fstream in text mode, every '\r' will be output there as
an LF as well as every '\n' being output as CR, and the same
is true of input - the escape-to-ASCII-binary correspondences
are reversed.  They are not reversed however in memory, e.g.
with sprintf() to a buffer or with a std::stringstream.
I find this confusing and, if not non-standard, at least
worse than other implementations.

It turns out there is a workaround with MSL - if you open
the file in binary mode then '\n' always == LF and
'\r' always == CR.  This is what I wanted but in getting
this information I also got a lot of justification from
folks over there that this was the "standard" way to get
what I wanted, when I feel like this is more like a workaround
for a bug in their implementation.  After all, CR and LF
are 7-bit ASCII values and I'd expect to be able to use
them in a standard way with a file opened in text mode.

Is there any place in the C or C++ standard where the
correspondence of escapes to ASCII is explicitly
established, or else allowed to be implementation-defined
in certain contexts?  I did a few web searches for
things like "ASCII standard C FILE" and sim. for
C++ fstream, and could not find any specific info
one way or the other.

However all programmers on my team feel that the
MSL implementation is more or less wrong.  If there
is no clear statement from the standard I wonder what
others think about it.  I know that the other two
compilers/libraries we use, which are often less
standards-compliant than Metrowerks, are both better
in this regard - you can count on a FILE/fstream opened
in text mode giving your program a '\r' when a CR
was found in the file and a LF when an LF was found
in the file, and on output the only substitution
that occurs is (if applicable) EOL-sequence for '\n'.

Thanks for info & discussion,
Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: Ron Natalie <ron@sensor.com>
Date: Mon, 11 Mar 2002 16:34:16 CST
Raw View

Christopher Corbell wrote:

> I was under the impression that the ASCII characters LF (0x0A)
> and CR (0x0D) were, by standard or convention, mapped to the
> C character-escapes '\n' and '\r' respectively.

Well not quite.  C lines are terminated by \n alone.  It is up
to the stdio/streams code to map from whatever line termination
scheme that file uses (for instance CRLF on Windows).  This is
a curious artifact of UNIX (and model 37 teletypes) where NEWLINE
was a single (coincideentally same value as LF) character.

> I would also tend
> to think that standard text-mode file io (in C with stdio
> commands or in C++ with iostream) would preserve these
> literal correlations.

No, not in text mode.   In binary mode, ASCII LF->\n and CR->\r
is the norm (still an impelementation defined thing).

> The Metrowerks MacOS implementation goes a step further by
> reversing the significance of CR and LF with regard to
> the '\r' and '\n' escapes in i/o involving a file, but not
> in any other context.  This means that if you open a FILE or
> fstream in text mode, every '\r' will be output there as
> an LF as well as every '\n' being output as CR, and the same
> is true of input

This is exactly how text mode is supposed to work.  The mac uses
CR for line termination.  The Stdio/stream library functions convert
that to \n.

> It turns out there is a workaround with MSL - if you open
> the file in binary mode then '\n' always == LF and
> '\r' always == CR.

Again, this is preceisely what BINARY mode is supposed to
do (and why there exists such a mode).

> Is there any place in the C or C++ standard where the
> correspondence of escapes to ASCII is explicitly
> established, or else allowed to be implementation-defined

There is no place where ASCII is discussed.  However, the
fact that in text mode whatever the NATIVE line termination
may be, it is converted to a single \n character.  (7.19.2/2
of the C standard for STDIO and the C++ stream behavior is
defined in therms of this 27.8).

> However all programmers on my team feel that the
> MSL implementation is more or less wrong.

Nope, Metroworks is right.  You should expect text mode lines
to end with a single '\n' character regardless of the
implementation.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: gmt@CS.Arizona.EDU (Gregg Townsend)
Date: Mon, 11 Mar 2002 23:44:21 GMT
Raw View
In article <3C8D309A.C98855CF@sensor.com>, Ron Natalie  <ron@sensor.com> wrote:
>
> ... C lines are terminated by \n alone. ... This is
> a curious artifact of UNIX (and model 37 teletypes) where NEWLINE
> was a single (coincideentally same value as LF) character.

Actually, this goes back to the ASCII standard (ANSI X3.4) which
optionally associates the LF character with a newline function and
specifies it as the character to use if you're just using a single
line-separator character.

---------------------------------------------------------------------------
Gregg Townsend         Staff Scientist      The University of Arizona
gmt@cs.arizona.edu     Computer Science     Tucson, Arizona, USA

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: Wojtek_L@yahoo.ca (Wojtek Lerch)
Date: Tue, 12 Mar 2002 16:59:45 GMT
Raw View
Christopher Corbell <ccorbell@vernier.com> wrote in message news:<3C8D2210.1AF61F4A@vernier.com>...
> I was under the impression that the ASCII characters LF (0x0A)
> and CR (0x0D) were, by standard or convention, mapped to the
> C character-escapes '\n' and '\r' respectively.  I would also tend
> to think that standard text-mode file io (in C with stdio
> commands or in C++ with iostream) would preserve these
> literal correlations.

It's true that on typical implementations, the values of '\n' and '\r'
are 0x0A and 0x0D, respectively.   The standard guarantees that these
values, just like any other byte values, are preserved when you write
them to a file and read them back, provided that you use the *binary*
mode.

The text mode is a different story.  Its purpose is to let a C program
see a text file as consisting of lines terminated with a single '\n'
character, no matter how much that view differs from the format that
the underlying OS normally uses for storing text.  If you want a
guarantee that what you read is identical to what you wrote, you must
not use any control characters other than '\t' and '\n', and you must
make sure that your lines don't have any trailing white space.  In
particular, if you write a '\r' to a text stream, all bets are off...

In the Standard's own words (7.19.2.2):

A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character.  Whether the last line requires a
terminating new-line character is implementation-defined.  Characters
may have to be added, altered, or deleted on input and output to
conform to differing conventions for representing text in the host
environment.  Thus, there need not be a one-to-one correspondence
between the characters in a stream and those in the external
representation. Data read in from a text stream will necessarily
compare equal to the data that were earlier written out to that stream
only if: the data consist only of printing characters and the control
characters horizontal tab and new-line; no new-line character is
immediately preceded by space characters; and the last character is a
new-line character.  Whether space characters that are written out
immediately before a new-line character appear when read in is
implementation-defined.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: larry.jones@sdrc.com
Date: Tue, 12 Mar 2002 17:38:27 GMT
Raw View
Gregg Townsend <gmt@cs.arizona.edu> wrote:
>
> Actually, this goes back to the ASCII standard (ANSI X3.4) which
> optionally associates the LF character with a newline function and
> specifies it as the character to use if you're just using a single
> line-separator character.

The old ASCII standard.  The current version either deprecates that
usage or removes it entirely.

-Larry Jones

Years from now when I'm successful and happy, ...and he's in
prison... I hope I'm not too mature to gloat. -- Calvin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]





Author: Daniel Miller <daniel.miller@tellabs.com>
Date: Tue, 12 Mar 2002 20:08:20 GMT
Raw View
   Christopher Corbell wrote:

> The Metrowerks MacOS implementation goes a step further by
> reversing the significance of CR and LF with regard to
> the '\r' and '\n' escapes in i/o involving a file, but not
> in any other context.  This means that if you open a FILE or
> fstream in text mode, every '\r' will be output there as
> an LF as well as every '\n' being output as CR, and the same
> is true of input - the escape-to-ASCII-binary correspondences
> are reversed.
>

Wojtek Lerch wrote:

> Christopher Corbell <ccorbell@vernier.com> wrote in message news:<3C8D2210.1AF61F4A@vernier.com>...
>
>>I was under the impression that the ASCII characters LF (0x0A)
>>and CR (0x0D) were, by standard or convention, mapped to the
>>C character-escapes '\n' and '\r' respectively.  I would also tend
>>to think that standard text-mode file io (in C with stdio
>>commands or in C++ with iostream) would preserve these
>>literal correlations.
>>
>
> It's true that on typical implementations, the values of '\n' and '\r'
> are 0x0A and 0x0D, respectively.   The standard guarantees that these
> values, just like any other byte values, are preserved when you write
> them to a file and read them back, provided that you use the *binary*
> mode.
>
> The text mode is a different story.  Its purpose is to let a C program
> see a text file as consisting of lines terminated with a single '\n'
> character, no matter how much that view differs from the format that
> the underlying OS normally uses for storing text.  If you want a
> guarantee that what you read is identical to what you wrote, you must
> not use any control characters other than '\t' and '\n', and you must
> make sure that your lines don't have any trailing white space.  In
> particular, if you write a '\r' to a text stream, all bets are off...
>
> In the Standard's own words (7.19.2.2):
>
> A text stream is an ordered sequence of characters composed into
> lines, each line consisting of zero or more characters plus a
> terminating new-line character.  Whether the last line requires a
> terminating new-line character is implementation-defined.  Characters
> may have to be added, altered, or deleted on input and output to
> conform to differing conventions for representing text in the host
> environment.  Thus, there need not be a one-to-one correspondence
> between the characters in a stream and those in the external
> representation. Data read in from a text stream will necessarily
> compare equal to the data that were earlier written out to that stream
> only if: the data consist only of printing characters and the control
> characters horizontal tab and new-line; no new-line character is
> immediately preceded by space characters; and the last character is a
> new-line character.  Whether space characters that are written out
> immediately before a new-line character appear when read in is
> implementation-defined.
>
> ---
> [ comp.std.c++ is moderated.  To submit articles, try just posting with ]
> [ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
> [              --- Please see the FAQ before posting. ---               ]
> [ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]
>
>

   And building on Wojtek Lerch's analysis, the line-separator in the Mac
culture is CR (alone), not CRLF as in the Microsoft/DEC culture, not LF as in
the UNIX culture.  Thus to guarantee that Mac's CR satisfies the requirement
that line-separators be represented strictly by \n alone, text mode translates
CR to \n as required.  Since Mac culture does not consider LF to be part of the
line-separator, a library author is left with two choices:
   1) in text mode, drop the LF entirely, such as by Mac libraries considering
CR (official), LF (foreign from UNIX), CRLF (foreign from Microsoft/DEC), and
LFCR (oddity) as all mapping to \n on input in text mode.  Obviously on output
\n would be converted to CR alone in Mac culture in text mode.
   OR
   2) in text mode, map the LF into some escaped-character-sequence which is not
already used in text-mode.  Since CR is represented by \n due to the
aforementioned text-file chracter-translation requirement (instead of \r), \r is
unused.  Because LF was evicted from its natural \n home, and because \r is a
vacant home, LF could conceivably be mapped to \r, so that CR versus LF remain
distinguishable (instead of both mapped to \n) and so that no characters are
dropped in text-mode.

   Metrowerks evidently chose the second option.

   It would be preferable for the C & C++ standards to specify which of the two
options are permissible versus forbidden.  Hopefully C & C++ standards-body
working-groups would :-) choose the same choice instead of divergent ones.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]