Thread

Topic: Defect report: handling of extended source characters in string literals

Author: =?ISO-8859-1?Q?Martin_Vejn=E1r?= <avakar@volny.cz>
Date: Mon, 08 May 2006 11:55:02 -0500 Raw View

[ Note: Forwarded to C++ Committee.  -sdc ]

Consider the following code:

     #include <iostream>
     int main()
     {
         std::cout << "\\u00e1" << std::endl;

         // Following line contains Unicode character
         // "latin small letter a with acute" (U+00E1)
         std::cout << "\   " << std::endl;
     }

The first statement in main outputs characters "u00e1" preceded by a
backslash.

The Standard says:
[2.1 - Phases of translation, paragraph 1.1]
     Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character not
in the basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a universal-character-name
(i.e. using the \uXXXX notation), are handled equivalently.)

During this translation phase, the foreign character in the second
statement is replaced by a universal-character-name. Such statement
resembles the first and outputs one of the following:

     \u00e1
     \u00E1
     \U000000e1
     \U000000E1

C99 (at least in the draft I have available) avoids this problem by not
introducing any universal character names and not restricting the
(basic) source character set to 96 characters as C++ does.

--
Martin Vejn   r


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Greg Herlihy" <greghe@pacbell.net>
Date: Tue, 9 May 2006 11:10:34 CST Raw View

James Kanze wrote:
> Martin Vejn   r wrote:
>
>  > [ Note: Forwarded to C++ Committee.  -sdc ]
>
>  > Consider the following code:
>
>  >     #include <iostream>
>  >     int main()
>  >     {
>  >         std::cout << "\\u00e1" << std::endl;
>
>  >         // Following line contains Unicode character
>  >         // "latin small letter a with acute" (U+00E1)
>  >         std::cout << "\   " << std::endl;
>  >     }
>
>  > The first statement in main outputs characters "u00e1"
>  > preceded by a backslash.
>
> Which is perfectly legal, as the program has undefined
> behavior according to    2.1.3.2/3: "If the character following a
> backslash is not one of those specified, the behavior is
> undefined."  As you point out, in this case, the \u00e1 is a
> single character at all points beyond translation phase 1.

The Standard states that in phase 1 of source file translation:

"Any source file character not in the basic source character set is
replaced by the universal-character-name that designates that
character."

Since the characters  '\', 'u', '0', 'e', '1' are all in the basic
character set, they are not replaced in phase 1 - a point later
reiterated:

"Note: in translation phase 1, a universal-character-name is introduced
whenever an actual extended character is encountered in the source
text."    2.13.2/5.

Clearly     and not \u00e1 is the "actual extended character" so
processing the string literal in the first line must wait until phase 5
when:

"Each source character set member, escape sequence, or
universal-character-name in character literals and string literals is
converted to a the corresponding member of the execution character set"

At this stage the compiler translates the entire string literal,
\\u00e1. And since the two backslashes, \\, form a valid escape
sequence (for the backslash character itself), the string translates to
\u00e1. So the program output for the first statement is both correct
and well-defined by the Standard.

Now the second line is more interesting. The character     is clearly
not a character in the basic character set, so unlike the the first
line, this character is replaced by \u00e1 in phase 1. And now the
behavior of the program does become undefined since \    is not a valid
escape sequence. And in fact, gcc reports the invalid escape sequence
as an error. Not surprisingly, gcc accepts the first line as legal and
outputs the string "\u00e1" as expected.

Greg

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]