Topic: Defect report: handling of extended source characters in string


Author: kanze.james@neuf.fr (James Kanze)
Date: Mon, 8 May 2006 22:49:58 GMT
Raw View
Martin Vejn=E1r wrote:

 > [ Note: Forwarded to C++ Committee.  -sdc ]

 > Consider the following code:

 >     #include <iostream>
 >     int main()
 >     {
 >         std::cout << "\\u00e1" << std::endl;

 >         // Following line contains Unicode character
 >         // "latin small letter a with acute" (U+00E1)
 >         std::cout << "\=E1" << std::endl;
 >     }

 > The first statement in main outputs characters "u00e1"
 > preceded by a backslash.

Which is perfectly legal, as the program has undefined
behavior according to =A72.1.3.2/3: "If the character following a
backslash is not one of those specified, the behavior is
undefined."  As you point out, in this case, the \u00e1 is a
single character at all points beyond translation phase 1.

The correct way to output the sequence u00e1, preceded by a
backslash, is:

     std::cout << "\\" "u00e1" << std::endl ;

Note that this is similar to the way a backslash preceding a
newline is handled.  In both cases, the backslash is removed in
a very early phase, before any of the usual escape sequences in
a string or a character constant are considered, and thus,
before it can be escaped itself.  Consider for example:

     std::cout << "\\
a" ;

This is a perfectly legal piece of code -- a somewhat obfuscated
way of outputting an audible signal.  It is NOT an illegal
string constant with a newline in it.

 > The Standard says:
 > [2.1 - Phases of translation, paragraph 1.1]
 >     Physical source file characters are mapped, in an
 > implementation-defined manner, to the basic source character set
 > (introducing new-line characters for end-of-line indicators) if
 > necessary. Trigraph sequences (2.3) are replaced by corresponding
 > single-character internal representations. Any source file character n=
ot
 > in the basic source character set (2.2) is replaced by the
 > universal-character-name that designates that character. (An
 > implementation may use any internal encoding, so long as an actual
 > extended character encountered in the source file, and the same extend=
ed
 > character expressed in the source file as a universal-character-name
 > (i.e. using the \uXXXX notation), are handled equivalently.)

That's an interesting formulation.  Does it mean that all later
phases must see "\u00E1", even if e.g. the implementation uses
UTF-32 internally.  In which case, the behavior is well defined,
and must be that which you see.  My interpretation of =A72.2/2
("The universal-character-name construct provides a way to name
other characters.") is that this is not the intent; that \u00E1
is a single character, and must be treated as such.

 > During this translation phase, the foreign character in the
 > second statement is replaced by a universal-character-name.
 > Such statement resembles the first and outputs one of the
 > following:

 >     \u00e1
 >     \u00E1
 >     \U000000e1
 >     \U000000E1

Or anything else -- you have undefined behavior.

What you don't have is an escape sequence "\\...".  The "\u00e1"
doesn't exist beyond the first phase of translation, and the
first \ is followed by a character that "is not one of those
specified".  As a quality of implementation issue, I would
expect an error from the compiler -- this is an undefined
behavior which the compiler can easily detect.  (Except, of
course, in the unlikely event that the implementation has
defined this as an additional escape sequence.)

Of course, if the intent is for the universal character name to
behave as a sequence of 6 (or 10) characters in the later
translation phases -- and the description of phase one of the
translation can easily be interpreted this way -- then if '=E0' is
understood by the implementation as being the same character as
\u00E1 (which would be the case if e.g. the implementation
accepted ISO 8859-1 as its input encoding), then it would have
to output one of the variants you indicate.  My own opinion is
that the text in =A72.2. makes it clear that this is not the
intent; that \u00E1 should be treated as a single character, a
latin small letter a with acute.

 > C99 (at least in the draft I have available) avoids this
 > problem by not introducing any universal character names and
 > not restricting the (basic) source character set to 96
 > characters as C++ does.

The final C99 does contain universal character names, in almost
exactly the same language as the C++ standard.

--=20
James Kanze                                    kanze.james@neuf.fr
Conseils en informatique orient=E9e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: avakar@volny.cz (=?ISO-8859-1?Q?Martin_Vejn=E1r?=)
Date: Tue, 9 May 2006 14:49:33 GMT
Raw View
James Kanze wrote:
> Martin Vejn=E1r wrote:
>> Consider the following code:
>>=20
>>     #include <iostream>
>>     int main()
>>     {
>>         std::cout << "\\u00e1" << std::endl;
>>=20
>>         // Following line contains Unicode character
>>         // "latin small letter a with acute" (U+00E1)
>>         std::cout << "\=E1" << std::endl;
>>     }
>>=20
>> The first statement in main outputs characters "u00e1"
>> preceded by a backslash.
>=20
> Which is perfectly legal, as the program has undefined
> behavior according to =A72.1.3.2/3: "If the character following a
> backslash is not one of those specified, the behavior is
> undefined."  As you point out, in this case, the \u00e1 is a
> single character at all points beyond translation phase 1.

On the contrary, I was trying to point out that after phase 1 is=20
complete, the *second* statement no longer contains letter '=E1'. Instead=
,=20
  that character is replaced by a universal character name. So, after=20
phase 1 is complete, the code looks like this:

     #include <iostream>
     int main()
     {
         std::cout << "\\u00e1" << std::endl;

         // Following line contains Unicode character
         // "latin small letter a with acute" (U+00E1)
         std::cout << "\\u00e1" << std::endl;
     }

I understand that what you're saying is probably the original intent of=20
[2.1/1.1]. However, current wording of the paragraph in question says=20
something different.

> The correct way to output the sequence u00e1, preceded by a
> backslash, is:
>=20
>     std::cout << "\\" "u00e1" << std::endl ;

The grammar given for string-literal in [2.13.4] is pretty unambigous=20
about this. Just as "\\n" isn't a backslash followed by a new-line,=20
"\\u00e1" isn't a backslash followed by a universal-character-name.

Note, that when tokenization begins, there cannot be "\=E1" in the source=
,=20
since after phase 1, all characters are from the basic source character=20
set as defined in [2.2/1].

> Note that this is similar to the way a backslash preceding a
> newline is handled.  In both cases, the backslash is removed in
> a very early phase, before any of the usual escape sequences in
> a string or a character constant are considered, and thus,
> before it can be escaped itself.  Consider for example:
>=20
>     std::cout << "\\
> a" ;
>=20
> This is a perfectly legal piece of code -- a somewhat obfuscated
> way of outputting an audible signal.  It is NOT an illegal
> string constant with a newline in it.

The line splicing has nothing to do with phase 1. Phase 2 (line=20
splicing) removes backslash and newline pairs, while phase 1 replaces=20
foreign characters (by which I mean characters outside the basic source=20
character set) with their respective universal character names.

>> The Standard says:
>> [2.1 - Phases of translation, paragraph 1.1]
>>     Physical source file characters are mapped, in an
>> implementation-defined manner, to the basic source character set
>> (introducing new-line characters for end-of-line indicators) if
>> necessary. Trigraph sequences (2.3) are replaced by corresponding
>> single-character internal representations. Any source file character n=
ot
>> in the basic source character set (2.2) is replaced by the
>> universal-character-name that designates that character. (An
>> implementation may use any internal encoding, so long as an actual
>> extended character encountered in the source file, and the same extend=
ed
>> character expressed in the source file as a universal-character-name
>> (i.e. using the \uXXXX notation), are handled equivalently.)
>=20
> That's an interesting formulation.  Does it mean that all later
> phases must see "\u00E1", even if e.g. the implementation uses
> UTF-32 internally.  In which case, the behavior is well defined,
> and must be that which you see.  My interpretation of =A72.2/2
> ("The universal-character-name construct provides a way to name
> other characters.") is that this is not the intent; that \u00E1
> is a single character, and must be treated as such.

If the second sentence is a question, then yes, I believe that the=20
Standard says (but shouldn't say) so.

Even if an implementation used UTF-32 as an internal representation, the=20
"as-if" rule applies. Although I guess, that the intent is to allow=20
implementations to use whatever internal encoding they want, wording of=20
[2.1/1.1] effectively *prohibits* converting universal character names=20
to their respective characters - doing so would change the meaning of=20
the second statement and introduced an undefined behavior.

> [snip]
>=20
> Of course, if the intent is for the universal character name to
> behave as a sequence of 6 (or 10) characters in the later
> translation phases -- and the description of phase one of the
> translation can easily be interpreted this way -- then if '=E0' is
> understood by the implementation as being the same character as
> \u00E1 (which would be the case if e.g. the implementation
> accepted ISO 8859-1 as its input encoding), then it would have
> to output one of the variants you indicate.  My own opinion is
> that the text in =A72.2. makes it clear that this is not the
> intent; that \u00E1 should be treated as a single character, a
> latin small letter a with acute.

The intent probably does not correspond to the wording of [2.1/1.1].=20
That's good enough reason for a defect report, isn't it?

>> C99 (at least in the draft I have available) avoids this
>> problem by not introducing any universal character names and
>> not restricting the (basic) source character set to 96
>> characters as C++ does.
>=20
> The final C99 does contain universal character names, in almost
> exactly the same language as the C++ standard.

I cannot argue about that, since is don't have final C99 available.=20
However, the draft says

[C99 draft: 5.1.1.2/1.1]
     Physical source file multibyte characters are mapped to the source=20
character set (introducing new-line characters for end-of-line=20
indicators) if necessary. Trigraph sequences are replaced by=20
corresponding single-character internal representations.

Note the difference between "basic source character set" in C++ and=20
"source character set" in C99. Also note, that conversion of neither=20
foreign characters nor universal character names occurs. Both foreign=20
characters and universal character names are then subjects to the=20
grammer, which makes

     printf("\\u00e1");

output "u00e1" preceded by a backslash and

     printf("\=E1");

introduce an undefined behavior. That, I believe, is a very correct and=20
intuitive approach and a possible solution to the problem.

--=20
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]