Thread

Topic: Universal character name question

Author: Michiel.Salters@tomtom.com
Date: Mon, 9 Oct 2006 09:55:58 CST Raw View

Alberto Ganesh Barbati wrote:
> "if the universal character name designates a character in the basic
> source character set, then the program is ill-formed."
>
> I wonder what is the rationale for this restriction. I mean, what's
> wrong in writing "\u0055niversal" instead of "Universal" (except
> obfuscation, of course)?

What about "\\u00750055niversal" ? "\u005c0055niversal"?
Under the current rules we don't have to consider escaped escape
characters.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Mon, 9 Oct 2006 18:01:05 GMT Raw View

Michiel.Salters@tomtom.com ha scritto:
>
> What about "\\u00750055niversal" ? "\u005c0055niversal"?
> Under the current rules we don't have to consider escaped escape
> characters.

The second string is surely ill-formed because \u005c is in the basic
source character set. About the first one, I'm not sure... if I read the
standard correctly, conversion to the execution character set of escape
sequences and universal-character-names in string literals happens
simultaneously in translation phase 5. So the resulting literal should
be a backslash + "00750055niversal".

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Tue, 10 Oct 2006 09:31:37 CST Raw View

Alberto Ganesh Barbati wrote:
> Michiel.Salters@tomtom.com ha scritto:

> > What about "\\u00750055niversal" ? "\u005c0055niversal"?
> > Under the current rules we don't have to consider escaped escape
> > characters.

> The second string is surely ill-formed because \u005c is in
> the basic source character set.

But he gave the example as to why this should be the case.
"\u005c" is a backslash.  If you allowed universal character
names for characters in the basic character set, you'd have to
consider this string as equal to "Universal".

> About the first one, I'm not sure... if I read the standard
> correctly, conversion to the execution character set of escape
> sequences and universal-character-names in string literals
> happens simultaneously in translation phase 5. So the
> resulting literal should be a backslash + "00750055niversal".

I think you're right there.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Tue, 3 Oct 2006 17:09:04 GMT Raw View

Hi Everybody,

in =A72.3/2 it is said:

"if the universal character name designates a character in the basic
source character set, then the program is ill-formed."

I wonder what is the rationale for this restriction. I mean, what's=20
wrong in writing "\u0055niversal" instead of "Universal" (except=20
obfuscation, of course)?

Doesn't this restriction make the use of universal character names=20
inherently non-portable? As I am not aware of the basic source character=20
set of every possible platform, whenever I use a universal character I=20
run the risk of designating some forbidden character on some platform=20
and so my code would be ill-formed there.

In Annex E there's a list of universal characters names allowed in=20
identifiers. According to =A72.3/2 I can have an identifier named \u00c0=20
but not \u0041, although the character U+0041 is listed as valid in the=20
annex. It seems a strange and gratuitous asymmetry to me.

Not that I'm going to use universal characters all over the place... ;-)=20
It's just a curiosity.

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: kuyper@wizard.net
Date: Tue, 3 Oct 2006 22:45:56 CST Raw View

Alberto Ganesh Barbati wrote:
> Hi Everybody,
>
> in    2.3/2 it is said:
>
> "if the universal character name designates a character in the basic
> source character set, then the program is ill-formed."
>
> I wonder what is the rationale for this restriction. I mean, what's
> wrong in writing "\u0055niversal" instead of "Universal" (except
> obfuscation, of course)?
>
> Doesn't this restriction make the use of universal character names
> inherently non-portable? As I am not aware of the basic source character
> set of every possible platform,

You should be. It's a list of exactly 96 characters, which is by
definition exactly the same on every conforming implementation of C++.
See 2.2p1 for the list. As a result, there's no danger of the
following:

> ... whenever I use a universal character I
> run the risk of designating some forbidden character on some platform
> and so my code would be ill-formed there.


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Greg Herlihy" <greghe@pacbell.net>
Date: Wed, 4 Oct 2006 10:09:04 CST Raw View

Alberto Ganesh Barbati wrote:
> Hi Everybody,
>
> in    2.3/2 it is said:
>
> "if the universal character name designates a character in the basic
> source character set, then the program is ill-formed."
>
> I wonder what is the rationale for this restriction. I mean, what's
> wrong in writing "\u0055niversal" instead of "Universal" (except
> obfuscation, of course)?

Probably to keep identifier names canonical. After all, 512 different
ways to write the "universal" as an identifiier seems a needless
complication.

> Doesn't this restriction make the use of universal character names
> inherently non-portable? As I am not aware of the basic source character
> set of every possible platform, whenever I use a universal character I
> run the risk of designating some forbidden character on some platform
> and so my code would be ill-formed there.

No, there is no portability issue because the source character set (and
the set of allowed universal names in identifiers) is the same across
all implementations. So you'll never have a problem porting a source
file due to its use of universal character names.

The question is moot anyway because the contents of a C++ source file
is completely non-portable to start with. The mapping of a source
file's contents to the source character set is
"implementation-defined." So the Standard doesn't offer much assistance
to get a set of ASCII source files to compile with an implementation
that expects characters in EBCDIC.

> In Annex E there's a list of universal characters names allowed in
> identifiers. According to    2.3/2 I can have an identifier named \u00c0
> but not \u0041, although the character U+0041 is listed as valid in the
> annex. It seems a strange and gratuitous asymmetry to me.

Annex E explicity excludes the ranges u+0041 - u+0051 and u+0061 -
u+007a from the set of universal names that may be used in an
identifier.

Greg

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Wed, 4 Oct 2006 16:22:41 GMT Raw View

kuyper@wizard.net ha scritto:
> Alberto Ganesh Barbati wrote:
>> Hi Everybody,
>>
>> in =A72.3/2 it is said:
>>
>> "if the universal character name designates a character in the basic
>> source character set, then the program is ill-formed."
>>
>> I wonder what is the rationale for this restriction. I mean, what's
>> wrong in writing "\u0055niversal" instead of "Universal" (except
>> obfuscation, of course)?
>>
>> Doesn't this restriction make the use of universal character names
>> inherently non-portable? As I am not aware of the basic source charact=
er
>> set of every possible platform,
>=20
> You should be. It's a list of exactly 96 characters, which is by
> definition exactly the same on every conforming implementation of C++.
> See 2.2p1 for the list. As a result, there's no danger of the
> following:

Ah, yes, sure. I had read 2.2p1 but I missed footnote 15 and probably=20
also got confused by the many character sets... Thanks.

However, this doesn't answer my original question: why is the=20
restriction there in the first place? Why can't I write \u0041 in my=20
source code?

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Wed, 4 Oct 2006 11:21:45 CST Raw View

Alberto Ganesh Barbati wrote:

> in    2.3/2 it is said:

> "if the universal character name designates a character in the
> basic source character set, then the program is ill-formed."

> I wonder what is the rationale for this restriction. I mean, what's
> wrong in writing "\u0055niversal" instead of "Universal" (except
> obfuscation, of course)?

I think that the intent is to allow several different interal
encodings.  In particular, to allow the implementation to either
keep the extended characters in the \u0055 format (in the
string), or to translate it into whatever character corresponds
in the native environment.  In the first case, the string
"\u0055" doesn't compare equal to "U".

For typical systems, which actually support an extended native
character set, this freedom doesn't buy much, since they already
have to handle the case that "   " and "\u00e5" must compare
equal.  The obvious solution is to maintain symbol names as 32
bit integers, with the universal character names replaced
internally by a single Unicode character.  But the standard
doesn't mandate this, and on an implementation with very limited
resources, it would be quite reasonable (and legal) to only
accept universal character names for extended characters, and
store the symbol names exactly as they were read.  Allowing code
to use "\u0055niversal" instead of "Universal" would require
special handling in such implementations.

> Doesn't this restriction make the use of universal character
> names inherently non-portable?

In some contexts, they are.  If you write
    std::cout << "\u00E5t\u00E5\n" ;
and output it to a device which only supports the 128 basic
ASCII characters, you're going to loose something.

> As I am not aware of the basic source character set of every
> possible platform,

It's defined in the standard.  See    2.2.

> whenever I use a universal character I run the risk of
> designating some forbidden character on some platform and so
> my code would be ill-formed there.

Less so than if you write a file using CRLF as line separators,
on Windows, and try to compile it with some implementations
under Unix.

> In Annex E there's a list of universal characters names allowed in
> identifiers. According to    2.3/2 I can have an identifier named \u00c0
> but not \u0041, although the character U+0041 is listed as valid in the
> annex. It seems a strange and gratuitous asymmetry to me.

U+0041 is not listed as a valid univeral character name in
Appendix E of the current draft.

Any of the 96 characters in the basic source character set
defined in    2.2 must appear literally.  In portable code, any
other character (including @ or $) must appear as a universal
character name.  Implementations are allowed (and even
encouranged, if I'm reading between the lines in    2.1/1
correctly) to support any other characters in the native
character set, but they must do so in a manner which makes them
indistinguishable from their universal character name
equivalent.  And if you copy the file to a system where the
character in question isn't in the native character set, then
there's no guarantee that your code will compile.

The ideal solution, of course, is that the set of tools
associated with the compiler (e.g. editor, etc.) permit entry of
and display the characters as best it can, given the resources
available to it, but that it actually store them in the files as
universal character names.  But I don't know of any system which
does this, and it sort of goes against the basic philosophy of
Unix, in which text is text, and any program which handles text
can handle just about any text file.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]