Thread

Topic: Unicode (was Exploding integers)

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/13 Raw View

fjh@murlibobo.cs.mu.OZ.AU (Fergus Henderson) writes:

|>  >Does it also mean that a statement like
|>  >
|>  >   a \3D b \2B c;
|>  >
|>  >is equivalent to
|>  >
|>  >   a = b + c;
|>  >
|>  >in a compiler that uses ASCII as its source character set?
|>
|>  Almost.  You would need to write it like this:
|>
|>   a \u003D b \002B c;

Since I missed the fact that universal character names can occur in
identifiers, I'm hesitant to say this, but is it really the intent of
the standard that an entire program can be written exclusively with
universal character names, or only that they can be used in the places
explicitly authorized by the standard.  (There is, as far as I can see,
no specification as to how universal character names are to be processed
in this context.)

There is also an interesting sentance at the start of paragraph 2 in
2.2: "The universal-character-name construct provides a way to name
OTHER characters." (emphisis added).  In context, "other" can only refer
to characters in the basic source character set; taken literally, the
sentence means that "\u0061" is illegal, because it corresponds to 'a',
which is in the basic source character set.

Within string literals and character constants, the draft standard
states that the universal character name should be converted to the
correct character code in the execution character set.  This sounds to
me as if "\u0061" should work.  (Provided that the execution character
set contains an "a".  I can find no guarantee of this in the draft, but
no doubt I've overlooked something again.)

Concerning their use in symbols, the wording in [extendid] seems
slightly ambiguous to me: "This table is reproduced unchanged from
ISO/IEC PDTR 10176, [...] except that the ranges 0041-005a and 0061-007a
designate the upper and lower case English alphabets, which are part of
the basic source character set, and are not repeated in the table
below."  Are they not repeated because they represent characters in the
basic source character set, and so should be treated as such, or because
they represent characters in the basic source character set, and so
should not occur as universal character names?  (Also: regardless of the
interpretation, the universal character name for '_' is not mentioned,
and so presumably cannot occur in an identifier, although the
corresponding character in the basic character set can.  This would seem
to argue for the interpretation that \u0061 is simply illegal, at least
in an indentifier.)

Finally, it's probably worth pointing out that the draft hedges its
words with regards to what linkers may or may not accept.  Which
probably means that such characters cannot be portably used in names of
objects with file scope.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: fjh@murlibobo.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/02/13 Raw View

James Kanze <james-albert.kanze@vx.cit.alcatel.fr> writes:

>fjh@murlibobo.cs.mu.OZ.AU (Fergus Henderson) writes:
>
>|>  >Does it also mean that a statement like
>|>  >
>|>  >   a \3D b \2B c;
>|>  >
>|>  >is equivalent to
>|>  >
>|>  >   a = b + c;
>|>  >
>|>  >in a compiler that uses ASCII as its source character set?
>|>
>|>  Almost.  You would need to write it like this:
>|>
>|>   a \u003D b \002B c;

The above is not correct.  I misread the DWP.
In fact Universal Character Names (UCNs) can occur only
in identifiers and string or character literals.

>Since I missed the fact that universal character names can occur in
>identifiers, I'm hesitant to say this, but is it really the intent of
>the standard that an entire program can be written exclusively with
>universal character names, or only that they can be used in the places
>explicitly authorized by the standard.  (There is, as far as I can see,
>no specification as to how universal character names are to be processed
>in this context.)

As it happens I don't know anything about the intent of the committee
in this regard other than what I've read in the DWP.  And obviously
I didn't read it carefully enough before my previous post.
The draft is quite clear as it stands -- the lexical syntax rules mean
that UCNs can only be used in the places where they are explicitly
mentioned.

>Finally, it's probably worth pointing out that the draft hedges its
>words with regards to what linkers may or may not accept.  Which
>probably means that such characters cannot be portably used in names of
>objects with file scope.

Where does it hedge its words?

There's a footnote that makes it clear that the compiler is supposed
to mangle names with UCNs in them so that they are acceptable to
the linker:

 |   2.10  Identifiers                                           [lex.name]
 |
 | 1 An  identifier  is an arbitrarily long sequence of letters and digits.
 |   Each universal-character-name in an identifier shall designate a char-
 |   acter  whose encoding in ISO 10646 falls into one of the ranges speci-
 |   fied in _extendid_.  Upper- and lower-case letters are different.  All
 |   characters are significant.8)
 |
 |   _________________________
 |   8)  On  systems in which linkers cannot accept extended characters, an
 |   encoding of the universal-character-name may be used in forming  valid
 |   external identifiers.  For example, some otherwise unused character or
 |   sequence of characters may be used to encode the \u  in  a  universal-
 |   character-name.  Extended characters may produce a long external iden-
 |   tifier, but C++ does not place  a  translation  limit  on  significant
 |   characters  for  external  identifiers.  In C++, upper- and lower-case
 |   letters are considered different for all identifiers, including exter-
 |   nal identifiers.

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: Branko Cibej <branko.cibej@hermes.si>
Date: 1997/02/15 Raw View

Fergus Henderson wrote:
> >|>  Almost.  You would need to write it like this:
> >|>
> >|>     a \u003D b \002B c;
>
> The above is not correct.  I misread the DWP.
> In fact Universal Character Names (UCNs) can occur only
> in identifiers and string or character literals.

What about comments? IMHO extended characters (converted to UCNs) are
much more useful in comments than in identifiers. Not allowing them
would lead to the absurd situation where comments in a conforming
program couldn't be "written in the same language" as the identifiers...
--
------------------------------------------------------------------------
Branko Cibej      HERMES SoftLab, Litijska 51, 1000 Ljubljana,  Slovenia
brane@hermes.si   phone: (++386 61) 186 53 49  fax: (++386 61) 186 52 70
------------------------------------------------------------------------
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: blair@trojan.neta.com (Blair P Houghton)
Date: 1997/02/16 Raw View

Branko Cibej  <brane@hermes.si> wrote:
>Fergus Henderson wrote:
>> In fact Universal Character Names (UCNs) can occur only
>> in identifiers and string or character literals.
>
>What about comments? IMHO extended characters (converted to UCNs) are
>much more useful in comments than in identifiers. Not allowing them
>would lead to the absurd situation where comments in a conforming
>program couldn't be "written in the same language" as the identifiers...

They're okay.

The DWP (2 Dec 1996) describes how to delimit a comment
(section 2.7 Comments [lex.comment]), but is silent on what
a comment means, and almost silent on what it may contain,
except to say that formfeeds and vertical tabs can't be
followed by anything but whitespace.

In fact, the syntax of comments is left out of the grammar
entirely.

You have to go into translation phase 3 (2.1 Phases of
Translation, [lex.phases]) to find that comments are
made of preprocessing tokens, and preprocessing tokens
(2.4 Preprocessing tokens [lex.pptoken]) include
all of the "normal" usages of UCNs, plus the nifty
hack "each non-white-space character that cannot be one of the
above."  "The above" referring to just about every kind of
lexical token the standard defines.  (Unfortunately, this
nifty hack leaves out stray quotation marks (see paragraph
2 of [lex.pptoken]), so that's another constraint on the
formation of comments).

I.e., UCNs are permitted in comments whether or not they
constitute parts of identifiers or string or character
literals.

Translation phase 3 then says each comment is replaced by a
single space.

Now that I think about it, all is not well.  The stray
quotation mark denigration forms something of a paradox.
A comment should be allowed to contain such things, IMO,
but apparently it may not.

    --Blair
      " "

P.S.  I also just noticed that the first section, 1.1 Scope,
amusingly covers both the scope of the standard and the
scope of names in the libraries...
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: Stephen.Clamage@eng.sun.com (Steve Clamage)
Date: 1997/02/16 Raw View

In article 1@trojan.neta.com, blair@trojan.neta.com (Blair P Houghton) writes:
>
>You have to go into translation phase 3 (2.1 Phases of
>Translation, [lex.phases]) to find that comments are
>made of preprocessing tokens,

No. The exact text is:
"3 The source file is decomposed into preprocessing tokens (2.4) and
sequences of white-space characters (including comments). A source file
shall not end in a partial preprocessing token or partial comment). Each
comment is replaced by one space character."

Section 2.4 "Preprocessing tokens" further says:
"Preprocessing tokens can be separated by white space; this consists of
comments (2.7), or white-space characters (space, horizontal tab, new-line,
vertical tab, and formfeed), or both."

Thus, a comment is not a preprocessing token, and is not composed of
preprocessing tokens. Anything may appear between the start and end
delimiters of a comment, and the entire comment is considered white space.
(Of course, the discovery and elimination of comments does not occur
until phase 3 of translation, so the phase 1 and 2 rules still apply.)

---
Steve Clamage, stephen.clamage@eng.sun.com
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: fjh@mundook.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/02/16 Raw View

Branko Cibej <branko.cibej@hermes.si> writes:

>Fergus Henderson wrote:
>> >|>  Almost.  You would need to write it like this:
>> >|>
>> >|>     a \u003D b \002B c;
>>
>> The above is not correct.  I misread the DWP.
>> In fact Universal Character Names (UCNs) can occur only
>> in identifiers and string or character literals.
>
>What about comments? IMHO extended characters (converted to UCNs) are
>much more useful in comments than in identifiers. Not allowing them
>would lead to the absurd situation where comments in a conforming
>program couldn't be "written in the same language" as the identifiers...

Oh, sure UCNs are allowed in comments too.
(You can put just almost anything in the basic character set
in comments, and UCNs are written in the basic character set.)

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: dhansen@btree.com (Dave Hansen)
Date: 1997/02/11 Raw View

On 06 Feb 97 02:11:29 GMT, James Kanze
<james-albert.kanze@vx.cit.alcatel.fr> wrote:

>fjh@mundook.cs.mu.OZ.AU (Fergus Henderson) writes:
>
>|>  The draft C++ standard doesn't require implementations to allow ISO-10646
>|>  characters in identifiers.  It only requires them to support "\uNNNN"
>|>  and "\UNNNNNNNN" escapes in identifiers.  That should be very easy to
>|>  implement, I think -- you only need a small change to the lexer,
>|>  and a small change to the mangling/demangling algorithm.
>
>I think it's slightly more complicated than you suggest.  If I
>understand correctly, "\u0061" and "a" must be the same symbol.  And if
>I'm not mistaken, under the OS/390 which Ross Ridge mentioned (or did he
>really mean OS/360?), 'a' == 0xc1 (or something like that, I can't
>remember the exact encoding of EBCDIC, it's been so long ago).

OK, I'll bite.  Does this mean that a symbol like /u2B, which might
map to an alphabetic character in an encoding scheme such as FIELDATA,
but which maps to '+' in ASCII, be a legal identifier in one system
and not another?

Does it also mean that a statement like

   a \3D b \2B c;

is equivalent to

   a = b + c;

in a compiler that uses ASCII as its source character set? (I would
expect not, but I've been surprised before!  :)

Regards,

                             -=Dave

I can barely speak for myself, so I certainly can't speak for B-Tree.


[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: fjh@murlibobo.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/02/11 Raw View

dhansen@btree.com (Dave Hansen) writes:

>OK, I'll bite.  Does this mean that a symbol like /u2B, which might
>map to an alphabetic character in an encoding scheme such as FIELDATA,
>but which maps to '+' in ASCII, be a legal identifier in one system
>and not another?

No, for several reason.  Firstly, you need to use the backslash `\', not
the forward-slash `/'.  Secondly, for `\u', you must always provide
exactly four hexadecimal digits, so it would have to be `\u002B'.
Thirdly, the encoding scheme used for these escapes is always Unicode,
so `\u002B' always maps to `+' in the source character set, no matter
what source character set or what execution character set the
implementation uses.

>Does it also mean that a statement like
>
>   a \3D b \2B c;
>
>is equivalent to
>
>   a = b + c;
>
>in a compiler that uses ASCII as its source character set?

Almost.  You would need to write it like this:

 a \u003D b \002B c;

>(I would expect not, but I've been surprised before!  :)

Indeed ;-)

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: jfieber@indiana.edu (John Fieber)
Date: 1997/02/11 Raw View

In article <5dqb08$oe@palladium.transmeta.com>,
 hpa@transmeta.com (H. Peter Anvin) writes:
> Followup to:  <christian.bau-2701971014100001@christian-mac.isltd.insignia.com>
> By author:    christian.bau@isltd.insignia.com (Christian Bau)
>>    FFFF = Not A Character. A sequence of unicode codes can contain this
>> code, and it is not a character. I dont think it would be wise to use this
>> as an end of file indicator generally.
>>
>
> Why not?  It seems to be exactly the kind of thing this character was
> reserved for.  However, since most systems use int for getc() and the
> like, and int is nowadays usually 32 bits or more, using -1 for EOF
> and 0..0xffff for Unicode (0..0x7fffffff for full UCS-4) seems most
> reasonable.

To quote the from the Unicode 2.0 book (2-13):

 "U+FFFF is reserved for private program use as a sentinal or other
 signal.  (Notice that U+FFFF is a 16-bit representation of -1 in a
 two's-compliment notation.)  Programs receiving this code are not
 required to interpret it in any way.  It is good practice, however,
 to recognize this code as a non-character value and to take
 appropriate action, such as indicating possible corruption of the
 text."

-john
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/10 Raw View

jcoffin@taeus.com (Jerry Coffin) writes:

|>  In article <rf5k9oozx8r.fsf@vx.cit.alcatel.fr>, james-
|>  albert.kanze@vx.cit.alcatel.fr says...

|>  > |>  Of course, in C++, this basic issue becomes moot assuming one
|>  > |> uses iostreams instead of C style I/O, since indication of end-
|>  > |> of-file is done with a separate bit and detected with a separate
|>  > |> function instead of a special character.  (In fact, "EOF" isn't
|>  > |> listed in the index of the standard at all...)
|>  >
|>  > First, of course, C also has the function feof.  Furthermore, I
|>  > think that it is the only way of reliably detecting end of file with
|>  > getwc.  So, fundamentally, C and C++ do not differ that much.
|>
|>  I agree that the two are theoretically similar in that respect.
|>  However, a LOT of C programs depend on fgetc/getc/getchar returning
|>  EOF to signal the end of a file.  As such, if EOF ceases to be at
|>  least somewhat dependable to indicate that you've either reached the
|>  end of a file or had an error, a LOT of programs break in a hurry.

Correct.

|>  By contrast, in C++, if you no longer had a value the stream could
|>  return that was distinct from any value that could have been read from
|>  the stream, comparatively few programs would break.

Wrong.  Like C, people using C++ do both formatted (scanf, operator>>)
and unformatted (getc, istream::get/peek) input, according to the
application.  For example, a significant part of my program input
involves parsing ASN1 types.  This is not supported directly by
operator>> (understandably).  In my older code, I used istream::get and
istream::peek, and in newer code, I access the streambuf directly.  In
both cases, end of file (or an input error) is signaled by returning
EOF, exactly as was the case in C.  If EOF disappears, 90% of my input
routines will break.

|>  > And C++ does have EOF.  Because the iostreams are templates,
|>  > however, it is spelled differently: char_traits< charT >::eof().  I
|>  > do hope that most implementations will provide EOF anyway for a
|>  > transitional period, with the same value that would be returned by
|>  > char_traits< char >::eof().  But can they, legally?
|>
|>  They certainly can (and must) include it in <cstdio>.  I don't see a
|>  way to include it in <iostream> at least in the std namespace.  And
|>  I'm pretty sure it can't just put it in another namespace either.

If I remember correctly, the C standard *required* it to be a macro.
Since the C standard is included in the C++ standard by reference,
unless there are specific words in the C++ to remove this requirement,
it must also be a macro in C++.  So it cannot be a namespace (or rather,
it ignores namespace).

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: fjh@mundook.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/02/04 Raw View

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:

>fjh@cs.mu.OZ.AU writes:
>
>>Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:
>>>Even depending on a compiler
>>>extension and naming it "Mq" would be preferable.
>>
>>But then your program wouldn't be portable.
>
>Is the C++ standard really guaranteeing "\u5373" to be portable?

Yes, that's the idea.

>Requiring all ISO-10646 characters to be allowed in indentifiers is one
>hell of a imposition to be forced on implementations this late in the
>standardization process.

The draft C++ standard doesn't require implementations to allow ISO-10646
characters in identifiers.  It only requires them to support "\uNNNN"
and "\UNNNNNNNN" escapes in identifiers.  That should be very easy to
implement, I think -- you only need a small change to the lexer,
and a small change to the mangling/demangling algorithm.

>>Of course, if you want to make use of editors and other tools that
>>do not have special support for C++ extended character notation,
>>then it would make sense to use a non-portable platform-specific
>>representation of the source text.  However, you still want to
>>be able to port your programs.  Hence, even though you may use
>>a platform-specific source code representation most of the time during
>>development, you still need a portable format that you can convert it
>>too for porting.
>
>No.  I do not need a portable format, I need a method of converting
>between character sets.  An EUC-JP encoded file isn't going to compile
>under OS/390 whether "\u5375", "Mq", or "tamago" is used.

If you take your EUC-JP encoded file, and use a file conversion utility
to convert it into the standardized "\u" encoding, then it should
compile fine on OS/390, presuming you have a conforming C++ compiler
on your OS/390 system and presuming that your program isn't non-portable
for some other reason.

>>>Heck, what advantage
>>>does it have over naming the variable "u5375"?
>>
>>Your C++-language-sensitive extended-character-set editor will display
>>it differently.
>
>Why couldn't this mythical editor display "u5375" the same way?

Because `u5375' is already a valid identifier.

>>There's no way of automatically converting programs containing
>>identifiers such as "u5375" into extended character sets or
>>vice versa.
>
>It's trivial to do,

No, it's impossible to do correctly.
The identifiers "u5375" and "\u5375" are different,
and if your conversion utility maps them onto the same identifier,
then your conversion utility is broken.

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/04 Raw View

jcoffin@taeus.com (Jerry Coffin) writes:

|>  In article <rf5hgk7mleb.fsf@vx.cit.alcatel.fr>, james-
|>  albert.kanze@vx.cit.alcatel.fr says...
|>
|>  [ ... ]
|>
|>  > Just a quibble (and I'm skating on thin ice here, as I do not have
|>  > my copy of the C standard handy to verify), but I think that the C
|>  > standard only requires characters in the "basic source character
|>  > set" to be positive.  With my Sun compiler, for example, the source
|>  > character set is ISO 8859-1.  While all of the "basic source
|>  > character set" (the characters listed in the section on lexical
|>  > analysis) are positive, the "source character set" also includes
|>  > accented characters (which can be used in comments, character
|>  > constants, and string literals) which are not positive.  I am almost
|>  > sure that this is conforming C, and am not aware of anything in the
|>  > C++ draft which would make it non-conforming C++.  (I regularly use
|>  > the accented characters in both comments and string literals, and
|>  > I'd hate to think that I was using a non-conforming extenstion.)
|>
|>  Other than the _extremely_ minor quibble that the standard uses the
|>  phrase "required source character set" instead of "basic source
|>  character set", you've got this exactly correct abotu the C standard.
|>
|>  However, I can't find a matching requirement in the current draft of
|>  the C++ standard.

I couldn't either.  But then, I also missed the fact that a
universal-character-name can be part of an identifier, so that doesn't
mean much.  (On the other hand, I did do a grep for 'execution.*char',
and looked at all of the places which it found.)

|>  Section 5.2.1 lists the characters required to be present in both the
|>  source and execution character sets, and specifies that:
|>
|>   If any other characters are encountered in a source file
|>   (except in a character constant, a string literal, a header
|>   name, a comment, or a preprocessing token that is never
|>   converted to a token), the behavior is undefined.

This is the C standard.  The C++ standard talks about the "basic source
character set", in section 2.2.  I was unable to find even a guarantee
that all of the basic source characters be present in the execution
environment.  (I don't think that the fact that there is no mention of a
possible error in mapping the source set, including characters specified
by the universal-character-name, to the execution set in phase 5 is
meant to mean that this mapping cannot fail.  I would like to see a
failed mapping require a diagnostic, however, and not be undefined
behavior.)

|>  This doesn't sound like what you're using is non-conforming to me.
|>
|>  > Again, it was my (possibly mistaken) impression that all of the
|>  > characters in the "basic source character set":
|>  >
|>  > 1. must be present in the execution character set, and
|>  >
|>  > 2. must have a positive representation in the execution character
|>  > set.
|>
|>  Sounds right to me.

I'd also like to see this requirement in the C++ standard.  As it now
stands, "isalpha( 'a' )" is potentially undefined behavior (since 'a'
may have a negative representation in the execution character set).

|>  > |>  An implementation could use ISO646 encoding with 8-bit chars
|>  > |> (signed  or unsigned) and meet the C and C++ requirements. Or it
|>  > |> could use Unicode with 16-bit chars and meet the requirements.
|>  >
|>  > Wouldn't the 16-bit char's have to be unsigned?  I believe that
|>  > Unicode contains characters larger than 0x7fff.
|>
|>  I don't think so - Unicode contains characters greater than 0x7fff,
|>  but it doesn't look like any of those characters is a member of the
|>  required source character set, so their being stored as negative
|>  values would be perfectly legal as far as the C standard cares.
|>  Whether it would still be considered legal Unicode or not I'm less
|>  certain.

Right.  Careless thinking on my part.

|>  > Would a 16 bit implementation (with 16 bit int's) which declared
|>  > wchar_t as unsigned short, and EOF as -1, be legal?
|>
|>  I think so; EOF is required to be different from any legal value of a
|>  char that could be returned from getc(), but I don't see a requirement
|>  that it be distinct from any value of wchar_t.
|>
|>  Of course, in C++, this basic issue becomes moot assuming one uses
|>  iostreams instead of C style I/O, since indication of end-of-file is
|>  done with a separate bit and detected with a separate function instead
|>  of a special character.  (In fact, "EOF" isn't listed in the index of
|>  the standard at all...)

First, of course, C also has the function feof.  Furthermore, I think
that it is the only way of reliably detecting end of file with getwc.
So, fundamentally, C and C++ do not differ that much.

And C++ does have EOF.  Because the iostreams are templates, however, it
is spelled differently: char_traits< charT >::eof().  I do hope that
most implementations will provide EOF anyway for a transitional period,
with the same value that would be returned by
char_traits< char >::eof().  But can they, legally?

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/04 Raw View

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:

|>  Is the C++ standard really guaranteeing "\u5373" to be portable?

Yes.

|>  Requiring all ISO-10646 characters to be allowed in indentifiers is one
|>  hell of a imposition to be forced on implementations this late in the
|>  standardization process.

Not really, since the implementation can store it precisely in this form
in its symbol table.  There are a few gotcha's, of course: the symbol
"a" must match the symbol "\u0061".  This should only be a problem for
the implementations not using ASCII internally, however.

It does require some additional checks, too, since not ALL ISO-10646
characters are legal.  (Presumably, those that are legal are those that
correspond to alpha characters in the respective alphabets.)

Now if they'd have also allowed them in numerical constants, that might
have posed some problems.

|>  (Yes, I have no clue what the standard says
|>  about this beyond what has been metioned in this thread.)
|>
|>  >(I assume "Mq" is supposed to be a representation for the
|>  >extended character with hexadecimal code 5375; something seems
|>  >to have got lost in the translation.)
|>
|>  Yes.
|>
|>  >Of course, if you want to make use of editors and other tools that
|>  >do not have special support for C++ extended character notation,
|>  >then it would make sense to use a non-portable platform-specific
|>  >representation of the source text.  However, you still want to
|>  >be able to port your programs.  Hence, even though you may use
|>  >a platform-specific source code representation most of the time during
|>  >development, you still need a portable format that you can convert it
|>  >too for porting.
|>
|>  No.  I do not need a portable format, I need a method of converting
|>  between character sets.  An EUC-JP encoded file isn't going to compile
|>  under OS/390 whether "\u5375", "Mq", or "tamago" is used.  If I do want
|>  a standard intermediate representation to make the job converting between
|>  character sets easier, I already have it, it's called ISO-10646.

The whole point is that if the file consists only in the characters from
the basic character set, and uses "\u5375", it WILL compile.

I understand your point about code translation.  If I wrote the file on
my Sun, for example, the basic character set will be encoded in ISO
8859-1 (which == ASCII for all of the characters in the basic character
set).  I will still need to translate it to compile it on an IBM
mainframe (which uses EBCDIC).  The difference is, of course, that the
translation ASCII->EBCDIC is pretty standard, whereas the other
translations aren't.  By using "\u5375", there will be no loss of
information.

On the other hand, I'm still up in the air as to what the IBM will do if
it encounters this sequence in a string literal.  It must map it to the
execution character set.  If the execution character set is EBCDIC,
there is going to be a problem.  I see several possible solutions:

1. The compiler must issue a diagnostic.  There goes portability.

2. It is undefined behavior.  There goes portability, but you don't know
it until the disk is reformatted.

3. It is required to work.  There goes any chance of getting a
conforming compiler on such systems.

4. The compiler is required to "leave it as it is".  The output display
will show 6 characters, starting with a "\u", but at least the program
will compile and run, and not do anything untoward.

My interpretation of the current draft is 2, which seems, all in all,
the worst choice possible.

|>  >>Heck, what advantage
|>  >>does it have over naming the variable "u5375"?
|>  >
|>  >Your C++-language-sensitive extended-character-set editor will display
|>  >it differently.
|>
|>  Why couldn't this mythical editor display "u5375" the same way?

Because then you couldn't distinguish between a variable using an
extended character, and one that didn't.  "\u5375" and "u5375" are
different variables.  The advantage of using the former is that local
editors will be able to display it in a meaningful form; the cannot dare
do this for the latter, because its meaningful form could be "u5375"
(and not some Chinese character).

|>  >There's no way of automatically converting programs containing
|>  >identifiers such as "u5375" into extended character sets or
|>  >vice versa.
|>
|>  It's trivial to do, but outside of things like this mythical editor
|>  there is no need or reason to.
|>
|>  >>I suppose this is just more trigraph stupidity isn't it?
|>  >
|>  >It makes sense to standardize a portable representation for
|>  >programs using extended character sets.
|>
|>  You haven't shown this.  Anyways, since I'm hearing the same sort of
|>  arguments, I'll take your response as a yes.

You're hearing the same sort of arguments, because they are valid
arguments.  At least for those of us who regularly work with extended
character sets.  (And I've got it easy, compared to some.  At least my
language uses the Roman alphabet.)

Whether the current proposal is the best solution, I don't know.  It is
A solution, however, or at least a partial one, and the problem IS
real.  Whether you've encountered it or not.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: Darron.Shaffer@beasys.com (Darron Shaffer)
Date: 1997/02/05 Raw View

In article <rf5iv48zw6i.fsf@vx.cit.alcatel.fr>, James Kanze
<james-albert.kanze@vx.cit.alcatel.fr> wrote:

>I understand your point about code translation.  If I wrote the file on
>my Sun, for example, the basic character set will be encoded in ISO
>8859-1 (which == ASCII for all of the characters in the basic character
>set).  I will still need to translate it to compile it on an IBM
>mainframe (which uses EBCDIC).  The difference is, of course, that the
>translation ASCII->EBCDIC is pretty standard, whereas the other
>translations aren't.  By using "\u5375", there will be no loss of
>information.

Which version of EBCDIC do you consider standard?  There are at least six.
Unfortunately characters like [] are at different code points in different
versions.


>
>On the other hand, I'm still up in the air as to what the IBM will do if
>it encounters this sequence in a string literal.  It must map it to the
>execution character set.  If the execution character set is EBCDIC,
>there is going to be a problem.  I see several possible solutions:
>
>1. The compiler must issue a diagnostic.  There goes portability.
>
>2. It is undefined behavior.  There goes portability, but you don't know
>it until the disk is reformatted.
>
>3. It is required to work.  There goes any chance of getting a
>conforming compiler on such systems.
>
>4. The compiler is required to "leave it as it is".  The output display
>will show 6 characters, starting with a "\u", but at least the program
>will compile and run, and not do anything untoward.
>
>My interpretation of the current draft is 2, which seems, all in all,
>the worst choice possible.

Yes, this is not good, though I would expect any compiler vendor to do
a better job than the "format the disk" approach.  It would be nice if
they were required to document what they do.

--
Darron Shaffer
Darron.Shaffer@beasys.com



[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/06 Raw View

fjh@mundook.cs.mu.OZ.AU (Fergus Henderson) writes:

|>  The draft C++ standard doesn't require implementations to allow ISO-10646
|>  characters in identifiers.  It only requires them to support "\uNNNN"
|>  and "\UNNNNNNNN" escapes in identifiers.  That should be very easy to
|>  implement, I think -- you only need a small change to the lexer,
|>  and a small change to the mangling/demangling algorithm.

I think it's slightly more complicated than you suggest.  If I
understand correctly, "\u0061" and "a" must be the same symbol.  And if
I'm not mistaken, under the OS/390 which Ross Ridge mentioned (or did he
really mean OS/360?), 'a' == 0xc1 (or something like that, I can't
remember the exact encoding of EBCDIC, it's been so long ago).

|>  >No.  I do not need a portable format, I need a method of converting
|>  >between character sets.  An EUC-JP encoded file isn't going to compile
|>  >under OS/390 whether "\u5375", "Mq", or "tamago" is used.
|>
|>  If you take your EUC-JP encoded file, and use a file conversion utility
|>  to convert it into the standardized "\u" encoding, then it should
|>  compile fine on OS/390, presuming you have a conforming C++ compiler
|>  on your OS/390 system and presuming that your program isn't non-portable
|>  for some other reason.

Something both you and I missed: a EUC-JP encoded file is NOT legal
input for a conforming C++ compiler, or at least not portable input.
The whole point of the extention is to give the programmer a way of
doing this portably, without requiring the implementation to support
extended characters.

All of the characters in the basic character set appear somewhere in all
of the variants of EBCDIC (although '|' doesn't always appear in the
same place); the real advantage of the \u encoding is that it provides a
portable way to access extended characters, WITHOUT having characters
other than those in the basic character set in your source files.
Porting to an EBCDIC machine thus only requires translating the basic
character set, and not extended characters, which potentially aren't
even present in EBCDIC.

For total portability, you have to use trigraphs as well, as there are
code sets (national variants of ISO 546) which do not have '{', etc.  On
the other hand, the use of such code sets is diminishing rapidly, and of
course, there are code sets which C (and thus C++) never claimed to
support.  (To start with, of course, those that don't have both upper
and lower case:-).)  As I mentioned in another posting, trigraphs (with
correct compiler/editor support for them) would have been a Godsend in
1985; today, they are probably not that important.

Note that while all of the machines I currently have access to, or even
hear about, can display '{' on the screen, none of the PC's I see have
the character on the keyboard; it is replaced by more useful characters
like e-accent-aigu.  Of course, there is always a way of entering the
character, by pressing three keys simulateously.  (At such times, I
really appreciate my training as a classical guitarist:-).)  I'm not
sure that trigraphs are a good solution, but at least, they do allow
inputting the program without exceptional finger gymnastics.  And of
course, with a compilation system that supported them correctly, they
would display as '{', supposing that the display supported the character.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: fjh@murlibobo.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/02/06 Raw View

James Kanze <james-albert.kanze@vx.cit.alcatel.fr> writes:

>fjh@mundook.cs.mu.OZ.AU (Fergus Henderson) writes:
>
>|>  The draft C++ standard doesn't require implementations to allow ISO-10646
>|>  characters in identifiers.  It only requires them to support "\uNNNN"
>|>  and "\UNNNNNNNN" escapes in identifiers.  That should be very easy to
>|>  implement, I think -- you only need a small change to the lexer,
>|>  and a small change to the mangling/demangling algorithm.
>
>I think it's slightly more complicated than you suggest.  If I
>understand correctly, "\u0061" and "a" must be the same symbol.  And if
>I'm not mistaken, under the OS/390 which Ross Ridge mentioned (or did he
>really mean OS/360?), 'a' == 0xc1 (or something like that, I can't
>remember the exact encoding of EBCDIC, it's been so long ago).

That's true, but I'd still call it a "small" change to the lexer.
The code to canonicalize an identifier is only about 100 lines or so.
(See the code enclosed below.)

>Something both you and I missed: a EUC-JP encoded file is NOT legal
>input for a conforming C++ compiler, or at least not portable input.

A conforming C++ compiler *is* allowed to accept input in EUC-JP
encoding, and (if I read the draft correctly) it doesn't have to issue
any diagnostics if the input contains characters outside of the basic
source character set.  Such source code is conforming, but not strictly
conforming (and as you say, not portable).

--------------------------------------------------

#include <stdlib.h>
#include <algorithm>

static unsigned long decode_hex(char *from, int len);
static bool is_basic_char(unsigned long unicode, char & basic_char);

void canonicalize(char *identifier)
    // Precondition: `identifier' is a syntactically valid C++ identifier.
    // Canonicalizes the identifier by replacing any `\u' or `\U' escapes
    // in it that name characters in the C++ basic source character
    // with their unescaped equivalents.
{
    char *to = identifier;
    char *from = identifier;
    while (*from != '\0') {
 char next_char;
 if (from[0] == '\\' && from[1] == 'u' &&
     is_basic_char (decode_hex (from + 2, 4), next_char))
 {
     from += 2 + 4;
 } else if (from[0] == '\\' && from[1] == 'U' &&
     is_basic_char (decode_hex (from + 2, 8), next_char))
 {
     from += 2 + 8;
 } else {
     next_char = *from++;
 }
 *to++ = next_char;
    }
    *to++ = '\0';
}

static unsigned long decode_hex(char *str, int len)
    // Precondition: `str' points to a string of `len' hexadecimal digits,
    //               possibly followed by other stuff.
    // Returns the value of the hexadecimal number in the first `len'
    // characters of `str'.  Leaves `str' unchanged.
{
    char temp = str[len];
    str[len] = '\0';
    unsigned long n = strtoul(str, NULL, 16);
    str[len] = temp;
    return n;
}

static bool is_basic_char(unsigned long unicode, char & basic_char)
    // if `n' is the Unicode identifier for a member of the basic
    // C character set, sets `*basic_char' to the corresponding character
    // and returns true, otherwise returns false.
{
    // define a table of the basic source character set, sorted by unicode_id

    struct Encoding {
 unsigned long unicode_id;
 char character;
 bool operator < (const Encoding & other) {
  return this->unicode_id < other.unicode_id;
 }
    } basic_chars[] = {
 { 0x41, 'a' },
 { 0x42, 'b' },
 // ...
 { 0x61, 'A' },
 { 0x62, 'B' },
 // ... etc.
    };
    const int num_basic_chars = sizeof(basic_chars)/sizeof(*basic_chars);

    // search the table, using binary search

    Encoding search_obj;
    search_obj.unicode_id = unicode;
    Encoding* this_char = lower_bound(
  basic_chars, basic_chars + num_basic_chars, search_obj);
    if (this_char->unicode_id == unicode) {
 basic_char = this_char->character;
 return true;
    } else {
 return false;
    }
}

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/06 Raw View

Darron.Shaffer@beasys.com (Darron Shaffer) writes:

|>  In article <rf5iv48zw6i.fsf@vx.cit.alcatel.fr>, James Kanze
|>  <james-albert.kanze@vx.cit.alcatel.fr> wrote:
|>
|>  >I understand your point about code translation.  If I wrote the file on
|>  >my Sun, for example, the basic character set will be encoded in ISO
|>  >8859-1 (which == ASCII for all of the characters in the basic character
|>  >set).  I will still need to translate it to compile it on an IBM
|>  >mainframe (which uses EBCDIC).  The difference is, of course, that the
|>  >translation ASCII->EBCDIC is pretty standard, whereas the other
|>  >translations aren't.  By using "\u5375", there will be no loss of
|>  >information.
|>
|>  Which version of EBCDIC do you consider standard?  There are at least six.
|>  Unfortunately characters like [] are at different code points in different
|>  versions.

Which ever one is being used on the machine in question.  (And of
course, EBCDIC is really just an example.)

|>  >On the other hand, I'm still up in the air as to what the IBM will do if
|>  >it encounters this sequence in a string literal.  It must map it to the
|>  >execution character set.  If the execution character set is EBCDIC,
|>  >there is going to be a problem.  I see several possible solutions:
|>  >
|>  >1. The compiler must issue a diagnostic.  There goes portability.
|>  >
|>  >2. It is undefined behavior.  There goes portability, but you don't know
|>  >it until the disk is reformatted.
|>  >
|>  >3. It is required to work.  There goes any chance of getting a
|>  >conforming compiler on such systems.
|>  >
|>  >4. The compiler is required to "leave it as it is".  The output display
|>  >will show 6 characters, starting with a "\u", but at least the program
|>  >will compile and run, and not do anything untoward.
|>  >
|>  >My interpretation of the current draft is 2, which seems, all in all,
|>  >the worst choice possible.
|>
|>  Yes, this is not good, though I would expect any compiler vendor to do
|>  a better job than the "format the disk" approach.  It would be nice if
|>  they were required to document what they do.

At the least.  From a user point of view, 3 is the best solution, but I
rather think that this would lead to too many problems for the
implementors (at least at present).  And it is probably not appropriate
for non-hosted environments, either.  I think that 4 might be adequate
for portable software, although it will result in some strange looking
output.  (On the other hand, if the output should contain kanji, and the
output device doesn't support kanji, then almost any solution will have
to result in strange looking output.)

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: jcoffin@taeus.com (Jerry Coffin)
Date: 1997/02/06 Raw View

In article <rf5k9oozx8r.fsf@vx.cit.alcatel.fr>, james-
albert.kanze@vx.cit.alcatel.fr says...

[ ... ]

> I'd also like to see this requirement in the C++ standard.  As it
> now stands, "isalpha( 'a' )" is potentially undefined behavior
> (since 'a' may have a negative representation in the execution
> character set).

So it appears - and that certainly seems quite unacceptable.

> |>  Of course, in C++, this basic issue becomes moot assuming one
> |> uses iostreams instead of C style I/O, since indication of end-
> |> of-file is done with a separate bit and detected with a separate
> |> function instead of a special character.  (In fact, "EOF" isn't
> |> listed in the index of the standard at all...)
>
> First, of course, C also has the function feof.  Furthermore, I
> think that it is the only way of reliably detecting end of file with
> getwc.  So, fundamentally, C and C++ do not differ that much.

I agree that the two are theoretically similar in that respect.
However, a LOT of C programs depend on fgetc/getc/getchar returning
EOF to signal the end of a file.  As such, if EOF ceases to be at
least somewhat dependable to indicate that you've either reached the
end of a file or had an error, a LOT of programs break in a hurry.

By contrast, in C++, if you no longer had a value the stream could
return that was distinct from any value that could have been read from
the stream, comparatively few programs would break.

> And C++ does have EOF.  Because the iostreams are templates,
> however, it is spelled differently: char_traits< charT >::eof().  I
> do hope that most implementations will provide EOF anyway for a
> transitional period, with the same value that would be returned by
> char_traits< char >::eof().  But can they, legally?

They certainly can (and must) include it in <cstdio>.  I don't see a
way to include it in <iostream> at least in the std namespace.  And
I'm pretty sure it can't just put it in another namespace either.

--
    Later,
    Jerry.
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/28 Raw View

fjh@mundook.cs.mu.OZ.AU (Fergus Henderson) writes:

|>  The really unfortunate things about trigraphs were that they broke
|>  existing code, and that it was quite easy to use trigraphs unintentially
|>  ("say what??!?").  I don't think either of these criticisms apply to
|>  the \u and \U notation.

Does anyone have any ideas as to how much existing code they broke?  One
of the reasons such an ugly solution was chosen was to reduce the risk
of accidentally changing something in existing code.

The really unfortunate thing about trigraphs is that systems never
really supported them correctly.  A C-aware editor could easily make the
necessary substitutions, so you never saw them in your code, and you get
the desired portability.  (This is, after all, exactly what you are
suggesting for \u.)

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: spoon.menright@cts.com (Mike Enright)
Date: 1997/01/29 Raw View

stephen.clamage@eng.sun.com (Steve Clamage) wrote:

><...>
>Maybe, mainly because I don't think 0xffff is a valid Unicode
>character (I'm assuming 2's complement integers). I think you would
>have technical problems with that implementation, and I would not
>want to have to create that implementation nor use it.
>
>Maybe someone with a Unicode standard could jump in here and
>stem the tide of conjecture. :-)

My copy of Unicode 1.1 says U+FFFF is not a character code.


--
Mike Enright
menright@cts.com
http://www.users.cts.com/sd/m/menright/
Cardiff-by-the-Sea, California, USA
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: jcoffin@taeus.com (Jerry Coffin)
Date: 1997/02/03 Raw View

In article <rf5hgk7mleb.fsf@vx.cit.alcatel.fr>, james-
albert.kanze@vx.cit.alcatel.fr says...

[ ... ]

> Just a quibble (and I'm skating on thin ice here, as I do not have
> my copy of the C standard handy to verify), but I think that the C
> standard only requires characters in the "basic source character
> set" to be positive.  With my Sun compiler, for example, the source
> character set is ISO 8859-1.  While all of the "basic source
> character set" (the characters listed in the section on lexical
> analysis) are positive, the "source character set" also includes
> accented characters (which can be used in comments, character
> constants, and string literals) which are not positive.  I am almost
> sure that this is conforming C, and am not aware of anything in the
> C++ draft which would make it non-conforming C++.  (I regularly use
> the accented characters in both comments and string literals, and
> I'd hate to think that I was using a non-conforming extenstion.)

Other than the _extremely_ minor quibble that the standard uses the
phrase "required source character set" instead of "basic source
character set", you've got this exactly correct abotu the C standard.

However, I can't find a matching requirement in the current draft of
the C++ standard.

Section 5.2.1 lists the characters required to be present in both the
source and execution character sets, and specifies that:

 If any other characters are encountered in a source file
 (except in a character constant, a string literal, a header
 name, a comment, or a preprocessing token that is never
 converted to a token), the behavior is undefined.

This doesn't sound like what you're using is non-conforming to me.

> Again, it was my (possibly mistaken) impression that all of the
> characters in the "basic source character set":
>
> 1. must be present in the execution character set, and
>
> 2. must have a positive representation in the execution character
> set.

Sounds right to me.

> |>  An implementation could use ISO646 encoding with 8-bit chars
> |> (signed  or unsigned) and meet the C and C++ requirements. Or it
> |> could use Unicode with 16-bit chars and meet the requirements.
>
> Wouldn't the 16-bit char's have to be unsigned?  I believe that
> Unicode contains characters larger than 0x7fff.

I don't think so - Unicode contains characters greater than 0x7fff,
but it doesn't look like any of those characters is a member of the
required source character set, so their being stored as negative
values would be perfectly legal as far as the C standard cares.
Whether it would still be considered legal Unicode or not I'm less
certain.

> Would a 16 bit implementation (with 16 bit int's) which declared
> wchar_t as unsigned short, and EOF as -1, be legal?

I think so; EOF is required to be different from any legal value of a
char that could be returned from getc(), but I don't see a requirement
that it be distinct from any value of wchar_t.

Of course, in C++, this basic issue becomes moot assuming one uses
iostreams instead of C style I/O, since indication of end-of-file is
done with a separate bit and detected with a separate function instead
of a special character.  (In fact, "EOF" isn't listed in the index of
the standard at all...)

--
    Later,
    Jerry.
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: Ross Ridge <rridge@calum.csclub.uwaterloo.ca>
Date: 1997/02/03 Raw View

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:
>I don't see how this could have useful semantics let alone why any one
>would want it.  Why would anyone want to name a variable, for example,
>"\u5375" rather than "tamago" or "egg"?

fjh@mundook.cs.mu.OZ.AU writes:
>Perhaps because your C++-language-sensitive editor displays
>"\u5375" as the appropriate Japanese character.

In my dreams, sure.

>>Even depending on a compiler
>>extension and naming it "Mq" would be preferable.
>
>But then your program wouldn't be portable.

Is the C++ standard really guaranteeing "\u5373" to be portable?
Requiring all ISO-10646 characters to be allowed in indentifiers is one
hell of a imposition to be forced on implementations this late in the
standardization process.  (Yes, I have no clue what the standard says
about this beyond what has been metioned in this thread.)

>(I assume "Mq" is supposed to be a representation for the
>extended character with hexadecimal code 5375; something seems
>to have got lost in the translation.)

Yes.

>Of course, if you want to make use of editors and other tools that
>do not have special support for C++ extended character notation,
>then it would make sense to use a non-portable platform-specific
>representation of the source text.  However, you still want to
>be able to port your programs.  Hence, even though you may use
>a platform-specific source code representation most of the time during
>development, you still need a portable format that you can convert it
>too for porting.

No.  I do not need a portable format, I need a method of converting
between character sets.  An EUC-JP encoded file isn't going to compile
under OS/390 whether "\u5375", "Mq", or "tamago" is used.  If I do want
a standard intermediate representation to make the job converting between
character sets easier, I already have it, it's called ISO-10646.

>>Heck, what advantage
>>does it have over naming the variable "u5375"?
>
>Your C++-language-sensitive extended-character-set editor will display
>it differently.

Why couldn't this mythical editor display "u5375" the same way?

>There's no way of automatically converting programs containing
>identifiers such as "u5375" into extended character sets or
>vice versa.

It's trivial to do, but outside of things like this mythical editor
there is no need or reason to.

>>I suppose this is just more trigraph stupidity isn't it?
>
>It makes sense to standardize a portable representation for
>programs using extended character sets.

You haven't shown this.  Anyways, since I'm hearing the same sort of
arguments, I'll take your response as a yes.

       Ross Ridge
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: stephen.clamage@eng.sun.com (Steve Clamage)
Date: 1997/01/25 Raw View

In article fsf@vx.cit.alcatel.fr, James Kanze
<james-albert.kanze@vx.cit.alcatel.fr> writes:
>stephen.clamage@Eng.Sun.COM (Steve Clamage) writes:
>|>
>|>  No. Ascii is not required by the C or C++ standard. The requirement is
>|>  that all members of the "source character set" (those required in writing
>|>  C++ programs) be represented in type char as positive values. Other
>|>  than that, the encoding is up to the implementation, as is the
>|>  signedness of type char.
>
>Just a quibble (and I'm skating on thin ice here, as I do not have my
>copy of the C standard handy to verify), but I think that the C standard
>only requires characters in the "basic source character set" to be
>positive.

Yes. I meant to say "basic source character set". As you noted, the
source and execution characters sets are allowed to contain other
characters, but the standard mandates that they both include a
specified set of characters, called the "basic" character set.

>|>  The execution character set is allowed to have as many additional
>|>  characters as will fit in a char, and the encoding is entirely up to
>|>  the implementation. (The standard imposes no requirements except that the
>|>  details be documented.)
>
>Again, it was my (possibly mistaken) impression that all of the
>characters in the "basic source character set":
>
>1. must be present in the execution character set, and

That is what I meant by "additional characters".  There are a few
other constraints that I didn't mention, such as the numerals '0'
through '9' must be represented by consecutively increasing values.

>2. must have a positive representation in the execution character set.

Yes. My explanation above was sloppy (even after the addition of
"basic"). The requirements on representation of character sets apply
to the execution character set.

>|>  An implementation could use ISO646 encoding with 8-bit chars (signed
>|>  or unsigned) and meet the C and C++ requirements. Or it could use
>|>  Unicode with 16-bit chars and meet the requirements.
>
>Wouldn't the 16-bit char's have to be unsigned?  I believe that Unicode
>contains characters larger than 0x7fff.

I don't have a copy of the Unicode standard handy, but I think all
members of the C/C++ basic source character set have encodings
that are positive anyway. Additional characters are not required to
have positive char values.

>Would a 16 bit implementation (with 16 bit int's) which declared wchar_t
>as unsigned short, and EOF as -1, be legal?

Maybe, mainly because I don't think 0xffff is a valid Unicode
character (I'm assuming 2's complement integers). I think you would
have technical problems with that implementation, and I would not
want to have to create that implementation nor use it.

Maybe someone with a Unicode standard could jump in here and
stem the tide of conjecture. :-)
---
Steve Clamage, stephen.clamage@eng.sun.com
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: Ross Ridge <rridge@calum.csclub.uwaterloo.ca>
Date: 1997/01/26 Raw View

stephen.clamage@Eng.Sun.COM writes:
>The C++ draft standard was recently revised to allow "extended characters"
>(beyond the source character set) in identifiers and comments. In
>addition, it specifies a portable notation for extended characters.
>The notations \uNNNN and \UNNNNNNNN specify the character whose standard
>encoding is the hexidecimal characters NNNN or NNNNNNNN.

I don't see how this could have useful semantics let alone why any one
would want it.  Why would anyone want to name a variable, for example,
"\u5375" rather than "tamago" or "egg"?   Even depending on a compiler
extension and naming it "Mq" would be preferable.  Heck, what advantage
does it have over naming the variable "u5375"?

I suppose this is just more trigraph stupidity isn't it?

       Ross Ridge
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: fjh@mundook.cs.mu.OZ.AU (Fergus Henderson)
Date: 1997/01/27 Raw View

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:

>stephen.clamage@Eng.Sun.COM writes:
>>The C++ draft standard was recently revised to allow "extended characters"
>>(beyond the source character set) in identifiers and comments. In
>>addition, it specifies a portable notation for extended characters.
>>The notations \uNNNN and \UNNNNNNNN specify the character whose standard
>>encoding is the hexidecimal characters NNNN or NNNNNNNN.
>
>I don't see how this could have useful semantics let alone why any one
>would want it.  Why would anyone want to name a variable, for example,
>"\u5375" rather than "tamago" or "egg"?

Perhaps because your C++-language-sensitive editor displays
"\u5375" as the appropriate Japanese character.

>Even depending on a compiler
>extension and naming it "Mq" would be preferable.

But then your program wouldn't be portable.

(I assume "Mq" is supposed to be a representation for the
extended character with hexadecimal code 5375; something seems
to have got lost in the translation.)

Of course, if you want to make use of editors and other tools that
do not have special support for C++ extended character notation,
then it would make sense to use a non-portable platform-specific
representation of the source text.  However, you still want to
be able to port your programs.  Hence, even though you may use
a platform-specific source code representation most of the time during
development, you still need a portable format that you can convert it
too for porting.

>Heck, what advantage
>does it have over naming the variable "u5375"?

Your C++-language-sensitive extended-character-set editor will display
it differently.

There's no way of automatically converting programs containing
identifiers such as "u5375" into extended character sets or
vice versa.

>I suppose this is just more trigraph stupidity isn't it?

It makes sense to standardize a portable representation for
programs using extended character sets.

The really unfortunate things about trigraphs were that they broke
existing code, and that it was quite easy to use trigraphs unintentially
("say what??!?").  I don't think either of these criticisms apply to
the \u and \U notation.

--
Fergus Henderson <fjh@cs.mu.oz.au>   |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>   |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3         |     -- the last words of T. S. Garp.

[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: Christopher Eltschka <celtschk@physik.tu-muenchen.de>
Date: 1997/01/27 Raw View

Ross Ridge wrote:

[...]

> I don't see how this could have useful semantics let alone why any one
> would want it.  Why would anyone want to name a variable, for example,
> "\u5375" rather than "tamago" or "egg"?   Even depending on a compiler
> extension and naming it "Mq" would be preferable.  Heck, what advantage
> does it have over naming the variable "u5375"?
>
> I suppose this is just more trigraph stupidity isn't it?

Well, as the C++ standard doesn't make any restrictions to your editor
(how could it :-) ), you could use an editor which displays an
extended char if it encounters an \u sequence, and produce an \u
sequence if you enter an extended char. This way, you would
get the possibility to work with an readable program, but having it
stored in an (not so readable) standard conforming format with
\u sequences, which is understood by every compiler, even on systems
that don't support extended character sets. (BTW, it would even allow
transmission of such source code via mail/news without being
concerned of stripped eighth bits.)

Of course, an japanese program on computers without japanese support
would look strange - but wouldn't it look even stranger, if those
characters were inserted directly into variable names?

[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]

Author: christian.bau@isltd.insignia.com (Christian Bau)
Date: 1997/01/27 Raw View

In article <199701252302.PAA04004@taumet.eng.sun.com>,
stephen.clamage@eng.sun.com (Steve Clamage) responded to James Kanze:
> >Wouldn't the 16-bit char's have to be unsigned?  I believe that Unicode
> >contains characters larger than 0x7fff.
>
> I don't have a copy of the Unicode standard handy, but I think all
> members of the C/C++ basic source character set have encodings
> that are positive anyway. Additional characters are not required to
> have positive char values.

Actually everything in the base source character set has the same codes as
in ASCII or ISO 8859-1, all values are between 32 and 126.

I dont think implementing char as 16 bit signed would give you a type char
that conforms to the unicode standard. There are tons of characters above
0x8000, and I think they are supposed to be numerically greater than the
lower characters.

> >Would a 16 bit implementation (with 16 bit int's) which declared wchar_t
> >as unsigned short, and EOF as -1, be legal?
>
> Maybe, mainly because I don't think 0xffff is a valid Unicode
> character (I'm assuming 2's complement integers). I think you would
> have technical problems with that implementation, and I would not
> want to have to create that implementation nor use it.

There are currently three codes in Unicode with special meaning (and
thirteen codes reserved to get a special meaning):

   FFFF = Not A Character. A sequence of unicode codes can contain this
code, and it is not a character. I dont think it would be wise to use this
as an end of file indicator generally.

   FFFD = Replacement character. It indicates that someone tried to
translate from some other code to unicode, and there was a character that
could not be translated to unicode.

   FFFE = Wrong byte sex indicator. This one must _never_ be in any
unicode character sequence. A file containing unicode codes should start
with a code FEFF = non separating zero width space (that is the closest
you can come to "nothing"). If you read a file in the wrong byte order you
get UFFFE so you know you made a mistake. This cannot be used as an end of
file indicator.

As they have thirteen codes FFF0 to FFFC reserved for special purposes, I
think the Unicode standard could define one of them to be a end-of-file
indicator. However, I think it would be better if EOF was a value that is
absolutely impossible to be contained in a file of unicode characters.
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/27 Raw View

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:

|>  stephen.clamage@Eng.Sun.COM writes:
|>  >The C++ draft standard was recently revised to allow "extended characters"
|>  >(beyond the source character set) in identifiers and comments. In
|>  >addition, it specifies a portable notation for extended characters.
|>  >The notations \uNNNN and \UNNNNNNNN specify the character whose standard
|>  >encoding is the hexidecimal characters NNNN or NNNNNNNN.
|>
|>  I don't see how this could have useful semantics let alone why any one
|>  would want it.  Why would anyone want to name a variable, for example,
|>  "\u5375" rather than "tamago" or "egg"?   Even depending on a compiler
|>  extension and naming it "Mq" would be preferable.  Heck, what advantage
|>  does it have over naming the variable "u5375"?

Using an extended character in a variable name will result in a compiler
error in any case, since it cannot be lex'ed correctly.

On the other hand, if I am developing software on my machine for Japan
(and supposing I knew Japanese), I might find it useful to insert
Japanese characters into a string literal, even though my native
environment doesn't support them.  Extended characters would allow this
(albeit very painfully).

More important, perhaps: I develop my program on a Sun (ISO 8859-1) in
French, with French messages.  My compiler has no problem with the
accents, *BUT* their use is not portable.  For example, the binary code
of a-accent-grave will be different on an other machines; this binary
code normally appears in my source files, which means that my messages
will be corrupted.  If I use the extended characters, however, instead
of the visible ones, I am guaranteed portability to any machine which
supports the characters used.

|>  I suppose this is just more trigraph stupidity isn't it?

I suppose you've never actually been involved in porting software to
other machines in other locales, have you?

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/23 Raw View

Alessandro Vesely <mc6192@mclink.it> writes:

|>  James Kanze wrote:
|>  > [...]  And the Unicode
|>  > Consortium's recommendation holds: use unsigned short (although most
|>  > implementations define wchar_t to be at least a short, even if they
|>  > don't support Unicode directly in a locale).
|>  >
|>  > This is, at least, my interpretation of the situation in C, and the
|>  > current situation in C++.  From statements by people involved in the
|>  > standardization of these parts of the library (Bill Plauger, Nathan
|>  > Myers), I judge that it is the *intent* of the C++ standards committee
|>  > that, where reasonably, standard C++ do somewhat better.  For various
|>  > reasons, I don't think that the C++ standard could or should impose
|>  > Unicode, or even support for a real wchar_t locale.
|>
|>  What reasons? would you mind explain them [again], or give us a pointer, please?

Basically, C and C++ are meant to be used on a wide range of platforms.
There are platforms on which ASCII (and Unicode) are not standard.
There are embedded platforms for which any extended character support is
inappropriate.

There is also the problem of the future.  At present, Unicode seems to
be the way to go, but twenty years ago, we thought that about ASCII (or
EBCDIC, depending on which platform you used).  Even now, people
involved in serious text processing (e.g.: the Omega project) are
encountering weaknesses in Unicode, and ISO 10646, at least initially,
was oriented 32 bits.  IMHO, we don't know enough about the subject yet
to fix it in stone.

This said, I do think that the standard could (and should?) require
limits< wchar_t >::max() to return at least 65535 (or should that be
65534, with 65535 reserved for EOF?).

|>  >  I do wonder,
|>  > however, if there couldn't be some sort of non-normative appendix with a
|>  > recommended practice (i.e.: define what a Unicode locale should look
|>  > like, for those implementations that want to support Unicode).
|>
|>  Do you mean that an implementation may claim to be standard C++<Unicode>,
|>  but is only required to be standard C++<US ASCII>?

Neither.  There is no such thing as standard C++<US ASCII>.  A standard
conforming implementation is only required to support a limited set of
characters (96, including tab, etc.), and may do so with any character
code it wishes.  (EBCDIC is probably the most frequent alternative to
ASCII).

The only place the standard in any way endorses a specific character
code is in universal-character-names; the code in question is 32 bit ISO
10646.  (I think that this code corresponds to Unicode for characters in
the range 0...0xffff.)  However, the fact that you write: '\u0007' does
not mean that the compiler will use the constant value 7; it is required
to use whatever character corresponds to the character BEL in the
execution character set.  (Or generate an error if one doesn't exist?
The standard doesn't say what happens if the conversion to the execution
character set in phase 5 cannot take place.  I think that when not
otherwise specified, it would be undefined behavior, but any decent
implementation will issue a diagnostic.)

|>  BTW, a C++ program that uses Japanese characters to name some user defined
|>  chars, objects and/or literal values, will be a (well-formed) standard C++
|>  program in some sense? (And what about greek letters for variables? ;-)

This has been much discussed in comp.std.c.  The consensus is that this
results in an ill-formed token, and a diagnostic must be issued.  Once
the diagnostic has been issued, however, behavior is undefined, and the
compiler is free to go on and compile the program.

The undefined behavior following the diagnostic is important.  Normally,
in phase three, any character not in the basic source character set is
isolated in a token by itself.  Once the "undefined behavior" is
recognized, the compiler is free to ignore this requirement, and e.g.:
treat the unrecognized character as an alpha.

(Note that the above does NOT apply if the additional character is in a
comment, a character constant or a string literal.  In the latter two
cases, the character will simply be mapped to the execution set, often
using an identity mapping, and inserted as data into the generated
program.)

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: Alessandro Vesely <mc6192@mclink.it>
Date: 1997/01/22 Raw View

James Kanze wrote:
> [...]  And the Unicode
> Consortium's recommendation holds: use unsigned short (although most
> implementations define wchar_t to be at least a short, even if they
> don't support Unicode directly in a locale).
>
> This is, at least, my interpretation of the situation in C, and the
> current situation in C++.  From statements by people involved in the
> standardization of these parts of the library (Bill Plauger, Nathan
> Myers), I judge that it is the *intent* of the C++ standards committee
> that, where reasonably, standard C++ do somewhat better.  For various
> reasons, I don't think that the C++ standard could or should impose
> Unicode, or even support for a real wchar_t locale.

What reasons? would you mind explain them [again], or give us a pointer, please?

>  I do wonder,
> however, if there couldn't be some sort of non-normative appendix with a
> recommended practice (i.e.: define what a Unicode locale should look
> like, for those implementations that want to support Unicode).

Do you mean that an implementation may claim to be standard C++<Unicode>,
but is only required to be standard C++<US ASCII>?

BTW, a C++ program that uses Japanese characters to name some user defined
chars, objects and/or literal values, will be a (well-formed) standard C++
program in some sense? (And what about greek letters for variables? ;-)

Ciao
Ale
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]

Author: stephen.clamage@Eng.Sun.COM (Steve Clamage)
Date: 1997/01/23 Raw View

In article 4418@mclink.it, Alessandro Vesely <mc6192@mclink.it> writes:
>James Kanze wrote:
>>  I do wonder,
>> however, if there couldn't be some sort of non-normative appendix with a
>> recommended practice (i.e.: define what a Unicode locale should look
>> like, for those implementations that want to support Unicode).
>
>Do you mean that an implementation may claim to be standard C++<Unicode>,
>but is only required to be standard C++<US ASCII>?

No. Ascii is not required by the C or C++ standard. The requirement is
that all members of the "source character set" (those required in writing
C++ programs) be represented in type char as positive values. Other
than that, the encoding is up to the implementation, as is the
signedness of type char. The USASCII encoding fulfills those
requirements for type char, including 8-bit signed and unsigned char.
(That encoding is not required -- it is merely allowed. It is also popular.)

The execution character set is allowed to have as many additional
characters as will fit in a char, and the encoding is entirely up to
the implementation. (The standard imposes no requirements except that the
details be documented.)

An implementation could use ISO646 encoding with 8-bit chars (signed
or unsigned) and meet the C and C++ requirements. Or it could use
Unicode with 16-bit chars and meet the requirements.

A given implementation might have to fulfill compatibility requirements
apart from the requirements of the C and C++ standards, of course.

>BTW, a C++ program that uses Japanese characters to name some user defined
>chars, objects and/or literal values, will be a (well-formed) standard C++
>program in some sense? (And what about greek letters for variables? ;-)

The C++ draft standard was recently revised to allow "extended characters"
(beyond the source character set) in identifiers and comments. In
addition, it specifies a portable notation for extended characters.
The notations \uNNNN and \UNNNNNNNN specify the character whose standard
encoding is the hexidecimal characters NNNN or NNNNNNNN.
---
Steve Clamage, stephen.clamage@eng.sun.com
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]