Topic: Unicode


Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/02/13
Raw View
jfieber@indiana.edu (John Fieber) wrote:

> To quote the from the Unicode 2.0 book (2-13):
>
>  "U+FFFF is reserved for private program use as a sentinal or other
>  signal.  (Notice that U+FFFF is a 16-bit representation of -1 in a
>  two's-compliment notation.)  Programs receiving this code are not
>  required to interpret it in any way.  It is good practice, however,
>  to recognize this code as a non-character value and to take
>  appropriate action, such as indicating possible corruption of the
>  text."

At first glance, this seems to imply that U+FFFF should only be used to detect
data corruption and not for end-of-file.  However, quoting further [p.3-2]:

    3.1  Conformance Requirements

    Invalid Code Values

    C5  A process shall not interpret either U+FFFE or U+FFFF as an abstract
        character.

    C6  A process shall not interpret any unassigned code value as an
        abstract character.

    -  These clauses do not preclude the assignment of certain generic
       semantics that allow graceful behavior in the presence of code values
       that are outside a supported subset ...

Quoting from the first edition [p.123]:

    Special U+FFF0 -> U+FFFF

    ...

    U+FFFF.  The 16-bit unsigned hexadecimal value U+FFFF is *not* a Unicode
    character value, and can be used by an application as an error code or
    other non-character value.  The specific interpretation of U+FFFF is not
    defined by the Unicode standard, so it can be viewed as a kind of private-
    use non-character.

This means that U+FFFF is definitely not a valid Unicode character code; its
presence in a text file indicates an error because a file can never
contain such a code.  It also means that U+FFFF can indeed be used by a
conforming implementation to semantically represent something other than
a valid Unicode character; end-of-file, for instance.  Since C/C++ states
that EOF is a code that is not a valid character code, this works just fine
and conforms to both C/C++ and Unicode requirements.  The fact that U+FFFF
is -1 on two's-complement 16-bit machines and the fact that EOF has been
traditionally implemented as -1 is a nice little bonus.

You will also note that fgetc() returns EOF to indicate end-of-file and also
to indicate an error (such as file corruption).

The designers of Unicode actually put a great deal of thought into it
(being employees of Apple, DEC, HP, and IBM, among others).  They did not
overlook the most widespread systems development language around in their
design.

-- David R. Tribble, david.tribble@central.beasys.com --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/02/18
Raw View
> From: David R Tribble <david.tribble@central.beasys.com>
>   U+FFFF.  The 16-bit unsigned hexadecimal value U+FFFF is *not* a Unicode
>   character value, and can be used by an application as an error code or
>   other non-character value.  The specific interpretation of U+FFFF is not
>   defined by the Unicode standard, so it can be viewed as a kind of private-
>   use non-character.
>
> This means that U+FFFF is definitely not a valid Unicode character code; its
> presence in a text file indicates an error because a file can never
>---------------------------------------------------------------^^^^^
> contain such a code.  It also means that U+FFFF can indeed be used by a
> conforming implementation to semantically represent something other than
> a valid Unicode character; end-of-file, for instance.

>I was wondering how you can enforce that a file _never_ contains
>U+FFFF.  If you are just reading raw binary files and popping every
>byte-pair into the unicode value, you certainly could get
>U+FFFF (or U+FFFE, but I snipped that part).

By definition, if a Unicode text file contains undefined or invalid codes,
it's not conforming.  You state you're reading 'raw binary' files; such files
aren't text files.  If you mean you're reading 'Unicode text' files, then in
order to be proper Unicode files, they can't contain disallowed character
codes.  And you're free to flag any improper codes you encounter as errors.
Your program will be conforming but the data file won't be.

>You still ought to do the same thing that is done for characters:
>use a larger type (32 bit int) and reserve -1 for EOF and 0x0000 to 0xffff
>for Unicode.

Of course.  I've always preferred 32-bit systems over 16-bit systems
anyway.  Unicode on 32-bit systems doesn't seem to be a problem.

>Many applications don't need the full binary
>possibilities and can use Nul-terminated C-style strings.  Other cases
>require binary data, and must resort to handling a length as well as a
>pointer.  In the same way, you could define a "textual" file which
>_never_ contains U+FFFF and then you would be free to use U+FFFF as an
>end-of-file indicator.

Precisely my point about 'conforming' Unicode text files.  How would your
program, which uses NUL-terminated strings to store text lines, handle a
'\x00' or U+0000 character in a text file?  It's the same problem as what
to do with a U+FFFF code in a supposedly proper text file.  The answer is
that a 'text' file with such codes in it is not a proper text file.

If you're going to treat a Unicode text file as a binary stream of 16-bit
codes, then, yes, you will probably want to handle U+FFFF and U+0000 as well
as other non-text codes.  If, instead, you're going to read a Unicode file
as a text file, then you will probably want to treat such non-text codes
as invalid, convert them to spaces, or ignore them entirely.

-- David R. Tribble, david.tribble@central.beasys.com --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/02/11
Raw View
fjh@mundook.cs.mu.OZ.AU (Fergus Henderson) wrote:

|>  The draft C++ standard doesn't require implementations to allow ISO-10646
|>  characters in identifiers.  It only requires them to support "\uNNNN"
|>  and "\UNNNNNNNN" escapes in identifiers.

Don't forget that the C++ draft also requires implementations to support
digraph tokens, which are:

    Digraph   Equivalent
      <%        {   Left  brace
      %>        }   Right brace
      <:        [   Left  bracket
      :>        ]   Right bracket
      %:        #   Hash
      %:%:      ##  Double hash

These appear to be easier to type (and read) that trigraphs.

    %:include <stdio.h>

    int main(int argc, const char **argv)
    <%
        printf("%s: Hello, world??/n", argv<:0:>);
    %>

(According to Rex Jaeschke, digraphs were invented for ISO amendments to ANSI
C and later borrowed for C++.)


-- David R. Tribble, david.tribble@central.beasys.com --



[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: Ross Ridge <rridge@calum.csclub.uwaterloo.ca>
Date: 1997/02/02
Raw View
stephen.clamage@Eng.Sun.COM writes:
>The C++ draft standard was recently revised to allow "extended characters"
>(beyond the source character set) in identifiers and comments. In
>addition, it specifies a portable notation for extended characters.
>The notations \uNNNN and \UNNNNNNNN specify the character whose standard
>encoding is the hexidecimal characters NNNN or NNNNNNNN.

Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:
>I don't see how this could have useful semantics let alone why any one
>would want it.  Why would anyone want to name a variable, for example,
>"\u5375" rather than "tamago" or "egg"?   Even depending on a compiler
>extension and naming it "Mq" would be preferable.  Heck, what advantage
>does it have over naming the variable "u5375"?

James Kanze  <james-albert.kanze@vx.cit.alcatel.fr> wrote:
>Using an extended character in a variable name will result in a compiler
>error in any case, since it cannot be lex'ed correctly.

Are you saying Stephen Clamage is incorrect (or misleading) and "\uNNNN"
can't be used in identifiers?  Or just not variable names?  (Or do you
mean that characters from the extended source character set can't be
used directly in identifiers?)

>On the other hand, if I am developing software on my machine for Japan
>(and supposing I knew Japanese), I might find it useful to insert
>Japanese characters into a string literal, even though my native
>environment doesn't support them.  Extended characters would allow this
>(albeit very painfully).

Stephen Clamage said nothing about using "\uXXXX" in string literals,
my comments only refer to using "\uXXXX" in identifiers and comments.

>|>  I suppose this is just more trigraph stupidity isn't it?
>
>I suppose you've never actually been involved in porting software to
>other machines in other locales, have you?

Nope, never ported software to other locales.  (What the hell do you
mean by porting to other locales anyways?  Source code character set
translation?  Message database translation?  From the developers point
of view this is all trivial.)

I did however spend two years maintaining a set POSIX/XPG4 compiliant
utilities that amongst more common platforms and locales, also ran
on OS/390 and supported Japanese.  The code was more portable between
different machines and different locales than you've ever seen.  The C,
POSIX and XPG4 standards provided a number services that enabled this
level of locale portability.  Trigraphs weren't one of them.

       Ross Ridge
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/03
Raw View
Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:

|>  stephen.clamage@Eng.Sun.COM writes:
|>  >The C++ draft standard was recently revised to allow "extended characters"
|>  >(beyond the source character set) in identifiers and comments. In
|>  >addition, it specifies a portable notation for extended characters.
|>  >The notations \uNNNN and \UNNNNNNNN specify the character whose standard
|>  >encoding is the hexidecimal characters NNNN or NNNNNNNN.
|>
|>  Ross Ridge <rridge@calum.csclub.uwaterloo.ca> writes:
|>  >I don't see how this could have useful semantics let alone why any one
|>  >would want it.  Why would anyone want to name a variable, for example,
|>  >"\u5375" rather than "tamago" or "egg"?   Even depending on a compiler
|>  >extension and naming it "Mq" would be preferable.  Heck, what advantage
|>  >does it have over naming the variable "u5375"?
|>
|>  James Kanze  <james-albert.kanze@vx.cit.alcatel.fr> wrote:
|>  >Using an extended character in a variable name will result in a compiler
|>  >error in any case, since it cannot be lex'ed correctly.
|>
|>  Are you saying Stephen Clamage is incorrect (or misleading) and "\uNNNN"
|>  can't be used in identifiers?  Or just not variable names?  (Or do you
|>  mean that characters from the extended source character set can't be
|>  used directly in identifiers?)

Steve is right (as usual).  I've double checked, and apparently, I
missed something in section 2.10: "Each universal-character-name in an
identifier shall designate a character whose encoding falls into one of
the ranges specified in Annex 0."

I'm not sure I like this. As far as I know, the specification of Unicode
is NOT complete.  Is this table exaustive, or should it be automatically
extended whenever Unicode is extended.

And of course, whether a particular character is a letter (alpha) or not
may depend on the language.  (I'm thinking particularly of the raised
dot in Catalan, but I think the hyphen in Rheto-Romansch would also
qualify.)  This is not a problem with the isxxx functions, they are
locale dependant.  It is certainly not intended (I hope) that the
meaning or even the legality of a program depend on the locale at the
moment of compilation.  (Actually, I think that the compiler is supposed
to compile the program as if it were running in locale "C".  Although
I'm not totally certain about this.)

Also, if I interpret this right, the universal-character-name is legal,
but not its equivalent in the extended source character set.  (I'm not
sure whether this is a bad thing or not.  Perhaps if the C standards
committee had said that programs not using trigraphs were illegal, the
compiler implementors would have given trigraphs some real support:-).)

(Note that my comments above are a first impression.  I've not actually
had time to study the consequences thoroughly.)

|>  >On the other hand, if I am developing software on my machine for Japan
|>  >(and supposing I knew Japanese), I might find it useful to insert
|>  >Japanese characters into a string literal, even though my native
|>  >environment doesn't support them.  Extended characters would allow this
|>  >(albeit very painfully).
|>
|>  Stephen Clamage said nothing about using "\uXXXX" in string literals,
|>  my comments only refer to using "\uXXXX" in identifiers and comments.

Well, I use a lot of "extended characters" in my comments.  (Of course,
I don't think of them as "extended".  Or didn't, until I saw what they
looked like on a machine not using ISO 8859-1.)

|>  >|>  I suppose this is just more trigraph stupidity isn't it?
|>  >
|>  >I suppose you've never actually been involved in porting software to
|>  >other machines in other locales, have you?
|>
|>  Nope, never ported software to other locales.  (What the hell do you
|>  mean by porting to other locales anyways?  Source code character set
|>  translation?  Message database translation?  From the developers point
|>  of view this is all trivial.)
|>
|>  I did however spend two years maintaining a set POSIX/XPG4 compiliant
|>  utilities that amongst more common platforms and locales, also ran
|>  on OS/390 and supported Japanese.  The code was more portable between
|>  different machines and different locales than you've ever seen.  The C,
|>  POSIX and XPG4 standards provided a number services that enabled this
|>  level of locale portability.  Trigraphs weren't one of them.

I should have been more precise.  IMHO, the problem trigraphs originally
solved has practically ceased to exist.  It's been ages since I've seen
a machine without a '{'.  On the other hand, at the time the C standard
was being discussed, I was using one to develope programs in C.  I can
assure you that I would have welcomed trigraphs, if the compiler vendors
had really supported them, rather than just doing the minimum for
compliance, and otherwise pretending that they didn't exist.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/01/28
Raw View
[From Steve Clamage, talking about Unicode]:
> Maybe someone with a Unicode standard could jump in here and
> stem the tide of conjecture. :-)

I'd be happy to.  Excerpts are from:

    "The Unicode Standard, Version 2.0"
    The Unicode Consortium
    Addison-Wesley Developers Press
    ISBN 0-201-48345-9

> >|>  An implementation could use ISO646 encoding with 8-bit chars (signed
> >|>  or unsigned) and meet the C and C++ requirements. Or it could use
> >|>  Unicode with 16-bit chars and meet the requirements.
> >
> >Wouldn't the 16-bit char's have to be unsigned?  I believe that Unicode
> >contains characters larger than 0x7fff.
>
> I don't have a copy of the Unicode standard handy, but I think all
> members of the C/C++ basic source character set have encodings
> that are positive anyway. Additional characters are not required to
> have positive char values.

Correct.  Unicode is a 16-bit character encoding, covering all of the
codes from 0x0000 to 0xFFFF (represented by the notation U+0000 to
U+FFFF).  The first character block U+0000 to U+007F is identical to
ISO-646, a.k.a. 7-bit ASCII.  (Code U+0000, for example, is NUL.)  Codes
U+0080 to U+009F are unspecified, and U+00A0 to U+00FF are identical to
ISO-8859-1, a.k.a. 8-bit Latin-1 ASCII.  Codes U+0100 to U+0FFF encode
most remaining 8-bit character standards such as Hebrew, Arabic, Cyrillic,
etc.  U+3000 to U+D7FF encode Chinese, Japanese, and Korean characters.
U+FA00 to U+FFFD are used for other symbols and alternate character forms.
Unassigned codes are reserved for future character assignments.

Unicode characters are unsigned values.  Note that the transition from
U+7FFF to U+8000 is smack dab in the middle of the Chinese character range.

> >Would a 16 bit implementation (with 16 bit int's) which declared wchar_t
> >as unsigned short, and EOF as -1, be legal?
>
> Maybe, mainly because I don't think 0xffff is a valid Unicode
> character (I'm assuming 2's complement integers). I think you would
> have technical problems with that implementation, and I would not
> want to have to create that implementation nor use it.

Yes.  By definition, U+FFFF is not a valid character; it can thus be
conveniently used for EOF.

There are a few other special Unicode codes.  U+FFFE is not a valid
character, but its byte-swapped twin U+FEFF is; the U+FEFF code is used
as a sentinel code that indicates that the Unicode text is stored in
little-endian form rather than the default (preferred) big-endian form.
Code U+0000 is identical to ASCII NUL, so it makes a convenient 'null
terminator' character for C/C++ wchar_t strings.

Hope this helps.

-- David R. Tribble, david.tribble@central.beasys.com --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]