Thread

Topic: Unicode vs. Char

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 21 Feb 2002 20:42:13 GMT Raw View

In article <3C72E1EC.90E8CFD1@acm.org>, Pete Becker <petebecker@acm.org> wrote:
>> Unicode is supposed to be more or less
>> compatible with ISO 10646, and the standard refers to this code set
>> elsewhere.
>
>Yes, as Hans pointed out, it is part of the description of universal
>character names. What threw me was the reference to "Unicode strings."
>There is no way to say 'the characters in this string are Unicode',
>although you can say 'this character is Unicode'.

One can write a wchar_t string literal (see sec. 2.13.4 in the standard)
by adding an "L", like in:
  L"Hello World"

If there now should be Unicode support, then one should probably have the
constructs
  u"Hello World"  // UTF-16
  U"Hello World"  // UTF-32
as well. (If now the UTF-16 variation is needed.)

If say the C++ UTF-32 type is called "character", one would write code
something like
  character* str = U"Hello World";
(A UTF-16 type, if needed, might be called say a "short character" then.)

One alternative, as Unicode is expected to be more frequent in the future,
might be to add a wholly new construct without the U, say `Hello World'
(or using proper Unicode left/right quotes for sources in Unicode).

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 26 Feb 2002 17:39:31 GMT Raw View

Pete Becker <petebecker@acm.org> wrote in message
news:<3C72E1EC.90E8CFD1@acm.org>...
> James Kanze wrote:
> > Unicode is supposed to be more or less compatible with ISO 10646,
> > and the standard refers to this code set elsewhere.

> Yes, as Hans pointed out, it is part of the description of universal
> character names. What threw me was the reference to "Unicode
> strings."  There is no way to say 'the characters in this string are
> Unicode', although you can say 'this character is Unicode'.

You can't even say that.  All you can say is that you want the code
(in whatever code set the implementation feels like giving you) that
corresponds to the Unicode character with the following code.

There are a number of reasons why the standard shouldn't require a
specific code set, even for wchar_t.  On the other hand, Unicode/ISO
10646 are universal enough that one might like some sort of indication
concerning their support, something like is_iec559 in numeric_limits
for IEEE floating pointer.

--
James Kanze                                   mailto:kanze@gabi-soft.de
Beratung in objektorientierer Datenverarbeitung --
                             -- Conseils en informatique orient   e objet
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany, T   l.: +49 (0)69 19 86 27

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Robert Buck" <rbuck@mathworks.com>
Date: Tue, 26 Feb 2002 20:04:34 GMT Raw View

"Hans Aberg" <remove.haberg@matematik.su.se> wrote in message
news:remove.haberg-1502021236190001@du128-226.ppp.su-anst.tninet.se...
> In article <a4ggeo$8i5$1@news.mathworks.com>, "Robert Buck"
> <rbuck@mathworks.com> wrote:
[...]
> Wrongo: Many of the important math and technical characters are outside
> the 0..2^16-1 range. In addition, a typesetting like with TeX may need
> customer characters, also outside that range.

Hans, I'd be interested in understanding how you can defend this statement
of yours.
Grab your hard-copy of the Unicode standard and turn to page 7-176 of the
Unicode
standard; you do have a copy don't you? What do you see? Turn to page 7-184.
What do you see there?

What I see is this. I see 241 mathmetical symbols, and 122 other engineering
symbols that people frequently use. Coming from a multi-disciplinary
engineering and
mathematics background, I do not see one that is missing. Now mind you, I've
been in
the business world for many years now (as an engineer, as well as one stint
in sales),
so I may have forgotten some of my discrete math, how to solve sets of
partial
differential equations, but it is not so distant a memory that I do not
remember the basic points.

To which specific symbols that people frequently use, that you mention, that
are not here?
Can you cite any, as well as provide the specific code-points, that is
outside the 16-bit range
that software _commonly_ uses? If you can clearly articulate this point, as
well as back
it up with numbers stating what percentile of application development would
actually use
such obscure code-points, you may sway me. But you are talking about the
last percentile
of the computing world, your arguments make no sense.

[Hans again...]
> PC's are today sold with tens of GB, so that is just a percent of what
> fits on a computer. And memory has doubled every year recently.
>
> Memory isn't the problem, but whether compiler writers and users are
> burdened by it.

Unless one is developing a trivial application (akin to those developed by
university students),
you _do_ need to worry about memory. In common enterprise applications you
deal with
not just a few MB of data, but GB or TB. I worked for an object database
company
for a number of years. They had customers that had multiple-terabytes of
data in their databases.
The amount of data you can load during an individual transaction is _not_
related to how
much RAM you have, but rather how much address space the operating system
you are
operating on provides you. This is a hard ceiling. For instance, on Windows
this is approx
1.313 GB. UNIX machines are not _too_ much better from an enterprise data
perspective either.
Databases also, for instance, have overhead ranging from 3x to 50x depending
upon the characteristic of
the data and database internals architecture. Whenever you increase the size
of character data
code-points, this increases the overhead of data in databases. The smaller
you make
that representation, the more data you can fit into your address space for
one
transaction. This means you can batch more work into one transaction than
would have not
been otherwise possible. If you run out of address space, though, you get a
hard error. These are the
common problems that one needs to consider in the real world too, and not
just if you are a
developer working on the internals of a complex system such as an
object-database.

Now perhaps I am expecting too much for standardization committees to
compromise
on a representation that is most beneficial to the developer community out
there. Or
perhaps I naively expect that paper-dolls (PhD students and professors with
no
practical experience in the real world) to understand these issues. But I do
hope this.
I do hope that whatever the standardization committee does, that it does
with specific
intent to best serve the general computing world. In any case, the decision
will always
be a compromise and major some software companies will always have to do
things
differently than the way the committee decides is the norm.

Basic point: do what makes sense for most people, not what conforms to some
esoteric "ideal".

Bob

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 27 Feb 2002 13:15:29 GMT Raw View

In article <a5gnac$i5h$1@news.mathworks.com>, "Robert Buck"
<rbuck@mathworks.com> wrote:
>> Wrongo: Many of the important math and technical characters are outside
>> the 0..2^16-1 range. In addition, a typesetting like with TeX may need
>> customer characters, also outside that range.
>
>Hans, I'd be interested in understanding how you can defend this statement
>of yours.

I happened to work a bit on it. :-)

>Grab your hard-copy of the Unicode standard and turn to page 7-176 of the
>Unicode
>standard; you do have a copy don't you? What do you see? Turn to page 7-184.
>What do you see there?

I grabbed <http://www.unicode.org/Public/BETA/Unicode3.2/NamesList-3.2.0d4.txt>.

The Unicode 3.2 Beta is not a hardcopy, nor is it paginated; but perhaps
if you print it out.

>What I see is this. I see 241 mathmetical symbols, and 122 other engineering
>symbols that people frequently use. Coming from a multi-disciplinary
>engineering and
>mathematics background, I do not see one that is missing.

Well, they have added some thousand more math symbols. This stuff is for
the use of the pro's; not MS 16 bit.

> Now mind you, I've
>been in
>the business world for many years now (as an engineer, as well as one stint
>in sales),
>so I may have forgotten some of my discrete math, how to solve sets of
>partial
>differential equations, but it is not so distant a memory that I do not
>remember the basic points.
>
>To which specific symbols that people frequently use, that you mention, that
>are not here?

Some other links are
  http://www.unicode.org/unicode/reports/tr25/ as it stands
  http://www.w3.org/TR/MathML2/
where you can trace the online version for links to the current Unicode
character sets.

This stuff is not use with elementary math, but takes the first small
steps towards what professionals might use.

Of course, one does not expect that this small beginning should suffice
for the future, but one shall update it in various ways in the future. The
Unicode symbols are a part of that more general picture.

>Can you cite any, as well as provide the specific code-points, that is
>outside the 16-bit range
>that software _commonly_ uses?

For example (citing the .txt mentioned above), 1D400- contains the
mathematical semantic styles.

> If you can clearly articulate this point, as
>well as back
>it up with numbers stating what percentile of application development would
>actually use
>such obscure code-points, you may sway me. But you are talking about the
>last percentile
>of the computing world, your arguments make no sense.

Of the professional math, about 100%. (In the future, when it becomes
available: One expects for example a version of TeX to emerge that build
on these Unicode symbols.)

>> PC's are today sold with tens of GB, so that is just a percent of what
>> fits on a computer. And memory has doubled every year recently.
>>
>> Memory isn't the problem, but whether compiler writers and users are
>> burdened by it.
>
>Unless one is developing a trivial application (akin to those developed by
>university students),
>you _do_ need to worry about memory. In common enterprise applications you
>deal with
>not just a few MB of data, but GB or TB.

This is the amount of memory for the output program you are speaking
about, not the amount available in libraries of the compiler. Clearly, the
stuff that is not needed is to be stripped out in the compilation process.

>The amount of data you can load during an individual transaction is _not_
>related to how
>much RAM you have, but rather how much address space the operating system
>you are
>operating on provides you. This is a hard ceiling. For instance, on Windows
>this is approx
>1.313 GB.

Of course, on some older types of OS's, the amount of runtime memory can
be limited.

But I figure that such old OS's must change or go out of business.

Also note that, as computers become more powerful, one can put more into
DLL's, only loading into the program what is necessary at every point in
time. This paradoxically makes programs even smaller (or rather, one can
put even more function into ones programs, which probably will make them
larger).

>Databases also, for instance, have overhead ranging from 3x to 50x depending
>upon the characteristic of
>the data and database internals architecture. Whenever you increase the size
>of character data
>code-points, this increases the overhead of data in databases. The smaller
>you make
>that representation, the more data you can fit into your address space for
>one
>transaction.

I recall I mentioned that a program for general fixed-width character use
would end up on UTF-21 (or whatever) + alignment, while carefully
mentioning that special compression techniques should be used for special
types of applications.

The more one can foresee about what characters are to be used, the more
special compression techniques can be used, of course.

>Now perhaps I am expecting too much for standardization committees to
>compromise
>on a representation that is most beneficial to the developer community out
>there. Or
>perhaps I naively expect that paper-dolls (PhD students and professors with no
>practical experience in the real world) to understand these issues.

I think they have already understood it. :-)

> But I do
>hope this.
>I do hope that whatever the standardization committee does, that it does
>with specific
>intent to best serve the general computing world.

That's one, for general use, ends up with UTF-21 (or whatever) + alignment.

> In any case, the decision
>will always
>be a compromise and major some software companies will always have to do
>things differently than the way the committee decides is the norm.

Right. Those specialty applications can be supported by the C++ standard
if they fall into a more general picture that can be specialized.

>Basic point: do what makes sense for most people, not what conforms to some
>esoteric "ideal".

The right way to reach out to many people is to find some general
structures that can be specialized: The general structures provides
compact structures that are for the experts easy to ensure correctness.
Those that do not need the generalities, can simply make use of the
specializations only.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 27 Feb 2002 16:57:37 GMT Raw View

In article <d6651fb6.0202220439.6f169c68@posting.google.com>,
kanze@gabi-soft.de (James Kanze) wrote:
>There are a number of reasons why the standard shouldn't require a
>specific code set, even for wchar_t.  On the other hand, Unicode/ISO
>10646 are universal enough that one might like some sort of indication
>concerning their support, something like is_iec559 in numeric_limits
>for IEEE floating pointer.

I think you need (at least one) new character type for Unicode, in view of
that wchar_t may have specialty uses.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 14 Feb 2002 18:07:24 GMT Raw View

In article <a4ecpf$oml@dispatch.concentric.net>, "Ken Shaw"
<ken@_NO_SPAM_compinnovations.com> wrote:
>One very good reason for using UTF-8 is that legacy software will work if
>fed UTF-8 while most other encodings (any that allow a byte in the stream to
>have a value of 0) will cause those systems to fail.

Note that the C++ standard does not require "char" to have 8 bits.

But what will break with just using UTF-8 is operator++ and such, which
will not step to the next character, but to the next (C/C++) byte.

>I would be all in favor of requiring wchar_t to be 32 bits and requiring
>that the standard library include the appropriate UTF-8 to UTF-32 codecvt
>specializations. This would satisfy quite a lot of i18n needs without adding
>all the complexities of the ICU to the standard library.

And it might be good to not require such a requirement for wchar_t, as it
already may have other uses in compilers. (Some compilers already sets it
to 16 bits.)

Therefore, I think one should have a new type, perhaps named "unichar" or
"character" ("uchar" is often used as short for "unsigned char") which is
guaranteed to contain 32 bits internally.

For IO, one can add support for other encodings, like UTF-16, and UTF-8.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 14 Feb 2002 18:07:45 GMT Raw View

In article <3C6B0579.6FB164C6@acm.org>, Pete Becker <petebecker@acm.org> wrote:
>> C++ does have some "Unicode" strings, but it turns out that they do not
>> guarantee to produce any Unicode characters.

>The word "unicode" is used exactly once in the C++ standard, in
>22.2.1.5/1:

The segment I have in my mind is verse 2.2:2:
2 The universal-character-name construct provides a way to name other
characters.
  hex-quad:
    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
  universal-character-name:
    \u hex-quad
    \U hex-quad hex-quad
The character designated by the universal-character-name \UNNNNNNNN is
that character whose character short name in ISO/IEC 10646 is NNNNNNNN;
the character designated by the universal-character-name \uNNNN is that
character whose character short name in ISO/IEC 10646 is 0000NNNN. If the
hexadecimal value for a universal character name is less than 0x20 or in
the range 0x7F-0x9F (inclusive), or if the uni-versal character name
designates a character in the basic source character set, then the program
is ill-formed.

One can use this with strings, too (I did not find the exact quote).

Those that tried to use it, did not find it find useful for resolving the
Unicode problem they had, as the compilers are not required to produce
anything useful: They may, but most would not.

Or so I recall.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Robert Buck" <rbuck@mathworks.com>
Date: Thu, 14 Feb 2002 18:07:54 GMT Raw View

Hi everyone,

To clarify, I would say my point is this:

When it comes to building real software, you will need to use some higher
level concepts that inevitably are not in some "standard". Because it isn't
in there doesn't mean it ought to be. [Let me speak more on this later.] The
crux of the matter is that what encoding you choose _is_ application
specific, and it becomes very difficult to argue which encoding ought to be
in the standard.

Given that are living in an increasingly global economy, software languages
need some manner of keeping pace with support for character data encodings.
Unicode [UTF-16 minus surrogates] seems the best bet for general purpose
applications. Here you can stick with fixed-width 16-bit code-points. The
only thing you give up with a fixed-width form of UTF-16 (minus surrogates),
over UTF-32, is support for archaic forms of Chinese and ancient scripts. No
worries for 99-44/100% of applications out there I suspect. The benefit is
that you have a more compact form that for most applications yields less
memory or disk space overhead, yielding greater performance for most
applications. Remember that with UTF-32 you are only using 21-bits of the
fixed-width of 32-bits (1/3 of storage remains unused).

Back to an earlier point, that of bringing in higher level concepts into the
standard. I leave as a frame of reference, an illustration of why it is bad
to bring in too much to a standard, this being Java. It's now over 100 MB. I
like Java, but dang, 100 MB just for a "Standard" package now. The issue of
features, which we are speaking of, is often orthogonal to a true language
specification. IMHO, keep the two distinct and separate. At first I had
thought to say in this letter "lets develop a C++ Community portal &
standard higher-level library apart from ISO", but to go back to my original
point that "there is a time and place for everything under the sun"... I
would wonder if it would truly be successful. Would people just find that
the libraries were not specific enough for their application? An open
question.

For those listening to this thread and wondering how to support Unicode in
their real application, look at ICU at IBM's web site.

Bob


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Fri, 15 Feb 2002 17:25:50 GMT Raw View

In article <a4ggeo$8i5$1@news.mathworks.com>, "Robert Buck"
<rbuck@mathworks.com> wrote:

>Given that are living in an increasingly global economy, software languages
>need some manner of keeping pace with support for character data encodings.
>Unicode [UTF-16 minus surrogates] seems the best bet for general purpose
>applications. Here you can stick with fixed-width 16-bit code-points. The
>only thing you give up with a fixed-width form of UTF-16 (minus surrogates),
>over UTF-32, is support for archaic forms of Chinese and ancient scripts.

Wrongo: Many of the important math and technical characters are outside
the 0..2^16-1 range. In addition, a typesetting like with TeX may need
customer characters, also outside that range.

Thus, if one should have fixed width characters, one ends up with UTF-32
(or UTF-24 + alignment).

Discussions in this group before suggested it is only Microsoft compilers
that still believe UTF-16 is a good idea. Others, like Linux (or so I
recall) do it correctly, settling for UTF-32.

>Back to an earlier point, that of bringing in higher level concepts into the
>standard. I leave as a frame of reference, an illustration of why it is bad
>to bring in too much to a standard, this being Java. It's now over 100 MB. I
>like Java, but dang, 100 MB just for a "Standard" package now.

PC's are today sold with tens of GB, so that is just a percent of what
fits on a computer. And memory has doubled every year recently.

Memory isn't the problem, but whether compiler writers and users are
burdened by it.

I think that these new features will not be required, but one will have
some macro or such telling whether the compiler supports it. So it should
not cause any such burdens.

> The issue of
>features, which we are speaking of, is often orthogonal to a true language
>specification. IMHO, keep the two distinct and separate. At first I had
>thought to say in this letter "lets develop a C++ Community portal &
>standard higher-level library apart from ISO", but to go back to my original
>point that "there is a time and place for everything under the sun"...

I think this stuff is developed on many levels: I recall that the first
article in the thread "C++0x" last years said that one should avoid any
major additions to the C++ language. So most stuff will be put into new
C++ libraries. But one will only put stuff that is "universal" in some
sense, that it encourages portability. Then there are other efforts like
boost, and others, wholly independent C++ library efforts.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 19 Feb 2002 23:08:19 GMT Raw View

remove.haberg@matematik.su.se (Hans Aberg) wrote in message
news:<remove.haberg-1402021223330001@du130-226.ppp.su-anst.tninet.se>...

> In article <3C6B0579.6FB164C6@acm.org>, Pete Becker
> <petebecker@acm.org> wrote:
> >> C++ does have some "Unicode" strings, but it turns out that they
> >> do not guarantee to produce any Unicode characters.

> >The word "unicode" is used exactly once in the C++ standard, in
> >22.2.1.5/1:

You're nit-picking, Pete.  Unicode is supposed to be more or less
compatible with ISO 10646, and the standard refers to this code set
elsewhere.

> The segment I have in my mind is verse 2.2:2:
> 2 The universal-character-name construct provides a way to name other
> characters.
>   hex-quad:
>     hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
>   universal-character-name:
>     \u hex-quad
>     \U hex-quad hex-quad
> The character designated by the universal-character-name \UNNNNNNNN
> is that character whose character short name in ISO/IEC 10646 is
> NNNNNNNN; the character designated by the universal-character-name
> \uNNNN is that character whose character short name in ISO/IEC 10646
> is 0000NNNN. If the hexadecimal value for a universal character name
> is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the
> uni-versal character name designates a character in the basic source
> character set, then the program is ill-formed.

> One can use this with strings, too (I did not find the exact quote).

> Those that tried to use it, did not find it find useful for
> resolving the Unicode problem they had, as the compilers are not
> required to produce anything useful: They may, but most would not.

There are two separate problems here.  First, I'm not sure that all
compilers today support universal character names.  Secondly, even for
those that do, how they map universal character names to the runtime
character set is implementation defined.

--
James Kanze                                   mailto:kanze@gabi-soft.de
Beratung in objektorientierer Datenverarbeitung --
                             -- Conseils en informatique orient   e objet
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany, T   l.: +49 (0)69 19 86 27

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Pete Becker <petebecker@acm.org>
Date: Wed, 20 Feb 2002 13:46:36 GMT Raw View

James Kanze wrote:
>
> remove.haberg@matematik.su.se (Hans Aberg) wrote in message
> news:<remove.haberg-1402021223330001@du130-226.ppp.su-anst.tninet.se>...
>
> > In article <3C6B0579.6FB164C6@acm.org>, Pete Becker
> > <petebecker@acm.org> wrote:
> > >> C++ does have some "Unicode" strings, but it turns out that they
> > >> do not guarantee to produce any Unicode characters.
>
> > >The word "unicode" is used exactly once in the C++ standard, in
> > >22.2.1.5/1:
>
> You're nit-picking, Pete.

No, just careless.

> Unicode is supposed to be more or less
> compatible with ISO 10646, and the standard refers to this code set
> elsewhere.

Yes, as Hans pointed out, it is part of the description of universal
character names. What threw me was the reference to "Unicode strings."
There is no way to say 'the characters in this string are Unicode',
although you can say 'this character is Unicode'.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 20 Feb 2002 13:42:26 GMT Raw View

In article <d6651fb6.0202180844.eb4a7c7@posting.google.com>,
kanze@gabi-soft.de (James Kanze) wrote:
>> The segment I have in my mind is verse 2.2:2:
...
>> Those that tried to use it, did not find it find useful for
>> resolving the Unicode problem they had, as the compilers are not
>> required to produce anything useful: They may, but most would not.

>There are two separate problems here.  First, I'm not sure that all
>compilers today support universal character names.  Secondly, even for
>those that do, how they map universal character names to the runtime
>character set is implementation defined.

Right. One wants a way to test that the compiler supports a useful
character map (such as Unicode).

Say, \u might refer to UTF-16 and/or \U to UTF-32 if the test is true.

Then those that write multi-compiler code can at least test whether a
partciular compiler is useful.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 13 Feb 2002 17:41:32 GMT Raw View

In article <a4be43$pga$1@news.mathworks.com>, "Robert Buck"
<rbuck@mathworks.com> wrote:
>> Or one may view Unicode as a part of the distributed programming package
>> (see first article in thread "C++0x" last year): It turns out that the
>> current C++ Unicode support isn't portable, so that programmers that need
>> that feature end up writing out the names of the characters and their
>> Unicode values. This is extremely tedious, especially now when Unicode may
>> have hundreds of thousands of characters.
>
>Huh? Why would you think engineers would have to jump through hoops.
>Any engineer worth their weight will use a commercial library, such as ICU,
>that manages LE / BE issues for you, performs transcodings,
>transliterations,
>and word/sentence boundary analysys. No worries.

There was some guy in this group writing a multicompiler WWW browser
and/or server, who said he ended up explicitly writing out the names of
Unicode characters as identifiers, and their encoding numbers.

C++ does have some "Unicode" strings, but it turns out that they do not
guarantee to produce any Unicode characters.

Of course, if somebody writes a multiplatform library, working through all
the compilers in existence, that is one way around it. Another would be to
put it into the standard.

>> Otherwise, everybody expects Unicode is created by a consortium trying to
>> make all other characters encodings unnecessary. There is room for
>> introducing user characters on top of the Unicode range (between 2^21 and
>> 2^24-1 I think). So if you have Unicode, and what for what reason would
>> you use another character encoding when communicating with people?
>> (Disregarding the fact that it might be useful with special compacted
>> encodings for special purposes -- but that would not be used in order to
>> guarantee open communications.)
>
>If you stuck to UTF-32 or UTF-16 _only_, how might you write a real world
>business application that aggregates data from disparate business systems
>over sockets? That is where encodings such as UTF-8 come into place,
>and where commercial Unicode libraries that have transcoders come into
>place.

I think the quote got out of context: UTF-32 will probably be the only one
to adhere to _internally_ in newly written programs that handle many
potential different characters. Then hook any codecvt for use with other
external encodings.

>Any time you deal with wire-protocols you are stuck with dealing with
>bytes, not 16-bit or even 32-bit units of information.

If there is no such protocol for UTF-32, perhaps there should be.

> Choose what makes
>sense for the application. Suppose you are getting data from VSAM/ISAM
>files,
>and that data is returned in EBCDIC format, what do you do? Tell the
>customer to upgrade to Unicode?

If the new program uses UTF-32 internally, hook onto a suitable
std::codecvt, I think.

>I worked for an object-database company on a pure XML database. As
>with many databases on the market, character data was actually stored in
>UTF-8. All operations also were performed on UTF-8.

When storing information like on a file or in a data base, or transporting
over the Internet, one may use some kind of compressions scheme. UTF-8
might be viewed as such a compressions scheme relative UTF-32.

But I do not see why one should use UTF-8 as the one and only compression
scheme: Perhaps other schemes will work much more efficiently.

> Only when a
>business system requested results in other encodings, perhaps from a
>personalization object manager, did it invoke the transcoders and
>provide the content in Shift-JIS or some other format. The database was
>_extremely_ fast, and is one example that helps dispense the myth that
>UTF-8,
>or other variable-width encodings, are slow.

The question is this: If UTF-8 is used in the C++ standard as definition
for variable width characters, then that causes a lot of problems, because
sometimes one will have to refer to the width 8, and sometimes to the
actual character width.

Therefore, I think that the standard will have to settle for only single
width characters, and then it is going to be UTF-32 in practise as far as
Unicode is is concerned, as that is the typical nearest alignment on
todays computers.

I do not myself care if there are some enthusiasts defining a standard for
using variable width characters internally, but I would myself avoid that
approach.

I think that one must give some thought on the psychology that makes
people want to use variable width characters internally, namely, a "char"
with 8 bits has in the past (like in the 80'ies) been a fairly large unit.
However, recently, memory has doubled every year, meaning that a switch
from "char" to UTF-32 will eat up about two years of memory development.
As a couple of more years go by, using UTF-32 internally instead of 8 bit
"char" in programs will probably have little significance, but the
_feeling_ that it is a waste remains.

> >> One model that comes to my mind in order to resolve this problem is to
> >> number all bits in the computer (instead of words), and specify the
> >> representation with respect to that. In this model, the high/low endian
> >> representations are two different encodings in this binary model.
> >
> >Eek.  As convenient as this might be, it would also be terribly
Inefficient
> >as the basis for memory management.
>
>> Only if one tries to access those bits, but that is the same case as now:
>> Such a model would of course include an entity "machine word", and
>> computers would generally use that. One just refers to the it model when
>> necessary in order to resolve the binary structure. It does not mean that
>> one has to convert to the bit model, if that is not necessary.
>
>Isn't this a prime example of where classes ought to be used? If one is
>reading bytes off the wire, or from a file, one may presume that you
>know, or have been told, what endianness and encoding the data is in.
>Given this, one could safely deduce that a class library could be written
>that properly decodes the data and performs necessary byte swapping
>if necessary. This is precisely what ICU does.

I think of it as just as a logical model in order to once and for all pin
down the binary representation once and for all: If one knows what the
representation is in one instance relative this model, it is logically
possible to translate it to another representation that also has such a
representation.

Then in a computer, one would try to avoid to attach the information
explicitly on every piece of data and check individually, as that takes up
time and space. So one would want to have chunks of data where the
representation is known.

One such chunk is of course the program itself, which would act against
the model implemented by the compiler relative the OS and the CPU
conventions. But when that program encounters a foreign piece of data, the
pieces needed for a conversion must be present somehow.

An example: It is like if we settle for UTF-32 as the logical binary
encoding (disregarding the high/low endian stuff for the course of this
example). Then if UTF-8 and UTF-16 are defined relative this UTF-32, it is
possible (as an optimization) to make direct translations between UTF-8
and UTF-16 despite the fact that the logical model passes over UTF-32. But
having translations back and forth to the UTF-32 model would ensure that
the translation between UTF-8 and UTF-16 are possible even in the case one
has not bothered implementing the direct translation between them: The
translation model ensures that it is always possible, albeit slow.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Ken Shaw" <ken@_NO_SPAM_compinnovations.com>
Date: Wed, 13 Feb 2002 19:41:53 GMT Raw View

"Hans Aberg" <remove.haberg@matematik.su.se> wrote in message
news:remove.haberg-1302021239280001@du137-226.ppp.su-anst.tninet.se...
> In article <a4be43$pga$1@news.mathworks.com>, "Robert Buck"
> <rbuck@mathworks.com> wrote:
> >> Or one may view Unicode as a part of the distributed programming
package
> >> (see first article in thread "C++0x" last year): It turns out that the
> >> current C++ Unicode support isn't portable, so that programmers that
need
> >> that feature end up writing out the names of the characters and their
> >> Unicode values. This is extremely tedious, especially now when Unicode
may
> >> have hundreds of thousands of characters.
> >
> >Huh? Why would you think engineers would have to jump through hoops.
> >Any engineer worth their weight will use a commercial library, such as
ICU,
> >that manages LE / BE issues for you, performs transcodings,
> >transliterations,
> >and word/sentence boundary analysys. No worries.
>
> There was some guy in this group writing a multicompiler WWW browser
> and/or server, who said he ended up explicitly writing out the names of
> Unicode characters as identifiers, and their encoding numbers.
>
> C++ does have some "Unicode" strings, but it turns out that they do not
> guarantee to produce any Unicode characters.
>
> Of course, if somebody writes a multiplatform library, working through all
> the compilers in existence, that is one way around it. Another would be to
> put it into the standard.
>
> >> Otherwise, everybody expects Unicode is created by a consortium trying
to
> >> make all other characters encodings unnecessary. There is room for
> >> introducing user characters on top of the Unicode range (between 2^21
and
> >> 2^24-1 I think). So if you have Unicode, and what for what reason would
> >> you use another character encoding when communicating with people?
> >> (Disregarding the fact that it might be useful with special compacted
> >> encodings for special purposes -- but that would not be used in order
to
> >> guarantee open communications.)
> >
> >If you stuck to UTF-32 or UTF-16 _only_, how might you write a real world
> >business application that aggregates data from disparate business systems
> >over sockets? That is where encodings such as UTF-8 come into place,
> >and where commercial Unicode libraries that have transcoders come into
> >place.
>
> I think the quote got out of context: UTF-32 will probably be the only one
> to adhere to _internally_ in newly written programs that handle many
> potential different characters. Then hook any codecvt for use with other
> external encodings.
>
> >Any time you deal with wire-protocols you are stuck with dealing with
> >bytes, not 16-bit or even 32-bit units of information.
>
> If there is no such protocol for UTF-32, perhaps there should be.
>
> > Choose what makes
> >sense for the application. Suppose you are getting data from VSAM/ISAM
> >files,
> >and that data is returned in EBCDIC format, what do you do? Tell the
> >customer to upgrade to Unicode?
>
> If the new program uses UTF-32 internally, hook onto a suitable
> std::codecvt, I think.
>
> >I worked for an object-database company on a pure XML database. As
> >with many databases on the market, character data was actually stored in
> >UTF-8. All operations also were performed on UTF-8.
>
> When storing information like on a file or in a data base, or transporting
> over the Internet, one may use some kind of compressions scheme. UTF-8
> might be viewed as such a compressions scheme relative UTF-32.
>
> But I do not see why one should use UTF-8 as the one and only compression
> scheme: Perhaps other schemes will work much more efficiently.
>

One very good reason for using UTF-8 is that legacy software will work if
fed UTF-8 while most other encodings (any that allow a byte in the stream to
have a value of 0) will cause those systems to fail.

I would be all in favor of requiring wchar_t to be 32 bits and requiring
that the standard library include the appropriate UTF-8 to UTF-32 codecvt
specializations. This would satisfy quite a lot of i18n needs without adding
all the complexities of the ICU to the standard library.

Ken Shaw

--
The tree of liberty must be refreshed from time to time with the blood of
patriots and tyrants. It is its natural manure.

Thomas Jefferson, 1787


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Pete Becker <petebecker@acm.org>
Date: Thu, 14 Feb 2002 02:09:57 GMT Raw View

Hans Aberg wrote:
>
> C++ does have some "Unicode" strings, but it turns out that they do not
> guarantee to produce any Unicode characters.
>

The word "unicode" is used exactly once in the C++ standard, in
22.2.1.5/1:

 The class codecvt<internT,externT,stateT> is for use when
 converting from one codeset to another, such as from wide
 characters to multibyte characters, between wide character
 encodings such as Unicode and EUC.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Robert Buck" <rbuck@mathworks.com>
Date: Wed, 13 Feb 2002 01:25:34 GMT Raw View

> Or one may view Unicode as a part of the distributed programming package
> (see first article in thread "C++0x" last year): It turns out that the
> current C++ Unicode support isn't portable, so that programmers that need
> that feature end up writing out the names of the characters and their
> Unicode values. This is extremely tedious, especially now when Unicode may
> have hundreds of thousands of characters.

Huh? Why would you think engineers would have to jump through hoops.
Any engineer worth their weight will use a commercial library, such as ICU,
that manages LE / BE issues for you, performs transcodings,
transliterations,
and word/sentence boundary analysys. No worries.

> >Consider what happens if the Standard requires Unicode and then it turns
out
> >that the Unicode design is fundamentally flawed. Or what if another
> >international character encoding comes into favor, and C++ is tied to
> >Unicode?

That's a lot of what "if's". Especially since Unicode has broad acceptance
in
the business world and technical world. A change is doubtful at best.

> Otherwise, everybody expects Unicode is created by a consortium trying to
> make all other characters encodings unnecessary. There is room for
> introducing user characters on top of the Unicode range (between 2^21 and
> 2^24-1 I think). So if you have Unicode, and what for what reason would
> you use another character encoding when communicating with people?
> (Disregarding the fact that it might be useful with special compacted
> encodings for special purposes -- but that would not be used in order to
> guarantee open communications.)

If you stuck to UTF-32 or UTF-16 _only_, how might you write a real world
business application that aggregates data from disparate business systems
over sockets? That is where encodings such as UTF-8 come into place,
and where commercial Unicode libraries that have transcoders come into
place.
Any time you deal with wire-protocols you are stuck with dealing with
bytes, not 16-bit or even 32-bit units of information. Choose what makes
sense for the application. Suppose you are getting data from VSAM/ISAM
files,
and that data is returned in EBCDIC format, what do you do? Tell the
customer to upgrade to Unicode?

I worked for an object-database company on a pure XML database. As
with many databases on the market, character data was actually stored in
UTF-8. All operations also were performed on UTF-8. Only when a
business system requested results in other encodings, perhaps from a
personalization object manager, did it invoke the transcoders and
provide the content in Shift-JIS or some other format. The database was
_extremely_ fast, and is one example that helps dispense the myth that
UTF-8,
or other variable-width encodings, are slow.

My point: there is a time and place for everything under the sun.

> >> One model that comes to my mind in order to resolve this problem is to
> >> number all bits in the computer (instead of words), and specify the
> >> representation with respect to that. In this model, the high/low endian
> >> representations are two different encodings in this binary model.
> >
> >Eek.  As convenient as this might be, it would also be terribly
inefficient
> >as the basis for memory management.
>
> Only if one tries to access those bits, but that is the same case as now:
> Such a model would of course include an entity "machine word", and
> computers would generally use that. One just refers to the it model when
> necessary in order to resolve the binary structure. It does not mean that
> one has to convert to the bit model, if that is not necessary.

Isn't this a prime example of where classes ought to be used? If one is
reading bytes off the wire, or from a file, one may presume that you
know, or have been told, what endianness and encoding the data is in.
Given this, one could safely deduce that a class library could be written
that properly decodes the data and performs necessary byte swapping
if necessary. This is precisely what ICU does.




---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: 1 Feb 2002 22:20:49 GMT Raw View

[Please note that follow-ups are set to comp.std.c++.]
In article <OEZ48.66846$h31.3769928@e420r-atl1.usenetserver.com>, "Early
Ehlinger" <spamsink@spamblaster.org> wrote:
>A recent thread that started out discussing why basic_ofstream has no
>constructor taking a wchar_t* quickly changed into a discussion of
>whether wchar_t was defined by the Standard to be a Unicode character.
>From there, the discussion went to whether the Standard should make
>such a definition...
>Proposal:
...
>In C++0x and C0x as well, deprecate char, char* and any variant
>thereof. (!!!)  In its place, we would add two new types, _character_t
>and _byte_t.  All functions returning or accepting strings or
>characters would be deprecated as well, and replaced with versions
>returning/accepting _character_t.

This kind of questions have been discussed in comp.std.c++. In brief (and
perhaps follow-ups will say that I am wrong):

The C/C++ standards are written so that one can never ensure, in a
compiler independent manner, what the underlying binary (bit structure)
is. Clearly, if now C++0x (see first article with this subject title in
comp.std.c++ last year) is supposed to support distributed programming,
that attitude will not hold, as one then exchanges binary objects between
programs that may have been compiled with different compilers.

As for the C/C++ byte, the allocation "atom", it has in the past in (in
the context of C) sometimes have had say 9 bits on some platforms; my
compilers indicates a platform where it is 16 bits.

It is not possible to deprecate "char", because almost all code would break.

But one may think of introducing new types that do rely on a special type
of binary representation:

A byte type with exactly 8 bits, and say a type "unichar", encoding
probably UTF-32, with additional macros telling if the compiler can
support them. (Some say that the presence of these types may create
overheads they do not want to have.)

One should note that the Unicode character numbers do not rely on any
binary computer representation. So it might be possible to merely specify
that a unichar should be able to hold all Unicode characters, with UTF-32
and other translations being present. I think though that it is going to
be very complicated to use variable width characters internally in a
program (and slow, due to alignment cutoffs that the CPU will have to
perform), so that suggests one should use UTF-32 and nothing else.

There is also the high/low "endian" issue when dealing with Unicode
representations. This probably belongs to the distributed programming
chapter.

One model that comes to my mind in order to resolve this problem is to
number all bits in the computer (instead of words), and specify the
representation with respect to that. In this model, the high/low endian
representations are two different encodings in this binary model.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]
[ Note that the FAQ URL has changed!  Please update your bookmarks.     ]

Author: "Early Ehlinger" <earlye@yahoo.com>
Date: Sat, 2 Feb 2002 16:25:37 GMT Raw View

"Hans Aberg" <remove.haberg@matematik.su.se> wrote:
> This kind of questions have been discussed in comp.std.c++. In brief (and
> perhaps follow-ups will say that I am wrong):

My apologies for beating a dead horse...

> It is not possible to deprecate "char", because almost all code would
break.

Deprecate does not mean "remove."  It means to warn of possible future
obsolescence.  Therefore, deprecating char would not break any existing
code; it would merely warn people that char is considered to be a bad design
choice in light of the new types that would presumably be added.  And before
it is mentioned, warn does not necessarily mean that the compiler would
issue warnings, just that the Standard would effectively say, "hey guys,
your old buddy char is on its death bed."

> But one may think of introducing new types that do rely on a special type
> of binary representation:
>
> A byte type with exactly 8 bits, and say a type "unichar", encoding
> probably UTF-32, with additional macros telling if the compiler can
> support them. (Some say that the presence of these types may create
> overheads they do not want to have.)

Certainly.  Exactly what I was asking for, although what may prove better
would be a Standard Library extension similar to bit vector whereby the
developer could say something like this:

std::sized_int< 16 , std::big_endian > int16be = 32;
std::sized_int< 32 , std::little_endian > int32le = 123;
std::sized_unsigned_int< 64 , std::big_endian > int64be = 4097;

A vendor could provide specializations such that values described as being
std::big_endian effectively behave like built-in types of the same size on
big-endian processors, while their std::little_endian counterparts behave
like built-ins on little-endian processors.

Furthermore, the standard could be written such that the
big_endian/little_endian effects only matter when transmitting the object to
the "outside world"; in memory, the layout could be whatever is appropriate
to the platform.

As an example of where this could be useful, consider IP addresses, which
are 32-bit integers, transmitted in big-endian format.  Today, you have to
be very careful when writing socket code to use htonX / ntohX the
appropriate number of times.  A socket library based on std::sized_int could
simply do this:

typedef std::sized_int< 32 , std::big_endian > ip_address_t;

And the library could use whatever accessor/modifiers std::sized_int
provides to change the value of a specific address.

Before it is mentioned, I'm well aware that all of this could be done today
without a change to the Standard (well, except for putting sized_int into
std::).  I'm merely suggesting that this is one of those things that would
enhance the Standard Library and make distributed computing considerably
easier to do.  Perhaps in my copious free time I will attempt to write these
templates and submit them to boost, unless somebody else (hopefully) beats
me to it.  There seem to be some templates in boost today for selecting
appropriate types based on minimum bit-size requirements, but no apparent
luck as far as being able to specify a specific binary layout.

> One should note that the Unicode character numbers do not rely on any
> binary computer representation. So it might be possible to merely specify
> that a unichar should be able to hold all Unicode characters, with UTF-32
> and other translations being present. I think though that it is going to
> be very complicated to use variable width characters internally in a
> program (and slow, due to alignment cutoffs that the CPU will have to
> perform), so that suggests one should use UTF-32 and nothing else.
>
> There is also the high/low "endian" issue when dealing with Unicode
> representations. This probably belongs to the distributed programming
> chapter.

I think specifically requesting Unicode would probably be a mistake.
Consider how bad a choice it would have been for the Standard to require
that std::string have a refcounted implementation.  It was quite wise to
write the Standard to allow refcounted std::string, while not requiring it.

Consider what happens if the Standard requires Unicode and then it turns out
that the Unicode design is fundamentally flawed. Or what if another
international character encoding comes into favor, and C++ is tied to
Unicode?

Note, though that a sized_X template could help in this regard too.  You
could have something like this:

typedef
std::sized_char
  < 16
  , std::character_encoding< std::UTF_16 >
  , std::big_endian >
char_utf16_le;

typedef
std::sized_char
  < 32
  , std::character_encoding< std::UTF_32 >
  , std::little_endian >
char_utf32_le;

Again, a vendor would be free to use magic to make such sized_char values
effectively turn into built-in types.

> One model that comes to my mind in order to resolve this problem is to
> number all bits in the computer (instead of words), and specify the
> representation with respect to that. In this model, the high/low endian
> representations are two different encodings in this binary model.

Eek.  As convenient as this might be, it would also be terribly inefficient
as the basis for memory management.  Of course, there's no reason a library
couldn't include the ability to access individual bits, at least using the
virtual address space, say read( void* byte , int bit ), write ( void* byte
, int bit ).  But that can be done without any library help using &, |, ^,
etc.

-- Early Ehlinger

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Sun, 3 Feb 2002 18:58:20 GMT Raw View

In article <0hU68.89357$h31.5940930@e420r-atl1.usenetserver.com>, "Early
Ehlinger" <earlye@yahoo.com> wrote:
>I think specifically requesting Unicode would probably be a mistake.
>Consider how bad a choice it would have been for the Standard to require
>that std::string have a refcounted implementation.  It was quite wise to
>write the Standard to allow refcounted std::string, while not requiring it.

The difference is that Unicode is a well established standard, and ref
counts could be replaced by another type of conservative GC in order to
keep track of references.

Or one may view Unicode as a part of the distributed programming package
(see first article in thread "C++0x" last year): It turns out that the
current C++ Unicode support isn't portable, so that programmers that need
that feature end up writing out the names of the characters and their
Unicode values. This is extremely tedious, especially now when Unicode may
have hundreds of thousands of characters.

>Consider what happens if the Standard requires Unicode and then it turns out
>that the Unicode design is fundamentally flawed. Or what if another
>international character encoding comes into favor, and C++ is tied to
>Unicode?

Use wchar_t/char for that. As years go by, and one sees the usefulness of
them, one may add the to the C++ standard in another revision (if C++
still exists at that time).

Otherwise, everybody expects Unicode is created by a consortium trying to
make all other characters encodings unnecessary. There is room for
introducing user characters on top of the Unicode range (between 2^21 and
2^24-1 I think). So if you have Unicode, and what for what reason would
you use another character encoding when communicating with people?
(Disregarding the fact that it might be useful with special compacted
encodings for special purposes -- but that would not be used in order to
guarantee open communications.)

>> One model that comes to my mind in order to resolve this problem is to
>> number all bits in the computer (instead of words), and specify the
>> representation with respect to that. In this model, the high/low endian
>> representations are two different encodings in this binary model.
>
>Eek.  As convenient as this might be, it would also be terribly inefficient
>as the basis for memory management.

Only if one tries to access those bits, but that is the same case as now:
Such a model would of course include an entity "machine word", and
computers would generally use that. One just refers to the it model when
necessary in order to resolve the binary structure. It does not mean that
one has to convert to the bit model, if that is not necessary.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]