Thread

Topic: wchar & Unicode

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Tue, 10 Apr 2001 15:49:40 GMT Raw View

Hans Aberg wrote:
>
> In article <3ACE162D.2ED8CF21@dollywood.itp.tuwien.ac.at>, Christopher
> Eltschka <celtschk@dollywood.itp.tuwien.ac.at> wrote:
> >> As for those 9-bit structures, aren't you confusing those with machine
> "words".
> >
> >No, the words on those machines are IIRC 36 bits.
>
> Since the personal computer sold now have 128 bits (like Motorola G4), it
> sounds like these are anachronisms.

Those particular machines are anachronisms, but the freedom to have a
byte size other than 8 is a continuing issue. I suspect that the most
popular alternative now is 16 bits.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Martin von Loewis <loewis@informatik.hu-berlin.de>
Date: Tue, 10 Apr 2001 19:49:05 GMT Raw View

"James Kuyper Jr." <kuyper@wizard.net> writes:

> Those particular machines are anachronisms, but the freedom to have a
> byte size other than 8 is a continuing issue. I suspect that the most
> popular alternative now is 16 bits.

Which specific implementation that is actively used does have bytes of
16 bits?

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Tue, 10 Apr 2001 19:49:27 GMT Raw View

In article <3AD2590B.1A928FE2@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>> >> As for those 9-bit structures [called bytes], aren't you confusing
those with machine
>> "words".
>> >
>> >No, the words on those machines are IIRC 36 bits.
>>
>> Since the personal computer sold now have 128 bits (like Motorola G4), it
>> sounds like these are anachronisms.
>
>Those particular machines are anachronisms,

I recall that in the "Byte Magazine" (funnily enough) from the seventies,
memory organized into nine bits did occur. But that was well before the
current C++ standard.

> but the freedom to have a
>byte size other than 8 is a continuing issue. I suspect that the most
>popular alternative now is 16 bits.

The C++ standard (1.7:1) defines a byte as a contiguous sequence of bits,
large enough to contain "the basic execution character set members". In
5.3.3:1 one gets to know that a "char" is exactly a "byte" from the binary
point of view.

I could not find any definition of what the "the basic execution character
set" is.

One can note that the C++ definition of "byte" does not seem to have
anything what others mean with it, because I figure that in all computers,
a byte does not have anything to do with a execution character set to do
at all but is merely a binary structure. I can't recall any CPU that knows
anything about characters, but they merely processes them as any other
data.

So it seems that the C++ byte concept does not bring anything at all in
describing the language and its implementation; possibly only confusion,
as the platforms specific byte concept might be entirely different from
the C++ variation.

It would be better if one in C++ only used "char"'s or some other word
than "byte" in describing the sizes of objects.

But what happens with C++ file IO on a platforms where the "char" has 16
bits? Then it should write 16 bits at a time, right?

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Tue, 10 Apr 2001 22:34:18 GMT Raw View

Hans Aberg wrote:
>
> In article <3AD2590B.1A928FE2@wizard.net>, "James Kuyper Jr."
> <kuyper@wizard.net> wrote:
...
> I recall that in the "Byte Magazine" (funnily enough) from the seventies,
> memory organized into nine bits did occur. But that was well before the
> current C++ standard.

Yes - as I remember (not from personal experience) there was one
manufacturer that used 36 bit words. I've heard of those words being
subdivided many different ways including 5 7-bit bytes plus an unused
padding bit. Of course, 7-bit bytes are not legal in C/C++, but I think
that was used before C was even invented. 4 9-bit bytes is the most
obvious way to implement C/C++ on such a machine.

> > but the freedom to have a
> >byte size other than 8 is a continuing issue. I suspect that the most
> >popular alternative now is 16 bits.
>
> The C++ standard (1.7:1) defines a byte as a contiguous sequence of bits,
> large enough to contain "the basic execution character set members". In
> 5.3.3:1 one gets to know that a "char" is exactly a "byte" from the binary
> point of view.

A byte is an amount of storage; char is a type that can be stored in
that amount of storage. It's a non-trivial distinction. unsigned char
and signed char are two other types that are also guaranteed to fit in a
byte, but they each are represented differently in that memory. It's
implementation-defined whether any other types, such as 'long', also fit
in a single byte, possibly with a different representation.

> I could not find any definition of what the "the basic execution character
> set" is.

See section 2.2p3; the fact that "basic execution character set" is in
italics identifies this clause as defining that term.

...
> But what happens with C++ file IO on a platforms where the "char" has 16
> bits? Then it should write 16 bits at a time, right?

Correct.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@stop.mail-abuse.org>
Date: Wed, 11 Apr 2001 13:59:50 GMT Raw View

"James Kuyper Jr." wrote:

>
> Yes - as I remember (not from personal experience) there was one
> manufacturer that used 36 bit words. I've heard of those words being
> subdivided many different ways including 5 7-bit bytes plus an unused
> padding bit. Of course, 7-bit bytes are not legal in C/C++, but I think
> that was used before C was even invented. 4 9-bit bytes is the most
> obvious way to implement C/C++ on such a machine.

Yes, both the UNIVAC 1100 series and the DEC 10/20's (both which borrowed
heavily from an earlier IBM system) had no fixed "byte" size, but a very
flexible "partial word" mechanism.

The native character set on the UNIVAC in the early days was a 5-bit
encoding called FieldData.  There were no non-printing characters in
it.  The zero value was the @ character, referred to as the "master space"
in some of the documentation.

ASCII on those machines was variously implemented as 5 7-bit characters
(with a bit left over) or 4 9 bit ones.

>
> A byte is an amount of storage; char is a type that can be stored in
> that amount of storage. It's a non-trivial distinction.

Yes, and this is where the problem comes from.  Char has a schizophrenic
life:

It is, one the smallest allocatable unit of storage, and two frequently
a type holding a native "character".

If your native character needs to go to 16 bits, you have two choices.
If you (as you suggested in a reply to a previous posting of mine) make
char == 16 bits, then you lose the ability to address 8 bit things in
C++.  So C++ gives you wchar_t to use instead, which is your second
choice.  However, the problem, as I have been pointing out, is that
C++ only trivially implements wchar_t in a few special cases (the
conents of streams and strings).   Every where else it defers to
char, and assumes that the implementation can make due either by reverting
to some 8 bit encoding, or some multibyte representation, which there
is scant actual support for anywhere in C++ (a few wretched conversion
routines leak over from C).  There's no provision for manipulating
multibyte data in a C++ way.

This clearly leads to more shizophrenia, the set of C++ interfaces that
deal with character strings is divided rigoursly:
   streams and strings take wchar_t and wont handle mb encodings.
   every other interface (filenames, main args, exception what() strings, etc...)
   take mb strings only and won't work with wchar_t.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 11 Apr 2001 14:00:03 GMT Raw View

In article <j47l0sr8ot.fsf@informatik.hu-berlin.de>, Martin von Loewis
<loewis@informatik.hu-berlin.de> wrote:
>Which specific implementation that is actively used does have bytes of
>16 bits?

I noticed that my compiler (Metrowerks Codewarrior Pro 5) has a line
  #ifdef __m56800__
  #define CHAR_BIT            16
but I do not know which platform this is.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 12 Apr 2001 21:23:01 GMT Raw View

In article <3AD38C4B.15B53E25@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>...Of course, 7-bit bytes are not legal in C/C++,

What prohibits 7-bit bytes in C++? -- It seems that it suffices that the
byte can hold the "basic execution character set", to which you gave the
quote:

>See section 2.2p3; ...

And the list of characters given here is less than 128, thus fitting into
a 7-bit byte.

>but I think
>that was used before C was even invented.

The first UNIX with C was made in 1973, well before I saw those ads with
9-bit bytes. (The first standard C came 1983.)

>A byte is an amount of storage; char is a type that can be stored in
>that amount of storage. It's a non-trivial distinction.

In other context than C++, it seems: C++ has found a complicated way to
express to define a byte as being the underlying binary structure of a
char, which I think is not what most others would think of when speaking
of a byte.

Anyway, it does not affect the original issue, namely that C++ is
everywhere written so that it is impossible to specify the underlying
binary structure in its entirety in a compiler independent way, which
screws up things when one tries to use C++ n connection with standards
that rely on such binary structures. -- Such as Unicode, and I figure,
distributed programming as well.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Fri, 13 Apr 2001 02:13:38 GMT Raw View

Hans Aberg wrote:
>
> In article <3AD38C4B.15B53E25@wizard.net>, "James Kuyper Jr."
> <kuyper@wizard.net> wrote:
> >...Of course, 7-bit bytes are not legal in C/C++,
>
> What prohibits 7-bit bytes in C++? -- It seems that it suffices that the
> byte can hold the "basic execution character set", to which you gave the
> quote:

See 5.3.3p1 - a char is required to fit in a single byte. If char is
signed, then CHAR_MAX must be at least 127, and CHAR_MIN can't be
greater than -127; If char is unsigned, then CHAR_MAX must be at least
255, and CHAR_MIN must be 0. A char must be able to store any value from
CHAR_MIN to CHAR_MAX, inclusive, as distinct values. Whether signed or
unsigned, that's too large a range of
distinct values to store in a 7-bit byte.

> >but I think
> >that was used before C was even invented.
>
> The first UNIX with C was made in 1973, well before I saw those ads with
> 9-bit bytes. (The first standard C came 1983.)

Well, Ron Natalie mentioned two groups of machines with that feature,
"the UNIVAC 1100 series and the DEC 10/20's". According to
<http://www.fourmilab.ch/documents/univac/>, the UNIVAC 1100 series
dates back at least to the 1960's, which is well before the invention of
C. I've had trouble finding a more precise date. The important point, of
course, is not how old they are, but whether enough such machines
remained in service to justify leaving the number of bits in a byte open
when C was first standardized. There most certainly were many such
machines at that time; by the time they ceased being common, platforms
with 16-bit bytes had already started appearing.

> >A byte is an amount of storage; char is a type that can be stored in
> >that amount of storage. It's a non-trivial distinction.
>
> In other context than C++, it seems: C++ has found a complicated way to

In conventional usage, a byte is exactly 8 bits of memory space. In the
C/C++ standards, it's an implementation-defined number of bits of memory
space, with some indirect lower limits on that number. Other than that
one numerical issue of size, the two concepts are identical.

> express to define a byte as being the underlying binary structure of a
> char,

Yes - and in exactly the same sense, it is true that a byte is the
underlying binary structure of an 'int', and a 'double', and a 'char*',
and a std::string. What's odd about that? There's no mystical C++
connection between a byte and a char, except that a byte is big enough
to hold a char, and unsigned char is defined to use/provide access to
every bit of the byte it's stored in.

> ... which I think is not what most others would think of when speaking
> of a byte.

What exactly is the difference you see between that concept and "what
most others would think of when speaking of a byte."?  I don't see any.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Fri, 13 Apr 2001 16:12:24 GMT Raw View

In article <3AD6548D.3CD8678B@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>> What prohibits 7-bit bytes in C++?
...
>See 5.3.3p1 - a char is required to fit in a single byte.

This just says that a char (forgetting about the character structure) is
exactly one byte. In fact, if one looks at the definition of "object
representation" in 3.9:4, it does not use bytes, but unsigned char's for
that.

> If char is
>signed, then CHAR_MAX must be at least 127, and CHAR_MIN can't be
>greater than -127; ...

Actually, according to the C standard, CHAR_BIT must be at least 8.

>Yes - and in exactly the same sense, it is true that a byte is the
>underlying binary structure of an 'int', and a 'double', and a 'char*',
>and a std::string. What's odd about that?

The underlying binary structure of an int is not a byte, because, for
example, on my implementation it is 4 bytes.

The odd thing is that as matters are as of today, the C++ definition of
"byte" is unconventional, as it today is means exactly 8 bits and nothing
else.

What once was intended to avoid problems in C several decades ago (by
allowing other byte sizes), is now in the current C++ an anachronism that
invites creating new problems.

> There's no mystical C++
>connection between a byte and a char, except that a byte is big enough
>to hold a char, and unsigned char is defined to use/provide access to
>every bit of the byte it's stored in.

As far the "object representation", a char and a byte are exactly the same.

Returning to the original question, one could not in the current C++
standard define a char to have 32 bits merely in order to being able to
hold a Unicode character, because then it also means that the C++ bytes
are 32 bits, and then all code that has otherwise nothing with characters
to do must be changed as well.

In addition, one will end up with a compiler with a byte concept that is
likely to be in disagreement with the byte concept of the rest of the
world, including the OS the compiler is implemented under.

>What exactly is the difference you see between that concept and "what
>most others would think of when speaking of a byte."?  I don't see any.

I think I explained that: A byte is a fixed binary structure (today, 8
bits) that does not have anything with characters and their implementation
to do.

In C++, on the other hand, a byte is whatever a char is (forgetting about
the character structure).

It means that if one in C++ changes a char, the fixed concept of a byte
also is supposed to change, which is very odd indeed.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@stop.mail-abuse.org>
Date: Fri, 13 Apr 2001 16:16:40 GMT Raw View


Hans Aberg wrote:

>
> The first UNIX with C was made in 1973, well before I saw those ads with
> 9-bit bytes. (The first standard C came 1983.)

And what stanard would that have been (in 1983?).
>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Sat, 14 Apr 2001 12:30:53 GMT Raw View

In article <3AD7043A.41A779E0@stop.mail-abuse.org>, Ron Natalie
<ron@stop.mail-abuse.org> wrote:
>> The first UNIX with C was made in 1973, well before I saw those ads with
>> 9-bit bytes. (The first standard C came 1983.)
>
>And what stanard would that have been (in 1983?).

Sorry, it was the year the ANSI C standardization was started (according
to Kernigan & Ritchie, "The C PL"). The work was finished in late 1988.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Mon, 2 Apr 2001 11:58:22 CST Raw View

no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt) writes:

|>  "James Kuyper Jr." <kuyper@wizard.net> wrote:

|>  > Volker Apelt wrote:
|>  > > What I was trying to say is, that c++ should define a convenient
|>  > > way to handle locale/codeset dependent strings.

|>  > You said "a convenient way". But which way?
|>  [ snip -- convenience is not agreed yet for encodings ]
|>  > Standardizing before that consensus has been reached is premature.

|>  Then let's find out what "convenience" means to us. Just one
|>  example.  string and wstring are prepared to handle fixed size
|>  character encondings only.  What about multibyte encodings and
|>  strings? (let's say UTF-8) The smallest codepoint is 8 bits wide,
|>  so, one could choose std::string as the base of implementation. But
|>  the iterators are not prepared for multibytes and all algorithms are
|>  based on them.

|>  So, convenience to me means to have an iterator that knows about the
|>  current encoding of the string it belongs to and is guaranteed to
|>  stop at codepoints instead of raw bytes (char) or raw wchar_t.

Convenience means having a character type wide enough that I don't need
to worry about multibyte characters in my code.

I'm not really happy about the interface to locale, but providing
codecvt, and requiring file IO to use it, was a significant step
forward.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 2 Apr 2001 11:56:12 CST Raw View

In article <864rwhbydb.fsf@alex.gabi-soft.de>, James Kanze
<kanze@gabi-soft.de> wrote:
>|>  -- This reminds me that C++ does not appear to have a "save" command
>|>  for writing a file to disk. As file buffering is a common approach,
>|>  I think it would be prudent to add such a function.
>
>What is this save command supposed to do?  It sounds an awful lot like
>flush, from the little you say here.

By the way, I am not sure that a flush causes a write to disk. -- I think
that the behavior might be that it causes a write to the hard disk buffer,
which is written to the actual hard disk when the file is closed (for
performance reasons).

It would mean that in order to ensure a write to disk, one would need to
close and reopen the file, or to invoke a system specific command.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Mon, 2 Apr 2001 11:58:08 CST Raw View

remove.haberg@matematik.su.se (Hans Aberg) writes:

|>  But when Unicode becomes widespread, it seems likely that eventually
|>  32-bit words at least will replace the byte concept altogether:

Might I suggest that there is a bit of wishful thinking here.  There are
still channels on the Internet that aren't 8 bit clean; I don't think
that 32 bit words are going to take over the world anytime soon, and a
language or an OS which doesn't support legacy formats is doomed before
it starts.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Mon, 2 Apr 2001 16:59:05 CST Raw View

Volker Apelt wrote:
...
> Then let's find out what "convenience" means to us. Just one example.
> string and wstring are prepared to handle fixed size character
> encondings only.
> What about multibyte encodings and strings? (let's say UTF-8)
> The smallest codepoint is 8 bits wide, so, one could choose std::string
> as the base of implementation. But the iterators are not prepared
> for multibytes and all algorithms are based on them.
>
> So, convenience to me means to have an iterator that knows about
> the current encoding of the string it belongs to and is guaranteed
> to stop at codepoints instead of raw bytes (char) or raw wchar_t.

I sympathize with your desire, but I think there may be a basic
conceptual flaw with this idea. For iterators, i++ is supposed to
advanced you past the object *i. However, *i is supposed to return
iterator_traits<iterato>::value_type, while you're suggesting an
iterator that may move you across a different number of bytes, depending
upon the current state. The only way I can see to do that is to have
value_type be either a fixed-size character type such as wchar_t, or
some tricky class type that implicitly converts to a fixed-sized
character type under appropriate circumstances.

I suppose you could define a string whose iterators present a value_type
of wchar_t, but make use of codecvt on an purely internal multibyte
character string. However, that would leave the multibyte character
string almost completely hidden; I suspect that's not what you want.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Mon, 2 Apr 2001 21:59:08 CST Raw View

Ron Natalie wrote:
>
> "James Kuyper Jr." wrote:
> >
> > Ron Natalie wrote:
> > >
> > > "James Kuyper Jr." wrote:
> > ...
> > > I'm just mandating consistancy in the wchar_t usage so that an implementation
> > > has the option to say, OK wchar_t can hold a UNICODE character.
> >
> > Done: the language already allows that. What it doesn't do is mandate
> > it. The failure to mandate it allows implementors whose customers have
> > no need for internationalization to use 'char' as wchar_t (possibly as a
> > compiler option).
>
> I don't think you understand me.

I have to agree.

> ... The language doesn't allow it.  It starts
> to do that by defining wchar_t and making limitted use of it in some stream_bufs
> and basic_string, but there are still very important interfaces which can not
> take things composed out of wchar_t's.

So? How does that prevent "an implemention [saying that] wchar_t can
hold a UNICODE character"?  The fact that there are few important
interfaces that can't use them doesn't prevent them from holding UNICODE
characters. There's nothing in the standard preventing even 'char' from
holding UNICODE characters; there's certainly nothing preventing wchar_t
from doing so.

...
> saying is, the implementation is stuck tryign to figure out  how to make
> a non-US oriented system work with char* as the sole string type.

So wchar_t is what? a floating point type? There are things that char[]
can be used for that wchar_t[] can't be used for, but that doesn't
invalidate the fact that every single aspect of the library connected
with wchar_t is as a container for string characters, and that's a
significant fraction of the standard library. Whatever your real problem
with the standard is, it's not correctly described by attempting to deny
the stringiness of wchar_t[].

...
> > > All this presumes that there is a NTMBS string existant in the implementation
> > > that can stand for any NTWCS (string made up of wchar_t).   This is the
> >
> > If there isn't, that's because the decision was made (if only by failing
> > to decide) to allow it to be so. There's nothing in the C++ standard
> > that prevents an implementation from making that assumption valid. In
> > fact, the existence of the conversion utilities creates the naive
> > expectation that it's required to be valid (at least, until you read the
> > fine print).
> >

> Huh?   Yes there is.  If there is no such thing as a Wide string to Multibyte
> string conversion you:

I thought we were talking about the C++ standard library? Wide string to
multibyte conversion facilities are a built-in part of the library. Two
different flavors, in fact: wcstombs(), and std::codecvt.

> 1.  Can't deal with files, for fstream won't work.

Well, yes, of course fstream won't work. That's what wfstream is for. Of
course the file name itself can be a wide-character unicode string only
if 'char' is large enough to hold a unicode character, but the standard
doesn't prohibit that, either.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Tue, 3 Apr 2001 19:21:02 GMT Raw View

In article <86snjso3id.fsf@alex.gabi-soft.de>, James Kanze
<kanze@gabi-soft.de> wrote:
>|>  But when Unicode becomes widespread, it seems likely that eventually
>|>  32-bit words at least will replace the byte concept altogether:
>
>Might I suggest that there is a bit of wishful thinking here.  There are
>still channels on the Internet that aren't 8 bit clean; I don't think
>that 32 bit words are going to take over the world anytime soon, and a
>language or an OS which doesn't support legacy formats is doomed before
>it starts.

\begin{ironic}
But can't they simply write a codecvt?

-- It can't be more difficult than getting Unicode working with the
current C++ standard.
\end{ironic}

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Wed, 4 Apr 2001 19:45:52 GMT Raw View

remove.haberg@matematik.su.se (Hans Aberg) writes:

|>  In article <864rwhbydb.fsf@alex.gabi-soft.de>, James Kanze
|>  <kanze@gabi-soft.de> wrote:
|>  >|>  -- This reminds me that C++ does not appear to have a "save" command
|>  >|>  for writing a file to disk. As file buffering is a common approach,
|>  >|>  I think it would be prudent to add such a function.

|>  >What is this save command supposed to do?  It sounds an awful lot
|>  >like flush, from the little you say here.

|>  By the way, I am not sure that a flush causes a write to disk. -- I
|>  think that the behavior might be that it causes a write to the hard
|>  disk buffer, which is written to the actual hard disk when the file
|>  is closed (for performance reasons).

Quite true.  The standard doesn't say anything about what happens once
the buffer has left the C++ implementation.

|>  It would mean that in order to ensure a write to disk, one would
|>  need to close and reopen the file, or to invoke a system specific
|>  command.

Or perhaps even issue a sync command to the system, or some such.

This is, in fact, occasionally a problem.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt)
Date: Thu, 5 Apr 2001 16:06:33 GMT Raw View

"James Kuyper Jr." <kuyper@wizard.net> wrote:

> Volker Apelt wrote:
> ....
> > So, convenience to me means to have an iterator that knows about
> > the current encoding of the string it belongs to and is guaranteed
> > to stop at codepoints instead of raw bytes (char) or raw wchar_t.
>
> I sympathize with your desire, but I think there may be a basic
> conceptual flaw with this idea. For iterators, i++ is supposed to
> advanced you past the object *i. However, *i is supposed to return
> iterator_traits<iterator>::value_type, while you're suggesting an
> iterator that may move you across a different number of bytes, depending
> upon the current state. The only way I can see to do that is to have
> value_type be either a fixed-size character type such as wchar_t, or
> some tricky class type that implicitly converts to a fixed-sized
> character type under appropriate circumstances.

That's why mbc's are not just working with the standard implementation
of string. :-)
Step size of multibyte character iterator (mbc_iterator) ++i and
size of a raw char of the underlying implementation is not equal.
Even worse, size of the next step over a mbc depends on that mbc.
++i is supposed to step over one or more objects
of charT, where charT is a type capable of holding the smallest
code point for that encoding, probably a byte (octet).
So, the implementation of a mbc_iterator is not a simple pointer.
It must contain a reference to some encoding information and,
if the encoding requires it, some state information.

> [ about iterator_traits<iterator>::value_type and *i return value]

That depends on how we will use the mbc_iterator.
The only advantage of mbc's over fixed width characters is
space consumption. Not converting to wide characters is an advantage,
if the amount of mbc's read is huge and you don't need to change
the mbc string.
eg: searching a small pattern in a large mbc encoded file. So, one
could mmap the file, search over it and unmap it without copying
or converting.

I'd like to use a reduced set of non-changing operations on
mbc_iterator:
( i,j,k   are mbc_iterator's,  w is a wchar_t iterator, c is a char* )
 - ++i , i++,  i += unsigned,  point to the first byte of the next
   codepoint.
   maybe it should just throw, in case of partial or error.
 - testing for validity of the last ++i operation and current
   position.  !i,  i.convResult()
   (Did the last operation step over a complete mbc...?)
 - compare mbc's to mbc's  (*i==*k) (*i!=*k)
 - (?) compare mbc's to wchar_t and (?) char (*i==*w), (*i==*c)
 - use non-changing std algorithms like find, find_if on i pairs
   (rfind could be tricky)
It sounds much like an input_iterator. Maybe it requires specialized
templates for standard algorithms on mbc_iterators.
If --i is implementable, it is a 'bidirectional_input_iterator' (not in
std).

  Is a raw char good enough as
iterator_traits<mbc_iterator>::value_type?
That was my initial thought. But now I think it is better to make it a
readonly proxy, that converts to different char sizes if necessary and
throws if it can't. That means the proxy knows about the position and
the encoding, too.
Or make it a proxy that is accessible for friend funcktions and methods
only to restrict access and avoid invisible conversions.

But one could argue, that concatenating two mbc strings is
efficient,too,
if they have the same encoding. But assigning to arbitrary positions
isn't required to. Concatenation may require some additional bytes,
which are not present in the original strings, to adjust for different
shift states. That requires at least:
 - *i=*k, *i=*w, *i=*c, should be fast if i == end()
Then, iterator_traits<mbc_iterator>::value_type probably has to be
a complicated proxy type, if assignment is required.
( This is not what I need. )

> I suppose you could define a string whose iterators present a value_type
> of wchar_t, but make use of codecvt on an purely internal multibyte
> character string. However, that would leave the multibyte character
> string almost completely hidden; I suspect that's not what you want.

That sounds interesting, too, but could be a completely different iter
type.
My initial idea was an implementation, which allows to access the
raw representation and does as little work as possible (no conversion).
The proxy above could do lazy evaluation and cache a conversion result
for the last character.
But 'doing no conversion' may be an illusion.

I'm not sure what the correct type for value_t would be. I favour
the proxy implementation.



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Christopher Eltschka <celtschk@dollywood.itp.tuwien.ac.at>
Date: Thu, 5 Apr 2001 22:59:08 GMT Raw View

Hans Aberg wrote:
>
> In article <99vr75$can$1@xmission.xmission.com>, (Rich)
> legalize+jeeves@mail.xmission.com wrote:
> >>It is stopping one in the sense that it is not _within_ the C++ standard
> >>possible to implement any underlying binary model.
> >
> >You lost me there.  I'm not even sure what you mean by "binary model".
> >You're talking about treating a bag of bits as a binary type for I/O,
> >right?  What's wrong with the read and write members of streams?
>
> The C++ standard is written everywhere so that one can specify what the
> actual underlying bits should be. It's not only what a char should be, but
> bit-fields, and such can be aligned as the compiler writer finds it
> useful.
>
> For example, from the view of the C++ standard, one can output a "char"
> but not a 8-bit byte, because there is no way to specify a structure with
> 8-bit alignment. Only if one knows that ones compiler implement a char as
> a byte, one can output a byte. But it is perfectly legal to implement a
> "char" as something else that a 8-bit byte.

That is because there are machines where outputting an 8-bit byte
would be quite a hard task (even defining what it means to output
an 8-bit byte would be), just because on those machines a byte
_does not have 8 bits_.

Please give a portable definition of "output an 8-bit byte"
which could be used unchanged on a machine with 9-bit bytes.

>
> So it means that if one wants to output a binary structure, which then
> should be read by say the same program compiled with another compiler,
> there is no guarantee that the program will work, that is, form the point
> of view of the C++ standard.

What if the second compiler is on another computer, with incompatible
storage devices? Should the C++ standard demand every computer to
be able to read every storage device? With every file system on it?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Fri, 6 Apr 2001 17:51:26 GMT Raw View

In article <3ACCAD79.FF0DE057@dollywood.itp.tuwien.ac.at>, Christopher
Eltschka <celtschk@dollywood.itp.tuwien.ac.at> wrote:
>> For example, from the view of the C++ standard, one can output a "char"
>> but not a 8-bit byte, because there is no way to specify a structure with
>> 8-bit alignment. Only if one knows that ones compiler implement a char as
>> a byte, one can output a byte. But it is perfectly legal to implement a
>> "char" as something else that a 8-bit byte.
>
>That is because there are machines where outputting an 8-bit byte
>would be quite a hard task (even defining what it means to output
>an 8-bit byte would be), just because on those machines a byte
>_does not have 8 bits_.
>
>Please give a portable definition of "output an 8-bit byte"
>which could be used unchanged on a machine with 9-bit bytes.

As far as I know, a "byte" is always 8-bit; I only use the wording "8-bit
byte" so that one should not confuse it with anything else.

As for those 9-bit structures, aren't you confusing those with machine "words".

The 9-bit word computers I recall a long ago, used the extra bit for error
checking. So in effect, those words are 8-bit. But those computers did not
run C++; they were far too small.

What current computers do you have in your mind?

Otherwise, it is quite clear that with any bit specific standard, the
alignments may go off relative to the machine words, and that is going to
be slow.

>> So it means that if one wants to output a binary structure, which then
>> should be read by say the same program compiled with another compiler,
>> there is no guarantee that the program will work, that is, form the point
>> of view of the C++ standard.
>
>What if the second compiler is on another computer, with incompatible
>storage devices? Should the C++ standard demand every computer to
>be able to read every storage device? With every file system on it?

One simply writes a stream of bits to an IO-stream of some kind. Then one
wants to ensure that the bits are exactly the same when they are read by
that other computer.

It's like when you write a file, and port it over the Internet. If you
wrote it in English, you wouldn't want it to look like it was written in
Chinese at the other end.

The big hurdle I think is how the bits of say a byte are picked together
to a stream -- high or low first. Perhaps there are two choices, like in
big/small endian.

My guess is that on OS's that know how to write files in bytes in them,
one would simply write to that format.

On OS's that do not support that, one would have a macro telling that.
Then one would get a compiler error. Those that would try to compile the
program will know that this program depends on that certain bit structure
is correct, and will need to fix that.

Or if the proposal is made so that one has definitions like say
  #define char byte
  #define wchar_t Uchar // 32-bit, or uchar = 16-bit.
then on compilers that do not support "byte", "uchar", or "Uchar", one
would still have char and wchar_t, and one could use them.

But one would still want to have macros that can tell about if bytes do
exist, so that one can write say
  #if !Uchar_exists
  #error "This program requires 32-bit Unicode characters".
  #endif

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Christopher Eltschka <celtschk@dollywood.itp.tuwien.ac.at>
Date: Fri, 6 Apr 2001 21:45:34 GMT Raw View

Hans Aberg wrote:
>
> In article <3ACCAD79.FF0DE057@dollywood.itp.tuwien.ac.at>, Christopher
> Eltschka <celtschk@dollywood.itp.tuwien.ac.at> wrote:
> >> For example, from the view of the C++ standard, one can output a "char"
> >> but not a 8-bit byte, because there is no way to specify a structure with
> >> 8-bit alignment. Only if one knows that ones compiler implement a char as
> >> a byte, one can output a byte. But it is perfectly legal to implement a
> >> "char" as something else that a 8-bit byte.
> >
> >That is because there are machines where outputting an 8-bit byte
> >would be quite a hard task (even defining what it means to output
> >an 8-bit byte would be), just because on those machines a byte
> >_does not have 8 bits_.
> >
> >Please give a portable definition of "output an 8-bit byte"
> >which could be used unchanged on a machine with 9-bit bytes.
>
> As far as I know, a "byte" is always 8-bit; I only use the wording "8-bit
> byte" so that one should not confuse it with anything else.

Well, fortunately knowledge can grow ;-)

>
> As for those 9-bit structures, aren't you confusing those with machine "words".

No, the words on those machines are IIRC 36 bits.

> The 9-bit word computers I recall a long ago, used the extra bit for error
> checking. So in effect, those words are 8-bit. But those computers did not
> run C++; they were far too small.
>
> What current computers do you have in your mind?

Those whihc were mentioned in the C++ groups several
times; I'd have to do a google (ex-deja) search, but
I'm currently too lazy for that.

>
> Otherwise, it is quite clear that with any bit specific standard, the
> alignments may go off relative to the machine words, and that is going to
> be slow.

And that is why the C++ standard doesn't specify bit widths.

However on machines with 8-bit bytes, I doubt you'll find one C++
compiler where char is not 8 bits. Even if the standard doesn't
demand it.
Therefore if you are happy with portability to most machines
(all with 8 bit bytes), you can happily use char *as if* they
were guaranteed to be 8 bits - just because the market will
guarantee it (and the market is stronger than _any_ standard).

>
> >> So it means that if one wants to output a binary structure, which then
> >> should be read by say the same program compiled with another compiler,
> >> there is no guarantee that the program will work, that is, form the point
> >> of view of the C++ standard.
> >
> >What if the second compiler is on another computer, with incompatible
> >storage devices? Should the C++ standard demand every computer to
> >be able to read every storage device? With every file system on it?
>
> One simply writes a stream of bits to an IO-stream of some kind. Then one
> wants to ensure that the bits are exactly the same when they are read by
> that other computer.

But that's clearly outside of the C++ standard.

>
> It's like when you write a file, and port it over the Internet. If you
> wrote it in English, you wouldn't want it to look like it was written in
> Chinese at the other end.

Well, that's why there are *independant* standards (like ASCII,
TCP/IP, ...)

>
> The big hurdle I think is how the bits of say a byte are picked together
> to a stream -- high or low first. Perhaps there are two choices, like in
> big/small endian.

In principle, yes. In practise, the OS (and usually eve the hardware)
takes not individual bits, but full bytes. And the standards which
guarantee the right bit order are made elsewhere (f.ex. by those
defining the disk storage format).

>
> My guess is that on OS's that know how to write files in bytes in them,
> one would simply write to that format.

Of course. That's what C++ does.

>
> On OS's that do not support that, one would have a macro telling that.

Besides the fact that I don't know an OS which doesn't allow
the I/O of bytes, I don't see why you should need a macro.
The C++ file handling is byte oriented.

> Then one would get a compiler error. Those that would try to compile the
> program will know that this program depends on that certain bit structure
> is correct, and will need to fix that.

For the char size, you have a macro: CHAR_BITS.
It's easy to write

#include <climits>

#if CHAR_BITS != 8
#error This program only works with 8 bit chars. Sorry.
#endif

>
> Or if the proposal is made so that one has definitions like say
>   #define char byte

The standard already tells you that 1 char == 1 byte.
It just doesn't tell you how large 1 byte is.

>   #define wchar_t Uchar // 32-bit, or uchar = 16-bit.
> then on compilers that do not support "byte", "uchar", or "Uchar", one
> would still have char and wchar_t, and one could use them.

You normally don't need a type with exactly 32 bits.
You need a type with _at least_ 32 bits. And long is such a type.

>
> But one would still want to have macros that can tell about if bytes do
> exist, so that one can write say
>   #if !Uchar_exists
>   #error "This program requires 32-bit Unicode characters".
>   #endif

Hmmm... do you want to check for *bytes*, or for *Uchar*?

But no program needs a 32 bit type to handle UCS-32.
You can just use long. It's guaranteed to be _at least_
32 bits.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 9 Apr 2001 15:29:50 CST Raw View

In article <3ACE162D.2ED8CF21@dollywood.itp.tuwien.ac.at>, Christopher
Eltschka <celtschk@dollywood.itp.tuwien.ac.at> wrote:
>> As for those 9-bit structures, aren't you confusing those with machine
"words".
>
>No, the words on those machines are IIRC 36 bits.

Since the personal computer sold now have 128 bits (like Motorola G4), it
sounds like these are anachronisms.

>> As far as I know, a "byte" is always 8-bit; I only use the wording "8-bit
>> byte" so that one should not confuse it with anything else.
>
>Well, fortunately knowledge can grow ;-)
...
>> The 9-bit word computers I recall a long ago, used the extra bit for error
>> checking. So in effect, those words are 8-bit. But those computers did not
>> run C++; they were far too small.
>>
>> What current computers do you have in your mind?
>
>Those whihc were mentioned in the C++ groups several
>times; I'd have to do a google (ex-deja) search, but
>I'm currently too lazy for that.

Well, then there is no point in discussing that, as the basis for allowing
knowledge to grow does not seem to exist. :-)

>> Otherwise, it is quite clear that with any bit specific standard, the
>> alignments may go off relative to the machine words, and that is going to
>> be slow.
>
>And that is why the C++ standard doesn't specify bit widths.

This is why the _current_ C++ standard does not discuss that.

The reason the issue was brought up in the this thread was to express
wishes for a future C++ standard that addresses that issue.

Right now, C++ provides fast, but useless solutions.

>However on machines with 8-bit bytes, I doubt you'll find one C++
>compiler where char is not 8 bits. Even if the standard doesn't
>demand it.
>Therefore if you are happy with portability to most machines
>(all with 8 bit bytes), you can happily use char *as if* they
>were guaranteed to be 8 bits - just because the market will
>guarantee it (and the market is stronger than _any_ standard).

Yes, this is what already has mentioned before in this thread.

But it does not work out as easily with Unicode.

So while at fixing Unicode, one can just as well fix the byte issue,
paving the way for distributed programming, which I figure that the C++
standard will have to address as well. Then one is not only speaking about
fixing bytes, but all binary structures that may be exchanged with other
programs.

>> One simply writes a stream of bits to an IO-stream of some kind. Then one
>> wants to ensure that the bits are exactly the same when they are read by
>> that other computer.
>
>But that's clearly outside of the C++ standard.

Right, that is why we are discussing it here.

>> It's like when you write a file, and port it over the Internet. If you
>> wrote it in English, you wouldn't want it to look like it was written in
>> Chinese at the other end.
>
>Well, that's why there are *independant* standards (like ASCII,
>TCP/IP, ...)

Yes, as there is a Unicode standard, independent of the C++ standard.

The problem with C++, which we are discussing in this thread, is that it
does not provide the suitable hooks onto those other standards.

>>..In practise, the OS (and usually eve the hardware)
>takes not individual bits, but full bytes.

In practise, the OS handles words, not bytes.

>> My guess is that on OS's that know how to write files in bytes in them,
>> one would simply write to that format.
>
>Of course. That's what C++ does.

Nope. C++ handles char's not bytes.

>> On OS's that do not support that, one would have a macro telling that.
>
>Besides the fact that I don't know an OS which doesn't allow
>the I/O of bytes, I don't see why you should need a macro.

You started the discussion by saying that there are OS's which uses 9-bit
bytes, and hinting at that they are incapable of handling 8-bit bytes.

If there are such OS's, significant in the context of C++ implementations,
one would need such a macro; otherwise not.

>The C++ file handling is byte oriented.

Nope, it is "char" oriented.

>For the char size, you have a macro: CHAR_BITS.

It is CHAR_BIT.

>It's easy to write
>
>#include <climits>
>
>#if CHAR_BITS != 8
>#error This program only works with 8 bit chars. Sorry.
>#endif

So then just use this one. The interesting thing is that my compiler
(Metrowerks CodeWarrior) that has implementations for several OS's, has a
line
  #ifdef __m56800__
  #define CHAR_BIT            16
  ...
So CHAR_BIT is 16 on some computers, and it would be interesting to know
how that works when one tries to write an 8-bit byte.

>> Or if the proposal is made so that one has definitions like say
>>   #define char byte
>
>The standard already tells you that 1 char == 1 byte.
>It just doesn't tell you how large 1 byte is.

You probably misunderstood what I meant then: Which such a type, a byte
would be 8 bit and nothing else.

>>   #define wchar_t Uchar // 32-bit, or uchar = 16-bit.
>> then on compilers that do not support "byte", "uchar", or "Uchar", one
>> would still have char and wchar_t, and one could use them.
>
>You normally don't need a type with exactly 32 bits.
>You need a type with _at least_ 32 bits. And long is such a type.

This is correct for the internal handling of the type _within the
program_, but not when doing IO to a file. Then it needs to be _exactly_
32 bit and nothing else.

>> But one would still want to have macros that can tell about if bytes do
>> exist, so that one can write say
>>   #if !Uchar_exists
>>   #error "This program requires 32-bit Unicode characters".
>>   #endif
>
>Hmmm... do you want to check for *bytes*, or for *Uchar*?

The discussion is originally only about Unicode, and the byte issue is
just a side question, because as you and others point out, char's usually
are 8-bit, and people do not experience any problems with that.

One problem was how to get C++ wchar_t "..." strings to produce correct
Unicode symbols.

The problems pop up when writing a program that should behave exactly the
same on a number of platforms, handling Unicode. -- See some of the
earlier posts in this thread. I was thinking about extensions of TeX.

>But no program needs a 32 bit type to handle UCS-32.
>You can just use long. It's guaranteed to be _at least_
>32 bits.

This is a good suggestion. I guess if one assumes that char's are 8-bit,
one can write a codecvt that converts Unicode characters into longs, and
use that instead of wchar_t.

But it seems still strange that one cannot use any C++ type intended for
characters to handle Unicode characters.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Wed, 28 Mar 2001 12:16:01 CST Raw View

In article <99r788$m87$1@xmission.xmission.com>, (Rich)
legalize+jeeves@mail.xmission.com wrote:
>>It does not seem impossible to introduce a type binary<n> where n is a
>>positive integer, having exactly n bits. Then IO with the type binary<n>
>>in/out-puts exactly n bits. The implementation can optimize certain sizes,
>>if that is now n = 8 or whatever.
>
>Nothing's stopping you from writing such a class and using it yourself
>if you really think its that important.

The C++ standard is :-), in view of that there is no part of it giving
room for specifying the underlying binary model. This can only be done for
each compiler via extra knowledge on how it makes the implementation (for
example, by knowing that a char is implemented as an 8-bit byte).

-- The discussion is about providing C++ with the suitable hooks, allowing
to specify the underlying binary model, and not really about this or other
class or feature of general convenience. The latter only shows up as
different means to attain the first goal.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Matthew Austern <austern@research.att.com>
Date: Wed, 28 Mar 2001 12:16:03 CST Raw View

Ron Natalie <ron@spamcop.net> writes:

> Rich wrote:
> >
> > Frankly if I needed to support opening files with Unicode names, then
> > I would just handle that myself, which is perfectly in the spirit of
> > the C++ library
>
> Good, explain to me what spirit you are going to use, short of writing
> your own stream_buf which replicates the standard one with everything except
> file pen.

I certainly wouldn't rewrite std::streambuf.  I'd be more likely to do
one of these three things:

 - Find out how to map a sequence of char's to a Unicode name.
   (Remember, UTF-8 is a perfectly good Unicode encoding.) This mapping
   isn't likely to be portable to every OS, but then, Unicode filenames
   aren't portable to every OS anyway.
 - Look for an extension in my favorite vendor's library that would
   allow me to use OS-specific methods of file opening.  Most Unix
   implementations of the standard C++ library, for example, will
   allow me to pass an open file descriptor to std::basic_filebuf.
 - If necessary, write my own version of std::basic_filebuf.  It's
   not very difficult; the streambuf interface is, deliberately,
   simple.  std::basic_filebuf is very complicated because of its
   generality, but no one user needs all of that generality.  If
   you don't bother to implement the features you don't need, it'll
   probably just be a couple hundred lines.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt)
Date: Wed, 28 Mar 2001 12:45:07 CST Raw View

"James Kuyper Jr." <kuyper@wizard.net> wrote:

> Ron Natalie wrote:
> >
> > Michiel Salters wrote:
> >
> > > >  How is that to be handled from within C++ or is the C++ standard
> > > >insufficient with respect to this?
> > >
> > > The C++ standard is insufficient. But since other standards are
> > > sufficient,
> > > C++ shouldn't redo that work.
> >
> > Yes, but  unforutnatley, the standard parts of C++ are incompatible with
> > any useful extension.  With the exception of a few places where
> > char_traits
> > <wchar_t> is defined as being required, by and large C++ ignores the fact
> > that all the world ain't ASCII.
>
> I'm curious: what aspects of C++ are incompatible with any useful
> extension? C++ doesn't guarantee the presence of anything useful with
> repect to Unicode. However, it would seem to allow a great deal. I don't
> see anything that would prevent an implementation from defining the wide
> character set, or even the narrow character set, as using the 16 or even
> the 32 bit Unicode encodings.

Is there any argument against defining a set of string classes
and stream buffers in the standard
like   utf8_string  utf16_string utf16le_string utf16be_string
       ucs4_string  ...
and have them convert from and to each other seamlessly?
(if possible, throwing an exception or doing a fake translation
otherwise, depending on a user supplied trait class)



--
Volker Apelt          Group of Prof. Dr. Ch. Griesinger
                      Johann Wolfgang Goethe Universitaet
                      Frankfurt am Main (Germany)
no_spam_va@org.chemie.uni-frankfurt.de  (use va@ instead of ...@ )

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Wed, 28 Mar 2001 13:04:35 CST Raw View


Volker Apelt wrote:
>
>
> Is there any argument against defining a set of string classes
> and stream buffers in the standard
> like   utf8_string  utf16_string utf16le_string utf16be_string
>        ucs4_string  ...
> and have them convert from and to each other seamlessly?
> (if possible, throwing an exception or doing a fake translation
> otherwise, depending on a user supplied trait class)
>
The fact that the character set is ASCII, or UNICODE, or whatever
is an implementation specific thing and the language kind of ignores
it.  If the implementation wants to define those, it can do that.
That's not the problem I am having.

The problem I am having is that there are a few interfaces where
there is exactly one way to pass it string data and that is "char*"
(neglecting the fact that I felt that string would be more
appropriate).  My argument if you're going to require wchar_t
overloads for the data, it sure as hell would be a lot easier
to also require them for these interfaces.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Wed, 28 Mar 2001 17:01:55 CST Raw View

[Please do not mail me a copy of your followup]

remove.haberg@matematik.su.se (Hans Aberg) spake the secret code
<remove.haberg-2803011105440001@sdu43-195.ppp.algonet.se> thusly:

>>Nothing's stopping you from writing such a class and using it yourself
>>if you really think its that important.
>
>The C++ standard is :-), in view of that there is no part of it giving
>room for specifying the underlying binary model. This can only be done for
>each compiler via extra knowledge on how it makes the implementation (for
>example, by knowing that a char is implemented as an 8-bit byte).

The C++ standard isn't stopping you, just making the task tedious and
difficult.  Frankly, the reason the standard is wiggly on this issue
is probably because the task is tedious, difficult and
compiler-dependent.  Would magic wording in the compiler make that
task any less tedious, difficult or compiler-dependent?  No, it just
shifts the burden from you to the poor guy who has to write the
compiler and support infrastructure.
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Wed, 28 Mar 2001 20:00:36 CST Raw View

Volker Apelt wrote:
>
> "James Kuyper Jr." <kuyper@wizard.net> wrote:
...
> > I'm curious: what aspects of C++ are incompatible with any useful
> > extension? C++ doesn't guarantee the presence of anything useful with
> > repect to Unicode. However, it would seem to allow a great deal. I don't
> > see anything that would prevent an implementation from defining the wide
> > character set, or even the narrow character set, as using the 16 or even
> > the 32 bit Unicode encodings.
>
> Is there any argument against defining a set of string classes
> and stream buffers in the standard
> like   utf8_string  utf16_string utf16le_string utf16be_string
>        ucs4_string  ...
> and have them convert from and to each other seamlessly?

The current standard says nothing to prevent such types from existing
(as long as the names are changed to not interfere with the user's name
space). Therefore, I presume you're talking about changing the standard
to require them, not merely to allow them.

There is the same argument that applies to any requirement that you want
to add to the standard:  it's required. Regardless of what the
requirement is, there's someone somewhere who'd like to implement C++,
but they're certain (and possibly correctly so) that their users will
never care about that requirement. More importantly, once the standard
specifies how something must be done, it prevents exploration of
alternative ways of doing it. I don't get the impression that there is
sufficiently universal agreement about how i18n should be done, to
justify standardizing it just now, at least not at the language level.

I'm all in favor of i18n. However, in 25 years of programming I've never
had to write a single program which needed i18n support. There are niche
markets (and they're arguably HUGE niches, at least here in the U.S.)
where i18n support is completely worthless.

I'm not saying that these arguments should be decisive. I'm just saying
that to support a change like this, you should show that the need is so
wide-spread as to justify forcing people who don't share that need to
put up with the consequences (whatever they might be) of satisfying it.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt)
Date: Thu, 29 Mar 2001 17:08:49 GMT Raw View

"James Kuyper Jr." <kuyper@wizard.net> wrote:

> Volker Apelt wrote:
> >=20
> > "James Kuyper Jr." <kuyper@wizard.net> wrote:
> ....
> > > I'm curious: what aspects of C++ are incompatible with any useful
> > > extension? C++ doesn't guarantee the presence of anything useful wi=
th
> > > repect to Unicode. However, it would seem to allow a great deal. I =
don't
> > > see anything that would prevent an implementation from defining the=
 wide
> > > character set, or even the narrow character set, as using the 16 or=
 even
> > > the 32 bit Unicode encodings.
> >=20
> > Is there any argument against defining a set of string classes
> > and stream buffers in the standard
> > like   utf8_string  utf16_string utf16le_string utf16be_string
> >        ucs4_string  ...
> > and have them convert from and to each other seamlessly?
>=20
> The current standard says nothing to prevent such types from existing
> (as long as the names are changed to not interfere with the user's name
> space). Therefore, I presume you're talking about changing the standard
> to require them, not merely to allow them.

Correct interpretation. And I'd like to see a discussion on the=20
requirements for a convenient locale/codeset aware string class.

> There is the same argument that applies to any requirement that you wan=
t
> to add to the standard:  it's required. Regardless of what the
> requirement is, there's someone somewhere who'd like to implement C++,
> but they're certain (and possibly correctly so) that their users will
> never care about that requirement. More importantly, once the standard
> specifies how something must be done, it prevents exploration of
> alternative ways of doing it. I don't get the impression that there is
> sufficiently universal agreement about how i18n should be done, to
> justify standardizing it just now, at least not at the language level.

No, a standard does not keep you from exploring other alternatives
if the standard way is inconvenient.=20
Just look at valarray and all the libraries for matrix operations. =20
It only guarantees there is at least one defined way of doing it
that is avail everywhere.

> I'm all in favor of i18n. However, in 25 years of programming I've neve=
r
> had to write a single program which needed i18n support. There are nich=
e
> markets (and they're arguably HUGE niches, at least here in the U.S.)
> where i18n support is completely worthless.

<sarcasm> =20
So, I missunderstood the intention of the c++ standard,.. it's
supposed adopt c++ to the US habits and stopps when the US convenience
is satisfied. ;-) =20
I've seen those programs fail in the most trivial situations,
like text retrival programs searching for words with =FC=F6=E4.=20
</sarcasm>
Maybe it is time to have a set of Addendums like in ADA. That=20
permits the niches to leave the unneeded baggage out.
=20
> I'm not saying that these arguments should be decisive. I'm just saying
> that to support a change like this, you should show that the need is so
> wide-spread as to justify forcing people who don't share that need to
> put up with the consequences (whatever they might be) of satisfying it.

What I was trying to say is, that c++ should define a convenient=20
way to handle locale/codeset dependent strings.=20

--=20
Volker Apelt          Group of Prof. Dr. Ch. Griesinger
                      Johann Wolfgang Goethe Universit=E4t=20
                      Frankfurt am Main (Germany)
no_spam_va@org.chemie.uni-frankfurt.de  (use va@ instead of ...@ )

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 29 Mar 2001 17:07:21 GMT Raw View

In article <99tqb7$jio$1@xmission.xmission.com>, (Rich)
legalize+jeeves@mail.xmission.com wrote:
>>>Nothing's stopping you from writing such a class and using it yourself
>>>if you really think its that important.
>>
>>The C++ standard is :-), in view of that there is no part of it giving
>>room for specifying the underlying binary model. This can only be done for
>>each compiler via extra knowledge on how it makes the implementation (for
>>example, by knowing that a char is implemented as an 8-bit byte).
>
>The C++ standard isn't stopping you, just making the task tedious and
>difficult.

It is stopping one in the sense that it is not _within_ the C++ standard
possible to implement any underlying binary model.

If the C++ compilers one is working with specifies how it implements
objects, it is possible to implement binary objects that way. But this is
still not done within the C++ standard.

>From the practical point of view, it makes the code not portable, which is
the reason one wants to have such hooks in C++.

>  Frankly, the reason the standard is wiggly on this issue
>is probably because the task is tedious, difficult and
>compiler-dependent.  Would magic wording in the compiler make that
>task any less tedious, difficult or compiler-dependent?  No, it just
>shifts the burden from you to the poor guy who has to write the
>compiler and support infrastructure.

This is just a generic argument, which can be used to prevent the addition
of any feature.

It is not difficult to specify the underlying binary model once one
already is making a C++ compiler implementation, clearly, as without such
specification, the compiler couldn't work.

In addition, there are standards for providing exchange of binary objects
such as CORBA (and COM). Clearly, C++ will have to address the issue of
distributed programming somehow anyway.

So if C++ has some suitable hooks for specifying the underlying binary
model, it should not be difficult to implement it.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt)
Date: Thu, 29 Mar 2001 17:08:33 GMT Raw View

Ron Natalie <ron@spamcop.net> wrote:
> Volker Apelt wrote:
> > Ron Natalie wrote:
> > > Michiel Salters wrote:
> > > > Hans Aberg says:
> > > > > How is that to be handled from within C++ or is the C++ standard
> > > > > insufficient with respect to this?
> > > >
> > > > The C++ standard is insufficient. But since other standards are
> > > > sufficient, C++ shouldn't redo that work.
> > >
> > > Yes, but  unforutnatley, the standard parts of C++ are incompatible with
> > > any useful extension.  With the exception of a few places where
> > > char_traits <wchar_t> is defined as being required, by and
> > > large C++ ignores the fact that all the world ain't ASCII.
> > >
> > > I'm curious: what aspects of C++ are incompatible with any useful
> > > extension? C++ doesn't guarantee the presence of anything useful with
> > > repect to Unicode. However, it would seem to allow a great deal. I don't
> > > see anything that would prevent an implementation from defining the wide
> > > character set, or even the narrow character set, as using the 16 or even
> > > the 32 bit Unicode encodings.
> >
> > Is there any argument against defining a set of string classes
> > and stream buffers in the standard
> > like   utf8_string  utf16_string utf16le_string utf16be_string
> >        ucs4_string  ...
> > and have them convert from and to each other seamlessly?
> > (if possible, throwing an exception or doing a fake translation
> > otherwise, depending on a user supplied trait class)
>
> The fact that the character set is ASCII, or UNICODE, or whatever
> is an implementation specific thing and the language kind of ignores
> it.  If the implementation wants to define those, it can do that.
> That's not the problem I am having.

Nothing in the definition of a Turing machine requires the definitions
made in the c++ standard. ... ;-)
Leaving to many things up to the implementation makes it inconvenient
to anything but the main stream.

> The problem I am having is that there are a few interfaces where
> there is exactly one way to pass it string data and that is "char*"
> (neglecting the fact that I felt that string would be more
> appropriate).  My argument if you're going to require wchar_t
> overloads for the data, it sure as hell would be a lot easier
> to also require them for these interfaces.

Agreed, inventing and defining wchar_t and string and wstring
and  not defining the matching interfaces for stream constructors
and others makes no sense to me.
But the same argument applies to locale/codeset aware strings.
std::string
already knows about locale and codeset. Why is there no iterator
in string that is guaranteed to stop at codepoints instead of
raw bytes (char) or raw wchar_t?  std::string should have been
named std::raw_string.

--
Volker Apelt          Group of Prof. Dr. Ch. Griesinger
                      Johann Wolfgang Goethe Universitaet
                      Frankfurt am Main (Germany)
no_spam_va@org.chemie.uni-frankfurt.de  (use va@ instead of ...@ )

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Thu, 29 Mar 2001 17:32:00 GMT Raw View

[Please do not mail me a copy of your followup]

remove.haberg@matematik.su.se (Hans Aberg) spake the secret code
<remove.haberg-2903011116240001@du131-226.ppp.su-anst.tninet.se> thusly:

>>The C++ standard isn't stopping you, just making the task tedious and
>>difficult.
>
>It is stopping one in the sense that it is not _within_ the C++ standard
>possible to implement any underlying binary model.

You lost me there.  I'm not even sure what you mean by "binary model".
You're talking about treating a bag of bits as a binary type for I/O,
right?  What's wrong with the read and write members of streams?
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 29 Mar 2001 18:58:49 GMT Raw View

In article <3AC29793.B621ABB5@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>> Is there any argument against defining a set of string classes
>> and stream buffers in the standard
>> like   utf8_string  utf16_string utf16le_string utf16be_string
>>        ucs4_string  ...
>> and have them convert from and to each other seamlessly?
...
>There is the same argument that applies to any requirement that you want
>to add to the standard:  it's required. ... once the standard
>specifies how something must be done, it prevents exploration of
>alternative ways of doing it.

There is an easy way around this, that is, to have macro names that
indicates whether features certain are implemented.

Any multi-platform C++ source code can then check for what is available.

There are already such features, because one can check for the certain
specifics on how floats are implemented, etc.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Thu, 29 Mar 2001 18:58:49 GMT Raw View


"James Kuyper Jr." wrote:

>
> The current standard says nothing to prevent such types from existing
> (as long as the names are changed to not interfere with the user's name
> space). Therefore, I presume you're talking about changing the standard
> to require them, not merely to allow them.

The implementation can easily do that by prepending __ to them, such
as Microsoft's __int64 type.

>  I don't get the impression that there is
> sufficiently universal agreement about how i18n should be done, to
> justify standardizing it just now, at least not at the language level.

But that doesn't mean finishing the minimal implementation provided for
by the existing C++ standard shouldn't be done.  By your argument, they
shouldn't have bothered with it at all.

> I'm all in favor of i18n. However, in 25 years of programming I've never
> had to write a single program which needed i18n support. There are niche
> markets (and they're arguably HUGE niches, at least here in the U.S.)
> where i18n support is completely worthless.

Well, for 25 years of programming I never cared either, as a matter of fact
I worked for years in the military field and sending our stuff to a foreign
country wasn't even a possibility...until we got an alliance with the Japanese...

> I'm not saying that these arguments should be decisive. I'm just saying
> that to support a change like this, you should show that the need is so
> wide-spread as to justify forcing people who don't share that need to
> put up with the consequences (whatever they might be) of satisfying it.
>
What consequences?  All I'm asking for is the minimal wchar_t to be used
consistantly thorughout the standard.  As it is, it's by and large USELESS
on a true UNICODE (fixed size) machine.  The people working on the implementatio
in the C and C++ standard seem to have been MBCS oriented.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Fri, 30 Mar 2001 02:03:43 GMT Raw View

Ron Natalie wrote:
>
> "James Kuyper Jr." wrote:
...
> >  I don't get the impression that there is
> > sufficiently universal agreement about how i18n should be done, to
> > justify standardizing it just now, at least not at the language level.
>
> But that doesn't mean finishing the minimal implementation provided for
> by the existing C++ standard shouldn't be done.  By your argument, they
> shouldn't have bothered with it at all.

I've heard i18n specialists say precisely that - that C/C++ should never
have standardized wchar_t and the associated library functions, but
should have waited until a cleaner approach to the problem had evolved.

> > I'm all in favor of i18n. However, in 25 years of programming I've never
> > had to write a single program which needed i18n support. There are niche
> > markets (and they're arguably HUGE niches, at least here in the U.S.)
> > where i18n support is completely worthless.
>
> Well, for 25 years of programming I never cared either, as a matter of fact
> I worked for years in the military field and sending our stuff to a foreign
> country wasn't even a possibility...until we got an alliance with the Japanese...

It's a matter of what kind of application you write. The code I write is
primarily for bulk processing of satellite data (1 terabyte a day). The
only text strings that are related to the code I write are the error
messages and the file metadata. My code runs in just a few places, all
in the US, and everyone who needs to look at those error messages is a
US resident. The product files are used by scientists from around the
world, but they all seem to accept NASA's decision that the file
metadata must be all in English. Most of them can read English, at least
well enough to understand the metadata.
That's also not likely to change any time soon. The error messages and
file metadata are filtered through third-party libraries with char*
interfaces. They'd need a major re-design to handle wchar_t. I think
those libraries should have (optional) wchar_t interfaces, but I've no
influence over those decisions.

> > I'm not saying that these arguments should be decisive. I'm just saying
> > that to support a change like this, you should show that the need is so
> > wide-spread as to justify forcing people who don't share that need to
> > put up with the consequences (whatever they might be) of satisfying it.
> >
> What consequences?  All I'm asking for is the minimal wchar_t to be used

It's impossible to mandate unicode support without requiring the
addition of code to the standard library to implement that support. That
code requires time to develop, adding to both the costs and the delays
in producing new implementations. That code uses up space in the library
and (when not using shared libraries) in the executable. Any seamless
integration with the C++ standard I/O library functions would mean that
everyone who does I/O would pay at least part of the price of being able
to process Unicode, even if they have no need to process Unicode.

I'm not saying that the price isn't worth paying; I'm saying that
supporters of an alternative need to specify that alternative, and to
provide data in support of an argument that the price IS worth paying.

> consistantly thorughout the standard.  As it is, it's by and large USELESS
> on a true UNICODE (fixed size) machine.  The people working on the implementatio
> in the C and C++ standard seem to have been MBCS oriented.

[Note: the term "MBCS" appears nowhere in the C++ standard. The acronym
it uses is "NTMBS".]

Could you explain what makes it useless? That's what I don't understand.
I could believe "inconvenient", but "USELESS" seems to be an
exaggeration. C++ has been written to interpret char[] as NTMBS in many
contexts, because that decision is backward compatible with a lot of
legacy code that treats char[] as holding a null-terminated ASCII
string.  That's not a matter of favoring NTMBS, it's simply facing
practical realities. C++ allows for wchar_t, and gives implementors the
freedom to make it big enough to hold fixed-size Unicode, and to provide
locale(s) where it is interpreted as such by the library's string
functions. Every aspect of the standard templates that involves strings
is provided with wchar_t specializations. That's hardly the most
convenient way imaginable, but neither is is useless.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Fri, 30 Mar 2001 19:32:51 GMT Raw View

"James Kuyper Jr." wrote:
>
>
> > > I'm not saying that these arguments should be decisive. I'm just saying
> > > that to support a change like this, you should show that the need is so
> > > wide-spread as to justify forcing people who don't share that need to
> > > put up with the consequences (whatever they might be) of satisfying it.
> > >
> > What consequences?  All I'm asking for is the minimal wchar_t to be used
>
> It's impossible to mandate unicode support without requiring the
> addition of code to the standard library to implement that support.

The issue isn't UNICODE at all.  I'm not mandating UNICODE support,
I'm just mandating consistancy in the wchar_t usage so that an implementation
has the option to say, OK wchar_t can hold a UNICODE character.

> That
> code requires time to develop, adding to both the costs and the delays
> in producing new implementations. That code uses up space in the library
> and (when not using shared libraries) in the executable. Any seamless
> integration with the C++ standard I/O library functions would mean that
> everyone who does I/O would pay at least part of the price of being able
> to process Unicode, even if they have no need to process Unicode.
>

Then they might as well have gotten rid of all wchar_t by your argument
(of course never writing code that needs other than ASCII you're a bit
prejudiced against it anyhow).  By your argument, half of the standard
library ought not to be there.  Too much work to get right for the people
who aren't going to use it any how.

The problem, is what is the point of putting in a half assed wchar_t
concept, if it isn't going to be carried through the entire library?
The whole thing is illconceived from the days where just allowing all
256 values in a char sufficed for the Europeans and a half a dozen
MBCS systems solved the CJKV issues.

If you're going to promulgate wchar_t (a fixed size character type
"big enough") as a solution, then you ought to at least allow all
the interfaces that might need it (program arguments, filenames, etc...)
to use it.
>
> [Note: the term "MBCS" appears nowhere in the C++ standard. The acronym
> it uses is "NTMBS".]

NTMBS stands for "NULL TERMINATED MULTIBYTE STRING"  It's not synonymous
with MBCS: Multibyte Character Set.  They are not interchangable terms
any more that wchar_t is synonymous with NTWCS
>
> Could you explain what makes it useless? That's what I don't understand.
> I could believe "inconvenient", but "USELESS" seems to be an
> exaggeration. C++ has been written to interpret char[] as NTMBS in many
> contexts, because that decision is backward compatible with a lot of
> legacy code that treats char[] as holding a null-terminated ASCII
> string.  That's not a matter of favoring NTMBS,

All this presumes that there is a NTMBS string existant in the implementation
that can stand for any NTWCS (string made up of wchar_t).   This is the
problem I am having.  That's a completely stupid assumption, and once
you get your head out of legacy UNIX implementations you find that it
is not true.  THere should be wchar_t interfaces (overloads, etc...) for
the library places that NTMBS crutches are now.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Fri, 30 Mar 2001 19:57:28 GMT Raw View

Volker Apelt wrote:
>=20
> "James Kuyper Jr." <kuyper@wizard.net> wrote:
...
> > I'm all in favor of i18n. However, in 25 years of programming I've ne=
ver
> > had to write a single program which needed i18n support. There are ni=
che
> > markets (and they're arguably HUGE niches, at least here in the U.S.)
> > where i18n support is completely worthless.
>=20
> <sarcasm>
> So, I missunderstood the intention of the c++ standard,.. it's
> supposed adopt c++ to the US habits and stopps when the US convenience
> is satisfied. ;-)

That's not what I meant. I don't think that anything should be frozen
into a standard unless it is widely needed; a criterion that applies
equally well whether the people who don't need it are in the US or in
India or anywhere else. And before you point it out, I agree that the
current standard contains a number of features that don't meet this
criterion - for some odd reason, I wasn't consulted. :-)

> I've seen those programs fail in the most trivial situations,
> like text retrival programs searching for words with =FC=F6=E4.

Certainly; any program may fail when applied to data outside it's
designed domain, and any program may have been incorrectly designed for
a domain that is more restrictive than the one it's actually needed for.
That doesn't mean that there aren't programs which are perfectly
correctly designed for a more restricted domain than others.

...
> > I'm not saying that these arguments should be decisive. I'm just sayi=
ng
> > that to support a change like this, you should show that the need is =
so
> > wide-spread as to justify forcing people who don't share that need to
> > put up with the consequences (whatever they might be) of satisfying i=
t.
>=20
> What I was trying to say is, that c++ should define a convenient
> way to handle locale/codeset dependent strings.

You said "a convenient way". But which way? If the time were ripe for
standardization on this issue, then there would be little question about
it - the "convenient way" would have a particular well-known name, and
there would be widespread agreement that it was the right way to do it.
Standardizing before that consensus has been reached is premature.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Fri, 30 Mar 2001 19:57:16 GMT Raw View

In article <99vr75$can$1@xmission.xmission.com>, (Rich)
legalize+jeeves@mail.xmission.com wrote:
>>It is stopping one in the sense that it is not _within_ the C++ standard
>>possible to implement any underlying binary model.
>
>You lost me there.  I'm not even sure what you mean by "binary model".
>You're talking about treating a bag of bits as a binary type for I/O,
>right?  What's wrong with the read and write members of streams?

The C++ standard is written everywhere so that one can specify what the
actual underlying bits should be. It's not only what a char should be, but
bit-fields, and such can be aligned as the compiler writer finds it
useful.

For example, from the view of the C++ standard, one can output a "char"
but not a 8-bit byte, because there is no way to specify a structure with
8-bit alignment. Only if one knows that ones compiler implement a char as
a byte, one can output a byte. But it is perfectly legal to implement a
"char" as something else that a 8-bit byte.

So it means that if one wants to output a binary structure, which then
should be read by say the same program compiled with another compiler,
there is no guarantee that the program will work, that is, form the point
of view of the C++ standard.

So at least somewhere, one would want to have the ability to specify that
one produces a binary structure down to the very bits, which can be used
in communication with others.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Fri, 30 Mar 2001 21:41:48 GMT Raw View

[Please do not mail me a copy of your followup]

Ron Natalie <ron@spamcop.net> spake the secret code
<3AC4AF4D.4A9B76AD@spamcop.net> thusly:

>The problem, is what is the point of putting in a half assed wchar_t
>concept, if it isn't going to be carried through the entire library?

OK, clearly this is really, really bugging you.  Why don't you write
up a proposal to the standards committee to get it fixed?  Just
complaining about it here isn't going to fix the situation.
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Sat, 31 Mar 2001 01:30:18 GMT Raw View

Ron Natalie wrote:
>
> "James Kuyper Jr." wrote:
...
> I'm just mandating consistancy in the wchar_t usage so that an implementation
> has the option to say, OK wchar_t can hold a UNICODE character.

Done: the language already allows that. What it doesn't do is mandate
it. The failure to mandate it allows implementors whose customers have
no need for internationalization to use 'char' as wchar_t (possibly as a
compiler option).

> > That
> > code requires time to develop, adding to both the costs and the delays
> > in producing new implementations. That code uses up space in the library
> > and (when not using shared libraries) in the executable. Any seamless
> > integration with the C++ standard I/O library functions would mean that
> > everyone who does I/O would pay at least part of the price of being able
> > to process Unicode, even if they have no need to process Unicode.
> >
>
> Then they might as well have gotten rid of all wchar_t by your argument

Exactly. Good work! Of course, I'm not saying that there shouldn't be a
wchar_t, merely that the wchar_t and the associated functions should
have unstandardized for a while longer.

> (of course never writing code that needs other than ASCII you're a bit
> prejudiced against it anyhow). ...

I'm not prejudiced against it; I want effective, easy to use
internationalization. I am, to the disgust of some of my more
chauvinistic friends, a one-worlder. I'm just skeptical about the
current C/C++ approach. My skepticism isn't even first-hand (since I
have no experience with it myself) - I'm merely accepting as valid the
skepticism of others who have much more experience in the matter.

> ...  By your argument, half of the standard
> library ought not to be there. ...

That's exactly what I've heard some experts say about the wide character
functions - that they were the wrong approach, and shouldn't be there.
As long as there seems to be disagreement between experts as to the
right approach, the one thing I'm certain of is that it's not ready for
standardization.

> > [Note: the term "MBCS" appears nowhere in the C++ standard. The acronym
> > it uses is "NTMBS".]
>
> NTMBS stands for "NULL TERMINATED MULTIBYTE STRING"  It's not synonymous
> with MBCS: Multibyte Character Set.  They are not interchangable terms
> any more that wchar_t is synonymous with NTWCS

I'm just saying that if the standard is to be accused of bias toward
MBCS, you'd think they would use the term at least once.

...
> All this presumes that there is a NTMBS string existant in the implementation
> that can stand for any NTWCS (string made up of wchar_t).   This is the

If there isn't, that's because the decision was made (if only by failing
to decide) to allow it to be so. There's nothing in the C++ standard
that prevents an implementation from making that assumption valid. In
fact, the existence of the conversion utilities creates the naive
expectation that it's required to be valid (at least, until you read the
fine print).

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt)
Date: Sun, 1 Apr 2001 12:09:12 GMT Raw View

"James Kuyper Jr." <kuyper@wizard.net> wrote:

> Volker Apelt wrote:
> > What I was trying to say is, that c++ should define a convenient
> > way to handle locale/codeset dependent strings.
>
> You said "a convenient way". But which way?
[ snip -- convenience is not agreed yet for encodings ]
> Standardizing before that consensus has been reached is premature.

Then let's find out what "convenience" means to us. Just one example.
string and wstring are prepared to handle fixed size character
encondings only.
What about multibyte encodings and strings? (let's say UTF-8)
The smallest codepoint is 8 bits wide, so, one could choose std::string
as the base of implementation. But the iterators are not prepared
for multibytes and all algorithms are based on them.

So, convenience to me means to have an iterator that knows about
the current encoding of the string it belongs to and is guaranteed
to stop at codepoints instead of raw bytes (char) or raw wchar_t.

--
Volker Apelt          Group of Prof. Dr. Ch. Griesinger
                      Johann Wolfgang Goethe Universitaet
                      Frankfurt am Main (Germany)
no_spam_va@org.chemie.uni-frankfurt.de  (use va@ instead of ...@ )

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Sun, 1 Apr 2001 22:36:52 GMT Raw View

In article <WZoWxsroeXVk-pn2-1ipi3kMhpb4x@APELT-PC.RZ.UNI-FRANKFURT.DE>,
no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt) wrote:

>"James Kuyper Jr." <kuyper@wizard.net> wrote:
...
>> You said "a convenient way". But which way?
>[ snip -- convenience is not agreed yet for encodings ]
>> Standardizing before that consensus has been reached is premature.
>
>Then let's find out what "convenience" means to us. Just one example.
>string and wstring are prepared to handle fixed size character
>encondings only.
>What about multibyte encodings and strings? (let's say UTF-8)
...
>So, convenience to me means to have an iterator that knows about
>the current encoding of the string it belongs to and is guaranteed
>to stop at codepoints instead of raw bytes (char) or raw wchar_t.

My own hunch is that variable wide characters are too complicated, and too
inefficient in terms of speed, for that to be any point of having them in
the C++ standard.

So if there should be any Unicode support, it should be for fixed size 32
and 16 bits. If one encounters encodings such as UTF-8 or UTF-16, convert
to an internal fixed width at input/output.

-- It will be simplest for any program to only treat one internal
character type.

As for the general discussion, there are two opposing schools, it seems,
the traditional C++ view, where everything is defined so that it is up to
the compiler implementation to define how the binary bits should be
implemented, and the opposing view, that wants binary specifications
within C++ to be enabled.

The advantages of the traditional C++ view, are that it allows for various
optimizations, and that it becomes easier for a compiler to confirm to the
C++ standard. For example, if it is faster, a compiler could define a char
to have 32 bits, or if only ASCII characters are deemed necessary for a
compiler it could define wchar_t to be say 7 bits.

The drawback with this view is of course when programs exchanging binary
data are compiled with different compilers, which could be the same
program compiled on several different platforms.

It seems to me to be inevitable to allow some kind of binary definitions
into C++. -- This can be done without making compiler implementation more
difficult by introducing macro letters informing if the feature is
present.

This way one gets a hybrid, which largely keeps the old C++ strategy of
not specifying the underlying binary structure, but allowing one to the
underlying binary structure in some specific cases. It is then possible
for the program to check whether this feature is present or not.

Clearly, in the cases one does want to specify the underlying binary
structure, "performance" is a question of make or break: If the underlying
binary structures aren't correct, the program will break. So if the
feature is not present on the compiler, that is something good to know, as
it means that one is not compiling a program that appears to compile
correctly, but will break when run.

So, as for the "char" and "wchar_t", I think one will need to keep them as
they are: C++ compilers can implement them as they like. Then introduce
types of 1, 2, and 4 bytes which can be used with Unicode. There will be
some macros present telling if these are available. Further, a compiler
could defined char and wchar_t to be one of those if they are available,
and one would be able to know that this is so via some macros.

Then this addition will not bother those that never uses other than 7-bit
ASCII letters, because they do not need changing their compilers; but
also, it will also solve the problem for those that write a multilingual
program that will compiled on many different platforms.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 1 Apr 2001 23:14:23 GMT Raw View

no_spam_va@org.chemie.uni-frankfurt.de (Volker Apelt) writes:

|>  <sarcasm>
|>  So, I missunderstood the intention of the c++ standard,.. it's
|>  supposed adopt c++ to the US habits and stopps when the US convenience
|>  is satisfied. ;-)
|>  I've seen those programs fail in the most trivial situations,
|>  like text retrival programs searching for words with          .
|>  </sarcasm>

One of my favorite tests is to pass the program a list towns in the
Paris suburbs.  You'd be surprised how many programs stop at
Le-Ha   -des-Roses.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Pete Becker <petebecker@acm.org>
Date: Sun, 1 Apr 2001 23:14:02 GMT Raw View

Volker Apelt wrote:
>
> So, convenience to me means to have an iterator that knows about
> the current encoding of the string it belongs to and is guaranteed
> to stop at codepoints instead of raw bytes (char) or raw wchar_t.
>

You can certainly write your own class to do this, but most of the world
is moving away from multibyte character manipulation inside programs,
because it is much easier to translate once from mutlibyte to wide
characters on input and translate once again on output than to translate
on the fly every time you need to do character manipulation.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 1 Apr 2001 23:14:29 GMT Raw View

"James Kuyper Jr." <kuyper@wizard.net> writes:

|>  I'm all in favor of i18n. However, in 25 years of programming I've
|>  never had to write a single program which needed i18n support. There
|>  are niche markets (and they're arguably HUGE niches, at least here
|>  in the U.S.)  where i18n support is completely worthless.

There are some pretty important markets in most countries where
internationalization, if the word is taken literally, isn't relevant.
One of the things that pisses me off is that even when I'm writing for a
local market, I need something that the standard calls
"internationalization" (instead of, say, "unAmericanization").

Frankly, for a lot of my markets, just requiring that char (and not just
unsigned char) could handle up to 255 would go a long way to making life
simpler.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@stop.mail-abuse.org>
Date: Mon, 2 Apr 2001 11:51:39 CST Raw View

"James Kuyper Jr." wrote:
>
> Ron Natalie wrote:
> >
> > "James Kuyper Jr." wrote:
> ...
> > I'm just mandating consistancy in the wchar_t usage so that an implementation
> > has the option to say, OK wchar_t can hold a UNICODE character.
>
> Done: the language already allows that. What it doesn't do is mandate
> it. The failure to mandate it allows implementors whose customers have
> no need for internationalization to use 'char' as wchar_t (possibly as a
> compiler option).

I don't think you understand me.  The language doesn't allow it.  It starts
to do that by defining wchar_t and making limitted use of it in some stream_bufs
and basic_string, but there are still very important interfaces which can not
take things composed out of wchar_t's.

> >
> > NTMBS stands for "NULL TERMINATED MULTIBYTE STRING"  It's not synonymous
> > with MBCS: Multibyte Character Set.  They are not interchangable terms
> > any more that wchar_t is synonymous with NTWCS
>
> I'm just saying that if the standard is to be accused of bias toward
> MBCS, you'd think they would use the term at least once.

There are words in the English language that must be used that don't appear
in the standard.  The standard is heavily biased towards MBCS (and the fact
that NTMBS strings are the only interface to many functions) is just proof
of that.  Of course, the bias  is that when you say NTMBS what you are really
saying is, the implementation is stuck tryign to figure out  how to make
a non-US oriented system work with char* as the sole string type.

>
> ...
> > All this presumes that there is a NTMBS string existant in the implementation
> > that can stand for any NTWCS (string made up of wchar_t).   This is the
>
> If there isn't, that's because the decision was made (if only by failing
> to decide) to allow it to be so. There's nothing in the C++ standard
> that prevents an implementation from making that assumption valid. In
> fact, the existence of the conversion utilities creates the naive
> expectation that it's required to be valid (at least, until you read the
> fine print).
>
Huh?   Yes there is.  If there is no such thing as a Wide string to Multibyte
string conversion you:

1.  Can't deal with files, for fstream won't work.
2.  Can't pass in arguements, because main doesn't have any provision to work,
etc...

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@sensor.com>
Date: Fri, 16 Mar 2001 11:02:26 GMT Raw View

Hans Aberg wrote:

>   I wonder how this should work together with Unicode, which can be 16-bit
> or 32-bit*); from what I understand one can have a mixture of U16 & U32
> files on the same OS (operative system).

Tbe C++ standard doesn't know unicode from a hole in the ground.
wchar_t
is just a larger FIXED char size.

You'll just have to hope your implementation can deal with UNICODE in
one of it's fixed sizes (16 or 32).

The standard library is pretty useless with regard to
internationalization
and unicode.  While wchar_t's exist and are barely adequate.  None of
the
useful system interfaces (like filenames or program arguments) are
defined
for them in the langauge.

>
>   How is that to be handled from within C++ or is the C++ standard
> insufficient with respect to this?

All the implementations I've dealt with use the 16 bit encoding for
unicode
(i.e. wchar_t is unsigned short).

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Thu, 15 Mar 2001 23:20:07 GMT Raw View

The C++ standard 3.9.1#5 says:
  Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.1.1). Type wchar_t shall have the same
size, signedness, and alignment requirements (3.9) as one of the other
integral types, called its underlying type.

  I wonder how this should work together with Unicode, which can be 16-bit
or 32-bit*); from what I understand one can have a mixture of U16 & U32
files on the same OS (operative system).

  How is that to be handled from within C++ or is the C++ standard
insufficient with respect to this?

-- It appears to me that one would need a uchar type which is 16 bits and
a Uchar type which is 32 bits, and nothing else.

Or is the OS supposed to translate all files when opened as wfstreams, so
that they appear to be U32 (i.e. a wchar_t with at least 32 bits) from
within C++?

*) Or 31 bits, if one should be finicky. (Unicode actually only uses the
twenty lowest order bits, but I think 31 bits are reserved for future and
private use.)

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Sat, 17 Mar 2001 10:39:47 GMT Raw View

In article <3AB2C76E.AA9DC137@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>...I don't
>see anything that would prevent a [C++ compiler] implementation from
defining the wide
>character set, or even the narrow character set, as using the 16 or even
>the 32 bit Unicode encodings.

If there only were one Unicode, there would be no problem, as one could
define wchar to that. But there 16 an 32 bit Unicode.

There is no problem in letting wchar_t be 32 bits, and use only that
internally as the Unicode 16 bits will always agree with the first 65536
characters of 32 bit Unicode. In this approach the OS/compiler library
would make the translation transparent. (It is probably best having an OS
with file specific information about the input encoding, because there are
a lot of 8-bit encodings that would require special mappings to Unicode.)

But one may still want to write a 16-bit Unicode file. The question is how
to ensure that from within a C++ program, if wchar_t is 32-bit.

The problem is the same as with C++ in other respects: Sometimes one do
want to specify the underlying binary information, which is especially
true when reading and writing binary structures which are part of the
communication with other programs, that are compiled with a different
compiler.

Under such circumstances, C++ leaves you out in the cold.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Sat, 17 Mar 2001 10:40:19 GMT Raw View

In article <slrn9b4s42.g58.qrczak@qrnik.zagroda>, qrczak@knm.org.pl
(Marcin 'Qrczak' Kowalczyk) wrote:
>Fri, 16 Mar 2001 19:43:25 GMT, Andrea Ferro <AndreaF@UrkaDVD.it> pisze:
>> And specifically if the implementation is to support all your
>> cases then wchar_t will be a signed or unsigned 32 bit integral
>> (that may be int, unsigned, long or unsigned long depending on
>> the implementation.

>No, in C++ wchar_t is a keyword which denotes a distinct type.
>In C wchar_t may be e.g. a typedef for int.

You both are right: According to the 3.9.1#5, wchar_t is a distinct type,
but on the same time it says that it must have the same size, signedness,
and alignment requirements as one of the other integral types (called its
underlying type).

So even though wchar_t will not be a signed or unsigned 32 bit integral,
one must find one.

But strictly speaking it is allowed by the implementation to have any
number of bits >= 31, as one is only required that all Unicode
(hypothetical) characters should fit in + plus that there should be an
identical underlying integral type.

So if short is 16 bit, int is 64 bit, wchar_t must be 64-bit according to
3.9.1#5.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Sat, 17 Mar 2001 14:23:26 GMT Raw View

Hans Aberg wrote:
>
> In article <3AB2C76E.AA9DC137@wizard.net>, "James Kuyper Jr."
> <kuyper@wizard.net> wrote:
> >...I don't
> >see anything that would prevent a [C++ compiler] implementation from
> defining the wide
> >character set, or even the narrow character set, as using the 16 or even
> >the 32 bit Unicode encodings.
>
> If there only were one Unicode, there would be no problem, as one could
> define wchar to that. But there 16 an 32 bit Unicode.

So use 16 bit Unicode for char, and 32 bit Unicode for wchar_t. That
seems a fairly obvious mapping.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: kuehl@ramsen.informatik.uni-konstanz.de (Dietmar Kuehl)
Date: Sat, 17 Mar 2001 21:55:16 GMT Raw View

Hi,
Hans Aberg (remove.haberg@matematik.su.se) wrote:
: There is no problem in letting wchar_t be 32 bits, and use only that
: internally as the Unicode 16 bits will always agree with the first 65536
: characters of 32 bit Unicode. In this approach the OS/compiler library
: would make the translation transparent.

There is a mechanism in the standard C++ library which makes this
translation transparent, namely the 'std::codecvt' facet: The standard
C++ library assumes that the underlying OS supports a mechanism to
read/write individual characters of type 'char' which are then
converted using an appropriate 'std::codecvt' object to translate it to
the stream's character type. How this conversion is done and what
conversions are to be supported (except a trivial for conversion of
'char' to/from 'char' doing nothing and one conversion which somehow
converts from 'char' to/from 'wchar_t') is not defined, however. On the
other hand, a user can implement and register his/her favorite
conversion.
--
<mailto:dietmar_kuehl@yahoo.com> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 18 Mar 2001 12:32:49 CST Raw View

Ron Natalie <ron@sensor.com> writes:

|>  >   How is that to be handled from within C++ or is the C++ standard
|>  > insufficient with respect to this?

|>  All the implementations I've dealt with use the 16 bit encoding for
|>  unicode (i.e. wchar_t is unsigned short).

Both g++ under Linux and Sun CC under Solaris use 32 bits (presumably
ISO 10654).

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 18 Mar 2001 18:32:24 GMT Raw View

remove.haberg@matematik.su.se (Hans Aberg) writes:

    [...]
|>  Also, from the point of speed, there appears to be no particular
|>  gain in using less than 32 bits internally. For example, the latest
|>  Mac powerbook (G4 CPU) uses 128-bit words which are vectored into
|>  32-bit words. There appears to be no gain in speed in using 16-bit
|>  characters over 32-bit characters.

The wider the characters, the less fit in the cache, and the more often
you will get cache misses.  (But I really doubt that there are many
programs where this would make a difference.)

|>  So what one would expect that there will be a mixture of 32, 16, and
|>  8-bit files on the computer operative system.

|>  The problem is perhaps not reading those files, because then one
|>  could use a 32-bit wchar, and translate whatever is read into that
|>  format. But then one would want the need to write a file on a
|>  particular format, which should be 32, 16, or 8 bytes. Then there
|>  appears to be missing at least one type in C++ (i.e., having at
|>  least to different wchar types).

Not really.  C++ *always* reads and writes "char" (which are typically 8
bits).  Files with 16 or 32 bit characters require a codecvt which
treats the file as if it were multibyte encoded, by reading the
characters one byte at a time.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 19 Mar 2001 00:09:56 GMT Raw View

In article <8666h79pk5.fsf@alex.gabi-soft.de>, James Kanze
<kanze@gabi-soft.de> wrote:
>|>  Also, from the point of speed, there appears to be no particular
>|>  gain in using less than 32 bits internally. For example, the latest
>|>  Mac powerbook (G4 CPU) uses 128-bit words which are vectored into
>|>  32-bit words. There appears to be no gain in speed in using 16-bit
>|>  characters over 32-bit characters.
>
>The wider the characters, the less fit in the cache, and the more often
>you will get cache misses.  (But I really doubt that there are many
>programs where this would make a difference.)

Well, it depends if the internal cache rounds off words to any shorter
than 32 bits. -- Probably it doesn't, and then there is no gain there
either. Instead you get some extra time for conversions back and forth to
32 bits.

>Not really.  C++ *always* reads and writes "char" (which are typically 8
>bits).  Files with 16 or 32 bit characters require a codecvt which
>treats the file as if it were multibyte encoded, by reading the
>characters one byte at a time.

OK. Then for file writes on computers on which char is a byte it suffices
to write a codcvt which writes 1, 2, 4-byte files as desired.

Then, internally, simply use a wchar_t with at least 32 bits in it.

Internally, in the C++ program, one wouldn't want to juggle more than one
character type anyway.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Andrea Ferro" <AndreaF@UrkaDVD.it>
Date: Mon, 19 Mar 2001 12:28:17 GMT Raw View

"Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> wrote in message
news:slrn9b4s42.g58.qrczak@qrnik.zagroda...
Fri, 16 Mar 2001 19:43:25 GMT, Andrea Ferro <AndreaF@UrkaDVD.it> pisze:
>
>> And specifically if the implementation is to support all your
>> cases then wchar_t will be a signed or unsigned 32 bit integral
>> (that may be int, unsigned, long or unsigned long depending on
>> the implementation.
>
>No, in C++ wchar_t is a keyword which denotes a distinct type.
>In C wchar_t may be e.g. a typedef for int.

Right. I wanted to say it would be "like" one of those. It is distinct BUT must
match another integral in its signedness and size.

>Fri, 16 Mar 2001 11:02:26 GMT, Ron Natalie <ron@sensor.com> pisze:
>
>> All the implementations I've dealt with use the 16 bit encoding
>> for unicode (i.e. wchar_t is unsigned short).
>
>On Linux wchar_t has 32 bits.

I would not say it this way. Linux is a platform not a C++ implementation. It
would be more correct "on the Linux version of compiler xyz ..."


Andrea Ferro

---------
Brainbench C++ Master. Scored higher than 97% of previous takers
Scores: Overall 4.46, Conceptual 5.0, Problem-Solving 5.0
More info http://www.brainbench.com/transcript.jsp?pid=2522556



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 19 Mar 2001 18:05:03 GMT Raw View

In article <3AB3742F.A1FA7A5D@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>So use 16 bit Unicode for char, and 32 bit Unicode for wchar_t. That
>seems a fairly obvious mapping.

It doesn't work, because one still wants to be able to write 8-bit files,
for ASCII, various encodings, and UTF8.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 19 Mar 2001 18:21:31 GMT Raw View

In article <990jjt$oi$2@news.BelWue.DE>, dietmar_kuehl@yahoo.com wrote:
>There is a mechanism in the standard C++ library which makes this
>translation transparent, namely the 'std::codecvt' facet: The standard
>C++ library assumes that the underlying OS supports a mechanism to
>read/write individual characters of type 'char' which are then
>converted using an appropriate 'std::codecvt' object to translate it to
>the stream's character type.

This is good.

There only remains a mechanism that guarantees one to use 8, 16, and
32-bit characters.

Actually, I think C++ is in the need of a new type binary<n>, where n is
the numbers of bits, that can be used to eventually replace the other so
called "integral types". Such a binary<n> would have operations such as
signed_multiplication and unsigned_multiplication, so that one then
somehow could define say short = { signed_, ... } etc.

One could then be sure to define characters of a certain bit-width from
within C++.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Mon, 19 Mar 2001 18:20:57 GMT Raw View

remove.haberg@matematik.su.se (Hans Aberg) writes:

|>  In article <3AB2C76E.AA9DC137@wizard.net>, "James Kuyper Jr."
|>  <kuyper@wizard.net> wrote:

|>  >...I don't see anything that would prevent a [C++ compiler]
|>  >implementation from defining the wide character set, or even the
|>  >narrow character set, as using the 16 or even the 32 bit Unicode
|>  >encodings.

|>  If there only were one Unicode, there would be no problem, as one
|>  could define wchar to that. But there 16 an 32 bit Unicode.

More correctly: Unicode is compatible with ISO 10646, which defines two
characters code sets: UCS-2 and UCS-4, with two and four byte
characters, respectively.  The Unicode consortium also defines several
multibyte encodings to handle this: UTF-16 (16 bits) and UTF-16BE,
UTF-16LE and UTF-8 (8 bits).

|>  There is no problem in letting wchar_t be 32 bits, and use only that
|>  internally as the Unicode 16 bits will always agree with the first
|>  65536 characters of 32 bit Unicode. In this approach the OS/compiler
|>  library would make the translation transparent. (It is probably best
|>  having an OS with file specific information about the input
|>  encoding, because there are a lot of 8-bit encodings that would
|>  require special mappings to Unicode.)

|>  But one may still want to write a 16-bit Unicode file. The question
|>  is how to ensure that from within a C++ program, if wchar_t is
|>  32-bit.

Simple, define the locale (codecvt) such that it uses UTF-16[BL]E as the
external code set.  C++ (and all systems I'm familiar with) reads and
writes bytes.  UTF-16[BL]E correspond to UTF-16 (which in turn is
compatible with UCS-2 for the common characters), written either low
byte or high byte first.

The hooks for this are there in C++.  On the other hand, I don't know if
any implementation actually provides the necessary locales and/or facets
at present.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@sensor.com>
Date: Mon, 19 Mar 2001 18:25:50 GMT Raw View


"James Kuyper Jr." wrote:
>
>
> I'm curious: what aspects of C++ are incompatible with any useful
> extension? C++ doesn't guarantee the presence of anything useful with
> repect to Unicode. However, it would seem to allow a great deal. I don't
> see anything that would prevent an implementation from defining the wide
> character set, or even the narrow character set, as using the 16 or even
> the 32 bit Unicode encodings.

There isn't even wchar_t available for things like filenames and
the arguments to main, etc...

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 19 Mar 2001 19:32:21 GMT Raw View

In article <86zoej8aac.fsf@alex.gabi-soft.de>, James Kanze
<kanze@gabi-soft.de> wrote:
>Simple, define the locale (codecvt) such that it uses UTF-16[BL]E as the
>external code set.  C++ (and all systems I'm familiar with) reads and
>writes bytes.

It seems that this will suffice for now, write a codecvt, and write bytes.

But if one uses say 32-bit wchar_t, writing 32-bit files, then each
wchar_t will first decomposed into bytes, and then picked together to
identical 32-bit words (assuming the OS can handle it). This is slow: Even
though file operations are slow when reading/writing to disk, normally a
file is buffered, and thus one can make a lot of fast manipulations of a
buffered file (even though programmers usually avoids this).

So it seems that for the future, one will need a mechanism for reading and
writing wider characters than char. (And it will probably not be possible
to let char be more than a byte for backwards compatibility reasons.)

-- This reminds me that C++ does not appear to have a "save" command for
writing a file to disk. As file buffering is a common approach, I think it
would be prudent to add such a function.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 19 Mar 2001 21:19:05 GMT Raw View

In article <3AB4EA9A.1C74520D@sensor.com>, Ron Natalie <ron@sensor.com> wrote:
>There isn't even wchar_t available for things like filenames and
>the arguments to main, etc...

The MacOS X supports Unicode filenames, so I gather it must use a
conversion like UTF to make it work, in view that C++ does not support
wchar_t filenames.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Dean Roddey" <droddey@charmedquark.com>
Date: Tue, 20 Mar 2001 07:54:26 CST Raw View

I agree that the optimum scenario would be to use UTF-32 internally in the
program, translating as required to the underlying OS format. If you use
UTF-16, at first you feel its great. But you very soon figure out that its
no better than something like Shift-JIS or other variable byte format. You
have all the same limitations, of having two totally different modes of
dealing with strings, one that's 16 bit oriented offsets, and one that that
has to understand surrogates and such. This introduces a level of complexity
that's not worth it relative to the space saved relative to UTF-32. Its just
a mine field of bugs that cannot be dealt with other than with run time
checks. I believe that MS made a mistake in choosing UTF-16 as its native
format, though its obvious why the did.

But there are always complexities. For instance, if you deal with XML, the
DOM officially only uses UTF-16. So you end up having to translate in and
out if that's not your native format.

The other problem with trying to implement something like that outside of
the compiler's support, i.e. have a library make all of this work, is that
L"foo" will create a native wide character string, which might not be what
your library thinks of as a wide string. You can argue that good programs
don't have hard coded strngs anyway, since they would be loaded. But as a
practical matter there are still lots of strings that the user never sees
and which would be wasteful to have to load dynamically.

I find myself stuck with this situation in my CIDLib stuff. I could easily
make my system consistently be UTF-32, but the issue with L prefixed strings
and characters would be a huge PITA.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"I'm not sure how I feel about ambivalence"


"Hans Aberg" <remove.haberg@matematik.su.se> wrote in message
news:remove.haberg-1603011310520001@du137-226.ppp.su-anst.tninet.se...
> The reason I started to think about this is a discussion in the LaTeX3
> group about a suitable successor to TeX (like Omega, etc):
>
> Then it turns out that 16-bit Unicode is not sufficient. In fact, most
> math characters lie outside the first Unicode 65536 character "plane". In
> addition, there must be room for additional user characters, as Unicode
> probably never will cover all that will be needed in typesetting.
>


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Dean Roddey" <droddey@charmedquark.com>
Date: Tue, 20 Mar 2001 07:54:59 CST Raw View

> There is no problem in letting wchar_t be 32 bits, and use only that
> internally as the Unicode 16 bits will always agree with the first 65536
> characters of 32 bit Unicode. In this approach the OS/compiler library

Unfortunately, that's not true either, technically. If the underlying OS
uses UTF-32/UCS-4, then most likely it does not ever expect to see surrogate
pairs. We had a huge argument about this with the Xerces C++ parser. I
originally wrote it in this way. But it really wouldn't work in a lot of
situations. Only if you can guarantee you'll never suck in any data that
uses surrogates can you be sure that this will work. Some programs will do
ok with that limitation, but many won't if they have to process files, or in
coming network data or whatever..

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"I'm not sure how I feel about ambivalence"


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Dean Roddey" <droddey@charmedquark.com>
Date: Tue, 20 Mar 2001 14:19:47 GMT Raw View

Its not even necessarily the case that a particular compiler will use the
underlying system representation. For instance, I think that Borland's
compiler uses 32 bit wchar_t (I assume to be consistent across platforms) on
NT, which uses UTF-16. So they will have to translate in and out.

Its all a mess once you get into it more than wee bit. I'm the primary
author of the Xerces C++ XML parser, which does have to get into it deeply.
Once you then throw in having to write portable code, and it gets even
messier. Basically, you cannot assume anything.

And you also cannot even assume that wchar_t will hold any kind of Unicode
at all. On some platforms, they don't, some HP variants being in that crowd
I think. So L"Boo" won't necessarily produce Unicode code points.

If you look at the XML parser, we do constants like (where XMLCh is a
typedef we can adjust by platform):

const XMLCh fgSomeString[] = { chLatin_A, chLatin_B, chNull };

since this is the only truely portable way, across all the platforms that
Xerces has to run on, to create hard coded strings, of which there is a
considerable number in XML.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"I'm not sure how I feel about ambivalence"

"James Kanze" <kanze@gabi-soft.de> wrote in message
news:863dcb9p9y.fsf@alex.gabi-soft.de...
> Ron Natalie <ron@sensor.com> writes:
>
> |>  >   How is that to be handled from within C++ or is the C++ standard
> |>  > insufficient with respect to this?
>
> |>  All the implementations I've dealt with use the 16 bit encoding for
> |>  unicode (i.e. wchar_t is unsigned short).
>
> Both g++ under Linux and Sun CC under Solaris use 32 bits (presumably
> ISO 10654).
>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Mon, 26 Mar 2001 20:02:07 GMT Raw View

In article <864rwhbydb.fsf@alex.gabi-soft.de>, James Kanze
<kanze@gabi-soft.de> wrote:
>|>  But if one uses say 32-bit wchar_t, writing 32-bit files, then each
>|>  wchar_t will first decomposed into bytes, and then picked together
>|>  to identical 32-bit words (assuming the OS can handle it). This is
>|>  slow:
...
>All systems I know currently use byte oriented IO.  In general, if you
>want to write or read anything portably, you must read and write bytes.
>Even if the actual data is 32 bits wide, you need to control the byte
>order, for example.

It depends: If the OS supports 32-bit file buffers, then it would not be
convenient if the C++ requirement is to convert first into bytes. -- But
perhaps it is possible to write a codesvt which directly calls the system
routines in such a case.

As for the interface between the computer parallel bus and the actual
hard-disk, it may not make much difference, as it will be converted to
whatever standard is used.

>|>  So it seems that for the future, one will need a mechanism for
>|>  reading and writing wider characters than char. (And it will
>|>  probably not be possible to let char be more than a byte for
>|>  backwards compatibility reasons.)
>
>If we ever do get OS's which really support reads and writes of more
>than just bytes, we should give some thought as to how to support them.
>For the moment, I don't think it's an issue.  (For various reasons, I
>doubt that it ever will be an issue.)

The MacOS X, which should be publicly released by, already supports
Unicode files. Perhaps it goes over bytes.

But when Unicode becomes widespread, it seems likely that eventually
32-bit words at least will replace the byte concept altogether: There will
simply be no reason to fiddle around with 8-bit bytes anymore. (In fact 64
bit bytes sound more reasonable, because one can fit a IEEE 64 bit float
into that. Come back in a year perhaps and we will know, in view that
computer capacity doubles every year. :-) )

>|>  -- This reminds me that C++ does not appear to have a "save" command
>|>  for writing a file to disk. As file buffering is a common approach,
>|>  I think it would be prudent to add such a function.
>
>What is this save command supposed to do?  It sounds an awful lot like
>flush, from the little you say here.

Right, I just forgot about flush; it will do.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Mon, 26 Mar 2001 21:53:29 GMT Raw View

[Please do not mail me a copy of your followup]

remove.haberg@matematik.su.se (Hans Aberg) spake the secret code
<remove.haberg-2603011859430001@du159-226.ppp.su-anst.tninet.se> thusly:

>But when Unicode becomes widespread, it seems likely that eventually
>32-bit words at least will replace the byte concept altogether: There will
>simply be no reason to fiddle around with 8-bit bytes anymore.

Have you done any graphics programming?  Dealing with 8-bit bytes is
common.
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Mon, 26 Mar 2001 22:16:00 GMT Raw View

[Please do not mail me a copy of your followup]

Ron Natalie <ron@sensor.com> spake the secret code
<3AB173DC.184BC988@sensor.com> thusly:

>The standard library is pretty useless with regard to
>internationalization
>and unicode.

Where does a package like IBM's International Components for Unicode
fit into the picture?
<http://oss.software.ibm.com/developerworks/opensource/icu/project/>
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Tue, 27 Mar 2001 00:13:42 GMT Raw View


Rich wrote:
>
> [Please do not mail me a copy of your followup]
>
> Ron Natalie <ron@sensor.com> spake the secret code
> <3AB173DC.184BC988@sensor.com> thusly:
>
> >The standard library is pretty useless with regard to
> >internationalization
> >and unicode.
>
> Where does a package like IBM's International Components for Unicode
> fit into the picture?
> <http://oss.software.ibm.com/developerworks/opensource/icu/project/>

This gives better UNICODE and localization support, but I don't
see how it addresses the system interface problem.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Tue, 27 Mar 2001 16:37:42 CST Raw View

In article <99odm4$fqe$1@xmission.xmission.com>, (Rich)
legalize+jeeves@mail.xmission.com wrote:
>>But when Unicode becomes widespread, it seems likely that eventually
>>32-bit words at least will replace the byte concept altogether: There will
>>simply be no reason to fiddle around with 8-bit bytes anymore.
>
>Have you done any graphics programming?  Dealing with 8-bit bytes is
>common.

That might be a good example, even though if one is doing it color, you
get three bytes packed together, which aligns to a 32-bit word. Also, I
recall that the Motorola G4 used in Mac's vectors its 128-bit CPU word
into 4 32-bit word, and part of the motivation for that was to simplify
graphics applications.

By the way, isn't "millions of colors" the emerging standard, which
implies that one is using at least 10 bits for each color. If you want to
do float operations, then you would need to use at least 32-bits words.

If I should link it up with the discussion on how C++ allows file IO, it
is evidently so that C++ does not allow writing bytes or any specific
binary type having a certain fixed number of bits in it. -- One has to
know for each compiler what the underlying binary strucure of a "char" is,
and that is what C++ will read/write.

On the other hand, I recall Simula(?) has a feature where one can write
binary files (that is, bits).

It does not seem impossible to introduce a type binary<n> where n is a
positive integer, having exactly n bits. Then IO with the type binary<n>
in/out-puts exactly n bits. The implementation can optimize certain sizes,
if that is now n = 8 or whatever.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Tue, 27 Mar 2001 18:38:43 CST Raw View

[Please do not mail me a copy of your followup]

Ron Natalie <ron@spamcop.net> spake the secret code
<3ABFC814.1572C94E@spamcop.net> thusly:

>This gives better UNICODE and localization support, but I don't
>see how it addresses the system interface problem.

Frankly if I needed to support opening files with Unicode names, then
I would just handle that myself, which is perfectly in the spirit of
the C++ library and even the C library approach to handling input and
output.  Yes, yucky that I have to do it myself, but this was the case
for containers before STL.

Has anyone made a list of these char* issues and drafted anything to
the standards body?
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: legalize+jeeves@xmission.com (Rich)
Date: Tue, 27 Mar 2001 18:38:21 CST Raw View

[Please do not mail me a copy of your followup]

remove.haberg@matematik.su.se (Hans Aberg) spake the secret code
<remove.haberg-2703011208380001@du187-226.ppp.su-anst.tninet.se> thusly:

>>Have you done any graphics programming?  Dealing with 8-bit bytes is
>>common.
>
>That might be a good example, even though if one is doing it color, you
>get three bytes packed together, which aligns to a 32-bit word. Also, I
>recall that the Motorola G4 used in Mac's vectors its 128-bit CPU word
>into 4 32-bit word, and part of the motivation for that was to simplify
>graphics applications.

There are plenty of places where 8-bit byte values are used: alpha
channels, stencil buffers, grayscale images, etc.  There are plenty of
places where 1-bit values are still used as well.

>By the way, isn't "millions of colors" the emerging standard, which
>implies that one is using at least 10 bits for each color.

Generally "millions of colors" implies 24-bit RGB colors with 8 bits
per channel.  So you still manipulate 8-bit quantities when you want
to diddle one channel and leave the others alone.

>If you want to
>do float operations, then you would need to use at least 32-bits words.

People don't commonly use floating-point values for pixels, although
colors are often represented in floating-point for lighting
computations or high-dynamic range applications.  Off the top of my
head, only raytracers use floating-point pixels and maybe RenderMan
since it targets non real-time rendering where quality is the primary
concern.

>It does not seem impossible to introduce a type binary<n> where n is a
>positive integer, having exactly n bits. Then IO with the type binary<n>
>in/out-puts exactly n bits. The implementation can optimize certain sizes,
>if that is now n = 8 or whatever.

Nothing's stopping you from writing such a class and using it yourself
if you really think its that important.
--
Ask me about my upcoming book on Direct3D from Addison-Wesley!
Direct3D Book <http://www.xmission.com/~legalize/book/>
         Home <http://www.xmission.com/~legalize/>
    Fractals! <http://www.xmission.com/~legalize/fractals/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Matthew Austern <austern@research.att.com>
Date: Tue, 27 Mar 2001 18:38:56 CST Raw View

Ron Natalie <ron@spamcop.net> writes:

> > >The standard library is pretty useless with regard to
> > >internationalization
> > >and unicode.
> >
> > Where does a package like IBM's International Components for Unicode
> > fit into the picture?
> > <http://oss.software.ibm.com/developerworks/opensource/icu/project/>
>
> This gives better UNICODE and localization support, but I don't
> see how it addresses the system interface problem.

It doesn't.  The problem is that you made an extraordinarily broad
claim (that the standard library's i18n support was "pretty much
useless"), when you actually meant something very specific (that
std::basic_fstream's constructor takes a char* instead of a wchar_t*).

People are talking about ways in which the standard library is useful
and ways in which it can be extended to be even more useful.  That
still doesn't give you your basic_fstream::basic_fstream(const wchar_t*),
but that doesn't mean it isn't useful for other purposes and in other
ways.  People aren't necessarily responding to what you meant, but
they are responding to what you said.

For what it's worth, I do think it would be nice to think about
generalizing basic_fstream's constructor.  There are some interesting
issues there that are worth talking about.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Tue, 27 Mar 2001 23:15:31 CST Raw View


Rich wrote:
>
> Frankly if I needed to support opening files with Unicode names, then
> I would just handle that myself, which is perfectly in the spirit of
> the C++ library

Good, explain to me what spirit you are going to use, short of writing
your own stream_buf which replicates the standard one with everything except
file pen.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Tue, 20 Mar 2001 20:37:27 GMT Raw View

In article <pjCt6.1650$_M4.217794@news.pacbell.net>, "Dean Roddey"
<droddey@charmedquark.com> wrote:
>> There is no problem in letting wchar_t be 32 bits, and use only that
>> internally as the Unicode 16 bits will always agree with the first 65536
>> characters of 32 bit Unicode. In this approach the OS/compiler library
>
>Unfortunately, that's not true either, technically. If the underlying OS
>uses UTF-32/UCS-4, then most likely it does not ever expect to see surrogate
>pairs. We had a huge argument about this with the Xerces C++ parser. I
>originally wrote it in this way. But it really wouldn't work in a lot of
>situations. Only if you can guarantee you'll never suck in any data that
>uses surrogates can you be sure that this will work. Some programs will do
>ok with that limitation, but many won't if they have to process files, or in
>coming network data or whatever..

I haven't tried it in practise -- I made the post in order to avoid
getting into the troubles that you seems to already have experienced.

But in my mind, I expect there to be a mixture of files on a host of 8-bit
encodings UTF-8, -16, Unicode 16 & 32, etc.

One then only makes sure to translate it into an internal 32-bit wchar_t.
Either the OS recognizes this and provides the translation, or provides an
encoding attribute for each file which the program can use for appropriate
translation, or the program will have to figure it out otherwise.

Within C++, if the file is on a format which isn't already Unicode 32, one
can evidently write a codecvt for that. Have you tried that approach?

Otherwise, the problems you have experienced only seems to get one back to
square one: C++ totally ignores that fact that sometimes one need to
specify the underlying binary (= bit) structure, especially for
communicating data between programs compiled using different C++
compilers.

The C++ standard is developed for a world that no longer exists, where
binary transfer between computers of incompatible underlying binary model
is rare.

Somehow, I think that one must take up this issue for the next revision of C++.

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 25 Mar 2001 17:41:35 GMT Raw View

remove.haberg@matematik.su.se (Hans Aberg) writes:

|>  In article <86zoej8aac.fsf@alex.gabi-soft.de>, James Kanze
|>  <kanze@gabi-soft.de> wrote:
|>  >Simple, define the locale (codecvt) such that it uses UTF-16[BL]E as the
|>  >external code set.  C++ (and all systems I'm familiar with) reads and
|>  >writes bytes.

|>  It seems that this will suffice for now, write a codecvt, and write
|>  bytes.

|>  But if one uses say 32-bit wchar_t, writing 32-bit files, then each
|>  wchar_t will first decomposed into bytes, and then picked together
|>  to identical 32-bit words (assuming the OS can handle it). This is
|>  slow: Even though file operations are slow when reading/writing to
|>  disk, normally a file is buffered, and thus one can make a lot of
|>  fast manipulations of a buffered file (even though programmers
|>  usually avoids this).

All systems I know currently use byte oriented IO.  In general, if you
want to write or read anything portably, you must read and write bytes.
Even if the actual data is 32 bits wide, you need to control the byte
order, for example.

|>  So it seems that for the future, one will need a mechanism for
|>  reading and writing wider characters than char. (And it will
|>  probably not be possible to let char be more than a byte for
|>  backwards compatibility reasons.)

If we ever do get OS's which really support reads and writes of more
than just bytes, we should give some thought as to how to support them.
For the moment, I don't think it's an issue.  (For various reasons, I
doubt that it ever will be an issue.)

|>  -- This reminds me that C++ does not appear to have a "save" command
|>  for writing a file to disk. As file buffering is a common approach,
|>  I think it would be prudent to add such a function.

What is this save command supposed to do?  It sounds an awful lot like
flush, from the little you say here.

--
James Kanze                               mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(069)63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Michiel Salters<Michiel.Salters@cmg.nl>
Date: Fri, 16 Mar 2001 12:12:39 CST Raw View

In article <remove.haberg-1503011129330001@du130-226.ppp.su-anst.tninet.se>,
Hans Aberg says...

>The C++ standard 3.9.1#5 says:
>  Type wchar_t is a distinct type whose values can represent distinct
>codes for all members of the largest extended character set specified
>among the supported locales (22.1.1). Type wchar_t shall have the same
>size, signedness, and alignment requirements (3.9) as one of the other
>integral types, called its underlying type.

>  I wonder how this should work together with Unicode, which can be 16-bit
>or 32-bit*); from what I understand one can have a mixture of U16 & U32
>files on the same OS (operative system).

wchar_t and Unicode should work together like char and ASCII, that is,
just as the platform likes. The Operating System can do pretty much
anything it likes, and the compiler has to make some sense. Note the
part about "supported locales".

>  How is that to be handled from within C++ or is the C++ standard
>insufficient with respect to this?

The C++ standard is insufficient. But since other standards are sufficient,
C++ shouldn't redo that work.

>-- It appears to me that one would need a uchar type which is 16 bits and
>a Uchar type which is 32 bits, and nothing else.

Why? Some systems might not even have 16-bit types; that's not required
by the C++ standard.

>Or is the OS supposed to translate all files when opened as wfstreams, so
>that they appear to be U32 (i.e. a wchar_t with at least 32 bits) from
>within C++?

No. The full burden is placed on the C++ implementation. For all locales
that it supports there must be a wchar_t, and it must write them to disk.
Whether that is done in UTF-8 or any other local OS convention is not
dictated by the standard. If a C++ program can read its own output the
compiler has done ok.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: remove.haberg@matematik.su.se (Hans Aberg)
Date: Fri, 16 Mar 2001 18:46:30 GMT Raw View

In article <3AB161D5.25B98F73@wizard.net>, "James Kuyper Jr."
<kuyper@wizard.net> wrote:
>>..Unicode, which can be 16-bit
>> or 32-bit*); from what I understand one can have a mixture of U16 & U32
>> files on the same OS (operative system).
>>
>>   How is that to be handled from within C++ or is the C++ standard
>> insufficient with respect to this?
>
>It is insufficient.
...
>That said, the standard at least allows an implementation to support a
>locale in which wchar_t is interpreted by the standard I/O functions
>using Unicode encoding. In fact, there's no reason why char itself
>couldn't be 32 bits, with a Unicode encoding.

In article <3AB173DC.184BC988@sensor.com>, Ron Natalie <ron@sensor.com> wrote:
>All the implementations I've dealt with use the 16 bit encoding for
>unicode
>(i.e. wchar_t is unsigned short).

The reason I started to think about this is a discussion in the LaTeX3
group about a suitable successor to TeX (like Omega, etc):

Then it turns out that 16-bit Unicode is not sufficient. In fact, most
math characters lie outside the first Unicode 65536 character "plane". In
addition, there must be room for additional user characters, as Unicode
probably never will cover all that will be needed in typesetting.

Also, from the point of speed, there appears to be no particular gain in
using less than 32 bits internally. For example, the latest Mac powerbook
(G4 CPU) uses 128-bit words which are vectored into 32-bit words. There
appears to be no gain in speed in using 16-bit characters over 32-bit
characters.

So what one would expect that there will be a mixture of 32, 16, and 8-bit
files on the computer operative system.

The problem is perhaps not reading those files, because then one could use
a 32-bit wchar, and translate whatever is read into that format. But then
one would want the need to write a file on a particular format, which
should be 32, 16, or 8 bytes. Then there appears to be missing at least
one type in C++ (i.e., having at least to different wchar types).

(There are encodings UTF8 & UTF16 for writing Unicode files on 8 & 16-bit
formats, but from within a C++ program, one would want to work with fixed
size characters.)

  Hans Aberg      * Anti-spam: remove "remove." from email address.
                  * Email: Hans Aberg <remove.haberg@member.ams.org>
                  * Home Page: <http://www.matematik.su.se/~haberg/>
                  * AMS member listing: <http://www.ams.org/cml/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: Ron Natalie <ron@spamcop.net>
Date: Fri, 16 Mar 2001 19:01:45 GMT Raw View

Michiel Salters wrote:

> >  How is that to be handled from within C++ or is the C++ standard
> >insufficient with respect to this?
>
> The C++ standard is insufficient. But since other standards are sufficient,
> C++ shouldn't redo that work.

Yes, but  unforutnatley, the standard parts of C++ are incompatible with
any useful extension.  With the exception of a few places where char_traits
<wchar_t> is defined as being required, by and large C++ ignores the fact
that all the world ain't ASCII.

>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "Andrea Ferro" <AndreaF@UrkaDVD.it>
Date: Fri, 16 Mar 2001 19:43:25 GMT Raw View

"Hans Aberg" <remove.haberg@matematik.su.se> wrote in message
news:remove.haberg-1503011129330001@du130-226.ppp.su-anst.tninet.se...
> The C++ standard 3.9.1#5 says:
>   Type wchar_t is a distinct type whose values can represent distinct
> codes for all members of the largest extended character set specified
> among the supported locales (22.1.1). Type wchar_t shall have the same
> size, signedness, and alignment requirements (3.9) as one of the other
> integral types, called its underlying type.
>
>   I wonder how this should work together with Unicode, which can be 16-bit
> or 32-bit*); from what I understand one can have a mixture of U16 & U32
> files on the same OS (operative system).
>
>   How is that to be handled from within C++ or is the C++ standard
> insufficient with respect to this?
>
> -- It appears to me that one would need a uchar type which is 16 bits and
> a Uchar type which is 32 bits, and nothing else.
>
> Or is the OS supposed to translate all files when opened as wfstreams, so
> that they appear to be U32 (i.e. a wchar_t with at least 32 bits) from
> within C++?

AFAIK the OS has nothing to do with that. The implementation does. And
specifically if the implementation is to support all your cases then wchar_t
will be a signed or unsigned 32 bit integral (that may be int, unsigned, long or
unsigned long depending on the implementation. It may even be a short if the
implementation has 32 bit shorts, unlikely today!). The transformation to
wchar_t is due in the library code that reads the stream, not in the OS. Note
that if you open a stream that is fisically 8 bit with a wfstream, that is
supposed to be translated to wchar_t already. Translating 16 bit unicode to 32
bit wchar_t is nowere more difficult or different. How does the implementation
know if the fisical stream is 8, 16, 32 or mixed is just not a problem of the
standard! It may be a magic string header of the stream or it may be user
specified or it may be based on the file extension (win docet) or it may be
extra info from the OS (mac docet) or the stream may be something as complex as
an HTTP connection fisically coming with headers, mime coding and compression
that are all to be resolved to just present the pure data content in the input
stream itself.

Andrea Ferro

---------
Brainbench C++ Master. Scored higher than 97% of previous takers
Scores: Overall 4.46, Conceptual 5.0, Problem-Solving 5.0
More info http://www.brainbench.com/transcript.jsp?pid=2522556



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: qrczak@knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 16 Mar 2001 23:22:58 GMT Raw View

Fri, 16 Mar 2001 19:43:25 GMT, Andrea Ferro <AndreaF@UrkaDVD.it> pisze:

> And specifically if the implementation is to support all your
> cases then wchar_t will be a signed or unsigned 32 bit integral
> (that may be int, unsigned, long or unsigned long depending on
> the implementation.

No, in C++ wchar_t is a keyword which denotes a distinct type.
In C wchar_t may be e.g. a typedef for int.


Fri, 16 Mar 2001 11:02:26 GMT, Ron Natalie <ron@sensor.com> pisze:

> All the implementations I've dealt with use the 16 bit encoding
> for unicode (i.e. wchar_t is unsigned short).

On Linux wchar_t has 32 bits.

--=20
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZAST=CAPCZA
QRCZAK

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Sat, 17 Mar 2001 02:02:58 GMT Raw View

Ron Natalie wrote:
>
> Michiel Salters wrote:
>
> > >  How is that to be handled from within C++ or is the C++ standard
> > >insufficient with respect to this?
> >
> > The C++ standard is insufficient. But since other standards are sufficient,
> > C++ shouldn't redo that work.
>
> Yes, but  unforutnatley, the standard parts of C++ are incompatible with
> any useful extension.  With the exception of a few places where char_traits
> <wchar_t> is defined as being required, by and large C++ ignores the fact
> that all the world ain't ASCII.

I'm curious: what aspects of C++ are incompatible with any useful
extension? C++ doesn't guarantee the presence of anything useful with
repect to Unicode. However, it would seem to allow a great deal. I don't
see anything that would prevent an implementation from defining the wide
character set, or even the narrow character set, as using the 16 or even
the 32 bit Unicode encodings.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Fri, 16 Mar 2001 11:01:48 GMT Raw View

Hans Aberg wrote:
>
> The C++ standard 3.9.1#5 says:
>   Type wchar_t is a distinct type whose values can represent distinct
> codes for all members of the largest extended character set specified
> among the supported locales (22.1.1). Type wchar_t shall have the same
> size, signedness, and alignment requirements (3.9) as one of the other
> integral types, called its underlying type.
>
>   I wonder how this should work together with Unicode, which can be 16-bit
> or 32-bit*); from what I understand one can have a mixture of U16 & U32
> files on the same OS (operative system).
>
>   How is that to be handled from within C++ or is the C++ standard
> insufficient with respect to this?

It is insufficient. A fully conforming implementation can use an 8 bit
char as wchar_t. The standard does not require Unicode support, beyond
translation of UCN's; it doesn't require that they be translated in any
particularly use form. Identifiers that contain distinct UCN's must be
treated as distinct identifiers, but that can be done by keeping the
literal form of the UCN (translated to a fixed case) in the compiler's
internal representation of the literal.

UCNs which occur in string and character literals that correspond to
encodings in the execution character set shall be translated to those
encodings; what happens to other characters is implementation-defined,
and need not be defined as anything useful.

In other words, the unicode support required by the standard is quite
minimal.

That said, the standard at least allows an implementation to support a
locale in which wchar_t is interpreted by the standard I/O functions
using Unicode encoding. In fact, there's no reason why char itself
couldn't be 32 bits, with a Unicode encoding.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.research.att.com/~austern/csc/faq.html                ]