Thread

Topic: Wide characters and narrow streams

Author: "Kristof Zelechovski" <giecrilj@stegny.2a.pl>
Date: Fri, 24 Nov 2006 12:55:52 CST Raw View

Uzytkownik "Jean-Marc Bourguet" <jm@bourguet.org> napisal w wiadomosci > I
started the thread to check if my understanding of the matter -- see the
> correct.  The most shocking part is that in locale other than "C", narrow
> streams are useless: you may know if that it doesn't do a conversion (with
> codecvt<>::always_noconv()) but you don't know if it is because there is
> no
> need for one or because the underlying charset is wide.
>

Usually because the recoding process
can be conceptually split into two parts: decoding and reencoding.
I admit it could be more efficient if it were done as if by Unix tr tool
but not all recodings can be implemented in such a way.
And the underlying character set is never wide.
The reason is you cannot limit reading from text files to blocks;
you can always read just one byte.
What you may perceive as wide characters stored directly in the file
is wide characters converted to the UTF encoding
of appropriate bit length and direction.
(A single byte does not have any perceivable "direction";
however, a sequence of bytes does.)
This encoding may be simpler to decode,
but it is a narrow character encoding nevertheless,
and, surprise, need not be supported by the locale system.
(This is the present deplorable condition of Microsoft C++).

Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "James Kanze" <james.kanze@gmail.com>
Date: Wed, 22 Nov 2006 10:42:59 CST Raw View

"Krzysztof Zelechowski" wrote:
> Narrow character streams do not decode characters.  And you
> cannot expect them to do it because it does not work that way.
> You cannot decode narrow character sequences.

I've wondered about this.  The standard requires all instances
of basic_filebuf, including basic_filebuf<char>, to use the
codecvt for code translation, at least in theory.  The standard
also have a requirement that
std::codecvt<char,char,mtstate_t>::do_always_noconv() return
true.  Does this mean that it is impossible to create a locale
with a facet std::codecvt<char,char,mtstate_t> where this isn't
true?  I don't think so; I can certainly create instances of
other standard facets to do what I want, regardless of what the
default version does.  But to tell the truth, I'm not sure.  I
find it very difficult to say what is and what is not allowed
when it comes to locale.

> You can recode them, but it is a two-step process:
> decode to wide characters and encode to narrow characters, perhaps using a
> different locale.

I don't see from where you get this requirement.

> In theory you can tell what kind of encoding the locale uses by examining
> the character length.

I'm not sure I follow.  Length will only give you the length for
the currently available characters.  Do you mean encoding()
(which in fact does return the fixed length, if there is one
and special values otherwise)?

> If the character length is fixed, you can seek offsets in file buffers;
> if it is not, you can only seek positions.

According to the standard, you can only seek positions.  Period.
At least in a text file.

    [...]
> If your code contains narrow character literals or narrow string literals,
> it must be recompiled because you have to recode the source files.

If your code contains litterals which contain characters other
than those in the basic character set, you're pretty much hosed
anyway, with regards to portability.

> If you open the source files under a locale that uses an encoding different
> from the original,
> your literals will probably be unreadable.

If you open the files, you're responsible for whatever happens.
The real question is what the compiler does with such files.
What assumptions does it make about the input encoding?

> No surprise that they do not make sense to your programme either.
> Therefore it is always safer to use wide character literals and wide
> character streams for all processing purposes.
> And if you recode single byte narrow characters, the source code will be
> ill-formed
> because it will contain multiple bytes between single quotes;
> your compiler may accept them silently (especially at Apple),
> but then you will get runtime misbehaviour
> because you cannot get such characters from the input stream.

I'm not sure what you are saying here.  A narrow character
litteral can definitly contain multibyte characters, e.g. if the
compiler uses UTF-8 as the default encoding.  And there's also
no problem in reading such characters.

> It is the core cause of your failure: narrow I/O streams operate on bytes,
> not on characters.

Which is a problem for most of locale (how do you use isupper on
a multibyte character), but not for the streams.

> And for a good reason: the UTF-8 encoding of a character can take up to 6
> bytes; that will not fit into an integer;

What does fitting into an integer have to do with anything?

> you would have to recourse to using UTF-16 surrogates encoded into UTF-8 as
> separate characters;
> while such a double encoding is technically possible, it is cumbersome and
> weird.

I can't figure out what you are talking about here.

--
James Kanze (GABI Software)             email:james.kanze@gmail.com
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: krixel@qed.pl ("Krzysztof Zelechowski")
Date: Wed, 22 Nov 2006 17:53:57 GMT Raw View

Uzytkownik "James Kanze" <james.kanze@gmail.com> napisal w wiadomosci
news:1164201538.184899.241920@f16g2000cwb.googlegroups.com...
> "Krzysztof Zelechowski" wrote:
>> Narrow character streams do not decode characters.  And you
>> cannot expect them to do it because it does not work that way.
>> You cannot decode narrow character sequences.
>
> I've wondered about this.  The standard requires all instances
> of basic_filebuf, including basic_filebuf<char>, to use the
> codecvt for code translation, at least in theory.  The standard
> also have a requirement that
> std::codecvt<char,char,mtstate_t>::do_always_noconv() return
> true.  Does this mean that it is impossible to create a locale
> with a facet std::codecvt<char,char,mtstate_t> where this isn't
> true?  I don't think so; I can certainly create instances of
> other standard facets to do what I want, regardless of what the
> default version does.  But to tell the truth, I'm not sure.  I
> find it very difficult to say what is and what is not allowed
> when it comes to locale.

You definitely do not have it out of the box, which is the situation of the
OP.

>
>> You can recode them, but it is a two-step process:
>> decode to wide characters and encode to narrow characters, perhaps using
>> a
>> different locale.
>
> I don't see from where you get this requirement.

It is a possibility, not a requirement.

>
>> In theory you can tell what kind of encoding the locale uses by examining
>> the character length.
>
> I'm not sure I follow.  Length will only give you the length for
> the currently available characters.  Do you mean encoding()
> (which in fact does return the fixed length, if there is one
> and special values otherwise)?

I should have put it "the length of a character under the current
representation".

>
>> If the character length is fixed, you can seek offsets in file buffers;
>> if it is not, you can only seek positions.
>
> According to the standard, you can only seek positions.  Period.
> At least in a text file.
>

According to the C++ Standard (the 2003 version), section 27.8.1.4:

  11 Effects: Let width denote a_codecvt.encoding(). If is_open() == false,
or off != 0

  && width <= 0, then the positioning operation fails. Otherwise, if way !=
basic_ios::cur

  or off != 0, and if the last operation was output, then update the output
sequence and write any

  unshift sequence. Next, seek to the new position: if width > 0, call
std::fseek(file, width

  * off, whence), otherwise call std::fseek(file, 0, whence).



>> If you open the source files under a locale that uses an encoding
>> different
>> from the original,
>> your literals will probably be unreadable.
>
> If you open the files, you're responsible for whatever happens.

I meant, open in a text viewer.
Thus, if the text viewer robs a bank or launches an intercontinental missile
after being told to rendering my code for viewing,
the implementor of the text editor will be responsible, not me.  Sorry :-(

> The real question is what the compiler does with such files.
> What assumptions does it make about the input encoding?

in std::locale("") unless explicitly instructed otherwise

>
>> No surprise that they do not make sense to your programme either.
>> Therefore it is always safer to use wide character literals and wide
>> character streams for all processing purposes.
>> And if you recode single byte narrow characters, the source code will be
>> ill-formed
>> because it will contain multiple bytes between single quotes;
>> your compiler may accept them silently (especially at Apple),
>> but then you will get runtime misbehaviour
>> because you cannot get such characters from the input stream.
>
> I'm not sure what you are saying here.  A narrow character
> litteral can definitly contain multibyte characters, e.g. if the
> compiler uses UTF-8 as the default encoding.  And there's also
> no problem in reading such characters.

I get warnings from gcc about a literal of type 'ABCD'.
It is an integer value equivalent to ((('A' shr CHAR_BIT or 'B') shr
CHAR_BIT or 'C') shr CHAR_BIT or 'D' for an Apple compiler.
gcc tells me it could be the other way round as well: 'A' or ('B' or ('C' or
'D' shr CHAR_BIT) shr CHAR_BIT) shr CHAR_BIT.
As I already said, narrow characters are never recoded; the input encoding
is only relevant to wide character constants.

>
>> It is the core cause of your failure: narrow I/O streams operate on
>> bytes,
>> not on characters.
>
> Which is a problem for most of locale (how do you use isupper on
> a multibyte character), but not for the streams.

But for the users who want to extract some useful meaning of what the
streams read.

>
>> And for a good reason: the UTF-8 encoding of a character can take up to 6
>> bytes; that will not fit into an integer;
>
> What does fitting into an integer have to do with anything?
>

Therefore stream buffers are unable to treat narrow characters as atomic
entities.
I do not refer to the current state of affairs but what could have been done
that has not been and why.

>> you would have to recourse to using UTF-16 surrogates encoded into UTF-8
>> as
>> separate characters;
>> while such a double encoding is technically possible, it is cumbersome
>> and
>> weird.
>
> I can't figure out what you are talking about here.

The encoding for a surrogate fits into an integer.

Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: jm@bourguet.org (Jean-Marc Bourguet)
Date: Wed, 22 Nov 2006 22:40:23 GMT Raw View

krixel@qed.pl ("Krzysztof Zelechowski") writes:

> Uzytkownik "James Kanze" <james.kanze@gmail.com> napisal w wiadomosci
> news:1164201538.184899.241920@f16g2000cwb.googlegroups.com...
>> "Krzysztof Zelechowski" wrote:
>>> Narrow character streams do not decode characters.  And you
>>> cannot expect them to do it because it does not work that way.
>>> You cannot decode narrow character sequences.
>>
>> I've wondered about this.  The standard requires all instances
>> of basic_filebuf, including basic_filebuf<char>, to use the
>> codecvt for code translation, at least in theory.  The standard
>> also have a requirement that
>> std::codecvt<char,char,mtstate_t>::do_always_noconv() return
>> true.  Does this mean that it is impossible to create a locale
>> with a facet std::codecvt<char,char,mtstate_t> where this isn't
>> true?  I don't think so; I can certainly create instances of
>> other standard facets to do what I want, regardless of what the
>> default version does.  But to tell the truth, I'm not sure.  I
>> find it very difficult to say what is and what is not allowed
>> when it comes to locale.
>
> You definitely do not have it out of the box,

That was my expectation that we got it.  I still I see nothing which
prevent a system provided locale to do a decoding for narrow stream, but I
see nothing which mandate it -- an unfortunate situation.  Perhaps the fact
that C IO seem unable to do such a decoding (there is an mbstate_t
explicitly associated with a wide stream, nothing is mentionned for a
narrow stream) is the explanation of the behavior I observed.

> which is the situation of the OP.

My situation is under control.  Personnal programs on personnal data.

I started the thread to check if my understanding of the matter -- see the
end of <pxbzmaskcke.fsf@news.bourguet.org> for the rest of it -- is
correct.  The most shocking part is that in locale other than "C", narrow
streams are useless: you may know if that it doesn't do a conversion (with
codecvt<>::always_noconv()) but you don't know if it is because there is no
need for one or because the underlying charset is wide.

Yours,

--
Jean-Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: krixel@qed.pl ("Krzysztof Zelechowski")
Date: Tue, 21 Nov 2006 16:51:52 GMT Raw View

Narrow character streams do not decode characters.
And you cannot expect them to do it because it does not work that way.
You cannot decode narrow character sequences.
You can recode them, but it is a two-step process:
decode to wide characters and encode to narrow characters, perhaps using a
different locale.
In theory you can tell what kind of encoding the locale uses by examining
the character length.
If the character length is fixed, you can seek offsets in file buffers;
if it is not, you can only seek positions.
Note that not all implementations of the standard library are reliable:
Dinkumware, for example, tells me that the "C" locale has a variable-length
encoding.
Mr. Plauger is ready to say that he has the right to be fully agnostic;
perhaps it is, but such a doctrine-driven stand does not make much sense to
me as the end user.
If your code contains narrow character literals or narrow string literals,
it must be recompiled because you have to recode the source files.
If you open the source files under a locale that uses an encoding different
from the original,
your literals will probably be unreadable.
No surprise that they do not make sense to your programme either.
Therefore it is always safer to use wide character literals and wide
character streams for all processing purposes.
And if you recode single byte narrow characters, the source code will be
ill-formed
because it will contain multiple bytes between single quotes;
your compiler may accept them silently (especially at Apple),
but then you will get runtime misbehaviour
because you cannot get such characters from the input stream.
It is the core cause of your failure: narrow I/O streams operate on bytes,
not on characters.
And for a good reason: the UTF-8 encoding of a character can take up to 6
bytes; that will not fit into an integer;
you would have to recourse to using UTF-16 surrogates encoded into UTF-8 as
separate characters;
while such a double encoding is technically possible, it is cumbersome and
weird.
Chris

Uzytkownik "Jean-Marc Bourguet" <jm@bourguet.org> napisal w wiadomosci
news:pxbzmaskcke.fsf@news.bourguet.org...

It's true that I may be influenced by my context.  I've files in ISO
8859-1, I've programs handling them using char.  New Unix versions tend to
assume UTF-8.  I wanted to change my default locale so that I don't have to
search on how to change the default (GUI make this more complicated than
"LC_ALL=fr_FR.ISO-8859-1; export LC_ALL" in your .profile)

As Unicode kept the encoding of ISO-8859-1, I was assumed that my programs
would work without recompiling with a UTF-8 locale if their data files were
just converted from ISO 8859-1 to UTF-8 (they already set the globale
locale).  It didn't work.  I started to read more about locale and found
how not only that the little I though I knew was probably false, but that
the situation seemed confused.



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: jm@bourguet.org (Jean-Marc Bourguet)
Date: Mon, 13 Nov 2006 16:44:55 GMT Raw View

My understanding was that char was the type to be used when storing
characters when their code was small enough, and that wchar_t was to be
used in other cases.

I was, somewhat naively apparently, expecting that the conversion done from
the external representation to the internal one depended only on the locale
imbued on the stream and *not* on the width of the stream.  IE, the numeric
values returned from call to getc() would be the same, just that wide
stream would be able to return all values for larger character sets, narrow
stream would return an error (badbit set) when the code was outside the
range representable in a char.

That is not what happen in the two implementations I've tried which are
quite independant.  When I read from a narrow stream after having imbued an
UTF-8 locale, I just get the encoded representation.  When I read from a
wide stream in the same conditions, I get the decoded values.

Reading what I think are the relevant part of the C++ standard, I see
nothing which which mandate either behavior.  Did I miss something?

--
Jean-Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Mon, 13 Nov 2006 18:14:04 GMT Raw View

Jean-Marc Bourguet ha scritto:
> My understanding was that char was the type to be used when storing
> characters when their code was small enough, and that wchar_t was to be
> used in other cases.
>
> I was, somewhat naively apparently, expecting that the conversion done from
> the external representation to the internal one depended only on the locale
> imbued on the stream and *not* on the width of the stream.  IE, the numeric
> values returned from call to getc() would be the same, just that wide
> stream would be able to return all values for larger character sets, narrow
> stream would return an error (badbit set) when the code was outside the
> range representable in a char.
>
> That is not what happen in the two implementations I've tried which are
> quite independant.  When I read from a narrow stream after having imbued an
> UTF-8 locale, I just get the encoded representation.  When I read from a
> wide stream in the same conditions, I get the decoded values.
>
> Reading what I think are the relevant part of the C++ standard, I see
> nothing which which mandate either behavior.  Did I miss something?
>

First of all, only file streams do conversions. The conversion is
performed by a codecvt facet which matches the stream character type.
More precisely, the fstream/codecvt conspiracy provides conversion from
an external sequence (always represented as a sequence of chars) to an
internal sequence of either chars or wide chars. The C++ Standard does
not specify the behavior of any such conversion, except for the trivial
one, where exactly one external char is converted to one internal char
with the same value and vice versa.

So, about your UTF-8 locale, you're deep inside the
implementation-defined realm. Whatever the conversion is doing is ok
from the C++ Standard point of view.

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "James Kanze" <james.kanze@gmail.com>
Date: Tue, 14 Nov 2006 11:16:18 CST Raw View

Jean-Marc Bourguet wrote:
> My understanding was that char was the type to be used when storing
> characters when their code was small enough, and that wchar_t was to be
> used in other cases.

Not really.  As far as the standard is concerned, about all you
are guaranteed is that:

 -- both the narrow character set and the wide character set
    will contain all of the characters in the basic character
    set (regardless of locale, I think), and

 -- all external IO takes place over narrow characters.

There's no guarantee that wchar_t is larger than a char, for
example.  And even less that it is Unicode, or anything else one
might expect.

In practice, most of the time I would expect that in most
locales (and in "C" locale), all of the characters in the basic
character set have the same numeric encoding in the two types,
but this is not guaranteed (and on an IBM mainframe, it might
even make sense for narrow characters to be EBCDIC, and wide
characters Unicode).

> I was, somewhat naively apparently, expecting that the
> conversion done from the external representation to the
> internal one depended only on the locale imbued on the stream
> and *not* on the width of the stream.

It depends on the facet, and different width streams use
different facets.

In practice, I don't quite see how it could be otherwise.  The
conversion cannot be the same, since in one case, I either get
multibyte characters, or some characters are not representable,
and in the other, I get wide characters (which are in theory
never multibyte, even if in practice, surrogate characters may
appear).

I wonder if you idea isn't conditioned by the fact that you live
in an area where ISO 8859-1 is widespread, and the fact that all
of the characters in ISO 8859-1 have the same numeric encoding
as in Unicode.  Imagine, however, that you lived in eastern
Europe, and imbued an ISO 8859-2 locale.  What would you expect
if the file contained the character 0xC8 (a C with caron---the
first letter of Czeck in Czeck) when read as a wide character.
Surely not 0x00C8 (a '   ' in Unicode).

Actually, you don't even have to go as far afar as eastern
Europe.  How do you expect to handle the transition to
ISO 8859-15 (necessary for the Euro, and also, in France, for
the OE and oe ligatures)?  The Unicode representation for Euro
is 0x20AC, which isn't representable in a char on the machines I
generally work on.  Do you really expect some sort of error on
encountering a Euro character when reading a file encoded in
8859-15 with a narrow character stream.  The whole point of
ISO 8859-15 is that I don't need to use wide characters when
working in a western European environment.  (Or at least some
western European environments---I don't think it covers Catalan,
which is western European to me.)

> IE, the numeric
> values returned from call to getc() would be the same, just that wide
> stream would be able to return all values for larger character sets, narrow
> stream would return an error (badbit set) when the code was outside the
> range representable in a char.

I'm not quite sure how that could be.  The locale determines the
encoding, and is mainly relevant for wide characters (here).  I
would expect that most locales (with the exception of exotic
locales like EBCDIC) would suppose that the internal encoding of
char corresponds to that of the locale.  This is supported by
the fact that changing the locale also changes the behavior of
functions like isalpha.  (isalpha( 0xBD ) should be false in an
ISO 8859-1 locale, but true in ISO 8859-15.)

In theory, the same thing may apply to wide characters, of
course, but the intent, I think, is that the wide character
encoding be pretty much locale independant.  Unless historical
considerations argue against it, I would expect that wchar_t be
a 32 bit type, using Unicode, regardless of the locale.  So that
isalpha(wchar_t) would in fact be locale independant.  The
result is, of course, that the character code translation must
then be locale dependant.  (In practice, historical
considerations intervene more often than not, and of the three
systems to which I have ready access---Solaris, Linux and
Windows---, only Linux does it this way.)

This philosophy is reflected, at least in Posix, in the naming
conventions of the locale, which reflect the name of the narrow
character encoding, and do not contain any component specifying
wide character encoding (which thus must be assumed to be locale
independant).

> That is not what happen in the two implementations I've tried which are
> quite independant.  When I read from a narrow stream after having imbued an
> UTF-8 locale, I just get the encoded representation.  When I read from a
> wide stream in the same conditions, I get the decoded values.

Which is what I would more or less expect.  The internal
encoding of char varies according to the locale---otherwise,
there would be no point in making isalpha(char) locale
dependant, and people living in places where ISO 8859-1 is not
appropriate (e.g. anywhere in the Euro zone, today) would be
pretty much screwed.

> Reading what I think are the relevant part of the C++ standard, I see
> nothing which which mandate either behavior.  Did I miss something?

Well, there's very little concerning locales (other than "C")
and wide characters that isn't implementation dependant, so
you're probably reading in the wrong place.  I think that the
intent, however, is what I just explained.  But you'll have to
check the implementation documentation each time to see what
they think.

--
James Kanze (GABI Software)             email:james.kanze@gmail.com
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: jm@bourguet.org (Jean-Marc Bourguet)
Date: Fri, 17 Nov 2006 03:08:44 GMT Raw View

"James Kanze" <james.kanze@gmail.com> writes:

> Jean-Marc Bourguet wrote:
> > My understanding was that char was the type to be used when storing
> > characters when their code was small enough, and that wchar_t was to =
be
> > used in other cases.
>=20
> Not really.  As far as the standard is concerned, about all you
> are guaranteed is that:
>=20
>  -- both the narrow character set and the wide character set
>     will contain all of the characters in the basic character
>     set (regardless of locale, I think), and

With positive codes.  And I think that the code of character in the basic
sets can't be locale specific.  (The standard says in 2.2./3 "The value o=
f
the members of the execution character sets are implementation dependant
and any additional members are locale specific.").

>  -- all external IO takes place over narrow characters.
>=20
> There's no guarantee that wchar_t is larger than a char, for
> example.  And even less that it is Unicode, or anything else one
> might expect.
>=20
> In practice, most of the time I would expect that in most
> locales (and in "C" locale),=20

My reading of 2.2 is that the codes of the characters in the basic
character set may not dependant of the locale.

> all of the characters in the basic character set have the same numeric
> encoding in the two types, but this is not guaranteed (and on an IBM
> mainframe, it might even make sense for narrow characters to be EBCDIC,
> and wide characters Unicode).

I agree.  And I don't even have a clue about non unicode wide
representation (I know a little about some multibyte representation, but =
I
don't know how they are transformed in a wide representation).

> > I was, somewhat naively apparently, expecting that the conversion don=
e
> > from the external representation to the internal one depended only on
> > the locale imbued on the stream and *not* on the width of the stream.
>=20
> It depends on the facet, and different width streams use different
> facets.

Facets are part of locale... I was assuming that factets for handling a
given encoding would behave essentially the same.

> In practice, I don't quite see how it could be otherwise.  The
> conversion cannot be the same, since in one case, I either get
> multibyte characters, or some characters are not representable,
> and in the other, I get wide characters (which are in theory
> never multibyte, even if in practice, surrogate characters may
> appear).

If you consider combining characters, you still have multi word character=
s.
And combining characters are quite old (in some national variants of
ISO-646 sequence of accent, backspace act officially as combining
characters).  And you have to handle combining character to get a sensibl=
e
user level behaviour.

> I wonder if you idea isn't conditioned by the fact that you live
> in an area where ISO 8859-1 is widespread, and the fact that all
> of the characters in ISO 8859-1 have the same numeric encoding
> as in Unicode.

It's true that I may be influenced by my context.  I've files in ISO
8859-1, I've programs handling them using char.  New Unix versions tend t=
o
assume UTF-8.  I wanted to change my default locale so that I don't have =
to
search on how to change the default (GUI make this more complicated than
"LC_ALL=3Dfr_FR.ISO-8859-1; export LC_ALL" in your .profile)

As Unicode kept the encoding of ISO-8859-1, I was assumed that my program=
s
would work without recompiling with a UTF-8 locale if their data files we=
re
just converted from ISO 8859-1 to UTF-8 (they already set the globale
locale).  It didn't work.  I started to read more about locale and found
how not only that the little I though I knew was probably false, but that
the situation seemed confused.

In C, narrow streams have to return the encoded form.  I haven't see a wa=
y
to know if the charset in use was narrow.  So you can't do much with char=
:
you can't give an error message if the encoding is not narrow.  you can d=
o
binary IO (if the stream has been open with "b") and use the multibyte
functions.

In C++, I saw nothing either mandating or preventing reading from a narro=
w
stream to return the encoding.  I saw nothing either mandating nor
preventing reading from a narrow stream to return decoded characters, wit=
h
run time error if the character was not representable.  I saw no way to
know if the charset in use was narrow and so it was possible to savely us=
e
narrow stream.  Again, char are not really usable for textual IO in robus=
t
programs.  We aren't even in the C situation were you can use narrow stre=
am
for binary reading with a global locale which use a multibyte encoding or
use the multibyte functions.  I hope I'm wrong.

> Imagine, however, that you lived in eastern Europe, and imbued an ISO
> 8859-2 locale.  What would you expect if the file contained the charact=
er
> 0xC8 (a C with caron---the first letter of Czeck in Czeck) when read as=
 a
> wide character.  Surely not 0x00C8 (a '=C8' in Unicode).

No.  I still expect either the precomposed character for C caron, or a C
and a combining caron (if the internal encoding for wide character is
Unicode).  If I'm wrong and you can't savely read a narrow encoding in a
wide stream, then you can't do robust text IO at all excepted by doing
binary IO in the classic locale and reinterpreting the encoding yourself?

> Actually, you don't even have to go as far afar as eastern Europe.  How
> do you expect to handle the transition to ISO 8859-15 (necessary for th=
e
> Euro, and also, in France, for the OE and oe ligatures)?  The Unicode
> representation for Euro is 0x20AC, which isn't representable in a char =
on
> the machines I generally work on.  Do you really expect some sort of
> error on encountering a Euro character when reading a file encoded in
> 8859-15 with a narrow character stream.

No.  I expected an error when reading an euro character in a narrow strea=
m
with an UTF-8 locale.

> The whole point of ISO 8859-15 is that I don't need to use wide
> characters when working in a western European environment.  (Or at leas=
t
> some western European environments---I don't think it covers Catalan,
> which is western European to me.)
>=20
> > IE, the numeric values returned from call to getc() would be the same=
,
> > just that wide stream would be able to return all values for larger
> > character sets, narrow stream would return an error (badbit set) when
> > the code was outside the range representable in a char.
>=20
> I'm not quite sure how that could be.  The locale determines the
> encoding, and is mainly relevant for wide characters (here).=20

Is it relevant at all for narrow character?  I'd prefer the C situation
where we can do very some things with a narrow stream (binary IO, IO then
converting ourself with the multibyte functions) to the apparent situatio=
n
in C++ where we can't count on anything with narrow stream without having
assumption on the locale and without being able to check that these
assumption hold.

> I would expect that most locales (with the exception of exotic locales
> like EBCDIC) would suppose that the internal encoding of char correspon=
ds
> to that of the locale. =20

Well, my reading is that an EBCDIC locale in an otherwise ASCII environme=
nt
must do remapping in IO to work.  And that this is possible in C++ with
narrow stream but not in C.

> This is supported by the fact that changing the
> locale also changes the behavior of functions like isalpha.  (isalpha(
> 0xBD ) should be false in an ISO 8859-1 locale, but true in ISO 8859-15=
.)
>=20
> In theory, the same thing may apply to wide characters, of
> course, but the intent, I think, is that the wide character
> encoding be pretty much locale independant. =20

I can see an Unix supporting a Japanese locale with an EUC external
encoding put in an traditional internal encoding (I don't have any idea o=
f
what they are) as well as with an UTF-16 external encoding put in an UTF-=
16
or UTF-32 internal form.

> Unless historical considerations argue against it, I would expect that
> wchar_t be a 32 bit type, using Unicode, regardless of the locale.  So
> that isalpha(wchar_t) would in fact be locale independant.  The result
> is, of course, that the character code translation must then be locale
> dependant.  (In practice, historical considerations intervene more ofte=
n
> than not, and of the three systems to which I have ready
> access---Solaris, Linux and Windows---, only Linux does it this way.)
>=20
> This philosophy is reflected, at least in Posix, in the naming
> conventions of the locale, which reflect the name of the narrow
> character encoding, and do not contain any component specifying
> wide character encoding (which thus must be assumed to be locale
> independant).

I wouldn't call UTF-8 a narrow character encoding.

> > That is not what happen in the two implementations I've tried which a=
re
> > quite independant.  When I read from a narrow stream after having imb=
ued an
> > UTF-8 locale, I just get the encoded representation.  When I read fro=
m a
> > wide stream in the same conditions, I get the decoded values.
>=20
> Which is what I would more or less expect.  The internal
> encoding of char varies according to the locale---otherwise,
> there would be no point in making isalpha(char) locale
> dependant, and people living in places where ISO 8859-1 is not
> appropriate (e.g. anywhere in the Euro zone, today) would be
> pretty much screwed.

You seem to assumed even more constraint here that I did.

What I assumed (see below for my understanding of the situation) was:

- binary IO: do as few transformation on the transmitted bytes as possibl=
e
  in the given context; those transformations are locale independant.

- text IO:

   - '\n' instead of the platform conventions (CR, CR-LF, LF, record leng=
th
     -- sadly we don't know if '\n' is a line separator or a line termina=
tor)

   - check end of file with platform conventions for text (^Z for instanc=
e)

   - decode the external encoding (which is the same for both narrow and
     wide stream) to an internal representation (which may be locale
     dependant -- so the localization of character characterization --,
     which may be different for narrow and wide char) with an error if th=
e
     encoding is meaningless or the decoded result not available in the
     choosen representation (this last case not possible for wide
     characters)

> > Reading what I think are the relevant part of the C++ standard, I see
> > nothing which which mandate either behavior.  Did I miss something?
>=20
> Well, there's very little concerning locales (other than "C")
> and wide characters that isn't implementation dependant, so
> you're probably reading in the wrong place.  I think that the
> intent, however, is what I just explained.  But you'll have to
> check the implementation documentation each time to see what
> they think.

There is so few constraints put on locale that I wonder what it is possib=
le
to do with narrow IOStreams without assuming things that you can't check
about the imbued locale.  My current understanding:

- IOStream text IO:

   - '\n' instead of the platform conventions (CR, CR-LF, LF, record leng=
th
     -- sadly we don't know if '\n' is a line separator or a line termina=
tor)

   - check end of file with platform conventions for text (^Z for instanc=
e)

   - perhaps other transformations?

   - wide stream: encode/decode the external encoding to an internal
     representation (which may be locale dependant -- so the localization
     of character characterization) with an error if the encoding is
     meaningless.

   - narrow stream: don't know if there is an encoding/decoding or not.

- IOStream binary IO=20
   - we don't have the '\n', end of file and other? transformations.
   - we have the encoding/decoding... if there is one

- C stream text IO:
   - '\n' instead of the platform conventions (CR, CR-LF, LF, record leng=
th
     -- sadly we don't know if '\n' is a line separator or a line termina=
tor)

   - check end of file with platform conventions for text (^Z for instanc=
e)

   - perhaps other transformations?

   - narrow stream: no transformation whatever the locale is

   - wide stream: encode/decode the external encoding (which is the same
     for both narrow and wide stream) to an internal representation (whic=
h
     may be locale dependant).

- C stream binary IO:
   - we don't have the '\n', end of file and other? transformations.

   - narrow stream: no transformation whatever the locale is

   - wide stream: encode/decode the external encoding (which is the same
     for both narrow and wide stream) to an internal representation (whic=
h
     may be locale dependant).

Yours,

--=20
Jean-Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]