Thread

Topic: A few questions about internationalization of characters

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: 19 Sep 2002 22:00:41 GMT Raw View

"Martin v. L   wis" <loewis@informatik.hu-berlin.de> wrote in message
news:j48z1ytnvf.fsf@informatik.hu-berlin.de...
> pjp@dinkumware.com ("P.J. Plauger") writes:

> > The C++ Standard includes by reference the library portion of the C
> > Standard. The header <ctype.h> spells out the basic properties of the
> > is* classification functions.
>
> Correct, so I can find out what std::isalpha(int) means, in C++. That
> does not answer the question what table() does, does it?

It says a lot about it, because of the stated interconnected requirements.
Of course, if you don't want to see that, I'm sure you can avoid it.

> > They let you *add* elements from the extended character set when you
> > switch from the "C" locale, but you can't take away any of the
> > letters from the basic C character set.
>
> I see. It seems you imply that the classification of a character
> cannot change between locales, so that if 'A' is a letter in the "C"
> locale, it also must be a letter in all other locales (I agree it must
> be present, but this is a different matter).

I was speaking of the basic C character set, not all characters.

> AFAICT, the standard never spells out this property,

You have to read a lot of separate small statements, mostly in <ctype.h>,
to put together this picture, and I have done so on several occasions.
But as before, you can certainly avoid seeing this with a minimum of
effort.

>                                                    and it is common
> practice that classifications of characters *do* change across
> locales, e.g. \u00f6, on Linux, is a character in de_DE.ISO-8859-1,
> but not in ru_RU.UTF-8.

I never said otherwise. I confined my remarks to the basic C character
set.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Fri, 20 Sep 2002 15:14:03 +0000 (UTC) Raw View

"James Kanze" <kanze@gabi-soft.de> wrote in message
news:d6651fb6.0209190733.3ec3fc15@posting.google.com...

> pjp@dinkumware.com ("P.J. Plauger") wrote in message
> news:<3d88929a$0$13082$724ebb72@reader2.ash.ops.us.uu.net>...
>
> > The C++ Standard includes by reference the library portion of the C
> > Standard. The header <ctype.h> spells out the basic properties of the
> > is* classification functions. They let you *add* elements from the
> > extended character set when you switch from the "C" locale, but you
> > can't take away any of the letters from the basic C character set.
>
> The question is: can you move them?

I have said several times in other parts of this thread that the C Standard
doesn't contemplate the possibility that either 'x' or L'x' (code values
defined by the basic C character set) can change value when you switch
locales. It sort of suggests that elements of the *extended* character
set (all the rest) can change, but it doesn't spell out exactly how.
That's why I keep saying that you should avoid literals that use anything
but the basic C character set if you want to stay out of this no-man's
land.

> I can imagine that support for a EBCDIC locale could be quite useful on
> a machine which generally uses ASCII (or 8859-1, or 8859-15, or ...).  I
> am almost certain that if a demand for it arises, library vendors will
> provide it, regardless of what the standard says.  But I don't see how
> std::isalpha( 'a', locale( "us_US.ebcdic" ) ) can possibly be made to
> return true.

We do indeed provide support for EBCDIC in a couple of ways. The sanest
way is to pretend that EBCDIC is a multibyte encoding and ISO-8859-1
is the wide-character encoding. If you use this translation rule only
when reading and writing wide streams, you avoid conflicts with literals.
OTOH, you can always play fast and loose inside a C program and switch
to all sorts of interesting locales that break all sorts of rules.
Like writing a device or interrupt handler, you have to walk on eggs,
but you can do useful things.

> To be frank, I really don't know what the correct solution should be,
> much less what the standard says.

The C Standard doesn't say nearly enough, for a variety of reasons.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Wed, 18 Sep 2002 16:18:07 +0000 (UTC) Raw View

"Martin v. L   wis" <loewis@informatik.hu-berlin.de> wrote in message
news:j4it15ja81.fsf@informatik.hu-berlin.de...

> > > - isalpha('A'), in C++, has the same meaning as it has in C; I *believe*
> > >   that the meaning is, strictly speaking, unimplementable (or you can't
> > >   support locales which differ in the numbering of the basic characters)
> >
> > You can believe it as hard as you'd like. It's still implementable in the
> > commonsense case where the basic C character set doesn't get renumbered.
>
> Ok, then what about isalpha('\u00F6')? Is it reasonable for an
> implementation to get this "right" (i.e. the character always denoting
> the same Unicode character, even if the locale changes)?

Assuming \u00F6 is representable as a single-byte character, I believe
this refers to a character not in the basic C character set. Therefore,
it is harder to make a case that the meaning will survive a change in
locale. (I choose my words carefully, because the C Standard does not
spell out at all clearly what changes can occur in character sets when
you change locales. I believe you can read the C Standard to require the
basic C character set to be uniform across all supported locales, both
as a single-byte and a wide character.)

> > > - std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
> > >   since it assumes a widened character.
> >
> > Most behavior involving locales is *locale specific*, not implementation
> > defined. In this case, however, all locales are required to categorize
> > 'A' as an alpha character.
>
> Can you elaborate where the C++ standard requires that? My
> understanding is that this is equivalent to
>
> use_facet< ctype<char> >(locale()).table()[(unsigned char)'A']
>    & ctype_base::alpha
>
> where the contents of table seems to be specified nowhere (not even
> for the "C" locale.

The C++ Standard includes by reference the library portion of the C
Standard. The header <ctype.h> spells out the basic properties of the
is* classification functions. They let you *add* elements from the
extended character set when you switch from the "C" locale, but you
can't take away any of the letters from the basic C character set.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Wed, 18 Sep 2002 22:59:21 +0000 (UTC) Raw View

"Martin v. L   wis" <loewis@informatik.hu-berlin.de> wrote in message
news:j4admhj9i9.fsf@informatik.hu-berlin.de...

> > > If that is the recommendation, what good are universal character
> > > names?
> [...]
> > > It is ok to deviate from the expected behaviour if a character is not
> > > supported. It is also ok if the expected behaviour does not
> > > materialize if you fail to use the library correctly.
> >
> > Right. You've answered your own question.
>
> No. A UCN is useless if the implementation does not support it in the
> execution character set.

Not entirely useless. They give the compiler a chance to warn you that
you're using an unsupported character code.

>                        Many implementations *do* support man
> additional characters, so UCNs ought to work for those additional
> characters.

Maybe they *ought* to, for some applications, but they're not required to
by the C Standard.

> If they don't "work", there is either an error in the implementation,
> or an error in the language standard.

Or there are conflicting goals that are not necessarily met by all
conforming implementations.

> > > But I think there must be a completely portable way to output "funny"
> > > characters which are specified in the source code.
> >
> > If you mean portable across arbitrary changes in execution character set,
> > then dream on. We did discuss this in the C committee and decided it was
> > not a problem we were going to consider. (C++ has done nothing more.)
>
> I'm talking about run-time changes in to the locale that are not
> arbitrary, but in a way that leaves certain extended characters
> available. I wonder how I can denote those characters (which I know
> will be present in all locales I'm going to use) in source code.

If you know that your implementation will support all the locales you're
going to use, then your implementation may well tell you how to denote
them portably across all such locales. There's no guaranteed way to do
so across all conforming implementations, however.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 19 Sep 2002 17:27:20 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") wrote in message
news:<3d88929a$0$13082$724ebb72@reader2.ash.ops.us.uu.net>...

> The C++ Standard includes by reference the library portion of the C
> Standard. The header <ctype.h> spells out the basic properties of the
> is* classification functions. They let you *add* elements from the
> extended character set when you switch from the "C" locale, but you
> can't take away any of the letters from the basic C character set.

The question is: can you move them?

I can imagine that support for a EBCDIC locale could be quite useful on
a machine which generally uses ASCII (or 8859-1, or 8859-15, or ...).  I
am almost certain that if a demand for it arises, library vendors will
provide it, regardless of what the standard says.  But I don't see how
std::isalpha( 'a', locale( "us_US.ebcdic" ) ) can possibly be made to
return true.

To be frank, I really don't know what the correct solution should be,
much less what the standard says.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Thu, 19 Sep 2002 17:32:50 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") writes:

> > Can you elaborate where the C++ standard requires that? My
> > understanding is that this is equivalent to
> >
> > use_facet< ctype<char> >(locale()).table()[(unsigned char)'A']
> >    & ctype_base::alpha
> >
> > where the contents of table seems to be specified nowhere (not even
> > for the "C" locale.
>
> The C++ Standard includes by reference the library portion of the C
> Standard. The header <ctype.h> spells out the basic properties of the
> is* classification functions.

Correct, so I can find out what std::isalpha(int) means, in C++. That
does not answer the question what table() does, does it?

> They let you *add* elements from the extended character set when you
> switch from the "C" locale, but you can't take away any of the
> letters from the basic C character set.

I see. It seems you imply that the classification of a character
cannot change between locales, so that if 'A' is a letter in the "C"
locale, it also must be a letter in all other locales (I agree it must
be present, but this is a different matter).

AFAICT, the standard never spells out this property, and it is common
practice that classifications of characters *do* change across
locales, e.g. \u00f6, on Linux, is a character in de_DE.ISO-8859-1,
but not in ru_RU.UTF-8.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Mon, 16 Sep 2002 22:30:22 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. L   wis wrote in message
news:<j4ptvfqe3h.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > What do you mean: is "isalpha( 'A' )" illegal?  Or "std::isalpha(
> > 'A', std::locale() )"?

> Non of those are illegal; they might be implementation defined. My
> head starts spinning, so I guess instead of reading and re-reading the
> text:
> - isalpha('A'), in C++, has the same meaning as it has in C; I *believe*
>   that the meaning is, strictly speaking, unimplementable (or you can't
>   support locales which differ in the numbering of the basic characters)

Before saying that the meaning is unimplementable, we have to figure out
what it is.

What is certain is that 1) the results of isalpha depend on the locale,
at least for characters outside of the basic character set, and that 2)
whether the invocation in question is guaranteed to return true depends
on how much liberty the standard gives an implementation with regards to
character set variation in a locale.

> - std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
>   since it assumes a widened character.

Where do you find that it assumes a widened character.  The function is
a template; template type induction returns a function which takes a
char (since the type of 'A' is char).  So what is getting widened.

> > My impression is that at least some of the authors of the C standard
> > had some concrete experience with some forms of localization.  At
> > least enough experience to know to stop when they didn't know what
> > they were doing.  In C++, I'm not sure that even this is the case.

> I think the idea of "locale objects", and "facets" of those, was based
> on real experience with the C library, noting limitations (such as
> thread-unsafety, and the globalness of the locale). Whether, at that
> time, there was real experience with the proposed API, I cannot say.

Have you ever actually tried to write internationalized code using the
C++ library?

The C library has a number of limits.  But they are explicit limits; you
can't define a new locale yourself, for example.  The C++ library, in
the end, hardly offers more in real usability than the C library, at
considerably more complexity.  The possibility of imbuing a locale in a
stream is the one real advantage I can see, but it comes at a much
higher price than need be.

> It appears that none of the experts working on the library actually
> looked carefully enough at the core languages; the core people seem to
> have copied the C text as-is.

> > If I understand you correctly, there is an incompatibility between C
> > and C++ as to the meaning of a string literal:-).  Of course, all of
> > the C compilers actually implement the C++ semantics, so...

> Yes, this is my understanding. In that sense, the C++ definition seems
> "more right", even though it might be terrible to use (since you have
> to invoke operations to convert from the ECS to the locale's charset
> in many cases).

> > Which means that I actually have many "execution narrow character
> > sets", and the compiler chooses one to convert my string literals.
> > (From a practical point of view.  This obviously isn't standardese.)

> A compiler is certainly entitled to operate that way. To be
> conforming, it has to document which locale it uses.

Or how it chooses the locale.  (Presumably taking the local LC_CTYPES is
permitted.)

> > > > My point is that the compiler is required to accept 'A', and it is
> > > > required to translate the resulting character into an integral value
> > > > which represents 'A' in the execution character set.  Since 'A' is
> > > > an upper case character, this would seem to guarantee the results of
> > > > calling isupper on the value.
>  [...]
> > The first is just a reference to the C standard, where the function
> > takes an integer with values in the range 0..UCHAR_MAX.  Since 'A'
> > is guaranteed to be positive (even as a char), the implicit
> > conversion to int is well defined, and guaranteed to give a value in
> > the required range.  Thus, the call is legal.

> > The second is a template on the character type.  You certainly
> > aren't going to tell me that I cannot use char as the character
> > type.

> Both calls are "legal" (i.e. well-formed, and not even causing
> undefined behaviour).

Calling isalpha in C with a signed character is a no-no, and results in
undefined behavior.  Sun almost makes it work (the results are wrong for
'   '), with most other compilers, you get random results, and if there
were ever a C compiler with bounds checking, you would probably get a
core dump.

This is a well known problem, and explains why most European programmers
systematically cast to unsigned char before any of the functions in
ctype.h.

> At least the C++ version, AFAICT, causes implementation-defined
> behaviour - so while std::isalpha('A',locale()) might be well-formed,
> it still has implementation-defined behaviour.

> > In neither case is widen necessary, or even useful.

> I think ctype::widen is used to convert from the ECS to the locale's
> charset. That is necessary for both strings and wide strings (or is it
> narrow what I'm talking about ?)

I don't see where it would be necessary for normal char's.  For wide
chars, of course, I can't use widen.

> > I don't think you should have this guarantee.  But I ask myself,
> > what does converting this character from the basic source character
> > set to the basic execution character set mean?  I think you'll agree
> > that 'A' is an upper class letter in the basic source character set.
> > Shouldn't it be an upper case letter in the basic execution
> > character set?

> Yes, it certainly should.

> > And if it is an upper case letter, shouldn't isupper return true?

> Not necessarily. isupper should operate according to the locale, and
> the ECS cannot change with the locale. So isupper is only useful when
> you convert the ECS character to the locale's character set.

Your telling me what I already know with regards to the problems of
implementing.  I'm asking what the standard requires.  (It wouldn't be
the first time that the standard required something impossible.)

> > (I actually suspect that this is a question for comp.std.c, and that
> > C++ should just follow, on the grounds that I see no justification
> > for incompatibility here.)

> Given that the C semantics is unimplementable (not necessarily for the
> basic character set, but certainly for UCNs), I don't think that C++
> should follow C - I'd rather expect C to "change" (perhaps clarifying
> would be sufficient).

> > On my machine, I get a value of 0xE1 again.  As an int (C), this is
> > 225, which is in the range of 0...UCHAR_MAX, and so legal.  As a
> > char (C++), this is -31, which is NOT in the range of 0...UCHAR_MAX,
> > and so the call to isupper results in undefined behavior.

> Why is that? isupper expects an int...

An int in the range 0...UCHAR_MAX or EOF.  Anything else is undefined
behavior.  See    7.4/1 in the C standard.

> > Because the value, in C++, is not in the range 0...UCHAR_MAX.

> Why does it have to be?

Because the C standard says it has to.

    [...]
> > I was being a bit sarcastic.  I know the advantages of UTF-8, but I
> > am sceptical that one solution is appropriate for everything.

> While C attempts the solution for everything, I don't think C++
> does. And for the "typical" application areas, I do think UTF-8 would
> be a reasonable compromise.

For most users, with modern systems (Windows or Unix), I agree.

> > The problem is that within my program, I either have to use wchar_t
> > everywhere (with excessive memory overhead -- eventually leading to
> > page faults and excessive runtime overhead), or I have to deal with
> > multibyte characters.

> I think that the "excessive memory overhead" of wchar_t is a red
> herring.

It's not just memory.  If I write my locale data to a file using UTF-8,
the file is bigger than if I use ISO 8859-15.  But globally, if you only
consider the mainstream Unix and Windows on PC, you're probably right.
(Although I don't know.  I've seen customers keep surprising amounts of
data in memory.  Multiplying the size by four could make a difference.
If only in the number of page faults, and thus the execution speed.)

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Tue, 17 Sep 2002 12:03:17 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> > - std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
> >   since it assumes a widened character.
>
> Where do you find that it assumes a widened character.  The function is
> a template; template type induction returns a function which takes a
> char (since the type of 'A' is char).  So what is getting widened.

std::isalpha expects 'charT' values. ctype<charT>::widen(char)
converts a char to a charT.

It is implicitly invoked in some cases (e.g. operator<< for
basic_ostream); in other cases, I think it must be invoked explicitly.

> Have you ever actually tried to write internationalized code using the
> C++ library?

Yes.

> Calling isalpha in C with a signed character is a no-no, and results in
> undefined behavior.

Right; I missed that.

> > I think ctype::widen is used to convert from the ECS to the locale's
> > charset. That is necessary for both strings and wide strings (or is it
> > narrow what I'm talking about ?)
>
> I don't see where it would be necessary for normal char's.  For wide
> chars, of course, I can't use widen.

ctype::widen converts a char to the locale's charset. So if you want
the locale's notion of what is a letter, you first need to make sure
the character is encoded in the locale's codeset.

> > Not necessarily. isupper should operate according to the locale, and
> > the ECS cannot change with the locale. So isupper is only useful when
> > you convert the ECS character to the locale's character set.
>
> Your telling me what I already know with regards to the problems of
> implementing.  I'm asking what the standard requires.  (It wouldn't be
> the first time that the standard required something impossible.)

This is sort-of inferred. When writing to a stream, the C++ library
uses ctype::widen *even for char*. While the meaning of isalpha really
appears not to be specified at all (beyond that it is table() &
ctype::alpha), I assume that table() will also operate on the locale's
charset.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Wed, 18 Sep 2002 01:47:44 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") writes:

> > - isalpha('A'), in C++, has the same meaning as it has in C; I *believe*
> >   that the meaning is, strictly speaking, unimplementable (or you can't
> >   support locales which differ in the numbering of the basic characters)
>
> You can believe it as hard as you'd like. It's still implementable in the
> commonsense case where the basic C character set doesn't get renumbered.

Ok, then what about isalpha('\u00F6')? Is it reasonable for an
implementation to get this "right" (i.e. the character always denoting
the same Unicode character, even if the locale changes)?

> > - std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
> >   since it assumes a widened character.
>
> Most behavior involving locales is *locale specific*, not implementation
> defined. In this case, however, all locales are required to categorize
> 'A' as an alpha character.

Can you elaborate where the C++ standard requires that? My
understanding is that this is equivalent to

use_facet< ctype<char> >(locale()).table()[(unsigned char)'A']
   & ctype_base::alpha

where the contents of table seems to be specified nowhere (not even
for the "C" locale.

> And widening does not occur here, unless you mean the widening of a
> char to an int when passed as a function argument.

No, it doesn't take place. I meant that applications need to invoke

use_facet< ctype<char> >(locale()).widen('A')

before the character is suitable for being passed to isalpha, since
isalpha uses the locale's charset, not the execution character set.

> > No. Character literals are widened (or is it narrowed?) before being
> > written. That performs necessary conversions from the ECS to the
> > character set of the locale imbued on the stream
>
> The execution environment has no way of knowing that a particular integer
> value, or array of char, or array of wchar_t, originated as some form of
> character literal. It works with them all at face value; no implicit
> conversions occur.

Correct. That's why, explicit widening might be needed.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Wed, 18 Sep 2002 01:47:50 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") writes:

> > If that is the recommendation, what good are universal character
> > names?
[...]
> > It is ok to deviate from the expected behaviour if a character is not
> > supported. It is also ok if the expected behaviour does not
> > materialize if you fail to use the library correctly.
>
> Right. You've answered your own question.

No. A UCN is useless if the implementation does not support it in the
execution character set. Many implementations *do* support man
additional characters, so UCNs ought to work for those additional
characters.

If they don't "work", there is either an error in the implementation,
or an error in the language standard.

> > But I think there must be a completely portable way to output "funny"
> > characters which are specified in the source code.
>
> If you mean portable across arbitrary changes in execution character set,
> then dream on. We did discuss this in the C committee and decided it was
> not a problem we were going to consider. (C++ has done nothing more.)

I'm talking about run-time changes in to the locale that are not
arbitrary, but in a way that leaves certain extended characters
available. I wonder how I can denote those characters (which I know
will be present in all locales I'm going to use) in source code.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Fri, 13 Sep 2002 17:04:17 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. ?iso-8859-1?q?L wis?) wrote in
message news:<j4ofb44c6r.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > > Just as the standard says: behaviour that depends on local
> > > conventions of nationality, culture, and language.

> > So why don't they say that, instead of referring to "locale"?  As
> > far as I can see, if your interpretation is correct, this is the
> > only use of locale in the standard which doesn't refer in some way
> > or another to the locale feature of the library.

> They don't refer to "locale". They use the term "locale-specific",
> which is defined in 1.3.7.

OK.  I missed the definition.

> It is surprising that the term is used only once in the entire
> standard.

> > > Read your compiler's documentation. It is implementation-defined
> > > (but may have to follow local law).

> > The standard also makes certain requirements.  The classification of
> > the character changes, for example.

> You can't use a character literal to classify it in a std::locale
> object; you first have to widen it with the ctype's facet.

What do you mean: is "isalpha( 'A' )" illegal?  Or "std::isalpha( 'A',
std::locale() )"?

> > I would say rather that there is no compile-time locale in C++.  The
> > behavior of std::isalpha( int ) can definitly be changed at runtime,
> > an d the standard says that it depends on the current locale.  (The
> > exact words are in ISO/IEC 9899:1990 7.4/2, which is included by
> > reference in the C++ standard.)  So call it the current locale,
> > rather than the run-time locale.  It doesn't change anything.

> There is a relationship between the global locale
> (std::locale::global()) and the C "current" locale, although I'm not
> sure what precisely this relationship is. If you invoke
> std::setlocale, the global local doesn't change, right? But vice
> versa, it changes?

Maybe I should point one thing out: for various reasons, one of the
compilers I have to support in production code is g++ 2.95.2.  Which
means that in real code, the locale experience I have is with C locales.

I experimented a little with std::locale with a version of g++ 3.0.x ;
they didn't work, so I let it drop.

And a long time ago, I did use std::locale in VC++.  But only for a few
limited things.

> I think the relationship between the C locale, and std::locale is not
> well thought-out.

My impression is that there is very little involving locales, be it in C
or in C++, that is based on concrete experience.  And I don't believe
much in "well though-out" that isn't based on such experience.  No
matter how careful you are, there is always something you forget.  In
both cases, we have the problem of the standard inventing, rather than
codifying existing practice.

My impression is that at least some of the authors of the C standard had
some concrete experience with some forms of localization.  At least
enough experience to know to stop when they didn't know what they were
doing.  In C++, I'm not sure that even this is the case.

> > Which leaves open the question: why does the standard speak of an
> > "execution character set" if there is no desire to reflect some sort
> > of idea of character?

> I think that notion is inherited from C.

Certainly that's where C++ got it.  Which doesn't change the problem.
(I think that the aspects of the first couple of phases of translation
are meant to be almost 100% compatible with C.)

> > And what is "locale-specific" supposed to mean in  2.2?

> For C++, it's defined in 1.3.7. C defines it in the same way, but
> associates (implicitly) additional semantics, namely a relation-ship
> to the setlocale function.

> That, of course, is not well thought-out, either: In C, there appears
> to be a requirement that the contents of "\u20AC" might change
> whenever setlocale() is invoked - which is, of course, unrealistic
> (see J.4).

Right.  I just assumed that the same thing held with C++.

If I understand you correctly, there is an incompatibility between C and
C++ as to the meaning of a string literal:-).  Of course, all of the C
compilers actually implement the C++ semantics, so...

> Specifically, the locale of the execution environment changes with the
> setlocale (and so changes the execution character set), whereas the
> conversion to the execution character set happens during translation,
> thus in the translation environment (C99, 5.1.1).

> > Why doesn't the standard just say "implementation defined", if that
> > is what it means?

> To indicate the intent that implementations may need to take local
> conventions into account, and to give program authors a clue that they
> cannot rely on a specific implementation behaving the same way all the
> time.

> > What I mean is that if I write "isupper( 'A' )", I am guaranteed
> > that the results are true, regardless of the current locale.

> No. This is only guaranteed in the "C" locale (see C99, 7.4.1.10).

Which means that I actually have many "execution narrow character sets",
and the compiler chooses one to convert my string literals.  (From a
practical point of view.  This obviously isn't standardese.)

That actually sounds like a reasonable situation.  I don't really expect
the compiler to be able to adjust character literals according to the
(changing) current global locale.

> > My point is that the compiler is required to accept 'A', and it is
> > required to translate the resulting character into an integral value
> > which represents 'A' in the execution character set.  Since 'A' is
> > an upper case character, this would seem to guarantee the results of
> > calling isupper on the value.

> For C, I would consider this unimplementable. For C++, you have to
> ctype::widen() 'A' before you can pass it to ctype::isupper.

In my copy of the standard, there are two isupper functions.  (I never
talked about ctype::isupper directly.)

The first is just a reference to the C standard, where the function
takes an integer with values in the range 0..UCHAR_MAX.  Since 'A' is
guaranteed to be positive (even as a char), the implicit conversion to
int is well defined, and guaranteed to give a value in the required
range.  Thus, the call is legal.

The second is a template on the character type.  You certainly aren't
going to tell me that I cannot use char as the character type.

In neither case is widen necessary, or even useful.

> As I said above, you are then still not guaranteed that you get true.

I don't think you should have this guarantee.  But I ask myself, what
does converting this character from the basic source character set to
the basic execution character set mean?  I think you'll agree that 'A'
is an upper class letter in the basic source character set.  Shouldn't
it be an upper case letter in the basic execution character set?  (Or is
it just a QoI issue that the compiler doesn't give you '1'?)  And if it
is an upper case letter, shouldn't isupper return true?

(I actually suspect that this is a question for comp.std.c, and that C++
should just follow, on the grounds that I see no justification for
incompatibility here.)

> > I have no similar guarantees for '   ', and in fact, on my machine,
> > with the normal settings I use, "isupper( '   ' ) has undefined
> > behavior (in C++, but not in C).

> Can you please elaborate? If you use     literally as in this email
> message, an implementation-defined mapping to the SCS takes place in
> phase 1,

[I've replaced the original character, which somehow got lost.]

On my machine, the mapping gives the equivalent of '\u00E1', decimal
225.

> followed by a locale-specific mapping to the ECS in phase 5;

On my machine, I get a value of 0xE1 again.  As an int (C), this is 225,
which is in the range of 0...UCHAR_MAX, and so legal.  As a char (C++),
this is -31, which is NOT in the range of 0...UCHAR_MAX, and so the call
to isupper results in undefined behavior.

> the isupper call is also locale-specific. In C, and in C++ if you use
> the C function, in the "C" locale, you are guaranteed to get false.
> In all other cases, you get locale-specific or implementation-defined
> behaviour. Why would you ever get undefined behaviour?

Because the value, in C++, is not in the range 0...UCHAR_MAX.

> > > The standard does not guarantee this. It does not say that the ECS
> > > depends on the "runtime locale", because there is no "runtime
> > > locale" in C++.

> > Could you please explain to me what the "execution character set"
> > is.

> It's a set of characters, which may include multi-byte characters. It
> is the character set to which source characters are converted in phase
> 5. Beyond that, it is locale-specific (i.e. implementation-defined).
> The ECS is what the implementation uses to represent source code
> literals at run-time.

Sort of a circular definition:-).  The standard says that in phase five,
characters are converted to the execution character set, and the
definition of the execution character set is the character set into
which characters are converted in phase 5.

But OK.  That's really about par for the course for standardese.  The
compiler must convert the characters into something, and we'll call that
something the execution character set.

> It is *not* the character set that you use for IO.

Which means that outputting string literals really only makes sense if
you are in locale "C", or imbue the stream with locale "C" before hand.

> > And what it means to translate 'A' into the "execution character
> > set", if whether 'A' is an upper case letter depends on the runtime
> > environment.

> Each implementation guarantees that the letter 'A' can be represented
> in the ECS. In C, this representation should that fputc will give a
> graphic representation of the letter A. In C++, a widening according
> to the locale imbued on the stream occurs before the 'A' is written to
> the stream.

You don't widen char's when outputting to an ostream.  Or have I missed
something?  (If you do widen char's to char's, there's yet another
function which is misnamed.)

> > > It should then issue warnings if a non-basic character is found in
> > > a string literal (unless the warning is silenced).

> > Any string literal, or just a narrow string literal?

> Only for narrow string literals. For wide string literals, no warning
> is needed if a UCN was used;

Really?  Even if wchar_t is only eight bits wide?  Or only 16 (as it is
under Windows), and the UCN is '\U00012345'?

> for other characters, it should make sure it really got the source
> encoding right.

By reading the documentation of the compiler, right?

> > Practically, I recognize that it may be necessary to assume a
> > different encoding in different included files.  Which lead me to
> > suggest a #pragma, or a hidden file in the directory which contained
> > the file.

> Indeed. One may argue that non-basic characters should be avoided in
> header files, in which case getting the source encoding from the
> environment variable might be sufficient. This is where the warning
> becomes relevant: if there was a pragma, there is no need to warn.

One may even argue that non-basic characters should be avoided
generally, given the current level of support:-).

> > This is all a bit of brain-storming on my part.  I have no
> > experience with compilers which actually do anything intelligent
> > when faced with characters outside of the basic character set.

> gcc uses the compilation-time C library to determine character
> boundaries; this is necessary to parse string literals correctly for
> encodings like iso-2022-jp, or gb2312 (since those encodings use the
> byte that represents the REVERSE SOLIDUS as the second byte of a
> multi-byte character).

> > > A reasonable approach would be to use UTF-8 for the ECS.

> > Reasonable for who?

> For code that works correctly in the presence of different "native
> runtime locales".

I was being a bit sarcastic.  I know the advantages of UTF-8, but I am
sceptical that one solution is appropriate for everything.

> > Why should my program pay the penalty of larger characters,
> > conversion on input and output, and slightly larger files (accented
> > characters take two bytes in UTF-8, only one in 8859-15)?  Why
> > shouldn't I be able to just use 8859-15 everywhere?

> Because then the run-time library can depend on this. The run-time
> library, in most implementations, uses the same algorithms independent
> of the locale. To output characters correctly, it must convert from
> the ECS to the output encoding (i.e. the one imbued on the output
> stream). This is only possible if you know what the ECS is, from
> compile time.

The problem is that within my program, I either have to use wchar_t
everywhere (with excessive memory overhead -- eventually leading to page
faults and excessive runtime overhead), or I have to deal with multibyte
characters.

> > > Application authors would need to imbue locale("") to all streams
> > > the y want to use - but they should do that, anyway (if the stream
> > > is meant for output to the user).

> > More likely, they will set the global locale, once and for all.

> That won't help, as that won't change the locale imbued on std::cout.

> > The only reason to actually imbue an individual stream is when you
> > have to read files from a variety of sources, written with different
> > locales.

> Not true. If you have not imbued a locale on std::cout, you can only
> output characters available in the "C" locale (i.e. from the basic
> SCS).

Good point.  I guess you systematically should imbue the standard
streams.

It's interesting to note that after the usual "setlocale" which is the
first function in main in just about every program I've written, the
semantics of std::cout are different than those of printf.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Wed, 11 Sep 2002 16:38:41 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. ?iso-8859-1?q?L wis?) wrote in
message news:<j41y828b8z.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > > I used to think so, but I now think that this is a
> > > misinterpretation of the standard. 2.2/3 says that "any additional
> > > members are locale-specific". That does *not* mean that they
> > > depend on the locale. Instead, this refers to 1.3.7

> > And what does "locale-specific" mean, if not that they can depend on
> > the locale.

> Just as the standard says: behaviour that depends on local conventions
> of nationality, culture, and language.

So why don't they say that, instead of referring to "locale"?  As far as
I can see, if your interpretation is correct, this is the only use of
locale in the standard which doesn't refer in some way or another to the
locale feature of the library.

> That *may* mean that, in PRC, the ECS is GB18030, since that is what
> the law says (I'm not sure that this specific law applies here). It
> does not (necessarily) mean that there must be a relationship to
> std::locale objects.

> > I don't disagree.  But what does change, then, when I change the
> > locale (or the font).

> Read your compiler's documentation. It is implementation-defined (but
> may have to follow local law).

The standard also makes certain requirements.  The classification of the
character changes, for example.

> > I'm trying to address a practical problem, and find out what the
> > standard has to say about it.  I agree that once addressed, we have
> > the problem of how to implement it, or what the compiler should do
> > about it.

> The standard says it's implementation-defined, and implementations
> shall document their procedures.

> > The practical problem is simple.  Supposing 8 bit char's, what is
> > the character in the implementation character set which corresponds
> > to 0xBD?

> That's implementation-defined (or, more precisely, locale-specific).

> > It is clear that if the runtime locale is one with 8859-1, isalpha
> > returns false, and if the runtime locale is one with 8859-15,
> > isalpha returns true.

> There is no runtime locale in C++.

I would say rather that there is no compile-time locale in C++.  The
behavior of std::isalpha( int ) can definitly be changed at runtime, and
the standard says that it depends on the current locale.  (The exact
words are in ISO/IEC 9899:1990    7.4/2, which is included by reference in
the C++ standard.)  So call it the current locale, rather than the
run-time locale.  It doesn't change anything.

> If you pick a locale object whose ctype facet acts that way, then -
> yes.

Or if the current locale acts that way, and you use a function in
<cctype>.

> > It is also clear that if the display font encodes 8859-1, I will see
> > a 1/2, and if the display font encodes 8859-15, I will see a oe.

> You cannot display the character directly. You have to write it to a
> stream (say, std::cout); that may involve a conversion.

I could also write it to a FILE*:-).

But I think I understand what you are saying.  Basically, C++ doesn't
have a character type (something we all know), and that any conversion
between the internal numerical value stored in the internal integral
type (char or wchar_t) to or from characters occurs in basic_filebuf,
and depends on the locale imbued in that basic_filebuf.  And that the
only valid (or usable) functions which consider the integral type as a
character are also locale dependant in some way: they either use the
current locale (my "runtime locale"), or they take a locale as a
parameter, or they are in a facet of a locale.

And that the conversion to the execution character set in phase is
totally independant of all that; the conversion is simply between
integers, in an implementation defined manner.

Which leaves open the question: why does the standard speak of an
"execution character set" if there is no desire to reflect some sort of
idea of character?  And what is "locale-specific" supposed to mean in
   2.2?  Why doesn't the standard just say "implementation defined", if
that is what it means?

> That conversion will depend on the locale object that is imbued on the
> stream - not (only) on the execution character set.

> > In that sense, at least, the execution character set depends on the
> > locale.

> No, see above.

> > The question remaining is, of course, how does this interact with
> > regards to string and character literals, which will be converted to
> > some binary code at compile time.

> That question is answered in your compiler's documentation.

> > If your point is that making this translation depend on the runtime
> > locale is impossible, then I totally agree.  But that doesn't negate
> > my point; it just means that the standard has imposed impossible
> > requirements on the implementation:-).

> No, it doesn't. It does not mandate that the execution character set
> is defined by the "runtime locale" - it does not even use the term
> "runtime locale".

It certainly "implies" that some "locale" is involved.  Otherwise,
"implementation defined" is sufficient.

> Instead, it says that the ECS is locale-specific, which means that it
> is implementation-defined (with the clear intention that local
> convention may adjust the compiler's behaviour, in an
> implementation-defined way).

That's not the way I would interpret the sentence in question, but I
guess it is a legitimate interpretation.

> >   - since the text in 2.2/3 says that "any ADDITIONAL members [of th
> >     e execution character set] are locale specific", it implies that
> >     the basic members are NOT locale specific, and can be guaranteed
> >     to be the same in all locales.

> What do you mean, 'all the same'. This is true for any character in
> the execution character set - it is the same as itself.

What I mean is that if I write "isupper( 'A' )", I am guaranteed that
the results are true, regardless of the current locale.

My point is that the compiler is required to accept 'A', and it is
required to translate the resulting character into an integral value
which represents 'A' in the execution character set.  Since 'A' is an
upper case character, this would seem to guarantee the results of
calling isupper on the value.  I have no similar guarantees for '   ', and
in fact, on my machine, with the normal settings I use, "isupper( '   ' )
has undefined behavior (in C++, but not in C).

> >     (Which is, IMHO, too strong a guarantee.  I would like to see it
> >     legal to provide an EBCDIC locale, for example, even when the
> >     basic character set is based on ASCII.)

> The standard does not guarantee this. It does not say that the ECS
> depends on the "runtime locale", because there is no "runtime locale"
> in C++.

Could you please explain to me what the "execution character set" is.
And what it means to translate 'A' into the "execution character set",
if whether 'A' is an upper case letter depends on the runtime
environment.  There is apparently something about the word character
that I don't understand.

> > Can the implementation document it as depending on the compile time
> > locale?  I would think so.

> Certainly, yes.

> > Could the implementation document it as depending on a run time
> > locale determined by environment variables read at startup?  I would
> > also think so, although this would mean that the actual
> > initialization of string and character literals didn't take place
> > until program start-up (before any dynamic initialization of
> > course).

> Unlikely, but possible.

> > What should a quality implementation do: just allow characters in
> > the basic character set (ensuring a maximum of portability), fix one
> > locale for all translations, and use it, or use the current locale
> > (as determined by environment variables, etc.) when compiling?  Or
> > somethin g else?

> I think it should fix an ECS-W to be independent of the locale; it
> should specify that ECS-W is some form of Unicode.

I totally agree.  (As a quality of implementation issue, of course.)

> It should then issue warnings if a non-basic character is found in a
> string literal (unless the warning is silenced).

Any string literal, or just a narrow string literal?  If the
implementation only supports one interpretation of the characters in
wchar_t (some form of Unicode), then there should be no problem with
non-basic characters in wide character literals.

Or do you want the warning anyway, on the grounds of possible (or
probable?) lack of portability.

> It should also allow to specify the encoding of source code files (for
> translation phase 1), and use that to convert wide string literals to
> Unicode - so that UCNs don't have to be used.

That's part of what started this discussion.  I suggested several ways
to do this, starting with using an environment variable or a command
line option.

Practically, I recognize that it may be necessary to assume a different
encoding in different included files.  Which lead me to suggest a
#pragma, or a hidden file in the directory which contained the file.

This is all a bit of brain-storming on my part.  I have no experience
with compilers which actually do anything intelligent when faced with
characters outside of the basic character set.

> > But it doesn't say how the compiler can do this for the characters
> > in the extended execution character set, which are "locale
> > specific".  (A better example might be '\u0178', which should be
> > 0xAF in 8859-14, but 0xBE in 8859-15.)

> A reasonable approach would be to use UTF-8 for the ECS.

Reasonable for who?  It's very reasonable if all internal characters are
wchar_t (using Unicode).  But not all programs need to be
internationalized.  If I am writing a program to process French tax
returns, I certainly don't need to be able to handle Chinese characters,
nor anything else, for that matter.  Why should my program pay the
penalty of larger characters, conversion on input and output, and
slightly larger files (accented characters take two bytes in UTF-8, only
one in 8859-15)?  Why shouldn't I be able to just use 8859-15
everywhere?

> Application authors would need to imbue locale("") to all streams they
> want to use - but they should do that, anyway (if the stream is meant
> for output to the user).

More likely, they will set the global locale, once and for all.

The only reason to actually imbue an individual stream is when you have
to read files from a variety of sources, written with different locales.
(I don't really like the word "only" here, because I suspect that this
should be the case a lot more often than it is.  But again, French tax
returns will not be generated in any locale other than fr_FR.)

> > I suspect that most compilers will simply truncate, which is the
> > equivalent of using a locale with 8859-1.  Which, of course, should
> > please no one in the long run, because 8859-1 has, for all practical
> > purposes, been replaced with 8859-15 (if only for the Euro character
> > in most locales).

> I hope that 8859-1 is replaced with UTF-8 on Unix;

I is the closest to "one size fits all".  But I'm not sure that it is
appropriate for everyone.

> on Windows, it has been replaced with CP 1252 a long time ago; both
> support the Euro character. I hope that 8859-15 won't be used widely.

I can see some uses for it, at least internally, for programs which only
have to work in one locale.  And while I'm convinced that such programs
are really the exceptions, they do exist.

In the meantime, more and more programs have to deal with a variety of
different code sets.  HTTP, and now I believe STMP as well, provide
options for specifying the code set, for example.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Wed, 11 Sep 2002 18:01:25 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. ?iso-8859-1?q?L wis?) wrote in
message news:<j41y828b8z.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > > I used to think so, but I now think that this is a
> > > misinterpretation of the standard. 2.2/3 says that "any additional
> > > members are locale-specific". That does *not* mean that they
> > > depend on the locale. Instead, this refers to 1.3.7

> > And what does "locale-specific" mean, if not that they can depend on
> > the locale.

> Just as the standard says: behaviour that depends on local conventions
> of nationality, culture, and language.

So why don't they say that, instead of referring to "locale"?  As far as
I can see, if your interpretation is correct, this is the only use of
locale in the standard which doesn't refer in some way or another to the
locale feature of the library.

> That *may* mean that, in PRC, the ECS is GB18030, since that is what
> the law says (I'm not sure that this specific law applies here). It
> does not (necessarily) mean that there must be a relationship to
> std::locale objects.

> > I don't disagree.  But what does change, then, when I change the
> > locale (or the font).

> Read your compiler's documentation. It is implementation-defined (but
> may have to follow local law).

The standard also makes certain requirements.  The classification of the
character changes, for example.

> > I'm trying to address a practical problem, and find out what the
> > standard has to say about it.  I agree that once addressed, we have
> > the problem of how to implement it, or what the compiler should do
> > about it.

> The standard says it's implementation-defined, and implementations
> shall document their procedures.

> > The practical problem is simple.  Supposing 8 bit char's, what is
> > the character in the implementation character set which corresponds
> > to 0xBD?

> That's implementation-defined (or, more precisely, locale-specific).

> > It is clear that if the runtime locale is one with 8859-1, isalpha
> > returns false, and if the runtime locale is one with 8859-15,
> > isalpha returns true.

> There is no runtime locale in C++.

I would say rather that there is no compile-time locale in C++.  The
behavior of std::isalpha( int ) can definitly be changed at runtime, and
the standard says that it depends on the current locale.  (The exact
words are in ISO/IEC 9899:1990    7.4/2, which is included by reference in
the C++ standard.)  So call it the current locale, rather than the
run-time locale.  It doesn't change anything.

> If you pick a locale object whose ctype facet acts that way, then -
> yes.

Or if the current locale acts that way, and you use a function in
<cctype>.

> > It is also clear that if the display font encodes 8859-1, I will see
> > a 1/2, and if the display font encodes 8859-15, I will see a oe.

> You cannot display the character directly. You have to write it to a
> stream (say, std::cout); that may involve a conversion.

I could also write it to a FILE*:-).

But I think I understand what you are saying.  Basically, C++ doesn't
have a character type (something we all know), and that any conversion
between the internal numerical value stored in the internal integral
type (char or wchar_t) to or from characters occurs in basic_filebuf,
and depends on the locale imbued in that basic_filebuf.  And that the
only valid (or usable) functions which consider the integral type as a
character are also locale dependant in some way: they either use the
current locale (my "runtime locale"), or they take a locale as a
parameter, or they are in a facet of a locale.

And that the conversion to the execution character set in phase is
totally independant of all that; the conversion is simply between
integers, in an implementation defined manner.

Which leaves open the question: why does the standard speak of an
"execution character set" if there is no desire to reflect some sort of
idea of character?  And what is "locale-specific" supposed to mean in
   2.2?  Why doesn't the standard just say "implementation defined", if
that is what it means?

> That conversion will depend on the locale object that is imbued on the
> stream - not (only) on the execution character set.

> > In that sense, at least, the execution character set depends on the
> > locale.

> No, see above.

> > The question remaining is, of course, how does this interact with
> > regards to string and character literals, which will be converted to
> > some binary code at compile time.

> That question is answered in your compiler's documentation.

> > If your point is that making this translation depend on the runtime
> > locale is impossible, then I totally agree.  But that doesn't negate
> > my point; it just means that the standard has imposed impossible
> > requirements on the implementation:-).

> No, it doesn't. It does not mandate that the execution character set
> is defined by the "runtime locale" - it does not even use the term
> "runtime locale".

It certainly "implies" that some "locale" is involved.  Otherwise,
"implementation defined" is sufficient.

> Instead, it says that the ECS is locale-specific, which means that it
> is implementation-defined (with the clear intention that local
> convention may adjust the compiler's behaviour, in an
> implementation-defined way).

That's not the way I would interpret the sentence in question, but I
guess it is a legitimate interpretation.

> >   - since the text in 2.2/3 says that "any ADDITIONAL members [of th
> >     e execution character set] are locale specific", it implies that
> >     the basic members are NOT locale specific, and can be guaranteed
> >     to be the same in all locales.

> What do you mean, 'all the same'. This is true for any character in
> the execution character set - it is the same as itself.

What I mean is that if I write "isupper( 'A' )", I am guaranteed that
the results are true, regardless of the current locale.

My point is that the compiler is required to accept 'A', and it is
required to translate the resulting character into an integral value
which represents 'A' in the execution character set.  Since 'A' is an
upper case character, this would seem to guarantee the results of
calling isupper on the value.  I have no similar guarantees for '   ', and
in fact, on my machine, with the normal settings I use, "isupper( '   ' )
has undefined behavior (in C++, but not in C).

> >     (Which is, IMHO, too strong a guarantee.  I would like to see it
> >     legal to provide an EBCDIC locale, for example, even when the
> >     basic character set is based on ASCII.)

> The standard does not guarantee this. It does not say that the ECS
> depends on the "runtime locale", because there is no "runtime locale"
> in C++.

Could you please explain to me what the "execution character set" is.
And what it means to translate 'A' into the "execution character set",
if whether 'A' is an upper case letter depends on the runtime
environment.  There is apparently something about the word character
that I don't understand.

> > Can the implementation document it as depending on the compile time
> > locale?  I would think so.

> Certainly, yes.

> > Could the implementation document it as depending on a run time
> > locale determined by environment variables read at startup?  I would
> > also think so, although this would mean that the actual
> > initialization of string and character literals didn't take place
> > until program start-up (before any dynamic initialization of
> > course).

> Unlikely, but possible.

> > What should a quality implementation do: just allow characters in
> > the basic character set (ensuring a maximum of portability), fix one
> > locale for all translations, and use it, or use the current locale
> > (as determined by environment variables, etc.) when compiling?  Or
> > somethin g else?

> I think it should fix an ECS-W to be independent of the locale; it
> should specify that ECS-W is some form of Unicode.

I totally agree.  (As a quality of implementation issue, of course.)

> It should then issue warnings if a non-basic character is found in a
> string literal (unless the warning is silenced).

Any string literal, or just a narrow string literal?  If the
implementation only supports one interpretation of the characters in
wchar_t (some form of Unicode), then there should be no problem with
non-basic characters in wide character literals.

Or do you want the warning anyway, on the grounds of possible (or
probable?) lack of portability.

> It should also allow to specify the encoding of source code files (for
> translation phase 1), and use that to convert wide string literals to
> Unicode - so that UCNs don't have to be used.

That's part of what started this discussion.  I suggested several ways
to do this, starting with using an environment variable or a command
line option.

Practically, I recognize that it may be necessary to assume a different
encoding in different included files.  Which lead me to suggest a
#pragma, or a hidden file in the directory which contained the file.

This is all a bit of brain-storming on my part.  I have no experience
with compilers which actually do anything intelligent when faced with
characters outside of the basic character set.

> > But it doesn't say how the compiler can do this for the characters
> > in the extended execution character set, which are "locale
> > specific".  (A better example might be '\u0178', which should be
> > 0xAF in 8859-14, but 0xBE in 8859-15.)

> A reasonable approach would be to use UTF-8 for the ECS.

Reasonable for who?  It's very reasonable if all internal characters are
wchar_t (using Unicode).  But not all programs need to be
internationalized.  If I am writing a program to process French tax
returns, I certainly don't need to be able to handle Chinese characters,
nor anything else, for that matter.  Why should my program pay the
penalty of larger characters, conversion on input and output, and
slightly larger files (accented characters take two bytes in UTF-8, only
one in 8859-15)?  Why shouldn't I be able to just use 8859-15
everywhere?

> Application authors would need to imbue locale("") to all streams they
> want to use - but they should do that, anyway (if the stream is meant
> for output to the user).

More likely, they will set the global locale, once and for all.

The only reason to actually imbue an individual stream is when you have
to read files from a variety of sources, written with different locales.
(I don't really like the word "only" here, because I suspect that this
should be the case a lot more often than it is.  But again, French tax
returns will not be generated in any locale other than fr_FR.)

> > I suspect that most compilers will simply truncate, which is the
> > equivalent of using a locale with 8859-1.  Which, of course, should
> > please no one in the long run, because 8859-1 has, for all practical
> > purposes, been replaced with 8859-15 (if only for the Euro character
> > in most locales).

> I hope that 8859-1 is replaced with UTF-8 on Unix;

I is the closest to "one size fits all".  But I'm not sure that it is
appropriate for everyone.

> on Windows, it has been replaced with CP 1252 a long time ago; both
> support the Euro character. I hope that 8859-15 won't be used widely.

I can see some uses for it, at least internally, for programs which only
have to work in one locale.  And while I'm convinced that such programs
are really the exceptions, they do exist.

In the meantime, more and more programs have to deal with a variety of
different code sets.  HTTP, and now I believe STMP as well, provide
options for specifying the code set, for example.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: gennaro_prota@yahoo.com (Gennaro Prota)
Date: Wed, 11 Sep 2002 18:03:11 +0000 (UTC) Raw View

On Tue, 10 Sep 2002 20:23:39 +0000 (UTC), kanze@gabi-soft.de (James
Kanze) wrote:

>gennaro prota@yahoo.com (Gennaro Prota) wrote
>> And that was what I meant: yes, character-literals are preprocessing
>> tokens, but until to phase 7 there is no "type"!
>
>That bit bothered me as well.  Because in practice, character literals
>*must* have type, in order to determine which execution character set to
>use in phase 5.

Well, it's enough to distinguish them into narrow and wide ones. In
that sense, yes, they have a "type".

[...]
>> Since everything happens before the type char or wchar t is assigned
>> to the expression, why do you say that L'\u20AC' is ok? (The warning
>> is not mandated, of course, but I don't understand why it is issued
>> for the narrow case only)
>
>Because there are two different execution character sets.

Aha. Now I see, thanks. I missed this. I would like to know why there
are two execution sets however, but I think I have to study this part
better before making any intelligent question (if ever I will :-)).


Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Wed, 11 Sep 2002 18:36:28 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> > Just as the standard says: behaviour that depends on local convention=
s
> > of nationality, culture, and language.
>=20
> So why don't they say that, instead of referring to "locale"?  As far a=
s
> I can see, if your interpretation is correct, this is the only use of
> locale in the standard which doesn't refer in some way or another to th=
e
> locale feature of the library.

They don't refer to "locale". They use the term "locale-specific",
which is defined in 1.3.7.

It is surprising that the term is used only once in the entire standard.

> > Read your compiler's documentation. It is implementation-defined (but
> > may have to follow local law).
>=20
> The standard also makes certain requirements.  The classification of th=
e
> character changes, for example.

You can't use a character literal to classify it in a std::locale
object; you first have to widen it with the ctype's facet.

> I would say rather that there is no compile-time locale in C++.  The
> behavior of std::isalpha( int ) can definitly be changed at runtime, an=
d
> the standard says that it depends on the current locale.  (The exact
> words are in ISO/IEC 9899:1990 =A77.4/2, which is included by reference=
 in
> the C++ standard.)  So call it the current locale, rather than the
> run-time locale.  It doesn't change anything.

There is a relationship between the global locale
(std::locale::global()) and the C "current" locale, although I'm not
sure what precisely this relationship is. If you invoke
std::setlocale, the global local doesn't change, right? But vice
versa, it changes?

I think the relationship between the C locale, and std::locale is not
well thought-out.

> Which leaves open the question: why does the standard speak of an
> "execution character set" if there is no desire to reflect some sort of
> idea of character? =20

I think that notion is inherited from C.

> And what is "locale-specific" supposed to mean in =A72.2?

For C++, it's defined in 1.3.7. C defines it in the same way, but
associates (implicitly) additional semantics, namely a relation-ship
to the setlocale function.

That, of course, is not well thought-out, either: In C, there appears
to be a requirement that the contents of "\u20AC" might change
whenever setlocale() is invoked - which is, of course, unrealistic
(see J.4). Specifically, the locale of the execution environment
changes with the setlocale (and so changes the execution character
set), whereas the conversion to the execution character set happens
during translation, thus in the translation environment (C99, 5.1.1).

> Why doesn't the standard just say "implementation defined", if that
> is what it means?

To indicate the intent that implementations may need to take local
conventions into account, and to give program authors a clue that they
cannot rely on a specific implementation behaving the same way all the
time.

> What I mean is that if I write "isupper( 'A' )", I am guaranteed that
> the results are true, regardless of the current locale.

No. This is only guaranteed in the "C" locale (see C99, 7.4.1.10).

> My point is that the compiler is required to accept 'A', and it is
> required to translate the resulting character into an integral value
> which represents 'A' in the execution character set.  Since 'A' is an
> upper case character, this would seem to guarantee the results of
> calling isupper on the value. =20

For C, I would consider this unimplementable. For C++, you have to
ctype::widen() 'A' before you can pass it to ctype::isupper. As I said
above, you are then still not guaranteed that you get true.

> I have no similar guarantees for '=C1', and in fact, on my machine,
> with the normal settings I use, "isupper( '=C1' ) has undefined
> behavior (in C++, but not in C).

Can you please elaborate? If you use =C1 literally as in this email
message, an implementation-defined mapping to the SCS takes place in
phase 1, followed by a locale-specific mapping to the ECS in phase 5;
the isupper call is also locale-specific. In C, and in C++ if you use
the C function, in the "C" locale, you are guaranteed to get false.
In all other cases, you get locale-specific or implementation-defined
behaviour. Why would you ever get undefined behaviour?

> > The standard does not guarantee this. It does not say that the ECS
> > depends on the "runtime locale", because there is no "runtime locale"
> > in C++.
>=20
> Could you please explain to me what the "execution character set" is.

It's a set of characters, which may include multi-byte characters. It
is the character set to which source characters are converted in phase
5. Beyond that, it is locale-specific (i.e. implementation-defined).
The ECS is what the implementation uses to represent source code
literals at run-time.

It is *not* the character set that you use for IO.

> And what it means to translate 'A' into the "execution character set",
> if whether 'A' is an upper case letter depends on the runtime
> environment. =20

Each implementation guarantees that the letter 'A' can be represented
in the ECS. In C, this representation should that fputc will give a
graphic representation of the letter A. In C++, a widening according
to the locale imbued on the stream occurs before the 'A' is written to
the stream.

> > It should then issue warnings if a non-basic character is found in a
> > string literal (unless the warning is silenced).
>=20
> Any string literal, or just a narrow string literal? =20

Only for narrow string literals. For wide string literals, no warning
is needed if a UCN was used; for other characters, it should make sure
it really got the source encoding right.

> Practically, I recognize that it may be necessary to assume a different
> encoding in different included files.  Which lead me to suggest a
> #pragma, or a hidden file in the directory which contained the file.

Indeed. One may argue that non-basic characters should be avoided in
header files, in which case getting the source encoding from the
environment variable might be sufficient. This is where the warning
becomes relevant: if there was a pragma, there is no need to warn.

> This is all a bit of brain-storming on my part.  I have no experience
> with compilers which actually do anything intelligent when faced with
> characters outside of the basic character set.

gcc uses the compilation-time C library to determine character
boundaries; this is necessary to parse string literals correctly for
encodings like iso-2022-jp, or gb2312 (since those encodings use the
byte that represents the REVERSE SOLIDUS as the second byte of a
multi-byte character).

> > A reasonable approach would be to use UTF-8 for the ECS.
>=20
> Reasonable for who? =20

For code that works correctly in the presence of different "native
runtime locales".=20

> Why should my program pay the penalty of larger characters,
> conversion on input and output, and slightly larger files (accented
> characters take two bytes in UTF-8, only one in 8859-15)?  Why
> shouldn't I be able to just use 8859-15 everywhere?

Because then the run-time library can depend on this. The run-time
library, in most implementations, uses the same algorithms independent
of the locale. To output characters correctly, it must convert from
the ECS to the output encoding (i.e. the one imbued on the output
stream). This is only possible if you know what the ECS is, from
compile time.

> > Application authors would need to imbue locale("") to all streams the=
y
> > want to use - but they should do that, anyway (if the stream is meant
> > for output to the user).
>=20
> More likely, they will set the global locale, once and for all.

That won't help, as that won't change the locale imbued on std::cout.

> The only reason to actually imbue an individual stream is when you have
> to read files from a variety of sources, written with different locales.

Not true. If you have not imbued a locale on std::cout, you can only
output characters available in the "C" locale (i.e. from the basic SCS).

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: Sat, 14 Sep 2002 17:11:40 CST Raw View

"James Kanze" <kanze@gabi-soft.de> wrote in message
news:d6651fb6.0209130332.f1431a5@posting.google.com...

> > They don't refer to "locale". They use the term "locale-specific",
> > which is defined in 1.3.7.
>
> OK.  I missed the definition.
>
> > It is surprising that the term is used only once in the entire
> > standard.

We wanted to make clear that a dependency on locale did *not* cause
``unportable'' behavior. Thus the targeted use of the term ``locale specific.''

> > I think the relationship between the C locale, and std::locale is not
> > well thought-out.
>
> My impression is that there is very little involving locales, be it in C
> or in C++, that is based on concrete experience.  And I don't believe
> much in "well though-out" that isn't based on such experience.  No
> matter how careful you are, there is always something you forget.  In
> both cases, we have the problem of the standard inventing, rather than
> codifying existing practice.

Indeed.

> My impression is that at least some of the authors of the C standard had
> some concrete experience with some forms of localization.  At least
> enough experience to know to stop when they didn't know what they were
> doing.

Indeed.

>       In C++, I'm not sure that even this is the case.

No comment.

> Which means that I actually have many "execution narrow character sets",
> and the compiler chooses one to convert my string literals.  (From a
> practical point of view.  This obviously isn't standardese.)

But it's pretty accurate.

> I don't think you should have this guarantee.  But I ask myself, what
> does converting this character from the basic source character set to
> the basic execution character set mean?  I think you'll agree that 'A'
> is an upper class letter in the basic source character set.  Shouldn't
> it be an upper case letter in the basic execution character set?  (Or is
> it just a QoI issue that the compiler doesn't give you '1'?)  And if it
> is an upper case letter, shouldn't isupper return true?

I believe the basic C character set is pretty tightly tied down, and
behaves as you would like. It's all those other characters...

> But OK.  That's really about par for the course for standardese.  The
> compiler must convert the characters into something, and we'll call that
> something the execution character set.
>
> > It is *not* the character set that you use for IO.
>
> Which means that outputting string literals really only makes sense if
> you are in locale "C", or imbue the stream with locale "C" before hand.

Or confine your string literals to just the basic C character set, if
the program is going to be changing execution character sets on you.

> > Indeed. One may argue that non-basic characters should be avoided in
> > header files, in which case getting the source encoding from the
> > environment variable might be sufficient. This is where the warning
> > becomes relevant: if there was a pragma, there is no need to warn.
>
> One may even argue that non-basic characters should be avoided
> generally, given the current level of support:-).

Yes.

> > > > Application authors would need to imbue locale("") to all streams
> > > > the y want to use - but they should do that, anyway (if the stream
> > > > is meant for output to the user).
>
> > > More likely, they will set the global locale, once and for all.
>
> > That won't help, as that won't change the locale imbued on std::cout.
>
> > > The only reason to actually imbue an individual stream is when you
> > > have to read files from a variety of sources, written with different
> > > locales.
>
> > Not true. If you have not imbued a locale on std::cout, you can only
> > output characters available in the "C" locale (i.e. from the basic
> > SCS).
>
> Good point.  I guess you systematically should imbue the standard
> streams.
>
> It's interesting to note that after the usual "setlocale" which is the
> first function in main in just about every program I've written, the
> semantics of std::cout are different than those of printf.

Yes. I argued at length for a ``transparent'' locale, one that lets the
underlying global locale shine through. That view didn't prevail in the
C++ Standard, but we've retained transparent locales as a conforming
extension in our Standard C++ library.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Mon, 16 Sep 2002 04:00:23 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> What do you mean: is "isalpha( 'A' )" illegal?  Or "std::isalpha( 'A',
> std::locale() )"?

Non of those are illegal; they might be implementation defined. My
head starts spinning, so I guess instead of reading and re-reading the
text:
- isalpha('A'), in C++, has the same meaning as it has in C; I *believe*
  that the meaning is, strictly speaking, unimplementable (or you can't
  support locales which differ in the numbering of the basic characters)
- std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
  since it assumes a widened character.

> My impression is that at least some of the authors of the C standard had
> some concrete experience with some forms of localization.  At least
> enough experience to know to stop when they didn't know what they were
> doing.  In C++, I'm not sure that even this is the case.

I think the idea of "locale objects", and "facets" of those, was based
on real experience with the C library, noting limitations (such as
thread-unsafety, and the globalness of the locale). Whether, at that
time, there was real experience with the proposed API, I cannot say.

It appears that none of the experts working on the library actually
looked carefully enough at the core languages; the core people seem to
have copied the C text as-is.

> If I understand you correctly, there is an incompatibility between C and
> C++ as to the meaning of a string literal:-).  Of course, all of the C
> compilers actually implement the C++ semantics, so...

Yes, this is my understanding. In that sense, the C++ definition seems
"more right", even though it might be terrible to use (since you have
to invoke operations to convert from the ECS to the locale's charset
in many cases).

> Which means that I actually have many "execution narrow character sets",
> and the compiler chooses one to convert my string literals.  (From a
> practical point of view.  This obviously isn't standardese.)

A compiler is certainly entitled to operate that way. To be
conforming, it has to document which locale it uses.

> > > My point is that the compiler is required to accept 'A', and it is
> > > required to translate the resulting character into an integral value
> > > which represents 'A' in the execution character set.  Since 'A' is
> > > an upper case character, this would seem to guarantee the results of
> > > calling isupper on the value.
[...]
> The first is just a reference to the C standard, where the function
> takes an integer with values in the range 0..UCHAR_MAX.  Since 'A' is
> guaranteed to be positive (even as a char), the implicit conversion to
> int is well defined, and guaranteed to give a value in the required
> range.  Thus, the call is legal.
>
> The second is a template on the character type.  You certainly aren't
> going to tell me that I cannot use char as the character type.

Both calls are "legal" (i.e. well-formed, and not even causing
undefined behaviour). Atleast the C++ version, AFAICT, causes
implementation-defined behaviour - so while std::isalpha('A',locale())
might be well-formed, it still has implementation-defined behaviour.

> In neither case is widen necessary, or even useful.

I think ctype::widen is used to convert from the ECS to the locale's
charset. That is necessary for both strings and wide strings (or is it
narrow what I'm talking about ?)

> I don't think you should have this guarantee.  But I ask myself, what
> does converting this character from the basic source character set to
> the basic execution character set mean?  I think you'll agree that 'A'
> is an upper class letter in the basic source character set.  Shouldn't
> it be an upper case letter in the basic execution character set?

Yes, it certainly should.

> And if it is an upper case letter, shouldn't isupper return true?

Not necessarily. isupper should operate according to the locale, and
the ECS cannot change with the locale. So isupper is only useful when
you convert the ECS character to the locale's character set.

> (I actually suspect that this is a question for comp.std.c, and that C++
> should just follow, on the grounds that I see no justification for
> incompatibility here.)

Given that the C semantics is unimplementable (not necessarily for the
basic character set, but certainly for UCNs), I don't think that C++
should follow C - I'd rather expect C to "change" (perhaps clarifying
would be sufficient).

> On my machine, I get a value of 0xE1 again.  As an int (C), this is 225,
> which is in the range of 0...UCHAR_MAX, and so legal.  As a char (C++),
> this is -31, which is NOT in the range of 0...UCHAR_MAX, and so the call
> to isupper results in undefined behavior.

Why is that? isupper expects an int...

> Because the value, in C++, is not in the range 0...UCHAR_MAX.

Why does it have to be?

> Sort of a circular definition:-).

:-)

> > It is *not* the character set that you use for IO.
>
> Which means that outputting string literals really only makes sense if
> you are in locale "C", or imbue the stream with locale "C" before hand.

No. Character literals are widened (or is it narrowed?) before being
written. That performs necessary conversions from the ECS to the
character set of the locale imbued on the stream

> You don't widen char's when outputting to an ostream.  Or have I missed
> something?  (If you do widen char's to char's, there's yet another
> function which is misnamed.)

Yes, you do widen char-to-char.

> Really?  Even if wchar_t is only eight bits wide?  Or only 16 (as it is
> under Windows), and the UCN is '\U00012345'?

I think we were talking about quality implementations here, and I
would consider an implementation where wchar_t is eight bits of low
quality. For using non-BMP characters in an implementation where
wchar_t is UCS-2 (or where the character is unassigned in the
supported version of ISO 10646), I would indedd expect a warning - I
did not think of that case.

> > for other characters, it should make sure it really got the source
> > encoding right.
>
> By reading the documentation of the compiler, right?

This was a requirement on the compiler: it should refuse the
temptation to "guess", i.e. warn if the sole indication for a certain
source encoding is the user's environment variables. Command line
options or pragmas would silence the warning.

> I was being a bit sarcastic.  I know the advantages of UTF-8, but I am
> sceptical that one solution is appropriate for everything.

While C attempts the solution for everything, I don't think C++
does. And for the "typical" application areas, I do think UTF-8 would
be a reasonable compromise.

> The problem is that within my program, I either have to use wchar_t
> everywhere (with excessive memory overhead -- eventually leading to page
> faults and excessive runtime overhead), or I have to deal with multibyte
> characters.

I think that the "excessive memory overhead" of wchar_t is a red
herring.

> It's interesting to note that after the usual "setlocale" which is the
> first function in main in just about every program I've written, the
> semantics of std::cout are different than those of printf.

That's indeed problematic. C++ does define certain interworking
between stdio and iostreams, by requiring certain synchronization - I
think it should also require synchronization of the C locale, the
global locale, and the locale imbued on the standard streams.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Mon, 16 Sep 2002 04:00:27 +0000 (UTC) Raw View

"P.J. Plauger" <pjp@dinkumware.com> writes:

> Or confine your string literals to just the basic C character set, if
> the program is going to be changing execution character sets on you.

If that is the recommendation, what good are universal character
names? People *want* to put non-ASCII in their source code, and expect
it to work on all implementations and platforms in the same way.

It is ok to deviate from the expected behaviour if a character is not
supported. It is also ok if the expected behaviour does not
materialize if you fail to use the library correctly.

But I think there must be a completely portable way to output "funny"
characters which are specified in the source code.

> Yes. I argued at length for a ``transparent'' locale, one that lets the
> underlying global locale shine through. That view didn't prevail in the
> C++ Standard, but we've retained transparent locales as a conforming
> extension in our Standard C++ library.

How do I refer to the transparent locale? Sounds like a useful extension.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Mon, 16 Sep 2002 17:15:49 +0000 (UTC) Raw View

"Martin v. L   wis" <loewis@informatik.hu-berlin.de> wrote in message
news:j4lm63qdwa.fsf@informatik.hu-berlin.de...

> > Or confine your string literals to just the basic C character set, if
> > the program is going to be changing execution character sets on you.
>
> If that is the recommendation, what good are universal character
> names? People *want* to put non-ASCII in their source code, and expect
> it to work on all implementations and platforms in the same way.

People can *want* a lot of things. What they'll get if they go beyond
the basic C character set depends on whether the program sticks with
locales that are consistent with the execution character set imagined
by the compiler.

> It is ok to deviate from the expected behaviour if a character is not
> supported. It is also ok if the expected behaviour does not
> materialize if you fail to use the library correctly.

Right. You've answered your own question.

> But I think there must be a completely portable way to output "funny"
> characters which are specified in the source code.

If you mean portable across arbitrary changes in execution character set,
then dream on. We did discuss this in the C committee and decided it was
not a problem we were going to consider. (C++ has done nothing more.)

> > Yes. I argued at length for a ``transparent'' locale, one that lets the
> > underlying global locale shine through. That view didn't prevail in the
> > C++ Standard, but we've retained transparent locales as a conforming
> > extension in our Standard C++ library.
>
> How do I refer to the transparent locale? Sounds like a useful extension.

Call locale::empty(). It's described in our on-line C++ Library Reference
at our web site.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Mon, 16 Sep 2002 17:17:00 +0000 (UTC) Raw View

"Martin v. L   wis" <loewis@informatik.hu-berlin.de> wrote in message
news:j4ptvfqe3h.fsf@informatik.hu-berlin.de...

> > What do you mean: is "isalpha( 'A' )" illegal?  Or "std::isalpha( 'A',
> > std::locale() )"?
>
> Non of those are illegal; they might be implementation defined. My
> head starts spinning, so I guess instead of reading and re-reading the
> text:
> - isalpha('A'), in C++, has the same meaning as it has in C; I *believe*
>   that the meaning is, strictly speaking, unimplementable (or you can't
>   support locales which differ in the numbering of the basic characters)

You can believe it as hard as you'd like. It's still implementable in the
commonsense case where the basic C character set doesn't get renumbered.

> - std::isalpha('A', std::locale()) is, AFAICT, implementation-defined,
>   since it assumes a widened character.

Most behavior involving locales is *locale specific*, not implementation
defined. In this case, however, all locales are required to categorize
'A' as an alpha character. And widening does not occur here, unless you
mean the widening of a char to an int when passed as a function argument.

> > My impression is that at least some of the authors of the C standard had
> > some concrete experience with some forms of localization.  At least
> > enough experience to know to stop when they didn't know what they were
> > doing.  In C++, I'm not sure that even this is the case.
>
> I think the idea of "locale objects", and "facets" of those, was based
> on real experience with the C library, noting limitations (such as
> thread-unsafety, and the globalness of the locale). Whether, at that
> time, there was real experience with the proposed API, I cannot say.

There wasn't. It was invented out of whole cloth and more or less debugged
on a draft-by-draft basis over the next several years. AFAIK, I was the
first (and, in some cases, only) person to implement every draft.

> It appears that none of the experts working on the library actually
> looked carefully enough at the core languages; the core people seem to
> have copied the C text as-is.

Some of us looked. Some of us didn't get our way.

> > If I understand you correctly, there is an incompatibility between C and
> > C++ as to the meaning of a string literal:-).  Of course, all of the C
> > compilers actually implement the C++ semantics, so...
>
> Yes, this is my understanding. In that sense, the C++ definition seems
> "more right", even though it might be terrible to use (since you have
> to invoke operations to convert from the ECS to the locale's charset
> in many cases).

I'd say in rare cases. The vast majority of programs don't care about locales
and never touch them. Most of the locale-aware programs I know call
setlocale("") once at program startup to switch to the native locale,
which uses a character set reasonably congruent with the one presumed by
the compiler. A tiny handful switch among locales, with daringly different
character sets. These *might* have occasion to convert character and
string literals on the fly. Very carefully.

> > The first is just a reference to the C standard, where the function
> > takes an integer with values in the range 0..UCHAR_MAX.  Since 'A' is
> > guaranteed to be positive (even as a char), the implicit conversion to
> > int is well defined, and guaranteed to give a value in the required
> > range.  Thus, the call is legal.
> >
> > The second is a template on the character type.  You certainly aren't
> > going to tell me that I cannot use char as the character type.
>
> Both calls are "legal" (i.e. well-formed, and not even causing
> undefined behaviour). Atleast the C++ version, AFAICT, causes
> implementation-defined behaviour - so while std::isalpha('A',locale())
> might be well-formed, it still has implementation-defined behaviour.

Locale specific, in some cases, as I explained above. But not in this case.
The return value is always true.

> > In neither case is widen necessary, or even useful.
>
> I think ctype::widen is used to convert from the ECS to the locale's
> charset. That is necessary for both strings and wide strings (or is it
> narrow what I'm talking about ?)

I think you're a bit confused about what can and must occur with all
this machinery. It is admittedly difficult to work through.

> > I don't think you should have this guarantee.  But I ask myself, what
> > does converting this character from the basic source character set to
> > the basic execution character set mean?  I think you'll agree that 'A'
> > is an upper class letter in the basic source character set.  Shouldn't
> > it be an upper case letter in the basic execution character set?
>
> Yes, it certainly should.
>
> > And if it is an upper case letter, shouldn't isupper return true?
>
> Not necessarily. isupper should operate according to the locale, and
> the ECS cannot change with the locale. So isupper is only useful when
> you convert the ECS character to the locale's character set.

True in that sense, but if you switch to a locale that alters the meaning
of the basic C character set, you're skipping blithely through a swamp
anyway.

> > (I actually suspect that this is a question for comp.std.c, and that C++
> > should just follow, on the grounds that I see no justification for
> > incompatibility here.)
>
> Given that the C semantics is unimplementable (not necessarily for the
> basic character set, but certainly for UCNs), I don't think that C++
> should follow C - I'd rather expect C to "change" (perhaps clarifying
> would be sufficient).

Here's the clarification once again -- a locale should *never* change the
basic C character set. If you expect to switch to locales that alter the
definition of other characters, don't use them in literals. (See the
essay on Multibyte Conversions in the Code Conversions section of our
on-line Dinkum CoreX Library Reference for more background.)

> > On my machine, I get a value of 0xE1 again.  As an int (C), this is 225,
> > which is in the range of 0...UCHAR_MAX, and so legal.  As a char (C++),
> > this is -31, which is NOT in the range of 0...UCHAR_MAX, and so the call
> > to isupper results in undefined behavior.
>
> Why is that? isupper expects an int...
>
> > Because the value, in C++, is not in the range 0...UCHAR_MAX.
>
> Why does it have to be?
>
> > Sort of a circular definition:-).

Not circular, very direct. The is* functions, and several others in the
Standard C library, work with metacharacters represented as type int.
The valid values for a metacharacter are EOF and [0, UCHAR_MAX). Been
that way for about 30 years now.

> > > It is *not* the character set that you use for IO.
> >
> > Which means that outputting string literals really only makes sense if
> > you are in locale "C", or imbue the stream with locale "C" before hand.
>
> No. Character literals are widened (or is it narrowed?) before being
> written. That performs necessary conversions from the ECS to the
> character set of the locale imbued on the stream

The execution environment has no way of knowing that a particular integer
value, or array of char, or array of wchar_t, originated as some form of
character literal. It works with them all at face value; no implicit
conversions occur.

> > You don't widen char's when outputting to an ostream.  Or have I missed
> > something?  (If you do widen char's to char's, there's yet another
> > function which is misnamed.)
>
> Yes, you do widen char-to-char.

No you don't.

> > Really?  Even if wchar_t is only eight bits wide?  Or only 16 (as it is
> > under Windows), and the UCN is '\U00012345'?
>
> I think we were talking about quality implementations here, and I
> would consider an implementation where wchar_t is eight bits of low
> quality.

And an embedded compiler might consider an implementation where wchar_t
is more than eight bits of low quality.

>         For using non-BMP characters in an implementation where
> wchar_t is UCS-2 (or where the character is unassigned in the
> supported version of ISO 10646), I would indedd expect a warning - I
> did not think of that case.
>
> > > for other characters, it should make sure it really got the source
> > > encoding right.
> >
> > By reading the documentation of the compiler, right?
>
> This was a requirement on the compiler: it should refuse the
> temptation to "guess", i.e. warn if the sole indication for a certain
> source encoding is the user's environment variables. Command line
> options or pragmas would silence the warning.

I assume you're talking about compile-time warnings here. Some
situations might call for tight checks on valid character literal codes,
others might prefer transparent codes.

> > I was being a bit sarcastic.  I know the advantages of UTF-8, but I am
> > sceptical that one solution is appropriate for everything.
>
> While C attempts the solution for everything, I don't think C++
> does. And for the "typical" application areas, I do think UTF-8 would
> be a reasonable compromise.

UTF-8 is often a reasonable compromise, but it shouldn't be mandated by
either the C or C++ Standard.

> > The problem is that within my program, I either have to use wchar_t
> > everywhere (with excessive memory overhead -- eventually leading to page
> > faults and excessive runtime overhead), or I have to deal with multibyte
> > characters.
>
> I think that the "excessive memory overhead" of wchar_t is a red
> herring.

I agree.

> > It's interesting to note that after the usual "setlocale" which is the
> > first function in main in just about every program I've written, the
> > semantics of std::cout are different than those of printf.
>
> That's indeed problematic. C++ does define certain interworking
> between stdio and iostreams, by requiring certain synchronization - I
> think it should also require synchronization of the C locale, the
> global locale, and the locale imbued on the standard streams.

That's generally wise, and what you get by default. But I don't think
C++ should require such behavior.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 10 Sep 2002 20:23:39 +0000 (UTC) Raw View

gennaro prota@yahoo.com (Gennaro Prota) wrote in message
news:<k9kjnus5hgo3usf5bn0lp9711fva178bko@4ax.com>...
> On Thu, 5 Sep 2002 17:29:15 +0000 (UTC), kanze@gabi-soft.de (James
> Kanze) wrote:

> >Martin v. L wis wrote
> >> James Kanze) writes:

> >> > > But the problem is not whether the number 0x20AC fits in 8
> >> > > (CHAR BIT) bits. It's whether the euro sign *is* in the
> >> > > (presumed) run-time locale.

> >> > It is with regards to the error message you cite. '\u20AC' is an
> >> > expression with type char. On most machines, the results won't
> >> > fit in a char.

> >> I may be missing part of the discussion, but - why is that so? The
> >> standard does nowhere say that (int)'\u20AC' == 0x20AC; it may be
> >> that (int)'\u20AC' == 0x80 (say) - which fits into a char just
> >> fine.

> >Good point.

> >That's actually a bit what we were discussing.  How the compiler
> >converts internal representation into the execution code set.

> >The context of this particular comment, however, was a complaint that
> >the Comeau compiler issued an error that the character constant
> >didn't fit in the statement:

> >    wchar t euro = '\u20AC' ;

> >The poster I was responding to was surprised by the error, on the
> >grounds that wchar t was (in this case) large enough to hold it.

> Ah! Now I see where is the misunderstanding! :-) Yes, this was my
> surprise when I first pointed out the warning. Actually I just forgot
> to write the L in front of the literal. Anyhow when you replied that
> the warning was *due* to the absence of the L, I thought better about
> it and said that it should be emitted *even* with the L! To avoid
> further confusion, let's reduce it to a simple expression statement:

> // This issues the warning
> int main() {
>  '\u20AC';
> }

> // This doesn't
> int main() {
>  L'\u20AC';
> }

> Now, once the compiler has assumed a given locale, the euro sign
> either is in that set or not.  So, why the second program doesn't
> issue the warning? (I mean the warning: "UCN corresponds to a
> character not in the execution set"). I strongly suspect that the
> compiler just sees if it can stuff the number in a char/wchar t, thus
> acting as if I wrote:

> char/wchar t temp = 0x20AC;

> instead of seeing if the Unicode character U+20AC is in the execution
> set. Probably this is equivalent, because the assumed execution set is
> just ISO Latin-1 (U+0000...U+00FF range in Unicode) but the warning
> should be issued in both cases, I think.

There are two execution character sets: an execution wide character set,
and an execution narrow character set.  Presumably, the euro sign is in
the wide character set, and not in the narrow character set.

> >In fact, I suspect that the compiler in question was assuming ISO
> >8859-1 as the execution character set.  In this case, it doesn't fit,
> >but I agree that an warning message along the lines of "character
> >doesn't exist in the execution character set" would be better.

> This is what I tried to say. My English must really be horrible :-(
> This is e.g. from a previous post of mine:

>       "But the problem is not whether the number 0x20AC
>        fits in 8 (CHAR BIT) bits. It's whether the euro
>        sign *is* in the (presumed) run-time locale."

> Unfortunately I've also made several typos in my posts, in particular
> writing "x20AC" instead of "u20AC" as you noticed.

> We seem to agree on this fact. So, why aren't you astonished that no
> warning is issued when wide literals are used?

Doubtlessly because the character is a member of the wide character set.

In practice, I suspect that the compiler is taking the easy way out, and
systematically supposing that the narrow character set is 8859-1, and
the wide character set is 10646, both for compilation and for execution.
This means that "conversion" to narrow is just taking the lower eight
bits (although a good compiler will check that the high order bits are
zero, and not only issue a warning, but also generate a specific
"unknown character" character, probably a '?').

This is a somewhat parochial approach, but it shouldn't cause problems
in North America.

> >  (The "doesn't fit" message would be quite appropriate for '\x20AC',
> >however.)

> >> > > What is the difference when I use L"\0x20AC" instead?

> >> > The type of L'\u20AC' (or L'\x20AC') is wchar t.  No truncation
> >> > to char takes place.

> >> No truncation takes place in the char case, either: In phase 5, the
> >> UCN is converted into the execution character set - until then, it
> >> is preserved as-is.

> >Right.

> And that was what I meant: yes, character-literals are preprocessing
> tokens, but until to phase 7 there is no "type"!

That bit bothered me as well.  Because in practice, character literals
*must* have type, in order to determine which execution character set to
use in phase 5.

Conceptually, it is interesting to ask what an implementation should do
if the narrow character set is EBCDIC, and the wide ISO 10646.  Because
in that case, even the codes of the basic characters are different in
wide characters and in narrow characters.

> Since everything happens before the type char or wchar t is assigned
> to the expression, why do you say that L'\u20AC' is ok? (The warning
> is not mandated, of course, but I don't understand why it is issued
> for the narrow case only)

Because there are two different execution character sets.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Tue, 10 Sep 2002 20:26:23 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> > I used to think so, but I now think that this is a misinterpretation
> > of the standard. 2.2/3 says that "any additional members are
> > locale-specific". That does *not* mean that they depend on the
> > locale. Instead, this refers to 1.3.7
>=20
> And what does "locale-specific" mean, if not that they can depend on th=
e
> locale.

Just as the standard says: behaviour that depends on local conventions
of nationality, culture, and language. That *may* mean that, in PRC,
the ECS is GB18030, since that is what the law says (I'm not sure that
this specific law applies here). It does not (necessarily) mean that
there must be a relationship to std::locale objects.

> I don't disagree.  But what does change, then, when I change the locale
> (or the font).

Read your compiler's documentation. It is implementation-defined (but
may have to follow local law).

> I'm trying to address a practical problem, and find out what the
> standard has to say about it.  I agree that once addressed, we have the
> problem of how to implement it, or what the compiler should do about it.

The standard says it's implementation-defined, and implementations
shall document their procedures.

> The practical problem is simple.  Supposing 8 bit char's, what is the
> character in the implementation character set which corresponds to 0xBD=
?

That's implementation-defined (or, more precisely, locale-specific).

> It is clear that if the runtime locale is one with 8859-1, isalpha
> returns false, and if the runtime locale is one with 8859-15, isalpha
> returns true.=20

There is no runtime locale in C++. If you pick a locale object whose
ctype facet acts that way, then - yes.

> It is also clear that if the display font encodes 8859-1,
> I will see a 1/2, and if the display font encodes 8859-15, I will see a
> oe.

You cannot display the character directly. You have to write it to a
stream (say, std::cout); that may involve a conversion. That
conversion will depend on the locale object that is imbued on the
stream - not (only) on the execution character set.

> In that sense, at least, the execution character set depends on the
> locale.

No, see above.

> The question remaining is, of course, how does this interact with
> regards to string and character literals, which will be converted to
> some binary code at compile time.

That question is answered in your compiler's documentation.

> If your point is that making this translation depend on the runtime
> locale is impossible, then I totally agree.  But that doesn't negate my
> point; it just means that the standard has imposed impossible
> requirements on the implementation:-).

No, it doesn't. It does not mandate that the execution character set
is defined by the "runtime locale" - it does not even use the term
"runtime locale". Instead, it says that the ECS is locale-specific,
which means that it is implementation-defined (with the clear
intention that local convention may adjust the compiler's behaviour,
in an implementation-defined way).

>   - since the text in =A72.2/3 says that "any ADDITIONAL members [of th=
e
>     execution character set] are locale specific", it implies that the
>     basic members are NOT locale specific, and can be guaranteed to be
>     the same in all locales. =20

What do you mean, 'all the same'. This is true for any character in
the execution character set - it is the same as itself.

>     (Which is, IMHO, too strong a guarantee.
>     I would like to see it legal to provide an EBCDIC locale, for
>     example, even when the basic character set is based on ASCII.)

The standard does not guarantee this. It does not say that the ECS
depends on the "runtime locale", because there is no "runtime locale"
in C++.

> Can the implementation document it as depending on the compile time
> locale?  I would think so. =20

Certainly, yes.

> Could the implementation document it as depending on a run time
> locale determined by environment variables read at startup?  I would
> also think so, although this would mean that the actual
> initialization of string and character literals didn't take place
> until program start-up (before any dynamic initialization of
> course).

Unlikely, but possible.

> What should a quality implementation do: just allow characters in the
> basic character set (ensuring a maximum of portability), fix one locale
> for all translations, and use it, or use the current locale (as
> determined by environment variables, etc.) when compiling?  Or somethin=
g
> else?

I think it should fix an ECS-W to be independent of the locale; it
should specify that ECS-W is some form of Unicode. It should then
issue warnings if a non-basic character is found in a string literal
(unless the warning is silenced). It should also allow to specify the
encoding of source code files (for translation phase 1), and use that
to convert wide string literals to Unicode - so that UCNs don't have
to be used.

> But it doesn't say how the compiler can do this for the characters
> in the extended execution character set, which are "locale
> specific".  (A better example might be '\u0178', which should be
> 0xAF in 8859-14, but 0xBE in 8859-15.)

A reasonable approach would be to use UTF-8 for the ECS. Application
authors would need to imbue locale("") to all streams they want to use
- but they should do that, anyway (if the stream is meant for output
to the user).

> I suspect that most compilers will simply truncate, which is the
> equivalent of using a locale with 8859-1.  Which, of course, should
> please no one in the long run, because 8859-1 has, for all practical
> purposes, been replaced with 8859-15 (if only for the Euro character in
> most locales).

I hope that 8859-1 is replaced with UTF-8 on Unix; on Windows, it has
been replaced with CP 1252 a long time ago; both support the Euro
character. I hope that 8859-15 won't be used widely.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 10 Sep 2002 21:10:07 +0000 (UTC) Raw View

James Kanze wrote:
> And what does "locale-specific" mean, if not that they can depend on the
> locale.

It means that it depends on details of the locale in which
the compiler runs, not in the locale in which the translated
program runs.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Wed, 11 Sep 2002 02:55:02 +0000 (UTC) Raw View

gennaro_prota@yahoo.com (Gennaro Prota) writes:

> // This issues the warning
> int main() {
>  '\u20AC';
> }
>
> // This doesn't
> int main() {
>  L'\u20AC';
> }
>
> Now, once the compiler has assumed a given locale, the euro sign
> either is in that set or not.

No. C++ recognizes *two* execution character sets: the execution
character set (ECS), and the wide execution character set (ECS-W). It
may well be that the euro sign is in one but not the other.

> Probably this is equivalent, because the assumed execution set is
> just ISO Latin-1 (U+0000...U+00FF range in Unicode) but the warning
> should be issued in both cases, I think.

No. Most likely, the ECS is Latin-1, and the ECS-W is UCS-2 (or
perhaps UCS-4).

> We seem to agree on this fact. So, why aren't you astonished that no
> warning is issued when wide literals are used?

Well, I don't know why JK is not surprised - I'm not because I
*expect* an implementation to support more characters in ECS-W than in
ECS. In specific implementations, that may not be the case (in
particular, if they chose UTF-8 as ECS and UCS-2 as ECS-W); in
general, restricting ECS-W to, say, Latin-1 misses the whole point of
wchar_t.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: gennaro_prota@yahoo.com (Gennaro Prota)
Date: Mon, 9 Sep 2002 19:01:03 +0000 (UTC) Raw View

On Thu, 5 Sep 2002 17:29:15 +0000 (UTC), kanze@gabi-soft.de (James
Kanze) wrote:

>Martin v. L=F6wis wrote
>> James Kanze) writes:
>
>> > > But the problem is not whether the number 0x20AC fits in 8
>> > > (CHAR_BIT) bits. It's whether the euro sign *is* in the (presumed)
>> > > run-time locale.
>
>> > It is with regards to the error message you cite.  '\u20AC' is an
>> > expression with type char.  On most machines, the results won't fit
>> > in a char.
>
>> I may be missing part of the discussion, but - why is that so? The
>> standard does nowhere say that (int)'\u20AC' =3D=3D 0x20AC; it may be =
that
>> (int)'\u20AC' =3D=3D 0x80 (say) - which fits into a char just fine.
>
>Good point.
>
>That's actually a bit what we were discussing.  How the compiler
>converts internal representation into the execution code set.
>
>The context of this particular comment, however, was a complaint that
>the Comeau compiler issued an error that the character constant didn't
>fit in the statement:
>
>    wchar_t euro =3D '\u20AC' ;
>
>The poster I was responding to was surprised by the error, on the
>grounds that wchar_t was (in this case) large enough to hold it.

Ah! Now I see where is the misunderstanding! :-) Yes, this was my
surprise when I first pointed out the warning. Actually I just forgot
to write the L in front of the literal. Anyhow when you replied that
the warning was *due* to the absence of the L, I thought better about
it and said that it should be emitted *even* with the L! To avoid
further confusion, let's reduce it to a simple expression statement:

// This issues the warning
int main() {
 '\u20AC';
}

// This doesn't
int main() {
 L'\u20AC';
}

Now, once the compiler has assumed a given locale, the euro sign
either is in that set or not. So, why the second program doesn't issue
the warning? (I mean the warning: "UCN corresponds to a character not
in the execution set"). I strongly suspect that the compiler just sees
if it can stuff the number in a char/wchar_t, thus acting as if I
wrote:

char/wchar_t temp =3D 0x20AC;

instead of seeing if the Unicode character U+20AC is in the execution
set. Probably this is equivalent, because the assumed execution set is
just ISO Latin-1 (U+0000...U+00FF range in Unicode) but the warning
should be issued in both cases, I think.

>In fact, I suspect that the compiler in question was assuming ISO 8859-1
>as the execution character set.  In this case, it doesn't fit, but I
>agree that an warning message along the lines of "character doesn't
>exist in the execution character set" would be better.

This is what I tried to say. My English must really be horrible :-(
This is e.g. from a previous post of mine:

      "But the problem is not whether the number 0x20AC
       fits in 8 (CHAR_BIT) bits. It's whether the euro
       sign *is* in the (presumed) run-time locale."

Unfortunately I've also made several typos in my posts, in particular
writing "x20AC" instead of "u20AC" as you noticed.

We seem to agree on this fact. So, why aren't you astonished that no
warning is issued when wide literals are used?

>  (The "doesn't
>fit" message would be quite appropriate for '\x20AC', however.)
>
>> > > What is the difference when I use L"\0x20AC" instead?
>
>> > The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to
>> > char takes place.
>
>> No truncation takes place in the char case, either: In phase 5, the
>> UCN is converted into the execution character set - until then, it is
>> preserved as-is.
>
>Right.

And that was what I meant: yes, character-literals are preprocessing
tokens, but until to phase 7 there is no "type"! Since everything
happens before the type char or wchar_t is assigned to the expression,
why do you say that L'\u20AC' is ok? (The warning is not mandated, of
course, but I don't understand why it is issued for the narrow case
only)

>  [...] The real problem is that
>(probably) there is not corresponding character in what the compiler is
>supposing is the execution character set.

Yeah!

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 10 Sep 2002 08:42:57 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. L   wis wrote in message
news:<j4heh333w0.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > Right.  I was very careless in my language.  The real problem is
> > that (probably) there is not corresponding character in what the
> > compiler is supposing is the execution character set.  (In fact, as
> > I have tried to point out, the execution character set is determined
> > at run-time, according to the locale.)

> I used to think so, but I now think that this is a misinterpretation
> of the standard. 2.2/3 says that "any additional members are
> locale-specific". That does *not* mean that they depend on the
> locale. Instead, this refers to 1.3.7

And what does "locale-specific" mean, if not that they can depend on the
locale.

> # 1.3.7 [defns.locale.specific] locale-specific behavior
> # behavior that depends on local conventions of nationality, culture,
> # and language that each implementation shall document.

> So this is just a synonym for "implementation-defined".

> Having the execution character set depend on the locale is both
> meaningless and unimplementable:

> - there is not a single locale that it could depend on: would the
>   locale be locale(), locale(""), or locale::classic() (aka
>   locale("C"))?

> - if the value of a character string literal depends on the
>   runtime-locale, the compiler would need to arrange that the
>   character string literal changes every time when "the locale"
>   changes. That contradicts the requirement that the address of a
>   string literal must not change.

I don't disagree.  But what does change, then, when I change the locale
(or the font).

I'm trying to address a practical problem, and find out what the
standard has to say about it.  I agree that once addressed, we have the
problem of how to implement it, or what the compiler should do about it.

The practical problem is simple.  Supposing 8 bit char's, what is the
character in the implementation character set which corresponds to 0xBD?
It is clear that if the runtime locale is one with 8859-1, isalpha
returns false, and if the runtime locale is one with 8859-15, isalpha
returns true.  It is also clear that if the display font encodes 8859-1,
I will see a 1/2, and if the display font encodes 8859-15, I will see a
oe.

In that sense, at least, the execution character set depends on the
locale.

The question remaining is, of course, how does this interact with
regards to string and character literals, which will be converted to
some binary code at compile time.

If your point is that making this translation depend on the runtime
locale is impossible, then I totally agree.  But that doesn't negate my
point; it just means that the standard has imposed impossible
requirements on the implementation:-).

More realistically, I suspect that the standard is intentially a little
bit vague here, to allow a maximum of freedom to the implementor.
Perhaps with the justification that we don't yet have enough experience
to really know what the right answer should be.  What I think seems to
be guaranteed is:

  - the characters in the basic source character set (plus a few control
    characters) are guaranteed to be present in the execution character
    set, and

  - since the text in    2.2/3 says that "any ADDITIONAL members [of the
    execution character set] are locale specific", it implies that the
    basic members are NOT locale specific, and can be guaranteed to be
    the same in all locales.  (Which is, IMHO, too strong a guarantee.
    I would like to see it legal to provide an EBCDIC locale, for
    example, even when the basic character set is based on ASCII.)

Given this, it seems reasonable to say that you can portably use
characters in the basic character set with reliable semantics.  Beyond
that, you depend on the implementation.

> > Correct.  But you would perhaps agree with me that a warning would
> > be nice.  (And that 0xAC would also be legal, i.e. the
> > implementation-defined encoding is generated by truncating the
> > character.  From a QoI point of view, certainly not very desirable,
> > but I think it is legal.)

> I agree: that is conforming *if* the implementation documents it that
> way.

Can the implementation document it as depending on the compile time
locale?  I would think so.  Could the implementation document it as
depending on a run time locale determined by environment variables read
at startup?  I would also think so, although this would mean that the
actual initialization of string and character literals didn't take place
until program start-up (before any dynamic initialization of course).

What should a quality implementation do: just allow characters in the
basic character set (ensuring a maximum of portability), fix one locale
for all translations, and use it, or use the current locale (as
determined by environment variables, etc.) when compiling?  Or something
else?

Most compilers today just read bytes, and don't do anything special with
them.  If the compiler locale and the runtime locale are compatible,
everything works.  If not, I don't get what I expect.  But the presence
of UCN changes things somewhat: if I write '\u00E9', the standard
guarantees that I get a '   ', supposing that this character exists in the
execution character set.  But it doesn't say how the compiler can do
this for the characters in the extended execution character set, which
are "locale specific".  (A better example might be '\u0178', which
should be 0xAF in 8859-14, but 0xBE in 8859-15.)

I suspect that most compilers will simply truncate, which is the
equivalent of using a locale with 8859-1.  Which, of course, should
please no one in the long run, because 8859-1 has, for all practical
purposes, been replaced with 8859-15 (if only for the Euro character in
most locales).

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Fri, 6 Sep 2002 15:35:47 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> Right.  I was very careless in my language.  The real problem is that
> (probably) there is not corresponding character in what the compiler is
> supposing is the execution character set.  (In fact, as I have tried to
> point out, the execution character set is determined at run-time,
> according to the locale.)

I used to think so, but I now think that this is a misinterpretation
of the standard. 2.2/3 says that "any additional members are
locale-specific". That does *not* mean that they depend on the
locale. Instead, this refers to 1.3.7

# 1.3.7 [defns.locale.specific] locale-specific behavior
# behavior that depends on local conventions of nationality, culture,
# and language that each implementation shall document.

So this is just a synonym for "implementation-defined".

Having the execution character set depend on the locale is both
meaningless and unimplementable:

- there is not a single locale that it could depend on: would the
  locale be locale(), locale(""), or locale::classic() (aka
  locale("C"))?

- if the value of a character string literal depends on the
  runtime-locale, the compiler would need to arrange that the
  character string literal changes every time when "the locale"
  changes. That contradicts the requirement that the address of a
  string literal must not change.

> Correct.  But you would perhaps agree with me that a warning would be
> nice.  (And that 0xAC would also be legal, i.e. the
> implementation-defined encoding is generated by truncating the
> character.  From a QoI point of view, certainly not very desirable, but
> I think it is legal.)

I agree: that is conforming *if* the implementation documents it that
way.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Fri, 6 Sep 2002 15:35:52 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> Leave the basic character set, and the issue is decidedly less
> clear, since the character codes DO really change according to the
> execution locale (and/or the fonts used).

Again, the "execution locale" is a notion of your imagination only -
not of C++.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 3 Sep 2002 17:49:52 +0000 (UTC) Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
news:<co5vmugoqrq1coa85njq0or442fie1qsn9@4ax.com>...

    [...]
> >> But only if that character is not in the execution set (2.13.2/5)

> >Only if that character is not in the *basic* execution set.  You're
> >right that the standard doesn't say this.  But the actual execution
> >character set (at least for narrow characters) is locale dependant,
> >and the locale cannot be known until execution.

> This is where I don't follow your reasoning. The compiler can *assume*
> what is the execution set or can be told what it is (e.g. by a switch)

Actually, according to the standard, the compiler can do just about
anything it wants with regards to the conversion.

My assertion (or at least, what I meant to assert) wasn't that you can
never count on such; on that you can't count on it in portable code.
The next compiler might not be running in the same environment.

> [...]
> >> Just out of curiosity I've tried the

> >>     const wchar_t euro_sign = 'u20AC';

> >> above with Comeau 4.3.0.1 online and it emits a "warning: character
> >> value is out of range" though I've seen with a little code snippet
> >> that wchar_t is 32 bits wide. Even \u0100 and \U00000100 are "out
> >> of range".

> >As they should be.  Try :

> >    wchar_t const euro_sign = L'\u20AC' ;

> >The type of a single character narrow character constant is char.
> >'\u0100' is out of range for a char (at least on most machines).  The
> >fact that you later implicitly convert this char to wchar_t is
> >irrelevant.

> But the problem is not whether the number 0x20AC fits in 8 (CHAR_BIT)
> bits. It's whether the euro sign *is* in the (presumed) run-time
> locale.

It is with regards to the error message you cite.  '\u20AC' is an
expression with type char.  On most machines, the results won't fit in a
char.

What is used to initialize the variable is the result of converting this
char to a wchar_t.  Results which *must* be in the range
CHAR_MIN...CHAR_MAX (supposing that a wchar_t can represent all of the
values in that range).

> In this case if the run-time locale is e.g. ISO 8859-15 then the code
> is simply 0xA4.

OK.  Was the run-time locale ISO 8859-15?  In that case, and if this is
the conversion the compiler was using, the results should fit.  They
won't be what you are expecting, I suspect.  Since the wchar_t will
contain 0xA4, which isn't the Euro sign in ISO 10646 or Unicode.

> What is the difference when I use L"\0x20AC" instead?

The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to char
takes place.

> (To say it differently: the UCN remains as is in phase 1 and then is
> mapped to the execution set in phase 5. That's what I understand from
> 2.1/1)

The problem is that your UCN is in a narrow character constant.  Which
has type char.  So it first gets interpreted as an eight bit char, and
it is only the results of this interpretation which get assigned to the
wchar_t.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Wed, 4 Sep 2002 11:34:41 +0000 (UTC) Raw View

kanze@gabi-soft.de (James Kanze) writes:

> > But the problem is not whether the number 0x20AC fits in 8 (CHAR_BIT)
> > bits. It's whether the euro sign *is* in the (presumed) run-time
> > locale.
>
> It is with regards to the error message you cite.  '\u20AC' is an
> expression with type char.  On most machines, the results won't fit in a
> char.

I may be missing part of the discussion, but - why is that so? The
standard does nowhere say that (int)'\u20AC' == 0x20AC; it may be that
(int)'\u20AC' == 0x80 (say) - which fits into a char just fine.

> > What is the difference when I use L"\0x20AC" instead?
>
> The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to char
> takes place.

No truncation takes place in the char case, either: In phase 5, the
UCN is converted into the execution character set - until then, it is
preserved as-is.

If the character is not supported in the execution character set, it
is converted to an implementation-defined encoding (e.g. '?'); it is
*not* truncated.

Regards,
Martin

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: gennaro_prota@yahoo.com (Gennaro Prota)
Date: Thu, 5 Sep 2002 00:36:38 +0000 (UTC) Raw View

On Tue, 3 Sep 2002 17:49:52 +0000 (UTC), kanze@gabi-soft.de (James
Kanze) wrote:

>Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
>news:<co5vmugoqrq1coa85njq0or442fie1qsn9@4ax.com>...
>
>    [...]
>> >> But only if that character is not in the execution set (2.13.2/5)
>
>> >Only if that character is not in the *basic* execution set.  You're
>> >right that the standard doesn't say this.  But the actual execution
>> >character set (at least for narrow characters) is locale dependant,
>> >and the locale cannot be known until execution.
>
>> This is where I don't follow your reasoning. The compiler can *assume*
>> what is the execution set or can be told what it is (e.g. by a switch)
>
>Actually, according to the standard, the compiler can do just about
>anything it wants with regards to the conversion.
>
>My assertion (or at least, what I meant to assert) wasn't that you can
>never count on such; on that you can't count on it in portable code.
>The next compiler might not be running in the same environment.

But why do you say that?

"Each source character set member, escape sequence, or
universal-character-name in character literals and string literals is
converted to a member of the execution character set"

Note that it says *the* execution character set. Of course this cannot
be interpreted as saying that if one changes the locale in run-time
the program adapt immediately, because the mapping is made "once for
all", but at least it says that everything should work if a given
execution set is chosen (usually by the compiler vendor) to do phase 5
and you run the program with that character set (And, agreed, this
doesn't guarantee against the possibility that your next compiler will
run in a different environment. It doesn't guarantee either that the
execution set is not assumed to be identical to the basic source set)

[...]
>The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to char
>takes place.
>
>> (To say it differently: the UCN remains as is in phase 1 and then is
>> mapped to the execution set in phase 5. That's what I understand from
>> 2.1/1)
>
>The problem is that your UCN is in a narrow character constant.  Which
>has type char.  So it first gets interpreted as an eight bit char, and
>it is only the results of this interpretation which get assigned to the
>wchar_t.

But I was much before the point where '\x20AC' is recognized to be a
char-literal. Doesn't this belong to the syntactical analysis? That is
phase 7. And we are talking of something that is presumed to happen
before the stupid sequence of characters is recognized to be something
with a certain structure by the compiler's mind.
I understand that the euro sign is not in the run-time locale used
(assumed) by Comeau, so in that sense the character is "out of range"
(it is not in the set). But I don't understand why you justify the
error message with the fact that I use a narrow character literal.
When I write (with the single quote characters):

'\x20AC'

phase 1 maps the characters ', \, x, 2, 0, A, C and ' to the internal
encoding, or not? Then, phase 5 "sees" that those characters
constitute a UCN contained in a char-literal and thus converts it to
the execution character set encoding. No?

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 5 Sep 2002 16:41:00 +0000 (UTC) Raw View

gennaro_prota@yahoo.com (Gennaro Prota) wrote in message
news:<86vcnu4s40l4i8q1unfrt9dub4p7vuqfld@4ax.com>...
> On Tue, 3 Sep 2002 17:49:52 +0000 (UTC), kanze@gabi-soft.de (James
> Kanze) wrote:

> >Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
> >news:<co5vmugoqrq1coa85njq0or442fie1qsn9@4ax.com>...

> >    [...]
> >> >> But only if that character is not in the execution set (2.13.2/5)

> >> >Only if that character is not in the *basic* execution set.
> >> >You're right that the standard doesn't say this.  But the actual
> >> >execution character set (at least for narrow characters) is locale
> >> >dependant, and the locale cannot be known until execution.

> >> This is where I don't follow your reasoning. The compiler can
> >> *assume* what is the execution set or can be told what it is
> >> (e.g. by a switch)

> >Actually, according to the standard, the compiler can do just about
> >anything it wants with regards to the conversion.

> >My assertion (or at least, what I meant to assert) wasn't that you
> >can never count on such; on that you can't count on it in portable
> >code.  The next compiler might not be running in the same
> >environment.

> But why do you say that?

> "Each source character set member, escape sequence, or
> universal-character-name in character literals and string literals is
> converted to a member of the execution character set"

> Note that it says *the* execution character set. Of course this cannot
> be interpreted as saying that if one changes the locale in run-time
> the program adapt immediately, because the mapping is made "once for
> all", but at least it says that everything should work if a given
> execution set is chosen (usually by the compiler vendor) to do phase 5
> and you run the program with that character set (And, agreed, this
> doesn't guarantee against the possibility that your next compiler will
> run in a different environment. It doesn't guarantee either that the
> execution set is not assumed to be identical to the basic source set)

But what you put in parentheses is exactly what I am saying.  I think
that legally, a compiler could switch the execution character set
between ASCII and EBCDIC according to the locale.  But I know of no
compiler which does this.  If you write 'a', the compiler will translate
it to an 'a' in what it assumes is the execution character set.  This
character is guaranteed to be present in all execution character sets,
and in practice, I don't know of a case where it's representation will
change with the locale, so you are pretty safe (unless there is a locale
EBCDIC, and your compiler has assumed ASCII).  Leave the basic character
set, and the issue is decidedly less clear, since the character codes DO
really change according to the execution locale (and/or the fonts used).

> [...]
> >The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to
> >char takes place.

> >> (To say it differently: the UCN remains as is in phase 1 and then
> >> is mapped to the execution set in phase 5. That's what I understand
> >> from 2.1/1)

> >The problem is that your UCN is in a narrow character constant.
> >Which has type char.  So it first gets interpreted as an eight bit
> >char, and it is only the results of this interpretation which get
> >assigned to the wchar_t.

> But I was much before the point where '\x20AC' is recognized to be a
> char-literal. Doesn't this belong to the syntactical analysis? That is
> phase 7.

No.  Character literals are preprocessing tokens.  They are recognized
in phase 3, and converted into the execution character codes in phase 5.

But it doesn't really matter.  The type of a character literal is char.

> And we are talking of something that is presumed to happen before the
> stupid sequence of characters is recognized to be something with a
> certain structure by the compiler's mind.  I understand that the euro
> sign is not in the run-time locale used (assumed) by Comeau, so in
> that sense the character is "out of range" (it is not in the set). But
> I don't understand why you justify the error message with the fact
> that I use a narrow character literal.  When I write (with the single
> quote characters):

> '\x20AC'

> phase 1 maps the characters ', \, x, 2, 0, A, C and ' to the internal
> encoding, or not? Then, phase 5 "sees" that those characters
> constitute a UCN contained in a char-literal and thus converts it to
> the execution character set encoding. No?

Yes and no.

First, in this posting, you seem to be confusing \x and \u, which are
two radically different things (as Martin L   we's posting reminds us).
The sequence '\x20AC' is NOT a UCN, but a narrow character with that
value.  Any compiler worth its salt will complain (unless char has 14 or
more bits).  And it remains \x20AC until tokenization (phase 3).  The
sequence '\u20AC' is a UCN, and is converted into the internal
representation for UCN's in phase one; during tokenization in phase 3,
the compiler "sees" three characters, a ', a UCN, and a '.  In phase 5,
this UCN is converted into the execution character set.  If the compiler
assumes ISO 8859-1 as the execution character set, there is no
equivalent character.  What the compiler then does is implementation
defined, but I would certainly hope for a warning.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 5 Sep 2002 17:29:15 +0000 (UTC) Raw View

loewis@informatik.hu-berlin.de (Martin v. L   wis wrote in message
news:<j4admyfa11.fsf@informatik.hu-berlin.de>...
> kanze@gabi-soft.de (James Kanze) writes:

> > > But the problem is not whether the number 0x20AC fits in 8
> > > (CHAR_BIT) bits. It's whether the euro sign *is* in the (presumed)
> > > run-time locale.

> > It is with regards to the error message you cite.  '\u20AC' is an
> > expression with type char.  On most machines, the results won't fit
> > in a char.

> I may be missing part of the discussion, but - why is that so? The
> standard does nowhere say that (int)'\u20AC' == 0x20AC; it may be that
> (int)'\u20AC' == 0x80 (say) - which fits into a char just fine.

Good point.

That's actually a bit what we were discussing.  How the compiler
converts internal representation into the execution code set.

The context of this particular comment, however, was a complaint that
the Comeau compiler issued an error that the character constant didn't
fit in the statement:

    wchar_t euro = '\u20AC' ;

The poster I was responding to was surprised by the error, on the
grounds that wchar_t was (in this case) large enough to hold it.

In fact, I suspect that the compiler in question was assuming ISO 8859-1
as the execution character set.  In this case, it doesn't fit, but I
agree that an warning message along the lines of "character doesn't
exist in the execution character set" would be better.  (The "doesn't
fit" message would be quite appropriate for '\x20AC', however.)

> > > What is the difference when I use L"\0x20AC" instead?

> > The type of L'\u20AC' (or L'\x20AC') is wchar_t.  No truncation to
> > char takes place.

> No truncation takes place in the char case, either: In phase 5, the
> UCN is converted into the execution character set - until then, it is
> preserved as-is.

Right.  I was very careless in my language.  The real problem is that
(probably) there is not corresponding character in what the compiler is
supposing is the execution character set.  (In fact, as I have tried to
point out, the execution character set is determined at run-time,
according to the locale.)

> If the character is not supported in the execution character set, it
> is converted to an implementation-defined encoding (e.g. '?'); it is
> *not* truncated.

Correct.  But you would perhaps agree with me that a warning would be
nice.  (And that 0xAC would also be legal, i.e. the
implementation-defined encoding is generated by truncating the
character.  From a QoI point of view, certainly not very desirable, but
I think it is legal.)

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Tue, 27 Aug 2002 12:38:24 CST Raw View

On Mon, 26 Aug 2002 17:20:26 GMT, kanze@gabi-soft.de (James Kanze)
wrote:

>Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
>news:<09vhmu00bb3lkic5queg9t9s25m3p4ckq0@4ax.com>...
>> On 23 Aug 2002 16:15:05 GMT, kanze@gabi-soft.de (James Kanze) wrote:
>
>> > But what you say is true for any reasonable implementation, I think:
>> >after phase 1, the compiler has a fixed, internal encoding which is
>> >independant of the locale.
>
>> My emphasis was on the fact that phase 1 involves the codeset only,
>> not that the codeset is required only in phase 1. But this isn't that
>> important for our discussion.
>
>> >One possible other case is when generating narrow character literals,
>> >to convert the internal encoding into the narrow character encoding.
>> >Thus, '\u0153' might generate a different 8 bit code according to the
>> >locale.
>
>> Uh?? When is a character-literal "generated"?
>
>Phase 5.

Ah! That's the usual issue on which you and me clash every now and
then! :-) I make a strict distinction between the representation
(char-literal or string-literal) and the 'representee' (character or
array of characters). So to me a char-literal is never "generated"
(ok, with some macro tricks, and giving a particular meaning to
"generated" then... :-))


> The question is: what should '\u0153' generate. The response is
>0xBD in a locale using 8859-15. The response is implementation defined
>in 8859-1.
>
>A better example might be '\u0178': in 8859-14, this is 0xAF, in 8859-15
>0xBE.
>
>Of course, the compiler really should use the locale active at
>execution for this:-).

Yes. This is a different problem from the one we have discussed so
far. It can appear harder at first but is IMHO simpler because it's
reasonable that you want to specify one run-time locale per executable
(program image) so a simple command line switch like

/RuntimeLocale:it_IT.ISO_8859_1

should be enough.


[Now that we have ideas to implement a correct phase 1 and a flexible
phase 5 I think what remains is the easy part. Just 20/30 years of
work and we will have written our own compiler! :-)]


>> Anyhow it seems to me that he has still the right to support one
>> locale only.
>
>The only required locale is the locale "C".  Which is only required to
>contain the characters in the basic character set.  Such a compiler
>could convert any UCN not in the basic character set into a '?' (or
>whatever) in character and string literals.

But only if that character is not in the execution set (2.13.2/5)


[...]
>
>> This sounds like the above: where do C++ compilers have to generate
>> literals???
>
>Phase 5.
>
>The significant aspect is that the literals are generated in the
>execution character set, and not in the source character set.  How the
>compiler is supposed to do this if the execution character set is locale
>dependant (as it is under Unix) is left as an excercise for the reader.
>Realistically, I can only see two options: use the current user locale
>when compiling, and always use locale "C".  All in all, I suspect that
>the latter is the better choice: it does mean that you cannot
>practically use accented letters in character and in string literals,
>but at least, you should get the same results as I do, even if we
>normally use different locales.

Gulp! If I have a compiler that wants ASCII source files but I want to
display an euro sign at run-time I would like an option (like the one
I mentioned above) to specify the run-time locale, even if I represent
that euro sign with a UCN:

const wchar_t euro_sign = '\u20AC';


How could I display that sign otherwise?

[...]
>
>> static Target do_cast(Source arg)
>> {
>>   typedef boost::remove_pointer<Source>::type char_type;
>
>>   if( std::char_traits<char_type>::eq(arg[0], char_type()) == true
>>      || std::char_traits<char_type>::eq(arg[1], char_type()) == false)
>>             throw bad_lexical_cast();
>
>>         return arg[0];
>> }
>
>I'd write:
>
>    static Target do_cast( Source arg )
>    {
>        typedef boost::remove_pointer<Source>::type char_type;
>
>        if ( std::char_traits< char_type >::eq( arg[ 0 ], char_type())
>             || ! ...
>
>No need for the comparison with true.
>
>More significantly, I imagine that the function would be a member of a
>class which already has a typedef for the traits (which, of course,
>won't necessarily be an std::char_traits<>).

No. The whole class is:

template<class Target, class Source>
    struct pointer_to_char_base
    {
      static Target do_cast(Source arg)
      {
         ...
      }
    };

with the rest of the implementation being (basically):

template<class Target, class Source>
  Target lexical_cast(Source arg)
  {
    return detail::lexical_cast_impl<Target, Source>::do_cast(arg);
  }

template<>
    struct lexical_cast_impl<char, char *> :
      pointer_to_char_base<char, char *> {};



Anyhow, you should feel a certain sense of guilt: boost users will cry
forever for such a function and you've discouraged me from proposing!
:-)


> So we're down to:
>
>    if ( traits_type::eq( arg[ 0 ], char_type() )
>         ||  ! traits_type::equ( arg[ 1 ], char_type() ) )

With the additional typedef (and additional code, that in this case is
in boost::remove_pointer) to get the base type of the pointer. Of
course that base type is obtained via template argument deduction when
you use the function. As I said however it's just a little snippet
that *may* be useful in certain situations.

[...]
>> I hope so! In practice it's very simple: it's enough to ask them what
>> they had in mind, isn't it? (That's a joke, of course, and there's no
>> intent of being disrespectful on my part. Anyhow, there's some truth
>> in it: the question is "What possible implementations were
>> considered?").
>
>Well, most of the implementors are also committee members.  I guess they
>knew what they intended, although sometimes I'm not sure:-).
>

Most? I knew that many of them are, but I didn't believe they are the
major part of the committee. Are they?

[...]
>
>> If you port the project to another compiler that only understands ISO
>> Latin-5 you must transcode the files only once, then use your editor
>> to look at those Latin-5 files as Latin-9 files.
>
>If the editor only generates characters in the basic source character
>set, you should be home free on all machines that understand ASCII.  And
>given the prevalence of ASCII, you can be pretty sure that any machine
>which uses something else will have the necessary utilities to
>transcode.

Yes, but our super-editor could also be configured to use trigraphs
and alternative tokens. Or use UCNs anyway for {, }, ~, #, etc... BTW,
this would automatically candidate the generated files for IOC++CC :-)

>
>> [...]
>> >I should have been clearer.  The intent of wchar_t is that it should be
>> >the same in all locales.  The standard doesn't require it, however, and
>> >an implementation can make wchar_t exactly the same as char_t.
>
>> >[...]
>
>> >Again, I'm talking about the intent.  The intent is that wchar_t be
>> >locale independant.  If wchar_t changes with the locale, or requires
>> >multibyte char's, then why bother.  It doesn't bring any advantages
>> >over straight char.
>
>> I've assumed this to mean that changing the locale shouldn't change
>> the encoding of a given wide character. I'm astonished that this is
>> not required by the standard (though I admit that doing it otherwise
>> would IMHO be an explicit fraud)
>
>Changing the locales doesn't change the numeric value of any variable.
>If you use a different locale, it MIGHT change the interpretation.
>
>I believe that the intent is that where reasonably possible, the
>interpretation of a wchar_t will be the same in all locales.  The
>standard doesn't really require much of wchar_t.  Or for that matter,
>that the matter, it doesn't require much in terms of different locales.
>In the end, it is a question of market presure and quality of
>implementation; I don't know of any compiler on a Unix box or on Windows
>which uses an 8 bit wchar_t, although the standard certainly allows it.
>

Just out of curiosity I've tried the

    const wchar_t euro_sign = 'u20AC';

above with Comeau 4.3.0.1 online and it emits a "warning: character
value is out of range" though I've seen with a little code snippet
that wchar_t is 32 bits wide. Even \u0100 and \U00000100 are "out of
range".

>
>> [...]
>
>> Well, I'm under the impression that, in this case, what they missed is
>> how deceptive is the "character type genericity" of _M_copy_to_string.
>> Considering how widespread in the world is the use of STLport however
>> I'm astonished that this hasn't been corrected.
>
>The STLport is widely used, but how many people are really using
>std::basic_string with types other than char or wchar_t?  Regardless of
>the standard library used.  I know that both the STLport people and the
>Dinkumware people set very high standards for quality, but their first
>concern is obviously the parts of the library that people really use.

Maybe it's just STLport's bitset that isn't used so much (I've noticed
other non-conformities: for instance the shift operators have
undefined behavior when pos>size()). In other parts of the library
instead I've seen that usually widen and narrow are used when
opportune.


Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: 28 Aug 2002 17:50:25 GMT Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
news:<128nmu0jopgsdj6bkpfjodlp0qc3begfi2@4ax.com>...
> On Mon, 26 Aug 2002 17:20:26 GMT, kanze@gabi-soft.de (James Kanze)
> wrote:

> >Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
> >news:<09vhmu00bb3lkic5queg9t9s25m3p4ckq0@4ax.com>...
> >> On 23 Aug 2002 16:15:05 GMT, kanze@gabi-soft.de (James Kanze) wrote:

> >> > But what you say is true for any reasonable implementation, I
> >> >think: after phase 1, the compiler has a fixed, internal encoding
> >> >which is independant of the locale.

> >> My emphasis was on the fact that phase 1 involves the codeset only,
> >> not that the codeset is required only in phase 1. But this isn't
> >> that important for our discussion.

> >> >One possible other case is when generating narrow character
> >> >literals, to convert the internal encoding into the narrow
> >> >character encoding.  Thus, '\u0153' might generate a different 8
> >> >bit code according to the locale.

> >> Uh?? When is a character-literal "generated"?

> >Phase 5.

> Ah! That's the usual issue on which you and me clash every now and
> then! :-) I make a strict distinction between the representation
> (char-literal or string-literal) and the 'representee' (character or
> array of characters). So to me a char-literal is never "generated"
> (ok, with some macro tricks, and giving a particular meaning to
> "generated" then... :-))

I'm not concerned about what you call it.  The fact is that the compiler
must do some sort of code conversion here, and that all code conversion
is locale dependant.

> > The question is: what should '\u0153' generate. The response is 0xBD
> >in a locale using 8859-15. The response is implementation defined in
> >8859-1.

> >A better example might be '\u0178': in 8859-14, this is 0xAF, in
> >8859-15 0xBE.

> >Of course, the compiler really should use the locale active at
> >execution for this:-).

> Yes. This is a different problem from the one we have discussed so
> far. It can appear harder at first but is IMHO simpler because it's
> reasonable that you want to specify one run-time locale per executable
> (program image) so a simple command line switch like

> /RuntimeLocale:it_IT.ISO_8859_1

> should be enough.

On my machine, the runtime locale depends on environment variables.
Different users will have different runtime locales, even when running
the same program.  There is no way I can specify this at compile time.

> [Now that we have ideas to implement a correct phase 1 and a flexible
> phase 5 I think what remains is the easy part. Just 20/30 years of
> work and we will have written our own compiler! :-)]

I still don't really know how to implement phase 5.  The usual solution
today is to use 8 bit characters internally, and just pass them
through.  Which basically means that the runtime locale is supposed to
be the same as the locale in which I edited the text (which is often
false), and which doesn't make any provision for UCNs.

> >> Anyhow it seems to me that he has still the right to support one
> >> locale only.

> >The only required locale is the locale "C".  Which is only required
> >to contain the characters in the basic character set.  Such a
> >compiler could convert any UCN not in the basic character set into a
> >'?' (or whatever) in character and string literals.

> But only if that character is not in the execution set (2.13.2/5)

Only if that character is not in the *basic* execution set.  You're
right that the standard doesn't say this.  But the actual execution
character set (at least for narrow characters) is locale dependant, and
the locale cannot be known until execution.  So    2.1./5 can only refer
to the basic execution character set.  (Or maybe not even that, since
there is no explicit requirement that the characters in the basic
execution set not be locale dependant.  But in practice, I don't think
we'll ever see a case where they are.)

> [...]

> >> This sounds like the above: where do C++ compilers have to generate
> >> literals???

> >Phase 5.

> >The significant aspect is that the literals are generated in the
> >execution character set, and not in the source character set.  How
> >the compiler is supposed to do this if the execution character set is
> >locale dependant (as it is under Unix) is left as an excercise for
> >the reader.  Realistically, I can only see two options: use the
> >current user locale when compiling, and always use locale "C".  All
> >in all, I suspect that the latter is the better choice: it does mean
> >that you cannot practically use accented letters in character and in
> >string literals, but at least, you should get the same results as I
> >do, even if we normally use different locales.

> Gulp! If I have a compiler that wants ASCII source files but I want to
> display an euro sign at run-time I would like an option (like the one
> I mentioned above) to specify the run-time locale, even if I represent
> that euro sign with a UCN:

> const wchar_t euro_sign = '\u20AC';

> How could I display that sign otherwise?

With great difficulty.  If you have a graphic screen, you might try
using graphic primitives to draw it:-).

Seriously, how do you display that sign if the machine your code runs on
doesn't have it in any of its fonts ?  (Note once again the interaction
between locale, particularly codecvt, and fonts.)

The problem isn't trivial, and it won't go away until all machines
support ISO 10646.

> [...]
> >> I hope so! In practice it's very simple: it's enough to ask them
> >> what they had in mind, isn't it? (That's a joke, of course, and
> >> there's no intent of being disrespectful on my part. Anyhow,
> >> there's some truth in it: the question is "What possible
> >> implementations were considered?").

> >Well, most of the implementors are also committee members.  I guess
> >they knew what they intended, although sometimes I'm not sure:-).

> Most? I knew that many of them are, but I didn't believe they are the
> major part of the committee. Are they?

I can't name an implementor off hand who isn't a member.

There are other members, of course.  Globally, I'd guess that
implementors represent something between 10% and 25% of the members.  If
we are talking about the really active members, who attend all, or most,
of the meetings, the percentage which are implementors is considerably
higher.

But my point is the opposite: there aren't implementors (or very few)
who aren't members.

    [...]
> Just out of curiosity I've tried the

>     const wchar_t euro_sign = 'u20AC';

> above with Comeau 4.3.0.1 online and it emits a "warning: character
> value is out of range" though I've seen with a little code snippet
> that wchar_t is 32 bits wide. Even \u0100 and \U00000100 are "out of
> range".

As they should be.  Try :

    wchar_t const euro_sign = L'\u20AC' ;

The type of a single character narrow character constant is char.
'\u0100' is out of range for a char (at least on most machines).  The
fact that you later implicitly convert this char to wchar_t is
irrelevant.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Allan_W@my-dejanews.com (Allan W)
Date: Wed, 28 Aug 2002 16:27:27 CST Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote
> On Mon, 26 Aug 2002 17:20:26 GMT, kanze@gabi-soft.de (James Kanze)
> wrote:
> >The significant aspect is that the literals are generated in the
> >execution character set, and not in the source character set.  How the
> >compiler is supposed to do this if the execution character set is locale
> >dependant (as it is under Unix) is left as an excercise for the reader.
> >Realistically, I can only see two options: use the current user locale
> >when compiling, and always use locale "C".  All in all, I suspect that
> >the latter is the better choice: it does mean that you cannot
> >practically use accented letters in character and in string literals,
> >but at least, you should get the same results as I do, even if we
> >normally use different locales.
>
> Gulp! If I have a compiler that wants ASCII source files but I want to
> display an euro sign at run-time I would like an option (like the one
> I mentioned above) to specify the run-time locale, even if I represent
> that euro sign with a UCN:
>
> const wchar_t euro_sign = '\u20AC';
>
>
> How could I display that sign otherwise?

If you know the character code that you will need at runtime, use it
at runtime!

   const wchar_t euro_sign = (wchar_t)0x20AC;

No?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: 30 Aug 2002 17:05:15 GMT Raw View

On 28 Aug 2002 17:50:25 GMT, kanze@gabi-soft.de (James Kanze) wrote:

>Gennaro Prota <gennaro_prota@yahoo.com> wrote
>> >> Uh?? When is a character-literal "generated"?
>
>> >Phase 5.
>
>> Ah! That's the usual issue on which you and me clash every now and
>> then! :-) I make a strict distinction between the representation
>> (char-literal or string-literal) and the 'representee' (character or
>> array of characters). So to me a char-literal is never "generated"
>> (ok, with some macro tricks, and giving a particular meaning to
>> "generated" then... :-))
>
>I'm not concerned about what you call it.  The fact is that the compiler
>must do some sort of code conversion here, and that all code conversion
>is locale dependant.

The remark about the macro trick and term "generated" in that context
was a joke as you may have noted from the smiley. I was just
explaining why I didn't understand your assertion. Anyhow if "what you
call it" is referred to literals then it's not me: to the standard
literals are expressions and not entities. And while we are at it, I
would say that it is not just a terminology issue. It's an important
distinction between a representation (symbol) and what it represents,
of fundamental importance every time you talk about the representation
system (the language). In linguistics for instance, where F. de
Saussure's terminology ('signifiant' and 'signifi   ') is common.

>> > The question is: what should '\u0153' generate. The response is 0xBD
>> >in a locale using 8859-15. The response is implementation defined in
>> >8859-1.
>
>> >A better example might be '\u0178': in 8859-14, this is 0xAF, in
>> >8859-15 0xBE.
>
>> >Of course, the compiler really should use the locale active at
>> >execution for this:-).
>
>> Yes. This is a different problem from the one we have discussed so
>> far. It can appear harder at first but is IMHO simpler because it's
>> reasonable that you want to specify one run-time locale per executable
>> (program image) so a simple command line switch like
>
>> /RuntimeLocale:it_IT.ISO_8859_1
>
>> should be enough.
>
>On my machine, the runtime locale depends on environment variables.
>Different users will have different runtime locales, even when running
>the same program.  There is no way I can specify this at compile time.

Of course. I didn't claim that the above allows you to change the
locale at run-time and have your program to adapt immediately. What
can a phase of compilation do? As far as I see, phase 5 can only
assume that a given run-time locale is used. The switch above tells
which one (allowing it to be different from the source files' locale),
but of course your

 cout << '   '; // a with accent (just in case you
              // can't display it :-)

will only work if you choose the same locale at run-time. If you want
your program to respond to a change of the locale at run-time you have
to adopt a different solution (I could imagine e.g. using DLLs
compiled with different switches that are selected according to the
locale, but of course I do not claim that a solution is provided by
the compiler writer)

[...]
>> >> Anyhow it seems to me that he has still the right to support one
>> >> locale only.
>
>> >The only required locale is the locale "C".  Which is only required
>> >to contain the characters in the basic character set.  Such a
>> >compiler could convert any UCN not in the basic character set into a
>> >'?' (or whatever) in character and string literals.
>
>> But only if that character is not in the execution set (2.13.2/5)
>
>Only if that character is not in the *basic* execution set.  You're
>right that the standard doesn't say this.  But the actual execution
>character set (at least for narrow characters) is locale dependant, and
>the locale cannot be known until execution.

This is where I don't follow your reasoning. The compiler can *assume*
what is the execution set or can be told what it is (e.g. by a switch)

[...]
>> >Well, most of the implementors are also committee members.  I guess
>> >they knew what they intended, although sometimes I'm not sure:-).
>
>> Most? I knew that many of them are, but I didn't believe they are the
>> major part of the committee. Are they?
>

Sorry, I must have been asleep when replying to this (I read it as:
"most of the members are implementors")

[...]
>> Just out of curiosity I've tried the
>
>>     const wchar_t euro_sign = 'u20AC';
>
>> above with Comeau 4.3.0.1 online and it emits a "warning: character
>> value is out of range" though I've seen with a little code snippet
>> that wchar_t is 32 bits wide. Even \u0100 and \U00000100 are "out of
>> range".
>
>As they should be.  Try :
>
>    wchar_t const euro_sign = L'\u20AC' ;
>
>The type of a single character narrow character constant is char.
>'\u0100' is out of range for a char (at least on most machines).  The
>fact that you later implicitly convert this char to wchar_t is
>irrelevant.

But the problem is not whether the number 0x20AC fits in 8 (CHAR_BIT)
bits. It's whether the euro sign *is* in the (presumed) run-time
locale. In this case if the run-time locale is e.g. ISO 8859-15 then
the code is simply 0xA4. What is the difference when I use L"\0x20AC"
instead?
(To say it differently: the UCN remains as is in phase 1 and then is
mapped to the execution set in phase 5. That's what I understand from
2.1/1)

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Thu, 22 Aug 2002 20:43:46 GMT Raw View

I thought it would have been too dispersive to reply on a point by
point basis, so I've tried a little summary.

- We agree that being forced to use a non-builtin as a character type
is not a limitation because in practice there's no built-in type that
could be unambiguously used for that purpose.

- We also agree that from the compiler perspective only the character
set part of the locale plays a role for character mapping in phase 1.

- It's usual however that, because of issues related e.g. to fonts,
files belonging to the same project are edited with different
charsets. Thus, there's the problem of telling the compiler which
encoding/charset must be used for each file. Transcoding is impossible
(or at least a transcoding "on the fly" would be possible, but
impractical) because the files must be seen simultaneously from
different environments.

- Statically converting (the character denoted by) a character
literal, e.g. 'a' or '1', to a generic charT is impossible. The null
character is an exception because charT() is guaranteed to be the
charT equivalent of '\0' with any locale. Thus, something like this is
ok (I'm about to propose it for boost's utilities)

template <class charT, class Traits>
inline bool is_null_char(const charT & c) {
   return Traits::eq(c, charT());
}

template <class charT>
inline bool is_null_char(const charT & c) {
   return is_null_char<charT, std::char_traits<charT> >(c);
}

[In practice one would almost always use the second template; the
first is there just in case... The purpose is simply to generalize
conditions involving the null character like

if (p[0] == '\0' || p[1] == '\0')

while keeping a certain readability.

Of course we could use just one template if default template arguments
for function templates were allowed - that's DR 226]



On Thu, 22 Aug 2002 15:48:12 GMT, kanze@gabi-soft.de (James Kanze)
wrote:

>
>> >Yes.  Suppose I work using ISO 8859-15, and enter an oe (single
>> >character) in my editor.  This is code 0xBD.  I transfer the file to
>> >you, and your environment is set up for ISO 8859-1, in which 0xBD is
>> >a decimal 1/2.  Supposing some form of ISO 10646 or Unicode for wchar
>> >t (which seems likely), I expect this character to be converted to
>> >'\u0153' in phase one; with your default local, it will be converted
>> >to '\u00BD', which is something else entirely.
>
>> I see. Anyhow I wasn't thinking of a situation where you transfer the
>> source code to another compiler.
>
>You probably don't have to transfer anything.  Just change the value of
>the environment variable LC_CTYPE (under Unix, anyway).
>
>Now imagine that the value of this variable is something with 8859_1 in
>the Window with the compiler, and 8859_15 in the Window with the editor.
>You insert an oe in the editor, which is an alphabetic character, legal
>in a variable name, for example.  The compiler sees 1/2, which is not
>legal in a variable name, and reports an error.  Which, of course, you
>cannot see in your editor.

Yes, but that's because the two windows use different charsets. If
they were the same the compiler would have seen oe. Or are you saying
the contrary?

>
>The problem isn't trivial, and I know of no good solution.  Basically,
>in order to get things right, the compiler would have to know how the
>editor displays characters, and this is impossible -- it varies from one
>editor to the next, and for most editors, according to environment.
>Under Unix, both LC_CTYPE and the selected font play a role, and both
>can be changed on a Window by Window basis.
>
>> Of course this adds another level of complexity, but maybe it can be
>> solved by translating the files in advance, i.e. before submitting
>> them to the new compiler.
>
>That problem has always been with us.  Even if you only use the basic
>character set.  If a machine uses EBCDIC, and your files are in ASCII,
>code translation is necessary.

Ok.


[...]
>The gag is, of course, that for tools like editors to work properly, the
>encoding suffix of the locale MUST correspond to the font encoding of
>the fonts used.  It's not unreasonable for a user to use different fonts
>in an editor window than in a shell window (from which the compiler is
>invoked), and if those fonts have different encodings, the user must
>have set LC_CTYPE differently for the other software to work.

>This is not some abstract problem.  For historical reasons, in western
>Europe, ISO 8859-1 has become the quasi-universal codeset.  However,
>several important characters for French are missing from it, and it
>doesn't have the Euro symbol (which didn't exist when it was
>standardized).  For this reason, there is a shift to ISO 8859-15, which
>corrects this problems.  But things don't happen overnight, and there
>are a number of fonts on my machines which only support 8859-1.  The
>result is that I use 8859-15 when I can, and 8859-1 when I have to,
>because of a lack of support in the font.  And even a simple ls will
>display different filenames according to the window in which it is
>invoked.


Ok. But is this a problem that the committee wanted to address? I
understand that it is a big problem in practice, but I suspect that
they just decided to leave compiler implementers the option to choose
a solution, for instance the one using a hidden file that you propose
(or not providing a solution at all)


>> Of course the code at hand
>
>> std::locale loc;
>> char narrow["0123456789"];
>> wchar t wide[10];
>> std::use facet<std::ctype<wchar t> >(loc).widen(narrow, narrow+10,
>> wide);
>
>> would translate in this case between ISO latin1 and whatever encoding
>> is part of loc.
>
>The narrow character encoding is *always* part of the locale.  The wide
>character encoding is supposed constant.

In the code above? Maybe you say this because you stick to the fact
that the locale is default constructed? Actually I miss your point
because the default constructor returns the global locale, which can
be changed with a call to

static locale global(const locale& loc);

Why you say that the wide character encoding is supposed constant? (I
begin to suspect that I should be very careful to distinguish between
a charset and its encoding, is it so?)


>In general, any conversion to string, implicit or explicit, is an
>error.  The correct way to handle this is to provide an operator>>.
>(This would also allow a manipulator to select between the use of 0 and
>1, and the use of the first character of numpunct::falsename() and
>numpunct::truename() -- I can easily think of cases where I want a
>string tfft...)
>
>(More accurately: any conversion to string should be based on outputting
>to operator>>.  I can quite understand the desire to have something
>simple, along the lines of boost::lexical_cast, even if it prevents the
>use of all of the formatting possibilities.)

Yes. One can decide give up formatting, but at least should handle
correctly the character type. Otherwise it would be better to use
std::string directly:

template <typename Block, typename Allocator>
void to_string(const dynamic_bitset<Block, Allocator>& b,
                      std::string& s)
{
    s.assign(b.size(), '0');
    for (std::size_t i = 0; i < b.size(); ++i)
        if (b.test(i))
            s[b.size() - 1 - i] = '1';
}



>> Ah! Maybe you didn't notice that 'Alloc' is different from
>> 'Allocator'. Anyhow, yes, there's a typo in the original code which in
>> practice is not parameterized on the string allocator but on its char
>> traits. Why didn't the compiler warn about this? :-)
>
>Probably because no one has ever used any allocator execpt the standard
>allocator:-).  If Alloc and the string allocator are the same, the code
>is fine.

I'm not sure that we understood each other: I was referring to the
fact that the code I pasted, i.e.

template <typename Block, typename Allocator, typename CharT, typename
Alloc> void to_string(const dynamic_bitset<Block, Allocator>& b,
                      std::basic_string<CharT, Alloc>& s)

uses the name Alloc for the second parameter of basic_string<>. But
the second parameter of basic_string is not the string allocator, it's
the traits type. This means that to_string is not parameterized at all
on the string allocator and the wrong name is used for what should
have been simply named 'Traits' or 'traitsT' (As you can see from the
code, dynamic_bitset<> also can use a user-defined allocator). I guess
the intent was to write

template <typename Block, typename BitsetAllocator, typename CharT,
typename Traits, typename StringAllocator>
void to_string(const dynamic_bitset<Block, BitsetAllocator>& b,
          std::basic_string<CharT, Traits, StringAllocator>& s)


instead.


>> >Of course, there is absolutely nothing to guarantee that it works
>> >correctly even if it compiles.  If my system uses EBCDIC for char,
>> >and ISO 10646 for wchar t, the conversion to a wstring will result in
>> >a sequence of '? and '?: probably not what was intended either.
>
>> Yes. The code above should solve at least this problem.
>
>Painfully.

Well, that was something! :-) Obviously in my example the pain was not
that terrible because I just needed the two characters '0' and '1',
but you are absolutely right that passing through an ostringstream is
a much better solution, for the reasons you have explained.

It's curious that I have seen exactly the contrary in a lot of code.
For instance this is part of the implementation of std::bitset in the
STLport:

template <class _CharT, class _Traits, size_t _Nb>
basic_ostream<_CharT, _Traits>& _STLP_CALL
operator<<(basic_ostream<_CharT, _Traits>& __os,
           const bitset<_Nb>& __x)
{
  basic_string<_CharT, _Traits> __tmp;
  __x._M_copy_to_string(__tmp);    // <-- !!!
  return __os << __tmp;
}



Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: 23 Aug 2002 16:15:05 GMT Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
news:<1uhamuca7va60anq1hu8tg2dv187hr3kik@4ax.com>...

> I thought it would have been too dispersive to reply on a point by
> point basis, so I've tried a little summary.

> - We agree that being forced to use a non-builtin as a character type
> is not a limitation because in practice there's no built-in type that
> could be unambiguously used for that purpose.

It depends on why you want a new character type, but for general purpose
text, yes.

> - We also agree that from the compiler perspective only the character
> set part of the locale plays a role for character mapping in phase 1.

Formally, implementation defined.  I think most compilers today never
use locale (but most don't implement UCN either, nor allow characters
not in the basic character set anywhere but in comments or string or
character literals).  But what you say is true for any reasonable
implementation, I think: after phase 1, the compiler has a fixed,
internal encoding which is independant of the locale.

One possible other case is when generating narrow character literals, to
convert the internal encoding into the narrow character encoding.  Thus,
'\u0153' might generate a different 8 bit code according to the locale.

> - It's usual however that, because of issues related e.g. to fonts,
> files belonging to the same project are edited with different
> charsets. Thus, there's the problem of telling the compiler which
> encoding/charset must be used for each file. Transcoding is impossible
> (or at least a transcoding "on the fly" would be possible, but
> impractical) because the files must be seen simultaneously from
> different environments.

Correct.  My impression is that today, most compilers ignore the
problem.  They don't (yet) support UCN's, and internally, the work with
bytes -- if you use an extended character in a string literal, for
example, they just stuff the byte into the literal without any
transcoding.

The one possible exception is when it comes to generating wide character
literals.  It's possible that some implementations use the locale to
determine how to convert the narrow characters that it uses internally
into the wide characters.

> - Statically converting (the character denoted by) a character
> literal, e.g. 'a' or '1', to a generic charT is impossible. The null
> character is an exception because charT() is guaranteed to be the
> charT equivalent of '\0' with any locale.

The null character is not an exception, because '\0' has type char, and
there is not necessarily a conversion char to charT.  The difference
with regards to the null character is that we know how to write it for
any type: charT().  And that for types other than char and wchar_t, this
is the only character which we know how to write portably.

> Thus, something like this is ok (I'm about to propose it for boost's
> utilities)

> template <class charT, class Traits>
> inline bool is_null_char(const charT & c) {
>    return Traits::eq(c, charT());
> }

> template <class charT>
> inline bool is_null_char(const charT & c) {
>    return is_null_char<charT, std::char_traits<charT> >(c);
> }

> [In practice one would almost always use the second template; the
> first is there just in case...

In practice, I suspect that the first template would be more useful.  If
I'm defining my own character type, there's a good chance that I've
defined my own traits type as well.

> The purpose is simply to generalize conditions involving the null
> character like

> if (p[0] == '\0' || p[1] == '\0')

> while keeping a certain readability.

And "Traits::eq( p[ 0 ], charT() )" isn't readable?

> Of course we could use just one template if default template arguments
> for function templates were allowed - that's DR 226]

> On Thu, 22 Aug 2002 15:48:12 GMT, kanze@gabi-soft.de (James Kanze)
> wrote:

> >> >Yes. Suppose I work using ISO 8859-15, and enter an oe (single
> >> >character) in my editor. This is code 0xBD. I transfer the file to
> >> >you, and your environment is set up for ISO 8859-1, in which 0xBD
> >> >is a decimal 1/2. Supposing some form of ISO 10646 or Unicode for
> >> >wchar t (which seems likely), I expect this character to be
> >> >converted to '\u0153' in phase one; with your default local, it
> >> >will be converted to '\u00BD', which is something else entirely.

> >> I see. Anyhow I wasn't thinking of a situation where you transfer
> >> the source code to another compiler.

> >You probably don't have to transfer anything. Just change the value
> >of the environment variable LC_CTYPE (under Unix, anyway).

> >Now imagine that the value of this variable is something with 8859_1
> >in the Window with the compiler, and 8859_15 in the Window with the
> >editor. You insert an oe in the editor, which is an alphabetic
> >character, legal in a variable name, for example. The compiler sees
> >1/2, which is not legal in a variable name, and reports an error.
> >Which, of course, you cannot see in your editor.

> Yes, but that's because the two windows use different charsets. If
> they were the same the compiler would have seen oe. Or are you saying
> the contrary?

No.  But in real life, we all use several Windows.  And I can imagine
contexts where it would be likely that different windows use different
character sets -- in fact, I use a different character set (with
different encodings) in my web browser and in my editor.

> [...]
> >The gag is, of course, that for tools like editors to work properly,
> >the encoding suffix of the locale MUST correspond to the font
> >encoding of the fonts used.  It's not unreasonable for a user to use
> >different fonts in an editor window than in a shell window (from
> >which the compiler is invoked), and if those fonts have different
> >encodings, the user must have set LC_CTYPE differently for the other
> >software to work.

> >This is not some abstract problem.  For historical reasons, in
> >western Europe, ISO 8859-1 has become the quasi-universal codeset.
> >However, several important characters for French are missing from it,
> >and it doesn't have the Euro symbol (which didn't exist when it was
> >standardized).  For this reason, there is a shift to ISO 8859-15,
> >which corrects this problems.  But things don't happen overnight, and
> >there are a number of fonts on my machines which only support 8859-1.
> >The result is that I use 8859-15 when I can, and 8859-1 when I have
> >to, because of a lack of support in the font.  And even a simple ls
> >will display different filenames according to the window in which it
> >is invoked.

> Ok. But is this a problem that the committee wanted to address?

Not directly.  But they must keep it in mind, so that there will exist
reasonable implementation dependant solutions for what they do specify.

> I understand that it is a big problem in practice, but I suspect that
> they just decided to leave compiler implementers the option to choose
> a solution, for instance the one using a hidden file that you propose
> (or not providing a solution at all)

Formally, the committee doesn't address these problems.  Practically,
the intent is that whatever is standardized can be implemented, and that
the implementation can be what we would consider good: easy to use,
etc.  Thus, they have to keep possible solutions in mind, even for
implementation defined issues.

While I'm at it, I might mention that using the hidden file isn't
without problems either.  Unless the name of the file (and the name of
the codesets) is the same for all implementations, you still have a
portability problem.  For that matter, if my code is in jisx0212, and
your implementation doesn't support it, we still have a problem.

> >> Of course the code at hand

> >> std::locale loc;
> >> char narrow["0123456789"];
> >> wchar t wide[10];
> >> std::use facet<std::ctype<wchar t> >(loc).widen(narrow, narrow+10,
> >> wide);

> >> would translate in this case between ISO latin1 and whatever
> >> encoding is part of loc.

> >The narrow character encoding is *always* part of the locale.  The
> >wide character encoding is supposed constant.

I should have been clearer.  The intent of wchar_t is that it should be
the same in all locales.  The standard doesn't require it, however, and
an implementation can make wchar_t exactly the same as char_t.

> In the code above? Maybe you say this because you stick to the fact
> that the locale is default constructed? Actually I miss your point
> because the default constructor returns the global locale, which can
> be changed with a call to

> static locale global(const locale& loc);

> Why you say that the wide character encoding is supposed constant? (I
> begin to suspect that I should be very careful to distinguish between
> a charset and its encoding, is it so?)

Again, I'm talking about the intent.  The intent is that wchar_t be
locale independant.  If wchar_t changes with the locale, or requires
multibyte char's, then why bother.  It doesn't bring any advantages over
straight char.

Practically, I would say that the only "correct" quality implementation
of wchar_t would be a 32 bit signed integer with ISO 10646 encoding.  I
can't seen any other reasonable alternatives.  (The type actually only
requires 22 bits, but on modern machines, the only alternatives with at
least 22 bits are 32 bits and 64 bits.)  For some limited uses, an
implementation with 16 bit wchar_t might be acceptable too.

> >In general, any conversion to string, implicit or explicit, is an
> >error.  The correct way to handle this is to provide an operator>>.
> >(This would also allow a manipulator to select between the use of 0
> >and 1, and the use of the first character of numpunct::falsename()
> >and numpunct::truename() -- I can easily think of cases where I want
> >a string tfft...)

> >(More accurately: any conversion to string should be based on
> >outputting to operator>>.  I can quite understand the desire to have
> >something simple, along the lines of boost::lexical_cast, even if it
> >prevents the use of all of the formatting possibilities.)

> Yes. One can decide give up formatting, but at least should handle
> correctly the character type. Otherwise it would be better to use
> std::string directly:

> template <typename Block, typename Allocator>
> void to_string(const dynamic_bitset<Block, Allocator>& b,
>                       std::string& s)
> {
>     s.assign(b.size(), '0');
>     for (std::size_t i = 0; i < b.size(); ++i)
>         if (b.test(i))
>             s[b.size() - 1 - i] = '1';
> }

Right.  This is probably adequat for 99.9% of all use.  Even in programs
where the normal string type is wstring (or some other user defined
type), narrow character strings are going to exist, and must be
handled.  One more case where this is necessary won't make any real
difference.

But having experienced the universal asString function in Java, I can
only conclude that it is a mistake.  Conversion to string is, for most
types, formatting, and should be done by the standard formatting
conventions : operator<<.  In this case, the presence of a to_string
function is, IMHO, a design error.

> >> Ah! Maybe you didn't notice that 'Alloc' is different from
> >> 'Allocator'. Anyhow, yes, there's a typo in the original code which
> >> in practice is not parameterized on the string allocator but on its
> >> char traits. Why didn't the compiler warn about this? :-)

> >Probably because no one has ever used any allocator execpt the
> >standard allocator:-).  If Alloc and the string allocator are the
> >same, the code is fine.

> I'm not sure that we understood each other: I was referring to the
> fact that the code I pasted, i.e.

> template <typename Block, typename Allocator, typename CharT, typename
> Alloc> void to_string(const dynamic_bitset<Block, Allocator>& b,
>                       std::basic_string<CharT, Alloc>& s)

> uses the name Alloc for the second parameter of basic_string<>. But
> the second parameter of basic_string is not the string allocator, it's
> the traits type.

That's what I understood.  It is almost certainly a error in the code.
But since no one has every used the code with anything other than the
standard allocators everywhere (so that both Alloc and Allocator have
the type std::allocator), the error has never been noticed.

> This means that to_string is not parameterized at all on the string
> allocator and the wrong name is used for what should have been simply
> named 'Traits' or 'traitsT' (As you can see from the code,
> dynamic_bitset<> also can use a user-defined allocator). I guess the
> intent was to write

> template <typename Block, typename BitsetAllocator, typename CharT,
> typename Traits, typename StringAllocator>
> void to_string(const dynamic_bitset<Block, BitsetAllocator>& b,
>           std::basic_string<CharT, Traits, StringAllocator>& s)
> instead.

Actually, I'd see something more on the order of:

    template< typename Bloc, typename Allocator, typename String >
    void
    to_string( dynamic_bitset< Block, Allocator > const& b,
               String& s )
    {
        typedef typename String::char_type char_type ;
        typedef typename String::traits_type char_traits ;
        //  ...
    }

The problem is one of how much information you need, and where to get
it.  Ideally, from the point of view of generic programming, you would
use an iterator type, and pass a iterator for the output.  In practice,
however, when handling text, iterators don't have enough information;
you need to pick up a char_traits from somewhere.  This suggests using a
string type.  No where near as generic, but at least you have the
necessary information to do the job.  And for formatting, as here, even
char_traits lacks necessary information: you need at least a basic_ios
from somewhere.  (Actually, I would argue that for any serious
formatting, even basic_ios lacks enough information: should the '0' and
the '1' be in latin, arabic, deganavari or some other script?  But
basic_ios contains a locale, so it should be easy in the future to add a
formatting option to select latin or locale specific digits, etc.)

> >> >Of course, there is absolutely nothing to guarantee that it works
> >> >correctly even if it compiles.  If my system uses EBCDIC for char,
> >> >and ISO 10646 for wchar t, the conversion to a wstring will result
> >> >in a sequence of '? and '?: probably not what was intended either.

> >> Yes. The code above should solve at least this problem.

> >Painfully.

> Well, that was something! :-) Obviously in my example the pain was not
> that terrible because I just needed the two characters '0' and '1',
> but you are absolutely right that passing through an ostringstream is
> a much better solution, for the reasons you have explained.

Note that in this case, the ostream would have handled the digits
correctly if I just output << 0 and << 1.  No need for character
constants at all, unless I wanted to support "tfft...".

> It's curious that I have seen exactly the contrary in a lot of code.

That's because people don't really understand the importance of
separation of concerns.

I'll admit that I have only recently come to realize the importance of
this.  It was pointed out to me by a collegue at my last job, Adam
Wilkshire; without him, I'd probably still be generating strings as
well.

This is, in fact, one case which simply cries for a free function:
neither the class itself nor std::string should be encumbered with
knowledge of the other.  For generic programming to work, it is
important that the name of this free function be standardized.  As it
happens, it is standardized: the standard name for formatting to text
representation is operator<<, with the destination as the first
parameter, and what is to be formatted as the second.

Obviously, how something is formatted may depend on the type of the
destination, and a BER encoding will look quite different from text.  In
the end, we have a need for double dispatch, and while C++ doesn't
support dynamic double dispatch, function overloading and generic
programming provide a static equivalent which is often quite acceptable.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Mon, 26 Aug 2002 00:27:17 GMT Raw View

On 23 Aug 2002 16:15:05 GMT, kanze@gabi-soft.de (James Kanze) wrote:

> But what you say is true for any reasonable
>implementation, I think: after phase 1, the compiler has a fixed,
>internal encoding which is independant of the locale.

My emphasis was on the fact that phase 1 involves the codeset only,
not that the codeset is required only in phase 1. But this isn't that
important for our discussion.

>One possible other case is when generating narrow character literals, to
>convert the internal encoding into the narrow character encoding.  Thus,
>'\u0153' might generate a different 8 bit code according to the locale.

Uh?? When is a character-literal "generated"?

>> - It's usual however that, because of issues related e.g. to fonts,
>> files belonging to the same project are edited with different
>> charsets. Thus, there's the problem of telling the compiler which
>> encoding/charset must be used for each file. Transcoding is impossible
>> (or at least a transcoding "on the fly" would be possible, but
>> impractical) because the files must be seen simultaneously from
>> different environments.
>
>Correct.  My impression is that today, most compilers ignore the
>problem.  They don't (yet) support UCN's, and internally, the work with
>bytes -- if you use an extended character in a string literal, for
>example, they just stuff the byte into the literal without any
>transcoding.

Well, of course doing phase 1 mapping correctly doesn't imply you have
to support UCNs, anyhow if the compiler implementer decides to support
UCNs he will presumably use Unicode internally and so will be, to say,
more "willing" to support many locales for source files. Anyhow it
seems to me that he has still the right to support one locale only.
It's up to the user to have suitable external tools (see below)

>The one possible exception is when it comes to generating wide character
>literals.  It's possible that some implementations use the locale to
>determine how to convert the narrow characters that it uses internally
>into the wide characters.

This sounds like the above: where do C++ compilers have to generate
literals???

>> - Statically converting (the character denoted by) a character
>> literal, e.g. 'a' or '1', to a generic charT is impossible. The null
>> character is an exception because charT() is guaranteed to be the
>> charT equivalent of '\0' with any locale.
>
>The null character is not an exception, because '\0' has type char, and
>there is not necessarily a conversion char to charT.  The difference
>with regards to the null character is that we know how to write it for
>any type: charT().  And that for types other than char and wchar_t, this
>is the only character which we know how to write portably.

That's what I meant (certainly it was not my intent to say that
charT() is a literal! :-)). But you said it better (as usual).

>> Thus, something like this is ok (I'm about to propose it for boost's
>> utilities)
>
>> template <class charT, class Traits>
>> inline bool is_null_char(const charT & c) {
>>    return Traits::eq(c, charT());
>> }
>
>> template <class charT>
>> inline bool is_null_char(const charT & c) {
>>    return is_null_char<charT, std::char_traits<charT> >(c);
>> }
>
>> [In practice one would almost always use the second template; the
>> first is there just in case...
>
>In practice, I suspect that the first template would be more useful.  If
>I'm defining my own character type, there's a good chance that I've
>defined my own traits type as well.

Yes, I said it! The first is the most useful, but everybody would use
the second :-)

>> The purpose is simply to generalize conditions involving the null
>> character like
>
>> if (p[0] == '\0' || p[1] == '\0')
>
>> while keeping a certain readability.
>
>And "Traits::eq( p[ 0 ], charT() )" isn't readable?

Well, readability depends a lot on the context. I'm not claiming that
is_null_char() is clearer in every situation. Here's a real life
example where it could be useful. It's from the new boost lexical_cast
(which will presumably be in the next release). This new version
solves problems with strings containing white spaces, supports wide
characters, and also handles separately cases where no stream is
needed, for instance when converting from string to char or from array
of chars to char. In the latter case the core implementation is
simply:

static Target do_cast(Source arg)
{
    if(arg[0]==0 || arg[1]!=0)
      throw bad_lexical_cast();

    return arg[0];
}

Now, the above is ok when the source is an array of char or wchar_t
but if you want it to work also when converting from array of charTs
(in the context of the do_cast function arg becomes a pointer via
array-to-pointer conversion, of course) to charT you can't rely on the
existence of an operator==(), as you pointed out. Thus the code
becomes e.g.:

static Target do_cast(Source arg)
{
  typedef boost::remove_pointer<Source>::type char_type;

  if( std::char_traits<char_type>::eq(arg[0], char_type()) == true
     || std::char_traits<char_type>::eq(arg[1], char_type()) == false)
            throw bad_lexical_cast();

        return arg[0];
}

IMHO (and I emphasize the expression IMHO, because these are religious
issues :-)) it is much less readable than e.g.:

static Target do_cast(Source arg) {

     if (is_null_char(arg[0]) || !is_null_char(arg[1]))
         throw bad_lexical_cast();

     return arg[0];
}

Of course

static Target do_cast(Source arg) {

      typedef boost::remove_pointer<Source>::type charT;
      typedef std::char_traits<charT> Tr;

      if ( Tr::eq(arg[0], charT() )  ||  !Tr::eq( arg[1], charT() ) )
            throw bad_lexical_cast();

      return arg[0];
}

is quite readable as well, but introducing typedefs only to shorten
expressions that involve some type names can be tedious in the long
run, don't you agree?

Ah, I read in your mind: you are thinking that the traits type should
be a generic Traits and not std::char_traits :-) But the above was
only to illustrate the "readability" issue!

>> [...]
>> >The gag is, of course, that for tools like editors to work properly,
>> >the encoding suffix of the locale MUST correspond to the font
>> >encoding of the fonts used.  It's not unreasonable for a user to use
>> >different fonts in an editor window than in a shell window (from
>> >which the compiler is invoked), and if those fonts have different
>> >encodings, the user must have set LC_CTYPE differently for the other
>> >software to work.
>
>> >This is not some abstract problem.  For historical reasons, in
>> >western Europe, ISO 8859-1 has become the quasi-universal codeset.
>> >However, several important characters for French are missing from it,
>> >and it doesn't have the Euro symbol (which didn't exist when it was
>> >standardized).  For this reason, there is a shift to ISO 8859-15,
>> >which corrects this problems.  But things don't happen overnight, and
>> >there are a number of fonts on my machines which only support 8859-1.
>> >The result is that I use 8859-15 when I can, and 8859-1 when I have
>> >to, because of a lack of support in the font.  And even a simple ls
>> >will display different filenames according to the window in which it
>> >is invoked.
>
>> Ok. But is this a problem that the committee wanted to address?
>
>Not directly.  But they must keep it in mind, so that there will exist
>reasonable implementation dependant solutions for what they do specify.
>
>> I understand that it is a big problem in practice, but I suspect that
>> they just decided to leave compiler implementers the option to choose
>> a solution, for instance the one using a hidden file that you propose
>> (or not providing a solution at all)
>
>Formally, the committee doesn't address these problems.  Practically,
>the intent is that whatever is standardized can be implemented, and that
>the implementation can be what we would consider good: easy to use,
>etc.  Thus, they have to keep possible solutions in mind, even for
>implementation defined issues.

I hope so! In practice it's very simple: it's enough to ask them what
they had in mind, isn't it? (That's a joke, of course, and there's no
intent of being disrespectful on my part. Anyhow, there's some truth
in it: the question is "What possible implementations were
considered?").

If I had to guess I'd say that they wanted to put the burden on
external tools, not on the compiler. Imagine an environment (editor)
with the following capabilities:

- it can read a file written with a given charset and transcode it on
the fly for displaying in another charset. In practice, the actual
encoding in which you would store the file is the encoding, say ISO
Latin-1, that your compiler understands. Now, when you read that file
from your ISO Latin-9 window the editor translates it on the fly and
let it appear as a genuine ISO Latin-9 file. If you type in your oe
character (0xBD) it's up to the editor, when saving the file, to write
it as

\u0153

The saved file will be an ISO Latin-1 file with UCNs for all the
characters that are in Latin-9 but not in Latin-1.

On the contrary if you open that ISO Latin-1 file within, say, a ASCII
window you have to accept to see your oe and your <a with acute
accent> as e.g. UCNs even from the editor (translation on the fly
again).

With such a tool, the compiler is free to understand only ISO Latin-1
(provided that it supports UCNs) and the handling of non Latin-1
characters is transparent to the user.

If you port the project to another compiler that only understands ISO
Latin-5 you must transcode the files only once, then use your editor
to look at those Latin-5 files as Latin-9 files.

[...]
>I should have been clearer.  The intent of wchar_t is that it should be
>the same in all locales.  The standard doesn't require it, however, and
>an implementation can make wchar_t exactly the same as char_t.
>
>[...]
>
>Again, I'm talking about the intent.  The intent is that wchar_t be
>locale independant.  If wchar_t changes with the locale, or requires
>multibyte char's, then why bother.  It doesn't bring any advantages over
>straight char.

I've assumed this to mean that changing the locale shouldn't change
the encoding of a given wide character. I'm astonished that this is
not required by the standard (though I admit that doing it otherwise
would IMHO be an explicit fraud)

>
>Practically, I would say that the only "correct" quality implementation
>of wchar_t would be a 32 bit signed integer with ISO 10646 encoding.  I
>can't seen any other reasonable alternatives.

Why signed?

[...]
>> Yes. One can decide give up formatting, but at least should handle
>> correctly the character type. Otherwise it would be better to use
>> std::string directly:
>
>> template <typename Block, typename Allocator>
>> void to_string(const dynamic_bitset<Block, Allocator>& b,
>>                       std::string& s)
>> {
>>     s.assign(b.size(), '0');
>>     for (std::size_t i = 0; i < b.size(); ++i)
>>         if (b.test(i))
>>             s[b.size() - 1 - i] = '1';
>> }
>
>Right.  This is probably adequat for 99.9% of all use.  Even in programs
>where the normal string type is wstring (or some other user defined
>type), narrow character strings are going to exist, and must be
>handled.  One more case where this is necessary won't make any real
>difference.

Anyhow you solution below (taking advantage of basic_string typedefs)
is better.

>
>But having experienced the universal asString function in Java, I can
>only conclude that it is a mistake.  Conversion to string is, for most
>types, formatting, and should be done by the standard formatting
>conventions : operator<<.  In this case, the presence of a to_string
>function is, IMHO, a design error.

The fact that it is a member function is certainly a design error.
Anyhow its presence per se IMHO isn't, at least if it is meant as a
simple shortcut for

-default construct a stringstream strm
-output to it
-return strm.str()

In practice, something when you don't need locale genericity or
formatting options (on the model of lexical_cast).

[...]
>>I guess the intent was to write
>
>> template <typename Block, typename BitsetAllocator, typename CharT,
>> typename Traits, typename StringAllocator>
>> void to_string(const dynamic_bitset<Block, BitsetAllocator>& b,
>>           std::basic_string<CharT, Traits, StringAllocator>& s)
>> instead.
>
>Actually, I'd see something more on the order of:
>
>    template< typename Bloc, typename Allocator, typename String >
>    void
>    to_string( dynamic_bitset< Block, Allocator > const& b,
>               String& s )
>    {
>        typedef typename String::char_type char_type ;
>        typedef typename String::traits_type char_traits ;
>        //  ...
>    }
>

Yes!! :-)

    template< typename Bloc, typename Allocator, typename String >
    void
    to_string( dynamic_bitset< Block, Allocator > const& b,
               String& s )
    {
        typedef typename String::char_type char_type ;
        typedef typename String::traits_type char_traits ;

        std::basic_ostringstream<char_type, char_traits> strm;
        strm << b;
        s = strm.str();
    }

>The problem is one of how much information you need, and where to get
>it.  Ideally, from the point of view of generic programming, you would
>use an iterator type, and pass a iterator for the output.  In practice,
>however, when handling text, iterators don't have enough information;
>you need to pick up a char_traits from somewhere.  This suggests using a
>string type.  No where near as generic, but at least you have the
>necessary information to do the job.

Good. The second version is not less general than the first one and
obtains via typedefs the types that the compiler must deduce with the
first form. In the end it looks IMHO cleaner (and is also more
digestible to some buggy compilers).

[...]
>> Obviously in my example the pain was not
>> that terrible because I just needed the two characters '0' and '1',
>> but you are absolutely right that passing through an ostringstream is
>> a much better solution, for the reasons you have explained.
>
>Note that in this case, the ostream would have handled the digits
>correctly if I just output << 0 and << 1.  No need for character
>constants at all, unless I wanted to support "tfft...".

Good point. I didn't think of it. In theory I would expect the
insertion of a charT to be faster than the insertion of an int, but
this is unlikely to be a problem in practice, and maybe isn't true
either.

As to the possibility of "tfft..",as I said, I would left it outside
the to_string function. Who wants the options should use the stream
and a suitable manipulator (that I'll enjoy writing as soon as I'll
have a few spare time! :-))

>> It's curious that I have seen exactly the contrary in a lot of code.
>
>That's because people don't really understand the importance of
>separation of concerns.

Well, I'm under the impression that, in this case, what they missed is
how deceptive is the "character type genericity" of _M_copy_to_string.
Considering how widespread in the world is the use of STLport however
I'm astonished that this hasn't been corrected.

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Mon, 26 Aug 2002 17:20:26 GMT Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
news:<09vhmu00bb3lkic5queg9t9s25m3p4ckq0@4ax.com>...
> On 23 Aug 2002 16:15:05 GMT, kanze@gabi-soft.de (James Kanze) wrote:

> > But what you say is true for any reasonable implementation, I think:
> >after phase 1, the compiler has a fixed, internal encoding which is
> >independant of the locale.

> My emphasis was on the fact that phase 1 involves the codeset only,
> not that the codeset is required only in phase 1. But this isn't that
> important for our discussion.

> >One possible other case is when generating narrow character literals,
> >to convert the internal encoding into the narrow character encoding.
> >Thus, '\u0153' might generate a different 8 bit code according to the
> >locale.

> Uh?? When is a character-literal "generated"?

Phase 5. The question is: what should '\u0153' generate. The response is
0xBD in a locale using 8859-15. The response is implementation defined
in 8859-1.

A better example might be '\u0178': in 8859-14, this is 0xAF, in 8859-15
0xBE.

Of course, the compiler really should use the locale active at
execution for this:-).

> >> - It's usual however that, because of issues related e.g. to fonts,
> >> files belonging to the same project are edited with different
> >> charsets. Thus, there's the problem of telling the compiler which
> >> encoding/charset must be used for each file. Transcoding is
> >> impossible (or at least a transcoding "on the fly" would be
> >> possible, but impractical) because the files must be seen
> >> simultaneously from different environments.

> >Correct.  My impression is that today, most compilers ignore the
> >problem.  They don't (yet) support UCN's, and internally, the work
> >with bytes -- if you use an extended character in a string literal,
> >for example, they just stuff the byte into the literal without any
> >transcoding.

> Well, of course doing phase 1 mapping correctly doesn't imply you have
> to support UCNs,

A C++ compiler (like a Java compiler or a C99 compiler) MUST support
UCNs.  If it doesn't, it isn't a C++/Java/C99 compiler.  There are no
optional sections in the standard.

> anyhow if the compiler implementer decides to support UCNs he will
> presumably use Unicode internally and so will be, to say, more
> "willing" to support many locales for source files.

This is what I would presume.  Off hand, I would think that the most
reasonable way to support UCNs would be to use ISO 10646 internally.  At
least for a totally new compiler; for an existing compiler, the change
of the data type for a string would propagate through the entire
compiler, and might cause added work elsewhere.  It is just possible
that it would be easier to make the lexer UCN aware, and leave the
compound characters compound.

> Anyhow it seems to me that he has still the right to support one
> locale only.

The only required locale is the locale "C".  Which is only required to
contain the characters in the basic character set.  Such a compiler
could convert any UCN not in the basic character set into a '?' (or
whatever) in character and string literals.  (There is nothing to
require wchar_t to be larger than a char, either.)  Such a compiler must
still distinguish between e.g. "l\u00e8ve" and "lev\u00e9": these are
two different variable names, and must be treated as such.

> It's up to the user to have suitable external tools (see below)

> >The one possible exception is when it comes to generating wide
> >character literals.  It's possible that some implementations use the
> >locale to determine how to convert the narrow characters that it uses
> >internally into the wide characters.

> This sounds like the above: where do C++ compilers have to generate
> literals???

Phase 5.

The significant aspect is that the literals are generated in the
execution character set, and not in the source character set.  How the
compiler is supposed to do this if the execution character set is locale
dependant (as it is under Unix) is left as an excercise for the reader.
Realistically, I can only see two options: use the current user locale
when compiling, and always use locale "C".  All in all, I suspect that
the latter is the better choice: it does mean that you cannot
practically use accented letters in character and in string literals,
but at least, you should get the same results as I do, even if we
normally use different locales.

    [...]
> >And "Traits::eq( p[ 0 ], charT() )" isn't readable?

> Well, readability depends a lot on the context. I'm not claiming that
> is_null_char() is clearer in every situation. Here's a real life
> example where it could be useful. It's from the new boost lexical_cast
> (which will presumably be in the next release). This new version
> solves problems with strings containing white spaces, supports wide
> characters, and also handles separately cases where no stream is
> needed, for instance when converting from string to char or from array
> of chars to char. In the latter case the core implementation is
> simply:

> static Target do_cast(Source arg)
> {
>     if(arg[0]==0 || arg[1]!=0)
>       throw bad_lexical_cast();
>
>     return arg[0];
> }

> Now, the above is ok when the source is an array of char or wchar_t
> but if you want it to work also when converting from array of charTs
> (in the context of the do_cast function arg becomes a pointer via
> array-to-pointer conversion, of course) to charT you can't rely on the
> existence of an operator==(), as you pointed out. Thus the code
> becomes e.g.:

> static Target do_cast(Source arg)
> {
>   typedef boost::remove_pointer<Source>::type char_type;

>   if( std::char_traits<char_type>::eq(arg[0], char_type()) == true
>      || std::char_traits<char_type>::eq(arg[1], char_type()) == false)
>             throw bad_lexical_cast();

>         return arg[0];
> }

I'd write:

    static Target do_cast( Source arg )
    {
        typedef boost::remove_pointer<Source>::type char_type;

        if ( std::char_traits< char_type >::eq( arg[ 0 ], char_type())
             || ! ...

No need for the comparison with true.

More significantly, I imagine that the function would be a member of a
class which already has a typedef for the traits (which, of course,
won't necessarily be an std::char_traits<>). So we're down to:

    if ( traits_type::eq( arg[ 0 ], char_type() )
         ||  ! traits_type::equ( arg[ 1 ], char_type() ) )

> >> [...]
> >> >The gag is, of course, that for tools like editors to work
> >> >properly, the encoding suffix of the locale MUST correspond to the
> >> >font encoding of the fonts used.  It's not unreasonable for a user
> >> >to use different fonts in an editor window than in a shell window
> >> >(from which the compiler is invoked), and if those fonts have
> >> >different encodings, the user must have set LC_CTYPE differently
> >> >for the other software to work.

> >> >This is not some abstract problem.  For historical reasons, in
> >> >western Europe, ISO 8859-1 has become the quasi-universal codeset.
> >> >However, several important characters for French are missing from
> >> >it, and it doesn't have the Euro symbol (which didn't exist when
> >> >it was standardized).  For this reason, there is a shift to ISO
> >> >8859-15, which corrects this problems.  But things don't happen
> >> >overnight, and there are a number of fonts on my machines which
> >> >only support 8859-1.  The result is that I use 8859-15 when I can,
> >> >and 8859-1 when I have to, because of a lack of support in the
> >> >font.  And even a simple ls will display different filenames
> >> >according to the window in which it is invoked.

> >> Ok. But is this a problem that the committee wanted to address?

> >Not directly.  But they must keep it in mind, so that there will
> >exist reasonable implementation dependant solutions for what they do
> >specify.

> >> I understand that it is a big problem in practice, but I suspect
> >> that they just decided to leave compiler implementers the option to
> >> choose a solution, for instance the one using a hidden file that
> >> you propose (or not providing a solution at all)

> >Formally, the committee doesn't address these problems.  Practically,
> >the intent is that whatever is standardized can be implemented, and
> >that the implementation can be what we would consider good: easy to
> >use, etc.  Thus, they have to keep possible solutions in mind, even
> >for implementation defined issues.

> I hope so! In practice it's very simple: it's enough to ask them what
> they had in mind, isn't it? (That's a joke, of course, and there's no
> intent of being disrespectful on my part. Anyhow, there's some truth
> in it: the question is "What possible implementations were
> considered?").

Well, most of the implementors are also committee members.  I guess they
knew what they intended, although sometimes I'm not sure:-).

> If I had to guess I'd say that they wanted to put the burden on
> external tools, not on the compiler. Imagine an environment (editor)
> with the following capabilities:

> - it can read a file written with a given charset and transcode it on
> the fly for displaying in another charset. In practice, the actual
> encoding in which you would store the file is the encoding, say ISO
> Latin-1, that your compiler understands. Now, when you read that file
> from your ISO Latin-9 window the editor translates it on the fly and
> let it appear as a genuine ISO Latin-9 file. If you type in your oe
> character (0xBD) it's up to the editor, when saving the file, to write
> it as

> \u0153

> The saved file will be an ISO Latin-1 file with UCNs for all the
> characters that are in Latin-9 but not in Latin-1.

I believe that this was possibly intended, yes.  That files would
actually never contain anything but universal character names, and the
editors would take care of the displaying.

This was, of course, the intent with trigraphs:-).

> On the contrary if you open that ISO Latin-1 file within, say, a ASCII
> window you have to accept to see your oe and your <a with acute
> accent> as e.g. UCNs even from the editor (translation on the fly
> again).

> With such a tool, the compiler is free to understand only ISO Latin-1
> (provided that it supports UCNs) and the handling of non Latin-1
> characters is transparent to the user.

Agreed.

> If you port the project to another compiler that only understands ISO
> Latin-5 you must transcode the files only once, then use your editor
> to look at those Latin-5 files as Latin-9 files.

If the editor only generates characters in the basic source character
set, you should be home free on all machines that understand ASCII.  And
given the prevalence of ASCII, you can be pretty sure that any machine
which uses something else will have the necessary utilities to
transcode.

> [...]
> >I should have been clearer.  The intent of wchar_t is that it should be
> >the same in all locales.  The standard doesn't require it, however, and
> >an implementation can make wchar_t exactly the same as char_t.

> >[...]

> >Again, I'm talking about the intent.  The intent is that wchar_t be
> >locale independant.  If wchar_t changes with the locale, or requires
> >multibyte char's, then why bother.  It doesn't bring any advantages
> >over straight char.

> I've assumed this to mean that changing the locale shouldn't change
> the encoding of a given wide character. I'm astonished that this is
> not required by the standard (though I admit that doing it otherwise
> would IMHO be an explicit fraud)

Changing the locales doesn't change the numeric value of any variable.
If you use a different locale, it MIGHT change the interpretation.

I believe that the intent is that where reasonably possible, the
interpretation of a wchar_t will be the same in all locales.  The
standard doesn't really require much of wchar_t.  Or for that matter,
that the matter, it doesn't require much in terms of different locales.
In the end, it is a question of market presure and quality of
implementation; I don't know of any compiler on a Unix box or on Windows
which uses an 8 bit wchar_t, although the standard certainly allows it.

> >Practically, I would say that the only "correct" quality
> >implementation of wchar_t would be a 32 bit signed integer with ISO
> >10646 encoding.  I can't seen any other reasonable alternatives.

> Why signed?

Historical reasons.  Formally, the only way to check for EOF is to use
the various functions in char_traits.  Practically, C required EOF to be
negative, and there is doubtlessly stupid code around which tests for
less than zero.  (There's probably even a lot of code which tests for
-1.  I've never seen a machine where it was defined differently,
although the standards, both C and Posix, allow any negative value.)
Since making customers look like fools isn't a particularly good
marketing technique, I prefer to use -1, like everyone else.

> [...]

> Well, I'm under the impression that, in this case, what they missed is
> how deceptive is the "character type genericity" of _M_copy_to_string.
> Considering how widespread in the world is the use of STLport however
> I'm astonished that this hasn't been corrected.

The STLport is widely used, but how many people are really using
std::basic_string with types other than char or wchar_t?  Regardless of
the standard library used.  I know that both the STLport people and the
Dinkumware people set very high standards for quality, but their first
concern is obviously the parts of the library that people really use.
Should they fix the problem we've just pointed out (which prevents the
library from being conform, even though it probably doesn't affect a
single user), or should they add support for file descriptors (attach,
etc.) to filebuf (which isn't in the standard, but is probably useful to
at least 10-15% of the users)?

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: 19 Aug 2002 18:00:09 GMT Raw View

Gennaro Prota <gennaro prota@yahoo.com> wrote in message
news:<c8iklucc6pk25uavgjsb0rj8tvrvjqrbep@4ax.com>...
> On Mon, 12 Aug 2002 17:29:32 GMT, kanze@gabi-soft.de (James Kanze)
> wrote:

> >Gennaro Prota <gennaro prota@yahoo.com> wrote in message
> >news:<4j98luch9j63jdjfbef3u4jmdrgc3qh53r@4ax.com>...

> >> The original question was whether, given that c is a
> >> wchar t, the following are equivalent

> >> c==0, c=='\0', c==L'\0', c==wchar t()
> >  [...]

> >The important text is in ?3.9.1/5:

> >    Type wchar t is a distinct type whose values can represent distinct
> >    codes for all members of the largest extended character specified
> >    among the supported locales.  Type wchar t shall have the same size,
> >    signedness and alignment requirements as one of the other integral
> >    types, called its underlying type.

> >And the fact that wchar t is an integral type, see ?3.9.1/7 (which
> >also requires integral types to use a binary representation).  Given
> >that it is an integral type, and that 0 must be representable in it,
> >   4.7 should pose no problems.

> Yes. I went through these too. What I was missing to 'close the
> circle' was a direct guarantee that the value of L'\0' is zero.

Hmmm.  I hadn't thought of that possibility.  But I suppose that
   2.13.2/4 should apply: "The escape \000 consists of the backslash
followed by one, two, or three octal digits that are taken to specify
the value of the desired character."  A priori, I don't see anything
relevant to converting from the basic character set to any other.  After
all, '\033' results in a character with the value 27, regardless of the
character set used.  Why should '\0' be different.  And the text doesn't
speak of any encoding for wchar_t either, so I cannot imagine that it
not apply directly to L'\0'.

> (I say 'direct' guarantee because an indirect guarantee exists: it
> follows from the fact that wchar t() and L'\0' must represent the same
> character, and wchar t() has the value zero). I think I've found it in
> 2.2/3.

> >> [snip]
> >> >Note too that while the standard says that you can use any POD type
> >> >you can't specialize char traits or any of the classes you might need
> >> >in the locale section except for user defined types. [...]

> >> Is this a problem?

> >It is if you want to do IO on the type.  File IO uses codecvt, and
> >numeric IO uses numpunct (indirectly, via num get and num put) and
> >codecvt (and ctype?).  The standard library doesn't require any
> >non-specialized versions of these, so YOU have to provide them.  And
> >about the only way I can think of to provide them is to specialize
> >the standard templates (but I'm not 100% sure about this), which you
> >can only do on a user defined type.

> I meant: is it a problem that you can't use a built-in type?

Well, it is a lot more difficult.  Since users expect to be able to do
things like '0' + digit, etc. (and the standard library may be one of
your users here), you probably have to overload all of the basic
operators.  While std::string uses char_traits<>::eq, et al., I'm not
sure that this is required for e.g. std::num_get or std::num_put.

In what I started, I created a function which returned the value of the
character interpreted as a digit, based on tables generated
automatically from the Unicode data file.  In this way, input would work
correctly even if the user input Devanagari or Arabic digits.  For
output, I had considered adding a parameter (using
std::ios_base::xalloc) to specify the character for '0'; as far as I can
tell, all of the digit sequences in ISO 10646 are contiguous.

Of course, this still leaves open the question of whether the language
prints numbers high digit first (as in Europe and America), or low digit
first (as in Arabic).

> >> >[...]

> >> >> - as far as I understand it I should use widen() even when a
> >> >> charac ter is directly represented with a literal in the text of
> >> >> the program (there is no generic literal, is there?). In fact,
> >> >> the already mentioned TC++SL has the following examples on page
> >> >> 717:

> >> >> std::locale loc;
> >> >> char narrow["0123456789"];
> >> >> wchar t wide[10];

> >> >> std::use facet<std::ctype<wchar t> >(loc).widen(narrow,
> >> >> narrow+10, wide);

> >> >I'm not sure anyone really knows what you should do, even in C,
> >> >and even when charT is wchar t.  I think we're still experimenting
> >> >some in this domain.  (I know I am, anyway.)

> >> >There are wide character literals: L"0123456789".  The problem is
> >> >how these are mapped may depend on the locale at compile time.

> >> The above *does* depend on the locale loc. Are you alluding to some
> >> other hidden problem?

> >The code using the facet depends on the run-time locale; the
> >conversion of wide character literals depends on the compile-time
> >locale.  That's a big difference if you are developping code in
> >Italy, and it is to run in Japan.

> >Practically, I don't think it will make a difference if you stick to
> >the basic character set (although I'm just guessing).

> Sticking to the basic character set, I think I don't understand you
> correctly here: maybe you refer to the character mapping in phase 1?

Yes.  Suppose I work using ISO 8859-15, and enter an oe (single
character) in my editor.  This is code 0xBD.  I transfer the file to
you, and your environment is set up for ISO 8859-1, in which 0xBD is a
decimal 1/2.  Supposing some form of ISO 10646 or Unicode for wchar_t
(which seems likely), I expect this character to be converted to
'\u0153' in phase one; with your default local, it will be converted to
'\u00BD', which is something else entirely.

And what should the poor compiler writer do.  Currently, I think most
simply use the current locale (at least under Unix), or perhaps even
refuse code with such characters (forcing the locale to "C").  But it
isn't hard to imagine a compiler option to specify the code.  However,
the problem is that the compiler should use one code for your .cc file,
and a different code for my .h which your .cc includes.  Other than some
sort of pragma, or a naming convention, I don't know what a compiler can
do.  Off hand, some sort of hidden file in the directory in which the
compiler finds the file seems like a good idea too.

> Yes, I see a problem if, for instance, I will run the program on a
> machine with a character set different from the machine where I
> compiled the code, but is that an issue of "locales"? (My idea is that
> character mapping is the substitution of a character in the text of
> the program with an integer value; the value depends on the internal
> encoding used by the compiler (EBCDIC, ISO Latin-1, etc.) but not,
> strictly speaking, on the locale. I guess it may depend on the locale
> only for characters not in the basic set. Where am I wrong?)

The question is how the compiler interprets what it reads in phase 1.
Does it use locale or not?  And if not, how does it handle extended
characters.  I'll admit that I've not done any real testing in this
regard.  I have my ideas about how a compiler *should* handle it, but
the machines I have access to at the moment don't have any locales of
interest installed, so I cannot really test anything.

> >If you're stuck with narrow char's, you have to widen them.  I'd try
> >and declare constants for all of the char's I could, however, and use
> >them when possible.

> >Note that your expression use facet<ctype<charT> >(loc).widen('A') is
> >likely to result in a bad cast exception if you haven't installed the
> >additional facet.

> What has initially made me ask these questions is the code of the new
> boost::dynamic bitset. Let's consider, for instance, the function

> template <typename Block, typename Allocator, typename CharT, typename
> Alloc> void to string(const dynamic bitset<Block, Allocator>& b,
>                       std::basic string<CharT, Alloc>& s)
> {
>     s.assign(b.size(), '0');
>     for (std::size t i = 0; i < b.size(); ++i)
>         if (b.test(i))
>             s[b.size() - 1 - i] = '1';
> }

> The digits zero and one are certainly in the basic (execution) set.
> But how should I represent them generically?

Certainly not the way they've done it here.  (Although they can be
excused for the error.  I doubt that there are many people who have any
real experience with std::basic_string other than for char or wchar_t.)
To begin with, '0' has type char, and there isn't the slightest reason
to suppose that there is a conversion (implicit or explicit) from char
to charT.  Other than ctype::widen or codecvt::in, I don't think that
there is any way of getting from char to charT.

Generally, I think it should give a compiler error.  Unless, maybe,
you've got a very strange allocator, which typedef's size_type to char,
and arranges to use char as a pointer as well, and the implementation
has typedef'ed iterator to pointer -- in that case, both parameters to
assign have the same type, and you end up in the templated function with
two iterators.

Of course, there is absolutely nothing to guarantee that it works
correctly even if it compiles.  If my system uses EBCDIC for char, and
ISO 10646 for wchar_t, the conversion to a wstring will result in a
sequence of '   ' and '   ': probably not what was intended either.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Wed, 21 Aug 2002 22:51:23 GMT Raw View

On 19 Aug 2002 18:00:09 GMT, kanze@gabi-soft.de (James Kanze) wrote:

>Gennaro Prota <gennaro prota@yahoo.com> wrote
>> What I was missing to 'close the
>> circle' was a direct guarantee that the value of L'\0' is zero.
>
>Hmmm.  I hadn't thought of that possibility.

Do you mean: "I hadn't thought you were missing something so
obvious?". Yes, I was! :-)

>  But I suppose that
>=A72.13.2/4 should apply: "The escape \000 consists of the backslash
>followed by one, two, or three octal digits that are taken to specify
>the value of the desired character."

Yeah. It simply didn't occur to me to regard '\0' as a special form of
'\000' (I tend to think of it as a special syntax to represent the
string terminator, and not as a way to state the exact value of the
character. It's similar to the fact that char * p =3D 0 doesn't mean I
assign the address zero to p. Anyhow for characters, as you note,
things are different and '\0' means that the value is exactly zero)
Thanks for pointing that out! :-)

[snip]
>> I meant: is it a problem that you can't use a built-in type?
>
>Well, it is a lot more difficult.  Since users expect to be able to do
>things like '0' + digit, etc. (and the standard library may be one of
>your users here), you probably have to overload all of the basic
>operators.  While std::string uses char_traits<>::eq, et al., I'm not
>sure that this is required for e.g. std::num_get or std::num_put.

Yes, but what built-in type would you use? If you chose, say, short
int how would you manage to output a character as a character and not
as a number?

[...]
>
>> >> >> - as far as I understand it I should use widen() even when a
>> >> >> charac ter is directly represented with a literal in the text of
>> >> >> the program (there is no generic literal, is there?). In fact,
>> >> >> the already mentioned TC++SL has the following examples on page
>> >> >> 717:
>
>> >> >> std::locale loc;
>> >> >> char narrow["0123456789"];
>> >> >> wchar t wide[10];
>
>> >> >> std::use facet<std::ctype<wchar t> >(loc).widen(narrow,
>> >> >> narrow+10, wide);
>
>> >> >I'm not sure anyone really knows what you should do, even in C,
>> >> >and even when charT is wchar t.  I think we're still experimenting
>> >> >some in this domain.  (I know I am, anyway.)
>
>> >> >There are wide character literals: L"0123456789".  The problem is
>> >> >how these are mapped may depend on the locale at compile time.
>
>> >> The above *does* depend on the locale loc. Are you alluding to some
>> >> other hidden problem?
>
>> >The code using the facet depends on the run-time locale; the
>> >conversion of wide character literals depends on the compile-time
>> >locale.  That's a big difference if you are developping code in
>> >Italy, and it is to run in Japan.
>
>> >Practically, I don't think it will make a difference if you stick to
>> >the basic character set (although I'm just guessing).
>
>> Sticking to the basic character set, I think I don't understand you
>> correctly here: maybe you refer to the character mapping in phase 1?
>
>Yes.  Suppose I work using ISO 8859-15, and enter an oe (single
>character) in my editor.  This is code 0xBD.  I transfer the file to
>you, and your environment is set up for ISO 8859-1, in which 0xBD is a
>decimal 1/2.  Supposing some form of ISO 10646 or Unicode for wchar_t
>(which seems likely), I expect this character to be converted to
>'\u0153' in phase one; with your default local, it will be converted to
>'\u00BD', which is something else entirely.

I see. Anyhow I wasn't thinking of a situation where you transfer the
source code to another compiler. Of course this adds another level of
complexity, but maybe it can be solved by translating the files in
advance, i.e. before submitting them to the new compiler.

Back to the main issue, my idea (likely erroneous) was that the
compiler doesn't use locales to translate source code characters. Or,
more exactly, it uses only the character encoding (which is usually
part of the locale). If I use for instance ISO Latin-1 then comparison
of a-with-grave accent and a-with-acute accent should always yield
false in compile-time because they have different codes (E0 and E1)
even though in some cases they would be "considered equal" in run-time
through the collate<> facet. Anyhow, given that the encoding is the
same, it doesn't matter if the compile-time locale is, say,
it_IT.ISO_latin1 or it_CH.ISO_latin1.

Of course the code at hand

std::locale loc;
char narrow["0123456789"];
wchar t wide[10];
std::use facet<std::ctype<wchar t> >(loc).widen(narrow, narrow+10,
wide);

would translate in this case between ISO_latin1 and whatever encoding
is part of loc.=20

In your example if you give me your executable and I have O.S.
run-time support for ISO 8859-15 I should be able to get the correct
output from the program. If you give me the sources and my compiler
doesn't support ISO 8859-15 I have to transcode first (assuming, of
course, that any character you use has a corresponding character in my
set). After that, assuming again that I have the correct run-time
support, the executable that I create this way should have the same
behavior of the one you gave me.

Anyhow I'm just guessing because I don't have any experience with
these sorts of things. I just wanted to be sure I understood you. I
suppose you were referring to the condition "assuming that any
character of the source set has a corresponding character in the
destination set", weren't you? Yes, I don't think I would manage to
fit hiragana and katakana into ISO Latin-1, even with transcoding! :-)

>And what should the poor compiler writer do.  Currently, I think most
>simply use the current locale (at least under Unix), or perhaps even
>refuse code with such characters (forcing the locale to "C").  But it
>isn't hard to imagine a compiler option to specify the code.  However,
>the problem is that the compiler should use one code for your .cc file,
>and a different code for my .h which your .cc includes.  Other than some
>sort of pragma, or a naming convention, I don't know what a compiler can
>do.  Off hand, some sort of hidden file in the directory in which the
>compiler finds the file seems like a good idea too.

Yes, I like the idea too.

>> Yes, I see a problem if, for instance, I will run the program on a
>> machine with a character set different from the machine where I
>> compiled the code, but is that an issue of "locales"? (My idea is that
>> character mapping is the substitution of a character in the text of
>> the program with an integer value; the value depends on the internal
>> encoding used by the compiler (EBCDIC, ISO Latin-1, etc.) but not,
>> strictly speaking, on the locale. I guess it may depend on the locale
>> only for characters not in the basic set. Where am I wrong?)
>
>The question is how the compiler interprets what it reads in phase 1.
>Does it use locale or not?  And if not, how does it handle extended
>characters.

As I said, I was under the impression that it uses only the character
set, not the whole locale.=20

> [snip]
>> What has initially made me ask these questions is the code of the new
>> boost::dynamic bitset. Let's consider, for instance, the function
>
>> template <typename Block, typename Allocator, typename CharT, typename
>> Alloc> void to string(const dynamic bitset<Block, Allocator>& b,
>>                       std::basic string<CharT, Alloc>& s)
>> {
>>     s.assign(b.size(), '0');
>>     for (std::size t i =3D 0; i < b.size(); ++i)
>>         if (b.test(i))
>>             s[b.size() - 1 - i] =3D '1';
>> }
>
>> The digits zero and one are certainly in the basic (execution) set.
>> But how should I represent them generically?
>
>Certainly not the way they've done it here.  (Although they can be
>excused for the error.  I doubt that there are many people who have any
>real experience with std::basic_string other than for char or wchar_t.)

Oh, I excuse them! :-) The issue is how to solve the problem. In short
we have a language-defined way to represent the digits 'one' and
'zero' either for char and wchar_t (the literals), but of course no
corresponding representation for generic character types: it's up to
the person that defined charT to know how to represent those digits as
charT instances. The only solution that comes to my mind is therefore
something like this:

// Not compiled. Expect syntax errors.
//
template <typename T>
struct digit_traits;

template <>
struct digit_traits<char> {
    inline static char zero() { return '0'; }
    inline static char one () { return '1'; }
};

template <>
struct digit_traits<wchar_t> {
    inline static wchar_t zero() { return L'0'; }
    inline static wchar_t one () { return L'1'; }
};

and implement "conversions" from and to basic_string as follows:

template <typename Block, typename Allocator, typename CharT, typename
Traits, typename Alloc>
void to_string(const dynamic_bitset<Block, Allocator>& b,=20
          std::basic_string<CharT, Traits, Alloc>& s)
{
    s.assign(b.size(), digit_traits<CharT>::zero());
    for (std::size_t i =3D 0; i < b.size(); ++i)
        if (b.test(i))
            Traits::assign (s[b.size() - 1 - i],
                            digit_traits<CharT>::one());
}

template <typename charT, typename Traits, typename Alloc>
void from_string(const std::basic_string<charT, Traits, Alloc> & s)
{
    reset();
    size_type const len =3D s.length();

    for (size_type i =3D 0; i < len; ++i) {
         if (Traits::eq(s[len - i - 1], digit_traits<charT>::one()))
            set_(i);
    }      =20
}

Of course this requires the user to specialize digit_traits for his
own charT type (presumably in the namespace boost). What about this
implementation?

>To begin with, '0' has type char, and there isn't the slightest reason
>to suppose that there is a conversion (implicit or explicit) from char
>to charT.  Other than ctype::widen or codecvt::in, I don't think that
>there is any way of getting from char to charT.
>
>Generally, I think it should give a compiler error.  Unless, maybe,
>you've got a very strange allocator, which typedef's size_type to char,
>and arranges to use char as a pointer as well, and the implementation
>has typedef'ed iterator to pointer -- in that case, both parameters to
>assign have the same type, and you end up in the templated function with
>two iterators.

Ah! Maybe you didn't notice that 'Alloc' is different from
'Allocator'. Anyhow, yes, there's a typo in the original code which in
practice is not parameterized on the string allocator but on its
char_traits. Why didn't the compiler warn about this? :-)

>Of course, there is absolutely nothing to guarantee that it works
>correctly even if it compiles.  If my system uses EBCDIC for char, and
>ISO 10646 for wchar_t, the conversion to a wstring will result in a
>sequence of '=F0' and '=F1': probably not what was intended either.

Yes. The code above should solve at least this problem.

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 22 Aug 2002 15:48:12 GMT Raw View

Gennaro Prota <gennaro prota@yahoo.com> wrote in message
news:<dvi5mu8gts64rq9uftkhk4iu7jiath04c6@4ax.com>...
> On 19 Aug 2002 18:00:09 GMT, kanze@gabi-soft.de (James Kanze) wrote:

> [snip]
> >> I meant: is it a problem that you can't use a built-in type?

> >Well, it is a lot more difficult.  Since users expect to be able to
> >do things like '0' + digit, etc. (and the standard library may be one
> >of your users here), you probably have to overload all of the basic
> >operators.  While std::string uses char traits<>::eq, et al., I'm not
> >sure that this is required for e.g. std::num get or std::num put.

> Yes, but what built-in type would you use? If you chose, say, short
> int how would you manage to output a character as a character and not
> as a number?

Excellent point.  Given:

    std::basic_ostream< short > dest ;
    short x ;

I think that "dest << x" is ambiguous. On one hand, you have the
non-member functions operator<<( ..., charT ), and on the other, the
member function operator<<( short ).

> [...]

> >Yes.  Suppose I work using ISO 8859-15, and enter an oe (single
> >character) in my editor.  This is code 0xBD.  I transfer the file to
> >you, and your environment is set up for ISO 8859-1, in which 0xBD is
> >a decimal 1/2.  Supposing some form of ISO 10646 or Unicode for wchar
> >t (which seems likely), I expect this character to be converted to
> >'\u0153' in phase one; with your default local, it will be converted
> >to '\u00BD', which is something else entirely.

> I see. Anyhow I wasn't thinking of a situation where you transfer the
> source code to another compiler.

You probably don't have to transfer anything.  Just change the value of
the environment variable LC_CTYPE (under Unix, anyway).

Now imagine that the value of this variable is something with 8859_1 in
the Window with the compiler, and 8859_15 in the Window with the editor.
You insert an oe in the editor, which is an alphabetic character, legal
in a variable name, for example.  The compiler sees 1/2, which is not
legal in a variable name, and reports an error.  Which, of course, you
cannot see in your editor.

The problem isn't trivial, and I know of no good solution.  Basically,
in order to get things right, the compiler would have to know how the
editor displays characters, and this is impossible -- it varies from one
editor to the next, and for most editors, according to environment.
Under Unix, both LC_CTYPE and the selected font play a role, and both
can be changed on a Window by Window basis.

> Of course this adds another level of complexity, but maybe it can be
> solved by translating the files in advance, i.e. before submitting
> them to the new compiler.

That problem has always been with us.  Even if you only use the basic
character set.  If a machine uses EBCDIC, and your files are in ASCII,
code translation is necessary.

The use of two different machines in my example was misleading.  The
problem isn't different machines, or different systems.  The problem is
different environments.

> Back to the main issue, my idea (likely erroneous) was that the
> compiler doesn't use locales to translate source code characters.

The question is what a compiler should do.  Most current compilers don't
do anything, and either barf or give random results at anything not
plain ASCII.  The standard, however, says that they have to support full
ISO 10646.  But it doesn't really say how.

In this case, if compilers have been slow to follow, it isn't so much a
problem of implementation (as with e.g. export); it is a problem of
knowing what to implement.  The standard leaves the question rather
open, and there is little or no precedence to refer to for quality of
implementation.

A simple question: EDF and Comeau claim full ISO compliance (modulo
bugs, of course).  What do they do?

> Or, more exactly, it uses only the character encoding (which is
> usually part of the locale).

For better or for worse (mainly for worse, I fear), the character
encoding is embedded in the ctype part of the locale.

> If I use for instance ISO Latin-1 then comparison of a-with-grave
> accent and a-with-acute accent should always yield false in
> compile-time because they have different codes (E0 and E1) even though
> in some cases they would be "considered equal" in run-time through the
> collate<> facet. Anyhow, given that the encoding is the same, it
> doesn't matter if the compile-time locale is, say, it IT.ISO latin1 or
> it CH.ISO latin1.

The question isn't collate.  The problem is with codecvt, or
ctype::widen.  Or whatever the compiler uses internally as an
equivalent.  The locales it_IT.iso_8u59_1 and fr_FR.iso_8859_1 should
give the same results, but fr_FR.iso_8859_1 and fr_FR.iso_8859_15 won't.

The gag is, of course, that for tools like editors to work properly, the
encoding suffix of the locale MUST correspond to the font encoding of
the fonts used.  It's not unreasonable for a user to use different fonts
in an editor window than in a shell window (from which the compiler is
invoked), and if those fonts have different encodings, the user must
have set LC_CTYPE differently for the other software to work.

This is not some abstract problem.  For historical reasons, in western
Europe, ISO 8859-1 has become the quasi-universal codeset.  However,
several important characters for French are missing from it, and it
doesn't have the Euro symbol (which didn't exist when it was
standardized).  For this reason, there is a shift to ISO 8859-15, which
corrects this problems.  But things don't happen overnight, and there
are a number of fonts on my machines which only support 8859-1.  The
result is that I use 8859-15 when I can, and 8859-1 when I have to,
because of a lack of support in the font.  And even a simple ls will
display different filenames according to the window in which it is
invoked.

> Of course the code at hand

> std::locale loc;
> char narrow["0123456789"];
> wchar t wide[10];
> std::use facet<std::ctype<wchar t> >(loc).widen(narrow, narrow+10,
> wide);

> would translate in this case between ISO latin1 and whatever encoding
> is part of loc.

The narrow character encoding is *always* part of the locale.  The wide
character encoding is supposed constant.

> In your example if you give me your executable and I have O.S.
> run-time support for ISO 8859-15 I should be able to get the correct
> output from the program.

If you set up your environment to use the same character encoding as I
do.  Which may mean changing fonts.  I have something like 40 font
families available for 8859-1, and about five for 8859-15.  And if
things are going to work in a reasonable fashion, the character encoding
specified by the LC_CTYPE variable MUST be the same as that of the font
used for displaying.

> If you give me the sources and my compiler doesn't support ISO 8859-15
> I have to transcode first (assuming, of course, that any character you
> use has a corresponding character in my set). After that, assuming
> again that I have the correct run-time support, the executable that I
> create this way should have the same behavior of the one you gave me.

Back in the days when a machine only supported one encoding, this was
exactly how we proceded.  Today, the Sparc I am on has font support for
10 different ISO 8859 variants, plus three JIS variants, and quite a bit
of others.  Not to mention UTF-8.  It didn't used to be a problem to
transcode a file when transfering it from one machine to the next, but
what happens when two people look at the same file simultaneously in
different environments.

> Anyhow I'm just guessing because I don't have any experience with
> these sorts of things. I just wanted to be sure I understood you. I
> suppose you were referring to the condition "assuming that any
> character of the source set has a corresponding character in the
> destination set", weren't you? Yes, I don't think I would manage to
> fit hiragana and katakana into ISO Latin-1, even with transcoding! :-)

That's why the standard pretty much requires a compiler to use ISO 10646
internally:-).  (What the standard actually requires is that the
implementation process UCN's correctly, see    2.1/1.  By far the easiest
way to do this, however, is to simply encode everything as a single
character of at least 21 bits.)

    [...]
> >The question is how the compiler interprets what it reads in phase 1.
> >Does it use locale or not?  And if not, how does it handle extended
> >characters.

> As I said, I was under the impression that it uses only the character
> set, not the whole locale.

It uses only the character set, which is specified by the locale.  It's
not the only thing specified by the locale, of course, but it is part of
the locale.

> > [snip]
> >> What has initially made me ask these questions is the code of the
> >> new boost::dynamic bitset. Let's consider, for instance, the
> >> function

> >> template <typename Block, typename Allocator, typename CharT, typename
> >> Alloc> void to string(const dynamic bitset<Block, Allocator>& b,
> >>                       std::basic string<CharT, Alloc>& s)
> >> {
> >>     s.assign(b.size(), '0');
> >>     for (std::size t i = 0; i < b.size(); ++i)
> >>         if (b.test(i))
> >>             s[b.size() - 1 - i] = '1';
> >> }

> >> The digits zero and one are certainly in the basic (execution) set.
> >> But how should I represent them generically?

> >Certainly not the way they've done it here.  (Although they can be
> >excused for the error.  I doubt that there are many people who have
> >any real experience with std::basic string other than for char or
> >wchar t.)

> Oh, I excuse them! :-) The issue is how to solve the problem. In short
> we have a language-defined way to represent the digits 'one' and
> 'zero' either for char and wchar t (the literals), but of course no
> corresponding representation for generic character types: it's up to
> the person that defined charT to know how to represent those digits as
> charT instances.

std::ctype< CharT >::widen( '0' ) and std::ctype< CharT >::widen( '1' ).

It's obviously locale dependant, and there is the question of which
locale to use when converting to string.  The answer, of course, is not
to convert to string.  The standard formatting class is ostream, not
string, and an ostream has a locale which you can access.  If the user
wants a string, that is what ostringstream is there for.  And he can
also set up whichever locale he desires.

> The only solution that comes to my mind is therefore something like
> this:

> // Not compiled. Expect syntax errors.
> //
> template <typename T>
> struct digit traits;

> template <>
> struct digit traits<char> {
>     inline static char zero() { return '0'; }
>     inline static char one () { return '1'; }
> };

I considered the idea of traits, and rejected it.  Templates resolve at
compile time, and the locale isn't known until runtime, and may change
dynamically.

> Of course this requires the user to specialize digit traits for his
> own charT type (presumably in the namespace boost). What about this
> implementation?

Rejected, for the reason above.

In general, any conversion to string, implicit or explicit, is an
error.  The correct way to handle this is to provide an operator>>.
(This would also allow a manipulator to select between the use of 0 and
1, and the use of the first character of numpunct::falsename() and
numpunct::truename() -- I can easily think of cases where I want a
string tfft...)

(More accurately: any conversion to string should be based on outputting
to operator>>.  I can quite understand the desire to have something
simple, along the lines of boost::lexical_cast, even if it prevents the
use of all of the formatting possibilities.)

> >To begin with, '0' has type char, and there isn't the slightest
> >reason to suppose that there is a conversion (implicit or explicit)
> >from char to charT.  Other than ctype::widen or codecvt::in, I don't
> >think that there is any way of getting from char to charT.

> >Generally, I think it should give a compiler error.  Unless, maybe,
> >you've got a very strange allocator, which typedef's size type to char,
> >and arranges to use char as a pointer as well, and the implementation
> >has typedef'ed iterator to pointer -- in that case, both parameters to
> >assign have the same type, and you end up in the templated function with
> >two iterators.

> Ah! Maybe you didn't notice that 'Alloc' is different from
> 'Allocator'. Anyhow, yes, there's a typo in the original code which in
> practice is not parameterized on the string allocator but on its char
> traits. Why didn't the compiler warn about this? :-)

Probably because no one has ever used any allocator execpt the standard
allocator:-).  If Alloc and the string allocator are the same, the code
is fine.

> >Of course, there is absolutely nothing to guarantee that it works
> >correctly even if it compiles.  If my system uses EBCDIC for char,
> >and ISO 10646 for wchar t, the conversion to a wstring will result in
> >a sequence of '? and '?: probably not what was intended either.

> Yes. The code above should solve at least this problem.

Painfully.

For toString, I would simply output to an ostringstream, and use str().

For the output, I would obtain the locale from the destination stream,
get its ctype facet, and use widen. Or toupper or tolower (according to
ios_base::uppercase) of the first character from falsename or truename
(according to ios_base::bool_alpha?).

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Mon, 12 Aug 2002 17:29:32 GMT Raw View

Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
news:<4j98luch9j63jdjfbef3u4jmdrgc3qh53r@4ax.com>...

> You wrote:
> >Gennaro Prota <gennaro_prota@yahoo.com> wrote
> >>Does this mean that the following:

> >> a) void (charT c) {
> >>     if ( c == 0)
> >>        ...
> >>    }

> >> b) void (charT c) {
> >>     if ( c == '\0')
> >>        ...
> >>    }

> >> c) void (charT c) {
> >>     if ( c == charT())
> >>        ...
> >>    }

> >> are always equivalent?

> >No.  To begin with, there is no guarantee that == is defined on
> >charT, so there is no guarantee that any of the three expressions are
> >legal.

> Aaarggghhh!! This question is certainly candidate to be the most
> stupid ever asked on comp.std.c++ :-(.

Not at all.  It's easy to miss, and what is more natural than comparing
characters for equality.

> Actually I had in mind only
> char and wchar_t. The original question was whether, given that c is a
> wchar_t, the following are equivalent

> c==0, c=='\0', c==L'\0', c==wchar_t()

> I'm not sure. (BTW, the fact that a wchar_t can have padding bits is
> irrelevant, isn't it? I also think that value-initialization cannot
> result in a trap representation but, again, I'm not sure).

wchar_t is an integral type, and so must behave like one.  That means
that 0 is 0, and that wchar_t() creates a temporary object initialized
with 0.  How you write the 0 doesn't matter.

> 5.2.3/2 says wchar_t() is value-initialized (I'm also considering
> DR178). Then 8.5/5 says this means "the value of 0 (zero) converted to
> wchar_t". Since it talks about "conversion" must I assume it means the
> int 0 converted to wchar_t? If so, this brings me to 4.7/3 and there
> I'm lost.

The important text is in    3.9.1/5:

    Type wchar_t is a distinct type whose values can represent distinct
    codes for all members of the largest extended character specified
    among the supported locales.  Type wchar_t shall have the same size,
    signedness and alignment requirements as one of the other integral
    types, called its underlying type.

And the fact that wchar_t is an integral type, see    3.9.1/7 (which also
requires integral types to use a binary representation).  Given that it
is an integral type, and that 0 must be representable in it,    4.7 should
pose no problems.

> [snip]
> >Note too that while the standard says that you can use any POD type
> >you can't specialize char_traits or any of the classes you might need
> >in the locale section except for user defined types.  You can manage
> >with std::basic_string by providing your own char_traits, rather than
> >specializing the existing one, but if you want to do any IO, you'll
> >need ctype and probably numpunct, and you can't replace those with
> >your own class (unless I've misunderstood something -- chapter 22 is
> >about the most difficult text to understand that I've ever
> >encountered).

> Is this a problem?

It is if you want to do IO on the type.  File IO uses codecvt, and
numeric IO uses numpunct (indirectly, via num_get and num_put) and
codecvt (and ctype?).  The standard library doesn't require any
non-specialized versions of these, so YOU have to provide them.  And
about the only way I can think of to provide them is to specialize the
standard templates (but I'm not 100% sure about this), which you can
only do on a user defined type.

> >[...]

> >> - as far as I understand it I should use widen() even when a character
> >> is directly represented with a literal in the text of the program
> >> (there is no generic literal, is there?). In fact, the already
> >> mentioned TC++SL has the following examples on page 717:

> >> std::locale loc;
> >> char narrow["0123456789"];
> >> wchar_t wide[10];

> >> std::use_facet<std::ctype<wchar_t> >(loc).widen(narrow, narrow+10,
> >> wide);

> >I'm not sure anyone really knows what you should do, even in C, and
> >even when charT is wchar_t.  I think we're still experimenting some
> >in this domain.  (I know I am, anyway.)

> >There are wide character literals: L"0123456789".  The problem is how
> >these are mapped may depend on the locale at compile time.

> The above *does* depend on the locale loc. Are you alluding to some
> other hidden problem?

The code using the facet depends on the run-time locale; the conversion
of wide character literals depends on the compile-time locale.  That's a
big difference if you are developping code in Italy, and it is to run in
Japan.

Practically, I don't think it will make a difference if you stick to the
basic character set (although I'm just guessing).  If you want to be
sure, however, any extended characters should be in UCN (except that a
lot of compilers don't support it yet).

> >> Anyhow I usually see code like this:

> >> template <class charT, class traits, class Allocator>
> >>   void f(basic_string<charT, traits, Allocator> & str) {

> >>    str.assign (32, 'A'); // <-- note the literal.
> >>                          //     type, of course, is char
> >>
> >> }

> >> How can this be correct?

> >It isn't.  I will work for std::string and std::wstring, but probably
> >not for any other instantiations (given the restrictions on
> >specializing templates in std::).  And since charT must be a POD, it
> >can't have a constructor which takes a char, which might be used to
> >make it work.

> Yes. I needed something similar to the above in my current code
> (initializing a basic_string with all '0', i.e. the digit zero,
> characters). Of course, I never dreamed to write str.assign(32, '0')
> but I didn't know what to put in place of '0'. So I consulted TC++SL
> and wrote:

>  str.assign (32, use_facet<ctype<charT> >(loc).widen('A') );

> It looked quite unfamiliar however! :-) Thus, I went to inspect the
> code of a few libraries and I saw that nobody uses something like the
> above (which I'm not sure is correct either)

When I defined my ISO10646 character type (a struct), I provided it with
static members for the basic characters, e.g.:

    static Character const CAPITAL_LETTER_A ;

and so on.  (The code was generated automatically by a small AWK script
from the Unicode data files.)  So I can write:

    str.assign( 32, ISO10646::CAPITAL_LETTER_A ) ;

Not the most beautiful thing in the world, but it still beat your
version:-).

> >> - question similar to the above for comparisons: I normally see things
> >> like this:

> >> template <class charT, class traits, class Allocator>
> >>   bool g(basic_string<charT, traits, Allocator> const & str) {
> >>     return traits::eq(str[0], 'a');
> >> }

> >> Is it ok?

> >No.  Presumably, traits::eq takes two charT.  Since you cannot
> >normally convert a char to a charT (even explicitly), the code will
> >only compile in special cases.

> What would be the correct way? To widen the char 'a' before
> comparison? Or are you saying that there's no correct way?

If you're stuck with narrow char's, you have to widen them.  I'd try and
declare constants for all of the char's I could, however, and use them
when possible.

Note that your expression use_facet<ctype<charT> >(loc).widen('A') is
likely to result in a bad_cast exception if you haven't installed the
additional facet.

I had been playing with this before I got a new contract.  Now, I don't
have time to continue, but my estimate is that it would take someone of
my level about six man months to create everything that is needed for a
fully standard conforming new character type.  (This is doubtlessly due
to the fact that I don't know locale that well, but except for Nathan
Meyrs or Bill Plauger, who does?)  Given that in six man months, I could
easily implement a new string class and an new corresponding io
hierarchy, I really wonder what all the templating is there for.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Wed, 14 Aug 2002 21:29:14 GMT Raw View

On Mon, 12 Aug 2002 17:29:32 GMT, kanze@gabi-soft.de (James Kanze)
wrote:

>Gennaro Prota <gennaro_prota@yahoo.com> wrote in message
>news:<4j98luch9j63jdjfbef3u4jmdrgc3qh53r@4ax.com>...

>> The original question was whether, given that c is a
>> wchar_t, the following are equivalent
>
>> c=3D=3D0, c=3D=3D'\0', c=3D=3DL'\0', c=3D=3Dwchar_t()
>  [...]

>The important text is in =A73.9.1/5:=20
>
>    Type wchar_t is a distinct type whose values can represent distinct
>    codes for all members of the largest extended character specified
>    among the supported locales.  Type wchar_t shall have the same size,
>    signedness and alignment requirements as one of the other integral
>    types, called its underlying type.
>
>And the fact that wchar_t is an integral type, see =A73.9.1/7 (which als=
o
>requires integral types to use a binary representation).  Given that it
>is an integral type, and that 0 must be representable in it, =A74.7 shou=
ld
>pose no problems.
>

Yes. I went through these too. What I was missing to 'close the
circle' was a direct guarantee that the value of L'\0' is zero. (I say
'direct' guarantee because an indirect guarantee exists: it follows
from the fact that wchar_t() and L'\0' must represent the same
character, and wchar_t() has the value zero). I think I've found it in
2.2/3.


>> [snip]
>> >Note too that while the standard says that you can use any POD type
>> >you can't specialize char_traits or any of the classes you might need
>> >in the locale section except for user defined types. [...]
>
>> Is this a problem?
>
>It is if you want to do IO on the type.  File IO uses codecvt, and
>numeric IO uses numpunct (indirectly, via num_get and num_put) and
>codecvt (and ctype?).  The standard library doesn't require any
>non-specialized versions of these, so YOU have to provide them.  And
>about the only way I can think of to provide them is to specialize the
>standard templates (but I'm not 100% sure about this), which you can
>only do on a user defined type.
>

I meant: is it a problem that you can't use a built-in type?


>> >[...]
>
>> >> - as far as I understand it I should use widen() even when a charac=
ter
>> >> is directly represented with a literal in the text of the program
>> >> (there is no generic literal, is there?). In fact, the already
>> >> mentioned TC++SL has the following examples on page 717:
>
>> >> std::locale loc;
>> >> char narrow["0123456789"];
>> >> wchar_t wide[10];
>
>> >> std::use_facet<std::ctype<wchar_t> >(loc).widen(narrow, narrow+10,
>> >> wide);
>
>> >I'm not sure anyone really knows what you should do, even in C, and
>> >even when charT is wchar_t.  I think we're still experimenting some
>> >in this domain.  (I know I am, anyway.)
>
>> >There are wide character literals: L"0123456789".  The problem is how
>> >these are mapped may depend on the locale at compile time.
>
>> The above *does* depend on the locale loc. Are you alluding to some
>> other hidden problem?
>
>The code using the facet depends on the run-time locale; the conversion
>of wide character literals depends on the compile-time locale.  That's a
>big difference if you are developping code in Italy, and it is to run in
>Japan.
>
>Practically, I don't think it will make a difference if you stick to the
>basic character set (although I'm just guessing).

Sticking to the basic character set, I think I don't understand you
correctly here: maybe you refer to the character mapping in phase 1?
Yes, I see a problem if, for instance, I will run the program on a
machine with a character set different from the machine where I
compiled the code, but is that an issue of "locales"? (My idea is that
character mapping is the substitution of a character in the text of
the program with an integer value; the value depends on the internal
encoding used by the compiler (EBCDIC, ISO Latin-1, etc.) but not,
strictly speaking, on the locale. I guess it may depend on the locale
only for characters not in the basic set. Where am I wrong?)


>If you're stuck with narrow char's, you have to widen them.  I'd try and
>declare constants for all of the char's I could, however, and use them
>when possible.
>
>Note that your expression use_facet<ctype<charT> >(loc).widen('A') is
>likely to result in a bad_cast exception if you haven't installed the
>additional facet.

What has initially made me ask these questions is the code of the new
boost::dynamic_bitset. Let's consider, for instance, the function


template <typename Block, typename Allocator, typename CharT, typename
Alloc> void to_string(const dynamic_bitset<Block, Allocator>& b,
                      std::basic_string<CharT, Alloc>& s)
{
    s.assign(b.size(), '0');
    for (std::size_t i =3D 0; i < b.size(); ++i)
        if (b.test(i))
            s[b.size() - 1 - i] =3D '1';
}

The digits zero and one are certainly in the basic (execution) set.
But how should I represent them generically?


Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]