Thread

Topic: ASCII std::ctype::to_lower/std::ctype::to_upper

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Sun, 8 Oct 2006 18:14:24 GMT Raw View

"James Kanze" <kanze.james@neuf.fr> wrote in message
news:1160311789.369695.54680@b28g2000cwb.googlegroups.com...

> .....
>> In practice, most (not all, RogueWave has a library you can
>> buy as an addon) C++ locale subsystems just pass their
>> operations to the underlying operating systems.
>
> Not under Unix, at least, since the underlying OS hasn't the
> slightest knowledge of what case is.  Many libraries do map the
> C++ functions to the C functions, in some way or another.
>
>> Some, like Dinkumware on Microsoft, dont do anything at all,
>> and if you want localization, you have to use something else
>> (perhaps the vastly superior I18N facilities that Microsoft
>> provides)
>
> That rather surprises me with regards to Dinkumware.

Because it's not true. (My earlier response was censored.)

>             Or is the
> underlying problem simply that Microsoft doesn't support
> different locales on their systems.  (Which also surprises me;
> there are very definitly tabs in the configuration to choose
> different locale characteristics, and when I'm logged into my
> Windows machine, it has different behavior than when my children
> log in.)

Microsoft supports a rich assortment of locales, all of which
are available within C++.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Lance Diduck" <lancediduck@nyc.rr.com>
Date: Sun, 8 Oct 2006 17:45:49 CST Raw View

"P.J. Plauger" wrote:
> "James Kanze" <kanze.james@neuf.fr> wrote in message
> news:1160311789.369695.54680@b28g2000cwb.googlegroups.com...
>
> >
> > That rather surprises me with regards to Dinkumware.
>
> Because it's not true. (My earlier response was censored.)
Oops I'm sorry - what I meant to say is that the not all the
functionality is supported, for example message catalogs (maybe it is
now?). But yes the internationalization support using DinkumWare on
Microsoft is excellent. As well as IBM. But it did come out that I said
otherwise.
What I meant to say is that locale support is very platform specific,
and making a portable solution usually requires addon libraries, and
DinkumWare as well offers such a library. I don't know of a portable
solution for message catalogs however -- it may exist, I just dont know
about it.
I tried rolling my own portable message catalog one time, and it is
very difficult.
> "James Kanze" <kanze.james@neuf.fr> wrote in message
>Not under Unix, at least, since the underlying OS hasn't the
>slightest knowledge of what case is.
On most UNIXs you can install locales different than C, at an OS level.
Indeed my understanding is that Mr Plauger designed the POSIX locales
mechanism used on UNIX (or at least the forerunner to it -- See the
book "Standard C Library"). These locales are installed as part of the
OS -- and it is this subsystem that is usually wrapped by the C++
stuff. Thereby in a manner the OS does understand case.
And if your UNIX has an event driven IO service, then you will find
that it is probably fully internationalized.

>And while I don't know what most languages now do, using Unicode
>isn't an answer to the upper/lower question, since Unicode
>recognizes three cases, not two.
The point that I failed to convey is that the The Unicode Consortium is
on the case of answering the question fully. We, the C++ community as a
whole, is not equipped to tackle the question. "Caseness" should be
left up to them to define -- as well as other I18N issues. This is what
most languages nowadays do. Certianly C++ should be able to handle any
encoding, but the current strategy of "no encoding at all" just really
puts off most developers now approaching the language, esp those not in
Western Europe or North America.

The byte stream oriented-ness of the IO is precisely why I18N is so
difficult in C and C++. In other IO models, whereby you get the entire
IO chunked at one time, then it is far easier. When there is a standard
"C++ event IO" standard that competes with the IOStreams, then thing
will be different. But for now, the std::string interface is tied into
this "one byte per character"  model, so it is poorly equipped to
handle ANY encoding that requires something different. That leaves out
Unicode (in any form) and the majority of any non-European or non-North
American encoding schemes.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "K i tof elechovski" <giecrilj@stegny.2a.pl>
Date: Mon, 9 Oct 2006 09:19:11 CST Raw View

U   ytkownik ""P.J. Plauger"" <pjp@dinkumware.com> napisa    w wiadomo   ci
news:Gqudnf_H7PPJqbTYnZ2dnUVZ_qSdnZ2d@giganews.com...
> "James Kanze" <kanze.james@neuf.fr> wrote in message
> news:1160311789.369695.54680@b28g2000cwb.googlegroups.com...
>
>> .....
>>> In practice, most (not all, RogueWave has a library you can
>>> buy as an addon) C++ locale subsystems just pass their
>>> operations to the underlying operating systems.
>>
>> Not under Unix, at least, since the underlying OS hasn't the
>> slightest knowledge of what case is.  Many libraries do map the
>> C++ functions to the C functions, in some way or another.
>>
>>> Some, like Dinkumware on Microsoft, dont do anything at all,
>>> and if you want localization, you have to use something else
>>> (perhaps the vastly superior I18N facilities that Microsoft
>>> provides)
>>
>> That rather surprises me with regards to Dinkumware.
>
> Because it's not true. (My earlier response was censored.)
>

Except for std::time_get, which is limited to a small set of automatically
recognized formats (Polish date formats are not recognized because we use
full stop as the separator; additionally, there is a defect in the Standard
Library that assumes that the order of date elements is the same for the
long and the short format, which is not the case for Polish since about
1980), and std::messages, which is implemented as always fail (at least in
the version bundled with Microsoft Visual C++).

Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Mon, 9 Oct 2006 11:19:30 CST Raw View

Lance Diduck wrote:
> "P.J. Plauger" wrote:
> > "James Kanze" <kanze.james@neuf.fr> wrote in message
> > news:1160311789.369695.54680@b28g2000cwb.googlegroups.com...

> > "James Kanze" <kanze.james@neuf.fr> wrote in message

> >Not under Unix, at least, since the underlying OS hasn't the
> >slightest knowledge of what case is.

> On most UNIXs you can install locales different than C, at an
> OS level.

On most Unix's (and on Windows), you can also install text
processors, bookkeeping programs, and a lot of other things.
That doesn't make them part of the OS.  About the only direct
support the operating system provides is creating a standard
place for the locales (albeit different from one system to the
next), and in some ways, that does more harm than good; I can't
install a locale for my own use without having root privileges.

Another thing that Unix does is make ISO C the official
language, so implicitly, you have any support that ISO requires.

And finally, if you consider X part of Unix (although there are
non-Unix systems using X, and Unix systems which don't have it),
and of course, X manages the fonts used to display text in
windows (which also has a lot to do with locales, even if the
standards never mention it).

> Indeed my understanding is that Mr Plauger designed the POSIX
> locales mechanism used on UNIX (or at least the forerunner to
> it -- See the book "Standard C Library").

Not Posix, but Open Systems.  Or X/Open, as it was called back
then.  But not everything defined by X/Open can be considered as
directly OS related.

> These locales are installed as part of the OS -- and it is
> this subsystem that is usually wrapped by the C++ stuff.
> Thereby in a manner the OS does understand case.

Yes and no.  The locale stuff is pretty much layered on top of
Unix per se.  Unix provides the file system, for example, but
Unix file systems don't worry about whether the filename is ISO
8859-1, or UTF-8.  I've actually tried it: do an ls on a
directory with accented characters, in two different windows,
using different font encoding in both windows, and the names of
the files changes.

> And if your UNIX has an event driven IO service, then you will
> find that it is probably fully internationalized.

In what way.  (I'm not familiar with event driven IO, but I just
can't see Unix dealing with anything more than byte sequences,
and a few special characters, e.g. '/' in file names.)

> >And while I don't know what most languages now do, using
> >Unicode isn't an answer to the upper/lower question, since
> >Unicode recognizes three cases, not two.

> The point that I failed to convey is that the The Unicode
> Consortium is on the case of answering the question fully.

They're trying to address it, yes.  And they're one of the best
sources of information around.

> We, the C++ community as a whole, is not equipped to tackle
> the question.

I think we are, as much as any other language.

That individual programmers, or in some cases, compilers, don't
is another issue.

> "Caseness" should be left up to them to define -- as
> well as other I18N issues. This is what most languages
> nowadays do.

You keep saying "most languages", but I don't know of any which
do more than C++.

> Certianly C++ should be able to handle any encoding, but the
> current strategy of "no encoding at all" just really puts off
> most developers now approaching the language, esp those not in
> Western Europe or North America.

> The byte stream oriented-ness of the IO is precisely why I18N
> is so difficult in C and C++.

I'd say that the byte stream oriented-ness is what makes UTF-8
support possible.  At any rate, I don't think you can avoid it.
The external world is byte oriented.  Text coming in over the
Internet is always in the form of a byte stream.  I don't
understand what you want to do differently.

> In other IO models, whereby you get the entire IO chunked at
> one time, then it is far easier.

Which other IO models?  The other models I know are much more
restrictive, and almost none work with multibyte encodings.

> When there is a standard "C++ event IO" standard that competes
> with the IOStreams, then thing will be different. But for now,
> the std::string interface is tied into this "one byte per
> character" model, so it is poorly equipped to handle ANY
> encoding that requires something different.

The "intent" of the current standard is that you use wstring
when I18n is required, and that while it is only a QoI issue, it
is expected that this be Unicode (probably UTF-32), so that
multi-byte encodings are not necessary.

> That leaves out Unicode (in any form) and the majority of any
> non-European or non-North American encoding schemes.

What's the problem with wstring?  If that's what you need.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: Mon, 9 Oct 2006 11:49:29 CST Raw View

"Lance Diduck" <lancediduck@nyc.rr.com> wrote in message
news:1160347012.888857.300680@i42g2000cwa.googlegroups.com...

> "P.J. Plauger" wrote:
>> "James Kanze" <kanze.james@neuf.fr> wrote in message
>> news:1160311789.369695.54680@b28g2000cwb.googlegroups.com...
>>
>> >
>> > That rather surprises me with regards to Dinkumware.
>>
>> Because it's not true. (My earlier response was censored.)
>
> Oops I'm sorry - what I meant to say is that the not all the
> functionality is supported, for example message catalogs (maybe it is
> now?).

No, it's still just rudimentary. I've yet to get guidance on
how to do a more thorough job (that people will actually use).

>       But yes the internationalization support using DinkumWare on
> Microsoft is excellent. As well as IBM. But it did come out that I said
> otherwise.
> What I meant to say is that locale support is very platform specific,
> and making a portable solution usually requires addon libraries, and
> DinkumWare as well offers such a library. I don't know of a portable
> solution for message catalogs however -- it may exist, I just dont know
> about it.

I don't either.

> I tried rolling my own portable message catalog one time, and it is
> very difficult.

Agreed.

>> "James Kanze" <kanze.james@neuf.fr> wrote in message
>>Not under Unix, at least, since the underlying OS hasn't the
>>slightest knowledge of what case is.
>
> On most UNIXs you can install locales different than C, at an OS level.
> Indeed my understanding is that Mr Plauger designed the POSIX locales
> mechanism used on UNIX (or at least the forerunner to it -- See the
> book "Standard C Library").

No, it was a parallel invention. I wish I had been aware of the
Posix work when I wrote my book -- it would have better matched
the Posix spec.

>                            These locales are installed as part of the
> OS -- and it is this subsystem that is usually wrapped by the C++
> stuff. Thereby in a manner the OS does understand case.
> And if your UNIX has an event driven IO service, then you will find
> that it is probably fully internationalized.
>
>>And while I don't know what most languages now do, using Unicode
>>isn't an answer to the upper/lower question, since Unicode
>>recognizes three cases, not two.
>
> The point that I failed to convey is that the The Unicode Consortium is
> on the case of answering the question fully. We, the C++ community as a
> whole, is not equipped to tackle the question. "Caseness" should be
> left up to them to define -- as well as other I18N issues. This is what
> most languages nowadays do. Certianly C++ should be able to handle any
> encoding, but the current strategy of "no encoding at all" just really
> puts off most developers now approaching the language, esp those not in
> Western Europe or North America.

Both C and C++ have recently adopted more Unicode specific
mechanisms.

> The byte stream oriented-ness of the IO is precisely why I18N is so
> difficult in C and C++. In other IO models, whereby you get the entire
> IO chunked at one time, then it is far easier. When there is a standard
> "C++ event IO" standard that competes with the IOStreams, then thing
> will be different.

But even with Unicode, there are *lots* of different file encodings
you might choose. At least the C++ machinery, for all its complexity,
gives you a wide set of choices. See the codecvt facets documented
in our online reference at our web site.

>                   But for now, the std::string interface is tied into
> this "one byte per character"  model, so it is poorly equipped to
> handle ANY encoding that requires something different. That leaves out
> Unicode (in any form) and the majority of any non-European or non-North
> American encoding schemes.

Well, there is also wstring.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Thu, 5 Oct 2006 09:49:13 CST Raw View

Jean-Marc Bourguet wrote:
> "kanze" <kanze@gabi-soft.fr> writes:

> > Under Solaris, the "C" locale does correspond to strict
> > ASCII, and declares all characters outside the range 0...127
> > invalid.  Which, of course, doesn't correspond at all to the
> > reality outside of the program.  Which doesn't depend on the
> > locale, but specific font which is active, and its encoding.
> > But the standard certainly doesn't require such,

> Well, a quick reading of POSIX makes me think that POSIX does.
> And Solaris aims for POSIX compliance.

Does Posix require ASCII?  I wasn't aware of it, but it wouldn't
surprise me.

What is certain is that C does specify, rather strictly, the
behavior of the functions in <ctype.h>, in the "C" locale.  And
although the C++ standard says nothing, as far as I can see,  I
think it would be rather disconcerting if "islower( ch )" gave
different results that "std::islower( ch, std::locale() )", at
least if one of the pre-defined locales was the global locale
(and supposing that both C and C++ have the same global locale
set).  From a quality of implementation point of view, I do
expect that the C++ ctype functions behave in "C" locale exactly
as the C ones.

None of which really solves the problem I raised, of course.  I
don't know what, if anything, Posix says about it, but the
Solaris systems I have access to don't normally use ASCII as the
font encoding; it's more often ISO 8859-1 or ISO 8859-15.  You
can pretend that they're ASCII, but display invalid characters
in a funny way.  But when I output a character to the screen,
and see '   ', I normally expect islower to return true.  And if
I've started the xterm with an option to use Zapf Dingbats,
presumably, islower should return false for all characters:-).

Except that that last statement isn't really a joke.  It touches
the crux of the problem: islower presumably says something about
how the character is perceived, but there is in fact no possible
way for it to know.  If we're careful, and ensure that our
environment is coherent, it works.  But that carefulness is out
of the hands of the programmer; you have to count on your users
here.  (And sometimes, it's even out of their hands.  I've got
two printers on my machine at home, one of which uses ISO
8859-1, and the other ISO 8859-15, and the only way I can make
things coherent is by avoiding characters that aren't common to
both.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Thu, 5 Oct 2006 10:08:41 CST Raw View

Alberto Ganesh Barbati wrote:
> Kristof Zelechovski ha scritto:
> > Uzytkownik "kanze" <kanze@gabi-soft.fr> napisal w wiadomosci
> > news:1159867506.222269.192330@h48g2000cwc.googlegroups.com...
> >> The specification for the functions seems clear enough.  In
> >> pracctice, of course, they're not very useful, since upper
> >> to lower mapping (or vice versa) is not a bijection in real
> >> life, at least not for any character set useful for human
> >> languages.  Of course, if you are using them for case
> >> insensitive comparison of key words, and you're key words
> >> only use characters in the basic character set, they work
> >> fine.  But don't expect to use them on languages like
> >> French or German and get anything useful.

> > German has the ? anomaly, but what about French?

> For example, some typographic legacy rules in France state
> that accents are omitted on capital words. According to those
> rules, the uppercase of the letter     (U+00E0) would be A and
> not     (U+00C0), thus the mapping would not be a bijection.
> Although the trend is changing in favor of the more modern
> "bijective" conversion, there may be locales implementing
> these legacy rules.

No longer really about C++, but...

The legacy is to keep the accents on capitals.  It's a long
standing tradition, still adhered to by conservative type
setting organizations, such as the Imprimerie nationale.  You
can't get accents on capitals with a French typewriter keyboard,
however, and the tendancy over the last, say 50 years, has been
to not use them.

In the computer world, those who (like myself) use TeX or LaTeX,
generally use them; those who use Word don't.  (I'm tempted to
say that those who care about the quality of their output use
them, those who don't give a shit don't.  But that would reveal
my personal bias---both with regards to accents on capitals, and
with regards to typesetting tools:-).)

And as you correctly say, any particular place you work might
impose one rule or the other, so you have to be able to handle
both (presumably with a different locale---although the Posix
locale naming conventions don't forsee such a case).

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Thu, 5 Oct 2006 10:09:04 CST Raw View

Alberto Ganesh Barbati wrote:
> kanze ha scritto:
> > Alberto Ganesh Barbati wrote:

> >> In the "C" locale, to_lower, resp. to_upper, return its
> >> argument for all characters except the 26 latin uppercase,
> >> resp. lowercase, letters, as described in 7.4.2 of the C
> >> Standard (not the C++ standard!). Otherwise it's locale
> >> dependent.

> > Yes.  For C, and not for C++.  (In practice, however, I can't
> > imagine an implementation daring to have the two be different.)

> I thought this part of the C library was included in the C++
> library without latitude for differences. In particular,
> functions to_lower and to_upper (from <cctype>) should have
> the well-defined meaning I reported above. Moreover 22.1.1.5/5
> seems to guarantee the also the ctype facets provided by
> std::locale("C") should behave the same way. Am I missing
> something?

I suppose if you interpret the word "classic" in a certain way.
I don't see anything concerning std::locale that makes any
direct reference to what the C library might do, except that if
you set a named locale, the library is required to call the C
function setlocale with the same name.

>From a QoI point of view, of course, yes.  I expect two locales
with the same name (including the one named "C") to have exactly
the same behavior in both libraries, both in the functions in
<ctype.h>/<cctype> and in those in <locale>.  I don't find such
a requirement in the standard, however, and in the past, I've
had more than a few cases where it wasn't true, at least with
g++.  (But I suspect that the authors' of g++ considered those
cases a bug, or rather, a "not yet fully implemented" feature.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: jm@bourguet.org (Jean-Marc Bourguet)
Date: Thu, 5 Oct 2006 19:21:42 GMT Raw View

"kanze" <kanze@gabi-soft.fr> writes:

> Jean-Marc Bourguet wrote:
>> "kanze" <kanze@gabi-soft.fr> writes:
>
>> > Under Solaris, the "C" locale does correspond to strict
>> > ASCII, and declares all characters outside the range 0...127
>> > invalid.  Which, of course, doesn't correspond at all to the
>> > reality outside of the program.  Which doesn't depend on the
>> > locale, but specific font which is active, and its encoding.
>> > But the standard certainly doesn't require such,
>
>> Well, a quick reading of POSIX makes me think that POSIX does.
>> And Solaris aims for POSIX compliance.
>
> Does Posix require ASCII?  I wasn't aware of it, but it wouldn't
> surprise me.

I don't know that standard enough to be authoritative.  I think that for
all practical purpose it does.  From my understanding the characters which
have to be present in the POSIX locale are those from ASCII and they must
collate in the ASCII order but I'm not sure that using an EBCDIC code page
containing the demanded characters and ensuring the correct the collation
order for strcoll is a valid (ie I'm not sure that strcmp must give the
same result as strcoll in the POSIX locale).

Yours,

--
Jean-Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Lance Diduck" <lancediduck@nyc.rr.com>
Date: Sat, 7 Oct 2006 15:35:48 CST Raw View

kanze wrote:
>But when I output a character to the screen,
> and see '   ', I normally expect islower to return true.  And if
> I've started the xterm with an option to use Zapf Dingbats,
> presumably, islower should return false for all characters:-).

This is why the standard does not -- and cannot -- make a rule for how
ctype "lower" and "upper" functionality should work. "Case" is more
about I/O rather than how the language operates.
All the standard could address is the 93 (or so) ASCII characters it
uses to represent the identifiers and other syntatical parts of the
language itself.
Most languages now just say "we are using Unicode internally," and the
Unicode consortium maintains the requirements for what operations such
as "upper" and "lower" should do. But consider what "lower" and "upper"
would do with scripts such as Arabic, which has no block letters at
all, and everthing is cursive. What would ctype do now ?????

In practice, most (not all, RogueWave has a library you can buy as an
addon) C++ locale subsystems just pass their operations to the
underlying operating systems. Some, like Dinkumware on Microsoft, dont
do anything at all, and if you want localization, you have to use
something else (perhaps the vastly superior I18N facilities that
Microsoft provides)

I18N and L10N support in standard C++ is downright awful. The C++
community in 2006, as this thread attests, STILL tries to give each
individual byte brains enough to decide how is should be displayed on a
character mode terminal, as if the entire universe still used TTY
devices. The rest of the world has long abandoned this approach, and
the smallest handhelds have graphical IO devices. You cannot make each
individual char (or even wchar_t) smart enough to handle its own
formatting, even on a character mode display. The "string" is the
smallest unit that can possibly do this, and even then this string has
to be aware of the encoding used, and in sync with the IO device.
However, C++ views "string" as "array of bytes," where "character" and
"byte" is synonymous. In almost any other language, a "character" is a
"string" of length 1, irrespective of how many bytes actually make up
that string -- and that is determined by the encoding used.
This is why std::string cannot process Unicode in any meaningful way.
Nor can IOStreams, which still sees the world in the same way Doug
McIlroy saw the world in 1964, with all individual programs commicating
though each other via pipes and streams of ASCII text. This approach
works great for programming chores, but that is the extent of it.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "James Kanze" <kanze.james@neuf.fr>
Date: Sun, 8 Oct 2006 10:33:14 CST Raw View

Lance Diduck wrote:
> kanze wrote:
> >But when I output a character to the screen,
> > and see '   ', I normally expect islower to return true.  And if
> > I've started the xterm with an option to use Zapf Dingbats,
> > presumably, islower should return false for all characters:-).

> This is why the standard does not -- and cannot -- make a rule
> for how ctype "lower" and "upper" functionality should work.
> "Case" is more about I/O rather than how the language
> operates.

Case is more than just I/O.  But it doesn't really have a
significant meaning for isolated characters, and it is at least
partially related to the graphic representation.  (Should the
lower case characters in a small caps font be considered "lower
case"?)

About all one can say for certain is that the problem isn't
simple.

> All the standard could address is the 93 (or so) ASCII
> characters it uses to represent the identifiers and other
> syntatical parts of the language itself.

> Most languages now just say "we are using Unicode internally,"
> and the Unicode consortium maintains the requirements for what
> operations such as "upper" and "lower" should do. But consider
> what "lower" and "upper" would do with scripts such as Arabic,
> which has no block letters at all, and everthing is cursive.
> What would ctype do now ?????

>From a certain point of view, Arabic has four cases.  Except
that they aren't called cases.

And while I don't know what most languages now do, using Unicode
isn't an answer to the upper/lower question, since Unicode
recognizes three cases, not two.  (Unicode is also very explicit
that case translation must take place at the string level, not
at the character level, and that it isn't a bijection.)

> In practice, most (not all, RogueWave has a library you can
> buy as an addon) C++ locale subsystems just pass their
> operations to the underlying operating systems.

Not under Unix, at least, since the underlying OS hasn't the
slightest knowledge of what case is.  Many libraries do map the
C++ functions to the C functions, in some way or another.

> Some, like Dinkumware on Microsoft, dont do anything at all,
> and if you want localization, you have to use something else
> (perhaps the vastly superior I18N facilities that Microsoft
> provides)

That rather surprises me with regards to Dinkumware.  Or is the
underlying problem simply that Microsoft doesn't support
different locales on their systems.  (Which also surprises me;
there are very definitly tabs in the configuration to choose
different locale characteristics, and when I'm logged into my
Windows machine, it has different behavior than when my children
log in.)

> I18N and L10N support in standard C++ is downright awful.

The locale library is pretty incomprehensible, I agree.  On the
other hand, I don't know of any other language which has more
comprehesive support.

> The C++ community in 2006, as this thread attests, STILL tries
> to give each individual byte brains enough to decide how is
> should be displayed on a character mode terminal, as if the
> entire universe still used TTY devices.

I don't think so.  It does recognize that regardless of what you
do in your program, the external world is very byte oriented.

> The rest of the world has long abandoned this approach, and
> the smallest handhelds have graphical IO devices. You cannot
> make each individual char (or even wchar_t) smart enough to
> handle its own formatting, even on a character mode display.
> The "string" is the smallest unit that can possibly do this,
> and even then this string has to be aware of the encoding
> used, and in sync with the IO device.

And what language supports that?  And how?  I typically write my
bytes to a file; how can the system, at any level, possibly know
anything about the encoding used on the device I will ultimately
use to view the file.  For that matter, it isn't rare for me to
view it on two different devices, a terminal window, and if the
data turns out to be too complex, a printer.  And those devices
don't alway, or even usually, support the same character
encoding.

> However, C++ views "string" as "array of bytes," where "character" and
> "byte" is synonymous. In almost any other language, a "character" is a
> "string" of length 1, irrespective of how many bytes actually make up
> that string -- and that is determined by the encoding used.

Again, what languages are these?  Certainly not Java, nor Ada.

> This is why std::string cannot process Unicode in any meaningful way.

The reason std::string cannot process Unicode is, fundamentally,
because it doesn't know whether it is Unicode or not.  On the
systems I usually work on, Unicode isn't supported at the system
level.

> Nor can IOStreams, which still sees the world in the same way
> Doug McIlroy saw the world in 1964, with all individual
> programs commicating though each other via pipes and streams
> of ASCII text. This approach works great for programming
> chores, but that is the extent of it.

I don't think you quite understand the impact of locales in C++.

--
James

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Tue, 3 Oct 2006 08:57:17 CST Raw View

abadura wrote:
> I write code which implements ASCII (and other codes as well) to be
> usable with the STD.

>    Nataurally I read Standard (an appropriate part) a few times and
> while writing I still consult the text. But I came accross a strang
> thing which I cannot deicide what to do with... Suppose I have a
> derived class of std::ctype<char> or anything like this to be used for
> ASCII code.
>    ACSII uses only 128 values while "char" type seems to have to be (I
> could not find it as well) at least 8 bits long. So can store at least
> 256 values. And even if it does not have to it can, and I think on most
> if not all PCs it actually has 8 bits.

char is required to be at least 8 bits.  In practice, 8, 9 and
32 bits seem to be frequent.

>    The question is what should to_lower or to_upper function return for
> values larger then 127, which are not valid ASCII values?

By definition, to_lower returns the value passed into it unless
is_upper is true.  And since is_upper can only be true for a
valid character code, that should answer your question.

> Should it return the passed value just as it does for any
> non-character? Or what?

>    "Type" (masks like "printable", "upper", "digit") of character - as
> I checked in Cygwin and MS VS .NET 2005) in "classic locale" for
> characters not in [0;127] range are "0" which means that the have no
> attribute at all (not even "ctrl").

That's a characteristic of invalid character encodings.  Perhaps
ctype should add a mask, valid, which would be the or of all
existing mask.

> But what with those "to_lower" and
> "to_upper"?
>    I think that expression:

> The first form returns the corresponding upper-case character
> if it is known to exist, or its argument if not. The second
> form returns high.

> (extract from specification of to_upper) orders to return
> passed value if it is not in range. This is however not that
> clear for me because n think is a character which is not a
> letter and another is a character which is not character at
> all.

>    What do you think?

The specification for the functions seems clear enough.  In
pracctice, of course, they're not very useful, since upper to
lower mapping (or vice versa) is not a bijection in real life,
at least not for any character set useful for human languages.
Of course, if you are using them for case insensitive comparison
of key words, and you're key words only use characters in the
basic character set, they work fine.  But don't expect to use
them on languages like French or German and get anything useful.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Tue, 3 Oct 2006 15:05:51 GMT Raw View

abadura ha scritto:
>    ACSII uses only 128 values while "char" type seems to have to be (I
> could not find it as well) at least 8 bits long. So can store at least
> 256 values. And even if it does not have to it can, and I think on most
> if not all PCs it actually has 8 bits.

It's true that most PC implementation has 8-bit chars, but most of them
also make char a signed type, so, strictly speaking, they don't have
characters with values greater than 127.

>    The question is what should to_lower or to_upper function return for
> values larger then 127, which are not valid ASCII values? Should it
> return the passed value just as it does for any non-character? Or what?

The interpretation of character values outside the ASCII range is locale
dependent. What is the lower case of character '\xb7'? It's '\x85' in
some codepage (DOS Western Europe) and '\xd8' in another codepage (DOS
Eastern Europe) and could just be '\xb7' (in Unicode, as it's a
punctuation character there). If I'm not mistaken, the "C" locale does
not provide an interpretation of characters with values larger than 127.

>    "Type" (masks like "printable", "upper", "digit") of character - as
> I checked in Cygwin and MS VS .NET 2005) in "classic locale" for
> characters not in [0;127] range are "0" which means that the have no
> attribute at all (not even "ctrl"). But what with those "to_lower" and
> "to_upper"?
>    I think that expression:
>
> The first form returns the corresponding upper-case character if it is
> known to exist, or its argument if not. The second form returns high.

In the "C" locale, to_lower, resp. to_upper, return its argument for all
characters except the 26 latin uppercase, resp. lowercase, letters, as
described in 7.4.2 of the C Standard (not the C++ standard!). Otherwise
it's locale dependent.

> (extract from specification of to_upper) orders to return passed value
> if it is not in range. This is however not that clear for me because n
> think is a character which is not a letter and another is a character
> which is not character at all.

I'm sorry, but I really could not understand this sentence...

>    What do you think?

You have just scratched the surface of the locale issue which can be
incredibly complex. I strongly suggest you find some good book on the
subject before digging deeper.

Regards,

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: giecrilj@stegny.2a.pl ("Kristof Zelechovski")
Date: Wed, 4 Oct 2006 15:04:16 GMT Raw View

Uzytkownik "kanze" <kanze@gabi-soft.fr> napisal w wiadomosci=20
news:1159867506.222269.192330@h48g2000cwc.googlegroups.com...
> The specification for the functions seems clear enough.  In
> pracctice, of course, they're not very useful, since upper to
> lower mapping (or vice versa) is not a bijection in real life,
> at least not for any character set useful for human languages.
> Of course, if you are using them for case insensitive comparison
> of key words, and you're key words only use characters in the
> basic character set, they work fine.  But don't expect to use
> them on languages like French or German and get anything useful.

German has the =DF anomaly, but what about French?



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "kanze" <kanze@gabi-soft.fr>
Date: Wed, 4 Oct 2006 10:03:40 CST Raw View

Alberto Ganesh Barbati wrote:
> abadura ha scritto:
> >    ACSII uses only 128 values while "char" type seems to have to be (I
> > could not find it as well) at least 8 bits long. So can store at least
> > 256 values. And even if it does not have to it can, and I think on most
> > if not all PCs it actually has 8 bits.

> It's true that most PC implementation has 8-bit chars, but
> most of them also make char a signed type, so, strictly
> speaking, they don't have characters with values greater than
> 127.

On the other hand, they have characters with values less than 0,
which is worse:-).  In ASCII, of course, it doesn't matter,
because these are invalid characters, however you look at it.
But I don't know of an implementation today which still uses
ASCII; they all use 8 bit character sets.  With the result that
the character specified as having the value 229 actually has the
value -27.

> >    The question is what should to_lower or to_upper function return for
> > values larger then 127, which are not valid ASCII values? Should it
> > return the passed value just as it does for any non-character? Or what?

> The interpretation of character values outside the ASCII range is locale
> dependent.

By definition, the contents of a facet is locale dependent.
That's what they're there for.

> What is the lower case of character '\xb7'? It's '\x85' in
> some codepage (DOS Western Europe) and '\xd8' in another codepage (DOS
> Eastern Europe) and could just be '\xb7' (in Unicode, as it's a
> punctuation character there). If I'm not mistaken, the "C" locale does
> not provide an interpretation of characters with values larger than 127.

The standard only specifies the minimum set of characters
necessary, and the behavior of this set.  Under Solaris, the "C"
locale does correspond to strict ASCII, and declares all
characters outside the range 0...127 invalid.  Which, of course,
doesn't correspond at all to the reality outside of the program.
Which doesn't depend on the locale, but specific font which is
active, and its encoding.  But the standard certainly doesn't
require such, and an implementation could very easily use ISO
8859-1 as the encoding for locale "C"; I'm not sure, but I
rather suspect that even locale "C" could depend on some
external factors.  (Again, under Solaris, it is normally loaded
dynamically.  Solaris hasn't gone that route, but if the system
developers decided that there would be an English version of the
OS, and a French version, and a German version, etc., they could
also provide a different shared objet file with each version, so
that the default behavior would depend on the version.)

C defines a few additional constraints, and the while every
implementation I know adds characters to the basic execution
character set (e.g. '@' or '$'), in C, these characters cannot
be considered alphabetic nor digits in locale "C".

> >    "Type" (masks like "printable", "upper", "digit") of character - as
> > I checked in Cygwin and MS VS .NET 2005) in "classic locale" for
> > characters not in [0;127] range are "0" which means that the have no
> > attribute at all (not even "ctrl"). But what with those "to_lower" and
> > "to_upper"?
> >    I think that expression:

> > The first form returns the corresponding upper-case character if it is
> > known to exist, or its argument if not. The second form returns high.

> In the "C" locale, to_lower, resp. to_upper, return its argument for all
> characters except the 26 latin uppercase, resp. lowercase, letters, as
> described in 7.4.2 of the C Standard (not the C++ standard!). Otherwise
> it's locale dependent.

Yes.  For C, and not for C++.  (In practice, however, I can't
imagine an implementation daring to have the two be different.)

> > (extract from specification of to_upper) orders to return passed value
> > if it is not in range. This is however not that clear for me because n
> > think is a character which is not a letter and another is a character
> > which is not character at all.

> I'm sorry, but I really could not understand this sentence...

> >    What do you think?

> You have just scratched the surface of the locale issue which can be
> incredibly complex. I strongly suggest you find some good book on the
> subject before digging deeper.

And if he finds one, that he post a reference to it here,
because quite frankly, I don't know of one.  (There is some
discussion of the issues in Plauger's book on the C standard
library, but that's about all I know of.)

The issue is particularly complicated because it generally
involves a lot of programs outside your control.  If you're
outputting to an xterm, for example, whether the user sees
something he considers an upper case letter or not has nothing
to do with what ctype might say about it; it depends entirely on
the encoding of the font that xterm is using.  (I think most
Linux distributions will use a font with UTF-8 encoding by
default.  Which means that you can't use ctype at all.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: jm@bourguet.org (Jean-Marc Bourguet)
Date: Wed, 4 Oct 2006 15:43:16 GMT Raw View

"kanze" <kanze@gabi-soft.fr> writes:

> Under Solaris, the "C" locale does correspond to strict ASCII, and
> declares all characters outside the range 0...127 invalid.  Which, of
> course, doesn't correspond at all to the reality outside of the program.
> Which doesn't depend on the locale, but specific font which is active,
> and its encoding.  But the standard certainly doesn't require such,

Well, a quick reading of POSIX makes me think that POSIX does.  And Solaris
aims for POSIX compliance.

--
Jean-Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Alberto Ganesh Barbati <AlbertoBarbati@libero.it>
Date: Wed, 4 Oct 2006 14:16:47 CST Raw View

kanze ha scritto:
> Alberto Ganesh Barbati wrote:
>
>> In the "C" locale, to_lower, resp. to_upper, return its argument for all
>> characters except the 26 latin uppercase, resp. lowercase, letters, as
>> described in 7.4.2 of the C Standard (not the C++ standard!). Otherwise
>> it's locale dependent.
>
> Yes.  For C, and not for C++.  (In practice, however, I can't
> imagine an implementation daring to have the two be different.)
>

I thought this part of the C library was included in the C++ library
without latitude for differences. In particular, functions to_lower and
to_upper (from <cctype>) should have the well-defined meaning I reported
above. Moreover 22.1.1.5/5 seems to guarantee the also the ctype facets
provided by std::locale("C") should behave the same way. Am I missing
something?

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Wed, 4 Oct 2006 22:24:24 GMT Raw View

Kristof Zelechovski ha scritto:
> Uzytkownik "kanze" <kanze@gabi-soft.fr> napisal w wiadomosci=20
> news:1159867506.222269.192330@h48g2000cwc.googlegroups.com...
>> The specification for the functions seems clear enough.  In
>> pracctice, of course, they're not very useful, since upper to
>> lower mapping (or vice versa) is not a bijection in real life,
>> at least not for any character set useful for human languages.
>> Of course, if you are using them for case insensitive comparison
>> of key words, and you're key words only use characters in the
>> basic character set, they work fine.  But don't expect to use
>> them on languages like French or German and get anything useful.
>=20
> German has the =EF=BF=BD anomaly, but what about French?

For example, some typographic legacy rules in France state that accents=20
are omitted on capital words. According to those rules, the uppercase of=20
the letter =C3=A0 (U+00E0) would be A and not =C3=80 (U+00C0), thus the m=
apping=20
would not be a bijection. Although the trend is changing in favor of the=20
more modern "bijective" conversion, there may be locales implementing=20
these legacy rules.

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "abadura" <abadura@o2.pl>
Date: Mon, 2 Oct 2006 11:28:37 CST Raw View

   I write code which implements ASCII (and other codes as well) to be
usable with the STD.

   Nataurally I read Standard (an appropriate part) a few times and
while writing I still consult the text. But I came accross a strang
thing which I cannot deicide what to do with... Suppose I have a
derived class of std::ctype<char> or anything like this to be used for
ASCII code.
   ACSII uses only 128 values while "char" type seems to have to be (I
could not find it as well) at least 8 bits long. So can store at least
256 values. And even if it does not have to it can, and I think on most
if not all PCs it actually has 8 bits.
   The question is what should to_lower or to_upper function return for
values larger then 127, which are not valid ASCII values? Should it
return the passed value just as it does for any non-character? Or what?
   "Type" (masks like "printable", "upper", "digit") of character - as
I checked in Cygwin and MS VS .NET 2005) in "classic locale" for
characters not in [0;127] range are "0" which means that the have no
attribute at all (not even "ctrl"). But what with those "to_lower" and
"to_upper"?
   I think that expression:

The first form returns the corresponding upper-case character if it is
known to exist, or its argument if not. The second form returns high.

(extract from specification of to_upper) orders to return passed value
if it is not in range. This is however not that clear for me because n
think is a character which is not a letter and another is a character
which is not character at all.
   What do you think?

   Adam Badura

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]