Thread

Topic: ANSI/C++ string class not correct for non English languages ?????

Author: jkanze@otelo.ibmmail.com
Date: 1998/03/11 Raw View

In article <35058106.2681@central.beasys.com>,
  david.tribble@noSPAM.central.beasys.com wrote:
>
> I, David R Tribble <david.tribble@central.beasys.com> wrote:
> >> It should behave correctly, but it doesn't.  Consider an
> >> implementation with signed chars that defines EOF to be -1:
> >>
> >>     #include <stdio.h>
> >>
> >>     if (isprint('\xFF')) ...  // True, FF =3D Latin-1 y+diaeresis
> >>     if (isprint(EOF)) ...     // False, EOF is not printable
>
> Mea culpa, that should be:
>     #include <ctype.h>
>
> Achim Gratz wrote:
> > That would be a bug in your program, since EOF is not a char, but an
> > integer in this case.  It must be distinguishable from any char and
> > thus it cannot be a char itself, signed or not.
>
> No, my program is quite legal and bug-free.  In C, '\xFF' and EOF
> are both integer constants.  In C++, '\xFF' will convert to an
> 'int' constant, either through sign-extension (if 'char' is signed)
> or zero-extension (if 'char' is unsigned).

Correct so far.

> The C++ library is based on the C library, for which isprint() et al
> take an 'int' argument.  EOF is a perfectly legal argument.
> See below.

PARTS of the C++ library is based on the C library.  The functions in
<locale> are NOT.

> I also wrote:
> >> Consider an
> >> implementation with signed chars that defines EOF to be -1:
>
> jkanze@otelo.ibmmail.com wrote:
> > Irrelevant: EOF is NOT a legal input for the is.... functions in
> > locale.
>
> I beg to differ, quoting from the C89 standard, upon which the
> C++ library is based:

Your quote concerns the functions in <ctype.h> (or <cctype>), not the
functions in <locale>.  The header files <ctype.h> and <cctype> are
mainly present for C compatibility, I would imagine.  At any rate,
C++ code should be using <locale>, except that most compilers don't
support it yet.

> > Here, you are probably getting the old isxxx functions from ctype.h.
> > The new ones *require* a locale as the second argument.  It would be
> > nice if the second argument could default to the global locale, but
> > as someone (Nathan Meyers?) pointed out, the ambiguities with the
> > older functions would create more problems than it was worth.
>
> Perhaps the situation is better in the C++ library.  Is EOF not
> allowed as an argument to isprint() in the C++ library?

Maybe.  As to be expected:-), the new functions are templates over
the type of the character.  Of course, they only make sense for types
over which they are specialized, but "isprint( 2.4 , locale() )" is
not forbidden (although the signification would be implementation
defined).

An implementation is required to provide specializations only for char
and wchar_t, so using one of the functions with any other type is
unportable.  I've not been able to find a concrete statement concerning
legal input values, but without any stated restrictions, I would suppose
any value the instantiation type can hold would be legal and defined.

And to your question: if char's are signed, and EOF is a typedef to int,
the value EOF will fit on a char, and so can be passed to the new functions.
In this context, however, its value will not be interpreted as EOF, but
as whatever character happens to have this code (or bit pattern, since
most implementations do put ISO 8859-1 on a signed eight bit char, and
interpret the char's using the underlying bit pattern, rather than the
numeric value).

> But there's still the inherent problem, on systems where 'char' is
> signed and EOF is -1, that '\xFF' == EOF is true.  Character '\xFF'
> is a valid Latin-1 character code, and EOF is a code that's supposed
> to be distinct from all other valid character codes, so they should
> compare unequal.  This can be fixed by assuming that plain 'char' is
> unsigned, and/or by defining EOF to be something like -256.

As in C, all of the functions in C++ which can return an EOF return
an int, with values between 0 and UCHAR_MAX, or EOF.

Technically, you're correct, but in practice, it isn't a problem.

> >> I've suggested a couple of times in the past that plain 'char' should
> >> be unsigned in order to simplify (i.e., to correct) such things, but
> >> to no avail.
>
> > I've suggested this too.  The problem is that there really is a large
> > body of code that uses char (and not signed char) for small signed
> > int's.
> > Such code isn't portable; you can even argue that it isn't correct.
> > But I can well understand a vendor not wanting to break it.
>
> If it really is incorrect code, then breaking it is acceptable for
> the new language standard - the new standards never claimed they
> wouldn't break bad code.  And this kind of code can be fixed, by
> substituting 'signed char'.

The standard can say what it likes, but if you require char to be unsigned,
then every vendor who currently has a signed char will continue to
support it, probably by default.  Which means that compiler flags will
affect the semantics of your program.

Frankly, I'm not sure what the correct solution is.  For me, the problem
is simple.  They can and should declare that char must be unsigned.
It won't break any of my code, and if it does, I'll be glad of it; it
will make me fix an error I wasn't aware of.  Not all programmers have
the same attitude as I do, here, and vendors are in the business to sell
as many compilers as they can, not to be right at all cost.  If conforming
to the (new) standard costs sales, they won't conform, or they will only
conform with special flags that nobody uses.  In the end, the results
will probably be worse, even for me, that the current situation.

> > There are also machines on which an unsigned char is slower than a
> > signed one.
>
> Indeed.  But if you agree that 'char' should not be signed, then
> efficiency is not a reasonable argument for keeping it signed.  The
> question is, should 'char' mean "character code" or "small integer"?
> (Java has a nice answer to this question.)

It's a nice answer for a new language which doesn't have to worry about
backwards compatibility or existing code, although I wonder how good it
will seem when they realize that 16 bits isn't enough.  And I think that
most people agree that failing to distinguish between "small integer"
and "character" is probably an error in language design.  But it's too
late to change any of this in C or C++ today.

--
James Kanze    +33 (0)1 39 23 84 71    mailto: kanze@gabi-soft.fr
        +49 (0)69 66 45 33 10    mailto: jkanze@otelo.ibmmail.com
GABI Software, 22 rue Jacques-Lemercier, 78000 Versailles, France
Conseils en informatique orient   e objet --
              -- Beratung in objektorientierter Datenverarbeitung

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/   Now offering spam-free web-based newsreading
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: jkanze@otelo.ibmmail.com
Date: 1998/03/11 Raw View

In article <clnujasdc.fsf@ite127.inf.tu-dresden.de>,
  Achim Gratz <gratz@ite.inf.tu-dresden.de> wrote:
>
> David R Tribble <david.tribble@central.beasys.com> writes:
>
> > It should behave correctly, but it doesn't.  Consider an
> > implementation with signed chars that defines EOF to be -1:
> >
> >     #include <stdio.h>
> >
> >     if (isprint('\xFF')) ...    // True, FF =3D Latin-1 y+diaeresis (=FF)
> >     if (isprint(EOF)) ...       // False, EOF is not printable
>
> That would be a bug in your program, since EOF is not a char, but an
> integer in this case.  It must be distinguishable from any char and
> thus it cannot be a char itself, signed or not.

It's not a bug in his program, since his program invokes the isprint
function declared in <cctype>, which is defined for all values between
0 and UCHAR_MAX inclusive, plus EOF.

It's just not relevant to my comments, which concerned the isxxx
functions in <locale>.  These are not defined for EOF, but for all
possible values of the instantiation type.  (I suppose.  I would feel
happier if I could find a sentence in the standard which said this
explicitly.)

--
James Kanze    +33 (0)1 39 23 84 71    mailto: kanze@gabi-soft.fr
        +49 (0)69 66 45 33 10    mailto: jkanze@otelo.ibmmail.com
GABI Software, 22 rue Jacques-Lemercier, 78000 Versailles, France
Conseils en informatique orient   e objet --
              -- Beratung in objektorientierter Datenverarbeitung

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/   Now offering spam-free web-based newsreading
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: James Kuyper <kuyper@wizard.net>
Date: 1998/03/09 Raw View

Jerry Coffin wrote:
>
> In article <35002C69.3BF853D4@acm.org>, petebecker@acm.org says...
> > Jerry Coffin wrote:
> > >
> > >
> > > As we speak, it's getting close to 8 years since the C standard was
> > > ratified, but the majority of C compilers seem to have limited (or
> > > nonexistent) support for any but one locale: the C locale that's been
> > > required since the beginning.  The problem here isn't the standard,
> > > it's the vendors...
> >
> > Well, maybe. But remember that vendors respond to customers. So maybe
> > the problem is lack of demand.
>
> I'll bow to your superior personal knowledge of that.  I have to agree
> that if demand was strong, the vendors would undoubtedly meet it.  I
> think to some extent, it's a bit of a chicken and egg problem: if one
> vendor were to come out with really good locale support, demand for it
> would likely grow.  (Hint hint...)

Well, the SGI compiler we use at my company supports the following
locales:
C             en_AU         fr_BE         nl            sv
cs            en_CA         fr_CA         nl_BE         tr
da            en_US         fr_CH         no            zh_CN.ugb
de            es            is            pl            zh_TW.uncs
de_AT         es_AR         it            pt
de_CH         es_MX         it_CH         pt_BR
el            fi            ja_JP.EUC     ru
en            fr                          sk

In addition, the man pages for viturally every IRIX 6.2 system utility
indicate support of locales, based upon environment variables with the
same names as the setlocale() macros. I don't know how well locales are
actually supported; my work just barely requires support for English, so
I haven't had to use the other locales. Still, if a single vendor with
decent support were all that was needed, that problem has already been
solved.

I think one of the key barriers to greater use of locales, is the
difficulty of writing portable interoperable C code that makes use of
them. The fact that so many aspects of locales are implementation
defined means that most code which uses them must contain significant
sections which need to be re-written for each implementation. This
problem could be greatly reduced by:

Requiring an implementation to document through a macro (WC_LOCALE?)
defined in the locale.h the name of the locale used to convert wide
character constants to the execution wide character set.

Adding a library function

 const char *getlocale(unsigned int n);

which retrieves a pointer to the name of the n'th supported locale,
returning NULL if and only if n > N, for some implementation-defined
value of N.

Mandating that if the name of any of the properties recognised by
towctrans() is a locale name, then that property implements a
transformation from the current locale to the specified locale.

Finally, the current standard requires the development of an
implementation-defined conversion from UCN's to the execution wide
character set, to be used in translation phase 5. It seems to me that it
wouldn't require much additional work by the implementor if the standard
required that the same conversion be made available to programs through
a function such as

 wchar_t UCNtowc(uint_least32_t ucn);

That conversion needn't be invertible, but a semi-inverse wctoUCN() can
and should be made available, satisfying the requirement that

 UCNtowc(wctoUCN(a))==a

where 'a' is any legal wide character constant. The return type of
wctoUCN could be smaller than uint_least32_t, depending on the value of
the largest UCN with a non-trivial conversion to wchar_t. An
implementation which makes wchar_t identical with char could make do
with 'unsigned char' as the return type.

"string" oriented versions of these functions would be even more useful.
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]