Topic: signedness of plain char


Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/02/13
Raw View
d96-mst@nada.kth.se (Mikael Steldal) writes:

|>  In article <rf5sp3kizry.fsf@vx.cit.alcatel.fr>,
|>  James Kanze <james-albert.kanze@vx.cit.alcatel.fr> wrote:
|>
|>  >In practice, it is impossible to declare string literals as type
|>  >"unsigned char".  I've generally found that the solution consists in
|>  >using char (and not unsigned char), and casting to "unsigned char" for
|>  >the isxxx functions.
|>  >
|>  >I also design my own code so that it can handle signed char's.  Thus, an
|>  >array designed to be indexed by char's will be declared:
|>  >
|>  >    int                 a[ CHAR_MAX - CHAR_MIN + 1 ] ;
|>  >
|>  >and accessed by:
|>  >
|>  >    a[ i - CHAR_MIN ]
|>
|>  I have hoped that the C++ standard would define a nice and useful high
|>  level language. But if this kind of ugly kluges is nessesary to get
|>  conforming programs I'm not sure it will be :-(

Attention:

I should have specified that this is what I have done in the past, in
C.  In C++, the problem has not yet come up, but I'd surely use a
special class, with overloaded operator[].  (Typically, operator[] would
be overloaded to handle all char types, plain, signed and unsigned,
correctly.)

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: d96-mst@nada.kth.se (Mikael St ldal)
Date: 1997/02/10
Raw View
In article <2.2.32.19970128001102.0030757c@central.beasys.com>,
David R Tribble <david.tribble@central.beasys.com> wrote:

>In practice, the only way to guarantee correct, portable behavior is to
>declare all characters and strings as 'unsigned char' and to live with the
>fact that all the standard library functions (like strlen()) require a
>cast to 'char *' (since none of the library functions accept 'unsigned
>char *' args).  (Overloaded functions will probably help this situation
>for C++, but not for poor old C.)

This is really bad.

C++ considers plain char, signed char and unsigned char
as three different types when overloading. That is a definitly a good
thing, but if plain char can't be used this feature would be useless.

Require plain char to be unsigned!
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]





Author: d96-mst@nada.kth.se (Mikael St ldal)
Date: 1997/02/11
Raw View
In article <rf5sp3kizry.fsf@vx.cit.alcatel.fr>,
James Kanze <james-albert.kanze@vx.cit.alcatel.fr> wrote:

>In practice, it is impossible to declare string literals as type
>"unsigned char".  I've generally found that the solution consists in
>using char (and not unsigned char), and casting to "unsigned char" for
>the isxxx functions.
>
>I also design my own code so that it can handle signed char's.  Thus, an
>array designed to be indexed by char's will be declared:
>
>    int                 a[ CHAR_MAX - CHAR_MIN + 1 ] ;
>
>and accessed by:
>
>    a[ i - CHAR_MIN ]

I have hoped that the C++ standard would define a nice and useful high
level language. But if this kind of ugly kluges is nessesary to get
conforming programs I'm not sure it will be :-(


[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/29
Raw View
David R Tribble <david.tribble@central.beasys.com> writes:

|>  In practice, the only way to guarantee correct, portable behavior is to
|>  declare all characters and strings as 'unsigned char' and to live with the
|>  fact that all the standard library functions (like strlen()) require a
|>  cast to 'char *' (since none of the library functions accept 'unsigned
|>  char *' args).  (Overloaded functions will probably help this situation
|>  for C++, but not for poor old C.)

In practice, it is impossible to declare string literals as type
"unsigned char".  I've generally found that the solution consists in
using char (and not unsigned char), and casting to "unsigned char" for
the isxxx functions.

I also design my own code so that it can handle signed char's.  Thus, an
array designed to be indexed by char's will be declared:

    int                 a[ CHAR_MAX - CHAR_MIN + 1 ] ;

and accessed by:

    a[ i - CHAR_MIN ]

In C, this would be encapsulated in a macro, in C++ in a class.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/27
Raw View
David R Tribble <david.tribble@central.beasys.com> writes:

|>  > |>  How can you accomplish that task in a conforming manner now?
|>
|>  > Use <locale>, and not <cctype>.  Presumably, the isalpha, etc. functions
|>  > in <locale> are defined over all possible values of charT.
|>
|>  No, <ctype> defines all the isxxx() functions, but <locale> does not.

So which ones are missing.  In my copy of the draft, there is a one to
one correspondance between the isxxx functions in <locale>, and those
defined in <ctype.h> in the C standard.

|>  It certainly doesn't redefine the functions; they are already supposed
|>  to be aware of the LC_CTYPE locale settings.

It overloads them.

Using the LC_CTYPE locale could be painful in cases where different
strings were encoded in different locales (almost a necessity when
working with 8 bit char's in an international environment).

|>  > Note that a good implementation could make <cctype> work even with 8 bit
|>  > signed characters.  This would entail defining EOF as -129, and
|>  > accepting character values in the range of -128...-1 (as well as
|>  > 128...255).  According to the standard, accessing one of the ctype
|>  > functions with a value other than 0...UCHAR_MAX or EOF is undefined
|>  > behavior, so an implementation is free to do something intelligent.
|>
|>  This solution is fine and conforming.
|>
|>  > The problem here is that there are probably more (broken) programs
|>  > assuming EOF == -1 than there are assuming plain char is signed.  So the
|>  > same motivations for making plain char signed (on a new architecture)
|>  > tend to force EOF to -1.
|>
|>  Then those programs wouldn't be conforming.  This shouldn't be a concern
|>  to the standard.

But it would be a concern to the vendor.  As I've said before, proving
that the customer is an idiot is not generally considered a good
marketing ploy.

|>  > For a real hack, that would, however, work in the real world, and not
|>  > expose existing programs as broken: leave EOF as -1, define isalpha, and
|>  > etc. over the range -128...255, with -1 acting as EOF, rather than 0xff,
|>  > but all of the other negative values "as if" they were positive.  In ISO
|>  > 8859-1, 0xff corresponds to a y with two dots over it.  Testing this
|>  > character as a plain char would give wrong results (because it would be
|>  > indistinguishable from EOF), but since this character isn't used in any
|>  > known language, the "bug" is probably tolerable (especially if
|>  > documented).
|>
|>  ISO-8859-1 (also known as Latin-1 ASCII) and Unicode define character code
|>  0x00FF to be 'Latin small letter y with diaeresis', which is used in French.

Really.  In which word.  I've been told that it occasionally occurs in
proper names in medieval French, but I've never seen it (and note well
my address), and it certainly doesn't occur in modern French.

|>  (The two little dots are called an umlaut, so the letter could also be
|>  called y-umlaut).  It wouldn't be in the standard ISO or Unicode character
|>  sets if it wasn't use in some language.

I can see you have little actual experience in how ISO standardization
really works:-).  (If you really want to discuss the utility of
y-diaeresis, drop in on comp.fonts.  Some of the people there are real
experts in typesetting foriegn languages.)

|>  Allowing it to be treated as EOF,
|>  in effect ignoring it as a valid printable alphabetic character, is a mistake.

Formally, yes, and I'd prefer to avoid it.  (Even useless characters may
occur, if only by error, in text input.)

On the other hand, I'd rather have a ctype that worked for all of the
characters except one, which I never use anyway, that a ctype that only
works for half of the characters, and fails for many characters that I
use every day.  And I fear that commercial presure will prevent me from
getting anything more, UNLESS it is mandated by the standard.

|>  The ISO-8859-2, -3, and -4 (covering Czech, German, Hungarian, Polish,
|>  Rumanian, Croatian, Slovak, Slovene, Esperanto, Galician, Maltese, Turkish,
|>  Estonian, Latvian, and Lithuanian) define code 0xFF to be 'Dot above'
|>  (Unicode U+02D9, I believe).  ISO-8859-5 (Cyrillic covering Bulgarian,
|>  Byelorussian, Mecedonain, Russian, Serbian and Ukrainian) defines 0xFF as
|>  'Cyrillic small letter dzhe' (Unicode U+045F).  ISO-8859-9 (Icelandic
|>  replacement for Latin-1) also defines 0xFF as y-umlaut.  ISO-8859-10
|>  (adding Inuit, Greenlandic, Sami, and Lappish to Latin-4) defines 0xFF as
|>  'Latin small letter kra' (Unicode U+0138).

You're right, of course.  The problem is that people using these
langauges are not buying a lot of computers (yet, at least).

In practice, too, in all of these areas, German or English is equally
necessary.  All of which should lead to the use of Unicode; <ctype.h>
defines no support for Unicode (but <locale> does).

|>  Any solution to this problem must not be US-centric.  That's why I think
|>  the best solution is to mandate that 'plain char' is 'unsigned char'.

Technically, you're right.  Practically, the vendors' first priority is
selling their product.  They will conform to a standard only because
they see this as a sales argument.  Most implementations initially chose
plain character as signed to avoid breaking (already broken) code which
assumed that plain char was signed (although even K&R I explicitly said
otherwise).

In practice, although it may not be perfect, I think that the use of
character traits and locale in C++ is more than adequate.  Or will be,
once all of the compilers support them.

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --
---
[ comp.std.c++ is moderated.  To submit articles: Try just posting with your
                newsreader.  If that fails, use mailto:std-c++@ncar.ucar.edu
  comp.std.c++ FAQ: http://reality.sgi.com/austern/std-c++/faq.html
  Moderation policy: http://reality.sgi.com/austern/std-c++/policy.html
  Comments? mailto:std-c++-request@ncar.ucar.edu
]





Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/01/28
Raw View
Andrew Koenig <ark@research.att.com> wrote:
> A quick check of the C standard reveals that isalpha is required
> to return a well-defined value for any int you might possibly give
> it as an argument.

I hate to disagree with you (because you are probably the only mortal who
fully comprehends the ANSI C++ standard), but no cookie.  The isalpha()
functions et al are not required to handle every int value.  Quoth the
standard:

    7.3  Character handling <ctype.h>
      The header <ctype.h> declares several functions useful for testing
    and mapping characters.  In all cases the argument is an int, the value
    of which shall be representable as an 'unsigned char' or shall be equal
    to the value of the macro EOF.  If the argument has any other value,
    the behavior is undefined.

On implementations that treat 'plain char' as 'signed char', some valid
character values (of an extended character set) will be negative.  For
example, the character '\xF1' (which is the Latin-1 Spanish 'enyay'
character) has the value 0xF..FF1, or -15, on such implementations; the
value -15 is not representable as an unsigned char.  When passed to
isalpha() et al, such characters will be treated as 'undefined' argument
values.

Thus this code given by James Kanze, modified slightly, will result in
undefined behavior:

    char  c = '\xF1';           /* Spanish 'enyay' */
    if (isalpha(c)) ...         /* Undefined behavior */

We are forced to do something different to make it work as expected and
be portable:

    if (isalpha((unsigned char) c)) ... /* Defined behavior */
Or:
    unsigned char  c = '\xF1';  /* Spanish 'enyay' */
    if (isalpha(c)) ...         /* Defined behavior */

>|> That is a VERY good reason for making plain chars unsigned by default.

Precisely.  There really isn't any good reason to allow 'plain char' to
mean 'signed char' anyway.  (If you're really concerned about performance
on your PDP-11, and you're going to use only 7-bit characters, then go
ahead and declare all your characters as 'signed char'.)

> C++ inherits the behavior of isalpha from C.

Indeed it does.

In practice, the only way to guarantee correct, portable behavior is to
declare all characters and strings as 'unsigned char' and to live with the
fact that all the standard library functions (like strlen()) require a
cast to 'char *' (since none of the library functions accept 'unsigned
char *' args).  (Overloaded functions will probably help this situation
for C++, but not for poor old C.)

-- David R. Tribble, david.tribble@central.beasys.com --
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze <james-albert.kanze@vx.cit.alcatel.fr>
Date: 1997/01/28
Raw View
James Kanze <james-albert.kanze@vx.cit.alcatel.fr> writes:

|>  All of which should lead to the use of Unicode; <ctype.h>
|>  defines no support for Unicode (but <locale> does).

It occurs to me that this sentence is highly misleading.  What I meant
to say was that if an implementation defines Unicode as the extended
character set used in wchar_t, then the implementation is required to
support it in the isxxx functions defined in <locale>.

I expect that in the immediate future, this will be the path most
quality implementations will take (with the exception, perhaps, of those
targetting specialized markets such as embedded systems).

--
James Kanze      home:     kanze@gabi-soft.fr        +33 (0)1 39 55 85 62
                 office:   kanze@vx.cit.alcatel.fr   +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
     -- Conseils en informatique industrielle --


[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: David R Tribble <david.tribble@central.beasys.com>
Date: 1997/01/24
Raw View
> |>  How can you accomplish that task in a conforming manner now?

> Use <locale>, and not <cctype>.  Presumably, the isalpha, etc. functions
> in <locale> are defined over all possible values of charT.

No, <ctype> defines all the isxxx() functions, but <locale> does not.
It certainly doesn't redefine the functions; they are already supposed
to be aware of the LC_CTYPE locale settings.

> Note that a good implementation could make <cctype> work even with 8 bit
> signed characters.  This would entail defining EOF as -129, and
> accepting character values in the range of -128...-1 (as well as
> 128...255).  According to the standard, accessing one of the ctype
> functions with a value other than 0...UCHAR_MAX or EOF is undefined
> behavior, so an implementation is free to do something intelligent.

This solution is fine and conforming.

> The problem here is that there are probably more (broken) programs
> assuming EOF == -1 than there are assuming plain char is signed.  So the
> same motivations for making plain char signed (on a new architecture)
> tend to force EOF to -1.

Then those programs wouldn't be conforming.  This shouldn't be a concern
to the standard.

> For a real hack, that would, however, work in the real world, and not
> expose existing programs as broken: leave EOF as -1, define isalpha, and
> etc. over the range -128...255, with -1 acting as EOF, rather than 0xff,
> but all of the other negative values "as if" they were positive.  In ISO
> 8859-1, 0xff corresponds to a y with two dots over it.  Testing this
> character as a plain char would give wrong results (because it would be
> indistinguishable from EOF), but since this character isn't used in any
> known language, the "bug" is probably tolerable (especially if
> documented).

ISO-8859-1 (also known as Latin-1 ASCII) and Unicode define character code
0x00FF to be 'Latin small letter y with diaeresis', which is used in French.
(The two little dots are called an umlaut, so the letter could also be
called y-umlaut).  It wouldn't be in the standard ISO or Unicode character
sets if it wasn't use in some language.  Allowing it to be treated as EOF,
in effect ignoring it as a valid printable alphabetic character, is a mistake.

The ISO-8859-2, -3, and -4 (covering Czech, German, Hungarian, Polish,
Rumanian, Croatian, Slovak, Slovene, Esperanto, Galician, Maltese, Turkish,
Estonian, Latvian, and Lithuanian) define code 0xFF to be 'Dot above'
(Unicode U+02D9, I believe).  ISO-8859-5 (Cyrillic covering Bulgarian,
Byelorussian, Mecedonain, Russian, Serbian and Ukrainian) defines 0xFF as
'Cyrillic small letter dzhe' (Unicode U+045F).  ISO-8859-9 (Icelandic
replacement for Latin-1) also defines 0xFF as y-umlaut.  ISO-8859-10
(adding Inuit, Greenlandic, Sami, and Lappish to Latin-4) defines 0xFF as
'Latin small letter kra' (Unicode U+0138).

Any solution to this problem must not be US-centric.  That's why I think
the best solution is to mandate that 'plain char' is 'unsigned char'.
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]