Thread

Topic: Portability of numeric types

Author: muysvasovic@applelink.apple.com (Jean-Denis Muys-Vasovic)
Date: 15 Oct 91 19:40:08 GMT Raw View

In article <3802@lulea.telesoft.se>, jbn@lulea.telesoft.se (Johan Bengtsson) writes:
>
> jimad@microsoft.UUCP (Jim ADCOCK) writes:
> >
> > If one does not assume that chars are signed nor unsigned, and restricts
> > oneself to using the low seven bits, you should do pretty well in that
> > regard.
>
>  I happen to be one of the many million people in the world
>  whose language uses a larger alphabet than the ASCII 7-bit
>  code (actually most/all non-english/american languages).
>
>  _Do_ support the eigth bit, especially when programming
>  specifically for machines that have 8-bit character sets.
>
[...]
>

I agree that supporting 7-bit characters is not enough. I also contend
that supporting 8-bit characters is not enough. Forthcoming standards
for character representation include both ISO (32 bits) and Unicode (16 bits).
I believe that ISO recently agreed to more or less adopt Unicode. This means
that shortly (how shortly ?), we'll have to deal with (at least) 16-bit
characters.

How will the transition take place for the C family of languages? Will we keep
the standard type char to mean what *everybody* assumes: a single byte, and add
a new (unichar?) type?

What is the safest way to behave now? Especially when you want a byte, what
do you declare? A char?

My point is that 8-bits characters are already obsolete. I would like to get
the net opinion on how to deal with that, in a portable way of course.

disclaimer: I don't want to discuss the validity/desirability of either ISO or
Unicode. I'm also aware of locale-related issues. So the scope of my request
is indeed very narrow.

Regards and thanks to all.

Jean-Denis

PS: I crossposted to comp.XX groups that seemed relevant.

Author: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
Date: 16 Oct 91 23:57:00 GMT Raw View

muysvasovic@applelink.apple.com (Jean-Denis Muys-Vasovic) writes:
>
> My point is that 8-bits characters are already obsolete. I would like to
> get the net opinion on how to deal with that, in a portable way of course.
>
> How will the transition take place for the C family of languages?
> Will we keep the standard type char to mean what *everybody* assumes:
> a single byte, and add a new (unichar?) type?

We discussed this recently at a Unicode developer's conference.  It was
generally agreed that EUC (from AT&T's MNLS, now on SVr4) was not a good
solution.  EUC involves a whole set of wide character library routines,
plus routines to convert variable multibyte file code into 32-bit wchar_t
process code, and back again.  Wide character library routines have not
been standardized into POSIX or XPG.

Somebody suggested, and it seems reasonable to me, that 16-bit Unicode
could be treated as:

 typedef uchar unsigned short;

Then all the string handling functions could be recompiled after replacing
"char" with "uchar".  This should work fine, although there's always the
thorny problem of how to make mixed environments containing both char and
uchar files work correctly and conveniently.

Probably char should continue to mean byte, for backward compatibility.

Author: jimad@microsoft.UUCP (Jim ADCOCK)
Date: 16 Oct 91 19:45:36 GMT Raw View

In article <58635@apple.Apple.COM> muysvasovic@applelink.apple.com (Jean-Denis Muys-Vasovic) writes:
|In article <3802@lulea.telesoft.se>, jbn@lulea.telesoft.se (Johan Bengtsson) writes:
|>
|> jimad@microsoft.UUCP (Jim ADCOCK) writes:
|> >
|> > If one does not assume that chars are signed nor unsigned, and restricts
|> > oneself to using the low seven bits, you should do pretty well in that
|> > regard.
|>
|>  I happen to be one of the many million people in the world
|>  whose language uses a larger alphabet than the ASCII 7-bit
|>  code (actually most/all non-english/american languages).
|>
|>  _Do_ support the eigth bit, especially when programming
|>  specifically for machines that have 8-bit character sets.

I think you are confusing issues of code sets, 8th bit masking or not,
verses issues of writing portable code.

My statement was that *if* one uses chars as *integral* values, and only
use the low 7 bits of a char, then one has a reasonably portable
use of that char.

Conversely, if one is "forced" to program using chars not as integral
values, but rather as an implied enumerated type, with the implied
enumerated values being from one or another national character sets,
then I'd think its obvious that one has entered the non-portable domain.
Which isn't necessarily "good" nor "bad" -- it just isn't very portable --
as anyone who has been forced to address the issues of nationalizing
software well knows.  It just shows how long some of us have been
programming using "C" when we fail to realize that a char used to store
a char constant isn't being used in an integral manner at all, but rather
as some kind of enumerated type.  Consider sort-order issues, for instance.

I agree with the notion that software designed for international usage
ought to be heading towards Unicode and wide chars.  But, even with
Unicode and wide chars one still has only scratched the service of
international portability.  Not to even mention multilingual capable.

Author: UweKloss@lionbbs.han.de (Uwe Kloss)
Date: 19 Oct 91 08:04:35 GMT Raw View

In <58635@apple.Apple.COM>, Jean-Denis Muys-Vasovic writes:
> In article <3802@lulea.telesoft.se>, jbn@lulea.telesoft.se (Johan Bengtsson) writes:
> >
> > jimad@microsoft.UUCP (Jim ADCOCK) writes:
> > >
> > > If one does not assume that chars are signed nor unsigned, and restricts
> > > oneself to using the low seven bits, you should do pretty well in that
> > > regard.
> >
> >     I happen to be one of the many million people in the world
> >     whose language uses a larger alphabet than the ASCII 7-bit
> >     code (actually most/all non-english/american languages).
Me to!
> >     _Do_ support the eigth bit, especially when programming
> >     specifically for machines that have 8-bit character sets.
Applause!

I understand that there are two different fields to use types.

        application specific (char, int, float, ... )

        system specific (bit, byte, word, ... )

In the first case data is (should be) used in a consistent way.
'char' does mean char, int does mean int and so on.

But there are applications where the second case applies,
where one acesses BYTES not 'char's!

 ( This is as long as bytes are the basic type of your     )
 ( hardware. Which means that adresses are BYTE-addresses! )

Up to now this means:

        typedef unsigned char byte;

Any reasonable standard should take care of that!
Because, if char will be 16 bits wide in future,

        HOW DO I ACCESS MY DEAR BYTES?

There has to be a standard way to do this, at least as
long as ADDRESSES are BYTE-addresses!

Uwe

--
Uwe Kloss               UUCP: UweKloss@lionbbs.han.de
Fasanenstrasse 65       BTX:  0531336908 0001
3300 Braunschweig       FIDO: not connected

Author: jimad@microsoft.UUCP (Jim ADCOCK)
Date: 21 Oct 91 23:06:54 GMT Raw View

In article <A0bdgmec@lionbbs.han.de> UweKloss@lionbbs.han.de writes:
|        typedef unsigned char byte;
|
|Any reasonable standard should take care of that!
|Because, if char will be 16 bits wide in future,
|
|        HOW DO I ACCESS MY DEAR BYTES?
|
|There has to be a standard way to do this, at least as
|long as ADDRESSES are BYTE-addresses!

Over the long long term, if programmers around the world move to
16-bit character representations such as Unicode, so that the
whole world can exchange data successfully, then perhaps hardware
manufacturers *might* start designing computers where 16-bits is the
smallest conveniently accessible chunk of memory.  And maybe computers
that already don't conveniently support access to 8-bit chunks of memory
won't continue to "fake it" in software.

But until such a time, I think you can rest assured that calling an
unsigned char a byte will work on many, many systems -- just as long
as you treat that unsigned char as a byte and not a character.

Over the intermediate time frame, what many people are talking about is
to use a type called a "wchar_t" which is defined in terms of some
other integral type.  This is similar to how "size_t" is defined in terms
of some other integral type.  Neither is a unique type of their own.

For example, some systems might in effect do a:

typedef unsigned short wchar_t;

Those systems might choose to use Unicode, not ASCII, not EBCDIC nor something
else to define wide character constants:

 wchar_t wideA = L'A';  // wideA = 0x0041
 wchar_t* wideABC = L"ABC"; // wideABC = 0x0041, 0x0042, 0x0043

-- the real advantage of Unicode obviously is not in representing the
traditional ASCII character set, but in being able to simply and cleanly
represent all the world's commonly used characters.  For example, Unicode
0x3291 is the circled CJK ideograph STOCK meaning, in loose terms,
"incorporated."  Thus, on a Unicode-compatible system, doing something like:

 wchar_t stock = (wchar_t)0x3291;
 unicodeOut << stock;

should result in the STOCK ideograph being displayed on the Unicode-compatible
console.

This of course points out that ideally you'd want wchar_t to be its own
type, so that:

 cout << stock;

would also result in a ideograph being displayed, and not the unsigned
numerical decimal constant equivalent to 0x3291;  So, it would seem like
the present definition of wchar_t is less than desirable.