Thread

Topic: Guarantees about the encoding of 'A

Author: kristian <kristian.spangsege@gmail.com>
Date: Wed, 5 Jan 2011 13:35:13 CST Raw View

Am I right that the C++ standard (C++03) does not guarantee any of the
following:

1) The multi-byte character encoding of the glyph 'A' in the classic C
locale has a value equal to 65.

2) The wide character encoding of the glyph 'A' in the classic C
locale has a value equal to 65.

3) The multi-byte and wide character encodings of the glyph 'A' in the
classic C locale has equal values.


Note: By "the glyph 'A'" I mean the Latin capital letter 'A', and not
some other glyph that looks like 'A'. It is the 'A' that occurs in the
basic source character set.

Note: The multi-byte character encoding of 'A' is guaranteed by the C+
+ standard to use a single byte, so we can talk meaningfully about the
value of the encoding.


--
[ comp.std.c++ is moderated.  To submit articles, try posting with your ]
[ newsreader.  If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Bo Persson" <bop@gmb.dk>
Date: Thu, 6 Jan 2011 10:58:42 CST Raw View

kristian wrote:
> Am I right that the C++ standard (C++03) does not guarantee any of
> the following:
>
> 1) The multi-byte character encoding of the glyph 'A' in the
> classic C locale has a value equal to 65.
>
> 2) The wide character encoding of the glyph 'A' in the classic C
> locale has a value equal to 65.
>
> 3) The multi-byte and wide character encodings of the glyph 'A' in
> the classic C locale has equal values.

The language standard doesn't say anything about the encoding of
characters, because it basically has to follow the underlying machine
representation. For example, on an IBM mainframe the narrow character
set is likely to use an EBCDIC encoding.

So, you are right - the language standard contains no guarantees about
representation.

>
> Note: By "the glyph 'A'" I mean the Latin capital letter 'A', and
> not some other glyph that looks like 'A'. It is the 'A' that occurs
> in the basic source character set.
>
> Note: The multi-byte character encoding of 'A' is guaranteed by the
> C+ + standard to use a single byte, so we can talk meaningfully
> about the value of the encoding.

The narrow character set (char) is one byte wide in C++, by
definition. However, the size of this byte isn't fixed at 8 bits!
(Look for CHAR_BIT in <climits> for the implementation defined byte
size).

Bo Persson

--
[ comp.std.c++ is moderated.  To submit articles, try posting with your ]
[ newsreader.  If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: kristian <kristian.spangsege@gmail.com>
Date: Tue, 11 Jan 2011 13:14:47 CST Raw View

On Jan 6, 5:58 pm, "Bo Persson" <b...@gmb.dk> wrote:
> kristian wrote:
> > Am I right that the C++ standard (C++03) does not guarantee any of
> > the following:
>
> > 1) The multi-byte character encoding of the glyph 'A' in the
> > classic C locale has a value equal to 65.
>
> > 2) The wide character encoding of the glyph 'A' in the classic C
> > locale has a value equal to 65.
>
> > 3) The multi-byte and wide character encodings of the glyph 'A' in
> > the classic C locale has equal values.
>
> The language standard doesn't say anything about the encoding of
> characters, because it basically has to follow the underlying machine
> representation. For example, on an IBM mainframe the narrow character
> set is likely to use an EBCDIC encoding.
>
> So, you are right - the language standard contains no guarantees about
> representation.
>
>
>
> > Note: By "the glyph 'A'" I mean the Latin capital letter 'A', and
> > not some other glyph that looks like 'A'. It is the 'A' that occurs
> > in the basic source character set.
>
> > Note: The multi-byte character encoding of 'A' is guaranteed by the
> > C+ + standard to use a single byte, so we can talk meaningfully
> > about the value of the encoding.
>
> The narrow character set (char) is one byte wide in C++, by
> definition. However, the size of this byte isn't fixed at 8 bits!
> (Look for CHAR_BIT in <climits> for the implementation defined byte
> size).
>
> Bo Persson
>
> --
> [ comp.std.c++ is moderated.  To submit articles, try posting with your ]
> [ newsreader.  If that fails, use mailto:std-cpp-sub...@vandevoorde.com ]
> [              --- Please see the FAQ before posting. ---               ]
> [ FAQ:http://www.comeaucomputing.com/csc/faq.html                     ]

Thanks a million for the answer! Now I can move on in my quest to
understand the ideas behind the character encoding aspects of the
standard.

Do you also happen to know whether the standard requires:

1) That the (multi-byte) encoding of 'A' in the narrow execution
character set of the "C" locale has the same value as the encoding of
'A' in the wide execution character set of the "C" locale?

2) That the (multi-byte) encoding of 'A' is the same across the narrow
execution character sets of all locales available for a specific
implementation?

3) That the encoding of 'A' is the same across the wide execution
character sets of all locales available for a specific implementation?

(I'm trying to phrase my questions very precisely - I home it doesn't
obfuscate the meaning too much)


Kristian


--
[ comp.std.c++ is moderated.  To submit articles, try posting with your ]
[ newsreader.  If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Bo Persson" <bop@gmb.dk>
Date: Tue, 11 Jan 2011 15:20:42 CST Raw View

kristian wrote:
> On Jan 6, 5:58 pm, "Bo Persson" <b...@gmb.dk> wrote:
>> kristian wrote:
>>> Am I right that the C++ standard (C++03) does not guarantee any of
>>> the following:
>>
>>> 1) The multi-byte character encoding of the glyph 'A' in the
>>> classic C locale has a value equal to 65.
>>
>>> 2) The wide character encoding of the glyph 'A' in the classic C
>>> locale has a value equal to 65.
>>
>>> 3) The multi-byte and wide character encodings of the glyph 'A' in
>>> the classic C locale has equal values.
>>
>> The language standard doesn't say anything about the encoding of
>> characters, because it basically has to follow the underlying
>> machine representation. For example, on an IBM mainframe the
>> narrow character set is likely to use an EBCDIC encoding.
>>
>> So, you are right - the language standard contains no guarantees
>> about representation.
>>
>>
>>
>>> Note: By "the glyph 'A'" I mean the Latin capital letter 'A', and
>>> not some other glyph that looks like 'A'. It is the 'A' that
>>> occurs in the basic source character set.
>>
>>> Note: The multi-byte character encoding of 'A' is guaranteed by
>>> the C+ + standard to use a single byte, so we can talk
>>> meaningfully about the value of the encoding.
>>
>> The narrow character set (char) is one byte wide in C++, by
>> definition. However, the size of this byte isn't fixed at 8 bits!
>> (Look for CHAR_BIT in <climits> for the implementation defined byte
>> size).
>>
>> Bo Persson
>>
>
> Thanks a million for the answer! Now I can move on in my quest to
> understand the ideas behind the character encoding aspects of the
> standard.
>
> Do you also happen to know whether the standard requires:
>
> 1) That the (multi-byte) encoding of 'A' in the narrow execution
> character set of the "C" locale has the same value as the encoding
> of 'A' in the wide execution character set of the "C" locale?

Both the character sets and the locales are implementation defined, so
there is no general answer.

If the narrow character is ASCII and the wide character set is some
kind of Unicode encoding, I believe they should match. If the narrow
character set is EBCDIC, they will not.

>
> 2) That the (multi-byte) encoding of 'A' is the same across the
> narrow execution character sets of all locales available for a
> specific implementation?
>
> 3) That the encoding of 'A' is the same across the wide execution
> character sets of all locales available for a specific
> implementation?

Which locales are present, and what they look like, are all
implementation defined (as are the sizes of the narrow and wide
characters), so we just don't know.


Bo Persson



--
[ comp.std.c++ is moderated.  To submit articles, try posting with your ]
[ newsreader.  If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]