Topic: No type for UTF-8 in WG21 proposal N2018
Author: sure@europa.com
Date: Thu, 21 Dec 2006 14:47:12 CST Raw View
I'm curious why the most current Unicode support paper,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html
includes type definitions for UTF-16 and UTF-32 (_Char16_t and
_Char32_t respectively), but no such definition for UTF-8.
I'm a Unicode newby, so maybe the answer seems obvious to everyone
else. Personally, I'm working on an application with lots of legacy
ASCII support. We're looking at migrating the application to Unicode
piecemeal, moving a few interfaces at a time. I think it's a great
idea to have specific types for UTF-16 and UTF-32. I'm just puzzled
why UTF-8 was excluded from the type naming. It seems to me, perhaps
naively, that it would be very useful to be able to distinguish between
legacy ASCII interfaces and UTF-8 interfaces by type.
The UTF-8 encoding looks useful for a number of reasons. For example,
the writeup at
http://www.cl.cam.ac.uk/~mgk25/unicode.html
seems very persuasive.
Thanks for your thoughts.
Scott Schurr
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: ben-public-nospam@decadentplace.org.uk (Ben Hutchings)
Date: Sun, 24 Dec 2006 17:44:21 GMT Raw View
On 2006-12-21, sure@europa.com <sure@europa.com> wrote:
> I'm curious why the most current Unicode support paper,
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html
>
> includes type definitions for UTF-16 and UTF-32 (_Char16_t and
> _Char32_t respectively), but no such definition for UTF-8.
These new types are intended for use with UTF-16 and UTF-32 but are not
required to be used for those encodings. char16_t could for example be
used for UCS-2 encoding or one of the JIS encodings (not Shift-JIS!).
They fill the gaps that otherwise would exist in the following table:
Min. Character code Signed integer Unsigned integer
width type type type
8-bit char int_least8_t uint_least8_t
signed char unsigned char
16-bit char16_t int_least16_t uint_least16_t
signed short unsigned short
32-bit char32_t int_least32_t uint_least32_t
signed long unsigned long
As you can see, UTF-8 is already covered by char.
<snip>
> It seems to me, perhaps naively, that it would be very useful to be
> able to distinguish between legacy ASCII interfaces and UTF-8
> interfaces by type.
It might be, but then there are a huge number of 7- and 8-bit encodings
that might be used in char strings. So why should the language
distinguish only between UTF-8 and not-UTF-8?
Ben.
--
Ben Hutchings
Every program is either trivial or else contains at least one bug
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]