Topic: No type for UTF-8 in WG21 proposal N2018


Author: sure@europa.com
Date: Thu, 21 Dec 2006 14:47:12 CST
Raw View
I'm curious why the most current Unicode support paper,

  http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html

includes type definitions for UTF-16 and UTF-32 (_Char16_t and
_Char32_t respectively), but no such definition for UTF-8.

I'm a Unicode newby, so maybe the answer seems obvious to everyone
else.  Personally, I'm working on an application with lots of legacy
ASCII support.  We're looking at migrating the application to Unicode
piecemeal, moving a few interfaces at a time.  I think it's a great
idea to have specific types for UTF-16 and UTF-32.  I'm just puzzled
why UTF-8 was excluded from the type naming.  It seems to me, perhaps
naively, that it would be very useful to be able to distinguish between
legacy ASCII interfaces and UTF-8 interfaces by type.

The UTF-8 encoding looks useful for a number of reasons.  For example,
the writeup at

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

seems very persuasive.

Thanks for your thoughts.

Scott Schurr

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: ben-public-nospam@decadentplace.org.uk (Ben Hutchings)
Date: Sun, 24 Dec 2006 17:44:21 GMT
Raw View
On 2006-12-21, sure@europa.com <sure@europa.com> wrote:
> I'm curious why the most current Unicode support paper,
>
>   http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html
>
> includes type definitions for UTF-16 and UTF-32 (_Char16_t and
> _Char32_t respectively), but no such definition for UTF-8.

These new types are intended for use with UTF-16 and UTF-32 but are not
required to be used for those encodings.  char16_t could for example be
used for UCS-2 encoding or one of the JIS encodings (not Shift-JIS!).
They fill the gaps that otherwise would exist in the following table:

Min.    Character code  Signed integer  Unsigned integer
width   type            type            type

8-bit   char            int_least8_t    uint_least8_t
                        signed char     unsigned char
16-bit  char16_t        int_least16_t   uint_least16_t
                        signed short    unsigned short
32-bit  char32_t        int_least32_t   uint_least32_t
                        signed long     unsigned long

As you can see, UTF-8 is already covered by char.

<snip>
> It seems to me, perhaps naively, that it would be very useful to be
> able to distinguish between legacy ASCII interfaces and UTF-8
> interfaces by type.

It might be, but then there are a huge number of 7- and 8-bit encodings
that might be used in char strings.  So why should the language
distinguish only between UTF-8 and not-UTF-8?

Ben.

--
Ben Hutchings
Every program is either trivial or else contains at least one bug

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]