Thread

Topic: Issues about N2401 (Code Conversion Facets)

Author: Alberto Ganesh Barbati <AlbertoBarbati@libero.it>
Date: Wed, 19 Sep 2007 21:09:22 CST Raw View

Hi Everybody,

(for reference, this is about N2401
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2401.htm)

-- Issue #1: Maxcode

The Maxcode template parameter has practically only two reasonable
values, that are 0xffff (for applications supporting the BMP only) and
0x10ffff (for applications supporting the entire Unicode range). It's
very hard to believe that an application would use any other value for
Maxcode. Is there really the intent to provide support for other values?
If not, why not enforce that, using an enum instead of an unsigned long,
such as:

enum codecvd_maxcode { codecvt_bmp, codecvt_full };

template<class Elem,
 codecvt_maxcode Maxcode = codecvt_full,
 codecvt_mode Mode = (codecvt_mode)0>
 class codecvt_utf8 {...};

Alternatively, we could merge the two enums and get rid of one parameter:

enum codecvt_mode {
 restrict_to_bmp = 8,
 consume_header = 4,
 generate_header = 2,
 little_endian = 1};

template<class Elem,
 codecvt_mode Mode = (codecvt_mode)0>
 class codecvt_utf8 {...};

-- Issue #2: endianness

The choice to make big endianness as the default is arbitrary and is
going to be very confusing for all people working on little endian
machines. I can think of three possible suggestions to overcome that:

1) add a new enumerator that specify the native endianness and use that
as the default:

enum codecvt_mode {
 restrict_to_bmp = 8,
 consume_header = 4,
 generate_header = 2,
 little_endian = 1,
 native_endianness = /* implementation defined: either 0 or 1 */
};

template<class Elem,
 codecvt_mode Mode = native_endianness>
 class codecvt_utf8 {...};

This solution has the inconvenient that the user might forget to add
native_endianness when specifying another enumerator, for example
consume_header.

2) provide two enum values for both endianness with the default matching
the platform's native endianness:

enum codecvt_mode {
 restrict_to_bmp = 8,
 consume_header = 4,
 generate_header = 2,
 big_endian = /* implementation defined: either 0 or 1 */,
 little_endian = /* implementation defined: either 1 or 0 */
};

This solution has the only inconvenience that having the same symbol
codecvt_utf8<T,0> refer to either the big or little endianness might be
a problem in libraries.

3) add a template parameter:

enum codecvt_mode {
 restrict_to_bmp = 8,
 consume_header = 4,
 generate_header = 2,
};

enum codecvt_endianness {
 little_endian,
 big_endian,
 native = /* implementation defined: either little_endian or big_endian */
};

template<class Elem,
 codecvt_mode Mode = (codecvt_mode)0,
 codecvt_endianness Endian = native>
 class codecvt_utf8 {...};

This solution has none of the previous inconveniences but has... ehr...
one more parameter.

-- Issue #3: UTF-8 encoding clarification

The paper states the intent to provide support for Unicode, but when
describing UTF-8 encoding it refers to UCS2 and UCS4 which are encoding
forms that are part of ISO 10646 and are *not* part of Unicode. This is
not just nit-picking, because ISO 10646 defines UTF-8 in a slightly
different way than Unicode, so it's not clear which of the two
definitions is the paper referring to. For example, in Unicode the
so-called "non-shortest" sequences, as well as all sequences that would
refer to surrogate code points or to non-characters are invalid UTF-8
sequences, while they are valid in ISO 10646. Which is exactly the
intent of the paper? This point is very important, IMHO. Mis-handling
non-shortest forms is considered a security issue (see
http://unicode.org/reports/tr36/) so the library should at least handle
those, but I would suggest we do it right and support the whole Unicode
semantic.

Just my two eurocents,

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: pjp@plauger.com ("P.J. Plauger")
Date: Thu, 20 Sep 2007 14:40:27 GMT Raw View

"Alberto Ganesh Barbati" <AlbertoBarbati@libero.it> wrote in message
news:HjiIi.118527$U01.966046@twister1.libero.it...
> Hi Everybody,
>
> (for reference, this is about N2401
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2401.htm)
>
> -- Issue #1: Maxcode
>
> The Maxcode template parameter has practically only two reasonable
> values, that are 0xffff (for applications supporting the BMP only) and
> 0x10ffff (for applications supporting the entire Unicode range). It's
> very hard to believe that an application would use any other value for
> Maxcode. Is there really the intent to provide support for other values?

Yes. We've had occasion to use 0x7fffffff, and even 0xffffffff [sic].

> If not, why not enforce that, using an enum instead of an unsigned long,
> such as:
>
> enum codecvd_maxcode { codecvt_bmp, codecvt_full };
>
> template<class Elem,
> codecvt_maxcode Maxcode = codecvt_full,
> codecvt_mode Mode = (codecvt_mode)0>
> class codecvt_utf8 {...};
>
> Alternatively, we could merge the two enums and get rid of one parameter:
>
> enum codecvt_mode {
> restrict_to_bmp = 8,
> consume_header = 4,
> generate_header = 2,
> little_endian = 1};
>
> template<class Elem,
> codecvt_mode Mode = (codecvt_mode)0>
> class codecvt_utf8 {...};

All interesting redesigns. I've proposed codifying existing practice,
however
revolutionary that may appear these days.

> -- Issue #2: endianness
>
> The choice to make big endianness as the default is arbitrary and is
> going to be very confusing for all people working on little endian
> machines.

And the choice of little endianness as the default would be arbitrary
and might be confusing to people working on big endian machines.

>             I can think of three possible suggestions to overcome that:
>
> 1) add a new enumerator that specify the native endianness and use that
> as the default:
>
> enum codecvt_mode {
> restrict_to_bmp = 8,
> consume_header = 4,
> generate_header = 2,
> little_endian = 1,
> native_endianness = /* implementation defined: either 0 or 1 */
> };
>
> template<class Elem,
> codecvt_mode Mode = native_endianness>
> class codecvt_utf8 {...};
>
> This solution has the inconvenient that the user might forget to add
> native_endianness when specifying another enumerator, for example
> consume_header.
>
> 2) provide two enum values for both endianness with the default matching
> the platform's native endianness:
>
> enum codecvt_mode {
> restrict_to_bmp = 8,
> consume_header = 4,
> generate_header = 2,
> big_endian = /* implementation defined: either 0 or 1 */,
> little_endian = /* implementation defined: either 1 or 0 */
> };
>
> This solution has the only inconvenience that having the same symbol
> codecvt_utf8<T,0> refer to either the big or little endianness might be
> a problem in libraries.
>
> 3) add a template parameter:
>
> enum codecvt_mode {
> restrict_to_bmp = 8,
> consume_header = 4,
> generate_header = 2,
> };
>
> enum codecvt_endianness {
> little_endian,
> big_endian,
> native = /* implementation defined: either little_endian or big_endian */
> };
>
> template<class Elem,
> codecvt_mode Mode = (codecvt_mode)0,
> codecvt_endianness Endian = native>
> class codecvt_utf8 {...};
>
> This solution has none of the previous inconveniences but has... ehr...
> one more parameter.

All interesting redesigns. I've proposed codifying existing practice,
however
revolutionary that may appear these days.

> -- Issue #3: UTF-8 encoding clarification
>
> The paper states the intent to provide support for Unicode, but when
> describing UTF-8 encoding it refers to UCS2 and UCS4 which are encoding
> forms that are part of ISO 10646 and are *not* part of Unicode. This is
> not just nit-picking, because ISO 10646 defines UTF-8 in a slightly
> different way than Unicode, so it's not clear which of the two
> definitions is the paper referring to. For example, in Unicode the
> so-called "non-shortest" sequences, as well as all sequences that would
> refer to surrogate code points or to non-characters are invalid UTF-8
> sequences, while they are valid in ISO 10646. Which is exactly the
> intent of the paper? This point is very important, IMHO. Mis-handling
> non-shortest forms is considered a security issue (see
> http://unicode.org/reports/tr36/) so the library should at least handle
> those, but I would suggest we do it right and support the whole Unicode
> semantic.

I actually favor the ISO 10646 formalism, and implicitly did so in this
proposal (and the implementation on which it's based). I'll raise the
issue next week about changing the terms, to UTF-16 and UTF-32
I assume, but I think that an ISO committee should favor ISO
standards.

As for the security issue, and its purported fix in Unicode, I observe
that more computing sins are committed these days in the name of
improving security, without necessarily achieving it, than for most
other reasons, including blind stupidity. (With apologies to Bill Wulf.)

> Just my two eurocents,

About USD 0.028 these days (sigh).

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: AlbertoBarbati@libero.it (Alberto Ganesh Barbati)
Date: Thu, 20 Sep 2007 22:16:43 GMT Raw View

P.J. Plauger ha scritto:
> "Alberto Ganesh Barbati" <AlbertoBarbati@libero.it> wrote in message
> news:HjiIi.118527$U01.966046@twister1.libero.it...
>> Hi Everybody,
>>
>> (for reference, this is about N2401
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2401.htm)
>>
>> -- Issue #1: Maxcode
>>
>> The Maxcode template parameter has practically only two reasonable
>> values, that are 0xffff (for applications supporting the BMP only) and
>> 0x10ffff (for applications supporting the entire Unicode range). It's
>> very hard to believe that an application would use any other value for
>> Maxcode. Is there really the intent to provide support for other values?
>
> Yes. We've had occasion to use 0x7fffffff, and even 0xffffffff [sic].

I understand. However, we would no longer be speaking of Unicode, in
such case... (see below...)

>> -- Issue #2: endianness
>>
>> The choice to make big endianness as the default is arbitrary and is
>> going to be very confusing for all people working on little endian
>> machines.
>
> And the choice of little endianness as the default would be arbitrary
> and might be confusing to people working on big endian machines.

Of course. That's why a concept of "native" endianness might need to be
introduced.

>> -- Issue #3: UTF-8 encoding clarification
>>
>> The paper states the intent to provide support for Unicode, but when
>> describing UTF-8 encoding it refers to UCS2 and UCS4 which are encoding
>> forms that are part of ISO 10646 and are *not* part of Unicode. This is
>> not just nit-picking, because ISO 10646 defines UTF-8 in a slightly
>> different way than Unicode, so it's not clear which of the two
>> definitions is the paper referring to. For example, in Unicode the
>> so-called "non-shortest" sequences, as well as all sequences that would
>> refer to surrogate code points or to non-characters are invalid UTF-8
>> sequences, while they are valid in ISO 10646. Which is exactly the
>> intent of the paper? This point is very important, IMHO. Mis-handling
>> non-shortest forms is considered a security issue (see
>> http://unicode.org/reports/tr36/) so the library should at least handle
>> those, but I would suggest we do it right and support the whole Unicode
>> semantic.
>
> I actually favor the ISO 10646 formalism, and implicitly did so in this
> proposal (and the implementation on which it's based). I'll raise the
> issue next week about changing the terms, to UTF-16 and UTF-32
> I assume, but I think that an ISO committee should favor ISO
> standards.

I understand. The committee in its "call for proposal" for TR2
explicitly mentioned Unicode, however I see a rationale for supporting
ISO 10646 instead. I believe this is a major point to clarify, because
if we speak of Unicode then we should respect its semantic as much as
possible. People will expect that. If we don't want to support all the
semantic or if we are just going to support ISO 10646 because it's an
ISO standard, we can (and should, IMHO) just say that explicitly and
avoid referring to Unicode completely. There is already so much
confusion about the two standards...

Just my opinion,

Ganesh

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]