Thread

Topic: UTF-8 literals with C++0x

Author: Mathias Gaunard <loufoque@gmail.com>
Date: Tue, 20 Jul 2010 20:25:41 CST Raw View

On Jul 19, 7:24 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
> Greetings.
>
> I have tried to understand the changes that C++0x makes with regard to
> UTF-8 support and the u8"" string literals. Somehow it strikes me that
> the u8 string literals are not all that helpful.

On the contrary, they are quite helpful. Regular string literals are
encoded in the narrow locale character set, while UTF-8 string
literals are encoded in UTF-8.

The compiler will convert whatever you input in the source from the
source character set to the narrow locale character set for narrow
string literals, to the wide locale character set for wide string
literals, to UTF-8 for UTF-8 literals, to UTF-16 for UTF-16 literals,
and to UTF-32 for UTF-32 literals.

For example, on Windows with MSVC and an UTF-8 source character set
your UTF-8 string literals will be in UTF-8, your wide string literals
will be in UTF-16, and your narrow string literals will be in ANSI,
all being written the same in direct UTF-8 without escape sequences in
the source code.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Martin B." <0xCDCDCDCD@gmx.at>
Date: Wed, 21 Jul 2010 11:40:30 CST Raw View

Mathias Gaunard wrote:

> On Jul 19, 7:24 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
>
>> Greetings.
>>
>> I have tried to understand the changes that C++0x makes with regard to
>> UTF-8 support and the u8"" string literals. Somehow it strikes me that
>> the u8 string literals are not all that helpful.
>>
>
> On the contrary, they are quite helpful. Regular string literals are
> encoded in the narrow locale character set, while UTF-8 string
> literals are encoded in UTF-8.
>
>
How is this "narrow locale character set" defined? (I quite frankly do not
understand the standard/FCD when it's talking about these character sets.)
Is the implementation free to choose whatever it wants?

The compiler will convert whatever you input in the source from the
> source character set to the narrow locale character set for narrow
> string literals, to the wide locale character set for wide string
> literals, to UTF-8 for UTF-8 literals, to UTF-16 for UTF-16 literals,
> and to UTF-32 for UTF-32 literals.
>
>
And this was exactly my point. The compiler will, but it appears to me that
with narrow vs. UTF-8 there's no way to prevent typos in a western European
setting!?

For example, on Windows with MSVC and an UTF-8 source character set
> your UTF-8 string literals will be in UTF-8, your wide string literals
> will be in UTF-16, and your narrow string literals will be in ANSI,
>

That confused me a bit. Didn't know that "ANSI" refers to Windows-1252 :-)

all being written the same in direct UTF-8 without escape sequences in
> the source code.
>
>
And this is exactly the problem!
The developers *WILL* forget time after time to put u8 in front of the
character literals. Then at runtime you will have a W-1252 encoded "array of
n const char" which is invalid UTF-8.

Without the possibility to distinguish UTF-8 at the type level the u8"" is
just a little bit of syntactic sugar.

cheers,
Martin

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Mathias Gaunard <loufoque@gmail.com>
Date: Thu, 22 Jul 2010 12:34:45 CST Raw View

On Jul 21, 6:40 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:

> How is this "narrow locale character set" defined? (I quite frankly do not
> understand the standard/FCD when it's talking about these character sets.)
> Is the implementation free to choose whatever it wants?

I'm not certain this is right term, but yes, it can be any character
set as long as it fulfills the basic criteria of containing all
"basic" characters and that 0-9 is contiguous. It doesn't even have to
be ASCII-compatible.

> That confused me a bit. Didn't know that "ANSI" refers to Windows-1252 :-)

It can be any code page. Windows-1252 is just the default value on
western copies of Windows.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Jean-Marc Bourguet <jm@bourguet.org>
Date: Thu, 22 Jul 2010 12:38:51 CST Raw View

"Martin B." <0xCDCDCDCD@gmx.at> writes:

> Mathias Gaunard wrote:
>
>> On Jul 19, 7:24 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
>>
>>> Greetings.
>>>
>>> I have tried to understand the changes that C++0x makes with regard to
>>> UTF-8 support and the u8"" string literals. Somehow it strikes me that
>>> the u8 string literals are not all that helpful.
>>>
>>
>> On the contrary, they are quite helpful. Regular string literals are
>> encoded in the narrow locale character set, while UTF-8 string
>> literals are encoded in UTF-8.
>>
>>
> How is this "narrow locale character set" defined? (I quite frankly do not
> understand the standard/FCD when it's talking about these character sets.)

If we start with the Unicode model of character codes:

* Abstract Character Repertoire: the set of characters to be encoded, for
 example, some alphabet or symbol set

* Coded Character Set: a mapping from an abstract character repertoire to a
 set of nonnegative integers (called code point).

* Character Encoding Form: a mapping from a set of nonnegative integers
 that are elements of a CCS to a set of sequences of particular code units
 of some specified width, such as 32-bit integers

what the C++ standard call a character set is a Character Encoding Form and
not a coded character set (but at place there could be some ambiguity).

There is one Coded Character Set per locale in C++ with potentially two
Character Encoding Forms: the wide one in which the code point is simply
put in a wchar_t as it, the narrow one which can be more complex (state
dependant and use of several units per character) and use char as code
unit.  There are some other constraints (code points for digits are
consecutive, code points for characters in the basic character set are
represented as a single char in the default shift state in the narrow
encoding; for characters which are represented as a single char, I don't
remember if in C++ there must be a correspondance between the code point
and the value of the single char, in C they must be identical unless a
predefined macro -- __STDC_MB_MIGHT_NEQ_WC__ IIRC -- is defined, and POSIX
doesn't allow that -- AFAIK, this is mainly used on IBM mainframes to
provide locales where the narrow encoding is based on EBCDIC and the wide
one is Unicode).

String litterals are encoded in an implementation defined character set
(for instance gcc allows to specify them with -fexec-charset and
-fwide-exec-charset options).

Yours,

--
Jean-Marc

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: James Kanze <james.kanze@gmail.com>
Date: Fri, 23 Jul 2010 12:55:20 CST Raw View

On Jul 22, 7:38 pm, Jean-Marc Bourguet <j...@bourguet.org> wrote:
> "Martin B." <0xCDCDC...@gmx.at> writes:

      [...]
> There is one Coded Character Set per locale in C++ with potentially two
> Character Encoding Forms: the wide one in which the code point is simply
> put in a wchar_t as it, the narrow one which can be more complex (state
> dependant and use of several units per character) and use char as code
> unit.

Not directly related to the original question, but I don't think
the standard guarantees that the code point will be simple put
into a wchar_t as is, and at least two important implementations
(IBM's C++ compiler under AIX and Microsoft's VC++) using
Unicode support surrogates in wchar_t, a more complex encoding
than just "as is".

--
James Kanze

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Martin B." <0xCDCDCDCD@gmx.at>
Date: Fri, 23 Jul 2010 12:56:40 CST Raw View

Mathias Gaunard wrote:

> On Jul 21, 6:40 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
>
> How is this "narrow locale character set" defined? [...]
>> Is the implementation free to choose whatever it wants?
>>
>
> I'm not certain this is right term, but yes, it can be any character
> set as long as it fulfills the basic criteria [...]
>
>
Thanks for the update.

As to my snipped arguments:
Now what say you to the problem that C++0x defines two different string
literals ("" and u8"") that both will result in runtime strings of the same
static type but with different narrow character sets?
Yes I know that it's also possible at the moment, but at the moment it isn't
possible through a simple typo as it is with u8 literals!

cheers,
Martin

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Jean-Marc Bourguet <jm@bourguet.org>
Date: Sun, 25 Jul 2010 00:18:53 CST Raw View

James Kanze <james.kanze@gmail.com> writes:

> On Jul 22, 7:38 pm, Jean-Marc Bourguet <j...@bourguet.org> wrote:
>> "Martin B." <0xCDCDC...@gmx.at> writes:
>
>       [...]
>> There is one Coded Character Set per locale in C++ with potentially two
>> Character Encoding Forms: the wide one in which the code point is simply
>> put in a wchar_t as it, the narrow one which can be more complex (state
>> dependant and use of several units per character) and use char as code
>> unit.
>
> Not directly related to the original question, but I don't think
> the standard guarantees that the code point will be simple put
> into a wchar_t as is, and at least two important implementations
> (IBM's C++ compiler under AIX and Microsoft's VC++) using
> Unicode support surrogates in wchar_t, a more complex encoding
> than just "as is".

How do you interpret 2.13.4/6 and 3.9.1/5 if you allow more than one
wchar_t per code point?

3.9.1/5
  Type wchar_t is a distinct type whose values can represent distinct
  codes for all members of the largest extended character set specified
  among the supported locales (22.1.1). Type wchar_t shall have the same
  size, signedness, and alignment requirements (3.9) as one of the other
  integral types, called its underlying type.

2.13.4/6
  ... The size of a wide string literal is the total number of escape
  sequences, universal-character-names, and other characters, plus one for
  the terminating L'\0'.  The size of a narrow string literal is the total
  number of escape sequences and other characters, plus at least one for
  the multibyte encoding of each universal-character-name, plus one for
  the terminating '\0'.

2.14.5/14, the equivalent in N3032 is:
  In a narrow string literal, a universal-charactername may map to more
  than one char element due to multibyte encoding. The size of a char32_t
  or wide string literal is the total number of escape sequences,
  universal-character-names, and other characters, plus one for the
  terminating U'\0' or L'\0'. The size of a char16_t string literal is the
  total number of escape sequences, universal-character-names, and other
  characters, plus one for each character requiring a surrogate pair, plus
  one for the terminating u'\0'. [ Note: The size of a char16_t string
  literal is the number of code units, not the number of characters. end
  note ] Within char32_t and char16_t literals, any universalcharacter-
  names shall be within the range 0x0 to 0x10FFFF. The size of a narrow
  string literal is the total number of escape sequences and other
  characters, plus at least one for the multibyte encoding of each
  universal-character-name, plus one for the terminating '\0'.

note that wide strings are handled in the same way as char32_t, not as
char_16_t one.

Not formally relevant, even more so for C++, but the rational for the MSE
which explains that the goal of that amendment to the C standard explain
that it is introduced to allow to manipulate charsets needing a multibyte
encoding as easily as those having a one byte per character, just replacing
char by wchar_t, int by wint_t and narrow function calls by their
equivalent wide one.

So my understanding is that using a 16 bit wchar_t you have to limit
yourself to the BMP.

Yours,

--
Jean-Marc

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: James Kanze <james.kanze@gmail.com>
Date: Mon, 26 Jul 2010 12:09:51 CST Raw View

On Jul 25, 7:18 am, Jean-Marc Bourguet <j...@bourguet.org> wrote:
> James Kanze <james.ka...@gmail.com> writes:
> > On Jul 22, 7:38 pm, Jean-Marc Bourguet <j...@bourguet.org> wrote:
> >> "Martin B." <0xCDCDC...@gmx.at> writes:

> >       [...]
> >> There is one Coded Character Set per locale in C++ with potentially two
> >> Character Encoding Forms: the wide one in which the code point is simply
> >> put in a wchar_t as it, the narrow one which can be more complex (state
> >> dependant and use of several units per character) and use char as code
> >> unit.

> > Not directly related to the original question, but I don't think
> > the standard guarantees that the code point will be simple put
> > into a wchar_t as is, and at least two important implementations
> > (IBM's C++ compiler under AIX and Microsoft's VC++) using
> > Unicode support surrogates in wchar_t, a more complex encoding
> > than just "as is".

> How do you interpret 2.13.4/6 and 3.9.1/5 if you allow more than one
> wchar_t per code point?

I don't:-).

I didn't verify anything in the standard.  I know that the
intent of wchar_t was to have only one wchar_t per code point.
I also know that Microsoft and IBM (AIX) compilers use a 16 bit
wchar_t, with full Unicode (including surrogates).  It looks
that those paragraphs are about as relevant as export with
regards to actual implementations:-).

In practice, I'm not sure it makes any difference.  Even when
wchar_t is 32 bits, the two wchar_t sequence \U000000E5,
\U00000301 is a single character.  And in a SWiss German local,
to upper on the sequence \U000000E1, \U00000308 should return
\U000000C1. \U000000C5 (or \U000000C1, \U000000E5 if converting
to title case).  In other words, just applying the toupper
function in ctype<wchar_t> doesn't work any better with wchar_t
than with char.

I think that maybe the standard does need some improved wording
with regards to character handling, in general, but I'm not sure
what.  (I also think it very understandable that some
improvements might be desirable.  This is a domain where we
really don't know any right answers yet, and what we think are
the best answers is constantly evolving.)

      [...]
> Not formally relevant, even more so for C++, but the rational
> for the MSE which explains that the goal of that amendment to
> the C standard explain that it is introduced to allow to
> manipulate charsets needing a multibyte encoding as easily as
> those having a one byte per character, just replacing char by
> wchar_t, int by wint_t and narrow function calls by their
> equivalent wide one.

I'm aware of the rationale, and I think that the C standard did
an exceptionally good job, given our knowledge at the time.
That was more than 20 years ago, however, and our knowledge has
progressed.

> So my understanding is that using a 16 bit wchar_t you have to limit
> yourself to the BMP.

Or not be conforming (or not support Unicode).  I think that the
implementations using a 16 bit wchar_t and Unicode are probably
not conforming, given the passages you quote.  Realistically, I
think this non-conformance comes under the same heading as a
lack of support for export: it's intentional, and they have no
intention of changing.  In this case, unlike export, I rather
agree with their choice, or at least, understand the motivations
behind it.

--
James Kanze
Unicode

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Timothy Madden <terminatorul@gmail.com>
Date: Mon, 26 Jul 2010 12:12:15 CST Raw View

Martin B. wrote:
>
> Mathias Gaunard wrote:
>
>> On Jul 21, 6:40 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
>>
>> How is this "narrow locale character set" defined? [...]
>>>
>>> Is the implementation free to choose whatever it wants?
>>>
>>
>> I'm not certain this is right term, but yes, it can be any character
>> set as long as it fulfills the basic criteria [...]
>>
>>
> Thanks for the update.
>
> As to my snipped arguments:
> Now what say you to the problem that C++0x defines two different string
> literals ("" and u8"") that both will result in runtime strings of the same
> static type but with different narrow character sets?
> Yes I know that it's also possible at the moment, but at the moment it isn't
> possible through a simple typo as it is with u8 literals!

I think the question is: if we have different string literals for
strings with different character sets (execution narrow-character set
and UTF-8), should we also have different types for strings with
different character sets ?

Or to put it a different way: if UTF-8 charset deserves a special
syntax for UTF-8 string literals, does it deserve a special char type
too ?

I think the intent in the language is that the character set for a
given string is an application-level issue, not a language issue. The
language just provides the type char * (yes the language also requires
the implementation to define an /execution narrow-character set/, but
it is not like I can see a hard requirement that each and every char *
in a program must by all means have that charset associated; it is
more like the /execution narrow-character set/ is what is assumed by
default when the string is output).

Now with the new 0x and u8 I see this intent, that a char * has any
charset the application thinks it has, now benefits from an exception:
the u8 string literals have a specific charset associated, UTF-8,
known and now supported by the language.

It is true that UTF-8 is a special charset, rather important in
today's world, and it is true that it deserves this special treatment
from the language.

The question is: does it further deserve a special character type just for it ?

Some people (most people ?) might find a new character type for UTF-8
strings, like utf8_char (which would be the same as char, but
distinct) a big change in the language, as adding a new type is in
general a big change in the language.

Personally I would not find this utf8_char that big a change, but
there are the issues of:
  - should the language also provide a conversion form char * strings
     to utf8_char * strings (not necessarily the reverse) ?
  - should a codecvt facet be provided, to allow utf8_char * strings to
     be output to stdin/stdout ?
  - should utf8_char functions and specializations be provided for all
     other library features ? Like having std::string, std::utf8_string
     and std::wstring specializations of std::basic_string
  - should utf8_char type also be added to C language ?

The nice and easy answer would be to not provide any conversions
/support, but I would not hurry to say this is also the good answer.
I, for one, would love to see this further considered, and I would say
UTF-8 is special enough to receive this special treatment in the
language.

Thank you,
Timothy Madden

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Mathias Gaunard <loufoque@gmail.com>
Date: Tue, 27 Jul 2010 17:07:13 CST Raw View

On Jul 26, 7:09 pm, James Kanze <james.ka...@gmail.com> wrote:

> I didn't verify anything in the standard.  I know that the
> intent of wchar_t was to have only one wchar_t per code point.
> I also know that Microsoft and IBM (AIX) compilers use a 16 bit
> wchar_t, with full Unicode (including surrogates).  It looks
> that those paragraphs are about as relevant as export with
> regards to actual implementations:-).
>
> In practice, I'm not sure it makes any difference.

I can give you one.
The way codecvt<wchar_t, char, mbstate_t> is used, at least by several
conforming implementations (including the one provided with MSVC),
cannot support a variable-width encoding on the wchar_t side, unless
you use hacks.


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Jean-Marc Bourguet <jm@bourguet.org>
Date: Tue, 27 Jul 2010 17:08:49 CST Raw View

James Kanze <james.kanze@gmail.com> writes:

> In practice, I'm not sure it makes any difference.  Even when
> wchar_t is 32 bits, the two wchar_t sequence \U000000E5,
> \U00000301 is a single character.  And in a SWiss German local,
> to upper on the sequence \U000000E1, \U00000308 should return
> \U000000C1. \U000000C5 (or \U000000C1, \U000000E5 if converting
> to title case).  In other words, just applying the toupper
> function in ctype<wchar_t> doesn't work any better with wchar_t
> than with char.

I agree that it makes little difference in practice.  The complication
comes from combining characters.  The difference between a character
encoded in multiple units and a combination of combining character is more
a question of definition than of practical effects.

...

> Or not be conforming (or not support Unicode).

To conform formally, they just have to pretend that they aren't using
Unicode, but something very similar but in which surrogates are true
characters (and not non assigned code points reserved so that encoding
forms can use them) which have combining properties and some conditions on
their use.

Yours,

--
Jean-Marc

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Mathias Gaunard <loufoque@gmail.com>
Date: Thu, 29 Jul 2010 12:40:26 CST Raw View

On Jul 26, 7:09 pm, James Kanze <james.ka...@gmail.com> wrote:
> On Jul 25, 7:18 am, Jean-Marc Bourguet <j...@bourguet.org> wrote:
>
>
>
> > James Kanze <james.ka...@gmail.com> writes:
> > > On Jul 22, 7:38 pm, Jean-Marc Bourguet <j...@bourguet.org> wrote:
> > >> "Martin B." <0xCDCDC...@gmx.at> writes:
> > >       [...]
> > >> There is one Coded Character Set per locale in C++ with potentially
two
> > >> Character Encoding Forms: the wide one in which the code point is
simply
> > >> put in a wchar_t as it, the narrow one which can be more complex
(state
> > >> dependant and use of several units per character) and use char as
code
> > >> unit.
> > > Not directly related to the original question, but I don't think
> > > the standard guarantees that the code point will be simple put
> > > into a wchar_t as is, and at least two important implementations
> > > (IBM's C++ compiler under AIX and Microsoft's VC++) using
> > > Unicode support surrogates in wchar_t, a more complex encoding
> > > than just "as is".
> > How do you interpret 2.13.4/6 and 3.9.1/5 if you allow more than one
> > wchar_t per code point?
>
> I don't:-).
>
> I didn't verify anything in the standard.  I know that the
> intent of wchar_t was to have only one wchar_t per code point.
> I also know that Microsoft and IBM (AIX) compilers use a 16 bit
> wchar_t, with full Unicode (including surrogates).  It looks
> that those paragraphs are about as relevant as export with
> regards to actual implementations:-).
>
> In practice, I'm not sure it makes any difference.  Even when
> wchar_t is 32 bits, the two wchar_t sequence \U000000E5,
> \U00000301 is a single character.  And in a SWiss German local,
> to upper on the sequence \U000000E1, \U00000308 should return
> \U000000C1. \U000000C5 (or \U000000C1, \U000000E5 if converting
> to title case).  In other words, just applying the toupper
> function in ctype<wchar_t> doesn't work any better with wchar_t
> than with char.

I find your message a bit confusing.

00E5 0301 and 01FB are two different ways to express the character
LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE.
Its uppercase version is LATIN CAPITAL LETTER A WITH RING ABOVE AND
ACUTE, which is 01FA or 00C5 0301, that works fine with ctype whatever
the representation used.

So what you're saying is that 00E1 0308 (latin small letter a with
diaresis and acute, has no precomposed form), in Swiss German, gets
uppercased to 00C5 0301?
I can't find any such rule in the Unicode standard.

Anyway, the case where ctype doesn't work is something like 00DF which
gets uppercased to 0053 0053.


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use
mailto:std-c++@netlab.cs.rpi.edu<std-c%2B%2B@netlab.cs.rpi.edu>
]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: "Martin B." <0xCDCDCDCD@gmx.at>
Date: Mon, 19 Jul 2010 12:24:08 CST Raw View

Greetings.

I have tried to understand the changes that C++0x makes with regard to
UTF-8 support and the u8"" string literals. Somehow it strikes me that
the u8 string literals are not all that helpful.

* UTF-8 Strings are not distinguishable at runtime. Both "" and u8""
have a type of "array of n const char" (n3092, p.27) - What is the
rationale for not providing a distinct type for u8 string literals?

* Given that (I assume) many implementations already allow string
literals such as:
 const char* encStr = "ab" "\xDC" "efgh";
 // written as "ab   efgh" in an ISO-8859-1 encoded cpp file
where 0xDC represents a code point above 127 and, given a western
European environment represents U+00DC (   ), this string literal will
(most likely?) be encoded as bytes 0x61-0x62-0xDC-0x65-... at runtime,
the interpretation of which depends on the locale used by the program.

* Now, given a literal such as:
 const char* utf8Str = u8"ab" "\u00DC" "efgh";
 // written as "ab   efgh" in an ISO-8859-1 encoded cpp file
the resulting runtime bytes are required to be the valid UTF-8
encoding of the above string.

The problem I see here is: How can a conforming implementation assist
the programmer in not inadvertently mixing UTF-8 strings and
Byte-Strings? That is:
 ...
 void foo_utf8(const char* valid_utf8_str);
 ...
 foo_utf8(u8"ab   efgh"); // OK
 ...
 foo_utf8("ab   efgh"); // Oops - invalid UTF-8 sequence.
                      // We simply forgot the prefix
                      // But it seems the compiler cannot
                      // tell us.

I have seen some libraries[*] use the convention that UTF-8 encoded
strings use "unsigned char" for UTF-8 strings (as opposed to "signed
char" for normal strings).
Can a conforming implementation provide means so that a u8"" sequence
has a different type from a "" sequence?

I hope my questions/points make some sense.

cheers,
Martin

[*] notably the libxml2 library, which is C, but still.

p.s.: Note that I did only find one other recent topic with rgd to
unicode + C++0x on this group:
http://groups.google.com/group/comp.std.c++/browse_frm/thread/e0971751579edd88/e325034396938d0e


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]