Topic: C++0x string conversions, etc.


Author: Scott Meyers <smeyers@aristeia.com>
Date: Thu, 27 Aug 2009 16:45:34 CST
Raw View
In draft C++0x (N2914), 22.4.1.4/3 says (in part):

   codecvt<char, char, mbstate_t> implements a degenerate conversion; it
   does not convert at all.

This is also part of C++03.  What is the point of requiring the presence of
a no-op conversion?

22.4.1.4/3 continues:

   The specialization codecvt<char16_t, char, mbstate_t> converts between
   the UTF-16 and UTF-8 encoding schemes

However, 22.5/6 says

   For the facet codecvt_utf8_utf16:
   - The facet shall convert between UTF-8 multibyte sequences and UTF-16
     (one or two 16-bit codes) within the program

This seems to say that we have two codecvt instantations that convert
between UTF-8 and UTF-16 string representations.  Why do we need both?

Now, everything I know about ISO 10646/Unicode/UTF-n/UCS-n, etc., I got
from reading pages at Wikipedia in the last hour or so, so pardon me if
this is a silly question, but:

- If I have a pointer p of type char16_t* that points to a string encoded
   using UTF-16 (a multibyte format) and I say ++p, does p move forward a
   single (multibyte) character or a fixed number of machine bytes?  In
   other words, can a compiler generate a fixed increment for the value of p
   (as it would be able to do with a char* pointer), or must the value of
   the increment be determined at runtime based on the character p points
   to?

Thanks,

Scott



--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Mathias Gaunard <loufoque@gmail.com>
Date: Fri, 28 Aug 2009 11:09:47 CST
Raw View
On 28 ao   t, 00:45, Scott Meyers <smey...@aristeia.com> wrote:

> - If I have a pointer p of type char16_t* that points to a string encoded
>    using UTF-16 (a multibyte format) and I say ++p, does p move forward a
>    single (multibyte) character or a fixed number of machine bytes?  In
>    other words, can a compiler generate a fixed increment for the value of p
>    (as it would be able to do with a char* pointer), or must the value of
>    the increment be determined at runtime based on the character p points
>    to?

UTF-16 is a encoding for the Unicode character set that encodes a code
point as one or two 16-bit code units.
Pointers to char16_t don't behave any differently than pointers to
anything else.
If you increment a pointer to a T, the memory address held in the
pointer is increased by sizeof(T) bytes.

So that means that if you increment a pointer to a char16_t, you are
now at the next UTF-16 code unit, not necessarily at the next code
point.
If you want smart iteration of UTF-8 or UTF-16 ranges, you need a
conversion facility defined in terms of range adapters, not codecvt
facets.


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: David Abrahams <dave@boostpro.com>
Date: Sat, 29 Aug 2009 21:57:36 CST
Raw View
on Fri Aug 28 2009, Mathias Gaunard <loufoque-AT-gmail.com> wrote:

> If you want smart iteration of UTF-8 or UTF-16 ranges, you need a
> conversion facility defined in terms of range adapters, not codecvt
> facets.

Hi Scott,

You might look at the unofficial components in
boost/regex/pending/unicode_iterator.hpp, namely u16_to_u32_iterator and
u8_to_u32_iterator.

Cheers,

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]