Topic: C++0x string conversions, etc.
Author: Scott Meyers <smeyers@aristeia.com>
Date: Thu, 27 Aug 2009 16:45:34 CST Raw View
In draft C++0x (N2914), 22.4.1.4/3 says (in part):
codecvt<char, char, mbstate_t> implements a degenerate conversion; it
does not convert at all.
This is also part of C++03. What is the point of requiring the presence of
a no-op conversion?
22.4.1.4/3 continues:
The specialization codecvt<char16_t, char, mbstate_t> converts between
the UTF-16 and UTF-8 encoding schemes
However, 22.5/6 says
For the facet codecvt_utf8_utf16:
- The facet shall convert between UTF-8 multibyte sequences and UTF-16
(one or two 16-bit codes) within the program
This seems to say that we have two codecvt instantations that convert
between UTF-8 and UTF-16 string representations. Why do we need both?
Now, everything I know about ISO 10646/Unicode/UTF-n/UCS-n, etc., I got
from reading pages at Wikipedia in the last hour or so, so pardon me if
this is a silly question, but:
- If I have a pointer p of type char16_t* that points to a string encoded
using UTF-16 (a multibyte format) and I say ++p, does p move forward a
single (multibyte) character or a fixed number of machine bytes? In
other words, can a compiler generate a fixed increment for the value of p
(as it would be able to do with a char* pointer), or must the value of
the increment be determined at runtime based on the character p points
to?
Thanks,
Scott
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: Mathias Gaunard <loufoque@gmail.com>
Date: Fri, 28 Aug 2009 11:09:47 CST Raw View
On 28 ao t, 00:45, Scott Meyers <smey...@aristeia.com> wrote:
> - If I have a pointer p of type char16_t* that points to a string encoded
> using UTF-16 (a multibyte format) and I say ++p, does p move forward a
> single (multibyte) character or a fixed number of machine bytes? In
> other words, can a compiler generate a fixed increment for the value of p
> (as it would be able to do with a char* pointer), or must the value of
> the increment be determined at runtime based on the character p points
> to?
UTF-16 is a encoding for the Unicode character set that encodes a code
point as one or two 16-bit code units.
Pointers to char16_t don't behave any differently than pointers to
anything else.
If you increment a pointer to a T, the memory address held in the
pointer is increased by sizeof(T) bytes.
So that means that if you increment a pointer to a char16_t, you are
now at the next UTF-16 code unit, not necessarily at the next code
point.
If you want smart iteration of UTF-8 or UTF-16 ranges, you need a
conversion facility defined in terms of range adapters, not codecvt
facets.
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Author: David Abrahams <dave@boostpro.com>
Date: Sat, 29 Aug 2009 21:57:36 CST Raw View
on Fri Aug 28 2009, Mathias Gaunard <loufoque-AT-gmail.com> wrote:
> If you want smart iteration of UTF-8 or UTF-16 ranges, you need a
> conversion facility defined in terms of range adapters, not codecvt
> facets.
Hi Scott,
You might look at the unofficial components in
boost/regex/pending/unicode_iterator.hpp, namely u16_to_u32_iterator and
u8_to_u32_iterator.
Cheers,
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]