Topic: std::string vs. Unicode UTF-8


Author: "Lance Diduck" <lancediduck@nyc.rr.com>
Date: Thu, 6 Oct 2005 00:22:19 CST
Raw View
This was a great overvewi .Thanks!
> I think the current string classes and codecvt functionality in the language
> is pretty decent (I would have preferred if wchar_t had been nailed to 32
> bits, or even 16 bits...
Of the four platforms that I regularly code for , two are 32 bit, and
two are 16bit def for wchar_t. And of each variety, two are big endian
(AIX and Solaris), and two are little (Linux and Microsoft) (I haven't
researched Cygwin, which would be interesting to see). This is four
different encodings. Any comparisions involving literals are suspect,
not to mention "binary support."
message catalogs help -- and the diversity there is off topic, but is
far far more non standard and uneven than whar_t support.
Given that most localization is done in a GUI framework rather than
through IOstreams, it would help if automatic invocation of codecvt
were placed in something like stringstream. But as it is codecvt only
invoked automatically in things that don't write to memory.  And except
perhpas for CGI calls, there is little demand for "console mode"
internationalized applications.

>
> I'd love to see the functionality of the IBM ICU libraries
> <http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
> not a fan of the ICU C++ interface (as I mentioned above - I don't see a
> need for a new string class,
The ICU C++ string uses -- and I'm not kidding --"bogus sematics."
http://icu.sourceforge.net/apiref/icu4c/classUnicodeString.html#a82 You
check the validity of your string by calling the isBogus
method.Additionally, every ICU class inherits from UMemory, and can
only change the heap manager by redefining this base class, and
redeploying the library.
THe ICU looks like a port from Java, and has a very Java feel to it. I
believe it is a great starting point though.

Other than string literals, and the lack of character iterators, the
main problem with the C++ string and Unicode is the compare function.
To get a true comparision one would really use the locale compare
function, mapped to some normalization and collation algorithm, and not
string compare, which is more or less memcmp. The interface for string
compare can only compare using the number of bytes in the smaller of
the strings to be compared -- so even if you did manage somehow to cram
normalization in a char_traits class, the triats::copare interface
requires truncation the larger of the two strings.
This works great for backward compatibility, though.


>
> Beyond that, I'd like to work towards a standard markup -
But wouldn't that depend on the renderer? But adoption of XSL-FO may be
a goos start. However, RIM devices etc would barely be able to fit such
a renderer.




            ]

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: sparent@adobe.com (Sean Parent)
Date: Thu, 6 Oct 2005 14:21:36 GMT
Raw View


in article 1128564183.576203.88670@g49g2000cwa.googlegroups.com, Lance
Diduck at lancediduck@nyc.rr.com wrote on 10/5/05 11:22 PM:

>>
>> Beyond that, I'd like to work towards a standard markup -
> But wouldn't that depend on the renderer? But adoption of XSL-FO may be
> a goos start. However, RIM devices etc would barely be able to fit such
> a renderer.

I should have clarified - I'm not looking at markup for rendering intents
(that's a separate but important issues) rather for semantic intents -
marking substrings with their language, gender, plurality, and locale as
well as alternates (alternate languages, alternate forms such as
formal/casual). These are important attributes for string processing. More
RDF than XSL-FO.

Sean

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: sf.bone@btinternet.com (Simon Bone)
Date: Fri, 7 Oct 2005 03:20:51 GMT
Raw View
On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:


> What might be worthwhile is to require some actual support for Unicode.
> I'm not sure it's a good idea to impose such a requirement; there's a
> real advantage to giving implementors the freedom to not support
> Unicode if they know that their particular customer base has no need
> for it. However, such a requirement would at least guarantee some
> benefit to some users, which requiring wchar_t to be at least 16 bits
> would NOT do.
>

Like the freedom not to implement export because no-one in their customer
base needs it? ;-)

I think standard Unicode support would be more widely appreciated than
export. If some vendors continue to decide not to quite finish their
implementations, so what? The world has not stopped turning while we wait
for more C++ 98 implementations to become strictly complete. I also expect
most C++ implementors would provide Unicode support following the
standard, if it was included.

Simon Bone

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net
Date: 7 Oct 2005 06:20:01 GMT
Raw View
Simon Bone wrote:
> On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:
>
>
> > What might be worthwhile is to require some actual support for Unicode.
> > I'm not sure it's a good idea to impose such a requirement; there's a
> > real advantage to giving implementors the freedom to not support
> > Unicode if they know that their particular customer base has no need
> > for it. However, such a requirement would at least guarantee some
> > benefit to some users, which requiring wchar_t to be at least 16 bits
> > would NOT do.
> >
>
> Like the freedom not to implement export because no-one in their customer
> base needs it? ;-)

Not really. The freedom to not implement export exists because
customers don't insist that an implementation be fully conforming in
that regard. The freedom to provide a trivial implementation of wide
characters is available because the standard is quite deliberatly
designed to allow even a fully conforming implementation to provide
such an implementation. Those freedoms seem quite different to me.

> I think standard Unicode support would be more widely appreciated than
> export. ...

Perhaps; I can't speak for anyone but myself. Personally, in my current
job I have absolutely no need for Unicode support, or even support for
any encoding other than US ASCII, nor for any locale other than the "C"
locale. On the other hand, I'd love to be able to use "export". I'm not
opposed to supporting other locales, it just isn't relevant on my
current job.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: sf.bone@btinternet.com (Simon Bone)
Date: Sat, 8 Oct 2005 02:11:39 GMT
Raw View
On Fri, 07 Oct 2005 06:20:01 +0000, kuyper wrote:

> Simon Bone wrote:
>> On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:
>>
>>
>> > What might be worthwhile is to require some actual support for Unicode.
>> > I'm not sure it's a good idea to impose such a requirement; there's a
>> > real advantage to giving implementors the freedom to not support
>> > Unicode if they know that their particular customer base has no need
>> > for it. However, such a requirement would at least guarantee some
>> > benefit to some users, which requiring wchar_t to be at least 16 bits
>> > would NOT do.
>> >
>>
>> Like the freedom not to implement export because no-one in their customer
>> base needs it? ;-)
>
> Not really. The freedom to not implement export exists because
> customers don't insist that an implementation be fully conforming in
> that regard. The freedom to provide a trivial implementation of wide
> characters is available because the standard is quite deliberatly
> designed to allow even a fully conforming implementation to provide
> such an implementation. Those freedoms seem quite different to me.
>

I didn't intend to compare the C++98 export requirements to the C++98
wide character requirements, but rather to some hypothetical C++0x Unicode
requirements.

At the moment, implementors have a freedom in how they implement wide
character support that in practice seems to make writing portable programs
that handle plain text more difficult than it needs to be. Adding a
requirement to support Unicode directly would help IMO.

>> I think standard Unicode support would be more
widely appreciated than
>> export. ...
>
> Perhaps; I can't speak for anyone but myself. Personally, in my current
> job I have absolutely no need for Unicode support, or even support for
> any encoding other than US ASCII, nor for any locale other than the "C"
> locale. On the other hand, I'd love to be able to use "export". I'm not
> opposed to supporting other locales, it just isn't relevant on my
> current job.
>

I find support for extended characters in most of the software I use, even
if not in all I write. Unicode really has become very widespread - enough
to be considered as portable as US ASCII ever was. So I would like support
guaranteed by the standard.

And for what its worth, I think I'd like to be able to use export too. I'm
not trying to argue for losing that (or even the hope of that), but
rather for increased requirements in plain text handling facilities. I
think a standard Unicode library would be widely enough implemented to
displace most of the various libraries currently used. It is enough of a
hassle to pass Unicode data around between different codebases now to be
worth fixing this.

I feel a lot of C++ code right now is probably using one or another
library to solve the need to use Unicode. Moving to a future where most
code needing this support uses a single, well specified interface would be
a big improvement.

If a particular implementor sees their customers as not needing this, no
doubt they will ship without it, regardless of what the standard says.
This could well happen for some compilers targeting embedded systems; and
that is not a change. Lots of implementations have rough edges and when a
particular C++ codebase is ported, problems are often found. It doesn't
mean we should give up any hope of a useful standard. Rather, it helps us
work out who is to blame or at least where the extra work should be
targeted (at improving the compiler or changing the codebase).

Simon Bone

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: dietmar_kuehl@yahoo.com (Dietmar Kuehl)
Date: Wed, 28 Sep 2005 03:14:14 GMT
Raw View
Pete Becker wrote:
> That's unfortunate, since it's exactly what wchar_t and wstring were
> designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.
--
<mailto:dietmar_kuehl@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: petebecker@acm.org (Pete Becker)
Date: Wed, 28 Sep 2005 13:42:40 GMT
Raw View
Dietmar Kuehl wrote:
> Pete Becker wrote:
>
>>That's unfortunate, since it's exactly what wchar_t and wstring were
>>designed for. What is your objection to them?
>
>
> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
> was still pretending that they use 16-bit characters and that each
> Unicode character consists of a single 16-bit character. Neither of
> these two properties holds: Unicode is [currently] a 20-bit encoding
> and a Unicode character can consist of multiple such 20-bit entities
> for combining characters.

Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: jonathan@doves.demon.co.uk (Jonathan Coxhead)
Date: Sat, 1 Oct 2005 04:41:06 GMT
Raw View
Pete Becker wrote:
> Dietmar Kuehl wrote:
>
>> Pete Becker wrote:
>>
>>> That's unfortunate, since it's exactly what wchar_t and wstring were
>>> designed for. What is your objection to them?
>>
>>
>>
>> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
>> was still pretending that they use 16-bit characters and that each
>> Unicode character consists of a single 16-bit character. Neither of
>> these two properties holds: Unicode is [currently] a 20-bit encoding
>> and a Unicode character can consist of multiple such 20-bit entities
>> for combining characters.
>
>
> Well, true, but wchar_t can certainly be large enough to hold 20 bits.
> And the claim from the Unicode folks is that that's all you need.

    Actually, you need 21 bits. There are 0x11 planes with 0x10000 characters in
each, so 0x110000 characters. This space is completely flat, though it has
holes. Or, you can use UTF-16, where a character is encoded as 1 or 2 16-bit
values, so in C counts as neither a wide-character encoding nor a multibyte
encoding. (It might be a "multishort" encoding, if such a thing existed.) Or you
can use UTF-8, which is a true multibyte encoding. The translation between these
representations is purely algorithmic.

    Anyway, 20 bits: not enough.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "kanze" <kanze@gabi-soft.fr>
Date: Fri, 30 Sep 2005 23:41:35 CST
Raw View
Pete Becker wrote:
> Dietmar Kuehl wrote:
> > Pete Becker wrote:

> >>That's unfortunate, since it's exactly what wchar_t and
> >>wstring were designed for. What is your objection to them?

> > Well, 'wchar_t' and 'wstring' were designed at a time when
> > Unicode was still pretending that they use 16-bit characters
> > and that each Unicode character consists of a single 16-bit
> > character. Neither of these two properties holds: Unicode is
> > [currently] a 20-bit encoding and a Unicode character can
> > consist of multiple such 20-bit entities for combining
> > characters.

(If you have 20 or more bits, there's no need for the combining
characters; there only present to allow representing character
codes larger than 0xFFFF as two 16 bit characters.)

> Well, true, but wchar_t can certainly be large enough to hold
> 20 bits.  And the claim from the Unicode folks is that that's
> all you need.

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway.  Given that, vendors have defined
wchar_t in a variety of ways.  And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode.  (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: rjk@greenend.org.uk (Richard Kettlewell)
Date: Sat, 1 Oct 2005 17:01:46 GMT
Raw View
"kanze" <kanze@gabi-soft.fr> writes:
> (If you have 20 or more bits, there's no need for the combining
> characters; there only present to allow representing character codes
> larger than 0xFFFF as two 16 bit characters.)

I believe you are thinking of surrogates, rather than combining
characters, here.  The need (or otherwise) for the latter is
independent of representation.

--
http://www.greenend.org.uk/rjk/

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Sat, 1 Oct 2005 17:02:14 GMT
Raw View
"kanze" <kanze@gabi-soft.fr> wrote in message
news:1127985061.409082.75870@g49g2000cwa.googlegroups.com...

> I think the point is that when wchar_t was introduced, it wasn't
> obvious that Unicode was the solution, and Unicode at the time
> was only 16 bits anyway.  Given that, vendors have defined
> wchar_t in a variety of ways.  And given that vendors want to
> support their existing code bases, that really won't change,
> regardless of what the standard says.
>
> Given this, there is definite value in leaving wchar_t as it is
> (which is pretty unusable in portable code), and defining a new
> type which is guaranteed to be Unicode.  (This is, I believe,
> the route C is taking; there's probably some value in remaining
> C compatible here as well.)

Right, there's a (non-normative) Technical Report that defines
16- and 32-bit character types independent of wchar_t. We'll
be shipping it as part of our next release, along with a slew
of code conversions you can use with these new types.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "kanze" <kanze@gabi-soft.fr>
Date: 4 Oct 2005 04:00:36 GMT
Raw View
Richard Kettlewell wrote:
> "kanze" <kanze@gabi-soft.fr> writes:
> > (If you have 20 or more bits, there's no need for the
> > combining characters; there only present to allow
> > representing character codes larger than 0xFFFF as two 16
> > bit characters.)

> I believe you are thinking of surrogates, rather than
> combining characters, here.  The need (or otherwise) for the
> latter is independent of representation.

I was definitly talking about surrogates.  And it is possible to
represent any Unicode character in UTF-32 without the use of
surrogates; they are only necessary in UTF-16.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: squell@alumina.nl (Marc Schoolderman)
Date: Wed, 5 Oct 2005 04:27:51 GMT
Raw View
kanze wrote:

> I was definitly talking about surrogates.  And it is possible to
> represent any Unicode character in UTF-32 without the use of
> surrogates; they are only necessary in UTF-16.

To strengthen this a bit, surrogates aren't even allowed in encodings
other than UTF-16 since they occupy invalid code points. Proper UTF-8
decoders explicitly include checks to prevent them.

~Marc.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: sparent@adobe.com (Sean Parent)
Date: Wed, 5 Oct 2005 04:27:19 GMT
Raw View
A few comments on this thread -

Unicode has been 21 bits since it's inception, at least it was 21 bits by
the time Unicode 1.0 came out - (I worked with Eric Mader, Dave Opstad, and
Mark Davis at Apple <http://www.unicode.org/history/>). Although I've heard
grumblings that people would like to extend it to include pages for more
dead languages.

UCS-2 is a subset of Unicode that fits in 16 bits without double word
encoding. It is part of ISO 10646, which also defines UCS-4, which for all
practical purposes is the same encoding as UTF-32 (there's a document on the
relationship on the unicode.org site). UTF-16 and UTF-32 both have endian
variants.

Operations such as "the number of characters in a string" has very little
meaning - there is no direct relationship between characters and glyphs,
there are combining characters (not the same as a multi-byte or word
encoding). Even if defined as the number of Unicode code points in a string,
it isn't particularly interesting.

Operations such as string catenation, sub-string searching, upper-case to
lower-case conversion, and collation are all non-trivial on a Unicode string
regardless of the encoding.

I think the current string classes and codecvt functionality in the language
is pretty decent (I would have preferred if wchar_t had been nailed to 32
bits, or even 16 bits... But that will be somewhat addressed). I'd like to
see the complexity of the current string classes specified - and I think a
lightweight copy (constant time) is needed - but I think move semantics will
address this. I also think it would be good to mark strings with their
encoding because it is too easy to end up with Mojibake
<http://en.wikipedia.org/wiki/Mojibake> but I don't think this requires a
whole new string class (I honestly don't think there is such a thing as a
once size fits all string class).

I'd love to see the functionality of the IBM ICU libraries
<http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
not a fan of the ICU C++ interface (as I mentioned above - I don't see a
need for a new string class, I'd like ICU rethought as generic algorithms
that work regardless of the string representation.

Beyond that, I'd like to work towards a standard markup - strings require
more information than just their encoding to really be handled properly. You
need to know which sections of a string are in which language (which can't
be determined completely from the characters used) - items such as gender,
plurality, and formal forms all play a part in doing proper operations such
as replacements. The ASL xstring glossary library is a step in this
direction <http://opensource.adobe.com/group__asl__xstring.html>

--
Sean Parent
Sr. Engineering Manager
Software Technology Lab
Adobe Systems Incorporated
sparent@adobe.com

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: usenet-nospam@nmhq.net (Niklas Matthies)
Date: Wed, 5 Oct 2005 04:27:31 GMT
Raw View
On 2005-10-04 04:00, kanze wrote:
:
> I was definitly talking about surrogates.  And it is possible to
> represent any Unicode character in UTF-32 without the use of
> surrogates;

It's even necessary, because surrogate code points outside of UTF-16
are non-conformant and cause the corresponding byte or code point
sequences to be ill-formed.

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net
Date: Tue, 4 Oct 2005 23:33:31 CST
Raw View
kanze wrote:
> Richard Kettlewell wrote:
> > "kanze" <kanze@gabi-soft.fr> writes:
> > > (If you have 20 or more bits, there's no need for the
> > > combining characters; there only present to allow
> > > representing character codes larger than 0xFFFF as two 16
> > > bit characters.)
>
> > I believe you are thinking of surrogates, rather than
> > combining characters, here.  The need (or otherwise) for the
> > latter is independent of representation.
>
> I was definitly talking about surrogates.  And it is possible to
> represent any Unicode character in UTF-32 without the use of
> surrogates; they are only necessary in UTF-16.

As the Unicode documents themselves point out, what a reader would
consider to be a single character is often represented in Unicode as
the combination of several unicode characters. Can an implementation
use UTF-32 encoding for wchar_t, and meet all of the requirements of
the C standard with respect to wchar_t, when combined characters are
involved? I think you can meet those requirements only by interpreting
every reference in the C standard to a wide "character" as referring to
a "unicode character" rather than as referring to what end users would
consider a character.

If search_string ends with an uncombined character, and target_string
contains the exact same sequence of wchar_t values followed by one or
more combining characters, I believe that wcsstr(search_string,
target_string) is supposed to report a match. That strikes me as
problematic.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net
Date: Thu, 6 Oct 2005 00:20:59 CST
Raw View
Sean Parent wrote:
.
> I think the current string classes and codecvt functionality in the language
> is pretty decent (I would have preferred if wchar_t had been nailed to 32
> bits, or even 16 bits... But that will be somewhat addressed). I'd like to

Requiring wchar_t to have more than 8 bits is pointless in itself. If
an implementor would have chosen to make wchar_t 8 bits without that
requirement, forcing the implementor to use 16 bits will merely
encourage definition of  a 16-bit type that contains the same range of
values as his 8 bit type would have had. In the process, you'll be
making his implementation marginally more complicated and inefficient.

What might be worthwhile is to require some actual support for Unicode.
I'm not sure it's a good idea to impose such a requirement; there's a
real advantage to giving implementors the freedom to not support
Unicode if they know that their particular customer base has no need
for it. However, such a requirement would at least guarantee some
benefit to some users, which requiring wchar_t to be at least 16 bits
would NOT do.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net
Date: Fri, 16 Sep 2005 21:06:06 CST
Raw View
Marc Schoolderman wrote:
> P.J. Plauger wrote:
.
> >>I agree specific multibyte/external encodings (such as UTF8 or UTF16)
> >>should not be enshrined, however I see less reason why wchar_t can still
> >>be something other than a representation of ISO10646.
> > When you say ISO 10646, do you mean UCS-2, UCS-4, or UTF-16?
>
> I mean similar to what C99 requires for __STDC_ISO_10646__.

"_ _STDC_ISO_10646_ _ An integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of type wchar_t are
the coded representations of the characters defined by ISO/IEC 10646,
along with all amendments and technical corrigenda as of the specified
year and month."

I don't have a copy of ISO/IEC 10646. Does it specify UCS-2, UCS-4, or
UTF-16? I was under the impress that they were all allowed.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: squell@alumina.nl (Marc Schoolderman)
Date: Sat, 17 Sep 2005 17:10:24 GMT
Raw View
kuyper@wizard.net wrote:

> "_ _STDC_ISO_10646_ _ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t are
> the coded representations of the characters defined by ISO/IEC 10646,
> along with all amendments and technical corrigenda as of the specified
> year and month."
> I don't have a copy of ISO/IEC 10646. Does it specify UCS-2, UCS-4, or
> UTF-16? I was under the impress that they were all allowed.

I don't have a copy either (working mostly from the information that
Markus Kuhn and Unicode.org put online). But I think this is confusing
representation with external encoding.

   http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

Intuitively, I think __STDC_ISO_10646__ requires a wchar_t to simply
contain a value that is associated with a codepoint in the Unicode
character set. So wchar_t can simply be any 16bit-to-32bit integer.
UTF16 disqualifies for wchar_t since it would mean some wchar_t values
would not be holding a coded representation of a character. I don't see
how you could convert, for example, a UTF8 string to UTF16 wchar_t's by
using functions as mbtowc or vice versa.

~Marc.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: invalid@bigfoot.com (Bob Hairgrove)
Date: Sat, 17 Sep 2005 17:10:35 GMT
Raw View
On 17 Sep 2005 02:10:01 GMT, John Nagle <nagle@animats.com> wrote:

>Marc Schoolderman wrote:
>
>> Wolfgang Draxinger wrote:
>>
>>> The question is, wouldn't it be logical to make std::string
>>> Unicode aware in the next STL version? I18N is an important
>>> topic nowadays and I simply see no logical reason to keep
>>> std::string as limited as it is nowadays. Of course there is
>>> also the wchar_t variant, but actually I don't like that.
>
>    Perhaps a subclass of std::string, such as "std::string_utf8",
>would be appropriate.
>
>    John Nagle
>    Animats

It's not a good idea to derive another class from std::string because
it has no virtual destructor. OTOH, it's perfectly all right to use
implementation inheritance (i.e. private inheritance) or a private
std::string member to hold the data, but then none of the users of the
derived class would have access to std::string's interface unless you
wrapped each of the functions in your derived class.

[Shouldn't this be in a FAQ somewhere? Maybe it is, OTOH...]

--
Bob Hairgrove
NoSpamPlease@Home.com

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: lancediduck@nyc.rr.com
Date: Sat, 17 Sep 2005 12:11:32 CST
Raw View
UTF-8 is already in iostream. Just about any platform, when use set you
locale to something with "utf8" support, then your libraries codecvt
facet will likely convert the utf8 to whatever wide char type your
platform supports.
Which on some platforms is 16 byte, and others is 32.

But the main trouble that C++ programmers have with unicode is that
they still want to use it just like arrays of ASCII encoded characters
that you can send to a console command line. That won't work. At the
very least, Unicode assumes that it will be displayed on a graphical
terminal. And there is certainly no "one to one  correspondence"
between the characters rendered by the device and what you see encoded
in your Unicode string.
And don't even ask about Unicode regular expressions or "equality
comparison"-- Consider JavaScript,
var a='Hello';
var b=' World!';
if ((a+b) == 'Hello world!')
The conditional expression really means "encode in UTF16LE, normalize
each string using Unicode Normalization Form 3, and then do a byte by
byte comparison and return true if they match"

Just like ASCII is not a better way of doing Morse Code, Unicode is not
a better ASCII, but something way different.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "Bob Bell" <belvis@pacbell.net>
Date: 17 Sep 2005 20:10:12 GMT
Raw View
John Nagle wrote:
> Marc Schoolderman wrote:
>
> > Wolfgang Draxinger wrote:
> >
> >> The question is, wouldn't it be logical to make std::string
> >> Unicode aware in the next STL version? I18N is an important
> >> topic nowadays and I simply see no logical reason to keep
> >> std::string as limited as it is nowadays. Of course there is
> >> also the wchar_t variant, but actually I don't like that.
>
>     Perhaps a subclass of std::string, such as "std::string_utf8",
> would be appropriate.

This strikes me as a pretty bad idea.

void F(const std::string& str)
{
   // is str utf8 or not?
}

Bob

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: peter.koch.larsen@gmail.com
Date: 14 Sep 2005 05:00:07 GMT
Raw View
Wolfgang Draxinger wrote:
> I understand that it is perfectly possible to store UTF-8 strings
> in a std::string, however doing so can cause some implicaions.
> E.g. you can't count the amount of characters by length() |
> size(). Instead one has to iterate through the string, parse all
> UTF-8 multibytes and count each multibyte as one character.

Correct. Also you can't print it or anything else.

>
> To address this problem the GTKmm bindings for the GTK+ toolkit
> have implemented a own string class Glib::ustring
> <http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

Ok.

>
> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version?
It already is - using e.g. wchar_t.
> I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays.
It is not limited.
>Of course there is
> also the wchar_t variant, but actually I don't like that.

So you'd like to have Unicode support. And you realize you already have
it. But you don't like it. Why?
>
> Wolfgang Draxinger
> --
>
/Peter

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: invalid@bigfoot.com (Bob Hairgrove)
Date: Wed, 14 Sep 2005 05:00:01 GMT
Raw View
On Tue, 13 Sep 2005 04:20:30 GMT, wdraxinger@darkstargames.de
(Wolfgang Draxinger) wrote:

>I understand that it is perfectly possible to store UTF-8 strings
>in a std::string, however doing so can cause some implicaions.
>E.g. you can't count the amount of characters by length() |
>size(). Instead one has to iterate through the string, parse all
>UTF-8 multibytes and count each multibyte as one character.

Not only that, but substr(), operator[] etc. pose equally
"interesting" problems.

>To address this problem the GTKmm bindings for the GTK+ toolkit
>have implemented a own string class Glib::ustring
><http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
>
>The question is, wouldn't it be logical to make std::string
>Unicode aware in the next STL version? I18N is an important
>topic nowadays and I simply see no logical reason to keep
>std::string as limited as it is nowadays. Of course there is
>also the wchar_t variant, but actually I don't like that.
>
>Wolfgang Draxinger

People use std::string in many different ways. You can even store
binary data with embedded null characters in it. I don't know for
sure, but I believe there are already proposals in front of the C++
standards committee for what you suggest. In the meantime, it might
make more sense to use a third-party UTF-8 string class if that is
what you mainly use it for. IBM has released the ICU library as open
source, for example, and it is widely used these days.

--
Bob Hairgrove
NoSpamPlease@Home.com

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: moc.liamtoh@hgnohneb.read.backward ("benben")
Date: Wed, 14 Sep 2005 05:33:07 GMT
Raw View
"Wolfgang Draxinger" <wdraxinger@darkstargames.de> wrote in message
news:q2egv2-lf.ln1@darkstargames.dnsalias.net...
>I understand that it is perfectly possible to store UTF-8 strings
> in a std::string, however doing so can cause some implicaions.
> E.g. you can't count the amount of characters by length() |
> size(). Instead one has to iterate through the string, parse all
> UTF-8 multibytes and count each multibyte as one character.
>
> To address this problem the GTKmm bindings for the GTK+ toolkit
> have implemented a own string class Glib::ustring
> <http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
>
> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays. Of course there is
> also the wchar_t variant, but actually I don't like that.
>
> Wolfgang Draxinger

That's why people have std::wstring :)

Ben


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: petebecker@acm.org (Pete Becker)
Date: Wed, 14 Sep 2005 05:36:05 GMT
Raw View
Wolfgang Draxinger wrote:
> I understand that it is perfectly possible to store UTF-8 strings
> in a std::string, however doing so can cause some implicaions.
> E.g. you can't count the amount of characters by length() |
> size(). Instead one has to iterate through the string, parse all
> UTF-8 multibytes and count each multibyte as one character.

Yup. That's what happens when you use the wrong tool.

>
> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays.

There's much more to internationalization than Unicode. Requiring
std::string to be Unicode aware (presumably that means UTF-8 aware)
would impose implementation overhead that's not needed for the kinds of
things it was designed for, like the various ISO 8859 code sets. In
general, neither string nor wstring knows anything about multi-character
encodings. That's for efficiency. Do the translation on input and output.


> Of course there is
> also the wchar_t variant, but actually I don't like that.
>

That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "Old Wolf" <oldwolf@inspire.net.nz>
Date: 14 Sep 2005 05:40:13 GMT
Raw View
Marc Schoolderman wrote:
> Wolfgang Draxinger wrote:
>
>> The question is, wouldn't it be logical to make std::string
>> Unicode aware in the next STL version? I18N is an important
>> topic nowadays and I simply see no logical reason to keep
>> std::string as limited as it is nowadays. Of course there is
>> also the wchar_t variant, but actually I don't like that.

It's hard to predict the future; if UTF-8 fell out of favour,
then this would look pretty stupid. IMHO, UTF-16 is going to
become the de-facto standard for Unicode encoding, due to
Microsoft and Sun enshrining it.

> - Some implementations don't have <cwchar> or <wchar.h> (from C94),
> and also don't define std::wstring for that reason. I think I
> encountered this problem with OpenBSD and DJGPP.

I had it with GCC 3.4 too (I also have GLIBC 2.1, which was compiled
without wide-character support).

> The second problem has to do with non-conforming implementations,
> so that can't be helped, but the first one should be addressed IMHO.
> The ideal solution would probably be to force wchar_t to always
> represent Unicode code points, but I'm not sure if that's possible.

I think 'ideal' is in the eye of the beholder. You and I may think
that UCS-4 (ie. 32-bit wchar_t) is the ideal solution, but apparently
there are more people who prefer UTF-16, as evinced by Visual C++
and Java, both of which cater to very large developer bases.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "msalters" <Michiel.Salters@logicacmg.com>
Date: 14 Sep 2005 05:40:37 GMT
Raw View
Wolfgang Draxinger schreef:

> I understand that it is perfectly possible to store UTF-8 strings
> in a std::string, however doing so can cause some implicaions.
> E.g. you can't count the amount of characters by length() |
> size(). Instead one has to iterate through the string, parse all
> UTF-8 multibytes and count each multibyte as one character.

Usually correct, but not always. A char is a byte in C++, but
a byte might not be an octet. UTF-8 is of course octet-based.

> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays. Of course there is
> also the wchar_t variant, but actually I don't like that.

wchar_t isn't always Unicode, either. There's a proposal to add an
extra unicode char type, and that probably will include std::ustring

However, that is probably a 20+bit type. Unicode itself assigns
numbers to characters, and the numbers have exceeded 65536.
UTF-x means Unicode Transformation Format - x. These formats
map each number to one or more x-bit values. E.g. UTF-8 maps
the number of each unicode character to an octet sequence,
with the additional property that the 0 byte isn't used for
anything but number 0.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Wed, 14 Sep 2005 14:26:30 GMT
Raw View
"Old Wolf" <oldwolf@inspire.net.nz> wrote in message
news:1126666528.526227.303710@o13g2000cwo.googlegroups.com...

>> - Some implementations don't have <cwchar> or <wchar.h> (from C94),
>> and also don't define std::wstring for that reason. I think I
>> encountered this problem with OpenBSD and DJGPP.
>
> I had it with GCC 3.4 too (I also have GLIBC 2.1, which was compiled
> without wide-character support).

Yep, everyone complains about export being unsupported by all
but one implementation, but here is a much more common failure.
Only a few implementations make a serious attempt to provide
all the C95 stuff mandated by the C++ Standard. That becomes
important when you start working with large character sets.
(BTW, Dinkumware provides full support in this area, and has
for the past decade.)

>> The second problem has to do with non-conforming implementations,
>> so that can't be helped, but the first one should be addressed IMHO.
>> The ideal solution would probably be to force wchar_t to always
>> represent Unicode code points, but I'm not sure if that's possible.
>
> I think 'ideal' is in the eye of the beholder. You and I may think
> that UCS-4 (ie. 32-bit wchar_t) is the ideal solution, but apparently
> there are more people who prefer UTF-16, as evinced by Visual C++
> and Java, both of which cater to very large developer bases.

I see uses for a whole collection of Unicode encodings, so I think
the C and C++ Standards got it right in decoupling encodings from
their respective languages. But you then need an add-on library
to finish the job. (See, for example, our CoreX library.)

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "kanze" <kanze@gabi-soft.fr>
Date: 14 Sep 2005 14:40:14 GMT
Raw View
Old Wolf wrote:
> Marc Schoolderman wrote:
> > Wolfgang Draxinger wrote:

> >> The question is, wouldn't it be logical to make std::string
> >> Unicode aware in the next STL version? I18N is an important
> >> topic nowadays and I simply see no logical reason to keep
> >> std::string as limited as it is nowadays. Of course there
> >> is also the wchar_t variant, but actually I don't like
> >> that.

> It's hard to predict the future; if UTF-8 fell out of favour,
> then this would look pretty stupid. IMHO, UTF-16 is going to
> become the de-facto standard for Unicode encoding, due to
> Microsoft and Sun enshrining it.

The Internet uses 8 bit bytes, and that isn't likely to change
anytime soon.  And wchar_t is 32 bits under Solaris, so what's
this about Sun enshrining UTF-16, other than in Java.

For better or for worse, I don't think we're going to see a
de-facto standard for anything concerning character encoding,
anytime soon.

> I think 'ideal' is in the eye of the beholder. You and I may
> think that UCS-4 (ie. 32-bit wchar_t) is the ideal solution,
> but apparently there are more people who prefer UTF-16, as
> evinced by Visual C++ and Java, both of which cater to very
> large developer bases.

The traditional position of C/C++ is that character encodings
are an implementation issue, and that the standard doesn't say
anything either way.  I think that this is slowly changing, and
that some optional support for Unicode is creeping into C (much
like the optional support for IEEE floating point).  I find it
reasonable to define such support for all three Unicode
representations: UTF-8, UTF-16 and UTF-32.  Of course, since it
would be optional, you aren't guaranteed of any one particular
one, if that's what you want to use:-).  (The current proposals
in C seem to only concern UTF-16 and UTF-32.  It's also not
clear after a quick reading whether they support surrogates or
not.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "kanze" <kanze@gabi-soft.fr>
Date: 14 Sep 2005 14:40:21 GMT
Raw View
msalters wrote:
> Wolfgang Draxinger schreef:

    [...]
> However, that is probably a 20+bit type.  Unicode itself
> assigns numbers to characters, and the numbers have exceeded
> 65536.  UTF-x means Unicode Transformation Format - x.  These
> formats map each number to one or more x-bit values.
> E.g. UTF-8 maps the number of each unicode character to an
> octet sequence, with the additional property that the 0 byte
> isn't used for anything but number 0.

It has a lot more additional properties than that.  Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.

> Now, these formats are intended for data transfer and not data
> processing.  That in turn means UTF-8 should go somewhere in
> <iostream>, if it's added.

I don't know where you find that these formats are intended just
for data transfer.  Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32.  For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats.  (It's also required in some cases.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: Marc Schoolderman <squell@alumina.nl>
Date: Wed, 14 Sep 2005 12:57:55 CST
Raw View
Old Wolf wrote:
> I think 'ideal' is in the eye of the beholder. You and I may think
> that UCS-4 (ie. 32-bit wchar_t) is the ideal solution, but apparently
> there are more people who prefer UTF-16, as evinced by Visual C++
> and Java, both of which cater to very large developer bases.

If they use UTF16, would this actually "fit" the intended use of a
std::wstring, due to UTF16's use of surrogate pairs? Or do they simply
use UCS2?

P.J. Plauger wrote:
> I see uses for a whole collection of Unicode encodings, so I think
> the C and C++ Standards got it right in decoupling encodings from
> their respective languages. But you then need an add-on library
> to finish the job. (See, for example, our CoreX library.)

I agree specific multibyte/external encodings (such as UTF8 or UTF16)
should not be enshrined, however I see less reason why wchar_t can still
be something other than a representation of ISO10646. Especially
considering it *is* used for the \uXXXX escape sequences.

I think it's reasonable to expect that L'\u0131' == 0x0131, and afaik
that's not guaranteed right now.

~Marc.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: usenet-nospam@nmhq.net (Niklas Matthies)
Date: Wed, 14 Sep 2005 23:22:14 GMT
Raw View
On 2005-09-14 18:57, Marc Schoolderman wrote:
> Old Wolf wrote:
>> I think 'ideal' is in the eye of the beholder. You and I may think
>> that UCS-4 (ie. 32-bit wchar_t) is the ideal solution, but apparently
>> there are more people who prefer UTF-16, as evinced by Visual C++
>> and Java, both of which cater to very large developer bases.
>
> If they use UTF16, would this actually "fit" the intended use of a
> std::wstring, due to UTF16's use of surrogate pairs? Or do they
> simply use UCS2?

No, they use UTF-16. It's just more practical for applications that
don't care to pretend that it's merely UCS-2 than pretending that
UTF-8 is an 8-bit encoding (which doesn't work for anything non-ASCII).

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Wed, 14 Sep 2005 23:22:47 GMT
Raw View
"Marc Schoolderman" <squell@alumina.nl> wrote in message
news:43284358$0$721$5fc3050@dreader2.news.tiscali.nl...

> Old Wolf wrote:
>> I think 'ideal' is in the eye of the beholder. You and I may think
>> that UCS-4 (ie. 32-bit wchar_t) is the ideal solution, but apparently
>> there are more people who prefer UTF-16, as evinced by Visual C++
>> and Java, both of which cater to very large developer bases.
>
> If they use UTF16, would this actually "fit" the intended use of a
> std::wstring, due to UTF16's use of surrogate pairs? Or do they simply use
> UCS2?

No, UTF-16 does not meet the requirements intended for a wide-character
encoding. But since people use it anyway...

> P.J. Plauger wrote:
>> I see uses for a whole collection of Unicode encodings, so I think
>> the C and C++ Standards got it right in decoupling encodings from
>> their respective languages. But you then need an add-on library
>> to finish the job. (See, for example, our CoreX library.)
>
> I agree specific multibyte/external encodings (such as UTF8 or UTF16)
> should not be enshrined, however I see less reason why wchar_t can still
> be something other than a representation of ISO10646.

When you say ISO 10646, do you mean UCS-2, UCS-4, or UTF-16?

>                                                     Especially considering
> it *is* used for the \uXXXX escape sequences.
> I think it's reasonable to expect that L'\u0131' == 0x0131, and afaik
> that's not guaranteed right now.

Maybe you don't see a need for other encodings, but quite a few other
codes have been used and are still in use around the world --
particularly in cultures that really *need* support for large character
sets.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: squell@alumina.nl (Marc Schoolderman)
Date: Fri, 16 Sep 2005 02:26:37 GMT
Raw View
P.J. Plauger wrote:

>>>I see uses for a whole collection of Unicode encodings, so I think
>>>the C and C++ Standards got it right in decoupling encodings from
>>>their respective languages. But you then need an add-on library
>>>to finish the job. (See, for example, our CoreX library.)
>>I agree specific multibyte/external encodings (such as UTF8 or UTF16)
>>should not be enshrined, however I see less reason why wchar_t can still
>>be something other than a representation of ISO10646.
> When you say ISO 10646, do you mean UCS-2, UCS-4, or UTF-16?

I mean similar to what C99 requires for __STDC_ISO_10646__.

>>I think it's reasonable to expect that L'\u0131' == 0x0131, and afaik
>>that's not guaranteed right now.
> Maybe you don't see a need for other encodings, but quite a few other
> codes have been used and are still in use around the world --
> particularly in cultures that really *need* support for large character
> sets.

But afaik, that's largely for historical reasons, because they needed
wide character sets before there was ISO 10646. I can imagine people
will probably object to requiring a conversion to Unicode wchar_t's by
functions such as mbsrtowcs() on grounds of compatibility or efficiency,
but it would make things easier.

~Marc

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "msalters" <Michiel.Salters@logicacmg.com>
Date: Thu, 15 Sep 2005 21:23:05 CST
Raw View
kanze schreef:

> msalters wrote:
> > Wolfgang Draxinger schreef:
>
>     [...]
> > However, that is probably a 20+bit type.  Unicode itself
> > assigns numbers to characters, and the numbers have exceeded
> > 65536.  UTF-x means Unicode Transformation Format - x.  These
> > formats map each number to one or more x-bit values.
> > E.g. UTF-8 maps the number of each unicode character to an
> > octet sequence, with the additional property that the 0 byte
> > isn't used for anything but number 0.
>
> It has a lot more additional properties than that.  Like the
> fact that you can immediately tell whether a byte is a single
> byte character, the first byte of a multibyte sequence, or a
> following byte in a multibyte sequence, without looking beyond
> just that byte.

Yep, that makes scanning through a byte sequence a lot easier.
However, that's not very important for std::string. .substr()
can't do anything useful with it. For .c_str(), the non-null
property is important.
Of course, for an utf8string type, these additional properties
make implementations a lot easier. UTF8 is quite a good encoding
actually.

> > Now, these formats are intended for data transfer and not data
> > processing.  That in turn means UTF-8 should go somewhere in
> > <iostream>, if it's added.
>
> I don't know where you find that these formats are intended just
> for data transfer.  Depending on what the code is doing (and the
> text it has to deal with), the ideal solution may be UTF-8,
> UTF-16 or UTF-32.  For most of what I do, UTF-8 would be more
> appropriate, including internally, than any of the other
> formats.  (It's also required in some cases.)

Getting a substring, uppercasing, finding characters, replacing
characters: all common string operations, but non-trivial in UTF8
Saving to file, sending over TCP/IP, or to mobile devices: all
common I/O operations, and UTF8 makes it easy.

Regards,
Michiel Salters

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: John Nagle <nagle@animats.com>
Date: 17 Sep 2005 02:10:01 GMT
Raw View
Marc Schoolderman wrote:

> Wolfgang Draxinger wrote:
>
>> The question is, wouldn't it be logical to make std::string
>> Unicode aware in the next STL version? I18N is an important
>> topic nowadays and I simply see no logical reason to keep
>> std::string as limited as it is nowadays. Of course there is
>> also the wchar_t variant, but actually I don't like that.

    Perhaps a subclass of std::string, such as "std::string_utf8",
would be appropriate.

    John Nagle
    Animats

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: "kanze" <kanze@gabi-soft.fr>
Date: Fri, 16 Sep 2005 21:05:20 CST
Raw View
msalters wrote:
> kanze schreef:

    [...]
> > I don't know where you find that these formats are intended
> > just for data transfer.  Depending on what the code is doing
> > (and the text it has to deal with), the ideal solution may
> > be UTF-8, UTF-16 or UTF-32.  For most of what I do, UTF-8
> > would be more appropriate, including internally, than any of
> > the other formats.  (It's also required in some cases.)

> Getting a substring, uppercasing, finding characters,
> replacing characters: all common string operations, but
> non-trivial in UTF8.

I said "for most of what I do".  Comparing for equality, using
as keys in std::set or an unordered_set, for example.  UTF-8
works fine, and because it uses less memory, it will result in
better overall performance (less cache misses, less paging,
etc.).

In other cases, I've been dealing with binary input, with
embedded UTF-8 strings.  Which means that I cannot translate
directly on input, only once I've parsed the binary structure
enough to know where the strings are located.  In the last
application, the strings were just user names and passwords --
again, no processing which wouldn't work just fine in UTF-8.

Imagine a C++ compiler.  The only place UTF-8 might cause some
added difficulty is when scanning a symbol -- and even there, I
can imagine some fairly simply solutions.  For all of the
rest... the critical delimiters can all be easily recognized in
UTF-8, and of course, once past scanning, we're talking about
symbol table management, and perhaps concatenation (to generate
mangled names), but they're both easily done in UTF-8.  All in
all, I think a C++ compiler would be a good example of an
application where using UTF-8 as the internal encoding would
make sense.

> Saving to file, sending over TCP/IP, or to mobile devices: all
> common I/O operations, and UTF8 makes it easy.

The external world is byte oriented.  That's for sure.  UTF-8
(or some other 8 bit format) is definitly required for external
use.  But there are numerous cases where UTF-8 is also a good
choice for internal use as well; why bother with the conversions
and the added memory overhead if it doesn't buy you anything?

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: wdraxinger@darkstargames.de (Wolfgang Draxinger)
Date: Tue, 13 Sep 2005 04:20:30 GMT
Raw View
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger
--

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: squell@alumina.nl (Marc Schoolderman)
Date: Tue, 13 Sep 2005 21:51:13 GMT
Raw View
Wolfgang Draxinger wrote:

> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays. Of course there is
> also the wchar_t variant, but actually I don't like that.

I think the rationale behind the C++ (and actually, C as well) is that
UTF8 is an 'external format' and you would always manipulate strings in
the 'internal format'. So to manipulate Unicode, you would have to use
'wstring' instead. I think GTK+ uses UTF8 in order to save them from
duplicating their API by adding wide-character functions.

Read: http://gcc.gnu.org/ml/libstdc++/1999-q2/msg00182.html

A problem, which I've butted my head against, is that:

- There is no guaranteed conversion of any multibyte encoding (not just
UTF8) to UCS2/UCS4 in C++'s <locale>. C99's __STDC_ISO_10646__ doesn't
help much either.

- Some implementations don't have <cwchar> or <wchar.h> (from C94), and
also don't define std::wstring for that reason. I think I encountered
this problem with OpenBSD and DJGPP.

The second problem has to do with non-conforming implementations, so
that can't be helped, but the first one should be addressed IMHO. The
ideal solution would probably be to force wchar_t to always represent
Unicode code points, but I'm not sure if that's possible.

~Marc.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: niels@dybdahl.dk ("Niels Dybdahl")
Date: Wed, 14 Sep 2005 04:57:50 GMT
Raw View
> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays. Of course there is
> also the wchar_t variant, but actually I don't like that.

It is much easier to handle unicode strings with wchar_t internally and
there is much less confusion about whether the string is ANSI or UTF8
encoded. So I have started using wchar_t wherever I can and I only use UTF8
for external communication.

Niels Dybdahl


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: john_andronicus@hotmail.com (John Harrison)
Date: Wed, 14 Sep 2005 04:58:02 GMT
Raw View
Wolfgang Draxinger wrote:
> I understand that it is perfectly possible to store UTF-8 strings
> in a std::string, however doing so can cause some implicaions.
> E.g. you can't count the amount of characters by length() |
> size(). Instead one has to iterate through the string, parse all
> UTF-8 multibytes and count each multibyte as one character.
>
> To address this problem the GTKmm bindings for the GTK+ toolkit
> have implemented a own string class Glib::ustring
> <http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
>
> The question is, wouldn't it be logical to make std::string
> Unicode aware in the next STL version? I18N is an important
> topic nowadays and I simply see no logical reason to keep
> std::string as limited as it is nowadays. Of course there is
> also the wchar_t variant, but actually I don't like that.
>
> Wolfgang Draxinger

UTF-8 is only an encoding, why to you think a strings internal to the
program should be represented as UTF-8? Makes more sense to me to
translate to or from UTF-8 when you input or output strings from your
program. C++ already has the framework in place for that.

john

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]