Topic: Unicode in c++


Author: glen stark <g.a.stark@gmail.com>
Date: Tue, 11 Aug 2015 08:31:25 -0700 (PDT)
Raw View
------=_Part_5553_1627185703.1439307085139
Content-Type: multipart/alternative;
 boundary="----=_Part_5554_1456623813.1439307085140"

------=_Part_5554_1456623813.1439307085140
Content-Type: text/plain; charset=UTF-8

Hi everyone.

I've been thinking a great deal about unicode support in C++, and I think
it's very important that we progress on this issue in C++ 17.  I would like
to share my thoughts on the subject, even if they are somewhat naiive,
hoping that I will either learn enough to make a meaningful contribution,
or to spur a conversation that leads someone else to do so.

Currently in C++, I think a best-practices approach is to use std::string
to store a utf-8 encoded string, and use locales and external tools like
Boost::locale and utf8cpp where necessary to fill in the gaps.  In
particular Boost::locale seems to offer a valuable prototype for providing
unicde support in a modern C++ way.  I think any good solution will extend
and simplify, rather than replace this approach.  I also think a good
solution for this use case would extend to utf-16 and utf-32 in obvious
ways, and even allow support for other encodings (although I wouldn't
propose adding these encodings to the standard).  I have some thoughts I
would like to share, to see if it is worth pursuing them further.  If there
are problems in the details, please let me know, but be sure to let me know
if the overall idea is worth pursuing.

When working with unicode text, it seems to me one must distinguish between
the following three behaviors:
   1.  composed-character awareness:  a single display character might be
composed of multiple code points, eg an A followed a ring-above is a single
display character.
   2.  multi-char_t awareness:  both utf-8 and utf-16 might use multiple
char_t's for a single codepoint.
   3.  char_t awareness:  the current standard.


Now let's consider modifying the existing standard to support all three of
these cases.  Currently, 3 is the default behavior, and if I want to get
std::regex, std::sort etc to play nice with 1 or 2, I must pass locale
information.  This shoudl remain the case, so as not to break backward
compatibility.  I believe 2 & 3 to be primarily of interest for library
developers however and 1 (composed-character awareness) to be the most
common use case for application developers.  Thus we need to provide 3 for
compatibility, 2&3 for library development, and 1 for most use cases.

Now, I assume it would be uncontroversial to create a set of algorithms
that allow us to iterate over the various unicode encodings, and extend
locale to provide access to them, similarly to how we access collate() or
comparision .

One consistent option to extend std::string would be to have the
std::string methods accept a locale parameter, so std::begin(locale) would
provide a locale/unicode aware iterator, and std::string::substr(pos, len)
would use that iterator to get len... what?  codepoints?
 display-characters?  Same with size()...   We could intruduce a
locale_awareness parameter too, and then we'd have std::substr (locale,
LA_DISPLAY, pos, len), but I think that would lead to excessively verbose
code.


What if std::basic_string were to have 2 additional, optional template
parameters:
   1. locale-awareness : display, codepoint, none.
   2. locale.

Under this sheme, existing code would continue to work as it currently
works.  If I have code that really only cares about display-characters (a
common case), I can create a string:


typedef  std::string<local::display> u8_string;
u8_string foo;
// do stuff.
foo.size();  // gets the locale aware iterator from the global locale, and
uses it to return num display chars.
foo.substr(pos, 3) // the last three 'letters' of the string.
std::vector<u8_string> sortme;
// populate the vector.
std::sort<sortme>;  // use the global locale to sort.
std::sort<sortme, another_local> sort according to another_locale's rules.

static_cast<std::string>(foo).size() ;  // get the number of bytes.


What does everyone think, is this worth pursuing?

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_5554_1456623813.1439307085140
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi everyone.=C2=A0<div><br></div><div>I&#39;ve been thinki=
ng a great deal about unicode support in C++, and I think it&#39;s very imp=
ortant that we progress on this issue in C++ 17. =C2=A0I would like to shar=
e my thoughts on the subject, even if they are somewhat naiive, hoping that=
 I will either learn enough to make a meaningful contribution, or to spur a=
 conversation that leads someone else to do so. =C2=A0</div><div><br></div>=
<div>Currently in C++, I think a best-practices approach is to use std::str=
ing to store a utf-8 encoded string, and use locales and external tools lik=
e Boost::locale and utf8cpp where necessary to fill in the gaps. =C2=A0In p=
articular Boost::locale seems to offer a valuable prototype for providing u=
nicde support in a modern C++ way. =C2=A0I think any good solution will ext=
end and simplify, rather than replace this approach. =C2=A0I also think a g=
ood solution for this use case would extend to utf-16 and utf-32 in obvious=
 ways, and even allow support for other encodings (although I wouldn&#39;t =
propose adding these encodings to the standard). =C2=A0I have some thoughts=
 I would like to share, to see if it is worth pursuing them further. =C2=A0=
If there are problems in the details, please let me know, but be sure to le=
t me know if the overall idea is worth pursuing.=C2=A0</div><div><br></div>=
<div>When working with unicode text, it seems to me one must distinguish be=
tween the following three behaviors:</div><div>=C2=A0 =C2=A01. =C2=A0compos=
ed-character awareness: =C2=A0a single display character might be composed =
of multiple code points, eg an A followed a ring-above is a single display =
character.</div><div>=C2=A0 =C2=A02. =C2=A0multi-char_t awareness: =C2=A0bo=
th utf-8 and utf-16 might use multiple char_t&#39;s for a single codepoint.=
</div><div>=C2=A0 =C2=A03. =C2=A0char_t awareness: =C2=A0the current standa=
rd.</div><div><br></div><div><br></div><div>Now let&#39;s consider modifyin=
g the existing standard to support all three of these cases. =C2=A0Currentl=
y, 3 is the default behavior, and if I want to get std::regex, std::sort et=
c to play nice with 1 or 2, I must pass locale information. =C2=A0This shou=
dl remain the case, so as not to break backward compatibility. =C2=A0I beli=
eve 2 &amp; 3 to be primarily of interest for library developers however an=
d 1 (composed-character awareness) to be the most common use case for appli=
cation developers. =C2=A0Thus we need to provide 3 for compatibility, 2&amp=
;3 for library development, and 1 for most use cases.</div><div><br></div><=
div>Now, I assume it would be uncontroversial to create a set of algorithms=
 that allow us to iterate over the various unicode encodings, and extend lo=
cale to provide access to them, similarly to how we access collate() or com=
parision .</div><div><br></div><div>One consistent option to extend std::st=
ring would be to have the std::string methods accept a locale parameter, so=
 std::begin(locale) would provide a locale/unicode aware iterator, and std:=
:string::substr(pos, len) would use that iterator to get len... what? =C2=
=A0codepoints? =C2=A0display-characters? =C2=A0Same with size()... =C2=A0 W=
e could intruduce a locale_awareness parameter too, and then we&#39;d have =
std::substr (locale, LA_DISPLAY, pos, len), but I think that would lead to =
excessively verbose code.</div><div><br></div><div><br></div><div>What if s=
td::basic_string were to have 2 additional, optional template parameters:</=
div><div>=C2=A0 =C2=A01. locale-awareness : display, codepoint, none.</div>=
<div>=C2=A0 =C2=A02. locale.</div><div><br></div><div>Under this sheme, exi=
sting code would continue to work as it currently works. =C2=A0If I have co=
de that really only cares about display-characters (a common case), I can c=
reate a string:</div><div><br></div><div><br></div><div>typedef =C2=A0std::=
string&lt;local::display&gt; u8_string;</div><div>u8_string foo;</div><div>=
// do stuff.</div><div>foo.size(); =C2=A0// gets the locale aware iterator =
from the global locale, and uses it to return num display chars.<br></div><=
div>foo.substr(pos, 3) // the last three &#39;letters&#39; of the string.</=
div><div>std::vector&lt;u8_string&gt; sortme;</div><div>// populate the vec=
tor.</div><div>std::sort&lt;sortme&gt;; =C2=A0// use the global locale to s=
ort.</div><div>std::sort&lt;sortme, another_local&gt; sort according to ano=
ther_locale&#39;s rules.</div><div><br></div><div>static_cast&lt;std::strin=
g&gt;(foo).size() ; =C2=A0// get the number of bytes.<br></div><div><br></d=
iv><div><br></div><div>What does everyone think, is this worth pursuing? =
=C2=A0</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_5554_1456623813.1439307085140--
------=_Part_5553_1627185703.1439307085139--

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 11 Aug 2015 09:13:58 -0700
Raw View
On Tuesday 11 August 2015 08:31:25 glen stark wrote:
> When working with unicode text, it seems to me one must distinguish between
> the following three behaviors:
>    1.  composed-character awareness:  a single display character might be
> composed of multiple code points, eg an A followed a ring-above is a single
> display character.
>    2.  multi-char_t awareness:  both utf-8 and utf-16 might use multiple
> char_t's for a single codepoint.
>    3.  char_t awareness:  the current standard.
>
>
> Now let's consider modifying the existing standard to support all three of
> these cases.  Currently, 3 is the default behavior, and if I want to get
> std::regex, std::sort etc to play nice with 1 or 2, I must pass locale
> information.  This shoudl remain the case, so as not to break backward
> compatibility.  I believe 2 & 3 to be primarily of interest for library
> developers however and 1 (composed-character awareness) to be the most
> common use case for application developers.  Thus we need to provide 3 for
> compatibility, 2&3 for library development, and 1 for most use cases.

I disagree with your assessment.

15 years of Unicode experience in Qt (QString) have shown that very, very few
people require support for composed-character awareness. The one that shows up
most commonly is OS X filenames, because for some unfathomable reason Apple
decided that they should be NFD.

Most applications only need to know that a character is a character, whether
it is zero-, single- or double-width. The most common scenario is actually 2 &
3 and you can remove the difficulty completely by using UCS-4. The number of
uses that require multibyte awareness is so small that UTF-16 is usually more
efficient for most uses.

Code that needs to deal with character width is usually that which does text
shaping. In most cases, that implies using font metrics to convert a text
string to a width in pixels, not a width in spacing (ex).

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 11 Aug 2015 11:15:44 -0700 (PDT)
Raw View
------=_Part_4415_1947003988.1439316944935
Content-Type: multipart/alternative;
 boundary="----=_Part_4416_149898087.1439316944935"

------=_Part_4416_149898087.1439316944935
Content-Type: text/plain; charset=UTF-8

I have to agree with Thiago's assessment that this is really the wrong way
to go. Indeed, you don't seem to fully understand the various elements at
play in Unicode, which leads to the problem.

You correctly break down Unicode strings into the correct 3 forms (the
encoded sequence, the codepoint sequence, and the grapheme cluster
sequence), but you don't use Unicode terminology for them. You also don't
deal with normalization with regard to comparison, which is *absolutely
crucial* when attempting to compare two Unicode strings.

There was already an attempt to standardize Unicode use in C++
<http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3572.html>, but
unfortunately it did not progress very far. It was a solid approach, and
one that should be followed up on. And while there are many functional ways
of divising a Unicode string, trying to shoehorn Unicode functionality into
std::basic_string is absolutely the wrong way to do it.

Also, any improvements on it should recognize the importance of ranges,
particular Unicode codepoint iterator ranges.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_4416_149898087.1439316944935
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I have to agree with Thiago&#39;s assessment that this is =
really the wrong way to go. Indeed, you don&#39;t seem to fully understand =
the various elements at play in Unicode, which leads to the problem.<br><br=
>You correctly break down Unicode strings into the correct 3 forms (the enc=
oded sequence, the codepoint sequence, and the grapheme cluster sequence), =
but you don&#39;t use Unicode terminology for them. You also don&#39;t deal=
 with normalization with regard to comparison, which is <i>absolutely cruci=
al</i> when attempting to compare two Unicode strings.<br><br>There was alr=
eady an <a href=3D"http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/=
n3572.html">attempt to standardize Unicode use in C++</a>, but unfortunatel=
y it did not progress very far. It was a solid approach, and one that shoul=
d be followed up on. And while there are many functional ways of divising a=
 Unicode string, trying to shoehorn Unicode functionality into std::basic_s=
tring is absolutely the wrong way to do it.<br><br>Also, any improvements o=
n it should recognize the importance of ranges, particular Unicode codepoin=
t iterator ranges.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_4416_149898087.1439316944935--
------=_Part_4415_1947003988.1439316944935--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Tue, 11 Aug 2015 15:56:46 -0400
Raw View
On 2015-08-11 12:13, Thiago Macieira wrote:
> On Tuesday 11 August 2015 08:31:25 glen stark wrote:
>> When working with unicode text, it seems to me one must distinguish betw=
een=20
>> the following three behaviors:
>>    1.  composed-character awareness:  a single display character might b=
e=20
>> composed of multiple code points, eg an A followed a ring-above is a sin=
gle=20
>> display character.
>>    2.  multi-char_t awareness:  both utf-8 and utf-16 might use multiple=
=20
>> char_t's for a single codepoint.
>>    3.  char_t awareness:  the current standard.
>>
>> Now let's consider modifying the existing standard to support all three =
of=20
>> these cases.  Currently, 3 is the default behavior, and if I want to get=
=20
>> std::regex, std::sort etc to play nice with 1 or 2, I must pass locale=
=20
>> information.  This shoudl remain the case, so as not to break backward=
=20
>> compatibility.  I believe 2 & 3 to be primarily of interest for library=
=20
>> developers however and 1 (composed-character awareness) to be the most=
=20
>> common use case for application developers.  Thus we need to provide 3 f=
or=20
>> compatibility, 2&3 for library development, and 1 for most use cases.
>=20
> I disagree with your assessment.
>=20
> 15 years of Unicode experience in Qt (QString) have shown that very, very=
 few=20
> people require support for composed-character awareness.

Just because support isn't "required" doesn't mean they aren't doing it
wrong :-).

Let's say I have the string "can=CD=82on", composed of six codepoints, and =
I
want to take or remove the first three "letters", or the last three. How
many programs get that correct and don't either transform the 'n=CD=82' to =
a
'n', or leave a dangling combining codepoint? (How many actually
understand, or have even thought about, the difference between (1) and (2)?=
)

It's likely that in many cases, splitting is happening at known and
non-combining codepoints, and it's just luck that things usually turn
out okay.

On the same note, how do text editors deal with these sorts of issues?
Presumably single "characters" represented by two codepoints should
still be manipulated (caret movement, selection, deletion) as single
characters, no? At least, this sort of thing is what I recall as always
coming up when people start talking about pathological cases of
character counting.

> Most applications only need to know that a character is a character, whet=
her=20
> it is zero-, single- or double-width.

How would you have a zero-width character? (We are talking about number
of bytes representation, are we not? Render width is a whole other
kettle of fish and is only loosely related to what I understood Glen to
be talking about with (1). The issue as I understood is differentiating
between what a user considers a "character", and codepoints, e.g.
<U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit handles =
that
rather poorly ;-).)

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 11 Aug 2015 15:44:33 -0700 (PDT)
Raw View
------=_Part_4475_461479436.1439333074146
Content-Type: multipart/alternative;
 boundary="----=_Part_4476_1946494285.1439333074152"

------=_Part_4476_1946494285.1439333074152
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Tuesday, August 11, 2015 at 3:56:58 PM UTC-4, Matthew Woehlke wrote:
>
> On 2015-08-11 12:13, Thiago Macieira wrote:=20
> > On Tuesday 11 August 2015 08:31:25 glen stark wrote:=20
> >> When working with unicode text, it seems to me one must distinguish=20
> between=20
> >> the following three behaviors:=20
> >>    1.  composed-character awareness:  a single display character might=
=20
> be=20
> >> composed of multiple code points, eg an A followed a ring-above is a=
=20
> single=20
> >> display character.=20
> >>    2.  multi-char_t awareness:  both utf-8 and utf-16 might use=20
> multiple=20
> >> char_t's for a single codepoint.=20
> >>    3.  char_t awareness:  the current standard.=20
> >>=20
> >> Now let's consider modifying the existing standard to support all thre=
e=20
> of=20
> >> these cases.  Currently, 3 is the default behavior, and if I want to=
=20
> get=20
> >> std::regex, std::sort etc to play nice with 1 or 2, I must pass locale=
=20
> >> information.  This shoudl remain the case, so as not to break backward=
=20
> >> compatibility.  I believe 2 & 3 to be primarily of interest for librar=
y=20
> >> developers however and 1 (composed-character awareness) to be the most=
=20
> >> common use case for application developers.  Thus we need to provide 3=
=20
> for=20
> >> compatibility, 2&3 for library development, and 1 for most use cases.=
=20
> >=20
> > I disagree with your assessment.=20
> >=20
> > 15 years of Unicode experience in Qt (QString) have shown that very,=20
> very few=20
> > people require support for composed-character awareness.=20
>
> Just because support isn't "required" doesn't mean they aren't doing it=
=20
> wrong :-).=20
>
> Let's say I have the string "can=CD=82on", composed of six codepoints, an=
d I=20
> want to take or remove the first three "letters", or the last three. How=
=20
> many programs get that correct and don't either transform the 'n=CD=82' t=
o a=20
> 'n', or leave a dangling combining codepoint? (How many actually=20
> understand, or have even thought about, the difference between (1) and=20
> (2)?)=20
>
> It's likely that in many cases, splitting is happening at known and=20
> non-combining codepoints, and it's just luck that things usually turn=20
> out okay.=20
>
> On the same note, how do text editors deal with these sorts of issues?=20
> Presumably single "characters" represented by two codepoints should=20
> still be manipulated (caret movement, selection, deletion) as single=20
> characters, no? At least, this sort of thing is what I recall as always=
=20
> coming up when people start talking about pathological cases of=20
> character counting.=20
>

Yes, it is important to be able to access a Unicode-encoded codepoint=20
sequence by grapheme clusters (Unicode nerd-speak for "visible character").

What is *not* important to do is actually *count* the number of grapheme=20
culsters in a Unicode sequence. Or more to the point, if you have a Unicode=
=20
string, it should not have a `grapheme_cluster_length` function on it.

When doing text layout, you don't count the grapheme clusters themselves=20
(for any non-fixed-width text). You count the width of each grapheme=20
cluster (or more accurately, you do a whole bunch of gymnastics to allow=20
for kerning pairs and other such things). That's not something that=20
`grapheme_cluster_length` will be useful for.

If you need to delete the first three grapheme clusters in a=20
Unicode-encoded string, you do this very easily. You get a grapheme cluster=
=20
iterator to the beginning, increment it three times, convert it back to a=
=20
codepoint iterator, and remove it from the string using the available=20
iterator-based erasure functions. Or if you're feeling fancy, you could do=
=20
some range-based cuteness.

You don't want to rely on integer indices. It's *way* too easy for a user=
=20
to not know whether an index should be in code units, points, or clusters.=
=20
To make it easy, the string should have a purely iterator-based interface.

I would go so far as to say that it shouldn't have an operator[] overload=
=20
on it. And if it does have to have one, it should provide access to *code=
=20
units*, not codepoints or grapheme clusters. And this would only be for=20
doing low-level kinds of string manipulation, for performance purposes=20
(like basic_string::c_str and the like).

If you try to make operator[] act on codepoints or grapheme clusters, then=
=20
you'd be playing in vector<bool> territory, having to deal with=20
references-that-are-not-really-references and such.

That's bad.

Not to mention that the iterator-based interface would be faster. And with=
=20
ranges, allow for lazy evaluation. And all sorts of other cool stuff.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_4476_1946494285.1439333074152
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Tuesday, August 11, 2015 at 3:56:58 PM UTC-4, Matthew Woehlke wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;">On 2015-08-11 12:13, Thiago Maciei=
ra wrote:
<br>&gt; On Tuesday 11 August 2015 08:31:25 glen stark wrote:
<br>&gt;&gt; When working with unicode text, it seems to me one must distin=
guish between=20
<br>&gt;&gt; the following three behaviors:
<br>&gt;&gt; =C2=A0 =C2=A01. =C2=A0composed-character awareness: =C2=A0a si=
ngle display character might be=20
<br>&gt;&gt; composed of multiple code points, eg an A followed a ring-abov=
e is a single=20
<br>&gt;&gt; display character.
<br>&gt;&gt; =C2=A0 =C2=A02. =C2=A0multi-char_t awareness: =C2=A0both utf-8=
 and utf-16 might use multiple=20
<br>&gt;&gt; char_t&#39;s for a single codepoint.
<br>&gt;&gt; =C2=A0 =C2=A03. =C2=A0char_t awareness: =C2=A0the current stan=
dard.
<br>&gt;&gt;
<br>&gt;&gt; Now let&#39;s consider modifying the existing standard to supp=
ort all three of=20
<br>&gt;&gt; these cases. =C2=A0Currently, 3 is the default behavior, and i=
f I want to get=20
<br>&gt;&gt; std::regex, std::sort etc to play nice with 1 or 2, I must pas=
s locale=20
<br>&gt;&gt; information. =C2=A0This shoudl remain the case, so as not to b=
reak backward=20
<br>&gt;&gt; compatibility. =C2=A0I believe 2 &amp; 3 to be primarily of in=
terest for library=20
<br>&gt;&gt; developers however and 1 (composed-character awareness) to be =
the most=20
<br>&gt;&gt; common use case for application developers. =C2=A0Thus we need=
 to provide 3 for=20
<br>&gt;&gt; compatibility, 2&amp;3 for library development, and 1 for most=
 use cases.
<br>&gt;=20
<br>&gt; I disagree with your assessment.
<br>&gt;=20
<br>&gt; 15 years of Unicode experience in Qt (QString) have shown that ver=
y, very few=20
<br>&gt; people require support for composed-character awareness.
<br>
<br>Just because support isn&#39;t &quot;required&quot; doesn&#39;t mean th=
ey aren&#39;t doing it
<br>wrong :-).
<br>
<br>Let&#39;s say I have the string &quot;can=CD=82on&quot;, composed of si=
x codepoints, and I
<br>want to take or remove the first three &quot;letters&quot;, or the last=
 three. How
<br>many programs get that correct and don&#39;t either transform the &#39;=
n=CD=82&#39; to a
<br>&#39;n&#39;, or leave a dangling combining codepoint? (How many actuall=
y
<br>understand, or have even thought about, the difference between (1) and =
(2)?)
<br>
<br>It&#39;s likely that in many cases, splitting is happening at known and
<br>non-combining codepoints, and it&#39;s just luck that things usually tu=
rn
<br>out okay.
<br>
<br>On the same note, how do text editors deal with these sorts of issues?
<br>Presumably single &quot;characters&quot; represented by two codepoints =
should
<br>still be manipulated (caret movement, selection, deletion) as single
<br>characters, no? At least, this sort of thing is what I recall as always
<br>coming up when people start talking about pathological cases of
<br>character counting.
<br></blockquote><div><br>Yes, it is important to be able to access a Unico=
de-encoded codepoint sequence by grapheme clusters (Unicode nerd-speak for =
&quot;visible character&quot;).<br><br>What is <i>not</i> important to do i=
s actually <i>count</i> the number of grapheme culsters in a Unicode sequen=
ce. Or more to the point, if you have a Unicode string, it should not have =
a `grapheme_cluster_length` function on it.<br><br>When doing text layout, =
you don&#39;t count the grapheme clusters themselves (for any non-fixed-wid=
th text). You count the width of each grapheme cluster (or more accurately,=
 you do a whole bunch of gymnastics to allow for kerning pairs and other su=
ch things). That&#39;s not something that `grapheme_cluster_length` will be=
 useful for.<br><br>If you need to delete the first three grapheme clusters=
 in a Unicode-encoded string, you do this very easily. You get a grapheme c=
luster iterator to the beginning, increment it three times, convert it back=
 to a codepoint iterator, and remove it from the string using the available=
 iterator-based erasure functions. Or if you&#39;re feeling fancy, you coul=
d do some range-based cuteness.<br><br>You don&#39;t want to rely on intege=
r indices. It&#39;s <i>way</i> too easy for a user to not know whether an i=
ndex should be in code units, points, or clusters. To make it easy, the str=
ing should have a purely iterator-based interface.<br><br>I would go so far=
 as to say that it shouldn&#39;t have an operator[] overload on it. And if =
it does have to have one, it should provide access to <i>code units</i>, no=
t codepoints or grapheme clusters. And this would only be for doing low-lev=
el kinds of string manipulation, for performance purposes (like basic_strin=
g::c_str and the like).<br><br>If you try to make operator[] act on codepoi=
nts or grapheme clusters, then you&#39;d be playing in vector&lt;bool&gt; t=
erritory, having to deal with references-that-are-not-really-references and=
 such.<br><br>That&#39;s bad.<br><br>Not to mention that the iterator-based=
 interface would be faster. And with ranges, allow for lazy evaluation. And=
 all sorts of other cool stuff.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_4476_1946494285.1439333074152--
------=_Part_4475_461479436.1439333074146--

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 11 Aug 2015 22:40:02 -0700
Raw View
On Tuesday 11 August 2015 15:56:46 Matthew Woehlke wrote:
> > 15 years of Unicode experience in Qt (QString) have shown that very, ve=
ry
> > few people require support for composed-character awareness.
>=20
> Just because support isn't "required" doesn't mean they aren't doing it
> wrong :-).

No, I really meant what I said. Most use-cases don't require knowing where =
a=20
grapheme cluster ends.

> Let's say I have the string "can=CD=82on", composed of six codepoints, an=
d I
> want to take or remove the first three "letters", or the last three. How
> many programs get that correct and don't either transform the 'n=CD=82' t=
o a
> 'n', or leave a dangling combining codepoint? (How many actually
> understand, or have even thought about, the difference between (1) and (2=
)?)

That's a contrived use-case. String manipulation usually works by finding=
=20
boundaries with regular separators. Doing splitting by number of characters=
 or=20
bytes only makes sense to fit in fixed-size buffers, which we can usually a=
void=20
in C++.

> It's likely that in many cases, splitting is happening at known and
> non-combining codepoints, and it's just luck that things usually turn
> out okay.

Right.

> On the same note, how do text editors deal with these sorts of issues?
> Presumably single "characters" represented by two codepoints should
> still be manipulated (caret movement, selection, deletion) as single
> characters, no? At least, this sort of thing is what I recall as always
> coming up when people start talking about pathological cases of
> character counting.

If you're using Qt, you'd use QTextBoundaryFinder with a boundary type of=
=20
QTextBoundaryFinder::Grapheme. As you can see from the Qt source code, in=
=20
order to do that, you need the entire Unicode database (or at least a=20
substantial portion of it).

I'm sure ICU has a similar API.

> > Most applications only need to know that a character is a character,
> > whether it is zero-, single- or double-width.
>=20
> How would you have a zero-width character? (We are talking about number
> of bytes representation, are we not?=20

No, we are talking about width, not byte or character or codepoint count. T=
he=20
letter "x" has a width of 1 ex (hence the name); the letter "=EF=BD=98" (U+=
FF5B=20
FULLWIDTH LATIN SMALL LETTER X) has twice that width, whereas U+FEFF ZERO=
=20
WIDTH NO-BREAK SPACE) has width zero. Each of those is one codepoint.

Taking your example from below, the string "A=CC=8A" has:
 - width: 1 ex
 - codepoint count: 2 (U+0041 U+030A)
 - UCS-4: 2 entries, 8 bytes
 - UTF-16: 2 entries, 4 bytes
 - UTF-8: 3 bytes

We usually confuse codepoint with character, but there are certain Unicode=
=20
codepoints that are not characters.

> Render width is a whole other
> kettle of fish and is only loosely related to what I understood Glen to
> be talking about with (1). The issue as I understood is differentiating
> between what a user considers a "character", and codepoints, e.g.
> <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit handle=
s that
> rather poorly ;-).)

Not really. It's intentional that you can backspace a combining character a=
nd=20
remove it. It would be wrong if you could press the left or right arrows an=
d=20
it would stop between the "A"  and the COMBINING RING ABOVE.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: glen stark <g.a.stark@gmail.com>
Date: Wed, 12 Aug 2015 14:02:35 +0200
Raw View
--047d7b86cf50950f6e051d1bfe15
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

> No, I really meant what I said. Most use-cases don't require knowing
> where a grapheme cluster ends.

I'm sorry, but I think you're obviously wrong in this regard.  In
every single instance where I have ever had to manipulate a string, I
have been interested in manipulating what I think of as a displayable
characters, which might be expressed as a single codepoint, or a
grapheme cluster in Unicode.  I'd like a better word for that concept
-- a grapheme cluster is a specific unicode representation, but what
I'm thinking of is more of a human concept -- let's call it a symbol.
Maybe the symbol I want to manipulate is represented as a grapheme
cluster, maybe it's represented as a single codepoint.  Most of the
time I don't want to know, I just want to find, manipulate, compare,
and sort the symbol.  That's true for most of the coders I know
working C or C++.

Unless I'm developing a low level library, I'm not interested in being
aware of codepoints.  If I'm processing a data feed from an alpenhut
in Switzerland, and she sends me the string "Die Alpengl=C3=BChn is heute
fr=C3=B6hlich", I want to look for "Alpengl=C3=BChn", and not code twice --=
 once
for the case that she wrote [U+0075 U+0308], and once for the case
where she used U+00FC.  Same thing when I'm sorting, and doing a
regex.  With those latter 2 I can get the behavior I want using
locales, and I'd like to extend that behavior elsewhere.

I acknowledge that there is a use case where one needs to iterate
over, and compare at a codepoint level -- how else would we implement
the libraries that provide the symbol tools we usually want after all?
There will other cases as well, but in general we will want to work
with symbols.  When those symbols can be encoded in degenerate ways,
we'd like to be able to ignore that.  So I think a good solution
provides the ability to work at a codepoint level and at a
symbol-level, where the fact that the symbol may be implemented as a
grapheme cluster can be safely ignored.

>> Let's say I have the string "can=CD=82on", composed of six codepoints,
>> and I want to take or remove the first three "letters", or the last
>> three. How many programs get that correct and don't either
>> transform the 'n=CD=82' to a 'n', or leave a dangling combining
>> codepoint? (How many actually understand, or have even thought
>> about, the difference between (1) and (2)?)

> That's a contrived use-case. String manipulation usually works by
> finding boundaries with regular separators. Doing splitting by
> number of characters or bytes only makes sense to fit in fixed-size
> buffers, which we can usually avoid in C++.

That's absolutely not a contrived case.  In my last project I had to
output internationalized text to an lcd terminal with a fixed width
and font size.  Most of the text processing involved truncating or
abbreviating strings to a fixed width representatoin.

More importantly than that, you're ignore the monstrous amount of
legacy code which currently works with std::strings, which one would
like to make compliant with unicode.  Making this code compliant with
unide in general means making it work like it used to work with ASCII
or latin1, latin2...

In the ASCII or latin1 world, when one worked with a string and did
substr(pos,n), +=3Dn, etc, one rightly expected to advance n symbols.
Think of how much time it would save a developer if:


   // make me a locale aware string f, and populate it with data auto
   iter =3D f.begin(); iter +=3D 3;

advanced three letters!  We would save lifetimes of developer time.
It also hides the complexity of maniplating symbol equivalencies in a
library, which is how it should be.

>> Render width is a whole other kettle of fish and is only loosely
>> related to what I understood Glen to be talking about with (1). The
>> issue as I understood is differentiating between what a user
>> considers a "character", and codepoints, e.g.  <U00C5> '=C3=85'
>> vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit handles that rather
>> poorly ;-).)

I'm glad that someone at least understood what my goals are.  Clearly
I have to work on expressing them more clearly.  While it's true that
from the perspective of typesetting, my concept of size() is
meaningless, there's a universe of use cases where you just want to
know how many display characters a string needs to occupy: fixed width
terminal output, small displays on embedded devices, etc.

I am quite convinced a good solution to unicode should provide the
same result for the same set of "letters" or "symbols", regardless of
whether those symbols are encoded as single codepoints or grapheme
clusters.  As I understand it, this need is already partially
supported by std::regex and the std::algorithms when used together
with locales, and I would like to see this extended in a generic way.

On Wed, Aug 12, 2015 at 7:40 AM, Thiago Macieira <thiago@macieira.org>
wrote:

> On Tuesday 11 August 2015 15:56:46 Matthew Woehlke wrote:
> > > 15 years of Unicode experience in Qt (QString) have shown that very,
> very
> > > few people require support for composed-character awareness.
> >
> > Just because support isn't "required" doesn't mean they aren't doing it
> > wrong :-).
>
> No, I really meant what I said. Most use-cases don't require knowing wher=
e
> a
> grapheme cluster ends.
>
> > Let's say I have the string "can=CD=82on", composed of six codepoints, =
and I
> > want to take or remove the first three "letters", or the last three. Ho=
w
> > many programs get that correct and don't either transform the 'n=CD=82'=
 to a
> > 'n', or leave a dangling combining codepoint? (How many actually
> > understand, or have even thought about, the difference between (1) and
> (2)?)
>
> That's a contrived use-case. String manipulation usually works by finding
> boundaries with regular separators. Doing splitting by number of
> characters or
> bytes only makes sense to fit in fixed-size buffers, which we can usually
> avoid
> in C++.
>
> > It's likely that in many cases, splitting is happening at known and
> > non-combining codepoints, and it's just luck that things usually turn
> > out okay.
>
> Right.
>
> > On the same note, how do text editors deal with these sorts of issues?
> > Presumably single "characters" represented by two codepoints should
> > still be manipulated (caret movement, selection, deletion) as single
> > characters, no? At least, this sort of thing is what I recall as always
> > coming up when people start talking about pathological cases of
> > character counting.
>
> If you're using Qt, you'd use QTextBoundaryFinder with a boundary type of
> QTextBoundaryFinder::Grapheme. As you can see from the Qt source code, in
> order to do that, you need the entire Unicode database (or at least a
> substantial portion of it).
>
> I'm sure ICU has a similar API.
>
> > > Most applications only need to know that a character is a character,
> > > whether it is zero-, single- or double-width.
> >
> > How would you have a zero-width character? (We are talking about number
> > of bytes representation, are we not?
>
> No, we are talking about width, not byte or character or codepoint count.
> The
> letter "x" has a width of 1 ex (hence the name); the letter "=EF=BD=98" (=
U+FF5B
> FULLWIDTH LATIN SMALL LETTER X) has twice that width, whereas U+FEFF ZERO
> WIDTH NO-BREAK SPACE) has width zero. Each of those is one codepoint.
>
> Taking your example from below, the string "A=CC=8A" has:
>  - width: 1 ex
>  - codepoint count: 2 (U+0041 U+030A)
>  - UCS-4: 2 entries, 8 bytes
>  - UTF-16: 2 entries, 4 bytes
>  - UTF-8: 3 bytes
>
> We usually confuse codepoint with character, but there are certain Unicod=
e
> codepoints that are not characters.
>
> > Render width is a whole other
> > kettle of fish and is only loosely related to what I understood Glen to
> > be talking about with (1). The issue as I understood is differentiating
> > between what a user considers a "character", and codepoints, e.g.
> > <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit hand=
les that
> > rather poorly ;-).)
>
> Not really. It's intentional that you can backspace a combining character
> and
> remove it. It would be wrong if you could press the left or right arrows
> and
> it would stop between the "A"  and the COMBINING RING ABOVE.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>    Software Architect - Intel Open Source Technology Center
>       PGP/GPG: 0x6EF45358; fingerprint:
>       E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--047d7b86cf50950f6e051d1bfe15
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><br></div><div>&gt; No, I really meant what I said. M=
ost use-cases don&#39;t require knowing</div><div>&gt; where a grapheme clu=
ster ends.</div><div><br></div><div>I&#39;m sorry, but I think you&#39;re o=
bviously wrong in this regard.=C2=A0 In</div><div>every single instance whe=
re I have ever had to manipulate a string, I</div><div>have been interested=
 in manipulating what I think of as a displayable</div><div>characters, whi=
ch might be expressed as a single codepoint, or a</div><div>grapheme cluste=
r in Unicode.=C2=A0 I&#39;d like a better word for that concept</div><div>-=
- a grapheme cluster is a specific unicode representation, but what</div><d=
iv>I&#39;m thinking of is more of a human concept -- let&#39;s call it a sy=
mbol.</div><div>Maybe the symbol I want to manipulate is represented as a g=
rapheme</div><div>cluster, maybe it&#39;s represented as a single codepoint=
..=C2=A0 Most of the</div><div>time I don&#39;t want to know, I just want to=
 find, manipulate, compare,</div><div>and sort the symbol.=C2=A0 That&#39;s=
 true for most of the coders I know</div><div>working C or C++.</div><div><=
br></div><div>Unless I&#39;m developing a low level library, I&#39;m not in=
terested in being</div><div>aware of codepoints.=C2=A0 If I&#39;m processin=
g a data feed from an alpenhut</div><div>in Switzerland, and she sends me t=
he string &quot;Die Alpengl=C3=BChn is heute</div><div>fr=C3=B6hlich&quot;,=
 I want to look for &quot;Alpengl=C3=BChn&quot;, and not code twice -- once=
</div><div>for the case that she wrote [U+0075 U+0308], and once for the ca=
se</div><div>where she used U+00FC.=C2=A0 Same thing when I&#39;m sorting, =
and doing a</div><div>regex.=C2=A0 With those latter 2 I can get the behavi=
or I want using</div><div>locales, and I&#39;d like to extend that behavior=
 elsewhere.</div><div><br></div><div>I acknowledge that there is a use case=
 where one needs to iterate</div><div>over, and compare at a codepoint leve=
l -- how else would we implement</div><div>the libraries that provide the s=
ymbol tools we usually want after all?</div><div>There will other cases as =
well, but in general we will want to work</div><div>with symbols.=C2=A0 Whe=
n those symbols can be encoded in degenerate ways,</div><div>we&#39;d like =
to be able to ignore that.=C2=A0 So I think a good solution</div><div>provi=
des the ability to work at a codepoint level and at a</div><div>symbol-leve=
l, where the fact that the symbol may be implemented as a</div><div>graphem=
e cluster can be safely ignored.</div><div><br></div><div>&gt;&gt; Let&#39;=
s say I have the string &quot;can=CD=82on&quot;, composed of six codepoints=
,</div><div>&gt;&gt; and I want to take or remove the first three &quot;let=
ters&quot;, or the last</div><div>&gt;&gt; three. How many programs get tha=
t correct and don&#39;t either</div><div>&gt;&gt; transform the &#39;n=CD=
=82&#39; to a &#39;n&#39;, or leave a dangling combining</div><div>&gt;&gt;=
 codepoint? (How many actually understand, or have even thought</div><div>&=
gt;&gt; about, the difference between (1) and (2)?)</div><div><br></div><di=
v>&gt; That&#39;s a contrived use-case. String manipulation usually works b=
y</div><div>&gt; finding boundaries with regular separators. Doing splittin=
g by</div><div>&gt; number of characters or bytes only makes sense to fit i=
n fixed-size</div><div>&gt; buffers, which we can usually avoid in C++.</di=
v><div><br></div><div>That&#39;s absolutely not a contrived case.=C2=A0 In =
my last project I had to</div><div>output internationalized text to an lcd =
terminal with a fixed width</div><div>and font size.=C2=A0 Most of the text=
 processing involved truncating or</div><div>abbreviating strings to a fixe=
d width representatoin.</div><div><br></div><div>More importantly than that=
, you&#39;re ignore the monstrous amount of</div><div>legacy code which cur=
rently works with std::strings, which one would</div><div>like to make comp=
liant with unicode.=C2=A0 Making this code compliant with</div><div>unide i=
n general means making it work like it used to work with ASCII</div><div>or=
 latin1, latin2...</div><div><br></div><div>In the ASCII or latin1 world, w=
hen one worked with a string and did</div><div>substr(pos,n), +=3Dn, etc, o=
ne rightly expected to advance n symbols.</div><div>Think of how much time =
it would save a developer if:</div><div><br></div><div><br></div><div>=C2=
=A0 =C2=A0// make me a locale aware string f, and populate it with data aut=
o</div><div>=C2=A0 =C2=A0iter =3D f.begin(); iter +=3D 3;</div><div><br></d=
iv><div>advanced three letters!=C2=A0 We would save lifetimes of developer =
time.</div><div>It also hides the complexity of maniplating symbol equivale=
ncies in a</div><div>library, which is how it should be.</div><div><br></di=
v><div>&gt;&gt; Render width is a whole other kettle of fish and is only lo=
osely</div><div>&gt;&gt; related to what I understood Glen to be talking ab=
out with (1). The</div><div>&gt;&gt; issue as I understood is differentiati=
ng between what a user</div><div>&gt;&gt; considers a &quot;character&quot;=
, and codepoints, e.g. =C2=A0&lt;U00C5&gt; &#39;=C3=85&#39;</div><div>&gt;&=
gt; vs. &lt;U030A&gt;+&lt;U0041&gt; &#39;=CC=8AA&#39;. And FYI, QTextEdit h=
andles that rather</div><div>&gt;&gt; poorly ;-).)</div><div><br></div><div=
>I&#39;m glad that someone at least understood what my goals are.=C2=A0 Cle=
arly</div><div>I have to work on expressing them more clearly.=C2=A0 While =
it&#39;s true that</div><div>from the perspective of typesetting, my concep=
t of size() is</div><div>meaningless, there&#39;s a universe of use cases w=
here you just want to</div><div>know how many display characters a string n=
eeds to occupy: fixed width</div><div>terminal output, small displays on em=
bedded devices, etc.</div><div><br></div><div>I am quite convinced a good s=
olution to unicode should provide the</div><div>same result for the same se=
t of &quot;letters&quot; or &quot;symbols&quot;, regardless of</div><div>wh=
ether those symbols are encoded as single codepoints or grapheme</div><div>=
clusters.=C2=A0 As I understand it, this need is already partially</div><di=
v>supported by std::regex and the std::algorithms when used together</div><=
div>with locales, and I would like to see this extended in a generic way.</=
div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed,=
 Aug 12, 2015 at 7:40 AM, Thiago Macieira <span dir=3D"ltr">&lt;<a href=3D"=
mailto:thiago@macieira.org" target=3D"_blank">thiago@macieira.org</a>&gt;</=
span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex"><span class=3D"">On Tuesday =
11 August 2015 15:56:46 Matthew Woehlke wrote:<br>
&gt; &gt; 15 years of Unicode experience in Qt (QString) have shown that ve=
ry, very<br>
&gt; &gt; few people require support for composed-character awareness.<br>
&gt;<br>
&gt; Just because support isn&#39;t &quot;required&quot; doesn&#39;t mean t=
hey aren&#39;t doing it<br>
&gt; wrong :-).<br>
<br>
</span>No, I really meant what I said. Most use-cases don&#39;t require kno=
wing where a<br>
grapheme cluster ends.<br>
<span class=3D""><br>
&gt; Let&#39;s say I have the string &quot;can=CD=82on&quot;, composed of s=
ix codepoints, and I<br>
&gt; want to take or remove the first three &quot;letters&quot;, or the las=
t three. How<br>
&gt; many programs get that correct and don&#39;t either transform the &#39=
;n=CD=82&#39; to a<br>
&gt; &#39;n&#39;, or leave a dangling combining codepoint? (How many actual=
ly<br>
&gt; understand, or have even thought about, the difference between (1) and=
 (2)?)<br>
<br>
</span>That&#39;s a contrived use-case. String manipulation usually works b=
y finding<br>
boundaries with regular separators. Doing splitting by number of characters=
 or<br>
bytes only makes sense to fit in fixed-size buffers, which we can usually a=
void<br>
in C++.<br>
<span class=3D""><br>
&gt; It&#39;s likely that in many cases, splitting is happening at known an=
d<br>
&gt; non-combining codepoints, and it&#39;s just luck that things usually t=
urn<br>
&gt; out okay.<br>
<br>
</span>Right.<br>
<span class=3D""><br>
&gt; On the same note, how do text editors deal with these sorts of issues?=
<br>
&gt; Presumably single &quot;characters&quot; represented by two codepoints=
 should<br>
&gt; still be manipulated (caret movement, selection, deletion) as single<b=
r>
&gt; characters, no? At least, this sort of thing is what I recall as alway=
s<br>
&gt; coming up when people start talking about pathological cases of<br>
&gt; character counting.<br>
<br>
</span>If you&#39;re using Qt, you&#39;d use QTextBoundaryFinder with a bou=
ndary type of<br>
QTextBoundaryFinder::Grapheme. As you can see from the Qt source code, in<b=
r>
order to do that, you need the entire Unicode database (or at least a<br>
substantial portion of it).<br>
<br>
I&#39;m sure ICU has a similar API.<br>
<span class=3D""><br>
&gt; &gt; Most applications only need to know that a character is a charact=
er,<br>
&gt; &gt; whether it is zero-, single- or double-width.<br>
&gt;<br>
&gt; How would you have a zero-width character? (We are talking about numbe=
r<br>
&gt; of bytes representation, are we not?<br>
<br>
</span>No, we are talking about width, not byte or character or codepoint c=
ount. The<br>
letter &quot;x&quot; has a width of 1 ex (hence the name); the letter &quot=
;=EF=BD=98&quot; (U+FF5B<br>
FULLWIDTH LATIN SMALL LETTER X) has twice that width, whereas U+FEFF ZERO<b=
r>
WIDTH NO-BREAK SPACE) has width zero. Each of those is one codepoint.<br>
<br>
Taking your example from below, the string &quot;A=CC=8A&quot; has:<br>
=C2=A0- width: 1 ex<br>
=C2=A0- codepoint count: 2 (U+0041 U+030A)<br>
=C2=A0- UCS-4: 2 entries, 8 bytes<br>
=C2=A0- UTF-16: 2 entries, 4 bytes<br>
=C2=A0- UTF-8: 3 bytes<br>
<br>
We usually confuse codepoint with character, but there are certain Unicode<=
br>
codepoints that are not characters.<br>
<span class=3D""><br>
&gt; Render width is a whole other<br>
&gt; kettle of fish and is only loosely related to what I understood Glen t=
o<br>
&gt; be talking about with (1). The issue as I understood is differentiatin=
g<br>
&gt; between what a user considers a &quot;character&quot;, and codepoints,=
 e.g.<br>
&gt; &lt;U00C5&gt; &#39;=C3=85&#39; vs. &lt;U030A&gt;+&lt;U0041&gt; &#39;=
=CC=8AA&#39;. And FYI, QTextEdit handles that<br>
&gt; rather poorly ;-).)<br>
<br>
</span>Not really. It&#39;s intentional that you can backspace a combining =
character and<br>
remove it. It would be wrong if you could press the left or right arrows an=
d<br>
it would stop between the &quot;A&quot;=C2=A0 and the COMBINING RING ABOVE.=
<br>
<span class=3D"im HOEnZb"><br>
--<br>
Thiago Macieira - thiago (AT) <a href=3D"http://macieira.info" rel=3D"noref=
errer" target=3D"_blank">macieira.info</a> - thiago (AT) <a href=3D"http://=
kde.org" rel=3D"noreferrer" target=3D"_blank">kde.org</a><br>
=C2=A0 =C2=A0Software Architect - Intel Open Source Technology Center<br>
=C2=A0 =C2=A0 =C2=A0 PGP/GPG: 0x6EF45358; fingerprint:<br>
=C2=A0 =C2=A0 =C2=A0 E067 918B B660 DBD1 105C=C2=A0 966C 33F5 F005 6EF4 535=
8<br>
<br>
</span><div class=3D"HOEnZb"><div class=3D"h5">--<br>
<br>
---<br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org">std-propo=
sals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" rel=3D"noreferrer" target=3D"_blank">http://groups.google.c=
om/a/isocpp.org/group/std-proposals/</a>.<br>
</div></div></blockquote></div><br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--047d7b86cf50950f6e051d1bfe15--

.


Author: glen stark <g.a.stark@gmail.com>
Date: Wed, 12 Aug 2015 14:32:30 +0200
Raw View
--f46d044289e6921ef5051d1c69eb
Content-Type: text/plain; charset=UTF-8

Thanks for the input.

Fair point about not using Unicode terminology.  That's partly because I
haven't master the vocabulary yet, so help me out when you can.

n3572 failed because the standards committee is against the idea of adding
a new string class.  I agree with that decision.  I think a better approach
is to provide a set of encoding aware iterators and operators that allow us
to use the existing string class in a way that provides both comfortable
unicode support as well as being consistent with existing C++ design and
solutions.  My little food-for-thought  post is aimed at exploring a way to
make these tools comfortable for implementers, and for those updating or
maintaining legacy code.

I didn't explicity discuss normalization, just as I didn't discuss a lot of
other things, as I was focusing usability and design, and I though the
normalization strategy was somehow implicit, or could be safely ignored for
an initial discussion.    Apparently I was mistaken there, mea culpa.

As I understand it, and I am hear to learn so please correct me if I'm
wrong -- if I do std::sort(vec_of_strings, locale), I get locale aware
sorting, and if the local is implemented to provide correct normalization,
I get the correctly normalized sorting.  Same thing happens when I pass a
locale to regex.  I find that terrific.

Now I'd like to be able to do stuff like this, without breaking old code,
in a way that lets me handle all three use cases (char_t processing,
codepoint processing, symbol processing -- i.e. treat grapheme clusters as
a single 'symbol').
     // make me a std::string <...> foo with unicode encoding.
    foo.substr(pos,3);   // gets me the last three symbols, regardless of
if they are 'multi-byte'

In the current state of the standard, the way to get unicode compliance
(regadless of encoding), is by imbueing the locale into a std::regex, or
providing the locale to a std algorithm (e.g. std::sort(foo, locale)), then
normalization is taken care of for you.

I find this an entirely acceptable approach, but I foresee a  lot of
redundant, repetitive providing of locales everywhere and sundry.
Woulldn't it be great if we could tell the compiler -- hey, always use the
locale for this string.

My thought was to make this a part of the string type  -- you've seen my
inital thoughts. char_traits might be the right place to put the
information.. I don't know.

If the technical details could be solved, I think the usability would be
excellent.  Since the standard already provides a mechanism for locale and
encoding aware comparison (locale()) I took it as given that this would
hold for my proposal.  Since the local would be a property of the string
type, we  could take advantage of typechecking to guarentee good behavior.

   std::string < std::string  ; // just like now.
   std::string<de_utf8> < sdt::string<en_utf8>  // compiler err,
incompatible types.


On Tue, Aug 11, 2015 at 8:15 PM, Nicol Bolas <jmckesson@gmail.com> wrote:

> I have to agree with Thiago's assessment that this is really the wrong way
> to go. Indeed, you don't seem to fully understand the various elements at
> play in Unicode, which leads to the problem.
>
> You correctly break down Unicode strings into the correct 3 forms (the
> encoded sequence, the codepoint sequence, and the grapheme cluster
> sequence), but you don't use Unicode terminology for them. You also don't
> deal with normalization with regard to comparison, which is *absolutely
> crucial* when attempting to compare two Unicode strings.
>
> There was already an attempt to standardize Unicode use in C++
> <http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3572.html>, but
> unfortunately it did not progress very far. It was a solid approach, and
> one that should be followed up on. And while there are many functional ways
> of divising a Unicode string, trying to shoehorn Unicode functionality into
> std::basic_string is absolutely the wrong way to do it.
>
> Also, any improvements on it should recognize the importance of ranges,
> particular Unicode codepoint iterator ranges.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--f46d044289e6921ef5051d1c69eb
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for the input. =C2=A0<div><br></div><div>Fair point=
 about not using Unicode terminology.=C2=A0 That&#39;s partly because I hav=
en&#39;t master the vocabulary yet, so help me out when you can. =C2=A0=C2=
=A0</div><div><br></div><div>n3572 failed because the standards committee i=
s against the idea of adding a new string class.=C2=A0 I agree with that de=
cision.=C2=A0 I think a better approach is to provide a set of encoding awa=
re iterators and operators that allow us to use the existing string class i=
n a way that provides both comfortable unicode support as well as being con=
sistent with existing C++ design and solutions.=C2=A0 My little food-for-th=
ought =C2=A0post is aimed at exploring a way to make these tools comfortabl=
e for implementers, and for those updating or maintaining legacy code.</div=
><div><br></div><div>I didn&#39;t explicity discuss normalization, just as =
I didn&#39;t discuss a lot of other things, as I was focusing usability and=
 design, and I though the normalization strategy was somehow implicit, or c=
ould be safely ignored for an initial discussion. =C2=A0 =C2=A0Apparently I=
 was mistaken there, mea culpa.</div><div><br></div><div>As I understand it=
, and I am hear to learn so please correct me if I&#39;m wrong -- if I do s=
td::sort(vec_of_strings, locale), I get locale aware sorting, and if the lo=
cal is implemented to provide correct normalization, I get the correctly no=
rmalized sorting.=C2=A0 Same thing happens when I pass a locale to regex.=
=C2=A0 I find that terrific.</div><div><br></div><div>Now I&#39;d like to b=
e able to do stuff like this, without breaking old code, in a way that lets=
 me handle all three use cases (char_t processing, codepoint processing, sy=
mbol processing -- i.e. treat grapheme clusters as a single &#39;symbol&#39=
;).=C2=A0</div><div>=C2=A0 =C2=A0 =C2=A0// make me a std::string &lt;...&gt=
; foo with unicode encoding.</div><div>=C2=A0 =C2=A0 foo.substr(pos,3); =C2=
=A0 // gets me the last three symbols, regardless of if they are &#39;multi=
-byte&#39;</div><div><br></div><div>In the current state of the standard, t=
he way to get unicode compliance (regadless of encoding), is by imbueing th=
e locale into a std::regex, or providing the locale to a std algorithm (e.g=
.. std::sort(foo, locale)), then normalization is taken care of for you.</di=
v><div><br></div><div>I find this an entirely acceptable approach, but I fo=
resee a =C2=A0lot of redundant, repetitive providing of locales everywhere =
and sundry.=C2=A0 Woulldn&#39;t it be great if we could tell the compiler -=
- hey, always use the locale for this string.</div><div><br></div><div>My t=
hought was to make this a part of the string type =C2=A0-- you&#39;ve seen =
my inital thoughts. char_traits might be the right place to put the informa=
tion.. I don&#39;t know.</div><div><br></div><div>If the technical details =
could be solved, I think the usability would be excellent.=C2=A0 Since the =
standard already provides a mechanism for locale and encoding aware compari=
son (locale()) I took it as given that this would hold for my proposal.=C2=
=A0 Since the local would be a property of the string type, we =C2=A0could =
take advantage of typechecking to guarentee good behavior.</div><div><br></=
div><div>=C2=A0 =C2=A0std::string &lt; std::string =C2=A0; // just like now=
..</div><div>=C2=A0 =C2=A0std::string&lt;de_utf8&gt; &lt; sdt::string&lt;en_=
utf8&gt; =C2=A0// compiler err, incompatible types.</div><div><br></div></d=
iv><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Aug 11=
, 2015 at 8:15 PM, Nicol Bolas <span dir=3D"ltr">&lt;<a href=3D"mailto:jmck=
esson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I have to agree with T=
hiago&#39;s assessment that this is really the wrong way to go. Indeed, you=
 don&#39;t seem to fully understand the various elements at play in Unicode=
, which leads to the problem.<br><br>You correctly break down Unicode strin=
gs into the correct 3 forms (the encoded sequence, the codepoint sequence, =
and the grapheme cluster sequence), but you don&#39;t use Unicode terminolo=
gy for them. You also don&#39;t deal with normalization with regard to comp=
arison, which is <i>absolutely crucial</i> when attempting to compare two U=
nicode strings.<br><br>There was already an <a href=3D"http://www.open-std.=
org/JTC1/SC22/WG21/docs/papers/2013/n3572.html" target=3D"_blank">attempt t=
o standardize Unicode use in C++</a>, but unfortunately it did not progress=
 very far. It was a solid approach, and one that should be followed up on. =
And while there are many functional ways of divising a Unicode string, tryi=
ng to shoehorn Unicode functionality into std::basic_string is absolutely t=
he wrong way to do it.<br><br>Also, any improvements on it should recognize=
 the importance of ranges, particular Unicode codepoint iterator ranges.<br=
></div><div class=3D"HOEnZb"><div class=3D"h5">

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org" target=3D"_=
blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank">http://groups.google.com/a/isocpp.org/gro=
up/std-proposals/</a>.<br>
</div></div></blockquote></div><br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--f46d044289e6921ef5051d1c69eb--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Wed, 12 Aug 2015 09:38:30 -0400
Raw View
On 2015-08-12 01:40, Thiago Macieira wrote:
> On Tuesday 11 August 2015 15:56:46 Matthew Woehlke wrote:
>> On the same note, how do text editors deal with these sorts of issues?
>> Presumably single "characters" represented by two codepoints should
>> still be manipulated (caret movement, selection, deletion) as single
>> characters, no? At least, this sort of thing is what I recall as always
>> coming up when people start talking about pathological cases of
>> character counting.
>=20
> If you're using Qt, you'd use QTextBoundaryFinder with a boundary type of=
=20
> QTextBoundaryFinder::Grapheme. As you can see from the Qt source code, in=
=20
> order to do that, you need the entire Unicode database (or at least a=20
> substantial portion of it).

Right.

> I'm sure ICU has a similar API.

Yes, and IIRC, the above was given as a reason why it may not be
desirable for STL to play in this space. We're possibly better off with
everyone using a library like ICU.

Or, maybe, what we could do is standardize an API and leave it to
vendors to keep the database up to date. Possibly by STL being a wrapper
over a platform specific library (e.g. ICU).

>> How would you have a zero-width character? (We are talking about number
>> of bytes representation, are we not?=20
>=20
> No, we are talking about width, not byte or character or codepoint count.

Okay, I don't know where you got that from. I agree that needs font
metrics to solve and isn't something we should be looking at at this
time. I also didn't read anything in Glen's original e-mail that dealt
with render width, so I don't know why we're even discussing it.

>> e.g. <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit
>> handles that rather poorly ;-).)
>=20
> Not really. It's intentional that you can backspace a combining character=
 and=20
> remove it. It would be wrong if you could press the left or right arrows =
and=20
> it would stop between the "A"  and the COMBINING RING ABOVE.

Did you actually *TRY* it? Note that the CRA occurs *before* the letter
with which it combines. Thunderbird doesn't render it correctly the
other way around, and at least Gecko and QtWebkit (and even KHTML) also
produce a =C3=85 in my original order. Qt itself however... hmm, well, seem=
s
to behave reasonably except for combining it with the "'" instead.

Anyway, the point was that text editing operations need to be aware of
these issues... if you have the caret adjacent to a combined pair of
codepoints and press delete/backspace, it should either erase BOTH, or
erase only the combining codepoint, whether that's the one adjacent to
the caret or not. So in at least one case, knowledge that a combining
codepoint is present is required for reasonable behavior.

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Wed, 12 Aug 2015 09:49:50 -0400
Raw View
On 2015-08-12 08:02, glen stark wrote:
> While it's true that from the perspective of typesetting, my concept
> of size() is meaningless, there's a universe of use cases where you
> just want to know how many display characters a string needs to
> occupy: fixed width terminal output, small displays on embedded
> devices, etc.

Umm... okay, those *are* actually cases where you are talking about
render width. Yes, even in case of "fixed width" fonts. You still have
zero- and double-width characters in those cases for which you need to
account. I'm inclined to agree with Nicol; I've yet to see a convincing
use case for counting graphemes. (Not "counting" in the sense of being
able to manipulate iterators, but a count_graphemes() function.)

--
Matthew

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 12 Aug 2015 06:50:26 -0700 (PDT)
Raw View
------=_Part_213_1397894043.1439387426359
Content-Type: multipart/alternative;
 boundary="----=_Part_214_2121759883.1439387426359"

------=_Part_214_2121759883.1439387426359
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wednesday, August 12, 2015 at 8:02:38 AM UTC-4, glen stark wrote:
>
>
> > No, I really meant what I said. Most use-cases don't require knowing
> > where a grapheme cluster ends.
>
> I'm sorry, but I think you're obviously wrong in this regard.  In
> every single instance where I have ever had to manipulate a string, I
> have been interested in manipulating what I think of as a displayable
> characters, which might be expressed as a single codepoint, or a
> grapheme cluster in Unicode.  I'd like a better word for that concept
> -- a grapheme cluster is a specific unicode representation, but what
> I'm thinking of is more of a human concept -- let's call it a symbol.
> Maybe the symbol I want to manipulate is represented as a grapheme
> cluster, maybe it's represented as a single codepoint.
>

A single codepoint is a grapheme cluster (provided that it is not followed=
=20
by a combining character). Clusters can contain only one codepoint.
=20

> Most of the
> time I don't want to know, I just want to find, manipulate, compare,
> and sort the symbol.  That's true for most of the coders I know
> working C or C++.
>
> Unless I'm developing a low level library, I'm not interested in being
> aware of codepoints.  If I'm processing a data feed from an alpenhut
> in Switzerland, and she sends me the string "Die Alpengl=C3=BChn is heute
> fr=C3=B6hlich", I want to look for "Alpengl=C3=BChn", and not code twice =
-- once
> for the case that she wrote [U+0075 U+0308], and once for the case
> where she used U+00FC.  Same thing when I'm sorting, and doing a
> regex.  With those latter 2 I can get the behavior I want using
> locales, and I'd like to extend that behavior elsewhere.
>

Comparison, sorting, and regex searching all require that the input strings=
=20
follow the same Unicode normalization form=20
<https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization>.=20
Otherwise, comparison just isn't viable.

The whole point of Unicode normalization is so that you *can* compare=20
strings by codepoints. When normalized in a particular form, every possible=
=20
string representation has a single codepoint sequence. So if two sequences=
=20
are unequal, you *know* that they don't represent the same string.

So you don't need to "code twice"; just normalize the string when you=20
receive it from the user.

Ordering strings (sorting) is rather more involved, as such comparisons=20
require specific codepoint-by-codepoint knowledge. But even there,=20
normalization is a necessary first step.
=20

> >> Let's say I have the string "can=CD=82on", composed of six codepoints,
> >> and I want to take or remove the first three "letters", or the last
> >> three. How many programs get that correct and don't either
> >> transform the 'n=CD=82' to a 'n', or leave a dangling combining
> >> codepoint? (How many actually understand, or have even thought
> >> about, the difference between (1) and (2)?)
>
> > That's a contrived use-case. String manipulation usually works by
> > finding boundaries with regular separators. Doing splitting by
> > number of characters or bytes only makes sense to fit in fixed-size
> > buffers, which we can usually avoid in C++.
>
> That's absolutely not a contrived case.  In my last project I had to
> output internationalized text to an lcd terminal with a fixed width
> and font size.  Most of the text processing involved truncating or
> abbreviating strings to a fixed width representatoin.
>

Yes, there are cases where counting the number of grapheme clusters is=20
useful. However, those cases are not so common that we need a specific=20
function of a Unicode string class to do that.
=20

> More importantly than that, you're ignore the monstrous amount of
> legacy code which currently works with std::strings, which one would
> like to make compliant with unicode.  Making this code compliant with
> unide in general means making it work like it used to work with ASCII
> or latin1, latin2...
>
> In the ASCII or latin1 world, when one worked with a string and did
> substr(pos,n), +=3Dn, etc, one rightly expected to advance n symbols.
> Think of how much time it would save a developer if:
>
>
>    // make me a locale aware string f, and populate it with data auto
>    iter =3D f.begin(); iter +=3D 3;
>
> advanced three letters!  We would save lifetimes of developer time.
> It also hides the complexity of maniplating symbol equivalencies in a
> library, which is how it should be.
>

Or you could just do this:

u8string str =3D ...;
auto rng =3D grapheme_cluster_range(str);
rng =3D rng.advance_begin(3);

That makes it much more clear what you've done. You take a unicode string,=
=20
get a range of grapheme clusters, and advance the front by 3.

What's the problem? Indeed, you could even specialize=20
`grapheme_cluster_range` to take `std::basic_string`, though for=20
`std::string`, you'd need an extra parameter to differentiate between=20
char-as-narrow and char-as-UTF8-codeunit.

>> Render width is a whole other kettle of fish and is only loosely
> >> related to what I understood Glen to be talking about with (1). The
> >> issue as I understood is differentiating between what a user
> >> considers a "character", and codepoints, e.g.  <U00C5> '=C3=85'
> >> vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdit handles that rather
> >> poorly ;-).)
>
> I'm glad that someone at least understood what my goals are.  Clearly
> I have to work on expressing them more clearly.  While it's true that
> from the perspective of typesetting, my concept of size() is
> meaningless, there's a universe of use cases where you just want to
> know how many display characters a string needs to occupy: fixed width
> terminal output, small displays on embedded devices, etc.
>

Fixed-width output is not sufficiently important that we need a whole=20
function just for that. Any variable-width renderer can also handle=20
fixed-width.

As for "small displays"... if it's a variable-width font, counting grapheme=
=20
clusters is not helpful. So again, you're talking about fixed-width only;=
=20
it's the exact same use case as before.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_214_2121759883.1439387426359
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wednesday, August 12, 2015 at 8:02:38 AM UTC-4, glen stark wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><br></div><div>=
&gt; No, I really meant what I said. Most use-cases don&#39;t require knowi=
ng</div><div>&gt; where a grapheme cluster ends.</div><div><br></div><div>I=
&#39;m sorry, but I think you&#39;re obviously wrong in this regard.=C2=A0 =
In</div><div>every single instance where I have ever had to manipulate a st=
ring, I</div><div>have been interested in manipulating what I think of as a=
 displayable</div><div>characters, which might be expressed as a single cod=
epoint, or a</div><div>grapheme cluster in Unicode.=C2=A0 I&#39;d like a be=
tter word for that concept</div><div>-- a grapheme cluster is a specific un=
icode representation, but what</div><div>I&#39;m thinking of is more of a h=
uman concept -- let&#39;s call it a symbol.</div><div>Maybe the symbol I wa=
nt to manipulate is represented as a grapheme</div><div>cluster, maybe it&#=
39;s represented as a single codepoint.</div></div></blockquote><div><br>A =
single codepoint is a grapheme cluster (provided that it is not followed by=
 a combining character). Clusters can contain only one codepoint.<br>=C2=A0=
</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8=
ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div>Mo=
st of the</div><div>time I don&#39;t want to know, I just want to find, man=
ipulate, compare,</div><div>and sort the symbol.=C2=A0 That&#39;s true for =
most of the coders I know</div><div>working C or C++.</div><div><br></div><=
div>Unless I&#39;m developing a low level library, I&#39;m not interested i=
n being</div><div>aware of codepoints.=C2=A0 If I&#39;m processing a data f=
eed from an alpenhut</div><div>in Switzerland, and she sends me the string =
&quot;Die Alpengl=C3=BChn is heute</div><div>fr=C3=B6hlich&quot;, I want to=
 look for &quot;Alpengl=C3=BChn&quot;, and not code twice -- once</div><div=
>for the case that she wrote [U+0075 U+0308], and once for the case</div><d=
iv>where she used U+00FC.=C2=A0 Same thing when I&#39;m sorting, and doing =
a</div><div>regex.=C2=A0 With those latter 2 I can get the behavior I want =
using</div><div>locales, and I&#39;d like to extend that behavior elsewhere=
..</div></div></blockquote><div><br>Comparison, sorting, and regex searching=
 all require that the input strings follow the same <a href=3D"https://en.w=
ikipedia.org/wiki/Unicode_equivalence#Normalization">Unicode normalization =
form</a>. Otherwise, comparison just isn&#39;t viable.<br><br>The whole poi=
nt of Unicode normalization is so that you <i>can</i> compare strings by co=
depoints. When normalized in a particular form, every possible string repre=
sentation has a single codepoint sequence. So if two sequences are unequal,=
 you <i>know</i> that they don&#39;t represent the same string.<br><br>So y=
ou don&#39;t need to &quot;code twice&quot;; just normalize the string when=
 you receive it from the user.<br><br>Ordering strings (sorting) is rather =
more involved, as such comparisons require specific codepoint-by-codepoint =
knowledge. But even there, normalization is a necessary first step.<br>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: =
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div=
></div><div>&gt;&gt; Let&#39;s say I have the string &quot;can=CD=82on&quot=
;, composed of six codepoints,</div><div>&gt;&gt; and I want to take or rem=
ove the first three &quot;letters&quot;, or the last</div><div>&gt;&gt; thr=
ee. How many programs get that correct and don&#39;t either</div><div>&gt;&=
gt; transform the &#39;n=CD=82&#39; to a &#39;n&#39;, or leave a dangling c=
ombining</div><div>&gt;&gt; codepoint? (How many actually understand, or ha=
ve even thought</div><div>&gt;&gt; about, the difference between (1) and (2=
)?)</div><div><br></div><div>&gt; That&#39;s a contrived use-case. String m=
anipulation usually works by</div><div>&gt; finding boundaries with regular=
 separators. Doing splitting by</div><div>&gt; number of characters or byte=
s only makes sense to fit in fixed-size</div><div>&gt; buffers, which we ca=
n usually avoid in C++.</div><div><br></div><div>That&#39;s absolutely not =
a contrived case.=C2=A0 In my last project I had to</div><div>output intern=
ationalized text to an lcd terminal with a fixed width</div><div>and font s=
ize.=C2=A0 Most of the text processing involved truncating or</div><div>abb=
reviating strings to a fixed width representatoin.</div></div></blockquote>=
<div><br>Yes, there are cases where counting the number of grapheme cluster=
s is useful. However, those cases are not so common that we need a specific=
 function of a Unicode string class to do that.<br>=C2=A0</div><blockquote =
class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1p=
x #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div></div><div>More impo=
rtantly than that, you&#39;re ignore the monstrous amount of</div><div>lega=
cy code which currently works with std::strings, which one would</div><div>=
like to make compliant with unicode.=C2=A0 Making this code compliant with<=
/div><div>unide in general means making it work like it used to work with A=
SCII</div><div>or latin1, latin2...</div><div><br></div><div>In the ASCII o=
r latin1 world, when one worked with a string and did</div><div>substr(pos,=
n), +=3Dn, etc, one rightly expected to advance n symbols.</div><div>Think =
of how much time it would save a developer if:</div><div><br></div><div><br=
></div><div>=C2=A0 =C2=A0// make me a locale aware string f, and populate i=
t with data auto</div><div>=C2=A0 =C2=A0iter =3D f.begin(); iter +=3D 3;</d=
iv><div><br></div><div>advanced three letters!=C2=A0 We would save lifetime=
s of developer time.</div><div>It also hides the complexity of maniplating =
symbol equivalencies in a</div><div>library, which is how it should be.</di=
v></div></blockquote><div><br>Or you could just do this:<br><br>u8string st=
r =3D ...;<br>auto rng =3D grapheme_cluster_range(str);<br>rng =3D rng.adva=
nce_begin(3);<br><br>That makes it much more clear what you&#39;ve done. Yo=
u take a unicode string, get a range of grapheme clusters, and advance the =
front by 3.<br><br>What&#39;s the problem? Indeed, you could even specializ=
e `grapheme_cluster_range` to take `std::basic_string`, though for `std::st=
ring`, you&#39;d need an extra parameter to differentiate between char-as-n=
arrow and char-as-UTF8-codeunit.<br><br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padd=
ing-left: 1ex;"><div dir=3D"ltr"><div></div><div>&gt;&gt; Render width is a=
 whole other kettle of fish and is only loosely</div><div>&gt;&gt; related =
to what I understood Glen to be talking about with (1). The</div><div>&gt;&=
gt; issue as I understood is differentiating between what a user</div><div>=
&gt;&gt; considers a &quot;character&quot;, and codepoints, e.g. =C2=A0&lt;=
U00C5&gt; &#39;=C3=85&#39;</div><div>&gt;&gt; vs. &lt;U030A&gt;+&lt;U0041&g=
t; &#39;=CC=8AA&#39;. And FYI, QTextEdit handles that rather</div><div>&gt;=
&gt; poorly ;-).)</div><div><br></div><div>I&#39;m glad that someone at lea=
st understood what my goals are.=C2=A0 Clearly</div><div>I have to work on =
expressing them more clearly.=C2=A0 While it&#39;s true that</div><div>from=
 the perspective of typesetting, my concept of size() is</div><div>meaningl=
ess, there&#39;s a universe of use cases where you just want to</div><div>k=
now how many display characters a string needs to occupy: fixed width</div>=
<div>terminal output, small displays on embedded devices, etc.</div></div><=
/blockquote><div><br>Fixed-width output is not sufficiently important that =
we need a whole function just for that. Any variable-width renderer can als=
o handle fixed-width.<br><br>As for &quot;small displays&quot;... if it&#39=
;s a variable-width font, counting grapheme clusters is not helpful. So aga=
in, you&#39;re talking about fixed-width only; it&#39;s the exact same use =
case as before.</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_214_2121759883.1439387426359--
------=_Part_213_1397894043.1439387426359--

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Wed, 12 Aug 2015 07:16:24 -0700
Raw View
On Wednesday 12 August 2015 14:02:35 glen stark wrote:
> > No, I really meant what I said. Most use-cases don't require knowing
> > where a grapheme cluster ends.
>=20
> I'm sorry, but I think you're obviously wrong in this regard.  In

I don't think I am.

> every single instance where I have ever had to manipulate a string, I
> have been interested in manipulating what I think of as a displayable
> characters, which might be expressed as a single codepoint, or a
> grapheme cluster in Unicode.  I'd like a better word for that concept
> -- a grapheme cluster is a specific unicode representation, but what
> I'm thinking of is more of a human concept -- let's call it a symbol.
> Maybe the symbol I want to manipulate is represented as a grapheme
> cluster, maybe it's represented as a single codepoint.  Most of the
> time I don't want to know, I just want to find, manipulate, compare,
> and sort the symbol.  That's true for most of the coders I know
> working C or C++.

That doesn't match my experience. Most uses of strings that I do involve=20
simple manipulations, like splitting on slashes or finding the last slash a=
nd=20
copying the trailing part after that (i.e., the basename operation on a=20
filename).

The conclusion is simple: there are use-cases for everything and we need to=
=20
support all of them. Given C++'s "don't pay for what you don't need", I'd s=
ay=20
we need simple code unit iteration as well as grapheme cluster iteration, w=
ord=20
iteration, paragraph iteration, etc.

> Unless I'm developing a low level library, I'm not interested in being
> aware of codepoints.  If I'm processing a data feed from an alpenhut
> in Switzerland, and she sends me the string "Die Alpengl=C3=BChn is heute
> fr=C3=B6hlich", I want to look for "Alpengl=C3=BChn", and not code twice =
-- once
> for the case that she wrote [U+0075 U+0308], and once for the case
> where she used U+00FC.  Same thing when I'm sorting, and doing a
> regex.  With those latter 2 I can get the behavior I want using
> locales, and I'd like to extend that behavior elsewhere.

You just convert your string to NFC or to NFD and then compare.

If you're using QString:
 if (str.normalized(QString::NormalizationForm_C) =3D=3D "Alpengl=C3=BChn")

> I acknowledge that there is a use case where one needs to iterate
> over, and compare at a codepoint level -- how else would we implement
> the libraries that provide the symbol tools we usually want after all?
> There will other cases as well, but in general we will want to work
> with symbols.  When those symbols can be encoded in degenerate ways,
> we'd like to be able to ignore that.  So I think a good solution
> provides the ability to work at a codepoint level and at a
> symbol-level, where the fact that the symbol may be implemented as a
> grapheme cluster can be safely ignored.

Where symbol can also be "word", "sentence" and "line".

>=20
> >> Let's say I have the string "can=CD=82on", composed of six codepoints,
> >> and I want to take or remove the first three "letters", or the last
> >> three. How many programs get that correct and don't either
> >> transform the 'n=CD=82' to a 'n', or leave a dangling combining
> >> codepoint? (How many actually understand, or have even thought
> >> about, the difference between (1) and (2)?)
> >=20
> > That's a contrived use-case. String manipulation usually works by
> > finding boundaries with regular separators. Doing splitting by
> > number of characters or bytes only makes sense to fit in fixed-size
> > buffers, which we can usually avoid in C++.
>=20
> That's absolutely not a contrived case.  In my last project I had to
> output internationalized text to an lcd terminal with a fixed width
> and font size.  Most of the text processing involved truncating or
> abbreviating strings to a fixed width representatoin.

You said "font". That's a whole other domain...

If you're using Qt, that's QFontMetrics.

> More importantly than that, you're ignore the monstrous amount of
> legacy code which currently works with std::strings, which one would
> like to make compliant with unicode.  Making this code compliant with
> unide in general means making it work like it used to work with ASCII
> or latin1, latin2...

That code is broken already. It shouldn't be using std::string.

std::string is irreparably broken when it comes to Unicode. At best, it can=
 be=20
used as storage. If you want to do more Unicode manipulation, you use anoth=
er=20
class on top. Or better yet, use std::u16string or std::u32string.

> In the ASCII or latin1 world, when one worked with a string and did
> substr(pos,n), +=3Dn, etc, one rightly expected to advance n symbols.
> Think of how much time it would save a developer if:
>=20
>=20
>    // make me a locale aware string f, and populate it with data auto
>    iter =3D f.begin(); iter +=3D 3;
>=20
> advanced three letters!  We would save lifetimes of developer time.
> It also hides the complexity of maniplating symbol equivalencies in a
> library, which is how it should be.

That can be done, but it's not the regular std::string::iterator.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Wed, 12 Aug 2015 07:21:12 -0700
Raw View
On Wednesday 12 August 2015 09:38:30 Matthew Woehlke wrote:
> >> e.g. <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdi=
t
> >> handles that rather poorly ;-).)
> >
> >=20
> >
> > Not really. It's intentional that you can backspace a combining charact=
er
> > and  remove it. It would be wrong if you could press the left or right
> > arrows and it would stop between the "A"  and the COMBINING RING ABOVE.
>=20
> Did you actually *TRY* it?=20

Yes.

> Note that the CRA occurs *before* the letter
> with which it combines.=20

Uh... no, it doesn't. It occurs after the letter it combines with. Moreover=
,=20
the order of combining characters themselves is defined, so there's exactly=
 one=20
permutation possible for a given grapheme cluster (in each normalisation).

> Thunderbird doesn't render it correctly the
> other way around, and at least Gecko and QtWebkit (and even KHTML) also
> produce a =C3=85 in my original order. Qt itself however... hmm, well, se=
ems
> to behave reasonably except for combining it with the "'" instead.

> Anyway, the point was that text editing operations need to be aware of
> these issues... if you have the caret adjacent to a combined pair of
> codepoints and press delete/backspace, it should either erase BOTH, or
> erase only the combining codepoint, whether that's the one adjacent to
> the caret or not. So in at least one case, knowledge that a combining
> codepoint is present is required for reasonable behavior.

Try with the characters in the right order first.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 12 Aug 2015 07:52:14 -0700 (PDT)
Raw View
------=_Part_566_604678610.1439391134512
Content-Type: multipart/alternative;
 boundary="----=_Part_567_152199246.1439391134512"

------=_Part_567_152199246.1439391134512
Content-Type: text/plain; charset=UTF-8

On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4, glen stark wrote:
>
> Thanks for the input.
>
> Fair point about not using Unicode terminology.  That's partly because I
> haven't master the vocabulary yet, so help me out when you can.
>
> n3572 failed because the standards committee is against the idea of adding
> a new string class.
>

Is there any evidence for that? I know the specific string class was a
point of contention, but as I understood it, the idea was that some people
wanted a single string class who's encoding was either fixed or made
irrelevant, so that everyone could communicate with a single type rather
than a template with different people using different encodings.

Also, N3572 had a *lot* more than a Unicode string class in it.

Furthermore, your idea is very much against that. If the standards
committee was indeed against the idea of having a new Unicode-aware string
type, they would be just as much against giving std::basic_string
Unicode-aware facilities, along the same reasoning. Especially since any
attempt to do so would have to be a breaking change.


> I didn't explicity discuss normalization, just as I didn't discuss a lot
> of other things, as I was focusing usability and design, and I though the
> normalization strategy was somehow implicit, or could be safely ignored for
> an initial discussion.    Apparently I was mistaken there, mea culpa.
>
> As I understand it, and I am hear to learn so please correct me if I'm
> wrong -- if I do std::sort(vec_of_strings, locale), I get locale aware
> sorting, and if the local is implemented to provide correct normalization,
> I get the correctly normalized sorting.  Same thing happens when I pass a
> locale to regex.  I find that terrific.
>

I have never fully understood how std::locale works in any real way, but
I'm pretty sure (based on reading about Boost.Locale) that std::locale
can't do something complex like Unicode normalization. And even if it can,
std::locale is pretty slow in general, what with all of those indirect
and/or virtual calls. And doing normalization for every string comparison
is *incredibly* slow.

It's best to do normalization when you receive the string, so that you
don't have to deal with it in the majority of your code.

Now I'd like to be able to do stuff like this, without breaking old code,
> in a way that lets me handle all three use cases (char_t processing,
> codepoint processing, symbol processing -- i.e. treat grapheme clusters as
> a single 'symbol').
>      // make me a std::string <...> foo with unicode encoding.
>     foo.substr(pos,3);   // gets me the last three symbols, regardless of
> if they are 'multi-byte'
>

But that will break old code. Because you're taking old code that used to
do one thing and make it do something else.

Furthermore, what if the user really does want, in this particular case, to
chop off the last 3 code units? Or codepoints? You're making an assumption
that your use case is the only, or just primary, use case.

Not everyone operates at the grapheme cluster level. And not everybody does
so all the time, or even predominantly.

If the technical details could be solved, I think the usability would be
> excellent.  Since the standard already provides a mechanism for locale and
> encoding aware comparison (locale()) I took it as given that this would
> hold for my proposal.  Since the local would be a property of the string
> type, we  could take advantage of typechecking to guarentee good behavior.
>
>    std::string < std::string  ; // just like now.
>    std::string<de_utf8> < sdt::string<en_utf8>  // compiler err,
> incompatible types.
>

Putting the *language* in the string type is fundamentally broken.
Remember; the type is static; how would the user, at *runtime* be able to
change languages?

It's one thing to hard-code the encoding; that makes sense, as it changes
everything about how you interpret the bytes of the string. The language
being part of the type does not make sense. You might need the language for
certain operations (like sorting or other specialized comparisons), but
even that needs to be a parameter that can be runtime defined, not
compile-time defined.

Not to mention the fact that you're *changing* std::basic_string in a
completely incompatible way, breaking basically everyone's code.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_567_152199246.1439391134512
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4, glen st=
ark wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">Tha=
nks for the input. =C2=A0<div><br></div><div>Fair point about not using Uni=
code terminology.=C2=A0 That&#39;s partly because I haven&#39;t master the =
vocabulary yet, so help me out when you can. =C2=A0=C2=A0</div><div><br></d=
iv><div>n3572 failed because the standards committee is against the idea of=
 adding a new string class.</div></div></blockquote><div><br>Is there any e=
vidence for that? I know the specific string class was a point of contentio=
n, but as I understood it, the idea was that some people wanted a single st=
ring class who&#39;s encoding was either fixed or made irrelevant, so that =
everyone could communicate with a single type rather than a template with d=
ifferent people using different encodings.<br><br>Also, N3572 had a <i>lot<=
/i> more than a Unicode string class in it.<br><br>Furthermore, your idea i=
s very much against that. If the standards committee was indeed against the=
 idea of having a new Unicode-aware string type, they would be just as much=
 against giving std::basic_string Unicode-aware facilities, along the same =
reasoning. Especially since any attempt to do so would have to be a breakin=
g change.<br>=C2=A0<br></div><blockquote class=3D"gmail_quote" style=3D"mar=
gin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><=
div dir=3D"ltr"><div></div><div>I didn&#39;t explicity discuss normalizatio=
n, just as I didn&#39;t discuss a lot of other things, as I was focusing us=
ability and design, and I though the normalization strategy was somehow imp=
licit, or could be safely ignored for an initial discussion. =C2=A0 =C2=A0A=
pparently I was mistaken there, mea culpa.</div><div><br></div><div>As I un=
derstand it, and I am hear to learn so please correct me if I&#39;m wrong -=
- if I do std::sort(vec_of_strings, locale), I get locale aware sorting, an=
d if the local is implemented to provide correct normalization, I get the c=
orrectly normalized sorting.=C2=A0 Same thing happens when I pass a locale =
to regex.=C2=A0 I find that terrific.</div></div></blockquote><div><br>I ha=
ve never fully understood how std::locale works in any real way, but I&#39;=
m pretty sure (based on reading about Boost.Locale) that std::locale can&#3=
9;t do something complex like Unicode normalization. And even if it can, st=
d::locale is pretty slow in general, what with all of those indirect and/or=
 virtual calls. And doing normalization for every string comparison is <i>i=
ncredibly</i> slow.<br><br>It&#39;s best to do normalization when you recei=
ve the string, so that you don&#39;t have to deal with it in the majority o=
f your code.<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin=
: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div=
 dir=3D"ltr"><div></div><div>Now I&#39;d like to be able to do stuff like t=
his, without breaking old code, in a way that lets me handle all three use =
cases (char_t processing, codepoint processing, symbol processing -- i.e. t=
reat grapheme clusters as a single &#39;symbol&#39;).=C2=A0</div><div>=C2=
=A0 =C2=A0 =C2=A0// make me a std::string &lt;...&gt; foo with unicode enco=
ding.</div><div>=C2=A0 =C2=A0 foo.substr(pos,3); =C2=A0 // gets me the last=
 three symbols, regardless of if they are &#39;multi-byte&#39;</div></div><=
/blockquote><div><br>But that will break old code. Because you&#39;re takin=
g old code that used to do one thing and make it do something else.<br><br>=
Furthermore, what if the user really does want, in this particular case, to=
 chop off the last 3 code units? Or codepoints? You&#39;re making an assump=
tion that your use case is the only, or just primary, use case.<br><br>Not =
everyone operates at the grapheme cluster level. And not everybody does so =
all the time, or even predominantly.<br><br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;=
padding-left: 1ex;"><div></div></blockquote><blockquote class=3D"gmail_quot=
e" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;paddin=
g-left: 1ex;"><div dir=3D"ltr"><div></div><div>If the technical details cou=
ld be solved, I think the usability would be excellent.=C2=A0 Since the sta=
ndard already provides a mechanism for locale and encoding aware comparison=
 (locale()) I took it as given that this would hold for my proposal.=C2=A0 =
Since the local would be a property of the string type, we =C2=A0could take=
 advantage of typechecking to guarentee good behavior.</div><div><br></div>=
<div>=C2=A0 =C2=A0std::string &lt; std::string =C2=A0; // just like now.</d=
iv><div>=C2=A0 =C2=A0std::string&lt;de_utf8&gt; &lt; sdt::string&lt;en_utf8=
&gt; =C2=A0// compiler err, incompatible types.</div></div></blockquote><di=
v><br>Putting the <i>language</i> in the string type is fundamentally broke=
n. Remember; the type is static; how would the user, at <i>runtime</i> be a=
ble to change languages?<br><br>It&#39;s one thing to hard-code the encodin=
g; that makes sense, as it changes everything about how you interpret the b=
ytes of the string. The language being part of the type does not make sense=
.. You might need the language for certain operations (like sorting or other=
 specialized comparisons), but even that needs to be a parameter that can b=
e runtime defined, not compile-time defined.<br><br>Not to mention the fact=
 that you&#39;re <i>changing</i> std::basic_string in a completely incompat=
ible way, breaking basically everyone&#39;s code.<br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_567_152199246.1439391134512--
------=_Part_566_604678610.1439391134512--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Wed, 12 Aug 2015 11:04:04 -0400
Raw View
On 2015-08-12 10:21, Thiago Macieira wrote:
> On Wednesday 12 August 2015 09:38:30 Matthew Woehlke wrote:
>>>> e.g. <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextEdi=
t
>>>> handles that rather poorly ;-).)
>>>
>>> Not really. It's intentional that you can backspace a combining charact=
er
>>> and  remove it. It would be wrong if you could press the left or right
>>> arrows and it would stop between the "A"  and the COMBINING RING ABOVE.
>>
>> Note that the CRA occurs *before* the letter with which it
>> combines.
>=20
> Uh... no, it doesn't.

It does. I mean, whether or not it is *supposed* to, that is the order
that actually occurs in the above quoted text. Please actually READ what
I wrote, in particular "<U030A>+<U0041>".

It may be that Qt is the only renderer that both renders <U0041>+<U030A>
correctly (Thunderbird does not=C2=B9) and renders <U030A>+<U0041>
consistently (never as '=C3=85').

(=C2=B9 e.g. <U0041><U030A><U0061> is rendered as "A=C3=A5", not "=C3=85a".=
)

If nothing else, I submit this (i.e. inconsistent behavior across
several rendering libraries) as another example why proper Unicode
handling is hard :-).

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Wed, 12 Aug 2015 08:08:48 -0700
Raw View
On Wednesday 12 August 2015 11:04:04 Matthew Woehlke wrote:
> On 2015-08-12 10:21, Thiago Macieira wrote:
> > On Wednesday 12 August 2015 09:38:30 Matthew Woehlke wrote:
> >>>> e.g. <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextE=
dit
> >>>> handles that rather poorly ;-).)
> >>>=20
> >>> Not really. It's intentional that you can backspace a combining
> >>> character
> >>> and  remove it. It would be wrong if you could press the left or righ=
t
> >>> arrows and it would stop between the "A"  and the COMBINING RING ABOV=
E.
> >>=20
> >> Note that the CRA occurs *before* the letter with which it
> >> combines.
> >=20
> > Uh... no, it doesn't.
>=20
> It does. I mean, whether or not it is *supposed* to, that is the order
> that actually occurs in the above quoted text. Please actually READ what
> I wrote, in particular "<U030A>+<U0041>".

That's QUOTATION MARK with RING ABOVE, followed by A, followed by QUOTATION=
=20
MARK.

>=20
> It may be that Qt is the only renderer that both renders <U0041>+<U030A>
> correctly (Thunderbird does not=C2=B9) and renders <U030A>+<U0041>
> consistently (never as '=C3=85').
>=20
> (=C2=B9 e.g. <U0041><U030A><U0061> is rendered as "A=C3=A5", not "=C3=85a=
".)
>=20
> If nothing else, I submit this (i.e. inconsistent behavior across
> several rendering libraries) as another example why proper Unicode
> handling is hard :-).

Submit a bug report to Gecko.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 12 Aug 2015 08:25:22 -0700 (PDT)
Raw View
------=_Part_5286_2017157094.1439393122444
Content-Type: multipart/alternative;
 boundary="----=_Part_5287_1430726123.1439393122444"

------=_Part_5287_1430726123.1439393122444
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wednesday, August 12, 2015 at 11:04:15 AM UTC-4, Matthew Woehlke wrote:
>
> On 2015-08-12 10:21, Thiago Macieira wrote:=20
> > On Wednesday 12 August 2015 09:38:30 Matthew Woehlke wrote:=20
> >>>> e.g. <U00C5> '=C3=85' vs. <U030A>+<U0041> '=CC=8AA'. And FYI, QTextE=
dit=20
> >>>> handles that rather poorly ;-).)=20
> >>>=20
> >>> Not really. It's intentional that you can backspace a combining=20
> character=20
> >>> and  remove it. It would be wrong if you could press the left or righ=
t=20
> >>> arrows and it would stop between the "A"  and the COMBINING RING=20
> ABOVE.=20
> >>=20
> >> Note that the CRA occurs *before* the letter with which it=20
> >> combines.=20
> >=20
> > Uh... no, it doesn't.=20
>
> It does. I mean, whether or not it is *supposed* to, that is the order=20
> that actually occurs in the above quoted text. Please actually READ what=
=20
> I wrote, in particular "<U030A>+<U0041>".=20
>
> It may be that Qt is the only renderer that both renders <U0041>+<U030A>=
=20
> correctly (Thunderbird does not=C2=B9) and renders <U030A>+<U0041>=20
> consistently (never as '=C3=85').
>

There is no "may" here. The Unicode specification is quite clear: combining=
=20
characters *always* modify the first non-combining codepoint that came=20
before them in a sequence. A codepoint sequence that starts with a=20
combining character is broken.

Unicode handling may be "hard", but a good grapheme cluster iterator would=
=20
easily resolve these problems.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_5287_1430726123.1439393122444
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wednesday, August 12, 2015 at 11:04:15 AM UTC-4, Matthew Woehlke wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;">On 2015-08-12 10:21, Thiago Mac=
ieira wrote:
<br>&gt; On Wednesday 12 August 2015 09:38:30 Matthew Woehlke wrote:
<br>&gt;&gt;&gt;&gt; e.g. &lt;U00C5&gt; &#39;=C3=85&#39; vs. &lt;U030A&gt;+=
&lt;U0041&gt; &#39;=CC=8AA&#39;. And FYI, QTextEdit
<br>&gt;&gt;&gt;&gt; handles that rather poorly ;-).)
<br>&gt;&gt;&gt;
<br>&gt;&gt;&gt; Not really. It&#39;s intentional that you can backspace a =
combining character
<br>&gt;&gt;&gt; and =C2=A0remove it. It would be wrong if you could press =
the left or right
<br>&gt;&gt;&gt; arrows and it would stop between the &quot;A&quot; =C2=A0a=
nd the COMBINING RING ABOVE.
<br>&gt;&gt;
<br>&gt;&gt; Note that the CRA occurs *before* the letter with which it
<br>&gt;&gt; combines.
<br>&gt;=20
<br>&gt; Uh... no, it doesn&#39;t.
<br>
<br>It does. I mean, whether or not it is *supposed* to, that is the order
<br>that actually occurs in the above quoted text. Please actually READ wha=
t
<br>I wrote, in particular &quot;&lt;U030A&gt;+&lt;U0041&gt;&quot;.
<br>
<br>It may be that Qt is the only renderer that both renders &lt;U0041&gt;+=
&lt;U030A&gt;
<br>correctly (Thunderbird does not=C2=B9) and renders &lt;U030A&gt;+&lt;U0=
041&gt;
<br>consistently (never as &#39;=C3=85&#39;).<br></blockquote><div><br>Ther=
e is no &quot;may&quot; here. The Unicode specification is quite clear: com=
bining characters <i>always</i> modify the first non-combining codepoint th=
at came before them in a sequence. A codepoint sequence that starts with a =
combining character is broken.<br><br>Unicode handling may be &quot;hard&qu=
ot;, but a good grapheme cluster iterator would easily resolve these proble=
ms.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_5287_1430726123.1439393122444--
------=_Part_5286_2017157094.1439393122444--

.


Author: Fabio Fracassi <f.fracassi@gmx.net>
Date: Wed, 12 Aug 2015 17:41:05 +0200
Raw View
This is a multi-part message in MIME format.
--------------020007000306030602020308
Content-Type: text/plain; charset=UTF-8; format=flowed



On 12.08.2015 16:52, Nicol Bolas wrote:
> On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4, glen stark wrote:
>
>     Thanks for the input.
>
>     Fair point about not using Unicode terminology.  That's partly
>     because I haven't master the vocabulary yet, so help me out when
>     you can.
>
>     n3572 failed because the standards committee is against the idea
>     of adding a new string class.
>
>
> Is there any evidence for that? I know the specific string class was a
> point of contention, but as I understood it, the idea was that some
> people wanted a single string class who's encoding was either fixed or
> made irrelevant, so that everyone could communicate with a single type
> rather than a template with different people using different encodings.
>

My impression is that for the committee a string is spelled std::string.
I just re-checked the meeting notes from the N3572 discussion, which
support this.
A single strictly superior string type might be accepted (*very
unlikely), but it would need a seriously well argued motivation to
overcome the heavy opposition it will face.
A (templated or not) family of string types like n3572 proposed will not
stand a chance.

> Also, N3572 had a /lot/ more than a Unicode string class in it.

There was interest in seeing the algorithms presented as an independent
and container neutral facility, I do not remember that there was a
follow-up paper.

>
> Furthermore, your idea is very much against that. If the standards
> committee was indeed against the idea of having a new Unicode-aware
> string type, they would be just as much against giving
> std::basic_string Unicode-aware facilities, along the same reasoning.
> Especially since any attempt to do so would have to be a breaking change.

Yes, any Unicode facilities will have to be introduced as separate
algorithms, that can work with std::string,
std::experimental::string_view, std::array<char, ...>, etc ...

A good proposal will probably want to take the current work on ranges
into account as this will be a natural fit for string processing, i.e.
(imaginary pseudo code, not any proposed syntax):

std::string utf8_str = utf8::nfc(get_data_from_user());
std::string start = take(3, grapheme_cluster_range(utf8_str));

Best regards

Fabio


--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--------------020007000306030602020308
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Content-Type=
">
  </head>
  <body bgcolor=3D"#FFFFFF" text=3D"#000000">
    <br>
    <br>
    <div class=3D"moz-cite-prefix">On 12.08.2015 16:52, Nicol Bolas wrote:<=
br>
    </div>
    <blockquote
      cite=3D"mid:3546a113-c53c-449a-89f4-15d9700467d4@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4,
        glen stark wrote:
        <blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:
          0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
          <div dir=3D"ltr">Thanks for the input. =C2=A0
            <div><br>
            </div>
            <div>Fair point about not using Unicode terminology.=C2=A0 That=
's
              partly because I haven't master the vocabulary yet, so
              help me out when you can. =C2=A0=C2=A0</div>
            <div><br>
            </div>
            <div>n3572 failed because the standards committee is against
              the idea of adding a new string class.</div>
          </div>
        </blockquote>
        <div><br>
          Is there any evidence for that? I know the specific string
          class was a point of contention, but as I understood it, the
          idea was that some people wanted a single string class who's
          encoding was either fixed or made irrelevant, so that everyone
          could communicate with a single type rather than a template
          with different people using different encodings.<br>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
    My impression is that for the committee a string is spelled
    std::string. I just re-checked the meeting notes from the N3572
    discussion, which support this.<br>
    A single strictly superior string type might be accepted (*very
    unlikely), but it would need a seriously well argued motivation to
    overcome the heavy opposition it will face.<br>
    A (templated or not) family of string types like n3572 proposed will
    not stand a chance.<br>
    <br>
    <blockquote
      cite=3D"mid:3546a113-c53c-449a-89f4-15d9700467d4@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">
        <div>Also, N3572 had a <i>lot</i> more than a Unicode string
          class in it.<br>
        </div>
      </div>
    </blockquote>
    <br>
    There was interest in seeing the algorithms presented as an
    independent and container neutral facility, I do not remember that
    there was a follow-up paper.<br>
    <br>
    <blockquote
      cite=3D"mid:3546a113-c53c-449a-89f4-15d9700467d4@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">
        <div><br>
          Furthermore, your idea is very much against that. If the
          standards committee was indeed against the idea of having a
          new Unicode-aware string type, they would be just as much
          against giving std::basic_string Unicode-aware facilities,
          along the same reasoning. Especially since any attempt to do
          so would have to be a breaking change.<br>
        </div>
      </div>
    </blockquote>
    <br>
    Yes, any Unicode facilities will have to be introduced as separate
    algorithms, that can work with std::string,
    std::experimental::string_view, std::array&lt;char, ...&gt;, etc ...<br=
>
    <br>
    A good proposal will probably want to take the current work on
    ranges into account as this will be a natural fit for string
    processing, i.e. (imaginary pseudo code, not any proposed syntax):<br>
    <br>
    std::string utf8_str =3D utf8::nfc(get_data_from_user());<br>
    std::string start =3D take(3, grapheme_cluster_range(utf8_str));<br>
    <br>
    Best regards<br>
    <br>
    Fabio<br>
    <br>
    <br>
  </body>
</html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--------------020007000306030602020308--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Wed, 12 Aug 2015 11:44:57 -0400
Raw View
On 2015-08-12 11:25, Nicol Bolas wrote:
> On Wednesday, August 12, 2015 at 11:04:15 AM UTC-4, Matthew Woehlke wrote=
:
>> It may be that Qt is the only renderer that both renders <U0041>+<U030A>=
=20
>> correctly (Thunderbird does not=C2=B9) and renders <U030A>+<U0041>=20
>> consistently (never as '=C3=85').
>=20
> There is no "may" here.

You're using that word in a completely different sense than I did.

I observed the behavior of several text renderers (GTK, Qt, Gecko,
QtWebkit, KHTML). Only Qt seems to have "good" behavior. Since this is
an incomplete sampling, I cannot state confidently that Qt is the only
one that is "good"; hence "may" to express a possibility rather than a
certainty.

You are using it in an "is permissible" sense w.r.t. the Unicode
specification which is a different data set.

(Actually, on further poking, GTK may be "good" also, and possibly
cairo, if that's not equivalent to GTK. What's confusing is that
Thunderbird's mail display - which I guess must use gecko, even in plain
text mode - behaves differently from other widgets within Thunderbird
which are presumably straight GTK. Also very strange is that behavior
differs between Firefox and Thunderbird.)

> The Unicode specification is quite clear: combining=20
> characters *always* modify the first non-combining codepoint that came=20
> before them in a sequence. A codepoint sequence that starts with a=20
> combining character is broken.

It's... "interesting" that so many different renderers (Gecko, KHTML,
QtWebkit) "support" the opposite order. In at least one case, even
exclusively... (Also that the ones that do all seem to be web browser
renderers... I wonder if there is a connection there?)

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 12 Aug 2015 09:32:23 -0700 (PDT)
Raw View
------=_Part_707_351031075.1439397143177
Content-Type: multipart/alternative;
 boundary="----=_Part_708_133663833.1439397143177"

------=_Part_708_133663833.1439397143177
Content-Type: text/plain; charset=UTF-8

On Wednesday, August 12, 2015 at 11:41:44 AM UTC-4, Fabio Fracassi wrote:
>
> On 12.08.2015 16:52, Nicol Bolas wrote:
>
> On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4, glen stark wrote:
>>
>> Thanks for the input.
>>
>> Fair point about not using Unicode terminology.  That's partly because I
>> haven't master the vocabulary yet, so help me out when you can.
>>
>> n3572 failed because the standards committee is against the idea of
>> adding a new string class.
>>
>
> Is there any evidence for that? I know the specific string class was a
> point of contention, but as I understood it, the idea was that some people
> wanted a single string class who's encoding was either fixed or made
> irrelevant, so that everyone could communicate with a single type rather
> than a template with different people using different encodings.
>
>
> My impression is that for the committee a string is spelled std::string. I
> just re-checked the meeting notes from the N3572 discussion, which support
> this.
>

It's ironic: the belief that std::string is the one-true-string-class is
exactly what keeps UnicodeString, QString, Platform::String, and
innumerable other C++ string types in business. So believing in it only
proves how false it is.

I would hope that the committee would stop spending so much time thinking
about what "the committee" believes and spend a bit more time looking at
what the reality actually is.

That's not to say that Unicode facilities should be built to require the
use of a particular string class of course. C++ has so many extant string
types (not to mention special-case needs like fixed-length strings and
such) that we need to allow people to deal with strings whose sources come
from anywhere. And the primary interface between code that doesn't modify
containers should be through ranges/views/whatever.

But it would be nice to have a string type that had Unicode facilities as
part of its actual interface, not merely as bolted-on functions. And
std::basic_string is absolutely not the way to do it. It has way too many
functions for code-unit manipulation, which should not be the primary
interface when poking at an encoded string.

It's just too easy to screw up.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_708_133663833.1439397143177
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wednesday, August 12, 2015 at 11:41:44 AM UTC-4, Fabio Fracassi wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">On 12.08.2015 16:52, Nicol Bola=
s wrote:<br>
   =20
    <blockquote type=3D"cite">
      <div dir=3D"ltr">On Wednesday, August 12, 2015 at 8:32:32 AM UTC-4,
        glen stark wrote:
        <blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8=
ex;border-left:1px #ccc solid;padding-left:1ex">
          <div dir=3D"ltr">Thanks for the input. =C2=A0
            <div><br>
            </div>
            <div>Fair point about not using Unicode terminology.=C2=A0 That=
&#39;s
              partly because I haven&#39;t master the vocabulary yet, so
              help me out when you can. =C2=A0=C2=A0</div>
            <div><br>
            </div>
            <div>n3572 failed because the standards committee is against
              the idea of adding a new string class.</div>
          </div>
        </blockquote>
        <div><br>
          Is there any evidence for that? I know the specific string
          class was a point of contention, but as I understood it, the
          idea was that some people wanted a single string class who&#39;s
          encoding was either fixed or made irrelevant, so that everyone
          could communicate with a single type rather than a template
          with different people using different encodings.<br>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
    My impression is that for the committee a string is spelled
    std::string. I just re-checked the meeting notes from the N3572
    discussion, which support this.<br></div></blockquote><div><br>It&#39;s=
 ironic: the belief that std::string is the one-true-string-class is exactl=
y what keeps UnicodeString, QString, Platform::String, and innumerable othe=
r C++ string types in business. So believing in it only proves how false it=
 is.<br><br>I would hope that the committee would stop spending so much tim=
e thinking about what &quot;the committee&quot; believes and spend a bit mo=
re time looking at what the reality actually is.<br><br>That&#39;s not to s=
ay that Unicode facilities should be built to require the use of a particul=
ar string class of course. C++ has so many extant string types (not to ment=
ion special-case needs like fixed-length strings and such) that we need to =
allow people to deal with strings whose sources come from anywhere. And the=
 primary interface between code that doesn&#39;t modify containers should b=
e through ranges/views/whatever.<br><br>But it would be nice to have a stri=
ng type that had Unicode facilities as part of its actual interface, not me=
rely as bolted-on functions. And std::basic_string is absolutely not the wa=
y to do it. It has way too many functions for code-unit manipulation, which=
 should not be the primary interface when poking at an encoded string.<br><=
br>It&#39;s just too easy to screw up.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_708_133663833.1439397143177--
------=_Part_707_351031075.1439397143177--

.


Author: Magnus Fromreide <magfr@lysator.liu.se>
Date: Wed, 12 Aug 2015 18:32:27 +0200
Raw View
On Wed, Aug 12, 2015 at 11:44:57AM -0400, Matthew Woehlke wrote:
> On 2015-08-12 11:25, Nicol Bolas wrote:
>
> It's... "interesting" that so many different renderers (Gecko, KHTML,
> QtWebkit) "support" the opposite order. In at least one case, even
> exclusively... (Also that the ones that do all seem to be web browser
> renderers... I wonder if there is a connection there?)

I wouldn't be surprised at all if there is a connection. Web browsers
usually try really hard to make sense out of worthless crap input in
order to display the current web.

/MF

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.


Author: glen stark <g.a.stark@gmail.com>
Date: Wed, 12 Aug 2015 23:54:09 +0200
Raw View
--bcaec53f345f3036cb051d2442f5
Content-Type: text/plain; charset=UTF-8

"Especially since any attempt to do so [giving std::basic_string Unicode
awareness] would be a breaking change"

I fully admit my initial post was amateurish and imprecise.  A major  goal
in making it was get some practice discussing these things.  I apologize
for any frustration or wasted time this causes other participants --
clearly I failed to communicate my core idea:    I believe it is possible
to give std::basic_string unicode awareness without introducing a breaking
change.  I have to ask everyone to be a little ambiguity tolerant, as I
haven't thought it all the way through, but I believe the core idea is not
a bad one -- despite the perhaps justifiably critical reception it has
received so far.

I guess I'm starting with the assumption that we will get a set of
algorithms that can operate on std::strings, and I think that's a good
idea.  I'm hoping strongly (and I am willing to help in any way I'm
capable) that we will get a set of unicode aware iterators over
std::basic_strings.  I'm personally remain of the opinion that good
grapheme-cluster iterator would be of tremendous use to developers, and
simplify a large set of unicode handling, but given how the discussion is
going so far, I feel the need to put some more thought into that.

Let's say we get that -- I think there's a way to provide that
functionality to users of std::strings that provides a natural way to get
consistent unicode and locale handling for std::basic_strings, and will
make adding unicde support to legacy code bases vastly simpler than the
current situation (or the situation we would have with algorithms and
iterators alone).


std::basic string is currently a template accepting 3 template parameters:
CharT, Traits, and Allocator, where Traits and Allocator default ot
std::char_traits<CharT> and std::alocator<CharT>, and in practice triats
and allocators are only occasionally (dare I say rarely provided).

We could introduce an additonal, optional template parameter --
locale_traits or perhaps modify or extend char_traits, to provide locale
aware handling of strings.  Given that char_traits provides eq, lt, length,
find, to_char_type, to_int_type, eq_int_type, and especially
compare("lexicographically compares two character sequences"), I think a
good argument could be made to extend char_traits to 1) provide iterators,
and 2) provide locale aware comparison.


The default template parameter, and any existing char_traits wouldn't be
affected -- you still get your previous handling.  Let's say I could do
something like the following:

char_traits ct_cp<de_utf-8, codepoint>;  // make codepoint-aware char_traits
std::basic_string foo<ct_cp>;
char_traits ct_gc<de_utf-8, codepoint>;
std::basic_string<ct_gc>  bar;

char_traits (or locale_traits) could now provide unicode aware
handling of comparison, sorting, etc.   If the string methods, let's
consider at(), substr(), and size() for example, are aware of these
traits (and default to original behavior in their absence), we get
zero-change and std::strings remain std::strings, std::wstrings remain
std::wstrings, etc.  Nothing breaks, nothing changes.

But my new strings foo and bar, will use the new trait provided
support.  I personlly think this is consistent with how char_traits is
currently used.  foo would now be codepoint aware, and Thiago gets a
std::basic_string that works like he thinks is reasonable, and the
iterators don't have to know any more than codepoints -- pay for what
you use after all.  foo.size() gives you the number of codepoints,
foo.substr(pos, 3) give you 3 codepoints after pos, and at(n) gives
you the n'th codepoint.

bar.at(n) would give you the n't grapheme cluster, or what Unicode
Standard Annex #29 calls "user-perceived-characters", which would make
all my legacy code unicode friendly and meaningful, if grossly
inefficient.  bar.size() would give me the number of
user-perceived-characters, and bar.substr(pos,3) will match a three
character file exctension, regardless of how it's been normalized.

If we made it possible to provide a default-locale placeholder, we
could create a locale aware string that changes behavior appropriately
when the global local is changed.

This would allow unicode/locale awareness to propagate everywhere a
string might be used, like the key of a map.  Since the locale
awareness is part of the type of the string, type safety ensure that
we don't do something stupid like u8str==latin1str, or user-implement
it and take advantage of overloading to get the behavior that we want.

While the discussion has been hugely educational for me, and I see
there are a lot of points I need to learn more on, I'm still of the
opinion that extening std::string in such a way would provide:
  - backward compatibilty.
  - a tidy and compact interface for common locale/unicode aware operations.
  - a tidy, compact and consistent interface for providing efficient
    (e.g. codepoint or byte level operations).
  - the ability to change locale behavior cheaply -- I could do
    static_cast<byte_level_string_type>(grapheme_cluster_aware_string)
    cheaply, and use type-checking to ensure only meaningful
    conversion occur.
  - a path for cheap (in terms of work needed) path to unicode
    awareness for legacy code.  First get it to work (cheap) then fix
    performance problems as necessary.

Sure, you could write really stupid code like:
for (int i =0; i<u8_grapheme_cluster_aware_string.size(); ++i){
   cout << u8_grapheme_cluster_aware_string[i];
}
but a lot of C++ lets you stupid stuff like that.


Does nobody else think this would be useful?

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--bcaec53f345f3036cb051d2442f5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">&quot;Especially since any attempt to do so [giving std::b=
asic_string Unicode awareness] would be a breaking change&quot;<br><br>I fu=
lly admit my initial post was amateurish and imprecise.=C2=A0 A major=C2=A0=
 goal in making it was get some practice discussing these things.=C2=A0 I a=
pologize for any frustration or wasted time this causes other participants =
-- clearly I failed to communicate my core idea: =C2=A0=C2=A0 I believe it =
is possible to give std::basic_string unicode awareness without introducing=
 a breaking change.=C2=A0 I have to ask everyone to be a little ambiguity t=
olerant, as I haven&#39;t thought it all the way through, but I believe the=
 core idea is not a bad one -- despite the perhaps justifiably critical rec=
eption it has received so far.<br><br>I guess I&#39;m starting with the ass=
umption that we will get a set of algorithms that can operate on std::strin=
gs, and I think that&#39;s a good idea.=C2=A0 I&#39;m hoping strongly (and =
I am willing to help in any way I&#39;m capable) that we will get a set of =
unicode aware iterators over std::basic_strings.=C2=A0 I&#39;m personally r=
emain of the opinion that good grapheme-cluster iterator would be of tremen=
dous use to developers, and simplify a large set of unicode handling, but g=
iven how the discussion is going so far, I feel the need to put some more t=
hought into that.<br><br>Let&#39;s say we get that -- I think there&#39;s a=
 way to provide that functionality to users of std::strings that provides a=
 natural way to get consistent unicode and locale handling for std::basic_s=
trings, and will make adding unicde support to legacy code bases vastly sim=
pler than the current situation (or the situation we would have with algori=
thms and iterators alone).<br><br><br>std::basic string is currently a temp=
late accepting 3 template parameters: CharT, Traits, and Allocator, where T=
raits and Allocator default ot std::char_traits&lt;CharT&gt; and std::aloca=
tor&lt;CharT&gt;, and in practice triats and allocators are only occasional=
ly (dare I say rarely provided).<br><br>We could introduce an additonal, op=
tional template parameter --=C2=A0 locale_traits or perhaps modify or exten=
d char_traits, to provide locale aware handling of strings.=C2=A0 Given tha=
t char_traits provides eq, lt, length, find, to_char_type, to_int_type, eq_=
int_type, and especially compare(&quot;lexicographically compares two chara=
cter sequences&quot;), I think a good argument could be made to extend char=
_traits to 1) provide iterators, and 2) provide locale aware comparison. <b=
r><br><br>The default template parameter, and any existing char_traits woul=
dn&#39;t be affected -- you still get your previous handling.=C2=A0 Let&#39=
;s say I could do something like the following:<br><br>char_traits ct_cp&lt=
;de_utf-8, codepoint&gt;;=C2=A0 // make codepoint-aware char_traits<br>std:=
:basic_string foo&lt;ct_cp&gt;; <br>char_traits ct_gc&lt;de_utf-8, codepoin=
t&gt;;<br>std::basic_string&lt;ct_gc&gt;=C2=A0 bar;<br><br>char_traits (or =
locale_traits) could now provide unicode aware<br>handling of comparison, s=
orting, etc.=C2=A0=C2=A0 If the string methods, let&#39;s<br>consider at(),=
 substr(), and size() for example, are aware of these<br>traits (and defaul=
t to original behavior in their absence), we get<br>zero-change and std::st=
rings remain std::strings, std::wstrings remain<br>std::wstrings, etc.=C2=
=A0 Nothing breaks, nothing changes.<br><br>But my new strings foo and bar,=
 will use the new trait provided<br>support.=C2=A0 I personlly think this i=
s consistent with how char_traits is<br>currently used.=C2=A0 foo would now=
 be codepoint aware, and Thiago gets a<br>std::basic_string that works like=
 he thinks is reasonable, and the<br>iterators don&#39;t have to know any m=
ore than codepoints -- pay for what<br>you use after all.=C2=A0 foo.size() =
gives you the number of codepoints,<br>foo.substr(pos, 3) give you 3 codepo=
ints after pos, and at(n) gives<br>you the n&#39;th codepoint.<br><br><a hr=
ef=3D"http://bar.at">bar.at</a>(n) would give you the n&#39;t grapheme clus=
ter, or what Unicode<br>Standard Annex #29 calls &quot;user-perceived-chara=
cters&quot;, which would make<br>all my legacy code unicode friendly and me=
aningful, if grossly<br>inefficient.=C2=A0 bar.size() would give me the num=
ber of<br>user-perceived-characters, and bar.substr(pos,3) will match a thr=
ee<br>character file exctension, regardless of how it&#39;s been normalized=
..<br><br>If we made it possible to provide a default-locale placeholder, we=
<br>could create a locale aware string that changes behavior appropriately<=
br>when the global local is changed.<br><br>This would allow unicode/locale=
 awareness to propagate everywhere a<br>string might be used, like the key =
of a map.=C2=A0 Since the locale<br>awareness is part of the type of the st=
ring, type safety ensure that<br>we don&#39;t do something stupid like u8st=
r=3D=3Dlatin1str, or user-implement<br>it and take advantage of overloading=
 to get the behavior that we want.<br><br>While the discussion has been hug=
ely educational for me, and I see<br>there are a lot of points I need to le=
arn more on, I&#39;m still of the<br>opinion that extening std::string in s=
uch a way would provide:<br>=C2=A0 - backward compatibilty.<br>=C2=A0 - a t=
idy and compact interface for common locale/unicode aware operations.<br>=
=C2=A0 - a tidy, compact and consistent interface for providing efficient<b=
r>=C2=A0=C2=A0=C2=A0 (e.g. codepoint or byte level operations).<br>=C2=A0 -=
 the ability to change locale behavior cheaply -- I could do<br>=C2=A0=C2=
=A0=C2=A0 static_cast&lt;byte_level_string_type&gt;(grapheme_cluster_aware_=
string)<br>=C2=A0=C2=A0=C2=A0 cheaply, and use type-checking to ensure only=
 meaningful<br>=C2=A0=C2=A0=C2=A0 conversion occur.<br>=C2=A0 - a path for =
cheap (in terms of work needed) path to unicode<br>=C2=A0=C2=A0=C2=A0 aware=
ness for legacy code.=C2=A0 First get it to work (cheap) then fix<br>=C2=A0=
=C2=A0=C2=A0 performance problems as necessary.<br><br>Sure, you could writ=
e really stupid code like:<br>for (int i =3D0; i&lt;u8_grapheme_cluster_awa=
re_string.size(); ++i){<br>=C2=A0=C2=A0 cout &lt;&lt; u8_grapheme_cluster_a=
ware_string[i];<br>}<br>but a lot of C++ lets you stupid stuff like that.<b=
r><br><br>Does nobody else think this would be useful? <br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--bcaec53f345f3036cb051d2442f5--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 12 Aug 2015 19:35:08 -0700 (PDT)
Raw View
------=_Part_7478_1447907665.1439433308208
Content-Type: multipart/alternative;
 boundary="----=_Part_7479_2109595498.1439433308208"

------=_Part_7479_2109595498.1439433308208
Content-Type: text/plain; charset=UTF-8

On Wednesday, August 12, 2015 at 5:55:01 PM UTC-4, glen stark wrote:
>
> "Especially since any attempt to do so [giving std::basic_string Unicode
> awareness] would be a breaking change"
>
> I fully admit my initial post was amateurish and imprecise.  A major  goal
> in making it was get some practice discussing these things.  I apologize
> for any frustration or wasted time this causes other participants --
> clearly I failed to communicate my core idea:    I believe it is possible
> to give std::basic_string unicode awareness without introducing a breaking
> change.  I have to ask everyone to be a little ambiguity tolerant, as I
> haven't thought it all the way through, but I believe the core idea is not
> a bad one -- despite the perhaps justifiably critical reception it has
> received so far.
>

OK, I mentioned this before, but it bears repeating: your idea is
*impossible*. Allow me to explain, using an example you provided:

for (int i =0; i<u8_grapheme_cluster_aware_
> string.size(); ++i){
>    cout << u8_grapheme_cluster_aware_string[i];
> }
>

Given the name of this object, I'm going to assume that the actual bytes it
stores are a contiguous array of UTF-8 code units. OK.

So... what does operator[] actually return?

Recall that a grapheme cluster is a sequence of one or more codepoints. But
the string does not *contain* codepoints; it contains UTF-8 encoded code
units. So whatever operator[] returns, it cannot be a direct reference to
any object stored by the string.

So it will have to return some kind of facade object, one that can be
implicitly converted to/from a "grapheme cluster". When converting to a
grapheme cluster, it does the decoding of the UTF-8 data and writes out a
bunch of codepoints. When converting from a grapheme cluster, it reads the
sequence of codepoints and encodes them as UTF-8, possibly expanding or
shrinking the string in the process.

This must *also* be true of any iterators that the object returns.
Dereferencing the iterators doesn't return a C++ reference; it returns a
facade object that pretends to be a reference.

However, the rules of STL containers *forbid this*. All
iterator_t::reference types must be actual, C++ language references to an
actual object contained in the sequence; otherwise, it is not an STL
iterator. All container::reference types must be an actual C++ language
reference to the value_type; otherwise, it is not an STL container.

That's why std::vector<bool> is such a problem. As the old saying goes,
"the only problem with vector<bool> is that it is not a vector and it does
not contain bools." The same goes here: `u8_grapheme_cluster_aware_string`
is not a string (a contiguous array of data), and it does not contain
grapheme clusters.

If it's not an STL container (let's ignore the fact that `basic_string` is
already technically not an STL container) and it doesn't provide STL
iterators, then you can't pass iterators to algorithms that expect STL
iterators or containers. You can't pass this "container" to any generic
function that expects an STL container.

That's not a "bad idea"; it's fundamentally broken.


> I'm hoping strongly (and I am willing to help in any way I'm capable) that
> we will get a set of unicode aware iterators over std::basic_strings.  I'm
> personally remain of the opinion that good grapheme-cluster iterator would
> be of tremendous use to developers, and simplify a large set of unicode
> handling, but given how the discussion is going so far, I feel the need to
> put some more thought into that.
>

No, the iterator idea is fine. It's the "let's break basic_string's
interface to force it to use them" part that's the problem.

std::basic string is currently a template accepting 3 template parameters:
> CharT, Traits, and Allocator, where Traits and Allocator default ot
> std::char_traits<CharT> and std::alocator<CharT>, and in practice triats
> and allocators are only occasionally (dare I say rarely provided).
>

>
....
>
> This would allow unicode/locale awareness to propagate everywhere a
> string might be used, like the key of a map.  Since the locale
> awareness is part of the type of the string, type safety ensure that
> we don't do something stupid like u8str==latin1str, or user-implement
> it and take advantage of overloading to get the behavior that we want.
>

Ignoring the above point about how this is not possible, I would say that
it's not *desireable*. You're putting too many concepts under one roof.
You're sticking storage, Unicode encoding, *language*, and even comparisons
all under the string's *type*.

That's just too much responsibility for one object. Comparison requires
normalization, but you refuse to expose that. Languages should be a runtime
factor (at *best* part of a string's data just like the text, yet that
doesn't account for multi-language strings), but you make language a
compile-time construct.

Why do I want a german string to be a different type from a spanish string?
Now yes, the language is important for defining lexicographical ordering.
But that's the *only* operation where that is important. So just have a
comparison function that takes parameters defining the languages of the
strings in question. Or, if you absolutely must use codecvt or locales or
somesuch, imbue the comparison with the locale in question.

While the discussion has been hugely educational for me, and I see
> there are a lot of points I need to learn more on, I'm still of the
> opinion that extening std::string in such a way would provide:
>   - backward compatibilty.
>   - a tidy and compact interface for common locale/unicode aware
> operations.
>   - a tidy, compact and consistent interface for providing efficient
>     (e.g. codepoint or byte level operations).
>

It's based on locales and codecvt; these are not things I would consider
"efficient". Also, see below for "codepoint or byte level operations":


>   - the ability to change locale behavior cheaply -- I could do
>     static_cast<byte_level_string_type>(grapheme_cluster_aware_string)
>     cheaply, and use type-checking to ensure only meaningful
>     conversion occur.
>

Um, no. That conversion will *require* copying the string (not unless you
start doing copy-on-write gymnastics, which the committee rightfully
jettisoned). The absolute best you might do is move the string. And you
can't move from a `const&`, which is a common way of passing strings around.

You could try to use copy-on-write gymnastics, but good luck on getting
that through the committee.

That's not a "cheap" or "clean" way to handle "codepoint or byte level
operations".


>   - a path for cheap (in terms of work needed) path to unicode
>     awareness for legacy code.  First get it to work (cheap) then fix
>     performance problems as necessary.
>

Given the above, I fail to see how you can "fix performance problems"
without using a different string type. At which point, you may as well have
used that to begin with.


> Does nobody else think this would be useful?
>

Even if all of the problems I've mentioned so far went away... it's still
the wrong way to do it on a conceptual level.

You're forcing the interface into one specific interpretation, when users
need to be able to use *multiple* interpretations. Users sometimes need to
insert codepoints, and sometimes they need to insert grapheme clusters.
They shouldn't have to switch entire types and copy strings just to do that.

It's just not the right way to handle it.

At the end of the day, if the committee is going to be obstinate in their
refusal to accept that std::basic_string shouldn't be the only string in
the standard, then the only viable solution is one that treats
`basic_string` as exactly what it is (as far as Unicode is concerned): a
contiguous sequence of *code units*.

You should have the ability to look at it as a bidirectional range of
codepoints in some encoding. You should have the ability to look at it as a
bidirectional range of grapheme clusters. But at the end of the day, the
type should not try to hide the fact that it's a sequence of code units.

So you should be able to get appropriate iterators/ranges. And there should
be generic iterators/ranges for user-defined string types. But at no point
should the main interface of std::basic_string hide what it is or try to
pretend that it is something else.

>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_7479_2109595498.1439433308208
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Wednesday, August 12, 2015 at 5:55:01 PM UTC-4, glen st=
ark wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">&qu=
ot;Especially since any attempt to do so [giving std::basic_string Unicode =
awareness] would be a breaking change&quot;<br><br>I fully admit my initial=
 post was amateurish and imprecise.=C2=A0 A major=C2=A0 goal in making it w=
as get some practice discussing these things.=C2=A0 I apologize for any fru=
stration or wasted time this causes other participants -- clearly I failed =
to communicate my core idea: =C2=A0=C2=A0 I believe it is possible to give =
std::basic_string unicode awareness without introducing a breaking change.=
=C2=A0 I have to ask everyone to be a little ambiguity tolerant, as I haven=
&#39;t thought it all the way through, but I believe the core idea is not a=
 bad one -- despite the perhaps justifiably critical reception it has recei=
ved so far.<br></div></blockquote><div><br>OK, I mentioned this before, but=
 it bears repeating: your idea is <i>impossible</i>. Allow me to explain, u=
sing an example you provided:<br><br><blockquote style=3D"margin: 0px 0px 0=
px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" cl=
ass=3D"gmail_quote">for (int i =3D0; i&lt;u8_grapheme_cluster_aware_<br>str=
ing.size(); ++i){<br>=C2=A0 =C2=A0cout &lt;&lt; u8_grapheme_cluster_aware_s=
tring[i];<br>}<br></blockquote><br>Given the name of this object, I&#39;m g=
oing to assume that the actual bytes it stores are a contiguous array of UT=
F-8 code units. OK.<br><br>So... what does operator[] actually return?<br><=
br>Recall that a grapheme cluster is a sequence of one or more codepoints. =
But the string does not <i>contain</i> codepoints; it contains UTF-8 encode=
d code units. So whatever operator[] returns, it cannot be a direct referen=
ce to any object stored by the string.<br><br>So it will have to return som=
e kind of facade object, one that can be implicitly converted to/from a &qu=
ot;grapheme cluster&quot;. When converting to a grapheme cluster, it does t=
he decoding of the UTF-8 data and writes out a bunch of codepoints. When co=
nverting from a grapheme cluster, it reads the sequence of codepoints and e=
ncodes them as UTF-8, possibly expanding or shrinking the string in the pro=
cess.<br><br>This must <i>also</i> be true of any iterators that the object=
 returns. Dereferencing the iterators doesn&#39;t return a C++ reference; i=
t returns a facade object that pretends to be a reference.<br><br>However, =
the rules of STL containers <i>forbid this</i>. All iterator_t::reference t=
ypes must be actual, C++ language references to an actual object contained =
in the sequence; otherwise, it is not an STL iterator. All container::refer=
ence types must be an actual C++ language reference to the value_type; othe=
rwise, it is not an STL container.<br><br>That&#39;s why std::vector&lt;boo=
l&gt; is such a problem. As the old saying goes, &quot;the only problem wit=
h vector&lt;bool&gt; is that it is not a vector and it does not contain boo=
ls.&quot; The same goes here: `u8_grapheme_cluster_aware_string` is not a s=
tring (a contiguous array of data), and it does not contain grapheme cluste=
rs.<br><br>If it&#39;s not an STL container (let&#39;s ignore the fact that=
 `basic_string` is already technically not an STL container) and it doesn&#=
39;t provide STL iterators, then you can&#39;t pass iterators to algorithms=
 that expect STL iterators or containers. You can&#39;t pass this &quot;con=
tainer&quot; to any generic function that expects an STL container.<br><br>=
That&#39;s not a &quot;bad idea&quot;; it&#39;s fundamentally broken.<br></=
div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;m=
argin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=
=3D"ltr">I&#39;m hoping strongly (and I am willing to help in any way I&#39=
;m capable) that we will get a set of unicode aware iterators over std::bas=
ic_strings.=C2=A0 I&#39;m personally remain of the opinion that good graphe=
me-cluster iterator would be of tremendous use to developers, and simplify =
a large set of unicode handling, but given how the discussion is going so f=
ar, I feel the need to put some more thought into that.<br></div></blockquo=
te><div><br>No, the iterator idea is fine. It&#39;s the &quot;let&#39;s bre=
ak basic_string&#39;s interface to force it to use them&quot; part that&#39=
;s the problem.<br><br></div><blockquote class=3D"gmail_quote" style=3D"mar=
gin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><=
div dir=3D"ltr">std::basic string is currently a template accepting 3 templ=
ate parameters: CharT, Traits, and Allocator, where Traits and Allocator de=
fault ot std::char_traits&lt;CharT&gt; and std::alocator&lt;CharT&gt;, and =
in practice triats and allocators are only occasionally (dare I say rarely =
provided).<br></div></blockquote><blockquote style=3D"margin: 0px 0px 0px 0=
..8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" class=
=3D"gmail_quote"><div>=C2=A0<br></div></blockquote><blockquote style=3D"mar=
gin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-=
left: 1ex;" class=3D"gmail_quote"><div dir=3D"ltr">... <br><br>This would a=
llow unicode/locale awareness to propagate everywhere a<br>string might be =
used, like the key of a map.=C2=A0 Since the locale<br>awareness is part of=
 the type of the string, type safety ensure that<br>we don&#39;t do somethi=
ng stupid like u8str=3D=3Dlatin1str, or user-implement<br>it and take advan=
tage of overloading to get the behavior that we want.<br></div></blockquote=
><div><br>Ignoring the above point about how this is not possible, I would =
say that it&#39;s not <i>desireable</i>. You&#39;re putting too many concep=
ts under one roof. You&#39;re sticking storage, Unicode encoding, <i>langua=
ge</i>, and even comparisons all under the string&#39;s <i>type</i>.<br><br=
>That&#39;s just too much responsibility for one object. Comparison require=
s normalization, but you refuse to expose that. Languages should be a runti=
me factor (at <i>best</i> part of a string&#39;s data just like the text, y=
et that doesn&#39;t account for multi-language strings), but you make langu=
age a compile-time construct.<br><br>Why do I want a german string to be a =
different type from a spanish string? Now yes, the language is important fo=
r defining lexicographical ordering. But that&#39;s the <i>only</i> operati=
on where that is important. So just have a comparison function that takes p=
arameters defining the languages of the strings in question. Or, if you abs=
olutely must use codecvt or locales or somesuch, imbue the comparison with =
the locale in question.<br><br></div><blockquote style=3D"margin: 0px 0px 0=
px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" cl=
ass=3D"gmail_quote"><div dir=3D"ltr">While the discussion has been hugely e=
ducational for me, and I see<br>there are a lot of points I need to learn m=
ore on, I&#39;m still of the<br>opinion that extening std::string in such a=
 way would provide:<br>=C2=A0 - backward compatibilty.<br>=C2=A0 - a tidy a=
nd compact interface for common locale/unicode aware operations.<br>=C2=A0 =
- a tidy, compact and consistent interface for providing efficient<br>=C2=
=A0=C2=A0=C2=A0 (e.g. codepoint or byte level operations).<br></div></block=
quote><div><br>It&#39;s based on locales and codecvt; these are not things =
I would consider &quot;efficient&quot;. Also, see below for &quot;codepoint=
 or byte level operations&quot;:<br>=C2=A0</div><blockquote style=3D"margin=
: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-lef=
t: 1ex;" class=3D"gmail_quote"><div dir=3D"ltr">=C2=A0 - the ability to cha=
nge locale behavior cheaply -- I could do<br>=C2=A0=C2=A0=C2=A0 static_cast=
&lt;byte_level_string_<wbr>type&gt;(grapheme_cluster_aware_<wbr>string)<br>=
=C2=A0=C2=A0=C2=A0 cheaply, and use type-checking to ensure only meaningful=
<br>=C2=A0=C2=A0=C2=A0 conversion occur.<br></div></blockquote><div><br>Um,=
 no. That conversion will <i>require</i> copying the string (not unless you=
 start doing copy-on-write gymnastics, which the committee rightfully jetti=
soned). The absolute best you might do is move the string. And you can&#39;=
t move from a `const&amp;`, which is a common way of passing strings around=
..<br><br>You could try to use copy-on-write gymnastics, but good luck on ge=
tting that through the committee.<br><br>That&#39;s not a &quot;cheap&quot;=
 or &quot;clean&quot; way to handle &quot;codepoint or byte level operation=
s&quot;.<br>=C2=A0</div><blockquote style=3D"margin: 0px 0px 0px 0.8ex; bor=
der-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" class=3D"gmail_=
quote"><div dir=3D"ltr">=C2=A0 - a path for cheap (in terms of work needed)=
 path to unicode<br>=C2=A0=C2=A0=C2=A0 awareness for legacy code.=C2=A0 Fir=
st get it to work (cheap) then fix<br>=C2=A0=C2=A0=C2=A0 performance proble=
ms as necessary.<br></div></blockquote><div><br>Given the above, I fail to =
see how you can &quot;fix performance problems&quot; without using a differ=
ent string type. At which point, you may as well have used that to begin wi=
th.<br>=C2=A0</div><blockquote style=3D"margin: 0px 0px 0px 0.8ex; border-l=
eft: 1px solid rgb(204, 204, 204); padding-left: 1ex;" class=3D"gmail_quote=
"><div dir=3D"ltr">Does nobody else think this would be useful?<br></div></=
blockquote><div><br>Even if all of the problems I&#39;ve mentioned so far w=
ent away... it&#39;s still the wrong way to do it on a conceptual level.<br=
><br>You&#39;re forcing the interface into one specific interpretation, whe=
n users need to be able to use <i>multiple</i> interpretations. Users somet=
imes need to insert codepoints, and sometimes they need to insert grapheme =
clusters. They shouldn&#39;t have to switch entire types and copy strings j=
ust to do that.<br><br>It&#39;s just not the right way to handle it.<br><br=
>At the end of the day, if the committee is going to be obstinate in their =
refusal to accept that std::basic_string shouldn&#39;t be the only string i=
n the standard, then the only viable solution is one that treats `basic_str=
ing` as exactly what it is (as far as Unicode is concerned): a contiguous s=
equence of <i>code units</i>.<br><br>You should have the ability to look at=
 it as a bidirectional range of codepoints in some encoding. You should hav=
e the ability to look at it as a bidirectional range of grapheme clusters. =
But at the end of the day, the type should not try to hide the fact that it=
&#39;s a sequence of code units.<br><br>So you should be able to get approp=
riate iterators/ranges. And there should be generic iterators/ranges for us=
er-defined string types. But at no point should the main interface of std::=
basic_string hide what it is or try to pretend that it is something else.<b=
r></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0=
..8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
</blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_7479_2109595498.1439433308208--
------=_Part_7478_1447907665.1439433308208--

.


Author: Jens Maurer <Jens.Maurer@gmx.net>
Date: Thu, 13 Aug 2015 10:16:36 +0200
Raw View
On 08/12/2015 11:54 PM, glen stark wrote:
> I guess I'm starting with the assumption that we will get a set of
> algorithms that can operate on std::strings, and I think that's a
> good idea.

Great.

>  I'm hoping strongly (and I am willing to help in any way
> I'm capable) that we will get a set of unicode aware iterators over
> std::basic_strings.  I'm personally remain of the opinion that good
> grapheme-cluster iterator would be of tremendous use to developers,
> and simplify a large set of unicode handling, but given how the
> discussion is going so far, I feel the need to put some more thought
> into that.
>
> Let's say we get that --

Let's stop right here.  I won't repeat the arguments why modifying
std::string's behavior to get access to these kinds of iterators doesn't
fit with std::string's contract, but let's get the fundamentals done
before discussing the lipstick part (nicer access to the facilities).

If you wish to spend some productive time, work on the points above.
I'd suggest to keep three levels of hierarchy clearly separated:

 (1) code point <-> code unit  (no tables or config necessary)
We already have codecvt for some that, but I admit a bidirectional
read-only utf8_iterator and a utf16_iterator (adapting a bidirectional
iterator over code units and returning code points) might come handy.
(I thought this was on the table at some point, but it doesn't seem to
be in the current C++ draft.)

 (2) code point <-> grapheme cluster / normalization
This covers various normalization forms and requires (parts of)
the Unicode tables to determine which character is a combining mark
and how to expand "LATIN SMALL LETTER A WITH DIAERESIS" (U+00E4)
into "a" + some combining mark.

 (3) comparison / sorting
This requires locales of some sort, and interacts with all of the above.


It's important for me not to pay for stuff I don't use.  So if all I need
is (1), I don't want to have a large table for (2) in my (embedded) program.
If all I need is (1) and (2), I don't want runtime locale facilities in
my program.  (For example, if I simply want to use a Unicode string as
the key of a std::map, all I need is (2) [normalization], and I can use
lexicographical comparison on the code units.  I don't actually need (3)
for that.)

Jens

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.


Author: Miro Knejp <miro.knejp@gmail.com>
Date: Thu, 13 Aug 2015 15:11:49 +0200
Raw View
--Apple-Mail=_158962B2-F2B6-44F7-9D2A-CB6D1DDE40AB
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8


> On 13 Aug 2015, at 04:35 , Nicol Bolas <jmckesson@gmail.com> wrote:
> (=E2=80=A6)
> However, the rules of STL containers forbid this. All iterator_t::referen=
ce types must be actual, C++ language references to an actual object contai=
ned in the sequence; otherwise, it is not an STL iterator. All container::r=
eference types must be an actual C++ language reference to the value_type; =
otherwise, it is not an STL container.

This is a topic that is being explored by Eric Niebler and he has a series =
of blogs about it ( http://ericniebler.com/2015/01/28/to-be-or-not-to-be-an=
-iterator/ <http://ericniebler.com/2015/01/28/to-be-or-not-to-be-an-iterato=
r/> ). True, it is a requirement under the current rules, but it could be c=
hanged in the future and he is discussing the adaptations that would be nec=
essary to properly support proxy iterators in algorithms/ranges.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail=_158962B2-F2B6-44F7-9D2A-CB6D1DDE40AB
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html charset=
=3Dutf-8"></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space;" class=3D""><br class=3D""><di=
v><blockquote type=3D"cite" class=3D""><div class=3D"">On 13 Aug 2015, at 0=
4:35 , Nicol Bolas &lt;<a href=3D"mailto:jmckesson@gmail.com" class=3D"">jm=
ckesson@gmail.com</a>&gt; wrote:</div><div class=3D""><div dir=3D"ltr" clas=
s=3D""><div class=3D"">(=E2=80=A6)<br class=3D"">However, the rules of STL =
containers <i class=3D"">forbid this</i>. All iterator_t::reference types m=
ust be actual, C++ language references to an actual object contained in the=
 sequence; otherwise, it is not an STL iterator. All container::reference t=
ypes must be an actual C++ language reference to the value_type; otherwise,=
 it is not an STL container.<br class=3D""></div></div></div></blockquote><=
div><br class=3D""></div>This is a topic that is being explored by Eric Nie=
bler and he has a series of blogs about it (&nbsp;<a href=3D"http://ericnie=
bler.com/2015/01/28/to-be-or-not-to-be-an-iterator/" class=3D"">http://eric=
niebler.com/2015/01/28/to-be-or-not-to-be-an-iterator/</a>&nbsp;). True, it=
 is a requirement under the current rules, but it could be changed in the f=
uture and he is discussing the adaptations that would be necessary to prope=
rly support proxy iterators in algorithms/ranges.</div><div><br class=3D"">=
</div></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail=_158962B2-F2B6-44F7-9D2A-CB6D1DDE40AB--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Thu, 13 Aug 2015 09:53:44 -0400
Raw View
On 2015-08-12 22:35, Nicol Bolas wrote:
> Why do I want a german string to be a different type from a spanish strin=
g?=20
> Now yes, the language is important for defining lexicographical ordering.=
=20
> But that's the *only* operation where that is important. So just have a=
=20
> comparison function that takes parameters defining the languages of the=
=20
> strings in question.

I would say even that is insufficient; "natural" ordering is insanely
complicated, if not impossible (AI-complete=C2=B9), for a computer to get
correct. I have yet, for example, to see an algorithm that correctly
sorts "1 dog", "100 cats", "12F cow" and "1AB pig". Never mind truly
pathological cases like "The Brown Cow" and "Quick Brown Fox, The". Even
worse, the "correct" can vary by context.

Just don't even *try* to build lexicographical support into this. The
issues Nicol mentioned are just the tip of the iceberg. Realizing that
means there is no need to encode language into the type (which is not
going to work anyway) and removes a major implementation difficulty.

(=C2=B9 https://en.wikipedia.org/wiki/AI-complete ...and no, I'm *NOT*
kidding about that. See in particular the "world knowledge" requirement
and note that machine translation is given as an example.)

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Thu, 13 Aug 2015 08:28:06 -0700 (PDT)
Raw View
------=_Part_267_1089095683.1439479686416
Content-Type: multipart/alternative;
 boundary="----=_Part_268_82307073.1439479686416"

------=_Part_268_82307073.1439479686416
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable



On Thursday, August 13, 2015 at 9:54:06 AM UTC-4, Matthew Woehlke wrote:
>
> On 2015-08-12 22:35, Nicol Bolas wrote:=20
> > Why do I want a german string to be a different type from a spanish=20
> string?=20
> > Now yes, the language is important for defining lexicographical=20
> ordering.=20
> > But that's the *only* operation where that is important. So just have a=
=20
> > comparison function that takes parameters defining the languages of the=
=20
> > strings in question.=20
>
> I would say even that is insufficient; "natural" ordering is insanely=20
> complicated, if not impossible (AI-complete=C2=B9), for a computer to get=
=20
> correct. I have yet, for example, to see an algorithm that correctly=20
> sorts "1 dog", "100 cats", "12F cow" and "1AB pig". Never mind truly=20
> pathological cases like "The Brown Cow" and "Quick Brown Fox, The". Even=
=20
> worse, the "correct" can vary by context.
>

Those go beyond mere lexicographical (ie: alphabetic) ordering; they dip=20
into natural language processing. The Unicode standard defines a system for=
=20
ordering strings <http://unicode.org/reports/tr10/>. That's the most that=
=20
any Unicode collation system should provide.

The levels specified by that document should be what would be implemented=
=20
in the C++ standard. This would rely on normalization, which requires the=
=20
Unicode codepoint tables (the collation algorithms themselves require the=
=20
tables as well). And Jens brings up an excellent point that this kind of=20
stuff *really* needs to be opt-in.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_268_82307073.1439479686416
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<br><br>On Thursday, August 13, 2015 at 9:54:06 AM UTC-4, Matthew Woehlke w=
rote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8e=
x;border-left: 1px #ccc solid;padding-left: 1ex;">On 2015-08-12 22:35, Nico=
l Bolas wrote:
<br>&gt; Why do I want a german string to be a different type from a spanis=
h string?=20
<br>&gt; Now yes, the language is important for defining lexicographical or=
dering.=20
<br>&gt; But that&#39;s the *only* operation where that is important. So ju=
st have a=20
<br>&gt; comparison function that takes parameters defining the languages o=
f the=20
<br>&gt; strings in question.
<br>
<br>I would say even that is insufficient; &quot;natural&quot; ordering is =
insanely
<br>complicated, if not impossible (AI-complete=C2=B9), for a computer to g=
et
<br>correct. I have yet, for example, to see an algorithm that correctly
<br>sorts &quot;1 dog&quot;, &quot;100 cats&quot;, &quot;12F cow&quot; and =
&quot;1AB pig&quot;. Never mind truly
<br>pathological cases like &quot;The Brown Cow&quot; and &quot;Quick Brown=
 Fox, The&quot;. Even
<br>worse, the &quot;correct&quot; can vary by context.<br></blockquote><di=
v><br>Those go beyond mere lexicographical (ie: alphabetic) ordering; they =
dip into natural language processing. The Unicode standard <a href=3D"http:=
//unicode.org/reports/tr10/">defines a system for ordering strings</a>. Tha=
t&#39;s the most that any Unicode collation system should provide.<br><br>T=
he levels specified by that document should be what would be implemented in=
 the C++ standard. This would rely on normalization, which requires the Uni=
code codepoint tables (the collation algorithms themselves require the tabl=
es as well). And Jens brings up an excellent point that this kind of stuff =
<i>really</i> needs to be opt-in.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_268_82307073.1439479686416--
------=_Part_267_1089095683.1439479686416--

.


Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Thu, 13 Aug 2015 12:30:30 -0400
Raw View
On 2015-08-13 11:28, Nicol Bolas wrote:
> On Thursday, August 13, 2015 at 9:54:06 AM UTC-4, Matthew Woehlke wrote:
>> On 2015-08-12 22:35, Nicol Bolas wrote:=20
>>> Now yes, the language is important for defining lexicographical=20
>>> ordering. But that's the *only* operation where that is
>>> important. So just have a comparison function that takes
>>> parameters defining the languages of the strings in question.
>>
>> I would say even that is insufficient; "natural" ordering is insanely=20
>> complicated, if not impossible (AI-complete=C2=B9), for a computer to ge=
t=20
>> correct. I have yet, for example, to see an algorithm that correctly=20
>> sorts "1 dog", "100 cats", "12F cow" and "1AB pig". Never mind truly=20
>> pathological cases like "The Brown Cow" and "Quick Brown Fox, The". Even=
=20
>> worse, the "correct" can vary by context.
>=20
> Those go beyond mere lexicographical (ie: alphabetic) ordering [...]=20
> The Unicode standard defines a system for ordering strings=20
> <http://unicode.org/reports/tr10/>.

....which, I note, still points out that sort order is *context
dependent*. That's really only slightly less convoluted than full
natural sorting, and requires a non-trivial amount of information be
supplied for the sort beyond the string contents. (And frankly, I'd be
surprised if it doesn't run into some of the same pitfalls as full
natural sorting.)

I stand by what I said; there should be One True Ordering for strings
using the default comparison operators (e.g. what would apply to
std::map absent a user provided comparator).

If you need more sophisticated sorting, provide a replacement comparator
or use a specialized sorting algorithm tailored for user presentation.
Don't try to encode that sort of mess into the type itself; that way
lies madness...

(I'm not saying I'm rabidly opposed to the standard specifying such sort
functions, but that's orthogonal to the original issue.)

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Thu, 13 Aug 2015 10:35:06 -0700 (PDT)
Raw View
------=_Part_472_1597060505.1439487307162
Content-Type: multipart/alternative;
 boundary="----=_Part_473_1749313943.1439487307162"

------=_Part_473_1749313943.1439487307162
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Thursday, August 13, 2015 at 12:30:53 PM UTC-4, Matthew Woehlke wrote:
>
> On 2015-08-13 11:28, Nicol Bolas wrote:=20
> > On Thursday, August 13, 2015 at 9:54:06 AM UTC-4, Matthew Woehlke wrote=
:=20
> >> On 2015-08-12 22:35, Nicol Bolas wrote:=20
> >>> Now yes, the language is important for defining lexicographical=20
> >>> ordering. But that's the *only* operation where that is=20
> >>> important. So just have a comparison function that takes=20
> >>> parameters defining the languages of the strings in question.=20
> >>=20
> >> I would say even that is insufficient; "natural" ordering is insanely=
=20
> >> complicated, if not impossible (AI-complete=C2=B9), for a computer to =
get=20
> >> correct. I have yet, for example, to see an algorithm that correctly=
=20
> >> sorts "1 dog", "100 cats", "12F cow" and "1AB pig". Never mind truly=
=20
> >> pathological cases like "The Brown Cow" and "Quick Brown Fox, The".=20
> Even=20
> >> worse, the "correct" can vary by context.=20
> >=20
> > Those go beyond mere lexicographical (ie: alphabetic) ordering [...]=20
> > The Unicode standard defines a system for ordering strings=20
> > <http://unicode.org/reports/tr10/>.=20
>
> ...which, I note, still points out that sort order is *context=20
> dependent*.


Oh, I didn't mean to suggest that it's not; it very much is. I wanted to=20
make sure we understood that Unicode defines specific sorting algorithms=20
that are a lot less complex than the natural language stuff you were=20
talking about.
=20

> That's really only slightly less convoluted than full=20
> natural sorting, and requires a non-trivial amount of information be=20
> supplied for the sort beyond the string contents.


While it does require lots of information, outside of the specific language=
=20
in question (which is only necessary for certain fields), it requires only=
=20
static data. This is certainly significant (as I mentioned, it's important=
=20
that it be explicitly opt-in, so it can't be the default comparison), but=
=20
if you need Unicode ordering, you need Unicode ordering.
=20

> (And frankly, I'd be=20
> surprised if it doesn't run into some of the same pitfalls as full=20
> natural sorting.)=20
>

It does not. Just look at the algorithm; it doesn't require AI or=20
exceedingly complex logic. It just requires codepoint tables and which=20
language to use. It's all very algorithmic in the end.
=20

> I stand by what I said; there should be One True Ordering for strings=20
> using the default comparison operators (e.g. what would apply to=20
> std::map absent a user provided comparator).
>

If all you need is *an* ordering, rather than a human-visible ordering,=20
then Unicode normalization provides this even with code-unit comparisons.=
=20
As long as all input strings are normalized according to the proper=20
normalization form (form D if I recall correctly), two strings will only=20
code-unit-compare differently if they represent different strings.

In short, garbage in, garbage out. If you do everything correctly, the way=
=20
things work now gives you the right answer.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_473_1749313943.1439487307162
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Thursday, August 13, 2015 at 12:30:53 PM UTC-4, Matthew Woehlke wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">On 2015-08-13 11:28, Nicol Bolas=
 wrote:
<br>&gt; On Thursday, August 13, 2015 at 9:54:06 AM UTC-4, Matthew Woehlke =
wrote:
<br>&gt;&gt; On 2015-08-12 22:35, Nicol Bolas wrote:=20
<br>&gt;&gt;&gt; Now yes, the language is important for defining lexicograp=
hical=20
<br>&gt;&gt;&gt; ordering. But that&#39;s the *only* operation where that i=
s
<br>&gt;&gt;&gt; important. So just have a comparison function that takes
<br>&gt;&gt;&gt; parameters defining the languages of the strings in questi=
on.
<br>&gt;&gt;
<br>&gt;&gt; I would say even that is insufficient; &quot;natural&quot; ord=
ering is insanely=20
<br>&gt;&gt; complicated, if not impossible (AI-complete=C2=B9), for a comp=
uter to get=20
<br>&gt;&gt; correct. I have yet, for example, to see an algorithm that cor=
rectly=20
<br>&gt;&gt; sorts &quot;1 dog&quot;, &quot;100 cats&quot;, &quot;12F cow&q=
uot; and &quot;1AB pig&quot;. Never mind truly=20
<br>&gt;&gt; pathological cases like &quot;The Brown Cow&quot; and &quot;Qu=
ick Brown Fox, The&quot;. Even=20
<br>&gt;&gt; worse, the &quot;correct&quot; can vary by context.
<br>&gt;=20
<br>&gt; Those go beyond mere lexicographical (ie: alphabetic) ordering [..=
..]=20
<br>&gt; The Unicode standard defines a system for ordering strings=20
<br>&gt; &lt;<a href=3D"http://unicode.org/reports/tr10/" target=3D"_blank"=
 rel=3D"nofollow" onmousedown=3D"this.href=3D&#39;http://www.google.com/url=
?q\75http%3A%2F%2Funicode.org%2Freports%2Ftr10%2F\46sa\75D\46sntz\0751\46us=
g\75AFQjCNFdd5K3t2qEes4UzXA6KND-D21bGg&#39;;return true;" onclick=3D"this.h=
ref=3D&#39;http://www.google.com/url?q\75http%3A%2F%2Funicode.org%2Freports=
%2Ftr10%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFdd5K3t2qEes4UzXA6KND-D21bGg&=
#39;;return true;">http://unicode.org/reports/<wbr>tr10/</a>&gt;.
<br>
<br>...which, I note, still points out that sort order is *context
<br>dependent*.</blockquote><div><br>Oh, I didn&#39;t mean to suggest that =
it&#39;s not; it very much is. I wanted to make sure we understood that Uni=
code defines specific sorting algorithms that are a lot less complex than t=
he natural language stuff you were talking about.<br>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: =
1px #ccc solid;padding-left: 1ex;">That&#39;s really only slightly less con=
voluted than full
<br>natural sorting, and requires a non-trivial amount of information be
<br>supplied for the sort beyond the string contents.</blockquote><div><br>=
While it does require lots of information, outside of the specific language=
 in question (which is only necessary for certain fields), it requires only=
 static data. This is certainly significant (as I mentioned, it&#39;s impor=
tant that it be explicitly opt-in, so it can&#39;t be the default compariso=
n), but if you need Unicode ordering, you need Unicode ordering.<br>=C2=A0<=
/div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8e=
x;border-left: 1px #ccc solid;padding-left: 1ex;">(And frankly, I&#39;d be
<br>surprised if it doesn&#39;t run into some of the same pitfalls as full
<br>natural sorting.)
<br></blockquote><div><br>It does not. Just look at the algorithm; it doesn=
&#39;t require AI or exceedingly complex logic. It just requires codepoint =
tables and which language to use. It&#39;s all very algorithmic in the end.=
<br>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin=
-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
I stand by what I said; there should be One True Ordering for strings
<br>using the default comparison operators (e.g. what would apply to
<br>std::map absent a user provided comparator).<br></blockquote><div><br>I=
f all you need is <i>an</i> ordering, rather than a human-visible ordering,=
 then Unicode normalization provides this even with code-unit comparisons. =
As long as all input strings are normalized according to the proper normali=
zation form (form D if I recall correctly), two strings will only code-unit=
-compare differently if they represent different strings.<br><br>In short, =
garbage in, garbage out. If you do everything correctly, the way things wor=
k now gives you the right answer.</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_473_1749313943.1439487307162--
------=_Part_472_1597060505.1439487307162--

.


Author: Thiago Macieira <thiago@macieira.org>
Date: Thu, 13 Aug 2015 16:25:32 -0700
Raw View
On Wednesday 12 August 2015 23:54:09 glen stark wrote:
> I believe it is possible
> to give std::basic_string unicode awareness without introducing a breaking
> change.

And then you say:

> We could introduce an additonal, optional template parameter --

That's a breaking change.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.