Thread

Topic: Unicode support in the Standard Library

Author: Guy Davidson <guy@hatcat.com>
Date: Thu, 8 May 2014 05:59:09 -0700 (PDT) Raw View

------=_Part_431_12742392.1399553949648
Content-Type: text/plain; charset=UTF-8

I am very keen to see Unicode support in C++17.  At the ACCU conference<http://accu.org/index.php/conferences/accu_conference_2014>I was encouraged by Nico Josuttis and Kevlin Henney to put a proposal
together.  I knocked up a naive interface and an implementation and then I
checked the proposal mechanism.  I discovered Beman Dawes paper on string
interoperability<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html>and Mark
Boyall's submission<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html>; gratifyingly,
they were very similar to mine (I even used the name encoded_string and
templated it over an encoder and an allocator).  Mark has advised me that
he is no longer pursuing the matter, and Beman's paper doesn't consider a
string class per se.  I have an interface and a partial implementation that
is considerably lighter than ICU 53.1<http://icu-project.org/apiref/icu4c/index.html>:
what should I do next?

Cheers,
Guy

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_431_12742392.1399553949648
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I am very keen to see Unicode support in C++17. &nbsp;At t=
he <a href=3D"http://accu.org/index.php/conferences/accu_conference_2014">A=
CCU conference</a> I was encouraged by Nico Josuttis and Kevlin Henney to p=
ut a proposal together. &nbsp;I knocked up a naive interface and an impleme=
ntation and then I checked the proposal mechanism. &nbsp;I discovered <a hr=
ef=3D"http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html">B=
eman Dawes paper on string interoperability</a> and&nbsp;<a href=3D"http://=
www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html">Mark Boyall's =
submission</a>;&nbsp;gratifyingly, they were very similar to mine (I even u=
sed the name encoded_string and templated it over an encoder and an allocat=
or). &nbsp;Mark has advised me that he is no longer pursuing the matter, an=
d Beman's paper doesn't consider a string class per se. &nbsp;I have an int=
erface and a partial implementation that is considerably lighter than <a hr=
ef=3D"http://icu-project.org/apiref/icu4c/index.html">ICU 53.1</a>: what sh=
ould I do next?<div><br></div><div>Cheers,</div><div>Guy</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_431_12742392.1399553949648--

.

Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Thu, 8 May 2014 15:07:00 +0200 Raw View

2014-05-08 14:59 GMT+02:00 Guy Davidson <guy@hatcat.com>:
> I am very keen to see Unicode support in C++17.  At the ACCU conference I
> was encouraged by Nico Josuttis and Kevlin Henney to put a proposal
> together.  I knocked up a naive interface and an implementation and then I
> checked the proposal mechanism.  I discovered Beman Dawes paper on string
> interoperability and Mark Boyall's submission; gratifyingly, they were very
> similar to mine (I even used the name encoded_string and templated it over
> an encoder and an allocator).  Mark has advised me that he is no longer
> pursuing the matter, and Beman's paper doesn't consider a string class per
> se.  I have an interface and a partial implementation that is considerably
> lighter than ICU 53.1: what should I do next?

I would strongly encourage you to write a proposal for this addressing
Library Evolution Group as target project.

Please take a look at

http://isocpp.org/std/submit-a-proposal

for the general procedure how to do that.

If after reading that you still have any further questions please send
a short informal email to the lwgchair address mentioned here:

http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html

The next mailing deadline is mentioned here:

http://www.open-std.org/jtc1/sc22/wg21/

"The deadline for the next mailing is 2014-05-23"

Thanks,

- Daniel

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Thu, 8 May 2014 14:30:46 +0100 Raw View

I will give the feedback I gave before: don't create another string class! =
Instead, create the necessary algorithms to deal with Unicode. In my opinio=
n the actual encoding/decoding business is covered by the std::codecvt<...>=
 facet although it may be worth explicitly defining instances of these face=
ts for the various Unicode encodings. There are, of course, plenty of other=
 algorithms in Unicode which are reasonable to expose. Given that people li=
ke to process UTF8 and UTF16 it may be reasonable to also have encoding awa=
re algorithms for string operations.

Unless soneone provides a really strong argument for another string class, =
I will strongly argue against adding another representation for strings! (I=
 can see a place for an immutable string class but that's entirely differen=
t).

> On 8 May 2014, at 14:07, Daniel Kr=C3=BCgler <daniel.kruegler@gmail.com> =
wrote:
>=20
> 2014-05-08 14:59 GMT+02:00 Guy Davidson <guy@hatcat.com>:
>> I am very keen to see Unicode support in C++17.  At the ACCU conference =
I
>> was encouraged by Nico Josuttis and Kevlin Henney to put a proposal
>> together.  I knocked up a naive interface and an implementation and then=
 I
>> checked the proposal mechanism.  I discovered Beman Dawes paper on strin=
g
>> interoperability and Mark Boyall's submission; gratifyingly, they were v=
ery
>> similar to mine (I even used the name encoded_string and templated it ov=
er
>> an encoder and an allocator).  Mark has advised me that he is no longer
>> pursuing the matter, and Beman's paper doesn't consider a string class p=
er
>> se.  I have an interface and a partial implementation that is considerab=
ly
>> lighter than ICU 53.1: what should I do next?
>=20
> I would strongly encourage you to write a proposal for this addressing
> Library Evolution Group as target project.
>=20
> Please take a look at
>=20
> http://isocpp.org/std/submit-a-proposal
>=20
> for the general procedure how to do that.
>=20
> If after reading that you still have any further questions please send
> a short informal email to the lwgchair address mentioned here:
>=20
> http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html
>=20
> The next mailing deadline is mentioned here:
>=20
> http://www.open-std.org/jtc1/sc22/wg21/
>=20
> "The deadline for the next mailing is 2014-05-23"
>=20
> Thanks,
>=20
> - Daniel
>=20
> --=20
>=20
> ---=20
> You received this message because you are subscribed to the Google Groups=
 "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
 email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-propo=
sals/.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Thu, 8 May 2014 15:33:37 +0200 Raw View

2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietmar.kuehl@gmail.com>:
> I will give the feedback I gave before: don't create another string class=
! Instead, create the necessary algorithms to deal with Unicode. In my opin=
ion the actual encoding/decoding business is covered by the std::codecvt<..=
..> facet although it may be worth explicitly defining instances of these fa=
cets for the various Unicode encodings. There are, of course, plenty of oth=
er algorithms in Unicode which are reasonable to expose. Given that people =
like to process UTF8 and UTF16 it may be reasonable to also have encoding a=
ware algorithms for string operations.
>
> Unless soneone provides a really strong argument for another string class=
, I will strongly argue against adding another representation for strings! =
(I can see a place for an immutable string class but that's entirely differ=
ent).
>

I would like to add that I completely agree with Dietmar.

- Daniel

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Guy Davidson <guy@hatcat.com>
Date: Thu, 8 May 2014 07:07:39 -0700 (PDT) Raw View

------=_Part_429_29272284.1399558059648
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks for your feedback, I'm sorry you have to repeat it.  The main thing=
=20
that has driven me to creating an encoded_string class is provision of=20
iterators (although I'm not a fan of the std::codecvt interface and the=20
std::basic_string interface is a bit rich).  If I have a string made up of=
=20
variable width characters, as UTF-8 often yields, you have little=20
opportunity to use the standard algorithms in any meaningful way.  I have=
=20
to take a copy of the buffer of std::basic_string, use my own iterator on=
=20
it, then reassign the string from the buffer.  This isn't very efficient,=
=20
nor does it promote clear code.

Now that I think of it, I suppose another approach might be to modify=20
std::char_traits and declare the iterator in there, then modify=20
std::basic_string and define the iterators in terms of the char_traits=20
iterators.

On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>
> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com<javascript:=
>>:=20
>
> > I will give the feedback I gave before: don't create another string=20
> class! Instead, create the necessary algorithms to deal with Unicode. In =
my=20
> opinion the actual encoding/decoding business is covered by the=20
> std::codecvt<...> facet although it may be worth explicitly defining=20
> instances of these facets for the various Unicode encodings. There are, o=
f=20
> course, plenty of other algorithms in Unicode which are reasonable to=20
> expose. Given that people like to process UTF8 and UTF16 it may be=20
> reasonable to also have encoding aware algorithms for string operations.=
=20
> >=20
> > Unless soneone provides a really strong argument for another string=20
> class, I will strongly argue against adding another representation for=20
> strings! (I can see a place for an immutable string class but that's=20
> entirely different).=20
> >=20
>
> I would like to add that I completely agree with Dietmar.=20
>
> - Daniel=20
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_429_29272284.1399558059648
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for your feedback, I'm sorry you have to repeat it.=
 &nbsp;The main thing that has driven me to creating an encoded_string clas=
s is provision of iterators (although I'm not a fan of the std::codecvt int=
erface and the std::basic_string interface is a bit rich). &nbsp;If I have =
a string made up of variable width characters, as UTF-8 often yields, you h=
ave little opportunity to use the standard algorithms in any meaningful way=
.. &nbsp;I have to take a copy of the buffer of std::basic_string, use my ow=
n iterator on it, then reassign the string from the buffer. &nbsp;This isn'=
t very efficient, nor does it promote clear code.<div><br></div><div>Now th=
at I think of it, I suppose another approach might be to modify std::char_t=
raits and declare the iterator in there, then modify std::basic_string and =
define the iterators in terms of the char_traits iterators.<br><br>On Thurs=
day, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler  wrote:<blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #c=
cc solid;padding-left: 1ex;">2014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<=
a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"7dmbZ7hUA=
OkJ" onmousedown=3D"this.href=3D'javascript:';return true;" onclick=3D"this=
..href=3D'javascript:';return true;">dietma...@gmail.com</a>&gt;:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>&gt;
<br>
<br>I would like to add that I completely agree with Dietmar.
<br>
<br>- Daniel
<br></blockquote></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_429_29272284.1399558059648--

.

Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Thu, 8 May 2014 16:43:06 +0200 Raw View

2014-05-08 16:07 GMT+02:00 Guy Davidson <guy@hatcat.com>:
> Now that I think of it, I suppose another approach might be to modify
> std::char_traits and declare the iterator in there,

You don't mean char_traits, do you? If yes, I don't see how char
traits are related to iterators.

> then modify
> std::basic_string and define the iterators in terms of the char_traits
> iterators.

I would like to suggest an alternative approach: It is not necessary
to add further member functions to basic_string. Instead you could
provide range-based access functions that are free functions. This
would also allow (but not require) to provide these Unicode functions
in a separate header.

- Daniel

> On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=FCgler wrote:
>>
>> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:
>> > I will give the feedback I gave before: don't create another string
>> > class! Instead, create the necessary algorithms to deal with Unicode. =
In my
>> > opinion the actual encoding/decoding business is covered by the
>> > std::codecvt<...> facet although it may be worth explicitly defining
>> > instances of these facets for the various Unicode encodings. There are=
, of
>> > course, plenty of other algorithms in Unicode which are reasonable to
>> > expose. Given that people like to process UTF8 and UTF16 it may be
>> > reasonable to also have encoding aware algorithms for string operation=
s.
>> >
>> > Unless soneone provides a really strong argument for another string
>> > class, I will strongly argue against adding another representation for
>> > strings! (I can see a place for an immutable string class but that's
>> > entirely different).
>> >
>>
>> I would like to add that I completely agree with Dietmar.
>>
>> - Daniel
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.



--=20

________________________________
SavedURI :Show URLShow URLSavedURI :
SavedURI :Hide URLHide URLSavedURI :
https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.LEt2fN4ilL=
E.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD=
-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D1395770=
800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.LEt2=
fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0=
_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D1=
395770800154
________________________________

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Thu, 8 May 2014 17:57:12 +0300 Raw View

On 8 May 2014 17:43, Daniel Kr=C3=BCgler <daniel.kruegler@gmail.com> wrote:
>> then modify
>> std::basic_string and define the iterators in terms of the char_traits
>> iterators.
>
> I would like to suggest an alternative approach: It is not necessary
> to add further member functions to basic_string. Instead you could
> provide range-based access functions that are free functions. This
> would also allow (but not require) to provide these Unicode functions
> in a separate header.


It would also allow running unicode algorithms on a vector<char>, I guess.
Or a string_view. Or an iostream.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Guy Davidson <guy@hatcat.com>
Date: Thu, 8 May 2014 08:06:40 -0700 (PDT) Raw View

------=_Part_510_17795026.1399561600909
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I do indeed mean char_traits: my idea is to extend std::char_traits to=20
include iteration information about the character type, and then modify=20
std::basic_string to infer its iterator types from its char_trait template=
=20
parameter.

The fundamental problem here is that the basic_string class currently only=
=20
accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width=20
encodings (UTF-32 is also fixed width).  std::u16string is only fit for=20
representing elements from the Basic Multilingual Plane, it is not fit for=
=20
UTF-16.  I'm not sure how to introduce variable width encoded strings=20
within std::basic_string, hence the introduction of a separate string=20
class.  The square bracket operator becomes problematic if the width of a=
=20
character can be from one to four bytes.  I shall think on.

On Thursday, 8 May 2014 15:43:06 UTC+1, Daniel Kr=C3=BCgler wrote:
>
> 2014-05-08 16:07 GMT+02:00 Guy Davidson <g...@hatcat.com <javascript:>>:=
=20
> > Now that I think of it, I suppose another approach might be to modify=
=20
> > std::char_traits and declare the iterator in there,=20
>
> You don't mean char_traits, do you? If yes, I don't see how char=20
> traits are related to iterators.=20
>
> > then modify=20
> > std::basic_string and define the iterators in terms of the char_traits=
=20
> > iterators.=20
>
> I would like to suggest an alternative approach: It is not necessary=20
> to add further member functions to basic_string. Instead you could=20
> provide range-based access functions that are free functions. This=20
> would also allow (but not require) to provide these Unicode functions=20
> in a separate header.=20
>
> - Daniel=20
>
> > On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:=20
> >>=20
> >> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
> >> > I will give the feedback I gave before: don't create another string=
=20
> >> > class! Instead, create the necessary algorithms to deal with Unicode=
..=20
> In my=20
> >> > opinion the actual encoding/decoding business is covered by the=20
> >> > std::codecvt<...> facet although it may be worth explicitly defining=
=20
> >> > instances of these facets for the various Unicode encodings. There=
=20
> are, of=20
> >> > course, plenty of other algorithms in Unicode which are reasonable t=
o=20
> >> > expose. Given that people like to process UTF8 and UTF16 it may be=
=20
> >> > reasonable to also have encoding aware algorithms for string=20
> operations.=20
> >> >=20
> >> > Unless soneone provides a really strong argument for another string=
=20
> >> > class, I will strongly argue against adding another representation=
=20
> for=20
> >> > strings! (I can see a place for an immutable string class but that's=
=20
> >> > entirely different).=20
> >> >=20
> >>=20
> >> I would like to add that I completely agree with Dietmar.=20
> >>=20
> >> - Daniel=20
> >=20
> > --=20
> >=20
> > ---=20
> > You received this message because you are subscribed to the Google=20
> Groups=20
> > "ISO C++ Standard - Future Proposals" group.=20
> > To unsubscribe from this group and stop receiving emails from it, send=
=20
> an=20
> > email to std-proposal...@isocpp.org <javascript:>.=20
> > To post to this group, send email to std-pr...@isocpp.org <javascript:>=
..=20
>
> > Visit this group at=20
> > http://groups.google.com/a/isocpp.org/group/std-proposals/.=20
>
>
>
> --=20
>
> ________________________________=20
> SavedURI :Show URLShow URLSavedURI :=20
> SavedURI :Hide URLHide URLSavedURI :=20
>
> https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.LEt2fN4i=
lLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_=
AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D13957=
70800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.LE=
t2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m=
_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=
=3D1395770800154=20
> ________________________________=20
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_510_17795026.1399561600909
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I do indeed mean char_traits: my idea is to extend std::ch=
ar_traits to include iteration information about the character type, and th=
en modify std::basic_string to infer its iterator types from its char_trait=
 template parameter.<div><br></div><div>The fundamental problem here is tha=
t the basic_string class currently only accommodates fixed width encoding. =
&nbsp;UTF-8 and UTF-16 are variable width encodings (UTF-32 is also fixed w=
idth). &nbsp;std::u16string is only fit for representing elements from the =
Basic Multilingual Plane, it is not fit for UTF-16. &nbsp;I'm not sure how =
to introduce variable width encoded strings within std::basic_string, hence=
 the introduction of a separate string class. &nbsp;The square bracket oper=
ator becomes problematic if the width of a character can be from one to fou=
r bytes. &nbsp;I shall think on.</div><br>On Thursday, 8 May 2014 15:43:06 =
UTC+1, Daniel Kr=C3=BCgler  wrote:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;">2014-05-08 16:07 GMT+02:00 Guy Davidson &lt;<a href=3D"javascript:" t=
arget=3D"_blank" gdf-obfuscated-mailto=3D"slkRtUHmcfMJ" onmousedown=3D"this=
..href=3D'javascript:';return true;" onclick=3D"this.href=3D'javascript:';re=
turn true;">g...@hatcat.com</a>&gt;:
<br>&gt; Now that I think of it, I suppose another approach might be to mod=
ify
<br>&gt; std::char_traits and declare the iterator in there,
<br>
<br>You don't mean char_traits, do you? If yes, I don't see how char
<br>traits are related to iterators.
<br>
<br>&gt; then modify
<br>&gt; std::basic_string and define the iterators in terms of the char_tr=
aits
<br>&gt; iterators.
<br>
<br>I would like to suggest an alternative approach: It is not necessary
<br>to add further member functions to basic_string. Instead you could
<br>provide range-based access functions that are free functions. This
<br>would also allow (but not require) to provide these Unicode functions
<br>in a separate header.
<br>
<br>- Daniel
<br>
<br>&gt; On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
<br>&gt;&gt;
<br>&gt;&gt; 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<a>dietma...@gmai=
l.com</a>&gt;:
<br>&gt;&gt; &gt; I will give the feedback I gave before: don't create anot=
her string
<br>&gt;&gt; &gt; class! Instead, create the necessary algorithms to deal w=
ith Unicode. In my
<br>&gt;&gt; &gt; opinion the actual encoding/decoding business is covered =
by the
<br>&gt;&gt; &gt; std::codecvt&lt;...&gt; facet although it may be worth ex=
plicitly defining
<br>&gt;&gt; &gt; instances of these facets for the various Unicode encodin=
gs. There are, of
<br>&gt;&gt; &gt; course, plenty of other algorithms in Unicode which are r=
easonable to
<br>&gt;&gt; &gt; expose. Given that people like to process UTF8 and UTF16 =
it may be
<br>&gt;&gt; &gt; reasonable to also have encoding aware algorithms for str=
ing operations.
<br>&gt;&gt; &gt;
<br>&gt;&gt; &gt; Unless soneone provides a really strong argument for anot=
her string
<br>&gt;&gt; &gt; class, I will strongly argue against adding another repre=
sentation for
<br>&gt;&gt; &gt; strings! (I can see a place for an immutable string class=
 but that's
<br>&gt;&gt; &gt; entirely different).
<br>&gt;&gt; &gt;
<br>&gt;&gt;
<br>&gt;&gt; I would like to add that I completely agree with Dietmar.
<br>&gt;&gt;
<br>&gt;&gt; - Daniel
<br>&gt;
<br>&gt; --
<br>&gt;
<br>&gt; ---
<br>&gt; You received this message because you are subscribed to the Google=
 Groups
<br>&gt; "ISO C++ Standard - Future Proposals" group.
<br>&gt; To unsubscribe from this group and stop receiving emails from it, =
send an
<br>&gt; email to <a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-=
mailto=3D"slkRtUHmcfMJ" onmousedown=3D"this.href=3D'javascript:';return tru=
e;" onclick=3D"this.href=3D'javascript:';return true;">std-proposal...@<wbr=
>isocpp.org</a>.
<br>&gt; To post to this group, send email to <a href=3D"javascript:" targe=
t=3D"_blank" gdf-obfuscated-mailto=3D"slkRtUHmcfMJ" onmousedown=3D"this.hre=
f=3D'javascript:';return true;" onclick=3D"this.href=3D'javascript:';return=
 true;">std-pr...@isocpp.org</a>.
<br>&gt; Visit this group at
<br>&gt; <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/" target=3D"_blank" onmousedown=3D"this.href=3D'http://groups.google.com=
/a/isocpp.org/group/std-proposals/';return true;" onclick=3D"this.href=3D'h=
ttp://groups.google.com/a/isocpp.org/group/std-proposals/';return true;">ht=
tp://groups.google.com/a/<wbr>isocpp.org/group/std-<wbr>proposals/</a>.
<br>
<br>
<br>
<br>--=20
<br>
<br>______________________________<wbr>__
<br>SavedURI :Show URLShow URLSavedURI :
<br>SavedURI :Hide URLHide URLSavedURI :
<br><a href=3D"https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.mai=
n.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBB=
RgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?r=
andom=3D1395770800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgma=
il.main.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVA=
OBGeBBRgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4k=
uOWA?random=3D1395770800154" target=3D"_blank" onmousedown=3D"this.href=3D'=
https://mail.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilL=
E.O/m\75m_i,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD=
-qOEw/rt\75h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\07513957=
70800154https://mail.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LE=
t2fN4ilLE.O/m\75m_i,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m=
_0_8B_AD-qOEw/rt\75h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\=
0751395770800154';return true;" onclick=3D"this.href=3D'https://mail.google=
..com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilLE.O/m\75m_i,t,it/am=
\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt\75h/d\0751=
/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\0751395770800154https://mai=
l.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilLE.O/m\75m_i=
,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt\75=
h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\0751395770800154';r=
eturn true;">https://mail.google.com/_/scs/<wbr>mail-static/_/js/k=3Dgmail.=
main.<wbr>de.LEt2fN4ilLE.O/m=3Dm_i,t,it/<wbr>am=3DOCMOBiHj9kJxhnelj6j997_<w=
br>NLil29vVAOBGeBBRgJwD-m_0_8B_<wbr>AD-qOEw/rt=3Dh/d=3D1/rs=3D<wbr>AItRSTOD=
y9wv1JKZMABIG3Ak8ViC4k<wbr>uOWA?random=3D<wbr>1395770800154https://mail.<wb=
r>google.com/_/scs/mail-static/_<wbr>/js/k=3Dgmail.main.de.<wbr>LEt2fN4ilLE=
..O/m=3Dm_i,t,it/am=3D<wbr>OCMOBiHj9kJxhnelj6j997_<wbr>NLil29vVAOBGeBBRgJwD-=
m_0_8B_<wbr>AD-qOEw/rt=3Dh/d=3D1/rs=3D<wbr>AItRSTODy9wv1JKZMABIG3Ak8ViC4k<w=
br>uOWA?random=3D1395770800154</a>
<br>______________________________<wbr>__
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_510_17795026.1399561600909--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 08:09:10 -0700 (PDT) Raw View

------=_Part_65_17424960.1399561750884
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

The thing is, for most of the time the encoding of the string is not=20
relevant - copying, appending, exact comparison all work on the raw data=20
and don't care about the encoding. The only time the encoding matters is=20
during I/O or when performing lexicographic operations, and generally=20
within the same application, the same encoding will be used (almost)=20
everywhere.

I think you're on the right track with the suggestion about modifying=20
std::basic_string to have (either through char_traits or a defaulted=20
template parameter) an encoding, which defaults to something backward=20
compatible, but so that the default can be overridden easily enough with an=
=20
alias or something, but I don't think you should modify the iterators -=20
instead there should be global functions=20
"lexicographic_begin"/"lexicographic_end" or just "lbegin/lend" which will=
=20
either get the encoding from the container passed in, or get the encoding=
=20
from an additional parameter. These iterators would always have the same=20
element type which must be capable of representing any unicode character.

For example, this would work for any encoding, or even if input and result=
=20
had different encodings:
string input =3D "<some string encoded with the default encoding for=20
std::string>";
string result;
std::transform(lbegin(input), lend(input), lback_inserter(result), to_upper
);

Here lbegin and lend would infer their encoding from input, while=20
lback_inserter would infer it from result. The element type iterated over=
=20
would be a unicode character.

On Thursday, 8 May 2014 15:07:39 UTC+1, Guy Davidson wrote:
>
> Thanks for your feedback, I'm sorry you have to repeat it.  The main thin=
g=20
> that has driven me to creating an encoded_string class is provision of=20
> iterators (although I'm not a fan of the std::codecvt interface and the=
=20
> std::basic_string interface is a bit rich).  If I have a string made up o=
f=20
> variable width characters, as UTF-8 often yields, you have little=20
> opportunity to use the standard algorithms in any meaningful way.  I have=
=20
> to take a copy of the buffer of std::basic_string, use my own iterator on=
=20
> it, then reassign the string from the buffer.  This isn't very efficient,=
=20
> nor does it promote clear code.
>
> Now that I think of it, I suppose another approach might be to modify=20
> std::char_traits and declare the iterator in there, then modify=20
> std::basic_string and define the iterators in terms of the char_traits=20
> iterators.
>
> On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>>
>> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
>> > I will give the feedback I gave before: don't create another string=20
>> class! Instead, create the necessary algorithms to deal with Unicode. In=
 my=20
>> opinion the actual encoding/decoding business is covered by the=20
>> std::codecvt<...> facet although it may be worth explicitly defining=20
>> instances of these facets for the various Unicode encodings. There are, =
of=20
>> course, plenty of other algorithms in Unicode which are reasonable to=20
>> expose. Given that people like to process UTF8 and UTF16 it may be=20
>> reasonable to also have encoding aware algorithms for string operations.=
=20
>> >=20
>> > Unless soneone provides a really strong argument for another string=20
>> class, I will strongly argue against adding another representation for=
=20
>> strings! (I can see a place for an immutable string class but that's=20
>> entirely different).=20
>> >=20
>>
>> I would like to add that I completely agree with Dietmar.=20
>>
>> - Daniel=20
>>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_65_17424960.1399561750884
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The thing is, for most of the time the encoding of the str=
ing is not relevant - copying, appending, exact comparison all work on the =
raw data and don't care about the encoding. The only time the encoding matt=
ers is during I/O or when performing lexicographic operations, and generall=
y within the same application, the same encoding will be used (almost) ever=
ywhere.<br><br>I think you're on the right track with the suggestion about =
modifying std::basic_string to have (either through char_traits or a defaul=
ted template parameter) an encoding, which defaults to something backward c=
ompatible, but so that the default can be overridden easily enough with an =
alias or something, but I don't think you should modify the iterators - ins=
tead there should be global functions "lexicographic_begin"/"lexicographic_=
end" or just "lbegin/lend" which will either get the encoding from the cont=
ainer passed in, or get the encoding from an additional parameter. These it=
erators would always have the same element type which must be capable of re=
presenting any unicode character.<br><br>For example, this would work for a=
ny encoding, or even if input and result had different encodings:<br><div c=
lass=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); border-=
color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wra=
p: break-word;"><code class=3D"prettyprint"><div class=3D"subprettyprint"><=
span style=3D"color: #008;" class=3D"styled-by-prettify">string</span><span=
 style=3D"color: #000;" class=3D"styled-by-prettify"> input </span><span st=
yle=3D"color: #660;" class=3D"styled-by-prettify">=3D</span><span style=3D"=
color: #000;" class=3D"styled-by-prettify"> </span><span style=3D"color: #0=
80;" class=3D"styled-by-prettify">"&lt;some string encoded with the default=
 encoding for std::string&gt;"</span><span style=3D"color: #660;" class=3D"=
styled-by-prettify">;</span><span style=3D"color: #000;" class=3D"styled-by=
-prettify"><br></span><span style=3D"color: #008;" class=3D"styled-by-prett=
ify">string</span><span style=3D"color: #000;" class=3D"styled-by-prettify"=
> result</span><span style=3D"color: #660;" class=3D"styled-by-prettify">;<=
/span><span style=3D"color: #000;" class=3D"styled-by-prettify"><br>std</sp=
an><span style=3D"color: #660;" class=3D"styled-by-prettify">::</span><span=
 style=3D"color: #000;" class=3D"styled-by-prettify">transform</span><span =
style=3D"color: #660;" class=3D"styled-by-prettify">(</span><span style=3D"=
color: #000;" class=3D"styled-by-prettify">lbegin</span><span style=3D"colo=
r: #660;" class=3D"styled-by-prettify">(</span><span style=3D"color: #000;"=
 class=3D"styled-by-prettify">input</span><span style=3D"color: #660;" clas=
s=3D"styled-by-prettify">),</span><span style=3D"color: #000;" class=3D"sty=
led-by-prettify"> lend</span><span style=3D"color: #660;" class=3D"styled-b=
y-prettify">(</span><span style=3D"color: #000;" class=3D"styled-by-prettif=
y">input</span><span style=3D"color: #660;" class=3D"styled-by-prettify">),=
</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> lback_ins=
erter</span><span style=3D"color: #660;" class=3D"styled-by-prettify">(</sp=
an><span style=3D"color: #000;" class=3D"styled-by-prettify">result</span><=
span style=3D"color: #660;" class=3D"styled-by-prettify">),</span><span sty=
le=3D"color: #000;" class=3D"styled-by-prettify"> to_upper</span><span styl=
e=3D"color: #660;" class=3D"styled-by-prettify">);</span><span style=3D"col=
or: #000;" class=3D"styled-by-prettify"><br></span></div></code></div><br>H=
ere lbegin and lend would infer their encoding from input, while lback_inse=
rter would infer it from result. The element type iterated over would be a =
unicode character.<br><br>On Thursday, 8 May 2014 15:07:39 UTC+1, Guy David=
son  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left=
: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">Th=
anks for your feedback, I'm sorry you have to repeat it. &nbsp;The main thi=
ng that has driven me to creating an encoded_string class is provision of i=
terators (although I'm not a fan of the std::codecvt interface and the std:=
:basic_string interface is a bit rich). &nbsp;If I have a string made up of=
 variable width characters, as UTF-8 often yields, you have little opportun=
ity to use the standard algorithms in any meaningful way. &nbsp;I have to t=
ake a copy of the buffer of std::basic_string, use my own iterator on it, t=
hen reassign the string from the buffer. &nbsp;This isn't very efficient, n=
or does it promote clear code.<div><br></div><div>Now that I think of it, I=
 suppose another approach might be to modify std::char_traits and declare t=
he iterator in there, then modify std::basic_string and define the iterator=
s in terms of the char_traits iterators.<br><br>On Thursday, 8 May 2014 14:=
33:37 UTC+1, Daniel Kr=C3=BCgler  wrote:<blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:=
1ex">2014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<a>dietma...@gmail.com</a=
>&gt;:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>&gt;
<br>
<br>I would like to add that I completely agree with Dietmar.
<br>
<br>- Daniel
<br></blockquote></div></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_65_17424960.1399561750884--

.

Author: Guy Davidson <guy@hatcat.com>
Date: Thu, 8 May 2014 08:13:08 -0700 (PDT) Raw View

------=_Part_525_16436918.1399561988883
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Oooh, I like that...  The square bracket operator remains ambiguous though:=
=20
it stops meaning nth character and now only means nth byte in the sequence.

On Thursday, 8 May 2014 16:09:10 UTC+1, Diggory Blake wrote:
>
> The thing is, for most of the time the encoding of the string is not=20
> relevant - copying, appending, exact comparison all work on the raw data=
=20
> and don't care about the encoding. The only time the encoding matters is=
=20
> during I/O or when performing lexicographic operations, and generally=20
> within the same application, the same encoding will be used (almost)=20
> everywhere.
>
> I think you're on the right track with the suggestion about modifying=20
> std::basic_string to have (either through char_traits or a defaulted=20
> template parameter) an encoding, which defaults to something backward=20
> compatible, but so that the default can be overridden easily enough with =
an=20
> alias or something, but I don't think you should modify the iterators -=
=20
> instead there should be global functions=20
> "lexicographic_begin"/"lexicographic_end" or just "lbegin/lend" which wil=
l=20
> either get the encoding from the container passed in, or get the encoding=
=20
> from an additional parameter. These iterators would always have the same=
=20
> element type which must be capable of representing any unicode character.
>
> For example, this would work for any encoding, or even if input and resul=
t=20
> had different encodings:
> string input =3D "<some string encoded with the default encoding for=20
> std::string>";
> string result;
> std::transform(lbegin(input), lend(input), lback_inserter(result),to_uppe=
r
> );
>
> Here lbegin and lend would infer their encoding from input, while=20
> lback_inserter would infer it from result. The element type iterated over=
=20
> would be a unicode character.
>
> On Thursday, 8 May 2014 15:07:39 UTC+1, Guy Davidson wrote:
>>
>> Thanks for your feedback, I'm sorry you have to repeat it.  The main=20
>> thing that has driven me to creating an encoded_string class is provisio=
n=20
>> of iterators (although I'm not a fan of the std::codecvt interface and t=
he=20
>> std::basic_string interface is a bit rich).  If I have a string made up =
of=20
>> variable width characters, as UTF-8 often yields, you have little=20
>> opportunity to use the standard algorithms in any meaningful way.  I hav=
e=20
>> to take a copy of the buffer of std::basic_string, use my own iterator o=
n=20
>> it, then reassign the string from the buffer.  This isn't very efficient=
,=20
>> nor does it promote clear code.
>>
>> Now that I think of it, I suppose another approach might be to modify=20
>> std::char_traits and declare the iterator in there, then modify=20
>> std::basic_string and define the iterators in terms of the char_traits=
=20
>> iterators.
>>
>> On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>>>
>>> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
>>> > I will give the feedback I gave before: don't create another string=
=20
>>> class! Instead, create the necessary algorithms to deal with Unicode. I=
n my=20
>>> opinion the actual encoding/decoding business is covered by the=20
>>> std::codecvt<...> facet although it may be worth explicitly defining=20
>>> instances of these facets for the various Unicode encodings. There are,=
 of=20
>>> course, plenty of other algorithms in Unicode which are reasonable to=
=20
>>> expose. Given that people like to process UTF8 and UTF16 it may be=20
>>> reasonable to also have encoding aware algorithms for string operations=
..=20
>>> >=20
>>> > Unless soneone provides a really strong argument for another string=
=20
>>> class, I will strongly argue against adding another representation for=
=20
>>> strings! (I can see a place for an immutable string class but that's=20
>>> entirely different).=20
>>> >=20
>>>
>>> I would like to add that I completely agree with Dietmar.=20
>>>
>>> - Daniel=20
>>>
>>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_525_16436918.1399561988883
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Oooh, I like that... &nbsp;The square bracket operator rem=
ains ambiguous though: it stops meaning nth character and now only means nt=
h byte in the sequence.<br><br>On Thursday, 8 May 2014 16:09:10 UTC+1, Digg=
ory Blake  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"l=
tr">The thing is, for most of the time the encoding of the string is not re=
levant - copying, appending, exact comparison all work on the raw data and =
don't care about the encoding. The only time the encoding matters is during=
 I/O or when performing lexicographic operations, and generally within the =
same application, the same encoding will be used (almost) everywhere.<br><b=
r>I think you're on the right track with the suggestion about modifying std=
::basic_string to have (either through char_traits or a defaulted template =
parameter) an encoding, which defaults to something backward compatible, bu=
t so that the default can be overridden easily enough with an alias or some=
thing, but I don't think you should modify the iterators - instead there sh=
ould be global functions "lexicographic_begin"/"<wbr>lexicographic_end" or =
just "lbegin/lend" which will either get the encoding from the container pa=
ssed in, or get the encoding from an additional parameter. These iterators =
would always have the same element type which must be capable of representi=
ng any unicode character.<br><br>For example, this would work for any encod=
ing, or even if input and result had different encodings:<br><div style=3D"=
background-color:rgb(250,250,250);border-color:rgb(187,187,187);border-styl=
e:solid;border-width:1px;word-wrap:break-word"><code><div><span style=3D"co=
lor:#008">string</span><span style=3D"color:#000"> input </span><span style=
=3D"color:#660">=3D</span><span style=3D"color:#000"> </span><span style=3D=
"color:#080">"&lt;some string encoded with the default encoding for std::st=
ring&gt;"</span><span style=3D"color:#660">;</span><span style=3D"color:#00=
0"><br></span><span style=3D"color:#008">string</span><span style=3D"color:=
#000"> result</span><span style=3D"color:#660">;</span><span style=3D"color=
:#000"><br>std</span><span style=3D"color:#660">::</span><span style=3D"col=
or:#000">transform</span><span style=3D"color:#660">(</span><span style=3D"=
color:#000">lbegin</span><span style=3D"color:#660">(</span><span style=3D"=
color:#000">input</span><span style=3D"color:#660">),</span><span style=3D"=
color:#000"> lend</span><span style=3D"color:#660">(</span><span style=3D"c=
olor:#000">input</span><span style=3D"color:#660">),</span><span style=3D"c=
olor:#000"> lback_inserter</span><span style=3D"color:#660">(</span><span s=
tyle=3D"color:#000">result</span><span style=3D"color:#660">),</span><span =
style=3D"color:#000"> to_upper</span><span style=3D"color:#660">);</span><s=
pan style=3D"color:#000"><br></span></div></code></div><br>Here lbegin and =
lend would infer their encoding from input, while lback_inserter would infe=
r it from result. The element type iterated over would be a unicode charact=
er.<br><br>On Thursday, 8 May 2014 15:07:39 UTC+1, Guy Davidson  wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks for your feedba=
ck, I'm sorry you have to repeat it. &nbsp;The main thing that has driven m=
e to creating an encoded_string class is provision of iterators (although I=
'm not a fan of the std::codecvt interface and the std::basic_string interf=
ace is a bit rich). &nbsp;If I have a string made up of variable width char=
acters, as UTF-8 often yields, you have little opportunity to use the stand=
ard algorithms in any meaningful way. &nbsp;I have to take a copy of the bu=
ffer of std::basic_string, use my own iterator on it, then reassign the str=
ing from the buffer. &nbsp;This isn't very efficient, nor does it promote c=
lear code.<div><br></div><div>Now that I think of it, I suppose another app=
roach might be to modify std::char_traits and declare the iterator in there=
, then modify std::basic_string and define the iterators in terms of the ch=
ar_traits iterators.<br><br>On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel =
Kr=C3=BCgler  wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;mar=
gin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">2014-05-08 15:3=
0 GMT+02:00 Dietmar Kuehl &lt;<a>dietma...@gmail.com</a>&gt;:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>&gt;
<br>
<br>I would like to add that I completely agree with Dietmar.
<br>
<br>- Daniel
<br></blockquote></div></div></blockquote></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_525_16436918.1399561988883--

.

Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Thu, 8 May 2014 18:13:30 +0300 Raw View

On 8 May 2014 18:06, Guy Davidson <guy@hatcat.com> wrote:
> The fundamental problem here is that the basic_string class currently only
> accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width
> encodings (UTF-32 is also fixed width).  std::u16string is only fit for


basic_string accommodates variable width encodings fine. What it doesn't provide
is access to the actual characters rather than raw bytes, and adding access to
to characters should not require changing basic_string, or char_traits.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Guy Davidson <guy@hatcat.com>
Date: Thu, 8 May 2014 08:16:39 -0700 (PDT) Raw View

------=_Part_537_24594337.1399562199692
Content-Type: text/plain; charset=UTF-8

Agreed.

On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen wrote:
>
> On 8 May 2014 18:06, Guy Davidson <g...@hatcat.com <javascript:>> wrote:
> > The fundamental problem here is that the basic_string class currently
> only
> > accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width
> > encodings (UTF-32 is also fixed width).  std::u16string is only fit for
>
>
> basic_string accommodates variable width encodings fine. What it doesn't
> provide
> is access to the actual characters rather than raw bytes, and adding
> access to
> to characters should not require changing basic_string, or char_traits.
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_537_24594337.1399562199692
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Agreed.<br><br>On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen  w=
rote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8e=
x;border-left: 1px #ccc solid;padding-left: 1ex;">On 8 May 2014 18:06, Guy =
Davidson &lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailt=
o=3D"somKSjPGy4YJ" onmousedown=3D"this.href=3D'javascript:';return true;" o=
nclick=3D"this.href=3D'javascript:';return true;">g...@hatcat.com</a>&gt; w=
rote:
<br>&gt; The fundamental problem here is that the basic_string class curren=
tly only
<br>&gt; accommodates fixed width encoding. &nbsp;UTF-8 and UTF-16 are vari=
able width
<br>&gt; encodings (UTF-32 is also fixed width). &nbsp;std::u16string is on=
ly fit for
<br>
<br>
<br>basic_string accommodates variable width encodings fine. What it doesn'=
t provide
<br>is access to the actual characters rather than raw bytes, and adding ac=
cess to
<br>to characters should not require changing basic_string, or char_traits.
<br></blockquote>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_537_24594337.1399562199692--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 08:17:26 -0700 (PDT) Raw View

------=_Part_443_19864020.1399562246720
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

We shouldn't have a class to represent variable width encodings, especially=
=20
not one with operator square brackets, it will give the illusion that=20
accessing by character index is an efficient operation, which it can't be=
=20
in a variable width encoding. Instead the string class should just carry=20
along information about the encoding so that when needed it's easy to=20
iterate over logical characters, but the default should still be to iterate=
=20
over the underlying data.

On Thursday, 8 May 2014 16:06:40 UTC+1, Guy Davidson wrote:
>
> I do indeed mean char_traits: my idea is to extend std::char_traits to=20
> include iteration information about the character type, and then modify=
=20
> std::basic_string to infer its iterator types from its char_trait templat=
e=20
> parameter.
>
> The fundamental problem here is that the basic_string class currently onl=
y=20
> accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width=
=20
> encodings (UTF-32 is also fixed width).  std::u16string is only fit for=
=20
> representing elements from the Basic Multilingual Plane, it is not fit fo=
r=20
> UTF-16.  I'm not sure how to introduce variable width encoded strings=20
> within std::basic_string, hence the introduction of a separate string=20
> class.  The square bracket operator becomes problematic if the width of a=
=20
> character can be from one to four bytes.  I shall think on.
>
> On Thursday, 8 May 2014 15:43:06 UTC+1, Daniel Kr=C3=BCgler wrote:
>>
>> 2014-05-08 16:07 GMT+02:00 Guy Davidson <g...@hatcat.com>:=20
>> > Now that I think of it, I suppose another approach might be to modify=
=20
>> > std::char_traits and declare the iterator in there,=20
>>
>> You don't mean char_traits, do you? If yes, I don't see how char=20
>> traits are related to iterators.=20
>>
>> > then modify=20
>> > std::basic_string and define the iterators in terms of the char_traits=
=20
>> > iterators.=20
>>
>> I would like to suggest an alternative approach: It is not necessary=20
>> to add further member functions to basic_string. Instead you could=20
>> provide range-based access functions that are free functions. This=20
>> would also allow (but not require) to provide these Unicode functions=20
>> in a separate header.=20
>>
>> - Daniel=20
>>
>> > On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:=20
>> >>=20
>> >> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
>> >> > I will give the feedback I gave before: don't create another string=
=20
>> >> > class! Instead, create the necessary algorithms to deal with=20
>> Unicode. In my=20
>> >> > opinion the actual encoding/decoding business is covered by the=20
>> >> > std::codecvt<...> facet although it may be worth explicitly definin=
g=20
>> >> > instances of these facets for the various Unicode encodings. There=
=20
>> are, of=20
>> >> > course, plenty of other algorithms in Unicode which are reasonable=
=20
>> to=20
>> >> > expose. Given that people like to process UTF8 and UTF16 it may be=
=20
>> >> > reasonable to also have encoding aware algorithms for string=20
>> operations.=20
>> >> >=20
>> >> > Unless soneone provides a really strong argument for another string=
=20
>> >> > class, I will strongly argue against adding another representation=
=20
>> for=20
>> >> > strings! (I can see a place for an immutable string class but that'=
s=20
>> >> > entirely different).=20
>> >> >=20
>> >>=20
>> >> I would like to add that I completely agree with Dietmar.=20
>> >>=20
>> >> - Daniel=20
>> >=20
>> > --=20
>> >=20
>> > ---=20
>> > You received this message because you are subscribed to the Google=20
>> Groups=20
>> > "ISO C++ Standard - Future Proposals" group.=20
>> > To unsubscribe from this group and stop receiving emails from it, send=
=20
>> an=20
>> > email to std-proposal...@isocpp.org.=20
>> > To post to this group, send email to std-pr...@isocpp.org.=20
>> > Visit this group at=20
>> > http://groups.google.com/a/isocpp.org/group/std-proposals/.=20
>>
>>
>>
>> --=20
>>
>> ________________________________=20
>> SavedURI :Show URLShow URLSavedURI :=20
>> SavedURI :Hide URLHide URLSavedURI :=20
>>
>> https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.LEt2fN4=
ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B=
_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D1395=
770800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.L=
Et2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-=
m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=
=3D1395770800154=20
>> ________________________________=20
>>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_443_19864020.1399562246720
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">We shouldn't have a class to represent variable width enco=
dings, especially not one with operator square brackets, it will give the i=
llusion that accessing by character index is an efficient operation, which =
it can't be in a variable width encoding. Instead the string class should j=
ust carry along information about the encoding so that when needed it's eas=
y to iterate over logical characters, but the default should still be to it=
erate over the underlying data.<br><br>On Thursday, 8 May 2014 16:06:40 UTC=
+1, Guy Davidson  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: =
0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div d=
ir=3D"ltr">I do indeed mean char_traits: my idea is to extend std::char_tra=
its to include iteration information about the character type, and then mod=
ify std::basic_string to infer its iterator types from its char_trait templ=
ate parameter.<div><br></div><div>The fundamental problem here is that the =
basic_string class currently only accommodates fixed width encoding. &nbsp;=
UTF-8 and UTF-16 are variable width encodings (UTF-32 is also fixed width).=
 &nbsp;std::u16string is only fit for representing elements from the Basic =
Multilingual Plane, it is not fit for UTF-16. &nbsp;I'm not sure how to int=
roduce variable width encoded strings within std::basic_string, hence the i=
ntroduction of a separate string class. &nbsp;The square bracket operator b=
ecomes problematic if the width of a character can be from one to four byte=
s. &nbsp;I shall think on.</div><br>On Thursday, 8 May 2014 15:43:06 UTC+1,=
 Daniel Kr=C3=BCgler  wrote:<blockquote class=3D"gmail_quote" style=3D"marg=
in:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">2014-05=
-08 16:07 GMT+02:00 Guy Davidson &lt;<a>g...@hatcat.com</a>&gt;:
<br>&gt; Now that I think of it, I suppose another approach might be to mod=
ify
<br>&gt; std::char_traits and declare the iterator in there,
<br>
<br>You don't mean char_traits, do you? If yes, I don't see how char
<br>traits are related to iterators.
<br>
<br>&gt; then modify
<br>&gt; std::basic_string and define the iterators in terms of the char_tr=
aits
<br>&gt; iterators.
<br>
<br>I would like to suggest an alternative approach: It is not necessary
<br>to add further member functions to basic_string. Instead you could
<br>provide range-based access functions that are free functions. This
<br>would also allow (but not require) to provide these Unicode functions
<br>in a separate header.
<br>
<br>- Daniel
<br>
<br>&gt; On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
<br>&gt;&gt;
<br>&gt;&gt; 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<a>dietma...@gmai=
l.com</a>&gt;:
<br>&gt;&gt; &gt; I will give the feedback I gave before: don't create anot=
her string
<br>&gt;&gt; &gt; class! Instead, create the necessary algorithms to deal w=
ith Unicode. In my
<br>&gt;&gt; &gt; opinion the actual encoding/decoding business is covered =
by the
<br>&gt;&gt; &gt; std::codecvt&lt;...&gt; facet although it may be worth ex=
plicitly defining
<br>&gt;&gt; &gt; instances of these facets for the various Unicode encodin=
gs. There are, of
<br>&gt;&gt; &gt; course, plenty of other algorithms in Unicode which are r=
easonable to
<br>&gt;&gt; &gt; expose. Given that people like to process UTF8 and UTF16 =
it may be
<br>&gt;&gt; &gt; reasonable to also have encoding aware algorithms for str=
ing operations.
<br>&gt;&gt; &gt;
<br>&gt;&gt; &gt; Unless soneone provides a really strong argument for anot=
her string
<br>&gt;&gt; &gt; class, I will strongly argue against adding another repre=
sentation for
<br>&gt;&gt; &gt; strings! (I can see a place for an immutable string class=
 but that's
<br>&gt;&gt; &gt; entirely different).
<br>&gt;&gt; &gt;
<br>&gt;&gt;
<br>&gt;&gt; I would like to add that I completely agree with Dietmar.
<br>&gt;&gt;
<br>&gt;&gt; - Daniel
<br>&gt;
<br>&gt; --
<br>&gt;
<br>&gt; ---
<br>&gt; You received this message because you are subscribed to the Google=
 Groups
<br>&gt; "ISO C++ Standard - Future Proposals" group.
<br>&gt; To unsubscribe from this group and stop receiving emails from it, =
send an
<br>&gt; email to <a>std-proposal...@isocpp.org</a>.
<br>&gt; To post to this group, send email to <a>std-pr...@isocpp.org</a>.
<br>&gt; Visit this group at
<br>&gt; <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/" target=3D"_blank" onmousedown=3D"this.href=3D'http://groups.google.com=
/a/isocpp.org/group/std-proposals/';return true;" onclick=3D"this.href=3D'h=
ttp://groups.google.com/a/isocpp.org/group/std-proposals/';return true;">ht=
tp://groups.google.com/a/<wbr>isocpp.org/group/std-<wbr>proposals/</a>.
<br>
<br>
<br>
<br>--=20
<br>
<br>______________________________<wbr>__
<br>SavedURI :Show URLShow URLSavedURI :
<br>SavedURI :Hide URLHide URLSavedURI :
<br><a href=3D"https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.mai=
n.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBB=
RgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?r=
andom=3D1395770800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgma=
il.main.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVA=
OBGeBBRgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4k=
uOWA?random=3D1395770800154" target=3D"_blank" onmousedown=3D"this.href=3D'=
https://mail.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilL=
E.O/m\75m_i,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD=
-qOEw/rt\75h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\07513957=
70800154https://mail.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LE=
t2fN4ilLE.O/m\75m_i,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m=
_0_8B_AD-qOEw/rt\75h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\=
0751395770800154';return true;" onclick=3D"this.href=3D'https://mail.google=
..com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilLE.O/m\75m_i,t,it/am=
\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt\75h/d\0751=
/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\0751395770800154https://mai=
l.google.com/_/scs/mail-static/_/js/k\75gmail.main.de.LEt2fN4ilLE.O/m\75m_i=
,t,it/am\75OCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt\75=
h/d\0751/rs\75AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random\0751395770800154';r=
eturn true;">https://mail.google.com/_/scs/<wbr>mail-static/_/js/k=3Dgmail.=
main.<wbr>de.LEt2fN4ilLE.O/m=3Dm_i,t,it/<wbr>am=3DOCMOBiHj9kJxhnelj6j997_<w=
br>NLil29vVAOBGeBBRgJwD-m_0_8B_<wbr>AD-qOEw/rt=3Dh/d=3D1/rs=3D<wbr>AItRSTOD=
y9wv1JKZMABIG3Ak8ViC4k<wbr>uOWA?random=3D<wbr>1395770800154https://mail.<wb=
r>google.com/_/scs/mail-static/_<wbr>/js/k=3Dgmail.main.de.<wbr>LEt2fN4ilLE=
..O/m=3Dm_i,t,it/am=3D<wbr>OCMOBiHj9kJxhnelj6j997_<wbr>NLil29vVAOBGeBBRgJwD-=
m_0_8B_<wbr>AD-qOEw/rt=3Dh/d=3D1/rs=3D<wbr>AItRSTODy9wv1JKZMABIG3Ak8ViC4k<w=
br>uOWA?random=3D1395770800154</a>
<br>______________________________<wbr>__
<br></blockquote></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_443_19864020.1399562246720--

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Thu, 8 May 2014 16:40:55 +0100 Raw View

--Apple-Mail-12C82AF3-ECF0-4564-AE99-74764C234E09
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Note that I would also object to creating a new string class by way of chan=
ging any of its traits! The problem is that creating a new string type of a=
ny form just doesn't help:

 1. Strings are vocabulary types and having multiple ways to represent them=
 causes incompatibilitirs as everybody will use a different one.
 2. Many existing components already use some string representation, if not=
hing else string literals, which contain Unicode encoded characters (that's=
 a Bad Idea but that ship is sailed).
 3. Any representation of strings using a multi-word encoding ("byte" doesn=
't quite cut it as std::wstring is multi-word) has problems when mutating t=
hem. Putting the sequence into a class dorsn't make these go away (although=
 a new class could drop the requirement for a contiguous representation whi=
ch could help). ... and Unicode characters are multi-code-point, i.e., char=
32_t doesn't make tjis problem go away, either.

That is, the reality is that strings will use a Unicode encoding and need t=
o be treated readonably. To avoid incompatibilities due to different encodi=
ngs it is absolutely crucial that internal to a program all Unicode strings=
 for a give character type use the same encoding! If this invariant is not =
maintained there is a huge problem. The choice of encoding is already set [=
for each implementation] by the encoding chosen for string literals! That i=
mplies, however, that there is a encoding conversion only when converting b=
etween strings with different character types or when externalising/interna=
lising characters.

A set of Unicode aware algorithms would work on a suitable abstraction, pro=
bably iterators or ranges. The need to copy bytes around when using algorit=
hms mutating characters inplace will arise with all representations.

> On 8 May 2014, at 15:07, Guy Davidson <guy@hatcat.com> wrote:
>=20
> Thanks for your feedback, I'm sorry you have to repeat it.  The main thin=
g that has driven me to creating an encoded_string class is provision of it=
erators (although I'm not a fan of the std::codecvt interface and the std::=
basic_string interface is a bit rich).  If I have a string made up of varia=
ble width characters, as UTF-8 often yields, you have little opportunity to=
 use the standard algorithms in any meaningful way.  I have to take a copy =
of the buffer of std::basic_string, use my own iterator on it, then reassig=
n the string from the buffer.  This isn't very efficient, nor does it promo=
te clear code.
>=20
> Now that I think of it, I suppose another approach might be to modify std=
::char_traits and declare the iterator in there, then modify std::basic_str=
ing and define the iterators in terms of the char_traits iterators.
>=20
>> On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
>> > I will give the feedback I gave before: don't create another string cl=
ass! Instead, create the necessary algorithms to deal with Unicode. In my o=
pinion the actual encoding/decoding business is covered by the std::codecvt=
<...> facet although it may be worth explicitly defining instances of these=
 facets for the various Unicode encodings. There are, of course, plenty of =
other algorithms in Unicode which are reasonable to expose. Given that peop=
le like to process UTF8 and UTF16 it may be reasonable to also have encodin=
g aware algorithms for string operations.=20
>> >=20
>> > Unless soneone provides a really strong argument for another string cl=
ass, I will strongly argue against adding another representation for string=
s! (I can see a place for an immutable string class but that's entirely dif=
ferent).=20
>> >=20
>>=20
>> I would like to add that I completely agree with Dietmar.=20
>>=20
>> - Daniel
>=20
> --=20
>=20
> ---=20
> You received this message because you are subscribed to the Google Groups=
 "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
 email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-propo=
sals/.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail-12C82AF3-ECF0-4564-AE99-74764C234E09
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=
=3Dutf-8"></head><body dir=3D"auto"><div>Note that I would also object to c=
reating a new string class by way of changing any of its traits! The proble=
m is that creating a new string type of any form just doesn't help:</div><d=
iv><br></div><div>&nbsp;1. Strings are vocabulary types and having multiple=
 ways to represent them causes incompatibilitirs as everybody will use a di=
fferent one.</div><div>&nbsp;2. Many existing components already use some s=
tring representation, if nothing else string literals, which contain Unicod=
e encoded characters (that's a Bad Idea but that ship is sailed).</div><div=
>&nbsp;3. Any representation of strings using a multi-word encoding ("byte"=
 doesn't quite cut it as std::wstring is multi-word) has problems when muta=
ting them. Putting the sequence into a class dorsn't make these go away (al=
though a new class could drop the requirement for a contiguous representati=
on which could help). ... and Unicode characters are multi-code-point, i.e.=
, char32_t doesn't make tjis problem go away, either.</div><div><br>That is=
, the reality is that strings will use a Unicode encoding and need to be tr=
eated readonably. To avoid incompatibilities due to different encodings it =
is absolutely crucial that internal to a program all Unicode strings for a =
give character type use the same encoding! If this invariant is not maintai=
ned there is a huge problem. The choice of encoding is already set [for eac=
h implementation] by the encoding chosen for string literals! That implies,=
 however, that there is a encoding conversion only when converting between =
strings with different character types or when externalising/internalising =
characters.</div><div><br></div><div>A set of Unicode aware algorithms woul=
d work on a suitable abstraction, probably iterators or ranges. The need to=
 copy bytes around when using algorithms mutating characters inplace will a=
rise with all representations.</div><div><br>On 8 May 2014, at 15:07, Guy D=
avidson &lt;<a href=3D"mailto:guy@hatcat.com">guy@hatcat.com</a>&gt; wrote:=
<br><br></div><blockquote type=3D"cite"><div><div dir=3D"ltr">Thanks for yo=
ur feedback, I'm sorry you have to repeat it. &nbsp;The main thing that has=
 driven me to creating an encoded_string class is provision of iterators (a=
lthough I'm not a fan of the std::codecvt interface and the std::basic_stri=
ng interface is a bit rich). &nbsp;If I have a string made up of variable w=
idth characters, as UTF-8 often yields, you have little opportunity to use =
the standard algorithms in any meaningful way. &nbsp;I have to take a copy =
of the buffer of std::basic_string, use my own iterator on it, then reassig=
n the string from the buffer. &nbsp;This isn't very efficient, nor does it =
promote clear code.<div><br></div><div>Now that I think of it, I suppose an=
other approach might be to modify std::char_traits and declare the iterator=
 in there, then modify std::basic_string and define the iterators in terms =
of the char_traits iterators.<br><br>On Thursday, 8 May 2014 14:33:37 UTC+1=
, Daniel Kr=C3=BCgler  wrote:<blockquote class=3D"gmail_quote" style=3D"mar=
gin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">2=
014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<a href=3D"javascript:" target=
=3D"_blank" gdf-obfuscated-mailto=3D"7dmbZ7hUAOkJ" onmousedown=3D"this.href=
=3D'javascript:';return true;" onclick=3D"this.href=3D'javascript:';return =
true;">dietma...@gmail.com</a>&gt;:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>&gt;
<br>
<br>I would like to add that I completely agree with Dietmar.
<br>
<br>- Daniel
<br></blockquote></div></div>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br>
</div></blockquote></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail-12C82AF3-ECF0-4564-AE99-74764C234E09--

.

Author: Farid Mehrabi <farid.mehrabi@gmail.com>
Date: Thu, 8 May 2014 21:31:22 +0430 Raw View

--f46d043891ff71f92a04f8e66f9b
Content-Type: text/plain; charset=UTF-8

'value_type' et al are more of the 'allocator' characteristics than
'char_traits'; So I suggest that you create some new 'allocator' traits and
alias a new breed of strings from 'basic_strings' with modified 'allocator'.

regards,
FM.


2014-05-08 19:46 GMT+04:30 Guy Davidson <guy@hatcat.com>:

> Agreed.
>
> On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen wrote:
>
>> On 8 May 2014 18:06, Guy Davidson <g...@hatcat.com> wrote:
>> > The fundamental problem here is that the basic_string class currently
>> only
>> > accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width
>> > encodings (UTF-32 is also fixed width).  std::u16string is only fit for
>>
>>
>> basic_string accommodates variable width encodings fine. What it doesn't
>> provide
>> is access to the actual characters rather than raw bytes, and adding
>> access to
>> to characters should not require changing basic_string, or char_traits.
>>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>



--
how am I supposed to end the twisted road of  your hair in the dark night??
unless the candle of your face does turn a lamp up on my way!!!

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--f46d043891ff71f92a04f8e66f9b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"rtl"><div dir=3D"ltr">&#39;value_type&#39; et al are more of th=
e &#39;allocator&#39; characteristics than &#39;char_traits&#39;; So I sugg=
est that you create some new &#39;allocator&#39; traits and=C2=A0</div><div=
 dir=3D"ltr">

alias a new breed of strings from &#39;basic_strings&#39; with modified &#3=
9;allocator&#39;.</div><div dir=3D"ltr"><br></div><div dir=3D"ltr">regards,=
</div><div dir=3D"ltr">FM.</div></div><div class=3D"gmail_extra"><br><br><d=
iv class=3D"gmail_quote">

<div dir=3D"ltr">2014-05-08 19:46 GMT+04:30 Guy Davidson <span dir=3D"ltr">=
&lt;<a href=3D"mailto:guy@hatcat.com" target=3D"_blank">guy@hatcat.com</a>&=
gt;</span>:</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex">

Agreed.<br><br>On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen  w=
rote:<div class=3D"HOEnZb"><div class=3D"h5"><blockquote class=3D"gmail_quo=
te" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-=
left:1ex">

On 8 May 2014 18:06, Guy Davidson &lt;<a>g...@hatcat.com</a>&gt; wrote:
<br>&gt; The fundamental problem here is that the basic_string class curren=
tly only
<br>&gt; accommodates fixed width encoding. =C2=A0UTF-8 and UTF-16 are vari=
able width
<br>&gt; encodings (UTF-32 is also fixed width). =C2=A0std::u16string is on=
ly fit for
<br>
<br>
<br>basic_string accommodates variable width encodings fine. What it doesn&=
#39;t provide
<br>is access to the actual characters rather than raw bytes, and adding ac=
cess to
<br>to characters should not require changing basic_string, or char_traits.
<br></blockquote>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org" target=3D"_=
blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank">http://groups.google.com/a/isocpp.org/gro=
up/std-proposals/</a>.<br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div dir=3D"ltr">how am I supposed to end the twisted road of=C2=A0 your ha=
ir in the dark night??<br>unless the candle of your face does turn a lamp u=
p on my way!!!<br>

</div>
</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--f46d043891ff71f92a04f8e66f9b--

.

Author: Farid Mehrabi <farid.mehrabi@gmail.com>
Date: Thu, 8 May 2014 21:59:33 +0430 Raw View

--f46d044402f4339d8504f8e6d402
Content-Type: text/plain; charset=UTF-8

I just forgot the pain in the neck of keeping variable-length objects in a
contiguous region of memory. An array of refs to actual chars is not
efficient and the ultimate solution is to store each char in an element
with the capacity of keeping the longest possible encoding value; That is
fixed-sized characters.

regards,
FM.


2014-05-08 21:31 GMT+04:30 Farid Mehrabi <farid.mehrabi@gmail.com>:

> 'value_type' et al are more of the 'allocator' characteristics than
> 'char_traits'; So I suggest that you create some new 'allocator' traits and
>  alias a new breed of strings from 'basic_strings' with modified
> 'allocator'.
>
> regards,
> FM.
>
>
> 2014-05-08 19:46 GMT+04:30 Guy Davidson <guy@hatcat.com>:
>
>> Agreed.
>>
>> On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen wrote:
>>
>>> On 8 May 2014 18:06, Guy Davidson <g...@hatcat.com> wrote:
>>> > The fundamental problem here is that the basic_string class currently
>>> only
>>> > accommodates fixed width encoding.  UTF-8 and UTF-16 are variable
>>> width
>>> > encodings (UTF-32 is also fixed width).  std::u16string is only fit
>>> for
>>>
>>>
>>> basic_string accommodates variable width encodings fine. What it doesn't
>>> provide
>>> is access to the actual characters rather than raw bytes, and adding
>>> access to
>>> to characters should not require changing basic_string, or char_traits.
>>>
>>  --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "ISO C++ Standard - Future Proposals" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to std-proposals+unsubscribe@isocpp.org.
>> To post to this group, send email to std-proposals@isocpp.org.
>> Visit this group at
>> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>>
>
>
>
> --
> how am I supposed to end the twisted road of  your hair in the dark night??
> unless the candle of your face does turn a lamp up on my way!!!
>


--
how am I supposed to end the twisted road of  your hair in the dark night??
unless the candle of your face does turn a lamp up on my way!!!

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--f46d044402f4339d8504f8e6d402
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"rtl"><div dir=3D"ltr">I just forgot the pain in the neck of kee=
ping variable-length objects in a contiguous region of memory. An array of =
refs to actual chars is not efficient and the ultimate solution is to store=
 each char in an element with the capacity of keeping the longest possible =
encoding value; That is fixed-sized characters.</div>

<div dir=3D"ltr"><br></div><div dir=3D"ltr">regards,</div><div dir=3D"ltr">=
FM.</div></div><div class=3D"gmail_extra"><div dir=3D"ltr"><br><br><div cla=
ss=3D"gmail_quote">2014-05-08 21:31 GMT+04:30 Farid Mehrabi <span dir=3D"lt=
r">&lt;<a href=3D"mailto:farid.mehrabi@gmail.com" target=3D"_blank">farid.m=
ehrabi@gmail.com</a>&gt;</span>:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 .8ex;border-left:1px #c=
cc solid;border-right:1px #ccc solid;padding-left:1ex;padding-right:1ex"><d=
iv dir=3D"rtl"><div dir=3D"ltr">&#39;value_type&#39; et al are more of the =
&#39;allocator&#39; characteristics than &#39;char_traits&#39;; So I sugges=
t that you create some new &#39;allocator&#39; traits and=C2=A0</div>

<div dir=3D"ltr">
alias a new breed of strings from &#39;basic_strings&#39; with modified &#3=
9;allocator&#39;.</div><div dir=3D"ltr"><br></div><div dir=3D"ltr">regards,=
</div><div dir=3D"ltr">FM.</div></div><div class=3D"gmail_extra"><br><br><d=
iv class=3D"gmail_quote">


<div dir=3D"ltr">2014-05-08 19:46 GMT+04:30 Guy Davidson <span dir=3D"ltr">=
&lt;<a href=3D"mailto:guy@hatcat.com" target=3D"_blank">guy@hatcat.com</a>&=
gt;</span>:</div><div><div class=3D"h5"><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Agreed.<br><br>On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen  w=
rote:<div><div><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-l=
eft:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
On 8 May 2014 18:06, Guy Davidson &lt;<a>g...@hatcat.com</a>&gt; wrote:
<br>&gt; The fundamental problem here is that the basic_string class curren=
tly only
<br>&gt; accommodates fixed width encoding. =C2=A0UTF-8 and UTF-16 are vari=
able width
<br>&gt; encodings (UTF-32 is also fixed width). =C2=A0std::u16string is on=
ly fit for
<br>
<br>
<br>basic_string accommodates variable width encodings fine. What it doesn&=
#39;t provide
<br>is access to the actual characters rather than raw bytes, and adding ac=
cess to
<br>to characters should not require changing basic_string, or char_traits.
<br></blockquote>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org" target=3D"_=
blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank">http://groups.google.com/a/isocpp.org/gro=
up/std-proposals/</a>.<br>
</div></div></blockquote></div></div></div><span class=3D"HOEnZb"><font col=
or=3D"#888888"><br><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
>how am I supposed to end the twisted road of=C2=A0 your hair in the dark n=
ight??<br>

unless the candle of your face does turn a lamp up on my way!!!<br>
</div>
</font></span></div>
</blockquote></div></div><br clear=3D"all"><div><br></div>-- <br><div dir=
=3D"ltr">how am I supposed to end the twisted road of=C2=A0 your hair in th=
e dark night??<br>unless the candle of your face does turn a lamp up on my =
way!!!<br>

</div>
</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--f46d044402f4339d8504f8e6d402--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 11:38:45 -0700 (PDT) Raw View

------=_Part_590_17055578.1399574325624
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Thursday, 8 May 2014 16:40:55 UTC+1, Dietmar K=C3=BChl wrote:
>
> Note that I would also object to creating a new string class by way of=20
> changing any of its traits! The problem is that creating a new string typ=
e=20
> of any form just doesn't help:
>
>  1. Strings are vocabulary types and having multiple ways to represent=20
> them causes incompatibilitirs as everybody will use a different one.
>

As long as we have only one possible instantiation of basic_string per=20
encoding (not counting allocators) this is not a problem - if programs mix=
=20
strings with the same element type but different encodings that's already=
=20
an error, better to catch it at compile time rather than runtime.
=20

>  2. Many existing components already use some string representation, if=
=20
> nothing else string literals, which contain Unicode encoded characters=20
> (that's a Bad Idea but that ship is sailed).
>

String literals would convert to a basic_string of the equivalent encoding,=
=20
not sure how it's an issue.
=20

>  3. Any representation of strings using a multi-word encoding ("byte"=20
> doesn't quite cut it as std::wstring is multi-word) has problems when=20
> mutating them. Putting the sequence into a class dorsn't make these go aw=
ay=20
> (although a new class could drop the requirement for a contiguous=20
> representation which could help). ... and Unicode characters are=20
> multi-code-point, i.e., char32_t doesn't make tjis problem go away, eithe=
r.
>

This is over-simplifying, there are many types of operation:
- Those which work on the element type (eg. append, copy, exact comparison,=
=20
etc.)
These do not care what the encoding is.

- Those which work on code points (eg. case conversion, string splitting,=
=20
find/replace, the vast majority of useful general purpose operations)
These need to know the encoding, but do not need to worry about=20
multi-code-point problems - char32_t DOES make these operations simple.=20
Mutating in place at this level is as you say impossibly, but providing an=
=20
iterator abstraction for both reading and insertion at the level of code=20
points is essential.

- Those which work on glyphs (eg. rendering)
These are specialised cases which most applications will not have to worry=
=20
about, and for those that do, being able to work directly on code-points=20
will simplify the task considerably.


> That is, the reality is that strings will use a Unicode encoding and need=
=20
> to be treated readonably. To avoid incompatibilities due to different=20
> encodings it is absolutely crucial that internal to a program all Unicode=
=20
> strings for a give character type use the same encoding! If this invarian=
t=20
> is not maintained there is a huge problem. The choice of encoding is=20
> already set [for each implementation] by the encoding chosen for string=
=20
> literals! That implies, however, that there is a encoding conversion only=
=20
> when converting between strings with different character types or when=20
> externalising/internalising characters.
>
> A set of Unicode aware algorithms would work on a suitable abstraction,=
=20
> probably iterators or ranges. The need to copy bytes around when using=20
> algorithms mutating characters inplace will arise with all representation=
s.
>
> On 8 May 2014, at 15:07, Guy Davidson <g...@hatcat.com <javascript:>>=20
> wrote:
>
> Thanks for your feedback, I'm sorry you have to repeat it.  The main thin=
g=20
> that has driven me to creating an encoded_string class is provision of=20
> iterators (although I'm not a fan of the std::codecvt interface and the=
=20
> std::basic_string interface is a bit rich).  If I have a string made up o=
f=20
> variable width characters, as UTF-8 often yields, you have little=20
> opportunity to use the standard algorithms in any meaningful way.  I have=
=20
> to take a copy of the buffer of std::basic_string, use my own iterator on=
=20
> it, then reassign the string from the buffer.  This isn't very efficient,=
=20
> nor does it promote clear code.
>
> Now that I think of it, I suppose another approach might be to modify=20
> std::char_traits and declare the iterator in there, then modify=20
> std::basic_string and define the iterators in terms of the char_traits=20
> iterators.
>
> On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>>
>> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:=20
>> > I will give the feedback I gave before: don't create another string=20
>> class! Instead, create the necessary algorithms to deal with Unicode. In=
 my=20
>> opinion the actual encoding/decoding business is covered by the=20
>> std::codecvt<...> facet although it may be worth explicitly defining=20
>> instances of these facets for the various Unicode encodings. There are, =
of=20
>> course, plenty of other algorithms in Unicode which are reasonable to=20
>> expose. Given that people like to process UTF8 and UTF16 it may be=20
>> reasonable to also have encoding aware algorithms for string operations.=
=20
>> >=20
>> > Unless soneone provides a really strong argument for another string=20
>> class, I will strongly argue against adding another representation for=
=20
>> strings! (I can see a place for an immutable string class but that's=20
>> entirely different).=20
>> >=20
>>
>> I would like to add that I completely agree with Dietmar.=20
>>
>> - Daniel=20
>>
>  --=20
>
> ---=20
> You received this message because you are subscribed to the Google Groups=
=20
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
=20
> email to std-proposal...@isocpp.org <javascript:>.
> To post to this group, send email to std-pr...@isocpp.org <javascript:>.
> Visit this group at=20
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_590_17055578.1399574325624
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Thursday, 8 May 2014 16:40:55 UTC+1, Dietmar K=C3=BChl =
 wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.=
8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"auto"><div>=
Note that I would also object to creating a new string class by way of chan=
ging any of its traits! The problem is that creating a new string type of a=
ny form just doesn't help:</div><div><br></div><div>&nbsp;1. Strings are vo=
cabulary types and having multiple ways to represent them causes incompatib=
ilitirs as everybody will use a different one.</div></div></blockquote><div=
><br>As long as we have only one possible instantiation of basic_string per=
 encoding (not counting allocators) this is not a problem - if programs mix=
 strings with the same element type but different encodings that's already =
an error, better to catch it at compile time rather than runtime.<br>&nbsp;=
</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8=
ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"auto"><div>&=
nbsp;2. Many existing components already use some string representation, if=
 nothing else string literals, which contain Unicode encoded characters (th=
at's a Bad Idea but that ship is sailed).</div></div></blockquote><div><br>=
String literals would convert to a basic_string of the equivalent encoding,=
 not sure how it's an issue.<br>&nbsp;</div><blockquote class=3D"gmail_quot=
e" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;paddin=
g-left: 1ex;"><div dir=3D"auto"><div>&nbsp;3. Any representation of strings=
 using a multi-word encoding ("byte" doesn't quite cut it as std::wstring i=
s multi-word) has problems when mutating them. Putting the sequence into a =
class dorsn't make these go away (although a new class could drop the requi=
rement for a contiguous representation which could help). ... and Unicode c=
haracters are multi-code-point, i.e., char32_t doesn't make tjis problem go=
 away, either.</div></div></blockquote><div><br>This is over-simplifying, t=
here are many types of operation:<br>- Those which work on the element type=
 (eg. append, copy, exact comparison, etc.)<br>These do not care what the e=
ncoding is.<br><br>- Those which work on code points (eg. case conversion, =
string splitting, find/replace, the vast majority of useful general purpose=
 operations)<br>These need to know the encoding, but do not need to worry a=
bout multi-code-point problems - char32_t DOES make these operations simple=
.. Mutating in place at this level is as you say impossibly, but providing a=
n iterator abstraction for both reading and insertion at the level of code =
points is essential.<br><br>- Those which work on glyphs (eg. rendering)<br=
>These are specialised cases which most applications will not have to worry=
 about, and for those that do, being able to work directly on code-points w=
ill simplify the task considerably.<br><br></div><blockquote class=3D"gmail=
_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;p=
adding-left: 1ex;"><div dir=3D"auto"><div><br>That is, the reality is that =
strings will use a Unicode encoding and need to be treated readonably. To a=
void incompatibilities due to different encodings it is absolutely crucial =
that internal to a program all Unicode strings for a give character type us=
e the same encoding! If this invariant is not maintained there is a huge pr=
oblem. The choice of encoding is already set [for each implementation] by t=
he encoding chosen for string literals! That implies, however, that there i=
s a encoding conversion only when converting between strings with different=
 character types or when externalising/internalising characters.</div><div>=
<br></div><div>A set of Unicode aware algorithms would work on a suitable a=
bstraction, probably iterators or ranges. The need to copy bytes around whe=
n using algorithms mutating characters inplace will arise with all represen=
tations.</div><div><br>On 8 May 2014, at 15:07, Guy Davidson &lt;<a href=3D=
"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"8PUIylBfsBAJ" onmo=
usedown=3D"this.href=3D'javascript:';return true;" onclick=3D"this.href=3D'=
javascript:';return true;">g...@hatcat.com</a>&gt; wrote:<br><br></div><blo=
ckquote type=3D"cite"><div><div dir=3D"ltr">Thanks for your feedback, I'm s=
orry you have to repeat it. &nbsp;The main thing that has driven me to crea=
ting an encoded_string class is provision of iterators (although I'm not a =
fan of the std::codecvt interface and the std::basic_string interface is a =
bit rich). &nbsp;If I have a string made up of variable width characters, a=
s UTF-8 often yields, you have little opportunity to use the standard algor=
ithms in any meaningful way. &nbsp;I have to take a copy of the buffer of s=
td::basic_string, use my own iterator on it, then reassign the string from =
the buffer. &nbsp;This isn't very efficient, nor does it promote clear code=
..<div><br></div><div>Now that I think of it, I suppose another approach mig=
ht be to modify std::char_traits and declare the iterator in there, then mo=
dify std::basic_string and define the iterators in terms of the char_traits=
 iterators.<br><br>On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCg=
ler  wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:=
0.8ex;border-left:1px #ccc solid;padding-left:1ex">2014-05-08 15:30 GMT+02:=
00 Dietmar Kuehl &lt;<a>dietma...@gmail.com</a>&gt;:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>&gt;
<br>
<br>I would like to add that I completely agree with Dietmar.
<br>
<br>- Daniel
<br></blockquote></div></div>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"=
8PUIylBfsBAJ" onmousedown=3D"this.href=3D'javascript:';return true;" onclic=
k=3D"this.href=3D'javascript:';return true;">std-proposal...@<wbr>isocpp.or=
g</a>.<br>
To post to this group, send email to <a href=3D"javascript:" target=3D"_bla=
nk" gdf-obfuscated-mailto=3D"8PUIylBfsBAJ" onmousedown=3D"this.href=3D'java=
script:';return true;" onclick=3D"this.href=3D'javascript:';return true;">s=
td-pr...@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank" onmousedown=3D"this.href=3D'http://groups=
..google.com/a/isocpp.org/group/std-proposals/';return true;" onclick=3D"thi=
s.href=3D'http://groups.google.com/a/isocpp.org/group/std-proposals/';retur=
n true;">http://groups.google.com/a/<wbr>isocpp.org/group/std-<wbr>proposal=
s/</a>.<br>
</div></blockquote></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_590_17055578.1399574325624--

.

Author: Matthew Woehlke <mw_triad@users.sourceforge.net>
Date: Thu, 08 May 2014 14:44:30 -0400 Raw View

On 2014-05-08 11:13, Guy Davidson wrote:
> On Thursday, 8 May 2014 16:09:10 UTC+1, Diggory Blake wrote:
>> The thing is, for most of the time the encoding of the string is not
>> relevant - copying, appending, exact comparison all work on the raw data
>> and don't care about the encoding. The only time the encoding matters is
>> during I/O or when performing lexicographic operations, and generally
>> within the same application, the same encoding will be used (almost)
>> everywhere.

Please pardon me showing my unicode ignorance here, but are even search=20
operations a problem? That is, if I search for e.g. '=C3=A1', can it ever=
=20
match a second (or third or...) byte of some other multi-byte character?

I want to say 'no', in which case even search (or substring compare)=20
operations would not need to care about encoding... basically, only n'th=20
character operations and iterating over characters. (With the caveat=20
that the result you get back is a byte index and not a character index.)

>> I think you're on the right track with the suggestion about modifying
>> std::basic_string to have (either through char_traits or a defaulted
>> template parameter) an encoding, which defaults to something backward
>> compatible, but so that the default can be overridden easily enough with=
 an
>> alias or something, but I don't think you should modify the iterators -
>> instead there should be global functions
>> "lexicographic_begin"/"lexicographic_end" or just "lbegin/lend" which wi=
ll
>> either get the encoding from the container passed in, or get the encodin=
g
>> from an additional parameter. These iterators would always have the same
>> element type which must be capable of representing any unicode character=
..
>
> Oooh, I like that...  The square bracket operator remains ambiguous thoug=
h:
> it stops meaning nth character and now only means nth byte in the sequenc=
e.

That's probably just going to be an issue. On the other hand, the above=20
suggests also adding 'l[ex[icographic]_]at' to get the n'th character.

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 11:49:22 -0700 (PDT) Raw View

------=_Part_658_25824120.1399574962702
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:

>
> Please pardon me showing my unicode ignorance here, but are even search=
=20
> operations a problem? That is, if I search for e.g. '=C3=A1', can it ever=
=20
> match a second (or third or...) byte of some other multi-byte character?=
=20
>
> I want to say 'no', in which case even search (or substring compare)=20
> operations would not need to care about encoding... basically, only n'th=
=20
> character operations and iterating over characters. (With the caveat=20
> that the result you get back is a byte index and not a character index.)=
=20
>

It depends on the encoding - utf8 for example ensures that the=20
representation of one complete code-point can never occur as part of anothe=
r

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_658_25824120.1399574962702
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br>On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Wo=
ehlke  wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
<br>Please pardon me showing my unicode ignorance here, but are even search=
=20
<br>operations a problem? That is, if I search for e.g. '=C3=A1', can it ev=
er=20
<br>match a second (or third or...) byte of some other multi-byte character=
?
<br>
<br>I want to say 'no', in which case even search (or substring compare)=20
<br>operations would not need to care about encoding... basically, only n't=
h=20
<br>character operations and iterating over characters. (With the caveat=20
<br>that the result you get back is a byte index and not a character index.=
)
<br></blockquote><div><br>It depends on the encoding - utf8 for example ens=
ures that the representation of one complete code-point can never occur as =
part of another<br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_658_25824120.1399574962702--

.

Author: Matthew Woehlke <mw_triad@users.sourceforge.net>
Date: Thu, 08 May 2014 15:06:00 -0400 Raw View

On 2014-05-08 14:49, Diggory Blake wrote:
> On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:
>> Please pardon me showing my unicode ignorance here, but are even search
>> operations a problem? That is, if I search for e.g. '=C3=A1', can it eve=
r
>> match a second (or third or...) byte of some other multi-byte character?
>>
>> I want to say 'no', in which case even search (or substring compare)
>> operations would not need to care about encoding... basically, only n'th
>> character operations and iterating over characters. (With the caveat
>> that the result you get back is a byte index and not a character index.)
>
> It depends on the encoding - utf8 for example ensures that the
> representation of one complete code-point can never occur as part of anot=
her

Er, yes, I should specify UTF encodings when I say that.

IMHO at this point non-UTF encodings just need to die. I'd be opposed to=20
adding support for non-UTF multi-byte encodings (except for conversion=20
to/from UTF)... not that I get a vote :-).

--=20
Matthew

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 12:41:17 -0700 (PDT) Raw View

------=_Part_652_15991220.1399578078001
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Regardless of whether support for non-UTF encodings are in the standard=20
library, search/replace still needs to depend on the encoding, in case=20
other encodings are implemented by the user or in the future. An=20
'is_self_synchronising' trait on encodings might be a useful feature to=20
enable certain optimisations - search/replace could check for this and if=
=20
it's present not have to bother with the decoding step.

On Thursday, 8 May 2014 20:06:00 UTC+1, Matthew Woehlke wrote:
>
> On 2014-05-08 14:49, Diggory Blake wrote:=20
> > On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:=20
> >> Please pardon me showing my unicode ignorance here, but are even searc=
h=20
> >> operations a problem? That is, if I search for e.g. '=C3=A1', can it e=
ver=20
> >> match a second (or third or...) byte of some other multi-byte=20
> character?=20
> >>=20
> >> I want to say 'no', in which case even search (or substring compare)=
=20
> >> operations would not need to care about encoding... basically, only=20
> n'th=20
> >> character operations and iterating over characters. (With the caveat=
=20
> >> that the result you get back is a byte index and not a character=20
> index.)=20
> >=20
> > It depends on the encoding - utf8 for example ensures that the=20
> > representation of one complete code-point can never occur as part of=20
> another=20
>
> Er, yes, I should specify UTF encodings when I say that.=20
>
> IMHO at this point non-UTF encodings just need to die. I'd be opposed to=
=20
> adding support for non-UTF multi-byte encodings (except for conversion=20
> to/from UTF)... not that I get a vote :-).=20
>
> --=20
> Matthew=20
>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_652_15991220.1399578078001
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Regardless of whether support for non-UTF encodings are in=
 the standard library, search/replace still needs to depend on the encoding=
, in case other encodings are implemented by the user or in the future. An =
'is_self_synchronising' trait on encodings might be a useful feature to ena=
ble certain optimisations - search/replace could check for this and if it's=
 present not have to bother with the decoding step.<br><br>On Thursday, 8 M=
ay 2014 20:06:00 UTC+1, Matthew Woehlke  wrote:<blockquote class=3D"gmail_q=
uote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;pad=
ding-left: 1ex;">On 2014-05-08 14:49, Diggory Blake wrote:
<br>&gt; On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:
<br>&gt;&gt; Please pardon me showing my unicode ignorance here, but are ev=
en search
<br>&gt;&gt; operations a problem? That is, if I search for e.g. '=C3=A1', =
can it ever
<br>&gt;&gt; match a second (or third or...) byte of some other multi-byte =
character?
<br>&gt;&gt;
<br>&gt;&gt; I want to say 'no', in which case even search (or substring co=
mpare)
<br>&gt;&gt; operations would not need to care about encoding... basically,=
 only n'th
<br>&gt;&gt; character operations and iterating over characters. (With the =
caveat
<br>&gt;&gt; that the result you get back is a byte index and not a charact=
er index.)
<br>&gt;
<br>&gt; It depends on the encoding - utf8 for example ensures that the
<br>&gt; representation of one complete code-point can never occur as part =
of another
<br>
<br>Er, yes, I should specify UTF encodings when I say that.
<br>
<br>IMHO at this point non-UTF encodings just need to die. I'd be opposed t=
o=20
<br>adding support for non-UTF multi-byte encodings (except for conversion=
=20
<br>to/from UTF)... not that I get a vote :-).
<br>
<br>--=20
<br>Matthew
<br>
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_652_15991220.1399578078001--

.

Author: Tony V E <tvaneerd@gmail.com>
Date: Thu, 8 May 2014 16:11:37 -0400 Raw View

--001a113373029a81a804f8e916e5
Content-Type: text/plain; charset=UTF-8

On Thu, May 8, 2014 at 3:41 PM, Diggory Blake <diggsey@googlemail.com>wrote:

> Regardless of whether support for non-UTF encodings are in the standard
> library, search/replace still needs to depend on the encoding, in case
> other encodings are implemented by the user or in the future. An
> 'is_self_synchronising' trait on encodings might be a useful feature to
> enable certain optimisations - search/replace could check for this and if
> it's present not have to bother with the decoding step.
>
>
Or only support self-synchronizing encodings - ie no need for a trait.  Any
new/future encodings that aren't self-synchronizing are either stupid or
too advanced for us to consider until they actually exist.

Tony

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--001a113373029a81a804f8e916e5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Thu, May 8, 2014 at 3:41 PM, Diggory Blake <span dir=3D"ltr">&lt=
;<a href=3D"mailto:diggsey@googlemail.com" target=3D"_blank">diggsey@google=
mail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Regardless of whether suppo=
rt for non-UTF encodings are in the standard library, search/replace still =
needs to depend on the encoding, in case other encodings are implemented by=
 the user or in the future. An &#39;is_self_synchronising&#39; trait on enc=
odings might be a useful feature to enable certain optimisations - search/r=
eplace could check for this and if it&#39;s present not have to bother with=
 the decoding step.</div>
<br></blockquote><div><br></div><div>Or only support self-synchronizing enc=
odings - ie no need for a trait.=C2=A0 Any new/future encodings that aren&#=
39;t self-synchronizing are either stupid or too advanced for us to conside=
r until they actually exist. <br>
</div></div><br></div><div class=3D"gmail_extra">Tony<br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--001a113373029a81a804f8e916e5--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 13:28:54 -0700 (PDT) Raw View

------=_Part_474_27942338.1399580934737
Content-Type: text/plain; charset=UTF-8

It seems very presumptuous to assume that there could never be a need to
use a non-self-synchronising encoding, especially when it doesn't cost
anything more than adding a trait? I can think of plenty of reasons off
hand why it would be useful - backwards compatibility, interfacing with
software or even hardware using such encodings, etc.

On Thursday, 8 May 2014 21:11:37 UTC+1, Tony V E wrote:
>
>
>
>
> On Thu, May 8, 2014 at 3:41 PM, Diggory Blake <dig...@googlemail.com<javascript:>
> > wrote:
>
>> Regardless of whether support for non-UTF encodings are in the standard
>> library, search/replace still needs to depend on the encoding, in case
>> other encodings are implemented by the user or in the future. An
>> 'is_self_synchronising' trait on encodings might be a useful feature to
>> enable certain optimisations - search/replace could check for this and if
>> it's present not have to bother with the decoding step.
>>
>>
> Or only support self-synchronizing encodings - ie no need for a trait.
> Any new/future encodings that aren't self-synchronizing are either stupid
> or too advanced for us to consider until they actually exist.
>
> Tony
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_474_27942338.1399580934737
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">It seems very presumptuous to assume that there could neve=
r be a need to use a non-self-synchronising encoding, especially when it do=
esn't cost anything more than adding a trait? I can think of plenty of reas=
ons off hand why it would be useful - backwards compatibility, interfacing =
with software or even hardware using such encodings, etc.<br><br>On Thursda=
y, 8 May 2014 21:11:37 UTC+1, Tony V E  wrote:<blockquote class=3D"gmail_qu=
ote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padd=
ing-left: 1ex;"><div dir=3D"ltr"><br><div><br><br><div class=3D"gmail_quote=
">On Thu, May 8, 2014 at 3:41 PM, Diggory Blake <span dir=3D"ltr">&lt;<a hr=
ef=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"-04djV2EqMEJ"=
 onmousedown=3D"this.href=3D'javascript:';return true;" onclick=3D"this.hre=
f=3D'javascript:';return true;">dig...@googlemail.com</a>&gt;</span> wrote:=
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Regardless of whether suppo=
rt for non-UTF encodings are in the standard library, search/replace still =
needs to depend on the encoding, in case other encodings are implemented by=
 the user or in the future. An 'is_self_synchronising' trait on encodings m=
ight be a useful feature to enable certain optimisations - search/replace c=
ould check for this and if it's present not have to bother with the decodin=
g step.</div>
<br></blockquote><div><br></div><div>Or only support self-synchronizing enc=
odings - ie no need for a trait.&nbsp; Any new/future encodings that aren't=
 self-synchronizing are either stupid or too advanced for us to consider un=
til they actually exist. <br>
</div></div><br></div><div>Tony<br></div></div>
</blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_474_27942338.1399580934737--

.

Author: Tony V E <tvaneerd@gmail.com>
Date: Thu, 8 May 2014 16:38:40 -0400 Raw View

--001a11336c2664dbfe04f8e97784
Content-Type: text/plain; charset=UTF-8

On Thu, May 8, 2014 at 4:28 PM, Diggory Blake <diggsey@googlemail.com>wrote:

> It seems very presumptuous to assume that there could never be a need to
> use a non-self-synchronising encoding, especially when it doesn't cost
> anything more than adding a trait? I can think of plenty of reasons off
> hand why it would be useful - backwards compatibility, interfacing with
> software or even hardware using such encodings, etc.
>
>
It is not just a trait, it is implementing 2 code paths to handle trait vs
no trait.  And additional language in the standard (if you added up the
salaries of the Committee's volunteers, the costs would be enormous.)
Adding cost needs to show real benefits, not just maybe benefits.

So sure, if you have solid examples.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--001a11336c2664dbfe04f8e97784
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Thu, May 8, 2014 at 4:28 PM, Diggory Blake <span dir=3D"ltr">&lt=
;<a href=3D"mailto:diggsey@googlemail.com" target=3D"_blank">diggsey@google=
mail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>It seems very presumptuous to assume th=
at there could never be a need to use a non-self-synchronising encoding, es=
pecially when it doesn&#39;t cost anything more than adding a trait? I can =
think of plenty of reasons off hand why it would be useful - backwards comp=
atibility, interfacing with software or even hardware using such encodings,=
 etc.<br>
<br></div></blockquote><div><br></div></div>It is not just a trait, it is i=
mplementing 2 code paths to handle trait vs no trait.=C2=A0 And additional =
language in the standard (if you added up the salaries of the Committee&#39=
;s volunteers, the costs would be enormous.) Adding cost needs to show real=
 benefits, not just maybe benefits.<br>
<br></div><div class=3D"gmail_extra">So sure, if you have solid examples.<b=
r></div><div class=3D"gmail_extra"><br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--001a11336c2664dbfe04f8e97784--

.

Author: Philipp Maximilian Stephani <p.stephani2@gmail.com>
Date: Thu, 08 May 2014 20:43:07 +0000 Raw View

--047d7b6dc67a54689904f8e98750
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

In general there is no need to explicitly take variable-width encodings
into account, provided all users are aware that the "characters" of
basic_string are in fact code units and not code points/scalar
values/grapheme clusters. This is an ugliness that all languages have to
live with (except Perl), but it's more important to keep backward
compatibility and APIs intact. A new string class is practically impossible
as it would introduce a split between gazillions of existing APIs and the
"new" stuff, which would then be mostly ignored. As ugly as it may be, I
guess we have to live with std::string. I'd also suggest not to rely on
char_traits, that concept seems rather flawed and awkward. A UTF-8-decoding
range view/adapter sounds much nicer and less intrusive. Apart from that,
I'd love to see stuff that goes far beyond simple decoding/encoding:
Unicode algorithms (canonicalization, text segmentation, regexes...),
access to character properties, etc. When discussing Unicode, I feel people
focus too much on encoding: that's an important part, but still only a tiny
part of Unicode.

On Thu May 08 2014 at 17:06:42, Guy Davidson <guy@hatcat.com> wrote:

> I do indeed mean char_traits: my idea is to extend std::char_traits to
> include iteration information about the character type, and then modify
> std::basic_string to infer its iterator types from its char_trait templat=
e
> parameter.
>
> The fundamental problem here is that the basic_string class currently onl=
y
> accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width
> encodings (UTF-32 is also fixed width).  std::u16string is only fit for
> representing elements from the Basic Multilingual Plane, it is not fit fo=
r
> UTF-16.  I'm not sure how to introduce variable width encoded strings
> within std::basic_string, hence the introduction of a separate string
> class.  The square bracket operator becomes problematic if the width of a
> character can be from one to four bytes.  I shall think on.
>
> On Thursday, 8 May 2014 15:43:06 UTC+1, Daniel Kr=C3=BCgler wrote:
>>
>> 2014-05-08 16:07 GMT+02:00 Guy Davidson <g...@hatcat.com>:
>>
> > Now that I think of it, I suppose another approach might be to modify
>> > std::char_traits and declare the iterator in there,
>>
>> You don't mean char_traits, do you? If yes, I don't see how char
>> traits are related to iterators.
>>
>> > then modify
>> > std::basic_string and define the iterators in terms of the char_traits
>> > iterators.
>>
>> I would like to suggest an alternative approach: It is not necessary
>> to add further member functions to basic_string. Instead you could
>> provide range-based access functions that are free functions. This
>> would also allow (but not require) to provide these Unicode functions
>> in a separate header.
>>
>> - Daniel
>>
>> > On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
>> >>
>> >> 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl <dietma...@gmail.com>:
>> >> > I will give the feedback I gave before: don't create another string
>> >> > class! Instead, create the necessary algorithms to deal with
>> Unicode. In my
>> >> > opinion the actual encoding/decoding business is covered by the
>> >> > std::codecvt<...> facet although it may be worth explicitly definin=
g
>> >> > instances of these facets for the various Unicode encodings. There
>> are, of
>> >> > course, plenty of other algorithms in Unicode which are reasonable
>> to
>> >> > expose. Given that people like to process UTF8 and UTF16 it may be
>> >> > reasonable to also have encoding aware algorithms for string
>> operations.
>> >> >
>> >> > Unless soneone provides a really strong argument for another string
>> >> > class, I will strongly argue against adding another representation
>> for
>> >> > strings! (I can see a place for an immutable string class but that'=
s
>> >> > entirely different).
>> >> >
>> >>
>> >> I would like to add that I completely agree with Dietmar.
>> >>
>> >> - Daniel
>> >
>> > --
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "ISO C++ Standard - Future Proposals" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an
>>
> > email to std-proposal...@isocpp.org.
>> > To post to this group, send email to std-pr...@isocpp.org.
>>
> > Visit this group at
>> > http://groups.google.com/a/isocpp.org/group/std-proposals/.
>>
>>
>>
>> --
>>
>> ________________________________
>> SavedURI :Show URLShow URLSavedURI :
>> SavedURI :Hide URLHide URLSavedURI :
>> https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.main.
>> de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_
>> NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3D
>> AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D1395770800154https://mail.
>> google.com/_/scs/mail-static/_/js/k=3Dgmail.main.de.
>> LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_
>> NLil29vVAOBGeBBRgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3D
>> AItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?random=3D1395770800154
>> ________________________________
>>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--047d7b6dc67a54689904f8e98750
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

In general there is no need to explicitly take variable-width encodings int=
o account, provided all users are aware that the &quot;characters&quot; of =
basic_string are in fact code units and not code points/scalar values/graph=
eme clusters. This is an ugliness that all languages have to live with (exc=
ept Perl), but it&#39;s more important to keep backward compatibility and A=
PIs intact. A new string class is practically impossible as it would introd=
uce a split between gazillions of existing APIs and the &quot;new&quot; stu=
ff, which would then be mostly ignored. As ugly as it may be, I guess we ha=
ve to live with std::string. I&#39;d also suggest not to rely on char_trait=
s, that concept seems rather flawed and awkward. A UTF-8-decoding range vie=
w/adapter sounds much nicer and less intrusive. Apart from that, I&#39;d lo=
ve to see stuff that goes far beyond simple decoding/encoding: Unicode algo=
rithms (canonicalization, text segmentation, regexes...), access to charact=
er properties, etc. When discussing Unicode, I feel people focus too much o=
n encoding: that&#39;s an important part, but still only a tiny part of Uni=
code.<br>
<br><div>On Thu May 08 2014 at 17:06:42, Guy Davidson &lt;<a href=3D"mailto=
:guy@hatcat.com">guy@hatcat.com</a>&gt; wrote:</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">
<div dir=3D"ltr">I do indeed mean char_traits: my idea is to extend std::ch=
ar_traits to include iteration information about the character type, and th=
en modify std::basic_string to infer its iterator types from its char_trait=
 template parameter.<div>
<br></div><div>The fundamental problem here is that the basic_string class =
currently only accommodates fixed width encoding. =C2=A0UTF-8 and UTF-16 ar=
e variable width encodings (UTF-32 is also fixed width). =C2=A0std::u16stri=
ng is only fit for representing elements from the Basic Multilingual Plane,=
 it is not fit for UTF-16. =C2=A0I&#39;m not sure how to introduce variable=
 width encoded strings within std::basic_string, hence the introduction of =
a separate string class. =C2=A0The square bracket operator becomes problema=
tic if the width of a character can be from one to four bytes. =C2=A0I shal=
l think on.</div>
<br>On Thursday, 8 May 2014 15:43:06 UTC+1, Daniel Kr=C3=BCgler  wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-le=
ft:1px #ccc solid;padding-left:1ex">2014-05-08 16:07 GMT+02:00 Guy Davidson=
 &lt;<a>g...@hatcat.com</a>&gt;:
<br></blockquote></div><div dir=3D"ltr"><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:=
1ex">&gt; Now that I think of it, I suppose another approach might be to mo=
dify
<br>&gt; std::char_traits and declare the iterator in there,
<br>
<br>You don&#39;t mean char_traits, do you? If yes, I don&#39;t see how cha=
r
<br>traits are related to iterators.
<br>
<br>&gt; then modify
<br>&gt; std::basic_string and define the iterators in terms of the char_tr=
aits
<br>&gt; iterators.
<br>
<br>I would like to suggest an alternative approach: It is not necessary
<br>to add further member functions to basic_string. Instead you could
<br>provide range-based access functions that are free functions. This
<br>would also allow (but not require) to provide these Unicode functions
<br>in a separate header.
<br>
<br>- Daniel
<br>
<br>&gt; On Thursday, 8 May 2014 14:33:37 UTC+1, Daniel Kr=C3=BCgler wrote:
<br>&gt;&gt;
<br>&gt;&gt; 2014-05-08 15:30 GMT+02:00 Dietmar Kuehl &lt;<a>dietma...@gmai=
l.com</a>&gt;:
<br>&gt;&gt; &gt; I will give the feedback I gave before: don&#39;t create =
another string
<br>&gt;&gt; &gt; class! Instead, create the necessary algorithms to deal w=
ith Unicode. In my
<br>&gt;&gt; &gt; opinion the actual encoding/decoding business is covered =
by the
<br>&gt;&gt; &gt; std::codecvt&lt;...&gt; facet although it may be worth ex=
plicitly defining
<br>&gt;&gt; &gt; instances of these facets for the various Unicode encodin=
gs. There are, of
<br>&gt;&gt; &gt; course, plenty of other algorithms in Unicode which are r=
easonable to
<br>&gt;&gt; &gt; expose. Given that people like to process UTF8 and UTF16 =
it may be
<br>&gt;&gt; &gt; reasonable to also have encoding aware algorithms for str=
ing operations.
<br>&gt;&gt; &gt;
<br>&gt;&gt; &gt; Unless soneone provides a really strong argument for anot=
her string
<br>&gt;&gt; &gt; class, I will strongly argue against adding another repre=
sentation for
<br>&gt;&gt; &gt; strings! (I can see a place for an immutable string class=
 but that&#39;s
<br>&gt;&gt; &gt; entirely different).
<br>&gt;&gt; &gt;
<br>&gt;&gt;
<br>&gt;&gt; I would like to add that I completely agree with Dietmar.
<br>&gt;&gt;
<br>&gt;&gt; - Daniel
<br>&gt;
<br>&gt; --
<br>&gt;
<br>&gt; ---
<br>&gt; You received this message because you are subscribed to the Google=
 Groups
<br>&gt; &quot;ISO C++ Standard - Future Proposals&quot; group.
<br>&gt; To unsubscribe from this group and stop receiving emails from it, =
send an
<br></blockquote></div><div dir=3D"ltr"><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:=
1ex">&gt; email to <a>std-proposal...@<u></u>isocpp.org</a>.
<br>&gt; To post to this group, send email to <a>std-pr...@isocpp.org</a>.
<br></blockquote></div><div dir=3D"ltr"><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:=
1ex">&gt; Visit this group at
<br>&gt; <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/" target=3D"_blank">http://groups.google.com/a/<u></u>isocpp.org/group/s=
td-<u></u>proposals/</a>.
<br>
<br>
<br>
<br>--=20
<br>
<br>______________________________<u></u>__
<br>SavedURI :Show URLShow URLSavedURI :
<br>SavedURI :Hide URLHide URLSavedURI :
<br><a href=3D"https://mail.google.com/_/scs/mail-static/_/js/k=3Dgmail.mai=
n.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVAOBGeBB=
RgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4kuOWA?r=
andom=3D1395770800154https://mail.google.com/_/scs/mail-static/_/js/k=3Dgma=
il.main.de.LEt2fN4ilLE.O/m=3Dm_i,t,it/am=3DOCMOBiHj9kJxhnelj6j997_NLil29vVA=
OBGeBBRgJwD-m_0_8B_AD-qOEw/rt=3Dh/d=3D1/rs=3DAItRSTODy9wv1JKZMABIG3Ak8ViC4k=
uOWA?random=3D1395770800154" target=3D"_blank">https://mail.google.com/_/sc=
s/<u></u>mail-static/_/js/k=3Dgmail.main.<u></u>de.LEt2fN4ilLE.O/m=3Dm_i,t,=
it/<u></u>am=3DOCMOBiHj9kJxhnelj6j997_<u></u>NLil29vVAOBGeBBRgJwD-m_0_8B_<u=
></u>AD-qOEw/rt=3Dh/d=3D1/rs=3D<u></u>AItRSTODy9wv1JKZMABIG3Ak8ViC4k<u></u>=
uOWA?random=3D<u></u>1395770800154https://mail.<u></u>google.com/_/scs/mail=
-static/_<u></u>/js/k=3Dgmail.main.de.<u></u>LEt2fN4ilLE.O/m=3Dm_i,t,it/am=
=3D<u></u>OCMOBiHj9kJxhnelj6j997_<u></u>NLil29vVAOBGeBBRgJwD-m_0_8B_<u></u>=
AD-qOEw/rt=3Dh/d=3D1/rs=3D<u></u>AItRSTODy9wv1JKZMABIG3Ak8ViC4k<u></u>uOWA?=
random=3D1395770800154</a>
<br>______________________________<u></u>__
<br></blockquote></div>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org" target=3D"_=
blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank">http://groups.google.com/a/isocpp.org/gro=
up/std-proposals/</a>.<br>
</blockquote>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--047d7b6dc67a54689904f8e98750--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 14:30:20 -0700 (PDT) Raw View

------=_Part_222_18634906.1399584620574
Content-Type: text/plain; charset=UTF-8

How about the following:

basic_string is completely unchanged, except that it contains an additional
typedef:
typedef traits_type::encoding encoding;

The current system is to have a unique type to represent each element of
each encoding (ie. "char" is different from either "uint8_t" or "int8_t",
"char16_t" is different from either "uint16_t", "int16_t" or even "wchar_t")

To be consistent with this, user defined encodings would supply their own
element type, eg:
struct utf_ebcdic_char {
    uint8_t m_value;
    /* Relevant operators and conversions here */
}

In addition the new encoding would specialise char_traits<utf_ebcdic_char>
accordingly. Now, basic_string<utf_ebcdic_char> would represent strings
encoded in utf_ebcdic.

All "std::string"s would essentially become utf8 strings (because 'char' is
the element type for utf8), but it would have absolutely no effect on
existing code, which can only operate on the elements of the string.

New code however would be able to make use of new range views/adapters for
encoding and decoding unicode code points, and unless explicitly
overridden, these would infer the encoding from the new typedef in
basic_string.

Problems?

On Thursday, 8 May 2014 21:43:07 UTC+1, Philipp Stephani wrote:
>
> In general there is no need to explicitly take variable-width encodings
> into account, provided all users are aware that the "characters" of
> basic_string are in fact code units and not code points/scalar
> values/grapheme clusters. This is an ugliness that all languages have to
> live with (except Perl), but it's more important to keep backward
> compatibility and APIs intact. A new string class is practically impossible
> as it would introduce a split between gazillions of existing APIs and the
> "new" stuff, which would then be mostly ignored. As ugly as it may be, I
> guess we have to live with std::string. I'd also suggest not to rely on
> char_traits, that concept seems rather flawed and awkward. A UTF-8-decoding
> range view/adapter sounds much nicer and less intrusive. Apart from that,
> I'd love to see stuff that goes far beyond simple decoding/encoding:
> Unicode algorithms (canonicalization, text segmentation, regexes...),
> access to character properties, etc. When discussing Unicode, I feel people
> focus too much on encoding: that's an important part, but still only a tiny
> part of Unicode.
>
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_222_18634906.1399584620574
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">How about the following:<br><br>basic_string is completely=
 unchanged, except that it contains an additional typedef:<br><div class=3D=
"prettyprint" style=3D"background-color: rgb(250, 250, 250); border-color: =
rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap: brea=
k-word;"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span st=
yle=3D"color: #008;" class=3D"styled-by-prettify">typedef</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify"> traits_type</span><span sty=
le=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=3D"co=
lor: #000;" class=3D"styled-by-prettify">encoding encoding</span><span styl=
e=3D"color: #660;" class=3D"styled-by-prettify">;</span><span style=3D"colo=
r: #000;" class=3D"styled-by-prettify"><br></span></div></code></div><br>Th=
e current system is to have a unique type to represent each element of each=
 encoding (ie. "char" is different from either "uint8_t" or "int8_t", "char=
16_t" is different from either "uint16_t", "int16_t" or even "wchar_t")<br>=
<br>To be consistent with this, user defined encodings would supply their o=
wn element type, eg:<br><div class=3D"prettyprint" style=3D"background-colo=
r: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border-style: soli=
d; border-width: 1px; word-wrap: break-word;"><code class=3D"prettyprint"><=
div class=3D"subprettyprint"><span style=3D"color: #008;" class=3D"styled-b=
y-prettify">struct</span><span style=3D"color: #000;" class=3D"styled-by-pr=
ettify"> utf_ebcdic_char </span><span style=3D"color: #660;" class=3D"style=
d-by-prettify">{</span><span style=3D"color: #000;" class=3D"styled-by-pret=
tify"><br>&nbsp; &nbsp; uint8_t m_value</span><span style=3D"color: #660;" =
class=3D"styled-by-prettify">;</span><span style=3D"color: #000;" class=3D"=
styled-by-prettify"><br>&nbsp; &nbsp; </span><span style=3D"color: #800;" c=
lass=3D"styled-by-prettify">/* Relevant operators and conversions here */</=
span><span style=3D"color: #000;" class=3D"styled-by-prettify"> <br></span>=
<span style=3D"color: #660;" class=3D"styled-by-prettify">}</span><span sty=
le=3D"color: #000;" class=3D"styled-by-prettify"><br></span></div></code></=
div><br>In addition the new encoding would specialise char_traits&lt;utf_eb=
cdic_char&gt; accordingly. Now, basic_string&lt;utf_ebcdic_char&gt; would r=
epresent strings encoded in utf_ebcdic.<br><br>All "std::string"s would ess=
entially become utf8 strings (because 'char' is the element type for utf8),=
 but it would have absolutely no effect on existing code, which can only op=
erate on the elements of the string.<br><br>New code however would be able =
to make use of new range views/adapters for encoding and decoding unicode c=
ode points, and unless explicitly overridden, these would infer the encodin=
g from the new typedef in basic_string.<br><br>Problems?<br><br>On Thursday=
, 8 May 2014 21:43:07 UTC+1, Philipp Stephani  wrote:<blockquote class=3D"g=
mail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc sol=
id;padding-left: 1ex;">In general there is no need to explicitly take varia=
ble-width encodings into account, provided all users are aware that the "ch=
aracters" of basic_string are in fact code units and not code points/scalar=
 values/grapheme clusters. This is an ugliness that all languages have to l=
ive with (except Perl), but it's more important to keep backward compatibil=
ity and APIs intact. A new string class is practically impossible as it wou=
ld introduce a split between gazillions of existing APIs and the "new" stuf=
f, which would then be mostly ignored. As ugly as it may be, I guess we hav=
e to live with std::string. I'd also suggest not to rely on char_traits, th=
at concept seems rather flawed and awkward. A UTF-8-decoding range view/ada=
pter sounds much nicer and less intrusive. Apart from that, I'd love to see=
 stuff that goes far beyond simple decoding/encoding: Unicode algorithms (c=
anonicalization, text segmentation, regexes...), access to character proper=
ties, etc. When discussing Unicode, I feel people focus too much on encodin=
g: that's an important part, but still only a tiny part of Unicode.<br>
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_222_18634906.1399584620574--

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Thu, 8 May 2014 23:28:53 +0100 Raw View

--Apple-Mail-2FECCA83-3F0D-4C03-A592-445411E943A2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Sadly, Unicode isn't as simple: there are multiple representation of charac=
ters which are equivalent. For example the u-umlaut (=C3=BC) in my name can=
 be represented as u-umlaut or as a u and a combining character dieresis. T=
o match a u you'll need to understand whether the preceding (I think) chara=
cter is a combining character or not. To do so you need to understand the e=
ncoding.

Unicode had promised to make that simple by having three fundamental design=
 rules:

- every character is represented by one code point; that went over board an=
d there are now combining characters
- every string has a unique representation as a sequence of code points; th=
at's not true as you can choose certain orders of characters
- each code point uses 16 bits; well, last time I looked they were at 20 bi=
ts

If they had stuck with their original goals things would be nearly as simpl=
e as you'd think. As is, they are not.

> On 8 May 2014, at 19:49, Diggory Blake <diggsey@googlemail.com> wrote:
>=20
>=20
>=20
>> On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:
>>=20
>> Please pardon me showing my unicode ignorance here, but are even search=
=20
>> operations a problem? That is, if I search for e.g. '=C3=A1', can it eve=
r=20
>> match a second (or third or...) byte of some other multi-byte character?=
=20
>>=20
>> I want to say 'no', in which case even search (or substring compare)=20
>> operations would not need to care about encoding... basically, only n'th=
=20
>> character operations and iterating over characters. (With the caveat=20
>> that the result you get back is a byte index and not a character index.)
>=20
> It depends on the encoding - utf8 for example ensures that the representa=
tion of one complete code-point can never occur as part of another
> --=20
>=20
> ---=20
> You received this message because you are subscribed to the Google Groups=
 "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
 email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-propo=
sals/.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail-2FECCA83-3F0D-4C03-A592-445411E943A2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=
=3Dutf-8"></head><body dir=3D"auto"><div>Sadly, Unicode isn't as simple: th=
ere are multiple representation of characters which are equivalent. For exa=
mple the u-umlaut (=C3=BC) in my name can be represented as u-umlaut or as =
a u and a combining character dieresis. To match a u you'll need to underst=
and whether the preceding (I think) character is a combining character or n=
ot. To do so you need to understand the encoding.</div><div><br></div><div>=
Unicode had promised to make that simple by having three fundamental design=
 rules:</div><div><br></div><div>- every character is represented by one co=
de point; that went over board and there are now combining characters</div>=
<div>- every string has a unique representation as a sequence of code point=
s; that's not true as you can choose certain orders of characters</div><div=
>- each code point uses 16 bits; well, last time I looked they were at 20 b=
its<br><br>If they had stuck with their original goals things would be near=
ly as simple as you'd think. As is, they are not.</div><div><br>On 8 May 20=
14, at 19:49, Diggory Blake &lt;<a href=3D"mailto:diggsey@googlemail.com">d=
iggsey@googlemail.com</a>&gt; wrote:<br><br></div><blockquote type=3D"cite"=
><div><div dir=3D"ltr"><br><br>On Thursday, 8 May 2014 19:44:30 UTC+1, Matt=
hew Woehlke  wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin: 0=
;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
<br>Please pardon me showing my unicode ignorance here, but are even search=
=20
<br>operations a problem? That is, if I search for e.g. '=C3=A1', can it ev=
er=20
<br>match a second (or third or...) byte of some other multi-byte character=
?
<br>
<br>I want to say 'no', in which case even search (or substring compare)=20
<br>operations would not need to care about encoding... basically, only n't=
h=20
<br>character operations and iterating over characters. (With the caveat=20
<br>that the result you get back is a byte index and not a character index.=
)
<br></blockquote><div><br>It depends on the encoding - utf8 for example ens=
ures that the representation of one complete code-point can never occur as =
part of another<br></div></div>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br>
</div></blockquote></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail-2FECCA83-3F0D-4C03-A592-445411E943A2--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 15:40:53 -0700 (PDT) Raw View

------=_Part_9_6838830.1399588853222
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

But for the purposes of algorithms such as find/replace, combining=20
characters can be treated differently - if different behaviour is required=
=20
it's simple enough for the caller to ensure that the input strings are=20
normalised first. Unicode defines precisely how strings can be normalised=
=20
and this can be implemented in the standard library. Also, we are now at=20
the level of unicode code points: at this level the rules for combining=20
characters are the same regardless of the original encoding, so whatever=20
rules are used to handle or not handle combining characters will be exactly=
=20
the same for all encodings.

On Thursday, 8 May 2014 23:28:53 UTC+1, Dietmar K=C3=BChl wrote:
>
> Sadly, Unicode isn't as simple: there are multiple representation of=20
> characters which are equivalent. For example the u-umlaut (=C3=BC) in my =
name=20
> can be represented as u-umlaut or as a u and a combining character=20
> dieresis. To match a u you'll need to understand whether the preceding (I=
=20
> think) character is a combining character or not. To do so you need to=20
> understand the encoding.
>
> Unicode had promised to make that simple by having three fundamental=20
> design rules:
>
> - every character is represented by one code point; that went over board=
=20
> and there are now combining characters
> - every string has a unique representation as a sequence of code points;=
=20
> that's not true as you can choose certain orders of characters
> - each code point uses 16 bits; well, last time I looked they were at 20=
=20
> bits
>
> If they had stuck with their original goals things would be nearly as=20
> simple as you'd think. As is, they are not.
>
> On 8 May 2014, at 19:49, Diggory Blake <dig...@googlemail.com<javascript:=
>>=20
> wrote:
>
>
>
> On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke wrote:
>
>>
>> Please pardon me showing my unicode ignorance here, but are even search=
=20
>> operations a problem? That is, if I search for e.g. '=C3=A1', can it eve=
r=20
>> match a second (or third or...) byte of some other multi-byte character?=
=20
>>
>> I want to say 'no', in which case even search (or substring compare)=20
>> operations would not need to care about encoding... basically, only n'th=
=20
>> character operations and iterating over characters. (With the caveat=20
>> that the result you get back is a byte index and not a character index.)=
=20
>>
>
> It depends on the encoding - utf8 for example ensures that the=20
> representation of one complete code-point can never occur as part of anot=
her
>
> --=20
>
> ---=20
> You received this message because you are subscribed to the Google Groups=
=20
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
=20
> email to std-proposal...@isocpp.org <javascript:>.
> To post to this group, send email to std-pr...@isocpp.org <javascript:>.
> Visit this group at=20
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_9_6838830.1399588853222
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">But for the purposes of algorithms such as find/replace, c=
ombining characters can be treated differently - if different behaviour is =
required it's simple enough for the caller to ensure that the input strings=
 are normalised first. Unicode defines precisely how strings can be normali=
sed and this can be implemented in the standard library. Also, we are now a=
t the level of unicode code points: at this level the rules for combining c=
haracters are the same regardless of the original encoding, so whatever rul=
es are used to handle or not handle combining characters will be exactly th=
e same for all encodings.<br><br>On Thursday, 8 May 2014 23:28:53 UTC+1, Di=
etmar K=C3=BChl  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0=
;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div di=
r=3D"auto"><div>Sadly, Unicode isn't as simple: there are multiple represen=
tation of characters which are equivalent. For example the u-umlaut (=C3=BC=
) in my name can be represented as u-umlaut or as a u and a combining chara=
cter dieresis. To match a u you'll need to understand whether the preceding=
 (I think) character is a combining character or not. To do so you need to =
understand the encoding.</div><div><br></div><div>Unicode had promised to m=
ake that simple by having three fundamental design rules:</div><div><br></d=
iv><div>- every character is represented by one code point; that went over =
board and there are now combining characters</div><div>- every string has a=
 unique representation as a sequence of code points; that's not true as you=
 can choose certain orders of characters</div><div>- each code point uses 1=
6 bits; well, last time I looked they were at 20 bits<br><br>If they had st=
uck with their original goals things would be nearly as simple as you'd thi=
nk. As is, they are not.</div><div><br>On 8 May 2014, at 19:49, Diggory Bla=
ke &lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"n=
f1p2AbdZhYJ" onmousedown=3D"this.href=3D'javascript:';return true;" onclick=
=3D"this.href=3D'javascript:';return true;">dig...@googlemail.com</a>&gt; w=
rote:<br><br></div><blockquote type=3D"cite"><div><div dir=3D"ltr"><br><br>=
On Thursday, 8 May 2014 19:44:30 UTC+1, Matthew Woehlke  wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1=
px #ccc solid;padding-left:1ex">
<br>Please pardon me showing my unicode ignorance here, but are even search=
=20
<br>operations a problem? That is, if I search for e.g. '=C3=A1', can it ev=
er=20
<br>match a second (or third or...) byte of some other multi-byte character=
?
<br>
<br>I want to say 'no', in which case even search (or substring compare)=20
<br>operations would not need to care about encoding... basically, only n't=
h=20
<br>character operations and iterating over characters. (With the caveat=20
<br>that the result you get back is a byte index and not a character index.=
)
<br></blockquote><div><br>It depends on the encoding - utf8 for example ens=
ures that the representation of one complete code-point can never occur as =
part of another<br></div></div>

<p></p>

-- <br>
<br>
--- <br>
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"=
nf1p2AbdZhYJ" onmousedown=3D"this.href=3D'javascript:';return true;" onclic=
k=3D"this.href=3D'javascript:';return true;">std-proposal...@<wbr>isocpp.or=
g</a>.<br>
To post to this group, send email to <a href=3D"javascript:" target=3D"_bla=
nk" gdf-obfuscated-mailto=3D"nf1p2AbdZhYJ" onmousedown=3D"this.href=3D'java=
script:';return true;" onclick=3D"this.href=3D'javascript:';return true;">s=
td-pr...@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank" onmousedown=3D"this.href=3D'http://groups=
..google.com/a/isocpp.org/group/std-proposals/';return true;" onclick=3D"thi=
s.href=3D'http://groups.google.com/a/isocpp.org/group/std-proposals/';retur=
n true;">http://groups.google.com/a/<wbr>isocpp.org/group/std-<wbr>proposal=
s/</a>.<br>
</div></blockquote></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_9_6838830.1399588853222--

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Fri, 9 May 2014 01:49:24 +0100 Raw View

OK, when I made the comment quoted below I was at work typing on a mobile d=
evice, hence, the message didn't contain all the parts which seem to be nec=
essary to this discussion. I read all the other contributions and they are =
mostly going of into an area I consider the entirely wrong direction so I'l=
l pretend they were not made and continue from this earlier point of the di=
scussion. So let me put the arguments for the general design together. Note=
, however, that I'm not going to write a proposal or make a promise to revi=
ew proposals made by other. However, the arguments below will guide my argu=
ments in the committee (unless someone makes good arguments that they are w=
rong.

Step 1: External vs. Internal Encoding

When processing strings there is always an encoding involved. In its simple=
st form, it is a singly byte, fixed width encoding like, e.g., ASCII. It do=
esn't matter whether the characters are internal to a program, i.e., they a=
re stored in memory by the program or they are external to a program, i.e.,=
 they are in a file, in a buffer just read into a program, etc.: the is an =
encoding. However, it is important to realise that there is only *one* inte=
rnal encoding and string shall be converted from whatever external encoding=
 into the internal encoding upon reading and converted from the internal en=
coding to the external encoding upon writing! Dealing with multiple interna=
l encodings [for the same character type] is neither necessary nor helpful.=
 OK, it may be necessary if there are potential internal encodings which do=
n't cover the same set of characters. Well, i was the case for single byte =
fixed width encodings: for example, the different choices of ISO-Latin-n co=
vered different characters. However, we are talking about Unicode processin=
g and despite all its failures Unicode covers the full range of [human] cha=
racters (yes, Klingon characters were removed from Unicode; as far as I can=
 tell to make space for a comprehensive set of characters for turds).

With C++ there is a slight and somewhat annoying complication in that C++ h=
as multiple character types with different width. That is, different charac=
ter types will use different internal encodings. Since the strings will hav=
e different types there isn't much danger of accidental interference althou=
gh there is some danger as individual character types happily convert betwe=
en each other. It is worth to note that in the context of Unicode processin=
g the different *character* values actually do **not** represent *character=
s*! For most characters types one value cannot represent a character as the=
re aren't enough bits in the value (with char32_t being the exception; well=
, at least, for now). Instead, a character value (e.g. an individual `char`=
 or `wchar_t`) actually represents only a part of a character. Thinking mos=
tly in terms of strings of `char` representing something like UTF8 sequence=
, I think of the values stored in the character types as *bytes* although a=
 `wchar_t` is, at least, made up of two bytes (ICU calls what I call bytes =
*singletons*). That is, strings of the fundamental character types are made=
 up of bytes where one or more bytes form an actual character. Actually, mu=
ltiple bytes make up a *code point* which is used to represent a character =
but I will character and code point interchangeably.

Since C++ defines string literals for all of these character types (right n=
ow I'm not sure if there are really `signed char` and `unsigned char` liter=
als but I will simply assume that these will use the same encoding as `char=
`) and these string literals may contain Unicode characters, the internal e=
ncoding for each character types is actually already chosen by the compiler=
 (I think, the compilers are free to choose the encoding, though). That is,=
 **all** internal processing of Unicode strings will use the character type=
s specific internal encoding! There is no need to think about different enc=
odings [for a specific character type] internal to the program. This invari=
ant is extremely important and makes thinking about string processing possi=
ble. It also means, we can ignore the concept of encoding entirely for the =
internal processing of characters! This invariant is somewhat similar to th=
e approach in physics to choose the dimensions such that all the fundamenta=
l constants, e.g., the speed of light in vacuum (c), become 1: the resultin=
g math for the complicated formulas becomes viable (still incomprehensible =
to me but they look a huge amount simpler).

Why do encoding still matter? Well, outside of the program there are arbitr=
ary and often fairly odd encodings entirely outside the control of the prog=
ram (although there is a hope that in the long-term there will be just a fe=
w more or less reasonable encodings left). However, dealing with these exte=
rnal encodings can be centralised. In fact, dealing with external encodings=
 already *is* centralised: this is what the `std::codecvt<InternT, ExternT,=
 StateT>` facets are form. That is possibly not with an ideal interface and=
 we *may* consider creating a different interface but the conversion to and=
 from an external to encoding to the internal encoding is entirely orthogon=
al to a library processing Unicode.

Conveniently, the above means that for a Unicode library we can *entirely* =
ignore any encoding issues except, of course, the fact that the bytes used =
in strings are representing characters according to some encoding. What tha=
t encoding is, the programmer doesn't need to care about as the Unicode alg=
orithms will correctly interpret the bytes to process characters.

I have never used ICU directly but from a cursory look at their interfaces =
I *think* the ICU design makes the same assumption: once strings are inside=
 a program, their encoding is known. Actually, ICU further simplifies its v=
iew of the world by having just one character type, not 6 (`char`, `unsigne=
d char`, `signed char`, `wchar_t`, `char16_t`, and `char32_t`). I don't thi=
nk C++ has this luxury.

Step 2: Vocabulary Types

With encodings out of the way, let's now turn the focus on string types: ho=
w may should we have? Since strings are a fundamental abstraction which is =
used in many interfaces which need to communicate between different compone=
nts, the obvious answer is "one!" If there is more than one, some component=
s will choose to use different string types than other components. Sadly, s=
tring literals and the standard library string class have different types m=
eaning that there are already two string types - for each of the 6 choices =
of character types! While it would be borderline viable to unify the string=
 literals and `std::basic_string<cT>` using a suitable `std::basic_string_v=
iew<cT>`, a similar approach isn't possible for the different character typ=
es. To make matters worse, there are at least two popular choices for the c=
ommonly used character types (`wchar_t` on Windows and `char` everywhere el=
se). This situation is sufficiently bad and we should refrain from worsenin=
g the problem by introducing another string type: strings are fundamental v=
ocabulary and choice on the fundamental vocabulary is inherently bad.

To some extend there may be room for a different decision: instead of suppo=
rting all character types, it may be reasonable to support exactly one char=
acter type for the purpose of a Unicode library. Whenever an interface asks=
 for or provides a string of a different type, the strings would need to be=
 transformed which probably involves a change of the used encoding. Having =
just one string type to deal with is a huge advantage and, as far as I can =
tell, everybody always uses just one string type (either `std::string` or `=
std::wstring`). Since the standard would make a choice, the choice could be=
 deliberately to use `std::ustring` which could be mandated to be neither `=
std::string` nor `std::wstring` and drawing attention to the fact that `std=
::ustring` is guaranteed to use a Unicode encoding, i.e., the individual by=
tes (singletons) inside a `std::ustring` do **not** represent a character b=
ut merely part of a character. This assumption is often made for `std::stri=
ng` and `std::wstring`, too, but there is a lot of code out there which tre=
ats `std::string` and `std::wstring` as if they contained characters rather=
 than bytes/singletons.

Although the use of just one string class, i.e., a particular instantiation=
 of `std::basic_string<cT>` with a suitable character type `cT`, sounds gre=
at, I doubt that it will get enough support: either the `std::string` world=
 will be upset or the `std::wstring` or both. I honestly believe that we wo=
uld do ourselves and future generations of programmers a **tremendous** fav=
our if we could agree on The One string type (which might even use a polymo=
rphic allocator to remove the potential desire to vary allocation policies =
by changing the type; I'm personally on the fence on that one but that is i=
tself a bigger discussion which I'm not gone have now). Yes, I realise that=
 using just one string type would be disruptive now instead of taking out a=
 huge credit on the future we would remove that technical debt.

Step 3: How would a Unicode library look like?

First of all, even if there is no encoding to be dealt with and no new stri=
ng class, there are plenty of operations which are non-trivial, partly due =
to the way Unicode is designed:

- character-aware string processing: locating, extracting, changing, etc. i=
ndividual characters
- determining the number of characters, splitting strings after a certain n=
umber of characters,=20
- comparing strings: aside from Unicode strings not necessarily being norma=
lised, ordering strings in a form usable for humans is non-trivial: even tr=
ivial representation like the European letter-based strings are ordered dif=
ferently depending on the language context, e.g., where `ll` goes in the Sp=
anish or other European languages. This is nothing compared to ordering str=
ings with Chinese or Japanese characters.
- have a look at the operations in ICU to get more topics to be addressed.

The next question then becomes on how these algorithms should deal with str=
ings? The most likely abstraction is to have the algorithms operate in term=
s of character iterators, possibly limiting the set of iterator types the a=
lgorithms would operate on. I'm not sure if that is needed but I could imag=
ine that it may be undesirable to have all these algorithms be templates de=
fined in headers and instantiated wherever needed.

However, just processing STL-like iterators is probably not enough because =
sequences of characters may change the number of bytes used to encode them =
when applying even trivial transformations! That is, these algorithms may t=
ravel in terms of a richer abstraction than STL iterators. I haven't tried =
to create the corresponding abstractions.

This is what I think is important in this domain right from the top of my h=
ead. There is probably more but for now, I think the summary is:

- deal with only one encoding [per character type]
- do not create another string type; if absolutely necessary to have a type=
, create something viewing an underlying string type
- the algorithms operating on Unicode string are the important aspect

> On 8 May 2014, at 15:30, Dietmar Kuehl <dietmar.kuehl@gmail.com> wrote:
> I will give the feedback I gave before: don't create another string class=
! Instead, create the necessary algorithms to deal with Unicode. In my opin=
ion the actual encoding/decoding business is covered by the std::codecvt<..=
..> facet although it may be worth explicitly defining instances of these fa=
cets for the various Unicode encodings. There are, of course, plenty of oth=
er algorithms in Unicode which are reasonable to expose. Given that people =
like to process UTF8 and UTF16 it may be reasonable to also have encoding a=
ware algorithms for string operations.
>=20
> Unless soneone provides a really strong argument for another string class=
, I will strongly argue against adding another representation for strings! =
(I can see a place for an immutable string class but that's entirely differ=
ent).

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 8 May 2014 18:40:30 -0700 (PDT) Raw View

------=_Part_902_25792550.1399599630498
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Friday, 9 May 2014 01:49:24 UTC+1, Dietmar K=C3=BChl wrote:
>
> OK, when I made the comment quoted below I was at work typing on a mobile=
=20
> device, hence, the message didn't contain all the parts which seem to be=
=20
> necessary to this discussion. I read all the other contributions and they=
=20
> are mostly going of into an area I consider the entirely wrong direction =
so=20
> I'll pretend they were not made and continue from this earlier point of t=
he=20
> discussion. So let me put the arguments for the general design together.=
=20
> Note, however, that I'm not going to write a proposal or make a promise t=
o=20
> review proposals made by other. However, the arguments below will guide m=
y=20
> arguments in the committee (unless someone makes good arguments that they=
=20
> are wrong.=20
>
> Step 1: External vs. Internal Encoding=20
>
> When processing strings there is always an encoding involved. In its=20
> simplest form, it is a singly byte, fixed width encoding like, e.g., ASCI=
I.=20
> It doesn't matter whether the characters are internal to a program, i.e.,=
=20
> they are stored in memory by the program or they are external to a progra=
m,=20
> i.e., they are in a file, in a buffer just read into a program, etc.: the=
=20
> is an encoding. However, it is important to realise that there is only=20
> *one* internal encoding and string shall be converted from whatever=20
> external encoding into the internal encoding upon reading and converted=
=20
> from the internal encoding to the external encoding upon writing! Dealing=
=20
> with multiple internal encodings [for the same character type] is neither=
=20
> necessary nor helpful. OK, it may be necessary if there are potential=20
> internal encodings which don't cover the same set of characters. Well, i=
=20
> was the case for single byte fixed width encodings: for example, the=20
> different choices of ISO-Latin-n covered different characters. However, w=
e=20
> are talking about Unicode processing and despite all its failures Unicode=
=20
> covers the full range of [human] characters (yes, Klingon characters were=
=20
> removed from Unicode; as far as I can tell to make space for a=20
> comprehensive set of characters for turds).=20
>

You're forgetting that C++ code may have to interoperate with other code=20
which uses a different internal encoding. For example, if I want to call=20
MessageBoxW on windows, I need to pass in a string which is utf16 encoded.=
=20
If I'm going to be calling such a method a lot I should be able to store=20
that string internally in utf16 so I don't have to convert every time.=20
Furthermore, the type system should prevent me from inadvertently assigning=
=20
non-utf16-encoded strings to it and vice-versa. This can be done without=20
modifying the basic_string class as per my previous suggestion. You're=20
right that 'char', 'char16_t' and 'char32_t' might not use the utf=20
encodings, so I would amend my suggestion such that=20
"char_traits<char/char16_t/char32_t>::encoding" would be the implementation=
=20
specific encoding used by string literals of the same type, rather than=20
necessarily utf8/16/32.

> > On 8 May 2014, at 15:30, Dietmar Kuehl <dietma...@gmail.com<javascript:=
>>=20
> wrote:=20
> > I will give the feedback I gave before: don't create another string=20
> class! Instead, create the necessary algorithms to deal with Unicode. In =
my=20
> opinion the actual encoding/decoding business is covered by the=20
> std::codecvt<...> facet although it may be worth explicitly defining=20
> instances of these facets for the various Unicode encodings. There are, o=
f=20
> course, plenty of other algorithms in Unicode which are reasonable to=20
> expose. Given that people like to process UTF8 and UTF16 it may be=20
> reasonable to also have encoding aware algorithms for string operations.=
=20
> >=20
> > Unless soneone provides a really strong argument for another string=20
> class, I will strongly argue against adding another representation for=20
> strings! (I can see a place for an immutable string class but that's=20
> entirely different).=20
>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_902_25792550.1399599630498
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br>On Friday, 9 May 2014 01:49:24 UTC+1, Dietmar K=C3=
=BChl  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">OK, when I made t=
he comment quoted below I was at work typing on a mobile device, hence, the=
 message didn't contain all the parts which seem to be necessary to this di=
scussion. I read all the other contributions and they are mostly going of i=
nto an area I consider the entirely wrong direction so I'll pretend they we=
re not made and continue from this earlier point of the discussion. So let =
me put the arguments for the general design together. Note, however, that I=
'm not going to write a proposal or make a promise to review proposals made=
 by other. However, the arguments below will guide my arguments in the comm=
ittee (unless someone makes good arguments that they are wrong.
<br>
<br>Step 1: External vs. Internal Encoding
<br>
<br>When processing strings there is always an encoding involved. In its si=
mplest form, it is a singly byte, fixed width encoding like, e.g., ASCII. I=
t doesn't matter whether the characters are internal to a program, i.e., th=
ey are stored in memory by the program or they are external to a program, i=
..e., they are in a file, in a buffer just read into a program, etc.: the is=
 an encoding. However, it is important to realise that there is only *one* =
internal encoding and string shall be converted from whatever external enco=
ding into the internal encoding upon reading and converted from the interna=
l encoding to the external encoding upon writing! Dealing with multiple int=
ernal encodings [for the same character type] is neither necessary nor help=
ful. OK, it may be necessary if there are potential internal encodings whic=
h don't cover the same set of characters. Well, i was the case for single b=
yte fixed width encodings: for example, the different choices of ISO-Latin-=
n covered different characters. However, we are talking about Unicode proce=
ssing and despite all its failures Unicode covers the full range of [human]=
 characters (yes, Klingon characters were removed from Unicode; as far as I=
 can tell to make space for a comprehensive set of characters for turds).
<br></blockquote><div><br>You're forgetting that C++ code may have to inter=
operate with other code which uses a different internal encoding. For examp=
le, if I want to call MessageBoxW on windows, I need to pass in a string wh=
ich is utf16 encoded. If I'm going to be calling such a method a lot I shou=
ld be able to store that string internally in utf16 so I don't have to conv=
ert every time. Furthermore, the type system should prevent me from inadver=
tently assigning non-utf16-encoded strings to it and vice-versa. This can b=
e done without modifying the basic_string class as per my previous suggesti=
on. You're right that 'char', 'char16_t' and 'char32_t' might not use the u=
tf encodings, so I would amend my suggestion such that "char_traits&lt;char=
/char16_t/char32_t&gt;::encoding" would be the implementation specific enco=
ding used by string literals of the same type, rather than necessarily utf8=
/16/32.<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;m=
argin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
<br>&gt; On 8 May 2014, at 15:30, Dietmar Kuehl &lt;<a href=3D"javascript:"=
 target=3D"_blank" gdf-obfuscated-mailto=3D"1vdOz7_9qJYJ" onmousedown=3D"th=
is.href=3D'javascript:';return true;" onclick=3D"this.href=3D'javascript:';=
return true;">dietma...@gmail.com</a>&gt; wrote:
<br>&gt; I will give the feedback I gave before: don't create another strin=
g class! Instead, create the necessary algorithms to deal with Unicode. In =
my opinion the actual encoding/decoding business is covered by the std::cod=
ecvt&lt;...&gt; facet although it may be worth explicitly defining instance=
s of these facets for the various Unicode encodings. There are, of course, =
plenty of other algorithms in Unicode which are reasonable to expose. Given=
 that people like to process UTF8 and UTF16 it may be reasonable to also ha=
ve encoding aware algorithms for string operations.
<br>&gt;=20
<br>&gt; Unless soneone provides a really strong argument for another strin=
g class, I will strongly argue against adding another representation for st=
rings! (I can see a place for an immutable string class but that's entirely=
 different).
<br>
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_902_25792550.1399599630498--

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Thu, 08 May 2014 19:27:09 -0700 Raw View

Em qui 08 maio 2014, =E0s 12:41:17, Diggory Blake escreveu:
> Regardless of whether support for non-UTF encodings are in the standard
> library, search/replace still needs to depend on the encoding, in case
> other encodings are implemented by the user or in the future. An
> 'is_self_synchronising' trait on encodings might be a useful feature to
> enable certain optimisations - search/replace could check for this and if
> it's present not have to bother with the decoding step.

Use the hammer here:

Don't add any of the above. If someone wants to do any kind of parsing or=
=20
transformation of the legacy-encoded string, they should first convert it t=
o=20
one of the UTF encodings.

What's more, I don't think the standard should support that conversion, unl=
ess=20
it's the "locale encoding". The standard should support only an opaque "loc=
ale=20
encoding" conversion function. It does already (mbstowcs), but it could be=
=20
made a little nicer (e.g, QString::fromLocal8Bit).

Any other encodings should be left to ICU and other libraries.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Thu, 08 May 2014 19:32:58 -0700 Raw View

Em qui 08 maio 2014, =C3=A0s 23:28:53, Dietmar Kuehl escreveu:
> Unicode had promised to make that simple by having three fundamental desi=
gn
> rules:
>=20
> - every character is represented by one code point; that went over board =
and
> there are now combining characters=20

Combining characters are required to get some of the more exotic combinatio=
ns.=20
Why shouldn't I add an acute above and a comma below to s (s=CC=81=CC=A6)? =
There's =C5=9B and=20
there's =C5=9F, but there's no pre-combined codepoint for both diacritics.

> - every string has a unique representation as a sequence of code points;=
=20
> that's not true as you can choose certain orders of characters=20

Not really. Every precombined character has exactly two representations: th=
e=20
NFC and the NFD. Whenever the combining characters are used, there's an exa=
ct=20
order in which they should appear.

You may construct them in the wrong order, but then they are non-canonical.=
 A=20
good text editor should fix it.

> - each code point uses 16 bits; well, last time I looked they were at 20
> bits

That's not a problem. UTF-16 can still represent all of them.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Thu, 08 May 2014 19:42:38 -0700 Raw View

Em sex 09 maio 2014, =E0s 01:49:24, Dietmar Kuehl escreveu:
> - character-aware string processing: locating, extracting, changing, etc.
> individual characters=20

Easy: force people to convert to UTF-32 / UCS-4 before doing any of the abo=
ve.

> - determining the number of characters, splitting strings after a certain
> number of characters,=20

Number of characters or number of codepoints? For codepoints, it's easy:=20
require UTF-32 / UCS-4 again. It's easy with UTF-8 and UTF-16 too, just not=
 as=20
easy.

If you really mean the width of the string, this requires loading part of=
=20
UnicodeData.txt to find out whether a given codepoint is zero-, single- or=
=20
double-width.

> - comparing strings: aside
> from Unicode strings not necessarily being normalised, ordering strings i=
n
> a form usable for humans is non-trivial: even trivial representation like
> the European letter-based strings are ordered differently depending on th=
e
> language context, e.g., where `ll` goes in the Spanish or other European
> languages. This is nothing compared to ordering strings with Chinese or
> Japanese characters.=20

Oh, no, not in the standard. This requires loading the entire CLDR to get i=
t=20
right. Please leave this to ICU.

Trust me, one of the biggest complaints we've had in Qt 5 since we began=20
requiring ICU is that the the icudata library is too big (18 MB). Imagine=
=20
forcing every C++ runtime to have that.

> - have a look at the operations in ICU to get more topics to be addressed=
..

Please don't.

Instead, let's focus on what we want, not on what we could copy from ICU=20
(hint: let's not copy ICU).

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Thu, 8 May 2014 20:18:23 -0700 Raw View

On Thu, May 8, 2014 at 7:42 PM, Thiago Macieira <thiago@macieira.org> wrote=
:
> Em sex 09 maio 2014, =E0s 01:49:24, Dietmar Kuehl escreveu:
>> - comparing strings: aside
>> from Unicode strings not necessarily being normalised, ordering strings =
in
>> a form usable for humans is non-trivial: even trivial representation lik=
e
>> the European letter-based strings are ordered differently depending on t=
he
>> language context, e.g., where `ll` goes in the Spanish or other European
>> languages. This is nothing compared to ordering strings with Chinese or
>> Japanese characters.
>
> Oh, no, not in the standard. This requires loading the entire CLDR to get=
 it
> right. Please leave this to ICU.
>
> Trust me, one of the biggest complaints we've had in Qt 5 since we began
> requiring ICU is that the the icudata library is too big (18 MB). Imagine
> forcing every C++ runtime to have that.

ICU doesn't require the whole 18MB data file if you're not using the
data. See http://userguide.icu-project.org/icudata#TOC-Customizing-ICU-s-Da=
ta-Library.

>> - have a look at the operations in ICU to get more topics to be addresse=
d.
>
> Please don't.
>
> Instead, let's focus on what we want, not on what we could copy from ICU
> (hint: let's not copy ICU).

If we don't base C++ Unicode support on the most successful Unicode
library, we shouldn't have C++ Unicode support at all. We just don't
have the expertise or volunteer time in this committee to do a better
job than the domain experts.

It would totally make sense to do a subset or try to find more C++-y
interfaces where ICU is more Java-esque, but the core arrangement of
the operations really needs to stick to what ICU's designers are
comfortable with.

Jeffrey

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Thu, 08 May 2014 22:16:30 -0700 Raw View

Em qui 08 maio 2014, =E0s 20:18:23, 'Jeffrey Yasskin' via ISO C++ Standard =
-=20
Future Proposals escreveu:
> > Trust me, one of the biggest complaints we've had in Qt 5 since we bega=
n
> > requiring ICU is that the the icudata library is too big (18 MB). Imagi=
ne
> > forcing every C++ runtime to have that.
>=20
> ICU doesn't require the whole 18MB data file if you're not using the
> data. See
> http://userguide.icu-project.org/icudata#TOC-Customizing-ICU-s-Data-Libra=
ry

The problem is correctly guessing what environments your users will use the=
=20
application in.

If you're writing an application, you can think of customising the ICU data=
,=20
especially if it's a commercial application and you know you're not selling=
 in=20
190 countries. (Of course, your application might fail for the person who h=
as=20
their OS configured to a different country than the country they live in)

However, for a library or -- worse -- a C++ runtime, you can't make that=20
choice. You must assume all possible cases. Hence, all 18 MB is required.

> >> - have a look at the operations in ICU to get more topics to be
> >> addressed.
> >=20
> > Please don't.
> >=20
> > Instead, let's focus on what we want, not on what we could copy from IC=
U
> > (hint: let's not copy ICU).
>=20
> If we don't base C++ Unicode support on the most successful Unicode
> library, we shouldn't have C++ Unicode support at all. We just don't
> have the expertise or volunteer time in this committee to do a better
> job than the domain experts.

That's not what I'm saying.

I'm saying "let's determine what use-cases we need solved" and then we can=
=20
look at how ICU has solved them. I'm asking that we don't go opening ICU=20
headers and think "that's nifty, we should have it in the standard".

Personally, Unicode support in the C++ standard should limit itself to=20
encoding conversions between the UTF codecs, Latin 1 and the opaque "system=
=20
encoding", so basic I/O is possible. More than that, it should be *easy* to=
=20
convert between std::string (with both UTF-8 and system locale encodings),=
=20
std::u16string, std::u32string and std::wstring. Any other codecs must be=
=20
optional.

And I'd say that locale-specific collation and Unicode character properties=
=20
should not be in the standard, due to the size of the Unicode data and beca=
use=20
this data keeps changing.

> It would totally make sense to do a subset or try to find more C++-y
> interfaces where ICU is more Java-esque, but the core arrangement of
> the operations really needs to stick to what ICU's designers are
> comfortable with.

I agree.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Fri, 9 May 2014 07:19:49 +0100 Raw View

> On 9 May 2014, at 04:42, Thiago Macieira <thiago@macieira.org> wrote:
>=20
>> Em sex 09 maio 2014, =C3=A0s 01:49:24, Dietmar Kuehl escreveu:
>> - character-aware string processing: locating, extracting, changing, etc=
..
>> individual characters
>=20
> Easy: force people to convert to UTF-32 / UCS-4 before doing any of the a=
bove.

I fail to see how `std::string` will use UTF-32 encoding. Moreover the enco=
ding will already be chosen (for each character type) and the encoding used=
 for `char` is actually much more likely to use UTF-8 than UTF-32 given tha=
t `char` is probably 8 bits. Even then, changing characters which potential=
ly use combining characters and a non-canonical representation make that in=
teresting (admittedly, I silently extended the meaning of characters into s=
omething which is aware of these combinations): while I think it is essenti=
al to turn the external encoding into the character type's internal encodin=
g, I don't think it is reasonable to also normalise the strings in this pro=
cess (it may be reasonable, though: as I said above, I haven't used an actu=
al system using Unicode).

>> - determining the number of characters, splitting strings after a certai=
n
>> number of characters,
>=20
> Number of characters or number of codepoints? For codepoints, it's easy:=
=20
> require UTF-32 / UCS-4 again. It's easy with UTF-8 and UTF-16 too, just n=
ot as=20
> easy.

I'm not stating that it is complicated. However, you can't make the require=
ment of the encoding (see above). Also, I'm just pointing out that there ar=
e operations beyond encoding conversions and that these add value for users=
 who don't implement that support every time they need to process Unicode c=
haracters ... and even when UTF-32/UCS-4 is used, dealing with characters i=
n a combining character-aware fashion it isn't entirely trivial. In any cas=
e, it is more complicated than `s.substr(s, n)` to get a substring with `n`=
 characters starting after `s` characters: `substr()` would happily tear by=
tes for one code point apart or ignore that combining characters need to tr=
avel with the character they are combined with.

> If you really mean the width of the string, this requires loading part of=
=20
> UnicodeData.txt to find out whether a given codepoint is zero-, single- o=
r=20
> double-width.

I'm not really doing string processing, so I don't know if that is needed o=
r not. All I'm doing here is stating what sort of algorithms may constitute=
 part of a Unicode library.

>> - comparing strings: aside
>> from Unicode strings not necessarily being normalised, ordering strings =
in
>> a form usable for humans is non-trivial: even trivial representation lik=
e
>> the European letter-based strings are ordered differently depending on t=
he
>> language context, e.g., where `ll` goes in the Spanish or other European
>> languages. This is nothing compared to ordering strings with Chinese or
>> Japanese characters.
>=20
> Oh, no, not in the standard. This requires loading the entire CLDR to get=
 it=20
> right. Please leave this to ICU.

I'm not proposing to have a Unicode library but I would be very much surpri=
sed if anybody interested in processing Unicode would consider Unicode supp=
orted if the Unicode "support" doesn't provide a human friendly ordering of=
 strings.

> Trust me, one of the biggest complaints we've had in Qt 5 since we began=
=20
> requiring ICU is that the the icudata library is too big (18 MB). Imagine=
=20
> forcing every C++ runtime to have that.

We wouldn't force the inclusion of that data into every C++ user. It would =
be forced only on those users using Unicode support or, possibly, even just=
 a subset of these users, e.g., when they use certain Unicode operations. I=
'm pretty sure that anything considered to Unicode support will require som=
e data files.

>> - have a look at the operations in ICU to get more topics to be addresse=
d.
>=20
> Please don't.
>=20
> Instead, let's focus on what we want, not on what we could copy from ICU=
=20
> (hint: let's not copy ICU).

Sure. I'm not saying that we have to everything ICU does. I'm saying that t=
here are other operations in ICU which may be relevant for a minimal viable=
 Unicode support. That said, I wouldn't be surprised if we'll find that we =
covered more than ICU once we formed the union of what is considered the mi=
nimal viable support for Unicode.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Dietmar Kuehl <dietmar.kuehl@gmail.com>
Date: Fri, 9 May 2014 07:19:59 +0100 Raw View

--Apple-Mail-6868961F-566F-4D81-87AA-CCE0014FFFA0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable


> On 9 May 2014, at 03:40, Diggory Blake <diggsey@googlemail.com> wrote:
>> On Friday, 9 May 2014 01:49:24 UTC+1, Dietmar K=C3=BChl wrote:
>> OK, when I made the comment quoted below I was at work typing on a mobil=
e device, hence, the message didn't contain all the parts which seem to be =
necessary to this discussion. I read all the other contributions and they a=
re mostly going of into an area I consider the entirely wrong direction so =
I'll pretend they were not made and continue from this earlier point of the=
 discussion. So let me put the arguments for the general design together. N=
ote, however, that I'm not going to write a proposal or make a promise to r=
eview proposals made by other. However, the arguments below will guide my a=
rguments in the committee (unless someone makes good arguments that they ar=
e wrong.=20
>>=20
>> Step 1: External vs. Internal Encoding=20
>>=20
>> When processing strings there is always an encoding involved. In its sim=
plest form, it is a singly byte, fixed width encoding like, e.g., ASCII. It=
 doesn't matter whether the characters are internal to a program, i.e., the=
y are stored in memory by the program or they are external to a program, i.=
e., they are in a file, in a buffer just read into a program, etc.: the is =
an encoding. However, it is important to realise that there is only *one* i=
nternal encoding and string shall be converted from whatever external encod=
ing into the internal encoding upon reading and converted from the internal=
 encoding to the external encoding upon writing! Dealing with multiple inte=
rnal encodings [for the same character type] is neither necessary nor helpf=
ul. OK, it may be necessary if there are potential internal encodings which=
 don't cover the same set of characters. Well, i was the case for single by=
te fixed width encodings: for example, the different choices of ISO-Latin-n=
 covered different characters. However, we are talking about Unicode proces=
sing and despite all its failures Unicode covers the full range of [human] =
characters (yes, Klingon characters were removed from Unicode; as far as I =
can tell to make space for a comprehensive set of characters for turds).
>=20
> You're forgetting that C++ code may have to interoperate with other code =
which uses a different internal encoding. For example, if I want to call Me=
ssageBoxW on windows, I need to pass in a string which is utf16 encoded. If=
 I'm going to be calling such a method a lot I should be able to store that=
 string internally in utf16 so I don't have to convert every time. Furtherm=
ore, the type system should prevent me from inadvertently assigning non-utf=
16-encoded strings to it and vice-versa. This can be done without modifying=
 the basic_string class as per my previous suggestion. You're right that 'c=
har', 'char16_t' and 'char32_t' might not use the utf encodings, so I would=
 amend my suggestion such that "char_traits<char/char16_t/char32_t>::encodi=
ng" would be the implementation specific encoding used by string literals o=
f the same type, rather than necessarily utf8/16/32.

I'm not "forgetting" interoperation with other components. However, I do cl=
aim that programming with string becomes impossible if different components=
 start choosing different encodings! If you are on a system where your vend=
or provided MessageBoxW you'll better hope that your friendly vendor has ch=
osen to make the internal encoding used for C++ identical to that of Messag=
eBoxW (for strings using the suitable character type).

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail-6868961F-566F-4D81-87AA-CCE0014FFFA0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=
=3Dutf-8"></head><body dir=3D"auto"><div><span></span></div><div><meta http=
-equiv=3D"Content-Type" content=3D"text/html charset=3Diso-8859-1"><br><div=
><div>On 9 May 2014, at 03:40, Diggory Blake &lt;<a href=3D"mailto:diggsey@=
googlemail.com">diggsey@googlemail.com</a>&gt; wrote:</div><blockquote type=
=3D"cite"><div><div dir=3D"ltr">On Friday, 9 May 2014 01:49:24 UTC+1, Dietm=
ar K=C3=BChl  wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0px =
0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204=
); border-left-style: solid; padding-left: 1ex; position: static; z-index: =
auto;">OK, when I made the comment quoted below I was at work typing on a m=
obile device, hence, the message didn't contain all the parts which seem to=
 be necessary to this discussion. I read all the other contributions and th=
ey are mostly going of into an area I consider the entirely wrong direction=
 so I'll pretend they were not made and continue from this earlier point of=
 the discussion. So let me put the arguments for the general design togethe=
r. Note, however, that I'm not going to write a proposal or make a promise =
to review proposals made by other. However, the arguments below will guide =
my arguments in the committee (unless someone makes good arguments that the=
y are wrong.
<br>
<br>Step 1: External vs. Internal Encoding
<br>
<br>When processing strings there is always an encoding involved. In its si=
mplest form, it is a singly byte, fixed width encoding like, e.g., ASCII. I=
t doesn't matter whether the characters are internal to a program, i.e., th=
ey are stored in memory by the program or they are external to a program, i=
..e., they are in a file, in a buffer just read into a program, etc.: the is=
 an encoding. However, it is important to realise that there is only *one* =
internal encoding and string shall be converted from whatever external enco=
ding into the internal encoding upon reading and converted from the interna=
l encoding to the external encoding upon writing! Dealing with multiple int=
ernal encodings [for the same character type] is neither necessary nor help=
ful. OK, it may be necessary if there are potential internal encodings whic=
h don't cover the same set of characters. Well, i was the case for single b=
yte fixed width encodings: for example, the different choices of ISO-Latin-=
n covered different characters. However, we are talking about Unicode proce=
ssing and despite all its failures Unicode covers the full range of [human]=
 characters (yes, Klingon characters were removed from Unicode; as far as I=
 can tell to make space for a comprehensive set of characters for turds).
<br></blockquote><div><br>You're forgetting that C++ code may have to inter=
operate with other code which uses a different internal encoding. For examp=
le, if I want to call MessageBoxW on windows, I need to pass in a string wh=
ich is utf16 encoded. If I'm going to be calling such a method a lot I shou=
ld be able to store that string internally in utf16 so I don't have to conv=
ert every time. Furthermore, the type system should prevent me from inadver=
tently assigning non-utf16-encoded strings to it and vice-versa. This can b=
e done without modifying the basic_string class as per my previous suggesti=
on. You're right that 'char', 'char16_t' and 'char32_t' might not use the u=
tf encodings, so I would amend my suggestion such that "char_traits&lt;char=
/char16_t/char32_t&gt;::encoding" would be the implementation specific enco=
ding used by string literals of the same type, rather than necessarily utf8=
/16/32.<br></div></div></div></blockquote><div><div dir=3D"ltr"></div><div>=
<br></div></div><div>I'm not "forgetting" interoperation with other compone=
nts. However, I do claim that programming with string becomes impossible if=
 different components start choosing different encodings! If you are on a s=
ystem where your vendor provided MessageBoxW you'll better hope that your f=
riendly vendor has chosen to make the internal encoding used for C++ identi=
cal to that of MessageBoxW (for strings using the suitable character type).=
</div></div></div></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail-6868961F-566F-4D81-87AA-CCE0014FFFA0--

.

Author: Jean-Marc Bourguet <jm.bourguet@gmail.com>
Date: Thu, 8 May 2014 23:25:12 -0700 (PDT) Raw View

------=_Part_979_18953459.1399616712698
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Le jeudi 8 mai 2014 17:13:08 UTC+2, Guy Davidson a =C3=A9crit :
>
> Oooh, I like that...  The square bracket operator remains ambiguous=20
> though: it stops meaning nth character and now only means nth byte in the=
=20
> sequence.
>

It has never mean nth character but the nth encoding unit (for a=20
potentially state-full encoding). wchar_t has been introduced in time for=
=20
C90 (I don't know if it was an invention of the standardization process or=
=20
not) to have a character type holding one code point per encoding unit.

Yours,

--=20
Jean-Marc

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_979_18953459.1399616712698
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Le jeudi 8 mai 2014 17:13:08 UTC+2, Guy Davidson a =C3=A9c=
rit&nbsp;:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">Ooo=
h, I like that... &nbsp;The square bracket operator remains ambiguous thoug=
h: it stops meaning nth character and now only means nth byte in the sequen=
ce.<br></div></blockquote><div><br>It has never mean nth character but the =
nth encoding unit (for a potentially state-full encoding). wchar_t has been=
 introduced in time for C90 (I don't know if it was an invention of the sta=
ndardization process or not) to have a character type holding one code poin=
t per encoding unit.<br><br>Yours,<br><br>-- <br>Jean-Marc<br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_979_18953459.1399616712698--

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Fri, 09 May 2014 00:00:02 -0700 Raw View

Em sex 09 maio 2014, =E0s 07:19:49, Dietmar Kuehl escreveu:
> > On 9 May 2014, at 04:42, Thiago Macieira <thiago@macieira.org> wrote:
> >> Em sex 09 maio 2014, =E0s 01:49:24, Dietmar Kuehl escreveu:
> >> - character-aware string processing: locating, extracting, changing, e=
tc.
> >> individual characters
> >=20
> > Easy: force people to convert to UTF-32 / UCS-4 before doing any of the
> > above.
> I fail to see how `std::string` will use UTF-32 encoding.=20

It won't. Force people to use std::u32string to do those things. Convert to=
=20
UTF-32, do your work, then convert back if necessary.

Better yet, just keep everything in one of the two UTF basic_string classes=
..

> >> - comparing strings: aside
> >> from Unicode strings not necessarily being normalised, ordering string=
s
> >> in
> >> a form usable for humans is non-trivial: even trivial representation l=
ike
> >> the European letter-based strings are ordered differently depending on
> >> the
> >> language context, e.g., where `ll` goes in the Spanish or other Europe=
an
> >> languages. This is nothing compared to ordering strings with Chinese o=
r
> >> Japanese characters.
> >=20
> > Oh, no, not in the standard. This requires loading the entire CLDR to g=
et
> > it right. Please leave this to ICU.
>=20
> I'm not proposing to have a Unicode library but I would be very much
> surprised if anybody interested in processing Unicode would consider
> Unicode supported if the Unicode "support" doesn't provide a human friend=
ly
> ordering of strings.

Correct. But this is best left to ICU since it's incredibly complex to do,=
=20
requires a lot of data and it changes all the time with Unicode updates.

> > Trust me, one of the biggest complaints we've had in Qt 5 since we bega=
n
> > requiring ICU is that the the icudata library is too big (18 MB). Imagi=
ne
> > forcing every C++ runtime to have that.
>=20
> We wouldn't force the inclusion of that data into every C++ user. It woul=
d
> be forced only on those users using Unicode support or, possibly, even ju=
st
> a subset of these users, e.g., when they use certain Unicode operations.
> I'm pretty sure that anything considered to Unicode support will require
> some data files.

Right. But my problem here is not whether the user wants it or not. If they=
=20
want it, they need a Unicode library.

My problem is putting this into the C++ Standard Library. The libraries tha=
t=20
come with the compilers aren't updated often and developers & companies upd=
ate=20
them even less frequently. Unicode support -- especially the Unicode data,=
=20
like timezone data -- needs to be easily upgraded by the developer.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Fri, 9 May 2014 00:25:25 -0700 (PDT) Raw View

------=_Part_6_14780978.1399620326272
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Friday, 9 May 2014 07:19:59 UTC+1, Dietmar K=C3=BChl wrote:
>
>
> I'm not "forgetting" interoperation with other components. However, I do=
=20
> claim that programming with string becomes impossible if different=20
> components start choosing different encodings! If you are on a system whe=
re=20
> your vendor provided MessageBoxW you'll better hope that your friendly=20
> vendor has chosen to make the internal encoding used for C++ identical to=
=20
> that of MessageBoxW (for strings using the suitable character type).
>

That's really not helpful, "hoping" doesn't solve problems... You can't=20
just impose such arbitrary constraints as that on everyone who's going to=
=20
use it. Programming with strings in different encodings is not impossible,=
=20
as long as the type system handles it correctly, and in my experience it's=
=20
actually been much easier when I've been able to do that. I'm not saying=20
the standard library has to implement any other than the existing=20
encodings, but it should at least be extensible.

It's already the case that each char type is associated with an encoding:
'char' -> implementation defined encoding for "literals"
'char16_t' -> implementation defined encoding for u"literals"
'char32_t' -> implementation defined encoding for U"literals"

What is that downside of adding to the standard library:
1) Classes for each implementation defined encoding listed above, which=20
provide a unicode code-point abstraction (these would just provide methods=
=20
to encode/decode said encoding, I'm not talking about separate string=20
classes)
2) Allow the programmer to extend the list, for example by adding=20
'utf_ebcdic_char' -> utf ebcdic encoding
3) (Optional) Add built in implementations for utf encodings. If the=20
implementation defined encodings are already utf encodings, then the=20
char_type for say utf16 would be typedef'd to char16_t.

Different encoding therefore use different specialisations of=20
"basic_string", and no new string types are added (it's already the case=20
that the programmer can specialise basic_string with their own element type=
=20
and define custom char_traits for it, the only difference is that the=20
association with an encoding is made explicit, where previously it already=
=20
existed but was implicit).
=20

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_6_14780978.1399620326272
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Friday, 9 May 2014 07:19:59 UTC+1, Dietmar K=C3=BChl  w=
rote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8e=
x;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"auto"><div><s=
pan></span></div><div><br><div><div>I'm not "forgetting" interoperation wit=
h other components. However, I do claim that programming with string become=
s impossible if different components start choosing different encodings! If=
 you are on a system where your vendor provided MessageBoxW you'll better h=
ope that your friendly vendor has chosen to make the internal encoding used=
 for C++ identical to that of MessageBoxW (for strings using the suitable c=
haracter type).</div></div></div></div></blockquote><div><br>That's really =
not helpful, "hoping" doesn't solve problems... You can't just impose such =
arbitrary constraints as that on everyone who's going to use it. Programmin=
g with strings in different encodings is not impossible, as long as the typ=
e system handles it correctly, and in my experience it's actually been much=
 easier when I've been able to do that. I'm not saying the standard library=
 has to implement any other than the existing encodings, but it should at l=
east be extensible.<br><br>It's already the case that each char type is ass=
ociated with an encoding:<br>'char' -&gt; implementation defined encoding f=
or "literals"<br>'char16_t' -&gt; implementation defined encoding for u"lit=
erals"<br>'char32_t' -&gt; implementation defined encoding for U"literals"<=
br><br>What is that downside of adding to the standard library:<br>1) Class=
es for each implementation defined encoding listed above, which provide a u=
nicode code-point abstraction (these would just provide methods to encode/d=
ecode said encoding, I'm not talking about separate string classes)<br>2) A=
llow the programmer to extend the list, for example by adding 'utf_ebcdic_c=
har' -&gt; utf ebcdic encoding<br>3) (Optional) Add built in implementation=
s for utf encodings. If the implementation defined encodings are already ut=
f encodings, then the char_type for say utf16 would be typedef'd to char16_=
t.<br><br>Different encoding therefore use different specialisations of "ba=
sic_string", and no new string types are added (it's already the case that =
the programmer can specialise basic_string with their own element type and =
define custom char_traits for it, the only difference is that the associati=
on with an encoding is made explicit, where previously it already existed b=
ut was implicit).<br>&nbsp;</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_6_14780978.1399620326272--

.

Author: Guy Davidson <guy@hatcat.com>
Date: Fri, 9 May 2014 03:21:51 -0700 (PDT) Raw View

------=_Part_78_23250278.1399630911611
Content-Type: text/plain; charset=UTF-8

I can't see how to add access to characters via indexing and iteration
without changing basic_string or char_traits.  This is why I have an
alternative string class.  The use-case I have is that I have A LOT of
indexing to do, and less than 5% of my strings are variable-width-encoded:
most are fixed-width-encoded, which means finding the 100th character is
very cheap.  However, at run time I have to decide which indexing to use,
even though I know at compile time which is being chosen: UI input is
variable width, database keys are fixed width, for example.  This is
expensive, defeating branch-prediction, so I created a new string type
which also contained the encoding scheme.  Now my function templates can
choose the performant iteration mechanism at compile time.

I originally offered my encoded_string class on this thread which is
strikingly similar to others I've since come across: this problem has been
solved many times, it seems, making it a candidate for standard
consideration.  I can see that offering a new string class is a bad idea
and I'm disinclined to pursue that.  In fact, I think that offering this
under the Unicode banner is a bad idea also, since Unicode is a much bigger
problem than this: however, making std::basic_string encoding-aware does
open up opportunities for Unicode functionality.

Compile-time knowledge of encoding is valuable for performance as evidenced
by my use-case.  I can see the wisdom in keeping that out of
std::basic_string, but if not in std::char_traits, where should I keep that
information?

On Thursday, 8 May 2014 16:16:39 UTC+1, Guy Davidson wrote:
>
> Agreed.
>
> On Thursday, 8 May 2014 16:13:30 UTC+1, Ville Voutilainen wrote:
>>
>> On 8 May 2014 18:06, Guy Davidson <g...@hatcat.com> wrote:
>> > The fundamental problem here is that the basic_string class currently
>> only
>> > accommodates fixed width encoding.  UTF-8 and UTF-16 are variable width
>> > encodings (UTF-32 is also fixed width).  std::u16string is only fit for
>>
>>
>> basic_string accommodates variable width encodings fine. What it doesn't
>> provide
>> is access to the actual characters rather than raw bytes, and adding
>> access to
>> to characters should not require changing basic_string, or char_traits.
>>
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_78_23250278.1399630911611
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I can't see how to add access to characters via indexing a=
nd iteration without changing basic_string or char_traits. &nbsp;This is wh=
y I have an alternative string class. &nbsp;The use-case I have is that I h=
ave A LOT of indexing to do, and less than 5% of my strings are variable-wi=
dth-encoded: most are fixed-width-encoded, which means finding the 100th ch=
aracter is very cheap. &nbsp;However, at run time I have to decide which in=
dexing to use, even though I know at compile time which is being chosen: UI=
 input is variable width, database keys are fixed width, for example. &nbsp=
;This is expensive, defeating branch-prediction, so I created a new string =
type which also contained the encoding scheme. &nbsp;Now my function templa=
tes can choose the performant iteration mechanism at compile time.<div><br>=
</div><div>I originally offered my encoded_string class on this thread whic=
h is strikingly similar to others I've since come across: this problem has =
been solved many times, it seems, making it a candidate for standard consid=
eration. &nbsp;I can see that offering a new string class is a bad idea and=
 I'm disinclined to pursue that. &nbsp;In fact, I think that offering this =
under the Unicode banner is a bad idea also, since Unicode is a much bigger=
 problem than this: however, making std::basic_string encoding-aware does o=
pen up opportunities for Unicode functionality.</div><div><br></div><div>Co=
mpile-time knowledge of encoding is valuable for performance as evidenced b=
y my use-case. &nbsp;I can see the wisdom in keeping that out of std::basic=
_string, but if not in std::char_traits, where should I keep that informati=
on?<br><br>On Thursday, 8 May 2014 16:16:39 UTC+1, Guy Davidson  wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;">Agreed.<br><br>On Thursday, 8 May =
2014 16:13:30 UTC+1, Ville Voutilainen  wrote:<blockquote class=3D"gmail_qu=
ote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding=
-left:1ex">On 8 May 2014 18:06, Guy Davidson &lt;<a>g...@hatcat.com</a>&gt;=
 wrote:
<br>&gt; The fundamental problem here is that the basic_string class curren=
tly only
<br>&gt; accommodates fixed width encoding. &nbsp;UTF-8 and UTF-16 are vari=
able width
<br>&gt; encodings (UTF-32 is also fixed width). &nbsp;std::u16string is on=
ly fit for
<br>
<br>
<br>basic_string accommodates variable width encodings fine. What it doesn'=
t provide
<br>is access to the actual characters rather than raw bytes, and adding ac=
cess to
<br>to characters should not require changing basic_string, or char_traits.
<br></blockquote></blockquote></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_78_23250278.1399630911611--

.

Author: pecholt@gmail.com
Date: Thu, 15 May 2014 02:03:38 -0700 (PDT) Raw View

------=_Part_132_29922195.1400144618755
Content-Type: text/plain; charset=UTF-8

I guess if neither basic_string nor char_traits can be modifier your
previous example would like like this:

std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),lback_inserter
(result), to_upper);

this makes the interfaces slightly more complicated. I would personally
consider it too verbose already because internal encoding is most like the
same in all parts of program/dll but it has to be repeated on every line.

I am not sure if it would be feasible to provide default encoding value so
one doesn't need to repeat it everywhere (e.g. on windows default value
would be utf8 for std::string and utf16 for std::wstring. On linux
std::wstring would default to utf32)

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_132_29922195.1400144618755
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I guess if neither basic_string nor char_traits can be mod=
ifier your previous example would like like this:<br><br><code><div class=
=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); border-colo=
r: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap: b=
reak-word;"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span=
 style=3D"color: #000;" class=3D"styled-by-prettify">std</span><span style=
=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=3D"colo=
r: #000;" class=3D"styled-by-prettify">transform</span><span style=3D"color=
: #660;" class=3D"styled-by-prettify">(</span><span style=3D"color: #000;" =
class=3D"styled-by-prettify">lbegin_from</span><span style=3D"color: #080;"=
 class=3D"styled-by-prettify">&lt;utf8&gt;</span><span style=3D"color: #660=
;" class=3D"styled-by-prettify">(</span><span style=3D"color: #000;" class=
=3D"styled-by-prettify">input</span><span style=3D"color: #660;" class=3D"s=
tyled-by-prettify">),</span><span style=3D"color: #000;" class=3D"styled-by=
-prettify"> lend_from</span><span style=3D"color: #080;" class=3D"styled-by=
-prettify">&lt;utf8&gt;</span><span style=3D"color: #660;" class=3D"styled-=
by-prettify">(</span><span style=3D"color: #000;" class=3D"styled-by-pretti=
fy">input</span><span style=3D"color: #660;" class=3D"styled-by-prettify">)=
,</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> lback_in=
serter</span><span style=3D"color: #660;" class=3D"styled-by-prettify">(</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify">result</span>=
<span style=3D"color: #660;" class=3D"styled-by-prettify">),</span><span st=
yle=3D"color: #000;" class=3D"styled-by-prettify"> to_upper</span><span sty=
le=3D"color: #660;" class=3D"styled-by-prettify">);</span><span style=3D"co=
lor: #000;" class=3D"styled-by-prettify"><br></span></div></code></div><spa=
n style=3D"color:#000"></span></code><br>this makes the interfaces slightly=
 more complicated. I would personally consider it too verbose already becau=
se internal encoding is most like the same in all parts of program/dll but =
it has to be repeated on every line. <br><br>I am not sure if it would be f=
easible to provide default encoding value so one doesn't need to repeat it =
everywhere (e.g. on windows default value would be utf8 for std::string and=
 utf16 for std::wstring. On linux std::wstring would default to utf32)<br><=
/div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_132_29922195.1400144618755--

.

Author: pecholt@gmail.com
Date: Thu, 15 May 2014 02:10:47 -0700 (PDT) Raw View

------=_Part_6528_10951930.1400145047200
Content-Type: text/plain; charset=UTF-8

I guess if neither basic_string nor char_traits can be modified your
previous example would look like this:

std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),lback_inserter<utf8>
(result), to_upper);

this makes the interfaces slightly more complicated. I would personally
consider it too verbose already because internal encoding is most likely
the same in all parts of program/dll but it has to be repeated on every
line.

I am not sure if it would be feasible to provide default encoding value for
all algorithms so one doesn't need to repeat it everywhere (e.g. on windows
default value would be utf8 for std::string and utf16 for std::wstring. On
linux std::wstring would default to utf32)

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_6528_10951930.1400145047200
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I guess if neither basic_string nor char_traits can be mod=
ified your previous example would look like this:<br><br><code><div style=
=3D"background-color:rgb(250,250,250);border-color:rgb(187,187,187);border-=
style:solid;border-width:1px;word-wrap:break-word"><code><div><span style=
=3D"color:#000">std</span><span style=3D"color:#660">::</span><span style=
=3D"color:#000">transform</span><span style=3D"color:#660">(</span><span st=
yle=3D"color:#000">lbegin_from</span><span style=3D"color:#080">&lt;<wbr>ut=
f8&gt;</span><span style=3D"color:#660">(</span><span style=3D"color:#000">=
input</span><span style=3D"color:#660">),</span><span style=3D"color:#000">=
 lend_from</span><span style=3D"color:#080">&lt;utf8&gt;</span><span style=
=3D"color:#660">(</span><span style=3D"color:#000">input</span><span style=
=3D"color:#660">),</span><span style=3D"color:#000"> lback_inserter&lt;utf8=
&gt;</span><span style=3D"color:#660">(</span><span style=3D"color:#000">re=
sult</span><span style=3D"color:#660">),</span><span style=3D"color:#000"> =
to_upper</span><span style=3D"color:#660">);</span><span style=3D"color:#00=
0"><br></span></div></code></div><span style=3D"color:#000"></span></code><=
br>this
 makes the interfaces slightly more complicated. I would personally=20
consider it too verbose already because internal encoding is most likely=20
the same in all parts of program/dll but it has to be repeated on every=20
line. <br><br>I am not sure if it would be feasible to provide default=20
encoding value for all algorithms so one doesn't need to repeat it everywhe=
re (e.g. on=20
windows default value would be utf8 for std::string and utf16 for=20
std::wstring. On linux std::wstring would default to utf32)</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_6528_10951930.1400145047200--

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Thu, 15 May 2014 09:20:30 -0700 Raw View

On Thu, May 15, 2014 at 2:10 AM,  <pecholt@gmail.com> wrote:
> I guess if neither basic_string nor char_traits can be modified your
> previous example would look like this:
>
> std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),
> lback_inserter<utf8>(result), to_upper);

This is why we shouldn't be trying to design Unicode support in the
C++ group. Amateurs tend to think that unicode-aware algorithms can
run a codepoint at a time, while they almost always have to run a
whole string at a time. For example,
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt lists a
bunch of case conversions where the context of the character matters.

If you want to become an expert over the next couple years and then
bring that knowledge back to C++, that'd be great, but the place to do
that is in the ICU development community, not here where we don't know
enough to correct your mistakes.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Jean-Marc Bourguet <jm.bourguet@gmail.com>
Date: Thu, 15 May 2014 09:49:33 -0700 (PDT) Raw View

------=_Part_7836_22650711.1400172573627
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Le jeudi 15 mai 2014 02:10:47 UTC-7, pec...@gmail.com a =C3=A9crit :
>
>
>
> I am not sure if it would be feasible to provide default encoding value=
=20
> for all algorithms so one doesn't need to repeat it everywhere (e.g. on=
=20
> windows default value would be utf8 for std::string and utf16 for=20
> std::wstring. On linux std::wstring would default to utf32)
>

In a conforming implementation, utf16 can't be the encoding for wstring=20
which by design of wchar_t contains code points (ucs16 can),

Yours,

--=20
Jean-Marc

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_7836_22650711.1400172573627
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Le jeudi 15 mai 2014 02:10:47 UTC-7, pec...@gmail.com a =
=C3=A9crit&nbsp;:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"l=
tr"><br><br>I am not sure if it would be feasible to provide default=20
encoding value for all algorithms so one doesn't need to repeat it everywhe=
re (e.g. on=20
windows default value would be utf8 for std::string and utf16 for=20
std::wstring. On linux std::wstring would default to utf32)</div></blockquo=
te><div><br></div><div>In a conforming implementation, utf16 can't be the e=
ncoding for wstring which by design of wchar_t contains code points (ucs16 =
can),</div><div><br></div><div>Yours,</div><div><br></div><div>--&nbsp;</di=
v><div>Jean-Marc</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_7836_22650711.1400172573627--

.

Author: Farid Mehrabi <farid.mehrabi@gmail.com>
Date: Thu, 15 May 2014 21:42:37 +0430 Raw View

--f46d044402f494b48804f9736831
Content-Type: text/plain; charset=UTF-8

2014-05-09 14:51 GMT+04:30 Guy Davidson <guy@hatcat.com>:

> Compile-time knowledge of encoding is valuable for performance as
> evidenced by my use-case.  I can see the wisdom in keeping that out of
> std::basic_string, but if not in std::char_traits, where should I keep that
> information?
>

With current implementation of std, allocator is another optional traits
class fed to basic_string (in harmony with othr STL containers);  as many
other STL containers iterator , value_type and the like are defined in the
allocator. If a good combination of char_traits and  allocator - capable of
efficiently handling mixed-width characters- can`t be established, then a
new family of string classes needs to be designed.

regards,
FM.

--
how am I supposed to end the twisted road of  your hair in the dark night??
unless the candle of your face does turn a lamp up on my way!!!

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--f46d044402f494b48804f9736831
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"rtl"><br><div class=3D"gmail_extra"><div dir=3D"ltr"><br><br><d=
iv class=3D"gmail_quote">2014-05-09 14:51 GMT+04:30 Guy Davidson <span dir=
=3D"ltr">&lt;<a href=3D"mailto:guy@hatcat.com" target=3D"_blank">guy@hatcat=
..com</a>&gt;</span>:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div>Compile-time knowledge of encoding is valuable for pe=
rformance as evidenced by my use-case. =C2=A0I can see the wisdom in keepin=
g that out of std::basic_string, but if not in std::char_traits, where shou=
ld I keep that information?</div>

</blockquote></div></div><div dir=3D"ltr"><br></div><div dir=3D"ltr">With c=
urrent implementation of std, allocator is another optional traits class fe=
d to basic_string (in harmony with othr STL containers); =C2=A0as many othe=
r STL containers iterator , value_type and the like are defined in the allo=
cator. If a good combination of char_traits and =C2=A0allocator - capable o=
f efficiently handling mixed-width characters- can`t be established, then a=
 new family of string classes needs to be designed.</div>

<div dir=3D"ltr"><br></div><div dir=3D"ltr">regards,</div><div dir=3D"ltr">=
FM.</div><div><br></div>-- <br><div dir=3D"ltr">how am I supposed to end th=
e twisted road of=C2=A0 your hair in the dark night??<br>unless the candle =
of your face does turn a lamp up on my way!!!<br>

</div>
</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--f46d044402f494b48804f9736831--

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 15 May 2014 11:23:13 -0700 (PDT) Raw View

------=_Part_5693_9795535.1400178193657
Content-Type: text/plain; charset=UTF-8

On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin wrote:
>
> On Thu, May 15, 2014 at 2:10 AM,  <pec...@gmail.com <javascript:>> wrote:
> > I guess if neither basic_string nor char_traits can be modified your
> > previous example would look like this:
> >
> > std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),
> > lback_inserter<utf8>(result), to_upper);
>
> This is why we shouldn't be trying to design Unicode support in the
> C++ group. Amateurs tend to think that unicode-aware algorithms can
> run a codepoint at a time, while they almost always have to run a
> whole string at a time. For example,
> ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt lists a
> bunch of case conversions where the context of the character matters.
>

That's a bad example, but the point is that whatever algorithms are
provided should be implemented in terms of code-points: yes, many
algorithms can't work on a single code-point at a time, but they should
still use code-points as their atomic unit. Otherwise you end up in the
same situation as C runtime library with 20 different versions of every
function for each specific case instead of one generic one. The algorithm
for case conversion can then be written once and work properly regardless
of the encoding used. Ultimately every string encoding is just a
description of how to turn bytes into code-points and back again, so
code-points give a common ground where all encodings are equal.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_5693_9795535.1400178193657
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin  =
wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8=
ex;border-left: 1px #ccc solid;padding-left: 1ex;">On Thu, May 15, 2014 at =
2:10 AM, &nbsp;&lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated=
-mailto=3D"P4YlNP6lk6wJ" onmousedown=3D"this.href=3D'javascript:';return tr=
ue;" onclick=3D"this.href=3D'javascript:';return true;">pec...@gmail.com</a=
>&gt; wrote:
<br>&gt; I guess if neither basic_string nor char_traits can be modified yo=
ur
<br>&gt; previous example would look like this:
<br>&gt;
<br>&gt; std::transform(lbegin_from&lt;<wbr>utf8&gt;(input), lend_from&lt;u=
tf8&gt;(input),
<br>&gt; lback_inserter&lt;utf8&gt;(result), to_upper);
<br>
<br>This is why we shouldn't be trying to design Unicode support in the
<br>C++ group. Amateurs tend to think that unicode-aware algorithms can
<br>run a codepoint at a time, while they almost always have to run a
<br>whole string at a time. For example,
<br><a href=3D"ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.tx=
t" target=3D"_blank" onmousedown=3D"this.href=3D'ftp://ftp.unicode.org/Publ=
ic/UCD/latest/ucd/SpecialCasing.txt';return true;" onclick=3D"this.href=3D'=
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt';return true;=
">ftp://ftp.unicode.org/Public/<wbr>UCD/latest/ucd/SpecialCasing.<wbr>txt</=
a> lists a
<br>bunch of case conversions where the context of the character matters.
<br></blockquote><div><br>That's a bad example, but the point is that whate=
ver algorithms are provided should be implemented in terms of code-points: =
yes, many algorithms can't work on a single code-point at a time, but they =
should still use code-points as their atomic unit. Otherwise you end up in =
the same situation as C runtime library with 20 different versions of every=
 function for each specific case instead of one generic one. The algorithm =
for case conversion can then be written once and work properly regardless o=
f the encoding used. Ultimately every string encoding is just a description=
 of how to turn bytes into code-points and back again, so code-points give =
a common ground where all encodings are equal.<br></div><br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_5693_9795535.1400178193657--

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Thu, 15 May 2014 11:27:17 -0700 Raw View

On Thu, May 15, 2014 at 11:23 AM, Diggory Blake <diggsey@googlemail.com> wrote:
> On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin wrote:
>>
>> On Thu, May 15, 2014 at 2:10 AM,  <pec...@gmail.com> wrote:
>> > I guess if neither basic_string nor char_traits can be modified your
>> > previous example would look like this:
>> >
>> > std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),
>> > lback_inserter<utf8>(result), to_upper);
>>
>> This is why we shouldn't be trying to design Unicode support in the
>> C++ group. Amateurs tend to think that unicode-aware algorithms can
>> run a codepoint at a time, while they almost always have to run a
>> whole string at a time. For example,
>> ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt lists a
>> bunch of case conversions where the context of the character matters.
>
>
> That's a bad example, but the point is that whatever algorithms are provided
> should be implemented in terms of code-points: yes, many algorithms can't
> work on a single code-point at a time, but they should still use code-points
> as their atomic unit. Otherwise you end up in the same situation as C
> runtime library with 20 different versions of every function for each
> specific case instead of one generic one. The algorithm for case conversion
> can then be written once and work properly regardless of the encoding used.
> Ultimately every string encoding is just a description of how to turn bytes
> into code-points and back again, so code-points give a common ground where
> all encodings are equal.

That also sounds good, but it turns out to be wrong again. The ICU
folks (Dick Sites in particular) have gotten significant speedups by
writing their algorithms using state machines directly on top of
encoded utf-8 and utf-16 data. You convert your data to either utf-8
or utf-16 when it comes into your system, and then you run algorithms
on the single encoding you use. It's a fool's errand to keep lots of
encodings inside your system.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Diggory Blake <diggsey@googlemail.com>
Date: Thu, 15 May 2014 12:02:40 -0700 (PDT) Raw View

------=_Part_9_29971111.1400180560357
Content-Type: text/plain; charset=UTF-8

That may be so, but it's still better to specify the more general version
in the standard - it's trivial for an implementation to specialize it to
make it extra fast on UTF8, or UTF16, but it's impossible to go the other
way. It's also less effort: regardless of how many specialisations of an
algorithm there are, the standard just has to describe a single generic
version which behaves exactly as the unicode standard specifies.

Even if you use only one encoding throughout your program, what if it
happens to not be the one which all the unicode algorithms were written for?

On Thursday, 15 May 2014 19:27:17 UTC+1, Jeffrey Yasskin wrote:
>
> On Thu, May 15, 2014 at 11:23 AM, Diggory Blake <dig...@googlemail.com<javascript:>>
> wrote:
> > On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin wrote:
> >>
> >> On Thu, May 15, 2014 at 2:10 AM,  <pec...@gmail.com> wrote:
> >> > I guess if neither basic_string nor char_traits can be modified your
> >> > previous example would look like this:
> >> >
> >> > std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),
> >> > lback_inserter<utf8>(result), to_upper);
> >>
> >> This is why we shouldn't be trying to design Unicode support in the
> >> C++ group. Amateurs tend to think that unicode-aware algorithms can
> >> run a codepoint at a time, while they almost always have to run a
> >> whole string at a time. For example,
> >> ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt lists a
> >> bunch of case conversions where the context of the character matters.
> >
> >
> > That's a bad example, but the point is that whatever algorithms are
> provided
> > should be implemented in terms of code-points: yes, many algorithms
> can't
> > work on a single code-point at a time, but they should still use
> code-points
> > as their atomic unit. Otherwise you end up in the same situation as C
> > runtime library with 20 different versions of every function for each
> > specific case instead of one generic one. The algorithm for case
> conversion
> > can then be written once and work properly regardless of the encoding
> used.
> > Ultimately every string encoding is just a description of how to turn
> bytes
> > into code-points and back again, so code-points give a common ground
> where
> > all encodings are equal.
>
> That also sounds good, but it turns out to be wrong again. The ICU
> folks (Dick Sites in particular) have gotten significant speedups by
> writing their algorithms using state machines directly on top of
> encoded utf-8 and utf-16 data. You convert your data to either utf-8
> or utf-16 when it comes into your system, and then you run algorithms
> on the single encoding you use. It's a fool's errand to keep lots of
> encodings inside your system.
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_9_29971111.1400180560357
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">That may be so, but it's still better to specify the more =
general version in the standard - it's trivial for an implementation to spe=
cialize it to make it extra fast on UTF8, or UTF16, but it's impossible to =
go the other way. It's also less effort: regardless of how many specialisat=
ions of an algorithm there are, the standard just has to describe a single =
generic version which behaves exactly as the unicode standard specifies.<br=
><br>Even if you use only one encoding throughout your program, what if it =
happens to not be the one which all the unicode algorithms were written for=
?<br><br>On Thursday, 15 May 2014 19:27:17 UTC+1, Jeffrey Yasskin  wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">On Thu, May 15, 2014 at 11:23 AM=
, Diggory Blake &lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscate=
d-mailto=3D"1Md1iXxz6fwJ" onmousedown=3D"this.href=3D'javascript:';return t=
rue;" onclick=3D"this.href=3D'javascript:';return true;">dig...@googlemail.=
com</a>&gt; wrote:
<br>&gt; On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin wrote:
<br>&gt;&gt;
<br>&gt;&gt; On Thu, May 15, 2014 at 2:10 AM, &nbsp;&lt;<a>pec...@gmail.com=
</a>&gt; wrote:
<br>&gt;&gt; &gt; I guess if neither basic_string nor char_traits can be mo=
dified your
<br>&gt;&gt; &gt; previous example would look like this:
<br>&gt;&gt; &gt;
<br>&gt;&gt; &gt; std::transform(lbegin_from&lt;<wbr>utf8&gt;(input), lend_=
from&lt;utf8&gt;(input),
<br>&gt;&gt; &gt; lback_inserter&lt;utf8&gt;(result), to_upper);
<br>&gt;&gt;
<br>&gt;&gt; This is why we shouldn't be trying to design Unicode support i=
n the
<br>&gt;&gt; C++ group. Amateurs tend to think that unicode-aware algorithm=
s can
<br>&gt;&gt; run a codepoint at a time, while they almost always have to ru=
n a
<br>&gt;&gt; whole string at a time. For example,
<br>&gt;&gt; <a href=3D"ftp://ftp.unicode.org/Public/UCD/latest/ucd/Special=
Casing.txt" target=3D"_blank" onmousedown=3D"this.href=3D'ftp://ftp.unicode=
..org/Public/UCD/latest/ucd/SpecialCasing.txt';return true;" onclick=3D"this=
..href=3D'ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt';ret=
urn true;">ftp://ftp.unicode.org/Public/<wbr>UCD/latest/ucd/SpecialCasing.<=
wbr>txt</a> lists a
<br>&gt;&gt; bunch of case conversions where the context of the character m=
atters.
<br>&gt;
<br>&gt;
<br>&gt; That's a bad example, but the point is that whatever algorithms ar=
e provided
<br>&gt; should be implemented in terms of code-points: yes, many algorithm=
s can't
<br>&gt; work on a single code-point at a time, but they should still use c=
ode-points
<br>&gt; as their atomic unit. Otherwise you end up in the same situation a=
s C
<br>&gt; runtime library with 20 different versions of every function for e=
ach
<br>&gt; specific case instead of one generic one. The algorithm for case c=
onversion
<br>&gt; can then be written once and work properly regardless of the encod=
ing used.
<br>&gt; Ultimately every string encoding is just a description of how to t=
urn bytes
<br>&gt; into code-points and back again, so code-points give a common grou=
nd where
<br>&gt; all encodings are equal.
<br>
<br>That also sounds good, but it turns out to be wrong again. The ICU
<br>folks (Dick Sites in particular) have gotten significant speedups by
<br>writing their algorithms using state machines directly on top of
<br>encoded utf-8 and utf-16 data. You convert your data to either utf-8
<br>or utf-16 when it comes into your system, and then you run algorithms
<br>on the single encoding you use. It's a fool's errand to keep lots of
<br>encodings inside your system.
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_9_29971111.1400180560357--

.

Author: Guy Davidson <guy@hatcat.com>
Date: Fri, 16 May 2014 03:14:58 -0700 (PDT) Raw View

------=_Part_277_21431550.1400235298744
Content-Type: text/plain; charset=UTF-8

I'm inclined to agree.  I started this thread a week ago, I've found it
very stimulating and it's opened my eyes to the complexities of Unicode and
the (well-taken) caution of the standardisation body.  I think I shall
pursue that expertise as it's an interest of mine, and see what happens.  I
shan't submit anything this time around.

On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin wrote:
>
> On Thu, May 15, 2014 at 2:10 AM,  <pec...@gmail.com <javascript:>> wrote:
> > I guess if neither basic_string nor char_traits can be modified your
> > previous example would look like this:
> >
> > std::transform(lbegin_from<utf8>(input), lend_from<utf8>(input),
> > lback_inserter<utf8>(result), to_upper);
>
> This is why we shouldn't be trying to design Unicode support in the
> C++ group. Amateurs tend to think that unicode-aware algorithms can
> run a codepoint at a time, while they almost always have to run a
> whole string at a time. For example,
> ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt lists a
> bunch of case conversions where the context of the character matters.
>
> If you want to become an expert over the next couple years and then
> bring that knowledge back to C++, that'd be great, but the place to do
> that is in the ICU development community, not here where we don't know
> enough to correct your mistakes.
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_277_21431550.1400235298744
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I'm inclined to agree. &nbsp;I started this thread a week =
ago, I've found it very stimulating and it's opened my eyes to the complexi=
ties of Unicode and the (well-taken) caution of the standardisation body. &=
nbsp;I think I shall pursue that expertise as it's an interest of mine, and=
 see what happens. &nbsp;I shan't submit anything this time around.<br><br>=
On Thursday, 15 May 2014 17:20:30 UTC+1, Jeffrey Yasskin  wrote:<blockquote=
 class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1=
px #ccc solid;padding-left: 1ex;">On Thu, May 15, 2014 at 2:10 AM, &nbsp;&l=
t;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"P4YlNP=
6lk6wJ" onmousedown=3D"this.href=3D'javascript:';return true;" onclick=3D"t=
his.href=3D'javascript:';return true;">pec...@gmail.com</a>&gt; wrote:
<br>&gt; I guess if neither basic_string nor char_traits can be modified yo=
ur
<br>&gt; previous example would look like this:
<br>&gt;
<br>&gt; std::transform(lbegin_from&lt;<wbr>utf8&gt;(input), lend_from&lt;u=
tf8&gt;(input),
<br>&gt; lback_inserter&lt;utf8&gt;(result), to_upper);
<br>
<br>This is why we shouldn't be trying to design Unicode support in the
<br>C++ group. Amateurs tend to think that unicode-aware algorithms can
<br>run a codepoint at a time, while they almost always have to run a
<br>whole string at a time. For example,
<br><a href=3D"ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.tx=
t" target=3D"_blank" onmousedown=3D"this.href=3D'ftp://ftp.unicode.org/Publ=
ic/UCD/latest/ucd/SpecialCasing.txt';return true;" onclick=3D"this.href=3D'=
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt';return true;=
">ftp://ftp.unicode.org/Public/<wbr>UCD/latest/ucd/SpecialCasing.<wbr>txt</=
a> lists a
<br>bunch of case conversions where the context of the character matters.
<br>
<br>If you want to become an expert over the next couple years and then
<br>bring that knowledge back to C++, that'd be great, but the place to do
<br>that is in the ICU development community, not here where we don't know
<br>enough to correct your mistakes.
<br></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_277_21431550.1400235298744--

.

Author: phdofthehouse@gmail.com
Date: Fri, 14 Nov 2014 14:23:11 -0800 (PST) Raw View

------=_Part_1178_1272319592.1416003791299
Content-Type: text/plain; charset=UTF-8

As a slight question, has anyone touched Encoding/Decoding issues on text
recently? And perhaps less importantly, has anyone given any thought to
unicode in C++? I want to share ideas with a few individuals on what I've
been hacking away at...

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_1178_1272319592.1416003791299
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">As a slight question, has anyone touched Encoding/Decoding=
 issues on text recently? And perhaps less importantly, has anyone given an=
y thought to unicode in C++? I want to share ideas with a few individuals o=
n what I've been hacking away at...<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_1178_1272319592.1416003791299--

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Fri, 14 Nov 2014 15:10:51 -0800 Raw View

On Friday 14 November 2014 14:23:11 phdofthehouse@gmail.com wrote:
> As a slight question, has anyone touched Encoding/Decoding issues on text
> recently? And perhaps less importantly, has anyone given any thought to
> unicode in C++? I want to share ideas with a few individuals on what I've
> been hacking away at...

Just share your ideas.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: glen stark <g.a.stark@gmail.com>
Date: Fri, 14 Aug 2015 02:41:41 -0700 (PDT) Raw View

------=_Part_330_1802199779.1439545301435
Content-Type: multipart/alternative;
 boundary="----=_Part_331_1083330414.1439545301435"

------=_Part_331_1083330414.1439545301435
Content-Type: text/plain; charset=UTF-8



On Thursday, May 8, 2014 at 5:09:10 PM UTC+2, Diggory Blake wrote:
>
> The thing is, for most of the time the encoding of the string is not
> relevant - copying, appending, exact comparison all work on the raw data
> and don't care about the encoding. The only time the encoding matters is
> during I/O or when performing lexicographic operations, and generally
> within the same application, the same encoding will be used (almost)
> everywhere.
>
>>
>>>
While it's probably true that within most applications, the same encoding
will be used (almost) everywhere,  that wouldn't be true for libraries.
 For library support, and for unicode-aware components of legacy
applications, I think it would be helpful to know, if you're getting a
std::basic_string<char>, if that's ASCII, latin1, or utf-8.

Would it be reasonable to add encoding information to char_traits?  This
would allow a library developer that has to support both legacy code and
utf-8 compliant code to use type information to ensure correct behavior,
and prevent accidental operations which are meaningless across encodings.
 If it is reasonable, it would take some thought:  ideally the standard
explicitly provide encodings that are likely to remain in use for the next
decade or two (latin1  comes to mind), and it should be possible to create
custom encodings.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_331_1083330414.1439545301435
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br>On Thursday, May 8, 2014 at 5:09:10 PM UTC+2, Digg=
ory Blake wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin=
-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"lt=
r">The thing is, for most of the time the encoding of the string is not rel=
evant - copying, appending, exact comparison all work on the raw data and d=
on&#39;t care about the encoding. The only time the encoding matters is dur=
ing I/O or when performing lexicographic operations, and generally within t=
he same application, the same encoding will be used (almost) everywhere.<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><blockquote =
class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #=
ccc solid;padding-left:1ex"><br></blockquote></div></div></blockquote></div=
></blockquote><div><br></div><div>While it&#39;s probably true that within =
most applications, the same encoding will be used (almost) everywhere, =C2=
=A0that wouldn&#39;t be true for libraries. =C2=A0For library support, and =
for unicode-aware components of legacy applications, I think it would be he=
lpful to know, if you&#39;re getting a std::basic_string&lt;char&gt;, if th=
at&#39;s ASCII, latin1, or utf-8. =C2=A0 =C2=A0</div><div><br></div><div>Wo=
uld it be reasonable to add encoding information to char_traits? =C2=A0This=
 would allow a library developer that has to support both legacy code and u=
tf-8 compliant code to use type information to ensure correct behavior, and=
 prevent accidental operations which are meaningless across encodings. =C2=
=A0If it is reasonable, it would take some thought: =C2=A0ideally the stand=
ard explicitly provide encodings that are likely to remain in use for the n=
ext decade or two (latin1 =C2=A0comes to mind), and it should be possible t=
o create custom encodings. =C2=A0</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_331_1083330414.1439545301435--
------=_Part_330_1802199779.1439545301435--

.

Author: phdofthehouse@gmail.com
Date: Fri, 14 Aug 2015 06:34:06 -0700 (PDT) Raw View

------=_Part_607_3659362.1439559246605
Content-Type: multipart/alternative;
 boundary="----=_Part_608_1263359736.1439559246606"

------=_Part_608_1263359736.1439559246606
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

It took me a while to get back to this. What I have here is essentially a=
=20
watered-down version of ogonek <https://github.com/rmartinho/ogonek/>, it=
=20
works very well for my tastes. The reason I had to water it down was=20
because Microsoft's compiler couldn't handle ogonek and routinely ICE'd or=
=20
just outright exploded in the compiler.

The code lives here (github) <https://github.com/ThePhD/Furrovine.Heart>,=
=20
and the relevant implementation starts with the implementation of a decodin=
g=20
iterator=20
<https://github.com/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B=
%2B/Text/decode_iterator.hpp>.=20
There is an encoding iterator as well, but I haven't finished improving it=
=20
so that you can do things like in-place change codepoints if the backing=20
iterator returns a `codepoint&` versus a `codepoint`. The implementation is=
=20
entirely header-only (we don't use UCD information for just=20
encoding/decoding) and covers ASCII, utf8, utf16, and utf32 at this point.=
=20
I haven't yet had to deal with older databases or latin-1 serving entities=
=20
yet, so I haven't added that encoding/decoding interface. The fun starts=20
with the overarching container-adaptor class, named text_view_base=20
<https://github.com/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B=
%2B/text_view_base.hpp>,=20
where I essentially reimplement large chunks of std::string based on=20
codepoints rather than codeunits, and using iterators rather than pointers=
=20
and sizes.

I have not implemented title case comparison, normalization, and a few of=
=20
those things yet because I believe that requires the UCD and this is meant=
=20
not to have any compile-to bits in it: I'm going to work on normalization o=
ne=20
day=E2=84=A2=20
<https://github.com/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B=
%2B/Text/normalization.hpp>
..

The heavy-lifting comes in getting something that'll always spit out=20
codepoints. The problem: we violate a pretty standard iterator ask that=20
returned values can directly manipulate the underlying stream with `*it =3D=
=20
some_codepoint`. Doing this is impossible with the current implementation,=
=20
and we would need to have iterators which have knowledge of their=20
containers and possibly perform changes to the container itself that *chang=
e=20
the underlying sequence if the newly encoded codepoint has more code units=
=20
that what it is replacing*. An iterator that invalidates all other=20
iterators (including itself) with this kind of behavior is... not a good=20
iterator. I cannot think of a way to do this. The most I have considered=20
doing is a hack that allows me to do the change IFF some_codepoint has the=
=20
same code-unit count, then I could just do the change. Everything else=20
would have to throw to be correct or assert in debug mode and then just=20
tell the user they're shit-out-of-luck if they trigger that kind of bad=20
behavior in release.

The implementation is not perfect, but I've been using it with great=20
success in a number of projects. When I don't need the container's view on=
=20
the text, I can call `textview.storage()` to get the original thing out of=
=20
it and inspect that directly. Care has been taken that text_view_base truly=
=20
only requires something that can have `begin` and `end` called on it. The=
=20
more beefy class that offers a more `std::basic_string` like experience=20
with a bunch of useful and additional member functions (and some still=20
missing because I haven't yet used them) is text_base=20
<https://github.com/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B=
%2B/text_base.hpp>
..

This is... all I know about the encoding/decoding stuff. It looks fairly=20
non-trivial and there's a number of decisions that need to be made,=20
especially with respect to if having an iterator that just returns a value=
=20
`code_point` is useful & acceptable to any facilities which want to work=20
with them. In most cases, you do not want to be in-place changing the text=
=20
without explicitly using certain member functions to enable those changes,=
=20
so I think this is an alright trade...

Thoughts?

On Thursday, May 8, 2014 at 8:59:09 AM UTC-4, Guy Davidson wrote:
>
> I am very keen to see Unicode support in C++17.  At the ACCU conference=
=20
> <http://accu.org/index.php/conferences/accu_conference_2014> I was=20
> encouraged by Nico Josuttis and Kevlin Henney to put a proposal together.=
=20
>  I knocked up a naive interface and an implementation and then I checked=
=20
> the proposal mechanism.  I discovered Beman Dawes paper on string=20
> interoperability=20
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html> and =
Mark=20
> Boyall's submission=20
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html>; gra=
tifyingly,=20
> they were very similar to mine (I even used the name encoded_string and=
=20
> templated it over an encoder and an allocator).  Mark has advised me that=
=20
> he is no longer pursuing the matter, and Beman's paper doesn't consider a=
=20
> string class per se.  I have an interface and a partial implementation th=
at=20
> is considerably lighter than ICU 53.1=20
> <http://icu-project.org/apiref/icu4c/index.html>: what should I do next?
>
> Cheers,
> Guy
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

------=_Part_608_1263359736.1439559246606
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">It took me a while to get back to this. What I have here i=
s essentially a watered-down version of <a href=3D"https://github.com/rmart=
inho/ogonek/">ogonek</a>, it works very well for my tastes. The reason I ha=
d to water it down was because Microsoft&#39;s compiler couldn&#39;t handle=
 ogonek and routinely ICE&#39;d or just outright exploded in the compiler.<=
br><br>The code lives <a href=3D"https://github.com/ThePhD/Furrovine.Heart"=
>here (github)</a>, and the relevant implementation starts with the impleme=
ntation of a <a href=3D"https://github.com/ThePhD/Furrovine.Heart/blob/mast=
er/include/Furrovine%2B%2B/Text/decode_iterator.hpp">decoding iterator</a>.=
 There is an encoding iterator as well, but I haven&#39;t finished improvin=
g it so that you can do things like in-place change codepoints if the backi=
ng iterator returns a `codepoint&amp;` versus a `codepoint`. The implementa=
tion is entirely header-only (we don&#39;t use UCD information for just enc=
oding/decoding) and covers ASCII, utf8, utf16, and utf32 at this point. I h=
aven&#39;t yet had to deal with older databases or latin-1 serving entities=
 yet, so I haven&#39;t added that encoding/decoding interface. The fun star=
ts with the overarching container-adaptor class, named <a href=3D"https://g=
ithub.com/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B%2B/text_v=
iew_base.hpp">text_view_base</a>, where I essentially reimplement large chu=
nks of std::string based on codepoints rather than codeunits, and using ite=
rators rather than pointers and sizes.<br><br>I have not implemented title =
case comparison, normalization, and a few of those things yet because I bel=
ieve that requires the UCD and this is meant not to have any compile-to bit=
s in it: I&#39;m going to work on normalization <a href=3D"https://github.c=
om/ThePhD/Furrovine.Heart/blob/master/include/Furrovine%2B%2B/Text/normaliz=
ation.hpp">one day=E2=84=A2</a>.<br><br>The heavy-lifting comes in getting =
something that&#39;ll always spit out codepoints. The problem: we violate a=
 pretty standard iterator ask that returned values can directly manipulate =
the underlying stream with `*it =3D some_codepoint`. Doing this is impossib=
le with the current implementation, and we would need to have iterators whi=
ch have knowledge of their containers and possibly perform changes to the c=
ontainer itself that <i><b>change the underlying sequence if the newly enco=
ded codepoint has more code units that what it is replacing</b></i>. An ite=
rator that invalidates all other iterators (including itself) with this kin=
d of behavior is... not a good iterator. I cannot think of a way to do this=
.. The most I have considered doing is a hack that allows me to do the chang=
e IFF some_codepoint has the same code-unit count, then I could just do the=
 change. Everything else would have to throw to be correct or assert in deb=
ug mode and then just tell the user they&#39;re shit-out-of-luck if they tr=
igger that kind of bad behavior in release.<br><br>The implementation is no=
t perfect, but I&#39;ve been using it with great success in a number of pro=
jects. When I don&#39;t need the container&#39;s view on the text, I can ca=
ll `textview.storage()` to get the original thing out of it and inspect tha=
t directly. Care has been taken that text_view_base truly only requires som=
ething that can have `begin` and `end` called on it. The more beefy class t=
hat offers a more `std::basic_string` like experience with a bunch of usefu=
l and additional member functions (and some still missing because I haven&#=
39;t yet used them) is <a href=3D"https://github.com/ThePhD/Furrovine.Heart=
/blob/master/include/Furrovine%2B%2B/text_base.hpp">text_base</a>.<br><br>T=
his is... all I know about the encoding/decoding stuff. It looks fairly non=
-trivial and there&#39;s a number of decisions that need to be made, especi=
ally with respect to if having an iterator that just returns a value `code_=
point` is useful &amp; acceptable to any facilities which want to work with=
 them. In most cases, you do not want to be in-place changing the text with=
out explicitly using certain member functions to enable those changes, so I=
 think this is an alright trade...<br><br>Thoughts?<br><br>On Thursday, May=
 8, 2014 at 8:59:09 AM UTC-4, Guy Davidson wrote:<blockquote class=3D"gmail=
_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;p=
adding-left: 1ex;"><div dir=3D"ltr">I am very keen to see Unicode support i=
n C++17. =C2=A0At the <a href=3D"http://accu.org/index.php/conferences/accu=
_conference_2014" target=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.hr=
ef=3D&#39;http://www.google.com/url?q\75http%3A%2F%2Faccu.org%2Findex.php%2=
Fconferences%2Faccu_conference_2014\46sa\75D\46sntz\0751\46usg\75AFQjCNEdSi=
PK4x1jLvuUTVUE0feFV-ekXQ&#39;;return true;" onclick=3D"this.href=3D&#39;htt=
p://www.google.com/url?q\75http%3A%2F%2Faccu.org%2Findex.php%2Fconferences%=
2Faccu_conference_2014\46sa\75D\46sntz\0751\46usg\75AFQjCNEdSiPK4x1jLvuUTVU=
E0feFV-ekXQ&#39;;return true;">ACCU conference</a> I was encouraged by Nico=
 Josuttis and Kevlin Henney to put a proposal together. =C2=A0I knocked up =
a naive interface and an implementation and then I checked the proposal mec=
hanism. =C2=A0I discovered <a href=3D"http://www.open-std.org/jtc1/sc22/wg2=
1/docs/papers/2012/n3398.html" target=3D"_blank" rel=3D"nofollow" onmousedo=
wn=3D"this.href=3D&#39;http://www.google.com/url?q\75http%3A%2F%2Fwww.open-=
std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpapers%2F2012%2Fn3398.html\46sa\75D\4=
6sntz\0751\46usg\75AFQjCNHbSphOaRoUgQslS2DPsGhj5NeitQ&#39;;return true;" on=
click=3D"this.href=3D&#39;http://www.google.com/url?q\75http%3A%2F%2Fwww.op=
en-std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpapers%2F2012%2Fn3398.html\46sa\75=
D\46sntz\0751\46usg\75AFQjCNHbSphOaRoUgQslS2DPsGhj5NeitQ&#39;;return true;"=
>Beman Dawes paper on string interoperability</a> and=C2=A0<a href=3D"http:=
//www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html" target=3D"_b=
lank" rel=3D"nofollow" onmousedown=3D"this.href=3D&#39;http://www.google.co=
m/url?q\75http%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpaper=
s%2F2013%2Fn3572.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGwCh2tBZyVbBK9-r_o=
nv-tbN1krw&#39;;return true;" onclick=3D"this.href=3D&#39;http://www.google=
..com/url?q\75http%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpa=
pers%2F2013%2Fn3572.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGwCh2tBZyVbBK9-=
r_onv-tbN1krw&#39;;return true;">Mark Boyall&#39;s submission</a>;=C2=A0gra=
tifyingly, they were very similar to mine (I even used the name encoded_str=
ing and templated it over an encoder and an allocator). =C2=A0Mark has advi=
sed me that he is no longer pursuing the matter, and Beman&#39;s paper does=
n&#39;t consider a string class per se. =C2=A0I have an interface and a par=
tial implementation that is considerably lighter than <a href=3D"http://icu=
-project.org/apiref/icu4c/index.html" target=3D"_blank" rel=3D"nofollow" on=
mousedown=3D"this.href=3D&#39;http://www.google.com/url?q\75http%3A%2F%2Fic=
u-project.org%2Fapiref%2Ficu4c%2Findex.html\46sa\75D\46sntz\0751\46usg\75AF=
QjCNEgv-jwAaXD0LPERM_ePlXgsE6IOQ&#39;;return true;" onclick=3D"this.href=3D=
&#39;http://www.google.com/url?q\75http%3A%2F%2Ficu-project.org%2Fapiref%2F=
icu4c%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEgv-jwAaXD0LPERM_ePlX=
gsE6IOQ&#39;;return true;">ICU 53.1</a>: what should I do next?<div><br></d=
iv><div>Cheers,</div><div>Guy</div></div></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_608_1263359736.1439559246606--
------=_Part_607_3659362.1439559246605--

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Fri, 14 Aug 2015 09:42:14 -0700 Raw View

On Friday 14 August 2015 02:41:41 glen stark wrote:
> Would it be reasonable to add encoding information to char_traits?  This
> would allow a library developer that has to support both legacy code and
> utf-8 compliant code to use type information to ensure correct behavior,
> and prevent accidental operations which are meaningless across encodings.
>  If it is reasonable, it would take some thought:  ideally the standard
> explicitly provide encodings that are likely to remain in use for the next
> decade or two (latin1  comes to mind), and it should be possible to create
> custom encodings.

Adding it to char_traits implies making it a static decision -- that is,
decided at compile time. Of the library, not the application.

I would personally welcome forcing everything to be Unicode and dropping
support for anything else. If you need to read a file or a socket stream that
isn't Unicode, you should convert it to Unicode before placing it in the
string class. That implies having a holder type for arbitrary binary data,
which could be std::vector<char>.

But that is not going to happen, sorry.

In Qt, the class for Unicode strings is called QString and the class for
arbitrary-encoded byte data is QByteArray.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Fri, 14 Aug 2015 09:54:01 -0700 Raw View

On Friday 14 August 2015 06:34:06 phdofthehouse@gmail.com wrote:
> This is... all I know about the encoding/decoding stuff. It looks fairly
> non-trivial and there's a number of decisions that need to be made,
> especially with respect to if having an iterator that just returns a value
> `code_point` is useful & acceptable to any facilities which want to work
> with them. In most cases, you do not want to be in-place changing the text
> without explicitly using certain member functions to enable those changes,
> so I think this is an alright trade...
>
> Thoughts?

There are dozens of files there, most of which don't seem related to Unicode at
all.

I also looked at your UTF-8 encoder/decoder and it seems like:
 * it is subject to an overlong sequence attack
 * it uses non-inline data in the decoder (I thought you said that it was
   header-only)

I didn't look further.

Anyway, what is exactly the objective here? Why did you post the link to this
library? If you're trying to implement the proposals you'd linked to, can you
also point to example code?

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.