Topic: Unicode support


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Fri, 9 Nov 2012 21:46:20 -0800 (PST)
Raw View
------=_Part_1139_33476818.1352526380978
Content-Type: text/plain; charset=ISO-8859-1

I have a new addendum that I'd like you to consider.

Your initial proposal insisted upon removing the encoding from the string
and letting the platform have a standard encoding. You've moved away from
that, opting for a more traditional template approach. And that's good, but
there is something that's been lost.

It's now a lot harder to write code that doesn't care about the string's
encoding.

One of the great things about algorithms is that they don't care about
where the container comes from or if it's even a container at all. What
they care about is that the iterator behave as expected. Indeed,
Boost.Range has ranges/iterators that don't iterate over an actual
container, such as boost::irange<http://www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/reference/ranges/irange.html>(an excellent class that works exceedingly well with range-based for).

The way this abstraction is achieved is because algorithms are templates.
And one thing we learned in the 13 years between C++98 and C++11 is this:
sometimes, you just can't use templates.

Sometimes, you need to use non-template code. A lot of the time in fact,
you need to use non-template code. There will be times when you need to
find runtime means to achieve this kind of polymorphism.

This is why boost::any, boost::bind, and boost::function exist (among
others), two out of the three have been standardized with the third on its
way.

The thing about this unicode::string (or whatever we're naming it these
days) is that all of its encoding instantiations have a lot of things in
common. The whole purpose of the string is to store an encoding of Unicode.
It allows you to store Unicode codepoints in a container in a specific
encoded format.

But the *access* to them is all the same. You have codepoint
iterators/ranges. You use codepoint iterators/ranges with the special
grapheme/spacing/etc iterators. And so forth.

If you want to do a regex search on the string, do you really care about
its encoding? If you want to copy the string into an uppercase version of
itself, do you really care about the encoding of the original? And so
forth. There are a *lot* of operations where you simply won't care about
the encoding. Where you're simply using the string as though it were an
immutable sequence of char32_t.

This is ripe for some kind of polymorphism. I want to be able to write a
function that will have (const) access to a Unicode string, but without
caring what the encoding will be. Oh, we can use a template, but see above:
sometimes, we can't. Or just don't really want to.

Boost.Range has an any_range<http://www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/reference/ranges/any_range.html>class that uses type-erasure to allow the use of any kind of range (over a
specific type T). That would make a great tool for this.

However, if you look at the definition, it's a template with a bunch of
different template parameters on it. All of those parameters will be the
same for each unicode::string.

Therefore, I propose unicode::any_string_ref. It can be a typedef of
any_range (filling in the template parameters), but I would prefer that it
be a living type that could have specialized members and so forth. It only
provides immutable access, but you can do things like searches and so
forth. You can also trim off leading/trailing characters; this only removes
them from the range, not the actual storage (the unicode::string).

The name combines the "any" from any_range with the string_ref type that
we're (presumably) adding to std::basic_string. It works in a similar
manor, providing const access to the string and allowing for things like
trimming and so forth.

You can implicitly convert from a unicode::string into one of these. I'm
not entirely sure we want implicit conversion into unicode::string from one
of these, but we at least want explicit conversion available.

Obviously converting to a unicode::string by any_string_ref will be slower
than a regular unicode::string copy (especially if it just so happened to
use the same encoding). This is actually a motivating case for why we would
want to be able to make this our own type instead of strictly relying on
any_range. It might be possible to put some private backdoors into the
system and access the original type. I'm not an expert on type erasure, so
I have no idea how difficult that would be, but it should at least be
possible.

--




------=_Part_1139_33476818.1352526380978
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I have a new addendum that I'd like you to consider.<br><br>Your initial pr=
oposal insisted upon removing the encoding from the string and letting the =
platform have a standard encoding. You've moved away from that, opting for =
a more traditional template approach. And that's good, but there is somethi=
ng that's been lost.<br><br>It's now a lot harder to write code that doesn'=
t care about the string's encoding.<br><br>One of the great things about al=
gorithms is that they don't care about where the container comes from or if=
 it's even a container at all. What they care about is that the iterator be=
have as expected. Indeed, Boost.Range has ranges/iterators that don't itera=
te over an actual container, such as <a href=3D"http://www.boost.org/doc/li=
bs/1_52_0/libs/range/doc/html/range/reference/ranges/irange.html">boost::ir=
ange</a> (an excellent class that works exceedingly well with range-based f=
or).<br><br>The way this abstraction is achieved is because algorithms are =
templates. And one thing we learned in the 13 years between C++98 and C++11=
 is this: sometimes, you just can't use templates.<br><br>Sometimes, you ne=
ed to use non-template code. A lot of the time in fact, you need to use non=
-template code. There will be times when you need to find runtime means to =
achieve this kind of polymorphism.<br><br>This is why boost::any, boost::bi=
nd, and boost::function exist (among others), two out of the three have bee=
n standardized with the third on its way.<br><br>The thing about this unico=
de::string (or whatever we're naming it these days) is that all of its enco=
ding instantiations have a lot of things in common. The whole purpose of th=
e string is to store an encoding of Unicode. It allows you to store Unicode=
 codepoints in a container in a specific encoded format.<br><br>But the <i>=
access</i> to them is all the same. You have codepoint iterators/ranges. Yo=
u use codepoint iterators/ranges with the special grapheme/spacing/etc iter=
ators. And so forth.<br><br>If you want to do a regex search on the string,=
 do you really care about its encoding? If you want to copy the string into=
 an uppercase version of itself, do you really care about the encoding of t=
he original? And so forth. There are a <i>lot</i> of operations where you s=
imply won't care about the encoding. Where you're simply using the string a=
s though it were an immutable sequence of char32_t.<br><br>This is ripe for=
 some kind of polymorphism. I want to be able to write a function that will=
 have (const) access to a Unicode string, but without caring what the encod=
ing will be. Oh, we can use a template, but see above: sometimes, we can't.=
 Or just don't really want to.<br><br>Boost.Range has an <a href=3D"http://=
www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/reference/ranges/an=
y_range.html">any_range</a> class that uses type-erasure to allow the use o=
f any kind of range (over a specific type T). That would make a great tool =
for this.<br><br>However, if you look at the definition, it's a template wi=
th a bunch of different template parameters on it. All of those parameters =
will be the same for each unicode::string.<br><br>Therefore, I propose unic=
ode::any_string_ref. It can be a typedef of any_range (filling in the templ=
ate parameters), but I would prefer that it be a living type that could hav=
e specialized members and so forth. It only provides immutable access, but =
you can do things like searches and so forth. You can also trim off leading=
/trailing characters; this only removes them from the range, not the actual=
 storage (the unicode::string).<br><br>The name combines the "any" from any=
_range with the string_ref type that we're (presumably) adding to std::basi=
c_string. It works in a similar manor, providing const access to the string=
 and allowing for things like trimming and so forth.<br><br>You can implici=
tly convert from a unicode::string into one of these. I'm not entirely sure=
 we want implicit conversion into unicode::string from one of these, but we=
 at least want explicit conversion available.<br><br>Obviously converting t=
o a unicode::string by any_string_ref will be slower than a regular unicode=
::string copy (especially if it just so happened to use the same encoding).=
 This is actually a motivating case for why we would want to be able to mak=
e this our own type instead of strictly relying on any_range. It might be p=
ossible to put some private backdoors into the system and access the origin=
al type. I'm not an expert on type erasure, so I have no idea how difficult=
 that would be, but it should at least be possible.<br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1139_33476818.1352526380978--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 10 Nov 2012 06:49:57 -0800 (PST)
Raw View
------=_Part_1830_28100665.1352558997266
Content-Type: text/plain; charset=ISO-8859-1

Currently, you can match a regex of

basic_regex<char32t>

However, I'm not sure that this is specified to respect the Unicode
definition of, for example, letter, word, etc, and I'm not sure that the
regex_traits class supports that. This might imply having to design another
regex API to deal with Unicode regular expressions. In addition, I'm
concerned about the std::locale and how that's going to be used in
conjunction with Unicode. I'll be the first to admit that I've got limited
experience with std::locale, but the existing facets do not define anything
to do with char32_t.

Man. And the I/O routines, too. This is going to be one hell of a proposal.

On Saturday, November 10, 2012 5:46:21 AM UTC, Nicol Bolas wrote:
>
> I have a new addendum that I'd like you to consider.
>
> Your initial proposal insisted upon removing the encoding from the string
> and letting the platform have a standard encoding. You've moved away from
> that, opting for a more traditional template approach. And that's good, but
> there is something that's been lost.
>
> It's now a lot harder to write code that doesn't care about the string's
> encoding.
>
> One of the great things about algorithms is that they don't care about
> where the container comes from or if it's even a container at all. What
> they care about is that the iterator behave as expected. Indeed,
> Boost.Range has ranges/iterators that don't iterate over an actual
> container, such as boost::irange<http://www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/reference/ranges/irange.html>(an excellent class that works exceedingly well with range-based for).
>
> The way this abstraction is achieved is because algorithms are templates.
> And one thing we learned in the 13 years between C++98 and C++11 is this:
> sometimes, you just can't use templates.
>
> Sometimes, you need to use non-template code. A lot of the time in fact,
> you need to use non-template code. There will be times when you need to
> find runtime means to achieve this kind of polymorphism.
>
> This is why boost::any, boost::bind, and boost::function exist (among
> others), two out of the three have been standardized with the third on its
> way.
>
> The thing about this unicode::string (or whatever we're naming it these
> days) is that all of its encoding instantiations have a lot of things in
> common. The whole purpose of the string is to store an encoding of Unicode.
> It allows you to store Unicode codepoints in a container in a specific
> encoded format.
>
> But the *access* to them is all the same. You have codepoint
> iterators/ranges. You use codepoint iterators/ranges with the special
> grapheme/spacing/etc iterators. And so forth.
>
> If you want to do a regex search on the string, do you really care about
> its encoding? If you want to copy the string into an uppercase version of
> itself, do you really care about the encoding of the original? And so
> forth. There are a *lot* of operations where you simply won't care about
> the encoding. Where you're simply using the string as though it were an
> immutable sequence of char32_t.
>
> This is ripe for some kind of polymorphism. I want to be able to write a
> function that will have (const) access to a Unicode string, but without
> caring what the encoding will be. Oh, we can use a template, but see above:
> sometimes, we can't. Or just don't really want to.
>
> Boost.Range has an any_range<http://www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/reference/ranges/any_range.html>class that uses type-erasure to allow the use of any kind of range (over a
> specific type T). That would make a great tool for this.
>
> However, if you look at the definition, it's a template with a bunch of
> different template parameters on it. All of those parameters will be the
> same for each unicode::string.
>
> Therefore, I propose unicode::any_string_ref. It can be a typedef of
> any_range (filling in the template parameters), but I would prefer that it
> be a living type that could have specialized members and so forth. It only
> provides immutable access, but you can do things like searches and so
> forth. You can also trim off leading/trailing characters; this only removes
> them from the range, not the actual storage (the unicode::string).
>
> The name combines the "any" from any_range with the string_ref type that
> we're (presumably) adding to std::basic_string. It works in a similar
> manor, providing const access to the string and allowing for things like
> trimming and so forth.
>
> You can implicitly convert from a unicode::string into one of these. I'm
> not entirely sure we want implicit conversion into unicode::string from one
> of these, but we at least want explicit conversion available.
>
> Obviously converting to a unicode::string by any_string_ref will be slower
> than a regular unicode::string copy (especially if it just so happened to
> use the same encoding). This is actually a motivating case for why we would
> want to be able to make this our own type instead of strictly relying on
> any_range. It might be possible to put some private backdoors into the
> system and access the original type. I'm not an expert on type erasure, so
> I have no idea how difficult that would be, but it should at least be
> possible.
>

--




------=_Part_1830_28100665.1352558997266
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Currently, you can match a regex of<div><br></div><div><div style=3D"backgr=
ound-color: rgb(250, 250, 250); border: 1px solid rgb(187, 187, 187); word-=
wrap: break-word;" class=3D"prettyprint"><code class=3D"prettyprint"><div c=
lass=3D"subprettyprint"><font color=3D"#660066"><span style=3D"color: #000;=
" class=3D"styled-by-prettify">basic_regex</span><span style=3D"color: #080=
;" class=3D"styled-by-prettify">&lt;char32t&gt;</span><span style=3D"color:=
 #000;" class=3D"styled-by-prettify"> </span></font></div></code></div><br>=
However, I'm not sure that this is specified to respect the Unicode definit=
ion of, for example, letter, word, etc, and I'm not sure that the regex_tra=
its class supports that. This might imply having to design another regex AP=
I to deal with Unicode regular expressions. In addition, I'm concerned abou=
t the std::locale and how that's going to be used in conjunction with Unico=
de. I'll be the first to admit that I've got limited experience with std::l=
ocale, but the existing facets do not define anything to do with char32_t.<=
/div><div><br></div><div>Man. And the I/O routines, too. This is going to b=
e one hell of a proposal.<br><br>On Saturday, November 10, 2012 5:46:21 AM =
UTC, Nicol Bolas wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0=
;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">I have =
a new addendum that I'd like you to consider.<br><br>Your initial proposal =
insisted upon removing the encoding from the string and letting the platfor=
m have a standard encoding. You've moved away from that, opting for a more =
traditional template approach. And that's good, but there is something that=
's been lost.<br><br>It's now a lot harder to write code that doesn't care =
about the string's encoding.<br><br>One of the great things about algorithm=
s is that they don't care about where the container comes from or if it's e=
ven a container at all. What they care about is that the iterator behave as=
 expected. Indeed, Boost.Range has ranges/iterators that don't iterate over=
 an actual container, such as <a href=3D"http://www.boost.org/doc/libs/1_52=
_0/libs/range/doc/html/range/reference/ranges/irange.html" target=3D"_blank=
">boost::irange</a> (an excellent class that works exceedingly well with ra=
nge-based for).<br><br>The way this abstraction is achieved is because algo=
rithms are templates. And one thing we learned in the 13 years between C++9=
8 and C++11 is this: sometimes, you just can't use templates.<br><br>Someti=
mes, you need to use non-template code. A lot of the time in fact, you need=
 to use non-template code. There will be times when you need to find runtim=
e means to achieve this kind of polymorphism.<br><br>This is why boost::any=
, boost::bind, and boost::function exist (among others), two out of the thr=
ee have been standardized with the third on its way.<br><br>The thing about=
 this unicode::string (or whatever we're naming it these days) is that all =
of its encoding instantiations have a lot of things in common. The whole pu=
rpose of the string is to store an encoding of Unicode. It allows you to st=
ore Unicode codepoints in a container in a specific encoded format.<br><br>=
But the <i>access</i> to them is all the same. You have codepoint iterators=
/ranges. You use codepoint iterators/ranges with the special grapheme/spaci=
ng/etc iterators. And so forth.<br><br>If you want to do a regex search on =
the string, do you really care about its encoding? If you want to copy the =
string into an uppercase version of itself, do you really care about the en=
coding of the original? And so forth. There are a <i>lot</i> of operations =
where you simply won't care about the encoding. Where you're simply using t=
he string as though it were an immutable sequence of char32_t.<br><br>This =
is ripe for some kind of polymorphism. I want to be able to write a functio=
n that will have (const) access to a Unicode string, but without caring wha=
t the encoding will be. Oh, we can use a template, but see above: sometimes=
, we can't. Or just don't really want to.<br><br>Boost.Range has an <a href=
=3D"http://www.boost.org/doc/libs/1_52_0/libs/range/doc/html/range/referenc=
e/ranges/any_range.html" target=3D"_blank">any_range</a> class that uses ty=
pe-erasure to allow the use of any kind of range (over a specific type T). =
That would make a great tool for this.<br><br>However, if you look at the d=
efinition, it's a template with a bunch of different template parameters on=
 it. All of those parameters will be the same for each unicode::string.<br>=
<br>Therefore, I propose unicode::any_string_ref. It can be a typedef of an=
y_range (filling in the template parameters), but I would prefer that it be=
 a living type that could have specialized members and so forth. It only pr=
ovides immutable access, but you can do things like searches and so forth. =
You can also trim off leading/trailing characters; this only removes them f=
rom the range, not the actual storage (the unicode::string).<br><br>The nam=
e combines the "any" from any_range with the string_ref type that we're (pr=
esumably) adding to std::basic_string. It works in a similar manor, providi=
ng const access to the string and allowing for things like trimming and so =
forth.<br><br>You can implicitly convert from a unicode::string into one of=
 these. I'm not entirely sure we want implicit conversion into unicode::str=
ing from one of these, but we at least want explicit conversion available.<=
br><br>Obviously converting to a unicode::string by any_string_ref will be =
slower than a regular unicode::string copy (especially if it just so happen=
ed to use the same encoding). This is actually a motivating case for why we=
 would want to be able to make this our own type instead of strictly relyin=
g on any_range. It might be possible to put some private backdoors into the=
 system and access the original type. I'm not an expert on type erasure, so=
 I have no idea how difficult that would be, but it should at least be poss=
ible.<br></blockquote></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1830_28100665.1352558997266--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 10 Nov 2012 09:44:56 -0800 (PST)
Raw View
------=_Part_289_10633379.1352569496287
Content-Type: text/plain; charset=ISO-8859-1



On Saturday, November 10, 2012 6:49:57 AM UTC-8, DeadMG wrote:
>
> Currently, you can match a regex of
>
> basic_regex<char32t>
>
> However, I'm not sure that this is specified to respect the Unicode
> definition of, for example, letter, word, etc, and I'm not sure that the
> regex_traits class supports that. This might imply having to design another
> regex API to deal with Unicode regular expressions. In addition, I'm
> concerned about the std::locale and how that's going to be used in
> conjunction with Unicode. I'll be the first to admit that I've got limited
> experience with std::locale, but the existing facets do not define anything
> to do with char32_t.
>
> Man. And the I/O routines, too. This is going to be one hell of a proposal.
>

What did you expect? You're basically wanting to port about 80% of ICU into
the standard library.

Boost.Locale should probably be looked at, as it has dealt with many of
these issues.

--




------=_Part_289_10633379.1352569496287
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Saturday, November 10, 2012 6:49:57 AM UTC-8, DeadMG wrote:<bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;">Currently, you can match a regex of=
<div><br></div><div><div style=3D"background-color:rgb(250,250,250);border:=
1px solid rgb(187,187,187);word-wrap:break-word"><code><div><font color=3D"=
#660066"><span style=3D"color:#000">basic_regex</span><span style=3D"color:=
#080">&lt;char32t&gt;</span><span style=3D"color:#000"> </span></font></div=
></code></div><br>However, I'm not sure that this is specified to respect t=
he Unicode definition of, for example, letter, word, etc, and I'm not sure =
that the regex_traits class supports that. This might imply having to desig=
n another regex API to deal with Unicode regular expressions. In addition, =
I'm concerned about the std::locale and how that's going to be used in conj=
unction with Unicode. I'll be the first to admit that I've got limited expe=
rience with std::locale, but the existing facets do not define anything to =
do with char32_t.</div><div><br></div><div>Man. And the I/O routines, too. =
This is going to be one hell of a proposal.<br></div></blockquote><div><br>=
What did you expect? You're basically wanting to port about 80% of ICU into=
 the standard library.<br></div><br>Boost.Locale should probably be looked =
at, as it has dealt with many of these issues.<br><br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_289_10633379.1352569496287--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 10 Nov 2012 13:57:24 -0800 (PST)
Raw View
------=_Part_30_29625835.1352584644532
Content-Type: text/plain; charset=ISO-8859-1

If 80% of ICU is what's needed to allow decent Unicode support, then that's
what I'll propose.


--




------=_Part_30_29625835.1352584644532
Content-Type: text/html; charset=ISO-8859-1

If 80% of ICU is what's needed to allow decent Unicode support, then that's what I'll propose.<div><br></div><div><br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_30_29625835.1352584644532--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 10 Nov 2012 21:54:57 -0800 (PST)
Raw View
------=_Part_6_11650889.1352613297030
Content-Type: text/plain; charset=ISO-8859-1

Another thing that's needed: `std::to_string` support for basic numeric
conversions. I thought of it because it was mentioned in another thread.

You might want to put this up on Google Docs, to make it easier for people
to see and follow the proposal as you change it and assemble it.

--




------=_Part_6_11650889.1352613297030
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Another thing that's needed: `std::to_string` support for basic numeric con=
versions. I thought of it because it was mentioned in another thread.<br><b=
r>You might want to put this up on Google Docs, to make it easier for peopl=
e to see and follow the proposal as you change it and assemble it.<br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_6_11650889.1352613297030--

.


Author: Beman Dawes <bdawes@acm.org>
Date: Mon, 12 Nov 2012 16:43:15 -0500
Raw View
On Fri, Nov 9, 2012 at 4:11 PM,  <rick@longbowgames.com> wrote:
> n3398 tentatively suggests a char8_t type, and stops just shy of suggesting
> that the u8 literal be fixed. I don't know if that's too radical a change,
> but it sure would be nice if u8 was fixed in C++14.

n3398 is mine, and I run hot and cold on char8_t. There is no need for
a new type; char8_t can just be a typedef to unsigned char. Because of
the special relationship between char and unsigned char, a lot of
stuff just works. Presumably there would also be a u8string typedef to
basic_string<unsigned char>.  u8 literals aren't going to change; they
are needed both to preserve existing code and because of std::string
and const char* use cases. A uu8 literal marker or maybe a cast-like
function is the only thing I've been able to come up with to deal with
u8 literals.

But that's a lot of mechanism to introduce for platforms that already
treat narrow strings as UTF-8. So I'm brainstorming ideas to be able
to transition C++ to UTF-8 as the default narrow string encoding in a
way that preserves existing code on platforms where UTF-8 isn't the
current default encoding. I don't think that is a practical
possibility for the core language today, because it would require a
mandated option which we have always tried to avoid. But a Technical
Specification (TS) would not suffer from that limitation, so might be
much more practical.

I'm finding the current discussion interesting, and appreciate hearing
about other attempts to bring C++ further into the Unicode world.

--Beman

--




.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 12 Nov 2012 14:42:13 -0800 (PST)
Raw View
------=_Part_1883_12196322.1352760133901
Content-Type: text/plain; charset=ISO-8859-1

I think there's nothing wrong with saying "u8 literals are char8_t[], but
they can also implicitly decay into char[] if and only if overload
resolution with them as char8_t[] fails.". I mean, it might be fun to
implement, but from a hypothetical standpoint, it should be fine.

I'm sorry about the delay for the next version- I've been a bit busy. I
should have another one up, say, tomorrow.

--




------=_Part_1883_12196322.1352760133901
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think there's nothing wrong with saying "u8 literals are char8_t[], but t=
hey can also implicitly decay into char[] if and only if overload resolutio=
n with them as char8_t[] fails.". I mean, it might be fun to implement, but=
 from a hypothetical standpoint, it should be fine.<div><br></div><div>I'm =
sorry about the delay for the next version- I've been a bit busy. I should =
have another one up, say, tomorrow.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1883_12196322.1352760133901--

.


Author: Aleksandar Fabijanic <aleks.fabijanic@gmail.com>
Date: Mon, 12 Nov 2012 17:08:01 -0600
Raw View
On Mon, Nov 12, 2012 at 3:43 PM, Beman Dawes <bdawes@acm.org> wrote:

> But that's a lot of mechanism to introduce for platforms that already
> treat narrow strings as UTF-8. So I'm brainstorming ideas to be able
> to transition C++ to UTF-8 as the default narrow string encoding in a
> way that preserves existing code on platforms where UTF-8 isn't the
> current default encoding. I don't think that is a practical
> possibility for the core language today, because it would require a
> mandated option which we have always tried to avoid. But a Technical
> Specification (TS) would not suffer from that limitation, so might be
> much more practical.
>
> I'm finding the current discussion interesting, and appreciate hearing
> about other attempts to bring C++ further into the Unicode world.

There is no magic wand - there's a price to pay for cross-platform
Unicode nightmare.
POCO practical approach is to sacrifice performance for consistency -
all interfaces are UTF8 std::string by default;
on the platform "where UTF-8 isn't the current default encoding", for
Unicode builds we bite the bullet and wrap Unicode-sensitive API
calls.

E.g., see in https://github.com/pocoproject/poco/blob/master/Foundation/src/Environment_WIN32U.cpp:

std::string EnvironmentImpl::getImpl(const std::string& name)
{
 std::wstring uname;
 UnicodeConverter::toUTF16(name, uname);
 DWORD len = GetEnvironmentVariableW(uname.c_str(), 0, 0);
 if (len == 0) throw NotFoundException(name);
 Buffer<wchar_t> buffer(len);
 GetEnvironmentVariableW(uname.c_str(), buffer.begin(), len);
 std::string result;
 UnicodeConverter::toUTF8(buffer.begin(), len - 1, result);
 return result;
}

--




.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 24 Nov 2012 03:54:26 -0800 (PST)
Raw View
------=_Part_1475_612033.1353758066435
Content-Type: multipart/alternative;
 boundary="----=_Part_1476_10528796.1353758066439"

------=_Part_1476_10528796.1353758066439
Content-Type: text/plain; charset=ISO-8859-1

Right, the next version is here (finally...). The main things left to do
are:

Debate the exact level of regex support. I think that Level 1 isn't quite
enough, but Level 2 is a bit far- even ICU isn't fully Level 2 compliant.
Perhaps something a bit more fine-grained.
Deal with locales. Right now, I've just used std::locale, but it's not
really good enough.
String formatting and parsing functions. The IOstream ones aren't defined
as dealing with Unicode (and they rather suck), and there are quite a few
requests for overhauling this.
Error handling. I'm not too happy with just "Throw an exception" (aside
from the lack of specification for the exception object, that's easily
done). Python for example was exposed to a DOS attack through not properly
dealing with this exception. I think that perhaps the replacement codepoint
or simply drop bad codepoints strategies also have their place. May want to
offer the user more flexibility with how they deal with problems in the
input data.

--




------=_Part_1476_10528796.1353758066439
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Right, the next version is here (finally...). The main things left to do ar=
e:<div><br></div><div>Debate the exact level of regex support. I think that=
 Level 1 isn't quite enough, but Level 2 is a bit far- even ICU isn't fully=
 Level 2 compliant. Perhaps something a bit more fine-grained.</div><div>De=
al with locales. Right now, I've just used std::locale, but it's not really=
 good enough.</div><div>String formatting and parsing functions. The IOstre=
am ones aren't defined as dealing with Unicode (and they rather suck), and =
there are quite a few requests for overhauling this.</div><div>Error handli=
ng. I'm not too happy with just "Throw an exception" (aside from the lack o=
f specification for the exception object, that's easily done). Python for e=
xample was exposed to a DOS attack through not properly dealing with this e=
xception. I think that perhaps the replacement codepoint or simply drop bad=
 codepoints strategies also have their place. May want to offer the user mo=
re flexibility with how they deal with problems in the input data.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1476_10528796.1353758066439--
------=_Part_1475_612033.1353758066435
Content-Type: text/html; charset=UTF-8; name=Unicode.html
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=Unicode.html
X-Attachment-Id: 02607d39-94b6-4bc7-aef0-9be6a269b875

=EF=BB=BF<!DOCTYPE html>
<html lang=3D"en">
<body>         =20
    <p>Document working version three. The primary todos are string formatt=
ing and parsing, and locales.</p>
    <br />
    <p>Document Number: Currently not allocated.</p>
    <p>Date: 2012-11-05</p>
    <p>Project: Programming Language C++, Library Working Group</p>
    <p>Reply-to: wolfeinstein@gmail.com</p>
<h1>Strings Proposal</h1>
<h2>Introduction</h2>
<p>The purpose of this document is to propose new interfaces to support Uni=
code text, where the existing interfaces are quite deficient. </p>
<h2>Motivation and Scope</h2>
<p>This proposal is primarily motivated by two problems. The first is the o=
verwhelming number of string types- both primitive, Standard and third-part=
y. This mess of text types makes it impossible to reliably=20
    hold string data. The second is the poor support for Unicode within the=
 C++ Standard library. Unicode is a complex topic, where correctness depend=
s on the implementation of complex algorithms by the user.=20
    This is only exacerbated by the problem of multiple string encodings, a=
nd poor conversion interfaces, which is why C++ is awash with third-party s=
tring types. This problem is made even worse by the existence of=20
    unrelated types that need to hold string data- for example, exceptions.=
 The existing exception hierarchy is of significantly limited usefulness, a=
s it cannot hold Unicode exception data. This proposal aims to=20
    solve both these problems by offering freestanding algorithms and a fre=
sh string class which constitutes significant support for Unicode.
</p>
<p>It is intended to support all programmers from top to bottom, as string =
handling tasks are tasks universal to all programs. It is based on the exis=
ting practice shown in the more recent additions to the Standard=20
    library and Modern C++ design in general- templates instead of inherita=
nce, function objects, and freestanding algorithms and iterators.
</p>
<p>It is not currently in use and a reference implementation is still under=
 construction. However, there are numerous implementations of the various s=
ubcomponents, such as Unicode algorithms and formatting routines.=20
</p>
<h2>Impact on the Standard</h2>
<p>The primary impact on the Standard is the deprecation of existing compon=
ents. There are no additional language or library features required.
</p>
<h2>Design Decisions</h2>
<p>The primary design decision taken here is to give one universal definiti=
on of a string- a range of Unicode codepoints. This decision was taken beca=
use it allows free-standing algorithms, and an interface=20
    that fits well with the rest of the Standard library. It also allows th=
e string interface to be significantly simplified compared to the previous =
iteration. In addition, support for previous encodings is=20
    effectively for legacy only- no new code is expected to be written that=
 will use non-Unicode encodings to actually represent data, only to store a=
nd convert to/from Unicode. Thus, there is no more sense in=20
    providing for other encodings beyond this functionality. In addition, t=
he string class is considerably less generic than may be expected- especial=
ly lacking an allocator. This is intentional. The motivation=20
    is to simplify string handling, and the intended result is that there i=
s one, and only one, string type that a library vendor need concern themsel=
ves with. An allocator or other abstractions would unfortunately=20
    defeat this goal.
</p>
<p>Unicode validation failure throwing an exception is well known to be a l=
imited solution in many cases. This part of the API is due for additional c=
onsideration, as this is only a first draft. In addition,=20
    because of the potential for O(n) assignment, it was decided that the o=
nly kind of iterator offered over a string should be immutable, as in many =
cases the operation would boil down to inserting a variable=20
    size range. This could be prohibitively expensive. In addition, the cho=
ice of an rvalue makes it significantly simpler to offer iterators, as they=
 can decode on the fly to codepoints from their choice of=20
    encoding. Aside from this, however, the string was designed to be a fam=
iliar container, offering the minimal set of functions required to manipula=
te the sequence of codepoints.
</p>
<p>Another problem is posed by UTF-8. As u8 literals do not have a distinct=
 type, it's almost impossible to handle them correctly and as cleanly as th=
e other literal types. There are other proposals for introducing=20
    char8_t and fixing UTF-8 literals, and introducing std::u8string, but t=
his proposal does not assume they are accepted. It would, however, be of si=
gnificant benefit.
</p>
<p>Finally, the std namespace is becoming very overloaded. It was decided t=
hat it would be best to split the components into subnamespaces. This not o=
nly aids with the organization of the library as a whole,
    but also provides a clear difference between old and new components.
</p>
<h2>Technical Specification</h2>
<p>Currently, to avoid ambiguity, the specification is given as a series of=
 declarations in C++11.</p>
<p>Where a type is taken by either rvalue reference or const reference, it'=
s legal for implementations to provide only one overload that takes that ty=
pe by value.</p>
<p>For iterators, usually only the iterator category and return value of op=
erator* are specified, as the full specification of an iterator involves a =
lot of plumbing. If requested, these=20
specifications can be expanded to the full definition.</p>
<p>In header &lt;unicode&gt;</p>
<pre>namespace std {
    namespace unicode {
        enum class normal_form {
            NFC,
            NFD,
            NFKC,
            NFKD
        };</pre>
<p>The unicode_string class is templated based on an encoding parameter. Th=
is is a traits-style class implemented for each encoding. The required memb=
ers are:</p>

<pre>    typedef unspecified codeunit;</pre>

<p>The codeunit typedef is for the individual unit of storage for this spec=
ific encoding. This would be char16_t for UTF-16, char for narrow encoding,=
 etc. </p>

<pre>    template&lt;typename codeunit_iterator&gt; using codepoint_iterato=
r =3D unspecified;
    template&lt;typename codeunit_iterator&gt; using validating_codepoint_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original codeunit range as Un=
icode codepoints. The validating version will throw if the codeunits are no=
t valid or do not result in valid Unicode. The adaptors have the=20
    same iterator category as the input type, except if that category is ra=
ndom, in which case they only need be bidirectional.</p>

<pre>    template&lt;typename codepoint_iterator&gt; using codeunit_iterato=
r =3D unspecified;
    template&lt;typename codepoint_iterator&gt; using validating_codeunit_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original range of Unicode cod=
epoints as code uits. The validating version will throw if the codepoints a=
re not valid or cannot be expressed in the destination.</p>

<pre>    template&lt;typename foreign_encoding, typename foreign_codeunit_i=
terator&gt; using conversion_iterator =3D unspecified;
    template&lt;typename foreign_encoding, typename foreign_codeunit_iterat=
or&gt; using validating_conversion_iterator =3D unspecified;</pre>
<p>Views a range of codeunits in the foreign encoding as a range of code un=
its in this encoding. A reasonable implementation for any foreign encoding =
is to simply convert to Unicode codepoints and then back to=20
    the current encoding. The validating iterator shall ensure that all for=
eign data is suitable for representation in this encoding.</p>
<p>An implementation shall provide at least the following encodings:
</p>
<pre>        typedef unspecified UTF8;
        typedef unspecified UTF16;
        typedef unspecified UTF32;
        typedef unspecified wide;
        typedef unspecified narrow;
</pre>
<p>The narrow encoding is the encoding used for narrow string literals, suc=
h as "hello". The wide string literal is used for wide string literals such=
 as L"hello". An implementation=20
    has no obligation to make these separate types if one of the wide or na=
rrow encodings, or both, is already a Unicode encoding.
</p>

<p>The string class is a container of Unicode codepoints. The treatment of =
the freestanding algorithms as a range of Unicode codepoints means that any=
 container of Unicode codepoints may be=20
    used, but this class is provided as the minimal useful container.
</p>
<pre>        template&lt;typename encoding, typename allocator =3D std::all=
ocator&lt;encoding::codeunit&gt;&gt; class unicode_string {
        public:
            unicode_string();
            template&lt;typename other_encoding, typename other_alloc&gt;=
=20
            unicode_string(const unicode_string&lt;other_ecnoding, other_al=
loc&gt;&);
            unicode_string(unicode_string&&);
           =20
            unicode_string(const char*);</pre>
           =20
<p>When the unicode_string interface deals with a const char* or std::strin=
g, it will assume narrow encoding, not UTF-8. A constructor which can take =
an encoding is available for UTF-8 const char*. When the=20
    unicode_string class takes input from an external source, it will valid=
ate that it is well-formed Unicode. If not, an exception shall be thrown.</=
p>

<pre>            unicode_string(const char*, encoding);

            unicode_string(const wchar_t*);
            unicode_string(const char16_t*);
            unicode_string(const char32_t*);
            template&lt;typename T, typename Traits, typename Allocator&gt;=
 unicode_string(const std::basic_string_ref&lt;T, Traits, Allocator&gt;&);<=
/pre>
<p>The assumed encoding is based on the input type T- it is the same as tha=
t for const T*. </p>
<pre>
            template&lt;typename T, typename Traits, typename Allocator&gt;=
 unicode_string(const std::basic_string_ref&lt;T, Traits, Allocator&gt;&, e=
ncoding);

            template&lt;typename Iterator> unicode_string(Iterator, Iterato=
r);</pre>

<p>The requirements on the Iterator type, which the string may be construct=
ed from a pair of, is that it is at least an input iterator, of Unicode cod=
epoints. This means at least an rvalue which is implicitly convertible=20
    to char32_t.
</p>

<pre>            template&lt;typename Iterator> void assign(Iterator, Itera=
tor) &;</pre>

<p>The requirements on the Iterator type here are the same as those on the =
constructor.</p>

<pre>            void assign(unicode_string&) &;
            void assign(unicode_string&&) &;

            template&lt;typename other_encoding, typename other_alloc&gt;
            string& operator+(const unicode_string&lt;other_encoding, other=
_alloc&gt;&) const;
            string& operator+(unicode_string&&) const;
            string& operator+(const char*) const;
            string& operator+(const wchar_t*) const;
            string& operator+(const char16_t*) const;
            string& operator+(const char32_t*) const;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
 string&
            operator+(const std::basic_string_ref&lt;T, Traits, Allocator&g=
t;&) const;
           =20
            template&lt;typename other_encoding, typename other_alloc&gt;
            unicode_string& operator+=3D(const unicode_string&lt;other_enco=
ding, other_alloc&gt;&) &;
            unicode_string& operator+=3D(unicode_string&&) &;
            unicode_string& operator+=3D(const char*) &;
            unicode_string& operator+=3D(const wchar_t*) &;
            unicode_string& operator+=3D(const char16_t*) &;
            unicode_string& operator+=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
 unicode_string&=20
            operator+=3D(const std::basic_string_ref&lt;T, Traits, Allocato=
r&gt;&);

            unicode_string& operator=3D(const unicode_string&) &;
            unicode_string& operator=3D(unicode_string&&) &;
            unicode_string& operator=3D(const char*) &;
            unicode_string& operator=3D(const wchar_t*) &;
            unicode_string& operator=3D(const char16_t*) &;
            unicode_string& operator=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
 unicode_string&=20
            operator=3D(const std::basic_string_ref&lt;T, Traits, Allocator=
&gt;&);

            iterator begin() &;
            const_iterator begin() const &;
            const_iterator cbegin() const &;
            iterator end() &;
            const_iterator end() const &;
            const_iterator cend() const &;</pre>

<p>The iterator and const_iterator types are bidirectional iterators of Uni=
code codepoints. operator* returns a char32_t rvalue, which is the codepoin=
t at that position. The invalidation=20
semantics of iterators shall be those of std::vector.</p>

<pre>            void clear() &;
            bool empty() const;
           =20
            iterator erase(const_iterator where) &;
            iterator erase(const_iterator first, const_iterator last) &;

            void swap(unicode_string&);

            char32_t front() const;
            char32_t back() const;
           =20
            iterator insert(const_iterator where, char32_t codepoint);
            template&lt;typename InputIterator&gt; iterator insert(const_it=
erator where, InputIterator begin, InputIterator end);=20

            void pop_back();
            void push_back(char32_t);

            void normalize(normal_form);</pre>

<p>Performs an in-place normalization of the string's contents to the reque=
sted form by delegating to the freestanding algorithm.</p>

<pre>            const encoding::codeunit* null_terminated() const;</pre>

<p>Returns the contents of the unicode_string as a null-terminated buffer. =
This pointer shall be valid for as long as the unicode_string is not mutate=
d or destroyed.
</p>

<pre>        };

        using string =3D unicode_string&lt;implementation-defined default&g=
t;</pre>
<p>The implementation-defined default encoding must be capable of storing l=
osslessly all Unicode codepoints.</p>

<pre>        template&lt;typename lhs_encoding, typename lhs_allocator, typ=
ename rhs_encoding, typename rhs_allocator&gt;=20
        bool operator<(const unicode_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D(const unicode_string&lt;lhs_encoding, lhs_allocato=
r&gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D=3D(const unicode_string&lt;lhs_encoding, lhs_alloc=
ator&gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs=
);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>(const unicode_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>=3D(const unicode_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator!=3D(const unicode_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const unicode_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);=
</pre>

<p>These comparison operators behave as if the data in the lhs and the rhs =
was passed to the respective freestanding algorithm.</p>

<pre>        template&lt;typename Iterator, typename OutIt> void convert(It=
erator begin, Iterator end, OutIt out, encoding src, encoding dst);</pre>

<p>Converts from the input range which is an input range of code units in s=
rc encoding into dst encoding. The output iterator receives the result of t=
he operation.</p>
       =20
<pre>        template&lt;typename Iterator> std::pair&lt;grapheme_iterator&=
lt;Iterator>, grapheme_iterator&lt;Iterator>>=20
        graphemes(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;word_iterator&lt;Iterat=
or>, word_iterator&lt;Iterator>>
        words(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;line_iterator&lt;Iterat=
or>, line_iterator&lt;Iterator>>
        lines(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;sentence_iterator&lt;It=
erator>, sentence_iterator&lt;Iterator>>
        sentences(Iterator begin, Iterator end);</pre>

<p>All four iterator types- grapheme_iterator, word_iterator, line_iterator=
, and sentence_iterator implement the respective Unicode Standard boundary =
analysis algorithms. The Line algorithm is defined in <a href=3D"http://www=
..unicode.org/reports/tr14/">UAX #14</a>=20
    and the other three in <a href=3D"http://www.unicode.org/reports/tr29/"=
>UAX #29</a>. The input iterators are at least bidirectional iterators of U=
nicode codepoints. The boundary=20
    iterators all return from operator*() a pair of the base iterator type,=
 where the first value marks the beginning of the range, and the second mar=
ks the end, of the region. The first element of the return=20
    value of the four functions is the beginning and the second is the end.=
</p>

<pre>        template&lt;typename First, typename Second> bool less(First b=
egin, First end, Second begin, Second end, std::locale);
        template&lt;typename First, typename Second> bool less(First begin,=
 First end, Second begin, Second end);
        template&lt;typename First, typename Second> bool less_or_equal(Fir=
st begin, First end, Second begin, Second end, std::locale);
        template&lt;typename First, typename Second> bool less_or_equal(Fir=
st begin, First end, Second begin, Second end);
        template&lt;typename First, typename Second> bool greater(First beg=
in, First end, Second begin, Second end, std::locale);
        template&lt;typename First, typename Second> bool greater(First beg=
in, First end, Second begin, Second end);
        template&lt;typename First, typename Second> bool greater_or_equal(=
First begin, First end, Second begin, Second end, std::locale);
        template&lt;typename First, typename Second> bool greater_or_equal(=
First begin, First end, Second begin, Second end);
        template&lt;typename First, typename Second> bool equal(First begin=
, First end, Second begin, Second end);
        template&lt;typename First, typename Second> bool not_equal(First b=
egin, First end, Second begin, Second end);</pre>

<p>These six algorithms implement Unicode comparison functionality. Equival=
ence is defined as equivalence when normalized, with either NFC or NFD. Col=
lation requires a locale- overloads which do not have one as a=20
    parameter shall use the global locale. All iterator ranges shall be for=
ward iterators of Unicode codepoints.
</p>

<pre>        template&lt;typename Iterator, typename Out> void normalize(It=
erator begin, Iterator end, Out out, normal_form);</pre>

<p>Implements normalization of the forward range over Unicode codepoints, w=
ith the output provided to the output iterator. The normal_form argument in=
dicates which normal form is requested.</p>

<pre>        template&lt;typename encoding, typename allocator&gt; std::ist=
ream&=20
        operator>>(std::istream&, unicode_string&lt;encoding, allocator&gt;=
&);
        template&lt;typename encoding, typename allocator&gt; std::wistream=
&=20
        operator>>(std::wistream&, unicode_string&lt;encoding, allocator&gt=
;&);</pre>

<p>Reads until the next whitespace, as operator>>(std::istream&, std::strin=
g&);. Shall convert the data in the stream to Unicode, so whitespace shall =
include Unicode whitespaces.</p>

<pre>         template&lt;typename encoding, typename allocator&gt; std::os=
tream&=20
         operator&lt;&lt;(std::ostream&, unicode_string&lt;encoding, alloca=
tor&gt;&);
         template&lt;typename encoding, typename allocator&gt; std::wostrea=
m&=20
         operator&lt;&lt;(std::wostream&, unicode_string&lt;encoding, alloc=
ator&gt;&);</pre>

<p>Writes the contents of the string to the stream. Shall perform an encodi=
ng conversion to narrow encoding and wide encoding when necessary.</p>

<pre>        struct dec {
            dec();
            dec(const dec&);
            dec(dec&&);
            dec(std::locale l);
            std::locale get_locale();
         }
         struct hex {};
         struct bin {};
         struct oct {};
</pre>
<p>For all primitive integer types I:</p>
<pre>        template&lt;typename base =3D dec&gt; string to_string(I, base=
 =3D base{});</pre>
<p>This function shall format the integer of type I as a string, in the bas=
e provided. If I is <code>bool</code>, then bin shall represent 0 or 1, and=
 any other choice shall result in <code>true</code> or=20
    <code>false</code>.</p>
<pre>        enum class codepoint_category {
            letter_uppercase;
            letter_lowercase;
            letter_titlecase;
            letter_modifier;
            letter_other;
            mark_non_spacing;
            mark_spacing_combining;
            mark_enclosing;
            number_decimal_digit;
            number_letter;
            number_other;
            punctuation_connector;
            punctuation_dash;
            punctuation_open;
            punctuation_close;
            punctuation_initial;
            punctuation_final;
            punctuation_other;
            symbol_math;
            symbol_currency;
            symbol_modifier;
            symbol_other;
            separator_space;
            separator_line;
            separator_paragraph;
            other_control;
            other_format;
            other_surrogate;
            other_private_use;
            other_not_assigned;
        };
        enum class bidi_category {
            AL, AN,
            B, BN,
            CS,
            EN, ES, ET,
            L, LRE, LRO,
            NSM,
            ON,
            PDF,
            R, RLE, RLO,
            S,
            WS,
        };
        enum class category_joining_class {
            U, C, T, D, L, R,
        };
        enum class category_joining_group {
            Ain, Alaph, Alef, Alef_Maqsurah,
            Beh, Beth, Burushaski_Yeh_Barree,
            Dal, Dalath_Rish, E,
            Farsi_Yeh, Fe, Feh, Final_Semkath,
            Gaf, Gamal,
            Hah, Hamza_On_Heh_Goal, He,
            Heh, Heh_Goal, Heth,
            Kaf, Kaph, Khaph, Knotted_Heh,
            Lam, Lamadh, Meem, Mim,
            No_Joining_Group, Noon, Nun, Nya,
            Pe, Qaf, Qaph, Reh, Reversed_Pe,
            Rohingya_Yeh,
            Sad, Sadhe, Seen, Semkath, Shin,
            Swash_Kaf, Syriac_Waw, Tah, Taw,
            Teh_Marbuta, Teh_Marbuta_Goal, Teth, Waw, Yeh,
            Yeh_Barree, Yeh_With_Tail, Yudh,
            Yudh_He, Zain, Zhain,
        };
        enum class script_type {
            Arab, Armi, Armn, Avst,
            Bali, Bamu, Batk, Beng, Bopo, Brah, Brai, Bugi, Buhd,
            Cakm, Cans, Cari, Cham, Cher, Copt, Cprt,
            Cyrl,
            Deva, Dsrt,
            Egyp, Ethi,
            Geor, Glag, Goth, Grek, Gujr, Guru,
            Hang, Hani, Hano, Hebr, Hira, Hrkt,
            Ital,
            Java,
            Kali, Kana, Khar, Khmr, Knda, Kthi,
            Lana, Laoo, Latn, Lepc, Limb, Linb, Lisu, Lyci,
            Lydi,
            Mand, Merc, Mero, Mlym, Mong, Mtei, Mymr,
            Nkoo,
            Ogam, Olck, Orkh, Orya, Osma,
            Phag, Phli, Phnx, Plrd, Prti,
            Qaai,
            Rjng, Runr,
            Samr, Sarb, Saur, Shaw, Shrd,  Sinh, Sora, Sund, Sylo, Syrc,
            Tagb, Takr, Tale, Talu, Taml, Tavt, Telu, Tfng,
            Tglg, Thaa, Thai, Tibt,
            Ugar,
            Vaii,
            Xpeo, Xsux,
            Yiii,
            Zinh, Zyyy, Zzzz,
        };
        enum class block_name {
            Aegean_Numbers, Alchemical, Alphabetic_PF, Ancient_Greek_Music,=
 Ancient_Greek_Numbers,
            Ancient_Symbols, Arabic, Arabic_Ext_A, Arabic_Math, Arabic_PF_A=
, Arabic_PF_B, Arabic_Sup,
            Armenian, Arrows, ASCII, Avestan, Balinese, Bamum, Bamum_Sup, B=
atak, Bengali, Block_Elements,
            Bopomofo, Bopomofo_Ext, Box_Drawing, Brahmi, Braille, Buginese,=
 Buhid, Byzantine_Music,
            Carian, Chakma, Cham, Cherokee, CJK, CJK_Compat, CJK_Compat_For=
ms, CJK_Compat_Ideographs,
            CJK_Compat_Ideographs_Sup, CJK_Ext_A, CJK_Ext_B, CJK_Ext_C, CJK=
_Ext_D, CJK_Radicals_Sup,
            CJK_Strokes, CJK_Symbols, Compat_Jamo, Control_Pictures, Coptic=
, Counting_Rod, Cuneiform,
            Cuneiform_Numbers, Currency_Symbols, Cypriot_Syllabary, Cyrilli=
c, Cyrillic_Ext_A, Cyrillic_Ext_B,
            Cyrillic_Sup, Deseret, Devanagari, Devanagari_Ext, Diacriticals=
, Diacriticals_For_Symbols,
            Diacriticals_Sup, Dingbats, Domino, Egyptian_Hieroglyphs, Emoti=
cons, Enclosed_Alphanum,
            Enclosed_Alphanum_Sup, Enclosed_CJK, Enclosed_Ideographic_Sup, =
Ethiopic, Ethiopic_Ext,
            Ethiopic_Ext_A, Ethiopic_Sup, Geometric_Shapes, Georgian, Georg=
ian_Sup, Glagolitic, Gothic, Greek,
            Greek_Ext, Gujarati, Gurmukhi, Half_And_Full_Forms, Half_Marks,=
 Hangul, Hanunoo, Hebrew,
            High_PU_Surrogates, High_Surrogates, Hiragana, IDC, Imperial_Ar=
amaic, Indic_Number_Forms,
            Inscriptional_Pahlavi, Inscriptional_Parthian, IPA_Ext, Jamo, J=
amo_Ext_A, Jamo_Ext_B, Javanese,
            Kaithi, Kana_Sup, Kanbun, Kangxi, Kannada, Katakana, Katakana_E=
xt, Kayah_Li, Kharoshthi, Khmer,
            Khmer_Symbols, Lao, Latin_1_Sup, Latin_Ext_A, Latin_Ext_Additio=
nal, Latin_Ext_B, Latin_Ext_C,
            Latin_Ext_D, Lepcha, Letterlike_Symbols, Limbu, Linear_B_Ideogr=
ams, Linear_B_Syllabary, Lisu,
            Low_Surrogates, Lycian, Lydian, Mahjong, Malayalam, Mandaic, Ma=
th_Alphanum, Math_Operators,
            Meetei_Mayek, Meetei_Mayek_Ext, Meroitic_Cursive, Meroitic_Hier=
oglyphs, Miao, Misc_Arrows,
            Misc_Math_Symbols_A, Misc_Math_Symbols_B, Misc_Pictographs, Mis=
c_Symbols, Misc_Technical,
            Modifier_Letters, Modifier_Tone_Letters, Mongolian, Music, Myan=
mar, Myanmar_Ext_A, NB,
            New_Tai_Lue, NKo, Number_Forms, OCR, Ogham, Ol_Chiki, Old_Itali=
c, Old_Persian, Old_South_Arabian,
            Old_Turkic, Oriya, Osmanya, Phags_Pa, Phaistos, Phoenician, Pho=
netic_Ext, Phonetic_Ext_Sup,
            Playing_Cards, PUA, Punctuation, Rejang, Rumi, Runic, Samaritan=
, Saurashtra, Sharada, Shavian,
            Sinhala, Small_Forms, Sora_Sompeng, Specials, Sundanese, Sundan=
ese_Sup, Sup_Arrows_A, Sup_Arrows_B,
            Sup_Math_Operators, Sup_PUA_A, Sup_PUA_B, Sup_Punctuation, Supe=
r_And_Sub, Syloti_Nagri, Syriac,
            Tagalog, Tagbanwa, Tags, Tai_Le, Tai_Tham, Tai_Viet, Tai_Xuan_J=
ing, Takri, Tamil, Telugu, Thaana,
            Thai, Tibetan, Tifinagh, Transport_And_Map, UCAS, UCAS_Ext, Uga=
ritic, Vai, Vedic_Ext,
            Vertical_Forms, VS, VS_Sup, Yi_Radicals, Yi_Syllables, Yijing,
        };
        enum class version {
            v1_1,
            v2_0, v2_1,
            v3_0, v3_1, v3_2,
            v4_0, v4_1,
            v5_0, v5_1, v5_2,
            v6_0, v6_1, v6_2,
            unassigned =3D 0xFF,
        };
        struct codepoint_properties {
            codepoint_category category;
            block_name block;
            version age;
            bidi_category bidi_type;
            category_joining_class joining_class;
            category_joining_group joining_group;
            script_type script;
            bool control;
            bool digit;
            bool letter;
            bool lower;
            bool number;
            bool punctuation;
            bool separator;
            bool symbol;
            bool upper;
            bool whitespace;
        };
        codepoint_properties properties(char32_t);</pre>
<p>Returns the properties of any given codepoint. These properties are defi=
ned by the Unicode Standard, not here.</p>
<pre>        template&lt;typename Iterator, typename Out&gt; void to_upper(=
Iterator begin, Iterator end, Out out);
        template&lt;typename Iterator, typename Out&gt; void to_lower(Itera=
tor begin, Iterator end, Out out);</pre>
<p>Performs a case conversion for the given series of Unicode codepoints. P=
laces the output in the Out iterator.</p>
<pre>        using regex =3D std::basic_regex&lt;char32_t, implementation-d=
efined&gt;</pre>
<p>A regular expression type suitable for matching Unicode codepoints. The =
traits must support <a href=3D"http://www.unicode.org/reports/tr18/">UTS-18=
</a> to at least Level 2.</p>
<pre>    }
}</pre>
<h2>Acknowledgements</h2>
<p>R. Martinho Fernandes, gave significant assistance when dealing with som=
e of the ins and outs of Unicode.</p>
   =20
</body>
</html>=E2=80=8B
------=_Part_1475_612033.1353758066435--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 24 Nov 2012 10:06:04 -0800 (PST)
Raw View
------=_Part_207_19275869.1353780364064
Content-Type: text/plain; charset=ISO-8859-1

On Saturday, November 24, 2012 3:54:26 AM UTC-8, DeadMG wrote:
>
> Right, the next version is here (finally...). The main things left to do
> are:
>
> Debate the exact level of regex support. I think that Level 1 isn't quite
> enough, but Level 2 is a bit far- even ICU isn't fully Level 2 compliant.
> Perhaps something a bit more fine-grained.
> Deal with locales. Right now, I've just used std::locale, but it's not
> really good enough.
> String formatting and parsing functions. The IOstream ones aren't defined
> as dealing with Unicode (and they rather suck), and there are quite a few
> requests for overhauling this.
> Error handling. I'm not too happy with just "Throw an exception" (aside
> from the lack of specification for the exception object, that's easily
> done). Python for example was exposed to a DOS attack through not properly
> dealing with this exception. I think that perhaps the replacement codepoint
> or simply drop bad codepoints strategies also have their place. May want to
> offer the user more flexibility with how they deal with problems in the
> input data.
>

You might want to rewrite the Design Decisions section, since it states
things like "especially lacking an allocator" which is no longer true and
hasn't been true for several iterations.

I have a problem with the stream output routines though. Right now (and
therefore, for a good long while), people write UTF-8 data out via
std::ostream. If I have a UTF-8 string, I want to be able to write it to an
ostream anyway. I don't care that "narrow" encoding may not be UTF-8. I
asked for UTF-8 encoding in my string, and I want to be able to write that.

The same goes for UTF-16 and UTF-32, though that could be solved like this:

template<typename encoding, typename allocator> std::basic_ostream<encoding
::char_type, encoding::traits_type>& operator<<(std::ostream<encoding::
char_type, encoding::traits_type>&, unicode_string<encoding, allocator>&);

Of course, this requires that you create a stream with the proper encoding.
If you want to convert, you'll have to do it manually, since you can't
partially specialize functions.

A solution needs to exist for UTF-8 as well, which will work with
`std::basic_ostream<char, char_traits<char>>` *without* converting to
"narrow" encoding. I don't know how to go about doing that, since you can't
partially specialize functions and we want full allocator support.

Also, you forgot to take the string by `const&` in the output APIs.

Stream input should have similar behavior; if I have a UTF-8 file, I need
to be able to read that into a UTF-8 string without having to create a
basic_string and then copying it into a unicode_string.

--




------=_Part_207_19275869.1353780364064
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Saturday, November 24, 2012 3:54:26 AM UTC-8, DeadMG wrote:<blockquote c=
lass=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px=
 #ccc solid;padding-left: 1ex;">Right, the next version is here (finally...=
). The main things left to do are:<div><br></div><div>Debate the exact leve=
l of regex support. I think that Level 1 isn't quite enough, but Level 2 is=
 a bit far- even ICU isn't fully Level 2 compliant. Perhaps something a bit=
 more fine-grained.</div><div>Deal with locales. Right now, I've just used =
std::locale, but it's not really good enough.</div><div>String formatting a=
nd parsing functions. The IOstream ones aren't defined as dealing with Unic=
ode (and they rather suck), and there are quite a few requests for overhaul=
ing this.</div><div>Error handling. I'm not too happy with just "Throw an e=
xception" (aside from the lack of specification for the exception object, t=
hat's easily done). Python for example was exposed to a DOS attack through =
not properly dealing with this exception. I think that perhaps the replacem=
ent codepoint or simply drop bad codepoints strategies also have their plac=
e. May want to offer the user more flexibility with how they deal with prob=
lems in the input data.</div></blockquote><div><br>You might want to rewrit=
e the Design Decisions=20
section, since it states things like "especially lacking an allocator"=20
which is no longer true and hasn't been true for several iterations.<br><br=
>I have a problem with the stream output routines though. Right now (and th=
erefore, for a good long while), people write UTF-8 data out via std::ostre=
am. If I have a UTF-8 string, I want to be able to write it to an ostream a=
nyway. I don't care that "narrow" encoding may not be UTF-8. I asked for UT=
F-8 encoding in my string, and I want to be able to write that.<br><br>The =
same goes for UTF-16 and UTF-32, though that could be solved like this:<br>=
<br><div class=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250=
); border-color: rgb(187, 187, 187); border-style: solid; border-width: 1px=
; word-wrap: break-word;"><code class=3D"prettyprint"><div class=3D"subpret=
typrint"><span style=3D"color: #008;" class=3D"styled-by-prettify">template=
</span><span style=3D"color: #660;" class=3D"styled-by-prettify">&lt;</span=
><span style=3D"color: #008;" class=3D"styled-by-prettify">typename</span><=
span style=3D"color: #000;" class=3D"styled-by-prettify"> encoding</span><s=
pan style=3D"color: #660;" class=3D"styled-by-prettify">,</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify"> </span><span style=3D"color=
: #008;" class=3D"styled-by-prettify">typename</span><span style=3D"color: =
#000;" class=3D"styled-by-prettify"> allocator</span><span style=3D"color: =
#660;" class=3D"styled-by-prettify">&gt;</span><span style=3D"color: #000;"=
 class=3D"styled-by-prettify"> std</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">::</span><span style=3D"color: #000;" class=3D"styl=
ed-by-prettify">basic_ostream</span><span style=3D"color: #660;" class=3D"s=
tyled-by-prettify">&lt;</span><span style=3D"color: #000;" class=3D"styled-=
by-prettify">encoding</span><span style=3D"color: #660;" class=3D"styled-by=
-prettify">::</span><span style=3D"color: #000;" class=3D"styled-by-prettif=
y">char_type</span><span style=3D"color: #660;" class=3D"styled-by-prettify=
">,</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> encodi=
ng</span><span style=3D"color: #660;" class=3D"styled-by-prettify">::</span=
><span style=3D"color: #000;" class=3D"styled-by-prettify">traits_type</spa=
n><span style=3D"color: #660;" class=3D"styled-by-prettify">&gt;&amp;</span=
><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span st=
yle=3D"color: #008;" class=3D"styled-by-prettify">operator</span><span styl=
e=3D"color: #660;" class=3D"styled-by-prettify">&lt;&lt;(</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify">std</span><span style=3D"col=
or: #660;" class=3D"styled-by-prettify">::</span><span style=3D"color: #000=
;" class=3D"styled-by-prettify">ostream</span><span style=3D"color: #660;" =
class=3D"styled-by-prettify"><code class=3D"prettyprint"><span style=3D"col=
or: #660;" class=3D"styled-by-prettify">&lt;</span><span style=3D"color: #0=
00;" class=3D"styled-by-prettify">encoding</span><span style=3D"color: #660=
;" class=3D"styled-by-prettify">::</span><span style=3D"color: #000;" class=
=3D"styled-by-prettify">char_type</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">,</span><span style=3D"color: #000;" class=3D"style=
d-by-prettify"> encoding</span><span style=3D"color: #660;" class=3D"styled=
-by-prettify">::</span><span style=3D"color: #000;" class=3D"styled-by-pret=
tify">traits_type</span><span style=3D"color: #660;" class=3D"styled-by-pre=
ttify">&gt;</span><span style=3D"color: #660;" class=3D"styled-by-prettify"=
></span></code>&amp;,</span><span style=3D"color: #000;" class=3D"styled-by=
-prettify"> unicode_string</span><span style=3D"color: #660;" class=3D"styl=
ed-by-prettify">&lt;</span><span style=3D"color: #000;" class=3D"styled-by-=
prettify">encoding</span><span style=3D"color: #660;" class=3D"styled-by-pr=
ettify">,</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> =
allocator</span><span style=3D"color: #660;" class=3D"styled-by-prettify">&=
gt;&amp;);</span><span style=3D"color: #000;" class=3D"styled-by-prettify">=
<br></span></div></code></div><br>Of course, this requires that you create =
a stream with the proper encoding. If you want to convert, you'll have to d=
o it manually, since you can't partially specialize functions.<br><br>A sol=
ution needs to exist for UTF-8 as well, which will work with `std::basic_os=
tream&lt;char, char_traits&lt;char&gt;&gt;` <i>without</i> converting to "n=
arrow" encoding. I don't know how to go about doing that, since you can't p=
artially specialize functions and we want full allocator support.<br><br>Al=
so, you forgot to take the string by `const&amp;` in the output APIs.<br><b=
r>Stream input should have similar behavior; if I have a UTF-8 file, I need=
 to be able to read that into a UTF-8 string without having to create a bas=
ic_string and then copying it into a unicode_string.<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_207_19275869.1353780364064--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 24 Nov 2012 13:52:43 -0800 (PST)
Raw View
------=_Part_1_28942169.1353793963291
Content-Type: text/plain; charset=ISO-8859-1



On Saturday, November 24, 2012 10:06:04 AM UTC-8, Nicol Bolas wrote:
>
> On Saturday, November 24, 2012 3:54:26 AM UTC-8, DeadMG wrote:
>>
>> Right, the next version is here (finally...). The main things left to do
>> are:
>>
>> Debate the exact level of regex support. I think that Level 1 isn't quite
>> enough, but Level 2 is a bit far- even ICU isn't fully Level 2 compliant.
>> Perhaps something a bit more fine-grained.
>> Deal with locales. Right now, I've just used std::locale, but it's not
>> really good enough.
>> String formatting and parsing functions. The IOstream ones aren't defined
>> as dealing with Unicode (and they rather suck), and there are quite a few
>> requests for overhauling this.
>> Error handling. I'm not too happy with just "Throw an exception" (aside
>> from the lack of specification for the exception object, that's easily
>> done). Python for example was exposed to a DOS attack through not properly
>> dealing with this exception. I think that perhaps the replacement codepoint
>> or simply drop bad codepoints strategies also have their place. May want to
>> offer the user more flexibility with how they deal with problems in the
>> input data.
>>
>
> You might want to rewrite the Design Decisions section, since it states
> things like "especially lacking an allocator" which is no longer true and
> hasn't been true for several iterations.
>
> I have a problem with the stream output routines though. Right now (and
> therefore, for a good long while), people write UTF-8 data out via
> std::ostream. If I have a UTF-8 string, I want to be able to write it to an
> ostream anyway. I don't care that "narrow" encoding may not be UTF-8. I
> asked for UTF-8 encoding in my string, and I want to be able to write that.
>
> The same goes for UTF-16 and UTF-32, though that could be solved like this:
>
> template<typename encoding, typename allocator> std::basic_ostream<
> encoding::char_type, encoding::traits_type>& operator<<(std::ostream<
> encoding::char_type, encoding::traits_type>&, unicode_string<encoding,allocator
> >&);
>
> Of course, this requires that you create a stream with the proper
> encoding. If you want to convert, you'll have to do it manually, since you
> can't partially specialize functions.
>
> A solution needs to exist for UTF-8 as well, which will work with
> `std::basic_ostream<char, char_traits<char>>` *without* converting to
> "narrow" encoding. I don't know how to go about doing that, since you can't
> partially specialize functions and we want full allocator support.
>
> Also, you forgot to take the string by `const&` in the output APIs.
>
> Stream input should have similar behavior; if I have a UTF-8 file, I need
> to be able to read that into a UTF-8 string without having to create a
> basic_string and then copying it into a unicode_string.
>

Some other issues I noticed:

*The lack of inserts/assigns.*

I can create a unicode_string from a sequence of encoded code-units using
the constructor. But I can't use `assign` to do the same thing, nor can I
use `insert`. The `assign` part isn't so bad (move assignment accomplishes
the same thing), but the `insert` part is.

It's not even clear what the iterator range insert *does*. Are the given
iterators a char32_t range? Assuming they are, we need a way to provide a
range of data that matches the actual encoding of the string. Preferably,
for every constructor overload, there should be a matching insert overload.

I don't want to have to do this:

unicode_string ustr = ...;
const char *str = GetUtf8String();
ustr.insert(ustr.back(), UTF8::const_iterator(str), UTF8::const_iterator(str
+ std::char_traits<char>::length(str)));

A simple, *common* operation like this shouldn't require anything more than
this:

unicode_string ustr = ...;
const char *str = GetUtf8String();
ustr.insert(ustr.back(), str, UTF8());

Even inserting another unicode_string requires a lot of effort, as you have
to get the begin and end iterators. Also, if you're inserting a
unicode_string of the exact same encoding type, you should *never* have to
convert it to char32_t and back. It should be a much faster operation; a
memory allocation and a series of memcpys.

In short: Needs more `insert`.

*The lack of a code-unit count.*

I know that "length" is a real problem when dealing with Unicode strings.
But there will be times when we need one specific length: the length of the
encoded range.

The `null_terminated` function (why not `data` like all other containers
that provide this?) provides access to the underlying sequence of code
units. This is good, as it allows interoperation with APIs that take those
code units. The problem is that null termination is not sufficient.

Let's say I want to pass a UTF-8 string to Lua. Lua strings are
null-terminated in the same way that basic_string's are, but they *also*have a size (just like basic_string). Which means that Lua's string can
have embedded null characters, just like a unicode_string<UTF8>.

The problem is that I can't actually give Lua a string with embedded nulls
because unicode_string doesn't provide a way to figure out how big the
array from `null_terminated` will be. Without that, you have to guess by
finding the null terminator, which can be wrong.

So there needs to be this function:

size_type codeunit_count();

It's named this way to make certain that nobody mistakes it for a "length"
or anything of the kind. It's clearly getting the number of codeunits. And
this should be O(1) in complexity.

Indeed, we could rename `null_terminated` to match:

const encoding::codeunit_type* codeunit_data();

That way, it looks like a proper pair.

And since it's topical, there should be some way to create a
basic_string_ref directly from the interface. That way, we won't have to
make the string NULL-terminated. It could look like this:

basic_string_ref<encoding::codeunit_type, encoding::codeunit_traits>codeunit_string_ref
() const;

*Lack of sized constructors.*

This goes back to the last point. Just as Lua allows the user to provide
strings with embedded null characters, it *provides* strings with embedded
null characters.

None of the constructors take an array + size; they only use
null-terminated arrays. We need more overloads to support arrays of a given
size. Appropriate `assign` and `insert` overloads should be added too.

*Lack of memory controls.*

std::vector and std::basic_string expose a capacity, which gives you some
guarantees about memory allocations. unicode_string, due to variable-sized
encodings, can't give quite the same guarantees when you're inserting
char32_t ranges.

But that doesn't mean it can't provide something.

Just as I suggested a codeunit_count, there should also be a
codeunit_capacity and codeunit_reserve. Again, they are named so that you
know they work in terms of codeunits. By reserving an appropriate amount of
memory, you can transfer properly-encoded codeunit sequences into the
string without allocating more memory.

This goes in hand with inserting codeunit ranges. By reserving memory
beforehand, you can build unicode_strings while keeping its memory access
pattern regular.

--




------=_Part_1_28942169.1353793963291
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Saturday, November 24, 2012 10:06:04 AM UTC-8, Nicol Bolas wrote=
:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bo=
rder-left: 1px #ccc solid;padding-left: 1ex;">On Saturday, November 24, 201=
2 3:54:26 AM UTC-8, DeadMG wrote:<blockquote class=3D"gmail_quote" style=3D=
"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">Ri=
ght, the next version is here (finally...). The main things left to do are:=
<div><br></div><div>Debate the exact level of regex support. I think that L=
evel 1 isn't quite enough, but Level 2 is a bit far- even ICU isn't fully L=
evel 2 compliant. Perhaps something a bit more fine-grained.</div><div>Deal=
 with locales. Right now, I've just used std::locale, but it's not really g=
ood enough.</div><div>String formatting and parsing functions. The IOstream=
 ones aren't defined as dealing with Unicode (and they rather suck), and th=
ere are quite a few requests for overhauling this.</div><div>Error handling=
.. I'm not too happy with just "Throw an exception" (aside from the lack of =
specification for the exception object, that's easily done). Python for exa=
mple was exposed to a DOS attack through not properly dealing with this exc=
eption. I think that perhaps the replacement codepoint or simply drop bad c=
odepoints strategies also have their place. May want to offer the user more=
 flexibility with how they deal with problems in the input data.</div></blo=
ckquote><div><br>You might want to rewrite the Design Decisions=20
section, since it states things like "especially lacking an allocator"=20
which is no longer true and hasn't been true for several iterations.<br><br=
>I have a problem with the stream output routines though. Right now (and th=
erefore, for a good long while), people write UTF-8 data out via std::ostre=
am. If I have a UTF-8 string, I want to be able to write it to an ostream a=
nyway. I don't care that "narrow" encoding may not be UTF-8. I asked for UT=
F-8 encoding in my string, and I want to be able to write that.<br><br>The =
same goes for UTF-16 and UTF-32, though that could be solved like this:<br>=
<br><div style=3D"background-color:rgb(250,250,250);border-color:rgb(187,18=
7,187);border-style:solid;border-width:1px;word-wrap:break-word"><code><div=
><span style=3D"color:#008">template</span><span style=3D"color:#660">&lt;<=
/span><span style=3D"color:#008">typename</span><span style=3D"color:#000">=
 encoding</span><span style=3D"color:#660">,</span><span style=3D"color:#00=
0"> </span><span style=3D"color:#008">typename</span><span style=3D"color:#=
000"> allocator</span><span style=3D"color:#660">&gt;</span><span style=3D"=
color:#000"> std</span><span style=3D"color:#660">::</span><span style=3D"c=
olor:#000">basic_ostream</span><span style=3D"color:#660">&lt;</span><span =
style=3D"color:#000">encoding</span><span style=3D"color:#660">::</span><sp=
an style=3D"color:#000">c<wbr>har_type</span><span style=3D"color:#660">,</=
span><span style=3D"color:#000"> encoding</span><span style=3D"color:#660">=
::</span><span style=3D"color:#000">traits_type</span><span style=3D"color:=
#660">&gt;&amp;</span><span style=3D"color:#000"> </span><span style=3D"col=
or:#008">operator</span><span style=3D"color:#660">&lt;&lt;(</span><span st=
yle=3D"color:#000">std</span><span style=3D"color:#660">::</span><span styl=
e=3D"color:#000">ostream</span><span style=3D"color:#660"><code><span style=
=3D"color:#660">&lt;</span><span style=3D"color:#000">encodi<wbr>ng</span><=
span style=3D"color:#660">::</span><span style=3D"color:#000">char_type</sp=
an><span style=3D"color:#660">,</span><span style=3D"color:#000"> encoding<=
/span><span style=3D"color:#660">::</span><span style=3D"color:#000">traits=
_type</span><span style=3D"color:#660">&gt;</span><span style=3D"color:#660=
"></span></code>&amp;,</span><span style=3D"color:#000"> unicode_string</sp=
an><span style=3D"color:#660">&lt;</span><span style=3D"color:#000">encodin=
g</span><span style=3D"color:#660">,</span><span style=3D"color:#000"> allo=
cator</span><span style=3D"color:#660">&gt;&amp;);</span><span style=3D"col=
or:#000"><br></span></div></code></div><br>Of course, this requires that yo=
u create a stream with the proper encoding. If you want to convert, you'll =
have to do it manually, since you can't partially specialize functions.<br>=
<br>A solution needs to exist for UTF-8 as well, which will work with `std:=
:basic_ostream&lt;char, char_traits&lt;char&gt;&gt;` <i>without</i> convert=
ing to "narrow" encoding. I don't know how to go about doing that, since yo=
u can't partially specialize functions and we want full allocator support.<=
br><br>Also, you forgot to take the string by `const&amp;` in the output AP=
Is.<br><br>Stream input should have similar behavior; if I have a UTF-8 fil=
e, I need to be able to read that into a UTF-8 string without having to cre=
ate a basic_string and then copying it into a unicode_string.<br></div></bl=
ockquote><div><br>Some other issues I noticed:<br><br><b>The lack of insert=
s/assigns.</b><br><br>I can create a unicode_string from a sequence of enco=
ded code-units using the constructor. But I can't use `assign` to do the sa=
me thing, nor can I use `insert`. The `assign` part isn't so bad (move assi=
gnment accomplishes the same thing), but the `insert` part is.<br><br>It's =
not even clear what the iterator range insert <i>does</i>. Are the given it=
erators a char32_t range? Assuming they are, we need a way to provide a ran=
ge of data that matches the actual encoding of the string. Preferably, for =
every constructor overload, there should be a matching insert overload.<br>=
<br>I don't want to have to do this:<br><br><div class=3D"prettyprint" styl=
e=3D"background-color: rgb(250, 250, 250); border-color: rgb(187, 187, 187)=
; border-style: solid; border-width: 1px; word-wrap: break-word;"><code cla=
ss=3D"prettyprint"><div class=3D"subprettyprint"><span style=3D"color: #000=
;" class=3D"styled-by-prettify">unicode_string ustr </span><span style=3D"c=
olor: #660;" class=3D"styled-by-prettify">=3D</span><span style=3D"color: #=
000;" class=3D"styled-by-prettify"> </span><span style=3D"color: #660;" cla=
ss=3D"styled-by-prettify">...;</span><span style=3D"color: #000;" class=3D"=
styled-by-prettify"><br></span><span style=3D"color: #008;" class=3D"styled=
-by-prettify">const</span><span style=3D"color: #000;" class=3D"styled-by-p=
rettify"> </span><span style=3D"color: #008;" class=3D"styled-by-prettify">=
char</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> </spa=
n><span style=3D"color: #660;" class=3D"styled-by-prettify">*</span><span s=
tyle=3D"color: #000;" class=3D"styled-by-prettify">str </span><span style=
=3D"color: #660;" class=3D"styled-by-prettify">=3D</span><span style=3D"col=
or: #000;" class=3D"styled-by-prettify"> </span><span style=3D"color: #606;=
" class=3D"styled-by-prettify">GetUtf8String</span><span style=3D"color: #6=
60;" class=3D"styled-by-prettify">();</span><span style=3D"color: #000;" cl=
ass=3D"styled-by-prettify"><br>ustr</span><span style=3D"color: #660;" clas=
s=3D"styled-by-prettify">.</span><span style=3D"color: #000;" class=3D"styl=
ed-by-prettify">insert</span><span style=3D"color: #660;" class=3D"styled-b=
y-prettify">(</span><span style=3D"color: #000;" class=3D"styled-by-prettif=
y">ustr</span><span style=3D"color: #660;" class=3D"styled-by-prettify">.</=
span><span style=3D"color: #000;" class=3D"styled-by-prettify">back</span><=
span style=3D"color: #660;" class=3D"styled-by-prettify">(),</span><span st=
yle=3D"color: #000;" class=3D"styled-by-prettify"> UTF8</span><span style=
=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=3D"colo=
r: #000;" class=3D"styled-by-prettify">const_iterator</span><span style=3D"=
color: #660;" class=3D"styled-by-prettify">(</span><span style=3D"color: #0=
00;" class=3D"styled-by-prettify">str</span><span style=3D"color: #660;" cl=
ass=3D"styled-by-prettify">),</span><span style=3D"color: #000;" class=3D"s=
tyled-by-prettify"> UTF8</span><span style=3D"color: #660;" class=3D"styled=
-by-prettify">::</span><span style=3D"color: #000;" class=3D"styled-by-pret=
tify">const_iterator</span><span style=3D"color: #660;" class=3D"styled-by-=
prettify">(</span><span style=3D"color: #000;" class=3D"styled-by-prettify"=
>str </span><span style=3D"color: #660;" class=3D"styled-by-prettify">+</sp=
an><span style=3D"color: #000;" class=3D"styled-by-prettify"> std</span><sp=
an style=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify">char_traits</span><span styl=
e=3D"color: #080;" class=3D"styled-by-prettify">&lt;char&gt;</span><span st=
yle=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=3D"c=
olor: #000;" class=3D"styled-by-prettify">length</span><span style=3D"color=
: #660;" class=3D"styled-by-prettify">(</span><span style=3D"color: #000;" =
class=3D"styled-by-prettify">str</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">)));</span><span style=3D"color: #000;" class=3D"st=
yled-by-prettify"><br></span></div></code></div><br>A simple, <i>common</i>=
 operation like this shouldn't require anything more than this:<br><br><div=
 class=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); borde=
r-color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-w=
rap: break-word;"><code class=3D"prettyprint"><div class=3D"subprettyprint"=
><span style=3D"color: #000;" class=3D"styled-by-prettify">unicode_string u=
str </span><span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span=
 style=3D"color: #660;" class=3D"styled-by-prettify">...;</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify"><br></span><span style=3D"co=
lor: #008;" class=3D"styled-by-prettify">const</span><span style=3D"color: =
#000;" class=3D"styled-by-prettify"> </span><span style=3D"color: #008;" cl=
ass=3D"styled-by-prettify">char</span><span style=3D"color: #000;" class=3D=
"styled-by-prettify"> </span><span style=3D"color: #660;" class=3D"styled-b=
y-prettify">*</span><span style=3D"color: #000;" class=3D"styled-by-prettif=
y">str </span><span style=3D"color: #660;" class=3D"styled-by-prettify">=3D=
</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><s=
pan style=3D"color: #606;" class=3D"styled-by-prettify">GetUtf8String</span=
><span style=3D"color: #660;" class=3D"styled-by-prettify">();</span><span =
style=3D"color: #000;" class=3D"styled-by-prettify"><br>ustr</span><span st=
yle=3D"color: #660;" class=3D"styled-by-prettify">.</span><span style=3D"co=
lor: #000;" class=3D"styled-by-prettify">insert</span><span style=3D"color:=
 #660;" class=3D"styled-by-prettify">(</span><span style=3D"color: #000;" c=
lass=3D"styled-by-prettify">ustr</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">.</span><span style=3D"color: #000;" class=3D"style=
d-by-prettify">back</span><span style=3D"color: #660;" class=3D"styled-by-p=
rettify">(),</span><span style=3D"color: #000;" class=3D"styled-by-prettify=
"> str</span><span style=3D"color: #660;" class=3D"styled-by-prettify">,</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> UTF8</span><=
span style=3D"color: #660;" class=3D"styled-by-prettify">());</span></div><=
/code></div><br>Even inserting another unicode_string requires a lot of eff=
ort, as you have to get the begin and end iterators. Also, if you're insert=
ing a unicode_string of the exact same encoding type, you should <i>never</=
i> have to convert it to char32_t and back. It should be a much faster oper=
ation; a memory allocation and a series of memcpys.<br><br>In short: Needs =
more `insert`.<br><br><b>The lack of a code-unit count.</b><br><br>I know t=
hat "length" is a real problem when dealing with Unicode strings. But there=
 will be times when we need one specific length: the length of the encoded =
range.<br><br>The `null_terminated` function (why not `data` like all other=
 containers that provide this?) provides access to the underlying sequence =
of code units. This is good, as it allows interoperation with APIs that tak=
e those code units. The problem is that null termination is not sufficient.=
<br><br>Let's say I want to pass a UTF-8 string to Lua. Lua strings are nul=
l-terminated in the same way that basic_string's are, but they <i>also</i> =
have a size (just like basic_string). Which means that Lua's string can hav=
e embedded null characters, just like a unicode_string&lt;UTF8&gt;.<br><br>=
The problem is that I can't actually give Lua a string with embedded nulls =
because unicode_string doesn't provide a way to figure out how big the arra=
y from `null_terminated` will be. Without that, you have to guess by findin=
g the null terminator, which can be wrong.<br><br>So there needs to be this=
 function:<br><br><div class=3D"prettyprint" style=3D"background-color: rgb=
(250, 250, 250); border-color: rgb(187, 187, 187); border-style: solid; bor=
der-width: 1px; word-wrap: break-word;"><code class=3D"prettyprint"><div cl=
ass=3D"subprettyprint"><span style=3D"color: #000;" class=3D"styled-by-pret=
tify">size_type codeunit_count</span><span style=3D"color: #660;" class=3D"=
styled-by-prettify">();</span><span style=3D"color: #000;" class=3D"styled-=
by-prettify"><br></span></div></code></div><br>It's named this way to make =
certain that nobody mistakes it for a "length" or anything of the kind. It'=
s clearly getting the number of codeunits. And this should be O(1) in compl=
exity.<br><br>Indeed, we could rename `null_terminated` to match:<br><br><d=
iv class=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); bor=
der-color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word=
-wrap: break-word;"><code class=3D"prettyprint"><div class=3D"subprettyprin=
t"><span style=3D"color: #008;" class=3D"styled-by-prettify">const</span><s=
pan style=3D"color: #000;" class=3D"styled-by-prettify"> encoding</span><sp=
an style=3D"color: #660;" class=3D"styled-by-prettify">::</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify">codeunit_type</span><span st=
yle=3D"color: #660;" class=3D"styled-by-prettify">*</span><span style=3D"co=
lor: #000;" class=3D"styled-by-prettify"> codeunit_data</span><span style=
=3D"color: #660;" class=3D"styled-by-prettify">();</span></div></code></div=
><br>That way, it looks like a proper pair.<br><br>And since it's topical, =
there should be some way to create a basic_string_ref directly from the int=
erface. That way, we won't have to make the string NULL-terminated. It coul=
d look like this:<br><br><div class=3D"prettyprint" style=3D"background-col=
or: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border-style: sol=
id; border-width: 1px; word-wrap: break-word;"><code class=3D"prettyprint">=
<div class=3D"subprettyprint"><span style=3D"color: #000;" class=3D"styled-=
by-prettify">basic_string_ref</span><span style=3D"color: #660;" class=3D"s=
tyled-by-prettify">&lt;</span><span style=3D"color: #000;" class=3D"styled-=
by-prettify">encoding</span><span style=3D"color: #660;" class=3D"styled-by=
-prettify">::</span><span style=3D"color: #000;" class=3D"styled-by-prettif=
y">codeunit_type</span><span style=3D"color: #660;" class=3D"styled-by-pret=
tify">,</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> en=
coding</span><span style=3D"color: #660;" class=3D"styled-by-prettify">::</=
span><span style=3D"color: #000;" class=3D"styled-by-prettify">codeunit_tra=
its</span><span style=3D"color: #660;" class=3D"styled-by-prettify">&gt;</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> codeunit_str=
ing_ref</span><span style=3D"color: #660;" class=3D"styled-by-prettify">()<=
/span><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><sp=
an style=3D"color: #008;" class=3D"styled-by-prettify">const</span><span st=
yle=3D"color: #660;" class=3D"styled-by-prettify">;</span></div></code></di=
v><br><b>Lack of sized constructors.</b><br><br>This goes back to the last =
point. Just as Lua allows the user to provide strings with embedded null ch=
aracters, it <i>provides</i> strings with embedded null characters.<br><br>=
None of the constructors take an array + size; they only use null-terminate=
d arrays. We need more overloads to support arrays of a given size. Appropr=
iate `assign` and `insert` overloads should be added too.<br><br><b>Lack of=
 memory controls.</b><br><br>std::vector and std::basic_string expose a cap=
acity, which gives you some guarantees about memory allocations. unicode_st=
ring, due to variable-sized encodings, can't give quite the same guarantees=
 when you're inserting char32_t ranges.<br><br>But that doesn't mean it can=
't provide something.<br><br>Just as I suggested a codeunit_count, there sh=
ould also be a codeunit_capacity and codeunit_reserve. Again, they are name=
d so that you know they work in terms of codeunits. By reserving an appropr=
iate amount of memory, you can transfer properly-encoded codeunit sequences=
 into the string without allocating more memory.<br><br>This goes in hand w=
ith inserting codeunit ranges. By reserving memory beforehand, you can buil=
d unicode_strings while keeping its memory access pattern regular.<br></div=
>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1_28942169.1353793963291--

.


Author: rick@longbowgames.com
Date: Sat, 24 Nov 2012 15:05:45 -0800 (PST)
Raw View
------=_Part_1623_18606193.1353798345327
Content-Type: text/plain; charset=ISO-8859-1

A few thoughts on the typedefs:

std::unicode::string and std::unicode::regex seem problematic. You would
get a naming conflict if somebody does this:

using namespace std;
using namespace std::unicode;
string s;

Also, why not give a typedef for the implementation-defined default
encoding?

Finally, it would be nice if there's was a shorter name than
std::unicode::string.

My suggestion for all this would be:

namespace std
{
  namespace unicode
  {
    namespace encoding
    {

      typedef unspecified UTF8;
      typedef unspecified UTF16;
      typedef unspecified UTF32;
      typedef unspecified wide;
      typedef unspecified narrow;


      typedef unspecified system; // default encoding; one of the above
    }
  }

  typedef unicode::unicode_string<unicode::encoding::system> ustring;
  typedef unicode::regex<unicode::encoding::system> uregex;
}

--




------=_Part_1623_18606193.1353798345327
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<span style=3D"font-family: courier new,monospace;">A few thoughts on the t=
ypedefs:<br><br>std::unicode::string</span> and <span style=3D"font-family:=
 courier new,monospace;">std::unicode::regex</span> seem problematic. You w=
ould get a naming conflict if somebody does this:<br><br><span style=3D"fon=
t-family: courier new,monospace;">using namespace std;<br>using namespace s=
td::unicode;<br>string s;</span><br><br>Also, why not give a typedef for th=
e implementation-defined default encoding?<br><br>Finally, it would be nice=
 if there's was a shorter name than <span style=3D"font-family: courier new=
,monospace;">std::unicode::string</span>.<br><br>My suggestion for all this=
 would be:<br><br><span style=3D"font-family: courier new,monospace;">names=
pace std<br>{<br>&nbsp; namespace unicode<br>&nbsp; {<br>&nbsp;&nbsp;&nbsp;=
 namespace encoding<br>&nbsp;&nbsp;&nbsp; {<br></span><pre><span style=3D"f=
ont-family: courier new,monospace;">      typedef unspecified UTF8;
      typedef unspecified UTF16;
      typedef unspecified UTF32;
      typedef unspecified wide;
      typedef unspecified narrow;</span></pre><span style=3D"font-family: c=
ourier new,monospace;"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; typedef unspecifi=
ed system; // default encoding; one of the above<br>&nbsp;&nbsp;&nbsp; }<br=
>&nbsp; }<br><br>&nbsp; typedef unicode::unicode_string&lt;unicode::encodin=
g::system&gt; ustring;<br>&nbsp; typedef unicode::regex&lt;unicode::encodin=
g::system&gt; uregex;<br>}</span><br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1623_18606193.1353798345327--

.


Author: =?UTF-8?Q?Micha=C5=82_Dominiak?= <griwes@griwes.info>
Date: Sun, 25 Nov 2012 02:36:45 -0800 (PST)
Raw View
------=_Part_39_31919110.1353839805102
Content-Type: text/plain; charset=ISO-8859-1

That's the exact reason why `using namespace` is discouraged. And that's
the whole point of namespaces - to have different types with same name, but
in different namespaces. No-one sane would write both `using namespace
std;` and `using namespace std::unicode;` in same file. It would probably
be `using namespace std; namespace u = unicode;` or something similar.

On Sunday, 25 November 2012 00:05:45 UTC+1, ri...@longbowgames.com wrote:
>
> A few thoughts on the typedefs:
>
> std::unicode::string and std::unicode::regex seem problematic. You would
> get a naming conflict if somebody does this:
>
> using namespace std;
> using namespace std::unicode;
> string s;
>
> Also, why not give a typedef for the implementation-defined default
> encoding?
>
> Finally, it would be nice if there's was a shorter name than
> std::unicode::string.
>
> My suggestion for all this would be:
>
> namespace std
> {
>   namespace unicode
>   {
>     namespace encoding
>     {
>
>       typedef unspecified UTF8;
>       typedef unspecified UTF16;
>       typedef unspecified UTF32;
>       typedef unspecified wide;
>       typedef unspecified narrow;
>
>
>       typedef unspecified system; // default encoding; one of the above
>     }
>   }
>
>   typedef unicode::unicode_string<unicode::encoding::system> ustring;
>   typedef unicode::regex<unicode::encoding::system> uregex;
> }
>

--




------=_Part_39_31919110.1353839805102
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

That's the exact reason why `using namespace` is discouraged. And that's th=
e whole point of namespaces - to have different types with same name, but i=
n different namespaces. No-one sane would write both `using namespace std;`=
 and `using namespace std::unicode;` in same file. It would probably be `us=
ing namespace std; namespace u =3D unicode;` or something similar.<br><br>O=
n Sunday, 25 November 2012 00:05:45 UTC+1, ri...@longbowgames.com  wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;"><span style=3D"font-family:couri=
er new,monospace">A few thoughts on the typedefs:<br><br>std::unicode::stri=
ng</span> and <span style=3D"font-family:courier new,monospace">std::unicod=
e::regex</span> seem problematic. You would get a naming conflict if somebo=
dy does this:<br><br><span style=3D"font-family:courier new,monospace">usin=
g namespace std;<br>using namespace std::unicode;<br>string s;</span><br><b=
r>Also, why not give a typedef for the implementation-defined default encod=
ing?<br><br>Finally, it would be nice if there's was a shorter name than <s=
pan style=3D"font-family:courier new,monospace">std::unicode::string</span>=
..<br><br>My suggestion for all this would be:<br><br><span style=3D"font-fa=
mily:courier new,monospace">namespace std<br>{<br>&nbsp; namespace unicode<=
br>&nbsp; {<br>&nbsp;&nbsp;&nbsp; namespace encoding<br>&nbsp;&nbsp;&nbsp; =
{<br></span><pre><span style=3D"font-family:courier new,monospace">      ty=
pedef unspecified UTF8;
      typedef unspecified UTF16;
      typedef unspecified UTF32;
      typedef unspecified wide;
      typedef unspecified narrow;</span></pre><span style=3D"font-family:co=
urier new,monospace"><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; typedef unspecified=
 system; // default encoding; one of the above<br>&nbsp;&nbsp;&nbsp; }<br>&=
nbsp; }<br><br>&nbsp; typedef unicode::unicode_string&lt;<wbr>unicode::enco=
ding::system&gt; ustring;<br>&nbsp; typedef unicode::regex&lt;unicode::<wbr=
>encoding::system&gt; uregex;<br>}</span><br></blockquote>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_39_31919110.1353839805102--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 25 Nov 2012 04:14:50 -0800 (PST)
Raw View
------=_Part_995_5277078.1353845690422
Content-Type: text/plain; charset=ISO-8859-1

Aright, let me knock these off/put them on.

That's what namespaces are *for*. You shouldn't be using them everywhere
like that.

Typedef for impl-defined encoding? Sure.

Assign, I don't really care about, I only added it for compatibility with
stuff like std::vector, and I don't see the need to support more than the
bare minimum (it's a bad function anyway with move semantics). So I'm not
sure about that.

But inserts and constructors from other encodings and from unicode_strings,
I get. However, I'm not feeling compelled by an array + size constructor,
considering that with constructors from other encodings you could just use
the iterator constructor instead, like (arr, arr + getsize(), enc()), which
I don't consider to be a significant detriment over (arr, getsize(), enc()).

why not `data` like all other containers that provide this?


A long story involving various intermediary versions that no longer applies.

the length of the encoded range


I agree. I had in mind that it would support embedded NULLs and other such
things, but you're right in that I did not properly consider the interface
for passing this data in and out of the unicode_string without using
iterators for every use.

As for capacity() and reserve(), I dislike them, but I guess they are
necessary.

--




------=_Part_995_5277078.1353845690422
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Aright, let me knock these off/put them on.<div><br></div><div>That's what =
namespaces are <i>for</i>. You shouldn't be using them everywhere like that=
..</div><div><br></div><div>Typedef for impl-defined encoding? Sure.</div><d=
iv><br></div><div>Assign, I don't really care about, I only added it for co=
mpatibility with stuff like std::vector, and I don't see the need to suppor=
t more than the bare minimum (it's a bad function anyway with move semantic=
s). So I'm not sure about that.</div><div><br></div><div>But inserts and co=
nstructors from other encodings and from unicode_strings, I get. However, I=
'm not feeling compelled by an array + size constructor, considering that w=
ith constructors from other encodings you could just use the iterator const=
ructor instead, like (arr, arr + getsize(), enc()), which I don't consider =
to be a significant detriment over (arr, getsize(), enc()).</div><div><br><=
/div><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; =
border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-=
style: solid; padding-left: 1ex;">why not `data` like all other containers =
that provide this?</blockquote><div><br></div><div>A long story involving v=
arious intermediary versions that no longer applies.</div><div><br></div><b=
lockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; border-=
left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: =
solid; padding-left: 1ex;">the length of the encoded range</blockquote><div=
><br></div><div>I agree. I had in mind that it would support embedded NULLs=
 and other such things, but you're right in that I did not properly conside=
r the interface for passing this data in and out of the unicode_string with=
out using iterators for every use.&nbsp;</div><div>&nbsp;</div><div>As for =
capacity() and reserve(), I dislike them, but I guess they are necessary.</=
div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_995_5277078.1353845690422--

.


Author: rick@longbowgames.com
Date: Sun, 25 Nov 2012 05:13:03 -0800 (PST)
Raw View
------=_Part_107_20864717.1353849183555
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sunday, November 25, 2012 5:36:45 AM UTC-5, Micha=C5=82 Dominiak wrote:
>
> That's the exact reason why `using namespace` is discouraged. And that's=
=20
> the whole point of namespaces - to have different types with same name, b=
ut=20
> in different namespaces. No-one sane would write both `using namespace=20
> std;` and `using namespace std::unicode;` in same file. It would probably=
=20
> be `using namespace std; namespace u =3D unicode;` or something similar.
>

As somebody who writes libraries, I agree with you. But as somebody who has=
=20
tutored students, and as somebody who writes applications, I wholeheartedly=
=20
disagree.

using namespace std is one of the first things students learn, and using=20
namespace unicode isn't exactly a huge leap of logic after that, especially=
=20
if the unicode namespace fills up with more functionality in the future.=20
And the reason using namespace is discouraged is because it leaks out of=20
header files; it's perfectly kosher in source files, and will most likely=
=20
be perfectly kosher in modules.

The point of namespaces is to avoid collisions *between libraries that=20
don't know about each other*. We're talking about the standard library=20
here, and nobody is going to use std::unicode::string without using some=20
other part of namespace std. In actual fact, the namespace is completely=20
unnecessary here. It only serves to add a level of organization, but it=20
does so in a way that's at odds with existing practice in the standard=20
library, so it's arguable whether the namespace is a good idea at all.

I know this is a small point, but what do you really gain by overloading=20
the name?

--=20




------=_Part_107_20864717.1353849183555
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sunday, November 25, 2012 5:36:45 AM UTC-5, Micha=C5=82 Dominiak wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;">That's the exact reason why `us=
ing namespace` is discouraged. And that's the whole point of namespaces - t=
o have different types with same name, but in different namespaces. No-one =
sane would write both `using namespace std;` and `using namespace std::unic=
ode;` in same file. It would probably be `using namespace std; namespace u =
=3D unicode;` or something similar.<br></blockquote><div><br>As somebody wh=
o writes libraries, I agree with you. But as somebody who has tutored stude=
nts, and as somebody who writes applications, I wholeheartedly disagree.<br=
><br><span style=3D"font-family: courier new,monospace;">using namespace st=
d</span> is one of the first things students learn, and <span style=3D"font=
-family: courier new,monospace;">using namespace unicode</span> isn't exact=
ly a huge leap of logic after that, especially if the unicode namespace fil=
ls up with more functionality in the future. And the reason <span style=3D"=
font-family: courier new,monospace;">using namespace</span> is discouraged =
is because it leaks out of header files; it's perfectly kosher in source fi=
les, and will most likely be perfectly kosher in modules.<br><br>The point =
of namespaces is to avoid collisions <i>between libraries that don't know a=
bout each other</i>. We're talking about the standard library here, and nob=
ody is going to use std::unicode::string without using some other part of n=
amespace std. In actual fact, the namespace is completely unnecessary here.=
 It only serves to add a level of organization, but it does so in a way tha=
t's at odds with existing practice in the standard library, so it's arguabl=
e whether the namespace is a good idea at all.<br><br>I know this is a smal=
l point, but what do you really gain by overloading the name?<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_107_20864717.1353849183555--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 25 Nov 2012 06:38:05 -0800 (PST)
Raw View
------=_Part_135_16424249.1353854285577
Content-Type: text/plain; charset=ISO-8859-1

Now, you can certainly forget any intelligent behaviour happening with
UTF-8, ever, unless it's fixed, because there's no way in hell that I can
design and specify intelligent behaviour for a thing I can't know about.

Ultimately, I don't see the point of having ustring instead of namespace u
= std::unicode; u::string. You're just going back to pseudo-namespaces.
It's a string, and I'm calling it a string, and it needs a namespace. It
wouldn't be such a big deal if the Standard was already reasonably
namespaced. I do like the encoding namespace though.

In addition, I noticed that many of the freestanding algorithms had the
same issue as unicode_string, in that they were not very flexible. I have
created a number of additional overloads to simplify their use- for
example, you can now call

std::string s = "Some unicode in UTF-8";
s = std::unicode::to_upper(s);

And I also added a trait which will map from T (char, char16_t, char32_t,
wchar_t) to encoding, to default the encoding for such functions, and the
iterator-based functions, so that ideally you should have to pass an
explicit encoding much less often.

Also, holy crap, so many template parameters....

--




------=_Part_135_16424249.1353854285577
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Now, you can certainly forget any intelligent behaviour happening with UTF-=
8, ever, unless it's fixed, because there's no way in hell that I can desig=
n and specify intelligent behaviour for a thing I can't know about.<div><br=
></div><div><div>Ultimately, I don't see the point of having ustring instea=
d of namespace u =3D std::unicode; u::string. You're just going back to pse=
udo-namespaces. It's a string, and I'm calling it a string, and it needs a =
namespace. It wouldn't be such a big deal if the Standard was already reaso=
nably namespaced. I do like the encoding namespace though.</div></div><div>=
<br></div><div>In addition, I noticed that many of the freestanding algorit=
hms had the same issue as unicode_string, in that they were not very flexib=
le. I have created a number of additional overloads to simplify their use- =
for example, you can now call</div><div><br></div><div>std::string s =3D "S=
ome unicode in UTF-8";</div><div>s =3D std::unicode::to_upper(s);</div><div=
><br></div><div>And I also added a trait which will map from T (char, char1=
6_t, char32_t, wchar_t) to encoding, to default the encoding for such funct=
ions, and the iterator-based functions, so that ideally you should have to =
pass an explicit encoding much less often.</div><div><br></div><div>Also, h=
oly crap, so many template parameters....</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_135_16424249.1353854285577--

.


Author: rick@longbowgames.com
Date: Sun, 25 Nov 2012 07:10:56 -0800 (PST)
Raw View
------=_Part_147_23127226.1353856256176
Content-Type: text/plain; charset=ISO-8859-1

namespace u = std::unicode is a fairly poor solution. You can't use it in
header files without leaking the 'u' namespace, and you're mandating
additional boilerplate at the top of almost every source file.

From where I'm sitting, you're the one suggesting a change from the
existing naming scheme, and you've never given a convincing reason why,
other than that 'pseudo-namespaces' are supposedly evil.

Anyway, I won't harp on this anymore. It's an issue that the committee can
sort out.

--




------=_Part_147_23127226.1353856256176
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<span style=3D"font-family: courier new,monospace;">namespace u =3D std::un=
icode</span> is a fairly poor solution. You can't use it in header files wi=
thout leaking the 'u' namespace, and you're mandating additional boilerplat=
e at the top of almost every source file.<br><br>From where I'm sitting, yo=
u're the one suggesting a change from the existing naming scheme, and you'v=
e never given a convincing reason why, other than that 'pseudo-namespaces' =
are supposedly evil.<br><br>Anyway, I won't harp on this anymore. It's an i=
ssue that the committee can sort out.<br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_147_23127226.1353856256176--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Thu, 6 Dec 2012 07:54:53 -0800 (PST)
Raw View
------=_Part_248_23213961.1354809293475
Content-Type: multipart/alternative;
 boundary="----=_Part_249_19676025.1354809293475"

------=_Part_249_19676025.1354809293475
Content-Type: text/plain; charset=ISO-8859-1

Here's the next version up. Have at it, ye dastardly critics.


--




------=_Part_249_19676025.1354809293475
Content-Type: text/html; charset=ISO-8859-1

Here's the next version up. Have at it, ye dastardly critics.<div><br></div><div><br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_249_19676025.1354809293475--
------=_Part_248_23213961.1354809293475
Content-Type: text/html; charset=UTF-8; name=stringprop.html
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=stringprop.html
X-Attachment-Id: ea0afc33-cbd8-493b-95d6-d80a2c754629

=EF=BB=BF


<!DOCTYPE html>
<html lang=3D"en">
<head>
    <title>Unicode Support</title> =20
</head>
<body>

    <p>Document working version four. The primary todos are string formatti=
ng and parsing, and locales.</p>
    <br />
    <p>Document Number: Currently not allocated.</p>
    <p>Date: 2012-11-05</p>
    <p>Project: Programming Language C++, Library Working Group</p>
    <p>Reply-to: wolfeinstein@gmail.com</p>
<h1>Strings Proposal</h1>
<h2>Introduction</h2>
<p>The purpose of this document is to propose new interfaces to support Uni=
code text, where the existing interfaces are quite deficient. </p>
<h2>Motivation and Scope</h2>
<p>This proposal is primarily motivated by two problems. The first is the o=
verwhelming number of string types- both primitive, Standard and third-part=
y. This mess of text types makes it impossible to reliably=20
    hold string data. The second is the poor support for Unicode within the=
 C++ Standard library. Unicode is a complex topic, where correctness depend=
s on the implementation of complex algorithms by the user.=20
    This is only exacerbated by the problem of multiple string encodings, a=
nd poor conversion interfaces, which is why C++ is awash with third-party s=
tring types. This problem is made even worse by the existence of=20
    unrelated types that need to hold string data- for example, exceptions.=
 The existing exception hierarchy is of significantly limited usefulness, a=
s it cannot hold Unicode exception data. This proposal aims to=20
    solve both these problems by offering freestanding algorithms and a fre=
sh string class which constitutes significant support for Unicode.
</p>
<p>It is intended to support all programmers from top to bottom, as string =
handling tasks are tasks universal to all programs. It is based on the exis=
ting practice shown in the more recent additions to the Standard=20
    library and Modern C++ design in general- templates instead of inherita=
nce, function objects, and freestanding algorithms and iterators.
</p>
<p>It is not currently in use and a reference implementation is still under=
 construction. However, there are numerous implementations of the various s=
ubcomponents, such as Unicode algorithms and formatting routines.=20
</p>
<h2>Impact on the Standard</h2>
<p>The primary impact on the Standard is the deprecation of existing compon=
ents. There are no additional language or library features required.
</p>
<h2>Design Decisions</h2>
<p>The primary design decision taken here is to give one universal definiti=
on of a string- a range of Unicode codepoints. This decision was taken beca=
use it allows free-standing algorithms,=20
    and an interface that fits well with the rest of the Standard library. =
It also allows the string interface to be significantly simplified compared=
 to the previous iteration. In addition,=20
    the library provides one single string type, best suited for each platf=
orm. This string type is intended to meet the requirements of, for example,=
 the filesystem TS for storing paths.
</p>
<p>Unicode validation failure throwing an exception is well known to be a l=
imited solution in many cases. This part of the API is due for additional c=
onsideration, as this is only a first draft. In addition,=20
    because of the potential for O(n) assignment, it was decided that the o=
nly kind of iterator offered over a string should be immutable, as in many =
cases the operation would boil down to inserting a variable=20
    size range. This could be prohibitively expensive. In addition, the cho=
ice of an rvalue makes it significantly simpler to offer iterators, as they=
 can decode on the fly to codepoints from their choice of=20
    encoding. Aside from this, however, the string was designed to be a fam=
iliar container, offering the minimal set of functions required to manipula=
te the sequence of codepoints.
</p>
<p>Another problem is posed by UTF-8. As u8 literals do not have a distinct=
 type, it's almost impossible to handle them correctly and as cleanly as th=
e other literal types. There are other proposals for introducing=20
    char8_t and fixing UTF-8 literals, and introducing std::u8string, but t=
his proposal does not assume they are accepted. It would, however, be of si=
gnificant benefit.
</p>
<p>Finally, the std namespace is becoming very overloaded. It was decided t=
hat it would be best to split the components into subnamespaces. This not o=
nly aids with the organization of the library as a whole,
    but also provides a clear difference between old and new components.
</p>
<h2>Technical Specification</h2>
<p>Currently, to avoid ambiguity, the specification is given as a series of=
 declarations in C++11.</p>
<p>Where a type is taken by either rvalue reference or const reference, it'=
s legal for implementations to provide only one overload that takes that ty=
pe by value.</p>
<p>For iterators, usually only the iterator category and return value of op=
erator* are specified, as the full specification of an iterator involves a =
lot of plumbing. If requested, these=20
specifications can be expanded to the full definition.</p>
<p>In header &lt;unicode&gt;</p>
<pre>namespace std {
    namespace unicode {
        enum class normal_form {
            NFC,
            NFD,
            NFKC,
            NFKD
        };</pre>
<p>The encoded_string class is templated based on an encoding parameter. Th=
is is a traits-style class implemented for each encoding. The required memb=
ers are:</p>

<pre>    typedef unspecified codeunit;</pre>

<p>The codeunit typedef is for the individual unit of storage for this spec=
ific encoding. This would be char16_t for UTF-16, char for narrow encoding,=
 etc. </p>

<pre>    template&lt;typename codeunit_iterator&gt; using codepoint_iterato=
r =3D unspecified;
    template&lt;typename codeunit_iterator&gt; using validating_codepoint_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original codeunit range as Un=
icode codepoints. The validating version will throw if the codeunits are no=
t valid or do not result in valid Unicode. The adaptors have the=20
    same iterator category as the input type, except if that category is ra=
ndom, in which case they only need be bidirectional.</p>

<pre>    template&lt;typename codepoint_iterator&gt; using codeunit_iterato=
r =3D unspecified;
    template&lt;typename codepoint_iterator&gt; using validating_codeunit_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original range of Unicode cod=
epoints as code uits. The validating version will throw if the codepoints a=
re not valid or cannot be expressed in the destination.</p>

<pre>    template&lt;typename foreign_encoding, typename foreign_codeunit_i=
terator&gt; using conversion_iterator =3D unspecified;
    template&lt;typename foreign_encoding, typename foreign_codeunit_iterat=
or&gt; using validating_conversion_iterator =3D unspecified;</pre>
<p>Views a range of codeunits in the foreign encoding as a range of code un=
its in this encoding. A reasonable implementation for any foreign encoding =
is to simply convert to Unicode codepoints and then back to=20
    the current encoding. The validating iterator shall ensure that all for=
eign data is suitable for representation in this encoding.</p>
<p>An implementation shall provide at least the following encodings:
</p>
<pre>        namespace encoding {   =20
            typedef unspecified utf8;
            typedef unspecified utf16;
            typedef unspecified utf32;
            typedef unspecified wide;
            typedef unspecified narrow;
            typedef unspecified system;
        }
</pre>
<p>The narrow encoding is the encoding used for narrow string literals, suc=
h as "hello". The wide string literal is used for wide string literals such=
 as L"hello". An implementation=20
    has no obligation to make these separate types if one of the wide or na=
rrow encodings, or both, is already a Unicode encoding. The system encoding=
 is an implementation-defined default=20
    which shall be the encoding best used for interoperation with platform =
APIs, especially operating system APIs, such as UTF16 on Windows and UTF8 o=
n Unix.
</p>
<pre>        template&lt;typename Char&gt; using encoding_of =3D implementa=
tion-defined;
</pre>
<p>The encoding_of template returns the assumed encoding of a basic_string =
whose value_type is decayed Char. This shall be narrow for char, wide for w=
char_t, UTF16 for char16_t, and utf32 for char32_t.</p>

<p>The string class is a container of Unicode codepoints. The treatment of =
the freestanding algorithms as a range of Unicode codepoints means that any=
 container of Unicode codepoints may be=20
    used, but this class is provided as the minimal useful container. It ma=
y contain embedded null characters.
</p>
<pre>        template&lt;typename encoding, typename allocator =3D std::all=
ocator&lt;encoding::codeunit&gt;&gt; class encoded_string {
        public:
            encoded_string();
            template&lt;typename other_encoding, typename other_alloc&gt;=
=20
            encoded_string(const encoded_string&lt;other_ecnoding, other_al=
loc&gt;&);
            encoded_string(encoded_string&&);
           =20
            encoded_string(const char*);</pre>
           =20
<p>When the encoded_string interface deals with a const char* or std::strin=
g, it will assume narrow encoding, not UTF-8. A constructor which can take =
an encoding is available for UTF-8 const char*. When the=20
    encoded_string class takes input from an external source, it will valid=
ate that it is well-formed Unicode. If not, an exception shall be thrown.</=
p>

<pre>            encoded_string(const char*, encoding);

            encoded_string(const wchar_t*);
            encoded_string(const char16_t*);
            encoded_string(const char32_t*);
            template&lt;typename T, typename Traits, typename Allocator, ty=
pename Encoding =3D encoding_of&lt;T&gt;&gt;=20
            encoded_string(const std::basic_string_ref&lt;T, Traits, Alloca=
tor&gt;&, Encoding e =3D Encoding());
            template&lt;typename Iterator, typename Encoding =3D encoding_o=
f&lt;decltype(*std::declval&lt;Iterator&gt;)&gt;>=20
            encoded_string(Iterator, Iterator, Encoding e =3D Encoding());<=
/pre>

<p>The requirements on the Iterator type, which the string may be construct=
ed from a pair of, is that it is at least an input iterator, of codeunits o=
f that type. If the decayed return type=20
    of dereferencing an iterator is not a cv-qualified Encoding::code_unit,=
 then compilation shall fail.
</p>

<pre>            template&lt;typename Iterator, typename Encoding =3D encod=
ing_of&lt;decltype(*std::declval&lt;Iterator&gt;)&gt;>=20
            void assign(Iterator, Iterator, Encoding e =3D Encoding()) &;</=
pre>

<p>The requirements on the Iterator type here are the same as those on the =
constructor.</p>

<pre>            void assign(encoded_string&) &;
            void assign(encoded_string&&) &;

            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+(const encoded_string&lt;other_encodin=
g, other_alloc&gt;&) const;
            encoded_string& operator+(encoded_string&&) const;
            encoded_string& operator+(const char*) const;
            encoded_string& operator+(const wchar_t*) const;
            encoded_string& operator+(const char16_t*) const;
            encoded_string& operator+(const char32_t*) const;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+(const std::basic_string_ref&lt;T, Tra=
its, Allocator&gt;&) const;
           =20
            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+=3D(const encoded_string&lt;other_enco=
ding, other_alloc&gt;&) &;
            encoded_string& operator+=3D(encoded_string&&) &;
            encoded_string& operator+=3D(const char*) &;
            encoded_string& operator+=3D(const wchar_t*) &;
            encoded_string& operator+=3D(const char16_t*) &;
            encoded_string& operator+=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+=3D(const std::basic_string_ref&lt;T, =
Traits, Allocator&gt;&);

            encoded_string& operator=3D(const encoded_string&) &;
            encoded_string& operator=3D(encoded_string&&) &;
            encoded_string& operator=3D(const char*) &;
            encoded_string& operator=3D(const wchar_t*) &;
            encoded_string& operator=3D(const char16_t*) &;
            encoded_string& operator=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator=3D(const std::basic_string_ref&lt;T, T=
raits, Allocator&gt;&);

            iterator begin() &;
            const_iterator begin() const &;
            const_iterator cbegin() const &;
            iterator end() &;
            const_iterator end() const &;
            const_iterator cend() const &;</pre>

<p>The iterator and const_iterator types are bidirectional iterators of Uni=
code codepoints. operator* returns a char32_t rvalue, which is the codepoin=
t at that position. The invalidation=20
semantics of iterators shall be those of std::vector.</p>

<pre>            void clear() &;
            bool empty() const;
           =20
            iterator erase(const_iterator where) &;
            iterator erase(const_iterator first, const_iterator last) &;

            void swap(encoded_string&);

            char32_t front() const;
            char32_t back() const;
           =20
            iterator insert(const_iterator where, char32_t codepoint);
            template&lt;typename InputIterator, typename Encoding =3D encod=
ing_of&lt;decltype(*std::declval&lt;InputIterator&gt;)&gt;gt;=20
            iterator insert(const_iterator where, InputIterator begin, Inpu=
tIterator end, Encoding e =3D Encoding());
            template&lt;typename encoding, typename allocator&gt;=20
            iterator insert(const_iterator where, const encoded_string&lt;e=
ncoding, allocator&gt;&);
            template&lt;typename T, typename Traits, typename Alloc, typena=
me Encoding =3D encoding_of&lt;T&gt;=20
            iterator insert(const_iterator where, const basic_string&lt;T, =
Traits, Alloc&gt;&, Encoding e =3D Encoding());

            void pop_back();
            void push_back(char32_t);

            void normalize(normal_form);</pre>

<p>Performs an in-place normalization of the string's contents to the reque=
sted form.</p>

<pre>            const encoding::codeunit* codeunit_data() const;
            const encoding::codeunit* codeunit_data() const;
            std::size_t codeunit_size() const;</pre>

<p>codeunit_data returns the contents of the encoded_string as a null-termi=
nated buffer. This pointer shall be valid for as long as the encoded_string=
 is not mutated or destroyed. The codeunit_size=20
    function shall return the size of this buffer, except for the null term=
inator.</p>
<pre>            void codeunit_reserve(std::size_t size);
            std::size_t codeunit_capacity() const;
</pre>

<pre>        };

        using string =3D encoded_string&lt;encoding::system, implementation=
-defined default&gt;

        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;=20
        bool operator<(const encoded_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D(const encoded_string&lt;lhs_encoding, lhs_allocato=
r&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D=3D(const encoded_string&lt;lhs_encoding, lhs_alloc=
ator&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs=
);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>(const encoded_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>=3D(const encoded_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator!=3D(const encoded_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);=
</pre>

<p>These comparison operators behave as if the data in the lhs and the rhs =
was passed to the respective freestanding algorithm.</p>

<pre>        template&lt;typename Iterator, typename OutIt> void convert(It=
erator begin, Iterator end, OutIt out, encoding src, encoding dst);</pre>

<p>Converts from the input range which is an input range of code units in s=
rc encoding into dst encoding. The output iterator receives the result of t=
he operation.</p>
       =20
<pre>        template&lt;typename Iterator> std::pair&lt;grapheme_iterator&=
lt;Iterator>, grapheme_iterator&lt;Iterator>>=20
        graphemes(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;word_iterator&lt;Iterat=
or>, word_iterator&lt;Iterator>>
        words(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;line_iterator&lt;Iterat=
or>, line_iterator&lt;Iterator>>
        lines(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;sentence_iterator&lt;It=
erator>, sentence_iterator&lt;Iterator>>
        sentences(Iterator begin, Iterator end);</pre>

<p>All four iterator types- grapheme_iterator, word_iterator, line_iterator=
, and sentence_iterator implement the respective Unicode Standard boundary =
analysis algorithms. The Line algorithm is defined in UAX #14=20
    (http://www.unicode.org/reports/tr14/) and the other three in UAX #29 (=
http://www.unicode.org/reports/tr29/). The input iterators are at least bid=
irectional iterators of Unicode codepoints. The boundary=20
    iterators all return from operator*() a pair of the base iterator type,=
 where the first value marks the beginning of the range, and the second mar=
ks the end, of the region. The first element of the return=20
    value of the four functions is the beginning and the second is the end.=
 Each iterator is assumed to be in encoding_of&lt;decltype(*std::declval&lt=
;Iterator&gt;)&gt;</p>

<pre>        template&lt;typename First, typename Second> bool less(First b=
egin, First end, Second begin, Second end, std::locale =3D std::locale());
   =20
        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool less(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const en=
coded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::locale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool less_or_equal(Fir=
st begin, First end, Second begin, Second end, std::locale =3D std::locale(=
));
   =20
        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool less_or_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&,=
 const encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::loca=
le());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater(First beg=
in, First end, Second begin, Second end, std::locale =3D std::locale());

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool greater(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const=
 encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::locale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool greater(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;=
& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater_or_equal(=
First begin, First end, Second begin, Second end, std::locale =3D std::loca=
le());

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool greater_or_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt=
;&, const encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::l=
ocale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool greater_or_equal(const basic_string&lt;lhsChar, lhsTraits, lhs=
Alloc&gt;& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool equal(First begin=
, First end, Second begin, Second end);

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const e=
ncoded_string&lt;rhs_enc, rhs_alloc&gt;&);

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding())=
;

        template&lt;typename First, typename Second> bool not_equal(First b=
egin, First end, Second begin, Second end);

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool not_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, con=
st encoded_string&lt;rhs_enc, rhs_alloc&gt;&);

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool not_equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&g=
t;& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding())=
;</pre>

<p>These six algorithms implement Unicode comparison functionality. Equival=
ence is defined as equivalence when normalized, with either NFC or NFD. Col=
lation requires a locale- overloads which do not have one as a=20
    parameter shall use the global locale. All iterator ranges shall be for=
ward iterators of Unicode codepoints.
</p>

<pre>        template&lt;typename Iterator, typename Out> void normalize(It=
erator begin, Iterator end, Out out, normal_form);</pre>

<p>Implements normalization of the forward range over Unicode codepoints, w=
ith the output provided to the output iterator. The normal_form argument in=
dicates which normal form is requested.</p>

<pre>        template&lt;typename encoding, typename allocator&gt; std::ist=
ream&=20
        operator>>(std::istream&, encoded_string&lt;encoding, allocator&gt;=
&);
        template&lt;typename encoding, typename allocator&gt; std::wistream=
&=20
        operator>>(std::wistream&, encoded_string&lt;encoding, allocator&gt=
;&);</pre>

<p>Reads until the next whitespace, as operator>>(std::istream&, std::strin=
g&);. Shall convert the data in the stream to Unicode, so whitespace shall =
include Unicode whitespaces.</p>

<pre>         template&lt;typename encoding, typename allocator&gt; std::os=
tream&=20
         operator&lt;&lt;(std::ostream&, const encoded_string&lt;encoding, =
allocator&gt;&);
         template&lt;typename encoding, typename allocator&gt; std::wostrea=
m&=20
         operator&lt;&lt;(std::wostream&, const encoded_string&lt;encoding,=
 allocator&gt;&);</pre>

<p>Writes the contents of the string to the stream. Shall perform an encodi=
ng conversion to narrow encoding and wide encoding when necessary.</p>

<pre>        struct dec {
            dec();
            dec(const dec&);
            dec(dec&&);
            dec(std::locale l);
            std::locale get_locale();
         }
         struct hex {};
         struct bin {};
         struct oct {};
</pre>
<p>For all primitive integer types I:</p>
<pre>        template&lt;typename base =3D dec&gt; string to_string(I, base=
 =3D base{});</pre>
<p>This function shall format the integer of type I as a string, in the bas=
e provided. If I is <code>bool</code>, then bin shall represent 0 or 1, and=
 any other choice shall result in <code>true</code> or=20
    <code>false</code>.</p>
<pre>        enum class codepoint_category {
            letter_uppercase;
            letter_lowercase;
            letter_titlecase;
            letter_modifier;
            letter_other;
            mark_non_spacing;
            mark_spacing_combining;
            mark_enclosing;
            number_decimal_digit;
            number_letter;
            number_other;
            punctuation_connector;
            punctuation_dash;
            punctuation_open;
            punctuation_close;
            punctuation_initial;
            punctuation_final;
            punctuation_other;
            symbol_math;
            symbol_currency;
            symbol_modifier;
            symbol_other;
            separator_space;
            separator_line;
            separator_paragraph;
            other_control;
            other_format;
            other_surrogate;
            other_private_use;
            other_not_assigned;
        };
        enum class bidi_category {
            AL, AN,
            B, BN,
            CS,
            EN, ES, ET,
            L, LRE, LRO,
            NSM,
            ON,
            PDF,
            R, RLE, RLO,
            S,
            WS,
        };
        enum class category_joining_class {
            U, C, T, D, L, R,
        };
        enum class category_joining_group {
            Ain, Alaph, Alef, Alef_Maqsurah,
            Beh, Beth, Burushaski_Yeh_Barree,
            Dal, Dalath_Rish, E,
            Farsi_Yeh, Fe, Feh, Final_Semkath,
            Gaf, Gamal,
            Hah, Hamza_On_Heh_Goal, He,
            Heh, Heh_Goal, Heth,
            Kaf, Kaph, Khaph, Knotted_Heh,
            Lam, Lamadh, Meem, Mim,
            No_Joining_Group, Noon, Nun, Nya,
            Pe, Qaf, Qaph, Reh, Reversed_Pe,
            Rohingya_Yeh,
            Sad, Sadhe, Seen, Semkath, Shin,
            Swash_Kaf, Syriac_Waw, Tah, Taw,
            Teh_Marbuta, Teh_Marbuta_Goal, Teth, Waw, Yeh,
            Yeh_Barree, Yeh_With_Tail, Yudh,
            Yudh_He, Zain, Zhain,
        };
        enum class script_type {
            Arab, Armi, Armn, Avst,
            Bali, Bamu, Batk, Beng, Bopo, Brah, Brai, Bugi, Buhd,
            Cakm, Cans, Cari, Cham, Cher, Copt, Cprt,
            Cyrl,
            Deva, Dsrt,
            Egyp, Ethi,
            Geor, Glag, Goth, Grek, Gujr, Guru,
            Hang, Hani, Hano, Hebr, Hira, Hrkt,
            Ital,
            Java,
            Kali, Kana, Khar, Khmr, Knda, Kthi,
            Lana, Laoo, Latn, Lepc, Limb, Linb, Lisu, Lyci,
            Lydi,
            Mand, Merc, Mero, Mlym, Mong, Mtei, Mymr,
            Nkoo,
            Ogam, Olck, Orkh, Orya, Osma,
            Phag, Phli, Phnx, Plrd, Prti,
            Qaai,
            Rjng, Runr,
            Samr, Sarb, Saur, Shaw, Shrd,  Sinh, Sora, Sund, Sylo, Syrc,
            Tagb, Takr, Tale, Talu, Taml, Tavt, Telu, Tfng,
            Tglg, Thaa, Thai, Tibt,
            Ugar,
            Vaii,
            Xpeo, Xsux,
            Yiii,
            Zinh, Zyyy, Zzzz,
        };
        enum class block_name {
            Aegean_Numbers, Alchemical, Alphabetic_PF, Ancient_Greek_Music,=
 Ancient_Greek_Numbers,
            Ancient_Symbols, Arabic, Arabic_Ext_A, Arabic_Math, Arabic_PF_A=
, Arabic_PF_B, Arabic_Sup,
            Armenian, Arrows, ASCII, Avestan, Balinese, Bamum, Bamum_Sup, B=
atak, Bengali, Block_Elements,
            Bopomofo, Bopomofo_Ext, Box_Drawing, Brahmi, Braille, Buginese,=
 Buhid, Byzantine_Music,
            Carian, Chakma, Cham, Cherokee, CJK, CJK_Compat, CJK_Compat_For=
ms, CJK_Compat_Ideographs,
            CJK_Compat_Ideographs_Sup, CJK_Ext_A, CJK_Ext_B, CJK_Ext_C, CJK=
_Ext_D, CJK_Radicals_Sup,
            CJK_Strokes, CJK_Symbols, Compat_Jamo, Control_Pictures, Coptic=
, Counting_Rod, Cuneiform,
            Cuneiform_Numbers, Currency_Symbols, Cypriot_Syllabary, Cyrilli=
c, Cyrillic_Ext_A, Cyrillic_Ext_B,
            Cyrillic_Sup, Deseret, Devanagari, Devanagari_Ext, Diacriticals=
, Diacriticals_For_Symbols,
            Diacriticals_Sup, Dingbats, Domino, Egyptian_Hieroglyphs, Emoti=
cons, Enclosed_Alphanum,
            Enclosed_Alphanum_Sup, Enclosed_CJK, Enclosed_Ideographic_Sup, =
Ethiopic, Ethiopic_Ext,
            Ethiopic_Ext_A, Ethiopic_Sup, Geometric_Shapes, Georgian, Georg=
ian_Sup, Glagolitic, Gothic, Greek,
            Greek_Ext, Gujarati, Gurmukhi, Half_And_Full_Forms, Half_Marks,=
 Hangul, Hanunoo, Hebrew,
            High_PU_Surrogates, High_Surrogates, Hiragana, IDC, Imperial_Ar=
amaic, Indic_Number_Forms,
            Inscriptional_Pahlavi, Inscriptional_Parthian, IPA_Ext, Jamo, J=
amo_Ext_A, Jamo_Ext_B, Javanese,
            Kaithi, Kana_Sup, Kanbun, Kangxi, Kannada, Katakana, Katakana_E=
xt, Kayah_Li, Kharoshthi, Khmer,
            Khmer_Symbols, Lao, Latin_1_Sup, Latin_Ext_A, Latin_Ext_Additio=
nal, Latin_Ext_B, Latin_Ext_C,
            Latin_Ext_D, Lepcha, Letterlike_Symbols, Limbu, Linear_B_Ideogr=
ams, Linear_B_Syllabary, Lisu,
            Low_Surrogates, Lycian, Lydian, Mahjong, Malayalam, Mandaic, Ma=
th_Alphanum, Math_Operators,
            Meetei_Mayek, Meetei_Mayek_Ext, Meroitic_Cursive, Meroitic_Hier=
oglyphs, Miao, Misc_Arrows,
            Misc_Math_Symbols_A, Misc_Math_Symbols_B, Misc_Pictographs, Mis=
c_Symbols, Misc_Technical,
            Modifier_Letters, Modifier_Tone_Letters, Mongolian, Music, Myan=
mar, Myanmar_Ext_A, NB,
            New_Tai_Lue, NKo, Number_Forms, OCR, Ogham, Ol_Chiki, Old_Itali=
c, Old_Persian, Old_South_Arabian,
            Old_Turkic, Oriya, Osmanya, Phags_Pa, Phaistos, Phoenician, Pho=
netic_Ext, Phonetic_Ext_Sup,
            Playing_Cards, PUA, Punctuation, Rejang, Rumi, Runic, Samaritan=
, Saurashtra, Sharada, Shavian,
            Sinhala, Small_Forms, Sora_Sompeng, Specials, Sundanese, Sundan=
ese_Sup, Sup_Arrows_A, Sup_Arrows_B,
            Sup_Math_Operators, Sup_PUA_A, Sup_PUA_B, Sup_Punctuation, Supe=
r_And_Sub, Syloti_Nagri, Syriac,
            Tagalog, Tagbanwa, Tags, Tai_Le, Tai_Tham, Tai_Viet, Tai_Xuan_J=
ing, Takri, Tamil, Telugu, Thaana,
            Thai, Tibetan, Tifinagh, Transport_And_Map, UCAS, UCAS_Ext, Uga=
ritic, Vai, Vedic_Ext,
            Vertical_Forms, VS, VS_Sup, Yi_Radicals, Yi_Syllables, Yijing,
        };
        enum class version {
            v1_1,
            v2_0, v2_1,
            v3_0, v3_1, v3_2,
            v4_0, v4_1,
            v5_0, v5_1, v5_2,
            v6_0, v6_1, v6_2,
            unassigned =3D 0xFF,
        };
        struct codepoint_properties {
            codepoint_category category;
            block_name block;
            version age;
            bidi_category bidi_type;
            category_joining_class joining_class;
            category_joining_group joining_group;
            script_type script;
            bool control;
            bool digit;
            bool letter;
            bool lower;
            bool number;
            bool punctuation;
            bool separator;
            bool symbol;
            bool upper;
            bool whitespace;
        };
        codepoint_properties properties(char32_t);</pre>
<p>Returns the properties of any given codepoint. These properties are defi=
ned by the Unicode Standard, not here.</p>
<pre>        template&lt;typename Iterator, typename Out, typename Encoding=
 =3D utf32&gt; void to_upper(Iterator begin, Iterator end, Out out, Encodin=
g e =3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_upper(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_upper(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
utf32&gt; void to_lower(Iterator begin, Iterator end, Out out, Encoding e =
=3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_lower(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_lower(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
utf32&gt; void to_title(Iterator begin, Iterator end, Out out, Encoding e =
=3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_title(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_title(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());
</pre>
<p>Performs a case conversion for the given series of Unicode codepoints. T=
he output iterator shall be the same encoding as the input iterator.
</p>
<pre>        using regex =3D std::basic_regex&lt;char32_t, implementation-d=
efined&gt;</pre>
<p>A regular expression type suitable for matching Unicode codepoints. The =
traits must support <a href=3D"http://www.unicode.org/reports/tr18/">UTS-18=
</a> to at least Level 2.</p>
<pre>    }
}</pre>
<h2>Acknowledgements</h2>
<p>R. Martinho Fernandes, gave significant assistance when dealing with som=
e of the ins and outs of Unicode.</p>
   =20
</body>
</html>=E2=80=8B
------=_Part_248_23213961.1354809293475--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 8 Dec 2012 02:29:35 -0800 (PST)
Raw View
------=_Part_1816_26406021.1354962575104
Content-Type: multipart/alternative;
 boundary="----=_Part_1817_15080147.1354962575105"

------=_Part_1817_15080147.1354962575105
Content-Type: text/plain; charset=ISO-8859-1

Here is the next version.

--




------=_Part_1817_15080147.1354962575105
Content-Type: text/html; charset=ISO-8859-1

Here is the next version.

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1817_15080147.1354962575105--
------=_Part_1816_26406021.1354962575104
Content-Type: text/html; charset=UTF-8; name=stringprop.html
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=stringprop.html
X-Attachment-Id: 44c09331-57d6-48a2-8ec5-57938eae5077

=EF=BB=BF


<!DOCTYPE html>
<html lang=3D"en">
<head>
    <title>Unicode Support</title> =20
</head>
<body>

    <p>Document working version four. The primary todos are string formatti=
ng and parsing, and locales.</p>
    <br />
    <p>Document Number: Currently not allocated.</p>
    <p>Date: 2012-11-05</p>
    <p>Project: Programming Language C++, Library Working Group</p>
    <p>Reply-to: wolfeinstein@gmail.com</p>
<h1>Strings Proposal</h1>
<h2>Introduction</h2>
<p>The purpose of this document is to propose new interfaces to support Uni=
code text, where the existing interfaces are quite deficient. </p>
<h2>Motivation and Scope</h2>
<p>This proposal is primarily motivated by two problems. The first is the o=
verwhelming number of string types- both primitive, Standard and third-part=
y. This mess of text types makes it impossible to reliably=20
    hold string data. The second is the poor support for Unicode within the=
 C++ Standard library. Unicode is a complex topic, where correctness depend=
s on the implementation of complex algorithms by the user.=20
    This is only exacerbated by the problem of multiple string encodings, a=
nd poor conversion interfaces, which is why C++ is awash with third-party s=
tring types. This problem is made even worse by the existence of=20
    unrelated types that need to hold string data- for example, exceptions.=
 The existing exception hierarchy is of significantly limited usefulness, a=
s it cannot hold Unicode exception data. This proposal aims to=20
    solve both these problems by offering freestanding algorithms and a fre=
sh string class which constitutes significant support for Unicode.
</p>
<p>It is intended to support all programmers from top to bottom, as string =
handling tasks are tasks universal to all programs. It is based on the exis=
ting practice shown in the more recent additions to the Standard=20
    library and Modern C++ design in general- templates instead of inherita=
nce, function objects, and freestanding algorithms and iterators.
</p>
<p>It is not currently in use and a reference implementation is still under=
 construction. However, there are numerous implementations of the various s=
ubcomponents, such as Unicode algorithms and formatting routines.=20
</p>
<h2>Impact on the Standard</h2>
<p>The primary impact on the Standard is the deprecation of existing compon=
ents. There are no additional language or library features required.
</p>
<h2>Design Decisions</h2>
<p>The primary design decision taken here is to give one universal definiti=
on of a string- a range of Unicode codepoints. This decision was taken beca=
use it allows free-standing algorithms,=20
    and an interface that fits well with the rest of the Standard library. =
It also allows the string interface to be significantly simplified compared=
 to the previous iteration. In addition,=20
    the library provides one single string type, best suited for each platf=
orm. This string type is intended to meet the requirements of, for example,=
 the filesystem TS for storing paths.
</p>
<p>Unicode validation failure throwing an exception is well known to be a l=
imited solution in many cases. This part of the API is due for additional c=
onsideration, as this is only a first draft. In addition,=20
    because of the potential for O(n) assignment, it was decided that the o=
nly kind of iterator offered over a string should be immutable, as in many =
cases the operation would boil down to inserting a variable=20
    size range. This could be prohibitively expensive. In addition, the cho=
ice of an rvalue makes it significantly simpler to offer iterators, as they=
 can decode on the fly to codepoints from their choice of=20
    encoding. Aside from this, however, the string was designed to be a fam=
iliar container, offering the minimal set of functions required to manipula=
te the sequence of codepoints.
</p>
<p>Another problem is posed by UTF-8. As u8 literals do not have a distinct=
 type, it's almost impossible to handle them correctly and as cleanly as th=
e other literal types. There are other proposals for introducing=20
    char8_t and fixing UTF-8 literals, and introducing std::u8string, but t=
his proposal does not assume they are accepted. It would, however, be of si=
gnificant benefit.
</p>
<p>Finally, the std namespace is becoming very overloaded. It was decided t=
hat it would be best to split the components into subnamespaces. This not o=
nly aids with the organization of the library as a whole,
    but also provides a clear difference between old and new components.
</p>
<h2>Technical Specification</h2>
<p>Currently, to avoid ambiguity, the specification is given as a series of=
 declarations in C++11.</p>
<p>Where a type is taken by either rvalue reference or const reference, it'=
s legal for implementations to provide only one overload that takes that ty=
pe by value.</p>
<p>For iterators, usually only the iterator category and return value of op=
erator* are specified, as the full specification of an iterator involves a =
lot of plumbing. If requested, these=20
specifications can be expanded to the full definition.</p>
<p>In header &lt;unicode&gt;</p>
<pre>namespace std {
    namespace unicode {
        enum class normal_form {
            NFC,
            NFD,
            NFKC,
            NFKD
        };</pre>
<p>The encoded_string class is templated based on an encoding parameter. Th=
is is a traits-style class implemented for each encoding. The required memb=
ers are:</p>

<pre>    typedef unspecified codeunit;</pre>

<p>The codeunit typedef is for the individual unit of storage for this spec=
ific encoding. This would be char16_t for UTF-16, char for narrow encoding,=
 etc. </p>

<pre>    template&lt;typename codeunit_iterator&gt; using codepoint_iterato=
r =3D unspecified;
    template&lt;typename codeunit_iterator&gt; using validating_codepoint_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original codeunit range as Un=
icode codepoints. The validating version will throw if the codeunits are no=
t valid or do not result in valid Unicode. The adaptors have the=20
    same iterator category as the input type, except if that category is ra=
ndom, in which case they only need be bidirectional.</p>

<pre>    template&lt;typename codepoint_iterator&gt; using codeunit_iterato=
r =3D unspecified;
    template&lt;typename codepoint_iterator&gt; using validating_codeunit_i=
terator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original range of Unicode cod=
epoints as code uits. The validating version will throw if the codepoints a=
re not valid or cannot be expressed in the destination.</p>

<pre>    template&lt;typename foreign_encoding, typename foreign_codeunit_i=
terator&gt; using conversion_iterator =3D unspecified;
    template&lt;typename foreign_encoding, typename foreign_codeunit_iterat=
or&gt; using validating_conversion_iterator =3D unspecified;</pre>
<p>Views a range of codeunits in the foreign encoding as a range of code un=
its in this encoding. A reasonable implementation for any foreign encoding =
is to simply convert to Unicode codepoints and then back to=20
    the current encoding. The validating iterator shall ensure that all for=
eign data is suitable for representation in this encoding.</p>
<p>An implementation shall provide at least the following encodings:
</p>
<pre>        namespace encoding {   =20
            typedef unspecified utf8;
            typedef unspecified utf16;
            typedef unspecified utf32;
            typedef unspecified wide;
            typedef unspecified narrow;
            typedef unspecified system;
        }
</pre>
<p>The narrow encoding is the encoding used for narrow string literals, suc=
h as "hello". The wide string literal is used for wide string literals such=
 as L"hello". An implementation=20
    has no obligation to make these separate types if one of the wide or na=
rrow encodings, or both, is already a Unicode encoding. The system encoding=
 is an implementation-defined default=20
    which shall be the encoding best used for interoperation with platform =
APIs, especially operating system APIs, such as UTF16 on Windows and UTF8 o=
n Unix.
</p>
<pre>        template&lt;typename Char&gt; using encoding_of =3D implementa=
tion-defined;
</pre>
<p>The encoding_of template returns the assumed encoding of a basic_string =
whose value_type is decayed Char. This shall be narrow for char, wide for w=
char_t, UTF16 for char16_t, and utf32 for char32_t.</p>

<p>The string class is a container of Unicode codepoints. The treatment of =
the freestanding algorithms as a range of Unicode codepoints means that any=
 container of Unicode codepoints may be=20
    used, but this class is provided as the minimal useful container. It ma=
y contain embedded null characters.
</p>
<pre>        template&lt;typename encoding, typename allocator =3D std::all=
ocator&lt;encoding::codeunit&gt;&gt; class encoded_string {
        public:
            encoded_string();
            template&lt;typename other_encoding, typename other_alloc&gt;=
=20
            encoded_string(const encoded_string&lt;other_ecnoding, other_al=
loc&gt;&);
            encoded_string(encoded_string&&);
           =20
            encoded_string(const char*);</pre>
           =20
<p>When the encoded_string interface deals with a const char* or std::strin=
g, it will assume narrow encoding, not UTF-8. A constructor which can take =
an encoding is available for UTF-8 const char*. When the=20
    encoded_string class takes input from an external source, it will valid=
ate that it is well-formed Unicode. If not, an exception shall be thrown.</=
p>

<pre>            encoded_string(const char*, encoding);

            encoded_string(const wchar_t*);
            encoded_string(const char16_t*);
            encoded_string(const char32_t*);
            template&lt;typename T, typename Traits, typename Allocator, ty=
pename Encoding =3D encoding_of&lt;T&gt;&gt;=20
            encoded_string(const std::basic_string_ref&lt;T, Traits, Alloca=
tor&gt;&, Encoding e =3D Encoding());
            template&lt;typename Iterator, typename Encoding =3D encoding_o=
f&lt;decltype(*std::declval&lt;Iterator&gt;)&gt;>=20
            encoded_string(Iterator, Iterator, Encoding e =3D Encoding());<=
/pre>

<p>The requirements on the Iterator type, which the string may be construct=
ed from a pair of, is that it is at least an input iterator, of codeunits o=
f that type. If the decayed return type=20
    of dereferencing an iterator is not a cv-qualified Encoding::code_unit,=
 then compilation shall fail.
</p>

<pre>            template&lt;typename Iterator, typename Encoding =3D encod=
ing_of&lt;decltype(*std::declval&lt;Iterator&gt;)&gt;>=20
            void assign(Iterator, Iterator, Encoding e =3D Encoding()) &;</=
pre>

<p>The requirements on the Iterator type here are the same as those on the =
constructor.</p>

<pre>            void assign(encoded_string&) &;
            void assign(encoded_string&&) &;

            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+(const encoded_string&lt;other_encodin=
g, other_alloc&gt;&) const;
            encoded_string& operator+(encoded_string&&) const;
            encoded_string& operator+(const char*) const;
            encoded_string& operator+(const wchar_t*) const;
            encoded_string& operator+(const char16_t*) const;
            encoded_string& operator+(const char32_t*) const;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+(const std::basic_string_ref&lt;T, Tra=
its, Allocator&gt;&) const;
           =20
            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+=3D(const encoded_string&lt;other_enco=
ding, other_alloc&gt;&) &;
            encoded_string& operator+=3D(encoded_string&&) &;
            encoded_string& operator+=3D(const char*) &;
            encoded_string& operator+=3D(const wchar_t*) &;
            encoded_string& operator+=3D(const char16_t*) &;
            encoded_string& operator+=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+=3D(const std::basic_string_ref&lt;T, =
Traits, Allocator&gt;&);

            encoded_string& operator=3D(const encoded_string&) &;
            encoded_string& operator=3D(encoded_string&&) &;
            encoded_string& operator=3D(const char*) &;
            encoded_string& operator=3D(const wchar_t*) &;
            encoded_string& operator=3D(const char16_t*) &;
            encoded_string& operator=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator=3D(const std::basic_string_ref&lt;T, T=
raits, Allocator&gt;&);

            iterator begin() &;
            const_iterator begin() const &;
            const_iterator cbegin() const &;
            iterator end() &;
            const_iterator end() const &;
            const_iterator cend() const &;</pre>

<p>The iterator and const_iterator types are bidirectional iterators of Uni=
code codepoints. operator* returns a char32_t rvalue, which is the codepoin=
t at that position. The invalidation=20
semantics of iterators shall be those of std::vector.</p>

<pre>            void clear() &;
            bool empty() const;
           =20
            iterator erase(const_iterator where) &;
            iterator erase(const_iterator first, const_iterator last) &;

            void swap(encoded_string&);

            char32_t front() const;
            char32_t back() const;
           =20
            iterator insert(const_iterator where, char32_t codepoint);
            template&lt;typename InputIterator, typename Encoding =3D encod=
ing_of&lt;decltype(*std::declval&lt;InputIterator&gt;)&gt;gt;=20
            iterator insert(const_iterator where, InputIterator begin, Inpu=
tIterator end, Encoding e =3D Encoding());
            template&lt;typename encoding, typename allocator&gt;=20
            iterator insert(const_iterator where, const encoded_string&lt;e=
ncoding, allocator&gt;&);
            template&lt;typename T, typename Traits, typename Alloc, typena=
me Encoding =3D encoding_of&lt;T&gt;=20
            iterator insert(const_iterator where, const basic_string&lt;T, =
Traits, Alloc&gt;&, Encoding e =3D Encoding());

            void pop_back();
            void push_back(char32_t);

            void normalize(normal_form);</pre>

<p>Performs an in-place normalization of the string's contents to the reque=
sted form.</p>

<pre>            const encoding::codeunit* codeunit_data() const;
            const encoding::codeunit* codeunit_data() const;
            std::size_t codeunit_size() const;</pre>

<p>codeunit_data returns the contents of the encoded_string as a null-termi=
nated buffer. This pointer shall be valid for as long as the encoded_string=
 is not mutated or destroyed. The codeunit_size=20
    function shall return the size of this buffer, except for the null term=
inator.</p>
<pre>            void codeunit_reserve(std::size_t size);
            std::size_t codeunit_capacity() const;
</pre>

<pre>        };

        using string =3D encoded_string&lt;encoding::system, implementation=
-defined default&gt;

        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;=20
        bool operator<(const encoded_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D(const encoded_string&lt;lhs_encoding, lhs_allocato=
r&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator=3D=3D(const encoded_string&lt;lhs_encoding, lhs_alloc=
ator&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs=
);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>(const encoded_string&lt;lhs_encoding, lhs_allocator&=
gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator>=3D(const encoded_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);
        template&lt;typename lhs_encoding, typename lhs_allocator, typename=
 rhs_encoding, typename rhs_allocator&gt;
        bool operator!=3D(const encoded_string&lt;lhs_encoding, lhs_allocat=
or&gt;& lhs, const encoded_string&lt;rhs_encoding, rhs_allocator&gt;& rhs);=
</pre>

<p>These comparison operators behave as if the data in the lhs and the rhs =
was passed to the respective freestanding algorithm.</p>

<pre>        template&lt;typename Iterator, typename OutIt> void convert(It=
erator begin, Iterator end, OutIt out, encoding src, encoding dst);</pre>

<p>Converts from the input range which is an input range of code units in s=
rc encoding into dst encoding. The output iterator receives the result of t=
he operation.</p>
       =20
<pre>        template&lt;typename Iterator> std::pair&lt;grapheme_iterator&=
lt;Iterator>, grapheme_iterator&lt;Iterator>>=20
        graphemes(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;word_iterator&lt;Iterat=
or>, word_iterator&lt;Iterator>>
        words(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;line_iterator&lt;Iterat=
or>, line_iterator&lt;Iterator>>
        lines(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;sentence_iterator&lt;It=
erator>, sentence_iterator&lt;Iterator>>
        sentences(Iterator begin, Iterator end);</pre>

<p>All four iterator types- grapheme_iterator, word_iterator, line_iterator=
, and sentence_iterator implement the respective Unicode Standard boundary =
analysis algorithms. The Line algorithm is defined in UAX #14=20
    (http://www.unicode.org/reports/tr14/) and the other three in UAX #29 (=
http://www.unicode.org/reports/tr29/). The input iterators are at least bid=
irectional iterators of Unicode codepoints. The boundary=20
    iterators all return from operator*() a pair of the base iterator type,=
 where the first value marks the beginning of the range, and the second mar=
ks the end, of the region. The first element of the return=20
    value of the four functions is the beginning and the second is the end.=
 Each iterator is assumed to be in encoding_of&lt;decltype(*std::declval&lt=
;Iterator&gt;)&gt;</p>

<pre>        template&lt;typename First, typename Second> bool less(First b=
egin, First end, Second begin, Second end, std::locale =3D std::locale());
   =20
        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool less(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const en=
coded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::locale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool less_or_equal(Fir=
st begin, First end, Second begin, Second end, std::locale =3D std::locale(=
));
   =20
        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool less_or_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&,=
 const encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::loca=
le());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater(First beg=
in, First end, Second begin, Second end, std::locale =3D std::locale());

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool greater(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const=
 encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::locale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool greater(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;=
& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater_or_equal(=
First begin, First end, Second begin, Second end, std::locale =3D std::loca=
le());

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool greater_or_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt=
;&, const encoded_string&lt;rhs_enc, rhs_alloc&gt;&, std::locale =3D std::l=
ocale());

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool greater_or_equal(const basic_string&lt;lhsChar, lhsTraits, lhs=
Alloc&gt;& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool equal(First begin=
, First end, Second begin, Second end);

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, const e=
ncoded_string&lt;rhs_enc, rhs_alloc&gt;&);

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&gt;& =
lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding())=
;

        template&lt;typename First, typename Second> bool not_equal(First b=
egin, First end, Second begin, Second end);

        template&lt;typename lhs_enc, typename rhs_enc, typename lhs_alloc,=
 typename rhs_alloc&gt;=20
        bool not_equal(const encoded_string&lt;lhs_enc, lhs_alloc&gt;&, con=
st encoded_string&lt;rhs_enc, rhs_alloc&gt;&);

        template&lt;typename lhsChar, typename lhsTraits, typename lhsAlloc=
, typename rhsChar, typename rhsTraits, typename rhsAlloc,=20
        typename lhsEncoding =3D encoding_of&lt;lhsChar&gt;, typename rhsEn=
coding =3D encoding_of&lt;rhsChar&gt;&gt;
        bool not_equal(const basic_string&lt;lhsChar, lhsTraits, lhsAlloc&g=
t;& lhs, const basic_string&lt;rhsChar, rhsTraits, rhsAlloc&gt;&,=20
        lhsEncoding le =3D lhsEncoding(), rhsEncoding re =3D rhsEncoding())=
;</pre>

<p>These six algorithms implement Unicode comparison functionality. Equival=
ence is defined as equivalence when normalized, with either NFC or NFD. Col=
lation requires a locale- overloads which do not have one as a=20
    parameter shall use the global locale. All iterator ranges shall be for=
ward iterators of Unicode codepoints.
</p>

<pre>        template&lt;typename Iterator, typename Out> void normalize(It=
erator begin, Iterator end, Out out, normal_form);</pre>

<p>Implements normalization of the forward range over Unicode codepoints, w=
ith the output provided to the output iterator. The normal_form argument in=
dicates which normal form is requested.</p>

<pre>        template&lt;typename encoding, typename allocator&gt; std::ist=
ream&=20
        operator>>(std::istream&, encoded_string&lt;encoding, allocator&gt;=
&);
        template&lt;typename encoding, typename allocator&gt; std::wistream=
&=20
        operator>>(std::wistream&, encoded_string&lt;encoding, allocator&gt=
;&);</pre>

<p>Reads until the next whitespace, as operator>>(std::istream&, std::strin=
g&);. Shall convert the data in the stream to Unicode, so whitespace shall =
include Unicode whitespaces.</p>

<pre>         template&lt;typename encoding, typename allocator&gt; std::os=
tream&=20
         operator&lt;&lt;(std::ostream&, const encoded_string&lt;encoding, =
allocator&gt;&);
         template&lt;typename encoding, typename allocator&gt; std::wostrea=
m&=20
         operator&lt;&lt;(std::wostream&, const encoded_string&lt;encoding,=
 allocator&gt;&);</pre>

<p>Writes the contents of the string to the stream. Shall perform an encodi=
ng conversion to narrow encoding and wide encoding when necessary.</p>

<pre>        struct dec {
            dec();
            dec(const dec&);
            dec(dec&&);
            dec(std::locale l);
            std::locale get_locale();
         }
         struct hex {};
         struct bin {};
         struct oct {};
</pre>
<p>For all primitive integer types I:</p>
<pre>        template&lt;typename base =3D dec&gt; string to_string(I, base=
 =3D base{});</pre>
<p>This function shall format the integer of type I as a string, in the bas=
e provided. If I is <code>bool</code>, then bin shall represent 0 or 1, and=
 any other choice shall result in <code>true</code> or=20
    <code>false</code>.</p>
<pre>        enum class codepoint_category {
            letter_uppercase;
            letter_lowercase;
            letter_titlecase;
            letter_modifier;
            letter_other;
            mark_non_spacing;
            mark_spacing_combining;
            mark_enclosing;
            number_decimal_digit;
            number_letter;
            number_other;
            punctuation_connector;
            punctuation_dash;
            punctuation_open;
            punctuation_close;
            punctuation_initial;
            punctuation_final;
            punctuation_other;
            symbol_math;
            symbol_currency;
            symbol_modifier;
            symbol_other;
            separator_space;
            separator_line;
            separator_paragraph;
            other_control;
            other_format;
            other_surrogate;
            other_private_use;
            other_not_assigned;
        };
        enum class bidi_category {
            AL, AN,
            B, BN,
            CS,
            EN, ES, ET,
            L, LRE, LRO,
            NSM,
            ON,
            PDF,
            R, RLE, RLO,
            S,
            WS,
        };
        enum class category_joining_class {
            U, C, T, D, L, R,
        };
        enum class category_joining_group {
            Ain, Alaph, Alef, Alef_Maqsurah,
            Beh, Beth, Burushaski_Yeh_Barree,
            Dal, Dalath_Rish, E,
            Farsi_Yeh, Fe, Feh, Final_Semkath,
            Gaf, Gamal,
            Hah, Hamza_On_Heh_Goal, He,
            Heh, Heh_Goal, Heth,
            Kaf, Kaph, Khaph, Knotted_Heh,
            Lam, Lamadh, Meem, Mim,
            No_Joining_Group, Noon, Nun, Nya,
            Pe, Qaf, Qaph, Reh, Reversed_Pe,
            Rohingya_Yeh,
            Sad, Sadhe, Seen, Semkath, Shin,
            Swash_Kaf, Syriac_Waw, Tah, Taw,
            Teh_Marbuta, Teh_Marbuta_Goal, Teth, Waw, Yeh,
            Yeh_Barree, Yeh_With_Tail, Yudh,
            Yudh_He, Zain, Zhain,
        };
        enum class script_type {
            Arab, Armi, Armn, Avst,
            Bali, Bamu, Batk, Beng, Bopo, Brah, Brai, Bugi, Buhd,
            Cakm, Cans, Cari, Cham, Cher, Copt, Cprt,
            Cyrl,
            Deva, Dsrt,
            Egyp, Ethi,
            Geor, Glag, Goth, Grek, Gujr, Guru,
            Hang, Hani, Hano, Hebr, Hira, Hrkt,
            Ital,
            Java,
            Kali, Kana, Khar, Khmr, Knda, Kthi,
            Lana, Laoo, Latn, Lepc, Limb, Linb, Lisu, Lyci,
            Lydi,
            Mand, Merc, Mero, Mlym, Mong, Mtei, Mymr,
            Nkoo,
            Ogam, Olck, Orkh, Orya, Osma,
            Phag, Phli, Phnx, Plrd, Prti,
            Qaai,
            Rjng, Runr,
            Samr, Sarb, Saur, Shaw, Shrd,  Sinh, Sora, Sund, Sylo, Syrc,
            Tagb, Takr, Tale, Talu, Taml, Tavt, Telu, Tfng,
            Tglg, Thaa, Thai, Tibt,
            Ugar,
            Vaii,
            Xpeo, Xsux,
            Yiii,
            Zinh, Zyyy, Zzzz,
        };
        enum class block_name {
            Aegean_Numbers, Alchemical, Alphabetic_PF, Ancient_Greek_Music,=
 Ancient_Greek_Numbers,
            Ancient_Symbols, Arabic, Arabic_Ext_A, Arabic_Math, Arabic_PF_A=
, Arabic_PF_B, Arabic_Sup,
            Armenian, Arrows, ASCII, Avestan, Balinese, Bamum, Bamum_Sup, B=
atak, Bengali, Block_Elements,
            Bopomofo, Bopomofo_Ext, Box_Drawing, Brahmi, Braille, Buginese,=
 Buhid, Byzantine_Music,
            Carian, Chakma, Cham, Cherokee, CJK, CJK_Compat, CJK_Compat_For=
ms, CJK_Compat_Ideographs,
            CJK_Compat_Ideographs_Sup, CJK_Ext_A, CJK_Ext_B, CJK_Ext_C, CJK=
_Ext_D, CJK_Radicals_Sup,
            CJK_Strokes, CJK_Symbols, Compat_Jamo, Control_Pictures, Coptic=
, Counting_Rod, Cuneiform,
            Cuneiform_Numbers, Currency_Symbols, Cypriot_Syllabary, Cyrilli=
c, Cyrillic_Ext_A, Cyrillic_Ext_B,
            Cyrillic_Sup, Deseret, Devanagari, Devanagari_Ext, Diacriticals=
, Diacriticals_For_Symbols,
            Diacriticals_Sup, Dingbats, Domino, Egyptian_Hieroglyphs, Emoti=
cons, Enclosed_Alphanum,
            Enclosed_Alphanum_Sup, Enclosed_CJK, Enclosed_Ideographic_Sup, =
Ethiopic, Ethiopic_Ext,
            Ethiopic_Ext_A, Ethiopic_Sup, Geometric_Shapes, Georgian, Georg=
ian_Sup, Glagolitic, Gothic, Greek,
            Greek_Ext, Gujarati, Gurmukhi, Half_And_Full_Forms, Half_Marks,=
 Hangul, Hanunoo, Hebrew,
            High_PU_Surrogates, High_Surrogates, Hiragana, IDC, Imperial_Ar=
amaic, Indic_Number_Forms,
            Inscriptional_Pahlavi, Inscriptional_Parthian, IPA_Ext, Jamo, J=
amo_Ext_A, Jamo_Ext_B, Javanese,
            Kaithi, Kana_Sup, Kanbun, Kangxi, Kannada, Katakana, Katakana_E=
xt, Kayah_Li, Kharoshthi, Khmer,
            Khmer_Symbols, Lao, Latin_1_Sup, Latin_Ext_A, Latin_Ext_Additio=
nal, Latin_Ext_B, Latin_Ext_C,
            Latin_Ext_D, Lepcha, Letterlike_Symbols, Limbu, Linear_B_Ideogr=
ams, Linear_B_Syllabary, Lisu,
            Low_Surrogates, Lycian, Lydian, Mahjong, Malayalam, Mandaic, Ma=
th_Alphanum, Math_Operators,
            Meetei_Mayek, Meetei_Mayek_Ext, Meroitic_Cursive, Meroitic_Hier=
oglyphs, Miao, Misc_Arrows,
            Misc_Math_Symbols_A, Misc_Math_Symbols_B, Misc_Pictographs, Mis=
c_Symbols, Misc_Technical,
            Modifier_Letters, Modifier_Tone_Letters, Mongolian, Music, Myan=
mar, Myanmar_Ext_A, NB,
            New_Tai_Lue, NKo, Number_Forms, OCR, Ogham, Ol_Chiki, Old_Itali=
c, Old_Persian, Old_South_Arabian,
            Old_Turkic, Oriya, Osmanya, Phags_Pa, Phaistos, Phoenician, Pho=
netic_Ext, Phonetic_Ext_Sup,
            Playing_Cards, PUA, Punctuation, Rejang, Rumi, Runic, Samaritan=
, Saurashtra, Sharada, Shavian,
            Sinhala, Small_Forms, Sora_Sompeng, Specials, Sundanese, Sundan=
ese_Sup, Sup_Arrows_A, Sup_Arrows_B,
            Sup_Math_Operators, Sup_PUA_A, Sup_PUA_B, Sup_Punctuation, Supe=
r_And_Sub, Syloti_Nagri, Syriac,
            Tagalog, Tagbanwa, Tags, Tai_Le, Tai_Tham, Tai_Viet, Tai_Xuan_J=
ing, Takri, Tamil, Telugu, Thaana,
            Thai, Tibetan, Tifinagh, Transport_And_Map, UCAS, UCAS_Ext, Uga=
ritic, Vai, Vedic_Ext,
            Vertical_Forms, VS, VS_Sup, Yi_Radicals, Yi_Syllables, Yijing,
        };
        enum class version {
            v1_1,
            v2_0, v2_1,
            v3_0, v3_1, v3_2,
            v4_0, v4_1,
            v5_0, v5_1, v5_2,
            v6_0, v6_1, v6_2,
            unassigned =3D 0xFF,
        };
        struct codepoint_properties {
            codepoint_category category;
            block_name block;
            version age;
            bidi_category bidi_type;
            category_joining_class joining_class;
            category_joining_group joining_group;
            script_type script;
            bool control;
            bool digit;
            bool letter;
            bool lower;
            bool number;
            bool punctuation;
            bool separator;
            bool symbol;
            bool upper;
            bool whitespace;
        };
        codepoint_properties properties(char32_t);</pre>
<p>Returns the properties of any given codepoint. These properties are defi=
ned by the Unicode Standard, not here.</p>
<pre>        template&lt;typename Iterator, typename Out, typename Encoding=
 =3D utf32&gt; void to_upper(Iterator begin, Iterator end, Out out, Encodin=
g e =3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_upper(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_upper(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
utf32&gt; void to_lower(Iterator begin, Iterator end, Out out, Encoding e =
=3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_lower(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_lower(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
utf32&gt; void to_title(Iterator begin, Iterator end, Out out, Encoding e =
=3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_title(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_title(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());
</pre>
<p>Performs a case conversion for the given series of Unicode codepoints. T=
he output iterator shall be the same encoding as the input iterator.
</p>
<pre>        using regex =3D std::basic_regex&lt;char32_t, implementation-d=
efined&gt;</pre>
<p>A regular expression type suitable for matching Unicode codepoints. The =
traits must support <a href=3D"http://www.unicode.org/reports/tr18/">UTS-18=
</a> to at least Level 2.</p>
<pre>    }
}</pre>
<h2>Acknowledgements</h2>
<p>R. Martinho Fernandes, gave significant assistance when dealing with som=
e of the ins and outs of Unicode.</p>
   =20
</body>
</html>=E2=80=8B
------=_Part_1816_26406021.1354962575104--

.


Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Sat, 8 Dec 2012 23:42:03 +0100
Raw View
--089e013a11a27c0a5a04d05f0a48
Content-Type: text/plain; charset=ISO-8859-1

2012/12/8 DeadMG <wolfeinstein@gmail.com>

> Here is the next version.
>
> The proposal looks definitively interesting, but I'm giving some stylistic
and content-based recommendations:

1) "Where a type is taken by either rvalue reference or const reference,
it's legal for implementations to provide only one overload that takes that
type by value."

This is not very helpful, IMO: We have no corresponding policy in other
places in the library. Why is this so important for the proposal?

2) Stylistic recommendation: Since this is a Library proposal, it would be
much easier to read if it would use the normal style of the library. In
particular template parameters use an initial upper case character.

3) Enum class normal_form uses all-upper-case enumerator values which is
also exceptional for the library.
Similar for bidi_category and category_joining_class.

4) Iterator design

a) You might want to consider a policy-based design iun regard to error
management. This would reduce the number of different names for basically
the same thing (X_iterator and validating_X_iterator), it would also allow
for more fine-grained error handling strategies.

b) Your design seems to intend bidirectional iterator where operator*
returns an rvalue. This violates the requirements of a forward iterator
which must return an lvalue (I certainly agree that this is a problem of
the current iterator requirements, but you need to explain in the text,
that this design bases on a change of the iterator requirements).

"The iterator and const_iterator types are bidirectional iterators of
Unicode codepoints. operator* returns a char32_t rvalue,"

c) It is not clear to me what the value type of conversion_iterator would
be.

5) "The encoding_of template returns the assumed encoding of a basic_string
whose value_type is decayed Char. This shall be narrow for char, wide for
wchar_t, UTF16 for char16_t, and utf32 for char32_t."

The encoding_of template depends on some character type, so I don't
understand why the text refers to a basic_string. It seems as if
encoding_of can be used independent from basic_string instantiations.

6) "std::allocator<encoding::codeunit>>": Presumably this should be
"std::allocator<typename encoding::codeunit>"

7) In

template<typename Iterator, typename Encoding =
encoding_of<decltype(*std::declval<Iterator>)>>
            encoded_string(Iterator, Iterator, Encoding e = Encoding());

or

template<typename InputIterator, typename Encoding =
encoding_of<decltype(*std::declval<InputIterator>)>gt;
            iterator insert(const_iterator where, InputIterator begin,
InputIterator end, Encoding e = Encoding());

a problematic part is the decltype() part, because it seems to depend on
Iterator's operator* returning an rvalue. If the return type would be a
reference type I would assume that encoding_of cannot handle that
correctly. Why doesn't the proposal rely on
iterator_traits<Iterator>::value_type here, which seems much more
reasonable than the reference type?

Above declaration is also in disagreement with "If the decayed return type
of dereferencing an iterator is [..]"

8) Function template

 template<typename Iterator, typename OutIt> void convert(Iterator
begin, Iterator end, OutIt out, encoding src, encoding dst);


should return OutIt (not void), otherwise it would very much reduce the
value for pure output iterators, because
you cannot use it for further output. Same problem for

template<typename Iterator, typename Out> void normalize(Iterator
begin, Iterator end, Out out, normal_form);

and for the to_upper, to_lower, and to_title templates at the end.9)

9) I'm severely missing a handful of useful typedefs in template
encoded_string, such as

allocator_type
iterator
const_iterator

10) IMO it would be useful to have some function object types (class
templates), that take an encoded_string as value_type and which effectively
invoke the free function templates specifying an ordering (less, greater,
etc.). This is just an idea.

- Daniel

--




--089e013a11a27c0a5a04d05f0a48
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">2012/12/8 DeadMG <span dir=3D"ltr">&lt;<a href=
=3D"mailto:wolfeinstein@gmail.com" target=3D"_blank">wolfeinstein@gmail.com=
</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
..8ex;border-left:1px #ccc solid;padding-left:1ex">
Here is the next version.

<span class=3D"HOEnZb"><font color=3D"#888888"><p></p></font></span></block=
quote><div>The proposal looks definitively interesting, but I&#39;m giving =
some stylistic and content-based recommendations:<br><br>1) &quot;Where a t=
ype is taken by either rvalue reference or const reference,=20
it&#39;s legal for implementations to provide only one overload that takes=
=20
that type by value.&quot;<br><br>This is not very helpful, IMO: We have no =
corresponding policy in other places in the library. Why is this so importa=
nt for the proposal? <br><br>2) Stylistic recommendation: Since this is a L=
ibrary proposal, it would be much easier to read if it would use the normal=
 style of the library. In particular template parameters use an initial upp=
er case character.<br>
<br>3) Enum class normal_form uses all-upper-case enumerator values which i=
s also exceptional for the library.<br>Similar for bidi_category and catego=
ry_joining_class.<br><br>4) Iterator design<br><br>a) You might want to con=
sider a policy-based design iun regard to error management. This would redu=
ce the number of different names for basically the same thing (X_iterator a=
nd validating_X_iterator), it would also allow for more fine-grained error =
handling strategies.<br>
<br>b) Your design seems to intend bidirectional iterator where operator* r=
eturns an rvalue. This violates the requirements of a forward iterator whic=
h must return an lvalue (I certainly agree that this is a problem of the cu=
rrent iterator requirements, but you need to explain in the text, that this=
 design bases on a change of the iterator requirements). <br>
<br>&quot;The iterator and const_iterator types are bidirectional iterators=
 of Unicode codepoints. operator* returns a char32_t rvalue,&quot;<br><br>c=
) It is not clear to me what the value type of conversion_iterator would be=
..<br>
<br>5) &quot;The encoding_of template returns the assumed encoding of a bas=
ic_string=20
whose value_type is decayed Char. This shall be narrow for char, wide=20
for wchar_t, UTF16 for char16_t, and utf32 for char32_t.&quot;<br><br>The e=
ncoding_of template depends on some character type, so I don&#39;t understa=
nd why the text refers to a basic_string. It seems as if encoding_of can be=
 used independent from basic_string instantiations.<br>
<br>6) &quot;std::allocator&lt;encoding::codeunit&gt;&gt;&quot;: Presumably=
 this should be &quot;std::allocator&lt;typename encoding::codeunit&gt;&quo=
t;<br><br>7) In<br><br><pre>template&lt;typename Iterator, typename Encodin=
g =3D encoding_of&lt;decltype(*std::declval&lt;Iterator&gt;)&gt;&gt;=20
            encoded_string(Iterator, Iterator, Encoding e =3D Encoding());<=
br><br>or<br><br>template&lt;typename InputIterator, typename Encoding =3D =
encoding_of&lt;decltype(*std::declval&lt;InputIterator&gt;)&gt;gt;=20
            iterator insert(const_iterator where, InputIterator begin, Inpu=
tIterator end, Encoding e =3D Encoding());<br></pre>a problematic part is t=
he decltype() part, because it seems to depend on Iterator&#39;s operator* =
returning an rvalue. If the return type would be a reference type I would a=
ssume that encoding_of cannot handle that correctly. Why doesn&#39;t the pr=
oposal rely on iterator_traits&lt;Iterator&gt;::value_type here, which seem=
s much more reasonable than the reference type?<br>
<br>Above declaration is also in disagreement with &quot;If the decayed ret=
urn type=20
    of dereferencing an iterator is [..]&quot;<br><br>8) Function template<=
br><br><pre> template&lt;typename Iterator, typename OutIt&gt; void convert=
(Iterator begin, Iterator end, OutIt out, encoding src, encoding dst);</pre=
>
<br>should return OutIt (not void), otherwise it would very much reduce the=
 value for pure output iterators, because<br>you cannot use it for further =
output. Same problem for<br><br><pre>template&lt;typename Iterator, typenam=
e Out&gt; void normalize(Iterator begin, Iterator end, Out out, normal_form=
);</pre>
and for the to_upper, to_lower, and to_title templates at the end.9)<br><br=
>9) I&#39;m severely missing a handful of useful typedefs in template encod=
ed_string, such as <br><br>allocator_type<br>iterator<br>const_iterator<br>
<br>10) IMO it would be useful to have some function object types (class te=
mplates), that take an encoded_string as value_type and which effectively i=
nvoke the free function templates specifying an ordering (less, greater, et=
c.). This is just an idea.<br>
<br>- Daniel<br><br></div></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

--089e013a11a27c0a5a04d05f0a48--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 8 Dec 2012 16:07:07 -0800 (PST)
Raw View
------=_Part_172_19827225.1355011628026
Content-Type: multipart/alternative;
 boundary="----=_Part_173_7114941.1355011628026"

------=_Part_173_7114941.1355011628026
Content-Type: text/plain; charset=ISO-8859-1

1) Fault- that's not supposed to be there. Fixed.

2) They should do, I'll give it another once-over. I have tried to emulate
the style as far as I can. Listing this as fixed for now.

3) Those are the names given by the Unicode Consortium. I could lowercase
them, I guess, but ultimately, either you're inconsistent with Unicode or
you're inconsistent with the C++ library.

4) a) I have already considered this. Primarily, throwing an exception
isn't a good enough error handling strategy in many cases- for example,
Python was DOSed by abusing exactly this technique. However, I have not yet
decided exactly what I want to replace it.
    b) It was my understanding that the iterator requirements had already
been relaxed to permit rvalues. If not, it will be irritating, but doable-
each iterator would just have to hold a value that would be the referred-to
value. The semantics of referring or pointing to it would be somewhat
broken, though- I'm not sure if that approach can work. If that's not
possible, then the regular iterators cannot serve many useful purposes, but
starting with this one, without adaptation.
    c) It would be encoding::codeunit, or const encoding::codeunit&- i.e.,
you could simply consider it as
codeunit_iterator<foreign_codeunit_encoding::codepoint_iterator<foreign_codeunit_iterator>>,
except that some cross-encoding conversions may be easier to implement than
that. It's an optimization opportunity, not new functionality.

5) Originally, it took a basic_string, but no longer does. Fixed.

6) Fixed.

7) Actually, the encoding_of wording specifically requires that it be the
encoding of *decayed* Char, so it would be fine given char&. However, I
have already refactored this into encoding_of_iterator. As for
iterator_traits, I see no reason to depend on it when I can not depend on
it. I have clarified the wording of the behaviour of encoding_of. Fixed.

8) Fixed. I also added some convenience overloads for basic_string and
encoded_string for those functions. In addition, the convert function was
specified to a much older spec of the draft and had a completely broken
interface.

9) Hm. I implicitly meant to include them, but it's not explicitly stated.
Some of them introduce questions as to exactly what they should be, and if
it really can be supported, as iterators which return rvalues do not map
well to having to provide a reference or pointer type.

10) You mean like, the existing std::less and std::greater? This would
principally be useful for basic_string, not encoded_string. However, you
have made me realize that you cannot compare in any way a basic_string and
an encoded_string, although the generic algorithms completely can cope with
this. Nor did I propose extending basic_string to deal with encoded_string.
An encoded_string can already be constructed from a basic_string, but this
would involve a copy and be unnecessarily slow if you're just looking to
compare them. I also appear to have duplicated the operator== and called
the duplicate operator=.

I have also changed the regular expression support to support matching
other encodings. I don't believe this will be a significant implementation
burden, as they can simply delegate to the previously-required char32_t
variety for any encoding, but there's an opportunity for something
faster/smaller/whatever for non-UTF-32 encodings.

I have attached the revised version.

--




------=_Part_173_7114941.1355011628026
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

1) Fault- that's not supposed to be there. Fixed.<div><br></div><div>2) The=
y should do, I'll give it another once-over. I have tried to emulate the st=
yle as far as I can. Listing this as fixed for now.</div><div><br></div><di=
v>3) Those are the names given by the Unicode Consortium. I could lowercase=
 them, I guess, but ultimately, either you're inconsistent with Unicode or =
you're inconsistent with the C++ library.</div><div><br></div><div>4) a) I =
have already considered this. Primarily, throwing an exception isn't a good=
 enough error handling strategy in many cases- for example, Python was DOSe=
d by abusing exactly this technique. However, I have not yet decided exactl=
y what I want to replace it.</div><div>&nbsp; &nbsp; b) It was my understan=
ding that the iterator requirements had already been relaxed to permit rval=
ues. If not, it will be irritating, but doable- each iterator would just ha=
ve to hold a value that would be the referred-to value. The semantics of re=
ferring or pointing to it would be somewhat broken, though- I'm not sure if=
 that approach can work. If that's not possible, then the regular iterators=
 cannot serve many useful purposes, but starting with this one, without ada=
ptation.</div><div>&nbsp; &nbsp; c) It would be encoding::codeunit, or cons=
t encoding::codeunit&amp;- i.e., you could simply consider it as codeunit_i=
terator&lt;foreign_codeunit_encoding::codepoint_iterator&lt;foreign_codeuni=
t_iterator&gt;&gt;, except that some cross-encoding conversions may be easi=
er to implement than that. It's an optimization opportunity, not new functi=
onality.</div><div><br></div><div>5) Originally, it took a basic_string, bu=
t no longer does. Fixed.</div><div><br></div><div>6) Fixed.</div><div><br><=
/div><div>7) Actually, the encoding_of wording specifically requires that i=
t be the encoding of <i>decayed</i>&nbsp;Char, so it would be fine given ch=
ar&amp;. However, I have already refactored this into encoding_of_iterator.=
 As for iterator_traits, I see no reason to depend on it when I can not dep=
end on it. I have clarified the wording of the behaviour of encoding_of. Fi=
xed.</div><div><br></div><div>8) Fixed. I also added some convenience overl=
oads for basic_string and encoded_string for those functions. In addition, =
the convert function was specified to a much older spec of the draft and ha=
d a completely broken interface.</div><div><br></div><div>9) Hm. I implicit=
ly meant to include them, but it's not explicitly stated. Some of them intr=
oduce questions as to exactly what they should be, and if it really can be =
supported, as iterators which return rvalues do not map well to having to p=
rovide a reference or pointer type.</div><div><br></div><div>10) You mean l=
ike, the existing std::less and std::greater? This would principally be use=
ful for basic_string, not encoded_string. However, you have made me realize=
 that you cannot compare in any way a basic_string and an encoded_string, a=
lthough the generic algorithms completely can cope with this. Nor did I pro=
pose extending basic_string to deal with encoded_string. An encoded_string =
can already be constructed from a basic_string, but this would involve a co=
py and be unnecessarily slow if you're just looking to compare them. I also=
 appear to have duplicated the operator=3D=3D and called the duplicate oper=
ator=3D.&nbsp;</div><div><br></div><div>I have also changed the regular exp=
ression support to support matching other encodings. I don't believe this w=
ill be a significant implementation burden, as they can simply delegate to =
the previously-required char32_t variety for any encoding, but there's an o=
pportunity for something faster/smaller/whatever for non-UTF-32 encodings.<=
/div><div><br></div><div>I have attached the revised version.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_173_7114941.1355011628026--
------=_Part_172_19827225.1355011628026
Content-Type: text/html; charset=UTF-8; name=stringprop.html
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=stringprop.html
X-Attachment-Id: 7d8080a6-3fef-4b51-955f-dca8be2f9613

=EF=BB=BF<!DOCTYPE html>
<html lang=3D"en">
<head>
    <title>Coding Puppy</title> =20
</head>
<body>
    <p>Document working version five. The primary todos are string formatti=
ng and parsing, and locales.</p>
    <br />
    <p>Document Number: Currently not allocated.</p>
    <p>Date: 2012-11-05</p>
    <p>Project: Programming Language C++, Library Working Group</p>
    <p>Reply-to: wolfeinstein@gmail.com</p>
<h1>Strings Proposal</h1>
<h2>Introduction</h2>
<p>The purpose of this document is to propose new interfaces to support Uni=
code text, where the existing interfaces are quite deficient. </p>
<h2>Motivation and Scope</h2>
<p>This proposal is primarily motivated by two problems. The first is the o=
verwhelming number of string types- both primitive, Standard and third-part=
y. This mess of text types makes it impossible to reliably=20
    hold string data. The second is the poor support for Unicode within the=
 C++ Standard library. Unicode is a complex topic, where correctness depend=
s on the implementation of complex algorithms by the user.=20
    This is only exacerbated by the problem of multiple string encodings, a=
nd poor conversion interfaces, which is why C++ is awash with third-party s=
tring types. This problem is made even worse by the existence of=20
    unrelated types that need to hold string data- for example, exceptions.=
 The existing exception hierarchy is of significantly limited usefulness, a=
s it cannot hold Unicode exception data. This proposal aims to=20
    solve both these problems by offering freestanding algorithms and a fre=
sh string class which constitutes significant support for Unicode.
</p>
<p>It is intended to support all programmers from top to bottom, as string =
handling tasks are tasks universal to all programs. It is based on the exis=
ting practice shown in the more recent additions to the Standard=20
    library and Modern C++ design in general- templates instead of inherita=
nce, function objects, and freestanding algorithms and iterators.
</p>
<p>It is not currently in use and a reference implementation is still under=
 construction. However, there are numerous implementations of the various s=
ubcomponents, such as Unicode algorithms and formatting routines.=20
</p>
<h2>Impact on the Standard</h2>
<p>The primary impact on the Standard is the deprecation of existing compon=
ents. There are no additional language or library features required.
</p>
<h2>Design Decisions</h2>
<p>The primary design decision taken here is to give one universal definiti=
on of a string- a range of Unicode codepoints. This decision was taken beca=
use it allows free-standing algorithms,=20
    and an interface that fits well with the rest of the Standard library. =
It also allows the string interface to be significantly simplified compared=
 to the previous iteration. In addition,=20
    the library provides one single string type, best suited for each platf=
orm. This string type is intended to meet the requirements of, for example,=
 the filesystem TS for storing paths.
</p>
<p>Unicode validation failure throwing an exception is well known to be a l=
imited solution in many cases. This part of the API is due for additional c=
onsideration, as this is only a first draft. In addition,=20
    because of the potential for O(n) assignment, it was decided that the o=
nly kind of iterator offered over a string should be immutable, as in many =
cases the operation would boil down to inserting a variable=20
    size range. This could be prohibitively expensive. In addition, the cho=
ice of an rvalue makes it significantly simpler to offer iterators, as they=
 can decode on the fly to codepoints from their choice of=20
    encoding. Aside from this, however, the string was designed to be a fam=
iliar container, offering the minimal set of functions required to manipula=
te the sequence of codepoints.
</p>
<p>Another problem is posed by UTF-8. As u8 literals do not have a distinct=
 type, it's almost impossible to handle them correctly and as cleanly as th=
e other literal types. There are other proposals for introducing=20
    char8_t and fixing UTF-8 literals, and introducing std::u8string, but t=
his proposal does not assume they are accepted. It would, however, be of si=
gnificant benefit.
</p>
<p>Finally, the std namespace is becoming very overloaded. It was decided t=
hat it would be best to split the components into subnamespaces. This not o=
nly aids with the organization of the library as a whole,
    but also provides a clear difference between old and new components.
</p>
<h2>Technical Specification</h2>
<p>Currently, to avoid ambiguity, the specification is given as a series of=
 declarations in C++11.</p>
<p>For iterators, usually only the iterator category and return value of op=
erator* are specified, as the full specification of an iterator involves a =
lot of plumbing. If requested, these=20
specifications can be expanded to the full definition.</p>
<p>In header &lt;unicode&gt;</p>
<pre>namespace std {
    namespace unicode {
        enum class normal_form {
            nfc,
            nfd,
            nfkc,
            nfkd
        };</pre>
<p>The encoded_string class is templated based on an encoding parameter. Th=
is is a traits-style class implemented for each encoding. The required memb=
ers are:</p>

<pre>    typedef unspecified codeunit;</pre>

<p>The codeunit typedef is for the individual unit of storage for this spec=
ific encoding. This would be char16_t for UTF-16, char for narrow encoding,=
 etc. </p>

<pre>    template&lt;typename CodeunitIterator&gt; using codepoint_iterator=
 =3D unspecified;
    template&lt;typename CodeunitIterator&gt; using validating_codepoint_it=
erator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original codeunit range as Un=
icode codepoints. The validating version will throw if the codeunits are no=
t valid or do not result in valid Unicode. The adaptors have the=20
    same iterator category as the input type, except if that category is ra=
ndom, in which case they only need be bidirectional.</p>

<pre>    template&lt;typename CodepointIterator&gt; using codeunit_iterator=
 =3D unspecified;
    template&lt;typename CodepointIterator&gt; using validating_codeunit_it=
erator =3D unspecified;</pre>
<p>A pair of iterator adaptors which view the original range of Unicode cod=
epoints as code units. The validating version will throw if the codepoints =
are not valid or cannot be expressed in the destination.</p>

<pre>    template&lt;typename ForeignEncoding, typename ForeignCodeunitIter=
ator&gt; using conversion_iterator =3D unspecified;
    template&lt;typename ForeignEncoding, typename ForeignCodeunit_iterator=
&gt; using validating_conversion_iterator =3D unspecified;</pre>
<p>Views a range of codeunits in the foreign encoding as a range of code un=
its in this encoding. A reasonable implementation for any foreign encoding =
is to simply convert to Unicode codepoints and then back to=20
    the current encoding. The validating iterator shall ensure that all for=
eign data is suitable for representation in this encoding.</p>
<p>An implementation shall provide at least the following encodings:
</p>
<pre>        namespace encoding {   =20
            typedef unspecified utf8;
            typedef unspecified utf16;
            typedef unspecified utf32;
            typedef unspecified wide;
            typedef unspecified narrow;
            typedef unspecified system;
        }
</pre>
<p>The narrow encoding is the encoding used for narrow string literals, suc=
h as "hello". The wide string literal is used for wide string literals such=
 as L"hello". An implementation=20
    has no obligation to make these separate types if one of the wide or na=
rrow encodings, or both, is already a Unicode encoding. The system encoding=
 is an implementation-defined default=20
    which shall be the encoding best used for interoperation with platform =
APIs, especially operating system APIs, such as UTF16 on Windows and UTF8 o=
n Unix. The implementation may provide arbitrary
    additional encodings.
</p>
<pre>        template&lt;typename Char&gt; using encoding_of =3D implementa=
tion-defined;
</pre>
<p>The encoding_of template returns the assumed encoding of a string whose =
codeunit type is std::decay&lt;Char&gt;::type. This shall be narrow where t=
he decayed type is char, wide for wchar_t, UTF16 for char16_t, and=20
    utf32 for char32_t.</p>
<pre>        template&lt;typename Iterator&gt; using encoding_of_iterator =
=3D encoding_of&lt;decltype(*std::declval&lt;Iterator&gt;())&gt;</pre>

<p>The string class is a container of Unicode codepoints. The treatment of =
the freestanding algorithms as a range of Unicode codepoints means that any=
 container of Unicode codepoints may be=20
    used, but this class is provided as the minimal useful container. It ma=
y contain embedded null characters.
</p>
<pre>        template&lt;typename Encoding, typename Allocator =3D std::all=
ocator&lt;typename Encoding::codeunit&gt;&gt; class encoded_string {
        public:
            encoded_string();
            template&lt;typename OtherEncoding, typename OtherAlloc&gt;=20
            encoded_string(const encoded_string&lt;OtherEncoding, OtherAllo=
c&gt;&);
            encoded_string(encoded_string&&);
           =20
            encoded_string(const char*);</pre>
           =20
<p>When the encoded_string interface deals with a const char* or std::strin=
g, it will assume narrow encoding, not UTF-8. A constructor which can take =
an encoding is available for UTF-8 const char*. When the=20
    encoded_string class takes input from an external source, it will valid=
ate that it is well-formed Unicode. If not, an exception shall be thrown.</=
p>

<pre>            template&lt;typename Encoding&gt; encoded_string(const cha=
r*, Encoding =3D Encoding());
            encoded_string(const wchar_t*);
            encoded_string(const char16_t*);
            encoded_string(const char32_t*);
            template&lt;typename T, typename Traits, typename Allocator, ty=
pename Encoding =3D encoding_of&lt;T&gt;&gt;=20
            encoded_string(const std::basic_string&lt;T, Traits, Allocator&=
gt;&, Encoding e =3D Encoding());
            template&lt;typename Iterator, typename Encoding =3D encoding_o=
f_iterator&lt;Iterator&gt;&gt;=20
            encoded_string(Iterator, Iterator, Encoding e =3D Encoding());

            using iterator =3D implementation_defined;
            using const_iterator =3D implementation_defined;
            using allocator_type =3D implementation_defined;
            using size_type =3D  implementation_defined;
            using value_type =3D char32_t;
           =20
            template&lt;typename Iterator, typename Encoding =3D encoding_o=
f_iterator&lt;Iterator&gt;&gt;=20
            void assign(Iterator, Iterator, Encoding e =3D Encoding()) &;
            void assign(encoded_string&) &;
            void assign(encoded_string&&) &;

            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+(const encoded_string&lt;other_encodin=
g, other_alloc&gt;&) const;
            encoded_string& operator+(encoded_string&&) const;
            encoded_string& operator+(const char*) const;
            encoded_string& operator+(const wchar_t*) const;
            encoded_string& operator+(const char16_t*) const;
            encoded_string& operator+(const char32_t*) const;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+(const std::basic_string&lt;T, Traits,=
 Allocator&gt;&) const;
           =20
            template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string& operator+=3D(const encoded_string&lt;other_enco=
ding, other_alloc&gt;&) &;
            encoded_string& operator+=3D(encoded_string&&) &;
            encoded_string& operator+=3D(const char*) &;
            encoded_string& operator+=3D(const wchar_t*) &;
            encoded_string& operator+=3D(const char16_t*) &;
            encoded_string& operator+=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator+=3D(const std::basic_string&lt;T, Trai=
ts, Allocator&gt;&);

            encoded_string& operator=3D(const encoded_string&) &;
            encoded_string& operator=3D(encoded_string&&) &;
            encoded_string& operator=3D(const char*) &;
            encoded_string& operator=3D(const wchar_t*) &;
            encoded_string& operator=3D(const char16_t*) &;
            encoded_string& operator=3D(const char32_t*) &;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string& operator=3D(const std::basic_string&lt;T, Trait=
s, Allocator&gt;&);

            iterator begin() &;
            const_iterator begin() const &;
            const_iterator cbegin() const &;
            iterator end() &;
            const_iterator end() const &;
            const_iterator cend() const &;</pre>

<p>The iterator and const_iterator types are bidirectional iterators of Uni=
code codepoints. operator* returns a char32_t rvalue, which is the codepoin=
t at that position. The invalidation=20
semantics of iterators shall be those of std::vector.</p>

<pre>            void clear() &;
            bool empty() const;
           =20
            iterator erase(const_iterator where) &;
            iterator erase(const_iterator first, const_iterator last) &;

            void swap(encoded_string&);

            char32_t front() const;
            char32_t back() const;
           =20
            iterator insert(const_iterator where, char32_t codepoint);
            template&lt;typename InputIterator, typename Encoding =3D encod=
ing_of_iterator&lt;InputIterator&gt;=20
            iterator insert(const_iterator where, InputIterator begin, Inpu=
tIterator end, Encoding e =3D Encoding());
            template&lt;typename Encoding, typename Allocator&gt;=20
            iterator insert(const_iterator where, const encoded_string&lt;R=
ncoding, Allocator&gt;&);
            template&lt;typename T, typename Traits, typename Alloc, typena=
me Encoding =3D encoding_of&lt;T&gt;=20
            iterator insert(const_iterator where, const basic_string&lt;T, =
Traits, Alloc&gt;&, Encoding e =3D Encoding());

            void pop_back();
            void push_back(char32_t);

            void normalize(normal_form);</pre>

<p>Performs an in-place normalization of the string's contents to the reque=
sted form.</p>

<pre>            const encoding::codeunit* codeunit_data() const;
            const encoding::codeunit* codeunit_data() const;
            std::size_t codeunit_size() const;</pre>

<p>codeunit_data returns the contents of the encoded_string as a null-termi=
nated buffer. This pointer shall be valid for as long as the encoded_string=
 is not mutated or destroyed. The codeunit_size=20
    function shall return the size of this buffer, except for the null term=
inator.</p>
<pre>            void codeunit_reserve(std::size_t size);
            std::size_t codeunit_capacity() const;
</pre>
=20
<pre>        };

        using string =3D encoded_string&lt;encoding::system, implementation=
-defined default&gt;

        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20
        bool operator<(const encoded_string&lt;LHSEncoding, LHSAllocator&gt=
;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator<(const basic_string&lt;T, Traits, Alloc&gt;& lhs, con=
st encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator<(const encoded_string&lt;Encoding, EncAlloc&gt;& rhs,=
 const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20

        bool operator=3D=3D(const encoded_string&lt;LHSEncoding, LHSAllocat=
or&gt;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator=3D=3D(const basic_string&lt;T, Traits, Alloc&gt;& lhs=
, const encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator=3D=3D(const encoded_string&lt;Encoding, EncAlloc&gt;&=
 rhs, const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20

        bool operator<=3D(const encoded_string&lt;LHSEncoding, LHSAllocator=
&gt;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator<=3D(const basic_string&lt;T, Traits, Alloc&gt;& lhs, =
const encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator<=3D(const encoded_string&lt;Encoding, EncAlloc&gt;& r=
hs, const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20

        bool operator>(const encoded_string&lt;LHSEncoding, LHSAllocator&gt=
;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator>(const basic_string&lt;T, Traits, Alloc&gt;& lhs, con=
st encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator>(const encoded_string&lt;Encoding, EncAlloc&gt;& rhs,=
 const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20

        bool operator>=3D(const encoded_string&lt;LHSEncoding, LHSAllocator=
&gt;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator>=3D(const basic_string&lt;T, Traits, Alloc&gt;& lhs, =
const encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator>=3D(const encoded_string&lt;Encoding, EncAlloc&gt;& r=
hs, const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;=20

        bool operator!=3D(const encoded_string&lt;LHSEncoding, LHSAllocator=
&gt;& lhs, const encoded_string&lt;RHSEncoding, RHSAllocator&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator!=3D(const basic_string&lt;T, Traits, Alloc&gt;& lhs, =
const encoded_string&lt;Encoding, EncAlloc&gt;& rhs);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding, typename EncAlloc&gt;=20
        bool operator!=3D(const encoded_string&lt;Encoding, EncAlloc&gt;& r=
hs, const basic_string&lt;T, Traits, Alloc&gt;& lhs);
        template&lt;typename LHSEncoding, typename LHSAllocator, typename R=
HSEncoding, typename RHSAllocator&gt;</pre>

<p>These comparison operators behave as if the data in the lhs and the rhs =
was passed to the respective freestanding algorithm.</p>

<pre>        template&lt;typename Encoding, typename Alloc, typename DestEn=
coding, typename DestAlloc =3D Alloc&gt;
        encoded_string&lt;DestEncoding, DestAlloc&gt; convert(const encoded=
_string&lt;Encoding, Alloc&gt;&, DestEncoding =3D DestEncoding(), DestAlloc=
 =3D DestAlloc());
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding =3D encoding_of&lt;T&gt;, typename DestEncoding, typename DestAlloc=
 =3D Alloc&gt;
        basic_string&lt;typename DestEncoding::codeunit, std::char_traits&l=
t;typename DestEncoding::codeunit&gt;, DestAlloc&gt;
        convert(const basic_string&lt;T, Traits, Alloc&gt;&, DestEncoding =
=3D DestEncoding(), DestAlloc =3D DestAlloc());
        template&lt;typename DestEncoding, typename Iterator, typename OutI=
t, typename Encoding =3D encoding_of_iterator&lt;Iterator&gt;&gt;
        OutIt convert(Iterator begin, Iterator end, OutIt out);</pre>

<p>Converts from the input range which is an input range of code units in s=
rc encoding into dst encoding. The output iterator receives the result of t=
he operation. Returns out.</p>
       =20
<pre>        template&lt;typename Iterator> std::pair&lt;grapheme_iterator&=
lt;Iterator>, grapheme_iterator&lt;Iterator>>=20
        graphemes(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;word_iterator&lt;Iterat=
or>, word_iterator&lt;Iterator>>
        words(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;line_iterator&lt;Iterat=
or>, line_iterator&lt;Iterator>>
        lines(Iterator begin, Iterator end);
        template&lt;typename Iterator> std::pair&lt;sentence_iterator&lt;It=
erator>, sentence_iterator&lt;Iterator>>
        sentences(Iterator begin, Iterator end);</pre>

<p>All four iterator types- grapheme_iterator, word_iterator, line_iterator=
, and sentence_iterator implement the respective Unicode Standard boundary =
analysis algorithms. The Line algorithm is defined in UAX #14=20
    (http://www.unicode.org/reports/tr14/) and the other three in UAX #29 (=
http://www.unicode.org/reports/tr29/). The input iterators are at least bid=
irectional iterators of Unicode codepoints. The boundary=20
    iterators all return from operator*() a pair of the base iterator type,=
 where the first value marks the beginning of the range, and the second mar=
ks the end, of the region. The first element of the return=20
    value of the four functions is the beginning and the second is the end.=
</p>

<pre>        template&lt;typename First, typename Second> bool less(First b=
egin, First end, Second begin, Second end, std::locale =3D std::locale());
   =20
        template&lt;typename LHSEnc, typename RHSEnc, typename LHSAlloc, ty=
pename RHSAlloc&gt;=20
        bool less(const encoded_string&lt;LHSEnc, LHSAlloc&gt;&, const enco=
ded_string&lt;RHSEnc, RHSAlloc&gt;&, std::locale =3D std::locale());

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool equal(const basic_string&lt;LHSChar, LHSTraits, LHSAlloc&gt;& =
lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool less_or_equal(Fir=
st begin, First end, Second begin, Second end, std::locale =3D std::locale(=
));
   =20
        template&lt;typename LHSEnc, typename RHSEnc, typename lhs_alloc, t=
ypename RHSAlloc&gt;=20
        bool less_or_equal(const encoded_string&lt;LHSEnc, lhs_alloc&gt;&, =
const encoded_string&lt;RHSEnc, RHSAlloc&gt;&, std::locale =3D std::locale(=
));

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool equal(const basic_string&lt;LHSChar, LHSTraits, LHSAlloc&gt;& =
lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater(First beg=
in, First end, Second begin, Second end, std::locale =3D std::locale());

        template&lt;typename LHSEnc, typename RHSEnc, typename lhs_alloc, t=
ypename RHSAlloc&gt;=20
        bool greater(const encoded_string&lt;LHSEnc, lhs_alloc&gt;&, const =
encoded_string&lt;RHSEnc, RHSAlloc&gt;&, std::locale =3D std::locale());

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool greater(const basic_string&lt;LHSChar, LHSTraits, LHSAlloc&gt;=
& lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool greater_or_equal(=
First begin, First end, Second begin, Second end, std::locale =3D std::loca=
le());

        template&lt;typename LHSEnc, typename RHSEnc, typename lhs_alloc, t=
ypename RHSAlloc&gt;=20
        bool greater_or_equal(const encoded_string&lt;LHSEnc, lhs_alloc&gt;=
&, const encoded_string&lt;RHSEnc, RHSAlloc&gt;&, std::locale =3D std::loca=
le());

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool greater_or_equal(const basic_string&lt;LHSChar, LHSTraits, LHS=
Alloc&gt;& lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding(),=
 std::locale =3D std::locale());

        template&lt;typename First, typename Second> bool equal(First begin=
, First end, Second begin, Second end);

        template&lt;typename LHSEnc, typename RHSEnc, typename lhs_alloc, t=
ypename RHSAlloc&gt;=20
        bool equal(const encoded_string&lt;LHSEnc, lhs_alloc&gt;&, const en=
coded_string&lt;RHSEnc, RHSAlloc&gt;&);

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool equal(const basic_string&lt;LHSChar, LHSTraits, LHSAlloc&gt;& =
lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding())=
;

        template&lt;typename First, typename Second> bool not_equal(First b=
egin, First end, Second begin, Second end);

        template&lt;typename LHSEnc, typename RHSEnc, typename lhs_alloc, t=
ypename RHSAlloc&gt;=20
        bool not_equal(const encoded_string&lt;LHSEnc, lhs_alloc&gt;&, cons=
t encoded_string&lt;RHSEnc, RHSAlloc&gt;&);

        template&lt;typename LHSChar, typename LHSTraits, typename LHSAlloc=
, typename RHSChar, typename RHSTraits, typename RHSAlloc,=20
        typename LHSEncoding =3D encoding_of&lt;LHSChar&gt;, typename RHSEn=
coding =3D encoding_of&lt;RHSChar&gt;&gt;
        bool not_equal(const basic_string&lt;LHSChar, LHSTraits, LHSAlloc&g=
t;& lhs, const basic_string&lt;RHSChar, RHSTraits, RHSAlloc&gt;&,=20
        LHSEncoding le =3D LHSEncoding(), RHSEncoding re =3D RHSEncoding())=
;</pre>

<p>These six algorithms implement Unicode comparison functionality. Equival=
ence is defined as equivalence when normalized, with either NFC or NFD. Col=
lation requires a locale- overloads which do not have one as a=20
    parameter shall use the global locale. All iterator ranges shall be for=
ward iterators of Unicode codepoints.
</p>

<pre>        template&lt;typename Iterator, typename Out> Out normalize(Ite=
rator begin, Iterator end, Out out, normal_form);
        template&lt;typename T, typename Traits, typename Alloc, typename E=
ncoding =3D encoding_of&lt;T&gt;&gt;=20
        basic_string&lt;T, Traits, Alloc&gt; normalize(basic_string&lt;T, T=
raits, Alloc&gt;);
        template&lt;typename Encoding, typename Alloc&gt;=20
        encoded_string&lt;Encoding, Alloc&gt; normalize(encoded_string&lt;E=
ncoding, Alloc&gt;);</pre>

<p>Implements normalization of the forward range over Unicode codepoints, w=
ith the output provided to the output iterator. The normal_form argument in=
dicates which normal form is requested. Returns out.</p>

<pre>        template&lt;typename Encoding, typename Allocator&gt; std::ist=
ream&=20
        operator>>(std::istream&, encoded_string&lt;Encoding, Allocator&gt;=
&);
        template&lt;typename encoding, typename allocator&gt; std::wistream=
&=20
        operator>>(std::wistream&, encoded_string&lt;Encoding, Allocator&gt=
;&);</pre>

<p>Reads until the next whitespace, as operator>>(std::istream&, std::strin=
g&);. Shall convert the data in the stream to Unicode, so whitespace shall =
include Unicode whitespaces.</p>

<pre>         template&lt;typename encoding, typename allocator&gt; std::os=
tream&=20
         operator&lt;&lt;(std::ostream&, const encoded_string&lt;Encoding, =
Allocator&gt;&);
         template&lt;typename encoding, typename allocator&gt; std::wostrea=
m&=20
         operator&lt;&lt;(std::wostream&, const encoded_string&lt;Encoding,=
 Allocator&gt;&);</pre>

<p>Writes the contents of the string to the stream. Shall perform an encodi=
ng conversion to narrow encoding and wide encoding when necessary.</p>

<pre>        struct dec {
            dec();
            dec(const dec&);
            dec(dec&&);
            dec(std::locale l);
            std::locale get_locale();
         }
         struct hex {};
         struct bin {};
         struct oct {};
</pre>
<p>For all primitive integer types I:</p>
<pre>        template&lt;typename Base =3D dec&gt; string to_string(I, Base=
 =3D Base{});</pre>
<p>This function shall format the integer of type I as a string, in the bas=
e provided. If I is <code>bool</code>, then bin shall represent 0 or 1, and=
 any other choice shall result in <code>true</code> or=20
    <code>false</code>.</p>
<pre>        enum class codepoint_category {
            letter_uppercase;
            letter_lowercase;
            letter_titlecase;
            letter_modifier;
            letter_other;
            mark_non_spacing;
            mark_spacing_combining;
            mark_enclosing;
            number_decimal_digit;
            number_letter;
            number_other;
            punctuation_connector;
            punctuation_dash;
            punctuation_open;
            punctuation_close;
            punctuation_initial;
            punctuation_final;
            punctuation_other;
            symbol_math;
            symbol_currency;
            symbol_modifier;
            symbol_other;
            separator_space;
            separator_line;
            separator_paragraph;
            other_control;
            other_format;
            other_surrogate;
            other_private_use;
            other_not_assigned;
        };
        enum class bidi_category {
            AL, AN,
            B, BN,
            CS,
            EN, ES, ET,
            L, LRE, LRO,
            NSM,
            ON,
            PDF,
            R, RLE, RLO,
            S,
            WS,
        };
        enum class category_joining_class {
            U, C, T, D, L, R,
        };
        enum class category_joining_group {
            Ain, Alaph, Alef, Alef_Maqsurah,
            Beh, Beth, Burushaski_Yeh_Barree,
            Dal, Dalath_Rish, E,
            Farsi_Yeh, Fe, Feh, Final_Semkath,
            Gaf, Gamal,
            Hah, Hamza_On_Heh_Goal, He,
            Heh, Heh_Goal, Heth,
            Kaf, Kaph, Khaph, Knotted_Heh,
            Lam, Lamadh, Meem, Mim,
            No_Joining_Group, Noon, Nun, Nya,
            Pe, Qaf, Qaph, Reh, Reversed_Pe,
            Rohingya_Yeh,
            Sad, Sadhe, Seen, Semkath, Shin,
            Swash_Kaf, Syriac_Waw, Tah, Taw,
            Teh_Marbuta, Teh_Marbuta_Goal, Teth, Waw, Yeh,
            Yeh_Barree, Yeh_With_Tail, Yudh,
            Yudh_He, Zain, Zhain,
        };
        enum class script_type {
            Arab, Armi, Armn, Avst,
            Bali, Bamu, Batk, Beng, Bopo, Brah, Brai, Bugi, Buhd,
            Cakm, Cans, Cari, Cham, Cher, Copt, Cprt,
            Cyrl,
            Deva, Dsrt,
            Egyp, Ethi,
            Geor, Glag, Goth, Grek, Gujr, Guru,
            Hang, Hani, Hano, Hebr, Hira, Hrkt,
            Ital,
            Java,
            Kali, Kana, Khar, Khmr, Knda, Kthi,
            Lana, Laoo, Latn, Lepc, Limb, Linb, Lisu, Lyci,
            Lydi,
            Mand, Merc, Mero, Mlym, Mong, Mtei, Mymr,
            Nkoo,
            Ogam, Olck, Orkh, Orya, Osma,
            Phag, Phli, Phnx, Plrd, Prti,
            Qaai,
            Rjng, Runr,
            Samr, Sarb, Saur, Shaw, Shrd,  Sinh, Sora, Sund, Sylo, Syrc,
            Tagb, Takr, Tale, Talu, Taml, Tavt, Telu, Tfng,
            Tglg, Thaa, Thai, Tibt,
            Ugar,
            Vaii,
            Xpeo, Xsux,
            Yiii,
            Zinh, Zyyy, Zzzz,
        };
        enum class block_name {
            Aegean_Numbers, Alchemical, Alphabetic_PF, Ancient_Greek_Music,=
 Ancient_Greek_Numbers,
            Ancient_Symbols, Arabic, Arabic_Ext_A, Arabic_Math, Arabic_PF_A=
, Arabic_PF_B, Arabic_Sup,
            Armenian, Arrows, ASCII, Avestan, Balinese, Bamum, Bamum_Sup, B=
atak, Bengali, Block_Elements,
            Bopomofo, Bopomofo_Ext, Box_Drawing, Brahmi, Braille, Buginese,=
 Buhid, Byzantine_Music,
            Carian, Chakma, Cham, Cherokee, CJK, CJK_Compat, CJK_Compat_For=
ms, CJK_Compat_Ideographs,
            CJK_Compat_Ideographs_Sup, CJK_Ext_A, CJK_Ext_B, CJK_Ext_C, CJK=
_Ext_D, CJK_Radicals_Sup,
            CJK_Strokes, CJK_Symbols, Compat_Jamo, Control_Pictures, Coptic=
, Counting_Rod, Cuneiform,
            Cuneiform_Numbers, Currency_Symbols, Cypriot_Syllabary, Cyrilli=
c, Cyrillic_Ext_A, Cyrillic_Ext_B,
            Cyrillic_Sup, Deseret, Devanagari, Devanagari_Ext, Diacriticals=
, Diacriticals_For_Symbols,
            Diacriticals_Sup, Dingbats, Domino, Egyptian_Hieroglyphs, Emoti=
cons, Enclosed_Alphanum,
            Enclosed_Alphanum_Sup, Enclosed_CJK, Enclosed_Ideographic_Sup, =
Ethiopic, Ethiopic_Ext,
            Ethiopic_Ext_A, Ethiopic_Sup, Geometric_Shapes, Georgian, Georg=
ian_Sup, Glagolitic, Gothic, Greek,
            Greek_Ext, Gujarati, Gurmukhi, Half_And_Full_Forms, Half_Marks,=
 Hangul, Hanunoo, Hebrew,
            High_PU_Surrogates, High_Surrogates, Hiragana, IDC, Imperial_Ar=
amaic, Indic_Number_Forms,
            Inscriptional_Pahlavi, Inscriptional_Parthian, IPA_Ext, Jamo, J=
amo_Ext_A, Jamo_Ext_B, Javanese,
            Kaithi, Kana_Sup, Kanbun, Kangxi, Kannada, Katakana, Katakana_E=
xt, Kayah_Li, Kharoshthi, Khmer,
            Khmer_Symbols, Lao, Latin_1_Sup, Latin_Ext_A, Latin_Ext_Additio=
nal, Latin_Ext_B, Latin_Ext_C,
            Latin_Ext_D, Lepcha, Letterlike_Symbols, Limbu, Linear_B_Ideogr=
ams, Linear_B_Syllabary, Lisu,
            Low_Surrogates, Lycian, Lydian, Mahjong, Malayalam, Mandaic, Ma=
th_Alphanum, Math_Operators,
            Meetei_Mayek, Meetei_Mayek_Ext, Meroitic_Cursive, Meroitic_Hier=
oglyphs, Miao, Misc_Arrows,
            Misc_Math_Symbols_A, Misc_Math_Symbols_B, Misc_Pictographs, Mis=
c_Symbols, Misc_Technical,
            Modifier_Letters, Modifier_Tone_Letters, Mongolian, Music, Myan=
mar, Myanmar_Ext_A, NB,
            New_Tai_Lue, NKo, Number_Forms, OCR, Ogham, Ol_Chiki, Old_Itali=
c, Old_Persian, Old_South_Arabian,
            Old_Turkic, Oriya, Osmanya, Phags_Pa, Phaistos, Phoenician, Pho=
netic_Ext, Phonetic_Ext_Sup,
            Playing_Cards, PUA, Punctuation, Rejang, Rumi, Runic, Samaritan=
, Saurashtra, Sharada, Shavian,
            Sinhala, Small_Forms, Sora_Sompeng, Specials, Sundanese, Sundan=
ese_Sup, Sup_Arrows_A, Sup_Arrows_B,
            Sup_Math_Operators, Sup_PUA_A, Sup_PUA_B, Sup_Punctuation, Supe=
r_And_Sub, Syloti_Nagri, Syriac,
            Tagalog, Tagbanwa, Tags, Tai_Le, Tai_Tham, Tai_Viet, Tai_Xuan_J=
ing, Takri, Tamil, Telugu, Thaana,
            Thai, Tibetan, Tifinagh, Transport_And_Map, UCAS, UCAS_Ext, Uga=
ritic, Vai, Vedic_Ext,
            Vertical_Forms, VS, VS_Sup, Yi_Radicals, Yi_Syllables, Yijing,
        };
        enum class version {
            v1_1,
            v2_0, v2_1,
            v3_0, v3_1, v3_2,
            v4_0, v4_1,
            v5_0, v5_1, v5_2,
            v6_0, v6_1, v6_2,
            unassigned =3D 0xFF,
        };
        struct codepoint_properties {
            codepoint_category category;
            block_name block;
            version age;
            bidi_category bidi_type;
            category_joining_class joining_class;
            category_joining_group joining_group;
            script_type script;
            bool control;
            bool digit;
            bool letter;
            bool lower;
            bool number;
            bool punctuation;
            bool separator;
            bool symbol;
            bool upper;
            bool whitespace;
        };
        codepoint_properties properties(char32_t);</pre>
<p>Returns the properties of any given codepoint. These properties are defi=
ned by the Unicode Standard, not here.</p>
<pre>        template&lt;typename Iterator, typename Out, typename Encoding=
 =3D encoding_of_iterator&lt;Iterator&gt;&gt; Out to_upper(Iterator begin, =
Iterator end, Out out, Encoding e =3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_upper(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_upper(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
encoding_of_iterator&lt;Iterator&gt;&gt; Out to_lower(Iterator begin, Itera=
tor end, Out out, Encoding e =3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_lower(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_lower(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());

        template&lt;typename Iterator, typename Out, typename Encoding =3D =
encoding_of_iterator&lt;Iterator&gt;&gt; Out to_title(Iterator begin, Itera=
tor end, Out out, Encoding e =3D Encoding());
        template&lt;typename Encoding, typename Allocator&gt; encoded_strin=
g&lt;Encoding, Allocator&gt; to_title(const encoded_string&lt;Encoding, All=
ocator&gt;&);   =20
        template&lt;typename Char, typename Traits, typename Alloc, typenam=
e Encoding =3D utf32&gt;
        std::basic_string&lt;Char, Traits, Alloc&gt; to_title(const std::ba=
sic_string&lt;Char, Traits, Alloc&gt;&, Encoding e =3D Encoding());
</pre>
<p>Performs a case conversion for the given series of Unicode codepoints. T=
he output iterator shall be the same encoding as the input iterator. Return=
s the output iterator.
</p>
<pre>        template&lt;typename Encoding&gt; using encoded_regex =3D std:=
:basic_regex&lt;typename Encoding::codeunit, implementation-defined&gt;</pr=
e>
<p>A regular expression type suitable for matching Unicode which is encoded=
 in the specified encoding. The traits must support <a href=3D"http://www.u=
nicode.org/reports/tr18/">UTS-18</a> to at least Level 2.</p>
<pre>    }
}</pre>
<h2>Acknowledgements</h2>
<p>R. Martinho Fernandes, gave significant assistance when dealing with som=
e of the ins and outs of Unicode.</p>
   =20
</body>
</html>=E2=80=8B
------=_Part_172_19827225.1355011628026--

.


Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Sun, 9 Dec 2012 14:28:08 +0100
Raw View
--f46d043890936756f304d06b6b6b
Content-Type: text/plain; charset=ISO-8859-1

2012/12/9 DeadMG <wolfeinstein@gmail.com>

> 4)
>
[..]

I just observe a not-yet mentioned problem with the following overloaded
members of encoded_string:

template<typename other_encoding, typename other_alloc>
            encoded_string& operator+(const
encoded_string<other_encoding, other_alloc>&) const;
            encoded_string& operator+(encoded_string&&) const;
            encoded_string& operator+(const char*) const;
            encoded_string& operator+(const wchar_t*) const;
            encoded_string& operator+(const char16_t*) const;
            encoded_string& operator+(const char32_t*) const;
            template<typename T, typename Traits, typename Allocator>
            encoded_string& operator+(const std::basic_string<T,
Traits, Allocator>&) const;

I guess the intended return type is encoded_string and not encoded_string&,
else I don't understand why they are const functions. If my interpretation
is correct, I suggest to present those as free function templates instead,
which also allows for symmetric forms.

Further in regard to

"These comparison operators behave as if the data in the lhs and the rhs
was passed to the respective freestanding algorithm."

I suggest to make it clearer that you mean

"These comparison operators behave as if the codeunit_data in the lhs and
the rhs was passed to the respective freestanding algorithm."

This is still somewhat fuzzy, maybe it would better describe your intention
if you would describe the effects as equivalent to
passing codeunit_iterator<encoded_string::const_iterator> to the
freestanding algorithms (I guess you mean std::equal and
std::lexicographical_compare. This should better be spelled out).


     b) It was my understanding that the iterator requirements had already
> been relaxed to permit rvalues.
>

An input iterator (but only that one) can return an rvalue from operator*.


> If not, it will be irritating, but doable- each iterator would just have
> to hold a value that would be the referred-to value.
>

I would suggest to choose this direction for the moment, because it really
doesn't have bad impact on the iterator layout (Just a 32bit value), it is
also consistent with existing Library iterators such as istream_iterator.

Further there would be no problem to consider a change of the return type
of operator* once the iterator requirements have been redesigned to allow
that a forward iterator (or beyond) can return rvalues.

I'm recommending that, because a proposal that relies on things that are
not currently accepted is much harder to accept.

I strongly recommend to point out in your proposing paper what your
*intention* is (e.g. returning rvalues from iterators). This helps latter
to argue in favour for fixing things that currently are not directly
available.

Of-course you can still decide to return rvalues in your proposal. But you
have to make it clear that this has the effect that these iterators don't
strictly satisfy the forward/bidirectional iterator requirements.


> The semantics of referring or pointing to it would be somewhat broken,
> though- I'm not sure if that approach can work. If that's not possible,
> then the regular iterators cannot serve many useful purposes, but starting
> with this one, without adaptation.
>

That is fine, but please make the special character in regard to the
iterator requirements clear.


>     c) It would be encoding::codeunit, or const encoding::codeunit&- i.e.,
> you could simply consider it as
> codeunit_iterator<foreign_codeunit_encoding::codepoint_iterator<foreign_codeunit_iterator>>,
> except that some cross-encoding conversions may be easier to implement than
> that. It's an optimization opportunity, not new functionality.
>

Thanks. I guess, you meant encoding::codeunit, because the value type of an
iterator must be some object type. (I prefer to speak of value types first,
because reference types, i.e. the return type of operator*, are much more
sensitive in regard to the actual value access.).


> 7) Actually, the encoding_of wording specifically requires that it be the
> encoding of *decayed* Char, so it would be fine given char&. However, I
> have already refactored this into encoding_of_iterator. As for
> iterator_traits, I see no reason to depend on it when I can not depend on
> it. I have clarified the wording of the behaviour of encoding_of. Fixed.
>

I still think that it would be clearer to make encoding_of dependent on
some value type, and not on some reference type. This becomes especially
important, if the iterator returns a proxy type (which is neither a
reference type nor the actual value type). You cannot solve this problem
via std::decay and it is also a semantic difference to argue whether the
template parameter of encoding_of is considered as either some value type
(This means: No cv-qualifier and always an object type) or an iterator
reference type (Which can be any of a real reference, a value type or even
some proxy type that directly is not useful in that context).

9) Hm. I implicitly meant to include them, but it's not explicitly stated.
> Some of them introduce questions as to exactly what they should be, and if
> it really can be supported, as iterators which return rvalues do not map
> well to having to provide a reference or pointer type.
>

It would be helpful if you at least added to the synopsis of encoded_string
something like

.... // The usual container typedefs

but it would be still clearer if these would be expressed in full form.
Especially of interest is whether the typedefs reference and
const_reference exist and which definition these would have, because the
API seems not to return any lvalues. This gives encoded_string a very
different view compared to containers, so if this is by design, it should
be clearly expressed that the type does intentionally not satisfy the
optional container requirements (even though it provides members front and
back, for example).



> 10) You mean like, the existing std::less and std::greater?
>

Yes.


> This would principally be useful for basic_string, not encoded_string.
>

Yep, I agree. I realize now, that the value type of encoded_string
iterators is the codepoint, not the codeunit, so algorithms acting on
encoded_string iterators would compare codepoints. Then I don't understand
why the free comparison function overloads (<, ==, ...) do not also compare
codepoints. This becomes relevant, if you have sequences of encoded_string
values that are passed to algorithms, e.g. std::equal or
std::lexicographic_compare. It seems odd that these function would compare
the encoded_string values by code units instead of code points.


I have also changed the regular expression support to support matching
> other encodings. I don't believe this will be a significant implementation
> burden, as they can simply delegate to the previously-required char32_t
> variety for any encoding, but there's an opportunity for something
> faster/smaller/whatever for non-UTF-32 encodings.
>

I don't understand why encoded_regex does not allow user-code to provide
their own allocator.

I suggest to separate your idea of an implementation-defined iterator. This
idea is not really related to your proposal and could be applied to all
containers.

I have attached the revised version.
>

One further comment in regard to the IO inserters and extractors:

template<typename Encoding, typename Allocator> std::istream&
        operator>>(std::istream&, encoded_string<Encoding, Allocator>&);
        template<typename encoding, typename allocator> std::wistream&
        operator>>(std::wistream&, encoded_string<Encoding, Allocator>&);

template<typename encoding, typename allocator> std::ostream&
         operator<<(std::ostream&, const encoded_string<Encoding, Allocator>&);
         template<typename encoding, typename allocator> std::wostream&
         operator<<(std::wostream&, const encoded_string<Encoding,
Allocator>&);

I think these should be presented as the usual templates in regard to
basic_istream and basic_ostream, because your semantic specification in
regard to encoding conversion can cope with that. It also reflects the
general way how the library specification describes all current inserters
and extractors.

- Daniel

--




--f46d043890936756f304d06b6b6b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">2012/12/9 DeadMG <span dir=3D"ltr">&lt;<a href=
=3D"mailto:wolfeinstein@gmail.com" target=3D"_blank">wolfeinstein@gmail.com=
</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
..8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>4)</div></blockquote><div>[..] <br><br>I just observe a not-yet mentio=
ned problem with the following overloaded members of encoded_string:<br><br=
><pre>template&lt;typename other_encoding, typename other_alloc&gt;
            encoded_string&amp; operator+(const encoded_string&lt;other_enc=
oding, other_alloc&gt;&amp;) const;
            encoded_string&amp; operator+(encoded_string&amp;&amp;) const;
            encoded_string&amp; operator+(const char*) const;
            encoded_string&amp; operator+(const wchar_t*) const;
            encoded_string&amp; operator+(const char16_t*) const;
            encoded_string&amp; operator+(const char32_t*) const;
            template&lt;typename T, typename Traits, typename Allocator&gt;=
=20
            encoded_string&amp; operator+(const std::basic_string&lt;T, Tra=
its, Allocator&gt;&amp;) const;</pre>I guess the intended return type is en=
coded_string and not encoded_string&amp;, else I don&#39;t understand why t=
hey are const functions. If my interpretation is correct, I suggest to pres=
ent those as free function templates instead, which also allows for symmetr=
ic forms. <br>
<br>Further in regard to<br><br>&quot;These comparison operators behave as =
if the data in the lhs and the rhs was passed to the respective freestandin=
g algorithm.&quot;<br><br>I suggest to make it clearer that you mean<br>
<br>&quot;These comparison operators behave as if the codeunit_data in the =
lhs and the rhs was passed to the respective freestanding algorithm.&quot;<=
br><br>This is still somewhat fuzzy, maybe it would better describe your in=
tention if you would describe the effects as equivalent to passing=A0codeun=
it_iterator&lt;encoded_string::const_iterator&gt; to the freestanding algor=
ithms (I guess you mean std::equal and std::lexicographical_compare. This s=
hould better be spelled out).<br>
<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div>=A0=A0 =A0 b) It was my u=
nderstanding that the iterator requirements had already been relaxed to per=
mit rvalues. </div>
</blockquote><div><br>An input iterator (but only that one) can return an r=
value from operator*.<br>=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>If =
not, it will be irritating, but doable- each iterator would just have to ho=
ld a value that would be the referred-to value.</div>
</blockquote><div><br>I would suggest to choose this direction for the mome=
nt, because it really doesn&#39;t have bad impact on the iterator layout (J=
ust a 32bit value), it is also consistent with existing Library iterators s=
uch as istream_iterator.<br>
<br>Further there would be no problem to consider a change of the return ty=
pe of operator* once the iterator requirements have been redesigned to allo=
w that a forward iterator (or beyond) can return rvalues.<br><br>I&#39;m re=
commending that, because a proposal that relies on things that are not curr=
ently accepted is much harder to accept. <br>
<br>I strongly recommend to point out in your proposing paper what your *in=
tention* is (e.g. returning rvalues from iterators). This helps latter to a=
rgue in favour for fixing things that currently are not directly available.=
<br>
<br>Of-course you can still decide to return rvalues in your proposal. But =
you have to make it clear that this has the effect that these iterators don=
&#39;t strictly satisfy the forward/bidirectional iterator requirements.<br=
>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div> The semantics of referring or=
 pointing to it would be somewhat broken, though- I&#39;m not sure if that =
approach can work. If that&#39;s not possible, then the regular iterators c=
annot serve many useful purposes, but starting with this one, without adapt=
ation.</div>
</blockquote><div><br>That is fine, but please make the special character i=
n regard to the iterator requirements clear.<br>=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div>=A0 =A0 c) It would be encoding::codeunit, or const encoding::codeunit=
&amp;- i.e., you could simply consider it as codeunit_iterator&lt;foreign_c=
odeunit_encoding::codepoint_iterator&lt;foreign_codeunit_iterator&gt;&gt;, =
except that some cross-encoding conversions may be easier to implement than=
 that. It&#39;s an optimization opportunity, not new functionality.</div>
</blockquote><div><br>Thanks. I guess, you meant encoding::codeunit, becaus=
e the value type of an iterator must be some object type. (I prefer to spea=
k of value types first, because reference types, i.e. the return type of op=
erator*, are much more sensitive in regard to the actual value access.).<br=
>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div>7) Actually, the encoding_of w=
ording specifically requires that it be the encoding of <i>decayed</i>=A0Ch=
ar, so it would be fine given char&amp;. However, I have already refactored=
 this into encoding_of_iterator. As for iterator_traits, I see no reason to=
 depend on it when I can not depend on it. I have clarified the wording of =
the behaviour of encoding_of. Fixed.</div>
</blockquote><div><br>I still think that it would be clearer to make encodi=
ng_of dependent on some value type, and not on some reference type. This be=
comes especially important, if the iterator returns a proxy type (which is =
neither a reference type nor the actual value type). You cannot solve this =
problem via std::decay and it is also a semantic difference to argue whethe=
r the template parameter of encoding_of is considered as either some value =
type (This means: No cv-qualifier and always an object type) or an iterator=
 reference type (Which can be any of a real reference, a value type or even=
 some proxy type that directly is not useful in that context).<br>
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><div></div><div>9) Hm. I implicitl=
y meant to include them, but it&#39;s not explicitly stated. Some of them i=
ntroduce questions as to exactly what they should be, and if it really can =
be supported, as iterators which return rvalues do not map well to having t=
o provide a reference or pointer type.</div>
</blockquote><div><br>It would be helpful if you at least added to the syno=
psis of encoded_string something like<br><br>... // The usual container typ=
edefs<br><br>but it would be still clearer if these would be expressed in f=
ull form. Especially of interest is whether the typedefs reference and cons=
t_reference exist and which definition these would have, because the API se=
ems not to return any lvalues. This gives encoded_string a very different v=
iew compared to containers, so if this is by design, it should be clearly e=
xpressed that the type does intentionally not satisfy the optional containe=
r requirements (even though it provides members front and back, for example=
).<br>
<br>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div></div><div>10) You mean li=
ke, the existing std::less and std::greater? </div></blockquote><div><br>Ye=
s.<br>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div>This would principally be usef=
ul for basic_string, not encoded_string. </div></blockquote><div><br>Yep, I=
 agree. I realize now, that the value type of encoded_string iterators is t=
he codepoint, not the codeunit, so algorithms acting on encoded_string iter=
ators would compare codepoints. Then I don&#39;t understand why the free co=
mparison function overloads (&lt;, =3D=3D, ...) do not also compare codepoi=
nts. This becomes relevant, if you have sequences of encoded_string values =
that are passed to algorithms, e.g. std::equal or std::lexicographic_compar=
e. It seems odd that these function would compare the encoded_string values=
 by code units instead of code points.<br>
=A0<br></div><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div></div><div>I have also=
 changed the regular expression support to support matching other encodings=
.. I don&#39;t believe this will be a significant implementation burden, as =
they can simply delegate to the previously-required char32_t variety for an=
y encoding, but there&#39;s an opportunity for something faster/smaller/wha=
tever for non-UTF-32 encodings.</div>
</blockquote><div><br>I don&#39;t understand why encoded_regex does not all=
ow user-code to provide their own allocator.<br><br>I suggest to separate y=
our idea of an implementation-defined iterator. This idea is not really rel=
ated to your proposal and could be applied to all containers.<br>
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><div></div><div>I have attached th=
e revised version.</div></blockquote><div><br></div></div>One further comme=
nt in regard to the IO inserters and extractors:<br>
<br><pre>template&lt;typename Encoding, typename Allocator&gt; std::istream=
&amp;=20
        operator&gt;&gt;(std::istream&amp;, encoded_string&lt;Encoding, All=
ocator&gt;&amp;);
        template&lt;typename encoding, typename allocator&gt; std::wistream=
&amp;=20
        operator&gt;&gt;(std::wistream&amp;, encoded_string&lt;Encoding, Al=
locator&gt;&amp;);<br><br>template&lt;typename encoding, typename allocator=
&gt; std::ostream&amp;=20
         operator&lt;&lt;(std::ostream&amp;, const encoded_string&lt;Encodi=
ng, Allocator&gt;&amp;);
         template&lt;typename encoding, typename allocator&gt; std::wostrea=
m&amp;=20
         operator&lt;&lt;(std::wostream&amp;, const encoded_string&lt;Encod=
ing, Allocator&gt;&amp;); <br><br></pre>I think these should be presented a=
s the usual templates in regard to basic_istream and basic_ostream, because=
 your semantic specification in regard to encoding conversion can cope with=
 that. It also reflects the general way how the library specification descr=
ibes all current inserters and extractors.<br>
<br>- Daniel<br><br>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

--f46d043890936756f304d06b6b6b--

.


Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 9 Dec 2012 07:11:12 -0800 (PST)
Raw View
------=_Part_108_24104369.1355065872840
Content-Type: text/plain; charset=ISO-8859-1

Yes, that is a glitch- they return a value. As for member vs freestanding,
I have considered this, but in general, there simply need to be more
overloads for basic_string interoperation. Changing the existing overloads
to freestanding would simply entail about a billion copies when performing
something like std::string() + std::unicode::string(). Also, no operators
were provided that could compare an encoded_string and a C-string in any
encoding.

No, of course I do not mean std::equal and std::lexicographical_compare. I
mean the Unicode-aware comparison freestanding functions which are defined
right here in this proposal. A comparison of codeunit_data() with a generic
algorithm would be completely meaningless. I have altered the wording to
make it clear that they delegate to the Unicode freestanding algorithms. In
addition, with the new comparison operator overloads, most of those
freestanding function overloads no longer need to exist.

I had not considered the potential for iterator proxy types. I have changed
encoding_of_iterator to be iterator_traits<Iterator>::value_type. Beyond
that, there's really not much more I can do.

encoded_regex does not take an allocator because, according to the
references I can find, basic_regex does not take one either.

--




------=_Part_108_24104369.1355065872840
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Yes, that is a glitch- they return a value. As for member vs freestanding, =
I have considered this, but in general, there simply need to be more overlo=
ads for basic_string interoperation. Changing the existing overloads to fre=
estanding would simply entail about a billion copies when performing someth=
ing like std::string() + std::unicode::string(). Also, no operators were pr=
ovided that could compare an encoded_string and a C-string in any encoding.=
<div><br></div><div>No, of course I do not mean std::equal and std::lexicog=
raphical_compare. I mean the Unicode-aware comparison freestanding function=
s which are defined right here in this proposal. A comparison of codeunit_d=
ata() with a generic algorithm would be completely meaningless. I have alte=
red the wording to make it clear that they delegate to the Unicode freestan=
ding algorithms. In addition, with the new comparison operator overloads, m=
ost of those freestanding function overloads no longer need to exist.</div>=
<div><br></div><div>I had not considered the potential for iterator proxy t=
ypes. I have changed encoding_of_iterator to be iterator_traits&lt;Iterator=
&gt;::value_type. Beyond that, there's really not much more I can do.</div>=
<div><br></div><div>encoded_regex does not take an allocator because, accor=
ding to the references I can find, basic_regex does not take one either.</d=
iv>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_108_24104369.1355065872840--

.