Topic: Committee feedback on N3572
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 20 Apr 2013 05:21:20 -0700 (PDT)
Raw View
------=_Part_607_25106398.1366460480727
Content-Type: text/plain; charset=ISO-8859-1
The LEWG liked the approach of offering new generic Unicode algorithms.
They did not like the encoded_string class and especially did not like the
flexibility of it's encoding and that it exposed the encoding to the user.
The LEWG's recommendation was to split the string and algorithms into two
papers, and force the user into using one implementation-defined encoding,
and simply provide conversion to and from the existing string mechanisms.
Then it will be easy to pass the algorithms separately through the
Committee.
I intend to add to this thread with a draft of both revisions later.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_607_25106398.1366460480727
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
The LEWG liked the approach of offering new generic Unicode algorithms. The=
y did not like the encoded_string class and especially did not like the fle=
xibility of it's encoding and that it exposed the encoding to the user. The=
LEWG's recommendation was to split the string and algorithms into two pape=
rs, and force the user into using one implementation-defined encoding, and =
simply provide conversion to and from the existing string mechanisms. Then =
it will be easy to pass the algorithms separately through the Committee.<di=
v><br></div><div>I intend to add to this thread with a draft of both revisi=
ons later.</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_607_25106398.1366460480727--
.
Author: Olaf van der Spek <olafvdspek@gmail.com>
Date: Sat, 20 Apr 2013 06:38:22 -0700 (PDT)
Raw View
------=_Part_56_5687319.1366465102652
Content-Type: text/plain; charset=ISO-8859-1
> Committee feedback on N3572
Would be handy to include the title of the paper in the subject
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_56_5687319.1366465102652
Content-Type: text/html; charset=ISO-8859-1
<div><div><font size="4">> Committee feedback on N3572</font></div><div><font size="4"><br></font></div><div><font size="4">Would be handy to include the title of the paper in the subject</font></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_56_5687319.1366465102652--
.
Author: =?UTF-8?Q?Micha=C5=82_Dominiak?= <griwes@griwes.info>
Date: Sat, 20 Apr 2013 07:08:35 -0700 (PDT)
Raw View
------=_Part_566_21446130.1366466915489
Content-Type: text/plain; charset=ISO-8859-1
Wait a second, are you telling us that they want unicode strings to have
one specific encoding a user has totally no control over, and that writing
an application using, say, both UTF-8 and UTF-32 would not be possible? If
I understood what you wrote correctly, that would render the entire
proposal quite useless, and there would be no point in working on it.
If I misunderstood something, what would that "forcing the user into using
one implementation-defined encoding" mean?
On Saturday, 20 April 2013 14:21:20 UTC+2, DeadMG wrote:
>
> The LEWG liked the approach of offering new generic Unicode algorithms.
> They did not like the encoded_string class and especially did not like the
> flexibility of it's encoding and that it exposed the encoding to the user.
> The LEWG's recommendation was to split the string and algorithms into two
> papers, and force the user into using one implementation-defined encoding,
> and simply provide conversion to and from the existing string mechanisms.
> Then it will be easy to pass the algorithms separately through the
> Committee.
>
> I intend to add to this thread with a draft of both revisions later.
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_566_21446130.1366466915489
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Wait a second, are you telling us that they want unicode strings to have on=
e specific encoding a user has totally no control over, and that writing an=
application using, say, both UTF-8 and UTF-32 would not be possible? If I =
understood what you wrote correctly, that would render the entire proposal =
quite useless, and there would be no point in working on it. <div><br>=
</div><div>If I misunderstood something, what would that "forcing the user =
into using one implementation-defined encoding" mean?<br><br>On Saturday, 2=
0 April 2013 14:21:20 UTC+2, DeadMG wrote:<blockquote class=3D"gmail_quote=
" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding=
-left: 1ex;">The LEWG liked the approach of offering new generic Unicode al=
gorithms. They did not like the encoded_string class and especially did not=
like the flexibility of it's encoding and that it exposed the encoding to =
the user. The LEWG's recommendation was to split the string and algorithms =
into two papers, and force the user into using one implementation-defined e=
ncoding, and simply provide conversion to and from the existing string mech=
anisms. Then it will be easy to pass the algorithms separately through the =
Committee.<div><br></div><div>I intend to add to this thread with a draft o=
f both revisions later.</div></blockquote></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_566_21446130.1366466915489--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 20 Apr 2013 08:31:56 -0700 (PDT)
Raw View
------=_Part_384_11126770.1366471916928
Content-Type: text/plain; charset=ISO-8859-1
>
> Wait a second, are you telling us that they want unicode strings to have
> one specific encoding a user has totally no control over, and that writing
> an application using, say, both UTF-8 and UTF-32 would not be possible?
Give or take. The approach is that other interfaces deal in arbitrary
encodings, and at the interface boundaries, you convert to/from the
implementation-defined encoding. You, specifically, would not be able to
choose or even observe (the interface is carefully designed for this) the
encoding. This is quite similar to what other languages already do, except
that the encoding is implementation-defined instead of defined.
Frankly, I intend to give the Committee what they want.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_384_11126770.1366471916928
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; borde=
r-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style=
: solid; padding-left: 1ex;">Wait a second, are you telling us that they wa=
nt unicode strings to have one specific encoding a user has totally no cont=
rol over, and that writing an application using, say, both UTF-8 and UTF-32=
would not be possible?</blockquote><div><br></div><div>Give or take. The a=
pproach is that other interfaces deal in arbitrary encodings, and at the in=
terface boundaries, you convert to/from the implementation-defined encoding=
.. You, specifically, would not be able to choose or even observe (the inter=
face is carefully designed for this) the encoding. This is quite similar to=
what other languages already do, except that the encoding is implementatio=
n-defined instead of defined.</div><div><br></div><div>Frankly, I intend to=
give the Committee what they want.</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_384_11126770.1366471916928--
.
Author: Jeffrey Yasskin <jyasskin@google.com>
Date: Sat, 20 Apr 2013 18:05:05 +0200
Raw View
The idea we wanted for a data type was that users could use a single
type to represent unicode characters, rather than a separate type for
each external encoding. This asserts that it's a good idea to
transcode data as it enters or exits the system and use a specific
encoding inside the system, rather than propagating variable encodings
throughout. There are a couple options for implementing that:
1) Take Python 3's approach of representing strings in UTF-32, with an
optimization for strings with no code points above 255 (store in 1
byte each) and another optimization for strings with no code points
above 65535 (store in 2 bytes each). Optionally, if the unicode string
is a cord/rope type, then each segment can have the optimization
applied independently.
2) Store the string in UTF-8 or UTF-16 depending on the platform (or
another encoding for less common platforms), and provide, say,
"array_view<const char> as_utf8(std::vector<char>& storage);" and
"array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
accessors: these would copy the data if it's stored in the other
encoding, or return a reference to the internal storage if it's
already in the desired encoding. Implementations would be free to
define other accessors, but I suspect these are all the standard
needs.
Option 1 has the benefit of allowing random-access iterators, or at
least indexing, which the ICU folks I spoke to thought would be
useful. Option 2 has the benefit of allowing some, maybe most,
external interactions without copying.
Regardless, I expect it'll be easier to get an algorithms library
through than a string type, especially for algorithms that are
justified by appearing in both a Unicode Report and ICU. Also be sure
to synchronize with other existing papers, for example
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html,
which is a response to the comments at
http://wiki.edg.com/twiki/bin/view/Wg21portland2012/LibraryWorkingGroup#Aft=
ernoon_AN2.
Another thing the ICU folks I spoke to said was that it would be
useful for efficiency to allow users to pass a maximum code point to
each algorithm. Some implementations can run much faster if they know
their whole input is under U+100 than if they have to handle
everything up to U+10FFFF. Many users won't have a maximum code point
to pass, but, for example, the string class mentioned in option (1)
has to store it anyway, and can benefit from the ability to pass it.
HTH,
Jeffrey
On Sat, Apr 20, 2013 at 4:08 PM, Micha=C5=82 Dominiak <griwes@griwes.info> =
wrote:
> Wait a second, are you telling us that they want unicode strings to have =
one
> specific encoding a user has totally no control over, and that writing an
> application using, say, both UTF-8 and UTF-32 would not be possible? If I
> understood what you wrote correctly, that would render the entire proposa=
l
> quite useless, and there would be no point in working on it.
>
> If I misunderstood something, what would that "forcing the user into usin=
g
> one implementation-defined encoding" mean?
>
>
> On Saturday, 20 April 2013 14:21:20 UTC+2, DeadMG wrote:
>>
>> The LEWG liked the approach of offering new generic Unicode algorithms.
>> They did not like the encoded_string class and especially did not like t=
he
>> flexibility of it's encoding and that it exposed the encoding to the use=
r.
>> The LEWG's recommendation was to split the string and algorithms into tw=
o
>> papers, and force the user into using one implementation-defined encodin=
g,
>> and simply provide conversion to and from the existing string mechanisms=
..
>> Then it will be easy to pass the algorithms separately through the
>> Committee.
>>
>> I intend to add to this thread with a draft of both revisions later.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=3Den.
>
>
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 20 Apr 2013 09:54:34 -0700 (PDT)
Raw View
------=_Part_10_4605073.1366476874015
Content-Type: text/plain; charset=ISO-8859-1
On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>
> The idea we wanted for a data type was that users could use a single
> type to represent unicode characters, rather than a separate type for
> each external encoding. This asserts that it's a good idea to
> transcode data as it enters or exits the system and use a specific
> encoding inside the system, rather than propagating variable encodings
> throughout.
By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
should just have had a single `u` type that would convert it into *some*Unicode representation, depending on the platform.
It is a good idea to "transcode data as it enters or exits the system," but
only as part of the use of a *known* encoding of string data. If 95% of my
data coming into the system is UTF-8, I shouldn't have to transcode it to
UTF-16 or whatever that this string type wants to use. I should be using
UTF-8 internally, because that's what most of my data is. Having a
dedicated string type that can use UTF-8 is a big part of that. Without
having a specific type for that, I have no recourse other than
`vector<unsigned char>` if I actually want a UTF-8 string.
I'm guessing the counter-argument will be that you could just do `using
utf8_string = vector<unsigned char>;`. But that doesn't work, because it's *
still* a `vector<unsigned char>`. It will behave no differently than any
other `vector<unsigned char>`.
If we had support for strong typedefs, then I might be OK with that. But
otherwise, we need an actual type which can be different from `vector`, so
that it doesn't accidentally participate in overload resolution with it. We
need a type that I can pass to iostreams and have it understand what I'm
doing.
We need a real *type* for strings of a known, specific encoding.
I don't mind having a single type for paths, because paths are a very
specialized case. They're not just arbitrary strings; they're strings with
a purpose. That purpose being interfacing with the host filesystem.
Therefore, the encoding will be whatever is most efficient for the
platform. That's fine.
But that's the only platform-facing interface that C++ deals with. There is
no reason to take away from users the knowledge of how a string is encoded.
Really, what's the point of a Unicode string type if you don't even know
how it's encoded? What can you do with it? You can't give it to some API to
use, because right now, every existing C++ API that uses Unicode in any way
expects a *specific* encoding of Unicode, or provides several options for
encodings. So all of those APIs are either completely unusable or you have
to copy and convert the string type.
Why should I waste performance doing a pointless copy, when I gave the
Unicode string UTF-8 encoding, and the API I want to hand it to uses UTF-8
encoding?
This kind of string is very internally locked. It's a needless performance
hole for anyone who doesn't already use it.
*C++ is not Python.* Stop trying to turn it into a low-rent version of
Python. We don't use C++ because it's easy; we use it because it is *
powerful*. We shouldn't throw away power just to allow slightly easier
usage. We don't need a one-size-fits-all Unicode string. Give us *choices*.
It's sad that the C++ standards committee of 2013 doesn't see the simple
wisdom in doing what the C++ standards committee of 1998 did in having
`basic_string` be a template based on a character type.
> There are a couple options for implementing that:
>
> 1) Take Python 3's approach of representing strings in UTF-32, with an
> optimization for strings with no code points above 255 (store in 1
> byte each) and another optimization for strings with no code points
> above 65535 (store in 2 bytes each). Optionally, if the unicode string
> is a cord/rope type, then each segment can have the optimization
> applied independently.
>
> 2) Store the string in UTF-8 or UTF-16 depending on the platform (or
> another encoding for less common platforms), and provide, say,
> "array_view<const char> as_utf8(std::vector<char>& storage);" and
> "array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
> accessors: these would copy the data if it's stored in the other
> encoding, or return a reference to the internal storage if it's
> already in the desired encoding. Implementations would be free to
> define other accessors, but I suspect these are all the standard
> needs.
>
> Option 1 has the benefit of allowing random-access iterators, or at
> least indexing, which the ICU folks I spoke to thought would be
> useful. Option 2 has the benefit of allowing some, maybe most,
> external interactions without copying.
>
These two options represent two entirely different classes with entirely
different internal data representations and entirely different performance
characteristics. We shouldn't allow any one class to be able to be
implemented in such a widely varying way.
It'd be like if we had a single map type which could be a hashtable, a
sorted vector, or a tree. How could anyone use it and know what they're
getting? Why wouldn't you want separate classes with separate
implementations of mapping, for separate circumstances?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_10_4605073.1366476874015
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;">The idea we wanted for a data type wa=
s that users could use a single
<br>type to represent unicode characters, rather than a separate type for
<br>each external encoding. This asserts that it's a good idea to
<br>transcode data as it enters or exits the system and use a specific
<br>encoding inside the system, rather than propagating variable encodings
<br>throughout.</blockquote><div><br>By this logic, we shouldn't have `u8`,=
`u` and `U` prefixes either. We should just have had a single `u` type tha=
t would convert it into <i>some</i> Unicode representation, depending on th=
e platform.<br><br>It is a good idea to "transcode data as it enters or exi=
ts the system," but only as part of the use of a <i>known</i> encoding of s=
tring data. If 95% of my data coming into the system is UTF-8, I shouldn't =
have to transcode it to UTF-16 or whatever that this string type wants to u=
se. I should be using UTF-8 internally, because that's what most of my data=
is. Having a dedicated string type that can use UTF-8 is a big part of tha=
t. Without having a specific type for that, I have no recourse other than `=
vector<unsigned char>` if I actually want a UTF-8 string.<br><br>I'm =
guessing the counter-argument will be that you could just do `using utf8_st=
ring =3D vector<unsigned char>;`. But that doesn't work, because it's=
<i>still</i> a `vector<unsigned char>`. It will behave no differentl=
y than any other `vector<unsigned char>`.<br><br>If we had support fo=
r strong typedefs, then I might be OK with that. But otherwise, we need an =
actual type which can be different from `vector`, so that it doesn't accide=
ntally participate in overload resolution with it. We need a type that I ca=
n pass to iostreams and have it understand what I'm doing.<br><br>We need a=
real <i>type</i> for strings of a known, specific encoding.<br><br>I don't=
mind having a single type for paths, because paths are a very specialized =
case. They're not just arbitrary strings; they're strings with a purpose. T=
hat purpose being interfacing with the host filesystem. Therefore, the enco=
ding will be whatever is most efficient for the platform. That's fine.<br><=
br>But that's the only platform-facing interface that C++ deals with. There=
is no reason to take away from users the knowledge of how a string is enco=
ded.<br><br>Really, what's the point of a Unicode string type if you don't =
even know
how it's encoded? What can you do with it? You can't give it to some=20
API to use, because right now, every existing C++ API that uses Unicode in
any way expects a <i>specific</i> encoding of Unicode, or provides several=
options for encodings. So all of those APIs are either completely unusable=
or you have to copy and convert the string type.<br><br>Why
should I waste performance doing a pointless copy, when I gave the=20
Unicode string UTF-8 encoding, and the API I want to hand it to uses=20
UTF-8 encoding?<br><br>This kind of string is very internally locked. It's =
a needless performance hole for anyone who doesn't already use it.<br><br><=
i>C++ is not Python.</i> Stop trying to turn it into a low-rent version of =
Python. We don't use C++ because it's easy; we use it because it is <i>powe=
rful</i>. We shouldn't throw away power just to allow slightly easier usage=
.. We don't need a one-size-fits-all Unicode string. Give us <i>choices</i>.=
<br><br>It's sad that the C++ standards committee of 2013 doesn't see the s=
imple wisdom in doing what the C++ standards committee of 1998 did in havin=
g `basic_string` be a template based on a character type.<br> </div><b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">There are a couple options for i=
mplementing that:
<br>
<br>1) Take Python 3's approach of representing strings in UTF-32, with an
<br>optimization for strings with no code points above 255 (store in 1
<br>byte each) and another optimization for strings with no code points
<br>above 65535 (store in 2 bytes each). Optionally, if the unicode string
<br>is a cord/rope type, then each segment can have the optimization
<br>applied independently.
<br>
<br>2) Store the string in UTF-8 or UTF-16 depending on the platform (or
<br>another encoding for less common platforms), and provide, say,
<br>"array_view<const char> as_utf8(std::vector<char>& stor=
age);" and
<br>"array_view<const char16_t> as_utf16(std::vector<char16_t><=
wbr>& storage);"
<br>accessors: these would copy the data if it's stored in the other
<br>encoding, or return a reference to the internal storage if it's
<br>already in the desired encoding. Implementations would be free to
<br>define other accessors, but I suspect these are all the standard
<br>needs.
<br>
<br>Option 1 has the benefit of allowing random-access iterators, or at
<br>least indexing, which the ICU folks I spoke to thought would be
<br>useful. Option 2 has the benefit of allowing some, maybe most,
<br>external interactions without copying.<br></blockquote><div><br>These t=
wo options represent two entirely different classes with entirely different=
internal data representations and entirely different performance character=
istics. We shouldn't allow any one class to be able to be implemented in su=
ch a widely varying way.<br><br>It'd be like if we had a single map type wh=
ich could be a hashtable, a sorted vector, or a tree. How could anyone use =
it and know what they're getting? Why wouldn't you want separate classes wi=
th separate implementations of mapping, for separate circumstances?<br></di=
v>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_10_4605073.1366476874015--
.
Author: Jeffrey Yasskin <jyasskin@google.com>
Date: Sat, 20 Apr 2013 18:57:22 +0200
Raw View
On Sat, Apr 20, 2013 at 6:54 PM, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>>
>> The idea we wanted for a data type was that users could use a single
>> type to represent unicode characters, rather than a separate type for
>> each external encoding. This asserts that it's a good idea to
>> transcode data as it enters or exits the system and use a specific
>> encoding inside the system, rather than propagating variable encodings
>> throughout.
>
>
> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> should just have had a single `u` type that would convert it into some
> Unicode representation, depending on the platform.
>
> It is a good idea to "transcode data as it enters or exits the system," but
> only as part of the use of a known encoding of string data. If 95% of my
> data coming into the system is UTF-8, I shouldn't have to transcode it to
> UTF-16 or whatever that this string type wants to use. I should be using
> UTF-8 internally, because that's what most of my data is. Having a dedicated
> string type that can use UTF-8 is a big part of that. Without having a
> specific type for that, I have no recourse other than `vector<unsigned
> char>` if I actually want a UTF-8 string.
>
> I'm guessing the counter-argument will be that you could just do `using
> utf8_string = vector<unsigned char>;`. But that doesn't work, because it's
> still a `vector<unsigned char>`. It will behave no differently than any
> other `vector<unsigned char>`.
>
> If we had support for strong typedefs, then I might be OK with that. But
> otherwise, we need an actual type which can be different from `vector`, so
> that it doesn't accidentally participate in overload resolution with it. We
> need a type that I can pass to iostreams and have it understand what I'm
> doing.
>
> We need a real type for strings of a known, specific encoding.
>
> I don't mind having a single type for paths, because paths are a very
> specialized case. They're not just arbitrary strings; they're strings with a
> purpose. That purpose being interfacing with the host filesystem. Therefore,
> the encoding will be whatever is most efficient for the platform. That's
> fine.
>
> But that's the only platform-facing interface that C++ deals with. There is
> no reason to take away from users the knowledge of how a string is encoded.
>
> Really, what's the point of a Unicode string type if you don't even know how
> it's encoded? What can you do with it? You can't give it to some API to use,
> because right now, every existing C++ API that uses Unicode in any way
> expects a specific encoding of Unicode, or provides several options for
> encodings. So all of those APIs are either completely unusable or you have
> to copy and convert the string type.
That's why the algorithms library should come first.
> Why should I waste performance doing a pointless copy, when I gave the
> Unicode string UTF-8 encoding, and the API I want to hand it to uses UTF-8
> encoding?
>
> This kind of string is very internally locked. It's a needless performance
> hole for anyone who doesn't already use it.
>
> C++ is not Python. Stop trying to turn it into a low-rent version of Python.
> We don't use C++ because it's easy; we use it because it is powerful. We
> shouldn't throw away power just to allow slightly easier usage. We don't
> need a one-size-fits-all Unicode string. Give us choices.
>
> It's sad that the C++ standards committee of 2013 doesn't see the simple
> wisdom in doing what the C++ standards committee of 1998 did in having
> `basic_string` be a template based on a character type.
>
>>
>> There are a couple options for implementing that:
>>
>> 1) Take Python 3's approach of representing strings in UTF-32, with an
>> optimization for strings with no code points above 255 (store in 1
>> byte each) and another optimization for strings with no code points
>> above 65535 (store in 2 bytes each). Optionally, if the unicode string
>> is a cord/rope type, then each segment can have the optimization
>> applied independently.
>>
>> 2) Store the string in UTF-8 or UTF-16 depending on the platform (or
>> another encoding for less common platforms), and provide, say,
>> "array_view<const char> as_utf8(std::vector<char>& storage);" and
>> "array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
>> accessors: these would copy the data if it's stored in the other
>> encoding, or return a reference to the internal storage if it's
>> already in the desired encoding. Implementations would be free to
>> define other accessors, but I suspect these are all the standard
>> needs.
>>
>> Option 1 has the benefit of allowing random-access iterators, or at
>> least indexing, which the ICU folks I spoke to thought would be
>> useful. Option 2 has the benefit of allowing some, maybe most,
>> external interactions without copying.
>
>
> These two options represent two entirely different classes with entirely
> different internal data representations and entirely different performance
> characteristics. We shouldn't allow any one class to be able to be
> implemented in such a widely varying way.
>
> It'd be like if we had a single map type which could be a hashtable, a
> sorted vector, or a tree. How could anyone use it and know what they're
> getting? Why wouldn't you want separate classes with separate
> implementations of mapping, for separate circumstances?
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 20 Apr 2013 10:11:57 -0700 (PDT)
Raw View
------=_Part_616_3929754.1366477917211
Content-Type: text/plain; charset=ISO-8859-1
On Saturday, April 20, 2013 9:57:22 AM UTC-7, Jeffrey Yasskin wrote:
>
> On Sat, Apr 20, 2013 at 6:54 PM, Nicol Bolas <jmck...@gmail.com<javascript:>>
> wrote:
> > On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
> >>
> >> The idea we wanted for a data type was that users could use a single
> >> type to represent unicode characters, rather than a separate type for
> >> each external encoding. This asserts that it's a good idea to
> >> transcode data as it enters or exits the system and use a specific
> >> encoding inside the system, rather than propagating variable encodings
> >> throughout.
> >
> >
> > By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> > should just have had a single `u` type that would convert it into some
> > Unicode representation, depending on the platform.
> >
> > It is a good idea to "transcode data as it enters or exits the system,"
> but
> > only as part of the use of a known encoding of string data. If 95% of my
> > data coming into the system is UTF-8, I shouldn't have to transcode it
> to
> > UTF-16 or whatever that this string type wants to use. I should be using
> > UTF-8 internally, because that's what most of my data is. Having a
> dedicated
> > string type that can use UTF-8 is a big part of that. Without having a
> > specific type for that, I have no recourse other than `vector<unsigned
> > char>` if I actually want a UTF-8 string.
> >
> > I'm guessing the counter-argument will be that you could just do `using
> > utf8_string = vector<unsigned char>;`. But that doesn't work, because
> it's
> > still a `vector<unsigned char>`. It will behave no differently than any
> > other `vector<unsigned char>`.
> >
> > If we had support for strong typedefs, then I might be OK with that. But
> > otherwise, we need an actual type which can be different from `vector`,
> so
> > that it doesn't accidentally participate in overload resolution with it.
> We
> > need a type that I can pass to iostreams and have it understand what I'm
> > doing.
> >
> > We need a real type for strings of a known, specific encoding.
> >
> > I don't mind having a single type for paths, because paths are a very
> > specialized case. They're not just arbitrary strings; they're strings
> with a
> > purpose. That purpose being interfacing with the host filesystem.
> Therefore,
> > the encoding will be whatever is most efficient for the platform. That's
> > fine.
> >
> > But that's the only platform-facing interface that C++ deals with. There
> is
> > no reason to take away from users the knowledge of how a string is
> encoded.
> >
> > Really, what's the point of a Unicode string type if you don't even know
> how
> > it's encoded? What can you do with it? You can't give it to some API to
> use,
> > because right now, every existing C++ API that uses Unicode in any way
> > expects a specific encoding of Unicode, or provides several options for
> > encodings. So all of those APIs are either completely unusable or you
> have
> > to copy and convert the string type.
>
> That's why the algorithms library should come first.
>
No, it shouldn't.
Algorithms are a vital tool for actually doing stuff with Unicode text. But
they aren't *everything*. Unicode algorithms without the string type are
like STL algorithms without the STL containers: a fine and useful idea
certainly, but there's *clearly* something missing.
Right now, what we have are dozens of string types littered across dozens
of different products, all taking their own specific encodings of Unicode.
If we don't provide a string type that *could* replace all of those, then
there's no possibility for that type to ever actually do so. Yes, it's
highly unlikely that it would. But it *certainly* won't happen if don't
provide one.
We need a type that encapsulates the rules of Unicode. We need a type that
can concatenate, subdivide, and do all of the other things we need for
Unicode strings. We should not be *encouraging* the continued mass of
string types by providing all of the tools to make one, but then not
actually making that type.
That's why I would prefer that this proposal *not* be divided. It
effectively holds the useful algorithms hostage, forcing the committee to
either not have the algorithms, or to actually put in the work in getting a
solid Unicode string type together. By dividing it, you make it easy to
pass one while letting the other languish.
Algorithms are important, yes. But so is an actual Unicode string type.
They are both necessary and essential parts of a solid Unicode system.
This notion that the committee seems to have of just getting some of the
way to the goal is the easiest way to fail to achieve that goal. If you
only take half-steps, you'll never get where you're going.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_616_3929754.1366477917211
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Saturday, April 20, 2013 9:57:22 AM UTC-7, Jeffrey Yasskin wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;">On Sat, Apr 20, 2013 at 6:54 PM, Nico=
l Bolas <<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=
=3D"43loWkvCwE8J">jmck...@gmail.com</a>> wrote:
<br>> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrot=
e:
<br>>>
<br>>> The idea we wanted for a data type was that users could use a =
single
<br>>> type to represent unicode characters, rather than a separate t=
ype for
<br>>> each external encoding. This asserts that it's a good idea to
<br>>> transcode data as it enters or exits the system and use a spec=
ific
<br>>> encoding inside the system, rather than propagating variable e=
ncodings
<br>>> throughout.
<br>>
<br>>
<br>> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either=
.. We
<br>> should just have had a single `u` type that would convert it into =
some
<br>> Unicode representation, depending on the platform.
<br>>
<br>> It is a good idea to "transcode data as it enters or exits the sys=
tem," but
<br>> only as part of the use of a known encoding of string data. If 95%=
of my
<br>> data coming into the system is UTF-8, I shouldn't have to transcod=
e it to
<br>> UTF-16 or whatever that this string type wants to use. I should be=
using
<br>> UTF-8 internally, because that's what most of my data is. Having a=
dedicated
<br>> string type that can use UTF-8 is a big part of that. Without havi=
ng a
<br>> specific type for that, I have no recourse other than `vector<u=
nsigned
<br>> char>` if I actually want a UTF-8 string.
<br>>
<br>> I'm guessing the counter-argument will be that you could just do `=
using
<br>> utf8_string =3D vector<unsigned char>;`. But that doesn't wo=
rk, because it's
<br>> still a `vector<unsigned char>`. It will behave no different=
ly than any
<br>> other `vector<unsigned char>`.
<br>>
<br>> If we had support for strong typedefs, then I might be OK with tha=
t. But
<br>> otherwise, we need an actual type which can be different from `vec=
tor`, so
<br>> that it doesn't accidentally participate in overload resolution wi=
th it. We
<br>> need a type that I can pass to iostreams and have it understand wh=
at I'm
<br>> doing.
<br>>
<br>> We need a real type for strings of a known, specific encoding.
<br>>
<br>> I don't mind having a single type for paths, because paths are a v=
ery
<br>> specialized case. They're not just arbitrary strings; they're stri=
ngs with a
<br>> purpose. That purpose being interfacing with the host filesystem. =
Therefore,
<br>> the encoding will be whatever is most efficient for the platform. =
That's
<br>> fine.
<br>>
<br>> But that's the only platform-facing interface that C++ deals with.=
There is
<br>> no reason to take away from users the knowledge of how a string is=
encoded.
<br>>
<br>> Really, what's the point of a Unicode string type if you don't eve=
n know how
<br>> it's encoded? What can you do with it? You can't give it to some A=
PI to use,
<br>> because right now, every existing C++ API that uses Unicode in any=
way
<br>> expects a specific encoding of Unicode, or provides several option=
s for
<br>> encodings. So all of those APIs are either completely unusable or =
you have
<br>> to copy and convert the string type.
<br>
<br>That's why the algorithms library should come first.<br></blockquote><d=
iv><br>No, it shouldn't.<br><br>Algorithms are a vital tool for actually do=
ing stuff with Unicode text. But they aren't <i>everything</i>. Unicode alg=
orithms without the string type are like STL algorithms without the STL con=
tainers: a fine and useful idea certainly, but there's <i>clearly</i> somet=
hing missing.<br><br>Right now, what we have are dozens of string types lit=
tered across dozens of different products, all taking their own specific en=
codings of Unicode. If we don't provide a string type that <i>could</i> rep=
lace all of those, then there's no possibility for that type to ever actual=
ly do so. Yes, it's highly unlikely that it would. But it <i>certainly</i> =
won't happen if don't provide one.<br><br>We need a type that encapsulates =
the rules of Unicode. We need a type that can concatenate, subdivide, and d=
o all of the other things we need for Unicode strings. We should not be <i>=
encouraging</i> the continued mass of string types by providing all of the =
tools to make one, but then not actually making that type.<br><br>That's wh=
y I would prefer that this proposal <i>not</i> be divided. It effectively h=
olds the useful algorithms hostage, forcing the committee to either not hav=
e the algorithms, or to actually put in the work in getting a solid Unicode=
string type together. By dividing it, you make it easy to pass one while l=
etting the other languish.<br><br>Algorithms are important, yes. But so is =
an actual Unicode string type. They are both necessary and essential parts =
of a solid Unicode system.<br><br>This notion that the committee seems to h=
ave of just getting some of the way to the goal is the easiest way to fail =
to achieve that goal. If you only take half-steps, you'll never get where y=
ou're going.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_616_3929754.1366477917211--
.
Author: =?ISO-8859-1?Q?Daniel_Kr=FCgler?= <daniel.kruegler@gmail.com>
Date: Sat, 20 Apr 2013 21:12:24 +0200
Raw View
2013/4/20 Nicol Bolas <jmckesson@gmail.com>:
> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>>
>> The idea we wanted for a data type was that users could use a single
>> type to represent unicode characters, rather than a separate type for
>> each external encoding. This asserts that it's a good idea to
>> transcode data as it enters or exits the system and use a specific
>> encoding inside the system, rather than propagating variable encodings
>> throughout.
>
> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> should just have had a single `u` type that would convert it into some
> Unicode representation, depending on the platform.
I agree that this would be the most natural thing, when the new
character types had been introduced from begin with. But many (most?)
code bases use quite different containers for such character types,
e.g. wchar_t or short, depending on the purpose. In addition, you
often have to respect the API of a useful thirdparty library which
expect some such character-like type and it would be quite enoying, if
I would need to pay the conversion costs, just because the standard
restricts me to a single type. This does not mean that user-code is
encouraged to use different types than the new ones.
> It is a good idea to "transcode data as it enters or exits the system," but
> only as part of the use of a known encoding of string data. If 95% of my
> data coming into the system is UTF-8, I shouldn't have to transcode it to
> UTF-16 or whatever that this string type wants to use. I should be using
> UTF-8 internally, because that's what most of my data is. Having a dedicated
> string type that can use UTF-8 is a big part of that. Without having a
> specific type for that, I have no recourse other than `vector<unsigned
> char>` if I actually want a UTF-8 string.
Sure. And I would recommend to use the intended ones that you need.
But this does not mean that the library itself should only accept the
new character types. This would IMO strongly reduce the acceptance of
these functions. Usually the library and the language don't try to
enforce a particular idiom so to make the functionality of broader
interest.
I think it makes very much sense to start with algorithms here and
than (possibly) consider a stronger character type, if there is some
convincing desire for them.
> If we had support for strong typedefs, then I might be OK with that.
I don't think that we should make such library decisions of Library
features *depending* on some core language feature. I would express it
the other way around and say: *If* strong typedefs exist, the needs
for a specific encoding aware type would much more decrease.
> We need a real type for strings of a known, specific encoding.
I'm not denying this, but I also don't see that we both decisions are
dependent on each other.
> It's sad that the C++ standards committee of 2013 doesn't see the simple
> wisdom in doing what the C++ standards committee of 1998 did in having
> `basic_string` be a template based on a character type.
Please get this right: I don't think that there is a fundamental
"no-interest-in-this" position, it is just so that the current
interest is much stronger in an algorithm library. The committee usual
prefers to start with a often asked-for subset of useful
functionality, because this it always some risk and a lot of work to
integrate things in the library. The initial step often makes obvious
that natural interaction with other parts of the library exist or
would be very desirable, which would cause further API adaptions here
and there.
- Daniel
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 20 Apr 2013 15:23:09 -0700 (PDT)
Raw View
------=_Part_786_26257888.1366496589449
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Saturday, April 20, 2013 12:12:24 PM UTC-7, Daniel Kr=FCgler wrote:
>
> 2013/4/20 Nicol Bolas <jmck...@gmail.com <javascript:>>:=20
> > On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:=20
> >>=20
> >> The idea we wanted for a data type was that users could use a single=
=20
> >> type to represent unicode characters, rather than a separate type for=
=20
> >> each external encoding. This asserts that it's a good idea to=20
> >> transcode data as it enters or exits the system and use a specific=20
> >> encoding inside the system, rather than propagating variable encodings=
=20
> >> throughout.=20
> >=20
> > By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We=
=20
> > should just have had a single `u` type that would convert it into some=
=20
> > Unicode representation, depending on the platform.=20
>
> I agree that this would be the most natural thing, when the new=20
> character types had been introduced from begin with.
There seems to be a misunderstanding. I was posting an absurdity for the=20
purpose of holding the idea of a "one string fits all" solution to *ridicul=
e
* via analogy.
Using a platform-defined encoding for general strings is a bad idea,=20
whether it's in a string literal or a string class.
The idea that we shouldn't be able to declare literals in whatever Unicode=
=20
encoding we desire, that we should just accept some platform-specific=20
default is just... wrong. It's not natural and it's highly unnecessary;=20
it's performance killing, because the actual result depends entirely on the=
=20
platform, and the user has no way to change it. Switching from one platform=
=20
to another can degrade performance through a lot of pointless re-encoding.
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_786_26257888.1366496589449
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Saturday, April 20, 2013 12:12:24 PM UTC-7, Daniel Kr=FCgler wro=
te:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;=
border-left: 1px #ccc solid;padding-left: 1ex;">2013/4/20 Nicol Bolas <<=
a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"40sg4xnaa=
CwJ">jmck...@gmail.com</a>>:
<br>> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrot=
e:
<br>>>
<br>>> The idea we wanted for a data type was that users could use a =
single
<br>>> type to represent unicode characters, rather than a separate t=
ype for
<br>>> each external encoding. This asserts that it's a good idea to
<br>>> transcode data as it enters or exits the system and use a spec=
ific
<br>>> encoding inside the system, rather than propagating variable e=
ncodings
<br>>> throughout.
<br>>
<br>> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either=
.. We
<br>> should just have had a single `u` type that would convert it into =
some
<br>> Unicode representation, depending on the platform.
<br>
<br>I agree that this would be the most natural thing, when the new
<br>character types had been introduced from begin with.</blockquote><div><=
br>There seems to be a misunderstanding. I was posting an absurdity for the=
purpose of holding the idea of a "one string fits all" solution to <i>ridi=
cule</i> via analogy.<br><br>Using a platform-defined encoding for general =
strings is a bad idea, whether it's in a string literal or a string class.<=
br><br>The idea that we shouldn't be able to declare literals in whatever U=
nicode encoding we desire, that we should just accept some platform-specifi=
c default is just... wrong. It's not natural and it's highly unnecessary; i=
t's performance killing, because the actual result depends entirely on the =
platform, and the user has no way to change it. Switching from one platform=
to another can degrade performance through a lot of pointless re-encoding.=
</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_786_26257888.1366496589449--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 20 Apr 2013 15:24:19 -0700 (PDT)
Raw View
------=_Part_470_16769680.1366496659791
Content-Type: text/plain; charset=ISO-8859-1
I think that there could be a middle ground to be found here. I was
reviewing the original paper, and the only place that the encoding
parameter was actually *used* in the interface was to specify the return
value for C string interoperation. If, instead, I changed that so you could
request a C string of any encoding type from any encoding (so for example
c_str() was a template), as is supported by the original traits design,
then that would make a polymorphic encoding possible- and also the stored
encoding would be non-observable, except perhaps in the complexity of
requesting a C string.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_470_16769680.1366496659791
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
I think that there could be a middle ground to be found here. I was reviewi=
ng the original paper, and the only place that the encoding parameter was a=
ctually <i>used</i> in the interface was to specify the return value f=
or C string interoperation. If, instead, I changed that so you could reques=
t a C string of any encoding type from any encoding (so for example c_str()=
was a template), as is supported by the original traits design, then that =
would make a polymorphic encoding possible- and also the stored encoding wo=
uld be non-observable, except perhaps in the complexity of requesting a C s=
tring.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_470_16769680.1366496659791--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sun, 21 Apr 2013 05:55:00 -0700 (PDT)
Raw View
------=_Part_1299_10908844.1366548900158
Content-Type: text/plain; charset=ISO-8859-1
I think there are several issues here:
(1) platform encoding;
(2) application encoding;
(3) external file encoding.
The application encoding depending on the needs should provide mainly the
following representations:
-- char for ASCII (UTF-8 for codes <= 127);
-- UTF-16 for Unicode (95%);
-- UTF-32 for Unicode (rare cases).
UTF-8 is not practical for internal representation for characters with
codes over 127: string comparison won't work.
The library should cope with conversions between all these encodings:
platform<-> application, file<-> application.
If I was writing full Unicode support I would use UTF-32 for application
encoding, although it is a bit extensive.
On Saturday, April 20, 2013 11:24:19 PM UTC+1, DeadMG wrote:
> I think that there could be a middle ground to be found here. I was
> reviewing the original paper, and the only place that the encoding
> parameter was actually *used* in the interface was to specify the return
> value for C string interoperation. If, instead, I changed that so you could
> request a C string of any encoding type from any encoding (so for example
> c_str() was a template), as is supported by the original traits design,
> then that would make a polymorphic encoding possible- and also the stored
> encoding would be non-observable, except perhaps in the complexity of
> requesting a C string.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1299_10908844.1366548900158
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div> </div><div>I think there are several issues here:</div><div>(1) =
platform encoding;</div><div>(2) application encoding;</div><div>(3) extern=
al file encoding.</div><div> </div><div>The application encoding depen=
ding on the needs should provide mainly the following representations:=
</div><div>-- char for ASCII (UTF-8 for codes <=3D 127);</div><div>-- UT=
F-16 for Unicode (95%);</div><div>-- UTF-32 for Unicode (rare cases).</div>=
<div> </div><div>UTF-8 is not practical for internal representation fo=
r characters with codes over 127: string comparison won't work.</div><div>T=
he library should cope with conversions between all these encodings: platfo=
rm<-> application, file<-> application.</div><div> </div><=
div>If I was writing full Unicode support I would use UTF-32 for applicatio=
n encoding, although it is a bit extensive.</div><div> </div><div><br>=
On Saturday, April 20, 2013 11:24:19 PM UTC+1, DeadMG wrote:</div><blockquo=
te class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; padding-left: =
1ex; border-left-color: rgb(204, 204, 204); border-left-width: 1px; border-=
left-style: solid;">I think that there could be a middle ground to be found=
here. I was reviewing the original paper, and the only place that the enco=
ding parameter was actually <i>used</i> in the interface was to specif=
y the return value for C string interoperation. If, instead, I changed that=
so you could request a C string of any encoding type from any encoding (so=
for example c_str() was a template), as is supported by the original trait=
s design, then that would make a polymorphic encoding possible- and also th=
e stored encoding would be non-observable, except perhaps in the complexity=
of requesting a C string.</blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1299_10908844.1366548900158--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 21 Apr 2013 06:30:46 -0700 (PDT)
Raw View
------=_Part_1406_32449354.1366551046780
Content-Type: text/plain; charset=ISO-8859-1
Unicode string comparison works fine with UTF-8. You cannot use
basic_string::operator== or strcmp on *any* Unicode encoding unless you're
effectively only storing ASCII. Even then, I'm not sure it's really valid.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1406_32449354.1366551046780
Content-Type: text/html; charset=ISO-8859-1
Unicode string comparison works fine with UTF-8. You cannot use basic_string::operator== or strcmp on *any* Unicode encoding unless you're effectively only storing ASCII. Even then, I'm not sure it's really valid.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_1406_32449354.1366551046780--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sun, 21 Apr 2013 14:54:41 +0100
Raw View
--001a11c20b8237b34404dadf4b68
Content-Type: text/plain; charset=ISO-8859-1
(1) In an application, surely you'd like to see one string element per
Unicode character, not several (unless you don't care and pass to the
system APIs).
(2) As for comparison, of course there are language specific issues,
especially with accent characters (like in French) and some letters (like
"yo" in Russian), but at least comparison
by Unicode code should work, which is not perfect; there are
culture-specific issues.
(3) There is also probably an issue between the program text encoding
(which is often UTF-8 and can be UTF-16) and the application encoding.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c20b8237b34404dadf4b68
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div>(1) In an application, surely you'd like to see o=
ne string element per Unicode character, not several (unless you don't =
care and pass to the system APIs).</div><div>(2) As for comparison, of cour=
se there are language specific issues, especially with accent characters (l=
ike in French) and some letters (like "yo" in Russian), but at le=
ast comparison</div>
<div>by Unicode code should work, which is not perfect; there are culture-s=
pecific issues.</div><div>(3) There is also probably an issue between the p=
rogram text encoding (which is often UTF-8 and can be UTF-16) and the appli=
cation encoding.</div>
<div>=A0</div><div>=A0</div><div>=A0</div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c20b8237b34404dadf4b68--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 21 Apr 2013 07:21:19 -0700 (PDT)
Raw View
------=_Part_1435_23021433.1366554080227
Content-Type: text/plain; charset=ISO-8859-1
>
> comparison by Unicode code should work, which is not perfect; there are
> culture-specific issues.
It doesn't. The only correct comparison is to use a Unicode-specific
normalizing comparison operation. You cannot do *any* comparison with
un-normalized data.
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1435_23021433.1366554080227
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; borde=
r-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style=
: solid; padding-left: 1ex;">comparison by Unicode code should work, which =
is not perfect; there are culture-specific issues.</blockquote><div><br></d=
iv><div>It doesn't. The only correct comparison is to use a Unicode-specifi=
c normalizing comparison operation. You cannot do *any* comparison with un-=
normalized data. </div><blockquote class=3D"gmail_quote" style=3D"marg=
in: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
</blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1435_23021433.1366554080227--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 21 Apr 2013 09:02:26 -0700 (PDT)
Raw View
------=_Part_1406_8769531.1366560146130
Content-Type: multipart/alternative;
boundary="----=_Part_1407_18451695.1366560146131"
------=_Part_1407_18451695.1366560146131
Content-Type: text/plain; charset=ISO-8859-1
I have attached a draft of a new revision. I think that this new version
should address at least some of the concerns about the previous variant. I
have looked at Beman's paper n3398, and this paper almost entirely
supersedes that and will address all of the issues once basic_string is
adapted to feature encoded_string compatibility. I have also tuned the
algorithms interface and removed the case_insensitive stuff.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1407_18451695.1366560146131
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
I have attached a draft of a new revision. I think that this new version sh=
ould address at least some of the concerns about the previous variant. I ha=
ve looked at Beman's paper n3398, and this paper almost entirely supersedes=
that and will address all of the issues once basic_string is adapted to fe=
ature encoded_string compatibility. I have also tuned the algorithms interf=
ace and removed the case_insensitive stuff.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1407_18451695.1366560146131--
------=_Part_1406_8769531.1366560146130
Content-Type: text/html; charset=UTF-8; name=unicode.html
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=unicode.html
X-Attachment-Id: 922ebab4-08f8-471e-b66a-94654f9a95de
Content-ID: <922ebab4-08f8-471e-b66a-94654f9a95de>
=EF=BB=BF<!DOCTYPE html>
<html lang=3D"en">
<body> =20
<p>Document Number: Dnnnn</p>
<p>Date: 2012-11-05</p>
<p>Project: Programming Language C++, Library Working Group</p>
<p>Reply-to: wolfeinstein@gmail.com</p>
<h1>Strings Proposal</h1>
<h2>Introduction</h2>
<p>The purpose of this document is to propose new interfaces to support Uni=
code text, where the existing interfaces are quite deficient. This document=
revises=20
<a href=3D"http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n357=
2.html">N3572</a>, which was discussed at Bristol. The feedback from the LE=
WG was that the paper should propose a single Unicode string class- a=20
template was not considered advantageous. In addition, the LEWG request=
ed some small changes to the algorithms.
</p>
<h2>Motivation and Scope</h2>
<p>This proposal is primarily motivated by two problems. The first is the o=
verwhelming number of string types- both primitive, Standard and third-part=
y. This mess of text types makes it impossible to reliably=20
hold string data. The second is the poor support for Unicode within the=
C++ Standard library. Unicode is a complex topic, where correctness depend=
s on the implementation of complex algorithms by the user.=20
This is only exacerbated by the problem of multiple string encodings, a=
nd poor conversion interfaces, which is why C++ is awash with third-party s=
tring types. This problem is made even worse by the existence of=20
unrelated types that need to hold string data- for example, exceptions.=
The existing exception hierarchy is of significantly limited usefulness, a=
s it cannot hold Unicode exception data. This proposal aims to=20
solve both these problems by offering freestanding algorithms and a fre=
sh string class which constitutes significant support for Unicode. Unicode =
is considered to be version 6.2- the most recent finalized version.
</p>
<p>It is not currently in use and a reference implementation is still under=
construction. However, there are numerous implementations of the various s=
ubcomponents, such as Unicode algorithms and formatting routines.
ICU implements virtually all of the proposed functionality and then som=
e.
</p>
<h2>Impact on the Standard</h2>
<p>This paper currently depends on <a href=3D"http://www.open-std.org/JTC1/=
SC22/WG21/docs/papers/2013/n3525.pdf">N3525</a> for allocator support, or s=
ome future revision. The usage of a polymorphic allocator is taken from=20
that paper.=20
</p>
<h2>Design Decisions</h2>
<p>It was decided to make the string polymorphic in both encoding and alloc=
ator, to make it more flexible and more useful as a vocabulary type. This m=
ore closely resembles what other languages do for string types.=20
The new scheme permits the implementation to dynamically change their e=
ncoding in response to the string's contents.
</p>
<p>The algorithms were changed so that instead of outputting to an output i=
terator, they would lazily evaluate in-place if possible. This is fairly di=
fficult for the implementor, as many Unicode algorithms need to=20
know their end to function correctly. However, it provides better perfo=
rmance in many cases and a better ease-of-use for users as they do not need=
keep around many temporary buffers.
</p>
<p>Another problem is posed by UTF-8. As u8 literals do not have a distinct=
type, it's almost impossible to handle them correctly. There are other pro=
posals for introducing char8_t and fixing UTF-8 literals, and=20
introducing std::u8string, but this proposal does not assume they are a=
ccepted. It would, however, be of significant benefit.
</p>
<p>Finally, the std namespace is becoming very overloaded. It was decided t=
hat it would be best to split the components into subnamespaces. This not o=
nly aids with the organization of the library as a whole, but also provides=
=20
a clear difference between old and new components.
</p>
<h2>Technical Specification</h2>
<p>Currently, to avoid ambiguity, the specification is given as a series of=
declarations in C++11.</p>
<p>For iterators, usually only the iterator category and return value of op=
erator* are specified, as the full specification of an iterator involves a =
lot of plumbing. If requested, these=20
specifications can be expanded to the full definition.</p>
<p>In header <unicode></p>
<pre>namespace std {
namespace unicode {
enum class normal_form {
nfc,
nfd,
nfkc,
nfkd
};
namespace policies {
struct throw_exception {};
template<char32_t> struct replacement_character {};
struct undefined_behaviour {};
struct discard {};
};
</pre>
<p>These policies define what happens when the encoding iterators encounter=
bad input. If the throw_exception policy is specified, an exception shall =
be thrown of type std::runtime_error. If replacement_character is=20
specified, then the codepoint specified as the replacement character sh=
all be the replacement output. When converting from codepoints to codeunits=
, the encoding shall specify a replacement character, and ignore the=20
template parameter. The algorithm to determine how many replacement cha=
racters are issued is part of the Unicode Standard. If the undefined_behavi=
our policy is specified, then no validation shall take place, and if=20
the input sequence is bad, then the behaviour is undefined. If discard =
is specified, then bad input shall be silently discarded.
</p>
<p>An implementation shall provide at least the following encodings:
</p>
<pre> enum encoding { =20
utf8,
utf16,
utf32,
wide,
narrow,
system
}
</pre>
<p>The narrow encoding is the encoding used for narrow string literals, suc=
h as "hello". The wide string literal is used for wide string literals such=
as L"hello". An implementation has an obligation to make each encoding
a separate type, even if they represent the same logical encoding. This=
is to permit overloading or specialization in portable code. The system en=
coding is an implementation-defined default which shall be the encoding=20
best used for interoperation with platform APIs, especially operating s=
ystem APIs, such as UTF16 on Windows and UTF8 on Unix. The implementation m=
ay provide arbitrary additional encodings.
</p>
<pre> template<typename Char> using encoding_of =3D implementa=
tion-defined;
</pre>
<p>The encoding_of template returns the assumed encoding of a string whose =
codeunit type is <code>std::decay<Char>::type</code>. This shall be n=
arrow where the decayed type is char, wide for wchar_t, utf16 for char16_t,=
and=20
utf32 for char32_t.</p>
<pre> template<typename Iterator> using encoding_of_iterator =
=3D encoding_of<typename std::iterator_traits<Iterator>::value_typ=
e></pre>
<p>The string class is a container of Unicode codepoints. The treatment of =
the freestanding algorithms as a range of Unicode codepoints means that any=
container of Unicode codepoints may be used, but this class is provided as=
=20
the minimal useful container. It may contain embedded null characters.
</p>
<pre> class encoded_string {
public:
using allocator_type =3D std::polyalloc::memory_resource*;
using iterator =3D implementation_defined;
using const_iterator =3D implementation_defined;
using size_type =3D implementation_defined;
using value_type =3D char32_t;</pre>
<p>Draft notes: There are probably loads of other typedefs and functions li=
ke crbegin and criterator that I've forgotten. Kindly leave feedback mentio=
ning these.</p>
<pre> encoded_string(allocator_type alloc =3D nullptr);
encoded_string(const encoded_string&, allocator_type alloc =3D =
nullptr);
encoded_string(encoded_string&&);
=20
encoded_string(const char*, allocator_type alloc =3D nullptr);
encoded_string(const wchar_t*, allocator_type alloc =3D nullptr=
);
encoded_string(const char16_t*, allocator_type alloc =3D nullpt=
r);
encoded_string(const char32_t*, allocator_type alloc =3D nullpt=
r);
template<typename T, typename Traits, typename Allocator>=
=20
encoded_string(const std::basic_string<T, Traits, Allocator&=
gt;&, allocator_type alloc =3D nullptr);
template<typename T, typename Traits, typename Allocator>=
=20
encoded_string(std::basic_string<T, Traits, Allocator>&&)=
;
template<typename Iterator>=20
encoded_string(Iterator, Iterator, allocator_type alloc =3D nul=
lptr);
=20
template<typename Iterator>=20
void assign(Iterator, Iterator) &;
void assign(encoded_string&) &;
void assign(encoded_string&&) &;
encoded_string operator+(const encoded_string&) const;
encoded_string operator+(encoded_string&&) const;
encoded_string operator+(const char*) const;
encoded_string operator+(const wchar_t*) const;
encoded_string operator+(const char16_t*) const;
encoded_string operator+(const char32_t*) const;
template<typename T, typename Traits, typename Allocator>=
=20
encoded_string operator+(const std::basic_string<T, Traits, =
Allocator>&) const;
=20
encoded_string& operator+=3D(const encoded_string&) &;
encoded_string& operator+=3D(encoded_string&&) &;
encoded_string& operator+=3D(const char*) &;
encoded_string& operator+=3D(const wchar_t*) &;
encoded_string& operator+=3D(const char16_t*) &;
encoded_string& operator+=3D(const char32_t*) &;
template<typename T, typename Traits, typename Allocator>=
=20
encoded_string& operator+=3D(const std::basic_string<T, Trai=
ts, Allocator>&);
encoded_string& operator=3D(const encoded_string&) &;
encoded_string& operator=3D(encoded_string&&) &;
encoded_string& operator=3D(const char*) &;
encoded_string& operator=3D(const wchar_t*) &;
encoded_string& operator=3D(const char16_t*) &;
encoded_string& operator=3D(const char32_t*) &;
template<typename T, typename Traits, typename Allocator>=
=20
encoded_string& operator=3D(const std::basic_string<T, Trait=
s, Allocator>&);
allocator_type get_allocator() const;
iterator begin() &;
const_iterator begin() const &;
const_iterator cbegin() const &;
iterator end() &;
const_iterator end() const &;
const_iterator cend() const &;</pre>
<p>The iterator and const_iterator types are bidirectional iterators of Uni=
code codepoints. The value_type is char32_t. The invalidation semantics of =
iterators shall be those of std::string. Particularly, it is explicitly=20
legal for iterators to refer to values inside the encoded_string value =
itself, and thus move or swap may invalidate iterators.</p>
<pre> void clear() &;
bool empty() const;
=20
iterator erase(const_iterator where) &;
iterator erase(const_iterator first, const_iterator last) &;
void swap(encoded_string&) &;
char32_t front() const;
char32_t back() const;
=20
iterator insert(const_iterator where, char32_t codepoint) &;
template<typename InputIterator>
iterator insert(const_iterator where, InputIterator begin, Inpu=
tIterator end) &;
iterator insert(const_iterator where, const encoded_string&) &;
iterator insert(const_iterator where, encoded_string&&) &;
template<typename T, typename Traits, typename Alloc>=20
iterator insert(const_iterator where, const basic_string<T, =
Traits, Alloc>&) &;
void pop_back() &;
void push_back(char32_t) &;
void normalize(normal_form) &;
void set_encoding(encoding) &;
template<encoding> unspecified c_str() const;</pre>
<p>Each encoded_string has an internal encoding. If the provided encoding i=
s the same as the internal encoding, then the c_str() operation must comple=
te in O(1). Else, it may be O(n). The return value of the c_str() function=
=20
is move-only. It may or may not own distinct resources. The return valu=
e shall have an implicit conversion to const T*, where T is the character t=
ype of that encoding. This pointer shall point to a null-terminated buffer=
=20
containing the string's contents. It shall also have a size() method th=
at shall return the size of the contiguous buffer pointed to by the result =
of the implicit conversion operator. This size shall include the null=20
terminator. The implementation does not have to have any internal encod=
ing if the user did not request one through set_encoding. The implementatio=
n has no obligation to propagate the user-requested encoding through copies=
,
although it does for moves, and it does not have to take on the encodin=
g of input from external sources.
</p>
<pre> };
bool operator<(const encoded_string& lhs, const encoded_string& rhs=
);
template<typename T, typename Traits, typename Alloc>=20
bool operator<(const basic_string<T, Traits, Alloc>& lhs, con=
st encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator<(const encoded_string& rhs, const basic_string<T, =
Traits, Alloc>& lhs);
=20
bool operator=3D=3D(const encoded_string& lhs, const encoded_string=
& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator=3D=3D(const basic_string<T, Traits, Alloc>& lhs=
, const encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator=3D=3D(const encoded_string& rhs, const basic_string&l=
t;T, Traits, Alloc>& lhs);
=20
bool operator<=3D(const encoded_string& lhs, const encoded_string& =
rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator<=3D(const basic_string<T, Traits, Alloc>& lhs, =
const encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator<=3D(const encoded_string& rhs, const basic_string<=
T, Traits, Alloc>& lhs);
=20
bool operator>(const encoded_string& lhs, const encoded_string& rhs=
);
template<typename T, typename Traits, typename Alloc>=20
bool operator>(const basic_string<T, Traits, Alloc>& lhs, con=
st encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator>(const encoded_string& rhs, const basic_string<T, =
Traits, Alloc>& lhs);
=20
bool operator>=3D(const encoded_string& lhs, const encoded_string& =
rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator>=3D(const basic_string<T, Traits, Alloc>& lhs, =
const encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator>=3D(const encoded_string& rhs, const basic_string<=
T, Traits, Alloc>& lhs);
=20
bool operator!=3D(const encoded_string& lhs, const encoded_string& =
rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator!=3D(const basic_string<T, Traits, Alloc>& lhs, =
const encoded_string& rhs);
template<typename T, typename Traits, typename Alloc>=20
bool operator!=3D(const encoded_string& rhs, const basic_string<=
T, Traits, Alloc>& lhs);</pre>
<p>For all primitive character types C in char, wchar_t, char16_t, and char=
32_t,</p>
<pre> bool operator<(const encoded_string& lhs, const C* rhs); =
=20
bool operator<(const C* lhs, const encoded_string& rhs);
bool operator=3D=3D(const encoded_string& lhs, const C* rhs); =20
bool operator=3D=3D(const C* lhs, const encoded_string& rhs);
bool operator<=3D(const encoded_string& lhs, const C* rhs); =
=20
bool operator<=3D(const C* lhs, const encoded_string& rhs);
bool operator!=3D(const encoded_string& lhs, const C* rhs);
bool operator!=3D(const C* lhs, const encoded_string& rhs);
bool operator>(const encoded_string& lhs, const C* rhs);
bool operator>(const C* lhs, const encoded_string& rhs);
bool operator>=3D(const encoded_string& lhs, const C* rhs); =20
bool operator>=3D(const C* lhs, const encoded_string& rhs);
</pre>
<p>These comparison operators behave as if the data in the lhs and the rhs =
was passed to the respective iterator based Unicode freestanding algorithm.=
</p>
=20
<pre> template<typename First, typename Second>=20
bool less(First begin, First end, Second begin, Second end, char32_=
t max_codepoint =3D 0x10FFFF, std::locale =3D std::locale());
template<typename First, typename Second>=20
bool less_or_equal(First begin, First end, Second begin, Second end=
, char32_t max_codepoint =3D 0x10FFFF, std::locale =3D std::locale());
template<typename First, typename Second>=20
bool greater(First begin, First end, Second begin, Second end, char=
32_t max_codepoint =3D 0x10FFFF, std::locale =3D std::locale());
template<typename First, typename Second>=20
bool greater_or_equal(First begin, First end, Second begin, Second =
end, char32_t max_codepoint =3D 0x10FFFF, std::locale =3D std::locale());
template<typename First, typename Second>=20
bool equal(First begin, First end, Second begin, Second end, char32=
_t max_codepoint =3D 0x10FFFF);
template<typename First, typename Second>=20
bool not_equal(First begin, First end, Second begin, Second end, ch=
ar32_t max_codepoint =3D 0x10FFFF);</pre>
<p>These six algorithms implement Unicode comparison functionality on the U=
nicode codepoints provided in the passed encodings. Canonical equivalence a=
nd collation are defined by the Unicode Consortium. The comparison is=20
performed at L3 or greater. Compatibility equivalence is not permitted.=
</p>
<pre> template<typename Iterator> std::pair<unspecified, unspe=
cified>=20
extended_grapheme_boundaries(Iterator begin, Iterator end, std::loc=
ale =3D std::locale());
template<typename Iterator> std::pair<unspecified, unspeci=
fied>
word_boundaries(Iterator begin, Iterator end, std::locale =3D std::=
locale());
template<typename Iterator> std::pair<unspecified, unspeci=
fied>
line_boundaries(Iterator begin, Iterator end, std::locale =3D std::=
locale());
template<typename Iterator> std::pair<unspecified, unspeci=
fied>
sentence_boundaries(Iterator begin, Iterator end, std::locale =3D s=
td::locale());</pre>
<p> The Line algorithm is defined in UAX #14 (http://www.unicode.org/report=
s/tr14/) and the other three in UAX #29 (http://www.unicode.org/reports/tr2=
9/). The input iterators are at least forward iterators of Unicode=20
codepoints. The boundary iterators all have a value_type which is Itera=
tor. This iterator is the position of the boundary.</p>
<pre> template<typename Iterator> std::pair<unspecified, unspe=
cified> normalize(Iterator begin, Iterator end, normal_form);
template<typename T, typename Traits, typename Alloc>=20
basic_string<T, Traits, Alloc> normalize(basic_string<T, T=
raits, Alloc>, normal_form);
encoded_string normalize(const encoded_string&, normal_form);
encoded_string normalize(encoded_string&&, normal_form);</pre>
<p>Implements normalization of the forward range over Unicode codepoints. T=
he normal_form argument indicates which normal form is requested.</p>
<pre> template<typename encoding, typename Iterator> std::pair=
<unspecified, unspecified> encoding_convert(Iterator begin, Iterator =
end);
</pre>
<p>Implements encoding conversion. The source encoding is that indicated by=
the Iterator's value_type. </p>
<pre> template<typename Iterator, typename Policy> std::pair&l=
t;unspecified, unspecified> validate(Iterator begin, Iterator end, Polic=
y p);</pre>
<p>Implements encoding validation. When validation fails, the Policy dictat=
es to the implementation what action to take.</p>
<pre> template<typename Char, typename CharT> std::basic_istre=
am<Char, CharT>&=20
operator>>(std::basic_istream<Char, CharT>&, encoded_string&)=
;</pre>
<p>Reads until the next whitespace, as operator>>(std::istream&, std::strin=
g&);. Shall perform an encoding conversion if necessary.</p>
<pre> template<typename Char, typename CharT> std::basic_ostre=
am<Char, CharT>&=20
operator<<(std::basic_ostream<Char, CharT>&, const encoded_st=
ring&);</pre>
<p>Writes the contents of the string to the stream. Shall perform an encodi=
ng conversion as necessary.</p>
<pre> struct codepoint_properties {
string name;
bool is_numeric;
bool is_digit;
int digit_value;
bool is_decimal;
int decimal_value;
bool is_uppercase;
string uppercase;
bool is_lowercase;
string lowercase;
bool is_titlecase;
string titlecase;
bool is_alphabetic;
bool is_white_space;
bool is_control;
bool is_hex_digit;
bool is_ascii_hex_digit;
bool is_letter;
bool is_punctuation;
bool is_separator;
bool is_symbol;
bool is_quotation_mark;
bool is_dash;=20
bool is_diacritic;
bool is_mathematical;
bool is_ideographic;
bool is_defined;
bool is_noncharacter;
};
const codepoint_properties& properties(char32_t);</pre>
<p>Returns the properties of any given codepoint. These properties are defi=
ned by the Unicode Standard, not here.</p>
<pre> template<typename Iterator> std::pair<unspecified, un=
specified> to_upper(Iterator begin, Iterator end);
encoded_string to_upper(const encoded_string&);
encoded_string to_upper(encoded_string&&);
template<typename Char, typename Traits>
std::basic_string<Char, Traits, Alloc> to_upper(const std::ba=
sic_string<Char, Traits, Alloc>&, std::locale =3D std::locale());
template<typename Iterator> std::pair<unspecified, unspeci=
fied> to_lower(Iterator begin, Iterator end);
encoded_string to_lower(const encoded_string&); =20
encoded_string to_lower(encoded_string&&); =20
template<typename Char, typename Traits, typename Alloc>
std::basic_string<Char, Traits, Alloc> to_lower(const std::ba=
sic_string<Char, Traits, Alloc>&, std::locale =3D std::locale());
template<typename Iterator> std::pair<unspecified, unspeci=
fied> to_title(Iterator begin, Iterator end);
encoded_string to_title(const encoded_string&); =20
encoded_string to_title(encoded_string&&); =20
template<typename Char, typename Traits, typename Alloc>
std::basic_string<Char, Traits, Alloc> to_title(const std::ba=
sic_string<Char, Traits, Alloc>&, std::locale =3D std::locale());
using encoded_regex =3D std::basic_regex<char32_t, implementatio=
n-defined></pre>
<p>A regular expression type suitable for matching Unicode which is encoded=
in the specified encoding. The traits must support <a href=3D"http://www.u=
nicode.org/reports/tr18/">UTS-18</a> to at least Level 2.</p>
<pre> =20
template<typename Iterator> std::size_t hash(Iterator begin, =
Iterator end);
}
struct hash<encoded_string> {
std::size_t operator()(const unicode::encoded_string&) const;
};
</pre>
<p>Provides hashing functions for Unicode. As with any other Unicode functi=
on, two sequences which are canonically equivalent must produce the same re=
sult. The specialization of std::hash shall provide hashing functionality=
=20
for any encoded_string. The free function unicode::hash shall be availa=
ble to hash any Unicode sequence, which is a pair of input iterators of cod=
epoints.
</p>
<h2>Acknowledgements</h2>
<p>R. Martinho Fernandes, gave significant assistance when dealing with som=
e of the ins and outs of Unicode.</p>
<h2>Revision History</h2>
<p>Changed encoded_string from template in encoding and allocator to non-te=
mplate in both parameters.</p>
<p>Changed algorithms to return a pair of iterators which may be lazily-eva=
luating, rather than output parameters.</p> =20
</body>
</html>=E2=80=8B
------=_Part_1406_8769531.1366560146130--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 21 Apr 2013 16:21:09 -0700 (PDT)
Raw View
------=_Part_10_26641802.1366586469921
Content-Type: text/plain; charset=ISO-8859-1
On Saturday, April 20, 2013 3:24:19 PM UTC-7, DeadMG wrote:
>
> I think that there could be a middle ground to be found here. I was
> reviewing the original paper, and the only place that the encoding
> parameter was actually *used* in the interface was to specify the return
> value for C string interoperation. If, instead, I changed that so you could
> request a C string of any encoding type from any encoding (so for example
> c_str() was a template), as is supported by the original traits design,
> then that would make a polymorphic encoding possible- and also the stored
> encoding would be non-observable, except perhaps in the complexity of
> requesting a C string.
That's not a middle ground; that's not a compromise.
My position is *not* that I should be able to fetch a sequence of
characters in an arbitrary encoding. My position is that I should have *complete
control* over the encoding of the string.
If I'm using UTF-8 strings, I should not *have* to copy that string just to
pass it to a C API that takes a `const char*` of UTF-8 characters. The only
way I can guarantee that is if my Unicode string type is in UTF-8. And if I
don't have control over the encoding, then the string type is *worthless*to me.
There's no middle ground here. One side says, "I want the encoding to be
implementation-defined." The other side says, "I want direct control over
the encoding." You can't provide both without using type erasure and other
needlessly performance-damaging techniques.
And that's the problem with your attempted compromise proposal. There's no
way to implement that without type-erasure. And there's no way to do that
without making every iterator access and other command slower than it needs
to be.
Also, in your proposal, you forgot to add constructors that can take
encodings. Not every `const char*` is a narrow encoded string. And your
proposal doesn't represent a string that can work with *any* encoding. If I
want to work with UTF-EBCDIC or GB18030 or whatever, why shouldn't I?
Lastly, I wouldn't suggest calling the function to get the internal data
`c_str`. That has certain expectations about not copying data. It should
probably be `str`, since that function (at least in
`std::stringstream::str`) is expected to copy data.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_10_26641802.1366586469921
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Saturday, April 20, 2013 3:24:19 PM UTC-7, DeadMG wrote:<blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #c=
cc solid;padding-left: 1ex;">I think that there could be a middle ground to=
be found here. I was reviewing the original paper, and the only place that=
the encoding parameter was actually <i>used</i> in the interface was =
to specify the return value for C string interoperation. If, instead, I cha=
nged that so you could request a C string of any encoding type from any enc=
oding (so for example c_str() was a template), as is supported by the origi=
nal traits design, then that would make a polymorphic encoding possible- an=
d also the stored encoding would be non-observable, except perhaps in the c=
omplexity of requesting a C string.</blockquote><div><br>That's not a middl=
e ground; that's not a compromise.<br><br>My position is <i>not</i> that I =
should be able to fetch a sequence of characters in an arbitrary encoding. =
My position is that I should have <i>complete control</i> over the encoding=
of the string.<br><br>If I'm using UTF-8 strings, I should not <i>have</i>=
to copy that string just to pass it to a C API that takes a `const char*` =
of UTF-8 characters. The only way I can guarantee that is if my Unicode str=
ing type is in UTF-8. And if I don't have control over the encoding, then t=
he string type is <i>worthless</i> to me.<br><br>There's no middle ground h=
ere. One side says, "I want the encoding to be implementation-defined." The=
other side says, "I want direct control over the encoding." You can't prov=
ide both without using type erasure and other needlessly performance-damagi=
ng techniques.<br><br>And that's the problem with your attempted compromise=
proposal. There's no way to implement that without type-erasure. And there=
's no way to do that without making every iterator access and other command=
slower than it needs to be.<br><br>Also, in your proposal, you forgot to a=
dd constructors that can take encodings. Not every `const char*` is a narro=
w encoded string. And your proposal doesn't represent a string that can wor=
k with <i>any</i> encoding. If I want to work with UTF-EBCDIC or GB18030 or=
whatever, why shouldn't I?<br><br>Lastly, I wouldn't suggest calling the f=
unction to get the internal data `c_str`. That has certain expectations abo=
ut not copying data. It should probably be `str`, since that function (at l=
east in `std::stringstream::str`) is expected to copy data.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_10_26641802.1366586469921--
.
Author: Jeffrey Yasskin <jyasskin@google.com>
Date: Mon, 22 Apr 2013 17:02:24 -0700
Raw View
Thanks for the update!
* The paper would be easier to read if it were divided into sections.
* I still suggest writing separate papers on encoded_string and the
processing algorithms. I'll skip encoded_string for now.
* If this paper is intended to supersede N3398, the paper should say
which of Beman's interfaces are covered by which of your interfaces.
* In the free comparison functions:
* You should link to the relevant standard. I assume that's
http://www.unicode.org/reports/tr10/? What's the interface of the
matching ICU functions?
* "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?
* How is the locale argument used?
* If I have an implementation that can compare UTF-8 directly, faster
than converting to code points and comparing those, how do I use it to
implement this interface?
* IIRC, it's possible to convert unicode strings into sort keys for a
particular collation order, and then compare those keys byte-wise,
which can dramatically speed up sorting. Is that supported in your
interface? If not, is it just a V2 feature, or do you think it's
unnecessary?
* In the boundary finders: What's the meaning of the return type? How
do these compare to the equivalent ICU algorithms?
* normalize: Same question about UTF-8 input and output. What's the
return value of the range version?
* encoding_convert says that "The source encoding is that indicated by
the Iterator's value_type." There are more encodings than that. It
might make sense to handle normalization as part of conversion, since
both need to happen on entry to the system. This also needs to deal
with an output encoding.
* validate() is underspecified, and codepoints are probably the wrong
level to call it. For example, utf-8 can be invalid, and the iterators
need to catch that.
* Where do codepoint_properties come from? Link each one to part of
the Unicode standard. Why is a reference to a big struct the right
interface?
* What sort of data size is needed to implement this? Can the data be
shared with an ICU installation? Can implementations for constrained
environments omit chunks of data at the programmer's discretion?
Btw, which Shift-JIS characters were you saying didn't exist in Unicode?
On Sun, Apr 21, 2013 at 9:02 AM, DeadMG <wolfeinstein@gmail.com> wrote:
> I have attached a draft of a new revision. I think that this new version
> should address at least some of the concerns about the previous variant. I
> have looked at Beman's paper n3398, and this paper almost entirely
> supersedes that and will address all of the issues once basic_string is
> adapted to feature encoded_string compatibility. I have also tuned the
> algorithms interface and removed the case_insensitive stuff.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Tue, 23 Apr 2013 14:53:44 +0200
Raw View
--047d7b6d8896eb4aa604db06acd9
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <jyasskin@google.com>wrote=
:
<snip>
> * "The comparison is performed at L3 or greater" means that it's
> implementation-defined which level is actually used? Why is that the
> right decision?
>
There isn't much of a decision to be made here anyway. The only thing that
can be effectively required of an implementation is a minimum level. If an
implementation using Ln is conforming, any implementation that uses Ln+1 is
conforming as well. That means the "or greater" part in the text is
actually redundant, since requiring L3 does not forbid L4 implementations
(and there is no reason that I can think of to forbid them).
Choosing L3 as the minimum requirement stems from the conformance
requirement C2 in UTS#10: "A conformant implementation shall support at
least three levels of collation." I discussed this with the author when he
was drafting the original proposal in January, and the intent is that op<
should support the highest level of collation available to the
implementation, i.e. it provides the strictest sorting order available.
Lower levels of collation (less strict orders) could be provided using the
generic algorithms.
<snip>
> * What sort of data size is needed to implement this?
A few megabytes (~6-8 MB in my experiment) to support all non-tailored
(i.e. locale-independent) algorithms. Locale data requires more and the
exact amount depends on the number of locales supported.
> Can the data be
> shared with an ICU installation?
I see no reason an implementation could not use the ICU data and the CLDR
if so desired and available for the target platform.
Can implementations for constrained
> environments omit chunks of data at the programmer's discretion?
>
Maybe. I expect that all locale-specific data can be omitted, or maybe all
but one or two locales. Omitting more data would require making some
support for some algorithms optional.
<snip>
Mit freundlichen Gr=FC=DFen,
Martinho
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
--047d7b6d8896eb4aa604db06acd9
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On T=
ue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <span dir=3D"ltr"><<a href=
=3D"mailto:jyasskin@google.com" target=3D"_blank">jyasskin@google.com</a>&g=
t;</span> wrote:<br>
</div><div class=3D"gmail_quote"><snip><br></div><div class=3D"gmail_=
quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;=
border-left:1px solid rgb(204,204,204);padding-left:1ex">
=A0* "The comparison is performed at L3 or greater" means that it=
's<br>
implementation-defined which level is actually used? Why is that the<br>
right decision?<br></blockquote><div><br></div><div>There isn't much of=
a decision to be made here anyway. The only thing that can be effectively =
required of an implementation is a minimum level. If an implementation usin=
g Ln is conforming, any implementation that uses Ln+1 is conforming as well=
.. That means the "or greater" part in the text is actually redund=
ant, since requiring L3 does not forbid L4 implementations (and there is no=
reason that I can think of to forbid them).<br>
<br>Choosing L3 as the minimum requirement stems from the conformance requi=
rement C2 in UTS#10: "A conformant implementation shall support at lea=
st three levels of collation." I discussed this with the author when h=
e was drafting the original proposal in January, and the intent is that op&=
lt; should support the highest level of collation available to the implemen=
tation, i.e. it provides the strictest sorting order available. Lower level=
s of collation (less strict orders) could be provided using the generic alg=
orithms.<br>
</div><div><br><snip> <br></div><div></div><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204=
,204);padding-left:1ex">* What sort of data size is needed to implement thi=
s?</blockquote>
<div><br>A few megabytes (~6-8 MB in my experiment) to support all non-tail=
ored (i.e. locale-independent) algorithms. Locale data requires more and th=
e exact amount depends on the number of locales supported.<br>=A0</div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex">
Can the data be<br>
shared with an ICU installation?</blockquote><div><br>I see no reason an im=
plementation could not use the ICU data and the CLDR if so desired and avai=
lable for the target platform.<br><div class=3D"gmail_extra"><br></div></di=
v>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Can implementations for c=
onstrained<br>
environments omit chunks of data at the programmer's discretion?<br></b=
lockquote></div><br></div><div class=3D"gmail_extra">Maybe. I expect that a=
ll locale-specific data can be omitted, or maybe all but one or two locales=
.. Omitting more data would require making some support for some algorithms =
optional.<br>
</div><div class=3D"gmail_extra"><br><snip><br><br></div><div class=
=3D"gmail_extra"><div>Mit freundlichen Gr=FC=DFen,<br><br>Martinho</div>
<br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b6d8896eb4aa604db06acd9--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Tue, 23 Apr 2013 15:34:43 +0200
Raw View
--0015175774b488816004db073fdf
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Tue, Apr 23, 2013 at 2:53 PM, Martinho Fernandes <
martinho.fernandes@gmail.com> wrote:
> On Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <jyasskin@google.com>wro=
te:
> <snip>
>
>> * "The comparison is performed at L3 or greater" means that it's
>> implementation-defined which level is actually used? Why is that the
>> right decision?
>>
>
> There isn't much of a decision to be made here anyway. The only thing tha=
t
> can be effectively required of an implementation is a minimum level. If a=
n
> implementation using Ln is conforming, any implementation that uses Ln+1 =
is
> conforming as well. That means the "or greater" part in the text is
> actually redundant, since requiring L3 does not forbid L4 implementations
> (and there is no reason that I can think of to forbid them).
>
Wait, there is a difference indeed...
The results from sorting according to Ln+1 are always also sorted according
to Ln. However, C++ uses op< to treat non-comparability as equivalence in
some places. An L4 implementation would yield a different notion of
"equivalence" in those cases. I don't know how important that is though.
Mit freundlichen Gr=FC=DFen,
Martinho
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
--0015175774b488816004db073fdf
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Tue, Apr 23, 2013 at 2:53 PM, Martinho Fernandes <span =
dir=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_=
blank">martinho.fernandes@gmail.com</a>></span> wrote:<br><div class=3D"=
gmail_extra">
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On=
Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <span dir=3D"ltr"><<a hre=
f=3D"mailto:jyasskin@google.com" target=3D"_blank">jyasskin@google.com</a>&=
gt;</span> wrote:<br>
</div><div class=3D"gmail_quote"><snip><br></div><div class=3D"gmail_=
quote"><div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
=A0* "The comparison is performed at L3 or greater" means that it=
's<br>
implementation-defined which level is actually used? Why is that the<br>
right decision?<br></blockquote><div><br></div></div><div>There isn't m=
uch of a decision to be made here anyway. The only thing that can be effect=
ively required of an implementation is a minimum level. If an implementatio=
n using Ln is conforming, any implementation that uses Ln+1 is conforming a=
s well. That means the "or greater" part in the text is actually =
redundant, since requiring L3 does not forbid L4 implementations (and there=
is no reason that I can think of to forbid them).<br>
</div></div></div></div></blockquote><div><br></div><div dir=3D"ltr">Wait, =
there is a difference indeed...<br><br>The results from sorting according t=
o Ln+1 are always also sorted according to Ln. However, C++ uses op< to =
treat non-comparability as equivalence in some places. An L4 implementation=
would yield a different notion of "equivalence" in those cases. =
I don't know how important that is though.<br>
<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"><div>Mit fr=
eundlichen Gr=FC=DFen,<br><br>Martinho</div>
<br></div></div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--0015175774b488816004db073fdf--
.
Author: FrankHB1989 <frankhb1989@gmail.com>
Date: Wed, 24 Apr 2013 02:02:59 -0700 (PDT)
Raw View
------=_Part_940_24158779.1366794179964
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=E5=9C=A8 2013=E5=B9=B44=E6=9C=8821=E6=97=A5=E6=98=9F=E6=9C=9F=E6=97=A5UTC+=
8=E4=B8=8A=E5=8D=8812=E6=97=B654=E5=88=8634=E7=A7=92=EF=BC=8CNicol Bolas=E5=
=86=99=E9=81=93=EF=BC=9A
>
> *C++ is not Python.* Stop trying to turn it into a low-rent version of=20
> Python. We don't use C++ because it's easy; we use it because it is *
> powerful*. We shouldn't throw away power just to allow slightly easier=20
> usage. We don't need a one-size-fits-all Unicode string. Give us *choices=
*
> .
>
>
I think we need both encoding-aware string and non-encoding-aware string,=
=20
eventually. The latter one is not only for convenience, but also the=20
confidence of "no care of which encoding to use" in some contexts. Throwing=
=20
any one of them away and force users to use the other seems to be less *
powerful*. Give us *choices*.=20
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_940_24158779.1366794179964
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<br><br>=E5=9C=A8 2013=E5=B9=B44=E6=9C=8821=E6=97=A5=E6=98=9F=E6=9C=9F=E6=
=97=A5UTC+8=E4=B8=8A=E5=8D=8812=E6=97=B654=E5=88=8634=E7=A7=92=EF=BC=8CNico=
l Bolas=E5=86=99=E9=81=93=EF=BC=9A<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left=
: 1ex;"><div><br><i>C++ is not Python.</i> Stop trying to turn it into a lo=
w-rent version of Python. We don't use C++ because it's easy; we use it bec=
ause it is <i>powerful</i>. We shouldn't throw away power just to allow sli=
ghtly easier usage. We don't need a one-size-fits-all Unicode string. Give =
us <i>choices</i>.<br><br></div></blockquote><div><br>I think we need both =
encoding-aware string and non-encoding-aware string, eventually. The latter=
one is not only for convenience, but also the confidence of "no care of wh=
ich encoding to use" in some contexts. Throwing any one of them away and fo=
rce users to use the other seems to be less <i>powerful</i>. Give us <i>cho=
ices</i>. <br><br><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_940_24158779.1366794179964--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 03:46:35 -0700 (PDT)
Raw View
------=_Part_4087_237530.1366800395944
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 2:02:59 AM UTC-7, FrankHB1989 wrote:
>
>
>
> =E5=9C=A8 2013=E5=B9=B44=E6=9C=8821=E6=97=A5=E6=98=9F=E6=9C=9F=E6=97=A5UT=
C+8=E4=B8=8A=E5=8D=8812=E6=97=B654=E5=88=8634=E7=A7=92=EF=BC=8CNicol Bolas=
=E5=86=99=E9=81=93=EF=BC=9A
>
>>
>> *C++ is not Python.* Stop trying to turn it into a low-rent version of=
=20
>> Python. We don't use C++ because it's easy; we use it because it is *
>> powerful*. We shouldn't throw away power just to allow slightly easier=
=20
>> usage. We don't need a one-size-fits-all Unicode string. Give us *choice=
s
>> *.
>>
>>
> I think we need both encoding-aware string and non-encoding-aware string,=
=20
> eventually. The latter one is not only for convenience, but also the=20
> confidence of "no care of which encoding to use" in some contexts. Throwi=
ng=20
> any one of them away and force users to use the other seems to be less *
> powerful*. Give us *choices*.
>
If you don't care what encoding a string uses, then you can just use an=20
encoding-aware string anyway. All of them should be inter-convertible=20
between each other (though it should require explicit conversion). And they=
=20
should all be buildable from raw data (an iterator range and the encoding=
=20
of that range). So all you need to do is pick one and you're fine.
I don't see why anyone would need a string that is *explicitly* unaware of=
=20
its encoding. What does that gain you?
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_4087_237530.1366800395944
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, April 24, 2013 2:02:59 AM UTC-7, FrankHB1989 wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;"><br><br>=E5=9C=A8 2013=E5=B9=B44=
=E6=9C=8821=E6=97=A5=E6=98=9F=E6=9C=9F=E6=97=A5UTC+8=E4=B8=8A=E5=8D=8812=E6=
=97=B654=E5=88=8634=E7=A7=92=EF=BC=8C<wbr>Nicol Bolas=E5=86=99=E9=81=93=EF=
=BC=9A<br><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0=
..8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br><i>C++ is not Py=
thon.</i> Stop trying to turn it into a low-rent version of Python. We don'=
t use C++ because it's easy; we use it because it is <i>powerful</i>. We sh=
ouldn't throw away power just to allow slightly easier usage. We don't need=
a one-size-fits-all Unicode string. Give us <i>choices</i>.<br><br></div><=
/blockquote><div><br>I think we need both encoding-aware string and non-enc=
oding-aware string, eventually. The latter one is not only for convenience,=
but also the confidence of "no care of which encoding to use" in some cont=
exts. Throwing any one of them away and force users to use the other seems =
to be less <i>powerful</i>. Give us <i>choices</i>.<br></div></blockquote><=
div><br>If you don't care what encoding a string uses, then you can just us=
e an encoding-aware string anyway. All of them should be inter-convertible =
between each other (though it should require explicit conversion). And they=
should all be buildable from raw data (an iterator range and the encoding =
of that range). So all you need to do is pick one and you're fine.<br><br>I=
don't see why anyone would need a string that is <i>explicitly</i> unaware=
of its encoding. What does that gain you?<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_4087_237530.1366800395944--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 24 Apr 2013 12:52:06 +0200
Raw View
--047d7b6d878acce58e04db191719
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmckesson@gmail.com> wrote:
>
> I don't see why anyone would need a string that is *explicitly* unaware
> of its encoding. What does that gain you?
>
Don't we have that as std::basic_string already, anyway?
Mit freundlichen Gr=FC=DFen,
Martinho
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
--047d7b6d878acce58e04db191719
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On W=
ed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <span dir=3D"ltr"><<a href=3D"=
mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>></=
span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><br><div>I don't see =
why anyone would need a string that is <i>explicitly</i> unaware of its enc=
oding. What does that gain you?<br>
</div></blockquote></div><br></div><div class=3D"gmail_extra">Don't we =
have that as std::basic_string already, anyway?<br></div><div class=3D"gmai=
l_extra"><br><br clear=3D"all"><div>Mit freundlichen Gr=FC=DFen,<br><br>Mar=
tinho</div>
<br><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b6d878acce58e04db191719--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 14:00:16 +0300
Raw View
--089e01229c30f88ff804db19347a
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 13:52, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmckesson@gmail.com> wrote:
>
>>
>> I don't see why anyone would need a string that is *explicitly* unaware
>> of its encoding. What does that gain you?
>>
>
> Don't we have that as std::basic_string already, anyway?
>
>
>
We do. And the reason why we need it that it can be used for conveying the
bits across module boundaries
without caring what the encoded strings in whatever modules are, which is
sometimes useful. Kind of like
having an ip_address type that can hold either ip4_address or
ip6_address...
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01229c30f88ff804db19347a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 13:52, Martinho Fernandes <span dir=3D"ltr"><<a=
href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_blank">martinho.fe=
rnandes@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"im"><div clas=
s=3D"gmail_extra"><div class=3D"gmail_quote">On Wed, Apr 24, 2013 at 12:46 =
PM, Nicol Bolas <span dir=3D"ltr"><<a href=3D"mailto:jmckesson@gmail.com=
" target=3D"_blank">jmckesson@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><br><div>I don't see =
why anyone would need a string that is <i>explicitly</i> unaware of its enc=
oding. What does that gain you?<br>
</div></blockquote></div><br></div></div><div class=3D"gmail_extra">Don'=
;t we have that as std::basic_string already, anyway?<br></div><div class=
=3D"gmail_extra"><br><br></div></div></blockquote><div><br></div><div>We do=
.. And the reason why we need it that it can be used for conveying the bits =
across module boundaries<br>
without caring what the encoded strings in whatever modules are, which is s=
ometimes useful. Kind of like<br></div><div>having an ip_address type that =
can hold either ip4_address or ip6_address... <br></div></div><br></div>
</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c30f88ff804db19347a--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 04:15:12 -0700 (PDT)
Raw View
------=_Part_1489_27187799.1366802112532
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 4:00:16 AM UTC-7, Ville Voutilainen wrote:
>
> On 24 April 2013 13:52, Martinho Fernandes <martinho....@gmail.com<javascript:>
> > wrote:
>
>> On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmck...@gmail.com<javascript:>
>> > wrote:
>>
>>>
>>> I don't see why anyone would need a string that is *explicitly* unaware
>>> of its encoding. What does that gain you?
>>>
>>
>> Don't we have that as std::basic_string already, anyway?
>>
>>
>>
> We do. And the reason why we need it that it can be used for conveying the
> bits across module boundaries
> without caring what the encoded strings in whatever modules are, which is
> sometimes useful. Kind of like
> having an ip_address type that can hold either ip4_address or
> ip6_address...
>
I'm not sure I understand the analogy. An `any_ip_address` class wouldn't
be about crossing "module boundaries". It would be about being able to
access an Internet resource from either one or other other address, as
needed. The choice of IPv4 or IPv6 is sometimes not up to the application
at all; it comes from what site the user wants to access. If they enter an
IPv4 address, you need to be able to use it like an IPv4 address, and
likewise for IPv6.
This matters because there is no direct conversion from IPv4 to IPv6.
Mapping an IPv4 address to IPv6 is something that can cause problems, so in
many cases, it's best to just access IPv4 addresses through IPv4, rather
than through IPv6.
That is not the case for Unicode. All full Unicode encodings are
cross-convertible, with no loss of information. So no particular encoding
is *functionally* better or worse.
If a module is giving you a Unicode string in an encoding you don't care
about, and you likewise don't care to store that string in any particular
encoding, then what exactly does it matter if you pick a specific encoding?
What have you lost?
The only thing I can think of is that you lose the possibility of avoiding
copying the string. If your chosen encoding and the module's don't match,
then a conversion is needed. Whereas a non-denominational string would be
polymorphic, and thus be whatever encoding it was created with. Moving such
a string around is therefore possible with some assurance of not
cross-converting.
But if we're talking about "conveying the bits across module boundaries"
(where "bits" doesn't mean "C++ object", so we're talking serialization of
some form), then you really do need to know what those bits are and how
they're encoded. A non-denominational string isn't going to help there,
since both the source and the destination need to agree. That means an
explicit protocol needs to be established.
And if "conveying the bits across module boundaries" isn't talking about
serialization, what's wrong with just passing C++ types? You know what
string type the module uses, so just use the string type it uses, and
everyone's fine.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1489_27187799.1366802112532
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 4:00:16 AM UTC-7, Ville Voutilainen wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=
=3D"gmail_quote">On 24 April 2013 13:52, Martinho Fernandes <span dir=3D"lt=
r"><<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"n=
jhorvVmS4cJ">martinho....@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div class=3D"gma=
il_quote">On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <span dir=3D"ltr">&=
lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"njhor=
vVmS4cJ">jmck...@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><br><div>I don't see why =
anyone would need a string that is <i>explicitly</i> unaware of its encodin=
g. What does that gain you?<br>
</div></blockquote></div><br></div></div><div>Don't we have that as std::ba=
sic_string already, anyway?<br></div><div><br><br></div></div></blockquote>=
<div><br></div><div>We do. And the reason why we need it that it can be use=
d for conveying the bits across module boundaries<br>
without caring what the encoded strings in whatever modules are, which is s=
ometimes useful. Kind of like<br></div><div>having an ip_address type that =
can hold either ip4_address or ip6_address...<br></div></div></div></div></=
blockquote><div><br>I'm not sure I understand the analogy. An `any_ip_addre=
ss` class wouldn't be about crossing "module boundaries". It would be about=
being able to access an Internet resource from either one or other other a=
ddress, as needed. The choice of IPv4 or IPv6 is sometimes not up to the ap=
plication at all; it comes from what site the user wants to access. If they=
enter an IPv4 address, you need to be able to use it like an IPv4 address,=
and likewise for IPv6.<br><br>This matters because there is no direct conv=
ersion from IPv4 to IPv6. Mapping an IPv4 address to IPv6 is something that=
can cause problems, so in many cases, it's best to just access IPv4 addres=
ses through IPv4, rather than through IPv6.<br><br>That is not the case for=
Unicode. All full Unicode encodings are cross-convertible, with no loss of=
information. So no particular encoding is <i>functionally</i> better or wo=
rse.<br><br>If a module is giving you a Unicode string in an encoding you d=
on't care about, and you likewise don't care to store that string in any pa=
rticular encoding, then what exactly does it matter if you pick a specific =
encoding? What have you lost?<br><br>The only thing I can think of is that =
you lose the possibility of avoiding copying the string. If your chosen enc=
oding and the module's don't match, then a conversion is needed. Whereas a =
non-denominational string would be polymorphic, and thus be whatever encodi=
ng it was created with. Moving such a string around is therefore possible w=
ith some assurance of not cross-converting.<br><br>But if we're talking abo=
ut "conveying the bits across module boundaries" (where "bits" doesn't mean=
"C++ object", so we're talking serialization of some form), then you reall=
y do need to know what those bits are and how they're encoded. A non-denomi=
national string isn't going to help there, since both the source and the de=
stination need to agree. That means an explicit protocol needs to be establ=
ished.<br><br>And if "conveying the bits across module boundaries" isn't ta=
lking about serialization, what's wrong with just passing C++ types? You kn=
ow what string type the module uses, so just use the string type it uses, a=
nd everyone's fine.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1489_27187799.1366802112532--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 14:35:23 +0300
Raw View
--089e015369149093a804db19b21a
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 14:15, Nicol Bolas <jmckesson@gmail.com> wrote:
> And if "conveying the bits across module boundaries" isn't talking about
> serialization, what's wrong with just passing C++ types? You know what
> string type the module uses, so just use the string type it uses, and
> everyone's fine.
>
>
>
>
The issue is that there may be multiple modules using different encodings,
and the mediating module wants
to use a single common type. That's the analogy with an any_address as
well. It should become obvious
if you try it out, the mediating part will have an explosion in the amount
of types it needs to deal with,
which is not the case if it can use a common type.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e015369149093a804db19b21a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 14:15, Nicol Bolas <span dir=3D"ltr"><<a href=
=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">And if "conveying the bits across modul=
e boundaries" isn't talking about serialization, what's wrong =
with just passing C++ types? You know what string type the module uses, so =
just use the string type it uses, and everyone's fine.<br>
<div class=3D"HOEnZb"><div class=3D"h5">
<p></p>
<br><br></div></div></blockquote><div><br></div><div>The issue is that ther=
e may be multiple modules using different encodings, and the mediating modu=
le wants<br>to use a single common type. That's the analogy with an any=
_address as well. It should become obvious<br>
if you try it out, the mediating part will have an explosion in the amount =
of types it needs to deal with,<br>which is not the case if it can use a co=
mmon type.<br></div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e015369149093a804db19b21a--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 06:01:26 -0700 (PDT)
Raw View
------=_Part_2277_10232141.1366808486875
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 4:35:23 AM UTC-7, Ville Voutilainen wrote:
>
>
>
>
> On 24 April 2013 14:15, Nicol Bolas <jmck...@gmail.com <javascript:>>wrote:
>
>> And if "conveying the bits across module boundaries" isn't talking about
>> serialization, what's wrong with just passing C++ types? You know what
>> string type the module uses, so just use the string type it uses, and
>> everyone's fine.
>>
>>
>>
>>
> The issue is that there may be multiple modules using different encodings,
> and the mediating module wants
> to use a single common type. That's the analogy with an any_address as
> well. It should become obvious
> if you try it out, the mediating part will have an explosion in the amount
> of types it needs to deal with,
> which is not the case if it can use a common type.
>
How? This is what the mediating module would look like:
unicode_string<utf8> str = module_a::get_some_string(...);
module_b::use_some_string(..., str);
Whatever Unicode encoding `module_a::get_some_string` returns, `str` will
always be UTF-8 encoded. It will simply transcode the return value.
Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be
transcoded into it as needed.
So where exactly is the "explosion in the amount of types" that you're
concerned about?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_2277_10232141.1366808486875
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, April 24, 2013 4:35:23 AM UTC-7, Ville Voutilainen wr=
ote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex=
;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><br><div>=
<br><br><div class=3D"gmail_quote">On 24 April 2013 14:15, Nicol Bolas <spa=
n dir=3D"ltr"><<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-=
mailto=3D"KkPHWgBOPFYJ">jmck...@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">And if "conveying the bits across module bou=
ndaries" isn't talking about serialization, what's wrong with just passing =
C++ types? You know what string type the module uses, so just use the strin=
g type it uses, and everyone's fine.<br>
<div><div>
<p></p>
<br><br></div></div></blockquote><div><br></div><div>The issue is that ther=
e may be multiple modules using different encodings, and the mediating modu=
le wants<br>to use a single common type. That's the analogy with an any_add=
ress as well. It should become obvious<br>
if you try it out, the mediating part will have an explosion in the amount =
of types it needs to deal with,<br>which is not the case if it can use a co=
mmon type.<br></div></div></div></div></blockquote><div><br>How? This is wh=
at the mediating module would look like:<br><br><div class=3D"prettyprint" =
style=3D"background-color: rgb(250, 250, 250); border-color: rgb(187, 187, =
187); border-style: solid; border-width: 1px; word-wrap: break-word;"><code=
class=3D"prettyprint"><div class=3D"subprettyprint"><span style=3D"color: =
#000;" class=3D"styled-by-prettify">unicode_string</span><span style=3D"col=
or: #080;" class=3D"styled-by-prettify"><utf8></span><span style=3D"c=
olor: #000;" class=3D"styled-by-prettify"> str </span><span style=3D"color:=
#660;" class=3D"styled-by-prettify">=3D</span><span style=3D"color: #000;"=
class=3D"styled-by-prettify"> module_a</span><span style=3D"color: #660;" =
class=3D"styled-by-prettify">::</span><span style=3D"color: #000;" class=3D=
"styled-by-prettify">get_some_string</span><span style=3D"color: #660;" cla=
ss=3D"styled-by-prettify">(...);</span><span style=3D"color: #000;" class=
=3D"styled-by-prettify"><br>module_b</span><span style=3D"color: #660;" cla=
ss=3D"styled-by-prettify">::</span><span style=3D"color: #000;" class=3D"st=
yled-by-prettify">use_some_string</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">(...,</span><span style=3D"color: #000;" class=3D"s=
tyled-by-prettify"> str</span><span style=3D"color: #660;" class=3D"styled-=
by-prettify">);</span><span style=3D"color: #000;" class=3D"styled-by-prett=
ify"><br></span></div></code></div><br>Whatever Unicode encoding `module_a:=
:get_some_string` returns, `str` will always be UTF-8 encoded. It will simp=
ly transcode the return value. Whatever Unicode encoding `module_b::use_som=
e_string` takes, `str` will be transcoded into it as needed.<br><br>So wher=
e exactly is the "explosion in the amount of types" that you're concerned a=
bout?<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_2277_10232141.1366808486875--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 16:06:52 +0300
Raw View
--001a11c2ab98bb3cf104db1af9c1
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 16:01, Nicol Bolas <jmckesson@gmail.com> wrote:
> How? This is what the mediating module would look like:
>
> unicode_string<utf8> str = module_a::get_some_string(...);
> module_b::use_some_string(..., str);
>
>
I don't think that's what it would look like. It's using
unicode_string<utf8> there, potentially unicode_string<something_else>
elsewhere, which is the explosion of types I mentioned.
> Whatever Unicode encoding `module_a::get_some_string` returns, `str` will
> always be UTF-8 encoded. It will simply transcode the return value.
> Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be
> transcoded into it as needed.
>
That would assume that the transcoding cost is ok for the mediating part. I
don't think that's the general case.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c2ab98bb3cf104db1af9c1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 16:01, Nicol Bolas <span dir=3D"ltr"><<a href=
=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">How? This is what the mediating module would=
look like:<br><div><br><div style=3D"background-color:rgb(250,250,250);bor=
der-color:rgb(187,187,187);border-style:solid;border-width:1px;word-wrap:br=
eak-word">
<code><div><span style>unicode_string</span><span style=3D"color:#080"><=
utf8></span><span style> str </span><span style=3D"color:#660">=3D</span=
><span style> module_a</span><span style=3D"color:#660">::</span><span styl=
e>get_some_string</span><span style=3D"color:#660">(...);</span><span style=
><br>
module_b</span><span style=3D"color:#660">::</span><span style>use_some_str=
ing</span><span style=3D"color:#660">(...,</span><span style> str</span><sp=
an style=3D"color:#660">);</span><span style><br></span></div></code></div>=
<br>
</div></blockquote><div><br></div><div>I don't think that's what it=
would look like. It's using unicode_string<utf8> there, potentia=
lly unicode_string<something_else><br></div><div>elsewhere, which is =
the explosion of types I mentioned.<br>
=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div>Whatever Unicode encoding =
`module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It=
will simply transcode the return value. Whatever Unicode encoding `module_=
b::use_some_string` takes, `str` will be transcoded into it as needed.<br>
</div></blockquote><div><br></div><div>That would assume that the transcodi=
ng cost is ok for the mediating part. I don't think that's the gene=
ral case.<br><br></div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c2ab98bb3cf104db1af9c1--
.
Author: =?UTF-8?Q?Klaim_=2D_Jo=C3=ABl_Lamotte?= <mjklaim@gmail.com>
Date: Wed, 24 Apr 2013 15:37:03 +0200
Raw View
--089e01536d6cb03bf004db1b65de
Content-Type: text/plain; charset=ISO-8859-1
On Wed, Apr 24, 2013 at 3:06 PM, Ville Voutilainen <
ville.voutilainen@gmail.com> wrote:
> That would assume that the transcoding cost is ok for the mediating part.
> I don't think that's the general case.
I'm failing to see how the transcoding cost can be avoided if two modules
forces the user to work with specific and different encodings?
Joel Lamotte
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01536d6cb03bf004db1b65de
Content-Type: text/html; charset=ISO-8859-1
<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 24, 2013 at 3:06 PM, Ville Voutilainen <span dir="ltr"><<a href="mailto:ville.voutilainen@gmail.com" target="_blank">ville.voutilainen@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.</blockquote>
</div><br>I'm failing to see how the transcoding cost can be avoided if two modules forces the user to work with specific and different encodings?</div><div class="gmail_extra" style><br></div><div class="gmail_extra" style>
Joel Lamotte</div><div class="gmail_extra" style><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
--089e01536d6cb03bf004db1b65de--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 06:38:57 -0700 (PDT)
Raw View
------=_Part_85_5750078.1366810737492
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
>
> On 24 April 2013 16:01, Nicol Bolas <jmck...@gmail.com <javascript:>>wrote:
>
>> How? This is what the mediating module would look like:
>>
>> unicode_string<utf8> str = module_a::get_some_string(...);
>> module_b::use_some_string(..., str);
>>
>>
> I don't think that's what it would look like. It's using
> unicode_string<utf8> there, potentially unicode_string<something_else>
> elsewhere, which is the explosion of types I mentioned.
>
I'm still not clear on the problem. They're all inter-compatible; so what
if they use UTF8 in some places and UTF16 in others? It won't break
anything; they'll just get degraded performance due to user error.
Why should the standard be responsible for people who can't settle on a
convention?
Whatever Unicode encoding `module_a::get_some_string` returns, `str` will
>> always be UTF-8 encoded. It will simply transcode the return value.
>> Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be
>> transcoded into it as needed.
>>
>
> That would assume that the transcoding cost is ok for the mediating part.
> I don't think that's the general case.
>
The only way for transcoding to be avoided is if none of the modules
between the producing module and the consuming one do *anything* with the
string that requires a *specific *encoding. It must treat the string as a
sequence of codepoints that are properly Unicode formatted and arranged. So
the consuming module can't write it to a file, to a stream (not without
some serious upgrades to iostream to start taking codepoint sequences),
send it across the internet, or any number of other processes that need the
actual encoding.
There are quite a few operations that don't do any of those. But all of the
user-facing APIs will need a specific encoding. That's why applications
tend to just pick an encoding and stick with it. They pick whatever their
user-facing APIs use and just go with that.
The general rubric for C++ is (and should be): you accept whatever, convert
it ASAP into your standard encoding, do any manipulation in that encoding,
and then convert it if some specific API needs a different encoding. This
is how it *must* be, because the entire C++ world is not going to suddenly
switch to our new Unicode string. There will still be many APIs that only
accept a specific encoding.
I don't see the need for the standard to support an alternate way of using
Unicode strings.
As for the performance issue, I don't see how you can make a
performance-based case for `any_unicode_string` at all.
`any_unicode_string` will have significantly degraded access performance,
since it will have to use type erasure to store and access the actual data.
Remember: all of the truly useful stuff to do with Unicode strings in C++
comes from iterator-based algorithms, not members of the string class
itself. And `any_unicode_string` will have to use type-erased iterators;
every `++` or `*` operation will have to go through type erasure, thus
degrading performance. On *every use* of these, you will effectively get
the overhead of a type-erased call.
Take the example code I gave before. Even if it does two separate transcode
sequences to get the string from Module A to B, that's still likely to be a
win performance-wise over `any_unicode_string` if Module B does multiple
passes over the data. So if Module B is doing actual work with the string,
you win performance-wise.
I suppose you could make versions of the algorithms that are members of the
string, in which case the type-erasure part happens once for the operation.
Or that the algorithms could be specialized on
`any_unicode_string::iterator`.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_85_5750078.1366810737492
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=
=3D"gmail_quote">On 24 April 2013 16:01, Nicol Bolas <span dir=3D"ltr"><=
<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"f-F-3ef4=
3pYJ">jmck...@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">How? This is what the mediating module would=
look like:<br><div><br><div style=3D"background-color:rgb(250,250,250);bor=
der-color:rgb(187,187,187);border-style:solid;border-width:1px;word-wrap:br=
eak-word">
<code><div><span>unicode_string</span><span style=3D"color:#080"><utf8&g=
t;</span><span> str </span><span style=3D"color:#660">=3D</span><span> modu=
le_a</span><span style=3D"color:#660">::</span><span>get_some_string</span>=
<span style=3D"color:#660">(...)<wbr>;</span><span><br>
module_b</span><span style=3D"color:#660">::</span><span>use_some_string</s=
pan><span style=3D"color:#660">(...,</span><span> str</span><span style=3D"=
color:#660">);</span><span><br></span></div></code></div><br>
</div></blockquote><div><br></div><div>I don't think that's what it would l=
ook like. It's using unicode_string<utf8> there, potentially unicode_=
string<something_else><br></div><div>elsewhere, which is the explosio=
n of types I mentioned.<br></div></div></div></div></blockquote><div><br>I'=
m still not clear on the problem. They're all inter-compatible; so what if =
they use UTF8 in some places and UTF16 in others? It won't break anything; =
they'll just get degraded performance due to user error.<br><br>Why should =
the standard be responsible for people who can't settle on a convention?<br=
><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left=
: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><d=
iv><div class=3D"gmail_quote"><div>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div>Whatever Unicode encoding `module=
_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will s=
imply transcode the return value. Whatever Unicode encoding `module_b::use_=
some_string` takes, `str` will be transcoded into it as needed.<br>
</div></blockquote><div><br></div><div>That would assume that the transcodi=
ng cost is ok for the mediating part. I don't think that's the general case=
..<br></div></div></div></div></blockquote><div><br>The only way for transco=
ding to be avoided is if none of the modules between the producing module a=
nd the consuming one do <i>anything</i> with the string that requires a <i>=
specific </i>encoding. It must treat the string as a sequence of codepoints=
that are properly Unicode formatted and arranged. So the consuming module =
can't write it to a file, to a stream (not without some serious upgrades to=
iostream to start taking codepoint sequences), send it across the internet=
, or any number of other processes that need the actual encoding.<br><br>Th=
ere are quite a few operations that don't do any of those. But all of the u=
ser-facing APIs will need a specific encoding. That's why applications tend=
to just pick an encoding and stick with it. They pick whatever their user-=
facing APIs use and just go with that.<br><br>The general rubric for C++ is=
(and should be): you accept whatever, convert it ASAP into your standard e=
ncoding, do any manipulation in that encoding, and then convert it if some =
specific API needs a different encoding. This is how it <i>must</i> be, bec=
ause the entire C++ world is not going to suddenly switch to our new Unicod=
e string. There will still be many APIs that only accept a specific encodin=
g.<br><br>I don't see the need for the standard to support an alternate way=
of using Unicode strings.<br><br>As for the performance issue, I don't see=
how you can make a performance-based case for `any_unicode_string` at all.=
`any_unicode_string` will have significantly degraded access performance, =
since it will have to use type erasure to store and access the actual data.=
<br><br>Remember: all of the truly useful stuff to do with Unicode strings =
in C++ comes from iterator-based algorithms, not members of the string clas=
s itself. And `any_unicode_string` will have to use type-erased iterators; =
every `++` or `*` operation will have to go through type erasure, thus degr=
ading performance. On <i>every use</i> of these, you will effectively get t=
he overhead of a type-erased call.<br><br>Take the example code I gave befo=
re. Even if it does two separate transcode=20
sequences to get the string from Module A to B, that's still likely to be a=
win performance-wise over `any_unicode_string` if Module B does multiple p=
asses over the data. So if Module B is doing actual work with the string, y=
ou win performance-wise.<br><br>I suppose you could make versions of the al=
gorithms that are members of the string, in which case the type-erasure par=
t happens once for the operation. Or that the algorithms could be specializ=
ed on `any_unicode_string::iterator`.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_85_5750078.1366810737492--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 17:03:55 +0300
Raw View
--089e01229c30cdc63204db1bc5a7
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On 24 April 2013 16:37, Klaim - Jo=EBl Lamotte <mjklaim@gmail.com> wrote:
>
> On Wed, Apr 24, 2013 at 3:06 PM, Ville Voutilainen <
> ville.voutilainen@gmail.com> wrote:
>
>> That would assume that the transcoding cost is ok for the mediating part=
..
>> I don't think that's the general case.
>
>
> I'm failing to see how the transcoding cost can be avoided if two modules
> forces the user to work with specific and different encodings?
>
>
>
The point isn't avoiding the cost, but being able to choose where it's
paid.
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
--089e01229c30cdc63204db1bc5a7
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 16:37, Klaim - Jo=EBl Lamotte <span dir=3D"ltr">&l=
t;<a href=3D"mailto:mjklaim@gmail.com" target=3D"_blank">mjklaim@gmail.com<=
/a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_extra">=
<div class=3D"im"><br><div class=3D"gmail_quote">On Wed, Apr 24, 2013 at 3:=
06 PM, Ville Voutilainen <span dir=3D"ltr"><<a href=3D"mailto:ville.vout=
ilainen@gmail.com" target=3D"_blank">ville.voutilainen@gmail.com</a>></s=
pan> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">That would assume that the transcoding cost =
is ok for the mediating part. I don't think that's the general case=
..</blockquote>
</div><br></div>I'm failing to see how the transcoding cost can be avoi=
ded if two modules forces the user to work with specific and different enco=
dings?</div><div class=3D"gmail_extra"><br><br></div></div></blockquote>
<div><br></div><div>The point isn't avoiding the cost,=A0 but being abl=
e to choose where it's paid. <br></div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c30cdc63204db1bc5a7--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 17:09:45 +0300
Raw View
--047d7b2e0887adb44704db1bdaed
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 16:38, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
>
>> I don't think that's what it would look like. It's using
>> unicode_string<utf8> there, potentially unicode_string<something_else>
>> elsewhere, which is the explosion of types I mentioned.
>>
>
> I'm still not clear on the problem. They're all inter-compatible; so what
> if they use UTF8 in some places and UTF16 in others? It won't break
> anything; they'll just get degraded performance due to user error.
>
Yes, "they'll". The question is who's "them", and where.
>
> Why should the standard be responsible for people who can't settle on a
> convention?
>
In order to be useful?
The only way for transcoding to be avoided is if none of the modules
> between the producing module and the consuming one do *anything* with the
> string that requires a *specific *encoding. It must treat the string as a
> sequence of codepoints that are properly Unicode formatted and arranged. So
> the consuming module can't write it to a file, to a stream (not without
> some serious upgrades to iostream to start taking codepoint sequences),
> send it across the internet, or any number of other processes that need the
> actual encoding.
>
Mostly correct, although the write-to-file/stream aren't quite that
clear-cut.
>
> The general rubric for C++ is (and should be): you accept whatever,
> convert it ASAP into your standard encoding, do any manipulation in that
> encoding, and then convert it if some specific API needs a different
> encoding. This is how it *must* be, because the entire C++ world is not
> going to suddenly switch to our new Unicode string. There will still be
> many APIs that only accept a specific encoding.
>
> I don't see the need for the standard to support an alternate way of using
> Unicode strings.
>
The "accept whatever" includes accepting a general unicode type. Do we
suddenly agree completely? Note that
it's just an idea, we might not end up having such a general type.
>
> As for the performance issue, I don't see how you can make a
> performance-based case for `any_unicode_string` at all.
> `any_unicode_string` will have significantly degraded access performance,
> since it will have to use type erasure to store and access the actual data.
>
You're reaching too far and too early into implementation details if you
think it *has to* use erasure.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7b2e0887adb44704db1bdaed
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 16:38, Nicol Bolas <span dir=3D"ltr"><<a href=
=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">On Wednesday, April 24, 2013 6:06:52 AM UTC-=
7, Ville Voutilainen wrote:<div class=3D"im"><blockquote class=3D"gmail_quo=
te" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div dir=3D"ltr"><div><div class=3D"gmail_quote">I don't think that'=
;s what it would look like. It's using unicode_string<utf8> there=
, potentially unicode_string<something_else><br><div>elsewhere, which=
is the explosion of types I mentioned.<br>
</div></div></div></div></blockquote></div><div><br>I'm still not clear=
on the problem. They're all inter-compatible; so what if they use UTF8=
in some places and UTF16 in others? It won't break anything; they'=
ll just get degraded performance due to user error.<br>
</div></blockquote><div><br></div><div>Yes, "they'll". The qu=
estion is who's "them", and where.<br>=A0<br></div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
<div><br>Why should the standard be responsible for people who can't se=
ttle on a convention?<br></div></blockquote><div><br></div><div>In order to=
be useful?<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The only way for transcoding to be avoided is if none of the modules betwee=
n the producing module and the consuming one do <i>anything</i> with the st=
ring that requires a <i>specific </i>encoding. It must treat the string as =
a sequence of codepoints that are properly Unicode formatted and arranged. =
So the consuming module can't write it to a file, to a stream (not with=
out some serious upgrades to iostream to start taking codepoint sequences),=
send it across the internet, or any number of other processes that need th=
e actual encoding.<br>
</blockquote><div><br></div><div>Mostly correct, although the write-to-file=
/stream aren't quite that clear-cut.<br>=A0<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div><br>The general rubric for C++ is (and should be): you accept whatever=
, convert it ASAP into your standard encoding, do any manipulation in that =
encoding, and then convert it if some specific API needs a different encodi=
ng. This is how it <i>must</i> be, because the entire C++ world is not goin=
g to suddenly switch to our new Unicode string. There will still be many AP=
Is that only accept a specific encoding.<br>
<br>I don't see the need for the standard to support an alternate way o=
f using Unicode strings.<br></div></blockquote><div><br></div><div>The &quo=
t;accept whatever" includes accepting a general unicode type. Do we su=
ddenly agree completely? Note that<br>
it's just an idea, we might not end up having such a general type.<br>=
=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div><br>As for the performance=
issue, I don't see how you can make a performance-based case for `any_=
unicode_string` at all. `any_unicode_string` will have significantly degrad=
ed access performance, since it will have to use type erasure to store and =
access the actual data.<br>
</div></blockquote><div><br></div><div>You're reaching too far and too =
early into implementation details if you think it *has to* use erasure. <br=
></div><div><br>=A0</div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b2e0887adb44704db1bdaed--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 09:23:53 -0700 (PDT)
Raw View
------=_Part_4227_7508509.1366820633217
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 7:09:45 AM UTC-7, Ville Voutilainen wrote:
>
> On 24 April 2013 16:38, Nicol Bolas <jmck...@gmail.com <javascript:>>wrote:
>
>> On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
>>
>>> I don't think that's what it would look like. It's using
>>> unicode_string<utf8> there, potentially unicode_string<something_else>
>>> elsewhere, which is the explosion of types I mentioned.
>>>
>>
>> I'm still not clear on the problem. They're all inter-compatible; so what
>> if they use UTF8 in some places and UTF16 in others? It won't break
>> anything; they'll just get degraded performance due to user error.
>>
>
> Yes, "they'll". The question is who's "them", and where.
>
I'm not sure what you're getting at here. "Where" will be when they
transcode strings. Transcoding can't be *hidden* in this system, because
it's right there in the type system. Any time you try to copy/move a
`unicode_string<A>` into a `unicode_string<B>`, you get a transcode. This
is not difficult to track down.
As for who "them" is, it would be people who can't keep conventions
straight, or who can't use a typedef properly. IE: not very many C++
programmers.
Why should the standard be responsible for people who can't settle on a
>> convention?
>>
>
> In order to be useful?
>
By that logic, we should also have garbage collection. Because it's
"useful".
C++ simply doesn't do things this way. It doesn't tend to have types that
could cover anything of a general category, with substantial implementation
differences based on construction. And when it tries that, it generally
works out poorly (see iostreams and its needlessly awful performance).
The general rubric for C++ is (and should be): you accept whatever, convert
>> it ASAP into your standard encoding, do any manipulation in that encoding,
>> and then convert it if some specific API needs a different encoding. This
>> is how it *must* be, because the entire C++ world is not going to
>> suddenly switch to our new Unicode string. There will still be many APIs
>> that only accept a specific encoding.
>>
>> I don't see the need for the standard to support an alternate way of
>> using Unicode strings.
>>
>
> The "accept whatever" includes accepting a general unicode type. Do we
> suddenly agree completely?
>
By "accept whatever", I mean "I call an API that returns a Unicode string
of a particular encoding." I write my code against "whatever" encoding that
the API uses, transcoding it to my required, internal encoding. It's not "a
specific API could return any arbitrary Unicode encoding," which is what
you're asking for.
As for the performance issue, I don't see how you can make a
>> performance-based case for `any_unicode_string` at all.
>> `any_unicode_string` will have significantly degraded access performance,
>> since it will have to use type erasure to store and access the actual data.
>>
>
> You're reaching too far and too early into implementation details if you
> think it *has to* use erasure.
>
Whether it's type erasure or something else, this iterator access is *not*going to be a simple pointer access. Each call to `++` or `*` is going to
have to do a lot more work than a specific encoder's iterator. It's going
to have to figure out which encoding the type actually is, then call an
appropriate function based on that.
And type erasure is the only way to make this work for arbitrary,
potentially user-defined encodings. Again, there's no reason I shouldn't be
able to use EBCDIC or whatever, so long as it is a proper Unicode encoding.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_4227_7508509.1366820633217
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 7:09:45 AM UTC-7, Ville Voutilainen wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=
=3D"gmail_quote">On 24 April 2013 16:38, Nicol Bolas <span dir=3D"ltr"><=
<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"gekd_bFF=
Y8IJ">jmck...@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">On Wednesday, April 24, 2013 6:06:52 AM UTC-=
7, Ville Voutilainen wrote:<div><blockquote class=3D"gmail_quote" style=3D"=
margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><div><div class=3D"gmail_quote">I don't think that's what =
it would look like. It's using unicode_string<utf8> there, potentiall=
y unicode_string<something_else><br><div>elsewhere, which is the expl=
osion of types I mentioned.<br>
</div></div></div></div></blockquote></div><div><br>I'm still not clear on =
the problem. They're all inter-compatible; so what if they use UTF8 in some=
places and UTF16 in others? It won't break anything; they'll just get degr=
aded performance due to user error.<br>
</div></blockquote><div><br></div><div>Yes, "they'll". The question is who'=
s "them", and where.<br></div></div></div></div></blockquote><div><br>I'm n=
ot sure what you're getting at here. "Where" will be when they transcode st=
rings. Transcoding can't be <i>hidden</i> in this system, because it's righ=
t there in the type system. Any time you try to copy/move a `unicode_string=
<A>` into a `unicode_string<B>`, you get a transcode. This is n=
ot difficult to track down.<br><br>As for who "them" is, it would be people=
who can't keep conventions straight, or who can't use a typedef properly. =
IE: not very many C++ programmers.<br><br></div><blockquote class=3D"gmail_=
quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;pa=
dding-left: 1ex;"><div dir=3D"ltr"><div><div class=3D"gmail_quote"><div></d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex">
<div>Why should the standard be responsible for people who can't settle on =
a convention?<br></div></blockquote><div><br></div><div>In order to be usef=
ul?<br></div></div></div></div></blockquote><div><br>By that logic, we shou=
ld also have garbage collection. Because it's "useful".<br><br>C++ simply d=
oesn't do things this way. It doesn't tend to have types that could cover a=
nything of a general category, with substantial implementation differences =
based on construction. And when it tries that, it generally works out poorl=
y (see iostreams and its needlessly awful performance).<br> <br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=3D=
"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
<div>The general rubric for C++ is (and should be): you accept whatever, co=
nvert it ASAP into your standard encoding, do any manipulation in that enco=
ding, and then convert it if some specific API needs a different encoding. =
This is how it <i>must</i> be, because the entire C++ world is not going to=
suddenly switch to our new Unicode string. There will still be many APIs t=
hat only accept a specific encoding.<br>
<br>I don't see the need for the standard to support an alternate way of us=
ing Unicode strings.<br></div></blockquote><div><br></div><div>The "accept =
whatever" includes accepting a general unicode type. Do we suddenly agree c=
ompletely?</div></div></div></div></blockquote><div><br>By "accept whatever=
", I mean "I call an API that returns a Unicode string of a particular enco=
ding." I write my code against "whatever" encoding that the API uses, trans=
coding it to my required, internal encoding. It's not "a specific API could=
return any arbitrary Unicode encoding," which is what you're asking for.<b=
r><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-lef=
t: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><=
div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>As for t=
he performance issue, I don't see how you can make a performance-based case=
for `any_unicode_string` at all. `any_unicode_string` will have significan=
tly degraded access performance, since it will have to use type erasure to =
store and access the actual data.<br>
</div></blockquote><div><br></div><div>You're reaching too far and too earl=
y into implementation details if you think it *has to* use erasure.<br></di=
v></div></div></div></blockquote><div><br>Whether it's type erasure or some=
thing else, this iterator access is <i>not</i> going to be a simple pointer=
access. Each call to `++` or `*` is going to have to do a lot more work th=
an a specific encoder's iterator. It's going to have to figure out which en=
coding the type actually is, then call an appropriate function based on tha=
t.<br><br>And type erasure is the only way to make this work for arbitrary,=
potentially user-defined encodings. Again, there's no reason I shouldn't b=
e able to use EBCDIC or whatever, so long as it is a proper Unicode encodin=
g.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_4227_7508509.1366820633217--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 19:55:43 +0300
Raw View
--089e01229c3038051f04db1e2c76
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 19:23, Nicol Bolas <jmckesson@gmail.com> wrote:
>
> I'm still not clear on the problem. They're all inter-compatible; so what
>>> if they use UTF8 in some places and UTF16 in others? It won't break
>>> anything; they'll just get degraded performance due to user error.
>>>
>>
>> Yes, "they'll". The question is who's "them", and where.
>>
>
> I'm not sure what you're getting at here. "Where" will be when they
> transcode strings. Transcoding can't be *hidden* in this system,
>
Precisely. And mediating layers don't need/want to do that. The transcoding
can be done in places where and when it needs
to be done.
>
> Why should the standard be responsible for people who can't settle on a
>>> convention?
>>>
>>
>> In order to be useful?
>>
>
> By that logic, we should also have garbage collection. Because it's
> "useful".
>
That's completely beside the point. Having a common type for multiple
different encoded strings has
nothing to do with things like garbage collection.
>
> C++ simply doesn't do things this way. It doesn't tend to have types that
> could cover anything of a general category, with
>
Oh, like std::exception and exception_ptr? shared_ptr? function? The
forthcoming polymorphic allocators?
>
> You're reaching too far and too early into implementation details if you
>> think it *has to* use erasure.
>>
>
> Whether it's type erasure or something else, this iterator access is *not*going to be a simple pointer access. Each call to `++` or `*` is going to
> have to do a lot more work than a specific encoder's iterator. It's going
> to have to figure out which encoding the type actually is, then call an
> appropriate function based on that.
>
That's not how it needs to be done. And such a common type doesn't
necessarily need to do encoding-specific
traversal at all, it may well be sufficient for such type to allow blasting
the raw bits into various sinks, or
just convey the type between different subsystems.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01229c3038051f04db1e2c76
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 19:23, Nicol Bolas <span dir=3D"ltr"><<a href=
=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><br><div class=3D"im"><blockquote class=3D"g=
mail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;=
padding-left:1ex">
<div dir=3D"ltr"><div><div class=3D"gmail_quote"><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div>I'm still not clear on the problem. They're all inter-com=
patible; so what if they use UTF8 in some places and UTF16 in others? It wo=
n't break anything; they'll just get degraded performance due to us=
er error.<br>
</div></blockquote><div><br></div><div>Yes, "they'll". The qu=
estion is who's "them", and where.<br></div></div></div></div=
></blockquote></div><div><br>I'm not sure what you're getting at he=
re. "Where" will be when they transcode strings. Transcoding can&=
#39;t be <i>hidden</i> in this system, </div>
</blockquote><div><br></div><div>Precisely. And mediating layers don't =
need/want to do that. The transcoding can be done in places where and when =
it needs<br>to be done.<br>=A0<br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br></div><div class=3D"im"><blockquote class=3D"gmail_quote" style=3D=
"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv dir=3D"ltr"><div><div class=3D"gmail_quote"><div></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div>Why should the standard be responsible for people who can't settle=
on a convention?<br></div></blockquote><div><br></div><div>In order to be =
useful?<br></div></div></div></div></blockquote></div><div><br>By that logi=
c, we should also have garbage collection. Because it's "useful&qu=
ot;.<br>
</div></blockquote><div><br></div><div>That's completely beside the poi=
nt. Having a common type for multiple different encoded strings has<br>noth=
ing to do with things like garbage collection.<br>=A0<br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">
<div><br>C++ simply doesn't do things this way. It doesn't tend to =
have types that could cover anything of a general category, with </div></bl=
ockquote><div><br></div><div>Oh, like std::exception and exception_ptr? sha=
red_ptr? function? The forthcoming polymorphic allocators?<br>
<br>=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;p=
adding-left:1ex">
<div dir=3D"ltr"><div><div class=3D"gmail_quote"><div>You're reaching t=
oo far and too early into implementation details if you think it *has to* u=
se erasure.<br></div></div></div></div></blockquote><div><br>Whether it'=
;s type erasure or something else, this iterator access is <i>not</i> going=
to be a simple pointer access. Each call to `++` or `*` is going to have t=
o do a lot more work than a specific encoder's iterator. It's going=
to have to figure out which encoding the type actually is, then call an ap=
propriate function based on that.<br>
</div></blockquote><div><br></div><div>That's not how it needs to be do=
ne. And such a common type doesn't necessarily need to do encoding-spec=
ific<br>traversal at all, it may well be sufficient for such type to allow =
blasting the raw bits into various sinks, or<br>
just convey the type between different subsystems.<br></div></div><br></div=
></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c3038051f04db1e2c76--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 11:08:57 -0700 (PDT)
Raw View
------=_Part_206_11874174.1366826937134
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 9:55:43 AM UTC-7, Ville Voutilainen wrote:
>
> On 24 April 2013 19:23, Nicol Bolas <jmck...@gmail.com <javascript:>>wrote:
>
You're reaching too far and too early into implementation details if you
>>> think it *has to* use erasure.
>>>
>>
>> Whether it's type erasure or something else, this iterator access is *not
>> * going to be a simple pointer access. Each call to `++` or `*` is going
>> to have to do a lot more work than a specific encoder's iterator. It's
>> going to have to figure out which encoding the type actually is, then call
>> an appropriate function based on that.
>>
>
> That's not how it needs to be done. And such a common type doesn't
> necessarily need to do encoding-specific
> traversal at all, it may well be sufficient for such type to allow
> blasting the raw bits into various sinks, or
> just convey the type between different subsystems.
>
OK, clearly before this conversation can proceed any further, there needs
to be some definition of what *exactly* we're talking about.
When I hear the word "string", I generally think of an object that contains
an ordered sequence of characters, which can be accessed in some way and
probably have basic sequencing operations performed on them. If it's
mutable, insertion, deletion, and such can be used. If it's not mutable,
then there should probably be some APIs that do copy-insertion/delection
(creating a new string that is the result of inserting/deleting).
If all you're talking about is some memory object which cannot be useful
until it is transferred into some other object, that's not a "string" by
any definition I'm aware of. That's not even an iterator range.
So what *exactly* are you arguing we should have? A string or something
else?
Furthermore, what good is "blasting the raw bits into various sinks"?
Ignoring the fact that there's no such thing as a "sink", the class is
designed so that the user explicitly doesn't know the actual encoding of
the data. So the "raw bits" themselves are completely meaningless to
anyone. And let's not forget endian conversion issues on top of that.
I cannot imagine what use it would be to take a string of an unknown
encoding and send its "raw bits" somewhere. At least, what use that would
be compared to taking an actual, specific encoding.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_206_11874174.1366826937134
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 9:55:43 AM UTC-7, Ville Voutilainen wrote:<blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div>On 24 A=
pril 2013 19:23, Nicol Bolas <span dir=3D"ltr"><<a href=3D"javascript:" =
target=3D"_blank" gdf-obfuscated-mailto=3D"_ndg4Unc8xsJ">jmck...@gmail.com<=
/a>></span> wrote:<br></div></div></div></blockquote><blockquote class=
=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #cc=
c solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=3D"gmail_quote=
"><div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><blockquote class=3D"gmail_quot=
e" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
<div dir=3D"ltr"><div><div class=3D"gmail_quote"><div>You're reaching too f=
ar and too early into implementation details if you think it *has to* use e=
rasure.<br></div></div></div></div></blockquote><div><br>Whether it's type =
erasure or something else, this iterator access is <i>not</i> going to be a=
simple pointer access. Each call to `++` or `*` is going to have to do a l=
ot more work than a specific encoder's iterator. It's going to have to figu=
re out which encoding the type actually is, then call an appropriate functi=
on based on that.<br>
</div></blockquote><div><br></div><div>That's not how it needs to be done. =
And such a common type doesn't necessarily need to do encoding-specific<br>=
traversal at all, it may well be sufficient for such type to allow blasting=
the raw bits into various sinks, or<br>
just convey the type between different subsystems.<br></div></div></div></d=
iv></blockquote><div><br>OK, clearly before this conversation can proceed a=
ny further, there needs to be some definition of what <i>exactly</i> we're =
talking about.<br><br>When I hear the word "string", I generally think of a=
n object that contains an ordered sequence of characters, which can be acce=
ssed in some way and probably have basic sequencing operations performed on=
them. If it's mutable, insertion, deletion, and such can be used. If it's =
not mutable, then there should probably be some APIs that do copy-insertion=
/delection (creating a new string that is the result of inserting/deleting)=
..<br><br>If all you're talking about is some memory object which cannot be =
useful until it is transferred into some other object, that's not a "string=
" by any definition I'm aware of. That's not even an iterator range.<br><br=
>So what <i>exactly</i> are you arguing we should have? A string or somethi=
ng else?<br><br>Furthermore, what good is "blasting the raw bits into vario=
us sinks"? Ignoring the fact that there's no such thing as a "sink", the cl=
ass is designed so that the user explicitly doesn't know the actual encodin=
g of the data. So the "raw bits" themselves are completely meaningless to a=
nyone. And let's not forget endian conversion issues on top of that.<br><br=
>I cannot imagine what use it would be to take a string of an unknown encod=
ing and send its "raw bits" somewhere. At least, what use that would be com=
pared to taking an actual, specific encoding.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_206_11874174.1366826937134--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 21:15:11 +0300
Raw View
--089e01536914688b5f04db1f4809
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 21:08, Nicol Bolas <jmckesson@gmail.com> wrote:
> If all you're talking about is some memory object which cannot be useful
> until it is transferred into some other object, that's not a "string" by
> any definition I'm aware of. That's not even an iterator range.
>
I'm talking about a type from which you can do a conversion to a more
specific type.
> Furthermore, what good is "blasting the raw bits into various sinks"?
> Ignoring the fact that there's no such thing as a "sink", the
>
You never dump data into debug files? Into error streams? You never memcpy
anything anywhere? You never
send raw data with out-of-band information about the encoding so that it
can be decoded on the receiving
side?
I find your trouble of understanding the uses for such a type odd. But I
don't need to convince you that
such types are useful.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01536914688b5f04db1f4809
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 21:08, Nicol Bolas <span dir=3D"ltr"><<a href=
=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">If all you're talking about is some memo=
ry object which cannot be useful until it is transferred into some other ob=
ject, that's not a "string" by any definition I'm aware o=
f. That's not even an iterator range.<br>
</blockquote><div><br></div><div>I'm talking about a type from which yo=
u can do a conversion to a more specific type.<br>=A0<br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">
<div>Furthermore, what good is "blasting the raw bits into various sin=
ks"? Ignoring the fact that there's no such thing as a "sink&=
quot;, the </div></blockquote><div><br></div><div>You never dump data into =
debug files? Into error streams? You never memcpy anything anywhere? You ne=
ver<br>
send raw data with out-of-band information about the encoding so that it ca=
n be decoded on the receiving<br>side?<br><br></div><div>I find your troubl=
e of understanding the uses for such a type odd. But I don't need to co=
nvince you that<br>
</div><div>such types are useful.<br></div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01536914688b5f04db1f4809--
.
Author: Jeffrey Yasskin <jyasskin@google.com>
Date: Wed, 24 Apr 2013 11:41:14 -0700
Raw View
On Wed, Apr 24, 2013 at 11:08 AM, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Wednesday, April 24, 2013 9:55:43 AM UTC-7, Ville Voutilainen wrote:
>>
>> On 24 April 2013 19:23, Nicol Bolas <jmck...@gmail.com> wrote:
>>>>
>>>> You're reaching too far and too early into implementation details if you
>>>> think it *has to* use erasure.
>>>
>>>
>>> Whether it's type erasure or something else, this iterator access is not
>>> going to be a simple pointer access. Each call to `++` or `*` is going to
>>> have to do a lot more work than a specific encoder's iterator. It's going to
>>> have to figure out which encoding the type actually is, then call an
>>> appropriate function based on that.
>>
>>
>> That's not how it needs to be done. And such a common type doesn't
>> necessarily need to do encoding-specific
>> traversal at all, it may well be sufficient for such type to allow
>> blasting the raw bits into various sinks, or
>> just convey the type between different subsystems.
>
>
> OK, clearly before this conversation can proceed any further, there needs to
> be some definition of what exactly we're talking about.
>
> When I hear the word "string", I generally think of an object that contains
> an ordered sequence of characters, which can be accessed in some way and
> probably have basic sequencing operations performed on them. If it's
> mutable, insertion, deletion, and such can be used. If it's not mutable,
> then there should probably be some APIs that do copy-insertion/delection
> (creating a new string that is the result of inserting/deleting).
>
> If all you're talking about is some memory object which cannot be useful
> until it is transferred into some other object, that's not a "string" by any
> definition I'm aware of. That's not even an iterator range.
>
> So what exactly are you arguing we should have? A string or something else?
>
> Furthermore, what good is "blasting the raw bits into various sinks"?
> Ignoring the fact that there's no such thing as a "sink", the class is
> designed so that the user explicitly doesn't know the actual encoding of the
> data. So the "raw bits" themselves are completely meaningless to anyone. And
> let's not forget endian conversion issues on top of that.
>
> I cannot imagine what use it would be to take a string of an unknown
> encoding and send its "raw bits" somewhere. At least, what use that would be
> compared to taking an actual, specific encoding.
I'd classify the options into two general categories:
1) A unicode string class that presents its contents as a sequence of
code points, without exposing its clients to the sequence of bytes
that underlie these code points. This could be the python-style object
I've been suggesting or could be an object that presents a
bidirectional iterator that converts on the fly.
2) An "encoded" string class that presents its contents as a sequence
of bytes along with a description of the encoding that should be used
to interpret those bytes, probably along with an iterator that can
convert from each encoding.
Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)
What *is* pretty obvious is that we need ways to convert byte
sequences from one encoding to another (what
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html
addresses) and ways to run the various Unicode algorithms over both
byte and codepoint sequences. (We need byte sequence support even if
we eventually pick class (1) in order to support users who want to
stick another encoding into a vector<char>.) I'm hoping we can get the
algorithms into TS2 before we need to firmly decide on the less
obvious questions.
Jeffrey
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Tony V E <tvaneerd@gmail.com>
Date: Wed, 24 Apr 2013 15:41:21 -0400
Raw View
--047d7b34405a8ab64f04db207c72
Content-Type: text/plain; charset=ISO-8859-1
On Wed, Apr 24, 2013 at 2:41 PM, Jeffrey Yasskin <jyasskin@google.com>wrote:
>
> I'd classify the options into two general categories:
>
> 1) A unicode string class that presents its contents as a sequence of
> code points, without exposing its clients to the sequence of bytes
> that underlie these code points. This could be the python-style object
> I've been suggesting or could be an object that presents a
> bidirectional iterator that converts on the fly.
>
> 2) An "encoded" string class that presents its contents as a sequence
> of bytes along with a description of the encoding that should be used
> to interpret those bytes, probably along with an iterator that can
> convert from each encoding.
>
> Neither of these is wrong, but we only want to standardize one, and
> it's not totally obvious which is better. (If it's totally obvious to
> you, that probably means you're not considering enough viewpoints.)
>
>
Let me attempt to claim (_somewhat_ devil's advocate) that we want class 1,
with implementation via UTF8, thus getting a specific case of 2 as well.
ie not just 1 that may or may not be UTF8, but define that it must be UTF8
so that you can rely on the bytes if you want or need to.
Reasons:
- UTF8 can work with things like strcpy(), so lots of code just works
(although "just works" can sometimes be considered harmful if it wasn't
expected)
- UTF8 is size efficient
- UTF8 is not *too* iterator inefficient as you never need to go more than
a few bytes left or right to find the start of a code point (ie you don't
need to go to the beginning of the string, and you can tell if a byte is in
the middle of a code point or not). Of course, with an iterator, you
should never be in the middle of a codepoint anyhow.
Downsides
- Windows uses UTF16. That's Windows' fault. UTF16 is the worst of both
worlds (still requires multibyte sequences, yet takes up too much space).
I'd be OK with functions that convert to other encodings, but I think UTF8
should be the default and the focus.
Tony
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7b34405a8ab64f04db207c72
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Wed, Apr 24, 2013 at 2:41 PM, Jeffrey Yasskin <span dir=3D"ltr">=
<<a href=3D"mailto:jyasskin@google.com" target=3D"_blank">jyasskin@googl=
e.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"HOEnZb"><div class=3D"h5">
<br>
</div></div>I'd classify the options into two general categories:<br>
<br>
1) A unicode string class that presents its contents as a sequence of<br>
code points, without exposing its clients to the sequence of bytes<br>
that underlie these code points. This could be the python-style object<br>
I've been suggesting or could be an object that presents a<br>
bidirectional iterator that converts on the fly.<br>
<br>
2) An "encoded" string class that presents its contents as a sequ=
ence<br>
of bytes along with a description of the encoding that should be used<br>
to interpret those bytes, probably along with an iterator that can<br>
convert from each encoding.<br>
<br>
Neither of these is wrong, but we only want to standardize one, and<br>
it's not totally obvious which is better. (If it's totally obvious =
to<br>
you, that probably means you're not considering enough viewpoints.)<br>
<br></blockquote><div><br></div><div>Let me attempt to claim (_somewhat_ de=
vil's advocate) that we want class 1, with implementation via UTF8, thu=
s getting a specific case of 2 as well.=A0 ie not just 1 that may or may no=
t be UTF8, but define that it must be UTF8 so that you can rely on the byte=
s if you want or need to.<br>
<br></div><div>Reasons:<br><br></div><div>- UTF8 can work with things like =
strcpy(), so lots of code just works (although "just works" can s=
ometimes be considered harmful if it wasn't expected)<br></div><div>
- UTF8 is size efficient<br></div><div>- UTF8 is not *too* iterator ineffic=
ient as you never need to go more than a few bytes left or right to find th=
e start of a code point (ie you don't need to go to the beginning of th=
e string, and you can tell if a byte is in the middle of a code point or no=
t).=A0 Of course, with an iterator, you should never be in the middle of a =
codepoint anyhow.<br>
</div><br></div><div class=3D"gmail_quote">Downsides<br></div><div class=3D=
"gmail_quote">=A0- Windows uses UTF16.=A0 That's Windows' fault.=A0=
UTF16 is the worst of both worlds (still requires multibyte sequences, yet=
takes up too much space).<br>
</div><div class=3D"gmail_quote"><div>=A0<br></div><div>I'd be OK with =
functions that convert to other encodings, but I think UTF8 should be the d=
efault and the focus.<br><br></div><div>Tony<br><br></div></div></div></div=
>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b34405a8ab64f04db207c72--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Wed, 24 Apr 2013 12:47:12 -0700 (PDT)
Raw View
------=_Part_5_1934875.1366832832184
Content-Type: text/plain; charset=ISO-8859-1
It's far more than just Windows. It's Java, .NET, and every Windows-focused
application. You cannot just dump the other encodings. It is nothing
Windows-specific, it's simple compatibility. If you have an existing UTF-16
application that interoperates with a bunch of other UTF-16 applications,
there's absolutely no reason whatsoever to go to UTF-8. Your other points
also equally apply to UTF-16 in the relevant ecosystems. There is nothing
special about UTF-8.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_5_1934875.1366832832184
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
It's far more than just Windows. It's Java, .NET, and every Windows-focused=
application. You cannot just dump the other encodings. It is nothing Windo=
ws-specific, it's simple compatibility. If you have an existing UTF-16 appl=
ication that interoperates with a bunch of other UTF-16 applications, there=
's absolutely no reason whatsoever to go to UTF-8. Your other points also e=
qually apply to UTF-16 in the relevant ecosystems. There is nothing special=
about UTF-8.<div><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_5_1934875.1366832832184--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Wed, 24 Apr 2013 22:57:21 +0300
Raw View
--089e01294802c0421b04db20b507
Content-Type: text/plain; charset=ISO-8859-1
On 24 April 2013 22:47, DeadMG <wolfeinstein@gmail.com> wrote:
> It's far more than just Windows. It's Java, .NET, and every
> Windows-focused application. You cannot just dump the other encodings. It
> is nothing Windows-specific, it's simple compatibility. If you have an
> existing UTF-16 application that interoperates
>
>
Sure, but do remember that not standardizing them isn't the same as dumping
them. It's unlikely that
we'll ever standardize libicu, so we need to consider what can be
reasonably done.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01294802c0421b04db20b507
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 24 April 2013 22:47, DeadMG <span dir=3D"ltr"><<a href=3D"mai=
lto:wolfeinstein@gmail.com" target=3D"_blank">wolfeinstein@gmail.com</a>>=
;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">It's far more than just Windows. It'=
s Java, .NET, and every Windows-focused application. You cannot just dump t=
he other encodings. It is nothing Windows-specific, it's simple compati=
bility. If you have an existing UTF-16 application that interoperates<br>
<br></blockquote><div><br></div><div>Sure, but do remember that not standar=
dizing them isn't the same as dumping them. It's unlikely that<br>w=
e'll ever standardize libicu, so we need to consider what can be reason=
ably done. <br>
</div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01294802c0421b04db20b507--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Wed, 24 Apr 2013 12:59:49 -0700 (PDT)
Raw View
------=_Part_18_5652743.1366833589942
Content-Type: text/plain; charset=ISO-8859-1
Standardizing a Unicode string as UTF-8 and then only ever using that in
new Standard interfaces would be dumping the other encodings.
I can see the argument for a Pythonic mystery encoding, maybe. But there's
no way I'd prevent any implementer from setting that encoding to UTF-16 on
Windows and Jeffrey would want to be able to set his to UTF-32 with some
storage magic and so on and so forth. Option 1 vs Option 2 is a debate, but
"UTF-8 everywhere" is not even a question. I will never propose such a
thing.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_18_5652743.1366833589942
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Standardizing a Unicode string as UTF-8 and then only ever using that in ne=
w Standard interfaces would be dumping the other encodings.<div><br></div><=
div>I can see the argument for a Pythonic mystery encoding, maybe. But ther=
e's no way I'd prevent any implementer from setting that encoding to UTF-16=
on Windows and Jeffrey would want to be able to set his to UTF-32 with som=
e storage magic and so on and so forth. Option 1 vs Option 2 is a debate, b=
ut "UTF-8 everywhere" is not even a question. I will never propose such a t=
hing.</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_18_5652743.1366833589942--
.
Author: Tony V E <tvaneerd@gmail.com>
Date: Wed, 24 Apr 2013 16:03:01 -0400
Raw View
--e89a8f83a63d05e15204db20ca49
Content-Type: text/plain; charset=ISO-8859-1
You only need the other encodings on the edges of your app.
Thing long term. Some day down the road Windows and Java won't exist,
and/or they'll have seen the error of their ways and converted to UTF8.
On Wed, Apr 24, 2013 at 3:59 PM, DeadMG <wolfeinstein@gmail.com> wrote:
> Standardizing a Unicode string as UTF-8 and then only ever using that in
> new Standard interfaces would be dumping the other encodings.
>
> I can see the argument for a Pythonic mystery encoding, maybe. But there's
> no way I'd prevent any implementer from setting that encoding to UTF-16 on
> Windows and Jeffrey would want to be able to set his to UTF-32 with some
> storage magic and so on and so forth. Option 1 vs Option 2 is a debate, but
> "UTF-8 everywhere" is not even a question. I will never propose such a
> thing.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--e89a8f83a63d05e15204db20ca49
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div>You only need the other encodings on the edges of you=
r app.<br><br></div>Thing long term.=A0 Some day down the road Windows and =
Java won't exist, and/or they'll have seen the error of their ways =
and converted to UTF8.<br>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed,=
Apr 24, 2013 at 3:59 PM, DeadMG <span dir=3D"ltr"><<a href=3D"mailto:wo=
lfeinstein@gmail.com" target=3D"_blank">wolfeinstein@gmail.com</a>></spa=
n> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Standardizing a Unicode string as UTF-8 and =
then only ever using that in new Standard interfaces would be dumping the o=
ther encodings.<div>
<br></div><div>I can see the argument for a Pythonic mystery encoding, mayb=
e. But there's no way I'd prevent any implementer from setting that=
encoding to UTF-16 on Windows and Jeffrey would want to be able to set his=
to UTF-32 with some storage magic and so on and so forth. Option 1 vs Opti=
on 2 is a debate, but "UTF-8 everywhere" is not even a question. =
I will never propose such a thing.</div>
<div class=3D"HOEnZb"><div class=3D"h5">
<p></p>
-- <br>
=A0<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org" target=3D=
"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den" target=3D"_blank">http://groups.google.com/a/isocpp=
..org/group/std-proposals/?hl=3Den</a>.<br>
=A0<br>
=A0<br>
</div></div></blockquote></div><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--e89a8f83a63d05e15204db20ca49--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Wed, 24 Apr 2013 13:31:22 -0700 (PDT)
Raw View
------=_Part_23_20064960.1366835482442
Content-Type: text/plain; charset=ISO-8859-1
Since I have clearly stated that I am not going to propose that, if you
want to, then work on your own proposal. In either case, kindly stop
wasting my time.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_23_20064960.1366835482442
Content-Type: text/html; charset=ISO-8859-1
Since I have clearly stated that I am not going to propose that, if you want to, then work on your own proposal. In either case, kindly stop wasting my time.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_23_20064960.1366835482442--
.
Author: Jeffrey Yasskin <jyasskin@google.com>
Date: Wed, 24 Apr 2013 13:31:25 -0700
Raw View
On Wed, Apr 24, 2013 at 12:59 PM, DeadMG <wolfeinstein@gmail.com> wrote:
> Standardizing a Unicode string as UTF-8 and then only ever using that in new
> Standard interfaces would be dumping the other encodings.
>
> I can see the argument for a Pythonic mystery encoding, maybe. But there's
> no way I'd prevent any implementer from setting that encoding to UTF-16 on
> Windows and Jeffrey would want to be able to set his to UTF-32 with some
> storage magic and so on and so forth.
(Disclaimer: I haven't checked with our ICU folks or the other C++
folks at Google, so the following is just an educated guess.)
FWIW, I don't think we would want to be able to set the unistring's
internal encoding to UTF-32 if that wasn't the default. The reason to
use a Python3-style encoding would be to allow random access, but that
has to be part of the interface to be useful. I think we'd prefer to
live with a UTF-8 string and no random access rather than using a
non-standard random access interface.
Don't interpret that as an argument _for_ the "UTF-8 everywhere"
option (on which I'm neutral); I'm just trying not to be an example
against it. :)
> Option 1 vs Option 2 is a debate, but
> "UTF-8 everywhere" is not even a question. I will never propose such a
> thing.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Zhihao Yuan <lichray@gmail.com>
Date: Wed, 24 Apr 2013 16:48:06 -0400
Raw View
On Wed, Apr 24, 2013 at 3:41 PM, Tony V E <tvaneerd@gmail.com> wrote:
> - UTF8 is size efficient
You joke. UTF-8 use 3 bytes to encode Asian characters, while any Asian
language-specific encoding needs only 2 bytes. More interestingly, GB18030,
as a full Unicode implementation, can encode any CJK characters in 2
bytes. UTF-8 sucks.
> - UTF8 is not *too* iterator inefficient as you never need to go more than a
> few bytes left or right to find the start of a code point (ie you don't need
> to go to the beginning of the string, and you can tell if a byte is in the
> middle of a code point or not). Of course, with an iterator, you should
> never be in the middle of a codepoint anyhow.
AFAIK, it's the slowest.
> Downsides
> - Windows uses UTF16. That's Windows' fault. UTF16 is the worst of both
> worlds (still requires multibyte sequences, yet takes up too much space).
UTF-16 balances the space usage, and it's very fast. To mix the concept
of bytes and string is C's big fault.
> I'd be OK with functions that convert to other encodings, but I think UTF8
> should be the default and the focus.
Absolutely no.
--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://4bsd.biz/
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 20:08:33 -0700 (PDT)
Raw View
------=_Part_295_31047644.1366859313830
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 1:03:01 PM UTC-7, Tony V E wrote:
>
> You only need the other encodings on the edges of your app.
>
> Thing long term. Some day down the road Windows and Java won't exist,
> and/or they'll have seen the error of their ways and converted to UTF8.
>
.... what? That's what you're banking on? That Windows and Java will vanish
into the aether in some 10+ years down the line? (neither of them are going
to revamp their APIs just because some people prefer UTF-8)
We shouldn't make decisions based on events that *might eventually* happen.
We should made decisions based on good knowledge.
Also, Windows/Java aren't the only people who use UTF-16. QT does too in
their QString class.
I'm personally in favor of UTF-8 over all other encodings. However, that is
simply not *realistic*.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_295_31047644.1366859313830
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 1:03:01 PM UTC-7, Tony V E wrote:<blockquote c=
lass=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px=
#ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div>You only need the oth=
er encodings on the edges of your app.<br><br></div>Thing long term. =
Some day down the road Windows and Java won't exist, and/or they'll have se=
en the error of their ways and converted to UTF8.<br></div></blockquote><di=
v><br>... what? That's what you're banking on? That Windows and Java will v=
anish into the aether in some 10+ years down the line? (neither of them are=
going to revamp their APIs just because some people prefer UTF-8)<br><br>W=
e shouldn't make decisions based on events that <i>might eventually</i> hap=
pen. We should made decisions based on good knowledge.<br><br>Also, Windows=
/Java aren't the only people who use UTF-16. QT does too in their QString c=
lass.<br><br>I'm personally in favor of UTF-8 over all other encodings. How=
ever, that is simply not <i>realistic</i>.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_295_31047644.1366859313830--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 20:11:31 -0700 (PDT)
Raw View
------=_Part_199_24404490.1366859491612
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 1:48:06 PM UTC-7, Zhihao Yuan wrote:
>
> On Wed, Apr 24, 2013 at 3:41 PM, Tony V E <tvan...@gmail.com <javascript:>>
> wrote:
> > - UTF8 is size efficient
>
> You joke. UTF-8 use 3 bytes to encode Asian characters, while any Asian
> language-specific encoding needs only 2 bytes. More interestingly,
> GB18030,
> as a full Unicode implementation, can encode any CJK characters in 2
> bytes. UTF-8 sucks.
>
Some would disagree on that point <http://utf8everywhere.org/>. I'll let
those arguments (and the data they use to support it) speak for themselves.
Note that this doesn't mean we shouldn't support UTF-16 just as fully as we
do UTF-8. Choice of Unicode encoding is the right of every C++ programmer.
Even if they choose the wrong one ;)
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_199_24404490.1366859491612
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, April 24, 2013 1:48:06 PM UTC-7, Zhihao Yuan wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">On Wed, Apr 24, 2013 at 3:41 PM,=
Tony V E <<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mail=
to=3D"4BOvofCoMesJ">tvan...@gmail.com</a>> wrote:
<br>> - UTF8 is size efficient
<br>
<br>You joke. UTF-8 use 3 bytes to encode Asian characters, while any=
Asian
<br>language-specific encoding needs only 2 bytes. More interestingly=
, GB18030,
<br>as a full Unicode implementation, can encode any CJK characters in 2
<br>bytes. UTF-8 sucks.<br></blockquote><div><br><a href=3D"http://utf8ever=
ywhere.org/">Some would disagree on that point</a>. I'll let those argument=
s (and the data they use to support it) speak for themselves.</div><br>Note=
that this doesn't mean we shouldn't support UTF-16 just as fully as we do =
UTF-8. Choice of Unicode encoding is the right of every C++ programmer.<br>=
<br>Even if they choose the wrong one ;)<br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_199_24404490.1366859491612--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 24 Apr 2013 20:41:24 -0700 (PDT)
Raw View
------=_Part_210_17090579.1366861284615
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, April 24, 2013 11:41:14 AM UTC-7, Jeffrey Yasskin wrote:
>
> On Wed, Apr 24, 2013 at 11:08 AM, Nicol Bolas <jmck...@gmail.com<javascript:>>
> wrote:
> > On Wednesday, April 24, 2013 9:55:43 AM UTC-7, Ville Voutilainen wrote:
> >>
> >> On 24 April 2013 19:23, Nicol Bolas <jmck...@gmail.com> wrote:
> >>>>
> >>>> You're reaching too far and too early into implementation details if
> you
> >>>> think it *has to* use erasure.
> >>>
> >>>
> >>> Whether it's type erasure or something else, this iterator access is
> not
> >>> going to be a simple pointer access. Each call to `++` or `*` is going
> to
> >>> have to do a lot more work than a specific encoder's iterator. It's
> going to
> >>> have to figure out which encoding the type actually is, then call an
> >>> appropriate function based on that.
> >>
> >>
> >> That's not how it needs to be done. And such a common type doesn't
> >> necessarily need to do encoding-specific
> >> traversal at all, it may well be sufficient for such type to allow
> >> blasting the raw bits into various sinks, or
> >> just convey the type between different subsystems.
> >
> >
> > OK, clearly before this conversation can proceed any further, there
> needs to
> > be some definition of what exactly we're talking about.
> >
> > When I hear the word "string", I generally think of an object that
> contains
> > an ordered sequence of characters, which can be accessed in some way and
> > probably have basic sequencing operations performed on them. If it's
> > mutable, insertion, deletion, and such can be used. If it's not mutable,
> > then there should probably be some APIs that do copy-insertion/delection
> > (creating a new string that is the result of inserting/deleting).
> >
> > If all you're talking about is some memory object which cannot be useful
> > until it is transferred into some other object, that's not a "string" by
> any
> > definition I'm aware of. That's not even an iterator range.
> >
> > So what exactly are you arguing we should have? A string or something
> else?
> >
> > Furthermore, what good is "blasting the raw bits into various sinks"?
> > Ignoring the fact that there's no such thing as a "sink", the class is
> > designed so that the user explicitly doesn't know the actual encoding of
> the
> > data. So the "raw bits" themselves are completely meaningless to anyone.
> And
> > let's not forget endian conversion issues on top of that.
> >
> > I cannot imagine what use it would be to take a string of an unknown
> > encoding and send its "raw bits" somewhere. At least, what use that
> would be
> > compared to taking an actual, specific encoding.
>
> I'd classify the options into two general categories:
>
> 1) A unicode string class that presents its contents as a sequence of
> code points, without exposing its clients to the sequence of bytes
> that underlie these code points. This could be the python-style object
> I've been suggesting or could be an object that presents a
> bidirectional iterator that converts on the fly.
>
> 2) An "encoded" string class that presents its contents as a sequence
> of bytes along with a description of the encoding that should be used
> to interpret those bytes, probably along with an iterator that can
> convert from each encoding.
>
#2 is not what people actually need from such a string. What people need is
a string that does all of the following:
1: Has an explicit encoding, such that you can hand it a block of data in
that encoding and no transcoding will take place. This also means that I
can get (const) access to the string's data as an array of code-units, for
passing to legacy APIs. The encoding should be flexible, so that users can
provide their own encodings for things that we don't provide them for (much
like allocators).
2: *Guarantees* the encoding. No operations on this string will cause its
data to be encoded wrongly. Any attempt to pass improperly encoded data
will throw an exception.
3: Guarantees Unicode. The Unicode spec has rules about what codepoints can
appear where. The string should abide by those rules and fail at any
operation that would violate them.
4: Work as a proper codepoint range. I should not have to *copy* my string
(again) or fumble about with out-of-class iterators. This means all of our
algorithms will work on them naturally. All forward-facing iterators will
be codepoint iterators; you don't get (direct) iterator access to the
codeunits, nor do you get operator[].
5: Transcoding support. It can take arbitrarily encoded data and convert it
to its given encoding.
In short, it should be a sequence of Unicode codepoints, where the encoding
is directly exposed to the user, so that they can more easily interface
with other APIs that don't use this string type. And that's the main reason
why we need the encoding to be directly exposed: because only the user of
the type knows what encoding their eventual destination uses. Therefore,
only the user of the object can know whether they want to match it or not.
If we need some kind of generic `Unicode codepoint range` class that could
work with any encoding transparently, we can have that. But it would not
own the actual storage; it would be like a `string_ref/view`.
The actual storage should always have an actual, forward-facing encoding.
Neither of these is wrong, but we only want to standardize one, and
> it's not totally obvious which is better. (If it's totally obvious to
> you, that probably means you're not considering enough viewpoints.)
>
Or it means that we're looking at how Unicode works in the real world of
C++. This "any_unicode_string" has not been written into any C++-facing
library that supports Unicode (that I'm aware of. Python's string type is
not C++-facing, though obviously C++ Python modules can use it). Whereas we
have Qt's QString, MFC's CString, wxWidget's wxString, ICU's UnicodeString,
and many other Unicode strings. *All of which* use a specific Unicode
encoding. None of the major libraries out there have adopted a class
anything like your Option #1.
The only upgrade from all of those types (besides getting rid of the stupid
stuff in some of them) that we're asking for is the ability to template the
type on a Unicode encoding, just as we allow the character type of
`basic_string` to be a template parameter.
So I would say that it is obvious which is better: the kind that is in use
in *millions* of lines of actual C++ code. Not the kind that only exists in
Python.
Standard practice has weighed in on this issue. Why should we go against
standard practice for Unicode string types?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_210_17090579.1366861284615
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, April 24, 2013 11:41:14 AM UTC-7, Jeffrey Yasskin wrote:<bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;">On Wed, Apr 24, 2013 at 11:08 AM, N=
icol Bolas <<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mai=
lto=3D"APTVCNchlVsJ">jmck...@gmail.com</a>> wrote:
<br>> On Wednesday, April 24, 2013 9:55:43 AM UTC-7, Ville Voutilainen w=
rote:
<br>>>
<br>>> On 24 April 2013 19:23, Nicol Bolas <<a>jmck...@gmail.com</=
a>> wrote:
<br>>>>>
<br>>>>> You're reaching too far and too early into implementat=
ion details if you
<br>>>>> think it *has to* use erasure.
<br>>>>
<br>>>>
<br>>>> Whether it's type erasure or something else, this iterator=
access is not
<br>>>> going to be a simple pointer access. Each call to `++` or =
`*` is going to
<br>>>> have to do a lot more work than a specific encoder's itera=
tor. It's going to
<br>>>> have to figure out which encoding the type actually is, th=
en call an
<br>>>> appropriate function based on that.
<br>>>
<br>>>
<br>>> That's not how it needs to be done. And such a common type doe=
sn't
<br>>> necessarily need to do encoding-specific
<br>>> traversal at all, it may well be sufficient for such type to a=
llow
<br>>> blasting the raw bits into various sinks, or
<br>>> just convey the type between different subsystems.
<br>>
<br>>
<br>> OK, clearly before this conversation can proceed any further, ther=
e needs to
<br>> be some definition of what exactly we're talking about.
<br>>
<br>> When I hear the word "string", I generally think of an object that=
contains
<br>> an ordered sequence of characters, which can be accessed in some w=
ay and
<br>> probably have basic sequencing operations performed on them. If it=
's
<br>> mutable, insertion, deletion, and such can be used. If it's not mu=
table,
<br>> then there should probably be some APIs that do copy-insertion/del=
ection
<br>> (creating a new string that is the result of inserting/deleting).
<br>>
<br>> If all you're talking about is some memory object which cannot be =
useful
<br>> until it is transferred into some other object, that's not a "stri=
ng" by any
<br>> definition I'm aware of. That's not even an iterator range.
<br>>
<br>> So what exactly are you arguing we should have? A string or someth=
ing else?
<br>>
<br>> Furthermore, what good is "blasting the raw bits into various sink=
s"?
<br>> Ignoring the fact that there's no such thing as a "sink", the clas=
s is
<br>> designed so that the user explicitly doesn't know the actual encod=
ing of the
<br>> data. So the "raw bits" themselves are completely meaningless to a=
nyone. And
<br>> let's not forget endian conversion issues on top of that.
<br>>
<br>> I cannot imagine what use it would be to take a string of an unkno=
wn
<br>> encoding and send its "raw bits" somewhere. At least, what use tha=
t would be
<br>> compared to taking an actual, specific encoding.
<br>
<br>I'd classify the options into two general categories:
<br>
<br>1) A unicode string class that presents its contents as a sequence of
<br>code points, without exposing its clients to the sequence of bytes
<br>that underlie these code points. This could be the python-style object
<br>I've been suggesting or could be an object that presents a
<br>bidirectional iterator that converts on the fly.
<br>
<br>2) An "encoded" string class that presents its contents as a sequence
<br>of bytes along with a description of the encoding that should be used
<br>to interpret those bytes, probably along with an iterator that can
<br>convert from each encoding.<br></blockquote><div><br>#2 is not what peo=
ple actually need from such a string. What people need is a string that doe=
s all of the following:<br><br>1: Has an explicit encoding, such that you c=
an hand it a block of data in that encoding and no transcoding will take pl=
ace. This also means that I can get (const) access to the string's data as =
an array of code-units, for passing to legacy APIs. The encoding should be =
flexible, so that users can provide their own encodings for things that we =
don't provide them for (much like allocators).<br><br>2: <i>Guarantees</i> =
the encoding. No operations on this string will cause its data to be encode=
d wrongly. Any attempt to pass improperly encoded data will throw an except=
ion.<br><br>3: Guarantees Unicode. The Unicode spec has rules about what co=
depoints can appear where. The string should abide by those rules and fail =
at any operation that would violate them.<br><br>4: Work as a proper codepo=
int range. I should not have to <i>copy</i> my string (again) or fumble abo=
ut with out-of-class iterators. This means all of our algorithms will work =
on them naturally. All forward-facing iterators will be codepoint iterators=
; you don't get (direct) iterator access to the codeunits, nor do you get o=
perator[].<br><br>5: Transcoding support. It can take arbitrarily encoded d=
ata and convert it to its given encoding.<br><br>In short, it should be a s=
equence of Unicode codepoints, where the encoding is directly exposed to th=
e user, so that they can more easily interface with other APIs that don't u=
se this string type. And that's the main reason why we need the encoding to=
be directly exposed: because only the user of the type knows what encoding=
their eventual destination uses. Therefore, only the user of the object ca=
n know whether they want to match it or not.<br><br>If we need some kind of=
generic `Unicode codepoint range` class that could work with any encoding =
transparently, we can have that. But it would not own the actual storage; i=
t would be like a `string_ref/view`.<br><br>The actual storage should alway=
s have an actual, forward-facing encoding.<br><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #cc=
c solid;padding-left: 1ex;">
Neither of these is wrong, but we only want to standardize one, and
<br>it's not totally obvious which is better. (If it's totally obvious to
<br>you, that probably means you're not considering enough viewpoints.)<br>=
</blockquote><br>Or it means that we're looking at how Unicode works in the=
real world of C++. This "any_unicode_string" has not been written into any=
C++-facing library that supports Unicode (that I'm aware of. Python's stri=
ng type is not C++-facing, though obviously C++ Python modules can use it).=
Whereas we have Qt's QString, MFC's CString, wxWidget's wxString, ICU's Un=
icodeString, and many other Unicode strings. <i>All of which</i> use a spec=
ific Unicode encoding. None of the major libraries out there have adopted a=
class anything like your Option #1.<br><br>The only upgrade from all of th=
ose types (besides getting rid of the stupid stuff in some of them) that we=
're asking for is the ability to template the type on a Unicode encoding, j=
ust as we allow the character type of `basic_string` to be a template param=
eter.<br><br>So I would say that it is obvious which is better: the kind th=
at is in use in <i>millions</i> of lines of actual C++ code. Not the kind t=
hat only exists in Python.<br><br>Standard practice has weighed in on this =
issue. Why should we go against standard practice for Unicode string types?=
<br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_210_17090579.1366861284615--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Wed, 24 Apr 2013 22:33:36 -0700 (PDT)
Raw View
------=_Part_250_11106590.1366868016981
Content-Type: text/plain; charset=ISO-8859-1
What I would also add is that the original paper is slightly defective. I
had intended for the encoding to not show up in the interface at all, but
it did for c_str(). I should have changed that so that you could request a
C-string of any encoding from an encoded_string of any encoding (only
guarantee O(1) for encoding matches). Then, if you do
auto str = f();
// use str
then the encoding of str is irrelevant unless you need to know.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_250_11106590.1366868016981
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
What I would also add is that the original paper is slightly defective. I h=
ad intended for the encoding to not show up in the interface at all, but it=
did for c_str(). I should have changed that so that you could request a C-=
string of any encoding from an encoded_string of any encoding (only guarant=
ee O(1) for encoding matches). Then, if you do<div><br></div><div> &n=
bsp; auto str =3D f();</div><div> // use str</div><div><br></d=
iv><div>then the encoding of str is irrelevant unless you need to know.</di=
v>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_250_11106590.1366868016981--
.
Author: Lawrence Crowl <crowl@googlers.com>
Date: Thu, 25 Apr 2013 14:10:04 -0700
Raw View
On 4/24/13, Tony V E <tvaneerd@gmail.com> wrote:
> On Apr 24, 2013 Jeffrey Yasskin <jyasskin@google.com> wrote:
> > I'd classify the options into two general categories:
> >
> > 1) A unicode string class that presents its contents as a
> > sequence of code points, without exposing its clients to the
> > sequence of bytes that underlie these code points. This could
> > be the python-style object I've been suggesting or could be an
> > object that presents a bidirectional iterator that converts on
> > the fly.
> >
> > 2) An "encoded" string class that presents its contents as
> > a sequence of bytes along with a description of the encoding
> > that should be used to interpret those bytes, probably along
> > with an iterator that can convert from each encoding.
> >
> > Neither of these is wrong, but we only want to standardize one,
> > and it's not totally obvious which is better. (If it's totally
> > obvious to you, that probably means you're not considering
> > enough viewpoints.)
>
> Let me attempt to claim (_somewhat_ devil's advocate) that we want
> class 1, with implementation via UTF8, thus getting a specific
> case of 2 as well. ie not just 1 that may or may not be UTF8,
> but define that it must be UTF8 so that you can rely on the bytes
> if you want or need to.
>
> Reasons:
>
> - UTF8 can work with things like strcpy(), so lots of code just
> works (although "just works" can sometimes be considered harmful
> if it wasn't expected)
It "just works" by a combination of design and accident. However,
strlen fails to return the right data. Repurposing functions
because of accidents is not the path to clear code.
> - UTF8 is size efficient
Efficiency depends on your corpus. UTF8 is most space efficient
for Latin scripts. For some European or Middle Eastern scripts,
UTF8 and UTF16 are space equivalent. For East Asian scripts,
UTF16 is most efficient.
On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
space efficient.
For scripts outside the basic plane, UTF16 and UTF32 are
space-equivalent, but UTF32 is more time efficient. Likewise,
UTF12 and UTF24 are space equivalent but versus UTF24 is more time
efficient for South and East Asian scripts.
> - UTF8 is not *too* iterator inefficient as you never need to go
> more than a few bytes left or right to find the start of a code
> point (ie you don't need to go to the beginning of the string, and
> you can tell if a byte is in the middle of a code point or not).
> Of course, with an iterator, you should never be in the middle
> of a codepoint anyhow.
UTF32 has the fastest iterator performance. It can matter, because
it is decision-less, which makes it viable for use in vector units.
UTF16 is somewhat harder. UTF8 is much harder.
> Downsides
> - Windows uses UTF16. That's Windows' fault. UTF16 is the worst
> of both worlds (still requires multibyte sequences, yet takes up
> too much space).
You're forgetting that UTF8 requires more validation. There are
lots of byte sequences that do not map to code points, so what do
you do with them?
> I'd be OK with functions that convert to other encodings, but I
> think UTF8 should be the default and the focus.
It seems to me that striving for one type is not likely to work,
given disparate uses and existing legacy files. I also think we
are likely to need 'unvalidated UTF8' and 'validated UTF8' types
in the mix as well.
Even so, we need a 'vocabulary' type, and I think it should adapt
its representation to the needs of its content. Doing so would
probably result in the least overall pain.
--
Lawrence Crowl
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: cornedbee@google.com
Date: Fri, 26 Apr 2013 02:45:21 -0700 (PDT)
Raw View
------=_Part_1072_21272625.1366969521540
Content-Type: text/plain; charset=ISO-8859-1
On Thursday, April 25, 2013 11:10:04 PM UTC+2, Lawrence Crowl wrote:
>
> On 4/24/13, Tony V E <tvan...@gmail.com <javascript:>> wrote:
> > - UTF8 is not *too* iterator inefficient as you never need to go
> > more than a few bytes left or right to find the start of a code
> > point (ie you don't need to go to the beginning of the string, and
> > you can tell if a byte is in the middle of a code point or not).
> > Of course, with an iterator, you should never be in the middle
> > of a codepoint anyhow.
>
> UTF32 has the fastest iterator performance. It can matter, because
> it is decision-less, which makes it viable for use in vector units.
> UTF16 is somewhat harder. UTF8 is much harder.
>
On the other hand, for UTF8 and western scripts, you can fit 4 times as
much text into the L1 cache. That may be quite a significant gain.
Sebastian
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1072_21272625.1366969521540
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Thursday, April 25, 2013 11:10:04 PM UTC+2, Lawrence Crowl wrote=
:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bo=
rder-left: 1px #ccc solid;padding-left: 1ex;">On 4/24/13, Tony V E <<a h=
ref=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"TjVcqgkbPuMJ=
">tvan...@gmail.com</a>> wrote:
<br>> - UTF8 is not *too* iterator inefficient as you never need to go
<br>> more than a few bytes left or right to find the start of a code
<br>> point (ie you don't need to go to the beginning of the string, and
<br>> you can tell if a byte is in the middle of a code point or not).
<br>> Of course, with an iterator, you should never be in the middle
<br>> of a codepoint anyhow.
<br>
<br>UTF32 has the fastest iterator performance. It can matter, becaus=
e
<br>it is decision-less, which makes it viable for use in vector units.
<br>UTF16 is somewhat harder. UTF8 is much harder.
<br></blockquote><div><br></div><div>On the other hand, for UTF8 and wester=
n scripts, you can fit 4 times as much text into the L1 cache. That may be =
quite a significant gain.</div><div><br></div><div>Sebastian</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1072_21272625.1366969521540--
.
Author: FrankHB1989 <frankhb1989@gmail.com>
Date: Fri, 26 Apr 2013 12:35:32 -0700 (PDT)
Raw View
------=_Part_1867_4935299.1367004932742
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=E5=9C=A8 2013=E5=B9=B44=E6=9C=8824=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=89UTC+=
8=E4=B8=8B=E5=8D=886=E6=97=B646=E5=88=8635=E7=A7=92=EF=BC=8CNicol Bolas=E5=
=86=99=E9=81=93=EF=BC=9A
> If you don't care what encoding a string uses, then you can just use an=
=20
> encoding-aware string anyway. All of them should be inter-convertible=20
> between each other (though it should require explicit conversion). And th=
ey=20
> should all be buildable from raw data (an iterator range and the encoding=
=20
> of that range). So all you need to do is pick one and you're fine.
>
Yes I can. And I *have to*.=20
=20
>
> I don't see why anyone would need a string that is *explicitly* unaware=
=20
> of its encoding. What does that gain you?
>
Firstly and conceptually, the encoding should not be exposed through the=20
interface of a "pure" string, namely a sequence of *characters*. If the=20
encoding is mandated in your mind, you are actually talking about a=20
sequence of *code points*, but not only a string.
Secondly, indeterminate encoding can lead to better optimization of=20
transcoding for string operands with different encodings. The=20
implementation should know better about the performance of transcoding=20
algorithms using specific intermediate encoding than users in most cases,=
=20
and can perform transcoding only when it is really necessary.
P.S. std::basic_string is still too strict to be the proper abstraction. Do=
=20
we have something like traits?
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_1867_4935299.1367004932742
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<br><br>=E5=9C=A8 2013=E5=B9=B44=E6=9C=8824=E6=97=A5=E6=98=9F=E6=9C=9F=E4=
=B8=89UTC+8=E4=B8=8B=E5=8D=886=E6=97=B646=E5=88=8635=E7=A7=92=EF=BC=8CNicol=
Bolas=E5=86=99=E9=81=93=EF=BC=9A<br><blockquote class=3D"gmail_quote" styl=
e=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left:=
1ex;"><div>If you don't care what encoding a string uses, then you can jus=
t use an encoding-aware string anyway. All of them should be inter-converti=
ble between each other (though it should require explicit conversion). And =
they should all be buildable from raw data (an iterator range and the encod=
ing of that range). So all you need to do is pick one and you're fine.<br><=
/div></blockquote><div> Yes I can. And I <i>have to</i>. <br></div><di=
v> </div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div><br>I don't=
see why anyone would need a string that is <i>explicitly</i> unaware of it=
s encoding. What does that gain you?<br></div></blockquote><div>Firstly and=
conceptually, the encoding should not be exposed through the interface of =
a "pure" string, namely a sequence of <i>characters</i>. If the encoding is=
mandated in your mind, you are actually talking about a sequence of <i>cod=
e points</i>, but not only a string.<br>Secondly, indeterminate encoding ca=
n lead to better optimization of transcoding for string operands with diffe=
rent encodings. The implementation should know better about the performance=
of transcoding algorithms using specific intermediate encoding than users =
in most cases, and can perform transcoding only when it is really necessary=
..<br><br>P.S. std::basic_string is still too strict to be the proper abstra=
ction. Do we have something like traits?<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1867_4935299.1367004932742--
.
Author: FrankHB1989 <frankhb1989@gmail.com>
Date: Fri, 26 Apr 2013 12:48:05 -0700 (PDT)
Raw View
------=_Part_310_30495234.1367005685484
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=E5=9C=A8 2013=E5=B9=B44=E6=9C=8825=E6=97=A5=E6=98=9F=E6=9C=9F=E5=9B=9BUTC+=
8=E4=B8=8A=E5=8D=884=E6=97=B648=E5=88=8606=E7=A7=92=EF=BC=8CZhihao Yuan=E5=
=86=99=E9=81=93=EF=BC=9A
>
> UTF-16 balances the space usage, and it's very fast. To mix the concept=
=20
> of bytes and string is C's big fault.=20
>
>
> UTF-16 is variable-length. In general, it is fast only you throw away=20
surrogate pairs, i.e. code points out of BMP. (And thus it becomes UCS-2.)
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_310_30495234.1367005685484
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<br><br>=E5=9C=A8 2013=E5=B9=B44=E6=9C=8825=E6=97=A5=E6=98=9F=E6=9C=9F=E5=
=9B=9BUTC+8=E4=B8=8A=E5=8D=884=E6=97=B648=E5=88=8606=E7=A7=92=EF=BC=8CZhiha=
o Yuan=E5=86=99=E9=81=93=EF=BC=9A<blockquote class=3D"gmail_quote" style=3D=
"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex=
;">UTF-16 balances the space usage, and it's very fast. To mix the co=
ncept
<br>of bytes and string is C's big fault.
<br>
<br><br></blockquote><div>UTF-16 is variable-length. In general, it is fast=
only you throw away surrogate pairs, i.e. code points out of BMP. (And thu=
s it becomes UCS-2.)<br><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_310_30495234.1367005685484--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sun, 28 Apr 2013 12:09:16 -0700 (PDT)
Raw View
------=_Part_3187_5978602.1367176156358
Content-Type: text/plain; charset=ISO-8859-1
I agree with Lawrence on that: UTF32 is more efficient for representing
general Unicode characters.
I think the issue here is that it is difficult to resolve the following two
issues:
(1) to select a preferable encoding (for reading from a file, system
representation and exchange);
(2) to select a common string format for internal representation (arrays of
characters that we can easily compare).
The reason for the second point is that the Unicode itself propose 4
different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
ligatures and other character types.
Point (2) is to create strings for easy access of elements and comparison.
The comparison is an issue: even French words have special way of
comparison based on accented characters.
Other languages have their own specific ways of capering words and there
may be more than one way of doing so. I think this issue can be left.
My suggesting would be to have two basic forms of representation:
(1) encoded strings;
(2) simple strings of characters (char8, char16 and char32).
An implementation should provide
(a) some forms of encoding for encoded strings;
(b) some conversions between encoded strings and simple strings (of char8,
char16 and char32);
(c) in addition to standard comparison of simple strings (like arrays of
elements), there should be conversion routines for
various languages.
The user should be able to use these encodings, conversions and
comparisons, and should be able to provide their own.
There is also GB18030 Standard (for Chinese characters), which is different
from Unicode.
Mikhail.
On Thursday, April 25, 2013 10:10:04 PM UTC+1, Lawrence Crowl wrote:
> On 4/24/13, Tony V E <tvan...@gmail.com <javascript:>> wrote:
> > On Apr 24, 2013 Jeffrey Yasskin <jyas...@google.com <javascript:>>
> wrote:
> > > I'd classify the options into two general categories:
> > >
> > > 1) A unicode string class that presents its contents as a
> > > sequence of code points, without exposing its clients to the
> > > sequence of bytes that underlie these code points. This could
> > > be the python-style object I've been suggesting or could be an
> > > object that presents a bidirectional iterator that converts on
> > > the fly.
> > >
> > > 2) An "encoded" string class that presents its contents as
> > > a sequence of bytes along with a description of the encoding
> > > that should be used to interpret those bytes, probably along
> > > with an iterator that can convert from each encoding.
> > >
> > > Neither of these is wrong, but we only want to standardize one,
> > > and it's not totally obvious which is better. (If it's totally
> > > obvious to you, that probably means you're not considering
> > > enough viewpoints.)
> >
> > Let me attempt to claim (_somewhat_ devil's advocate) that we want
> > class 1, with implementation via UTF8, thus getting a specific
> > case of 2 as well. ie not just 1 that may or may not be UTF8,
> > but define that it must be UTF8 so that you can rely on the bytes
> > if you want or need to.
> >
> > Reasons:
> >
> > - UTF8 can work with things like strcpy(), so lots of code just
> > works (although "just works" can sometimes be considered harmful
> > if it wasn't expected)
>
> It "just works" by a combination of design and accident. However,
> strlen fails to return the right data. Repurposing functions
> because of accidents is not the path to clear code.
>
> > - UTF8 is size efficient
>
> Efficiency depends on your corpus. UTF8 is most space efficient
> for Latin scripts. For some European or Middle Eastern scripts,
> UTF8 and UTF16 are space equivalent. For East Asian scripts,
> UTF16 is most efficient.
>
> On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
> space efficient.
>
> For scripts outside the basic plane, UTF16 and UTF32 are
> space-equivalent, but UTF32 is more time efficient. Likewise,
> UTF12 and UTF24 are space equivalent but versus UTF24 is more time
> efficient for South and East Asian scripts.
>
> > - UTF8 is not *too* iterator inefficient as you never need to go
> > more than a few bytes left or right to find the start of a code
> > point (ie you don't need to go to the beginning of the string, and
> > you can tell if a byte is in the middle of a code point or not).
> > Of course, with an iterator, you should never be in the middle
> > of a codepoint anyhow.
>
> UTF32 has the fastest iterator performance. It can matter, because
> it is decision-less, which makes it viable for use in vector units.
> UTF16 is somewhat harder. UTF8 is much harder.
>
> > Downsides
> > - Windows uses UTF16. That's Windows' fault. UTF16 is the worst
> > of both worlds (still requires multibyte sequences, yet takes up
> > too much space).
>
> You're forgetting that UTF8 requires more validation. There are
> lots of byte sequences that do not map to code points, so what do
> you do with them?
>
> > I'd be OK with functions that convert to other encodings, but I
> > think UTF8 should be the default and the focus.
>
> It seems to me that striving for one type is not likely to work,
> given disparate uses and existing legacy files. I also think we
> are likely to need 'unvalidated UTF8' and 'validated UTF8' types
> in the mix as well.
>
> Even so, we need a 'vocabulary' type, and I think it should adapt
> its representation to the needs of its content. Doing so would
> probably result in the least overall pain.
>
> --
> Lawrence Crowl
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_3187_5978602.1367176156358
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div> </div><div>I agree with Lawrence on that: UTF32 is more efficien=
t for representing general Unicode characters.</div><div> </div><div>I=
think the issue here is that it is difficult to resolve the following two =
issues:</div><div>(1) to select a preferable encoding (for reading from a f=
ile, system representation and exchange);</div><div>(2) to select a common =
string format for internal representation (arrays of characters that we can=
easily compare).</div><div> </div><div>The reason for the second poin=
t is that the Unicode itself propose 4 different types of representati=
on <a href=3D"http://unicode.org/reports/tr15/#Examples">http://unicode.org=
/reports/tr15/#Examples</a>: </div><div>NFD, NFC, NFKD and NFKC. I personal=
ly prefer NFKC, but then you lose ligatures and other character types.</div=
><div> </div><div>Point (2) is to create strings for easy access of el=
ements and comparison. The comparison is an issue: even French words have s=
pecial way of comparison based on accented characters. </div><div>Other lan=
guages have their own specific ways of capering words and there may be more=
than one way of doing so. I think this issue can be left.</div><div> =
</div><div>My suggesting would be to have two basic forms of representation=
:</div><div> </div><div>(1) encoded strings;</div><div>(2) simple stri=
ngs of characters (char8, char16 and char32).</div><div> </div><div>An=
implementation should provide </div><div> </div><div>(a) some forms o=
f encoding for encoded strings;</div><div>(b) some conversions between enco=
ded strings and simple strings (of char8, char16 and char32);</div><div>(c)=
in addition to standard comparison of simple strings (like arrays of =
elements), there should be conversion routines for</div><div>various langua=
ges.</div><div> </div><div>The user should be able to use these encodi=
ngs, conversions and comparisons, and should be able to provide their own.<=
/div><div> </div><div>There is also GB18030 Standard (for Chinese char=
acters), which is different from Unicode. </div><div> </div><div>Mikha=
il.</div><div> </div><div> </div><div> </div><div><br>On Thu=
rsday, April 25, 2013 10:10:04 PM UTC+1, Lawrence Crowl wrote:</div><blockq=
uote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; padding-left=
: 1ex; border-left-color: rgb(204, 204, 204); border-left-width: 1px; borde=
r-left-style: solid;">On 4/24/13, Tony V E <<a href=3D"javascript:" targ=
et=3D"_blank" gdf-obfuscated-mailto=3D"TjVcqgkbPuMJ">tvan...@gmail.com</a>&=
gt; wrote:
<br>> On Apr 24, 2013 Jeffrey Yasskin <<a href=3D"javascript:" target=
=3D"_blank" gdf-obfuscated-mailto=3D"TjVcqgkbPuMJ">jyas...@google.com</a>&g=
t; wrote:
<br>> > I'd classify the options into two general categories:
<br>> >
<br>> > 1) A unicode string class that presents its contents as a
<br>> > sequence of code points, without exposing its clients to the
<br>> > sequence of bytes that underlie these code points. This could
<br>> > be the python-style object I've been suggesting or could be a=
n
<br>> > object that presents a bidirectional iterator that converts o=
n
<br>> > the fly.
<br>> >
<br>> > 2) An "encoded" string class that presents its contents as
<br>> > a sequence of bytes along with a description of the encoding
<br>> > that should be used to interpret those bytes, probably along
<br>> > with an iterator that can convert from each encoding.
<br>> >
<br>> > Neither of these is wrong, but we only want to standardize on=
e,
<br>> > and it's not totally obvious which is better. (If it's totall=
y
<br>> > obvious to you, that probably means you're not considering
<br>> > enough viewpoints.)
<br>>
<br>> Let me attempt to claim (_somewhat_ devil's advocate) that we want
<br>> class 1, with implementation via UTF8, thus getting a specific
<br>> case of 2 as well. ie not just 1 that may or may not be UTF8=
,
<br>> but define that it must be UTF8 so that you can rely on the bytes
<br>> if you want or need to.
<br>>
<br>> Reasons:
<br>>
<br>> - UTF8 can work with things like strcpy(), so lots of code just
<br>> works (although "just works" can sometimes be considered harmful
<br>> if it wasn't expected)
<br>
<br>It "just works" by a combination of design and accident. However,
<br>strlen fails to return the right data. Repurposing functions
<br>because of accidents is not the path to clear code.
<br>
<br>> - UTF8 is size efficient
<br>
<br>Efficiency depends on your corpus. UTF8 is most space efficient
<br>for Latin scripts. For some European or Middle Eastern scripts,
<br>UTF8 and UTF16 are space equivalent. For East Asian scripts,
<br>UTF16 is most efficient.
<br>
<br>On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
<br>space efficient.
<br>
<br>For scripts outside the basic plane, UTF16 and UTF32 are
<br>space-equivalent, but UTF32 is more time efficient. Likewise,
<br>UTF12 and UTF24 are space equivalent but versus UTF24 is more time
<br>efficient for South and East Asian scripts.
<br>
<br>> - UTF8 is not *too* iterator inefficient as you never need to go
<br>> more than a few bytes left or right to find the start of a code
<br>> point (ie you don't need to go to the beginning of the string, and
<br>> you can tell if a byte is in the middle of a code point or not).
<br>> Of course, with an iterator, you should never be in the middle
<br>> of a codepoint anyhow.
<br>
<br>UTF32 has the fastest iterator performance. It can matter, becaus=
e
<br>it is decision-less, which makes it viable for use in vector units.
<br>UTF16 is somewhat harder. UTF8 is much harder.
<br>
<br>> Downsides
<br>> - Windows uses UTF16. That's Windows' fault. UTF16 is =
the worst
<br>> of both worlds (still requires multibyte sequences, yet takes up
<br>> too much space).
<br>
<br>You're forgetting that UTF8 requires more validation. There are
<br>lots of byte sequences that do not map to code points, so what do
<br>you do with them?
<br>
<br>> I'd be OK with functions that convert to other encodings, but I
<br>> think UTF8 should be the default and the focus.
<br>
<br>It seems to me that striving for one type is not likely to work,
<br>given disparate uses and existing legacy files. I also think we
<br>are likely to need 'unvalidated UTF8' and 'validated UTF8' types
<br>in the mix as well.
<br>
<br>Even so, we need a 'vocabulary' type, and I think it should adapt
<br>its representation to the needs of its content. Doing so would
<br>probably result in the least overall pain.
<br>
<br>--=20
<br>Lawrence Crowl
<br></blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_3187_5978602.1367176156358--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sun, 28 Apr 2013 12:15:08 -0700 (PDT)
Raw View
------=_Part_122_1816025.1367176509371
Content-Type: text/plain; charset=ISO-8859-1
I agree with Lawrence on that: UTF32 is more efficient for representing
general Unicode characters.
I think the issue here is that it is difficult to resolve the following two
issues:
(1) to select a preferable encoding (for reading from a file, system
representation and exchange);
(2) to select a common string format for internal representation (arrays of
characters that we can easily compare).
The reason for the second point is that the Unicode itself propose 4
different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
ligatures and other character types.
Point (2) is to create strings for easy access of elements and comparison.
The comparison is an issue: even French words have special way of
comparison based on accented characters.
Other languages have their own specific ways of comparing words and there
may be more than one way of doing so. I think this issue can be left.
My suggesting would be to have two basic forms of representation:
(1) encoded strings;
(2) simple strings of characters (char8, char16 and char32).
An implementation should provide
(a) some forms of encoding for encoded strings;
(b) some conversions between encoded strings and simple strings (of char8,
char16 and char32);
(c) in addition to standard comparison of simple strings (like arrays of
elements), there should be conversion routines for
various languages.
The user should be able to use these encodings, conversions and
comparisons, and should be able to provide their own.
There is also GB18030 Standard (for Chinese characters), which is different
from Unicode.
Mikhail.
On Thursday, April 25, 2013 10:10:04 PM UTC+1, Lawrence Crowl wrote:
> On 4/24/13, Tony V E <tvan...@gmail.com <javascript:>> wrote:
> > On Apr 24, 2013 Jeffrey Yasskin <jyas...@google.com <javascript:>>
> wrote:
> > > I'd classify the options into two general categories:
> > >
> > > 1) A unicode string class that presents its contents as a
> > > sequence of code points, without exposing its clients to the
> > > sequence of bytes that underlie these code points. This could
> > > be the python-style object I've been suggesting or could be an
> > > object that presents a bidirectional iterator that converts on
> > > the fly.
> > >
> > > 2) An "encoded" string class that presents its contents as
> > > a sequence of bytes along with a description of the encoding
> > > that should be used to interpret those bytes, probably along
> > > with an iterator that can convert from each encoding.
> > >
> > > Neither of these is wrong, but we only want to standardize one,
> > > and it's not totally obvious which is better. (If it's totally
> > > obvious to you, that probably means you're not considering
> > > enough viewpoints.)
> >
> > Let me attempt to claim (_somewhat_ devil's advocate) that we want
> > class 1, with implementation via UTF8, thus getting a specific
> > case of 2 as well. ie not just 1 that may or may not be UTF8,
> > but define that it must be UTF8 so that you can rely on the bytes
> > if you want or need to.
> >
> > Reasons:
> >
> > - UTF8 can work with things like strcpy(), so lots of code just
> > works (although "just works" can sometimes be considered harmful
> > if it wasn't expected)
>
> It "just works" by a combination of design and accident. However,
> strlen fails to return the right data. Repurposing functions
> because of accidents is not the path to clear code.
>
> > - UTF8 is size efficient
>
> Efficiency depends on your corpus. UTF8 is most space efficient
> for Latin scripts. For some European or Middle Eastern scripts,
> UTF8 and UTF16 are space equivalent. For East Asian scripts,
> UTF16 is most efficient.
>
> On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
> space efficient.
>
> For scripts outside the basic plane, UTF16 and UTF32 are
> space-equivalent, but UTF32 is more time efficient. Likewise,
> UTF12 and UTF24 are space equivalent but versus UTF24 is more time
> efficient for South and East Asian scripts.
>
> > - UTF8 is not *too* iterator inefficient as you never need to go
> > more than a few bytes left or right to find the start of a code
> > point (ie you don't need to go to the beginning of the string, and
> > you can tell if a byte is in the middle of a code point or not).
> > Of course, with an iterator, you should never be in the middle
> > of a codepoint anyhow.
>
> UTF32 has the fastest iterator performance. It can matter, because
> it is decision-less, which makes it viable for use in vector units.
> UTF16 is somewhat harder. UTF8 is much harder.
>
> > Downsides
> > - Windows uses UTF16. That's Windows' fault. UTF16 is the worst
> > of both worlds (still requires multibyte sequences, yet takes up
> > too much space).
>
> You're forgetting that UTF8 requires more validation. There are
> lots of byte sequences that do not map to code points, so what do
> you do with them?
>
> > I'd be OK with functions that convert to other encodings, but I
> > think UTF8 should be the default and the focus.
>
> It seems to me that striving for one type is not likely to work,
> given disparate uses and existing legacy files. I also think we
> are likely to need 'unvalidated UTF8' and 'validated UTF8' types
> in the mix as well.
>
> Even so, we need a 'vocabulary' type, and I think it should adapt
> its representation to the needs of its content. Doing so would
> probably result in the least overall pain.
>
> --
> Lawrence Crowl
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_122_1816025.1367176509371
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I agree with Lawrence on that: UTF32 is more efficient for representin=
g general Unicode characters.<br></div><div>I think the issue here is that =
it is difficult to resolve the following two issues:<br>(1) to select a pre=
ferable encoding (for reading from a file, system representation and exchan=
ge);<br>(2) to select a common string format for internal representation (a=
rrays of characters that we can easily compare).</div><div> </div><div=
>The reason for the second point is that the Unicode itself propose 4 diffe=
rent types of representation <a href=3D"http://unicode.org/reports/tr15/#Ex=
amples">http://unicode.org/reports/tr15/#Examples</a>: <br>NFD, NFC, NFKD a=
nd NFKC. I personally prefer NFKC, but then you lose ligatures and other ch=
aracter types.</div><div>Point (2) is to create strings for easy access of =
elements and comparison. The comparison is an issue: even French words have=
special way of comparison based on accented characters. </div><div>Other l=
anguages have their own specific ways of comparing words and there may be m=
ore than one way of doing so. I think this issue can be left.</div><div>&nb=
sp;</div><div>My suggesting would be to have two basic forms of representat=
ion:<br>(1) encoded strings;</div><div>(2) simple strings of characters (ch=
ar8, char16 and char32).</div><div> An implementation should provide <=
/div><div>(a) some forms of encoding for encoded strings;</div><div>(b) som=
e conversions between encoded strings and simple strings (of char8, char16 =
and char32);</div><div>(c) in addition to standard comparison of simple str=
ings (like arrays of elements), there should be conversion routines for</di=
v><div>various languages.</div><div> </div><div>The user should be abl=
e to use these encodings, conversions and comparisons, and should be able t=
o provide their own.</div><div> </div><div>There is also GB18030 Stand=
ard (for Chinese characters), which is different from Unicode. </div><div><=
br>Mikhail.<br></div><div> </div><div><br>On Thursday, April 25, 2013 =
10:10:04 PM UTC+1, Lawrence Crowl wrote:</div><blockquote class=3D"gmail_qu=
ote" style=3D"margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-col=
or: rgb(204, 204, 204); border-left-width: 1px; border-left-style: solid;">=
On 4/24/13, Tony V E <<a href=3D"javascript:" target=3D"_blank" gdf-obfu=
scated-mailto=3D"TjVcqgkbPuMJ">tvan...@gmail.com</a>> wrote:
<br>> On Apr 24, 2013 Jeffrey Yasskin <<a href=3D"javascript:" target=
=3D"_blank" gdf-obfuscated-mailto=3D"TjVcqgkbPuMJ">jyas...@google.com</a>&g=
t; wrote:
<br>> > I'd classify the options into two general categories:
<br>> >
<br>> > 1) A unicode string class that presents its contents as a
<br>> > sequence of code points, without exposing its clients to the
<br>> > sequence of bytes that underlie these code points. This could
<br>> > be the python-style object I've been suggesting or could be a=
n
<br>> > object that presents a bidirectional iterator that converts o=
n
<br>> > the fly.
<br>> >
<br>> > 2) An "encoded" string class that presents its contents as
<br>> > a sequence of bytes along with a description of the encoding
<br>> > that should be used to interpret those bytes, probably along
<br>> > with an iterator that can convert from each encoding.
<br>> >
<br>> > Neither of these is wrong, but we only want to standardize on=
e,
<br>> > and it's not totally obvious which is better. (If it's totall=
y
<br>> > obvious to you, that probably means you're not considering
<br>> > enough viewpoints.)
<br>>
<br>> Let me attempt to claim (_somewhat_ devil's advocate) that we want
<br>> class 1, with implementation via UTF8, thus getting a specific
<br>> case of 2 as well. ie not just 1 that may or may not be UTF8=
,
<br>> but define that it must be UTF8 so that you can rely on the bytes
<br>> if you want or need to.
<br>>
<br>> Reasons:
<br>>
<br>> - UTF8 can work with things like strcpy(), so lots of code just
<br>> works (although "just works" can sometimes be considered harmful
<br>> if it wasn't expected)
<br>
<br>It "just works" by a combination of design and accident. However,
<br>strlen fails to return the right data. Repurposing functions
<br>because of accidents is not the path to clear code.
<br>
<br>> - UTF8 is size efficient
<br>
<br>Efficiency depends on your corpus. UTF8 is most space efficient
<br>for Latin scripts. For some European or Middle Eastern scripts,
<br>UTF8 and UTF16 are space equivalent. For East Asian scripts,
<br>UTF16 is most efficient.
<br>
<br>On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
<br>space efficient.
<br>
<br>For scripts outside the basic plane, UTF16 and UTF32 are
<br>space-equivalent, but UTF32 is more time efficient. Likewise,
<br>UTF12 and UTF24 are space equivalent but versus UTF24 is more time
<br>efficient for South and East Asian scripts.
<br>
<br>> - UTF8 is not *too* iterator inefficient as you never need to go
<br>> more than a few bytes left or right to find the start of a code
<br>> point (ie you don't need to go to the beginning of the string, and
<br>> you can tell if a byte is in the middle of a code point or not).
<br>> Of course, with an iterator, you should never be in the middle
<br>> of a codepoint anyhow.
<br>
<br>UTF32 has the fastest iterator performance. It can matter, becaus=
e
<br>it is decision-less, which makes it viable for use in vector units.
<br>UTF16 is somewhat harder. UTF8 is much harder.
<br>
<br>> Downsides
<br>> - Windows uses UTF16. That's Windows' fault. UTF16 is =
the worst
<br>> of both worlds (still requires multibyte sequences, yet takes up
<br>> too much space).
<br>
<br>You're forgetting that UTF8 requires more validation. There are
<br>lots of byte sequences that do not map to code points, so what do
<br>you do with them?
<br>
<br>> I'd be OK with functions that convert to other encodings, but I
<br>> think UTF8 should be the default and the focus.
<br>
<br>It seems to me that striving for one type is not likely to work,
<br>given disparate uses and existing legacy files. I also think we
<br>are likely to need 'unvalidated UTF8' and 'validated UTF8' types
<br>in the mix as well.
<br>
<br>Even so, we need a 'vocabulary' type, and I think it should adapt
<br>its representation to the needs of its content. Doing so would
<br>probably result in the least overall pain.
<br>
<br>--=20
<br>Lawrence Crowl
<br></blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_122_1816025.1367176509371--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 28 Apr 2013 18:13:36 -0700 (PDT)
Raw View
------=_Part_89_23092156.1367198016966
Content-Type: text/plain; charset=ISO-8859-1
On Sunday, April 28, 2013 12:15:08 PM UTC-7, Mikhail Semenov wrote:
>
> I agree with Lawrence on that: UTF32 is more efficient for representing
> general Unicode characters.
> I think the issue here is that it is difficult to resolve the following
> two issues:
> (1) to select a preferable encoding (for reading from a file, system
> representation and exchange);
> (2) to select a common string format for internal representation (arrays
> of characters that we can easily compare).
>
> The reason for the second point is that the Unicode itself propose 4
> different types of representation
> http://unicode.org/reports/tr15/#Examples:
> NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
> ligatures and other character types.
> Point (2) is to create strings for easy access of elements and comparison.
> The comparison is an issue: even French words have special way of
> comparison based on accented characters.
> Other languages have their own specific ways of comparing words and there
> may be more than one way of doing so. I think this issue can be left.
>
> My suggesting would be to have two basic forms of representation:
> (1) encoded strings;
> (2) simple strings of characters (char8, char16 and char32).
>
If we have "encoded strings" (presumably allowing for arbitrary encodings),
why would we need "simple strings"? Isn't `basic_string` a "simple string"?
An implementation should provide
> (a) some forms of encoding for encoded strings;
> (b) some conversions between encoded strings and simple strings (of char8,
> char16 and char32);
> (c) in addition to standard comparison of simple strings (like arrays of
> elements), there should be conversion routines for
> various languages.
>
What kind of conversions are you talking about? We already have Unicode
normalization via algorithms. So it's not clear what kind of language-based
conversions you're looking for.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_89_23092156.1367198016966
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Sunday, April 28, 2013 12:15:08 PM UTC-7, Mikhail Semenov wrote:=
<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bor=
der-left: 1px #ccc solid;padding-left: 1ex;"><div>I agree with Lawrence on =
that: UTF32 is more efficient for representing general Unicode characters.<=
br></div><div>I think the issue here is that it is difficult to resolve the=
following two issues:<br>(1) to select a preferable encoding (for reading =
from a file, system representation and exchange);<br>(2) to select a common=
string format for internal representation (arrays of characters that we ca=
n easily compare).</div><div> </div><div>The reason for the second poi=
nt is that the Unicode itself propose 4 different types of representation <=
a href=3D"http://unicode.org/reports/tr15/#Examples" target=3D"_blank">http=
://unicode.org/reports/<wbr>tr15/#Examples</a>: <br>NFD, NFC, NFKD and NFKC=
.. I personally prefer NFKC, but then you lose ligatures and other character=
types.</div><div>Point (2) is to create strings for easy access of element=
s and comparison. The comparison is an issue: even French words have specia=
l way of comparison based on accented characters. </div><div>Other language=
s have their own specific ways of comparing words and there may be more tha=
n one way of doing so. I think this issue can be left.</div><div> </di=
v><div>My suggesting would be to have two basic forms of representation:<br=
>(1) encoded strings;</div><div>(2) simple strings of characters (char8, ch=
ar16 and char32).</div></blockquote><div><br>If we have "encoded strings" (=
presumably allowing for arbitrary encodings), why would we need "simple str=
ings"? Isn't `basic_string` a "simple string"?<br><br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #=
ccc solid;padding-left: 1ex;"><div> An implementation should provide <=
/div><div>(a) some forms of encoding for encoded strings;</div><div>(b) som=
e conversions between encoded strings and simple strings (of char8, char16 =
and char32);</div><div>(c) in addition to standard comparison of simple str=
ings (like arrays of elements), there should be conversion routines for</di=
v><div>various languages.</div></blockquote><div><br>What kind of conversio=
ns are you talking about? We already have Unicode normalization via algorit=
hms. So it's not clear what kind of language-based conversions you're looki=
ng for.</div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_89_23092156.1367198016966--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Mon, 29 Apr 2013 09:20:50 +0100
Raw View
--001a11c361fc08c20c04db7b901b
Content-Type: text/plain; charset=ISO-8859-1
Nicol,
I thought I made it clear. For example, elements of UTF-8 are bytes: each
element does not represent a character (unless it is an ASCII string) you
can convert it to
string of char32 so that each character really represent one Unicode
character. On the other hand, if you are only interested in the main coding
plane: string of char16 will be enough. And if you only use European
languages: string of char8 will be fine. In UTF-8 on the other hand, each
Unicode character can be coded by 1, 2 ,3 ... bytes.
In .NET, Microsoft uses 2-byte characters because in most applications it's
enough to use only the main Unicode plane, which covers most characters of
most languages.
Yo cannot use UTF-8 strings, for example, to easily mainipulate, for
example, Chinese charcaters: each character is represented by several bytes
in UTF-8.
Mikhail.
On 29 April 2013 02:13, Nicol Bolas <jmckesson@gmail.com> wrote:
>
>
> On Sunday, April 28, 2013 12:15:08 PM UTC-7, Mikhail Semenov wrote:
>>
>> I agree with Lawrence on that: UTF32 is more efficient for representing
>> general Unicode characters.
>> I think the issue here is that it is difficult to resolve the following
>> two issues:
>> (1) to select a preferable encoding (for reading from a file, system
>> representation and exchange);
>> (2) to select a common string format for internal representation (arrays
>> of characters that we can easily compare).
>>
>> The reason for the second point is that the Unicode itself propose 4
>> different types of representation http://unicode.org/reports/**
>> tr15/#Examples <http://unicode.org/reports/tr15/#Examples>:
>> NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
>> ligatures and other character types.
>> Point (2) is to create strings for easy access of elements and
>> comparison. The comparison is an issue: even French words have special way
>> of comparison based on accented characters.
>> Other languages have their own specific ways of comparing words and there
>> may be more than one way of doing so. I think this issue can be left.
>>
>> My suggesting would be to have two basic forms of representation:
>> (1) encoded strings;
>> (2) simple strings of characters (char8, char16 and char32).
>>
>
> If we have "encoded strings" (presumably allowing for arbitrary
> encodings), why would we need "simple strings"? Isn't `basic_string` a
> "simple string"?
>
> An implementation should provide
>> (a) some forms of encoding for encoded strings;
>> (b) some conversions between encoded strings and simple strings (of
>> char8, char16 and char32);
>> (c) in addition to standard comparison of simple strings (like arrays of
>> elements), there should be conversion routines for
>> various languages.
>>
>
> What kind of conversions are you talking about? We already have Unicode
> normalization via algorithms. So it's not clear what kind of language-based
> conversions you're looking for.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c361fc08c20c04db7b901b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Nicol,</div>
<div>=A0</div>
<div>I thought I made it clear. For example, elements of UTF-8 are bytes: e=
ach element does not represent a character (unless it is an ASCII string) y=
ou can convert it to</div>
<div>string of char32 so that each character really represent one Unicode c=
haracter. On the other hand, if you are only interested in the main coding =
plane: string of char16 will be enough. And if you only use European langua=
ges: string of char8 will be fine. In UTF-8 on the other hand,=A0each Unico=
de character can be coded by 1, 2 ,3 ... bytes.</div>
<div>=A0</div>
<div>=A0</div>
<div>In .NET, Microsoft uses 2-byte characters because in most applications=
it's enough to use only the main Unicode plane, which covers most char=
acters of most languages.</div>
<div>=A0</div>
<div>Yo cannot use UTF-8 strings, for example, to easily mainipulate, for e=
xample, Chinese charcaters: each character=A0is represented by several byte=
s in UTF-8.</div>
<div>=A0</div>
<div>Mikhail.<br><br></div>
<div class=3D"gmail_quote">On 29 April 2013 02:13, Nicol Bolas <span dir=3D=
"ltr"><<a href=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesso=
n@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div class=3D"im"><br><br>On Sunday, April 28, 2013 12:15:08 PM UTC-7, Mikh=
ail Semenov wrote:=20
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>I agree with Lawrence on that: UTF32 is more efficient for representin=
g general Unicode characters.<br></div>
<div>I think the issue here is that it is difficult to resolve the followin=
g two issues:<br>(1) to select a preferable encoding (for reading from a fi=
le, system representation and exchange);<br>(2) to select a common string f=
ormat for internal representation (arrays of characters that we can easily =
compare).</div>
<div>=A0</div>
<div>The reason for the second point is that the Unicode itself propose 4 d=
ifferent types of representation <a href=3D"http://unicode.org/reports/tr15=
/#Examples" target=3D"_blank">http://unicode.org/reports/<u></u>tr15/#Examp=
les</a>: <br>
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatu=
res and other character types.</div>
<div>Point (2) is to create strings for easy access of elements and compari=
son. The comparison is an issue: even French words have special way of comp=
arison based on accented characters. </div>
<div>Other languages have their own specific ways of comparing words and th=
ere may be more than one way of doing so. I think this issue can be left.</=
div>
<div>=A0</div>
<div>My suggesting would be to have two basic forms of representation:<br>(=
1) encoded strings;</div>
<div>(2) simple strings of characters (char8, char16 and char32).</div></bl=
ockquote></div>
<div><br>If we have "encoded strings" (presumably allowing for ar=
bitrary encodings), why would we need "simple strings"? Isn't=
`basic_string` a "simple string"?<br><br></div>
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>=A0An implementation should provide </div>
<div>(a) some forms of encoding for encoded strings;</div>
<div>(b) some conversions between encoded strings and simple strings (of ch=
ar8, char16 and char32);</div>
<div>(c) in addition to standard comparison of simple strings (like arrays =
of elements), there should be conversion routines for</div>
<div>various languages.</div></blockquote></div>
<div><br>What kind of conversions are you talking about? We already have Un=
icode normalization via algorithms. So it's not clear what kind of lang=
uage-based conversions you're looking for.</div>
<div class=3D"HOEnZb">
<div class=3D"h5"><br>
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c361fc08c20c04db7b901b--
.
Author: Giovanni Piero Deretta <gpderetta@gmail.com>
Date: Mon, 29 Apr 2013 02:41:13 -0700 (PDT)
Raw View
------=_Part_5_12695127.1367228473254
Content-Type: text/plain; charset=ISO-8859-1
On Monday, April 29, 2013 9:20:50 AM UTC+1, Mikhail Semenov wrote:
>
> Nicol,
>
> I thought I made it clear. For example, elements of UTF-8 are bytes: each
> element does not represent a character (unless it is an ASCII string) you
> can convert it to
> string of char32 so that each character really represent one Unicode
> character.
>
None of the Unicode encodings maps a code unit to a character. At most
(with UTF-32) you can map a code unit to a code point. But a code point,
because of compositing characters, is still not necessarily what would be
considered a character (whose definition is often application specific).
> On the other hand, if you are only interested in the main coding plane:
> string of char16 will be enough. And if you only use European languages:
> string of char8 will be fine. In UTF-8 on the other hand, each Unicode
> character can be coded by 1, 2 ,3 ... bytes.
>
So what? As long as you are restricting yourself to a subset of unicode, a
string of bytes is enough to represent ASCII. And with utf-16 even european
characters can be represented with multiple code units when using
compositing accents for example.
>
>
> In .NET, Microsoft uses 2-byte characters because in most applications
> it's enough to use only the main Unicode plane, which covers most
> characters of most languages.
>
..NET uses full UTF-16 and doesn't certainly assume only the basic plane.
Some functions may assume it, but they are market as so.
>
> Yo cannot use UTF-8 strings, for example, to easily mainipulate, for
> example, Chinese charcaters: each character is represented by several bytes
> in UTF-8.
>
For most string manipulations you would use high level algorithms anyway
so, really, character level access is often not really necessary. And when
you need it, for example for parsing protcols (which are invariably
specified as using a byte level encoding, usually utf-8), you can still do
many character level operations in utf-8 because many interesting unicode
codepoints (' ', '\r', '\n') map to a single code unit.
-- gpd
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_5_12695127.1367228473254
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br>On Monday, April 29, 2013 9:20:50 AM UTC+1, Mikhail Semenov wrote:<bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;"><div>Nicol,</div>
<div> </div>
<div>I thought I made it clear. For example, elements of UTF-8 are bytes: e=
ach element does not represent a character (unless it is an ASCII string) y=
ou can convert it to</div>
<div>string of char32 so that each character really represent one Unicode c=
haracter. </div></blockquote><div><br>None of the Unicode encodings maps a =
code unit to a character. At most (with UTF-32) you can map a code unit to =
a code point. But a code point, because of compositing characters, is still=
not necessarily what would be considered a character (whose definition is =
often application specific). <br><br> </div><blockquote class=3D"gmail=
_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;p=
adding-left: 1ex;"><div>On the other hand, if you are only interested in th=
e main coding plane: string of char16 will be enough. And if you only use E=
uropean languages: string of char8 will be fine. In UTF-8 on the other hand=
, each Unicode character can be coded by 1, 2 ,3 ... bytes.</div></blo=
ckquote><div><br>So what? As long as you are restricting yourself to a subs=
et of unicode, a string of bytes is enough to represent ASCII. And with utf=
-16 even european characters can be represented with multiple code units wh=
en using compositing accents for example. <br> </div><blockquote class=
=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #cc=
c solid;padding-left: 1ex;">
<div> </div>
<div> </div>
<div>In .NET, Microsoft uses 2-byte characters because in most applications=
it's enough to use only the main Unicode plane, which covers most characte=
rs of most languages.</div></blockquote><div><br>.NET uses full UTF-16 and =
doesn't certainly assume only the basic plane. Some functions may assume it=
, but they are market as so.<br> </div><blockquote class=3D"gmail_quot=
e" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;paddin=
g-left: 1ex;">
<div> </div>
<div>Yo cannot use UTF-8 strings, for example, to easily mainipulate, for e=
xample, Chinese charcaters: each character is represented by several b=
ytes in UTF-8.</div></blockquote><div><br>For most string manipulations you=
would use high level algorithms anyway so, really, character level access =
is often not really necessary. And when you need it, for example for parsin=
g protcols (which are invariably specified as using a byte level encoding, =
usually utf-8), you can still do many character level operations in utf-8 b=
ecause many interesting unicode codepoints (' ', '\r', '\n') map to a singl=
e code unit. <br></div><br>-- gpd<br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_5_12695127.1367228473254--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Mon, 29 Apr 2013 11:58:07 +0200
Raw View
--047d7bdc9d38f2134a04db7ceb70
Content-Type: text/plain; charset=ISO-8859-1
On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
>
>
> The reason for the second point is that the Unicode itself propose 4
> different types of representation
> http://unicode.org/reports/tr15/#Examples:
> NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
> ligatures and other character types.
>
No, no, no, no, and no. WTF. Compatibility normalization is a destructive
process! Suggesting that as what encoded_string uses is too limiting.
And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA
mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of
NFC is the same as NFD, that makes it destructive as well. I would prefer
not having any automatic normalization performed.
Different normal forms lend themselves to different uses cases. I say let
the user choose.
Point (2) is to create strings for easy access of elements and comparison.
> The comparison is an issue: even French words have special way of
> comparison based on accented characters.
> Other languages have their own specific ways of capering words and there
> may be more than one way of doing so. I think this issue can be left.
>
That would be done with locales, I would expect.
There is also GB18030 Standard (for Chinese characters), which is different
> from Unicode.
>
In what way is it different? Unicode defines a character set, and as far as
I know, GB18030 is yet another encoding form for that character set.
Martinho
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7bdc9d38f2134a04db7ceb70
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On S=
un, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <span dir=3D"ltr"><<a href=
=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_blank">mikhailsemenov19=
57@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div>=A0</div><br><div>Th=
e reason for the second point is that the Unicode itself propose 4 differen=
t types of=A0representation <a href=3D"http://unicode.org/reports/tr15/#Exa=
mples" target=3D"_blank">http://unicode.org/reports/tr15/#Examples</a>: </d=
iv>
<div>NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose l=
igatures and other character types.</div><div></div></blockquote><div><br><=
/div><div>No, no, no, no, and no. WTF. Compatibility normalization is a des=
tructive process! Suggesting that as what encoded_string uses is too limiti=
ng.<br>
<br>And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA mi=
stakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of NFC =
is the same as NFD, that makes it destructive as well. I would prefer not h=
aving any automatic normalization performed.<br>
<br>Different normal forms lend themselves to different uses cases. I say l=
et the user choose.<br><br></div><div><br></div><div><br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px sol=
id rgb(204,204,204);padding-left:1ex">
<div>Point (2) is to create strings for easy access of elements and compari=
son. The comparison is an issue: even French words have special way of comp=
arison based on accented characters. </div><div>Other languages have their =
own specific ways of capering words and there may be more than one way of d=
oing so. I think this issue can be left.</div>
<div></div></blockquote><div><br></div><div>That would be done with locales=
, I would expect.<br></div><div><br></div><blockquote class=3D"gmail_quote"=
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);p=
adding-left:1ex">
<div>There is also GB18030 Standard (for Chinese characters), which is diff=
erent from Unicode. </div><div></div></blockquote><div><br></div><div>In wh=
at way is it different? Unicode defines a character set, and as far as I kn=
ow, GB18030 is yet another encoding form for that character set.<br>
<br>Martinho</div><div class=3D"h5">
</div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7bdc9d38f2134a04db7ceb70--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Mon, 29 Apr 2013 12:58:33 +0100
Raw View
--047d7b2e0887a5d10a04db7e9a26
Content-Type: text/plain; charset=ISO-8859-1
Sorry, Gentlemen.
I think no-one is listening to what I am saying.
Speaking of Unicode: yes the user can choose.
This is the point. You've got to convert from an encoded sequence to an
array.
The conversion is the user's choice.
Encoded sequence -> array (which is a string of char8, char16 or char 32).
When you convert to an array you may choose NFD, NFC, NFKD and NFKC, or
just ASCII, or whatever you like.
This conversion can be provided by the an implementation or by the user.
When we obtain this string (array) we can get access to single characters
(coding points), whatever they are.
It's the users choice what elements to use char8, char16, char32.
But the point is that each element has a fixed size!.
Now, after processing, you can convert back or to another encoding:
string -> encoded sequence.
Mikhail.
On 29 April 2013 10:58, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <
> mikhailsemenov1957@gmail.com> wrote:
>
>>
>>
>> The reason for the second point is that the Unicode itself propose 4
>> different types of representation
>> http://unicode.org/reports/tr15/#Examples:
>> NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
>> ligatures and other character types.
>>
>
> No, no, no, no, and no. WTF. Compatibility normalization is a destructive
> process! Suggesting that as what encoded_string uses is too limiting.
>
> And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA
> mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of
> NFC is the same as NFD, that makes it destructive as well. I would prefer
> not having any automatic normalization performed.
>
> Different normal forms lend themselves to different uses cases. I say let
> the user choose.
>
>
>
> Point (2) is to create strings for easy access of elements and
>> comparison. The comparison is an issue: even French words have special way
>> of comparison based on accented characters.
>> Other languages have their own specific ways of capering words and there
>> may be more than one way of doing so. I think this issue can be left.
>>
>
> That would be done with locales, I would expect.
>
> There is also GB18030 Standard (for Chinese characters), which is
>> different from Unicode.
>>
>
> In what way is it different? Unicode defines a character set, and as far
> as I know, GB18030 is yet another encoding form for that character set.
>
> Martinho
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7b2e0887a5d10a04db7e9a26
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Sorry, Gentlemen.</div>
<div>=A0</div>
<div>I think no-one is listening to what I am saying.</div>
<div>Speaking of Unicode: yes the user can choose.</div>
<div>This is the point. You've got to convert from an encoded sequence =
to an array.</div>
<div>The conversion is the user's choice.</div>
<div>Encoded sequence -> array (which is a string of char8, char16 or ch=
ar 32).</div>
<div>=A0</div>
<div>When you convert to an array you may choose NFD, NFC, NFKD and NFKC, o=
r just ASCII, or whatever you like.</div>
<div>=A0</div>
<div>This conversion can be provided by the an implementation or by the use=
r.</div>
<div>When we obtain this string (array) we can get access to single charact=
ers (coding points), whatever they are.</div>
<div>It's the users choice what elements to use char8, char16, char32.<=
/div>
<div>But the point is that each element has a fixed size!.</div>
<div>=A0</div>
<div>Now, after processing, you can convert back or to another encoding:</d=
iv>
<div>string -> encoded sequence.</div>
<div>=A0</div>
<div>Mikhail.</div>
<div><br><br>=A0</div>
<div class=3D"gmail_quote">On 29 April 2013 10:58, Martinho Fernandes <span=
dir=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"=
_blank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <span di=
r=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bl=
ank">mikhailsemenov1957@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>=A0</div><br>
<div>The reason for the second point is that the Unicode itself propose 4 d=
ifferent types of=A0representation <a href=3D"http://unicode.org/reports/tr=
15/#Examples" target=3D"_blank">http://unicode.org/reports/tr15/#Examples</=
a>: </div>
<div>NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose l=
igatures and other character types.</div>
<div></div></blockquote>
<div><br></div></div>
<div>No, no, no, no, and no. WTF. Compatibility normalization is a destruct=
ive process! Suggesting that as what encoded_string uses is too limiting.<b=
r><br>And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA =
mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of NF=
C is the same as NFD, that makes it destructive as well. I would prefer not=
having any automatic normalization performed.<br>
<br>Different normal forms lend themselves to different uses cases. I say l=
et the user choose.<br><br></div>
<div class=3D"im">
<div><br></div>
<div><br></div>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>Point (2) is to create strings for easy access of elements and compari=
son. The comparison is an issue: even French words have special way of comp=
arison based on accented characters. </div>
<div>Other languages have their own specific ways of capering words and the=
re may be more than one way of doing so. I think this issue can be left.</d=
iv>
<div></div></blockquote>
<div><br></div></div>
<div>That would be done with locales, I would expect.<br></div>
<div class=3D"im">
<div><br></div>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>There is also GB18030 Standard (for Chinese characters), which is diff=
erent from Unicode. </div>
<div></div></blockquote>
<div><br></div></div>
<div>In what way is it different? Unicode defines a character set, and as f=
ar as I know, GB18030 is yet another encoding form for that character set.<=
span class=3D"HOEnZb"><font color=3D"#888888"><br><br>Martinho</font></span=
></div>
<div></div></div><br></div></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b2e0887a5d10a04db7e9a26--
.
Author: Ville Voutilainen <ville.voutilainen@gmail.com>
Date: Mon, 29 Apr 2013 15:03:25 +0300
Raw View
--047d7b2e41a40b7bb904db7eac5e
Content-Type: text/plain; charset=ISO-8859-1
On 29 April 2013 14:58, Mikhail Semenov <mikhailsemenov1957@gmail.com>wrote:
> Sorry, Gentlemen.
> I think no-one is listening to what I am saying.
Did you read what signore Deretta wrote? If not, go read it again, it
should be illuminating.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7b2e41a40b7bb904db7eac5e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On 29 April 2013 14:58, Mikhail Semenov <span dir=3D"ltr"><<a hr=
ef=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_blank">mikhailsemenov=
1957@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>Sorry, Gentlemen.</div>I think no-one i=
s listening to what I am saying.</blockquote><div><br></div><div>Did you re=
ad what signore Deretta wrote? If not, go read it again, it should be illum=
inating.<br>
</div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7b2e41a40b7bb904db7eac5e--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Mon, 29 Apr 2013 13:28:21 +0100
Raw View
--089e01229c303364d904db7f0512
Content-Type: text/plain; charset=ISO-8859-1
GB18030 has more coding points (1,587,600) than Unicode, but not all of
them used.
On 29 April 2013 10:58, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <
> mikhailsemenov1957@gmail.com> wrote:
>
>>
>>
>> The reason for the second point is that the Unicode itself propose 4
>> different types of representation
>> http://unicode.org/reports/tr15/#Examples:
>> NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose
>> ligatures and other character types.
>>
>
> No, no, no, no, and no. WTF. Compatibility normalization is a destructive
> process! Suggesting that as what encoded_string uses is too limiting.
>
> And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA
> mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of
> NFC is the same as NFD, that makes it destructive as well. I would prefer
> not having any automatic normalization performed.
>
> Different normal forms lend themselves to different uses cases. I say let
> the user choose.
>
>
>
> Point (2) is to create strings for easy access of elements and
>> comparison. The comparison is an issue: even French words have special way
>> of comparison based on accented characters.
>> Other languages have their own specific ways of capering words and there
>> may be more than one way of doing so. I think this issue can be left.
>>
>
> That would be done with locales, I would expect.
>
> There is also GB18030 Standard (for Chinese characters), which is
>> different from Unicode.
>>
>
> In what way is it different? Unicode defines a character set, and as far
> as I know, GB18030 is yet another encoding form for that character set.
>
> Martinho
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01229c303364d904db7f0512
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
GB18030 has more coding points (1,587,600) than Unicode, but not all of the=
m used.<br><br>
<div class=3D"gmail_quote">On 29 April 2013 10:58, Martinho Fernandes <span=
dir=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"=
_blank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <span di=
r=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bl=
ank">mikhailsemenov1957@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>=A0</div><br>
<div>The reason for the second point is that the Unicode itself propose 4 d=
ifferent types of=A0representation <a href=3D"http://unicode.org/reports/tr=
15/#Examples" target=3D"_blank">http://unicode.org/reports/tr15/#Examples</=
a>: </div>
<div>NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose l=
igatures and other character types.</div>
<div></div></blockquote>
<div><br></div></div>
<div>No, no, no, no, and no. WTF. Compatibility normalization is a destruct=
ive process! Suggesting that as what encoded_string uses is too limiting.<b=
r><br>And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA =
mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of NF=
C is the same as NFD, that makes it destructive as well. I would prefer not=
having any automatic normalization performed.<br>
<br>Different normal forms lend themselves to different uses cases. I say l=
et the user choose.<br><br></div>
<div class=3D"im">
<div><br></div>
<div><br></div>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>Point (2) is to create strings for easy access of elements and compari=
son. The comparison is an issue: even French words have special way of comp=
arison based on accented characters. </div>
<div>Other languages have their own specific ways of capering words and the=
re may be more than one way of doing so. I think this issue can be left.</d=
iv>
<div></div></blockquote>
<div><br></div></div>
<div>That would be done with locales, I would expect.<br></div>
<div class=3D"im">
<div><br></div>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>There is also GB18030 Standard (for Chinese characters), which is diff=
erent from Unicode. </div>
<div></div></blockquote>
<div><br></div></div>
<div>In what way is it different? Unicode defines a character set, and as f=
ar as I know, GB18030 is yet another encoding form for that character set.<=
span class=3D"HOEnZb"><font color=3D"#888888"><br><br>Martinho</font></span=
></div>
<div></div></div><br></div></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c303364d904db7f0512--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Mon, 29 Apr 2013 14:53:22 +0100
Raw View
--001a11c361fc451bf704db803536
Content-Type: text/plain; charset=ISO-8859-1
Ville,
I read it again. But I disagree with high-level manipulation of characters,
not using arrays. I would hate to manipulate, for instance, strings in
Chinese,
using directly UTF-8 encoded strings; the same applies to Russian. I need
one element per code point.
UTF-8 is very good for files, but not for string manipulation (unless, of
course, use use ASCII <128).
Regards,
Mikhail.
On 29 April 2013 13:03, Ville Voutilainen <ville.voutilainen@gmail.com>wrote:
>
>
>
> On 29 April 2013 14:58, Mikhail Semenov <mikhailsemenov1957@gmail.com>wrote:
>
>> Sorry, Gentlemen.
>> I think no-one is listening to what I am saying.
>
>
> Did you read what signore Deretta wrote? If not, go read it again, it
> should be illuminating.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c361fc451bf704db803536
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Ville,</div>
<div>=A0</div>
<div>I read it again. But I disagree with high-level manipulation of charac=
ters, not using arrays. I would hate to manipulate, for instance, strings i=
n Chinese,</div>
<div>using directly UTF-8 encoded strings; the same applies to Russian. I n=
eed one element per code point.</div>
<div>UTF-8 is very good for files, but not for string manipulation (unless,=
of course, use use ASCII <128).</div>
<div>=A0</div>
<div>Regards,</div>
<div>Mikhail.</div>
<div><br><br>=A0</div>
<div class=3D"gmail_quote">On 29 April 2013 13:03, Ville Voutilainen <span =
dir=3D"ltr"><<a href=3D"mailto:ville.voutilainen@gmail.com" target=3D"_b=
lank">ville.voutilainen@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr"><br>
<div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">
<div class=3D"im">On 29 April 2013 14:58, Mikhail Semenov <span dir=3D"ltr"=
><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_blank">mikh=
ailsemenov1957@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>Sorry, Gentlemen.</div>I think no-one is listening to what I am saying=
..</blockquote>
<div><br></div></div>
<div>Did you read what signore Deretta wrote? If not, go read it again, it =
should be illuminating.<br></div></div></div></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c361fc451bf704db803536--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Mon, 29 Apr 2013 16:13:30 +0200
Raw View
--20cf305e1fdb4a343004db807df9
Content-Type: text/plain; charset=ISO-8859-1
On Mon, Apr 29, 2013 at 3:53 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
> Ville,
>
> I read it again. But I disagree with high-level manipulation of
> characters, not using arrays. I would hate to manipulate, for instance,
> strings in Chinese,
> using directly UTF-8 encoded strings; the same applies to Russian. I need
> one element per code point.
> UTF-8 is very good for files, but not for string manipulation (unless, of
> course, use use ASCII <128).
>
> Regards,
> Mikhail.
>
From what I gathered, while there are some disagreements about how such a
thing should be achieved, I believe most people in this discussion agree
with one point: no one should have to manipulate
UTF-8/UTF-16/UTF-32/ASCII/Windows-1252/GB18030/Big-5/whatever directly as
code units except in very rare and very special circumstances. Most of the
uses for getting such raw access to data involve interoperation, either
with legacy code or with external systems.
That said, I don't know what kind of manipulations you are concerned with
here; lack of a common ground with respect to that may be the source of
some misunderstandings.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--20cf305e1fdb4a343004db807df9
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Mon, Apr 29, 2013 at 3:53 PM, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br><div class=3D"gma=
il_extra">
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div>Ville,</div>
<div>=A0</div>
<div>I read it again. But I disagree with high-level manipulation of charac=
ters, not using arrays. I would hate to manipulate, for instance, strings i=
n Chinese,</div>
<div>using directly UTF-8 encoded strings; the same applies to Russian. I n=
eed one element per code point.</div>
<div>UTF-8 is very good for files, but not for string manipulation (unless,=
of course, use use ASCII <128).</div>
<div>=A0</div>
<div>Regards,</div>
<div>Mikhail.</div></blockquote><div><br></div><div>From what I gathered, w=
hile there are some disagreements about how such a thing should be achieved=
, I believe most people in this discussion agree with one point: no one sho=
uld have to manipulate UTF-8/UTF-16/UTF-32/ASCII/Windows-1252/GB18030/Big-5=
/whatever directly as code units except in very rare and very special circu=
mstances. Most of the uses for getting such raw access to data involve inte=
roperation, either with legacy code or with external systems.<br>
<br></div><div>That said, I don't know what kind of manipulations you a=
re concerned with here; lack of a common ground with respect to that may be=
the source of some misunderstandings.<br></div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--20cf305e1fdb4a343004db807df9--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 29 Apr 2013 08:21:08 -0700 (PDT)
Raw View
------=_Part_3351_28812420.1367248868369
Content-Type: text/plain; charset=ISO-8859-1
On Monday, April 29, 2013 1:20:50 AM UTC-7, Mikhail Semenov wrote:
>
> Nicol,
>
> I thought I made it clear. For example, elements of UTF-8 are bytes: each
> element does not represent a character (unless it is an ASCII string) you
> can convert it to
> string of char32 so that each character really represent one Unicode
> character. On the other hand, if you are only interested in the main coding
> plane: string of char16 will be enough. And if you only use European
> languages: string of char8 will be fine. In UTF-8 on the other hand, each
> Unicode character can be coded by 1, 2 ,3 ... bytes.
>
>
> In .NET, Microsoft uses 2-byte characters because in most applications
> it's enough to use only the main Unicode plane, which covers most
> characters of most languages.
>
> Yo cannot use UTF-8 strings, for example, to easily mainipulate, for
> example, Chinese charcaters: each character is represented by several bytes
> in UTF-8.
>
Yes, that's why we want a string class that *makes it* easy to manipulate
arbitrary codepoint sequences in an arbitrary, specified encoding. The
whole point is to have a string class, with an explicit encoding parameter,
which allows you to manipulate it as a codepoint sequence, while still
having basic access to the encoded data as an array of code units.
We already have the basic tools to be able to do that: specialized
iterators for various encodings, which output codepoints, where ++ and --
will move along the encoded array properly. All we need is to aggregate
these into a storage object, template that object on an encoding type
(which is what provides the iterators), and add some basic operations.
At which point, I can use a UTF-8 string just as easily as I can a UTF-32
in any Unicode operation, from searching for a codepoint sequence, to
normalizing it, to anything.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_3351_28812420.1367248868369
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Monday, April 29, 2013 1:20:50 AM UTC-7, Mikhail Semenov wrote:<blockquo=
te class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left:=
1px #ccc solid;padding-left: 1ex;"><div>Nicol,</div>
<div> </div>
<div>I thought I made it clear. For example, elements of UTF-8 are bytes: e=
ach element does not represent a character (unless it is an ASCII string) y=
ou can convert it to</div>
<div>string of char32 so that each character really represent one Unicode c=
haracter. On the other hand, if you are only interested in the main coding =
plane: string of char16 will be enough. And if you only use European langua=
ges: string of char8 will be fine. In UTF-8 on the other hand, each Un=
icode character can be coded by 1, 2 ,3 ... bytes.</div>
<div> </div>
<div> </div>
<div>In .NET, Microsoft uses 2-byte characters because in most applications=
it's enough to use only the main Unicode plane, which covers most characte=
rs of most languages.</div>
<div> </div>
<div>Yo cannot use UTF-8 strings, for example, to easily mainipulate, for e=
xample, Chinese charcaters: each character is represented by several b=
ytes in UTF-8.</div></blockquote><div><br>Yes, that's why we want a string =
class that <i>makes it</i> easy to manipulate arbitrary codepoint sequences=
in an arbitrary, specified encoding. The whole point is to have a string c=
lass, with an explicit encoding parameter, which allows you to manipulate i=
t as a codepoint sequence, while still having basic access to the encoded d=
ata as an array of code units.<br><br>We already have the basic tools to be=
able to do that: specialized iterators for various encodings, which output=
codepoints, where ++ and -- will move along the encoded array properly. Al=
l we need is to aggregate these into a storage object, template that object=
on an encoding type (which is what provides the iterators), and add some b=
asic operations.<br><br>At which point, I can use a UTF-8 string just as ea=
sily as I can a UTF-32 in any Unicode operation, from searching for a codep=
oint sequence, to normalizing it, to anything.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_3351_28812420.1367248868369--
.
Author: Tony V E <tvaneerd@gmail.com>
Date: Wed, 1 May 2013 19:59:59 -0400
Raw View
--089e013d10065ed11904dbb0ea3e
Content-Type: text/plain; charset=ISO-8859-1
On Mon, Apr 29, 2013 at 11:21 AM, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Monday, April 29, 2013 1:20:50 AM UTC-7, Mikhail Semenov wrote:
>>
>> Nicol,
>>
>> I thought I made it clear. For example, elements of UTF-8 are bytes: each
>> element does not represent a character (unless it is an ASCII string) you
>> can convert it to
>> string of char32 so that each character really represent one Unicode
>> character. On the other hand, if you are only interested in the main coding
>> plane: string of char16 will be enough. And if you only use European
>> languages: string of char8 will be fine. In UTF-8 on the other hand, each
>> Unicode character can be coded by 1, 2 ,3 ... bytes.
>>
>>
>> In .NET, Microsoft uses 2-byte characters because in most applications
>> it's enough to use only the main Unicode plane, which covers most
>> characters of most languages.
>>
>> Yo cannot use UTF-8 strings, for example, to easily mainipulate, for
>> example, Chinese charcaters: each character is represented by several bytes
>> in UTF-8.
>>
>
> Yes, that's why we want a string class that *makes it* easy to manipulate
> arbitrary codepoint sequences in an arbitrary, specified encoding. The
> whole point is to have a string class, with an explicit encoding parameter,
> which allows you to manipulate it as a codepoint sequence, while still
> having basic access to the encoded data as an array of code units.
>
> We already have the basic tools to be able to do that: specialized
> iterators for various encodings, which output codepoints, where ++ and --
> will move along the encoded array properly. All we need is to aggregate
> these into a storage object, template that object on an encoding type
> (which is what provides the iterators), and add some basic operations.
>
> At which point, I can use a UTF-8 string just as easily as I can a UTF-32
> in any Unicode operation, from searching for a codepoint sequence, to
> normalizing it, to anything.
>
> --
>
>
>
Do we want the option of an encoding that changes at runtime (ie per
string, or even as the string changes)?
I can see
string<encoding_dontcare> str = str_from_somewhere;
f1(f2(f3(f4(str))));
I don't want "encoding_dontcare" to mean that on Windows it is UTF16, and
Linux UTF8, I want "dontcare" to mean whatever is given to it. In that
way, if each function along the way doesn't care, there is a chance that no
re-encoding ever happens. Whatever encoding str_from_somewhere was, that
is the encoding (internally) returned from f1.
ie I think we want both encoding_platform and encoding_flexible (as a still
not good, but better name).
Or do I need to write all my functions as templates?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e013d10065ed11904dbb0ea3e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Mon, Apr 29, 2013 at 11:21 AM, Nicol Bolas <span dir=3D"ltr"><=
;<a href=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@gmail.c=
om</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"im">On Monday, April 29, 2013 =
1:20:50 AM UTC-7, Mikhail Semenov wrote:<blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:=
1ex">
<div>Nicol,</div>
<div>=A0</div>
<div>I thought I made it clear. For example, elements of UTF-8 are bytes: e=
ach element does not represent a character (unless it is an ASCII string) y=
ou can convert it to</div>
<div>string of char32 so that each character really represent one Unicode c=
haracter. On the other hand, if you are only interested in the main coding =
plane: string of char16 will be enough. And if you only use European langua=
ges: string of char8 will be fine. In UTF-8 on the other hand,=A0each Unico=
de character can be coded by 1, 2 ,3 ... bytes.</div>
<div>=A0</div>
<div>=A0</div>
<div>In .NET, Microsoft uses 2-byte characters because in most applications=
it's enough to use only the main Unicode plane, which covers most char=
acters of most languages.</div>
<div>=A0</div>
<div>Yo cannot use UTF-8 strings, for example, to easily mainipulate, for e=
xample, Chinese charcaters: each character=A0is represented by several byte=
s in UTF-8.</div></blockquote></div><div><br>Yes, that's why we want a =
string class that <i>makes it</i> easy to manipulate arbitrary codepoint se=
quences in an arbitrary, specified encoding. The whole point is to have a s=
tring class, with an explicit encoding parameter, which allows you to manip=
ulate it as a codepoint sequence, while still having basic access to the en=
coded data as an array of code units.<br>
<br>We already have the basic tools to be able to do that: specialized iter=
ators for various encodings, which output codepoints, where ++ and -- will =
move along the encoded array properly. All we need is to aggregate these in=
to a storage object, template that object on an encoding type (which is wha=
t provides the iterators), and add some basic operations.<br>
<br>At which point, I can use a UTF-8 string just as easily as I can a UTF-=
32 in any Unicode operation, from searching for a codepoint sequence, to no=
rmalizing it, to anything.<br></div><div class=3D"HOEnZb"><div class=3D"h5"=
>
<p></p>
-- <br>
<br><br></div></div></blockquote><div><br></div><div>Do we want the option =
of an encoding that changes at runtime (ie per string, or even as the strin=
g changes)?<br><br></div><div>I can see<br><br></div><div>string<encodin=
g_dontcare> str =3D str_from_somewhere;<br>
</div><div>f1(f2(f3(f4(str))));<br></div><div><br>=A0</div><div>I don't=
want "encoding_dontcare" to mean that on Windows it is UTF16, an=
d Linux UTF8, I want "dontcare" to mean whatever is given to it.=
=A0 In that way, if each function along the way doesn't care, there is =
a chance that no re-encoding ever happens.=A0 Whatever encoding str_from_so=
mewhere was, that is the encoding (internally) returned from f1.<br>
<br></div><div>ie I think we want both encoding_platform and encoding_flexi=
ble (as a still not good, but better name).<br><br></div><div>Or do I need =
to write all my functions as templates?<br><br><br></div></div><br></div>
</div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e013d10065ed11904dbb0ea3e--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 1 May 2013 18:30:20 -0700 (PDT)
Raw View
------=_Part_6212_5579483.1367458220422
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:
>
> On Mon, Apr 29, 2013 at 11:21 AM, Nicol Bolas <jmck...@gmail.com<javascript:>
> > wrote:
>
>> On Monday, April 29, 2013 1:20:50 AM UTC-7, Mikhail Semenov wrote:
>>>
>>> Nicol,
>>>
>>> I thought I made it clear. For example, elements of UTF-8 are bytes:
>>> each element does not represent a character (unless it is an ASCII string)
>>> you can convert it to
>>> string of char32 so that each character really represent one Unicode
>>> character. On the other hand, if you are only interested in the main coding
>>> plane: string of char16 will be enough. And if you only use European
>>> languages: string of char8 will be fine. In UTF-8 on the other hand, each
>>> Unicode character can be coded by 1, 2 ,3 ... bytes.
>>>
>>>
>>> In .NET, Microsoft uses 2-byte characters because in most applications
>>> it's enough to use only the main Unicode plane, which covers most
>>> characters of most languages.
>>>
>>> Yo cannot use UTF-8 strings, for example, to easily mainipulate, for
>>> example, Chinese charcaters: each character is represented by several bytes
>>> in UTF-8.
>>>
>>
>> Yes, that's why we want a string class that *makes it* easy to
>> manipulate arbitrary codepoint sequences in an arbitrary, specified
>> encoding. The whole point is to have a string class, with an explicit
>> encoding parameter, which allows you to manipulate it as a codepoint
>> sequence, while still having basic access to the encoded data as an array
>> of code units.
>>
>> We already have the basic tools to be able to do that: specialized
>> iterators for various encodings, which output codepoints, where ++ and --
>> will move along the encoded array properly. All we need is to aggregate
>> these into a storage object, template that object on an encoding type
>> (which is what provides the iterators), and add some basic operations.
>>
>> At which point, I can use a UTF-8 string just as easily as I can a UTF-32
>> in any Unicode operation, from searching for a codepoint sequence, to
>> normalizing it, to anything.
>>
>> --
>>
>>
>>
> Do we want the option of an encoding that changes at runtime (ie per
> string, or even as the string changes)?
>
> I can see
>
> string<encoding_dontcare> str = str_from_somewhere;
> f1(f2(f3(f4(str))));
>
>
> I don't want "encoding_dontcare" to mean that on Windows it is UTF16, and
> Linux UTF8, I want "dontcare" to mean whatever is given to it. In that
> way, if each function along the way doesn't care, there is a chance that no
> re-encoding ever happens. Whatever encoding str_from_somewhere was, that
> is the encoding (internally) returned from f1.
>
> ie I think we want both encoding_platform and encoding_flexible (as a
> still not good, but better name).
>
> Or do I need to write all my functions as templates?
>
There's no (good) way to implement "encoding_flexible" as a template
parameter to a string type that expects a specific, fixed encoding. Not
without creating a whole new specialization of that type that has a
different interface, which is really little different from just creating a
new class type.
Personally, I don't like `encoding_platform` at all, as this assumes that
the platform-specific encoding is: 1) a good idea to use and 2) somehow
specific to that platform.
The problem with a string type that can handle *any* encoding (even
user-defined ones) is that such a string would necessarily have to use
type-erasure to access the data in that string. Iterators for such a type
will be slower to user because of the overhead.
In short, you're trading potentially less transcoding for always slower *use
* of the string.
I'm not saying that we shouldn't have an `any_encoded_string`. I'm saying
that we need to also have a `fixed_encoded_string` (with a forward-facing
encoding that can be user-provided), and the two types cannot be the same
type.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_6212_5579483.1367458220422
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:<blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #c=
cc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><div class=3D"gmail_quot=
e">On Mon, Apr 29, 2013 at 11:21 AM, Nicol Bolas <span dir=3D"ltr"><<a h=
ref=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"u3Sm5v01bLwJ=
">jmck...@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>On Monday, April 29, 2013 1:20:50 AM UT=
C-7, Mikhail Semenov wrote:<blockquote class=3D"gmail_quote" style=3D"margi=
n:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>Nicol,</div>
<div> </div>
<div>I thought I made it clear. For example, elements of UTF-8 are bytes: e=
ach element does not represent a character (unless it is an ASCII string) y=
ou can convert it to</div>
<div>string of char32 so that each character really represent one Unicode c=
haracter. On the other hand, if you are only interested in the main coding =
plane: string of char16 will be enough. And if you only use European langua=
ges: string of char8 will be fine. In UTF-8 on the other hand, each Un=
icode character can be coded by 1, 2 ,3 ... bytes.</div>
<div> </div>
<div> </div>
<div>In .NET, Microsoft uses 2-byte characters because in most applications=
it's enough to use only the main Unicode plane, which covers most characte=
rs of most languages.</div>
<div> </div>
<div>Yo cannot use UTF-8 strings, for example, to easily mainipulate, for e=
xample, Chinese charcaters: each character is represented by several b=
ytes in UTF-8.</div></blockquote></div><div><br>Yes, that's why we want a s=
tring class that <i>makes it</i> easy to manipulate arbitrary codepoint seq=
uences in an arbitrary, specified encoding. The whole point is to have a st=
ring class, with an explicit encoding parameter, which allows you to manipu=
late it as a codepoint sequence, while still having basic access to the enc=
oded data as an array of code units.<br>
<br>We already have the basic tools to be able to do that: specialized iter=
ators for various encodings, which output codepoints, where ++ and -- will =
move along the encoded array properly. All we need is to aggregate these in=
to a storage object, template that object on an encoding type (which is wha=
t provides the iterators), and add some basic operations.<br>
<br>At which point, I can use a UTF-8 string just as easily as I can a UTF-=
32 in any Unicode operation, from searching for a codepoint sequence, to no=
rmalizing it, to anything.<br></div><div><div>
<p></p>
-- <br>
<br><br></div></div></blockquote><div><br></div><div>Do we want the option =
of an encoding that changes at runtime (ie per string, or even as the strin=
g changes)?<br><br></div><div>I can see<br><br></div><div>string<encodin=
g_dontcare> str =3D str_from_somewhere;<br>
</div><div>f1(f2(f3(f4(str))));<br></div><div><br> </div><div>I don't =
want "encoding_dontcare" to mean that on Windows it is UTF16, and Linux UTF=
8, I want "dontcare" to mean whatever is given to it. In that way, if=
each function along the way doesn't care, there is a chance that no re-enc=
oding ever happens. Whatever encoding str_from_somewhere was, that is=
the encoding (internally) returned from f1.<br>
<br></div><div>ie I think we want both encoding_platform and encoding_flexi=
ble (as a still not good, but better name).<br><br></div><div>Or do I need =
to write all my functions as templates?<br></div></div></div></div></blockq=
uote><div><br>There's no (good) way to implement "encoding_flexible" as a t=
emplate parameter to a string type that expects a specific, fixed encoding.=
Not without creating a whole new specialization of that type that has a di=
fferent interface, which is really little different from just creating a ne=
w class type.<br><br>Personally, I don't like `encoding_platform` at all, a=
s this assumes that the platform-specific encoding is: 1) a good idea to us=
e and 2) somehow specific to that platform.<br><br>The problem with a strin=
g type that can handle <i>any</i> encoding (even user-defined ones) is that=
such a string would necessarily have to use type-erasure to access the dat=
a in that string. Iterators for such a type will be slower to user because =
of the overhead.<br><br>In short, you're trading potentially less transcodi=
ng for always slower <i>use</i> of the string.<br><br>I'm not saying that w=
e shouldn't have an `any_encoded_string`. I'm saying that we need to also h=
ave a `fixed_encoded_string` (with a forward-facing encoding that can be us=
er-provided), and the two types cannot be the same type.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_6212_5579483.1367458220422--
.
Author: Lawrence Crowl <crowl@googlers.com>
Date: Tue, 7 May 2013 11:50:21 -0700
Raw View
On 5/1/13, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:
> > I don't want "encoding_dontcare" to mean that on Windows it
> > is UTF16, and Linux UTF8, I want "dontcare" to mean whatever
> > is given to it. In that way, if each function along the way
> > doesn't care, there is a chance that no re-encoding ever happens.
> > Whatever encoding str_from_somewhere was, that is the encoding
> > (internally) returned from f1. ie I think we want both
> > encoding_platform and encoding_flexible (as a still not good,
> > but better name).
If you mean use the given encoding, you should infer the type
from the object given to it. In the case of variables, use auto.
In the case of functions, use a template parameter.
> > Or do I need to write all my functions as templates?
If you want flexibility without run-time overhead, yes.
> There's no (good) way to implement "encoding_flexible" as a
> template parameter to a string type that expects a specific,
> fixed encoding. Not without creating a whole new specialization of
> that type that has a different interface, which is really little
> different from just creating a new class type.
>
> Personally, I don't like `encoding_platform` at all, as this
> assumes that the platform-specific encoding is: 1) a good idea
> to use and 2) somehow specific to that platform.
>
> The problem with a string type that can handle *any* encoding (even
> user-defined ones) is that such a string would necessarily have
> to use type-erasure to access the data in that string. Iterators
> for such a type will be slower to user because of the overhead.
>
> In short, you're trading potentially less transcoding for always
> slower *use* of the string.
An intermediate approach is to use function templates as above,
and then permit implicit transcoding conversions where necessary.
> I'm not saying that we shouldn't have an `any_encoded_string`.
> I'm saying that we need to also have a `fixed_encoded_string`
> (with a forward-facing encoding that can be user-provided),
> and the two types cannot be the same type.
They can be different specializations of the same template though.
--
Lawrence Crowl
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Tue, 7 May 2013 12:50:57 -0700 (PDT)
Raw View
------=_Part_707_23844618.1367956257259
Content-Type: text/plain; charset=ISO-8859-1
I think there should be a base class for encoding
template <class EncodingElement, class CharType>
class encoding
{
public:
virtual std::basic_string<EncodingElement>
encode(std::basic_string<CharType> str) = 0;
virtual std::basic_string<CharType>
decode(std::basic_string<EncodingElement> str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
....
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
....
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
....
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
....
};
Inside the program the encoded strings should be decoded when necessary.
Such approach makes it possible to use various encodings in one program.
The system one will be just one of the encodings.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_707_23844618.1367956257259
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I think there should be a base class for encoding</div><div>template &=
lt;class EncodingElement, class CharType><br>class encoding<br>{<br>publ=
ic: <br> virtua=
l std::basic_string<EncodingElement> encode(std::basic_string<Char=
Type> str) =3D 0;<br> virtual std::basic_string<Cha=
rType> decode(std::basic_string<EncodingElement> str) =3D 0; =
<br>};</div><div>Then particular encoding classes can be imple=
mented:</div><div>class encoding_utf8_char32: public encoding<char, char=
32_t><br>{<br>...<br>};</div><div>class encoding_utf8_char16: public enc=
oding<char, char16_t><br>{<br>...<br>};</div><div>class encoding_utf1=
6_char32: public encoding<char16_t, char32_t><br>{<br>...<br>};</div>=
<div>class encoding_GB18030_char32: public encoding<char, char32_t><b=
r>{<br>...<br>};</div><div>Inside the program the encoded strings should be=
decoded when necessary.<br>Such approach makes it possible to use various =
encodings in one program.</div><div>The system one will be just one of the =
encodings.<br><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_707_23844618.1367956257259--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Tue, 7 May 2013 20:54:29 +0100
Raw View
--089e01229c30720e9d04dc262f41
Content-Type: text/plain; charset=ISO-8859-1
I think there should be a base class for encoding
template <class EncodingElement, class CharType>
class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const
std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const
std::basic_string<EncodingElement>& str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
....
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
....
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
....
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
....
};
Inside the program the encoded strings should be decoded when necessary.
Such approach makes it possible to use various encodings in one program.
On 7 May 2013 19:50, Lawrence Crowl <crowl@googlers.com> wrote:
> On 5/1/13, Nicol Bolas <jmckesson@gmail.com> wrote:
> > On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:
> > > I don't want "encoding_dontcare" to mean that on Windows it
> > > is UTF16, and Linux UTF8, I want "dontcare" to mean whatever
> > > is given to it. In that way, if each function along the way
> > > doesn't care, there is a chance that no re-encoding ever happens.
> > > Whatever encoding str_from_somewhere was, that is the encoding
> > > (internally) returned from f1. ie I think we want both
> > > encoding_platform and encoding_flexible (as a still not good,
> > > but better name).
>
> If you mean use the given encoding, you should infer the type
> from the object given to it. In the case of variables, use auto.
> In the case of functions, use a template parameter.
>
> > > Or do I need to write all my functions as templates?
>
> If you want flexibility without run-time overhead, yes.
>
> > There's no (good) way to implement "encoding_flexible" as a
> > template parameter to a string type that expects a specific,
> > fixed encoding. Not without creating a whole new specialization of
> > that type that has a different interface, which is really little
> > different from just creating a new class type.
> >
> > Personally, I don't like `encoding_platform` at all, as this
> > assumes that the platform-specific encoding is: 1) a good idea
> > to use and 2) somehow specific to that platform.
> >
> > The problem with a string type that can handle *any* encoding (even
> > user-defined ones) is that such a string would necessarily have
> > to use type-erasure to access the data in that string. Iterators
> > for such a type will be slower to user because of the overhead.
> >
> > In short, you're trading potentially less transcoding for always
> > slower *use* of the string.
>
> An intermediate approach is to use function templates as above,
> and then permit implicit transcoding conversions where necessary.
>
> > I'm not saying that we shouldn't have an `any_encoded_string`.
> > I'm saying that we need to also have a `fixed_encoded_string`
> > (with a forward-facing encoding that can be user-provided),
> > and the two types cannot be the same type.
>
> They can be different specializations of the same template though.
>
> --
> Lawrence Crowl
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01229c30720e9d04dc262f41
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><p>I think there should be a base class for encoding</p><p=
>template <class EncodingElement, class CharType><br>class encoding<b=
r>{<br>public:=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0 virtual std::basic_string=
<EncodingElement> encode(const std::basic_string<CharType>&=
str) =3D 0;<br>
=A0=A0=A0 virtual std::basic_string<CharType> decode(const std::basic=
_string<EncodingElement>& str) =3D 0;=A0=A0=A0 <br>};</p><p>Then =
particular encoding classes can be implemented:</p><p>class encoding_utf8_c=
har32: public encoding<char, char32_t><br>
{<br>...<br>};</p><p>class encoding_utf8_char16: public encoding<char, c=
har16_t><br>{<br>...<br>};</p><p>class encoding_utf16_char32: public enc=
oding<char16_t, char32_t><br>{<br>...<br>};</p><p>class encoding_GB18=
030_char32: public encoding<char, char32_t><br>
{<br>...<br>};</p><p>Inside the program the encoded strings should be decod=
ed when necessary.<br>Such approach makes it possible to use various encodi=
ngs in one program.</p><p>=A0</p></div><div class=3D"gmail_extra"><br><br><=
div class=3D"gmail_quote">
On 7 May 2013 19:50, Lawrence Crowl <span dir=3D"ltr"><<a href=3D"mailto=
:crowl@googlers.com" target=3D"_blank">crowl@googlers.com</a>></span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex">
<div class=3D"im">On 5/1/13, Nicol Bolas <<a href=3D"mailto:jmckesson@gm=
ail.com">jmckesson@gmail.com</a>> wrote:<br>
> On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:<br>
</div><div class=3D"im">> > I don't want "encoding_dontcare&=
quot; to mean that on Windows it<br>
> > is UTF16, and Linux UTF8, I want "dontcare" to mean wha=
tever<br>
> > is given to it. =A0In that way, if each function along the way<br=
>
> > doesn't care, there is a chance that no re-encoding ever happ=
ens.<br>
> > Whatever encoding str_from_somewhere was, that is the encoding<br=
>
> > (internally) returned from f1. =A0ie I think we want both<br>
> > encoding_platform and encoding_flexible (as a still not good,<br>
> > but better name).<br>
<br>
</div>If you mean use the given encoding, you should infer the type<br>
from the object given to it. =A0In the case of variables, use auto.<br>
In the case of functions, use a template parameter.<br>
<div class=3D"im"><br>
> > Or do I need to write all my functions as templates?<br>
<br>
</div>If you want flexibility without run-time overhead, yes.<br>
<div class=3D"im"><br>
> There's no (good) way to implement "encoding_flexible" a=
s a<br>
> template parameter to a string type that expects a specific,<br>
> fixed encoding. Not without creating a whole new specialization of<br>
> that type that has a different interface, which is really little<br>
> different from just creating a new class type.<br>
><br>
> Personally, I don't like `encoding_platform` at all, as this<br>
> assumes that the platform-specific encoding is: 1) a good idea<br>
> to use and 2) somehow specific to that platform.<br>
><br>
</div>> The problem with a string type that can handle *any* encoding (e=
ven<br>
<div class=3D"im">> user-defined ones) is that such a string would neces=
sarily have<br>
> to use type-erasure to access the data in that string. Iterators<br>
> for such a type will be slower to user because of the overhead.<br>
><br>
> In short, you're trading potentially less transcoding for always<b=
r>
</div>> slower *use* of the string.<br>
<br>
An intermediate approach is to use function templates as above,<br>
and then permit implicit transcoding conversions where necessary.<br>
<div class=3D"im"><br>
> I'm not saying that we shouldn't have an `any_encoded_string`.=
<br>
> I'm saying that we need to also have a `fixed_encoded_string`<br>
> (with a forward-facing encoding that can be user-provided),<br>
> and the two types cannot be the same type.<br>
<br>
</div>They can be different specializations of the same template though.<br=
>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
--<br>
Lawrence Crowl<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
--<br>
<br>
---<br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org">std-propo=
sals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den" target=3D"_blank">http://groups.google.com/a/isocpp=
..org/group/std-proposals/?hl=3Den</a>.<br>
<br>
<br>
</div></div></blockquote></div><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c30720e9d04dc262f41--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 7 May 2013 20:06:31 -0700 (PDT)
Raw View
------=_Part_1245_15834932.1367982391442
Content-Type: text/plain; charset=ISO-8859-1
On Tuesday, May 7, 2013 11:50:21 AM UTC-7, Lawrence Crowl wrote:
>
> On 5/1/13, Nicol Bolas <jmck...@gmail.com <javascript:>> wrote:
> > I'm not saying that we shouldn't have an `any_encoded_string`.
> > I'm saying that we need to also have a `fixed_encoded_string`
> > (with a forward-facing encoding that can be user-provided),
> > and the two types cannot be the same type.
>
> They can be different specializations of the same template though.
>
Why? They need different interfaces; in particular, the fixed-encoded
string needs a function to return an array of code-units, so that you can
use them with C APIs that take that encoding. That's not really possible
with the any-encoded string, because the type it returns could be anything,
rather than a single, fixed type. The any-encoded string probably should
also have APIs that will internally convert the string to a specific
encoding (still using the any-encoded API), without doing a copy to a new
string object.
Again, I point to `vector<bool>`; specializations that have different APIs
should not be specializations.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1245_15834932.1367982391442
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Tuesday, May 7, 2013 11:50:21 AM UTC-7, Lawrence Crowl wrote:<blockquote=
class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1=
px #ccc solid;padding-left: 1ex;">On 5/1/13, Nicol Bolas <<a href=3D"jav=
ascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"vQu2WNnL6tAJ">jmck...@=
gmail.com</a>> wrote:
<br>> I'm not saying that we shouldn't have an `any_encoded_string`.
<br>> I'm saying that we need to also have a `fixed_encoded_string`
<br>> (with a forward-facing encoding that can be user-provided),
<br>> and the two types cannot be the same type.
<br>
<br>They can be different specializations of the same template though.<br><=
/blockquote><div><br>Why? They need different interfaces; in particular, th=
e fixed-encoded string needs a function to return an array of code-units, s=
o that you can use them with C APIs that take that encoding. That's not rea=
lly possible with the any-encoded string, because the type it returns could=
be anything, rather than a single, fixed type. The any-encoded string prob=
ably should also have APIs that will internally convert the string to a spe=
cific encoding (still using the any-encoded API), without doing a copy to a=
new string object.<br><br>Again, I point to `vector<bool>`; speciali=
zations that have different APIs should not be specializations.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1245_15834932.1367982391442--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 08:48:37 +0100
Raw View
--089e013a076864d3dd04dc302910
Content-Type: text/plain; charset=ISO-8859-1
Do we really need fits all encoding, or shall we deal with typical cases
used to cover most languages? Besides, there is a case for the end-of-line
as well:
you can easily ideintify it by one encoded element (depending on the size
of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
That makes it easier to split the initial text into lines.
I don't think it is worth covering various "packed" encodings and those
used for encryption.
On 8 May 2013 04:06, Nicol Bolas <jmckesson@gmail.com> wrote:
> On Tuesday, May 7, 2013 11:50:21 AM UTC-7, Lawrence Crowl wrote:
>
>> On 5/1/13, Nicol Bolas <jmck...@gmail.com> wrote:
>> > I'm not saying that we shouldn't have an `any_encoded_string`.
>> > I'm saying that we need to also have a `fixed_encoded_string`
>> > (with a forward-facing encoding that can be user-provided),
>> > and the two types cannot be the same type.
>>
>> They can be different specializations of the same template though.
>>
>
> Why? They need different interfaces; in particular, the fixed-encoded
> string needs a function to return an array of code-units, so that you can
> use them with C APIs that take that encoding. That's not really possible
> with the any-encoded string, because the type it returns could be anything,
> rather than a single, fixed type. The any-encoded string probably should
> also have APIs that will internally convert the string to a specific
> encoding (still using the any-encoded API), without doing a copy to a new
> string object.
>
> Again, I point to `vector<bool>`; specializations that have different APIs
> should not be specializations.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e013a076864d3dd04dc302910
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Do we really need fits all encoding, or shall we deal with typical cas=
es used to cover most languages? Besides, there is a case for the end-of-li=
ne as well:</div>
<div>you can easily ideintify it by one=A0encoded element=A0(depending on t=
he size of the encoded element: 1 , 2 or 4 bytes) with the same=A0code (0x1=
0). </div>
<div>That makes it easier to split the initial text into lines.</div>
<div>=A0</div>
<div>I don't think it is worth covering various "packed" enco=
dings and those used for encryption.<br></div>
<div class=3D"gmail_quote">On 8 May 2013 04:06, Nicol Bolas <span dir=3D"lt=
r"><<a href=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@g=
mail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">On Tuesday, May 7, 2013 11:50:21 AM U=
TC-7, Lawrence Crowl wrote:=20
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">On 5/1/13, Nicol Bolas <<a>jmck...=
@gmail.com</a>> wrote: <br>> I'm not saying that we shouldn't=
have an `any_encoded_string`. <br>
> I'm saying that we need to also have a `fixed_encoded_string` <br>=
> (with a forward-facing encoding that can be user-provided), <br>> a=
nd the two types cannot be the same type. <br><br>They can be different spe=
cializations of the same template though.<br>
</blockquote></div>
<div><br>Why? They need different interfaces; in particular, the fixed-enco=
ded string needs a function to return an array of code-units, so that you c=
an use them with C APIs that take that encoding. That's not really poss=
ible with the any-encoded string, because the type it returns could be anyt=
hing, rather than a single, fixed type. The any-encoded string probably sho=
uld also have APIs that will internally convert the string to a specific en=
coding (still using the any-encoded API), without doing a copy to a new str=
ing object.<br>
<br>Again, I point to `vector<bool>`; specializations that have diffe=
rent APIs should not be specializations.<br></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e013a076864d3dd04dc302910--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 08:49:48 +0100
Raw View
--001a11c2029a9ce20604dc302dea
Content-Type: text/plain; charset=ISO-8859-1
Sorry, I meant 0xA for the end-of-line.
On 8 May 2013 08:48, Mikhail Semenov <mikhailsemenov1957@gmail.com> wrote:
> Do we really need fits all encoding, or shall we deal with typical cases
> used to cover most languages? Besides, there is a case for the end-of-line
> as well:
> you can easily ideintify it by one encoded element (depending on the size
> of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
> That makes it easier to split the initial text into lines.
>
> I don't think it is worth covering various "packed" encodings and those
> used for encryption.
> On 8 May 2013 04:06, Nicol Bolas <jmckesson@gmail.com> wrote:
>
>> On Tuesday, May 7, 2013 11:50:21 AM UTC-7, Lawrence Crowl wrote:
>>
>>> On 5/1/13, Nicol Bolas <jmck...@gmail.com> wrote:
>>> > I'm not saying that we shouldn't have an `any_encoded_string`.
>>> > I'm saying that we need to also have a `fixed_encoded_string`
>>> > (with a forward-facing encoding that can be user-provided),
>>> > and the two types cannot be the same type.
>>>
>>> They can be different specializations of the same template though.
>>>
>>
>> Why? They need different interfaces; in particular, the fixed-encoded
>> string needs a function to return an array of code-units, so that you can
>> use them with C APIs that take that encoding. That's not really possible
>> with the any-encoded string, because the type it returns could be anything,
>> rather than a single, fixed type. The any-encoded string probably should
>> also have APIs that will internally convert the string to a specific
>> encoding (still using the any-encoded API), without doing a copy to a new
>> string object.
>>
>> Again, I point to `vector<bool>`; specializations that have different
>> APIs should not be specializations.
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "ISO C++ Standard - Future Proposals" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to std-proposals+unsubscribe@isocpp.org.
>> To post to this group, send email to std-proposals@isocpp.org.
>> Visit this group at
>> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>>
>>
>>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c2029a9ce20604dc302dea
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Sorry, I meant 0xA for the end-of-line.</div>
<div><br><br>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 08:48, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>Do we really need fits all encoding, or shall we deal with typical cas=
es used to cover most languages? Besides, there is a case for the end-of-li=
ne as well:</div>
<div>you can easily ideintify it by one=A0encoded element=A0(depending on t=
he size of the encoded element: 1 , 2 or 4 bytes) with the same=A0code (0x1=
0). </div>
<div>That makes it easier to split the initial text into lines.</div>
<div>=A0</div>
<div>I don't think it is worth covering various "packed" enco=
dings and those used for encryption.<br></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<div class=3D"gmail_quote">On 8 May 2013 04:06, Nicol Bolas <span dir=3D"lt=
r"><<a href=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@g=
mail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">On Tuesday, May 7, 2013 11:50:21 AM U=
TC-7, Lawrence Crowl wrote:=20
<div>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">On 5/1/13, Nicol Bolas <<a>jmck...=
@gmail.com</a>> wrote: <br>> I'm not saying that we shouldn't=
have an `any_encoded_string`. <br>
> I'm saying that we need to also have a `fixed_encoded_string` <br>=
> (with a forward-facing encoding that can be user-provided), <br>> a=
nd the two types cannot be the same type. <br><br>They can be different spe=
cializations of the same template though.<br>
</blockquote></div>
<div><br>Why? They need different interfaces; in particular, the fixed-enco=
ded string needs a function to return an array of code-units, so that you c=
an use them with C APIs that take that encoding. That's not really poss=
ible with the any-encoded string, because the type it returns could be anyt=
hing, rather than a single, fixed type. The any-encoded string probably sho=
uld also have APIs that will internally convert the string to a specific en=
coding (still using the any-encoded API), without doing a copy to a new str=
ing object.<br>
<br>Again, I point to `vector<bool>`; specializations that have diffe=
rent APIs should not be specializations.<br></div>
<div>
<div>
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br></div></div></blockquote><=
/div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c2029a9ce20604dc302dea--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 8 May 2013 03:57:06 -0700 (PDT)
Raw View
------=_Part_2724_14074735.1368010626978
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 12:48:37 AM UTC-7, Mikhail Semenov wrote:
>
> Do we really need fits all encoding, or shall we deal with typical cases
> used to cover most languages?
>
Considering that an encoding is just a specialized set of iterators, a few
basic functions, and a couple of typedefs, I see no reason why we should
explicitly limit this string type to only certain encodings. If the user
wants to use UTF-7 as an encoding, we shouldn't prevent them from being
able to do so with the fixed encoding string type. This would allow them to
more easily utilize the transcoding and other machinery that such a string
will have.
Except for the most specialized needs or legacy code, nobody should have a
reason to use some other string type for a sequence of Unicode codepoints.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_2724_14074735.1368010626978
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, May 8, 2013 12:48:37 AM UTC-7, Mikhail Semenov wrote:<blockqu=
ote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left=
: 1px #ccc solid;padding-left: 1ex;"><div>Do we really need fits all encodi=
ng, or shall we deal with typical cases used to cover most languages?</div>=
</blockquote><div><br>Considering that an encoding is just a specialized se=
t of iterators, a few basic functions, and a couple of typedefs, I see no r=
eason why we should explicitly limit this string type to only certain encod=
ings. If the user wants to use UTF-7 as an encoding, we shouldn't prevent t=
hem from being able to do so with the fixed encoding string type. This woul=
d allow them to more easily utilize the transcoding and other machinery tha=
t such a string will have.<br><br>Except for the most specialized needs or =
legacy code, nobody should have a reason to use some other string type for =
a sequence of Unicode codepoints.</div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_2724_14074735.1368010626978--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Wed, 8 May 2013 04:48:25 -0700 (PDT)
Raw View
------=_Part_1080_14467734.1368013705156
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
> Do we really need fits all encoding, or shall we deal with typical cases
> used to cover most languages? Besides, there is a case for the end-of-line
> as well:
> you can easily ideintify it by one encoded element (depending on the size
> of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
> That makes it easier to split the initial text into lines.
>
No, no, it doesn't. I seriously have to question how much you know what we
are even talking about here. The Unicode Standard provides a line-break
algorithm and they provide it for a reason, and that reason is "Split on
"\n"" doesn't work.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1080_14467734.1368013705156
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;"><div>Do we really need fits all enc=
oding, or shall we deal with typical cases used to cover most languages? Be=
sides, there is a case for the end-of-line as well:</div>
<div>you can easily ideintify it by one encoded element (dependin=
g on the size of the encoded element: 1 , 2 or 4 bytes) with the same =
code (0x10). </div>
<div>That makes it easier to split the initial text into lines.</div></bloc=
kquote><div><br></div><div>No, no, it doesn't. I seriously have to question=
how much you know what we are even talking about here. The Unicode Standar=
d provides a line-break algorithm and they provide it for a reason, and tha=
t reason is "Split on "\n"" doesn't work. </div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1080_14467734.1368013705156--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 13:08:43 +0100
Raw View
--001a11c2697e923a0804dc33cb3c
Content-Type: text/plain; charset=ISO-8859-1
There are several issues here:
(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians),
UTF-32, GB81030 (GBK is a subset of this);
but they have oen thing in common that the end-of-line can be easily
identified without decoding.
(2) It is much easier to consider different types for an "encoded element"
and "string char"; for example, in UTF-8, the "encoded element" is a char
(byte),
but decoded string will be a string of char, char16_t or char32_t depending
on the requirement. It is not convenient to deal with a UTF-8 string
as a string of char if you'd want to use other languages (Greek, Chinese,
etc.). It's easier to convert from string of char to string of char16_t and
deal with
a string type. Potentially, and I stress potentailly, it is possible to
create a whole class that is dealing with the encoding, but then it will be
a new string class;
it you don't provide proper conversion to an array it will be inefficient:
imagine if you've got a several page document and you'd like to replace
2-byte elements
with 3-byte ones (say, you use UTF-8), it will be very very inefficient.
(3) It is possible to create encode() and decode() functions that allow
move, which will allow to embrace both worlds (for UTF-7 you'll just pass
the string through without any changes).
---------- Forwarded message ----------
From: Nicol Bolas <jmckesson@gmail.com>
Date: 8 May 2013 11:57
Subject: Re: [std-proposals] Re: Committee feedback on N3572
To: std-proposals@isocpp.org
On Wednesday, May 8, 2013 12:48:37 AM UTC-7, Mikhail Semenov wrote:
>
> Do we really need fits all encoding, or shall we deal with typical cases
> used to cover most languages?
>
Considering that an encoding is just a specialized set of iterators, a few
basic functions, and a couple of typedefs, I see no reason why we should
explicitly limit this string type to only certain encodings. If the user
wants to use UTF-7 as an encoding, we shouldn't prevent them from being
able to do so with the fixed encoding string type. This would allow them to
more easily utilize the transcoding and other machinery that such a string
will have.
Except for the most specialized needs or legacy code, nobody should have a
reason to use some other string type for a sequence of Unicode codepoints.
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c2697e923a0804dc33cb3c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>There are several issues here:</div>
<div>(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), U=
TF-32, GB81030 (GBK is a subset of this);</div>
<div>=A0=A0=A0=A0 but they have oen thing in common that the end-of-line ca=
n be easily identified without decoding.</div>
<div>(2) It is much easier to consider different types for an "encoded=
element" and "string char"; for example, in UTF-8, the &quo=
t;encoded element" is a char (byte),</div>
<div>but decoded string will be a string of char, char16_t or char32_t depe=
nding on the requirement. It is not convenient to deal with=A0a UTF-8 strin=
g</div>
<div>as a string of char if you'd want to use other languages (Greek, C=
hinese, etc.). It's easier to convert from string of char to string of =
char16_t and deal with</div>
<div>=A0a string type. Potentially, and I stress potentailly, it is possibl=
e to create a whole class that is dealing with the encoding, but then it wi=
ll be a new string class;</div>
<div>it you don't provide proper conversion to an array it will be inef=
ficient: imagine if you've got a several page document and you'd li=
ke to replace 2-byte elements</div>
<div>with 3-byte ones (say, you use UTF-8), it will be very very inefficien=
t.</div>
<div>(3) It is possible to create encode() and decode() functions that allo=
w move, which will allow to embrace both worlds=A0(for=A0UTF-7 you'll j=
ust pass the string through without any changes).=A0</div>
<div>=A0</div>
<div><br><br>=A0</div>
<div class=3D"gmail_quote">---------- Forwarded message ----------<br>From:=
<b class=3D"gmail_sendername">Nicol Bolas</b> <span dir=3D"ltr"><<a hre=
f=3D"mailto:jmckesson@gmail.com">jmckesson@gmail.com</a>></span><br>Date=
: 8 May 2013 11:57<br>
Subject: Re: [std-proposals] Re: Committee feedback on N3572<br>To: <a href=
=3D"mailto:std-proposals@isocpp.org">std-proposals@isocpp.org</a><br><br><b=
r>
<div class=3D"im">On Wednesday, May 8, 2013 12:48:37 AM UTC-7, Mikhail Seme=
nov wrote:=20
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>Do we really need fits all encoding, or shall we deal with typical cas=
es used to cover most languages?</div></blockquote></div>
<div><br>Considering that an encoding is just a specialized set of iterator=
s, a few basic functions, and a couple of typedefs, I see no reason why we =
should explicitly limit this string type to only certain encodings. If the =
user wants to use UTF-7 as an encoding, we shouldn't prevent them from =
being able to do so with the fixed encoding string type. This would allow t=
hem to more easily utilize the transcoding and other machinery that such a =
string will have.<br>
<br>Except for the most specialized needs or legacy code, nobody should hav=
e a reason to use some other string type for a sequence of Unicode codepoin=
ts.</div>
<div class=3D"HOEnZb">
<div class=3D"h5"><br>
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c2697e923a0804dc33cb3c--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 13:21:49 +0100
Raw View
--001a11c356f470cdf004dc33fa18
Content-Type: text/plain; charset=ISO-8859-1
I am not speaking about how to do line-breaking of text without
end-of-lines, but the fact that for most encodings avaliable (not all of
them), the end-of-line can be easily identified, but you need to know what
encoding is used in the text in question.
On 8 May 2013 12:48, DeadMG <wolfeinstein@gmail.com> wrote:
> On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
>
>> Do we really need fits all encoding, or shall we deal with typical cases
>> used to cover most languages? Besides, there is a case for the end-of-line
>> as well:
>> you can easily ideintify it by one encoded element (depending on the size
>> of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
>> That makes it easier to split the initial text into lines.
>>
>
> No, no, it doesn't. I seriously have to question how much you know what we
> are even talking about here. The Unicode Standard provides a line-break
> algorithm and they provide it for a reason, and that reason is "Split on
> "\n"" doesn't work.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c356f470cdf004dc33fa18
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I am not speaking about how to do line-breaking of text without end-of=
-lines, but the fact that for most encodings=A0avaliable (not all of them),=
the end-of-line can be easily identified, but you need to know what encodi=
ng is used in the text in question. </div>
<div><br>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 12:48, DeadMG <span dir=3D"ltr">&l=
t;<a href=3D"mailto:wolfeinstein@gmail.com" target=3D"_blank">wolfeinstein@=
gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div class=3D"im">On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semen=
ov wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>Do we really need fits all encoding, or shall we deal with typical cas=
es used to cover most languages? Besides, there is a case for the end-of-li=
ne as well:</div>
<div>you can easily ideintify it by one=A0encoded element=A0(depending on t=
he size of the encoded element: 1 , 2 or 4 bytes) with the same=A0code (0x1=
0). </div>
<div>That makes it easier to split the initial text into lines.</div></bloc=
kquote>
<div><br></div></div>
<div>No, no, it doesn't. I seriously have to question how much you know=
what we are even talking about here. The Unicode Standard provides a line-=
break algorithm and they provide it for a reason, and that reason is "=
Split on "\n"" doesn't work.=A0</div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c356f470cdf004dc33fa18--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 8 May 2013 05:28:06 -0700 (PDT)
Raw View
------=_Part_309_31360493.1368016086807
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 5:21:49 AM UTC-7, Mikhail Semenov wrote:
>
> On 8 May 2013 12:48, DeadMG <wolfei...@gmail.com <javascript:>> wrote:
>
>> On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
>>
>>> Do we really need fits all encoding, or shall we deal with typical cases
>>> used to cover most languages? Besides, there is a case for the end-of-line
>>> as well:
>>> you can easily ideintify it by one encoded element (depending on the
>>> size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
>>> That makes it easier to split the initial text into lines.
>>>
>>
>> No, no, it doesn't. I seriously have to question how much you know what
>> we are even talking about here. The Unicode Standard provides a line-break
>> algorithm and they provide it for a reason, and that reason is "Split on
>> "\n"" doesn't work.
>>
>>
>>
> I am not speaking about how to do line-breaking of text without
> end-of-lines, but the fact that for most encodings avaliable (not all of
> them), the end-of-line can be easily identified, but you need to know what
> encoding is used in the text in question.
>
>
Um, yes. To understand *any* string of text, you need to know what encoding
it is. That includes the EOL character, but it also includes *every other
character*. So why are you singling out EOL as something special?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_309_31360493.1368016086807
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, May 8, 2013 5:21:49 AM UTC-7, Mikhail Semenov wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;"><div></div>
<div class=3D"gmail_quote">On 8 May 2013 12:48, DeadMG <span dir=3D"ltr">&l=
t;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"-jGRpU=
qRPccJ">wolfei...@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>Do we really need fits all encoding, or shall we deal with typical cas=
es used to cover most languages? Besides, there is a case for the end-of-li=
ne as well:</div>
<div>you can easily ideintify it by one encoded element (dependin=
g on the size of the encoded element: 1 , 2 or 4 bytes) with the same =
code (0x10). </div>
<div>That makes it easier to split the initial text into lines.</div></bloc=
kquote>
<div><br></div></div>
<div>No, no, it doesn't. I seriously have to question how much you know wha=
t we are even talking about here. The Unicode Standard provides a line-brea=
k algorithm and they provide it for a reason, and that reason is "Split on =
"\n"" doesn't work. </div>
<div>
<div>
<p></p><br>
</div></div></blockquote></div><br><div>I am not speaking about how to do l=
ine-breaking of text without=20
end-of-lines, but the fact that for most encodings avaliable (not all =
of
them), the end-of-line can be easily identified, but you need to know=20
what encoding is used in the text in question. </div>
<br></blockquote><div><br>Um, yes. To understand <i>any</i> string of text,=
you need to know what encoding it is. That includes the EOL character, but=
it also includes <i>every other character</i>. So why are you singling out=
EOL as something special? <br></div><div><br> </div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_309_31360493.1368016086807--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 8 May 2013 05:31:41 -0700 (PDT)
Raw View
------=_Part_1865_14512787.1368016301404
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 5:08:43 AM UTC-7, Mikhail Semenov wrote:
>
> There are several issues here:
> (1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians),
> UTF-32, GB81030 (GBK is a subset of this);
> but they have oen thing in common that the end-of-line can be easily
> identified without decoding.
>
What does the "end-of-line" character have to do with *anything*? Who cares
about how easy or not easy it is to identify the EOL character?
(2) It is much easier to consider different types for an "encoded element"
> and "string char"; for example, in UTF-8, the "encoded element" is a char
> (byte),
> but decoded string will be a string of char, char16_t or char32_t
> depending on the requirement.
>
Um, no it won't.
A Unicode encoding specifies a mapping between a sequence of code units
(where each code unit is some particular bit-depth, as specified by the
encoding) and a sequence of codepoints, where each codepoint is a Unicode
codepoint of 21-bits in size (which can be stored in larger types for
convenience).
A UTF-8-encoded sequence of code units can be decoded to a sequence of
codepoints, which can then be re-encoded into a sequence of code units of
some other Unicode encoding. But a Unicode encoding can only be *decoded*into a sequence of codepoints. Not UTF-16, UTF-7, or any other encoding.
Just codepoints.
It is not convenient to deal with a UTF-8 string
> as a string of char
>
Nobody's suggesting that a UTF-8 string be treated "as a string of char", *
regardless* of what language it is. Well, outside of using C APIs that only
take `char*`s. We're suggesting that any Unicode encoding be treated as a
series of codepoints, with operations like insertion, removal, and so forth
on codepoint-based boundaries and iterators.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1865_14512787.1368016301404
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Wednesday, May 8, 2013 5:08:43 AM UTC-7, Mikhail Semenov wrote:<blockquo=
te class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left:=
1px #ccc solid;padding-left: 1ex;"><div>There are several issues here:</di=
v>
<div>(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), U=
TF-32, GB81030 (GBK is a subset of this);</div>
<div> but they have oen thing in common that the en=
d-of-line can be easily identified without decoding.</div></blockquote><div=
><br>What does the "end-of-line" character have to do with <i>anything</i>?=
Who cares about how easy or not easy it is to identify the EOL character?<=
br><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
<div>(2) It is much easier to consider different types for an "encoded elem=
ent" and "string char"; for example, in UTF-8, the "encoded element" is a c=
har (byte),</div>
<div>but decoded string will be a string of char, char16_t or char32_t depe=
nding on the requirement.</div></blockquote><div><br>Um, no it won't.<br><b=
r>A Unicode encoding specifies a mapping between a sequence of code units (=
where each code unit is some particular bit-depth, as specified by the enco=
ding) and a sequence of codepoints, where each codepoint is a Unicode codep=
oint of 21-bits in size (which can be stored in larger types for convenienc=
e).<br><br>A UTF-8-encoded sequence of code units can be decoded to a seque=
nce of codepoints, which can then be re-encoded into a sequence of code uni=
ts of some other Unicode encoding. But a Unicode encoding can only be <i>de=
coded</i> into a sequence of codepoints. Not UTF-16, UTF-7, or any other en=
coding. Just codepoints.<br><br></div><blockquote class=3D"gmail_quote" sty=
le=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left=
: 1ex;"><div>It is not convenient to deal with a UTF-8 string</div>
<div>as a string of char</div></blockquote><div><br>Nobody's suggesting tha=
t a UTF-8 string be treated "as a string of char", <i>regardless</i> of wha=
t language it is. Well, outside of using C APIs that only take `char*`s. We=
're suggesting that any Unicode encoding be treated as a series of codepoin=
ts, with operations like insertion, removal, and so forth on codepoint-base=
d boundaries and iterators.<br><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_1865_14512787.1368016301404--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 13:49:13 +0100
Raw View
--e89a8f6473f563970204dc345c44
Content-Type: text/plain; charset=ISO-8859-1
OK. The point I was making is that if you have UTF-32, UTF-16, GB18030 or
UTF8 (UTF-7 if you like), you can easily find end-of-lines in the text and
then decode it line by line. That's all. You don't have to do it, if you've
got a solid block of text without eols. If you've got a Visual C++, Intel
or a GCC compiler, you will easily find eols without decoding the text.
Of course, you may event encoding where you have to decode all the previous
characters before you hit the eol: it's easy to do so.
To be honest, I don't like reference to C: it's looking backwards.
---------- Forwarded message ----------
From: Nicol Bolas <jmckesson@gmail.com>
Date: 8 May 2013 13:31
Subject: Re: [std-proposals] Re: Committee feedback on N3572
To: std-proposals@isocpp.org
On Wednesday, May 8, 2013 5:08:43 AM UTC-7, Mikhail Semenov wrote:
>
> There are several issues here:
> (1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians),
> UTF-32, GB81030 (GBK is a subset of this);
> but they have oen thing in common that the end-of-line can be easily
> identified without decoding.
>
What does the "end-of-line" character have to do with *anything*? Who cares
about how easy or not easy it is to identify the EOL character?
(2) It is much easier to consider different types for an "encoded element"
> and "string char"; for example, in UTF-8, the "encoded element" is a char
> (byte),
> but decoded string will be a string of char, char16_t or char32_t
> depending on the requirement.
>
Um, no it won't.
A Unicode encoding specifies a mapping between a sequence of code units
(where each code unit is some particular bit-depth, as specified by the
encoding) and a sequence of codepoints, where each codepoint is a Unicode
codepoint of 21-bits in size (which can be stored in larger types for
convenience).
A UTF-8-encoded sequence of code units can be decoded to a sequence of
codepoints, which can then be re-encoded into a sequence of code units of
some other Unicode encoding. But a Unicode encoding can only be
*decoded*into a sequence of codepoints. Not UTF-16, UTF-7, or any
other encoding.
Just codepoints.
It is not convenient to deal with a UTF-8 string
> as a string of char
>
Nobody's suggesting that a UTF-8 string be treated "as a string of char", *
regardless* of what language it is. Well, outside of using C APIs that only
take `char*`s. We're suggesting that any Unicode encoding be treated as a
series of codepoints, with operations like insertion, removal, and so forth
on codepoint-based boundaries and iterators.
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--e89a8f6473f563970204dc345c44
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>OK. The point I was making is that if you have UTF-32, UTF-16, GB18030=
or UTF8 (UTF-7 if you like), you can easily find end-of-lines in the text =
and then decode=A0it line by line. That's all. You don't have to do=
it, if you've got a solid block of text without eols. If you've go=
t a Visual C++, Intel or a GCC compiler, you will easily find eols without =
decoding the text.</div>
<div>=A0</div>
<div>Of course, you may event encoding where you have to decode all the pre=
vious characters before you hit the eol: it's easy to do so.</div>
<div>=A0</div>
<div>To be honest, I don't like reference to C: it's looking backwa=
rds.</div>
<div>=A0</div>
<div><br>=A0</div>
<div class=3D"gmail_quote">---------- Forwarded message ----------<br>From:=
<b class=3D"gmail_sendername">Nicol Bolas</b> <span dir=3D"ltr"><<a hre=
f=3D"mailto:jmckesson@gmail.com">jmckesson@gmail.com</a>></span><br>Date=
: 8 May 2013 13:31<br>
Subject: Re: [std-proposals] Re: Committee feedback on N3572<br>To: <a href=
=3D"mailto:std-proposals@isocpp.org">std-proposals@isocpp.org</a><br><br><b=
r>
<div class=3D"im">On Wednesday, May 8, 2013 5:08:43 AM UTC-7, Mikhail Semen=
ov wrote:=20
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>There are several issues here:</div>
<div>(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), U=
TF-32, GB81030 (GBK is a subset of this);</div>
<div>=A0=A0=A0=A0 but they have oen thing in common that the end-of-line ca=
n be easily identified without decoding.</div></blockquote></div>
<div><br>What does the "end-of-line" character have to do with <i=
>anything</i>? Who cares about how easy or not easy it is to identify the E=
OL character?<br><br></div>
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>(2) It is much easier to consider different types for an "encoded=
element" and "string char"; for example, in UTF-8, the &quo=
t;encoded element" is a char (byte),</div>
<div>but decoded string will be a string of char, char16_t or char32_t depe=
nding on the requirement.</div></blockquote></div>
<div><br>Um, no it won't.<br><br>A Unicode encoding specifies a mapping=
between a sequence of code units (where each code unit is some particular =
bit-depth, as specified by the encoding) and a sequence of codepoints, wher=
e each codepoint is a Unicode codepoint of 21-bits in size (which can be st=
ored in larger types for convenience).<br>
<br>A UTF-8-encoded sequence of code units can be decoded to a sequence of =
codepoints, which can then be re-encoded into a sequence of code units of s=
ome other Unicode encoding. But a Unicode encoding can only be <i>decoded</=
i> into a sequence of codepoints. Not UTF-16, UTF-7, or any other encoding.=
Just codepoints.<br>
<br></div>
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>It is not convenient to deal with=A0a UTF-8 string</div>
<div>as a string of char</div></blockquote></div>
<div><br>Nobody's suggesting that a UTF-8 string be treated "as a =
string of char", <i>regardless</i> of what language it is. Well, outsi=
de of using C APIs that only take `char*`s. We're suggesting that any U=
nicode encoding be treated as a series of codepoints, with operations like =
insertion, removal, and so forth on codepoint-based boundaries and iterator=
s.<br>
<br></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--e89a8f6473f563970204dc345c44--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 8 May 2013 15:22:05 +0200
Raw View
--047d7bdc1220f4a6d304dc34d127
Content-Type: text/plain; charset=ISO-8859-1
On Wed, May 8, 2013 at 2:49 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
> you can easily find end-of-lines in the text and then decode it line by
> line.
>
Unless you plan in decoding only randomly accessed lines, I see little
benefit in not having to decode the text so that you can decode it
immediately afterwards.
To be honest, I don't like reference to C: it's looking backwards.
>
You may pretend that reality does not exist all you want. It won't make it
disappear.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--047d7bdc1220f4a6d304dc34d127
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Wed, May 8, 2013 at 2:49 PM, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br><div class=3D"gma=
il_extra">
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>you can easi=
ly find end-of-lines in the text and then decode=A0it line by line.</div></=
blockquote>
<div><br></div><div>Unless you plan in decoding only randomly accessed line=
s, I see little benefit in not having to decode the text so that you can de=
code it immediately afterwards.<br></div><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div>To be honest, I don't like reference to C: it's looking backwa=
rds.</div></blockquote><div><br></div><div>You may pretend that reality doe=
s not exist all you want. It won't make it disappear.<br></div></div>
</div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7bdc1220f4a6d304dc34d127--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 14:49:12 +0100
Raw View
--089e01177507e9c98c04dc3532aa
Content-Type: text/plain; charset=ISO-8859-1
I apologise for my brusque comment.
The reality is that if somebody is using UTF-16 or UTF-32, it's just easier
to use them as they are with char16_t and char32_t and probably without any
decoding.
In this case, why should I be talking about a string of char? That's all. I
think a lot of people a speaking about UTF-8, which obviously is a string
of char (or an array
of bytes, if you wish).
On 8 May 2013 14:22, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Wed, May 8, 2013 at 2:49 PM, Mikhail Semenov <
> mikhailsemenov1957@gmail.com> wrote:
>
>> you can easily find end-of-lines in the text and then decode it line by
>> line.
>>
>
> Unless you plan in decoding only randomly accessed lines, I see little
> benefit in not having to decode the text so that you can decode it
> immediately afterwards.
>
> To be honest, I don't like reference to C: it's looking backwards.
>>
>
> You may pretend that reality does not exist all you want. It won't make it
> disappear.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01177507e9c98c04dc3532aa
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I apologise for my brusque comment.</div>
<div>=A0</div>
<div>The reality is that if somebody is using UTF-16 or UTF-32, it's ju=
st easier to use them as they are with char16_t and char32_t and probably w=
ithout any decoding.</div>
<div>In this case, why should I be talking about a string of char? That'=
;s all. I think a lot of people a speaking about UTF-8, which obviously is =
a string of char (or an array</div>
<div>of bytes, if you wish).</div>
<div>=A0</div>
<div>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 14:22, Martinho Fernandes <span di=
r=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_bl=
ank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"im">On Wed, May 8, 2013 at 2:49 PM, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br></div>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>you can easily find end-of-lines in the text and then decode=A0it line=
by line.</div></blockquote>
<div><br></div></div>
<div>Unless you plan in decoding only randomly accessed lines, I see little=
benefit in not having to decode the text so that you can decode it immedia=
tely afterwards.<br></div>
<div class=3D"im">
<div><br></div>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>To be honest, I don't like reference to C: it's looking backwa=
rds.</div></blockquote>
<div><br></div></div>
<div>You may pretend that reality does not exist all you want. It won't=
make it disappear.<br></div></div></div></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01177507e9c98c04dc3532aa--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 8 May 2013 16:38:38 +0200
Raw View
--000e0cd5f5bab45cc304dc35e3f8
Content-Type: text/plain; charset=ISO-8859-1
On Wed, May 8, 2013 at 3:49 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
> The reality is that if somebody is using UTF-16 or UTF-32, it's just
> easier to use them as they are with char16_t and char32_t and probably
> without any decoding.
And I believe the burden of proof is on you that it is easier to manipulate
such strings for some reason. I hope your reasoning does not involve
pretending UTF-16 is UCS-2.
> In this case, why should I be talking about a string of char? That's all.
> I think a lot of people a speaking about UTF-8, which obviously is a string
> of char (or an array
>
of bytes, if you wish).
>
As I said before, from what I gather, a lot of people here are speaking
about strings that abstract the encoding away. With their ideal interface
the user does not see the encoding getting in their way: all such strings,
regardless of encoding, provide the same interface that treats code points,
not 8-bit bytes, not 16-bit words, not 32-bit words, as the basic unit of
text.
In normal usage of such strings *there is no encoding*. This is in the same
vein of using primitive types like int: when you use int there is no
endianness, there is no two's complement; there are only numbers. The
language gives you operations that are completely agnostic of the
underlying representation. It doesn't make sense to ask whether + for ints
is little endian or big endian: it operates on numbers, not ordered
sequences of bytes.
Sometimes, particularly when crossing interoperation boundaries, it is
important to go beyond the numbers, and have some control over the
representation, like when you sending numbers across the network with
things like htonl() and ntohl().
Nicol (please, correct me if I am wrong) wants to have the same ability for
handling text: normal operations on such ideal strings are such that
talking about encoding when related to them does not even make sense; and
yet you don't discard the possibility of picking specific representations
for crossing boundaries.
You appear to keep insisting on processing strings based on their raw code
unit form. If that is all you want I don't even know why you are wasting
your time here. You don't need anything new from C++ if you want to deal
with code unit sequences directly. std::string, std::u16string, and
std::u32string (or maybe std::wstring as well if you are into that) are
pretty much that: sequence containers of code units.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--000e0cd5f5bab45cc304dc35e3f8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><div class=3D"gmail_quote">=
On Wed, May 8, 2013 at 3:49 PM, Mikhail Semenov <span dir=3D"ltr"><<a hr=
ef=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_blank">mikhailsemenov=
1957@gmail.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">The reality is that if so=
mebody is using UTF-16 or UTF-32, it's just easier to use them as they =
are with char16_t and char32_t and probably without any decoding.
</blockquote><div><br></div><div>And I believe the burden of proof is on yo=
u that it is easier to manipulate such strings for some reason. I hope your=
reasoning does not involve pretending UTF-16 is UCS-2.<br>=A0</div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1p=
x solid rgb(204,204,204);padding-left:1ex">
<div>In this case, why should I be talking about a string of char? That'=
;s all. I think a lot of people a speaking about UTF-8, which obviously is =
a string of char (or an array<br></div></blockquote><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex">
<div>of bytes, if you wish).</div></blockquote><div class=3D"h5"><br>As I s=
aid before, from what I gather, a lot of people here are speaking about str=
ings that abstract the encoding away. With their ideal interface the user d=
oes not see the encoding getting in their way: all such strings, regardless=
of encoding, provide the same interface that treats code points, not 8-bit=
bytes, not 16-bit words, not 32-bit words, as the basic unit of text.<br>
<br>In normal usage of such strings <b>there is no encoding</b>. This is in=
the same vein of using primitive types like int: when you use int there is=
no endianness, there is no two's complement; there are only numbers. T=
he language gives you operations that are completely agnostic of the underl=
ying representation. It doesn't make sense to ask whether + for ints is=
little endian or big endian: it operates on numbers, not ordered sequences=
of bytes.<br>
<br>Sometimes, particularly when crossing interoperation boundaries, it is =
important to go beyond the numbers, and have some control over the represen=
tation, like when you sending numbers across the network with things like h=
tonl() and ntohl().<br>
<br>Nicol (please, correct me if I am wrong) wants to have the same ability=
for handling text: normal operations on such ideal strings are such that t=
alking about encoding when related to them does not even make sense; and ye=
t you don't discard the possibility of picking specific representations=
for crossing boundaries.<br>
<br>You appear to keep insisting on processing strings based on their raw c=
ode unit form. If that is all you want I don't even know why you are wa=
sting your time here. You don't need anything new from C++ if you want =
to deal with code unit sequences directly. std::string, std::u16string, and=
std::u32string (or maybe std::wstring as well if you are into that) are pr=
etty much that: sequence containers of code units.<br>
</div></div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--000e0cd5f5bab45cc304dc35e3f8--
.
Author: Zhihao Yuan <lichray@gmail.com>
Date: Wed, 8 May 2013 11:06:36 -0400
Raw View
On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
<martinho.fernandes@gmail.com> wrote:
> In normal usage of such strings there is no encoding. This is in the same
> vein of using primitive types like int: when you use int there is no
> endianness, there is no two's complement; there are only numbers. The
> language gives you operations that are completely agnostic of the underlying
> representation. It doesn't make sense to ask whether + for ints is little
> endian or big endian: it operates on numbers, not ordered sequences of
> bytes.
Yes, that might be the final answer. We need a class type, namely 'unicode'
or whatever. Its representation is totally implementation-defined. A
library can
choose UTF-8, UTF-16, UTF-32, GB18030, UTF-EBCDIC, homemade, anything.
But when you do `s[n]`, you get an object of type 'codepoint'.
--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://4bsd.biz/
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 16:09:01 +0100
Raw View
--089e0149529e5c5d4604dc3650b0
Content-Type: text/plain; charset=ISO-8859-1
I think you are mistaking me for someone else. I just want a good
interface for manipulate between various encodings and to be able to deal
with various files that use them.
I actually do not like "behind the scenes" code: it can interfere with
transfer of data. Let the user decide when to encode and decode strings.
On 8 May 2013 15:38, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
>
> On Wed, May 8, 2013 at 3:49 PM, Mikhail Semenov <
> mikhailsemenov1957@gmail.com> wrote:
>
>> The reality is that if somebody is using UTF-16 or UTF-32, it's just
>> easier to use them as they are with char16_t and char32_t and probably
>> without any decoding.
>
>
> And I believe the burden of proof is on you that it is easier to
> manipulate such strings for some reason. I hope your reasoning does not
> involve pretending UTF-16 is UCS-2.
>
>
>> In this case, why should I be talking about a string of char? That's all.
>> I think a lot of people a speaking about UTF-8, which obviously is a string
>> of char (or an array
>>
> of bytes, if you wish).
>>
>
> As I said before, from what I gather, a lot of people here are speaking
> about strings that abstract the encoding away. With their ideal interface
> the user does not see the encoding getting in their way: all such strings,
> regardless of encoding, provide the same interface that treats code points,
> not 8-bit bytes, not 16-bit words, not 32-bit words, as the basic unit of
> text.
>
> In normal usage of such strings *there is no encoding*. This is in the
> same vein of using primitive types like int: when you use int there is no
> endianness, there is no two's complement; there are only numbers. The
> language gives you operations that are completely agnostic of the
> underlying representation. It doesn't make sense to ask whether + for ints
> is little endian or big endian: it operates on numbers, not ordered
> sequences of bytes.
>
> Sometimes, particularly when crossing interoperation boundaries, it is
> important to go beyond the numbers, and have some control over the
> representation, like when you sending numbers across the network with
> things like htonl() and ntohl().
>
> Nicol (please, correct me if I am wrong) wants to have the same ability
> for handling text: normal operations on such ideal strings are such that
> talking about encoding when related to them does not even make sense; and
> yet you don't discard the possibility of picking specific representations
> for crossing boundaries.
>
> You appear to keep insisting on processing strings based on their raw code
> unit form. If that is all you want I don't even know why you are wasting
> your time here. You don't need anything new from C++ if you want to deal
> with code unit sequences directly. std::string, std::u16string, and
> std::u32string (or maybe std::wstring as well if you are into that) are
> pretty much that: sequence containers of code units.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e0149529e5c5d4604dc3650b0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I think you are mistaking me for someone else.=A0 I just want a good i=
nterface for manipulate between various encodings and to be able to deal wi=
th various files that use them.</div>
<div>I actually do not like "behind the scenes" code: it can inte=
rfere with transfer of data. Let the user decide when to encode and decode =
strings.</div>
<div><br><br>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 15:38, Martinho Fernandes <span di=
r=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_bl=
ank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr"><br>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">On Wed, May 8, 2013 at 3:49 PM, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br></div>
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">The reality is that if so=
mebody is using UTF-16 or UTF-32, it's just easier to use them as they =
are with char16_t and char32_t and probably without any decoding. </blockqu=
ote>
<div><br></div></div>
<div>And I believe the burden of proof is on you that it is easier to manip=
ulate such strings for some reason. I hope your reasoning does not involve =
pretending UTF-16 is UCS-2.<br>=A0</div>
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>In this case, why should I be talking about a string of char? That'=
;s all. I think a lot of people a speaking about UTF-8, which obviously is =
a string of char (or an array<br></div></blockquote>
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>of bytes, if you wish).</div></blockquote></div>
<div><br>As I said before, from what I gather, a lot of people here are spe=
aking about strings that abstract the encoding away. With their ideal inter=
face the user does not see the encoding getting in their way: all such stri=
ngs, regardless of encoding, provide the same interface that treats code po=
ints, not 8-bit bytes, not 16-bit words, not 32-bit words, as the basic uni=
t of text.<br>
<br>In normal usage of such strings <b>there is no encoding</b>. This is in=
the same vein of using primitive types like int: when you use int there is=
no endianness, there is no two's complement; there are only numbers. T=
he language gives you operations that are completely agnostic of the underl=
ying representation. It doesn't make sense to ask whether + for ints is=
little endian or big endian: it operates on numbers, not ordered sequences=
of bytes.<br>
<br>Sometimes, particularly when crossing interoperation boundaries, it is =
important to go beyond the numbers, and have some control over the represen=
tation, like when you sending numbers across the network with things like h=
tonl() and ntohl().<br>
<br>Nicol (please, correct me if I am wrong) wants to have the same ability=
for handling text: normal operations on such ideal strings are such that t=
alking about encoding when related to them does not even make sense; and ye=
t you don't discard the possibility of picking specific representations=
for crossing boundaries.<br>
<br>You appear to keep insisting on processing strings based on their raw c=
ode unit form. If that is all you want I don't even know why you are wa=
sting your time here. You don't need anything new from C++ if you want =
to deal with code unit sequences directly. std::string, std::u16string, and=
std::u32string (or maybe std::wstring as well if you are into that) are pr=
etty much that: sequence containers of code units.<br>
</div></div><br></div></div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e0149529e5c5d4604dc3650b0--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 8 May 2013 17:42:09 +0200
Raw View
--001a11c1e05adffe3f04dc36c68d
Content-Type: text/plain; charset=ISO-8859-1
On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <lichray@gmail.com> wrote:
> On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
> <martinho.fernandes@gmail.com> wrote:
> > In normal usage of such strings there is no encoding. This is in the same
> > vein of using primitive types like int: when you use int there is no
> > endianness, there is no two's complement; there are only numbers. The
> > language gives you operations that are completely agnostic of the
> underlying
> > representation. It doesn't make sense to ask whether + for ints is little
> > endian or big endian: it operates on numbers, not ordered sequences of
> > bytes.
>
> Yes, that might be the final answer. We need a class type, namely
> 'unicode'
> or whatever. Its representation is totally implementation-defined.
Yes, and the original point of contention here was about that "totally
implementation-defined" bit, which the committee seemed to prefer.
I don't agree with it. It might have been a good choice in a green
environment, but I think the existing ecosystem is too fractured to make
that the best option. I agree with Nicol that we should allow the user to
decided what underlying representation will be the cheapest for their
purposes. If I need to interop with environments that expect ENCODINGX all
the time, I would appreciate having the option of not paying any price for
transcoding on those boundaries. (This goes back to the "don't pay for what
you don't use mantra".)
And FWIW, I don't understand why you cannot have both if you really want
to. Consider the following.
template <typename Encoding>
class generic_unicode_string;
using implementation_defined_unicode_string =
generic_unicode_string<implementation_defined_encoding>;
What drawbacks would this approach have?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c1e05adffe3f04dc36c68d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <span dir=3D"l=
tr"><<a href=3D"mailto:lichray@gmail.com" target=3D"_blank">lichray@gmai=
l.com</a>></span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gma=
il_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D"im">On Wed,=
May 8, 2013 at 10:38 AM, Martinho Fernandes<br>
<<a href=3D"mailto:martinho.fernandes@gmail.com">martinho.fernandes@gmai=
l.com</a>> wrote:<br>
> In normal usage of such strings there is no encoding. This is in the s=
ame<br>
> vein of using primitive types like int: when you use int there is no<b=
r>
> endianness, there is no two's complement; there are only numbers. =
The<br>
> language gives you operations that are completely agnostic of the unde=
rlying<br>
> representation. It doesn't make sense to ask whether + for ints is=
little<br>
> endian or big endian: it operates on numbers, not ordered sequences of=
<br>
> bytes.<br>
<br>
</div>Yes, that might be the final answer. =A0We need a class type, namely =
'unicode'<br>
or whatever. =A0Its representation is totally implementation-defined.</bloc=
kquote><div><br></div><div>Yes, and the original point of contention here w=
as about that "totally implementation-defined" bit, which the com=
mittee seemed to prefer.<br>
<br>I don't agree with it. It might have been a good choice in a green =
environment, but I think the existing ecosystem is too fractured to make th=
at the best option. I agree with Nicol that we should allow the user to dec=
ided what underlying representation will be the cheapest for their purposes=
.. If I need to interop with environments that expect ENCODINGX all the time=
, I would appreciate having the option of not paying any price for transcod=
ing on those boundaries. (This goes back to the "don't pay for wha=
t you don't use mantra".)<br>
<br></div><div>And FWIW, I don't understand why you cannot have both if=
you really want to. Consider the following.<br><br></div><div>template <=
;typename Encoding><br>class generic_unicode_string;<br></div><div><br>
<div>using implementation_defined_unicode_string =3D generic_unicode_string=
<implementation_defined_encoding>;<br></div><div><br></div><div>What =
drawbacks would this approach have?<br></div></div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c1e05adffe3f04dc36c68d--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 17:14:09 +0100
Raw View
--001a11c356f454601e04dc373932
Content-Type: text/plain; charset=ISO-8859-1
When I mentioned the following encoding, some of the classes can be defined
as implementation defined encoding:
class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const
std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const
std::basic_string<EncodingElement>& str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
....
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
....
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
....
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
....
};
etc...
The only point is that there can be several of them. For example, UTF-8
cane be converted to char, char16_t or char32_t depending on what the user
perfers.
On 8 May 2013 16:42, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <lichray@gmail.com> wrote:
>
>> On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
>> <martinho.fernandes@gmail.com> wrote:
>> > In normal usage of such strings there is no encoding. This is in the
>> same
>> > vein of using primitive types like int: when you use int there is no
>> > endianness, there is no two's complement; there are only numbers. The
>> > language gives you operations that are completely agnostic of the
>> underlying
>> > representation. It doesn't make sense to ask whether + for ints is
>> little
>> > endian or big endian: it operates on numbers, not ordered sequences of
>> > bytes.
>>
>> Yes, that might be the final answer. We need a class type, namely
>> 'unicode'
>> or whatever. Its representation is totally implementation-defined.
>
>
> Yes, and the original point of contention here was about that "totally
> implementation-defined" bit, which the committee seemed to prefer.
>
> I don't agree with it. It might have been a good choice in a green
> environment, but I think the existing ecosystem is too fractured to make
> that the best option. I agree with Nicol that we should allow the user to
> decided what underlying representation will be the cheapest for their
> purposes. If I need to interop with environments that expect ENCODINGX all
> the time, I would appreciate having the option of not paying any price for
> transcoding on those boundaries. (This goes back to the "don't pay for what
> you don't use mantra".)
>
> And FWIW, I don't understand why you cannot have both if you really want
> to. Consider the following.
>
> template <typename Encoding>
> class generic_unicode_string;
>
> using implementation_defined_unicode_string =
> generic_unicode_string<implementation_defined_encoding>;
>
> What drawbacks would this approach have?
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c356f454601e04dc373932
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>When I mentioned the following encoding,=A0some of the classes can be =
defined as implementation defined encoding:</div>
<div>class encoding<br>{<br>public:=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0 virt=
ual std::basic_string<EncodingElement> encode(const std::basic_string=
<CharType>& str) =3D 0;<br>=A0=A0=A0 virtual std::basic_string<=
;CharType> decode(const std::basic_string<EncodingElement>& st=
r) =3D 0;=A0=A0=A0 <br>
};<br>Then particular encoding classes can be implemented:<br>class encodin=
g_utf8_char32: public encoding<char, char32_t><br>{<br>...<br>};<br>c=
lass encoding_utf8_char16: public encoding<char, char16_t><br>{<br>
....<br>};<br>class encoding_utf16_char32: public encoding<char16_t, char=
32_t><br>{<br>...<br>};<br>class encoding_GB18030_char32: public encodin=
g<char, char32_t><br>{<br>...<br>};</div>
<div>=A0</div>
<div>etc...</div>
<div><br>The only point is that there can be several of them. For example, =
UTF-8 cane be converted to char, char16_t or char32_t depending on what the=
user perfers.</div>
<div>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 16:42, Martinho Fernandes <span di=
r=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_bl=
ank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"im">On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <span dir=3D"=
ltr"><<a href=3D"mailto:lichray@gmail.com" target=3D"_blank">lichray@gma=
il.com</a>></span> wrote:<br></div>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes<br><<a href=3D"=
mailto:martinho.fernandes@gmail.com" target=3D"_blank">martinho.fernandes@g=
mail.com</a>> wrote:<br>> In normal usage of such strings there is no=
encoding. This is in the same<br>
> vein of using primitive types like int: when you use int there is no<b=
r>> endianness, there is no two's complement; there are only numbers=
.. The<br>> language gives you operations that are completely agnostic of=
the underlying<br>
> representation. It doesn't make sense to ask whether + for ints is=
little<br>> endian or big endian: it operates on numbers, not ordered s=
equences of<br>> bytes.<br><br></div>Yes, that might be the final answer=
.. =A0We need a class type, namely 'unicode'<br>
or whatever. =A0Its representation is totally implementation-defined.</bloc=
kquote>
<div><br></div></div>
<div>Yes, and the original point of contention here was about that "to=
tally implementation-defined" bit, which the committee seemed to prefe=
r.<br><br>I don't agree with it. It might have been a good choice in a =
green environment, but I think the existing ecosystem is too fractured to m=
ake that the best option. I agree with Nicol that we should allow the user =
to decided what underlying representation will be the cheapest for their pu=
rposes. If I need to interop with environments that expect ENCODINGX all th=
e time, I would appreciate having the option of not paying any price for tr=
anscoding on those boundaries. (This goes back to the "don't pay f=
or what you don't use mantra".)<br>
<br></div>
<div>And FWIW, I don't understand why you cannot have both if you reall=
y want to. Consider the following.<br><br></div>
<div>template <typename Encoding><br>class generic_unicode_string;<br=
></div>
<div><br>
<div>using implementation_defined_unicode_string =3D generic_unicode_string=
<implementation_defined_encoding>;<br></div>
<div><br></div>
<div>What drawbacks would this approach have?<br></div></div></div></div></=
div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c356f454601e04dc373932--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 17:13:35 +0100
Raw View
--089e0118413e4ec65f04dc37379a
Content-Type: text/plain; charset=ISO-8859-1
When I mentioned the following encoding, some of the classes can be defined
as implementation defined encoding:
class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const
std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const
std::basic_string<EncodingElement>& str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
....
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
....
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
....
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
....
};
etc...
The only point is that there can be several of them. For example, UTF-8
cane be converted to char, char16_t or char32_t depending on what the user
perfers.
On 8 May 2013 16:42, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <lichray@gmail.com> wrote:
>
>> On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
>> <martinho.fernandes@gmail.com> wrote:
>> > In normal usage of such strings there is no encoding. This is in the
>> same
>> > vein of using primitive types like int: when you use int there is no
>> > endianness, there is no two's complement; there are only numbers. The
>> > language gives you operations that are completely agnostic of the
>> underlying
>> > representation. It doesn't make sense to ask whether + for ints is
>> little
>> > endian or big endian: it operates on numbers, not ordered sequences of
>> > bytes.
>>
>> Yes, that might be the final answer. We need a class type, namely
>> 'unicode'
>> or whatever. Its representation is totally implementation-defined.
>
>
> Yes, and the original point of contention here was about that "totally
> implementation-defined" bit, which the committee seemed to prefer.
>
> I don't agree with it. It might have been a good choice in a green
> environment, but I think the existing ecosystem is too fractured to make
> that the best option. I agree with Nicol that we should allow the user to
> decided what underlying representation will be the cheapest for their
> purposes. If I need to interop with environments that expect ENCODINGX all
> the time, I would appreciate having the option of not paying any price for
> transcoding on those boundaries. (This goes back to the "don't pay for what
> you don't use mantra".)
>
> And FWIW, I don't understand why you cannot have both if you really want
> to. Consider the following.
>
> template <typename Encoding>
> class generic_unicode_string;
>
> using implementation_defined_unicode_string =
> generic_unicode_string<implementation_defined_encoding>;
>
> What drawbacks would this approach have?
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e0118413e4ec65f04dc37379a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>When I mentioned the following encoding,=A0some of the classes can be =
defined as implementation defined encoding:</div>
<div>class encoding<br>{<br>public:=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0 virt=
ual std::basic_string<EncodingElement> encode(const std::basic_string=
<CharType>& str) =3D 0;<br>=A0=A0=A0 virtual std::basic_string<=
;CharType> decode(const std::basic_string<EncodingElement>& st=
r) =3D 0;=A0=A0=A0 <br>
};<br>Then particular encoding classes can be implemented:<br>class encodin=
g_utf8_char32: public encoding<char, char32_t><br>{<br>...<br>};<br>c=
lass encoding_utf8_char16: public encoding<char, char16_t><br>{<br>
....<br>};<br>class encoding_utf16_char32: public encoding<char16_t, char=
32_t><br>{<br>...<br>};<br>class encoding_GB18030_char32: public encodin=
g<char, char32_t><br>{<br>...<br>};</div>
<div>=A0</div>
<div>etc...</div>
<div><br>The only point is that there can be several of them. For example, =
UTF-8 cane be converted to char, char16_t or char32_t depending on what the=
user perfers.</div>
<div>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 16:42, Martinho Fernandes <span di=
r=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_bl=
ank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"im">On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <span dir=3D"=
ltr"><<a href=3D"mailto:lichray@gmail.com" target=3D"_blank">lichray@gma=
il.com</a>></span> wrote:<br></div>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes<br><<a href=3D"=
mailto:martinho.fernandes@gmail.com" target=3D"_blank">martinho.fernandes@g=
mail.com</a>> wrote:<br>> In normal usage of such strings there is no=
encoding. This is in the same<br>
> vein of using primitive types like int: when you use int there is no<b=
r>> endianness, there is no two's complement; there are only numbers=
.. The<br>> language gives you operations that are completely agnostic of=
the underlying<br>
> representation. It doesn't make sense to ask whether + for ints is=
little<br>> endian or big endian: it operates on numbers, not ordered s=
equences of<br>> bytes.<br><br></div>Yes, that might be the final answer=
.. =A0We need a class type, namely 'unicode'<br>
or whatever. =A0Its representation is totally implementation-defined.</bloc=
kquote>
<div><br></div></div>
<div>Yes, and the original point of contention here was about that "to=
tally implementation-defined" bit, which the committee seemed to prefe=
r.<br><br>I don't agree with it. It might have been a good choice in a =
green environment, but I think the existing ecosystem is too fractured to m=
ake that the best option. I agree with Nicol that we should allow the user =
to decided what underlying representation will be the cheapest for their pu=
rposes. If I need to interop with environments that expect ENCODINGX all th=
e time, I would appreciate having the option of not paying any price for tr=
anscoding on those boundaries. (This goes back to the "don't pay f=
or what you don't use mantra".)<br>
<br></div>
<div>And FWIW, I don't understand why you cannot have both if you reall=
y want to. Consider the following.<br><br></div>
<div>template <typename Encoding><br>class generic_unicode_string;<br=
></div>
<div><br>
<div>using implementation_defined_unicode_string =3D generic_unicode_string=
<implementation_defined_encoding>;<br></div>
<div><br></div>
<div>What drawbacks would this approach have?<br></div></div></div></div></=
div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e0118413e4ec65f04dc37379a--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 17:19:20 +0100
Raw View
--001a11c2697ed9fe9504dc374b8a
Content-Type: text/plain; charset=ISO-8859-1
When I mentioned the following classes, I meant that some of them could be
treated and implementation-defined conversions.
class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const
std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const
std::basic_string<EncodingElement>& str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
....
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
....
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
....
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
....
};
For example, if the UTF-8 is the implementation-defined encoding then the
follwing at least three conversions should exist string of char, string of
char16_t (maybe even ignoring surrogates) or string of char32_t, depending
on what the user wants.
On 8 May 2013 16:42, Martinho Fernandes <martinho.fernandes@gmail.com>wrote:
> On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <lichray@gmail.com> wrote:
>
>> On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
>> <martinho.fernandes@gmail.com> wrote:
>> > In normal usage of such strings there is no encoding. This is in the
>> same
>> > vein of using primitive types like int: when you use int there is no
>> > endianness, there is no two's complement; there are only numbers. The
>> > language gives you operations that are completely agnostic of the
>> underlying
>> > representation. It doesn't make sense to ask whether + for ints is
>> little
>> > endian or big endian: it operates on numbers, not ordered sequences of
>> > bytes.
>>
>> Yes, that might be the final answer. We need a class type, namely
>> 'unicode'
>> or whatever. Its representation is totally implementation-defined.
>
>
> Yes, and the original point of contention here was about that "totally
> implementation-defined" bit, which the committee seemed to prefer.
>
> I don't agree with it. It might have been a good choice in a green
> environment, but I think the existing ecosystem is too fractured to make
> that the best option. I agree with Nicol that we should allow the user to
> decided what underlying representation will be the cheapest for their
> purposes. If I need to interop with environments that expect ENCODINGX all
> the time, I would appreciate having the option of not paying any price for
> transcoding on those boundaries. (This goes back to the "don't pay for what
> you don't use mantra".)
>
> And FWIW, I don't understand why you cannot have both if you really want
> to. Consider the following.
>
> template <typename Encoding>
> class generic_unicode_string;
>
> using implementation_defined_unicode_string =
> generic_unicode_string<implementation_defined_encoding>;
>
> What drawbacks would this approach have?
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c2697ed9fe9504dc374b8a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>When I mentioned the following classes, I meant that some of them coul=
d be treated and implementation-defined conversions.<br></div>
<div>class encoding<br>{<br>public:=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0 virt=
ual std::basic_string<EncodingElement> encode(const std::basic_string=
<CharType>& str) =3D 0;<br>=A0=A0=A0 virtual std::basic_string<=
;CharType> decode(const std::basic_string<EncodingElement>& st=
r) =3D 0;=A0=A0=A0 <br>
};<br>Then particular encoding classes can be implemented:<br>class encodin=
g_utf8_char32: public encoding<char, char32_t><br>{<br>...<br>};<br>c=
lass encoding_utf8_char16: public encoding<char, char16_t><br>{<br>
....<br>};<br>class encoding_utf16_char32: public encoding<char16_t, char=
32_t><br>{<br>...<br>};<br>class encoding_GB18030_char32: public encodin=
g<char, char32_t><br>{<br>...<br>};</div>
<div>=A0</div>
<div>=A0</div>
<div>For example, if the UTF-8 is the implementation-defined encoding then =
the follwing at least three=A0conversions should exist=A0string of char, st=
ring of char16_t (maybe even ignoring=A0surrogates)=A0or string of char32_t=
, depending on what the user wants.</div>
<div><br>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 16:42, Martinho Fernandes <span di=
r=3D"ltr"><<a href=3D"mailto:martinho.fernandes@gmail.com" target=3D"_bl=
ank">martinho.fernandes@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div dir=3D"ltr">
<div class=3D"im">On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <span dir=3D"=
ltr"><<a href=3D"mailto:lichray@gmail.com" target=3D"_blank">lichray@gma=
il.com</a>></span> wrote:<br></div>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">
<div class=3D"im">
<blockquote style=3D"BORDER-LEFT:rgb(204,204,204) 1px solid;MARGIN:0px 0px =
0px 0.8ex;PADDING-LEFT:1ex" class=3D"gmail_quote">
<div>On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes<br><<a href=3D"=
mailto:martinho.fernandes@gmail.com" target=3D"_blank">martinho.fernandes@g=
mail.com</a>> wrote:<br>> In normal usage of such strings there is no=
encoding. This is in the same<br>
> vein of using primitive types like int: when you use int there is no<b=
r>> endianness, there is no two's complement; there are only numbers=
.. The<br>> language gives you operations that are completely agnostic of=
the underlying<br>
> representation. It doesn't make sense to ask whether + for ints is=
little<br>> endian or big endian: it operates on numbers, not ordered s=
equences of<br>> bytes.<br><br></div>Yes, that might be the final answer=
.. =A0We need a class type, namely 'unicode'<br>
or whatever. =A0Its representation is totally implementation-defined.</bloc=
kquote>
<div><br></div></div>
<div>Yes, and the original point of contention here was about that "to=
tally implementation-defined" bit, which the committee seemed to prefe=
r.<br><br>I don't agree with it. It might have been a good choice in a =
green environment, but I think the existing ecosystem is too fractured to m=
ake that the best option. I agree with Nicol that we should allow the user =
to decided what underlying representation will be the cheapest for their pu=
rposes. If I need to interop with environments that expect ENCODINGX all th=
e time, I would appreciate having the option of not paying any price for tr=
anscoding on those boundaries. (This goes back to the "don't pay f=
or what you don't use mantra".)<br>
<br></div>
<div>And FWIW, I don't understand why you cannot have both if you reall=
y want to. Consider the following.<br><br></div>
<div>template <typename Encoding><br>class generic_unicode_string;<br=
></div>
<div><br>
<div>using implementation_defined_unicode_string =3D generic_unicode_string=
<implementation_defined_encoding>;<br></div>
<div><br></div>
<div>What drawbacks would this approach have?<br></div></div></div></div></=
div>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c2697ed9fe9504dc374b8a--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 8 May 2013 09:27:13 -0700 (PDT)
Raw View
------=_Part_4088_19023098.1368030433486
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 9:14:09 AM UTC-7, Mikhail Semenov wrote:
>
> When I mentioned the following encoding, some of the classes can be
> defined as implementation defined encoding:
> class encoding
> {
> public:
> virtual std::basic_string<EncodingElement> encode(const
> std::basic_string<CharType>& str) = 0;
> virtual std::basic_string<CharType> decode(const
> std::basic_string<EncodingElement>& str) = 0;
> };
>
First, we won't be using inheritance. It can't do the things we need to do.
For example "EncodingElement" is a type that *changes* based on the
encoding. Which you can't do with virtual functions. You also can't
specialize iterators, which is important since most of the algorithms work
on codepoint iterators. Oh, and there's no reason to throw performance away
on virtual function overhead.
Second, it won't be using basic_string. The *entire point* of an encoded
string is that you treat it like a sequence of codepoints. *Nothing* in the
`basic_string` API can handle that. It must be a new type.
So pretty much everything about this suggestion is a bad idea.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_4088_19023098.1368030433486
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, May 8, 2013 9:14:09 AM UTC-7, Mikhail Semenov wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;"><div>When I mentioned the follo=
wing encoding, some of the classes can be defined as implementation de=
fined encoding:</div>
<div>class encoding<br>{<br>public: &nbs=
p; <br> virtual std::basic_string<<wbr>EncodingElement=
> encode(const std::basic_string<CharType>& str) =3D 0;<br>&nb=
sp; virtual std::basic_string<CharType> decode(const std:=
:basic_string<<wbr>EncodingElement>& str) =3D 0;  =
; <br>
};<br></div></blockquote><div><br>First, we won't be using inheritance. It =
can't do the things we need to do. For example "EncodingElement" is a type =
that <i>changes</i> based on the encoding. Which you can't do with virtual =
functions. You also can't specialize iterators, which is important since mo=
st of the algorithms work on codepoint iterators. Oh, and there's no reason=
to throw performance away on virtual function overhead.<br><br>Second, it =
won't be using basic_string. The <i>entire point</i> of an encoded string i=
s that you treat it like a sequence of codepoints. <i>Nothing</i> in the `b=
asic_string` API can handle that. It must be a new type.<br><br>So pretty m=
uch everything about this suggestion is a bad idea.<br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_4088_19023098.1368030433486--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 17:37:43 +0100
Raw View
--f46d04426c589acb0b04dc378dec
Content-Type: text/plain; charset=ISO-8859-1
(1) Are you say that the Committee is happy with the idea of an Ecoded
String class?
(2) My proposal was to use the ecoding class only for conversion. You can
encode and decode the whole text in one go.
(3) I was think about have iterators as well in a different settings
(without encoding/decoding). They are also to crawl through a string (say a
UTF-8 string), but they won't work if the you'd like to replace say a
2-byte code with a 3-byte one inside a string: too inefficient!
On 8 May 2013 17:27, Nicol Bolas <jmckesson@gmail.com> wrote:
>
>
> On Wednesday, May 8, 2013 9:14:09 AM UTC-7, Mikhail Semenov wrote:
>>
>> When I mentioned the following encoding, some of the classes can be
>> defined as implementation defined encoding:
>> class encoding
>> {
>> public:
>> virtual std::basic_string<**EncodingElement> encode(const
>> std::basic_string<CharType>& str) = 0;
>> virtual std::basic_string<CharType> decode(const std::basic_string<**EncodingElement>&
>> str) = 0;
>> };
>>
>
> First, we won't be using inheritance. It can't do the things we need to
> do. For example "EncodingElement" is a type that *changes* based on the
> encoding. Which you can't do with virtual functions. You also can't
> specialize iterators, which is important since most of the algorithms work
> on codepoint iterators. Oh, and there's no reason to throw performance away
> on virtual function overhead.
>
> Second, it won't be using basic_string. The *entire point* of an encoded
> string is that you treat it like a sequence of codepoints. *Nothing* in
> the `basic_string` API can handle that. It must be a new type.
>
> So pretty much everything about this suggestion is a bad idea.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--f46d04426c589acb0b04dc378dec
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>(1) Are you say that the Committee is happy with the idea of an Ecoded=
String class?</div>
<div>=A0</div>
<div>(2) My proposal was to use the ecoding class only for conversion. You =
can encode and decode the whole text in one go.</div>
<div>=A0</div>
<div>(3) I was think about have iterators as well in a different settings (=
without encoding/decoding). They are also to crawl through a string (say a =
UTF-8 string), but they won't work if the you'd like to replace say=
a 2-byte code with a 3-byte one inside a string: too inefficient!</div>
<div><br>=A0</div>
<div class=3D"gmail_quote">On 8 May 2013 17:27, Nicol Bolas <span dir=3D"lt=
r"><<a href=3D"mailto:jmckesson@gmail.com" target=3D"_blank">jmckesson@g=
mail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div class=3D"im"><br><br>On Wednesday, May 8, 2013 9:14:09 AM UTC-7, Mikha=
il Semenov wrote:=20
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">
<div>When I mentioned the following encoding,=A0some of the classes can be =
defined as implementation defined encoding:</div>
<div>class encoding<br>{<br>public:=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0 virt=
ual std::basic_string<<u></u>EncodingElement> encode(const std::basic=
_string<CharType>& str) =3D 0;<br>=A0=A0=A0 virtual std::basic_st=
ring<CharType> decode(const std::basic_string<<u></u>EncodingEleme=
nt>& str) =3D 0;=A0=A0=A0 <br>
};<br></div></blockquote></div>
<div><br>First, we won't be using inheritance. It can't do the thin=
gs we need to do. For example "EncodingElement" is a type that <i=
>changes</i> based on the encoding. Which you can't do with virtual fun=
ctions. You also can't specialize iterators, which is important since m=
ost of the algorithms work on codepoint iterators. Oh, and there's no r=
eason to throw performance away on virtual function overhead.<br>
<br>Second, it won't be using basic_string. The <i>entire point</i> of =
an encoded string is that you treat it like a sequence of codepoints. <i>No=
thing</i> in the `basic_string` API can handle that. It must be a new type.=
<br>
<br>So pretty much everything about this suggestion is a bad idea.<br></div=
>
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--f46d04426c589acb0b04dc378dec--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 8 May 2013 18:39:14 +0200
Raw View
--e89a8f923b3604bcca04dc3793d8
Content-Type: text/plain; charset=ISO-8859-1
On Wed, May 8, 2013 at 5:09 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
> I think you are mistaking me for someone else.
>
I apologize if that is the case. I got the idea from statements like "It is
not convenient to deal with a UTF-8 string as a string of char" and
similar. *No one *here is arguing for dealing with a UTF-8 string as a
string of char, so I cannot really understand why anyone would keep arguing
about it.
> I just want a good interface for manipulate between various encodings and
> to be able to deal with various files that use them.
>
But other people seem to want more than encoding conversions. Unicode is
not encodings and I believe encodings should be the least important thing
of all. Please note all the generic algorithms to handle text that were
included in the proposal. And FWIW, C++11 already has encoding conversions
for the UTF encodings in it.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--e89a8f923b3604bcca04dc3793d8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_extra">On Wed, May 8, 2013 at 5:09 PM,=
Mikhail Semenov <span dir=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957=
@gmail.com" target=3D"_blank">mikhailsemenov1957@gmail.com</a>></span> w=
rote:<br>
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div>I think you are mistaking me for someone else. </div></blockquote><d=
iv>
<br></div><div>I apologize if that is the case. I got the idea from stateme=
nts like "It is not convenient to deal with=A0a UTF-8 string
as a string of char" and similar. <b>No one </b>here is arguing for de=
aling with a UTF-8 string as a string of char, so I cannot really understan=
d why anyone would keep arguing about it.<br></div><div>=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex">
<div>I just want a good interface for manipulate between various encodings =
and to be able to deal with various files that use them.<br></div></blockqu=
ote><div><br>But other people seem to want more than encoding conversions. =
Unicode is not encodings
and I believe encodings should be the least important thing of all. Please=
note all the generic algorithms to handle text that were included in the p=
roposal. And FWIW, C++11 already has encoding conversions for the UTF encod=
ings in it.<br>
</div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--e89a8f923b3604bcca04dc3793d8--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Wed, 8 May 2013 18:44:28 +0200
Raw View
--047d7bb04e36ba737104dc37a5a3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Wed, May 8, 2013 at 6:37 PM, Mikhail Semenov <
mikhailsemenov1957@gmail.com> wrote:
> (3) I was think about have iterators as well in a different settings
> (without encoding/decoding). They are also to crawl through a string (say=
a
> UTF-8 string), but they won't work if the you'd like to replace say a
> 2-byte code with a 3-byte one inside a string: too inefficient!
>
Is this is a relevant use case? What about wanting to replace "=C3=85rhus"
with "=D0=9C=D0=BE=D1=81=D0=BA=D0=B2=D0=B0"?
Or even in an ASCII string, where every character is a single byte, what
about wanting to replace "Paris" with "Moskva"?
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
--047d7bb04e36ba737104dc37a5a3
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Wed, May 8, 2013 at 6:37 PM, Mikhail Semenov <span dir=
=3D"ltr"><<a href=3D"mailto:mikhailsemenov1957@gmail.com" target=3D"_bla=
nk">mikhailsemenov1957@gmail.com</a>></span> wrote:<br><div class=3D"gma=
il_extra">
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div class=3D"h5">(3) I was think about have iterators as well in a diffe=
rent settings (without encoding/decoding). They are also to crawl through a=
string (say a UTF-8 string), but they won't work if the you'd like=
to replace say a 2-byte code with a 3-byte one inside a string: too ineffi=
cient!<br>
</div></blockquote></div><br></div><div class=3D"gmail_extra">Is this is a =
relevant use case? What about wanting to replace "=C3=85rhus" wit=
h "<span lang=3D"ru">=D0=9C=D0=BE=D1=81=D0=BA=D0=B2=D0=B0"? Or ev=
en in an ASCII string, where every character is a single byte, what about w=
anting to replace "Paris" with "Moskva"?<br>
</span></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--047d7bb04e36ba737104dc37a5a3--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Wed, 8 May 2013 10:23:09 -0700 (PDT)
Raw View
------=_Part_391_2575964.1368033789656
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
It depends what you want to do. I meant something like that:
=20
for (char32_t& x: a_utf_string)
{
if (x =3D=3D 'a')
{
x =3D '=CF=80=E2=80=99
} =20
}
You may allow such manipulations even if they are inefficient. But is it=20
worth the shot: we have to resize the array.
=20
=20
On Wednesday, May 8, 2013 5:44:28 PM UTC+1, R. Martinho Fernandes wrote:
> On Wed, May 8, 2013 at 6:37 PM, Mikhail Semenov <mikhailse...@gmail.com<j=
avascript:>
> > wrote:
>
>> (3) I was think about have iterators as well in a different settings=20
>> (without encoding/decoding). They are also to crawl through a string (sa=
y a=20
>> UTF-8 string), but they won't work if the you'd like to replace say a=20
>> 2-byte code with a 3-byte one inside a string: too inefficient!
>>
>
> Is this is a relevant use case? What about wanting to replace "=C3=85rhus=
" with=20
> "=D0=9C=D0=BE=D1=81=D0=BA=D0=B2=D0=B0"? Or even in an ASCII string, where=
every character is a single=20
> byte, what about wanting to replace "Paris" with "Moskva"?
> =20
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/?hl=3Den.
------=_Part_391_2575964.1368033789656
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div>It depends what you want to do. I meant something like that:</div><div=
> </div><div><font color=3D"#0000ff" face=3D"Consolas" size=3D"2"><fon=
t color=3D"#0000ff" face=3D"Consolas" size=3D"2"><font color=3D"#000000" fa=
ce=3D"courier new,monospace" size=3D"2"><p>for (char32_t& x: a_utf_stri=
ng)<br>{<br> if (x =3D=3D 'a')<br>  =
; {<br> x =3D '=CF=80=E2=80=
=99<br> } <br>}</p></font><p><font colo=
r=3D"#000000">You may allow such manipulations even if they are inefficient=
.. But is it worth the shot: we have to resize the array.</font></p></font><=
p> </p></font><p> </p></div><div><br>On Wednesday, May 8, 2013 5:=
44:28 PM UTC+1, R. Martinho Fernandes wrote:</div><blockquote class=3D"gmai=
l_quote" style=3D"margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left=
-color: rgb(204, 204, 204); border-left-width: 1px; border-left-style: soli=
d;"><div dir=3D"ltr">On Wed, May 8, 2013 at 6:37 PM, Mikhail Semenov <span =
dir=3D"ltr"><<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-ma=
ilto=3D"Msp-mFqFengJ">mikhailse...@gmail.com</a>></span> wrote:<br><div>
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-color: rgb(204, 204, 2=
04); border-left-width: 1px; border-left-style: solid;"><div>(3) I was thin=
k about have iterators as well in a different settings (without encoding/de=
coding). They are also to crawl through a string (say a UTF-8 string), but =
they won't work if the you'd like to replace say a 2-byte code with a 3-byt=
e one inside a string: too inefficient!<br>
</div></blockquote></div><br></div><div>Is this is a relevant use case? Wha=
t about wanting to replace "=C3=85rhus" with "<span lang=3D"ru">=D0=9C=D0=
=BE=D1=81=D0=BA=D0=B2=D0=B0"? Or even in an ASCII string, where every chara=
cter is a single byte, what about wanting to replace "Paris" with "Moskva"?=
<br>
</span></div></div>
</blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_391_2575964.1368033789656--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 8 May 2013 19:09:19 -0700 (PDT)
Raw View
------=_Part_2201_8277766.1368065359373
Content-Type: text/plain; charset=ISO-8859-1
On Wednesday, May 8, 2013 9:37:43 AM UTC-7, Mikhail Semenov wrote:
>
> (1) Are you say that the Committee is happy with the idea of an Ecoded
> String class?
>
> (2) My proposal was to use the ecoding class only for conversion. You can
> encode and decode the whole text in one go.
>
.... and? Look at the proposal; it already has transcoding support for "the
whole text in one go".
(3) I was think about have iterators as well in a different settings
> (without encoding/decoding). They are also to crawl through a string (say a
> UTF-8 string), but they won't work if the you'd like to replace say a
> 2-byte code with a 3-byte one inside a string: too inefficient!
>
Codepoint iterators would only provide value access to codepoints. You
can't set a codepoint via a codepoint iterator. The only encoding where
setting a codepoint by iterator would ever work (without the container)
would be UTF-32.
Encoded string would have the ability to insert codepoints or codepoint
ranges into explicit locations in the string (locations denoted by
codepoint iterators).
Really, just look at the proposal sometime. It's got all this stuff in
there, fairly well specified.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_2201_8277766.1368065359373
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, May 8, 2013 9:37:43 AM UTC-7, Mikhail Semenov wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;"><div>(1) Are you say that the C=
ommittee is happy with the idea of an Ecoded String class?</div>
<div> </div>
<div>(2) My proposal was to use the ecoding class only for conversion. You =
can encode and decode the whole text in one go.<br></div></blockquote><div>=
<br>... and? Look at the proposal; it already has transcoding support for "=
the whole text in one go".<br><br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-le=
ft: 1ex;"><div></div>
<div>(3) I was think about have iterators as well in a different settings (=
without encoding/decoding). They are also to crawl through a string (say a =
UTF-8 string), but they won't work if the you'd like to replace say a 2-byt=
e code with a 3-byte one inside a string: too inefficient!</div></blockquot=
e><div><br>Codepoint iterators would only provide value access to codepoint=
s. You can't set a codepoint via a codepoint iterator. The only encoding wh=
ere setting a codepoint by iterator would ever work (without the container)=
would be UTF-32.<br><br>Encoded string would have the ability to insert co=
depoints or codepoint ranges into explicit locations in the string (locatio=
ns denoted by codepoint iterators).<br><br>Really, just look at the proposa=
l sometime. It's got all this stuff in there, fairly well specified.<br></d=
iv>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_2201_8277766.1368065359373--
.
Author: Lawrence Crowl <crowl@googlers.com>
Date: Thu, 9 May 2013 11:39:13 -0700
Raw View
On 5/8/13, Nicol Bolas <jmckesson@gmail.com> wrote:
> On May 8, 2013, Mikhail Semenov wrote:
> > (1) Are you say that the Committee is happy with the idea of
> > an Ecoded String class?
> >
> > (2) My proposal was to use the ecoding class only for
> > conversion. You can encode and decode the whole text in one go.
>
> ... and? Look at the proposal; it already has transcoding support
> for "the whole text in one go".
>
> > (3) I was think about have iterators as well in a different
> > settings (without encoding/decoding). They are also to crawl
> > through a string (say a UTF-8 string), but they won't work if
> > the you'd like to replace say a 2-byte code with a 3-byte one
> > inside a string: too inefficient!
>
> Codepoint iterators would only provide value access to
> codepoints. You can't set a codepoint via a codepoint iterator. The
> only encoding where setting a codepoint by iterator would ever work
> (without the container) would be UTF-32.
I think an output iterator appending to the string would handle
codepoints just fine in any encoding. Indeed, that work must
effectively be done by any transcoder. We might as well make the
primitive available.
> Encoded string would have the ability to insert codepoints or
> codepoint ranges into explicit locations in the string (locations
> denoted by codepoint iterators).
>
> Really, just look at the proposal sometime. It's got all this
> stuff in there, fairly well specified.
--
Lawrence Crowl
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Thu, 9 May 2013 18:16:18 -0700 (PDT)
Raw View
------=_Part_306_4937667.1368148578525
Content-Type: text/plain; charset=ISO-8859-1
On Thursday, May 9, 2013 11:39:13 AM UTC-7, Lawrence Crowl wrote:
>
> On 5/8/13, Nicol Bolas <jmck...@gmail.com <javascript:>> wrote:
> > On May 8, 2013, Mikhail Semenov wrote:
> > > (1) Are you say that the Committee is happy with the idea of
> > > an Ecoded String class?
> > >
> > > (2) My proposal was to use the ecoding class only for
> > > conversion. You can encode and decode the whole text in one go.
> >
> > ... and? Look at the proposal; it already has transcoding support
> > for "the whole text in one go".
> >
> > > (3) I was think about have iterators as well in a different
> > > settings (without encoding/decoding). They are also to crawl
> > > through a string (say a UTF-8 string), but they won't work if
> > > the you'd like to replace say a 2-byte code with a 3-byte one
> > > inside a string: too inefficient!
> >
> > Codepoint iterators would only provide value access to
> > codepoints. You can't set a codepoint via a codepoint iterator. The
> > only encoding where setting a codepoint by iterator would ever work
> > (without the container) would be UTF-32.
>
> I think an output iterator appending to the string would handle
> codepoints just fine in any encoding. Indeed, that work must
> effectively be done by any transcoder. We might as well make the
> primitive available.
>
Those are iterators based on *containers*, not ranges. I'm talking about
doing something like std::for_each and modifying the codepoints in-situ.
That's not reasonable with pure iterator logic.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_306_4937667.1368148578525
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<br><br>On Thursday, May 9, 2013 11:39:13 AM UTC-7, Lawrence Crowl wrote:<b=
lockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;borde=
r-left: 1px #ccc solid;padding-left: 1ex;">On 5/8/13, Nicol Bolas <<a hr=
ef=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"9Yuwvxxkyy4J"=
>jmck...@gmail.com</a>> wrote:
<br>> On May 8, 2013, Mikhail Semenov wrote:
<br>> > (1) Are you say that the Committee is happy with the idea of
<br>> > an Ecoded String class?
<br>> >
<br>> > (2) My proposal was to use the ecoding class only for
<br>> > conversion. You can encode and decode the whole text in one g=
o.
<br>>
<br>> ... and? Look at the proposal; it already has transcoding support
<br>> for "the whole text in one go".
<br>>
<br>> > (3) I was think about have iterators as well in a different
<br>> > settings (without encoding/decoding). They are also to crawl
<br>> > through a string (say a UTF-8 string), but they won't work if
<br>> > the you'd like to replace say a 2-byte code with a 3-byte one
<br>> > inside a string: too inefficient!
<br>>
<br>> Codepoint iterators would only provide value access to
<br>> codepoints. You can't set a codepoint via a codepoint iterator. Th=
e
<br>> only encoding where setting a codepoint by iterator would ever wor=
k
<br>> (without the container) would be UTF-32.
<br>
<br>I think an output iterator appending to the string would handle
<br>codepoints just fine in any encoding. Indeed, that work must
<br>effectively be done by any transcoder. We might as well make the
<br>primitive available.
<br></blockquote><div><br>Those are iterators based on <i>containers</i>, n=
ot ranges. I'm talking about doing something like std::for_each and modifyi=
ng the codepoints in-situ. That's not reasonable with pure iterator logic.<=
/div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_306_4937667.1368148578525--
.
Author: Tony V E <tvaneerd@gmail.com>
Date: Fri, 10 May 2013 03:58:41 -0000
Raw View
--001a11c378e4d1a5f004dc552edb
Content-Type: text/plain; charset=ISO-8859-1
You could use a proxy-based iterator. I suppose. Obviously has trade-offs.
Sent from my portable Analytical Engine
------------------------------
*From:* "Nicol Bolas" <jmckesson@gmail.com>
*To:* "std-proposals@isocpp.org" <std-proposals@isocpp.org>
*Sent:* 9 May, 2013 10:16 PM
*Subject:* Re: [std-proposals] Re: Committee feedback on N3572
On Thursday, May 9, 2013 11:39:13 AM UTC-7, Lawrence Crowl wrote:
>
> On 5/8/13, Nicol Bolas <jmck...@gmail.com <javascript:>> wrote:
> > On May 8, 2013, Mikhail Semenov wrote:
> > > (1) Are you say that the Committee is happy with the idea of
> > > an Ecoded String class?
> > >
> > > (2) My proposal was to use the ecoding class only for
> > > conversion. You can encode and decode the whole text in one go.
> >
> > ... and? Look at the proposal; it already has transcoding support
> > for "the whole text in one go".
> >
> > > (3) I was think about have iterators as well in a different
> > > settings (without encoding/decoding). They are also to crawl
> > > through a string (say a UTF-8 string), but they won't work if
> > > the you'd like to replace say a 2-byte code with a 3-byte one
> > > inside a string: too inefficient!
> >
> > Codepoint iterators would only provide value access to
> > codepoints. You can't set a codepoint via a codepoint iterator. The
> > only encoding where setting a codepoint by iterator would ever work
> > (without the container) would be UTF-32.
>
> I think an output iterator appending to the string would handle
> codepoints just fine in any encoding. Indeed, that work must
> effectively be done by any transcoder. We might as well make the
> primitive available.
>
Those are iterators based on *containers*, not ranges. I'm talking about
doing something like std::for_each and modifying the codepoints in-situ.
That's not reasonable with pure iterator logic.
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--001a11c378e4d1a5f004dc552edb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<html><head></head><body>You could use a proxy-based iterator. I suppose. O=
bviously has trade-offs.<br><br><div id=3D"1330154144936-sig-id">Sent from =
my portable Analytical Engine </div><br><hr><div><strong>From:</strong> &qu=
ot;Nicol Bolas" <<a href=3D"mailto:jmckesson@gmail.com">jmckesson@g=
mail.com</a>><br>
<strong>To:</strong> "<a href=3D"mailto:std-proposals@isocpp.org">std-=
proposals@isocpp.org</a>" <<a href=3D"mailto:std-proposals@isocpp.o=
rg">std-proposals@isocpp.org</a>><br><strong>Sent:</strong> 9 May, 2013 =
10:16 PM<br>
<strong>Subject:</strong> Re: [std-proposals] Re: Committee feedback on N35=
72<br></div><br><br><br>On Thursday, May 9, 2013 11:39:13 AM UTC-7, Lawrenc=
e Crowl wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-le=
ft:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
On 5/8/13, Nicol Bolas <<a href=3D"javascript:" target=3D"_blank">jmck..=
..@gmail.com</a>> wrote:
<br>> On May 8, 2013, Mikhail Semenov wrote:
<br>> > (1) Are you say that the Committee is happy with the idea of
<br>> > an Ecoded String class?
<br>> >
<br>> > (2) My proposal was to use the ecoding class only for
<br>> > conversion. You can encode and decode the whole text in one g=
o.
<br>>
<br>> ... and? Look at the proposal; it already has transcoding support
<br>> for "the whole text in one go".
<br>>
<br>> > (3) I was think about have iterators as well in a different
<br>> > settings (without encoding/decoding). They are also to crawl
<br>> > through a string (say a UTF-8 string), but they won't wor=
k if
<br>> > the you'd like to replace say a 2-byte code with a 3-byte=
one
<br>> > inside a string: too inefficient!
<br>>
<br>> Codepoint iterators would only provide value access to
<br>> codepoints. You can't set a codepoint via a codepoint iterator=
.. The
<br>> only encoding where setting a codepoint by iterator would ever wor=
k
<br>> (without the container) would be UTF-32.
<br>
<br>I think an output iterator appending to the string would handle
<br>codepoints just fine in any encoding. =A0Indeed, that work must
<br>effectively be done by any transcoder. =A0We might as well make the
<br>primitive available.
<br></blockquote><div><br>Those are iterators based on <i>containers</i>, n=
ot ranges. I'm talking about doing something like std::for_each and mod=
ifying the codepoints in-situ. That's not reasonable with pure iterator=
logic.</div>
<br>
<p></p>
-- <br>
=A0<br>
--- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org">std-propo=
sals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br>
=A0<br>
=A0<br>
</body></html>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--001a11c378e4d1a5f004dc552edb--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Fri, 10 May 2013 12:28:04 +0200
Raw View
--089e013a1bb2509d7604dc5a9f85
Content-Type: text/plain; charset=ISO-8859-1
On Thu, May 9, 2013 at 8:39 PM, Lawrence Crowl <crowl@googlers.com> wrote:
> I think an output iterator appending to the string would handle
> codepoints just fine in any encoding. Indeed, that work must
> effectively be done by any transcoder. We might as well make the
> primitive available.
>
Wouldn't that simply be std::back_inserter?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e013a1bb2509d7604dc5a9f85
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On T=
hu, May 9, 2013 at 8:39 PM, Lawrence Crowl <span dir=3D"ltr"><<a href=3D=
"mailto:crowl@googlers.com" target=3D"_blank">crowl@googlers.com</a>></s=
pan> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"im">
</div>I think an output iterator appending to the string would handle<br>
codepoints just fine in any encoding. =A0Indeed, that work must<br>
effectively be done by any transcoder. =A0We might as well make the<br>
primitive available.<br></blockquote><div><br></div><div>Wouldn't that =
simply be std::back_inserter?<br></div></div></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e013a1bb2509d7604dc5a9f85--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sat, 11 May 2013 09:06:00 -0700 (PDT)
Raw View
------=_Part_1625_9871991.1368288360440
Content-Type: text/plain; charset=ISO-8859-1
Lawrence,
Could you tell me, please, what is the situation with 3398?
Has it been approved?
Mikhail.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_1625_9871991.1368288360440
Content-Type: text/html; charset=ISO-8859-1
<div>Lawrence,</div><div> </div><div>Could you tell me, please, what is the situation with 3398?</div><div>Has it been approved?</div><div> </div><div>Mikhail.</div><div> </div><div><br> </div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_1625_9871991.1368288360440--
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 11 May 2013 12:08:58 -0700 (PDT)
Raw View
------=_Part_2295_5284578.1368299338650
Content-Type: text/plain; charset=ISO-8859-1
No. I am not sure precisely what feedback Beman received from it, but it
was not approved, and no successor (unless you consider N3572 itself a
successor) was presented at Bristol.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_2295_5284578.1368299338650
Content-Type: text/html; charset=ISO-8859-1
No. I am not sure precisely what feedback Beman received from it, but it was not approved, and no successor (unless you consider N3572 itself a successor) was presented at Bristol.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_2295_5284578.1368299338650--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Sun, 12 May 2013 03:31:47 -0700 (PDT)
Raw View
------=_Part_96_14413000.1368354707588
Content-Type: text/plain; charset=ISO-8859-1
Thank for your information.
On Saturday, May 11, 2013 8:08:58 PM UTC+1, DeadMG wrote:
> No. I am not sure precisely what feedback Beman received from it, but it
> was not approved, and no successor (unless you consider N3572 itself a
> successor) was presented at Bristol.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_96_14413000.1368354707588
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>Thank for your information. </div><div><br>On Saturday, May 11, 2013 8=
:08:58 PM UTC+1, DeadMG wrote:</div><blockquote class=3D"gmail_quote" style=
=3D"margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-color: rgb(20=
4, 204, 204); border-left-width: 1px; border-left-style: solid;">No. I am n=
ot sure precisely what feedback Beman received from it, but it was not appr=
oved, and no successor (unless you consider N3572 itself a successor) was p=
resented at Bristol.</blockquote>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
------=_Part_96_14413000.1368354707588--
.
Author: Lawrence Crowl <crowl@googlers.com>
Date: Mon, 13 May 2013 11:42:30 -0700
Raw View
On 5/11/13, Mikhail Semenov <mikhailsemenov1957@gmail.com> wrote:
> Could you tell me, please, what is the situation with 3398?
You mean N3398 "String Interoperation Library Adapting Standard
Library Strings and I/O to a Unicode World"?
> Has it been approved?
I see no evidence that it was discussed at Bristol. So I presume
not. Alisdair Meredith is the better person to ask.
--
Lawrence Crowl
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 13 May 2013 15:46:02 -0700 (PDT)
Raw View
------=_Part_849_1205370.1368485162554
Content-Type: text/plain; charset=ISO-8859-1
It was discussed at a previous meeting.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
------=_Part_849_1205370.1368485162554
Content-Type: text/html; charset=ISO-8859-1
It was discussed at a previous meeting.
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href="http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en">http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en</a>.<br />
<br />
<br />
------=_Part_849_1205370.1368485162554--
.
Author: Mikhail Semenov <mikhailsemenov1957@gmail.com>
Date: Tue, 14 May 2013 09:38:30 +0100
Raw View
--089e01229c30ccc13d04dca98e99
Content-Type: text/plain; charset=ISO-8859-1
Could you tell me about any feedback from that dicussion on N3398, please?
On 13 May 2013 23:46, DeadMG <wolfeinstein@gmail.com> wrote:
> It was discussed at a previous meeting.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
--089e01229c30ccc13d04dca98e99
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Could you tell me about any feedback from that dicussion on N3398, please? =
<br><br>
<div class=3D"gmail_quote">On 13 May 2013 23:46, DeadMG <span dir=3D"ltr">&=
lt;<a href=3D"mailto:wolfeinstein@gmail.com" target=3D"_blank">wolfeinstein=
@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">It was discussed at a previous meetin=
g.=20
<div class=3D"HOEnZb">
<div class=3D"h5">
<p></p>-- <br>=A0<br>--- <br>You received this message because you are subs=
cribed to the Google Groups "ISO C++ Standard - Future Proposals"=
group.<br>To unsubscribe from this group and stop receiving emails from it=
, send an email to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org=
" target=3D"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>Visit this group a=
t <a href=3D"http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=
=3Den" target=3D"_blank">http://groups.google.com/a/isocpp.org/group/std-pr=
oposals/?hl=3Den</a>.<br>
=A0<br>=A0<br></div></div></blockquote></div><br>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/?hl=3Den">http://groups.google.com/a/isocpp.org/group/std-pro=
posals/?hl=3Den</a>.<br />
<br />
<br />
--089e01229c30ccc13d04dca98e99--
.
Author: matthieu.monrocq@gmail.com
Date: Thu, 5 Sep 2013 11:47:58 -0700 (PDT)
Raw View
------=_Part_25_30731013.1378406878540
Content-Type: text/plain; charset=ISO-8859-1
Hello,
I just saw this proposal thanks to Reddit and I must admit that I am
concerned about the idea of exposing the encoding as a template parameter.
I believe we can all agree that a std::unicode_string which would preserve
the original encoding of the string it was initialized with and still allow
manipulations would have value. If not, let me give you an example.
Suppose that I write a simple text editor: I load the text, do a couple
things about it, and then write it back with its original encoding. With an
encoded_string I unfortunately have to pick an encoding of choice and will
incur the wrath of all users who (unfortunately) chose to use another
encoding because my function is slow with those.
Note that the same issue occur when writing a browser (encoding of HTML
page), writing an XML tool (encoding of XML message), interacting with the
filesystem (Linux prefers utf-8, Windows prefers utf-16, ...) etc...
The issue I foresee with template <typename Encoding> class
encoded_string<Encoding>; is that instead of having to convert between
QtString and CString when interacting with two different libraries, I will
have to convert between std::encoded_string<utf8> (ah, a Linux aficionado)
and std::encoded_string<utf16le> (ah, a Windows/Java aficionado) and...
well the fact that the conversions are automatic is great but there are
still conversions all over the place which is a sad performance issue.
And of course, it means that the (future) filesystem module will have to be
templated too, because there is no reason I should pay for conversions to
an arbitrary encoding if I already passed the string in the right encoding
for this particular platform...
Now, I admit that Nicol Bolas has a good point that type erasure is both
undesirable and yet much necessary if we wish a std::unicode_string to
support any kind of encoding.
I would, however, stress the last point: is it necessary for it to support
*any* kind of encoding. Or, at least, is it necessary to support *any* kind
of encoding *all equally efficiently* ?
I could see two ways out of general type erasure here:
- Use only a fixed set of encodings, typically I would see UTF-8, UTF-16
(LE/BE), UTF-32 (LE/BE) and maybe a couple others
- If necessary, add a type-erased variant for other encodings
And when dealing with an algorithm, you would have to specialize it for
std::unicode_string if you wish to avoid (most) runtime overhead:
template <typename BackwardInserter>
void copy(std::unicode_string const& us, BackwardInserter bi) {
switch(us.encoding()) {
case std::encoding::utf8: std::copy(utf8_begin(us), utf8_end(us),
bi); break;
case std::encoding::utf16le: std::copy(utf16le_begin(us),
utf16le_end(us), bi); break;
...
case std::encoding::other: std::copy(begin(us), end(us), bi);
break; // uses type erasure, sorry folks.
}
}
You pay a little for the initial switch, but then it soon hands over the
issue to a specialized iterator so the cost of the *single* switch is
probably dwarfed by the actual iteration anyway.
I realized I have obviously not the depth and experience that DeadMG or R.
Martinho Fernandes have, so my case could be completely out of here (in
which case please just ignore it). However I believe this design could
strike an interesting balance between the "Python String" and the "Template
String".
-- Matthieu
On Tuesday, May 14, 2013 10:38:30 AM UTC+2, Mikhail Semenov wrote:
>
> Could you tell me about any feedback from that dicussion on N3398, please?
>
> On 13 May 2013 23:46, DeadMG <wolfei...@gmail.com <javascript:>> wrote:
>
>> It was discussed at a previous meeting.
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "ISO C++ Standard - Future Proposals" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to std-proposal...@isocpp.org <javascript:>.
>> To post to this group, send email to std-pr...@isocpp.org <javascript:>.
>> Visit this group at
>> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>>
>>
>>
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
------=_Part_25_30731013.1378406878540
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Hello,<br><br>I just saw this proposal thanks to Reddit an=
d I must admit that I am concerned about the idea of exposing the encoding =
as a template parameter.<br><br>I believe we can all agree that a std::unic=
ode_string which would preserve the original encoding of the string it was =
initialized with and still allow manipulations would have value. If not, le=
t me give you an example.<br><br>Suppose that I write a simple text editor:=
I load the text, do a couple things about it, and then write it back with =
its original encoding. With an encoded_string I unfortunately have to pick =
an encoding of choice and will incur the wrath of all users who (unfortunat=
ely) chose to use another encoding because my function is slow with those.<=
br><br>Note that the same issue occur when writing a browser (encoding of H=
TML page), writing an XML tool (encoding of XML message), interacting with =
the filesystem (Linux prefers utf-8, Windows prefers utf-16, ...) etc...<br=
><br><br>The issue I foresee with template <typename Encoding> =
class encoded_string<Encoding>; is that instead of having to co=
nvert between QtString and CString when interacting with two different libr=
aries, I will have to convert between std::encoded_string<utf8> (ah, =
a Linux aficionado) and std::encoded_string<utf16le> (ah, a Windows/J=
ava aficionado) and... well the fact that the conversions are automatic is =
great but there are still conversions all over the place which is a sad per=
formance issue.<br><br>And of course, it means that the (future) filesystem=
module will have to be templated too, because there is no reason I should =
pay for conversions to an arbitrary encoding if I already passed the string=
in the right encoding for this particular platform...<br><br><br>Now, I ad=
mit that Nicol Bolas has a good point that type erasure is both undesirable=
and yet much necessary if we wish a std::unicode_string to support any kin=
d of encoding.<br><br>I would, however, stress the last point: is it necess=
ary for it to support *any* kind of encoding. Or, at least, is it necessary=
to support *any* kind of encoding *all equally efficiently* ?<br><br><br>I=
could see two ways out of general type erasure here:<br><br> - Use on=
ly a fixed set of encodings, typically I would see UTF-8, UTF-16 (LE/BE), U=
TF-32 (LE/BE) and maybe a couple others<br> - If necessary, add a type=
-erased variant for other encodings<br><br><br>And when dealing with an alg=
orithm, you would have to specialize it for std::unicode_string if you wish=
to avoid (most) runtime overhead:<br><br> template <t=
ypename BackwardInserter><br> void copy(std::unicode_s=
tring const& us, BackwardInserter bi) {<br> &nbs=
p; switch(us.encoding()) {<br> &nb=
sp; case std::encoding::utf8: std::copy(utf8_begin(us), utf8_end(us),=
bi); break;<br> case std::encodi=
ng::utf16le: std::copy(utf16le_begin(us), utf16le_end(us), bi); break;<br>&=
nbsp; ...<br> &n=
bsp; case std::encoding::other: std::copy(begin(us), end(us), b=
i); break; // uses type erasure, sorry folks.<br> &n=
bsp; }<br> }<br><br>You pay a little for the =
initial switch, but then it soon hands over the issue to a specialized iter=
ator so the cost of the *single* switch is probably dwarfed by the actual i=
teration anyway.<br><br><br>I realized I have obviously not the depth and e=
xperience that DeadMG or R. Martinho Fernandes have, so my case could be co=
mpletely out of here (in which case please just ignore it). However I belie=
ve this design could strike an interesting balance between the "Python Stri=
ng" and the "Template String".<br><br>-- Matthieu<br><br><br>On Tuesday, Ma=
y 14, 2013 10:38:30 AM UTC+2, Mikhail Semenov wrote:<blockquote class=3D"gm=
ail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc soli=
d;padding-left: 1ex;">Could you tell me about any feedback from that dicuss=
ion on N3398, please? <br><br>
<div class=3D"gmail_quote">On 13 May 2013 23:46, DeadMG <span dir=3D"ltr">&=
lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"GsMFZ=
xq6J0cJ">wolfei...@gmail.com</a>></span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">It was discussed at a previous meetin=
g.=20
<div>
<div>
<p></p>-- <br> <br>--- <br>You received this message because you are s=
ubscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.=
<br>To unsubscribe from this group and stop receiving emails from it, send =
an email to <a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=
=3D"GsMFZxq6J0cJ">std-proposal...@<wbr>isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"javascript:" target=3D"_bla=
nk" gdf-obfuscated-mailto=3D"GsMFZxq6J0cJ">std-pr...@isocpp.org</a>.<br>Vis=
it this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/std=
-proposals/?hl=3Den" target=3D"_blank">http://groups.google.com/a/<wbr>isoc=
pp.org/group/std-<wbr>proposals/?hl=3Den</a>.<br>
<br> <br></div></div></blockquote></div><br>
</blockquote></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />
------=_Part_25_30731013.1378406878540--
.
Author: cornedbee@google.com
Date: Fri, 6 Sep 2013 02:54:13 -0700 (PDT)
Raw View
------=_Part_4217_19370668.1378461253606
Content-Type: text/plain; charset=ISO-8859-1
On Thursday, September 5, 2013 8:47:58 PM UTC+2, matthieu...@gmail.com
wrote:
>
> I believe we can all agree that a std::unicode_string which would preserve
> the original encoding of the string it was initialized with and still allow
> manipulations would have value.
>
Bad start. The entire thread is about people not agreeing on this.
> If not, let me give you an example.
>
> Suppose that I write a simple text editor: I load the text, do a couple
> things about it, and then write it back with its original encoding. With an
> encoded_string I unfortunately have to pick an encoding of choice and will
> incur the wrath of all users who (unfortunately) chose to use another
> encoding because my function is slow with those.
>
This is how probably every single text editor in existence works.
Everything written in Java or C# works with UTF-16 internally, no matter
the external encoding. Emacs uses UTF-8. Vim, as far as I can tell, can use
any narrow encoding, but defaults to UTF-8. I haven't seen anyone complain
yet.
>
> Note that the same issue occur when writing a browser (encoding of HTML
> page),
>
Browsers all use fixed encodings internally. Most of them use UTF-16,
because the JavaScript string interface assumes it. Mozilla has classes for
narrow strings as well as wide strings, but not for hybrids.
> writing an XML tool (encoding of XML message),
>
Same here.
> interacting with the filesystem (Linux prefers utf-8, Windows prefers
> utf-16, ...) etc...
>
The old Boost.Filesystem used a template parameter. The new one
automatically converts to the platform-preferred encoding, but it does not
preserve the original encoding.
> The issue I foresee with template <typename Encoding> class
> encoded_string<Encoding>; is that instead of having to convert between
> QtString and CString when interacting with two different libraries, I will
> have to convert between std::encoded_string<utf8> (ah, a Linux aficionado)
> and std::encoded_string<utf16le> (ah, a Windows/Java aficionado) and...
> well the fact that the conversions are automatic is great but there are
> still conversions all over the place which is a sad performance issue.
>
But at least you can see the performance issue in the types. With a hybrid
string, the performance issue is completely hidden:
hybrid_string s = library1::get_string() + library2::get_string();
If the two libraries use different encodings, one string will have to be
converted. And I don't even get to know which one!
> And of course, it means that the (future) filesystem module will have to
> be templated too, because there is no reason I should pay for conversions
> to an arbitrary encoding if I already passed the string in the right
> encoding for this particular platform...
>
I don't understand what you're trying to say here.
> I could see two ways out of general type erasure here:
>
> - Use only a fixed set of encodings, typically I would see UTF-8, UTF-16
> (LE/BE), UTF-32 (LE/BE) and maybe a couple others
>
You really don't save much by having a fixed subset. You get a slightly
cheaper selection-on-every-operation mode, but it's still
selection-on-every-operation. Full type erasure doesn't stop you from
optimizing a few cases.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
------=_Part_4217_19370668.1378461253606
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>On Thursday, September 5, 2013 8:47:58 PM UTC+2, m=
atthieu...@gmail.com wrote:<blockquote class=3D"gmail_quote" style=3D"margi=
n: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><di=
v dir=3D"ltr">I believe we can all agree that a std::unicode_string which w=
ould preserve the original encoding of the string it was initialized with a=
nd still allow manipulations would have value.</div></blockquote><div><br><=
/div><div>Bad start. The entire thread is about people not agreeing on this=
..</div><div> </div><blockquote class=3D"gmail_quote" style=3D"margin: =
0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div d=
ir=3D"ltr"> If not, let me give you an example.<br><br>Suppose that I write=
a simple text editor: I load the text, do a couple things about it, and th=
en write it back with its original encoding. With an encoded_string I unfor=
tunately have to pick an encoding of choice and will incur the wrath of all=
users who (unfortunately) chose to use another encoding because my functio=
n is slow with those.<br></div></blockquote><div><br></div><div>This is how=
probably every single text editor in existence works. Everything written i=
n Java or C# works with UTF-16 internally, no matter the external encoding.=
Emacs uses UTF-8. Vim, as far as I can tell, can use any narrow encoding, =
but defaults to UTF-8. I haven't seen anyone complain yet.</div><div> =
</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8=
ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><br>Not=
e that the same issue occur when writing a browser (encoding of HTML page),=
</div></blockquote><div><br></div><div>Browsers all use fixed encodings int=
ernally. Most of them use UTF-16, because the JavaScript string interface a=
ssumes it. Mozilla has classes for narrow strings as well as wide strings, =
but not for hybrids.</div><div> </div><blockquote class=3D"gmail_quote=
" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding=
-left: 1ex;"><div dir=3D"ltr"> writing an XML tool (encoding of XML message=
),</div></blockquote><div><br></div><div>Same here.</div><div> </div><=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"> interacting w=
ith the filesystem (Linux prefers utf-8, Windows prefers utf-16, ...) etc..=
..<br></div></blockquote><div><br></div><div>The old Boost.Filesystem used a=
template parameter. The new one automatically converts to the platform-pre=
ferred encoding, but it does not preserve the original encoding.</div><div>=
</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-lef=
t: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">T=
he issue I foresee with template <typename Encoding> class enco=
ded_string<Encoding>; is that instead of having to convert betw=
een QtString and CString when interacting with two different libraries, I w=
ill have to convert between std::encoded_string<utf8> (ah, a Linux af=
icionado) and std::encoded_string<utf16le> (ah, a Windows/Java aficio=
nado) and... well the fact that the conversions are automatic is great but =
there are still conversions all over the place which is a sad performance i=
ssue.<br></div></blockquote><div><br></div><div>But at least you can see th=
e performance issue in the types. With a hybrid string, the performance iss=
ue is completely hidden:</div><div><br></div><div>hybrid_string s =3D libra=
ry1::get_string() + library2::get_string();</div><div><br></div><div>If the=
two libraries use different encodings, one string will have to be converte=
d. And I don't even get to know which one!</div><div> </div><blockquot=
e class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: =
1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">And of course, it means=
that the (future) filesystem module will have to be templated too, because=
there is no reason I should pay for conversions to an arbitrary encoding i=
f I already passed the string in the right encoding for this particular pla=
tform...<br></div></blockquote><div><br></div><div>I don't understand what =
you're trying to say here.</div><div> </div><blockquote class=3D"gmail=
_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;p=
adding-left: 1ex;"><div dir=3D"ltr">I could see two ways out of general typ=
e erasure here:<br><br> - Use only a fixed set of encodings, typically=
I would see UTF-8, UTF-16 (LE/BE), UTF-32 (LE/BE) and maybe a couple other=
s<br></div></blockquote><div><br></div><div>You really don't save much by h=
aving a fixed subset. You get a slightly cheaper selection-on-every-operati=
on mode, but it's still selection-on-every-operation. Full type erasure doe=
sn't stop you from optimizing a few cases.</div><div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />
------=_Part_4217_19370668.1378461253606--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Fri, 6 Sep 2013 13:04:58 +0200
Raw View
On Thu, Sep 5, 2013 at 8:47 PM, <matthieu.monrocq@gmail.com> wrote:
> Suppose that I write a simple text editor: I load the text, do a couple
> things about it, and then write it back with its original encoding. With =
an
> encoded_string I unfortunately have to pick an encoding of choice and wil=
l
> incur the wrath of all users who (unfortunately) chose to use another
> encoding because my function is slow with those.
I believe most text manipulation applications in the wild use a
sandwich approach (like what Ned Batchelder advocates for Python here:
http://nedbatchelder.com/text/unipain/unipain.html#35). External data
has whatever encoding it comes with/needs to be written with, and
internal data is always in the same encoding. There's a transcoding
cost whenever you cross the boundary with the external world, but
there is no cost in the actual text manipulations. And since we are
talking about an application to manipulate text...
> I would, however, stress the last point: is it necessary for it to suppor=
t
> *any* kind of encoding. Or, at least, is it necessary to support *any* ki=
nd
> of encoding *all equally efficiently* ?
Out of the box? I would settle for any set that includes UTF-8, UTF-16
and UTF-32. Certainly not a set that includes UCS-2 but not UTF-16
(codecvts I am looking at you). There are a few other important
encodings around (windows1252, latin15, gb18030) but we don't need to
require a zillion of them in the stdlib. *However*, most of these
encodings share very similar characteristics which makes it very hard
to design an interface that would work with one but not the other
(there are some oddballs like Shift-JIS, but most of them share the
same fundamental ideas). I'd say it is impossible to do so
accidentally. So, a small important subset with required support? Yes.
Freedom for extension by implementations and users? Yes.
Mit freundlichen Gr=FC=DFen,
Martinho
--=20
---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Fri, 6 Sep 2013 04:16:51 -0700 (PDT)
Raw View
------=_Part_205_32882629.1378466211145
Content-Type: text/plain; charset=ISO-8859-1
>
> I believe we can all agree that a std::unicode_string which would preserve
> the original encoding of the string it was initialized with and still allow
> manipulations would have value.
Yes, the proposal includes that. It's called "auto".
Suppose that I write a simple text editor: I load the text, do a couple
> things about it, and then write it back with its original encoding. With an
> encoded_string I unfortunately have to pick an encoding of choice and will
> incur the wrath of all users who (unfortunately) chose to use another
> encoding because my function is slow with those.
Use a template. Or alternatively, use the System encoding- that's what it's
for.
Note that the same issue occur when writing a browser (encoding of HTML
> page), writing an XML tool (encoding of XML message), interacting with the
> filesystem (Linux prefers utf-8, Windows prefers utf-16, ...) etc...
Use the System encoding. It's entire purpose is to provide a
system-appropriate default that would make sense to use when interoperating
with the system APIs.
Ultimately, people don't write applications that deal with any encoding.
They write applications to one fixed encoding, and convert everything else
to that. For a generic function, a template if possible or System encoding
if not is not really unreasonable.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
------=_Part_205_32882629.1378466211145
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px=
0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); =
border-left-style: solid; padding-left: 1ex;">I believe we can all agree th=
at a std::unicode_string which would preserve the original encoding of the =
string it was initialized with and still allow manipulations would have val=
ue.</blockquote><div><br></div><div>Yes, the proposal includes that. It's c=
alled "auto".</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: r=
gb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"> Supp=
ose that I write a simple text editor: I load the text, do a couple things =
about it, and then write it back with its original encoding. With an encode=
d_string I unfortunately have to pick an encoding of choice and will incur =
the wrath of all users who (unfortunately) chose to use another encoding be=
cause my function is slow with those.</blockquote><div><br></div><div>Use a=
template. Or alternatively, use the System encoding- that's what it's for.=
</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0px=
0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 20=
4); border-left-style: solid; padding-left: 1ex;"> Note that the same =
issue occur when writing a browser (encoding of HTML page), writing an XML =
tool (encoding of XML message), interacting with the filesystem (Linux pref=
ers utf-8, Windows prefers utf-16, ...) etc...</blockquote><div><br></div><=
div>Use the System encoding. It's entire purpose is to provide a system-app=
ropriate default that would make sense to use when interoperating with the =
system APIs.</div><div><br></div><div>Ultimately, people don't write applic=
ations that deal with any encoding. They write applications to one fixed en=
coding, and convert everything else to that. For a generic function, a temp=
late if possible or System encoding if not is not really unreasonable.</div=
><div><br></div><div><br></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />
------=_Part_205_32882629.1378466211145--
.
Author: Matthieu Monrocq <matthieu.monrocq@gmail.com>
Date: Fri, 6 Sep 2013 19:15:32 +0200
Raw View
--001a11c33fe6a2c23e04e5ba2f22
Content-Type: text/plain; charset=ISO-8859-1
On Fri, Sep 6, 2013 at 11:54 AM, <cornedbee@google.com> wrote:
>
>
> On Thursday, September 5, 2013 8:47:58 PM UTC+2, matthieu...@gmail.comwrote:
>>
>> I believe we can all agree that a std::unicode_string which would
>> preserve the original encoding of the string it was initialized with and
>> still allow manipulations would have value.
>>
>
> Bad start. The entire thread is about people not agreeing on this.
>
>
>> If not, let me give you an example.
>>
>> Suppose that I write a simple text editor: I load the text, do a couple
>> things about it, and then write it back with its original encoding. With an
>> encoded_string I unfortunately have to pick an encoding of choice and will
>> incur the wrath of all users who (unfortunately) chose to use another
>> encoding because my function is slow with those.
>>
>
> This is how probably every single text editor in existence works.
> Everything written in Java or C# works with UTF-16 internally, no matter
> the external encoding. Emacs uses UTF-8. Vim, as far as I can tell, can use
> any narrow encoding, but defaults to UTF-8. I haven't seen anyone complain
> yet.
>
>
>>
>> Note that the same issue occur when writing a browser (encoding of HTML
>> page),
>>
>
> Browsers all use fixed encodings internally. Most of them use UTF-16,
> because the JavaScript string interface assumes it. Mozilla has classes for
> narrow strings as well as wide strings, but not for hybrids.
>
>
>> writing an XML tool (encoding of XML message),
>>
>
> Same here.
>
>
>> interacting with the filesystem (Linux prefers utf-8, Windows prefers
>> utf-16, ...) etc...
>>
>
> The old Boost.Filesystem used a template parameter. The new one
> automatically converts to the platform-preferred encoding, but it does not
> preserve the original encoding.
>
But what would the interface look like (with encoded_string) ? I guess they
could detect the platform at compile-time and use an #ifdef switch to pick
the encoding to present in the interface ?
>
>> The issue I foresee with template <typename Encoding> class
>> encoded_string<Encoding>; is that instead of having to convert between
>> QtString and CString when interacting with two different libraries, I will
>> have to convert between std::encoded_string<utf8> (ah, a Linux aficionado)
>> and std::encoded_string<utf16le> (ah, a Windows/Java aficionado) and...
>> well the fact that the conversions are automatic is great but there are
>> still conversions all over the place which is a sad performance issue.
>>
>
> But at least you can see the performance issue in the types. With a hybrid
> string, the performance issue is completely hidden:
>
> hybrid_string s = library1::get_string() + library2::get_string();
>
> If the two libraries use different encodings, one string will have to be
> converted. And I don't even get to know which one!
>
Actually, my point is that with fixed encodings, each library has to pick
one, thus resulting in potential performance penalties when there is a
mismatch whilst with a encoding-agnostic type all libraries can settle on a
single type (in the interface) and no conversion will occur unless truly
necessary (such as interaction with an UTF-16 filesystem).
So, you are right that on the one hand conversions may occur implicitly,
but since they are delayed to occur at the last moment possible you never
have to pay for useless conversions.
The only way I see encoded_string to make such a promise is to have every
library be templated on the encoding.
What I have in mind, specifically, is such a sequence of call:
1. A::Magic(encoded_string<utf8>&)
2. B::Magic(encoded_string<utf16le>&)
3. A::Next(encoded_string<utf8>&)
Note that those libraries may not really care about the encoding, you just
forced them to pick one explicitly (or always use pure-template code/N
overloads... which is not practical).
>
>
>> And of course, it means that the (future) filesystem module will have to
>> be templated too, because there is no reason I should pay for conversions
>> to an arbitrary encoding if I already passed the string in the right
>> encoding for this particular platform...
>>
>
> I don't understand what you're trying to say here.
>
>
>> I could see two ways out of general type erasure here:
>>
>> - Use only a fixed set of encodings, typically I would see UTF-8, UTF-16
>> (LE/BE), UTF-32 (LE/BE) and maybe a couple others
>>
>
> You really don't save much by having a fixed subset. You get a slightly
> cheaper selection-on-every-operation mode, but it's still
> selection-on-every-operation. Full type erasure doesn't stop you from
> optimizing a few cases.
>
> It is indeed selection-on-every-operation, though I would say:
- It does not matter if the operation is lengthy
- It does not matter either if the operation is infrequent
- If there are successive calls to operations, the selection can probably
be done only once (constant hoisting, ...)
I am not sure it would be a bottleneck in practice... but if it matters to
you nothing prevents you from using a `string_view<Encoding>` to avoid it.
On Fri, Sep 6, 2013 at 11:54 AM, <cornedbee@google.com> wrote:
>
>
> On Thursday, September 5, 2013 8:47:58 PM UTC+2, matthieu...@gmail.comwrote:
>>
>> I believe we can all agree that a std::unicode_string which would
>> preserve the original encoding of the string it was initialized with and
>> still allow manipulations would have value.
>>
>
> Bad start. The entire thread is about people not agreeing on this.
>
>
>> If not, let me give you an example.
>>
>> Suppose that I write a simple text editor: I load the text, do a couple
>> things about it, and then write it back with its original encoding. With an
>> encoded_string I unfortunately have to pick an encoding of choice and will
>> incur the wrath of all users who (unfortunately) chose to use another
>> encoding because my function is slow with those.
>>
>
> This is how probably every single text editor in existence works.
> Everything written in Java or C# works with UTF-16 internally, no matter
> the external encoding. Emacs uses UTF-8. Vim, as far as I can tell, can use
> any narrow encoding, but defaults to UTF-8. I haven't seen anyone complain
> yet.
>
>
>>
>> Note that the same issue occur when writing a browser (encoding of HTML
>> page),
>>
>
> Browsers all use fixed encodings internally. Most of them use UTF-16,
> because the JavaScript string interface assumes it. Mozilla has classes for
> narrow strings as well as wide strings, but not for hybrids.
>
>
>> writing an XML tool (encoding of XML message),
>>
>
> Same here.
>
>
>> interacting with the filesystem (Linux prefers utf-8, Windows prefers
>> utf-16, ...) etc...
>>
>
> The old Boost.Filesystem used a template parameter. The new one
> automatically converts to the platform-preferred encoding, but it does not
> preserve the original encoding.
>
>
>> The issue I foresee with template <typename Encoding> class
>> encoded_string<Encoding>; is that instead of having to convert between
>> QtString and CString when interacting with two different libraries, I will
>> have to convert between std::encoded_string<utf8> (ah, a Linux aficionado)
>> and std::encoded_string<utf16le> (ah, a Windows/Java aficionado) and...
>> well the fact that the conversions are automatic is great but there are
>> still conversions all over the place which is a sad performance issue.
>>
>
> But at least you can see the performance issue in the types. With a hybrid
> string, the performance issue is completely hidden:
>
> hybrid_string s = library1::get_string() + library2::get_string();
>
> If the two libraries use different encodings, one string will have to be
> converted. And I don't even get to know which one!
>
>
>> And of course, it means that the (future) filesystem module will have to
>> be templated too, because there is no reason I should pay for conversions
>> to an arbitrary encoding if I already passed the string in the right
>> encoding for this particular platform...
>>
>
> I don't understand what you're trying to say here.
>
>
>> I could see two ways out of general type erasure here:
>>
>> - Use only a fixed set of encodings, typically I would see UTF-8, UTF-16
>> (LE/BE), UTF-32 (LE/BE) and maybe a couple others
>>
>
> You really don't save much by having a fixed subset. You get a slightly
> cheaper selection-on-every-operation mode, but it's still
> selection-on-every-operation. Full type erasure doesn't stop you from
> optimizing a few cases.
>
>
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
--001a11c33fe6a2c23e04e5ba2f22
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Fri, Sep 6, 2013 at 11:54 AM, <span dir=3D"ltr"><<a href=3D"=
mailto:cornedbee@google.com" target=3D"_blank">cornedbee@google.com</a>>=
</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><br><br>=
On Thursday, September 5, 2013 8:47:58 PM UTC+2, <a href=3D"mailto:matthieu=
....@gmail.com" target=3D"_blank">matthieu...@gmail.com</a> wrote:<blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex">
<div dir=3D"ltr">I believe we can all agree that a std::unicode_string whic=
h would preserve the original encoding of the string it was initialized wit=
h and still allow manipulations would have value.</div></blockquote><div>
<br></div><div>Bad start. The entire thread is about people not agreeing on=
this.</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin=
:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"=
><div dir=3D"ltr">
If not, let me give you an example.<br><br>Suppose that I write a simple t=
ext editor: I load the text, do a couple things about it, and then write it=
back with its original encoding. With an encoded_string I unfortunately ha=
ve to pick an encoding of choice and will incur the wrath of all users who =
(unfortunately) chose to use another encoding because my function is slow w=
ith those.<br>
</div></blockquote><div><br></div><div>This is how probably every single te=
xt editor in existence works. Everything written in Java or C# works with U=
TF-16 internally, no matter the external encoding. Emacs uses UTF-8. Vim, a=
s far as I can tell, can use any narrow encoding, but defaults to UTF-8. I =
haven't seen anyone complain yet.</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D=
"ltr"><br>Note that the same issue occur when writing a browser (encoding o=
f HTML page),</div>
</blockquote><div><br></div><div>Browsers all use fixed encodings internall=
y. Most of them use UTF-16, because the JavaScript string interface assumes=
it. Mozilla has classes for narrow strings as well as wide strings, but no=
t for hybrids.</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D=
"ltr"> writing an XML tool (encoding of XML message),</div></blockquote><di=
v><br>
</div><div>Same here.</div><div>=A0</div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex"><div dir=3D"ltr"> interacting with the filesystem (Linux pr=
efers utf-8, Windows prefers utf-16, ...) etc...<br>
</div></blockquote><div><br></div><div>The old Boost.Filesystem used a temp=
late parameter. The new one automatically converts to the platform-preferre=
d encoding, but it does not preserve the original encoding.</div></div>
</blockquote><div><br></div><div>But what would the interface look like (wi=
th encoded_string) ? I guess they could detect the platform at compile-time=
and use an #ifdef switch to pick the encoding to present in the interface =
?<br>
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr=
"><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir=3D"ltr">The issue I foresee with=A0 template <typename Encoding=
> class encoded_string<Encoding>;=A0 is that instead of having to =
convert between QtString and CString when interacting with two different li=
braries, I will have to convert between std::encoded_string<utf8> (ah=
, a Linux aficionado) and std::encoded_string<utf16le> (ah, a Windows=
/Java aficionado) and... well the fact that the conversions are automatic i=
s great but there are still conversions all over the place which is a sad p=
erformance issue.<br>
</div></blockquote><div><br></div><div>But at least you can see the perform=
ance issue in the types. With a hybrid string, the performance issue is com=
pletely hidden:</div><div><br></div><div>hybrid_string s =3D library1::get_=
string() + library2::get_string();</div>
<div><br></div><div>If the two libraries use different encodings, one strin=
g will have to be converted. And I don't even get to know which one!</d=
iv></div></blockquote><div><br></div><div>Actually, my point is that with f=
ixed encodings, each library has to pick one, thus resulting in potential p=
erformance penalties when there is a mismatch whilst with a encoding-agnost=
ic type all libraries can settle on a single type (in the interface) and no=
conversion will occur unless truly necessary (such as interaction with an =
UTF-16 filesystem).<br>
<br></div><div>So, you are right that on the one hand conversions may occur=
implicitly, but since they are delayed to occur at the last moment possibl=
e you never have to pay for useless conversions.<br><div><br></div>The only=
way I see encoded_string to make such a promise is to have every library b=
e templated on the encoding.<br>
<br><br></div><div>What I have in mind, specifically, is such a sequence of=
call:<br><br></div><div>1. A::Magic(encoded_string<utf8>&)<br><b=
r></div><div>2. B::Magic(encoded_string<utf16le>&)<br><br></div>
<div>3. A::Next(encoded_string<utf8>&)<br><br></div><div>Note tha=
t those libraries may not really care about the encoding, you just forced t=
hem to pick one explicitly (or always use pure-template code/N overloads...=
which is not practical).</div>
<div>=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px=
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div di=
r=3D"ltr"><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir=3D"ltr">And of course, it means that the (future) filesystem modul=
e will have to be templated too, because there is no reason I should pay fo=
r conversions to an arbitrary encoding if I already passed the string in th=
e right encoding for this particular platform...<br>
</div></blockquote><div><br></div><div>I don't understand what you'=
re trying to say here.</div><div>=A0</div><blockquote class=3D"gmail_quote"=
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);p=
adding-left:1ex">
<div dir=3D"ltr">I could see two ways out of general type erasure here:<br>=
<br>=A0- Use only a fixed set of encodings, typically I would see UTF-8, UT=
F-16 (LE/BE), UTF-32 (LE/BE) and maybe a couple others<br></div></blockquot=
e>
<div><br></div><div>You really don't save much by having a fixed subset=
.. You get a slightly cheaper selection-on-every-operation mode, but it'=
s still selection-on-every-operation. Full type erasure doesn't stop yo=
u from optimizing a few cases.</div>
<div><br></div></div></blockquote></div>It is indeed selection-on-every-ope=
ration, though I would say:<br><br></div><div class=3D"gmail_extra">=A0- It=
does not matter if the operation is lengthy<br></div><div class=3D"gmail_e=
xtra">
=A0- It does not matter either if the operation is infrequent<br></div><div=
class=3D"gmail_extra">=A0- If there are successive calls to operations, th=
e selection can probably be done only once (constant hoisting, ...)<br><br>=
</div>
<div class=3D"gmail_extra">I am not sure it would be a bottleneck in practi=
ce... but if it matters to you nothing prevents you from using a `string_vi=
ew<Encoding>` to avoid it.</div></div><div class=3D"gmail_extra"><br>
<br><div class=3D"gmail_quote">On Fri, Sep 6, 2013 at 11:54 AM, <span dir=
=3D"ltr"><<a href=3D"mailto:cornedbee@google.com" target=3D"_blank">corn=
edbee@google.com</a>></span> wrote:<br><blockquote class=3D"gmail_quote"=
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><br><br>On Thursday, September 5, 2013 8:47:58 PM UTC+2, <=
a href=3D"mailto:matthieu...@gmail.com" target=3D"_blank">matthieu...@gmail=
..com</a> wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-l=
eft:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">I believe we can all agree that a std::unicode_string whic=
h would preserve the original encoding of the string it was initialized wit=
h and still allow manipulations would have value.</div></blockquote><div>
<br></div><div>Bad start. The entire thread is about people not agreeing on=
this.</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin=
:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">
If not, let me give you an example.<br><br>Suppose that I write a simple t=
ext editor: I load the text, do a couple things about it, and then write it=
back with its original encoding. With an encoded_string I unfortunately ha=
ve to pick an encoding of choice and will incur the wrath of all users who =
(unfortunately) chose to use another encoding because my function is slow w=
ith those.<br>
</div></blockquote><div><br></div><div>This is how probably every single te=
xt editor in existence works. Everything written in Java or C# works with U=
TF-16 internally, no matter the external encoding. Emacs uses UTF-8. Vim, a=
s far as I can tell, can use any narrow encoding, but defaults to UTF-8. I =
haven't seen anyone complain yet.</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-le=
ft:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><br>=
Note that the same issue occur when writing a browser (encoding of HTML pag=
e),</div>
</blockquote><div><br></div><div>Browsers all use fixed encodings internall=
y. Most of them use UTF-16, because the JavaScript string interface assumes=
it. Mozilla has classes for narrow strings as well as wide strings, but no=
t for hybrids.</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-le=
ft:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"> wri=
ting an XML tool (encoding of XML message),</div></blockquote><div><br></di=
v><div>
Same here.</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr"> interacting with the filesystem (Linux prefers utf-8, Windows =
prefers utf-16, ...) etc...<br>
</div></blockquote><div><br></div><div>The old Boost.Filesystem used a temp=
late parameter. The new one automatically converts to the platform-preferre=
d encoding, but it does not preserve the original encoding.</div><div>=A0</=
div>
<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">The issue I forese=
e with=A0 template <typename Encoding> class encoded_string<Encodi=
ng>;=A0 is that instead of having to convert between QtString and CStrin=
g when interacting with two different libraries, I will have to convert bet=
ween std::encoded_string<utf8> (ah, a Linux aficionado) and std::enco=
ded_string<utf16le> (ah, a Windows/Java aficionado) and... well the f=
act that the conversions are automatic is great but there are still convers=
ions all over the place which is a sad performance issue.<br>
</div></blockquote><div><br></div><div>But at least you can see the perform=
ance issue in the types. With a hybrid string, the performance issue is com=
pletely hidden:</div><div><br></div><div>hybrid_string s =3D library1::get_=
string() + library2::get_string();</div>
<div><br></div><div>If the two libraries use different encodings, one strin=
g will have to be converted. And I don't even get to know which one!</d=
iv><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0;margin=
-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">And of course, it means that the (future) filesystem modul=
e will have to be templated too, because there is no reason I should pay fo=
r conversions to an arbitrary encoding if I already passed the string in th=
e right encoding for this particular platform...<br>
</div></blockquote><div><br></div><div>I don't understand what you'=
re trying to say here.</div><div>=A0</div><blockquote class=3D"gmail_quote"=
style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
<div dir=3D"ltr">I could see two ways out of general type erasure here:<br>=
<br>=A0- Use only a fixed set of encodings, typically I would see UTF-8, UT=
F-16 (LE/BE), UTF-32 (LE/BE) and maybe a couple others<br></div></blockquot=
e>
<div><br></div><div>You really don't save much by having a fixed subset=
.. You get a slightly cheaper selection-on-every-operation mode, but it'=
s still selection-on-every-operation. Full type erasure doesn't stop yo=
u from optimizing a few cases.</div>
<div><br></div></div></blockquote></div><br></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />
--001a11c33fe6a2c23e04e5ba2f22--
.
Author: Martinho Fernandes <martinho.fernandes@gmail.com>
Date: Sat, 7 Sep 2013 04:10:28 +0200
Raw View
On Fri, Sep 6, 2013 at 7:15 PM, Matthieu Monrocq
<matthieu.monrocq@gmail.com> wrote:
> But what would the interface look like (with encoded_string) ? I guess they
> could detect the platform at compile-time and use an #ifdef switch to pick
> the encoding to present in the interface ?
It would look like `encoded_string<system>`, I guess. The `system`
encoding is provided by the implementation as the appropriate typedef
(just like `size_t` and so on).
> Actually, my point is that with fixed encodings, each library has to pick
> one, thus resulting in potential performance penalties when there is a
> mismatch whilst with a encoding-agnostic type all libraries can settle on a
> single type (in the interface) and no conversion will occur unless truly
> necessary (such as interaction with an UTF-16 filesystem).
That ship has sailed. The libraries did not settle for this. The C++
ecosystem is already fractured. This would be a nice idea for a green
environment, but C and C++ libraries already exist that made their
choices. You may keep the same type on the interfaces, but the
conversion will still need to be there.
> The only way I see encoded_string to make such a promise is to have every
> library be templated on the encoding.
To be honest, I don't see what prevents us from having a type that
erases the encoding built on the ones that don't erase it. I do that
on my prototype, but it makes defining some semantics a bit
complicated, like the auto c = a + b example given above: if a and b
erase different encodings what does c become? Anything the library
picks here would be arbitrary, so, as in all such cases my prototype
picks none (one of the design goals is to make operations automated if
and only if that does not reduce flexibility or functionality).
In my prototype there is no op+, only a variadic `concat` function.
That function uses a bunch of TMP to compute the right type *or not*.
If all strings have the same encoding you get a string with the same
encoding back. If they don't and you didn't specify an explicit
encoding, you get a compilation error. If you give an explicit
encoding like `auto c = concat<utf8>(a, b)` you always get that
encoding. The type erased form doesn't have a known encoding, so
concatting two such type-erased strings would be ill-formed without an
explicit encoding, even if you know they are the same. One of my
design goals is to be conservative on what compiles (if it's not safe,
or explicitly acknowledged as unsafe, it shouldn't compile) so this is
according to that. It is a bit more complex than this because I have
an extra customisable hook on my string class, and I give special
treatment to the mixed cases (type-erased+non-erased), but that's
basically it.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
.
Author: DeadMG <wolfeinstein@gmail.com>
Date: Sat, 7 Sep 2013 07:25:57 -0700 (PDT)
Raw View
------=_Part_98_15574924.1378563957041
Content-Type: text/plain; charset=ISO-8859-1
>
> Note that those libraries may not really care about the encoding, you just
> forced them to pick one explicitly (or always use pure-template code/N
> overloads... which is not practical).
If they need to pick one, and they don't want to use a template, then they
use the system encoding. That is what it is for.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
------=_Part_98_15574924.1378563957041
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px=
0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); =
border-left-style: solid; padding-left: 1ex;">Note that those libraries may=
not really care about the encoding, you just forced them to pick one expli=
citly (or always use pure-template code/N overloads... which is not practic=
al).</blockquote><div><br></div><div>If they need to pick one, and they don=
't want to use a template, then they use the system encoding. That is what =
it is for.</div><div class=3D"GIVTN-QCCEB"></div></div>
<p></p>
-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.<br />
To post to this group, send email to std-proposals@isocpp.org.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />
------=_Part_98_15574924.1378563957041--
.