Thread

Topic: A proposal to fix UTF-8 string literals

Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 25 Nov 2012 08:02:00 -0800 (PST) Raw View

------=_Part_81_30807144.1353859320638
Content-Type: text/plain; charset=ISO-8859-1

This proposal constitutes a breaking change.

Currently, there is no char8_t and no std::u8string, and the type of
u8"hello" is a const char[N]. This, I believe, to be a fundamental flaw in
the Standard's handling of UTF-8, rendering it almost entirely unusable. In
*addition, *it renders virtually all of the existing narrow encoding
infrastructure unusable in portable code.

UTF-8 and narrow encodings are not interoperable, especially where the
narrow encoding involves non-ASCII characters from the codepage system.
However, they share a common type. This means that given

void f(const char* p);

The implementer of f cannot assume that p is either UTF-8 *or* narrow
encoded, if he wishes to write portable code. This means that not only is
UTF-8 rendered unusable, but the existing narrow encodings too, as if a
UTF-8 literal was passed but is treated as a locale-based character
encoding, it will be gibberish. About the only thing he can rely on is the
basic character set, and/or that virtually all current encodings are
supersets of ASCII.

Conceivably, he could add a parameter, perhaps from an enumeration,
indicating what encoding the string is in. But this is not a very usable
idea. Primarily, it is not type safe, and secondly, about ten minutes
later, there will be ten thousand different enumerations for encodings,
none of which will interoperate with each other. Thirdly, some functions,
particularly converting constructors and virtually all operators, and all
legacy code, cannot add additional parameters.

Realistically, the burden of maintaining existing code will mean that
nobody, ever, can use UTF-8, as it is impossible to deal with it safely,
unless you are only using implementations which already used UTF-8 as the
narrow encoding, in which case, there was no point in using a UTF-8 literal
in the first place.

The fix for this is simple: add char8_t as a distinct overloadable type,
similar to char16_t and char32_t, change u8 literals and raw literals to be
of this type, and introduce std::u8string and related typedefs and
specializations to match std::u16string.

--




------=_Part_81_30807144.1353859320638
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

This proposal constitutes a breaking change.<div><br></div><div>Currently, =
there is no char8_t and no std::u8string, and the type of u8"hello" is a co=
nst char[N]. This, I believe, to be a fundamental flaw in the Standard's ha=
ndling of UTF-8, rendering it almost entirely unusable. In <i>addition, </i=
>it renders virtually all of the existing narrow encoding infrastructure un=
usable in portable code.</div><div><br></div><div>UTF-8 and narrow encoding=
s are not interoperable, especially where the narrow encoding involves non-=
ASCII characters from the codepage system. However, they share a common typ=
e. This means that given</div><div><br></div><div>void f(const char* p);</d=
iv><div><br></div><div>The implementer of f cannot assume that p is either =
UTF-8 <i>or</i> narrow encoded, if he wishes to write portable code. This m=
eans that not only is UTF-8 rendered unusable, but the existing narrow enco=
dings too, as if a UTF-8 literal was passed but is treated as a locale-base=
d character encoding, it will be gibberish. About the only thing he can rel=
y on is the basic character set, and/or that virtually all current encoding=
s are supersets of ASCII.&nbsp;</div><div><br></div><div>Conceivably, he co=
uld add a parameter, perhaps from an enumeration, indicating what encoding =
the string is in. But this is not a very usable idea. Primarily, it is not =
type safe, and secondly, about ten minutes later, there will be ten thousan=
d different enumerations for encodings, none of which will interoperate wit=
h each other. Thirdly, some functions, particularly converting constructors=
 and virtually all operators, and all legacy code, cannot add additional pa=
rameters.</div><div><br></div><div>Realistically, the burden of maintaining=
 existing code will mean that nobody, ever, can use UTF-8, as it is impossi=
ble to deal with it safely, unless you are only using implementations which=
 already used UTF-8 as the narrow encoding, in which case, there was no poi=
nt in using a UTF-8 literal in the first place.</div><div><br></div><div>Th=
e fix for this is simple: add char8_t as a distinct overloadable type, simi=
lar to char16_t and char32_t, change u8 literals and raw literals to be of =
this type, and introduce std::u8string and related typedefs and specializat=
ions to match std::u16string.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_81_30807144.1353859320638--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 25 Nov 2012 08:45:23 -0800 (PST) Raw View

------=_Part_202_27108290.1353861923725
Content-Type: text/plain; charset=ISO-8859-1

On Sunday, November 25, 2012 8:02:00 AM UTC-8, DeadMG wrote:
>
> This proposal constitutes a breaking change.
>
> Currently, there is no char8_t and no std::u8string, and the type of
> u8"hello" is a const char[N]. This, I believe, to be a fundamental flaw in
> the Standard's handling of UTF-8, rendering it almost entirely unusable. In
> *addition, *it renders virtually all of the existing narrow encoding
> infrastructure unusable in portable code.
>
> UTF-8 and narrow encodings are not interoperable, especially where the
> narrow encoding involves non-ASCII characters from the codepage system.
> However, they share a common type. This means that given
>
> void f(const char* p);
>
> The implementer of f cannot assume that p is either UTF-8 *or* narrow
> encoded, if he wishes to write portable code.
>

I think "portable" is the wrong word. It's not *type checked*. This is
portable so long as people pass the right data. Garbage in, garbage out; if
your API requires a properly formatted UTF-8 string, and the user doesn't
provide that... boom. It's user error, not a platform-specific issue.

What you're wanting is a *type* that will be self-documenting, so that they
don't have to look up in API documentation whether a function that takes a
`const char *` is taking a UTF-8 string or something else.

> This means that not only is UTF-8 rendered unusable, but the existing
> narrow encodings too, as if a UTF-8 literal was passed but is treated as a
> locale-based character encoding, it will be gibberish.
>

And yet, UTF-8 is widely used today. So how "unusable" can it really be?

> About the only thing he can rely on is the basic character set, and/or
> that virtually all current encodings are supersets of ASCII.
>
> Conceivably, he could add a parameter, perhaps from an enumeration,
> indicating what encoding the string is in. But this is not a very usable
> idea. Primarily, it is not type safe, and secondly, about ten minutes
> later, there will be ten thousand different enumerations for encodings,
> none of which will interoperate with each other. Thirdly, some functions,
> particularly converting constructors and virtually all operators, and all
> legacy code, cannot add additional parameters.
>

Small point: technically, legacy code can't add new type overloads either.

Realistically, the burden of maintaining existing code will mean that
> nobody, ever, can use UTF-8, as it is impossible to deal with it safely,
> unless you are only using implementations which already used UTF-8 as the
> narrow encoding, in which case, there was no point in using a UTF-8 literal
> in the first place.
>

But people already do use UTF-8. It's probably the biggest argument against
char8_t (besides, you know, the breaking change).

There are innumerable interfaces which already exist that take `char*` that
*require* UTF-8 strings. These APIs are in wide, common usage. The vast
majority of those interfaces are not going to be upgraded or changed. Which
means that a `u8"string"` literal now has to be *cast* into a `char*` just
to use them with such APIs.

The fix for this is simple: add char8_t as a distinct overloadable type,
> similar to char16_t and char32_t, change u8 literals and raw literals to be
> of this type, and introduce std::u8string and related typedefs and
> specializations to match std::u16string.
>

--

------=_Part_202_27108290.1353861923725
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Sunday, November 25, 2012 8:02:00 AM UTC-8, DeadMG wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;">This proposal constitutes a breaking =
change.<div><br></div><div>Currently, there is no char8_t and no std::u8str=
ing, and the type of u8"hello" is a const char[N]. This, I believe, to be a=
 fundamental flaw in the Standard's handling of UTF-8, rendering it almost =
entirely unusable. In <i>addition, </i>it renders virtually all of the exis=
ting narrow encoding infrastructure unusable in portable code.</div><div><b=
r></div><div>UTF-8 and narrow encodings are not interoperable, especially w=
here the narrow encoding involves non-ASCII characters from the codepage sy=
stem. However, they share a common type. This means that given</div><div><b=
r></div><div>void f(const char* p);</div><div><br></div><div>The implemente=
r of f cannot assume that p is either UTF-8 <i>or</i> narrow encoded, if he=
 wishes to write portable code.</div></blockquote><div><br>I think "portabl=
e" is the wrong word. It's not <i>type checked</i>. This is portable so lon=
g as people pass the right data. Garbage in, garbage out; if your API requi=
res a properly formatted UTF-8 string, and the user doesn't provide that...=
 boom. It's user error, not a platform-specific issue.<br><br>What you're w=
anting is a <i>type</i> that will be self-documenting, so that they don't h=
ave to look up in API documentation whether a function that takes a `const =
char *` is taking a UTF-8 string or something else.<br>&nbsp;</div><blockqu=
ote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left=
: 1px #ccc solid;padding-left: 1ex;"><div>This means that not only is UTF-8=
 rendered unusable, but the existing narrow encodings too, as if a UTF-8 li=
teral was passed but is treated as a locale-based character encoding, it wi=
ll be gibberish.</div></blockquote><div><br>And yet, UTF-8 is widely used t=
oday. So how "unusable" can it really be?<br>&nbsp;</div><blockquote class=
=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #cc=
c solid;padding-left: 1ex;"><div>About the only thing he can rely on is the=
 basic character set, and/or that virtually all current encodings are super=
sets of ASCII.&nbsp;</div><div><br></div><div>Conceivably, he could add a p=
arameter, perhaps from an enumeration, indicating what encoding the string =
is in. But this is not a very usable idea. Primarily, it is not type safe, =
and secondly, about ten minutes later, there will be ten thousand different=
 enumerations for encodings, none of which will interoperate with each othe=
r. Thirdly, some functions, particularly converting constructors and virtua=
lly all operators, and all legacy code, cannot add additional parameters.</=
div></blockquote><div><br>Small point: technically, legacy code can't add n=
ew type overloads either.<br><br></div><blockquote class=3D"gmail_quote" st=
yle=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-lef=
t: 1ex;"><div></div><div>Realistically, the burden of maintaining existing =
code will mean that nobody, ever, can use UTF-8, as it is impossible to dea=
l with it safely, unless you are only using implementations which already u=
sed UTF-8 as the narrow encoding, in which case, there was no point in usin=
g a UTF-8 literal in the first place.</div></blockquote><div><br>But people=
 already do use UTF-8. It's probably the biggest argument against char8_t (=
besides, you know, the breaking change).<br><br>There are innumerable inter=
faces which already exist that take `char*` that <i>require</i> UTF-8 strin=
gs. These APIs are in wide, common usage. The vast majority of those interf=
aces are not going to be upgraded or changed. Which means that a `u8"string=
"` literal now has to be <i>cast</i> into a `char*` just to use them with s=
uch APIs.<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0=
;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div></=
div><div>The fix for this is simple: add char8_t as a distinct overloadable=
 type, similar to char16_t and char32_t, change u8 literals and raw literal=
s to be of this type, and introduce std::u8string and related typedefs and =
specializations to match std::u16string.</div></blockquote>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_202_27108290.1353861923725--

.

Author: Beman Dawes <bdawes@acm.org>
Date: Sun, 25 Nov 2012 11:56:43 -0500 Raw View

On Sun, Nov 25, 2012 at 11:02 AM, DeadMG <wolfeinstein@gmail.com> wrote:
> This proposal constitutes a breaking change.
> ...

Remember, the committee won't even consider your "proposal" unless it
appears in a numbered committee document (or issues list) that appears
in a committee mailing. That's the only practical way to manage the
one to two hundred proposals and issues that the committee has to deal
with every meeting.

--Beman

--

.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Sun, 25 Nov 2012 09:09:34 -0800 (PST) Raw View

------=_Part_320_14789186.1353863374986
Content-Type: text/plain; charset=ISO-8859-1


>
> I think "portable" is the wrong word. It's not *type checked*. This is
> portable so long as people pass the right data. Garbage in, garbage out; if
> your API requires a properly formatted UTF-8 string, and the user doesn't
> provide that... boom. It's user error, not a platform-specific issue.


I disagree. It's a platform-specific issue because some platforms already
use UTF-8 as their narrow encoding, in which case there is no problem. But
the capacity for f to silently take completely the wrong thing is not
something that should be ignored or permitted. The Standard could also
define every Standard API to take a void pointer and then just throw up
it's hands and say "Well, your problem now.". But it doesn't because safety
is really kind of very important, and not being able to catch effectively
type errors is a very bad thing.

This means that not only is UTF-8 rendered unusable, but the existing
>> narrow encodings too, as if a UTF-8 literal was passed but is treated as a
>> locale-based character encoding, it will be gibberish.
>
>
> And yet, UTF-8 is widely used today. So how "unusable" can it really be?
>

Unusable enough that virtually all UTF-8 interfaces exist primarily using
implementations where the narrow encoding is already UTF-8. Given that it's
impossible to write a portable one with a safe interface in the face of
UTF-8 string literals, the only safe way to do it is to ban either narrow
or utf8 string literals, or to be non-portable.

About the only thing he can rely on is the basic character set, and/or that
>> virtually all current encodings are supersets of ASCII.
>>
>> Conceivably, he could add a parameter, perhaps from an enumeration,
>> indicating what encoding the string is in. But this is not a very usable
>> idea. Primarily, it is not type safe, and secondly, about ten minutes
>> later, there will be ten thousand different enumerations for encodings,
>> none of which will interoperate with each other. Thirdly, some functions,
>> particularly converting constructors and virtually all operators, and all
>> legacy code, cannot add additional parameters.
>>
>
> Small point: technically, legacy code can't add new type overloads either.
>

> Realistically, the burden of maintaining existing code will mean that
>> nobody, ever, can use UTF-8, as it is impossible to deal with it safely,
>> unless you are only using implementations which already used UTF-8 as the
>> narrow encoding, in which case, there was no point in using a UTF-8 literal
>> in the first place.
>>
>
> But people already do use UTF-8. It's probably the biggest argument
> against char8_t (besides, you know, the breaking change).
>
> There are innumerable interfaces which already exist that take `char*`
> that *require* UTF-8 strings. These APIs are in wide, common usage. The
> vast majority of those interfaces are not going to be upgraded or changed.
> Which means that a `u8"string"` literal now has to be *cast* into a
> `char*` just to use them with such APIs.
>

Quite true. All of those interfaces originate from before UTF-8 literals
and did not have to deal with the problem. Those innumerable interfaces
almost exclusively exist on platforms where narrow encoding is UTF-8
anyway- in which case, why use a UTF-8 literal, just use a regular literal.
In addition, it's a lot simpler and safer to add a char8_t wrapper that
casts back to char for some legacy code, than to permit all sorts of
completely unsafe calls everywhere and render it *impossible* to write safe
interfaces. After all, there are innumerable interfaces that expect char*
but don't modify their strings, but we have const char literals anyway.
This is no different. Non-type-safety *should* be an exception and it *
should* be something you have to explicitly take care of.

I know about the proposal mechanism. It's a ... proposed proposal, as it
were. I just wanted to put that there to ensure that it's clear that I know
I'm talking about a breaking change.

--




------=_Part_320_14789186.1353863374986
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; borde=
r-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style=
: solid; padding-left: 1ex;">I think "portable" is the wrong word. It's not=
&nbsp;<i>type checked</i>. This is portable so long as people pass the righ=
t data. Garbage in, garbage out; if your API requires a properly formatted =
UTF-8 string, and the user doesn't provide that... boom. It's user error, n=
ot a platform-specific issue.</blockquote><div><br></div><div>I disagree. I=
t's a platform-specific issue because some platforms already use UTF-8 as t=
heir narrow encoding, in which case there is no problem. But the capacity f=
or f to silently take completely the wrong thing is not something that shou=
ld be ignored or permitted. The Standard could also define every Standard A=
PI to take a void pointer and then just throw up it's hands and say "Well, =
your problem now.". But it doesn't because safety is really kind of very im=
portant, and not being able to catch effectively type errors is a very bad =
thing.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margi=
n: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 2=
04, 204); border-left-style: solid; padding-left: 1ex;"><blockquote class=
=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px=
; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-=
left: 1ex;">This means that not only is UTF-8 rendered unusable, but the ex=
isting narrow encodings too, as if a UTF-8 literal was passed but is treate=
d as a locale-based character encoding, it will be gibberish.</blockquote><=
div><br>And yet, UTF-8 is widely used today. So how "unusable" can it reall=
y be?<br></div></blockquote><div><br></div><div>Unusable enough that virtua=
lly all UTF-8 interfaces exist primarily using implementations where the na=
rrow encoding is already UTF-8. Given that it's impossible to write a porta=
ble one with a safe interface in the face of UTF-8 string literals, the onl=
y safe way to do it is to ban either narrow or utf8 string literals, or to =
be non-portable.</div><div><br></div><blockquote class=3D"gmail_quote" styl=
e=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: =
rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"><blockquo=
te class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; border-left-wi=
dth: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; =
padding-left: 1ex;"><div>About the only thing he can rely on is the basic c=
haracter set, and/or that virtually all current encodings are supersets of =
ASCII.&nbsp;</div><div><br></div><div>Conceivably, he could add a parameter=
, perhaps from an enumeration, indicating what encoding the string is in. B=
ut this is not a very usable idea. Primarily, it is not type safe, and seco=
ndly, about ten minutes later, there will be ten thousand different enumera=
tions for encodings, none of which will interoperate with each other. Third=
ly, some functions, particularly converting constructors and virtually all =
operators, and all legacy code, cannot add additional parameters.</div></bl=
ockquote><div><br>Small point: technically, legacy code can't add new type =
overloads either.&nbsp;</div></blockquote><blockquote class=3D"gmail_quote"=
 style=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-co=
lor: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"><div=
><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0=
..8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border=
-left-style: solid; padding-left: 1ex;"><div></div><div>Realistically, the =
burden of maintaining existing code will mean that nobody, ever, can use UT=
F-8, as it is impossible to deal with it safely, unless you are only using =
implementations which already used UTF-8 as the narrow encoding, in which c=
ase, there was no point in using a UTF-8 literal in the first place.</div><=
/blockquote><div><br>But people already do use UTF-8. It's probably the big=
gest argument against char8_t (besides, you know, the breaking change).<br>=
<br>There are innumerable interfaces which already exist that take `char*` =
that&nbsp;<i>require</i>&nbsp;UTF-8 strings. These APIs are in wide, common=
 usage. The vast majority of those interfaces are not going to be upgraded =
or changed. Which means that a `u8"string"` literal now has to be&nbsp;<i>c=
ast</i>&nbsp;into a `char*` just to use them with such APIs.<br></div></blo=
ckquote><div><br></div><div>Quite true. All of those interfaces originate f=
rom before UTF-8 literals and did not have to deal with the problem. Those =
innumerable interfaces almost exclusively exist on platforms where narrow e=
ncoding is UTF-8 anyway- in which case, why use a UTF-8 literal, just use a=
 regular literal. In addition, it's a lot simpler and safer to add a char8_=
t wrapper that casts back to char for some legacy code, than to permit all =
sorts of completely unsafe calls everywhere and render it&nbsp;<i>impossibl=
e</i>&nbsp;to write safe interfaces. After all, there are innumerable inter=
faces that expect char* but don't modify their strings, but we have const c=
har literals anyway. This is no different. Non-type-safety&nbsp;<i>should</=
i>&nbsp;be an exception and it&nbsp;<i>should</i>&nbsp;be something you hav=
e to explicitly take care of.</div><div><br></div><div>I know about the pro=
posal mechanism. It's a ... proposed proposal, as it were. I just wanted to=
 put that there to ensure that it's clear that I know I'm talking about a b=
reaking change.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_320_14789186.1353863374986--

.

Author: Olaf van der Spek <olafvdspek@gmail.com>
Date: Mon, 26 Nov 2012 05:53:37 -0800 (PST) Raw View

------=_Part_534_19294250.1353938017537
Content-Type: text/plain; charset=ISO-8859-1

Op zondag 25 november 2012 17:02:00 UTC+1 schreef DeadMG het volgende:

> This proposal constitutes a breaking change.
>
> How about first proposing char8_t, u8string etc and a maybe new literal
type (u8b, u8new, u8temp whatever) and then in a second proposal doing the
breaking change?

--




------=_Part_534_19294250.1353938017537
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Op zondag 25 november 2012 17:02:00 UTC+1 schreef DeadMG het volgende:<br><=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;">This proposal constitutes a bre=
aking change.<div><br></div></blockquote><div>How about first proposing cha=
r8_t, u8string etc and a maybe new literal type (u8b, u8new, u8temp whateve=
r) and then in a second proposal doing the breaking change?</div><div><br><=
/div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_534_19294250.1353938017537--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 26 Nov 2012 08:12:12 -0800 (PST) Raw View

------=_Part_121_8004458.1353946332736
Content-Type: text/plain; charset=ISO-8859-1

On Monday, November 26, 2012 5:53:37 AM UTC-8, Olaf van der Spek wrote:
>
> Op zondag 25 november 2012 17:02:00 UTC+1 schreef DeadMG het volgende:
>
>> This proposal constitutes a breaking change.
>>
>> How about first proposing char8_t, u8string etc and a maybe new literal
> type (u8b, u8new, u8temp whatever) and then in a second proposal doing the
> breaking change?
>

The problem is that the breaking change is the generation of char8_t from
`u8`. If you propose some alternate `u8` that will not be a breaking change
and it were accepted, why would they bother with the second? By accepting
the first, they're saying that they don't mind the odd syntax of having two
UTF-8 literal prefixes.

It's better to have a single proposal that puts forth the idea, with a
possible alternative that isn't a breaking change mentioned in the paper.

--

------=_Part_121_8004458.1353946332736
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Monday, November 26, 2012 5:53:37 AM UTC-8, Olaf van der Spek wr=
ote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex=
;border-left: 1px #ccc solid;padding-left: 1ex;">Op zondag 25 november 2012=
 17:02:00 UTC+1 schreef DeadMG het volgende:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;paddi=
ng-left:1ex">This proposal constitutes a breaking change.<div><br></div></b=
lockquote><div>How about first proposing char8_t, u8string etc and a maybe =
new literal type (u8b, u8new, u8temp whatever) and then in a second proposa=
l doing the breaking change?</div></blockquote><div><br>The problem is that=
 the breaking change is the generation of char8_t from `u8`. If you propose=
 some alternate `u8` that will not be a breaking change and it were accepte=
d, why would they bother with the second? By accepting the first, they're s=
aying that they don't mind the odd syntax of having two UTF-8 literal prefi=
xes.<br><br>It's better to have a single proposal that puts forth the idea,=
 with a possible alternative that isn't a breaking change mentioned in the =
paper.<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_121_8004458.1353946332736--

.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 26 Nov 2012 08:23:41 -0800 (PST) Raw View

------=_Part_362_28647132.1353947022021
Content-Type: text/plain; charset=ISO-8859-1

Because no non-breaking change can solve the problem, which is that there
is no way to write a safe, portable function which deals in narrow encoded
or UTF-8 encoded codeunits. std::u8string is just a minor footnote for
consistency- it's really quite irrelevant.

As long as u8"" as const char[] exists, portable safety for these two
encodings will be impossible.

--




------=_Part_362_28647132.1353947022021
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Because no non-breaking change can solve the problem, which is that there i=
s no way to write a safe, portable function which deals in narrow encoded o=
r UTF-8 encoded codeunits. std::u8string is just a minor footnote for consi=
stency- it's really quite irrelevant.<div><br></div><div>As long as u8"" as=
 const char[] exists, portable safety for these two encodings will be impos=
sible.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_362_28647132.1353947022021--

.

Author: Olaf van der Spek <olafvdspek@gmail.com>
Date: Mon, 26 Nov 2012 17:52:17 +0100 Raw View

On Mon, Nov 26, 2012 at 5:23 PM, DeadMG <wolfeinstein@gmail.com> wrote:
> Because no non-breaking change can solve the problem, which is that there is
> no way to write a safe, portable function which deals in narrow encoded or
> UTF-8 encoded codeunits.

The only breaking part is u8"", right? If that's deprecated and a new
proper u8b"" is available, you're half-way to solving it.

> std::u8string is just a minor footnote for
> consistency- it's really quite irrelevant.
>
> As long as u8"" as const char[] exists, portable safety for these two
> encodings will be impossible.

Not all strings come from literals.

--
Olaf

--




.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 26 Nov 2012 11:11:57 -0800 (PST) Raw View

------=_Part_858_14011966.1353957117685
Content-Type: text/plain; charset=ISO-8859-1

Half-way is no-way. Either you can make the guarantee or you can't- there's
little half about anything.

Not all strings come from literals.
>

Is fairly immaterial, really. A string from another source that came from a
literal is just as broken. More to the point, you can't stop it coming from
a literal, so you still have to deal with them.

If you cannot know the encoding of text, then it will be impossible to deal
with that text.

--




------=_Part_858_14011966.1353957117685
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Half-way is no-way. Either you can make the guarantee or you can't- there's=
 little half about anything.<div><br></div><div><blockquote class=3D"gmail_=
quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;pa=
dding-left: 1ex;">Not all strings come from literals.
<br></blockquote><div><br></div><div>Is fairly immaterial, really. A string=
 from another source that came from a literal is just as broken. More to th=
e point, you can't stop it coming from a literal, so you still have to deal=
 with them.&nbsp;</div><div><br></div><div>If you cannot know the encoding =
of text, then it will be impossible to deal with that text.</div></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_858_14011966.1353957117685--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 26 Nov 2012 12:32:49 -0800 (PST) Raw View

------=_Part_208_4863701.1353961969833
Content-Type: text/plain; charset=ISO-8859-1



On Monday, November 26, 2012 11:11:57 AM UTC-8, DeadMG wrote:
>
> Half-way is no-way. Either you can make the guarantee or you can't-
> there's little half about anything.
>
> Not all strings come from literals.
>>
>
> Is fairly immaterial, really. A string from another source that came from
> a literal is just as broken. More to the point, you can't stop it coming
> from a literal, so you still have to deal with them.
>
> If you cannot know the encoding of text, then it will be impossible to
> deal with that text.
>

You don't need a type to know the encoding of the string. It's nice,
granted, but it's hardly *necessary*. There are many APIs that don't do
type-safe checking of arguments to make sure you haven't passed improper
values. This is common and standard for many things. You don't see
interfaces that only accept positive integers use a type to prevent you
from passing negative integers. That's no different from this.

Garbage in, garbage out. I don't see this as a particularly onerous problem
that requires a breaking change. As long as there is a `char8_t` type, does
it matter if it *implicitly* converts to a `char`?

--




------=_Part_208_4863701.1353961969833
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Monday, November 26, 2012 11:11:57 AM UTC-8, DeadMG wrote:<block=
quote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-le=
ft: 1px #ccc solid;padding-left: 1ex;">Half-way is no-way. Either you can m=
ake the guarantee or you can't- there's little half about anything.<div><br=
></div><div><blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left=
:0.8ex;border-left:1px #ccc solid;padding-left:1ex">Not all strings come fr=
om literals.
<br></blockquote><div><br></div><div>Is fairly immaterial, really. A string=
 from another source that came from a literal is just as broken. More to th=
e point, you can't stop it coming from a literal, so you still have to deal=
 with them.&nbsp;</div><div><br></div><div>If you cannot know the encoding =
of text, then it will be impossible to deal with that text.</div></div></bl=
ockquote><div><br>You don't need a type to know the encoding of the string.=
 It's nice, granted, but it's hardly <i>necessary</i>. There are many APIs =
that don't do type-safe checking of arguments to make sure you haven't pass=
ed improper values. This is common and standard for many things. You don't =
see interfaces that only accept positive integers use a type to prevent you=
 from passing negative integers. That's no different from this.<br><br>Garb=
age in, garbage out. I don't see this as a particularly onerous problem tha=
t requires a breaking change. As long as there is a `char8_t` type, does it=
 matter if it <i>implicitly</i> converts to a `char`?<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_208_4863701.1353961969833--

.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 26 Nov 2012 13:51:21 -0800 (PST) Raw View

------=_Part_955_31988097.1353966681856
Content-Type: text/plain; charset=ISO-8859-1

It's quite different. Firstly, you can check this at run-time. You can just
slap in a quick assert and if somebody passes a non-positive integer, fail.
Secondly, we need two totally different overloads for char and char8_t-
which is the root of the problem. It's not a question of bad run-time
values, it's a question of needing the correct type. Oh, and thirdly, I do
see quite a bit of use of unsigned integers in this fashion.

Violating the type system should be explicit. If you had a function that
takes char and not char8_t and you try to pass a UTF-8 literal, then the
compilation *should* fail. An implicit conversion would be a start, I
guess, and solve some of the immediate issues like not being able to
overload, but it would still be bad.

--




------=_Part_955_31988097.1353966681856
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

It's quite different. Firstly, you can check this at run-time. You can just=
 slap in a quick assert and if somebody passes a non-positive integer, fail=
.. Secondly, we need two totally different overloads for char and char8_t- w=
hich is the root of the problem. It's not a question of bad run-time values=
, it's a question of needing the correct type. Oh, and thirdly, I do see qu=
ite a bit of use of unsigned integers in this fashion.<div><br></div><div>V=
iolating the type system should be explicit. If you had a function that tak=
es char and not char8_t and you try to pass a UTF-8 literal, then the compi=
lation <i>should</i> fail. An implicit conversion would be a start, I guess=
, and solve some of the immediate issues like not being able to overload, b=
ut it would still be bad.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_955_31988097.1353966681856--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 26 Nov 2012 14:19:45 -0800 (PST) Raw View

------=_Part_335_6888274.1353968385724
Content-Type: text/plain; charset=ISO-8859-1



On Monday, November 26, 2012 1:51:22 PM UTC-8, DeadMG wrote:
>
> It's quite different. Firstly, you can check this at run-time. You can
> just slap in a quick assert and if somebody passes a non-positive integer,
> fail. Secondly, we need two totally different overloads for char and
> char8_t- which is the root of the problem. It's not a question of bad
> run-time values, it's a question of needing the correct type. Oh, and
> thirdly, I do see quite a bit of use of unsigned integers in this fashion.


That's not a good idea. The rules of C++ require that this is not a
compiler error:

void foo(unsigned int val);

foo(-45);

The best you get is a warning, which is non-binding.

Violating the type system should be explicit. If you had a function that
> takes char and not char8_t and you try to pass a UTF-8 literal, then the
> compilation *should* fail.
>

Why? If the function accepts UTF-8 strings via char* (like most code that
accepts UTF-8 strings), there's no reason for compilation to fail. You're
doing something perfectly valid and legitimate. And there's certainly no
reason for compilation that *used* to work to fail after a compiler upgrade.

An implicit conversion would be a start, I guess, and solve some of the
> immediate issues like not being able to overload, but it would still be bad.
>

That's about the best you're going to get. Not unless you can provide
evidence that this won't affect a lot of code.

At least compilers can offer warnings, much like the int->unsigned int case.

--




------=_Part_335_6888274.1353968385724
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Monday, November 26, 2012 1:51:22 PM UTC-8, DeadMG wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;">It's quite different. Firstly, you ca=
n check this at run-time. You can just slap in a quick assert and if somebo=
dy passes a non-positive integer, fail. Secondly, we need two totally diffe=
rent overloads for char and char8_t- which is the root of the problem. It's=
 not a question of bad run-time values, it's a question of needing the corr=
ect type. Oh, and thirdly, I do see quite a bit of use of unsigned integers=
 in this fashion.</blockquote><div><br>That's not a good idea. The rules of=
 C++ require that this is not a compiler error:<br><br><div class=3D"pretty=
print" style=3D"background-color: rgb(250, 250, 250); border-color: rgb(187=
, 187, 187); border-style: solid; border-width: 1px; word-wrap: break-word;=
"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span style=3D"=
color: #008;" class=3D"styled-by-prettify">void</span><span style=3D"color:=
 #000;" class=3D"styled-by-prettify"> foo</span><span style=3D"color: #660;=
" class=3D"styled-by-prettify">(</span><span style=3D"color: #008;" class=
=3D"styled-by-prettify">unsigned</span><span style=3D"color: #000;" class=
=3D"styled-by-prettify"> </span><span style=3D"color: #008;" class=3D"style=
d-by-prettify">int</span><span style=3D"color: #000;" class=3D"styled-by-pr=
ettify"> val</span><span style=3D"color: #660;" class=3D"styled-by-prettify=
">);</span><span style=3D"color: #000;" class=3D"styled-by-prettify"><br><b=
r>foo</span><span style=3D"color: #660;" class=3D"styled-by-prettify">(-</s=
pan><span style=3D"color: #066;" class=3D"styled-by-prettify">45</span><spa=
n style=3D"color: #660;" class=3D"styled-by-prettify">);</span><span style=
=3D"color: #000;" class=3D"styled-by-prettify"><br></span></div></code></di=
v><br>The best you get is a warning, which is non-binding.<br><br></div><bl=
ockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border=
-left: 1px #ccc solid;padding-left: 1ex;"><div></div><div>Violating the typ=
e system should be explicit. If you had a function that takes char and not =
char8_t and you try to pass a UTF-8 literal, then the compilation <i>should=
</i> fail.</div></blockquote><div><br>Why? If the function accepts UTF-8 st=
rings via char* (like most code that accepts UTF-8 strings), there's no rea=
son for compilation to fail. You're doing something perfectly valid and leg=
itimate. And there's certainly no reason for compilation that <i>used</i> t=
o work to fail after a compiler upgrade.<br><br></div><blockquote class=3D"=
gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc so=
lid;padding-left: 1ex;"><div>An implicit conversion would be a start, I gue=
ss, and solve some of the immediate issues like not being able to overload,=
 but it would still be bad.</div></blockquote><div><br>That's about the bes=
t you're going to get. Not unless you can provide evidence that this won't =
affect a lot of code.<br><br>At least compilers can offer warnings, much li=
ke the int-&gt;unsigned int case.<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_335_6888274.1353968385724--

.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Mon, 26 Nov 2012 15:14:49 -0800 (PST) Raw View

------=_Part_803_25198155.1353971689809
Content-Type: text/plain; charset=ISO-8859-1


>
> You're doing something perfectly valid and legitimate.


But that's the problem. It's not really valid and legitimate at all. The
only safety you have is that on your most likely implementation, narrow
strings are in fact UTF-8. That's fine for them, but the Standard has to
cater to all platforms. Those people who use such APIs can just use the
same thing they did before- the fact that their implementation defines
narrow strings to be UTF-8. There's no actual reason for them to use UTF-8
literals in the first place, so they are the least likely to be affected by
any potential change.

A portable UTF-8 string is not char, it is char8_t. People who used
non-portable stuff before can just keep right on doing that in the future.

--




------=_Part_803_25198155.1353971689809
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; borde=
r-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style=
: solid; padding-left: 1ex;">You're doing something perfectly valid and leg=
itimate.</blockquote><div><br></div><div>But that's the problem. It's not r=
eally valid and legitimate at all. The only safety you have is that on your=
 most likely implementation, narrow strings are in fact UTF-8. That's fine =
for them, but the Standard has to cater to all platforms. Those people who =
use such APIs can just use the same thing they did before- the fact that th=
eir implementation defines narrow strings to be UTF-8. There's no actual re=
ason for them to use UTF-8 literals in the first place, so they are the lea=
st likely to be affected by any potential change.</div><div><br></div><div>=
A portable UTF-8 string is not char, it is char8_t. People who used non-por=
table stuff before can just keep right on doing that in the future.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_803_25198155.1353971689809--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 26 Nov 2012 16:37:48 -0800 (PST) Raw View

------=_Part_385_5278084.1353976668171
Content-Type: text/plain; charset=ISO-8859-1



On Monday, November 26, 2012 3:14:50 PM UTC-8, DeadMG wrote:
>
> You're doing something perfectly valid and legitimate.
>
>
> But that's the problem. It's not really valid and legitimate at all. The
> only safety you have is that on your most likely implementation, narrow
> strings are in fact UTF-8.
>

I think you kinda missed what you said. You said, "If you had a function
that takes char and not char8_t and you try to pass a UTF-8 literal, then
the compilation should fail." My point is that there are (many) functions
that take char* that *accept UTF-8 strings*. That will fail if the strings
are *not* UTF-8.

My point is that whether the function takes narrow-encodings or UTF-8 is up
to the *implementation* of the function, not the argument's *type*. That is
how C++ has functioned since UTF-8 came into being, and it's not going to
suddenly stop just because there are char8_t*'s around.

If you pass a non-UTF-8 string to a function that takes UTF-8 strings, you
get broken behavior. Whether it's a narrow string, Latin-1 or anything
else; it's not a valid UTF-8 string, so you get brokenness.

What you're suggesting is that now, every time I want to provide a string
to, for example, LibXML2, I have to do a cast. *For no reason whatsoever*.
LibXML2's interface only works with UTF-8 strings; it breaks on anything
else. What you're suggesting is only safety if everyone in the world
magically converts to this char8_t the instant it becomes available.


> That's fine for them, but the Standard has to cater to all platforms.
> Those people who use such APIs can just use the same thing they did before-
> the fact that their implementation defines narrow strings to be UTF-8.
> There's no actual reason for them to use UTF-8 literals in the first place,
> so they are the least likely to be affected by any potential change.
>

No actual reason? What about, I don't know, *getting a UTF-8 literal?* See?

const char *utf8_literal = u8"This is a UTF-8 literal with a random \u00B1
symbol in it";

It has consistent results across all platforms (that support the syntax).
If you pass that to an API that takes UTF-8 literals, it will work. If you
pass that to an API that doesn't take UTF-8 literals, it won't.

You're so wedded to type-safety that you can't see that everything can work
just fine without it. Yes, it isn't typed, so you won't get a compiler
error for doing the wrong thing. And that's unfortunate. But it's *far*from saying that there's no point in creating UTF-8 literals.

That's like saying that you need a different type for 2D vectors just
because some are used with bottom-left conventions while others are used
with top-left conventions. It's ridiculous.


> A portable UTF-8 string is not char, it is char8_t. People who used
> non-portable stuff before can just keep right on doing that in the future.
>

--




------=_Part_385_5278084.1353976668171
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Monday, November 26, 2012 3:14:50 PM UTC-8, DeadMG wrote:<blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;"><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(=
204,204,204);border-left-style:solid;padding-left:1ex">You're doing somethi=
ng perfectly valid and legitimate.</blockquote><div><br></div><div>But that=
's the problem. It's not really valid and legitimate at all. The only safet=
y you have is that on your most likely implementation, narrow strings are i=
n fact UTF-8.</div></blockquote><div><br>I think you kinda missed what you =
said. You said, "If you had a function that takes char and not char8_t and =
you try to pass a UTF-8 literal, then the compilation should fail." My poin=
t is that there are (many) functions that take char* that <i>accept UTF-8 s=
trings</i>. That will fail if the strings are <i>not</i> UTF-8.<br><br>My p=
oint is that whether the function takes narrow-encodings or UTF-8 is up to =
the <i>implementation</i> of the function, not the argument's <i>type</i>. =
That is how C++ has functioned since UTF-8 came into being, and it's not go=
ing to suddenly stop just because there are char8_t*'s around.<br><br>If yo=
u pass a non-UTF-8 string to a function that takes UTF-8 strings, you get b=
roken behavior. Whether it's a narrow string, Latin-1 or anything else; it'=
s not a valid UTF-8 string, so you get brokenness.<br><br>What you're sugge=
sting is that now, every time I want to provide a string to, for example, L=
ibXML2, I have to do a cast. <i>For no reason whatsoever</i>. LibXML2's int=
erface only works with UTF-8 strings; it breaks on anything else. What you'=
re suggesting is only safety if everyone in the world magically converts to=
 this char8_t the instant it becomes available.<br>&nbsp;</div><blockquote =
class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1p=
x #ccc solid;padding-left: 1ex;"><div>That's fine for them, but the Standar=
d has to cater to all platforms. Those people who use such APIs can just us=
e the same thing they did before- the fact that their implementation define=
s narrow strings to be UTF-8. There's no actual reason for them to use UTF-=
8 literals in the first place, so they are the least likely to be affected =
by any potential change.</div></blockquote><div><br>No actual reason? What =
about, I don't know, <i>getting a UTF-8 literal?</i> See?<br><br><div class=
=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); border-colo=
r: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap: b=
reak-word;"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span=
 style=3D"color: #008;" class=3D"styled-by-prettify">const</span><span styl=
e=3D"color: #000;" class=3D"styled-by-prettify"> </span><span style=3D"colo=
r: #008;" class=3D"styled-by-prettify">char</span><span style=3D"color: #00=
0;" class=3D"styled-by-prettify"> </span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">*</span><span style=3D"color: #000;" class=3D"style=
d-by-prettify">utf8_literal </span><span style=3D"color: #660;" class=3D"st=
yled-by-prettify">=3D</span><span style=3D"color: #000;" class=3D"styled-by=
-prettify"> u8</span><span style=3D"color: #080;" class=3D"styled-by-pretti=
fy">"This is a UTF-8 literal with a random \u00B1 symbol in it"</span><span=
 style=3D"color: #660;" class=3D"styled-by-prettify">;</span></div></code><=
/div><br>It has consistent results across all platforms (that support the s=
yntax). If you pass that to an API that takes UTF-8 literals, it will work.=
 If you pass that to an API that doesn't take UTF-8 literals, it won't.<br>=
<br>You're so wedded to type-safety that you can't see that everything can =
work just fine without it. Yes, it isn't typed, so you won't get a compiler=
 error for doing the wrong thing. And that's unfortunate. But it's <i>far</=
i> from saying that there's no point in creating UTF-8 literals.<br><br>Tha=
t's like saying that you need a different type for 2D vectors just because =
some are used with bottom-left conventions while others are used with top-l=
eft conventions. It's ridiculous.<br>&nbsp;</div><blockquote class=3D"gmail=
_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;p=
adding-left: 1ex;"><div>A portable UTF-8 string is not char, it is char8_t.=
 People who used non-portable stuff before can just keep right on doing tha=
t in the future.</div></blockquote>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_385_5278084.1353976668171--

.

Author: stackmachine@hotmail.com
Date: Tue, 27 Nov 2012 02:36:39 -0800 (PST) Raw View

------=_Part_1052_6666425.1354012599777
Content-Type: text/plain; charset=ISO-8859-1

Why not just do something similar to conversions from String-Literals to
char*?
That is:

   - Introduce the type char8_t, let literals with a u8 prefix be of type
   char8_t const[N]
   - Allow implicit conversions to char const*, but deprecate them

That way you don't break existing code, and you can remove it later.

--




------=_Part_1052_6666425.1354012599777
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Why not just do something similar to conversions from String-Literals to ch=
ar*?<br>That is:<br><ul><li>Introduce the type char8_t, let literals with a=
 u8 prefix be of type char8_t const[N]</li><li>Allow implicit conversions t=
o char const*, but deprecate them</li></ul><p>That way you don't break exis=
ting code, and you can remove it later.<br></p>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1052_6666425.1354012599777--

.

Author: DeadMG <wolfeinstein@gmail.com>
Date: Tue, 27 Nov 2012 03:26:32 -0800 (PST) Raw View

------=_Part_268_22908911.1354015592428
Content-Type: text/plain; charset=ISO-8859-1


>
> That is how C++ has functioned since UTF-8 came into being


No. That's how *some implementations* have done it since UTF-8 came into
being. Not all of them. Anyone who called LibXML2 with a string literal on
Windows would get a nasty surprise.

What you're suggesting is that now, every time I want to provide a string
> to, for example, LibXML2, I have to do a cast.


No. If you're on a platform that has UTF-8 narrow literals, then *just use
a narrow literal, exactly like before*. If you're not, then cast, and go
shout at the authors of LibXML2 (and did I mention the major compiler which
doesn't use UTF-8 narrow literals also doesn't support UTF-8 literals?).

No actual reason? What about, I don't know, *getting a UTF-8 literal?* See?


The only platforms that currently support it already have UTF-8 narrow
literals, and there's no reason to use a UTF-8 literal over a narrow
literal when they both have the same encoding, because they also both have
the same type- i.e., they are exactly the same in every respect, except
that for one, you put a u8 in front.

--




------=_Part_268_22908911.1354015592428
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0.8ex; borde=
r-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style=
: solid; padding-left: 1ex;">That is how C++ has functioned since UTF-8 cam=
e into being</blockquote><div><br></div><div>No. That's how <i>some impleme=
ntations</i>&nbsp;have done it since UTF-8 came into being. Not all of them=
.. Anyone who called LibXML2 with a string literal on Windows would get a na=
sty surprise.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: r=
gb(204, 204, 204); border-left-style: solid; padding-left: 1ex;">What you'r=
e suggesting is that now, every time I want to provide a string to, for exa=
mple, LibXML2, I have to do a cast.</blockquote><div><br></div><div>No. If =
you're on a platform that has UTF-8 narrow literals, then <i>just use a nar=
row literal, exactly like before</i>. If you're not, then cast, and go shou=
t at the authors of LibXML2 (and did I mention the major compiler which doe=
sn't use UTF-8 narrow literals also doesn't support UTF-8 literals?).</div>=
<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0=
px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); bo=
rder-left-style: solid; padding-left: 1ex;">No actual reason? What about, I=
 don't know,&nbsp;<i>getting a UTF-8 literal?</i>&nbsp;See?</blockquote><di=
v><br></div><div>The only platforms that currently support it already have =
UTF-8 narrow literals, and there's no reason to use a UTF-8 literal over a =
narrow literal when they both have the same encoding, because they also bot=
h have the same type- i.e., they are exactly the same in every respect, exc=
ept that for one, you put a u8 in front.</div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_268_22908911.1354015592428--

.

Author: rick@longbowgames.com
Date: Tue, 27 Nov 2012 08:51:41 -0800 (PST) Raw View

------=_Part_1082_31432323.1354035101211
Content-Type: text/plain; charset=ISO-8859-1

On Tuesday, November 27, 2012 6:26:32 AM UTC-5, DeadMG wrote:
>
> If you're not, then cast, and go shout at the authors of LibXML2

LibXML2 is a C library. You're unlikely to convince them to accept char8_t.

--

------=_Part_1082_31432323.1354035101211
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tuesday, November 27, 2012 6:26:32 AM UTC-5, DeadMG wrote:<blockquote cl=
ass=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px =
#ccc solid;padding-left: 1ex;">If you're not, then cast, and go shout at th=
e authors of LibXML2</blockquote><div><br>LibXML2 is a C library. You're un=
likely to convince them to accept char8_t.<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_1082_31432323.1354035101211--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 27 Nov 2012 09:26:23 -0800 (PST) Raw View

------=_Part_9_815582.1354037183443
Content-Type: text/plain; charset=ISO-8859-1



On Tuesday, November 27, 2012 3:26:32 AM UTC-8, DeadMG wrote:
>
> That is how C++ has functioned since UTF-8 came into being
>
>
> No. That's how *some implementations* have done it since UTF-8 came into
> being. Not all of them. Anyone who called LibXML2 with a string literal on
> Windows would get a nasty surprise.
>

I'll assume that by "on Windows" you mean "on Visual Studio" (GCC on
Windows uses UTF-8 narrow characters). Well first of all, VS's narrow
character set supports ASCII, which is by definition UTF-8. So there will
be no "nasty surprise" waiting unless they use characters outside of ASCII.
Secondly, there are ways to feed VS UTF-8 formatted files and have it
accept literals that are formatted as such.

Third and most importantly, if you are expecting platform-neutral behavior
on a platform-specific construct (ie: the encoding of a narrow literal),
then you *deserve* a "nasty surprise" when you cross platforms. Fourth, see
below:


> What you're suggesting is that now, every time I want to provide a string
>> to, for example, LibXML2, I have to do a cast.
>
>
> No. If you're on a platform that has UTF-8 narrow literals, then *just
> use a narrow literal, exactly like before*.
>

That's patently ridiculous. The whole point of `u8` is to have a literal
that will be encoded in UTF-8 *across all platforms.* That way, you won't
have to care what your platform-specific encoding is. This will always
work, regardless of platform:

auto writer = xmlNewTextWriterFilename(u8"Some UTF-8 literal", 0);

I should not have to do this:

auto writer = xmlNewTextWriterFilename((const char*)u8"Some UTF-8 literal",
0);


> If you're not, then cast, and go shout at the authors of LibXML2 (and did
> I mention the major compiler which doesn't use UTF-8 narrow literals also
> doesn't support UTF-8 literals?).
>

> No actual reason? What about, I don't know, *getting a UTF-8 literal?*
>>  See?
>
>
> The only platforms that currently support it already have UTF-8 narrow
> literals,
>

Unless you have knowledge that Microsoft *never* intends to support UTF-8
literals, the fact that they don't support them *yet* is irrelevant. They
will support them eventually, and when they do, it would be good to be able
to actually *use* it to get UTF-8 literals in a platform-neutral way.

and there's no reason to use a UTF-8 literal over a narrow literal when
> they both have the same encoding, because they also both have the same
> type- i.e., they are exactly the same in every respect, except that for
> one, you put a u8 in front.
>

--




------=_Part_9_815582.1354037183443
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Tuesday, November 27, 2012 3:26:32 AM UTC-8, DeadMG wrote:<block=
quote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-le=
ft: 1px #ccc solid;padding-left: 1ex;"><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb=
(204,204,204);border-left-style:solid;padding-left:1ex">That is how C++ has=
 functioned since UTF-8 came into being</blockquote><div><br></div><div>No.=
 That's how <i>some implementations</i>&nbsp;have done it since UTF-8 came =
into being. Not all of them. Anyone who called LibXML2 with a string litera=
l on Windows would get a nasty surprise.</div></blockquote><div><br>I'll as=
sume that by "on Windows" you mean "on Visual Studio" (GCC on Windows uses =
UTF-8 narrow characters). Well first of all, VS's narrow character set supp=
orts ASCII, which is by definition UTF-8. So there will be no "nasty surpri=
se" waiting unless they use characters outside of ASCII. Secondly, there ar=
e ways to feed VS UTF-8 formatted files and have it accept literals that ar=
e formatted as such.<br><br>Third and most importantly, if you are expectin=
g platform-neutral behavior on a platform-specific construct (ie: the encod=
ing of a narrow literal), then you <i>deserve</i> a "nasty surprise" when y=
ou cross platforms. Fourth, see below:<br><br></div><blockquote class=3D"gm=
ail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc soli=
d;padding-left: 1ex;"><div><br></div><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(2=
04,204,204);border-left-style:solid;padding-left:1ex">What you're suggestin=
g is that now, every time I want to provide a string to, for example, LibXM=
L2, I have to do a cast.</blockquote><div><br></div><div>No. If you're on a=
 platform that has UTF-8 narrow literals, then <i>just use a narrow literal=
, exactly like before</i>.</div></blockquote><div><br>That's patently ridic=
ulous. The whole point of `u8` is to have a literal that will be encoded in=
 UTF-8 <i>across all platforms.</i> That way, you won't have to care what y=
our platform-specific encoding is. This will always work, regardless of pla=
tform:<br><br><div class=3D"prettyprint" style=3D"background-color: rgb(250=
, 250, 250); border-color: rgb(187, 187, 187); border-style: solid; border-=
width: 1px; word-wrap: break-word;"><code class=3D"prettyprint"><div class=
=3D"subprettyprint"><span style=3D"color: #008;" class=3D"styled-by-prettif=
y">auto</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> wr=
iter </span><span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</=
span><span style=3D"color: #000;" class=3D"styled-by-prettify"> xmlNewTextW=
riterFilename</span><span style=3D"color: #660;" class=3D"styled-by-prettif=
y">(</span><span style=3D"color: #000;" class=3D"styled-by-prettify">u8</sp=
an><span style=3D"color: #080;" class=3D"styled-by-prettify">"Some UTF-8 li=
teral"</span><span style=3D"color: #660;" class=3D"styled-by-prettify">,</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span=
 style=3D"color: #066;" class=3D"styled-by-prettify">0</span><span style=3D=
"color: #660;" class=3D"styled-by-prettify">);</span></div></code></div><br=
>I should not have to do this:<br><br><div class=3D"prettyprint" style=3D"b=
ackground-color: rgb(250, 250, 250); border-color: rgb(187, 187, 187); bord=
er-style: solid; border-width: 1px; word-wrap: break-word;"><code class=3D"=
prettyprint"><div class=3D"subprettyprint"><span style=3D"color: #008;" cla=
ss=3D"styled-by-prettify">auto</span><span style=3D"color: #000;" class=3D"=
styled-by-prettify"> writer </span><span style=3D"color: #660;" class=3D"st=
yled-by-prettify">=3D</span><span style=3D"color: #000;" class=3D"styled-by=
-prettify"> xmlNewTextWriterFilename</span><span style=3D"color: #660;" cla=
ss=3D"styled-by-prettify">((</span><span style=3D"color: #008;" class=3D"st=
yled-by-prettify">const</span><span style=3D"color: #000;" class=3D"styled-=
by-prettify"> </span><span style=3D"color: #008;" class=3D"styled-by-pretti=
fy">char</span><span style=3D"color: #660;" class=3D"styled-by-prettify">*)=
</span><span style=3D"color: #000;" class=3D"styled-by-prettify">u8</span><=
span style=3D"color: #080;" class=3D"styled-by-prettify">"Some UTF-8 litera=
l"</span><span style=3D"color: #660;" class=3D"styled-by-prettify">,</span>=
<span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span sty=
le=3D"color: #066;" class=3D"styled-by-prettify">0</span><span style=3D"col=
or: #660;" class=3D"styled-by-prettify">);</span></div></code></div>&nbsp;<=
/div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8e=
x;border-left: 1px #ccc solid;padding-left: 1ex;"><div> If you're not, then=
 cast, and go shout at the authors of LibXML2 (and did I mention the major =
compiler which doesn't use UTF-8 narrow literals also doesn't support UTF-8=
 literals?).<br></div></blockquote><blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);bord=
er-left-style:solid;padding-left:1ex">No actual reason? What about, I don't=
 know,&nbsp;<i>getting a UTF-8 literal?</i>&nbsp;See?</blockquote><div><br>=
</div><div>The only platforms that currently support it already have UTF-8 =
narrow literals,</div></blockquote><div><br>Unless you have knowledge that =
Microsoft <i>never</i> intends to support UTF-8 literals, the fact that the=
y don't support them <i>yet</i> is irrelevant. They will support them event=
ually, and when they do, it would be good to be able to actually <i>use</i>=
 it to get UTF-8 literals in a platform-neutral way.<br><br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left:=
 1px #ccc solid;padding-left: 1ex;"><div> and there's no reason to use a UT=
F-8 literal over a narrow literal when they both have the same encoding, be=
cause they also both have the same type- i.e., they are exactly the same in=
 every respect, except that for one, you put a u8 in front.</div></blockquo=
te>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_9_815582.1354037183443--

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 27 Nov 2012 09:29:36 -0800 (PST) Raw View

------=_Part_115_30299606.1354037376052
Content-Type: text/plain; charset=ISO-8859-1



On Tuesday, November 27, 2012 2:36:40 AM UTC-8, stackm...@hotmail.com wrote:
>
> Why not just do something similar to conversions from String-Literals to
> char*?
> That is:
>
>    - Introduce the type char8_t, let literals with a u8 prefix be of type
>    char8_t const[N]
>    - Allow implicit conversions to char const*, but deprecate them
>
> That way you don't break existing code, and you can remove it later.
>
Why would we want to remove it later?

--




------=_Part_115_30299606.1354037376052
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br>On Tuesday, November 27, 2012 2:36:40 AM UTC-8, stackm...@hotmail.c=
om wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: =
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Why not just do somet=
hing similar to conversions from String-Literals to char*?<br>That is:<br><=
ul><li>Introduce the type char8_t, let literals with a u8 prefix be of type=
 char8_t const[N]</li><li>Allow implicit conversions to char const*, but de=
precate them</li></ul><p>That way you don't break existing code, and you ca=
n remove it later.<br></p></blockquote><div>Why would we want to remove it =
later?<br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_115_30299606.1354037376052--

.

Author: stackmachine@hotmail.com
Date: Tue, 27 Nov 2012 23:33:04 -0800 (PST) Raw View

------=_Part_42_24685502.1354087984802
Content-Type: text/plain; charset=ISO-8859-1

Am Dienstag, 27. November 2012 18:29:36 UTC+1 schrieb Nicol Bolas:
>
>
>
> On Tuesday, November 27, 2012 2:36:40 AM UTC-8, stackm...@hotmail.comwrote:
>>
>> Why not just do something similar to conversions from String-Literals to
>> char*?
>> That is:
>>
>>    - Introduce the type char8_t, let literals with a u8 prefix be of
>>    type char8_t const[N]
>>    - Allow implicit conversions to char const*, but deprecate them
>>
>> That way you don't break existing code, and you can remove it later.
>>
> Why would we want to remove it later?
>

It sounded like people here think it's been a mistake in the first place.

--




------=_Part_42_24685502.1354087984802
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Am Dienstag, 27. November 2012 18:29:36 UTC+1 schrieb Nicol Bolas:<blockquo=
te class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left:=
 1px #ccc solid;padding-left: 1ex;"><br><br>On Tuesday, November 27, 2012 2=
:36:40 AM UTC-8, <a>stackm...@hotmail.com</a> wrote:<blockquote class=3D"gm=
ail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;p=
adding-left:1ex">Why not just do something similar to conversions from Stri=
ng-Literals to char*?<br>That is:<br><ul><li>Introduce the type char8_t, le=
t literals with a u8 prefix be of type char8_t const[N]</li><li>Allow impli=
cit conversions to char const*, but deprecate them</li></ul><p>That way you=
 don't break existing code, and you can remove it later.<br></p></blockquot=
e><div>Why would we want to remove it later?<br></div></blockquote><div><br=
>It sounded like people here think it's been a mistake in the first place.<=
br></div>

<p></p>

-- <br />
&nbsp;<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_42_24685502.1354087984802--

.