Topic: Distinct type of array elements in UTF-8 string


Author: me@maxtruxa.com
Date: Sat, 6 Jun 2015 06:31:10 -0700 (PDT)
Raw View
------=_Part_133_1111759971.1433597470557
Content-Type: multipart/alternative;
 boundary="----=_Part_134_94620050.1433597470557"

------=_Part_134_94620050.1433597470557
Content-Type: text/plain; charset=UTF-8

This was originally posted to std-discussion (
https://groups.google.com/a/isocpp.org/d/topic/std-discussion/ZxDeh7RkhKU/discussion)
but it was suggested that std-proposals would be more appropriate.

References:
https://groups.google.com/a/isocpp.org/d/topic/std-discussion/jGr2bZXWntc/discussion
https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion

Since the discussions referenced above didn't come to any usefull
conclusion (at least I don't see one) but are inactive for some time now, I
would like to reiterate on the pros and cons of the addition of a distinct
code unit type for UTF-8 string literals.
These other discussions also went somewhat off-topic when they began
discussing pros/cons of the various Unicode CEFs/CESes for APIs and
transport/storage of string data. No need to say that there needs to be an
additional verification/encoding detection process when e.g. reading text
from a file.
This discussion is *purely* about string literals and their practical usage
in code.

Since C++11 there are the following character/string literal types:
- char / char* for narrow execution character set *OR* UTF-8
- wchar_t / wchar_t* for wide execution character set
- char16_t / char16_t* for UTF-16
- char32_t / char32_t* for UTF-32

As it currently stands there exist some problems with UTF-8 string literals
that are not present with UTF-16/UTF-32 string literals.

a) String literals encoded using the narrow execution charset and string
literals encoded using UTF-8 can be mixed (by accident).
b) [Inherited from a)] Overload resolution dependent on the (implicit)
encoding type is impossible.


Regarding a):

char16_t and char32_t can't be mixed with wchar_t/int16_t/uint16_t/int32_t/
uint32_t by accident. Doing so would require explicit casting via
reinterpret_cast (not static_cast).
Sadly, the same can't be said for UTF-8 string literals (see the code
example below).

I would argue that code that mixes strings with different encodings (for
example passing a UTF-8 encoded string to an API that expects a string
encoded using the execution charset) is inherently broken and needs to be
fixed.
Too bad this can't be checked at compile time currently.


Regarding b):

At the moment there is no way to differentiate between u8"" (UTF-8) and "" (narrow
execution charset) at compile time (and even at runtime it is quite hard to
do so correctly, if not even impossible).

See for example the following code:

#include <iostream>

void f(char const*) { std::cout << "narrow\n"; }
void f(wchar_t const*) { std::cout << "wide\n"; }
//void f(??? const*) { std::cout << "UTF-8\n"; } // No distinct type for
UTF-8 string literals.
void f(char16_t const*) { std::cout << "UTF-16\n"; }
void f(char32_t const*) { std::cout << "UTF-32\n"; }

int main() {
    f("");
    f(L"");
    f(u8""); // How are we supposed to invoke the UTF-8 overload?
    f(u"");
    f(U"");
    return 0;
}

[online demonstration using Ideone <https://ideone.com/piIuyn>]

To invoke a UTF-8 aware overload the only possibility right now is to try
to detect the encoding at runtime and dispatch manually like so:

void f_narrow(char const*) { std::cout << "narrow\n"; }
void f_utf8(char const*) { std::cout << "UTF-8\n"; }

void f(char const* x) {
    // is_utf8: Imaginary function that returns true if the passed in
string is UTF-8 encoded of false otherwise.
    if (is_utf8(x))
        f_utf8(x);
    else
        f_narrow(x);
}

Obviously this approach is ugly and slow.

char16_t and char32_t on the other hand work quite well, since they are
distinct types which can be used for overload resolution.


A possible solution:

Adding a new type (e.g. char8_t to be consistent with the other type names)
and specifying that UTF-8 string literals use this type would solve these
problems and make the behavior consistent with UTF-16/UTF-32 string
literals.
In a perfect world this new type would *not* be implicitly convertible to
char, to prevent a). On the other hand such an incompatible change would
break existing code that relies on std::is_same<decltype(""),
decltype(u8"")>::value == true. To prevent breaking said code an implicit
conversion could be defined, leaving a) in its current state but at least
fixing b).

Any thoughts?

-- Max Truxa


Addendum:

As David Krauss pointed out in the old thread at std-discussion, one more
inherent problem with the usage of char as UTF-8 code unit is that it is
signed, while char16_t/char32_t are not.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_134_94620050.1433597470557
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>This was originally posted to std-discussion (<a href=
=3D"https://groups.google.com/a/isocpp.org/d/topic/std-discussion/ZxDeh7Rkh=
KU/discussion">https://groups.google.com/a/isocpp.org/d/topic/std-discussio=
n/ZxDeh7RkhKU/discussion</a>) but it was suggested that std-proposals would=
 be more appropriate.</div><div><br></div><div>References:<br><a href=3D"ht=
tps://groups.google.com/a/isocpp.org/d/topic/std-discussion/jGr2bZXWntc/dis=
cussion" target=3D"_blank" rel=3D"nofollow" style=3D"cursor: pointer;">http=
s://groups.google.com/a/<wbr>isocpp.org/d/topic/std-<wbr>discussion/jGr2bZX=
Wntc/<wbr>discussion</a><br><a href=3D"https://groups.google.com/d/topic/co=
mp.lang.c++.moderated/4CBsrFuMFBc/discussion" target=3D"_blank" rel=3D"nofo=
llow" style=3D"cursor: pointer;">https://groups.google.com/d/<wbr>topic/com=
p.lang.c++.moderated/<wbr>4CBsrFuMFBc/discussion</a><br><br>Since the discu=
ssions referenced above didn't come to any usefull conclusion (at least I d=
on't see one) but are inactive for some time now, I would like to reiterate=
 on the pros and cons of the addition of a distinct code unit type for UTF-=
8 string literals.<br>These other discussions also went somewhat off-topic =
when they began discussing pros/cons of the various Unicode CEFs/CESes for =
APIs and transport/storage of string data. No need to say that there needs =
to be an additional verification/encoding detection process when e.g. readi=
ng text from a file.<br>This discussion is&nbsp;<b>purely</b>&nbsp;about st=
ring literals and their practical usage in code.<br><br>Since C++11 there a=
re the following character/string literal types:<br>-&nbsp;<span style=3D"b=
ackground-color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier =
new', monospace;">char</span></span>&nbsp;/&nbsp;<span style=3D"background-=
color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new', mono=
space;">char*</span></span>&nbsp;for narrow execution character set&nbsp;<b=
>OR</b>&nbsp;UTF-8<br>-&nbsp;<span style=3D"background-color: rgb(204, 204,=
 204);"><span style=3D"font-family: 'courier new', monospace;">wchar_t</spa=
n></span>&nbsp;/&nbsp;<span style=3D"background-color: rgb(204, 204, 204);"=
><span style=3D"font-family: 'courier new', monospace;">wchar_t*</span></sp=
an>&nbsp;for wide execution character set<br>-&nbsp;<span style=3D"backgrou=
nd-color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new', m=
onospace;">char16_t</span></span>&nbsp;/&nbsp;<span style=3D"background-col=
or: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new', monospa=
ce;">char16_t*</span></span>&nbsp;for UTF-16<br>-&nbsp;<span style=3D"backg=
round-color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new'=
, monospace;">char32_t</span></span>&nbsp;/&nbsp;<span style=3D"background-=
color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new', mono=
space;">char32_t*</span></span>&nbsp;for UTF-32<br><br>As it currently stan=
ds there exist some problems with UTF-8 string literals that are not presen=
t with UTF-16/UTF-32 string literals.<br><br>a) String literals encoded usi=
ng the narrow execution charset and string literals encoded using UTF-8 can=
 be mixed (by accident).<br>b) [Inherited from a)] Overload resolution depe=
ndent on the (implicit) encoding type is impossible.<br><br><br>Regarding a=
):<br><br><span style=3D"background-color: rgb(204, 204, 204);"><span style=
=3D"font-family: 'courier new', monospace;">char16_t</span></span>&nbsp;and=
&nbsp;<span style=3D"font-family: 'courier new', monospace;"><span style=3D=
"background-color: rgb(204, 204, 204);">char32_t</span></span>&nbsp;can't b=
e mixed with&nbsp;<span style=3D"background-color: rgb(204, 204, 204);"><sp=
an style=3D"font-family: 'courier new', monospace;">wchar_t</span></span>/<=
span style=3D"background-color: rgb(204, 204, 204);"><span style=3D"font-fa=
mily: 'courier new', monospace;">int16_t</span></span>/<span style=3D"backg=
round-color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new'=
, monospace;">uint16_t</span></span>/<span style=3D"background-color: rgb(2=
04, 204, 204);"><span style=3D"font-family: 'courier new', monospace;">int3=
2<wbr>_t</span></span>/<span style=3D"background-color: rgb(204, 204, 204);=
"><span style=3D"font-family: 'courier new', monospace;">uint32_t</span></s=
pan>&nbsp;by accident. Doing so would require explicit casting via&nbsp;<sp=
an style=3D"background-color: rgb(204, 204, 204);"><span style=3D"font-fami=
ly: 'courier new', monospace;">reinterpret_cast</span></span>&nbsp;(not&nbs=
p;<span style=3D"background-color: rgb(204, 204, 204);"><span style=3D"font=
-family: 'courier new', monospace;">static_cast</span></span>).<br>Sadly, t=
he same can't be said for UTF-8 string literals (see the code example below=
).<br><br>I would argue that code that mixes strings with different encodin=
gs (for example passing a UTF-8 encoded string to an API that expects a str=
ing encoded using the execution charset) is inherently broken and needs to =
be fixed.<br>Too bad this can't be checked at compile time currently.<br><b=
r><br>Regarding b):<br><br>At the moment there is no way to differentiate b=
etween&nbsp;<span style=3D"background-color: rgb(204, 204, 204);"><span sty=
le=3D"font-family: 'courier new', monospace;">u8""</span></span>&nbsp;(UTF-=
8) and&nbsp;<span style=3D"background-color: rgb(204, 204, 204);"><span sty=
le=3D"font-family: 'courier new', monospace;">""</span></span>&nbsp;(narrow=
 execution charset) at compile time (and even at runtime it is quite hard t=
o do so correctly, if not even impossible).<br><br>See for example the foll=
owing code:<br><span style=3D"font-family: 'courier new', monospace;"><br><=
span style=3D"font-family: arial, sans-serif;"></span></span><div style=3D"=
border: 1px solid rgb(187, 187, 187); word-wrap: break-word; background-col=
or: rgb(250, 250, 250);"><code><span style=3D"color: rgb(136, 0, 0);">#incl=
ude</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"=
color: rgb(0, 136, 0);">&lt;iostream&gt;</span><span style=3D"color: rgb(0,=
 0, 0);"><br>&nbsp;<br></span><span style=3D"color: rgb(0, 0, 136);">void</=
span><span style=3D"color: rgb(0, 0, 0);">&nbsp;f</span><span style=3D"colo=
r: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 0, 136);">char</=
span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color=
: rgb(0, 0, 136);">const</span><span style=3D"color: rgb(102, 102, 0);">*)<=
/span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"colo=
r: rgb(102, 102, 0);">{</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;st=
d</span><span style=3D"color: rgb(102, 102, 0);">::</span><span style=3D"co=
lor: rgb(0, 0, 0);">cout&nbsp;</span><span style=3D"color: rgb(102, 102, 0)=
;">&lt;&lt;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span s=
tyle=3D"color: rgb(0, 136, 0);">"narrow\n"</span><span style=3D"color: rgb(=
102, 102, 0);">;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><s=
pan style=3D"color: rgb(102, 102, 0);">}</span><span style=3D"color: rgb(0,=
 0, 0);"><br></span><span style=3D"color: rgb(0, 0, 136);">void</span><span=
 style=3D"color: rgb(0, 0, 0);">&nbsp;f</span><span style=3D"color: rgb(102=
, 102, 0);">(</span><span style=3D"color: rgb(0, 0, 136);">wchar_t</span><s=
pan style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color: rgb(0=
, 0, 136);">const</span><span style=3D"color: rgb(102, 102, 0);">*)</span><=
span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color: rgb(=
102, 102, 0);">{</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;std</span=
><span style=3D"color: rgb(102, 102, 0);">::</span><span style=3D"color: rg=
b(0, 0, 0);">cout&nbsp;</span><span style=3D"color: rgb(102, 102, 0);">&lt;=
&lt;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D=
"color: rgb(0, 136, 0);">"wide\n"</span><span style=3D"color: rgb(102, 102,=
 0);">;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=
=3D"color: rgb(102, 102, 0);">}</span><span style=3D"color: rgb(0, 0, 0);">=
<br></span><span style=3D"color: rgb(136, 0, 0);">//void f(??? const*) { st=
d::cout &lt;&lt; "UTF-8\n"; } // No distinct type for UTF-8 string literals=
..</span><span style=3D"color: rgb(0, 0, 0);"><br></span><span style=3D"colo=
r: rgb(0, 0, 136);">void</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;f=
</span><span style=3D"color: rgb(102, 102, 0);">(</span><span style=3D"colo=
r: rgb(0, 0, 0);">char16_t&nbsp;</span><span style=3D"color: rgb(0, 0, 136)=
;">const</span><span style=3D"color: rgb(102, 102, 0);">*)</span><span styl=
e=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color: rgb(102, 102,=
 0);">{</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;std</span><span st=
yle=3D"color: rgb(102, 102, 0);">::</span><span style=3D"color: rgb(0, 0, 0=
);">cout&nbsp;</span><span style=3D"color: rgb(102, 102, 0);">&lt;&lt;</spa=
n><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color: r=
gb(0, 136, 0);">"UTF-16\n"</span><span style=3D"color: rgb(102, 102, 0);">;=
</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"col=
or: rgb(102, 102, 0);">}</span><span style=3D"color: rgb(0, 0, 0);"><br></s=
pan><span style=3D"color: rgb(0, 0, 136);">void</span><span style=3D"color:=
 rgb(0, 0, 0);">&nbsp;f</span><span style=3D"color: rgb(102, 102, 0);">(</s=
pan><span style=3D"color: rgb(0, 0, 0);">char32_t&nbsp;</span><span style=
=3D"color: rgb(0, 0, 136);">const</span><span style=3D"color: rgb(102, 102,=
 0);">*)</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span styl=
e=3D"color: rgb(102, 102, 0);">{</span><span style=3D"color: rgb(0, 0, 0);"=
>&nbsp;std</span><span style=3D"color: rgb(102, 102, 0);">::</span><span st=
yle=3D"color: rgb(0, 0, 0);">cout&nbsp;</span><span style=3D"color: rgb(102=
, 102, 0);">&lt;&lt;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</spa=
n><span style=3D"color: rgb(0, 136, 0);">"UTF-32\n"</span><span style=3D"co=
lor: rgb(102, 102, 0);">;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;=
</span><span style=3D"color: rgb(102, 102, 0);">}</span><span style=3D"colo=
r: rgb(0, 0, 0);"><br>&nbsp;<br></span><span style=3D"color: rgb(0, 0, 136)=
;">int</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;main</span><span st=
yle=3D"color: rgb(102, 102, 0);">()</span><span style=3D"color: rgb(0, 0, 0=
);">&nbsp;</span><span style=3D"color: rgb(102, 102, 0);">{</span><span sty=
le=3D"color: rgb(0, 0, 0);"><br>&nbsp; &nbsp; f</span><span style=3D"color:=
 rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 136, 0);">""</span=
><span style=3D"color: rgb(102, 102, 0);">);</span><span style=3D"color: rg=
b(0, 0, 0);"><br>&nbsp; &nbsp; f</span><span style=3D"color: rgb(102, 102, =
0);">(</span><span style=3D"color: rgb(0, 0, 0);">L</span><span style=3D"co=
lor: rgb(0, 136, 0);">""</span><span style=3D"color: rgb(102, 102, 0);">);<=
/span><span style=3D"color: rgb(0, 0, 0);"><br>&nbsp; &nbsp; f</span><span =
style=3D"color: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 0, =
0);">u8</span><span style=3D"color: rgb(0, 136, 0);">""</span><span style=
=3D"color: rgb(102, 102, 0);">);</span><span style=3D"color: rgb(0, 0, 0);"=
>&nbsp;</span><span style=3D"color: rgb(136, 0, 0);">// How are we supposed=
 to invoke the UTF-8 overload?</span><span style=3D"color: rgb(0, 0, 0);"><=
br>&nbsp; &nbsp; f</span><span style=3D"color: rgb(102, 102, 0);">(</span><=
span style=3D"color: rgb(0, 0, 0);">u</span><span style=3D"color: rgb(0, 13=
6, 0);">""</span><span style=3D"color: rgb(102, 102, 0);">);</span><span st=
yle=3D"color: rgb(0, 0, 0);"><br>&nbsp; &nbsp; f</span><span style=3D"color=
: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 0, 0);">U</span><=
span style=3D"color: rgb(0, 136, 0);">""</span><span style=3D"color: rgb(10=
2, 102, 0);">);</span><span style=3D"color: rgb(0, 0, 0);"><br>&nbsp; &nbsp=
;&nbsp;</span><span style=3D"color: rgb(0, 0, 136);">return</span><span sty=
le=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"color: rgb(0, 102, =
102);">0</span><span style=3D"color: rgb(102, 102, 0);">;</span><span style=
=3D"color: rgb(0, 0, 0);"><br></span><span style=3D"color: rgb(102, 102, 0)=
;">}</span><span style=3D"color: rgb(0, 0, 0);"><br></span></code></div><br=
>[<a href=3D"https://ideone.com/piIuyn" target=3D"_blank" rel=3D"nofollow" =
style=3D"cursor: pointer;">online demonstration using Ideone</a>]<br><br>To=
 invoke a UTF-8 aware overload the only possibility right now is to try to =
detect the encoding at runtime and dispatch manually like so:<br><br><div s=
tyle=3D"border: 1px solid rgb(187, 187, 187); word-wrap: break-word; backgr=
ound-color: rgb(250, 250, 250);"><code><span style=3D"color: rgb(0, 0, 136)=
;">void</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;f_narrow</span><sp=
an style=3D"color: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, =
0, 136);">char</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><spa=
n style=3D"color: rgb(0, 0, 136);">const</span><span style=3D"color: rgb(10=
2, 102, 0);">*)</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><sp=
an style=3D"color: rgb(102, 102, 0);">{</span><span style=3D"color: rgb(0, =
0, 0);">&nbsp;std</span><span style=3D"color: rgb(102, 102, 0);">::</span><=
span style=3D"color: rgb(0, 0, 0);">cout&nbsp;</span><span style=3D"color: =
rgb(102, 102, 0);">&lt;&lt;</span><span style=3D"color: rgb(0, 0, 0);">&nbs=
p;</span><span style=3D"color: rgb(0, 136, 0);">"narrow\n"</span><span styl=
e=3D"color: rgb(102, 102, 0);">;</span><span style=3D"color: rgb(0, 0, 0);"=
>&nbsp;</span><span style=3D"color: rgb(102, 102, 0);">}</span><span style=
=3D"color: rgb(0, 0, 0);"><br></span><span style=3D"color: rgb(0, 0, 136);"=
>void</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;f_utf8</span><span s=
tyle=3D"color: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 0, 1=
36);">char</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span st=
yle=3D"color: rgb(0, 0, 136);">const</span><span style=3D"color: rgb(102, 1=
02, 0);">*)</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span s=
tyle=3D"color: rgb(102, 102, 0);">{</span><span style=3D"color: rgb(0, 0, 0=
);">&nbsp;std</span><span style=3D"color: rgb(102, 102, 0);">::</span><span=
 style=3D"color: rgb(0, 0, 0);">cout&nbsp;</span><span style=3D"color: rgb(=
102, 102, 0);">&lt;&lt;</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</=
span><span style=3D"color: rgb(0, 136, 0);">"UTF-8\n"</span><span style=3D"=
color: rgb(102, 102, 0);">;</span><span style=3D"color: rgb(0, 0, 0);">&nbs=
p;</span><span style=3D"color: rgb(102, 102, 0);">}</span><span style=3D"co=
lor: rgb(0, 0, 0);"><br><br></span><span style=3D"color: rgb(0, 0, 136);">v=
oid</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;f</span><span style=3D=
"color: rgb(102, 102, 0);">(</span><span style=3D"color: rgb(0, 0, 136);">c=
har</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;</span><span style=3D"=
color: rgb(0, 0, 136);">const</span><span style=3D"color: rgb(102, 102, 0);=
">*</span><span style=3D"color: rgb(0, 0, 0);">&nbsp;x</span><span style=3D=
"color: rgb(102, 102, 0);">)</span><span style=3D"color: rgb(0, 0, 0);">&nb=
sp;</span><span style=3D"color: rgb(102, 102, 0);">{</span><span style=3D"c=
olor: rgb(0, 0, 0);"><br>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style=3D"colo=
r: rgb(0, 0, 0);"><code><span style=3D"color: rgb(136, 0, 0);">// is_utf8: =
Imaginary function that returns true if the passed in string is UTF-8 encod=
ed of false otherwise.</span></code><br>&nbsp; &nbsp;&nbsp;</span><span sty=
le=3D"color: rgb(0, 0, 136);">if</span><span style=3D"color: rgb(0, 0, 0);"=
>&nbsp;</span><span style=3D"color: rgb(102, 102, 0);">(</span><span style=
=3D"color: rgb(0, 0, 0);">is_utf8</span><span style=3D"color: rgb(102, 102,=
 0);">(</span><span style=3D"color: rgb(0, 0, 0);">x</span><span style=3D"c=
olor: rgb(102, 102, 0);">))</span><span style=3D"color: rgb(0, 0, 0);"><br>=
&nbsp; &nbsp; &nbsp; &nbsp; f_utf8</span><span style=3D"color: rgb(102, 102=
, 0);">(</span><span style=3D"color: rgb(0, 0, 0);">x</span><span style=3D"=
color: rgb(102, 102, 0);">);</span><span style=3D"color: rgb(0, 0, 0);"><br=
>&nbsp; &nbsp;&nbsp;</span><span style=3D"color: rgb(0, 0, 136);">else</spa=
n><span style=3D"color: rgb(0, 0, 0);"><br>&nbsp; &nbsp; &nbsp; &nbsp; f_na=
rrow</span><span style=3D"color: rgb(102, 102, 0);">(</span><span style=3D"=
color: rgb(0, 0, 0);">x</span><span style=3D"color: rgb(102, 102, 0);">);</=
span><span style=3D"color: rgb(0, 0, 0);"><br></span><span style=3D"color: =
rgb(102, 102, 0);">}</span><span style=3D"color: rgb(0, 0, 0);"><br></span>=
</code></div><br>Obviously this approach is ugly and slow.<br><br><span sty=
le=3D"background-color: rgb(204, 204, 204);"><span style=3D"font-family: 'c=
ourier new', monospace;">char16_t</span></span>&nbsp;and&nbsp;<span style=
=3D"font-family: 'courier new', monospace;"><span style=3D"background-color=
: rgb(204, 204, 204);">char32_t</span></span>&nbsp;on the other hand work q=
uite well, since they are distinct types which can be used for overload res=
olution.<br><br><br>A possible solution:<br><br>Adding a new type (e.g.&nbs=
p;<span style=3D"background-color: rgb(204, 204, 204);"><span style=3D"font=
-family: 'courier new', monospace;">char8_t</span></span>&nbsp;to be consis=
tent with the other type names) and specifying that UTF-8 string literals u=
se this type would solve these problems and make the behavior consistent wi=
th UTF-16/UTF-32 string literals.<br>In a perfect world this new type would=
&nbsp;<b>not</b>&nbsp;be implicitly convertible to&nbsp;<span style=3D"back=
ground-color: rgb(204, 204, 204);"><span style=3D"font-family: 'courier new=
', monospace;">char</span></span>, to prevent a). On the other hand such an=
 incompatible change would break existing code that relies on&nbsp;<span st=
yle=3D"background-color: rgb(204, 204, 204);"><span style=3D"font-family: '=
courier new', monospace;">std::is_same&lt;decltype(""), decltype(u8"")&gt;:=
:value =3D=3D true</span></span>. To prevent breaking said code an implicit=
 conversion could be defined, leaving a) in its current state but at least =
fixing b).<br><br>Any thoughts?<br><br></div><div>-- Max Truxa</div><div><b=
r></div><div><br></div><div>Addendum:</div><div><br></div><div>As David Kra=
uss pointed out in the old thread at std-discussion, one more inherent prob=
lem with the usage of <font face=3D"courier new, monospace" style=3D"backgr=
ound-color: rgb(204, 204, 204);">char</font> as UTF-8 code unit is that it =
is signed, while <font face=3D"courier new, monospace" style=3D"background-=
color: rgb(204, 204, 204);">char16_t</font>/<font face=3D"courier new, mono=
space" style=3D"background-color: rgb(204, 204, 204);">char32_t</font>&nbsp=
;are not.</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_134_94620050.1433597470557--
------=_Part_133_1111759971.1433597470557--

.


Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 6 Jun 2015 10:19:57 -0700 (PDT)
Raw View
------=_Part_749_1728798221.1433611197189
Content-Type: multipart/alternative;
 boundary="----=_Part_750_457956437.1433611197189"

------=_Part_750_457956437.1433611197189
Content-Type: text/plain; charset=UTF-8

The big problem is that this is currently legal:

const char *u8str = u8"Stuff";

And it *must* remain legal in the future, otherwise you break the world. So
whatever u8"" returns, it must be implicitly convertible to `const char*`.

But we also want this implicit conversion to fail:

const char8_t *u8str = u8"Stuff";
const char *str = u8Str;

So whatever u8"" returns, it *must not* be `const char8_t *`. It must be
some *other* construct that can be converted to one or the other.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_750_457956437.1433611197189
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The big problem is that this is currently legal:<br><br><d=
iv class=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); bor=
der-color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word=
-wrap: break-word;"><code class=3D"prettyprint"><div class=3D"subprettyprin=
t"><span style=3D"color: #008;" class=3D"styled-by-prettify">const</span><s=
pan style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span style=
=3D"color: #008;" class=3D"styled-by-prettify">char</span><span style=3D"co=
lor: #000;" class=3D"styled-by-prettify"> </span><span style=3D"color: #660=
;" class=3D"styled-by-prettify">*</span><span style=3D"color: #000;" class=
=3D"styled-by-prettify">u8str </span><span style=3D"color: #660;" class=3D"=
styled-by-prettify">=3D</span><span style=3D"color: #000;" class=3D"styled-=
by-prettify"> u8</span><span style=3D"color: #080;" class=3D"styled-by-pret=
tify">"Stuff"</span><span style=3D"color: #660;" class=3D"styled-by-prettif=
y">;</span><span style=3D"color: #000;" class=3D"styled-by-prettify"><br></=
span></div></code></div><br>And it <i>must</i> remain legal in the future, =
otherwise you break the world. So whatever u8"" returns, it must be implici=
tly convertible to `const char*`.<br><br>But we also want this implicit con=
version to fail:<br><br><div class=3D"prettyprint" style=3D"background-colo=
r: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border-style: soli=
d; border-width: 1px; word-wrap: break-word;"><code class=3D"prettyprint"><=
div class=3D"subprettyprint"><span style=3D"color: #008;" class=3D"styled-b=
y-prettify">const</span><span style=3D"color: #000;" class=3D"styled-by-pre=
ttify"> char8_t </span><span style=3D"color: #660;" class=3D"styled-by-pret=
tify">*</span><span style=3D"color: #000;" class=3D"styled-by-prettify">u8s=
tr </span><span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</sp=
an><span style=3D"color: #000;" class=3D"styled-by-prettify"> u8</span><spa=
n style=3D"color: #080;" class=3D"styled-by-prettify">"Stuff"</span><span s=
tyle=3D"color: #660;" class=3D"styled-by-prettify">;</span><span style=3D"c=
olor: #000;" class=3D"styled-by-prettify"><br></span><span style=3D"color: =
#008;" class=3D"styled-by-prettify">const</span><span style=3D"color: #000;=
" class=3D"styled-by-prettify"> </span><span style=3D"color: #008;" class=
=3D"styled-by-prettify">char</span><span style=3D"color: #000;" class=3D"st=
yled-by-prettify"> </span><span style=3D"color: #660;" class=3D"styled-by-p=
rettify">*</span><span style=3D"color: #000;" class=3D"styled-by-prettify">=
str </span><span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> u8Str</span>=
<span style=3D"color: #660;" class=3D"styled-by-prettify">;</span></div></c=
ode></div><br>So whatever u8"" returns, it <i>must not</i> be `const char8_=
t *`. It must be some <i>other</i> construct that can be converted to one o=
r the other.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_750_457956437.1433611197189--
------=_Part_749_1728798221.1433611197189--

.


Author: David Krauss <potswa@mac.com>
Date: Sun, 07 Jun 2015 10:51:09 +0800
Raw View
--Apple-Mail=_CA2E7910-9E15-4B8D-BEFE-05B3B44378DF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8


> On 2015=E2=80=9306=E2=80=9307, at 1:19 AM, Nicol Bolas <jmckesson@gmail.c=
om> wrote:
>=20
> The big problem is that this is currently legal:
>=20
> const char *u8str =3D u8"Stuff";
>=20
> And it must remain legal in the future, otherwise you break the world. So=
 whatever u8"" returns, it must be implicitly convertible to `const char*`.
>=20
> But we also want this implicit conversion to fail:
>=20
> const char8_t *u8str =3D u8"Stuff";
> const char *str =3D u8Str;
>=20
> So whatever u8"" returns, it must not be `const char8_t *`. It must be so=
me other construct that can be converted to one or the other.

The situation is similar to that of C++03 string literals, which needed to =
appear to be non-const in an immediate context. I think the solution is lik=
ewise: convert literals differently from variables, but deprecate the diffe=
rence.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail=_CA2E7910-9E15-4B8D-BEFE-05B3B44378DF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html charset=
=3Dutf-8"></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space;" class=3D""><br class=3D""><di=
v><blockquote type=3D"cite" class=3D""><div class=3D"">On 2015=E2=80=9306=
=E2=80=9307, at 1:19 AM, Nicol Bolas &lt;<a href=3D"mailto:jmckesson@gmail.=
com" class=3D"">jmckesson@gmail.com</a>&gt; wrote:</div><br class=3D"Apple-=
interchange-newline"><div class=3D""><div dir=3D"ltr" class=3D"">The big pr=
oblem is that this is currently legal:<br class=3D""><br class=3D""><div cl=
ass=3D"prettyprint" style=3D"background-color: rgb(250, 250, 250); border-c=
olor: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap=
: break-word;"><code class=3D"prettyprint"><span style=3D"color: #008;" cla=
ss=3D"styled-by-prettify">const</span> <span style=3D"color: #008;" class=
=3D"styled-by-prettify">char</span> <span style=3D"color: #660;" class=3D"s=
tyled-by-prettify">*</span>u8str <span style=3D"color: #660;" class=3D"styl=
ed-by-prettify">=3D</span> u8<span style=3D"color: #080;" class=3D"styled-b=
y-prettify">"Stuff"</span><span style=3D"color: #660;" class=3D"styled-by-p=
rettify">;</span><br class=3D""></code></div><br class=3D"">And it <i class=
=3D"">must</i> remain legal in the future, otherwise you break the world. S=
o whatever u8"" returns, it must be implicitly convertible to `const char*`=
..<br class=3D""><br class=3D"">But we also want this implicit conversion to=
 fail:<br class=3D""><br class=3D""><div class=3D"prettyprint" style=3D"bac=
kground-color: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border=
-style: solid; border-width: 1px; word-wrap: break-word;"><code class=3D"pr=
ettyprint"><span style=3D"color: #008;" class=3D"styled-by-prettify">const<=
/span> char8_t <span style=3D"color: #660;" class=3D"styled-by-prettify">*<=
/span>u8str <span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</=
span> u8<span style=3D"color: #080;" class=3D"styled-by-prettify">"Stuff"</=
span><span style=3D"color: #660;" class=3D"styled-by-prettify">;</span><br =
class=3D""><span style=3D"color: #008;" class=3D"styled-by-prettify">const<=
/span> <span style=3D"color: #008;" class=3D"styled-by-prettify">char</span=
> <span style=3D"color: #660;" class=3D"styled-by-prettify">*</span>str <sp=
an style=3D"color: #660;" class=3D"styled-by-prettify">=3D</span> u8Str<spa=
n style=3D"color: #660;" class=3D"styled-by-prettify">;</span></code></div>=
<br class=3D"">So whatever u8"" returns, it <i class=3D"">must not</i> be `=
const char8_t *`. It must be some <i class=3D"">other</i> construct that ca=
n be converted to one or the other.<br class=3D""></div></div></blockquote>=
<br class=3D""></div><div>The situation is similar to that of C++03 string =
literals, which needed to appear to be non-const in an immediate context. I=
 think the solution is likewise: convert literals differently from variable=
s, but deprecate the difference.</div></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail=_CA2E7910-9E15-4B8D-BEFE-05B3B44378DF--

.


Author: Roman Perepelitsa <roman.perepelitsa@gmail.com>
Date: Sun, 7 Jun 2015 08:53:33 +0200
Raw View
--001a11c3c6a20f2f9c0517e7fd8e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sun, Jun 7, 2015 at 4:51 AM, David Krauss <potswa@mac.com> wrote:

>
> On 2015=E2=80=9306=E2=80=9307, at 1:19 AM, Nicol Bolas <jmckesson@gmail.c=
om> wrote:
>
> The big problem is that this is currently legal:
>
> const char *u8str =3D u8"Stuff";
>
> And it *must* remain legal in the future, otherwise you break the world.
> So whatever u8"" returns, it must be implicitly convertible to `const
> char*`.
>
> But we also want this implicit conversion to fail:
>
> const char8_t *u8str =3D u8"Stuff";
> const char *str =3D u8Str;
>
> So whatever u8"" returns, it *must not* be `const char8_t *`. It must be
> some *other* construct that can be converted to one or the other.
>
>
> The situation is similar to that of C++03 string literals, which needed t=
o
> appear to be non-const in an immediate context. I think the solution is
> likewise: convert literals differently from variables, but deprecate the
> difference.
>

It's also similar to the behavior of literal zero: it converts to int and
to T*, while int and T* don't convert to each other. It prefers conversion
to int over pointer. u8 literals can be handled the same way: the'll
convert to const char8_t* and const char*, preferring the former, while
const char8_t* and const char* won't convert to each other.

Roman Perepelitsa.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--001a11c3c6a20f2f9c0517e7fd8e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On S=
un, Jun 7, 2015 at 4:51 AM, David Krauss <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:potswa@mac.com" target=3D"_blank">potswa@mac.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><sp=
an class=3D""><br><div><blockquote type=3D"cite"><div>On 2015=E2=80=9306=E2=
=80=9307, at 1:19 AM, Nicol Bolas &lt;<a href=3D"mailto:jmckesson@gmail.com=
" target=3D"_blank">jmckesson@gmail.com</a>&gt; wrote:</div><br><div><div d=
ir=3D"ltr">The big problem is that this is currently legal:<br><br><div sty=
le=3D"background-color:rgb(250,250,250);border-color:rgb(187,187,187);borde=
r-style:solid;border-width:1px;word-wrap:break-word"><code><span style=3D"c=
olor:#008">const</span> <span style=3D"color:#008">char</span> <span style=
=3D"color:#660">*</span>u8str <span style=3D"color:#660">=3D</span> u8<span=
 style=3D"color:#080">&quot;Stuff&quot;</span><span style=3D"color:#660">;<=
/span><br></code></div><br>And it <i>must</i> remain legal in the future, o=
therwise you break the world. So whatever u8&quot;&quot; returns, it must b=
e implicitly convertible to `const char*`.<br><br>But we also want this imp=
licit conversion to fail:<br><br><div style=3D"background-color:rgb(250,250=
,250);border-color:rgb(187,187,187);border-style:solid;border-width:1px;wor=
d-wrap:break-word"><code><span style=3D"color:#008">const</span> char8_t <s=
pan style=3D"color:#660">*</span>u8str <span style=3D"color:#660">=3D</span=
> u8<span style=3D"color:#080">&quot;Stuff&quot;</span><span style=3D"color=
:#660">;</span><br><span style=3D"color:#008">const</span> <span style=3D"c=
olor:#008">char</span> <span style=3D"color:#660">*</span>str <span style=
=3D"color:#660">=3D</span> u8Str<span style=3D"color:#660">;</span></code><=
/div><br>So whatever u8&quot;&quot; returns, it <i>must not</i> be `const c=
har8_t *`. It must be some <i>other</i> construct that can be converted to =
one or the other.<br></div></div></blockquote><br></div></span><div>The sit=
uation is similar to that of C++03 string literals, which needed to appear =
to be non-const in an immediate context. I think the solution is likewise: =
convert literals differently from variables, but deprecate the difference.<=
/div></div></blockquote><div><br></div><div>It&#39;s also similar to the be=
havior of literal zero: it converts to int and to T*, while int and T* don&=
#39;t convert to each other. It prefers conversion to int over pointer. u8 =
literals can be handled the same way: the&#39;ll convert to const char8_t* =
and const char*, preferring the former, while const char8_t* and const char=
* won&#39;t convert to each other.</div><div><br></div><div>Roman Perepelit=
sa.</div></div></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--001a11c3c6a20f2f9c0517e7fd8e--

.


Author: me@maxtruxa.com
Date: Sun, 7 Jun 2015 05:52:04 -0700 (PDT)
Raw View
------=_Part_515_1493153244.1433681524682
Content-Type: multipart/alternative;
 boundary="----=_Part_516_2080156713.1433681524682"

------=_Part_516_2080156713.1433681524682
Content-Type: text/plain; charset=UTF-8

On Sunday, June 7, 2015 at 8:53:56 AM UTC+2, Roman Perepelitsa wrote:
>
> It's also similar to the behavior of literal zero: it converts to int and
> to T*, while int and T* don't convert to each other. It prefers conversion
> to int over pointer. u8 literals can be handled the same way: the'll
> convert to const char8_t* and const char*, preferring the former, while
> const char8_t* and const char* won't convert to each other.
>

One special case that requires some extra thought is this:
auto x = u8"";

Is the type of x now char8_t const* or char const*? (Taking the conversion
preference you mentioned into account it would probably be char8_t const*.)

char8_t const* would break old code that did something like this (which is
legal ATM):
void f(char const*);
auto x = u8"";
f(x); // bad: "x" is now of type "char8_t const*".

char const* on the other hand would make it impossible to utilize auto in
future code:
void f(char8_t const*);
auto x = u8"";
f(x); // bad: "x" is still of type "char const*".

I don't know if the committee reached a consensus so far about what kinds
of code breaking changes are acceptable for future standards, but
apparently code breakage is not off the table completely. Keeping things
compatible is obviously desirable but should not prevent defective behavior
from being fixed. I think the currently pretty much unusable definition of
UTF-8 string literals s a really good candidate for an incompatible change.
Furthermore a clean cut would remove any inconsistency between u8/u/U (which
is currently yet another unintuitive corner case in C++ that has to be
learned).
At first I intuitively expected a problem like this to be a candidate for a
DR, but the ISO definition says otherwise:

> A standard has a defect if and only if something is underspecified [...]
> or contains a contradiction [...].

What could be considered defective even considering this definition, is the
usage of char (which is not guaranteed to be 8 bits) for UTF-8 code units
(which are exactly 8 bits).

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_516_2080156713.1433681524682
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Sunday, June 7, 2015 at 8:53:56 AM UTC+2, Roman Perepel=
itsa wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0px 0px 0px 0=
..8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border=
-left-style: solid; padding-left: 1ex;"><div dir=3D"ltr"><div class=3D"gmai=
l_quote">It's also similar to the behavior of literal zero: it converts to =
int and to T*, while int and T* don't convert to each other. It prefers con=
version to int over pointer. u8 literals can be handled the same way: the'l=
l convert to const char8_t* and const char*, preferring the former, while c=
onst char8_t* and const char* won't convert to each other.</div></div></blo=
ckquote><div><br></div><div>One special case that requires some extra thoug=
ht is this:</div><div><div class=3D"prettyprint" style=3D"border: 1px solid=
 rgb(187, 187, 187); word-wrap: break-word; background-color: rgb(250, 250,=
 250);"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span sty=
le=3D"color: #008;" class=3D"styled-by-prettify">auto</span><span style=3D"=
color: #000;" class=3D"styled-by-prettify"> x </span><span style=3D"color: =
#660;" class=3D"styled-by-prettify">=3D</span><span style=3D"color: #000;" =
class=3D"styled-by-prettify"> u8</span><span style=3D"color: #080;" class=
=3D"styled-by-prettify">""</span><span style=3D"color: #660;" class=3D"styl=
ed-by-prettify">;</span></div></code></div><br></div><div>Is the type of x =
now&nbsp;<font face=3D"courier new, monospace">char8_t const*</font>&nbsp;o=
r&nbsp;<font face=3D"courier new, monospace">char const*</font>? (Taking th=
e conversion preference you mentioned into account it would probably be&nbs=
p;<font face=3D"courier new, monospace">char8_t const*</font>.)</div><div><=
font face=3D"courier new, monospace"><br></font></div><div><font face=3D"co=
urier new, monospace">char8_t const*&nbsp;</font><font face=3D"arial, sans-=
serif">would break old code that did something like this (which is legal AT=
M):</font></div><div class=3D"prettyprint" style=3D"border: 1px solid rgb(1=
87, 187, 187); word-wrap: break-word; background-color: rgb(250, 250, 250);=
"><code class=3D"prettyprint"><div class=3D"subprettyprint"><span style=3D"=
color: #008;" class=3D"styled-by-prettify">void</span><span style=3D"color:=
 #000;" class=3D"styled-by-prettify"> f</span><span style=3D"color: #660;" =
class=3D"styled-by-prettify">(</span><span style=3D"color: #008;" class=3D"=
styled-by-prettify">char</span><span style=3D"color: #000;" class=3D"styled=
-by-prettify"> </span><span style=3D"color: #008;" class=3D"styled-by-prett=
ify">const</span><span style=3D"color: #660;" class=3D"styled-by-prettify">=
*);</span><span style=3D"color: #000;" class=3D"styled-by-prettify"><br></s=
pan><span style=3D"color: #008;" class=3D"styled-by-prettify">auto</span><s=
pan style=3D"color: #000;" class=3D"styled-by-prettify"> x </span><span sty=
le=3D"color: #660;" class=3D"styled-by-prettify">=3D</span><span style=3D"c=
olor: #000;" class=3D"styled-by-prettify"> u8</span><span style=3D"color: #=
080;" class=3D"styled-by-prettify">""</span><span style=3D"color: #660;" cl=
ass=3D"styled-by-prettify">;</span><span style=3D"color: #000;" class=3D"st=
yled-by-prettify"><br>f</span><span style=3D"color: #660;" class=3D"styled-=
by-prettify">(</span><span style=3D"color: #000;" class=3D"styled-by-pretti=
fy">x</span><span style=3D"color: #660;" class=3D"styled-by-prettify">);</s=
pan><span style=3D"color: #000;" class=3D"styled-by-prettify"> </span><span=
 style=3D"color: #800;" class=3D"styled-by-prettify">// bad: "x" is now of =
type "char8_t const*".</span></div></code></div><div><font face=3D"arial, s=
ans-serif"><br></font></div><div><font face=3D"courier new, monospace">char=
 const*</font><font face=3D"arial, sans-serif">&nbsp;on the other hand woul=
d make it impossible to utilize&nbsp;</font><font face=3D"courier new, mono=
space">auto</font><font face=3D"arial, sans-serif">&nbsp;in future code:</f=
ont></div><div class=3D"prettyprint" style=3D"border: 1px solid rgb(187, 18=
7, 187); word-wrap: break-word; background-color: rgb(250, 250, 250);"><cod=
e class=3D"prettyprint"><div class=3D"subprettyprint"><span style=3D"color:=
 #008;" class=3D"styled-by-prettify">void</span><span style=3D"color: #000;=
" class=3D"styled-by-prettify"> f</span><span style=3D"color: #660;" class=
=3D"styled-by-prettify">(</span><span style=3D"color: #000;" class=3D"style=
d-by-prettify">char8_t </span><span style=3D"color: #008;" class=3D"styled-=
by-prettify">const</span><span style=3D"color: #660;" class=3D"styled-by-pr=
ettify">*);</span><span style=3D"color: #000;" class=3D"styled-by-prettify"=
><br></span><span style=3D"color: #008;" class=3D"styled-by-prettify">auto<=
/span><span style=3D"color: #000;" class=3D"styled-by-prettify"> x </span><=
span style=3D"color: #660;" class=3D"styled-by-prettify">=3D</span><span st=
yle=3D"color: #000;" class=3D"styled-by-prettify"> u8</span><span style=3D"=
color: #080;" class=3D"styled-by-prettify">""</span><span style=3D"color: #=
660;" class=3D"styled-by-prettify">;</span><span style=3D"color: #000;" cla=
ss=3D"styled-by-prettify"><br>f</span><span style=3D"color: #660;" class=3D=
"styled-by-prettify">(</span><span style=3D"color: #000;" class=3D"styled-b=
y-prettify">x</span><span style=3D"color: #660;" class=3D"styled-by-prettif=
y">);</span><span style=3D"color: #000;" class=3D"styled-by-prettify"> </sp=
an><span style=3D"color: #800;" class=3D"styled-by-prettify">// bad: "x" is=
 still of type "char const*".</span></div></code></div><div><font face=3D"a=
rial, sans-serif"><br></font></div><div><font face=3D"arial, sans-serif">I =
don't know if the committee reached a consensus so far about what kinds of =
code breaking changes are acceptable for future standards, but apparently c=
ode breakage is not off the table</font><span style=3D"font-family: arial, =
sans-serif;">&nbsp;</span><span style=3D"font-family: arial, sans-serif;">c=
ompletely</span><span style=3D"font-family: arial, sans-serif;">. Keeping t=
hings compatible is obviously desirable but should not prevent defective be=
havior from being fixed. I think the currently pretty much unusable definit=
ion of UTF-8 string literals s a really good candidate for an incompatible =
change. Furthermore a clean cut would remove any inconsistency between&nbsp=
;</span><span style=3D"font-family: 'courier new', monospace;">u8</span><fo=
nt face=3D"arial, sans-serif">/</font><span style=3D"font-family: 'courier =
new', monospace;">u</span><font face=3D"arial, sans-serif">/</font><span st=
yle=3D"font-family: 'courier new', monospace;">U</span><span style=3D"font-=
family: arial, sans-serif;">&nbsp;</span><span style=3D"font-family: arial,=
 sans-serif;">(which is currently yet another unintuitive corner case in C+=
+ that has to be learned).</span></div><div>At first I intuitively expected=
 a problem like this to be a candidate for a DR, but the ISO definition say=
s otherwise:</div><blockquote class=3D"gmail_quote" style=3D"margin: 0px 0p=
x 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204);=
 border-left-style: solid; padding-left: 1ex;"><span style=3D"font-family: =
arial, sans-serif;">A standard has a defect if and only if something is und=
erspecified [...] or contains a contradiction [...].</span></blockquote><di=
v>What could be considered defective even considering this definition, is t=
he usage of char (which is not guaranteed to be 8 bits) for UTF-8 code unit=
s (which are exactly 8 bits).</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_516_2080156713.1433681524682--
------=_Part_515_1493153244.1433681524682--

.


Author: David Krauss <potswa@mac.com>
Date: Sun, 07 Jun 2015 21:34:06 +0800
Raw View
--Apple-Mail=_8E69C884-6FB9-41A3-AE98-C3F5CEA3A02E
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8


> On 2015=E2=80=9306=E2=80=9307, at 8:52 PM, me@maxtruxa.com wrote:
>=20
> One special case that requires some extra thought is this:
> auto x =3D u8"";
>=20
> Is the type of x now char8_t const* or char const*? (Taking the conversio=
n preference you mentioned into account it would probably be char8_t const*=
..)

Since char may already be unsigned, depending on the platform, a program sh=
ouldn=E2=80=99t assume it=E2=80=99s signed anyway.

Unless there=E2=80=99s an argument for it, we don=E2=80=99t want a new char=
8_t type, either. There are already three separate types char, unsigned cha=
r, and signed char.

The real problem is that unsigned char * doesn=E2=80=99t convert to char *,=
 which is all that std::string knows about. Looking at the big picture, whe=
n you want to decode UTF-8 from a std::string, you have to manually cast to=
 unsigned char at some point.

There=E2=80=99s really no good reason that char should have negative values=
 in this day and age, and programs that depend on it are unportable, but du=
e to a majority of ABIs doing it that way, we=E2=80=99re pretty well stuck =
with it.

Perhaps there should be implicit conversions between char * and [un]signed =
char *, but not between signed char * and unsigned char * directly? =E2=80=
=A6 :(

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail=_8E69C884-6FB9-41A3-AE98-C3F5CEA3A02E
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html charset=
=3Dutf-8"></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space;" class=3D""><br class=3D""><di=
v><blockquote type=3D"cite" class=3D""><div class=3D"">On 2015=E2=80=9306=
=E2=80=9307, at 8:52 PM, <a href=3D"mailto:me@maxtruxa.com" class=3D"">me@m=
axtruxa.com</a> wrote:</div><br class=3D"Apple-interchange-newline"><div cl=
ass=3D""><div style=3D"font-family: Helvetica; font-size: 12px; font-style:=
 normal; font-variant: normal; font-weight: normal; letter-spacing: normal;=
 line-height: normal; orphans: auto; text-align: start; text-indent: 0px; t=
ext-transform: none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px;" class=3D"">One special case that requires =
some extra thought is this:</div><div style=3D"font-family: Helvetica; font=
-size: 12px; font-style: normal; font-variant: normal; font-weight: normal;=
 letter-spacing: normal; line-height: normal; orphans: auto; text-align: st=
art; text-indent: 0px; text-transform: none; white-space: normal; widows: a=
uto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D""><div cl=
ass=3D"prettyprint" style=3D"border: 1px solid rgb(187, 187, 187); word-wra=
p: break-word; background-color: rgb(250, 250, 250);"><code class=3D"pretty=
print"><span class=3D"styled-by-prettify" style=3D"color: rgb(0, 0, 136);">=
auto</span><span class=3D"styled-by-prettify" style=3D""><span class=3D"App=
le-converted-space">&nbsp;</span>x<span class=3D"Apple-converted-space">&nb=
sp;</span></span><span class=3D"styled-by-prettify" style=3D"color: rgb(102=
, 102, 0);">=3D</span><span class=3D"Apple-converted-space">&nbsp;</span>u8=
<span class=3D"styled-by-prettify" style=3D"color: rgb(0, 136, 0);">""</spa=
n><span class=3D"styled-by-prettify" style=3D"color: rgb(102, 102, 0);">;</=
span></code></div><br class=3D""></div><div style=3D"font-family: Helvetica=
; font-size: 12px; font-style: normal; font-variant: normal; font-weight: n=
ormal; letter-spacing: normal; line-height: normal; orphans: auto; text-ali=
gn: start; text-indent: 0px; text-transform: none; white-space: normal; wid=
ows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=3D"">I=
s the type of x now&nbsp;<font face=3D"courier new, monospace" class=3D"">c=
har8_t const*</font>&nbsp;or&nbsp;<font face=3D"courier new, monospace" cla=
ss=3D"">char const*</font>? (Taking the conversion preference you mentioned=
 into account it would probably be&nbsp;<font face=3D"courier new, monospac=
e" class=3D"">char8_t const*</font>.)</div></div></blockquote></div><br cla=
ss=3D""><div class=3D"">Since <font face=3D"Courier" class=3D"">char</font>=
 may already be unsigned, depending on the platform, a program shouldn=E2=
=80=99t assume it=E2=80=99s signed anyway.</div><div class=3D""><br class=
=3D""></div><div class=3D"">Unless there=E2=80=99s an argument for it, we d=
on=E2=80=99t want a new&nbsp;<font face=3D"Courier" class=3D"">char8_t</fon=
t>&nbsp;type, either. There are already three separate types <font face=3D"=
Courier" class=3D"">char</font>, <font face=3D"Courier" class=3D"">unsigned=
 char</font>, and <font face=3D"Courier" class=3D"">signed char</font>.</di=
v><div class=3D""><br class=3D""></div><div class=3D"">The real problem is =
that <font face=3D"Courier" class=3D"">unsigned char *</font> doesn=E2=80=
=99t convert to <font face=3D"Courier" class=3D"">char *</font>, which is a=
ll that&nbsp;<font face=3D"Courier" class=3D"">std::string</font>&nbsp;know=
s about. Looking at the big picture, when you want to decode UTF-8 from a&n=
bsp;<font face=3D"Courier" class=3D"">std::string</font>, you have to manua=
lly cast to <font face=3D"Courier" class=3D"">unsigned char</font> at some =
point.</div><div class=3D""><br class=3D""></div><div class=3D"">There=E2=
=80=99s really no good reason that <font face=3D"Courier" class=3D"">char</=
font> should have negative values in this day and age, and programs that de=
pend on it are unportable, but due to a majority of ABIs doing it that way,=
 we=E2=80=99re pretty well stuck with it.</div><div class=3D""><br class=3D=
""></div><div class=3D"">Perhaps there should be implicit conversions betwe=
en <font face=3D"Courier" class=3D"">char *</font> and <font face=3D"Courie=
r" class=3D"">[un]signed char *</font>, but not between <font face=3D"Couri=
er" class=3D"">signed char *</font> and <font face=3D"Courier" class=3D"">u=
nsigned char *</font> directly? =E2=80=A6 :(</div></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail=_8E69C884-6FB9-41A3-AE98-C3F5CEA3A02E--

.


Author: me@maxtruxa.com
Date: Sun, 7 Jun 2015 08:23:10 -0700 (PDT)
Raw View
------=_Part_88_742265051.1433690590341
Content-Type: multipart/alternative;
 boundary="----=_Part_89_516206734.1433690590341"

------=_Part_89_516206734.1433690590341
Content-Type: text/plain; charset=UTF-8

If there is no argument for char8_t, then why are char16_t and char32_t not
just typedefs of uint_least16_t and uint_least32_t respectively? Why do
they exist at all?

N3337 $3.9.1 Fundamental types:

> Types char16_t and char32_t denote distinct types with the same size,
> signedness, and alignment as uint_least16_t and uint_least32_t,
> respectively *[...]*.


I just can't see where this inconsistency (char16_t, char32_t with
(possibly) different properties than char; but no matching char8_t) came
from in the first place.
The real problem there is now is what I pointed out in my first post: You
can't detect whether you are working with a narrow or UTF-8 encoded string,
which in turn makes it impossible to use UTF-8 string literals in a type
safe manner. In theory, you now have to annotate every single char const*/
std::string in every API with whether it is expected to be encoded using
the narrow execution charset or UTF-8 (or whether the algorithm in question
doesn't care). Even single chars could now be problematic because for UTF-8
a single char that is > 0x7F is ill-formed. In practice, many algorithms
will work with UTF-8 strings as they did with "legacy" strings, but some
others will corrupt strings because they are not aware of multibyte
sequences. This whole mess could have been avoided by simply treating UTF-8
string literals in a consistent manner with UTF-16/UTF-32 string literals.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_89_516206734.1433690590341
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>If there is no argument for <font face=3D"courier new=
, monospace">char8_t</font>, then why<font face=3D"arial, sans-serif">&nbsp=
;are&nbsp;</font><font face=3D"courier new, monospace">char16_t</font>&nbsp=
;and&nbsp;<font face=3D"courier new, monospace">char32_t</font><font face=
=3D"arial, sans-serif">&nbsp;not just typedefs of </font><font face=3D"cour=
ier new, monospace">uint_least16_t</font><font face=3D"arial, sans-serif"> =
and </font><font face=3D"courier new, monospace">uint_least32_t</font><font=
 face=3D"arial, sans-serif"> respectively?</font>&nbsp;Why do they exist at=
 all<font face=3D"arial, sans-serif">?</font><br></div><div><font face=3D"a=
rial, sans-serif"><br></font><div><font face=3D"arial, sans-serif">N3337 $3=
..9.1 Fundamental types:</font></div><blockquote class=3D"gmail_quote" style=
=3D"margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: r=
gb(204, 204, 204); border-left-style: solid; padding-left: 1ex;">Types <fon=
t face=3D"courier new, monospace">char16_t</font> and <font face=3D"courier=
 new, monospace">char32_t</font> denote distinct types with the same size, =
signedness, and alignment as
<font face=3D"courier new, monospace">uint_least16_t</font> and <font face=
=3D"courier new, monospace">uint_least32_t</font>, respectively <i>[...]</i=
>.</blockquote><div><br></div><div>I just can't see where this inconsistenc=
y (<font face=3D"courier new, monospace">char16_t</font>, <font face=3D"cou=
rier new, monospace">char32_t</font> with (possibly) different properties t=
han <font face=3D"courier new, monospace">char</font>; but no matching&nbsp=
;<font face=3D"courier new, monospace">char8_t</font>) came from in the fir=
st place.</div></div><div>The real problem there is now is what I pointed o=
ut in my first post: You can't detect whether you are working with a narrow=
 or UTF-8 encoded string, which in turn makes it impossible to use UTF-8 st=
ring literals in a type safe manner. In theory, you now have to annotate ev=
ery single <font face=3D"courier new, monospace">char const*</font>/<span s=
tyle=3D"font-family: 'courier new', monospace;">std::string</span>&nbsp;in =
every API with whether it is expected to be encoded using the narrow execut=
ion charset or UTF-8 (or whether the algorithm in question doesn't care). E=
ven single <font face=3D"courier new, monospace">char</font>s could now be =
problematic because for UTF-8 a single <font face=3D"courier new, monospace=
">char</font><font face=3D"arial, sans-serif"> that is </font><font face=3D=
"courier new, monospace">&gt; 0x7F&nbsp;</font>is ill-formed. In practice, =
many algorithms will work with UTF-8 strings as they did with "legacy" stri=
ngs, but some others will corrupt strings because they are not aware of mul=
tibyte sequences. This whole mess could have been avoided by simply treatin=
g UTF-8 string literals in a consistent manner with UTF-16/UTF-32 string li=
terals.</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_89_516206734.1433690590341--
------=_Part_88_742265051.1433690590341--

.


Author: Bo Persson <bop@gmb.dk>
Date: Mon, 08 Jun 2015 17:49:34 +0200
Raw View
On 2015-06-07 17:23, me@maxtruxa.com wrote:
> If there is no argument for char8_t, then why are char16_t and
> char32_t not just typedefs of uint_least16_tand
> uint_least32_trespectively? Why do they exist at all?

There have been arguments for a char8_t type, just not strong enough to
convince the committee.

The arguments against it is that we already have three 8-bit character
types - char, signed char, and unsigned char - two of which have
indentical representation. That's way too many already.

Adding another type char8_t, with a representation identical to at least
one of the existing types, didn't look like an improvement.

Having 6 character types, some of which are identical, in a language is
gross. Why add a 7th?


Bo Persson



>
> N3337 $3.9.1 Fundamental types:
>
>     Types char16_t and char32_t denote distinct types with the same
>     size, signedness, and alignment as uint_least16_t and
>     uint_least32_t, respectively /[...]/.
>
>


--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.