Thread

Topic: Comments on P0372R0, A type for utf-8 data

Author: Tom Honermann <tom@honermann.net>
Date: Tue, 14 Jun 2016 22:55:36 -0400 Raw View

First, thank you for writing this paper!  It has been on my todo list to
write such a proposal, but alas...

I spoke with Richard Smith about such a proposal in Jacksonville and he
mentioned a further justification for supporting a char8_t type -
optimization.  Today, compilers are limited in optimizing code involving
char and unsigned char glvalues because these types are allowed to alias
objects of other types (C++14 3.10 [basic.lval] p10).  If a char8_t type
were to be added that adhered to strict aliasing, then compilers could
more aggressively optimize code involving it.  I think this may be a
benefit worth adding to the paper.

Tom.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/5760C3A8.5060804%40honermann.net.

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 14 Jun 2016 20:32:22 -0700 (PDT) Raw View

------=_Part_1206_273730841.1465961543012
Content-Type: multipart/alternative;
 boundary="----=_Part_1207_1618401840.1465961543013"

------=_Part_1207_1618401840.1465961543013
Content-Type: text/plain; charset=UTF-8

On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honermann wrote:
>
> First, thank you for writing this paper!  It has been on my todo list to
> write such a proposal, but alas...
>
> I spoke with Richard Smith about such a proposal in Jacksonville and he
> mentioned a further justification for supporting a char8_t type -
> optimization.  Today, compilers are limited in optimizing code involving
> char and unsigned char glvalues because these types are allowed to alias
> objects of other types (C++14 3.10 [basic.lval] p10).  If a char8_t type
> were to be added that adhered to strict aliasing, then compilers could
> more aggressively optimize code involving it.  I think this may be a
> benefit worth adding to the paper.
>

I'm quite certain that the proposal makes this illegal:

const char8_t *str = "Some String";

`char8_t` is meant for UTF-8 strings *only*. And most people's strings are
narrow character strings; on specific platforms, this may work out to being
UTF-8, but there is no guarantee of that. We need to differentiate between
narrow character strings and UTF-8 encoded strings at the type level.

The last thing we want is to encourage people to do this:

auto str = (const char8_t *)"Some String";

If people start trying doing casts like that to take advantage of more
aggressive optimizations, then we'll be right back where we were before: we
won't know if a string *really is* UTF-8 or not.

Solving the "char as byte array and string" problem is important. But we
shouldn't suggest that `char8_t` constitutes such a solution.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/5b62dc8c-b02e-46ec-91cd-2598965a73ff%40isocpp.org.

------=_Part_1207_1618401840.1465961543013
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honerm=
ann wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">First, thank you for=
 writing this paper! =C2=A0It has been on my todo list to=20
<br>write such a proposal, but alas...
<br>
<br>I spoke with Richard Smith about such a proposal in Jacksonville and he=
=20
<br>mentioned a further justification for supporting a char8_t type -=20
<br>optimization. =C2=A0Today, compilers are limited in optimizing code inv=
olving=20
<br>char and unsigned char glvalues because these types are allowed to alia=
s=20
<br>objects of other types (C++14 3.10 [basic.lval] p10). =C2=A0If a char8_=
t type=20
<br>were to be added that adhered to strict aliasing, then compilers could=
=20
<br>more aggressively optimize code involving it. =C2=A0I think this may be=
 a=20
<br>benefit worth adding to the paper.
<br></blockquote><div><br>I&#39;m quite certain that the proposal makes thi=
s illegal:<br><br>const char8_t *str =3D &quot;Some String&quot;;<br><br>`c=
har8_t` is meant for UTF-8 strings <i>only</i>. And most people&#39;s strin=
gs are narrow character strings; on specific platforms, this may work out t=
o being UTF-8, but there is no guarantee of that. We need to differentiate =
between narrow character strings and UTF-8 encoded strings at the type leve=
l.<br><br>The last thing we want is to encourage people to do this:<br><br>=
auto str =3D (const char8_t *)&quot;Some String&quot;;<br><br>If people sta=
rt trying doing casts like that to take advantage of more aggressive optimi=
zations, then we&#39;ll be right back where we were before: we won&#39;t kn=
ow if a string <i>really is</i> UTF-8 or not.<br><br>Solving the &quot;char=
 as byte array and string&quot; problem is important. But we shouldn&#39;t =
suggest that `char8_t` constitutes such a solution.<br> </div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/5b62dc8c-b02e-46ec-91cd-2598965a73ff%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/5b62dc8c-b02e-46ec-91cd-2598965a73ff=
%40isocpp.org</a>.<br />

------=_Part_1207_1618401840.1465961543013--

------=_Part_1206_273730841.1465961543012--

.

Author: Tom Honermann <tom@honermann.net>
Date: Tue, 14 Jun 2016 23:44:17 -0400 Raw View

This is a multi-part message in MIME format.
--------------000006080000030302000900
Content-Type: text/plain; charset=UTF-8; format=flowed

On 06/14/2016 11:32 PM, Nicol Bolas wrote:
> On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honermann wrote:
>
>     First, thank you for writing this paper!  It has been on my todo
>     list to
>     write such a proposal, but alas...
>
>     I spoke with Richard Smith about such a proposal in Jacksonville
>     and he
>     mentioned a further justification for supporting a char8_t type -
>     optimization.  Today, compilers are limited in optimizing code
>     involving
>     char and unsigned char glvalues because these types are allowed to
>     alias
>     objects of other types (C++14 3.10 [basic.lval] p10).  If a
>     char8_t type
>     were to be added that adhered to strict aliasing, then compilers
>     could
>     more aggressively optimize code involving it.  I think this may be a
>     benefit worth adding to the paper.
>
>
> I'm quite certain that the proposal makes this illegal:
>
> const char8_t *str = "Some String";

I would hope so.

> `char8_t` is meant for UTF-8 strings /only/. And most people's strings
> are narrow character strings; on specific platforms, this may work out
> to being UTF-8, but there is no guarantee of that. We need to
> differentiate between narrow character strings and UTF-8 encoded
> strings at the type level.
>
> The last thing we want is to encourage people to do this:
>
> auto str = (const char8_t *)"Some String";

I agree.

> If people start trying doing casts like that to take advantage of more
> aggressive optimizations, then we'll be right back where we were
> before: we won't know if a string /really is/ UTF-8 or not.
>
> Solving the "char as byte array and string" problem is important. But
> we shouldn't suggest that `char8_t` constitutes such a solution.

I don't think the ability to abuse a feature should be sufficient
justification to not add it.  I did not intend to suggest that char8_t
be used to circumvent existing aliasing rules.  Rather, that giving it
strict aliasing behavior would enable optimizations for UTF-8 data.
That could potentially provide some motivation towards using UTF-8
strings in preference to narrow strings.

Tom.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/5760CF11.20807%40honermann.net.

--------------000006080000030302000900
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Content-Type=
">
  </head>
  <body bgcolor=3D"#FFFFFF" text=3D"#000000">
    <div class=3D"moz-cite-prefix">On 06/14/2016 11:32 PM, Nicol Bolas
      wrote:<br>
    </div>
    <blockquote
      cite=3D"mid:5b62dc8c-b02e-46ec-91cd-2598965a73ff@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom
        Honermann wrote:
        <blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:
          0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">First,
          thank you for writing this paper! =C2=A0It has been on my todo li=
st
          to <br>
          write such a proposal, but alas...
          <br>
          <br>
          I spoke with Richard Smith about such a proposal in
          Jacksonville and he <br>
          mentioned a further justification for supporting a char8_t
          type - <br>
          optimization. =C2=A0Today, compilers are limited in optimizing co=
de
          involving <br>
          char and unsigned char glvalues because these types are
          allowed to alias <br>
          objects of other types (C++14 3.10 [basic.lval] p10). =C2=A0If a
          char8_t type <br>
          were to be added that adhered to strict aliasing, then
          compilers could <br>
          more aggressively optimize code involving it. =C2=A0I think this
          may be a <br>
          benefit worth adding to the paper.
          <br>
        </blockquote>
        <div><br>
          I'm quite certain that the proposal makes this illegal:<br>
          <br>
          const char8_t *str =3D "Some String";<br>
        </div>
      </div>
    </blockquote>
    <br>
    I would hope so.<br>
    <br>
    <blockquote
      cite=3D"mid:5b62dc8c-b02e-46ec-91cd-2598965a73ff@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">
        <div>`char8_t` is meant for UTF-8 strings <i>only</i>. And most
          people's strings are narrow character strings; on specific
          platforms, this may work out to being UTF-8, but there is no
          guarantee of that. We need to differentiate between narrow
          character strings and UTF-8 encoded strings at the type level.<br=
>
          <br>
          The last thing we want is to encourage people to do this:<br>
          <br>
          auto str =3D (const char8_t *)"Some String";<br>
        </div>
      </div>
    </blockquote>
    <br>
    I agree.<br>
    <br>
    <blockquote
      cite=3D"mid:5b62dc8c-b02e-46ec-91cd-2598965a73ff@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">
        <div>If people start trying doing casts like that to take
          advantage of more aggressive optimizations, then we'll be
          right back where we were before: we won't know if a string <i>rea=
lly
            is</i> UTF-8 or not.<br>
          <br>
          Solving the "char as byte array and string" problem is
          important. But we shouldn't suggest that `char8_t` constitutes
          such a solution.<br>
        </div>
      </div>
    </blockquote>
    <br>
    I don't think the ability to abuse a feature should be sufficient
    justification to not add it.=C2=A0 I did not intend to suggest that
    char8_t be used to circumvent existing aliasing rules.=C2=A0 Rather, th=
at
    giving it strict aliasing behavior would enable optimizations for
    UTF-8 data.=C2=A0 That could potentially provide some motivation toward=
s
    using UTF-8 strings in preference to narrow strings.<br>
    <br>
    Tom.<br>
  </body>
</html>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/5760CF11.20807%40honermann.net?utm_me=
dium=3Demail&utm_source=3Dfooter">https://groups.google.com/a/isocpp.org/d/=
msgid/std-proposals/5760CF11.20807%40honermann.net</a>.<br />

--------------000006080000030302000900--

.

Author: Tom Honermann <tom@honermann.net>
Date: Wed, 15 Jun 2016 11:07:19 -0400 Raw View

On 6/14/2016 10:55 PM, Tom Honermann wrote:
> First, thank you for writing this paper!  It has been on my todo list
> to write such a proposal, but alas...
>
> I spoke with Richard Smith about such a proposal in Jacksonville and
> he mentioned a further justification for supporting a char8_t type -
> optimization.  Today, compilers are limited in optimizing code
> involving char and unsigned char glvalues because these types are
> allowed to alias objects of other types (C++14 3.10 [basic.lval]
> p10).  If a char8_t type were to be added that adhered to strict
> aliasing, then compilers could more aggressively optimize code
> involving it.  I think this may be a benefit worth adding to the paper.

I'd also like to propose that the implicit conversion from u8"" to const
char[] and u8'x' to char be introduced as deprecated features that can
be removed in a future standard.

Is there any implementation experience?  Any chance that patches to gcc
or Clang exist?  If so, I would be interested in experimenting with them.

Tom.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/b784f120-033e-20c4-da3e-af25056b598c%40honermann.net.

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Wed, 15 Jun 2016 08:56:32 -0700 (PDT) Raw View

------=_Part_694_1215804870.1466006192419
Content-Type: multipart/alternative;
 boundary="----=_Part_695_869998225.1466006192419"

------=_Part_695_869998225.1466006192419
Content-Type: text/plain; charset=UTF-8

On Tuesday, June 14, 2016 at 11:44:19 PM UTC-4, Tom Honermann wrote:
>
> On 06/14/2016 11:32 PM, Nicol Bolas wrote:
>
> On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honermann wrote:
>>
>> First, thank you for writing this paper!  It has been on my todo list to
>> write such a proposal, but alas...
>>
>> I spoke with Richard Smith about such a proposal in Jacksonville and he
>> mentioned a further justification for supporting a char8_t type -
>> optimization.  Today, compilers are limited in optimizing code involving
>> char and unsigned char glvalues because these types are allowed to alias
>> objects of other types (C++14 3.10 [basic.lval] p10).  If a char8_t type
>> were to be added that adhered to strict aliasing, then compilers could
>> more aggressively optimize code involving it.  I think this may be a
>> benefit worth adding to the paper.
>>
>
> I'm quite certain that the proposal makes this illegal:
>
> const char8_t *str = "Some String";
>
>
> I would hope so.
>
> `char8_t` is meant for UTF-8 strings *only*. And most people's strings
> are narrow character strings; on specific platforms, this may work out to
> being UTF-8, but there is no guarantee of that. We need to differentiate
> between narrow character strings and UTF-8 encoded strings at the type
> level.
>
> The last thing we want is to encourage people to do this:
>
> auto str = (const char8_t *)"Some String";
>
>
> I agree.
>
> If people start trying doing casts like that to take advantage of more
> aggressive optimizations, then we'll be right back where we were before: we
> won't know if a string *really is* UTF-8 or not.
>
> Solving the "char as byte array and string" problem is important. But we
> shouldn't suggest that `char8_t` constitutes such a solution.
>
>
> I don't think the ability to abuse a feature should be sufficient
> justification to not add it.  I did not intend to suggest that char8_t be
> used to circumvent existing aliasing rules.  Rather, that giving it strict
> aliasing behavior would enable optimizations for UTF-8 data.  That could
> potentially provide some motivation towards using UTF-8 strings in
> preference to narrow strings.
>

Right, but it already has that. `char8_t`, based on the "unique, unsigned
type" statement in the proposal, is a different type from `char` and
`unsigned char`. It has the same value representation as those two, but the
way strict aliasing is defined already does not allow `char8_t*` to alias
with other types. Just as it doesn't allow `char16_t*` or `char32_t*` to do
so. The same goes for enums who use `char` as their underlying types;
arrays of them are not `char*`s to the strict aliasing rules.

The strict aliasing rules do not care what the underlying type of something
is.

What I'm saying is that we shouldn't *advertise* this as a selling point of
the feature. It shouldn't be listed in the motivation section, for example.
Otherwise you will encourage people to abuse it.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/66a5b78e-5445-40c8-80e7-c550416bf0cf%40isocpp.org.

------=_Part_695_869998225.1466006192419
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Tuesday, June 14, 2016 at 11:44:19 PM UTC-4, Tom Honerm=
ann wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    <div>On 06/14/2016 11:32 PM, Nicol Bolas
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom
        Honermann wrote:
        <blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8=
ex;border-left:1px #ccc solid;padding-left:1ex">First,
          thank you for writing this paper! =C2=A0It has been on my todo li=
st
          to <br>
          write such a proposal, but alas...
          <br>
          <br>
          I spoke with Richard Smith about such a proposal in
          Jacksonville and he <br>
          mentioned a further justification for supporting a char8_t
          type - <br>
          optimization. =C2=A0Today, compilers are limited in optimizing co=
de
          involving <br>
          char and unsigned char glvalues because these types are
          allowed to alias <br>
          objects of other types (C++14 3.10 [basic.lval] p10). =C2=A0If a
          char8_t type <br>
          were to be added that adhered to strict aliasing, then
          compilers could <br>
          more aggressively optimize code involving it. =C2=A0I think this
          may be a <br>
          benefit worth adding to the paper.
          <br>
        </blockquote>
        <div><br>
          I&#39;m quite certain that the proposal makes this illegal:<br>
          <br>
          const char8_t *str =3D &quot;Some String&quot;;<br>
        </div>
      </div>
    </blockquote>
    <br>
    I would hope so.<br>
    <br>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>`char8_t` is meant for UTF-8 strings <i>only</i>. And most
          people&#39;s strings are narrow character strings; on specific
          platforms, this may work out to being UTF-8, but there is no
          guarantee of that. We need to differentiate between narrow
          character strings and UTF-8 encoded strings at the type level.<br=
>
          <br>
          The last thing we want is to encourage people to do this:<br>
          <br>
          auto str =3D (const char8_t *)&quot;Some String&quot;;<br>
        </div>
      </div>
    </blockquote>
    <br>
    I agree.<br>
    <br>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>If people start trying doing casts like that to take
          advantage of more aggressive optimizations, then we&#39;ll be
          right back where we were before: we won&#39;t know if a string <i=
>really
            is</i> UTF-8 or not.<br>
          <br>
          Solving the &quot;char as byte array and string&quot; problem is
          important. But we shouldn&#39;t suggest that `char8_t` constitute=
s
          such a solution.<br>
        </div>
      </div>
    </blockquote>
    <br>
    I don&#39;t think the ability to abuse a feature should be sufficient
    justification to not add it.=C2=A0 I did not intend to suggest that
    char8_t be used to circumvent existing aliasing rules.=C2=A0 Rather, th=
at
    giving it strict aliasing behavior would enable optimizations for
    UTF-8 data.=C2=A0 That could potentially provide some motivation toward=
s
    using UTF-8 strings in preference to narrow strings.<br></div></blockqu=
ote><div><br>Right, but it already has that. `char8_t`, based on the &quot;=
unique, unsigned type&quot; statement in the proposal, is a different type =
from `char` and `unsigned char`. It has the same value representation as th=
ose two, but the way strict aliasing is defined already does not allow `cha=
r8_t*` to alias with other types. Just as it doesn&#39;t allow `char16_t*` =
or `char32_t*` to do so. The same goes for enums who use `char` as their un=
derlying types; arrays of them are not `char*`s to the strict aliasing rule=
s.<br><br>The strict aliasing rules do not care what the underlying type of=
 something is.<br><br>What I&#39;m saying is that we shouldn&#39;t <i>adver=
tise</i> this as a selling point of the feature. It shouldn&#39;t be listed=
 in the motivation section, for example. Otherwise you will encourage peopl=
e to abuse it.<br></div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/66a5b78e-5445-40c8-80e7-c550416bf0cf%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/66a5b78e-5445-40c8-80e7-c550416bf0cf=
%40isocpp.org</a>.<br />

------=_Part_695_869998225.1466006192419--

------=_Part_694_1215804870.1466006192419--

.

Author: Tom Honermann <tom@honermann.net>
Date: Wed, 15 Jun 2016 17:21:15 -0400 Raw View

This is a multi-part message in MIME format.
--------------1BD2FF3062EF6430B873773D
Content-Type: text/plain; charset=UTF-8; format=flowed

On 6/15/2016 11:56 AM, Nicol Bolas wrote:
>
>>     If people start trying doing casts like that to take advantage of
>>     more aggressive optimizations, then we'll be right back where we
>>     were before: we won't know if a string /really is/ UTF-8 or not.
>>
>>     Solving the "char as byte array and string" problem is important.
>>     But we shouldn't suggest that `char8_t` constitutes such a solution.
>
>     I don't think the ability to abuse a feature should be sufficient
>     justification to not add it.  I did not intend to suggest that
>     char8_t be used to circumvent existing aliasing rules.  Rather,
>     that giving it strict aliasing behavior would enable optimizations
>     for UTF-8 data.  That could potentially provide some motivation
>     towards using UTF-8 strings in preference to narrow strings.
>
>
> Right, but it already has that. `char8_t`, based on the "unique,
> unsigned type" statement in the proposal, is a different type from
> `char` and `unsigned char`. It has the same value representation as
> those two, but the way strict aliasing is defined already does not
> allow `char8_t*` to alias with other types. Just as it doesn't allow
> `char16_t*` or `char32_t*` to do so. The same goes for enums who use
> `char` as their underlying types; arrays of them are not `char*`s to
> the strict aliasing rules.

Until we have wording or the proposal states otherwise, we don't know
what we have.  I agree that changes particular to aliasing would have to
be made to the standard if the type was intended not to follow strict
aliasing rules.

> The strict aliasing rules do not care what the underlying type of
> something is.
>
> What I'm saying is that we shouldn't /advertise/ this as a selling
> point of the feature. It shouldn't be listed in the motivation
> section, for example. Otherwise you will encourage people to abuse it.

Uh oh, I think the cat is out of the bag...

Name a feature that people haven't figured out how to abuse.  This isn't
any different.  Listing the potential benefit and facilitating
discussion on the potential for abuse strikes me as a better approach
than "shhh".

Tom.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/3156186d-a7ac-224d-471f-943879f7bae3%40honermann.net.

--------------1BD2FF3062EF6430B873773D
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Content-Type=
">
  </head>
  <body bgcolor=3D"#FFFFFF" text=3D"#000000">
    <div class=3D"moz-cite-prefix">On 6/15/2016 11:56 AM, Nicol Bolas
      wrote:<br>
    </div>
    <blockquote
      cite=3D"mid:66a5b78e-5445-40c8-80e7-c550416bf0cf@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr"><br>
        <blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:
          0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
          <div bgcolor=3D"#FFFFFF" text=3D"#000000">
            <blockquote type=3D"cite">
              <div dir=3D"ltr">
                <div>If people start trying doing casts like that to
                  take advantage of more aggressive optimizations, then
                  we'll be right back where we were before: we won't
                  know if a string <i>really is</i> UTF-8 or not.<br>
                  <br>
                  Solving the "char as byte array and string" problem is
                  important. But we shouldn't suggest that `char8_t`
                  constitutes such a solution.<br>
                </div>
              </div>
            </blockquote>
            <br>
            I don't think the ability to abuse a feature should be
            sufficient justification to not add it.=C2=A0 I did not intend =
to
            suggest that char8_t be used to circumvent existing aliasing
            rules.=C2=A0 Rather, that giving it strict aliasing behavior
            would enable optimizations for UTF-8 data.=C2=A0 That could
            potentially provide some motivation towards using UTF-8
            strings in preference to narrow strings.<br>
          </div>
        </blockquote>
        <div><br>
          Right, but it already has that. `char8_t`, based on the
          "unique, unsigned type" statement in the proposal, is a
          different type from `char` and `unsigned char`. It has the
          same value representation as those two, but the way strict
          aliasing is defined already does not allow `char8_t*` to alias
          with other types. Just as it doesn't allow `char16_t*` or
          `char32_t*` to do so. The same goes for enums who use `char`
          as their underlying types; arrays of them are not `char*`s to
          the strict aliasing rules.<br>
        </div>
      </div>
    </blockquote>
    <br>
    Until we have wording or the proposal states otherwise, we don't
    know what we have.=C2=A0 I agree that changes particular to aliasing
    would have to be made to the standard if the type was intended not
    to follow strict aliasing rules.<br>
    <br>
    <blockquote
      cite=3D"mid:66a5b78e-5445-40c8-80e7-c550416bf0cf@isocpp.org"
      type=3D"cite">
      <div dir=3D"ltr">
        <div>The strict aliasing rules do not care what the underlying
          type of something is.<br>
          <br>
          What I'm saying is that we shouldn't <i>advertise</i> this as
          a selling point of the feature. It shouldn't be listed in the
          motivation section, for example. Otherwise you will encourage
          people to abuse it.</div>
      </div>
    </blockquote>
    <br>
    Uh oh, I think the cat is out of the bag...<br>
    <br>
    Name a feature that people haven't figured out how to abuse.=C2=A0 This
    isn't any different.=C2=A0 Listing the potential benefit and facilitati=
ng
    discussion on the potential for abuse strikes me as a better
    approach than "shhh".<br>
    <br>
    Tom.<br>
  </body>
</html>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/3156186d-a7ac-224d-471f-943879f7bae3%=
40honermann.net?utm_medium=3Demail&utm_source=3Dfooter">https://groups.goog=
le.com/a/isocpp.org/d/msgid/std-proposals/3156186d-a7ac-224d-471f-943879f7b=
ae3%40honermann.net</a>.<br />

--------------1BD2FF3062EF6430B873773D--

.

Author: Michael Spencer <bigcheesegs@gmail.com>
Date: Fri, 17 Jun 2016 13:18:41 -0700 Raw View

On Tue, Jun 14, 2016 at 7:55 PM, Tom Honermann <tom@honermann.net> wrote:
> First, thank you for writing this paper!  It has been on my todo list to
> write such a proposal, but alas...
>
> I spoke with Richard Smith about such a proposal in Jacksonville and he
> mentioned a further justification for supporting a char8_t type -
> optimization.  Today, compilers are limited in optimizing code involving
> char and unsigned char glvalues because these types are allowed to alias
> objects of other types (C++14 3.10 [basic.lval] p10).  If a char8_t type
> were to be added that adhered to strict aliasing, then compilers could more
> aggressively optimize code involving it.  I think this may be a benefit
> worth adding to the paper.
>
> Tom.

I also had a quite similar conversation with Richard :).

We did consider covering aliasing in the paper, but in the end we felt
that it detracted from the core message of C++ needing a type for
utf-8. The aliasing properties are indeed useful for optimization, but
just adding new distinct types is a bad solution to the general
aliasing problem.

- Michael Spencer

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CACXTSimGN_8Gkoz%3DX0944auizKWNB4d%3DQD_1jivK_pfjd6rBqA%40mail.gmail.com.

.