Thread

Topic: Distinct type of array elements in UTF-8

Author: Richard Smith <richard@metafoo.co.uk>
Date: Mon, 8 Jun 2015 11:08:36 -0700 Raw View

--001a1140a9d4d88b920518058717
Content-Type: text/plain; charset=UTF-8

On Mon, Jun 8, 2015 at 8:49 AM, Bo Persson <bop@gmb.dk> wrote:

> On 2015-06-07 17:23, me@maxtruxa.com wrote:
>
>> If there is no argument for char8_t, then why are char16_t and
>> char32_t not just typedefs of uint_least16_tand
>> uint_least32_trespectively? Why do they exist at all?
>>
>
> There have been arguments for a char8_t type, just not strong enough to
> convince the committee.
>
> The arguments against it is that we already have three 8-bit character
> types - char, signed char, and unsigned char - two of which have indentical
> representation. That's way too many already.
>
> Adding another type char8_t, with a representation identical to at least
> one of the existing types, didn't look like an improvement.
>
> Having 6 character types, some of which are identical, in a language is
> gross. Why add a 7th?
>

Because 'char' and 'unsigned char' are underlying storage types and can
alias anything. This frequently results in slow code being generated for
string manipulation algorithms. (Using a signed type would be completely
inappropriate for UTF-8 --but nonetheless this is what we usually get today
because u8 string literals have type char[N], and char is signed by default
on many implementations.)

Having a type that represents a single byte but that cannot alias every
other type would be very valuable for high-performance string algorithms.

Bo Persson
>
>
>
>
>> N3337 $3.9.1 Fundamental types:
>>
>>     Types char16_t and char32_t denote distinct types with the same
>>     size, signedness, and alignment as uint_least16_t and
>>     uint_least32_t, respectively /[...]/.
>>
>>
>>
>
> --
>
> --- You received this message because you are subscribed to the Google
> Groups "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.
>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--001a1140a9d4d88b920518058717
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Jun 8, 2015 at 8:49 AM, Bo Persson <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:bop@gmb.dk" target=3D"_blank">bop@gmb.dk</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-=
width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddin=
g-left:1ex"><span class=3D"">On 2015-06-07 17:23, <a href=3D"mailto:me@maxt=
ruxa.com" target=3D"_blank">me@maxtruxa.com</a> wrote:<br>
</span><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;=
border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:=
solid;padding-left:1ex"><span class=3D"">
If there is no argument for char8_t, then why are char16_t and<br>
char32_t not just typedefs of uint_least16_tand<br></span>
uint_least32_trespectively? Why do they exist at all?<br>
</blockquote>
<br>
There have been arguments for a char8_t type, just not strong enough to con=
vince the committee.<br>
<br>
The arguments against it is that we already have three 8-bit character type=
s - char, signed char, and unsigned char - two of which have indentical rep=
resentation. That&#39;s way too many already.<br>
<br>
Adding another type char8_t, with a representation identical to at least on=
e of the existing types, didn&#39;t look like an improvement.<br>
<br>
Having 6 character types, some of which are identical, in a language is gro=
ss. Why add a 7th?<br></blockquote><div><br></div><div>Because &#39;char&#3=
9; and &#39;unsigned char&#39; are underlying storage types and can alias a=
nything. This frequently results in slow code being generated for string ma=
nipulation algorithms. (Using a signed type would be completely inappropria=
te for UTF-8 --but nonetheless this is what we usually get today because u8=
 string literals have type char[N], and char is signed by default on many i=
mplementations.)</div><div><br></div><div>Having a type that represents a s=
ingle byte but that cannot alias every other type would be very valuable fo=
r high-performance string algorithms.</div><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Bo Persson<br>
<br>
<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><span class=3D"">
<br>
N3337 $3.9.1 Fundamental types:<br>
<br>
=C2=A0 =C2=A0 Types char16_t and char32_t denote distinct types with the sa=
me<br>
=C2=A0 =C2=A0 size, signedness, and alignment as uint_least16_t and<br></sp=
an>
=C2=A0 =C2=A0 uint_least32_t, respectively /[...]/.<br>
<br>
<br>
</blockquote><div class=3D""><div class=3D"h5">
<br>
<br>
-- <br>
<br>
--- You received this message because you are subscribed to the Google Grou=
ps &quot;ISO C++ Standard - Future Proposals&quot; group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals%2Bunsubscribe@isocpp.org" target=3D=
"_blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br>
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/" target=3D"_blank">http://groups.google.com/a/isocpp.org/gro=
up/std-proposals/</a>.<br>
</div></div></blockquote></div><br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--001a1140a9d4d88b920518058717--

.

Author: Andrey Semashev <andrey.semashev@gmail.com>
Date: Mon, 08 Jun 2015 21:37:04 +0300 Raw View

On Monday 08 June 2015 11:08:36 Richard Smith wrote:
> On Mon, Jun 8, 2015 at 8:49 AM, Bo Persson <bop@gmb.dk> wrote:
> >
> > Having 6 character types, some of which are identical, in a language is
> > gross. Why add a 7th?
>
> Because 'char' and 'unsigned char' are underlying storage types and can
> alias anything. This frequently results in slow code being generated for
> string manipulation algorithms. (Using a signed type would be completely
> inappropriate for UTF-8 --but nonetheless this is what we usually get today
> because u8 string literals have type char[N], and char is signed by default
> on many implementations.)
>
> Having a type that represents a single byte but that cannot alias every
> other type would be very valuable for high-performance string algorithms.

Could you provide an example of optimization that is not possible now?

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Richard Smith <richard@metafoo.co.uk>
Date: Mon, 8 Jun 2015 15:18:23 -0700 Raw View

--089e0111d8a826bdec051809058d
Content-Type: text/plain; charset=UTF-8

On Mon, Jun 8, 2015 at 11:37 AM, Andrey Semashev <andrey.semashev@gmail.com>
wrote:

> On Monday 08 June 2015 11:08:36 Richard Smith wrote:
> > On Mon, Jun 8, 2015 at 8:49 AM, Bo Persson <bop@gmb.dk> wrote:
> > >
> > > Having 6 character types, some of which are identical, in a language is
> > > gross. Why add a 7th?
> >
> > Because 'char' and 'unsigned char' are underlying storage types and can
> > alias anything. This frequently results in slow code being generated for
> > string manipulation algorithms. (Using a signed type would be completely
> > inappropriate for UTF-8 --but nonetheless this is what we usually get
> today
> > because u8 string literals have type char[N], and char is signed by
> default
> > on many implementations.)
> >
> > Having a type that represents a single byte but that cannot alias every
> > other type would be very valuable for high-performance string algorithms.
>
> Could you provide an example of optimization that is not possible now?


struct X {
  char *p;
  int length;
};
void zero(X *x) {
  for (int i = 0; i != x->length; ++i) x->p[i] = 0;
}

.... cannot be turned into a memset, and reloads the length on every
iteration, because x->p might point at x->length. Try compiling the above
with your favourite optimizing compiler, then replacing the 'char' with
'short'.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--089e0111d8a826bdec051809058d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Jun 8, 2015 at 11:37 AM, Andrey Semashev <span dir=3D"ltr">&lt;<a href=
=3D"mailto:andrey.semashev@gmail.com" target=3D"_blank">andrey.semashev@gma=
il.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=
=3D"">On Monday 08 June 2015 11:08:36 Richard Smith wrote:<br>
&gt; On Mon, Jun 8, 2015 at 8:49 AM, Bo Persson &lt;<a href=3D"mailto:bop@g=
mb.dk">bop@gmb.dk</a>&gt; wrote:<br>
&gt; &gt;<br>
</span><span class=3D"">&gt; &gt; Having 6 character types, some of which a=
re identical, in a language is<br>
&gt; &gt; gross. Why add a 7th?<br>
&gt;<br>
&gt; Because &#39;char&#39; and &#39;unsigned char&#39; are underlying stor=
age types and can<br>
&gt; alias anything. This frequently results in slow code being generated f=
or<br>
&gt; string manipulation algorithms. (Using a signed type would be complete=
ly<br>
&gt; inappropriate for UTF-8 --but nonetheless this is what we usually get =
today<br>
&gt; because u8 string literals have type char[N], and char is signed by de=
fault<br>
&gt; on many implementations.)<br>
&gt;<br>
&gt; Having a type that represents a single byte but that cannot alias ever=
y<br>
&gt; other type would be very valuable for high-performance string algorith=
ms.<br>
<br>
</span>Could you provide an example of optimization that is not possible no=
w?</blockquote><div><br></div><div>struct X {</div><div>=C2=A0 char *p;</di=
v><div>=C2=A0 int length;</div><div>};</div><div>void zero(X *x) {</div><di=
v>=C2=A0 for (int i =3D 0; i !=3D x-&gt;length; ++i) x-&gt;p[i] =3D 0;</div=
><div>}</div><div><br></div><div>... cannot be turned into a memset, and re=
loads the length on every iteration, because x-&gt;p might point at x-&gt;l=
ength. Try compiling the above with your favourite optimizing compiler, the=
n replacing the &#39;char&#39; with &#39;short&#39;.</div></div></div></div=
>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--089e0111d8a826bdec051809058d--

.

Author: Nevin Liber <nevin@eviloverlord.com>
Date: Mon, 8 Jun 2015 17:23:04 -0500 Raw View

--047d7bb03a9250a7b8051809189d
Content-Type: text/plain; charset=UTF-8

On 8 June 2015 at 17:18, Richard Smith <richard@metafoo.co.uk> wrote:

> struct X {
>   char *p;
>   int length;
> };
> void zero(X *x) {
>   for (int i = 0; i != x->length; ++i) x->p[i] = 0;
> }
>
> ... cannot be turned into a memset, and reloads the length on every
> iteration, because x->p might point at x->length. Try compiling the above
> with your favourite optimizing compiler, then replacing the 'char' with
> 'short'.
>

Shouldn't replacing 'char' with 'signed char' also invoke the memset
optimization?  It doesn't seem to under clang 3.6 nor gcc 5.1.
--
 Nevin ":-)" Liber  <mailto:nevin@eviloverlord.com>  (847) 691-1404

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--047d7bb03a9250a7b8051809189d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">=
On 8 June 2015 at 17:18, Richard Smith <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:richard@metafoo.co.uk" target=3D"_blank">richard@metafoo.co.uk</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=
=3D"gmail_extra"><div class=3D"gmail_quote"><div>struct X {</div><div>=C2=
=A0 char *p;</div><div>=C2=A0 int length;</div><div>};</div><div>void zero(=
X *x) {</div><div>=C2=A0 for (int i =3D 0; i !=3D x-&gt;length; ++i) x-&gt;=
p[i] =3D 0;</div><div>}</div><div><br></div><div>... cannot be turned into =
a memset, and reloads the length on every iteration, because x-&gt;p might =
point at x-&gt;length. Try compiling the above with your favourite optimizi=
ng compiler, then replacing the &#39;char&#39; with &#39;short&#39;.</div><=
/div></div></div></blockquote><div><br></div><div>Shouldn&#39;t replacing &=
#39;char&#39; with &#39;signed char&#39; also invoke the memset optimizatio=
n?=C2=A0 It doesn&#39;t seem to under clang 3.6 nor gcc 5.1.</div></div>-- =
<br><div class=3D"gmail_signature">=C2=A0Nevin &quot;:-)&quot; Liber=C2=A0 =
&lt;mailto:<a href=3D"mailto:nevin@eviloverlord.com" target=3D"_blank">nevi=
n@eviloverlord.com</a>&gt;=C2=A0 (847) 691-1404</div>
</div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--047d7bb03a9250a7b8051809189d--

.

Author: Richard Smith <richard@metafoo.co.uk>
Date: Mon, 8 Jun 2015 15:29:15 -0700 Raw View

--089e01182cf207a0360518092cb4
Content-Type: text/plain; charset=UTF-8

On Mon, Jun 8, 2015 at 3:23 PM, Nevin Liber <nevin@eviloverlord.com> wrote:

> On 8 June 2015 at 17:18, Richard Smith <richard@metafoo.co.uk> wrote:
>
>> struct X {
>>   char *p;
>>   int length;
>> };
>> void zero(X *x) {
>>   for (int i = 0; i != x->length; ++i) x->p[i] = 0;
>> }
>>
>> ... cannot be turned into a memset, and reloads the length on every
>> iteration, because x->p might point at x->length. Try compiling the above
>> with your favourite optimizing compiler, then replacing the 'char' with
>> 'short'.
>>
>
> Shouldn't replacing 'char' with 'signed char' also invoke the memset
> optimization?  It doesn't seem to under clang 3.6 nor gcc 5.1.
>

Yes, it should. This is slightly annoying to get right, since 'signed char'
can alias 'unsigned char' (which can in turn alias anything) so I'm not too
surprised that both compilers miss this trick.

However, both of them optimize the code if you use this instead:

  enum char8_t : char {};

(and replace the "0" with "char8_t()").

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

--089e01182cf207a0360518092cb4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Jun 8, 2015 at 3:23 PM, Nevin Liber <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:nevin@eviloverlord.com" target=3D"_blank">nevin@eviloverlord.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div cl=
ass=3D"gmail_extra"><div class=3D"gmail_quote"><span class=3D"">On 8 June 2=
015 at 17:18, Richard Smith <span dir=3D"ltr">&lt;<a href=3D"mailto:richard=
@metafoo.co.uk" target=3D"_blank">richard@metafoo.co.uk</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_e=
xtra"><div class=3D"gmail_quote"><div>struct X {</div><div>=C2=A0 char *p;<=
/div><div>=C2=A0 int length;</div><div>};</div><div>void zero(X *x) {</div>=
<div>=C2=A0 for (int i =3D 0; i !=3D x-&gt;length; ++i) x-&gt;p[i] =3D 0;</=
div><div>}</div><div><br></div><div>... cannot be turned into a memset, and=
 reloads the length on every iteration, because x-&gt;p might point at x-&g=
t;length. Try compiling the above with your favourite optimizing compiler, =
then replacing the &#39;char&#39; with &#39;short&#39;.</div></div></div></=
div></blockquote><div><br></div></span><div>Shouldn&#39;t replacing &#39;ch=
ar&#39; with &#39;signed char&#39; also invoke the memset optimization?=C2=
=A0 It doesn&#39;t seem to under clang 3.6 nor gcc 5.1.</div></div></div></=
div></blockquote><div><br></div><div>Yes, it should. This is slightly annoy=
ing to get right, since &#39;signed char&#39; can alias &#39;unsigned char&=
#39; (which can in turn alias anything) so I&#39;m not too surprised that b=
oth compilers miss this trick.</div><div><br></div><div>However, both of th=
em optimize the code if you use this instead:</div><div><br></div><div>=C2=
=A0 enum char8_t : char {};</div><div><br></div><div>(and replace the &quot=
;0&quot; with &quot;char8_t()&quot;).</div></div></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--089e01182cf207a0360518092cb4--

.