Topic: iswalpha and locales
Author: Renji <asorenji@gmail.com>
Date: Sat, 25 Jun 2016 01:48:29 -0700 (PDT)
Raw View
------=_Part_227_2079088304.1466844509106
Content-Type: multipart/alternative;
boundary="----=_Part_228_1984924480.1466844509106"
------=_Part_228_1984924480.1466844509106
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
As i can see in en.cppreference.com=20
<http://en.cppreference.com/w/cpp/string/wide/iswalpha>, iswalpha return=20
true for "any alphabetic character specific to the current *locale*".=20
Thanks for this, in Debian iswalpha(L'=E3=81=8B') (=E3=81=8B - symbol from =
hiragana) return=20
false if i'm not call std::setlocale before iswalpha. And iswalpha(L'=E3=81=
=8B')=20
return true if i call std::setlocale with *any* valid locale name (testing=
=20
for en_US.UTF-8, ru_RU.UTF-8 and even C.UTF-8). It's make no sense. I'm=20
work with *uni*code character, why i'm need locales? And if this locale=20
contain some important data, why name of locale is not important?
Proposal: iswalpha must not depend on any locales.
PS Sorry if my English is bad.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/5057e854-b2ea-4c39-81c7-367bc3e54080%40isocpp.or=
g.
------=_Part_228_1984924480.1466844509106
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div>As i <a href=3D"http://en.cppreference.com/w/cpp/stri=
ng/wide/iswalpha">can see in en.cppreference.com</a>, iswalpha return true =
for "any alphabetic character specific to the current <strong>locale</=
strong>". Thanks for this, in Debian iswalpha(L'=E3=81=8B') (=
=E3=81=8B - symbol from hiragana)=C2=A0return false if i'm not call std=
::setlocale before iswalpha. And iswalpha(L'=E3=81=8B') return true=
if i call=C2=A0std::setlocale with <strong>any</strong> valid locale name =
(testing for en_US.UTF-8, ru_RU.UTF-8 and even C.UTF-8). It's make no s=
ense. I'm work with <strong>uni</strong>code character, why i'm nee=
d locales? And if this locale contain some important data, why name of loca=
le is not important?<br></div><div><br></div><div>Proposal: iswalpha must n=
ot depend on any locales.</div><div>PS Sorry if my English is bad.</div></d=
iv>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/5057e854-b2ea-4c39-81c7-367bc3e54080%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/5057e854-b2ea-4c39-81c7-367bc3e54080=
%40isocpp.org</a>.<br />
------=_Part_228_1984924480.1466844509106--
------=_Part_227_2079088304.1466844509106--
.
Author: Bo Persson <bop@gmb.dk>
Date: Sat, 25 Jun 2016 13:24:06 +0200
Raw View
On 2016-06-25 10:48, Renji wrote:
> As i can see in en.cppreference.com
> <http://en.cppreference.com/w/cpp/string/wide/iswalpha>, iswalpha return
> true for "any alphabetic character specific to the current *locale*".
> Thanks for this, in Debian iswalpha(L'=E3=81=8B') (=E3=81=8B - symbol fro=
m
> hiragana) return false if i'm not call std::setlocale before iswalpha.
> And iswalpha(L'=E3=81=8B') return true if i call std::setlocale with *any=
*
> valid locale name (testing for en_US.UTF-8, ru_RU.UTF-8 and even
> C.UTF-8). It's make no sense. I'm work with *uni*code character, why i'm
> need locales? And if this locale contain some important data, why name
> of locale is not important?
>
> Proposal: iswalpha must not depend on any locales.
Don't know about Japanese, but in countries using latin alphabets the=20
number of characters in the national alphabet vary.
Characters like =C3=A5=C3=A4=C3=B6=C3=BC=C3=BF=C3=AF=C3=A2=C3=A9=C3=80=C3=
=8B=C3=8F *very much* depends on the chosen locale.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/nklpko%243va%241%40ger.gmane.org.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sat, 25 Jun 2016 10:34:47 -0700
Raw View
On s=C3=A1bado, 25 de junho de 2016 01:48:29 PDT Renji wrote:
> As i can see in en.cppreference.com
> <http://en.cppreference.com/w/cpp/string/wide/iswalpha>, iswalpha return
> true for "any alphabetic character specific to the current *locale*".
> Thanks for this, in Debian iswalpha(L'=E3=81=8B') (=E3=81=8B - symbol fro=
m hiragana) return
> false if i'm not call std::setlocale before iswalpha. And iswalpha(L'=E3=
=81=8B')
> return true if i call std::setlocale with *any* valid locale name (testin=
g
> for en_US.UTF-8, ru_RU.UTF-8 and even C.UTF-8). It's make no sense. I'm
> work with *uni*code character, why i'm need locales? And if this locale
> contain some important data, why name of locale is not important?
>=20
> Proposal: iswalpha must not depend on any locales.
That's an implementation detail, not a C standard issue.
The reason is that until you call setlocale(), the runtime in your C++=20
Standard Library implementation knows only of the US-ASCII C locale, in whi=
ch=20
the only alphabetic characters are those in US-ASCII. That is, the default=
=20
locale is "C.ANSI_X3.4-1986".
The moment you set the locale to something Unicode, then it knows about the=
=20
entire Unicode range.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/5458114.SXJFP9uSZR%40tjmaciei-mobl1.
.
Author: Aso Renji <asorenji@gmail.com>
Date: Sat, 25 Jun 2016 21:19:07 +0300
Raw View
Thiago Macieira <thiago@macieira.org> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0=
) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=
=B5 Sat, 25 Jun =20
2016 20:34:47 +0300:
> The reason is that until you call setlocale(), the runtime in your C++
> Standard Library implementation knows only of the US-ASCII C locale
In other words, iswalpha accept UNICODE character (wint_t), but =20
acknowledge only ASCII character. No, it's C standard issue. If you say =20
"i'm can work with unicode", you must support all unicode characters =20
(unicode include all national alphabets). If you say "i'm not support =20
unicode characters", you must not accept unicode characters.
Or, maybe wint_t is not unicode? In this case you know any other character =
=20
encoding with wide (not multi-byte) characters?
--=20
=D0=9D=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BD=D0=BE =D1=81 =D0=BF=D0=BE=D0=BC=
=D0=BE=D1=89=D1=8C=D1=8E =D0=BF=D0=BE=D1=87=D1=82=D0=BE=D0=B2=D0=BE=D0=B3=
=D0=BE =D0=BA=D0=BB=D0=B8=D0=B5=D0=BD=D1=82=D0=B0 Opera: http://www.opera.c=
om/mail/
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/op.yjmk55nhhnjspo%40debian.
.
Author: Aso Renji <asorenji@gmail.com>
Date: Sat, 25 Jun 2016 21:25:02 +0300
Raw View
Bo Persson <bop@gmb.dk> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0) =D0=B2 =D1=
=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Sat, 25 Ju=
n 2016 14:24:06 =20
+0300:
> Don't know about Japanese, but in countries using latin alphabets the =20
> number of characters in the national alphabet vary.
Unicode contain ALL national alphabets. Therefore number of characters in =
=20
unicode NOT vary. And isWalpha work with unicode characters.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/op.yjmlf0orhnjspo%40debian.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sat, 25 Jun 2016 11:39:10 -0700
Raw View
On s=C3=A1bado, 25 de junho de 2016 21:19:07 PDT Aso Renji wrote:
> Thiago Macieira <thiago@macieira.org> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=
=B0) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=
=D0=B5 Sat, 25 Jun
>=20
> 2016 20:34:47 +0300:
> > The reason is that until you call setlocale(), the runtime in your C++
> > Standard Library implementation knows only of the US-ASCII C locale
>=20
> In other words, iswalpha accept UNICODE character (wint_t), but
> acknowledge only ASCII character. No, it's C standard issue. If you say
> "i'm can work with unicode", you must support all unicode characters
> (unicode include all national alphabets). If you say "i'm not support
> unicode characters", you must not accept unicode characters.
>=20
> Or, maybe wint_t is not unicode? In this case you know any other characte=
r
> encoding with wide (not multi-byte) characters?
wchar_t and wint_t are not required to be Unicode. The call to setlocale()=
=20
changes the encoding they use.
You should use char16_t and char32_t to be sure to have Unicode.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/7091420.GCog6ZeHg6%40tjmaciei-mobl1.
.
Author: Aso Renji <asorenji@gmail.com>
Date: Sat, 25 Jun 2016 21:50:32 +0300
Raw View
Thiago Macieira <thiago@macieira.org> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0=
) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=
=B5 Sat, 25 Jun =20
2016 21:39:10 +0300:
> You should use char16_t and char32_t to be sure to have Unicode.
But iswalpha with char16_t argument don't exist. In this case I'm change =
=20
proposal to adding isalpha version for utf16/utf32 characters. For example =
=20
- isu16alpha.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/op.yjmmmiw0hnjspo%40debian.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 25 Jun 2016 12:06:43 -0700 (PDT)
Raw View
------=_Part_3082_380168801.1466881603261
Content-Type: multipart/alternative;
boundary="----=_Part_3083_994704773.1466881603261"
------=_Part_3083_994704773.1466881603261
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Saturday, June 25, 2016 at 2:25:01 PM UTC-4, Aso Renji wrote:
>
> Bo Persson <b...@gmb.dk <javascript:>> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=
=B0) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=
=D0=B5 Sat, 25=20
> Jun 2016 14:24:06 =20
> +0300:=20
>
> > Don't know about Japanese, but in countries using latin alphabets the =
=20
> > number of characters in the national alphabet vary.=20
> Unicode contain ALL national alphabets. Therefore number of characters in=
=20
> =20
> unicode NOT vary. And isWalpha work with unicode characters.=20
>
First, the number of Unicode codepoints does vary. Newer versions add new=
=20
valid Unicode codepoints. There is a fixed upper limit of course, but there=
=20
are large blocks of that limit which are unallocated.
Second, `iswalpha` does not work with Unicode. It works with wide=20
characters, which *may be Unicode*. They also may not. It depends on your=
=20
implementation, and your implementation may depend on your locale.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/d4db0ae1-8e35-4c2b-b4cc-03b00dabd588%40isocpp.or=
g.
------=_Part_3083_994704773.1466881603261
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>On Saturday, June 25, 2016 at 2:25:01 PM UTC-4, As=
o Renji wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Bo Persson <<=
a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"thaKu7SAB=
QAJ" rel=3D"nofollow" onmousedown=3D"this.href=3D'javascript:';retu=
rn true;" onclick=3D"this.href=3D'javascript:';return true;">b...@g=
mb.dk</a>> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0) =D0=B2 =D1=81=D0=B2=D0=
=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Sat, 25 Jun 2016 14:24=
:06 =C2=A0
<br>+0300:
<br>
<br>> Don't know about Japanese, but in countries using latin alphab=
ets the =C2=A0
<br>> number of characters in the national alphabet vary.
<br>Unicode contain ALL national alphabets. Therefore number of characters =
in =C2=A0
<br>unicode NOT vary. And isWalpha work with unicode characters.
<br></blockquote><div><br>First, the number of Unicode codepoints does vary=
.. Newer versions add new valid Unicode codepoints. There is a fixed upper l=
imit of course, but there are large blocks of that limit which are unalloca=
ted.<br><br>Second, `iswalpha` does not work with Unicode. It works with wi=
de characters, which <i>may be Unicode</i>. They also may not. It depends o=
n your implementation, and your implementation may depend on your locale.<b=
r></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/d4db0ae1-8e35-4c2b-b4cc-03b00dabd588%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/d4db0ae1-8e35-4c2b-b4cc-03b00dabd588=
%40isocpp.org</a>.<br />
------=_Part_3083_994704773.1466881603261--
------=_Part_3082_380168801.1466881603261--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 25 Jun 2016 12:09:46 -0700 (PDT)
Raw View
------=_Part_740_642100542.1466881786831
Content-Type: multipart/alternative;
boundary="----=_Part_741_1977422704.1466881786836"
------=_Part_741_1977422704.1466881786836
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Saturday, June 25, 2016 at 2:50:32 PM UTC-4, Aso Renji wrote:
>
> Thiago Macieira <thi...@macieira.org <javascript:>> =D0=BF=D0=B8=D1=81=D0=
=B0=D0=BB(=D0=B0) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC=20
> =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Sat, 25 Jun =20
> 2016 21:39:10 +0300:=20
>
> > You should use char16_t and char32_t to be sure to have Unicode.=20
> But iswalpha with char16_t argument don't exist. In this case I'm change =
=20
> proposal to adding isalpha version for utf16/utf32 characters. For exampl=
e=20
> =20
> - isu16alpha.=20
>
Here's the problem with that. Unicode has very complex rules about what=20
constitutes an "alphabetic" character. Indeed, it has a large list; for=20
every codepoint it defines, it says if that codepoint is alphabetic or not.=
=20
So too for many other questions like `isupper` and so forth.
That's a lot of tables to be including. I believe that the table can be=20
compacted via clever programming to be of reasonable length. But that would=
=20
require a good implementation to prove it.
Also, such functions cannot be applied to a `char16_t`, becuase a=20
`char16_t` is not a Unicode codepoint. It is only a UTF-16 code unit, which=
=20
may be a valid codepoint or it may be a surrogate pair for a valid=20
codepoint, in accord with the UTF-16 encoding rules.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/7f783407-0c3c-4381-a852-dfd12bc450d1%40isocpp.or=
g.
------=_Part_741_1977422704.1466881786836
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>On Saturday, June 25, 2016 at 2:50:32 PM UTC-4, As=
o Renji wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Thiago Macieira =
<<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"YGIf=
DRmCBQAJ" rel=3D"nofollow" onmousedown=3D"this.href=3D'javascript:'=
;return true;" onclick=3D"this.href=3D'javascript:';return true;">t=
hi...@macieira.org</a>> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0) =D0=B2 =
=D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Sat, 25=
Jun =C2=A0
<br>2016 21:39:10 +0300:
<br>
<br>> You should use char16_t and char32_t to be sure to have Unicode.
<br>But iswalpha with char16_t argument don't exist. In this case I'=
;m change =C2=A0
<br>proposal to adding isalpha version for utf16/utf32 characters. For exam=
ple =C2=A0
<br>- isu16alpha.
<br></blockquote><div><br>Here's the problem with that. Unicode has ver=
y complex rules about what constitutes an "alphabetic" character.=
Indeed, it has a large list; for every codepoint it defines, it says if th=
at codepoint is alphabetic or not. So too for many other questions like `is=
upper` and so forth.<br><br>That's a lot of tables to be including. I b=
elieve that the table can be compacted via clever programming to be of reas=
onable length. But that would require a good implementation to prove it.<br=
><br>Also, such functions cannot be applied to a `char16_t`, becuase a `cha=
r16_t` is not a Unicode codepoint. It is only a UTF-16 code unit, which may=
be a valid codepoint or it may be a surrogate pair for a valid codepoint, =
in accord with the UTF-16 encoding rules.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/7f783407-0c3c-4381-a852-dfd12bc450d1%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/7f783407-0c3c-4381-a852-dfd12bc450d1=
%40isocpp.org</a>.<br />
------=_Part_741_1977422704.1466881786836--
------=_Part_740_642100542.1466881786831--
.
Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Sat, 25 Jun 2016 22:52:25 +0300
Raw View
On Sat, Jun 25, 2016 at 11:48 AM, Renji <asorenji@gmail.com> wrote:
> As i can see in en.cppreference.com, iswalpha return true for "any
> alphabetic character specific to the current locale". Thanks for this, in
> Debian iswalpha(L'=E3=81=8B') (=E3=81=8B - symbol from hiragana) return f=
alse if i'm not
> call std::setlocale before iswalpha. And iswalpha(L'=E3=81=8B') return tr=
ue if i
> call std::setlocale with any valid locale name (testing for en_US.UTF-8,
> ru_RU.UTF-8 and even C.UTF-8). It's make no sense. I'm work with unicode
> character, why i'm need locales? And if this locale contain some importan=
t
> data, why name of locale is not important?
>
> Proposal: iswalpha must not depend on any locales.
> PS Sorry if my English is bad.
I think Thiago has it: when wchar_t and iswalpha were designed,
Unicode hadn't won yet, so they're designed to work with non-Unicode
encodings. I don't know of any truly double-byte encodings other than
UTF-16, but wchar_t could be 4 bytes and encode GB-18030 instead of
UTF-32.
The committee is interested in improving our Unicode support, and
we're actively looking at proposals to do so, including
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0353r0.html
and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0244r1.html.
I think we'd also welcome a set of predicate functions that assume
their input is in one of the UTF formats and check for the unicode
character class. You may have to figure something out for
multi-code-unit characters, or limit the proposal to char32_t.
Jeffrey (Library Evolution chair)
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/CANh-dXkxD_COixQrCe6fAXRZ%3Dj7VKa_Y1chu9OzdPBmR3=
vSuWw%40mail.gmail.com.
.
Author: asorenji@gmail.com
Date: Sat, 25 Jun 2016 13:00:53 -0700 (PDT)
Raw View
------=_Part_2885_325769785.1466884853903
Content-Type: multipart/alternative;
boundary="----=_Part_2886_105151583.1466884853903"
------=_Part_2886_105151583.1466884853903
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D1=81=D1=83=D0=B1=D0=B1=D0=BE=D1=82=D0=B0, 25 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 22:09:47 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=
=D0=BB:
>
>
>
> That's a lot of tables to be including. I believe that the table can be=
=20
> compacted via clever programming to be of reasonable length. But that wou=
ld=20
> require a good implementation to prove it.
>
bool isu32alpha(wint_t code)
{
std::setlocale(LC_ALL,"any_unicode_locale");//implementation defined
return iswalpha(code);
}
I'm already test this (see first post), it work very well.
Or:
bool isu32alpha(wint_t code)=20
{
return code<=3D0x10FFFF?table[code/8]&(1<<(code%8))]:false;
}
0x110000 bites table - reasonable length. At least if you keep this table=
=20
in separate cpp file.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/f7a33e49-7c25-48ba-a2ff-4bc765335796%40isocpp.or=
g.
------=_Part_2886_105151583.1466884853903
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D1=81=D1=83=D0=B1=D0=B1=D0=BE=D1=82=D0=B0, 25 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 22:09:47 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=
=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"m=
argin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"=
><div dir=3D"ltr"><br><br>That's a lot of tables to be including. I bel=
ieve that the table can be compacted via clever programming to be of reason=
able length. But that would require a good implementation to prove it.<br><=
/div></blockquote><div><div>bool isu32alpha(wint_t code)<br>{</div><div>=C2=
=A0 =C2=A0 std::setlocale(LC_ALL,"any_unicode_locale");//implemen=
tation defined</div><div>=C2=A0 =C2=A0 return iswalpha(code);<br>}</div><di=
v>I'm already test this (see first post), it work very well.</div><div>=
Or:</div></div><div>bool isu32alpha(wint_t code)=C2=A0</div><div>{</div><di=
v>=C2=A0 =C2=A0=C2=A0return code<=3D0x10FFFF?table[code/8]&(1<<=
;(code%8))]:false;<br>}</div><div>0x110000 bites table - reasonable length.=
At least if you keep this table in separate cpp file.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/f7a33e49-7c25-48ba-a2ff-4bc765335796%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/f7a33e49-7c25-48ba-a2ff-4bc765335796=
%40isocpp.org</a>.<br />
------=_Part_2886_105151583.1466884853903--
------=_Part_2885_325769785.1466884853903--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sat, 25 Jun 2016 17:04:31 -0700 (PDT)
Raw View
------=_Part_2924_151977685.1466899471132
Content-Type: multipart/alternative;
boundary="----=_Part_2925_1187302083.1466899471133"
------=_Part_2925_1187302083.1466899471133
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Saturday, June 25, 2016 at 4:00:54 PM UTC-4, asor...@gmail.com wrote:
>
> =D1=81=D1=83=D0=B1=D0=B1=D0=BE=D1=82=D0=B0, 25 =D0=B8=D1=8E=D0=BD=D1=8F 2=
016 =D0=B3., 22:09:47 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=
=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=
=B0=D0=BB:
>>
>> That's a lot of tables to be including. I believe that the table can be=
=20
>> compacted via clever programming to be of reasonable length. But that wo=
uld=20
>> require a good implementation to prove it.
>>
> bool isu32alpha(wint_t code)
> {
> std::setlocale(LC_ALL,"any_unicode_locale");//implementation defined
> return iswalpha(code);
> }
> I'm already test this (see first post), it work very well.
>
Implementation-dependent code is *implementation dependent*. So you've=20
proved nothing.
Or:
> bool isu32alpha(wint_t code)=20
> {
> return code<=3D0x10FFFF?table[code/8]&(1<<(code%8))]:false;
> }
> 0x110000 bites table - reasonable length. At least if you keep this table=
=20
> in separate cpp file.
>
.... Have you looked at the Unicode tables? They are not small. Like I said,=
=20
there are ways to make them smaller (and your code makes them far bigger=20
than necessary, since only ~15% of the codepoint range is assigned). But=20
there is no proof-of-concept that shows that it won't bloat executables by=
=20
100KB.
Also, there is no guarantee that `wint_t` can store a Unicode codepoint, so=
=20
that API isn't reasonable. If you're serious about Unicode support, you=20
need to focus on the types that actually store Unicode encodings.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/84d2b9e6-a29b-4c79-8646-2f00dbc5087b%40isocpp.or=
g.
------=_Part_2925_1187302083.1466899471133
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Saturday, June 25, 2016 at 4:00:54 PM UTC-4, asor...@gm=
ail.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"=
>=D1=81=D1=83=D0=B1=D0=B1=D0=BE=D1=82=D0=B0, 25 =D0=B8=D1=8E=D0=BD=D1=8F 20=
16 =D0=B3., 22:09:47 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=
=D0=BB:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8e=
x;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">That's =
a lot of tables to be including. I believe that the table can be compacted =
via clever programming to be of reasonable length. But that would require a=
good implementation to prove it.<br></div></blockquote><div><div>bool isu3=
2alpha(wint_t code)<br>{</div><div>=C2=A0 =C2=A0 std::setlocale(LC_ALL,&quo=
t;any_<wbr>unicode_locale");//<wbr>implementation defined</div><div>=
=C2=A0 =C2=A0 return iswalpha(code);<br>}</div><div>I'm already test th=
is (see first post), it work very well.</div></div></div></blockquote><div>=
<br>Implementation-dependent code is <i>implementation dependent</i>. So yo=
u've proved nothing.<br><br></div><blockquote class=3D"gmail_quote" sty=
le=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left=
: 1ex;"><div dir=3D"ltr"><div><div>Or:</div></div><div>bool isu32alpha(wint=
_t code)=C2=A0</div><div>{</div><div>=C2=A0 =C2=A0=C2=A0return code<=3D0=
x10FFFF?table[code/8]&(<wbr>1<<(code%8))]:false;<br>}</div><div>0=
x110000 bites table - reasonable length. At least if you keep this table in=
separate cpp file.<br></div></div></blockquote><div><br>... Have you looke=
d at the Unicode tables? They are not small. Like I said, there are ways to=
make them smaller (and your code makes them far bigger than necessary, sin=
ce only ~15% of the codepoint range is assigned). But there is no proof-of-=
concept that shows that it won't bloat executables by 100KB.<br><br>Als=
o, there is no guarantee that `wint_t` can store a Unicode codepoint, so th=
at API isn't reasonable. If you're serious about Unicode support, y=
ou need to focus on the types that actually store Unicode encodings.<br></d=
iv></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/84d2b9e6-a29b-4c79-8646-2f00dbc5087b%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/84d2b9e6-a29b-4c79-8646-2f00dbc5087b=
%40isocpp.org</a>.<br />
------=_Part_2925_1187302083.1466899471133--
------=_Part_2924_151977685.1466899471132--
.
Author: asorenji@gmail.com
Date: Sat, 25 Jun 2016 20:29:50 -0700 (PDT)
Raw View
------=_Part_3236_51026223.1466911791000
Content-Type: multipart/alternative;
boundary="----=_Part_3237_1608851807.1466911791001"
------=_Part_3237_1608851807.1466911791001
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 3:04:31 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> ... Have you looked at the Unicode tables? They are not small. Like I=20
> said, there are ways to make them smaller (and your code makes them far=
=20
> bigger than necessary, since only ~15% of the codepoint range is assigned=
).=20
> But there is no proof-of-concept that shows that it won't bloat executabl=
es=20
> by 100KB.
>
> Yes, there are ways to make them smaller. But this no way to make them=20
*faster*. Sacrifice time for saving 100KB? We not load programs from floppy=
=20
anymore, we have gigabytes of RAM and at least gigabytes of disk space. In=
=20
modern word 100KB insignificant sacrifice for speed. In any case, in=20
desktop, executables get C-function implementations from libc6-dev package=
=20
in Linux or from User32.dll in Windows. If Linux or Windows bloat by 100KB,=
=20
you even can't notice this.
PS *single* std::regex bloat executables by 100KB=20
<http://stackoverflow.com/questions/28931088/huge-program-size-c-with-stdre=
gex>.=20
I'm think we must remove std::regex from C++11 standard library.
> Also, there is no guarantee that `wint_t` can store a Unicode codepoint,=
=20
> so that API isn't reasonable. If you're serious about Unicode support, yo=
u=20
> need to focus on the types that actually store Unicode encodings.
>
Okay, char32_t.=20
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/f8ce2288-8e06-41ad-8e58-0082d9d9982a%40isocpp.or=
g.
------=_Part_3237_1608851807.1466911791001
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 3:04:31 UTC+3=
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C N=
icol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"=
gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc so=
lid;padding-left: 1ex;"><div dir=3D"ltr"><br>... Have you looked at the Uni=
code tables? They are not small. Like I said, there are ways to make them s=
maller (and your code makes them far bigger than necessary, since only ~15%=
of the codepoint range is assigned). But there is no proof-of-concept that=
shows that it won't bloat executables by 100KB.<br><br></div></blockqu=
ote><div>Yes, there are ways to make them smaller. But this no way to make =
them <strong>faster</strong>. Sacrifice time for saving 100KB? We not load =
programs from floppy anymore, we have gigabytes of RAM and at least gigabyt=
es of disk space. In modern word 100KB insignificant sacrifice for speed. I=
n any case, in desktop, executables get C-function implementations from lib=
c6-dev package in Linux or from User32.dll in Windows. If Linux or Windows =
bloat by 100KB, you even can't notice this.</div><div>PS <strong>single=
</strong> std::regex <a href=3D"http://stackoverflow.com/questions/28931088=
/huge-program-size-c-with-stdregex">bloat executables by 100KB</a>. I'm=
think we must remove std::regex from C++11 standard library.</div><blockqu=
ote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left=
: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div>Also, there is n=
o guarantee that `wint_t` can store a Unicode codepoint, so that API isn=
9;t reasonable. If you're serious about Unicode support, you need to fo=
cus on the types that actually store Unicode encodings.<br></div></div></bl=
ockquote><div>Okay, char32_t.=C2=A0</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/f8ce2288-8e06-41ad-8e58-0082d9d9982a%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/f8ce2288-8e06-41ad-8e58-0082d9d9982a=
%40isocpp.org</a>.<br />
------=_Part_3237_1608851807.1466911791001--
------=_Part_3236_51026223.1466911791000--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sat, 25 Jun 2016 21:07:51 -0700
Raw View
On s=C3=A1bado, 25 de junho de 2016 20:29:50 PDT asorenji@gmail.com wrote:
> Yes, there are ways to make them smaller. But this no way to make them
> *faster*. Sacrifice time for saving 100KB? We not load programs from flop=
py
> > anymore, we have gigabytes of RAM and at least gigabytes of disk space.=
In
No, we don't. There are many modern microcontroller-class CPUs with less th=
an=20
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn=
't=20
we program them with C++?
Another reason is that often the library gets supplied with the executable.=
=20
Once we load the entire set of Unicode tables for all attributes, it may be=
=20
well over a megabyte. In fact, the ICU data table is an 18 MB library. That=
=20
means your small 100-line C++ program is at least that big.
I don't mean to discourage you. I do think we need some more Unicode suppor=
t=20
in the standard library, like better conversion functions from the current=
=20
locale to UTF-16 and 32. Qt has provided that for 15 years, so why not the=
=20
standard library?
> > Also, there is no guarantee that `wint_t` can store a Unicode codepoint=
,
> > so that API isn't reasonable. If you're serious about Unicode support, =
you
> > need to focus on the types that actually store Unicode encodings.
>=20
> Okay, char32_t.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/1550791.ovymrLsGO4%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Sat, 25 Jun 2016 22:39:16 -0700 (PDT)
Raw View
------=_Part_3024_1183186347.1466919556788
Content-Type: multipart/alternative;
boundary="----=_Part_3025_633136772.1466919556788"
------=_Part_3025_633136772.1466919556788
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:07:57 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> No, we don't. There are many modern microcontroller-class CPUs with less=
=20
> than=20
> 1 MB of flash and a around a hundred kilobytes of RAM, or less. Why=20
> shouldn't=20
> we program them with C++?=20
>
> Because microcontroller-class CPUs can't grant hardware support for=20
std::thread, std::mutex, etc. Also, I'm don't think this support present=20
in microcontroller OS (if microcontroller have any).
Yes, std::thread and iswalpha - different things. But if you write program=
=20
for microcontroller, you should be preprepared for some restricts and=20
forget about full C++11 support.
Another reason is that often the library gets supplied with the executable.=
=20
> Once we load the entire set of Unicode tables for all attributes, it may=
=20
> be=20
> well over a megabyte. In fact, the ICU data table is an 18 MB library.=20
> That=20
> means your small 100-line C++ program is at least that big.=20
>
> In that case my program bloat by 18 MB only if I'm use isu32alpha. In=20
other case smart compiler can remove code of unused function. But if I'm=20
use isu32alpha, I'm need normal unicode support. I'm don't need defective=
=20
wide character function, that can't work with 128+ (wide) character codes.=
=20
If price of normal support is 18 MB, so be it.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869%40isocpp.or=
g.
------=_Part_3025_633136772.1466919556788
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:07:57 UTC+3=
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C T=
hiago Macieira =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=
=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #cc=
c solid;padding-left: 1ex;"><br>No, we don't. There are many modern mic=
rocontroller-class CPUs with less than=20
<br>1 MB of flash and a around a hundred kilobytes of RAM, or less. Why sho=
uldn't=20
<br>we program them with C++?
<br>
<br></blockquote><div>Because=C2=A0microcontroller-class CPUs can't gra=
nt=C2=A0hardware support for std::thread, std::mutex, etc. Also, I'm do=
n't think this support present in=C2=A0microcontroller OS (if=C2=A0micr=
ocontroller have any).</div><div>Yes, std::thread and iswalpha - different =
things. But if you write program for microcontroller, you should be preprep=
ared for some restricts and forget about full C++11 support.</div><div><br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8=
ex;border-left: 1px #ccc solid;padding-left: 1ex;">Another reason is that o=
ften the library gets supplied with the executable.=20
<br>Once we load the entire set of Unicode tables for all attributes, it ma=
y be=20
<br>well over a megabyte. In fact, the ICU data table is an 18 MB library. =
That=20
<br>means your small 100-line C++ program is at least that big.
<br>
<br></blockquote><div>In that case my program bloat by 18 MB only if I'=
m use isu32alpha. In other case smart compiler can remove code of unused fu=
nction. But if I'm use isu32alpha, I'm need normal unicode support.=
I'm don't need defective wide character function, that can't w=
ork with 128+ (wide) character codes. If price of normal support is 18 MB, =
so be it.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869=
%40isocpp.org</a>.<br />
------=_Part_3025_633136772.1466919556788--
------=_Part_3024_1183186347.1466919556788--
.
Author: Patrice Roy <patricer@gmail.com>
Date: Sun, 26 Jun 2016 01:52:14 -0400
Raw View
--001a1142d2227a2c39053628016e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
18MB means we get away from many platforms where we are today. It's 9 times
the size of my whole current project (with debug info in!) which runs on
embedded devices. Please remember that we want Unicode support, but not at
the cost of not being able to support target platforms which are very much
alive today.
I'll be happy to see good Unicode support proposals at committee meetings,
but I'd strongly advise that hey are aware of such issues.
2016-06-26 1:39 GMT-04:00 <asorenji@gmail.com>:
>
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:07:57 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>>
>> No, we don't. There are many modern microcontroller-class CPUs with less
>> than
>> 1 MB of flash and a around a hundred kilobytes of RAM, or less. Why
>> shouldn't
>> we program them with C++?
>>
>> Because microcontroller-class CPUs can't grant hardware support for
> std::thread, std::mutex, etc. Also, I'm don't think this support present
> in microcontroller OS (if microcontroller have any).
> Yes, std::thread and iswalpha - different things. But if you write progra=
m
> for microcontroller, you should be preprepared for some restricts and
> forget about full C++11 support.
>
> Another reason is that often the library gets supplied with the
>> executable.
>> Once we load the entire set of Unicode tables for all attributes, it may
>> be
>> well over a megabyte. In fact, the ICU data table is an 18 MB library.
>> That
>> means your small 100-line C++ program is at least that big.
>>
>> In that case my program bloat by 18 MB only if I'm use isu32alpha. In
> other case smart compiler can remove code of unused function. But if I'm
> use isu32alpha, I'm need normal unicode support. I'm don't need defective
> wide character function, that can't work with 128+ (wide) character codes=
..
> If price of normal support is 18 MB, so be it.
>
> --
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> To view this discussion on the web visit
> https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc=
4-4526-9826-8062bb443869%40isocpp.org
> <https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0f=
c4-4526-9826-8062bb443869%40isocpp.org?utm_medium=3Demail&utm_source=3Dfoot=
er>
> .
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/CAKiZDp14QHrt%3DmfabR30MAVAPLeh%2B5X4ZVv%2BH1nFr=
8GOLvgPKw%40mail.gmail.com.
--001a1142d2227a2c39053628016e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div>18MB means we get away from many platforms where we a=
re today. It's 9 times the size of my whole current project (with debug=
info in!) which runs on embedded devices. Please remember that we want Uni=
code support, but not at the cost of not being able to support target platf=
orms which are very much alive today.<br><br></div>I'll be happy to see=
good Unicode support proposals at committee meetings, but I'd strongly=
advise that hey are aware of such issues.<br></div><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">2016-06-26 1:39 GMT-04:00 <span dir=3D"=
ltr"><<a href=3D"mailto:asorenji@gmail.com" target=3D"_blank">asorenji@g=
mail.com</a>></span>:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"=
><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5=
, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:07:57 UTC+3 =D0=BF=D0=BE=D0=
=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<span class=3D""><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc=
solid;padding-left:1ex"><br>No, we don't. There are many modern microc=
ontroller-class CPUs with less than=20
<br>1 MB of flash and a around a hundred kilobytes of RAM, or less. Why sho=
uldn't=20
<br>we program them with C++?
<br>
<br></blockquote></span><div>Because=C2=A0microcontroller-class CPUs can=
9;t grant=C2=A0hardware support for std::thread, std::mutex, etc. Also, I&#=
39;m don't think this support present in=C2=A0microcontroller OS (if=C2=
=A0microcontroller have any).</div><div>Yes, std::thread and iswalpha - dif=
ferent things. But if you write program for microcontroller, you should be =
preprepared for some restricts and forget about full C++11 support.</div><s=
pan class=3D""><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">Anoth=
er reason is that often the library gets supplied with the executable.=20
<br>Once we load the entire set of Unicode tables for all attributes, it ma=
y be=20
<br>well over a megabyte. In fact, the ICU data table is an 18 MB library. =
That=20
<br>means your small 100-line C++ program is at least that big.
<br>
<br></blockquote></span><div>In that case my program bloat by 18 MB only if=
I'm use isu32alpha. In other case smart compiler can remove code of un=
used function. But if I'm use isu32alpha, I'm need normal unicode s=
upport. I'm don't need defective wide character function, that can&=
#39;t work with 128+ (wide) character codes. If price of normal support is =
18 MB, so be it.<br></div></div><span class=3D"">
<p></p>
-- <br>
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org" target=3D"_=
blank">std-proposals+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org" target=3D"_blank">std-proposals@isocpp.org</a>.<br></span>
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter" target=3D"_blank">=
https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-=
4526-9826-8062bb443869%40isocpp.org</a>.<br>
</blockquote></div><br></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/CAKiZDp14QHrt%3DmfabR30MAVAPLeh%2B5X4=
ZVv%2BH1nFr8GOLvgPKw%40mail.gmail.com?utm_medium=3Demail&utm_source=3Dfoote=
r">https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CAKiZDp14QH=
rt%3DmfabR30MAVAPLeh%2B5X4ZVv%2BH1nFr8GOLvgPKw%40mail.gmail.com</a>.<br />
--001a1142d2227a2c39053628016e--
.
Author: asorenji@gmail.com
Date: Sat, 25 Jun 2016 23:04:59 -0700 (PDT)
Raw View
------=_Part_328_1438948542.1466921099459
Content-Type: multipart/alternative;
boundary="----=_Part_329_1097752576.1466921099459"
------=_Part_329_1097752576.1466921099459
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
> 18MB means we get away from many platforms where we are today. It's 9=20
> times the size of my whole current project (with debug info in!) which ru=
ns=20
> on embedded devices. Please remember that we want Unicode support, but no=
t=20
> at the cost of not being able to support target platforms which are very=
=20
> much alive today.
>
> Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most=20
0x110000 bits for single predicate function. 18 MB enough for hundred=20
predicates. I'm don't think you really need so many.
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/8ef0b8d3-3c57-4e54-8478-b81c2bd4207a%40isocpp.or=
g.
------=_Part_329_1097752576.1466921099459
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3=
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C P=
atrice Roy =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"=
gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc so=
lid;padding-left: 1ex;"><div dir=3D"ltr"><div>18MB means we get away from m=
any platforms where we are today. It's 9 times the size of my whole cur=
rent project (with debug info in!) which runs on embedded devices. Please r=
emember that we want Unicode support, but not at the cost of not being able=
to support target platforms which are very much alive today.<br><br></div>=
</div></blockquote><div>Unicode using codespace from 0 to 0x10FFFF. Therefo=
re we need at most 0x110000 bits for single predicate function. 18 MB enoug=
h for hundred predicates. I'm don't think you really need so many.<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: =
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
</blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/8ef0b8d3-3c57-4e54-8478-b81c2bd4207a%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/8ef0b8d3-3c57-4e54-8478-b81c2bd4207a=
%40isocpp.org</a>.<br />
------=_Part_329_1097752576.1466921099459--
------=_Part_328_1438948542.1466921099459--
.
Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Sun, 26 Jun 2016 11:28:13 +0300
Raw View
On Sun, Jun 26, 2016 at 7:07 AM, Thiago Macieira <thiago@macieira.org> wrot=
e:
> On s=C3=A1bado, 25 de junho de 2016 20:29:50 PDT asorenji@gmail.com wrote=
:
>> Yes, there are ways to make them smaller. But this no way to make them
>> *faster*. Sacrifice time for saving 100KB? We not load programs from flo=
ppy
>> > anymore, we have gigabytes of RAM and at least gigabytes of disk space=
.. In
>
> No, we don't. There are many modern microcontroller-class CPUs with less =
than
> 1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shoul=
dn't
> we program them with C++?
>
> Another reason is that often the library gets supplied with the executabl=
e.
> Once we load the entire set of Unicode tables for all attributes, it may =
be
> well over a megabyte. In fact, the ICU data table is an 18 MB library. Th=
at
> means your small 100-line C++ program is at least that big.
ICU does have ways to subset its data tables to include only the parts
you use. The proposal author should probably validate that to show us
what kinds of subsets are already possible, but it'll also be possible
to add new subsets if the C++ library wants to make finer-grained
distinctions.
Standard libraries targeting microcontrollers may need ways to only
ship subsets of unicode according to user assertions about which
characters they actually use, but I'm pretty confident it's possible
to make the size reasonable.
Jeffrey
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/CANh-dXmGSdNaEZZ2qY9-L6VkR52sNfVOKHMtkgDpOMgcWtO=
6Gw%40mail.gmail.com.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 06:08:44 -0700 (PDT)
Raw View
------=_Part_163_695688364.1466946524785
Content-Type: multipart/alternative;
boundary="----=_Part_164_1008336187.1466946524785"
------=_Part_164_1008336187.1466946524785
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>> 18MB means we get away from many platforms where we are today. It's 9=20
>> times the size of my whole current project (with debug info in!) which r=
uns=20
>> on embedded devices. Please remember that we want Unicode support, but n=
ot=20
>> at the cost of not being able to support target platforms which are very=
=20
>> much alive today.
>>
>> Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most=20
> 0x110000 bits for single predicate function. 18 MB enough for hundred=20
> predicates. I'm don't think you really need so many.
>
Again, let's forget that most of that range is not actually assigned and=20
therefore takes up 0 bits.
Not all of the properties in the Unicode tables are *binary*. Indeed, most=
=20
are not. Case-conversion, for example, cannot be binary. It has to specify=
=20
how you go from codepoint X to one or more codepoints YZW. For each=20
codepoint. That cannot take up a single bit per codepoint.
That being said, I firmly believe that 18MB is much larger than the Unicode=
=20
tables *need* to be. That there must be clever ways to make that table much=
=20
smaller (on the order of hundreds of kilobytes rather than megabytes). But=
=20
as of yet, I have not undertaken the task of *proving* that, so that=20
doesn't mean much.
And even if I'm wrong, I bet we can provide certain very useful features=20
that require less than the full Unicode table space. For example, I'd bet=
=20
that the non-compatibility Unicode normalization forms require much less=20
table space than the compatibility ones. I'd bet that grapheme cluster=20
iteration requires much less table space than case conversion.
Again, not proven. But it'd certainly be a worthy research project. If=20
Unicode normalization functions only cost 15KB of executable room, that=20
might be a reasonable tradeoff. Obviously, if you don't call them at all,=
=20
you should get zero increase.
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/4d975edd-1c9c-468d-b852-28e2a2108931%40isocpp.or=
g.
------=_Part_164_1008336187.1466946524785
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmai=
l.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-lef=
t: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">=
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy =D0=BD=D0=B0=
=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"m=
argin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
dir=3D"ltr"><div>18MB means we get away from many platforms where we are t=
oday. It's 9 times the size of my whole current project (with debug inf=
o in!) which runs on embedded devices. Please remember that we want Unicode=
support, but not at the cost of not being able to support target platforms=
which are very much alive today.<br><br></div></div></blockquote><div>Unic=
ode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 =
bits for single predicate function. 18 MB enough for hundred predicates. I&=
#39;m don't think you really need so many.<br></div></div></blockquote>=
<div><br>Again, let's forget that most of that range is not actually as=
signed and therefore takes up 0 bits.<br><br>Not all of the properties in t=
he Unicode tables are <i>binary</i>. Indeed, most are not. Case-conversion,=
for example, cannot be binary. It has to specify how you go from codepoint=
X to one or more codepoints YZW. For each codepoint. That cannot take up a=
single bit per codepoint.<br><br>That being said, I firmly believe that 18=
MB is much larger than the Unicode tables <i>need</i> to be. That there mus=
t be clever ways to make that table much smaller (on the order of hundreds =
of kilobytes rather than megabytes). But as of yet, I have not undertaken t=
he task of <i>proving</i> that, so that doesn't mean much.<br><br>And e=
ven if I'm wrong, I bet we can provide certain very useful features tha=
t require less than the full Unicode table space. For example, I'd bet =
that the non-compatibility Unicode normalization forms require much less ta=
ble space than the compatibility ones. I'd bet that grapheme cluster it=
eration requires much less table space than case conversion.<br><br>Again, =
not proven. But it'd certainly be a worthy research project. If Unicode=
normalization functions only cost 15KB of executable room, that might be a=
reasonable tradeoff. Obviously, if you don't call them at all, you sho=
uld get zero increase.<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;=
"><div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=3D"margin:0;marg=
in-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
</blockquote></div></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/4d975edd-1c9c-468d-b852-28e2a2108931%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/4d975edd-1c9c-468d-b852-28e2a2108931=
%40isocpp.org</a>.<br />
------=_Part_164_1008336187.1466946524785--
------=_Part_163_695688364.1466946524785--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 06:11:31 -0700 (PDT)
Raw View
------=_Part_190_360242826.1466946692128
Content-Type: multipart/alternative;
boundary="----=_Part_191_1023888791.1466946692128"
------=_Part_191_1023888791.1466946692128
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Sunday, June 26, 2016 at 12:07:57 AM UTC-4, Thiago Macieira wrote:
>
> On s=C3=A1bado, 25 de junho de 2016 20:29:50 PDT asor...@gmail.com=20
> <javascript:> wrote:=20
> > Yes, there are ways to make them smaller. But this no way to make them=
=20
> > *faster*. Sacrifice time for saving 100KB? We not load programs from=20
> floppy=20
> > > anymore, we have gigabytes of RAM and at least gigabytes of disk=20
> space. In=20
>
> No, we don't. There are many modern microcontroller-class CPUs with less=
=20
> than=20
> 1 MB of flash and a around a hundred kilobytes of RAM, or less. Why=20
> shouldn't=20
> we program them with C++?=20
>
To be fair, microcontroller code probably isn't doing Unicode=20
case-conversion or string comparisons. So as long as the mere *presence* of=
=20
such functions in the standard library doesn't cause bloat (and why would=
=20
it?), I don't think that case would be a problem.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/890a7333-bd2c-4a18-9527-5d8def88576a%40isocpp.or=
g.
------=_Part_191_1023888791.1466946692128
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 12:07:57 AM UTC-4, Thiago Maci=
eira wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left=
: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On s=C3=A1bado, 25 =
de junho de 2016 20:29:50 PDT <a href=3D"javascript:" target=3D"_blank" gdf=
-obfuscated-mailto=3D"vBgXBISgBQAJ" rel=3D"nofollow" onmousedown=3D"this.hr=
ef=3D'javascript:';return true;" onclick=3D"this.href=3D'javasc=
ript:';return true;">asor...@gmail.com</a> wrote:
<br>> Yes, there are ways to make them smaller. But this no way to make =
them
<br>> *faster*. Sacrifice time for saving 100KB? We not load programs fr=
om floppy
<br>> > anymore, we have gigabytes of RAM and at least gigabytes of d=
isk space. In
<br>
<br>No, we don't. There are many modern microcontroller-class CPUs with=
less than=20
<br>1 MB of flash and a around a hundred kilobytes of RAM, or less. Why sho=
uldn't=20
<br>we program them with C++?
<br></blockquote><div><br>To be fair, microcontroller code probably isn'=
;t doing Unicode case-conversion or string comparisons. So as long as the m=
ere <i>presence</i> of such functions in the standard library doesn't c=
ause bloat (and why would it?), I don't think that case would be a prob=
lem.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/890a7333-bd2c-4a18-9527-5d8def88576a%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/890a7333-bd2c-4a18-9527-5d8def88576a=
%40isocpp.org</a>.<br />
------=_Part_191_1023888791.1466946692128--
------=_Part_190_360242826.1466946692128--
.
Author: asorenji@gmail.com
Date: Sun, 26 Jun 2016 07:03:01 -0700 (PDT)
Raw View
------=_Part_3375_370609748.1466949781613
Content-Type: multipart/alternative;
boundary="----=_Part_3376_1237938054.1466949781613"
------=_Part_3376_1237938054.1466949781613
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 11:28:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Jeffrey Yasskin=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> ICU does have ways to subset its data tables to include only the parts=20
> you use. The proposal author should probably validate that to show us=20
> what kinds of subsets are already possible, but it'll also be possible=20
> to add new subsets if the C++ library wants to make finer-grained=20
> distinctions.=20
>
> Okay, lets use subset model. Unicode use 273 unicode blocks, 271792 codes=
=20
total <https://en.wikipedia.org/wiki/Plane_(Unicode)#Overview>. We can=20
write something like this:
struct u32_subset
{
int32_t lower_bound,size;
const int8_t*table;
bool operator<(int32_t code)const{return lower_bound<code;}
};
bool isu32alpha(int32_t code)
{
static const u32_subset unicode_blocks[273]=3D{/*some large table*/};
const=20
u32_subset&subset=3D*std::lower_bound(unicode_blocks,unicode_blocks+273,cod=
e);
size_t offset=3Dcode-subset.lower_bound;
return offset<subset.size?subset.table[offset/8]&(1<<(offset&7)):false;
}
u32_subset request 4 byte for lower_bound, 4 byte for size, and 8 bytes for=
=20
table pointer. 273*16=3D4368 bytes total.
All 273 tables request 271792 bites, or 33974 bytes total.
4368+33974=3D38342 bytes total.
38 KB is still very big and your 4+ GB desktop can't afford this?
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/e9664260-5d43-4630-b2a5-1c9baea57cbd%40isocpp.or=
g.
------=_Part_3376_1237938054.1466949781613
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 11:28:35 UTC+=
3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C =
Jeffrey Yasskin =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #c=
cc solid;padding-left: 1ex;"><br>ICU does have ways to subset its data tabl=
es to include only the parts
<br>you use. The proposal author should probably validate that to show us
<br>what kinds of subsets are already possible, but it'll also be possi=
ble
<br>to add new subsets if the C++ library wants to make finer-grained
<br>distinctions.
<br>
<br></blockquote><div>Okay, lets use subset model. Unicode use 273 unicode =
blocks, <a href=3D"https://en.wikipedia.org/wiki/Plane_(Unicode)#Overview">=
271792 codes total</a>. We can write something like this:</div><div>struct=
=C2=A0u32_subset<br>{</div><div>=C2=A0 =C2=A0 int32_t=C2=A0lower_bound,size=
;</div><div>=C2=A0 =C2=A0 const int8_t*table;</div><div>=C2=A0 =C2=A0 bool =
operator<(int32_t code)const{return lower_bound<code;}<br>};<br></div=
><div>bool isu32alpha(int32_t code)<br>{</div><div>=C2=A0 =C2=A0 static con=
st u32_subset unicode_blocks[273]=3D{/*some large table*/};</div><div><br><=
/div><div>=C2=A0 =C2=A0 const u32_subset&subset=3D*std::lower_bound(uni=
code_blocks,unicode_blocks+273,code);</div><div>=C2=A0 =C2=A0 size_t offset=
=3Dcode-subset.lower_bound;</div><div>=C2=A0 =C2=A0 return offset<subset=
..size?subset.table[offset/8]&(1<<(offset&7)):false;<br>}<br><=
/div><div><br></div><div>u32_subset request 4 byte for=C2=A0lower_bound, 4 =
byte for size, and 8 bytes for table pointer. 273*16=3D4368 bytes total.</d=
iv><div>All 273 tables request 271792 bites, or 33974 bytes total.</div><di=
v>4368+33974=3D38342 bytes total.</div><div>38 KB is still very big and you=
r 4+ GB desktop can't afford this?</div><div><br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/e9664260-5d43-4630-b2a5-1c9baea57cbd%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/e9664260-5d43-4630-b2a5-1c9baea57cbd=
%40isocpp.org</a>.<br />
------=_Part_3376_1237938054.1466949781613--
------=_Part_3375_370609748.1466949781613--
.
Author: asorenji@gmail.com
Date: Sun, 26 Jun 2016 07:03:49 -0700 (PDT)
Raw View
------=_Part_160_1301924111.1466949829655
Content-Type: multipart/alternative;
boundary="----=_Part_161_325493963.1466949829656"
------=_Part_161_325493963.1466949829656
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 16:08:45 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
> On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:
>>
>> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy=20
>> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>>
>>> 18MB means we get away from many platforms where we are today. It's 9=
=20
>>> times the size of my whole current project (with debug info in!) which =
runs=20
>>> on embedded devices. Please remember that we want Unicode support, but =
not=20
>>> at the cost of not being able to support target platforms which are ver=
y=20
>>> much alive today.
>>>
>>> Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most=
=20
>> 0x110000 bits for single predicate function. 18 MB enough for hundred=20
>> predicates. I'm don't think you really need so many.
>>
>
> Again, let's forget that most of that range is not actually assigned and=
=20
> therefore takes up 0 bits.
>
> Not all of the properties in the Unicode tables are *binary*. Indeed,=20
> most are not. Case-conversion, for example, cannot be binary. It has to=
=20
> specify how you go from codepoint X to one or more codepoints YZW. For ea=
ch=20
> codepoint. That cannot take up a single bit per codepoint.
>
> Lets concentrate to predicate (return bool value) functions. I'm don't=20
believe that isalpha or isupper have this sort of problems. Although we=20
should decide what is "isupper" means in languages without upper and lower=
=20
characters. In Japanese language for example.
> That being said, I firmly believe that 18MB is much larger than the=20
> Unicode tables *need* to be. That there must be clever ways to make that=
=20
> table much smaller (on the order of hundreds of kilobytes rather than=20
> megabytes). But as of yet, I have not undertaken the task of *proving*=20
> that, so that doesn't mean much.
>
> In my answer to Jeffrey Yasskin, I'm compress tables to 38 KB. But price=
=20
of this - binary search with eight (log2(273)) comparison. And eight=20
problems with branch prediction unit. At least in desktop I'm prefer avoid=
=20
this.
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/1facff45-59ee-47dc-a7df-bbc72a32cca8%40isocpp.or=
g.
------=_Part_161_325493963.1466949829656
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 16:08:45 UTC+=
3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C =
Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D=
"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc s=
olid;padding-left: 1ex;"><div dir=3D"ltr">On Sunday, June 26, 2016 at 2:04:=
59 AM UTC-4, <a>asor...@gmail.com</a> wrote:<blockquote class=3D"gmail_quot=
e" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3=
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C P=
atrice Roy =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"=
gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid=
;padding-left:1ex"><div dir=3D"ltr"><div>18MB means we get away from many p=
latforms where we are today. It's 9 times the size of my whole current =
project (with debug info in!) which runs on embedded devices. Please rememb=
er that we want Unicode support, but not at the cost of not being able to s=
upport target platforms which are very much alive today.<br><br></div></div=
></blockquote><div>Unicode using codespace from 0 to 0x10FFFF. Therefore we=
need at most 0x110000 bits for single predicate function. 18 MB enough for=
hundred predicates. I'm don't think you really need so many.<br></=
div></div></blockquote><div><br>Again, let's forget that most of that r=
ange is not actually assigned and therefore takes up 0 bits.<br><br>Not all=
of the properties in the Unicode tables are <i>binary</i>. Indeed, most ar=
e not. Case-conversion, for example, cannot be binary. It has to specify ho=
w you go from codepoint X to one or more codepoints YZW. For each codepoint=
.. That cannot take up a single bit per codepoint.<br><br></div></div></bloc=
kquote><div>Lets concentrate to predicate (return bool value) functions. I&=
#39;m don't believe that isalpha or isupper have this sort of problems.=
Although we should decide what is "isupper" means in languages w=
ithout upper and lower characters. In Japanese language for example.</div><=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div>That bein=
g said, I firmly believe that 18MB is much larger than the Unicode tables <=
i>need</i> to be. That there must be clever ways to make that table much sm=
aller (on the order of hundreds of kilobytes rather than megabytes). But as=
of yet, I have not undertaken the task of <i>proving</i> that, so that doe=
sn't mean much.<br><br></div></div></blockquote><div>In my answer to Je=
ffrey Yasskin, I'm compress tables to 38 KB. But price of this - binary=
search with eight (log2(273)) comparison. And eight problems with branch p=
rediction unit. At least in desktop I'm prefer avoid this.<br></div><bl=
ockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border=
-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc =
solid;padding-left:1ex"><div dir=3D"ltr"><blockquote class=3D"gmail_quote" =
style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left=
:1ex">
</blockquote></div></blockquote></div></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/1facff45-59ee-47dc-a7df-bbc72a32cca8%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/1facff45-59ee-47dc-a7df-bbc72a32cca8=
%40isocpp.org</a>.<br />
------=_Part_161_325493963.1466949829656--
------=_Part_160_1301924111.1466949829655--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 08:12:51 -0700 (PDT)
Raw View
------=_Part_3692_2020040332.1466953971433
Content-Type: multipart/alternative;
boundary="----=_Part_3693_1607879778.1466953971433"
------=_Part_3693_1607879778.1466953971433
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Sunday, June 26, 2016 at 10:03:02 AM UTC-4, asor...@gmail.com wrote:
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 11:28:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Jeffrey Yasskin=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>>
>> ICU does have ways to subset its data tables to include only the parts=
=20
>> you use. The proposal author should probably validate that to show us=20
>> what kinds of subsets are already possible, but it'll also be possible=
=20
>> to add new subsets if the C++ library wants to make finer-grained=20
>> distinctions.=20
>>
>> Okay, lets use subset model. Unicode use 273 unicode blocks, 271792=20
> codes total <https://en.wikipedia.org/wiki/Plane_(Unicode)#Overview>. We=
=20
> can write something like this:
> struct u32_subset
> {
> int32_t lower_bound,size;
> const int8_t*table;
> bool operator<(int32_t code)const{return lower_bound<code;}
> };
> bool isu32alpha(int32_t code)
> {
> static const u32_subset unicode_blocks[273]=3D{/*some large table*/};
>
> const=20
> u32_subset&subset=3D*std::lower_bound(unicode_blocks,unicode_blocks+273,c=
ode);
> size_t offset=3Dcode-subset.lower_bound;
> return offset<subset.size?subset.table[offset/8]&(1<<(offset&7)):fals=
e;
> }
>
> u32_subset request 4 byte for lower_bound, 4 byte for size, and 8 bytes=
=20
> for table pointer. 273*16=3D4368 bytes total.
> All 273 tables request 271792 bites, or 33974 bytes total.
> 4368+33974=3D38342 bytes total.
> 38 KB is still very big and your 4+ GB desktop can't afford this?
>
1: The world of C++ is greater than the world of "4+GB desktop" computers.
2: Why are you using `lower_bound` for a table? That's like making an=20
`unordered_map`, then using a range-for loop to search for an item by its=
=20
key. You should be able to get the exact block index for a Unicode=20
codepoint with some simple mathematics.
3: 38KB is still *way* too big for this information.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/ad31eadd-08a3-4968-9fa9-a957106a2472%40isocpp.or=
g.
------=_Part_3693_1607879778.1466953971433
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 10:03:02 AM UTC-4, asor...@gma=
il.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">=
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 11:28:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Jeffrey Yasskin =D0=BD=D0=
=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=
=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"=
><br>ICU does have ways to subset its data tables to include only the parts
<br>you use. The proposal author should probably validate that to show us
<br>what kinds of subsets are already possible, but it'll also be possi=
ble
<br>to add new subsets if the C++ library wants to make finer-grained
<br>distinctions.
<br>
<br></blockquote><div>Okay, lets use subset model. Unicode use 273 unicode =
blocks, <a href=3D"https://en.wikipedia.org/wiki/Plane_(Unicode)#Overview" =
target=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'https://=
www.google.com/url?q\x3dhttps%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPlane_(Unic=
ode)%23Overview\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4rrQYugFyNkn_eLHw7=
knZa7e-Fg';return true;" onclick=3D"this.href=3D'https://www.google=
..com/url?q\x3dhttps%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPlane_(Unicode)%23Ove=
rview\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4rrQYugFyNkn_eLHw7knZa7e-Fg&=
#39;;return true;">271792 codes total</a>. We can write something like this=
:</div><div>struct=C2=A0u32_subset<br>{</div><div>=C2=A0 =C2=A0 int32_t=C2=
=A0lower_bound,size;</div><div>=C2=A0 =C2=A0 const int8_t*table;</div><div>=
=C2=A0 =C2=A0 bool operator<(int32_t code)const{return lower_bound<co=
de;}<br>};<br></div><div>bool isu32alpha(int32_t code)<br>{</div><div>=C2=
=A0 =C2=A0 static const u32_subset unicode_blocks[273]=3D{/*some large tabl=
e*/};</div><div><br></div><div>=C2=A0 =C2=A0 const u32_subset&subset=3D=
*std::lower_<wbr>bound(unicode_blocks,unicode_<wbr>blocks+273,code);</div><=
div>=C2=A0 =C2=A0 size_t offset=3Dcode-subset.lower_<wbr>bound;</div><div>=
=C2=A0 =C2=A0 return offset<subset.size?subset.<wbr>table[offset/8]&=
(1<<(offset&7)<wbr>):false;<br>}<br></div><div><br></div><div>u32=
_subset request 4 byte for=C2=A0lower_bound, 4 byte for size, and 8 bytes f=
or table pointer. 273*16=3D4368 bytes total.</div><div>All 273 tables reque=
st 271792 bites, or 33974 bytes total.</div><div>4368+33974=3D38342 bytes t=
otal.</div><div>38 KB is still very big and your 4+ GB desktop can't af=
ford this?</div></div></blockquote><div><br>1: The world of C++ is greater =
than the world of "4+GB desktop" computers.<br><br>2: Why are you=
using `lower_bound` for a table? That's like making an `unordered_map`=
, then using a range-for loop to search for an item by its key. You should =
be able to get the exact block index for a Unicode codepoint with some simp=
le mathematics.<br><br>3: 38KB is still <i>way</i> too big for this informa=
tion.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/ad31eadd-08a3-4968-9fa9-a957106a2472%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/ad31eadd-08a3-4968-9fa9-a957106a2472=
%40isocpp.org</a>.<br />
------=_Part_3693_1607879778.1466953971433--
------=_Part_3692_2020040332.1466953971433--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 08:14:09 -0700 (PDT)
Raw View
------=_Part_3647_45149536.1466954049530
Content-Type: multipart/alternative;
boundary="----=_Part_3648_1790059593.1466954049530"
------=_Part_3648_1790059593.1466954049530
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Sunday, June 26, 2016 at 10:03:50 AM UTC-4, asor...@gmail.com wrote:
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 16:08:45 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>> On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:
>>>
>>> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy=20
>>> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>>>
>>>> 18MB means we get away from many platforms where we are today. It's 9=
=20
>>>> times the size of my whole current project (with debug info in!) which=
runs=20
>>>> on embedded devices. Please remember that we want Unicode support, but=
not=20
>>>> at the cost of not being able to support target platforms which are ve=
ry=20
>>>> much alive today.
>>>>
>>>> Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most=
=20
>>> 0x110000 bits for single predicate function. 18 MB enough for hundred=
=20
>>> predicates. I'm don't think you really need so many.
>>>
>>
>> Again, let's forget that most of that range is not actually assigned and=
=20
>> therefore takes up 0 bits.
>>
>> Not all of the properties in the Unicode tables are *binary*. Indeed,=20
>> most are not. Case-conversion, for example, cannot be binary. It has to=
=20
>> specify how you go from codepoint X to one or more codepoints YZW. For e=
ach=20
>> codepoint. That cannot take up a single bit per codepoint.
>>
>> Lets concentrate to predicate (return bool value) functions. I'm don't=
=20
> believe that isalpha or isupper have this sort of problems. Although we=
=20
> should decide what is "isupper" means in languages without upper and lowe=
r=20
> characters. In Japanese language for example.
>
Before you can talk about what things Unicode should have, you should=20
probably look at how Unicode *currently works*. Unicode already has an=20
answer for what the case properties of CJK ideograms are. The only thing=20
any prospective C++ API for Unicode should do is provide Unicode's answers=
=20
for questions that Unicode has answers to.
That being said, I firmly believe that 18MB is much larger than the Unicode=
=20
>> tables *need* to be. That there must be clever ways to make that table=
=20
>> much smaller (on the order of hundreds of kilobytes rather than megabyte=
s).=20
>> But as of yet, I have not undertaken the task of *proving* that, so that=
=20
>> doesn't mean much.
>>
>> In my answer to Jeffrey Yasskin, I'm compress tables to 38 KB.
>
No, you did not. You compressed *a single table*. That's not "tables".
>
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/1d7b12ef-e0df-453d-931d-2fad73e7d0d9%40isocpp.or=
g.
------=_Part_3648_1790059593.1466954049530
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 10:03:50 AM UTC-4, asor...@gma=
il.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">=
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 16:08:45 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=
=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"m=
argin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
dir=3D"ltr">On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, <a>asor...@gmail=
..com</a> wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-l=
eft:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">=D0=
=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=B8=
=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:52:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=
=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Patrice Roy =D0=BD=D0=B0=D0=
=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"marg=
in:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div di=
r=3D"ltr"><div>18MB means we get away from many platforms where we are toda=
y. It's 9 times the size of my whole current project (with debug info i=
n!) which runs on embedded devices. Please remember that we want Unicode su=
pport, but not at the cost of not being able to support target platforms wh=
ich are very much alive today.<br><br></div></div></blockquote><div>Unicode=
using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bit=
s for single predicate function. 18 MB enough for hundred predicates. I'=
;m don't think you really need so many.<br></div></div></blockquote><di=
v><br>Again, let's forget that most of that range is not actually assig=
ned and therefore takes up 0 bits.<br><br>Not all of the properties in the =
Unicode tables are <i>binary</i>. Indeed, most are not. Case-conversion, fo=
r example, cannot be binary. It has to specify how you go from codepoint X =
to one or more codepoints YZW. For each codepoint. That cannot take up a si=
ngle bit per codepoint.<br><br></div></div></blockquote><div>Lets concentra=
te to predicate (return bool value) functions. I'm don't believe th=
at isalpha or isupper have this sort of problems. Although we should decide=
what is "isupper" means in languages without upper and lower cha=
racters. In Japanese language for example.</div></div></blockquote><div><br=
>Before you can talk about what things Unicode should have, you should prob=
ably look at how Unicode <i>currently works</i>. Unicode already has an ans=
wer for what the case properties of CJK ideograms are. The only thing any p=
rospective C++ API for Unicode should do is provide Unicode's answers f=
or questions that Unicode has answers to.<br><br></div><blockquote class=3D=
"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc s=
olid;padding-left: 1ex;"><div dir=3D"ltr"><blockquote class=3D"gmail_quote"=
style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-lef=
t:1ex"><div dir=3D"ltr"><div>That being said, I firmly believe that 18MB is=
much larger than the Unicode tables <i>need</i> to be. That there must be =
clever ways to make that table much smaller (on the order of hundreds of ki=
lobytes rather than megabytes). But as of yet, I have not undertaken the ta=
sk of <i>proving</i> that, so that doesn't mean much.<br><br></div></di=
v></blockquote><div>In my answer to Jeffrey Yasskin, I'm compress table=
s to 38 KB.</div></div></blockquote><div dir=3D"ltr"><br>No, you did not. Y=
ou compressed <i>a single table</i>. That's not "tables".<blo=
ckquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><blockquote class=3D"g=
mail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=
=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"=
>
</blockquote></div></blockquote></div></blockquote></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/1d7b12ef-e0df-453d-931d-2fad73e7d0d9%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/1d7b12ef-e0df-453d-931d-2fad73e7d0d9=
%40isocpp.org</a>.<br />
------=_Part_3648_1790059593.1466954049530--
------=_Part_3647_45149536.1466954049530--
.
Author: asorenji@gmail.com
Date: Sun, 26 Jun 2016 09:44:20 -0700 (PDT)
Raw View
------=_Part_166_1434653002.1466959460790
Content-Type: multipart/alternative;
boundary="----=_Part_167_1889017719.1466959460791"
------=_Part_167_1889017719.1466959460791
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 18:12:51 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> 2: Why are you using `lower_bound` for a table? That's like making an=20
> `unordered_map`, then using a range-for loop to search for an item by its=
=20
> key. You should be able to get the exact block index for a Unicode=20
> codepoint with some simple mathematics.
>
Because unicode blocks have random size=20
<https://en.wikipedia.org/wiki/Unicode_blocks>. I'm can use Unicode plans=
=20
with fixed size and simple mathematics instead. But then you start complain=
=20
"oh, six Unicode planes contain 65536*6=3D393216 characters! You wish spend=
=20
393216 bits or 49 KB? I't very, very large amount of memory!".
You can get high speed with simple code, or you can get minimum memory=20
usage. But you can't get both in one time.=20
>
> 3: 38KB is still *way* too big for this information.
>
Only if you know more economical method. If more economical method don't=20
exist, than 38KB reasonable size.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/fa11402d-1401-4175-8b7e-7998917fbd95%40isocpp.or=
g.
------=_Part_167_1889017719.1466959460791
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 18:12:51 UTC+=
3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C =
Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D=
"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc s=
olid;padding-left: 1ex;"><div dir=3D"ltr"><br>2: Why are you using `lower_b=
ound` for a table? That's like making an `unordered_map`, then using a =
range-for loop to search for an item by its key. You should be able to get =
the exact block index for a Unicode codepoint with some simple mathematics.=
<br></div></blockquote><div><div>Because <a href=3D"https://en.wikipedia.or=
g/wiki/Unicode_blocks">unicode blocks have random size</a>. I'm can use=
Unicode plans with fixed size and simple mathematics instead. But then you=
start complain "oh, six Unicode planes contain 65536*6=3D393216 chara=
cters! You wish spend 393216 bits or 49 KB? I't very, very large amount=
of memory!".</div><div>You can get high speed with simple code, or yo=
u can get minimum memory usage. But you can't get both in one time.=C2=
=A0</div></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-=
left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr=
"><br>3: 38KB is still <i>way</i> too big for this information.<br></div></=
blockquote><div>Only if you know more economical method. If more economical=
method don't exist, than 38KB reasonable size.</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/fa11402d-1401-4175-8b7e-7998917fbd95%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/fa11402d-1401-4175-8b7e-7998917fbd95=
%40isocpp.org</a>.<br />
------=_Part_167_1889017719.1466959460791--
------=_Part_166_1434653002.1466959460790--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sun, 26 Jun 2016 10:14:38 -0700
Raw View
On s=C3=A1bado, 25 de junho de 2016 22:39:16 PDT asorenji@gmail.com wrote:
> Because microcontroller-class CPUs can't grant hardware support for=20
> std::thread, std::mutex, etc. Also, I'm don't think this support present=
=20
> in microcontroller OS (if microcontroller have any).
> Yes, std::thread and iswalpha - different things. But if you write progra=
m=20
> for microcontroller, you should be preprepared for some restricts and=20
> forget about full C++11 support.
I think you know as much about microcontroller OSes as you know about=20
microcontrollers themselves.
I've been working with the people developing Zephyr OS and there they have=
=20
fibers and protothreads. There's no reason those primitives couldn't be=20
supported, if needed.
And let me repeat: I do want some more Unicode support. Just remember the C=
++=20
rule of not paying for the cost of things you're not using. Unfortunately,=
=20
Unicode character properties is one of those that cost a lot, just like the=
=20
rest of locale databases and timezones.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/4397143.p0kz4RbZYa%40tjmaciei-mobl1.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sun, 26 Jun 2016 10:24:14 -0700
Raw View
On domingo, 26 de junho de 2016 06:08:44 PDT Nicol Bolas wrote:
> That being said, I firmly believe that 18MB is much larger than the Unicode
> tables *need* to be. That there must be clever ways to make that table much
> smaller (on the order of hundreds of kilobytes rather than megabytes). But
> as of yet, I have not undertaken the task of *proving* that, so that
> doesn't mean much.
>
> And even if I'm wrong, I bet we can provide certain very useful features
> that require less than the full Unicode table space. For example, I'd bet
> that the non-compatibility Unicode normalization forms require much less
> table space than the compatibility ones. I'd bet that grapheme cluster
> iteration requires much less table space than case conversion.
ICU comes with a tool to select which properties and which locales to include
in your data pack. It's just not an easy tool to use and I personally know of
no one that has successfully deployed the data file with it.
Un-selecting entries from the "lines" from the database is often a short-
sighted decision. You may think "my application will not be run in Thailand"
and thus remove support for Thai grapheme support along with its locale
information. But then you may get a Thai customer calling your application's
support and they may not even be in Thailand.
Un-selecting "columns" would be safer, but you often don't know which
properties your application needs. You might think like you said above that
you don't need the non-compatibility normalisations, only to find out that
Internationalised Domain Names does need NFKC.
Also, ICU 57.1 isn't 18 MB:
$ v -h /usr/share/icu/57.1/icudt57l.dat
-rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/5188654.7iUmSHZqYX%40tjmaciei-mobl1.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sun, 26 Jun 2016 10:29:48 -0700
Raw View
On domingo, 26 de junho de 2016 06:11:31 PDT Nicol Bolas wrote:
> To be fair, microcontroller code probably isn't doing Unicode
> case-conversion or string comparisons. So as long as the mere *presence* of
> such functions in the standard library doesn't cause bloat (and why would
> it?), I don't think that case would be a problem.
Hopefully, but often people don't remeber that when creating their Internet of
Things protocols. Trust me, we're running into that in the Open Connectivity
Foundation: many things dictacted by IEEE and the CoRE initiative are case
insensitive and since data is often encoded in UTF-8, the logical
conclusion...
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/3074002.1TsJSjBABI%40tjmaciei-mobl1.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 12:51:09 -0700 (PDT)
Raw View
------=_Part_3378_1942953347.1466970669532
Content-Type: multipart/alternative;
boundary="----=_Part_3379_1314286255.1466970669533"
------=_Part_3379_1314286255.1466970669533
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Sunday, June 26, 2016 at 12:44:21 PM UTC-4, asor...@gmail.com wrote:
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 18:12:51 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>>
>> 2: Why are you using `lower_bound` for a table? That's like making an=20
>> `unordered_map`, then using a range-for loop to search for an item by it=
s=20
>> key. You should be able to get the exact block index for a Unicode=20
>> codepoint with some simple mathematics.
>>
> Because unicode blocks have random size=20
> <https://en.wikipedia.org/wiki/Unicode_blocks>.
>
Why divide it up by Unicode's arbitrary blocks to begin with? That makes=20
searching require a lot of conditional branching and cache misses.
I would start by dividing it into Unicode planes, and dividing each plane=
=20
into X regions of Y codepoints apiece. Different properties, depending on=
=20
the distribution of attributes, could have region sizes. The idea being=20
that you choose region sizes based on the specific distribution within a=20
plane. The overall goal being that you can jump from a codepoint directly=
=20
to the specific set of codepoints from which to fetch the exact value.
If all of the codepoints in a region share the same value, you can instead=
=20
jump to a function that returns that default value, rather than fetching it=
=20
from a table of duplicate entries. Or rather more to the point, each region=
=20
within a plane is implemented as a *function*, which may use a table to=20
generate the return value, return a single default, employ RLE encoding, or=
=20
any number of other tricks to reduce the overall size of the compiled=20
binary.
All of which would be much faster than a binary search.
3: 38KB is still *way* too big for this information.
>>
> Only if you know more economical method. If more economical method don't=
=20
> exist, than 38KB reasonable size.
>
OK, allow me to say what I mean a different way. Having that information=20
available at all is not worth 38KB to most users. The number of people who=
=20
truly *need* to know if a codepoint is an alphabetic character or not is=20
far less than the number of people who need, for example, Unicode=20
normalization. Or grapheme cluster iteration. And so forth.
`isalpha` is not worth the cost. And no, normalization doesn't use that=20
property. Even Unicode collation doesn't use that property, and that's=20
about the most property-laden operation that Unicode offers.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/020f6a36-3632-4793-9622-96a41ff40aeb%40isocpp.or=
g.
------=_Part_3379_1314286255.1466970669533
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 12:44:21 PM UTC-4, asor...@gma=
il.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr">=
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 18:12:51 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=
=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"m=
argin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
dir=3D"ltr"><br>2: Why are you using `lower_bound` for a table? That's=
like making an `unordered_map`, then using a range-for loop to search for =
an item by its key. You should be able to get the exact block index for a U=
nicode codepoint with some simple mathematics.<br></div></blockquote><div><=
div>Because <a href=3D"https://en.wikipedia.org/wiki/Unicode_blocks" target=
=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'https://www.go=
ogle.com/url?q\x3dhttps%3A%2F%2Fen.wikipedia.org%2Fwiki%2FUnicode_blocks\x2=
6sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHsHFw_tWlHqTp4wxgQz08FPGyaig';ret=
urn true;" onclick=3D"this.href=3D'https://www.google.com/url?q\x3dhttp=
s%3A%2F%2Fen.wikipedia.org%2Fwiki%2FUnicode_blocks\x26sa\x3dD\x26sntz\x3d1\=
x26usg\x3dAFQjCNHsHFw_tWlHqTp4wxgQz08FPGyaig';return true;">unicode blo=
cks have random size</a>.</div></div></div></blockquote><div><br>Why divide=
it up by Unicode's arbitrary blocks to begin with? That makes searchin=
g require a lot of conditional branching and cache misses.<br><br>I would s=
tart by dividing it into Unicode planes, and dividing each plane into X reg=
ions of Y codepoints apiece. Different properties, depending on the distrib=
ution of attributes, could have region sizes. The idea being that you choos=
e region sizes based on the specific distribution within a plane. The overa=
ll goal being that you can jump from a codepoint directly to the specific s=
et of codepoints from which to fetch the exact value.<br><br>If all of the =
codepoints in a region share the same value, you can instead jump to a func=
tion that returns that default value, rather than fetching it from a table =
of duplicate entries. Or rather more to the point, each region within a pla=
ne is implemented as a <i>function</i>, which may use a table to generate t=
he return value, return a single default, employ RLE encoding, or any numbe=
r of other tricks to reduce the overall size of the compiled binary.<br><br=
>All of which would be much faster than a binary search.<br><br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-l=
eft: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div></div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:=
1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">3: 38KB is still <i>way</=
i> too big for this information.<br></div></blockquote><div>Only if you kno=
w more economical method. If more economical method don't exist, than 3=
8KB reasonable size.</div></div></blockquote><div><br>OK, allow me to say w=
hat I mean a different way. Having that information available at all is not=
worth <i></i>38KB to most users. The number of people who truly <i>need</i=
> to know if a codepoint is an alphabetic character or not is far less than=
the number of people who need, for example, Unicode normalization. Or grap=
heme cluster iteration. And so forth.<br><br>`isalpha` is not worth the cos=
t. And no, normalization doesn't use that property. Even Unicode collat=
ion doesn't use that property, and that's about the most property-l=
aden operation that Unicode offers.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/020f6a36-3632-4793-9622-96a41ff40aeb%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/020f6a36-3632-4793-9622-96a41ff40aeb=
%40isocpp.org</a>.<br />
------=_Part_3379_1314286255.1466970669533--
------=_Part_3378_1942953347.1466970669532--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Sun, 26 Jun 2016 12:53:01 -0700 (PDT)
Raw View
------=_Part_375_1600764210.1466970781683
Content-Type: multipart/alternative;
boundary="----=_Part_376_1237769351.1466970781683"
------=_Part_376_1237769351.1466970781683
Content-Type: text/plain; charset=UTF-8
On Sunday, June 26, 2016 at 1:24:18 PM UTC-4, Thiago Macieira wrote:
>
> On domingo, 26 de junho de 2016 06:08:44 PDT Nicol Bolas wrote:
> > That being said, I firmly believe that 18MB is much larger than the
> Unicode
> > tables *need* to be. That there must be clever ways to make that table
> much
> > smaller (on the order of hundreds of kilobytes rather than megabytes).
> But
> > as of yet, I have not undertaken the task of *proving* that, so that
> > doesn't mean much.
> >
> > And even if I'm wrong, I bet we can provide certain very useful features
> > that require less than the full Unicode table space. For example, I'd
> bet
> > that the non-compatibility Unicode normalization forms require much less
> > table space than the compatibility ones. I'd bet that grapheme cluster
> > iteration requires much less table space than case conversion.
>
> ICU comes with a tool to select which properties and which locales to
> include
> in your data pack. It's just not an easy tool to use and I personally know
> of
> no one that has successfully deployed the data file with it.
>
> Un-selecting entries from the "lines" from the database is often a short-
> sighted decision. You may think "my application will not be run in
> Thailand"
> and thus remove support for Thai grapheme support along with its locale
> information. But then you may get a Thai customer calling your
> application's
> support and they may not even be in Thailand.
>
> Un-selecting "columns" would be safer, but you often don't know which
> properties your application needs. You might think like you said above
> that
> you don't need the non-compatibility normalisations, only to find out that
> Internationalised Domain Names does need NFKC.
>
Well, removing specific properties is a compile-time decision, since those
functions simply don't exist. Remember: we're not talking about what to
"remove" necessarily; we're talking about what should be *added* to the
standard library. And if we don't add the compatibility normalization forms
(because we deem them to be too costly), then whatever IDN needs is
irrelevant; the standard library simply doesn't support it.
The general idea I'm trying to get to is that some operations are worth
spending X amount of memory, and some operations are not. We should try to
ascertain how much memory each Unicode operation that uses Unicode
properties costs, so that we can determine which ones are worth supporting
and which ones aren't.
> Also, ICU 57.1 isn't 18 MB:
>
> $ v -h /usr/share/icu/57.1/icudt57l.dat
> -rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat
>
They store it as a file to be directly included? No run-length encoding of
series of elements that contain the same value or anything?
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/ecc0fc43-752f-42ba-9f9b-a47485324314%40isocpp.org.
------=_Part_376_1237769351.1466970781683
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Sunday, June 26, 2016 at 1:24:18 PM UTC-4, Thiago Macie=
ira wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On domingo, 26 de ju=
nho de 2016 06:08:44 PDT Nicol Bolas wrote:
<br>> That being said, I firmly believe that 18MB is much larger than th=
e Unicode=20
<br>> tables *need* to be. That there must be clever ways to make that t=
able much
<br>> smaller (on the order of hundreds of kilobytes rather than megabyt=
es). But
<br>> as of yet, I have not undertaken the task of *proving* that, so th=
at
<br>> doesn't mean much.
<br>>
<br>> And even if I'm wrong, I bet we can provide certain very usefu=
l features
<br>> that require less than the full Unicode table space. For example, =
I'd bet
<br>> that the non-compatibility Unicode normalization forms require muc=
h less
<br>> table space than the compatibility ones. I'd bet that grapheme=
cluster
<br>> iteration requires much less table space than case conversion.
<br>
<br>ICU comes with a tool to select which properties and which locales to i=
nclude=20
<br>in your data pack. It's just not an easy tool to use and I personal=
ly know of=20
<br>no one that has successfully deployed the data file with it.
<br>
<br>Un-selecting entries from the "lines" from the database is of=
ten a short-
<br>sighted decision. You may think "my application will not be run in=
Thailand"=20
<br>and thus remove support for Thai grapheme support along with its locale=
=20
<br>information. But then you may get a Thai customer calling your applicat=
ion's =C2=A0
<br>support and they may not even be in Thailand.
<br>
<br>Un-selecting "columns" would be safer, but you often don'=
t know which=20
<br>properties your application needs. You might think like you said above =
that=20
<br>you don't need the non-compatibility normalisations, only to find o=
ut that=20
<br>Internationalised Domain Names does need NFKC.
<br></blockquote><div><br>Well, removing specific properties is a compile-t=
ime decision, since those functions simply don't exist. Remember: we=
9;re not talking about what to "remove" necessarily; we're ta=
lking about what should be <i>added</i> to the standard library. And if we =
don't add the compatibility normalization forms (because we deem them t=
o be too costly), then whatever IDN needs is irrelevant; the standard libra=
ry simply doesn't support it.<br><br>The general idea I'm trying to=
get to is that some operations are worth spending X amount of memory, and =
some operations are not. We should try to ascertain how much memory each Un=
icode operation that uses Unicode properties costs, so that we can determin=
e which ones are worth supporting and which ones aren't.<br><br></div><=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;">
<br>Also, ICU 57.1 isn't 18 MB:
<br>
<br>$ v -h /usr/share/icu/57.1/icudt57l.<wbr>dat
<br>-rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.<w=
br>dat
<br></blockquote><div><br>They store it as a file to be directly included? =
No run-length encoding of series of elements that contain the same value or=
anything?<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/ecc0fc43-752f-42ba-9f9b-a47485324314%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/ecc0fc43-752f-42ba-9f9b-a47485324314=
%40isocpp.org</a>.<br />
------=_Part_376_1237769351.1466970781683--
------=_Part_375_1600764210.1466970781683--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sun, 26 Jun 2016 13:02:19 -0700
Raw View
On domingo, 26 de junho de 2016 12:53:01 PDT Nicol Bolas wrote:
> > Also, ICU 57.1 isn't 18 MB:
> >
> > $ v -h /usr/share/icu/57.1/icudt57l.dat
> > -rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat
>
> They store it as a file to be directly included? No run-length encoding of
> series of elements that contain the same value or anything?
They store it as a binary file (the L is for "little-endian") and they apply
compression to it, as far as I know.
It's just that ICU is more than just the Unicode property tables. It's ALL the
Unicode tables, plus CLDR, plus the IANA/Olson timezone database, plus
whatever else I haven't found out yet.
Qt has a portion of the Unicode tables and CLDR too. We do compress the
Unicode tables, but just for QString & QChar usage, they're several hundred kB
already.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/1871419.QjF9UeKD6f%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Sun, 26 Jun 2016 13:42:01 -0700 (PDT)
Raw View
------=_Part_243_579434340.1466973721281
Content-Type: multipart/alternative;
boundary="----=_Part_244_1312212338.1466973721281"
------=_Part_244_1312212338.1466973721281
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 22:51:09 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> Why divide it up by Unicode's arbitrary blocks to begin with? That makes=
=20
> searching require a lot of conditional branching and cache misses.
>
> Because it's exclude unassigned codepoints from table. After this - yes,=
=20
we can use more complicated compression. But, be price of more complicated=
=20
and slowly code.
>
> OK, allow me to say what I mean a different way. Having that information=
=20
> available at all is not worth 38KB to most users. The number of people wh=
o=20
> truly *need* to know if a codepoint is an alphabetic character or not is=
=20
> far less than the number of people who need, for example, Unicode=20
> normalization. Or grapheme cluster iteration. And so forth.
>
If you happy with minimal return (c>=3D'a' && c<=3D'z') || (c>=3D'A' && c<=
=3D'Z');=20
implementation, then just use isalpha instead. If you need more, this have=
=20
some price.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/bfbe5f6a-00f7-4d74-a08a-b7cd13994090%40isocpp.or=
g.
------=_Part_244_1312212338.1466973721281
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=
=D0=BD=D1=8C=D0=B5, 26 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 22:51:09 UTC+=
3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C =
Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D=
"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc s=
olid;padding-left: 1ex;"><div dir=3D"ltr"><br>Why divide it up by Unicode&#=
39;s arbitrary blocks to begin with? That makes searching require a lot of =
conditional branching and cache misses.<br><br></div></blockquote><div>Beca=
use it's exclude unassigned codepoints from table. After this - yes, we=
can use more complicated compression. But, be price of more complicated an=
d slowly code.</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;ma=
rgin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=
=3D"ltr"><div><br>OK, allow me to say what I mean a different way. Having t=
hat information available at all is not worth 38KB to most users. The numbe=
r of people who truly <i>need</i> to know if a codepoint is an alphabetic c=
haracter or not is far less than the number of people who need, for example=
, Unicode normalization. Or grapheme cluster iteration. And so forth.<br></=
div></div></blockquote><div>If you happy with minimal =C2=A0return (c>=
=3D'a' && c<=3D'z') || (c>=3D'A' &=
;& c<=3D'Z'); implementation, then just use isalpha instead.=
If you need more, this have some price.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/bfbe5f6a-00f7-4d74-a08a-b7cd13994090%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/bfbe5f6a-00f7-4d74-a08a-b7cd13994090=
%40isocpp.org</a>.<br />
------=_Part_244_1312212338.1466973721281--
------=_Part_243_579434340.1466973721281--
.
Author: Tony V E <tvaneerd@gmail.com>
Date: Sun, 26 Jun 2016 17:53:14 -0400
Raw View
<html><head></head><body lang=3D"en-US" style=3D"background-color: rgb(255,=
255, 255); line-height: initial;"> =
<div style=3D"width: 100%; fo=
nt-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif=
; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, =
255, 255);">=E2=80=8E> <span style=3D"line-height: initial;">If you=
happy with minimal return (c>=3D'a' && c<=3D'z') || (c=
>=3D'A' && c<=3D'Z'); implementation,=E2=80=8E...</span></div=
><div style=3D"width: 100%; font-size: initial; font-family: Calibri, 'Slat=
e Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initia=
l; background-color: rgb(255, 255, 255);"><br></div><div style=3D"width: 10=
0%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans=
-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb=
(255, 255, 255);"><br></div><div style=3D"width: 100%; font-size: initial; =
font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 7=
3, 125); text-align: initial; background-color: rgb(255, 255, 255);">=E2=80=
=8EThat ("of course") doesn't work with EBCDIC. :-)</div><div style=3D"widt=
h: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif,=
sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color=
: rgb(255, 255, 255);"><br></div><div style=3D"width: 100%; font-size: init=
ial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(=
31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">M=
y real point being that none of this stuff is as simple as it seems. And ev=
en when we know it's complicated, it is still more complicated than that.&n=
bsp;</div><div style=3D"width: 100%; font-size: initial; font-family: Calib=
ri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-alig=
n: initial; background-color: rgb(255, 255, 255);"><br></div><div style=3D"=
width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-se=
rif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-c=
olor: rgb(255, 255, 255);">As suggested by others, if we layer or slice fun=
ctionality correctly, we can get close* to 'don't use what you don't pay fo=
r', but there is a ton of work to get there. </div><div style=3D"width: 100=
%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-=
serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(=
255, 255, 255);"><br></div><div style=3D"width: 100%; font-size: initial; f=
ont-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73=
, 125); text-align: initial; background-color: rgb(255, 255, 255);">[*for s=
ufficiently large values of 'close']</div><div style=3D"width: 100%; font-s=
ize: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; co=
lor: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255,=
255);"><br></div><div style=3D"width: 100%; font-size: initial; font-famil=
y: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); t=
ext-align: initial; background-color: rgb(255, 255, 255);">There is a reaso=
n ICU is huge. I hope no one is suggesting we could get all the functionali=
ty at 1/10 the price. </div><div style=3D"width: 100%; font-size: init=
ial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(=
31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);"><=
br></div> =
<div sty=
le=3D"width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', s=
ans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; backgr=
ound-color: rgb(255, 255, 255);"><br style=3D"display:initial"></div> =
=
=
<div style=3D"font-size: initial; fo=
nt-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73,=
125); text-align: initial; background-color: rgb(255, 255, 255);">Sent&nbs=
p;from my BlackBerry portable Babbage Device</div>=
=
=
<table width=3D"100%" style=3D"background-color=
:white;border-spacing:0px;"> <tbody><tr><td colspan=3D"2" style=3D"font-siz=
e: initial; text-align: initial; background-color: rgb(255, 255, 255);"> =
<div style=3D"border-style: solid none none; border=
-top-color: rgb(181, 196, 223); border-top-width: 1pt; padding: 3pt 0in 0in=
; font-family: Tahoma, 'BB Alpha Sans', 'Slate Pro'; font-size: 10pt;"> <d=
iv><b>From: </b>asorenji@gmail.com</div><div><b>Sent: </b>Sunday, June 26, =
2016 4:42 PM</div><div><b>To: </b>ISO C++ Standard - Future Proposals</div>=
<div><b>Reply To: </b>std-proposals@isocpp.org</div><div><b>Cc: </b>asorenj=
i@gmail.com</div><div><b>Subject: </b>Re: [std-proposals] iswalpha and loca=
les</div></div></td></tr></tbody></table><div style=3D"border-style: solid =
none none; border-top-color: rgb(186, 188, 209); border-top-width: 1pt; fon=
t-size: initial; text-align: initial; background-color: rgb(255, 255, 255);=
"></div><br><div id=3D"_originalContent" style=3D""><div dir=3D"ltr"><br><b=
r>=D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 22:51:09 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=
=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><div dir=3D"ltr"><br>Why divide it up by Unicode's arbitrary blocks t=
o begin with? That makes searching require a lot of conditional branching a=
nd cache misses.<br><br></div></blockquote><div>Because it's exclude unassi=
gned codepoints from table. After this - yes, we can use more complicated c=
ompression. But, be price of more complicated and slowly code.</div><blockq=
uote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-lef=
t: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><div><br>OK, allow m=
e to say what I mean a different way. Having that information available at =
all is not worth 38KB to most users. The number of people who truly <i>need=
</i> to know if a codepoint is an alphabetic character or not is far less t=
han the number of people who need, for example, Unicode normalization. Or g=
rapheme cluster iteration. And so forth.<br></div></div></blockquote><div>I=
f you happy with minimal return (c>=3D'a' && c<=3D'z') =
|| (c>=3D'A' && c<=3D'Z'); implementation, then just use isal=
pha instead. If you need more, this have some price.<br></div></div>
<p></p>
-- <br>
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.<br>
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br>
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br>
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/bfbe5f6a-00f7-4d74-a08a-b7cd13994090%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.goo=
gle.com/a/isocpp.org/d/msgid/std-proposals/bfbe5f6a-00f7-4d74-a08a-b7cd1399=
4090%40isocpp.org</a>.<br>
<br><!--end of _originalContent --></div></body></html>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/20160626215314.4898895.14852.12884%40=
gmail.com?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.com=
/a/isocpp.org/d/msgid/std-proposals/20160626215314.4898895.14852.12884%40gm=
ail.com</a>.<br />
.
Author: asorenji@gmail.com
Date: Sun, 26 Jun 2016 15:26:55 -0700 (PDT)
Raw View
------=_Part_372_288140774.1466980015973
Content-Type: multipart/alternative;
boundary="----=_Part_373_745598185.1466980015980"
------=_Part_373_745598185.1466980015980
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=BF=D0=BE=D0=BD=D0=B5=D0=B4=D0=B5=D0=BB=D1=8C=D0=BD=D0=B8=D0=BA, 27 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 0:53:17 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Tony V E =D0=BD=D0=B0=D0=
=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
> =E2=80=8E
>
As suggested by others, if we layer or slice functionality correctly, we=20
> can get close* to 'don't use what you don't pay for', but there is a ton =
of=20
> work to get there.
>
Okay, how about this?
template<int64_t flags>
bool isu32alpha(char32_t code)
{
return (flags & EN_LANGUAGE_SUPPORT?en_isu32alpha(code):false) ||
(flags & RU_LANGUAGE_SUPPORT?ru_isu32alpha(code):false) ||
(flags & JA_LANGUAGE_SUPPORT?ja_isu32alpha(code):false) ||
....
}
//default implementation
bool isu32alpha(char32_t code)
{
//code for support all national alphabets
}
Be default you get support of all national alphabets. Best choice in a=20
desktop, where unicode support - part of OS. But if you nead only partial=
=20
support, you can set what alphabets you need. This is compile-time choice,=
=20
so smart compiler can remove all unused functions.
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/d69950f8-1ee2-451a-a718-e87420ea6995%40isocpp.or=
g.
------=_Part_373_745598185.1466980015980
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=BF=D0=BE=D0=BD=D0=B5=D0=B4=D0=B5=D0=BB=D1=8C=
=D0=BD=D0=B8=D0=BA, 27 =D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 0:53:17 UTC+3=
=D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C T=
ony V E =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gma=
il_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid=
;padding-left: 1ex;"><div lang=3D"en-US" style=3D"background-color:rgb(255,=
255,255);line-height:initial"> =
<div style=3D"width:100%;font-size=
:initial;font-family:Calibri,'Slate Pro',sans-serif,sans-serif;colo=
r:rgb(31,73,125);text-align:initial;background-color:rgb(255,255,255)">=E2=
=80=8E<br></div></div></blockquote><blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><div lang=3D"en-US" style=3D"background-color:rgb(255,255,255);line-h=
eight:initial"><div style=3D"width:100%;font-size:initial;font-family:Calib=
ri,'Slate Pro',sans-serif,sans-serif;color:rgb(31,73,125);text-alig=
n:initial;background-color:rgb(255,255,255)">As suggested by others, if we =
layer or slice functionality correctly, we can get close* to 'don't=
use what you don't pay for', but there is a ton of work to get the=
re.</div></div></blockquote><div>Okay, how about this?</div><div>template&l=
t;int64_t flags></div><div>bool isu32alpha(char32_t code)<br>{</div><div=
>=C2=A0 =C2=A0 return (flags & EN_LANGUAGE_SUPPORT?en_isu32alpha(code):=
false) ||</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0(flags & RU_LANGUA=
GE_SUPPORT?ru_isu32alpha(code):false) ||</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0=C2=A0(flags & JA_LANGUAGE_SUPPORT?ja_isu32alpha(code):false) ||</di=
v><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ....<br>}</div><div><br></div><div>//def=
ault implementation</div><div>bool isu32alpha(char32_t code)<br>{</div><div=
>=C2=A0 =C2=A0 //code for support all national alphabets<br>}</div><div><br=
></div><div>Be default you get support of all national alphabets. Best choi=
ce in a desktop, where unicode support - part of OS. But if you nead only p=
artial support, you can set what alphabets you need. This is compile-time c=
hoice, so smart compiler can remove all unused functions.</div><blockquote =
class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1p=
x #ccc solid;padding-left: 1ex;">
</blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/d69950f8-1ee2-451a-a718-e87420ea6995%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/d69950f8-1ee2-451a-a718-e87420ea6995=
%40isocpp.org</a>.<br />
------=_Part_373_745598185.1466980015980--
------=_Part_372_288140774.1466980015973--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Sun, 26 Jun 2016 21:22:14 -0700
Raw View
On domingo, 26 de junho de 2016 15:26:55 PDT asorenji@gmail.com wrote:
> But if you nead only partial
> support, you can set what alphabets you need. This is compile-time choice,
> so smart compiler can remove all unused functions.
See my email when I said that developers (and product managers) are often
wrong about what locales they'll need in their applications.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/1629630.XItRfmfQ95%40tjmaciei-mobl1.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 10:30:41 -0400
Raw View
On 2016-06-25 15:06, Nicol Bolas wrote:
> On Saturday, June 25, 2016 at 2:25:01 PM UTC-4, Aso Renji wrote:
>> On 2016-06-25 07:24, Bo Persson wrote:
>>> Don't know about Japanese, but in countries using latin alphabets the
>>> number of characters in the national alphabet vary.
>> Unicode contain ALL national alphabets. Therefore number of characters in
>>
>> unicode NOT vary. And isWalpha work with unicode characters.
>
> First, the number of Unicode codepoints does vary. Newer versions add new
> valid Unicode codepoints. There is a fixed upper limit of course, but there
> are large blocks of that limit which are unallocated.
It doesn't vary as a function of the human-language portion of a locale.
(At least, I would hope not on any sane implementation.)
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nkrdah%247r4%241%40ger.gmane.org.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 11:11:08 -0400
Raw View
On 2016-06-25 20:04, Nicol Bolas wrote:
> ... Have you looked at the Unicode tables? They are not small. Like I said,
> there are ways to make them smaller (and your code makes them far bigger
> than necessary, since only ~15% of the codepoint range is assigned). But
> there is no proof-of-concept that shows that it won't bloat executables by
> 100KB.
Huh? Why on earth would you bake the tables into the executable ROM? Any
sane implementation is going to store them in shared memory.
For grins, I wrote a simple test program that calls 'iswalpha' on its
argument... it is 8618 bytes. (For comparison, I wrote a program that
does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
bytes. Hardly 100 KiB.)
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nkrfmc%24f89%241%40ger.gmane.org.
.
Author: Jean-Marc Bourguet <jm.bourguet@gmail.com>
Date: Mon, 27 Jun 2016 08:18:42 -0700 (PDT)
Raw View
------=_Part_628_444703992.1467040722386
Content-Type: multipart/alternative;
boundary="----=_Part_629_307912543.1467040722394"
------=_Part_629_307912543.1467040722394
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Le samedi 25 juin 2016 20:19:06 UTC+2, Aso Renji a =C3=A9crit :
>
> Or, maybe wint_t is not unicode? In this case you know any other characte=
r=20
> =20
> encoding with wide (not multi-byte) characters?=20
>
>
The encoding used for wchar_t is locale specific. The encoding model of C=
=20
and C++ is that a locale has one charset and three encodings for that=20
charset (a narrow encoding using char as encoding unit, which can have=20
shift state and be multi-byte, a wide encoding using wchar_t, which may not=
=20
have shift state and may not use several wchar_t to represent a code-point,=
=20
an external encoding which has less restrictions -- the best known is that=
=20
end of line may be represented by something else than a single character --=
=20
which is observable by looking at difference between binary and text file=
=20
IO; if my understanding is correct, you can have a locale using UTF-8 as=20
narrow encoding, UTF-32 as wide encoding and UTF-16 with BOM at start and=
=20
CR-LF as line separator as external encoding). I'm pretty sure -- I don't=
=20
have access to that hardware/software combination anymore -- that I've used=
=20
systems which had available at the same time:
- ascii char set, char and wchar_t are using directly the code point value
- ISO-8859-X charsets, char and wchar_t were using directly the code point=
=20
value (note that this is different from the Linux behavior which is to use=
=20
the code point value for char and the Unicode code point value for wchar_t=
=20
-- note also that this is the case I wanted to check if I was remembering=
=20
correctly)
- CJK charsets with char being a serialized form of EUC and wchar_t=20
grouping the number of chars needed for the character (EUC is a way to=20
encode a subset of ISO 2022 streams without using shift state with a=20
maximum of 4 bytes per char)
- the unicode charset, char is UTF-8 and wchar_t is the code point.
Most language/region pair had variant locales for several charset (for=20
instance Latin-1, Latin-9 and Unicode for the French locales, each of which=
=20
would have used a different wide encoding).
Yours
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/fa7ab47a-a3c7-4d23-a5a1-e13654b0c8c6%40isocpp.or=
g.
------=_Part_629_307912543.1467040722394
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Le samedi 25 juin 2016 20:19:06 UTC+2, Aso Renji a =C3=A9c=
rit=C2=A0:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left:=
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Or, maybe wint_t is =
not unicode? In this case you know any other character =C2=A0
<br>encoding with wide (not multi-byte) characters?
<br><br></blockquote><div><br></div><div>The encoding used for wchar_t is l=
ocale specific. =C2=A0The encoding model of C and C++ is that a locale has =
one charset and three encodings for that charset (a narrow encoding using c=
har as encoding unit, which can have shift state and be multi-byte, a wide =
encoding using wchar_t, which may not have shift state and may not use seve=
ral wchar_t to represent a code-point, an external encoding which has less =
restrictions -- the best known is that end of line may be represented by so=
mething else than a single character -- which is observable by looking at d=
ifference between binary and text file IO; if my understanding is correct, =
you can have a locale using UTF-8 as narrow encoding, UTF-32 as wide encodi=
ng and UTF-16 with BOM at start and CR-LF as line separator as external enc=
oding). =C2=A0I'm pretty sure -- I don't have access to that hardwa=
re/software combination anymore -- that I've used systems which had ava=
ilable at the same time:</div><div><br></div><div>- ascii char set, char an=
d wchar_t are using directly the code point value</div><div><br></div><div>=
- ISO-8859-X charsets, char and wchar_t were using directly the code point =
value (note that this is different from the Linux behavior which is to use =
the code point value for char and the Unicode code point value for wchar_t =
-- note also that this is the case I wanted to check if I was remembering c=
orrectly)</div><div><br></div><div>- CJK charsets with char being a seriali=
zed form of EUC and wchar_t grouping the number of chars needed for the cha=
racter (EUC is a way to encode a subset of ISO 2022 streams without using s=
hift state with a maximum of 4 bytes per char)</div><div><br></div><div>- t=
he unicode charset, char is UTF-8 and wchar_t is the code point.</div><div>=
<br></div><div>Most language/region pair had variant locales for several ch=
arset (for instance Latin-1, Latin-9 and Unicode for the French locales, ea=
ch of which would have used a different wide encoding).</div><div><br></div=
><div>Yours</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/fa7ab47a-a3c7-4d23-a5a1-e13654b0c8c6%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/fa7ab47a-a3c7-4d23-a5a1-e13654b0c8c6=
%40isocpp.org</a>.<br />
------=_Part_629_307912543.1467040722394--
------=_Part_628_444703992.1467040722386--
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 11:27:15 -0400
Raw View
On 2016-06-27 11:11, Matthew Woehlke wrote:
> On 2016-06-25 20:04, Nicol Bolas wrote:
>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>> there are ways to make them smaller (and your code makes them far bigger
>> than necessary, since only ~15% of the codepoint range is assigned). But
>> there is no proof-of-concept that shows that it won't bloat executables by
>> 100KB.
>
> Huh? Why on earth would you bake the tables into the executable ROM? Any
> sane implementation is going to store them in shared memory.
>
> For grins, I wrote a simple test program that calls 'iswalpha' on its
> argument... it is 8618 bytes. (For comparison, I wrote a program that
> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
> bytes. Hardly 100 KiB.)
Okay, clarification... yes, you need to store the data *somewhere*. So I
guess you are talking specifically about OS-less embedded platforms
where the executable - including statically linked standard library - is
possibly the only thing on the device.
I'd argue that Unicode support should not be required for freestanding
implementations. (How many people on tiny embedded systems are dealing
with Unicode, anyway?) That seems to solve the problem neatly...
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nkrgkj%241ai%241%40ger.gmane.org.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 11:56:43 -0400
Raw View
On 2016-06-27 11:27, Matthew Woehlke wrote:
> On 2016-06-27 11:11, Matthew Woehlke wrote:
>> On 2016-06-25 20:04, Nicol Bolas wrote:
>>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>>> there are ways to make them smaller (and your code makes them far bigger
>>> than necessary, since only ~15% of the codepoint range is assigned). But
>>> there is no proof-of-concept that shows that it won't bloat executables by
>>> 100KB.
>>
>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>> sane implementation is going to store them in shared memory.
>>
>> For grins, I wrote a simple test program that calls 'iswalpha' on its
>> argument... it is 8618 bytes. (For comparison, I wrote a program that
>> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
>> bytes. Hardly 100 KiB.)
>
> Okay, clarification... yes, you need to store the data *somewhere*. So I
> guess you are talking specifically about OS-less embedded platforms
> where the executable - including statically linked standard library - is
> possibly the only thing on the device.
>
> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...
BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
metadata e.g. the table size).
Others:
iswalpha: 3612
iswblank: 68
iswgraph: 3588
iswlower: 4580
iswprint: 3572
iswpunct: 2020
iswspace: 76
iswupper: 4428
iswxdigit: 28
total: 21972
This is just the data table sizes, of course, not including code. Some,
obviously, may be better implemented as pure functions. Also, in
fairness, the code to use these requires a lower_bound (i.e. non-trivial
branching).
(Because the program I used to generate these is overly simplified, I
believe the numbers are all 4 bytes larger than is actually needed.
Also, no iswdigit, as that is specified as true for exactly '0'-'9',
which is trivial to implement statically as two compares and a Boolean and.)
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nkribs%24tsc%241%40ger.gmane.org.
.
Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Mon, 27 Jun 2016 19:34:08 +0200
Raw View
On Sun, Jun 26, 2016 at 5:03 PM, <asorenji@gmail.com> wrote:
>
>
> =D0=B2=D0=BE=D1=81=D0=BA=D1=80=D0=B5=D1=81=D0=B5=D0=BD=D1=8C=D0=B5, 26 =
=D0=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 11:28:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=
=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Jeffrey Yasskin
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>>
>> ICU does have ways to subset its data tables to include only the parts
>> you use. The proposal author should probably validate that to show us
>> what kinds of subsets are already possible, but it'll also be possible
>> to add new subsets if the C++ library wants to make finer-grained
>> distinctions.
>>
> Okay, lets use subset model. Unicode use 273 unicode blocks, 271792 codes
> total. We can write something like this:
> struct u32_subset
> {
> int32_t lower_bound,size;
> const int8_t*table;
> bool operator<(int32_t code)const{return lower_bound<code;}
> };
> bool isu32alpha(int32_t code)
> {
> static const u32_subset unicode_blocks[273]=3D{/*some large table*/};
>
> const
> u32_subset&subset=3D*std::lower_bound(unicode_blocks,unicode_blocks+273,c=
ode);
> size_t offset=3Dcode-subset.lower_bound;
> return offset<subset.size?subset.table[offset/8]&(1<<(offset&7)):fals=
e;
> }
>
> u32_subset request 4 byte for lower_bound, 4 byte for size, and 8 bytes f=
or
> table pointer. 273*16=3D4368 bytes total.
> All 273 tables request 271792 bites, or 33974 bytes total.
> 4368+33974=3D38342 bytes total.
> 38 KB is still very big and your 4+ GB desktop can't afford this?
Look into what ICU actually does to implement functions like u_isalpha
(http://icu-project.org/apiref/icu4c/uchar_8h.html#a86cc4f937e33bcea3772c6f=
af3e293c1).
I'm guessing it's similar to the encoding Matthew Woehlke used in his
most recent message, but whoever proposes these functions should be
able to answer questions about the most common existing practice.
> Lets concentrate to predicate (return bool value) functions. I'm don't
> believe that isalpha or isupper have this sort of problems. Although we
> should decide what is "isupper" means in languages without upper and lowe=
r
> characters. In Japanese language for example.
To elaborate on Nicol's point about Unicode already defining this,
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt says:
304B;HIRAGANA LETTER KA;Lo;0;L;;;;;N;;;;;
meaning '=E3=81=8B' is Lo, which is the "Letter, other" category:
http://www.unicode.org/reports/tr44/#General_Category_Values.
http://www.unicode.org/reports/tr18/#Compatibility_Properties says
"Letter, other" isn't counted as Uppercase, so it'd return false from
isupper().
Jeffrey
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/CANh-dXnzCJB--4vOVH0J2hPSr5N6gxvx3THtggg2Nyq4D2s=
Zdg%40mail.gmail.com.
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 27 Jun 2016 10:54:17 -0700 (PDT)
Raw View
------=_Part_5236_423458830.1467050057636
Content-Type: multipart/alternative;
boundary="----=_Part_5237_427801698.1467050057636"
------=_Part_5237_427801698.1467050057636
Content-Type: text/plain; charset=UTF-8
On Monday, June 27, 2016 at 11:27:29 AM UTC-4, Matthew Woehlke wrote:
>
> On 2016-06-27 11:11, Matthew Woehlke wrote:
> > On 2016-06-25 20:04, Nicol Bolas wrote:
> >> ... Have you looked at the Unicode tables? They are not small. Like I
> said,
> >> there are ways to make them smaller (and your code makes them far
> bigger
> >> than necessary, since only ~15% of the codepoint range is assigned).
> But
> >> there is no proof-of-concept that shows that it won't bloat executables
> by
> >> 100KB.
> >
> > Huh? Why on earth would you bake the tables into the executable ROM? Any
> > sane implementation is going to store them in shared memory.
> >
> > For grins, I wrote a simple test program that calls 'iswalpha' on its
> > argument... it is 8618 bytes. (For comparison, I wrote a program that
> > does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
> > bytes. Hardly 100 KiB.)
>
> Okay, clarification... yes, you need to store the data *somewhere*. So I
> guess you are talking specifically about OS-less embedded platforms
> where the executable - including statically linked standard library - is
> possibly the only thing on the device.
>
It's not just that. If an application statically links to the standard
library, there's no reason for it to be loading the Unicode table at
runtime.
> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...
>
Well, the standard already has such requirements. Freestanding
implementations can omit most of the standard library, providing only
support for most of Chapter 18, the type traits from 20.10, and the atomics.
If even <string> isn't a requirement, I see no reason why Unicode
operations would be.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/8990a523-99a1-4f36-be6a-9cd6f9495176%40isocpp.org.
------=_Part_5237_427801698.1467050057636
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>On Monday, June 27, 2016 at 11:27:29 AM UTC-4, Mat=
thew Woehlke wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;mar=
gin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On 2016-06-=
27 11:11, Matthew Woehlke wrote:
<br>> On 2016-06-25 20:04, Nicol Bolas wrote:
<br>>> ... Have you looked at the Unicode tables? They are not small.=
Like I said,=20
<br>>> there are ways to make them smaller (and your code makes them =
far bigger=20
<br>>> than necessary, since only ~15% of the codepoint range is assi=
gned). But=20
<br>>> there is no proof-of-concept that shows that it won't bloa=
t executables by=20
<br>>> 100KB.
<br>>=20
<br>> Huh? Why on earth would you bake the tables into the executable RO=
M? Any
<br>> sane implementation is going to store them in shared memory.
<br>>=20
<br>> For grins, I wrote a simple test program that calls 'iswalpha&=
#39; on its
<br>> argument... it is 8618 bytes. (For comparison, I wrote a program t=
hat
<br>> does NOTHING AT ALL, and it is 8455 bytes. That's a difference=
of... 163
<br>> bytes. Hardly 100 KiB.)
<br>
<br>Okay, clarification... yes, you need to store the data *somewhere*. So =
I
<br>guess you are talking specifically about OS-less embedded platforms
<br>where the executable - including statically linked standard library - i=
s
<br>possibly the only thing on the device.
<br></blockquote><div><br>It's not just that. If an application statica=
lly links to the standard library, there's no reason for it to be loadi=
ng the Unicode table at runtime.<br>=C2=A0</div><blockquote class=3D"gmail_=
quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;pa=
dding-left: 1ex;">
I'd argue that Unicode support should not be required for freestanding
<br>implementations. (How many people on tiny embedded systems are dealing
<br>with Unicode, anyway?) That seems to solve the problem neatly...
<br></blockquote><div><br>Well, the standard already has such requirements.=
Freestanding implementations can omit most of the standard library, provid=
ing only support for most of Chapter 18, the type traits from 20.10, and th=
e atomics.<br><br>If even <string> isn't a requirement, I see no =
reason why Unicode operations would be.</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/8990a523-99a1-4f36-be6a-9cd6f9495176%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/8990a523-99a1-4f36-be6a-9cd6f9495176=
%40isocpp.org</a>.<br />
------=_Part_5237_427801698.1467050057636--
------=_Part_5236_423458830.1467050057636--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 27 Jun 2016 10:56:56 -0700 (PDT)
Raw View
------=_Part_1121_379808739.1467050216731
Content-Type: multipart/alternative;
boundary="----=_Part_1122_779404174.1467050216731"
------=_Part_1122_779404174.1467050216731
Content-Type: text/plain; charset=UTF-8
On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woehlke wrote:
>
> On 2016-06-27 11:27, Matthew Woehlke wrote:
> > On 2016-06-27 11:11, Matthew Woehlke wrote:
> >> On 2016-06-25 20:04, Nicol Bolas wrote:
> >>> ... Have you looked at the Unicode tables? They are not small. Like I
> said,
> >>> there are ways to make them smaller (and your code makes them far
> bigger
> >>> than necessary, since only ~15% of the codepoint range is assigned).
> But
> >>> there is no proof-of-concept that shows that it won't bloat
> executables by
> >>> 100KB.
> >>
> >> Huh? Why on earth would you bake the tables into the executable ROM?
> Any
> >> sane implementation is going to store them in shared memory.
> >>
> >> For grins, I wrote a simple test program that calls 'iswalpha' on its
> >> argument... it is 8618 bytes. (For comparison, I wrote a program that
> >> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of...
> 163
> >> bytes. Hardly 100 KiB.)
> >
> > Okay, clarification... yes, you need to store the data *somewhere*. So I
> > guess you are talking specifically about OS-less embedded platforms
> > where the executable - including statically linked standard library - is
> > possibly the only thing on the device.
> >
> > I'd argue that Unicode support should not be required for freestanding
> > implementations. (How many people on tiny embedded systems are dealing
> > with Unicode, anyway?) That seems to solve the problem neatly...
>
> BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
> metadata e.g. the table size).
>
What steps have you taken to determine if your `iswalpha` implementation is
actually returning whether a Unicode codepoint is an alphabetic character?
Because I highly doubt it's really doing that.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/fa30f9f0-702e-46c4-8f1f-3b019ddc9b21%40isocpp.org.
------=_Part_1122_779404174.1467050216731
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woe=
hlke wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left=
: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On 2016-06-27 11:27=
, Matthew Woehlke wrote:
<br>> On 2016-06-27 11:11, Matthew Woehlke wrote:
<br>>> On 2016-06-25 20:04, Nicol Bolas wrote:
<br>>>> ... Have you looked at the Unicode tables? They are not sm=
all. Like I said,=20
<br>>>> there are ways to make them smaller (and your code makes t=
hem far bigger=20
<br>>>> than necessary, since only ~15% of the codepoint range is =
assigned). But=20
<br>>>> there is no proof-of-concept that shows that it won't =
bloat executables by=20
<br>>>> 100KB.
<br>>>
<br>>> Huh? Why on earth would you bake the tables into the executabl=
e ROM? Any
<br>>> sane implementation is going to store them in shared memory.
<br>>>
<br>>> For grins, I wrote a simple test program that calls 'iswal=
pha' on its
<br>>> argument... it is 8618 bytes. (For comparison, I wrote a progr=
am that
<br>>> does NOTHING AT ALL, and it is 8455 bytes. That's a differ=
ence of... 163
<br>>> bytes. Hardly 100 KiB.)
<br>>=20
<br>> Okay, clarification... yes, you need to store the data *somewhere*=
.. So I
<br>> guess you are talking specifically about OS-less embedded platform=
s
<br>> where the executable - including statically linked standard librar=
y - is
<br>> possibly the only thing on the device.
<br>>=20
<br>> I'd argue that Unicode support should not be required for free=
standing
<br>> implementations. (How many people on tiny embedded systems are dea=
ling
<br>> with Unicode, anyway?) That seems to solve the problem neatly...
<br>
<br>BTW, I can represent the tables for iswalpha in 3612 bytes (not countin=
g
<br>metadata e.g. the table size).
<br></blockquote><div><br>What steps have you taken to determine if your `i=
swalpha` implementation is actually returning whether a Unicode codepoint i=
s an alphabetic character? Because I highly doubt it's really doing tha=
t.</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/fa30f9f0-702e-46c4-8f1f-3b019ddc9b21%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/fa30f9f0-702e-46c4-8f1f-3b019ddc9b21=
%40isocpp.org</a>.<br />
------=_Part_1122_779404174.1467050216731--
------=_Part_1121_379808739.1467050216731--
.
Author: Aso Renji <asorenji@gmail.com>
Date: Mon, 27 Jun 2016 21:39:29 +0300
Raw View
'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals =20
<std-proposals@isocpp.org> =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0) =D0=B2 =
=D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC=D0=B5 Mon, 27=
Jun 2016 =20
20:34:08 +0300:
> Look into what ICU actually does to implement functions like u_isalpha
> (http://icu-project.org/apiref/icu4c/uchar_8h.html#a86cc4f937e33bcea3772c=
6faf3e293c1).
(Sarcasm on) It's very simple and fast code. (Sarcasm off) Endless chain =
=20
of defines, with some complicated conversations (see =20
_UTRIE2_INDEX_FROM_CP). I'm guess this make tables more compact (contained =
=20
this tables uchar_props_data.h have 304 KB size). But it's not make them =
=20
fasted.
source/common/uchar.c
U_CAPI UBool U_EXPORT2
u_isalpha(UChar32 c) {
uint32_t props;
GET_PROPS(c, props);
return (UBool)((CAT_MASK(props)&U_GC_L_MASK)!=3D0);
}
source/common/utrie2.h
#define GET_PROPS(c, result) ((result)=3DUTRIE2_GET16(&propsTrie, c));
#define UTRIE2_GET16(trie, c) _UTRIE2_GET((trie), index, =20
(trie)->indexLength, (c))
#define _UTRIE2_GET(trie, data, asciiOffset, c) \
(trie)->data[_UTRIE2_INDEX_FROM_CP(trie, asciiOffset, c)]
#define _UTRIE2_INDEX_FROM_CP(trie, asciiOffset, c) \
((uint32_t)(c)<0xd800 ? \
_UTRIE2_INDEX_RAW(0, (trie)->index, c) : \
(uint32_t)(c)<=3D0xffff ? \
_UTRIE2_INDEX_RAW( \
(c)<=3D0xdbff ? =20
UTRIE2_LSCP_INDEX_2_OFFSET-(0xd800>>UTRIE2_SHIFT_2) : 0, \
(trie)->index, c) : \
(uint32_t)(c)>0x10ffff ? \
(asciiOffset)+UTRIE2_BAD_UTF8_DATA_OFFSET : \
(c)>=3D(trie)->highStart ? \
(trie)->highValueIndex : \
_UTRIE2_INDEX_FROM_SUPP((trie)->index, c))
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/op.yjqbf3t6hnjspo%40debian.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 15:57:54 -0400
Raw View
On 2016-06-27 13:56, Nicol Bolas wrote:
> On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woehlke wrote:
>> BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
>> metadata e.g. the table size).
>
> What steps have you taken to determine if your `iswalpha` implementation is
> actually returning whether a Unicode codepoint is an alphabetic character?
> Because I highly doubt it's really doing that.
I didn't actually write the function; I wrote a little program to
compute the data tables. Said program generates them by... invoking
iswalpha. I therefore submit that the assumption of accuracy is not
unreasonable :-). (And yes, I called setlocale first.)
I take it you haven't figured out how I achieved the reported sizes?
In effect, I consider the data table to be a bitmap (one bit per code
point) of whether a particular attribute applies or not, which is then
compressed by RLE, only I record the offset of the next run rather than
the run length. (Storing the value per run is not necessary, since it
can only alternate between two possible values.) To determine if a
particular code point has the attribute, one therefore uses the code
point to do a binary lower bound search into the array; the attribute
state is the resulting array index modulo 2.
In actuality, my program iterates over possible code points, checks if
the attribute state for the current code point differs from the
attribute state for the previous code point, and if so dumps an index
record. Counting the number of outputted records and multiplying by the
size of the index gives the number of bytes needed to store the table.
Disclaimer: The data is for code points 0 - 0x10FFFF. And, obviously
(given the above), the values are correct for the behavior of the
various functions on my system, which may be out of date.
And, in fact, the previously mentioned sizes could in theory be reduced
by an additional 25% by packing the indices into three bytes each rather
than four (a little more, even, if packed into 21 *bits* each), but the
further performance hit from doing so might not be worthwhile.
The flip side of course is that those 3612 are ONLY useful for telling
you if a code point "is alphabetic". They're useless for case folding,
normalization, etc. You're maximizing the space optimization of a
specific use case (iswalpha) at the expense of everything else.
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nks0g2%2424e%241%40ger.gmane.org.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Mon, 27 Jun 2016 16:19:00 -0400
Raw View
On 2016-06-27 15:57, Matthew Woehlke wrote:
> On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woehlke wrote:
>> BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
>> metadata e.g. the table size).
>
> In effect, I consider the data table to be a bitmap (one bit per code
> point) of whether a particular attribute applies or not, which is then
> compressed by RLE, only I record the offset of the next run rather than
> the run length. (Storing the value per run is not necessary, since it
> can only alternate between two possible values.) To determine if a
> particular code point has the attribute, one therefore uses the code
> point to do a binary lower bound search into the array; the attribute
> state is the resulting array index modulo 2.
>
> And, in fact, the previously mentioned sizes could in theory be reduced
> by an additional 25% by packing the indices into three bytes each rather
> than four (a little more, even, if packed into 21 *bits* each), but the
> further performance hit from doing so might not be worthwhile.
Okay, after that and some more thinking... I can get the tables for
*all* attributes down to about 9232 bytes. This is by assuming that I
can encode the code point index into 24 bits and the character class (at
least, bits needed to derive the values for the various isw????
functions) into another 8 bits, so that I still have 4 bytes per run
(and thus don't trash performance by unaligned reads).
Of course, this is a further trade-off of space versus performance,
since now every function has to deal with that entire array, compared to
the straight bitmap approach where some had very, very small tables. A
real implementation based on this method would likely want to separate
iswblank/iswspace from the rest; these need only 144 bytes, which is
likely a worthwhile trade-off as it reduces the number of compares
needed by about 7-8, vs. only about 2 for the others.
I'm also assuming independent implementations of iswdigit and iswxdigit
as these are specified being true only for very limited characters (0-9
and 0-9A-Fa-f, respectively).
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nks1nk%24l08%241%40ger.gmane.org.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 27 Jun 2016 20:39:12 -0700
Raw View
On segunda-feira, 27 de junho de 2016 21:39:29 PDT Aso Renji wrote:
> (Sarcasm on) It's very simple and fast code. (Sarcasm off) Endless chain
> of defines, with some complicated conversations (see
> _UTRIE2_INDEX_FROM_CP). I'm guess this make tables more compact (contained
> this tables uchar_props_data.h have 304 KB size). But it's not make them
> fasted.
But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.
That would mean you'd pay a higher up-front cost, but then no more.
Of course, all of this is QoI.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/1513704.D5378uGCZa%40tjmaciei-mobl1.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 27 Jun 2016 20:44:59 -0700
Raw View
On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
> Huh? Why on earth would you bake the tables into the executable ROM? Any
> sane implementation is going to store them in shared memory.
Huh... and what do you think executable ROM is? Shared memory. Since it's ROM,
it can't be changed, which means it can be shared between multiple instances
of the same program (if the OS supports shared memory in the first place)
> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...
You'd be surprised. As I said in another email, if you combine "data is
encoded in UTF-8" (like JSON) with "case-insensitive data", you get suddenly
require Unicode tables.
It's very likely the protocols that require this are poorly designed if they
are meant to be implemented in small OSes and in hardware/firmware, but they
exist.
FYI, there are Unicode tables inside the Linux kernel.
- VFAT is case-insensitive.
- VFAT stores filenames in UTF-16.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/2347780.U3tf2fKmrB%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Mon, 27 Jun 2016 21:29:35 -0700 (PDT)
Raw View
------=_Part_4875_2006955146.1467088175428
Content-Type: multipart/alternative;
boundary="----=_Part_4876_711406309.1467088175428"
------=_Part_4876_711406309.1467088175428
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 6:39:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> But the same tables support the other unicode properties. There's a good=
=20
> chance that code that tests if a given codepoint is an uppercase characte=
r=20
> will check elsewhere if it's lowercase, or alphabetic.=20
>
> STL support only 12 properties=20
<http://www.cplusplus.com/reference/cwctype/iswctype/>. 12 properties, 6=20
planes with 65536 codepoints, request 12*6*65536 bits total. It's about=20
half megabyte. Not so big if you store this in some shared library. Yes,=20
it's bigger that ICU tables, but code for this table far more simple and=20
fast.
const int isalpha_offset=3D0;
const int isdigit_offset=3D1;
//...
const int isspace_offset=3D11;
const int max_offset=3D12;
bool u_isctype(char32_t c,int offset)
{
const int8_t*plane=3D[c/65536];
int offset=3D(c%65536)*max_offset+offset;
return plane?plane[offset/8]&(1<<(offset&7)):false;
}
bool u_isalpha(char32_t c){return u_isctype(isalpha_offset);}=20
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/56267df5-3b8f-4139-90bd-8443cf200861%40isocpp.or=
g.
------=_Part_4876_711406309.1467088175428
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 6:39:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =D0=BD=D0=
=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><br>But the same tables support the other unicode properties. There&#=
39;s a good=20
<br>chance that code that tests if a given codepoint is an uppercase charac=
ter=20
<br>will check elsewhere if it's lowercase, or alphabetic.
<br>
<br></blockquote><div>STL support <a href=3D"http://www.cplusplus.com/refer=
ence/cwctype/iswctype/">only 12 properties</a>. 12 properties, 6 planes wit=
h 65536 codepoints, request 12*6*65536 bits total. It's about half mega=
byte. Not so big if you store this in some shared library. Yes, it's bi=
gger that ICU tables, but code for this table far more simple and fast.<br>=
<br>const int isalpha_offset=3D0;<br>const int isdigit_offset=3D1;<br>//...=
<br>const int isspace_offset=3D11;<br>const int max_offset=3D12;<br><br>boo=
l u_isctype(char32_t c,int offset)<br>{<br> const int8_t*plane=3D[c/6553=
6];<br> int offset=3D(c%65536)*max_offset+offset;<br> return plane?pl=
ane[offset/8]&(1<<(offset&7)):false;<br>}<br><br>bool u_isalp=
ha(char32_t c){return u_isctype(isalpha_offset);}=C2=A0<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/56267df5-3b8f-4139-90bd-8443cf200861%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/56267df5-3b8f-4139-90bd-8443cf200861=
%40isocpp.org</a>.<br />
------=_Part_4876_711406309.1467088175428--
------=_Part_4875_2006955146.1467088175428--
.
Author: asorenji@gmail.com
Date: Mon, 27 Jun 2016 21:33:34 -0700 (PDT)
Raw View
------=_Part_3260_1562184196.1467088415008
Content-Type: multipart/alternative;
boundary="----=_Part_3261_391151537.1467088415008"
------=_Part_3261_391151537.1467088415008
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 7:29:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C asor...@gmail.com=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
>
> const int8_t*plane=3D[c/65536];
>
Ops, of course plane_table[c/65536];
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/a0e7fc9b-073f-4ce7-af24-e698df313984%40isocpp.or=
g.
------=_Part_3261_391151537.1467088415008
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:29:35 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C asor...@gmail.com =D0=BD=
=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" styl=
e=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left:=
1ex;"><div dir=3D"ltr"><br><br> const int8_t*plane=3D[c/65536];<br></di=
v></blockquote><div>Ops, of course plane_table[c/65536];<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/a0e7fc9b-073f-4ce7-af24-e698df313984%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/a0e7fc9b-073f-4ce7-af24-e698df313984=
%40isocpp.org</a>.<br />
------=_Part_3261_391151537.1467088415008--
------=_Part_3260_1562184196.1467088415008--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 27 Jun 2016 21:58:26 -0700 (PDT)
Raw View
------=_Part_126_1410433061.1467089906406
Content-Type: multipart/alternative;
boundary="----=_Part_127_2123501772.1467089906406"
------=_Part_127_2123501772.1467089906406
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Tuesday, June 28, 2016 at 12:29:35 AM UTC-4, asor...@gmail.com wrote:
>
> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 2=
016 =D0=B3., 6:39:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>
>>
>> But the same tables support the other unicode properties. There's a good=
=20
>> chance that code that tests if a given codepoint is an uppercase=20
>> character=20
>> will check elsewhere if it's lowercase, or alphabetic.
>>
>
If you're talking about cache coherency... I'm not sure how likely that is.
Generally speaking, most Unicode operations only test a small set of=20
properties. Normalization (the non-compatibility forms) only care about=20
whether it's a combining character and what its composed/decomposed form it=
=20
is. Case conversion would test the current case of a codepoint (series) and=
=20
what its converted form is. It wouldn't be testing if it's lowercase then=
=20
uppercase. And so on.
Collation is probably the one where cross-property coherency is likely to=
=20
be of the greatest value.
STL support only 12 properties=20
> <http://www.cplusplus.com/reference/cwctype/iswctype/>.
>
If the goal is to get *meaningful* Unicode support into the C++ standard,=
=20
then it does not matter what properties the standard library currently=20
supports.
Meaningful Unicode support means having sufficient data to perform Unicode=
=20
*operations*. Those 12 properties that the C++ standard library provides=20
are completely useless for Unicode operations. These operations require=20
properties, but they require *Unicode* properties.
We shouldn't be trying to make Unicode fit within C/C++'s garbage=20
interface. We should be trying to improve C++'s interface match Unicode. Or=
=20
rather, *replace* C++'s terrible interface with one that matches Unicode.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/7da7e84a-09a8-41b1-b737-c48f1e908d83%40isocpp.or=
g.
------=_Part_127_2123501772.1467089906406
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Tuesday, June 28, 2016 at 12:29:35 AM UTC-4, asor...@gm=
ail.com wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-l=
eft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"=
>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 20=
16 =D0=B3., 6:39:16 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=
=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0=
..8ex;border-left:1px #ccc solid;padding-left:1ex"><br>But the same tables s=
upport the other unicode properties. There's a good=20
<br>chance that code that tests if a given codepoint is an uppercase charac=
ter=20
<br>will check elsewhere if it's lowercase, or alphabetic.<br></blockqu=
ote></div></blockquote><div><br>If you're talking about cache coherency=
.... I'm not sure how likely that is.<br><br>Generally speaking, most Un=
icode operations only test a small set of properties. Normalization (the no=
n-compatibility forms) only care about whether it's a combining charact=
er and what its composed/decomposed form it is. Case conversion would test =
the current case of a codepoint (series) and what its converted form is. It=
wouldn't be testing if it's lowercase then uppercase. And so on.<b=
r><br>Collation is probably the one where cross-property coherency is likel=
y to be of the greatest value.<br><br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;paddin=
g-left: 1ex;"><div dir=3D"ltr"><div>STL support <a href=3D"http://www.cplus=
plus.com/reference/cwctype/iswctype/" target=3D"_blank" rel=3D"nofollow" on=
mousedown=3D"this.href=3D'http://www.google.com/url?q\x3dhttp%3A%2F%2Fw=
ww.cplusplus.com%2Freference%2Fcwctype%2Fiswctype%2F\x26sa\x3dD\x26sntz\x3d=
1\x26usg\x3dAFQjCNEF20R54gMMEN6zCNwMOABatC-Ypw';return true;" onclick=
=3D"this.href=3D'http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.cplusp=
lus.com%2Freference%2Fcwctype%2Fiswctype%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\=
x3dAFQjCNEF20R54gMMEN6zCNwMOABatC-Ypw';return true;">only 12 properties=
</a>.</div></div></blockquote><div><br>If the goal is to get <i>meaningful<=
/i> Unicode support into the C++ standard, then it does not matter what pro=
perties the standard library currently supports.<br><br>Meaningful Unicode =
support means having sufficient data to perform Unicode <i>operations</i>. =
Those 12 properties that the C++ standard library provides are completely u=
seless for Unicode operations. These operations require properties, but the=
y require <i>Unicode</i> properties.<br><br>We shouldn't be trying to m=
ake Unicode fit within C/C++'s garbage interface. We should be trying t=
o improve C++'s interface match Unicode. Or rather, <i>replace</i> C++&=
#39;s terrible interface with one that matches Unicode.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/7da7e84a-09a8-41b1-b737-c48f1e908d83%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/7da7e84a-09a8-41b1-b737-c48f1e908d83=
%40isocpp.org</a>.<br />
------=_Part_127_2123501772.1467089906406--
------=_Part_126_1410433061.1467089906406--
.
Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 27 Jun 2016 22:02:06 -0700 (PDT)
Raw View
------=_Part_25_1518837918.1467090126158
Content-Type: multipart/alternative;
boundary="----=_Part_26_170616054.1467090126158"
------=_Part_26_170616054.1467090126158
Content-Type: text/plain; charset=UTF-8
On Monday, June 27, 2016 at 3:58:10 PM UTC-4, Matthew Woehlke wrote:
> The flip side of course is that those 3612 are ONLY useful for telling
> you if a code point "is alphabetic". They're useless for case folding,
> normalization, etc. You're maximizing the space optimization of a
> specific use case (iswalpha) at the expense of everything else.
>
Right. Which is why I'm confused as to why you created this compression
algorithm, then applied it to the least important and useful character
property data (and from a source of dubious quality at that, rather than
from the actual property tables).
To me, the key question for Unicode support in the standard (when it comes
to table size) is this: what is the actual memory cost for these Unicode
features:
1: Normalization, for each form.
2: Grapheme cluster iteration.
3: Text segmentation.
4: Case conversion.
5: Collation.
Your compression scheme might be a good tool, but you're aiming it in the
wrong direction. Though I find myself concerned about your use of a
log(n)-based algorithm.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/7831b4db-b920-4aad-9bb1-e106e02a185f%40isocpp.org.
------=_Part_26_170616054.1467090126158
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Monday, June 27, 2016 at 3:58:10 PM UTC-4, Matthew Woeh=
lke wrote:<div></div><blockquote class=3D"gmail_quote" style=3D"margin: 0;m=
argin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
The flip side of course is that those 3612 are ONLY useful for telling
<br>you if a code point "is alphabetic". They're useless for =
case folding,
<br>normalization, etc. You're maximizing the space optimization of a
<br>specific use case (iswalpha) at the expense of everything else.
<br></blockquote><div><br>Right. Which is why I'm confused as to why yo=
u
created this compression algorithm, then applied it to the least=20
important and useful character property data (and from a source of dubious =
quality at that, rather than from the actual property tables).<br><br>To me=
, the key question for Unicode support in the standard (when it comes to ta=
ble size) is this: what is the actual memory cost for these Unicode feature=
s:<br><br>1: Normalization, for each form.<br>2: Grapheme cluster iteration=
..<br>3: Text segmentation.<br>4: Case conversion.<br>5: Collation.<br><br>Y=
our
compression scheme might be a good tool, but you're aiming it in the=
=20
wrong direction. Though I find myself concerned about your use of a=20
log(n)-based algorithm.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/7831b4db-b920-4aad-9bb1-e106e02a185f%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/7831b4db-b920-4aad-9bb1-e106e02a185f=
%40isocpp.org</a>.<br />
------=_Part_26_170616054.1467090126158--
------=_Part_25_1518837918.1467090126158--
.
Author: asorenji@gmail.com
Date: Mon, 27 Jun 2016 22:46:52 -0700 (PDT)
Raw View
------=_Part_2656_1832367775.1467092812484
Content-Type: multipart/alternative;
boundary="----=_Part_2657_80160971.1467092812491"
------=_Part_2657_80160971.1467092812491
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 7:58:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=
=D0=BB:
>
>
> Meaningful Unicode support means having sufficient data to perform Unicod=
e=20
> *operations*. Those 12 properties that the C++ standard library provides=
=20
> are completely useless for Unicode operations. These operations require=
=20
> properties, but they require *Unicode* properties.
>
Why useless? iswdigit=3DDecimal_Number property, iswalpha=3DLetter property=
,=20
iswblank=3DSpace_Separator property. Yes, Unicode also give more concrete=
=20
properties. But I'm can live without Titlecase_Letter property, if I'm just=
=20
need split text to words.
>
> We shouldn't be trying to make Unicode fit within C/C++'s garbage=20
> interface. We should be trying to improve C++'s interface match Unicode. =
Or=20
> rather, *replace* C++'s terrible interface with one that matches Unicode.
>
In current time I'm only try make C/C++ wide character interface worked=20
with wide characters (characters with 128+ codes).
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/49799352-be04-4d18-9b1b-0d8e73176962%40isocpp.or=
g.
------=_Part_2657_80160971.1467092812491
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 7:58:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Nicol Bolas =D0=BD=D0=B0=
=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"m=
argin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"=
><div dir=3D"ltr"><br>Meaningful Unicode support means having sufficient da=
ta to perform Unicode <i>operations</i>. Those 12 properties that the C++ s=
tandard library provides are completely useless for Unicode operations. The=
se operations require properties, but they require <i>Unicode</i> propertie=
s.<br></div></blockquote><div>Why useless? iswdigit=3DDecimal_Number proper=
ty, iswalpha=3DLetter property, iswblank=3DSpace_Separator property. Yes, U=
nicode also give more concrete properties. But I'm can live without Tit=
lecase_Letter property, if I'm just need split text to words.</div><blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-=
left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"ltr"><br>We shouldn=
9;t be trying to make Unicode fit within C/C++'s garbage interface. We =
should be trying to improve C++'s interface match Unicode. Or rather, <=
i>replace</i> C++'s terrible interface with one that matches Unicode.<b=
r></div></blockquote><div>In current time I'm only try make C/C++ wide =
character interface worked with wide characters (characters with 128+ codes=
).</div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/49799352-be04-4d18-9b1b-0d8e73176962%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/49799352-be04-4d18-9b1b-0d8e73176962=
%40isocpp.org</a>.<br />
------=_Part_2657_80160971.1467092812491--
------=_Part_2656_1832367775.1467092812484--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 27 Jun 2016 22:48:50 -0700
Raw View
On segunda-feira, 27 de junho de 2016 22:46:52 PDT asorenji@gmail.com wrote:
> But I'm can live without Titlecase_Letter property, if I'm just
> need split text to words.
You can. Can you say the same for everyone?
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/2392865.0sJJd0opEj%40tjmaciei-mobl1.
.
Author: FrankHB1989 <frankhb1989@gmail.com>
Date: Mon, 27 Jun 2016 22:54:01 -0700 (PDT)
Raw View
------=_Part_1999_874038022.1467093242053
Content-Type: multipart/alternative;
boundary="----=_Part_2000_1672089249.1467093242053"
------=_Part_2000_1672089249.1467093242053
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Not all properties are needed at any time. This is true to everyone.
=E5=9C=A8 2016=E5=B9=B46=E6=9C=8828=E6=97=A5=E6=98=9F=E6=9C=9F=E4=BA=8C UTC=
+8=E4=B8=8B=E5=8D=881:48:54=EF=BC=8CThiago Macieira=E5=86=99=E9=81=93=EF=BC=
=9A
>
> On segunda-feira, 27 de junho de 2016 22:46:52 PDT asor...@gmail.com=20
> <javascript:> wrote:=20
> > But I'm can live without Titlecase_Letter property, if I'm just=20
> > need split text to words.=20
>
> You can. Can you say the same for everyone?=20
>
> --=20
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org=20
> Software Architect - Intel Open Source Technology Center=20
>
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/3ae80770-70a3-48fb-b927-01d6b368e8a8%40isocpp.or=
g.
------=_Part_2000_1672089249.1467093242053
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Not all properties are needed at any time. This is true to=
everyone.<br><br><br>=E5=9C=A8 2016=E5=B9=B46=E6=9C=8828=E6=97=A5=E6=98=9F=
=E6=9C=9F=E4=BA=8C UTC+8=E4=B8=8B=E5=8D=881:48:54=EF=BC=8CThiago Macieira=
=E5=86=99=E9=81=93=EF=BC=9A<blockquote class=3D"gmail_quote" style=3D"margi=
n: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On =
segunda-feira, 27 de junho de 2016 22:46:52 PDT <a href=3D"javascript:" tar=
get=3D"_blank" gdf-obfuscated-mailto=3D"TD2Pki9DBgAJ" rel=3D"nofollow" onmo=
usedown=3D"this.href=3D'javascript:';return true;" onclick=3D"this.=
href=3D'javascript:';return true;">asor...@gmail.com</a> wrote:
<br>> But I'm can live without Titlecase_Letter property, if I'm=
just=20
<br>> need split text to words.
<br>
<br>You can. Can you say the same for everyone?
<br>
<br>--=20
<br>Thiago Macieira - thiago (AT) <a href=3D"http://macieira.info" target=
=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'http://www.goo=
gle.com/url?q\x3dhttp%3A%2F%2Fmacieira.info\x26sa\x3dD\x26sntz\x3d1\x26usg\=
x3dAFQjCNEswDUBNCNanbu7euhqLn_62FW8ag';return true;" onclick=3D"this.hr=
ef=3D'http://www.google.com/url?q\x3dhttp%3A%2F%2Fmacieira.info\x26sa\x=
3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEswDUBNCNanbu7euhqLn_62FW8ag';return t=
rue;">macieira.info</a> - thiago (AT) <a href=3D"http://kde.org" target=3D"=
_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'http://www.google.=
com/url?q\x3dhttp%3A%2F%2Fkde.org\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH=
GRJdo5_JYG1DowztwAHAKs80XSA';return true;" onclick=3D"this.href=3D'=
http://www.google.com/url?q\x3dhttp%3A%2F%2Fkde.org\x26sa\x3dD\x26sntz\x3d1=
\x26usg\x3dAFQjCNHGRJdo5_JYG1DowztwAHAKs80XSA';return true;">kde.org</a=
>
<br>=C2=A0 =C2=A0Software Architect - Intel Open Source Technology Center
<br>
<br></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/3ae80770-70a3-48fb-b927-01d6b368e8a8%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/3ae80770-70a3-48fb-b927-01d6b368e8a8=
%40isocpp.org</a>.<br />
------=_Part_2000_1672089249.1467093242053--
------=_Part_1999_874038022.1467093242053--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 27 Jun 2016 22:55:49 -0700
Raw View
On segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:
> Not all properties are needed at any time. This is true to everyone.
We're talking about API. Are you able to say NO ONE needs that aPI?
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/9308164.d929Ce3kR3%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Mon, 27 Jun 2016 22:57:14 -0700 (PDT)
Raw View
------=_Part_1417_1130956854.1467093434694
Content-Type: multipart/alternative;
boundary="----=_Part_1418_1222964461.1467093434694"
------=_Part_1418_1222964461.1467093434694
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 8:48:54 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> You can. Can you say the same for everyone?=20
>
For everyone we can latter added new properties in universal iswctype=20
<http://www.cplusplus.com/reference/cwctype/iswctype/> functional. Now I'm=
=20
need at least this 12 properties for Unicode, not ASCII characters.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/cf738f69-3244-4aea-835b-5dc3d37ca9b0%40isocpp.or=
g.
------=_Part_1418_1222964461.1467093434694
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 8:48:54 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =D0=BD=D0=
=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><br>You can. Can you say the same for everyone?
<br></blockquote><div>For everyone we can latter added new properties in un=
iversal <a href=3D"http://www.cplusplus.com/reference/cwctype/iswctype/">is=
wctype</a> functional. Now I'm need at least this 12 properties for Uni=
code, not ASCII characters.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/cf738f69-3244-4aea-835b-5dc3d37ca9b0%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/cf738f69-3244-4aea-835b-5dc3d37ca9b0=
%40isocpp.org</a>.<br />
------=_Part_1418_1222964461.1467093434694--
------=_Part_1417_1130956854.1467093434694--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 00:21:22 -0700
Raw View
On segunda-feira, 27 de junho de 2016 22:57:14 PDT asorenji@gmail.com wrote=
:
> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 2=
016 =D0=B3., 8:48:54 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
>=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
> > You can. Can you say the same for everyone?
>=20
> For everyone we can latter added new properties in universal iswctype
> <http://www.cplusplus.com/reference/cwctype/iswctype/> functional. Now I'=
m
> need at least this 12 properties for Unicode, not ASCII characters.
We're not talking about "now". We're talking about standardising something =
for=20
2020. We have the time to do it right.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/17488222.rMpkCkuuuJ%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Tue, 28 Jun 2016 00:43:38 -0700 (PDT)
Raw View
------=_Part_5114_829734544.1467099818354
Content-Type: multipart/alternative;
boundary="----=_Part_5115_7188301.1467099818354"
------=_Part_5115_7188301.1467099818354
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> We're not talking about "now". We're talking about standardising somethin=
g=20
> for=20
> 2020. We have the time to do it right.=20
>
Okay, than we can create set of functions with standard=20
isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so=20
implementation can use any methods to omit consume of memory, or any=20
methods to get faster code.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/02b09fd2-26d0-44cb-8b24-3d23eb836cdc%40isocpp.or=
g.
------=_Part_5115_7188301.1467099818354
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=
=8E=D0=BD=D1=8F 2016 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=
=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =D0=BD=D0=B0=D0=
=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=3D"marg=
in: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><b=
r>We're not talking about "now". We're talking about stan=
dardising something for=20
<br>2020. We have the time to do it right.
<br></blockquote><div>Okay, than we can create set of functions with standa=
rd isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so im=
plementation can use any methods to omit consume of memory, or any methods =
to get faster code.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/02b09fd2-26d0-44cb-8b24-3d23eb836cdc%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/02b09fd2-26d0-44cb-8b24-3d23eb836cdc=
%40isocpp.org</a>.<br />
------=_Part_5115_7188301.1467099818354--
------=_Part_5114_829734544.1467099818354--
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Tue, 28 Jun 2016 10:30:54 -0400
Raw View
On 2016-06-27 23:44, Thiago Macieira wrote:
> On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>> sane implementation is going to store them in shared memory.
>
> Huh... and what do you think executable ROM is? Shared memory. Since it's ROM,
> it can't be changed, which means it can be shared between multiple instances
> of the same program (if the OS supports shared memory in the first place)
....which is much less efficient than sharing across *different* programs.
> FYI, there are Unicode tables inside the Linux kernel.
> - VFAT is case-insensitive.
> - VFAT stores filenames in UTF-16.
"In the kernel" was on my list of "reasonable places to put such tables"
:-). Or even in NVRAM (i.e. firmware), if you're talking about an
embedded platform.
--
Matthew
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/nku1mv%24t3o%241%40ger.gmane.org.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 08:01:08 -0700
Raw View
On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT asorenji@gmail.com wr=
ote:
> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 2=
016 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=
=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
>=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
> > We're not talking about "now". We're talking about standardising someth=
ing
> > for
> > 2020. We have the time to do it right.
>=20
> Okay, than we can create set of functions with standard
> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so
> implementation can use any methods to omit consume of memory, or any
> methods to get faster code.
Why can't we require O(1) complexity?
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/2051984.gaFSOGh1sJ%40tjmaciei-mobl1.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 08:03:33 -0700
Raw View
On ter=C3=A7a-feira, 28 de junho de 2016 10:30:54 PDT Matthew Woehlke wrote=
:
> On 2016-06-27 23:44, Thiago Macieira wrote:
> > On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrot=
e:
> >> Huh? Why on earth would you bake the tables into the executable ROM? A=
ny
> >> sane implementation is going to store them in shared memory.
> >=20
> > Huh... and what do you think executable ROM is? Shared memory. Since it=
's
> > ROM, it can't be changed, which means it can be shared between multiple
> > instances of the same program (if the OS supports shared memory in the
> > first place)
>
> ...which is much less efficient than sharing across *different* programs.
That is included. Executable ROM (i.e., .text sections) will be shared acro=
ss=20
multiple invocations, even of different programs, if they use the same=20
sections of ROM.
> > FYI, there are Unicode tables inside the Linux kernel.
> > - VFAT is case-insensitive.
> > - VFAT stores filenames in UTF-16.
>=20
> "In the kernel" was on my list of "reasonable places to put such tables"
>=20
> :-). Or even in NVRAM (i.e. firmware), if you're talking about an
> embedded platform.
Well, even since the 1980s, for me any "ROM" is actually some kind of PROM,=
=20
like Flash memory, NVRAM, etc.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/5687343.G23lk5P4Xh%40tjmaciei-mobl1.
.
Author: asorenji@gmail.com
Date: Tue, 28 Jun 2016 09:15:43 -0700 (PDT)
Raw View
------=_Part_3006_186558894.1467130543383
Content-Type: multipart/alternative;
boundary="----=_Part_3007_524532704.1467130543383"
------=_Part_3007_524532704.1467130543383
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 201=
6 =D0=B3., 18:01:17 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=
=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
=D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>
>
> Why can't we require O(1) complexity?=20
>
O(1) complexity is not guarantee of fast code. Slow code that slow *always,=
*=20
also have O(1) complexity. On the other hand some users have=20
microcontroller with 1M memory. And if we give them fast O(1) code, this=20
code eat significant amount of this 1M. Especially if code must support a=
=20
lot of properties.
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/75c5e493-a6b8-4b92-ba25-5081134961b9%40isocpp.or=
g.
------=_Part_3007_524532704.1467130543383
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=
=B8=D1=8E=D0=BD=D1=8F 2016 =D0=B3., 18:01:17 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=
=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira =D0=BD=D0=
=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;"><br>Why can't we require O(1) complexity?
<br></blockquote><div>O(1) complexity is not guarantee of fast code. Slow c=
ode that slow <em>always,</em> also have O(1) complexity. On the other hand=
some users have microcontroller with 1M memory. And if we give them fast O=
(1) code, this code eat significant amount of this 1M. Especially if code m=
ust support a lot of properties.<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/75c5e493-a6b8-4b92-ba25-5081134961b9%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/75c5e493-a6b8-4b92-ba25-5081134961b9=
%40isocpp.org</a>.<br />
------=_Part_3007_524532704.1467130543383--
------=_Part_3006_186558894.1467130543383--
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 09:36:35 -0700
Raw View
On ter=C3=A7a-feira, 28 de junho de 2016 09:15:43 PDT asorenji@gmail.com wr=
ote:
> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F 2=
016 =D0=B3., 18:01:17 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=
=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
>=20
> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
> > Why can't we require O(1) complexity?
>=20
> O(1) complexity is not guarantee of fast code. Slow code that slow *alway=
s,*
> also have O(1) complexity. On the other hand some users have
> microcontroller with 1M memory. And if we give them fast O(1) code, this
> code eat significant amount of this 1M. Especially if code must support a
> lot of properties.
Given some of the operations that may be constructed with those property=20
queries, O(1) complexity is probably a far better option.
Not sure the standard should require it.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/2962907.NunC4nsx6h%40tjmaciei-mobl1.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Tue, 28 Jun 2016 12:42:57 -0400
Raw View
On 2016-06-28 11:01, Thiago Macieira wrote:
> On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT asorenji@gmail.com =
wrote:
>> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=8F =
2016 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=
=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
>>
>> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
>>> We're not talking about "now". We're talking about standardising someth=
ing
>>> for
>>> 2020. We have the time to do it right.
>>
>> Okay, than we can create set of functions with standard
>> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so
>> implementation can use any methods to omit consume of memory, or any
>> methods to get faster code.
>=20
> Why can't we require O(1) complexity?
....because it limits the ways in which the algorithms might be
implemented. See for example my (notional) implementation using only
about 9 KiB; I achieve that using RLE which makes the functions O(logN)
for N =3D the size of the data table. (Which, okay, since that's constant
and not a function of the input, you could maybe argue *is* O(1), but...)
As we've discussed to death, the implementation almost certainly
involves a size/speed trade-off. If you require favoring speed, you are
likely also requiring an implementation that must have a very large data
table.
Leaving it open-ended allows implementations that must run on very
constrained systems to decide what is an acceptable trade-off.
--=20
Matthew
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/nku9eh%24te6%241%40ger.gmane.org.
.
Author: Matthew Woehlke <mwoehlke.floss@gmail.com>
Date: Tue, 28 Jun 2016 12:49:57 -0400
Raw View
On 2016-06-28 11:03, Thiago Macieira wrote:
> On ter=C3=A7a-feira, 28 de junho de 2016 10:30:54 PDT Matthew Woehlke wro=
te:
>> On 2016-06-27 23:44, Thiago Macieira wrote:
>>> On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrot=
e:
>>>> Huh? Why on earth would you bake the tables into the executable ROM? A=
ny
>>>> sane implementation is going to store them in shared memory.
>>>
>>> Huh... and what do you think executable ROM is? Shared memory. Since it=
's
>>> ROM, it can't be changed, which means it can be shared between multiple
>>> instances of the same program (if the OS supports shared memory in the
>>> first place)
>>
>> ...which is much less efficient than sharing across *different* programs=
..
>=20
> That is included. Executable ROM (i.e., .text sections) will be shared ac=
ross=20
> multiple invocations, even of different programs, if they use the same=20
> sections of ROM.
Are we talking about the same thing? I'm talking about the tables being
embedded as inline static data in the .exe (or equivalent), *not* a
shared library. I don't think that can be shared, as that would imply a)
that the data exactly starts and ends at page boundaries, and b) the OS
can share memory between different .exe's when they happen to have pages
with identical content. (Do OS's actually do that?)
Possible clarification: by "ROM", above, I'm talking about e.g. the
..rodata section. Not "ROM" in the hardware sense.
--=20
Matthew
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/nku9rm%246o2%241%40ger.gmane.org.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 10:12:46 -0700
Raw View
On ter=C3=A7a-feira, 28 de junho de 2016 12:42:57 PDT Matthew Woehlke wrote=
:
> > Why can't we require O(1) complexity?
>=20
> ...because it limits the ways in which the algorithms might be
> implemented. See for example my (notional) implementation using only
> about 9 KiB; I achieve that using RLE which makes the functions O(logN)
> for N =3D the size of the data table. (Which, okay, since that's constant
> and not a function of the input, you could maybe argue *is* O(1), but...)
>=20
> As we've discussed to death, the implementation almost certainly
> involves a size/speed trade-off. If you require favoring speed, you are
> likely also requiring an implementation that must have a very large data
> table.
>=20
> Leaving it open-ended allows implementations that must run on very
> constrained systems to decide what is an acceptable trade-off.
The standard does require the complexity of certain algorithms and containe=
rs.=20
Given that the most likely use of properties on characters is to loop over =
a=20
string, you have to remember that a non-O(1) property lookup will mean the=
=20
entire loop will have a higher complexity than O(n).
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/36701113.RR01NdtKE0%40tjmaciei-mobl1.
.
Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 28 Jun 2016 10:20:23 -0700
Raw View
On ter=C3=A7a-feira, 28 de junho de 2016 12:49:57 PDT Matthew Woehlke wrote=
:
> > That is included. Executable ROM (i.e., .text sections) will be shared
> > across multiple invocations, even of different programs, if they use th=
e
> > same sections of ROM.
>=20
> Are we talking about the same thing? I'm talking about the tables being
> embedded as inline static data in the .exe (or equivalent), *not* a
> shared library.=20
I wasn't considering that. I thought you meant data in a library of some so=
rt.
> I don't think that can be shared, as that would imply a)
> that the data exactly starts and ends at page boundaries, and b) the OS
> can share memory between different .exe's when they happen to have pages
> with identical content. (Do OS's actually do that?)
Starting data at page boundaries is quite easy.
Sharing pages even if they don't come from the same file is possible (look =
up=20
the Linux "kernel samepage merging" feature), but it's more efficient if th=
ey=20
come from the same file. Hence ICU having a file with the data.
> Possible clarification: by "ROM", above, I'm talking about e.g. the
> .rodata section. Not "ROM" in the hardware sense.
Off-topic:
Strictly speaking, you're talking about the ELF read-only segment(s) (not=
=20
sections). There are multiple sections in those segments: at the very least=
=20
..text, in addition to .rodata.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/5867921.AUuKeNNjJU%40tjmaciei-mobl1.
.
Author: Jean-Marc Bourguet <jm.bourguet@gmail.com>
Date: Tue, 28 Jun 2016 13:48:02 -0700 (PDT)
Raw View
------=_Part_749_2119401630.1467146882632
Content-Type: multipart/alternative;
boundary="----=_Part_750_1948382517.1467146882633"
------=_Part_750_1948382517.1467146882633
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlke a =C3=A9crit :
>
> On 2016-06-28 11:01, Thiago Macieira wrote:=20
> > On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com=
=20
> <javascript:> wrote:=20
> >> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=BD=D1=
=8F 2016 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=
=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira=20
> >>=20
> >> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:=20
> >>> We're not talking about "now". We're talking about standardising=20
> something=20
> >>> for=20
> >>> 2020. We have the time to do it right.=20
> >>=20
> >> Okay, than we can create set of functions with standard=20
> >> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so=
=20
> >> implementation can use any methods to omit consume of memory, or any=
=20
> >> methods to get faster code.=20
> >=20
> > Why can't we require O(1) complexity?=20
>
> ...because it limits the ways in which the algorithms might be=20
> implemented. See for example my (notional) implementation using only=20
> about 9 KiB; I achieve that using RLE which makes the functions O(logN)=
=20
> for N =3D the size of the data table. (Which, okay, since that's constant=
=20
> and not a function of the input, you could maybe argue *is* O(1), but...)=
=20
>
It is possible in 10 KiB to store the 59 binary properties of Unicode for=
=20
the basic
plane in such a way that they are accessible in O(1). =20
Access function is something like:
uint64_t get_value1(char32_t c) {
size_t i =3D c;
size_t d1 =3D i % BinaryPropertiesBlockSizeL1;
i /=3D BinaryPropertiesBlockSizeL1;
size_t d2 =3D i % BinaryPropertiesBlockSizeL2;
i /=3D BinaryPropertiesBlockSizeL2;
size_t d3 =3D i % BinaryPropertiesBlockSizeL3;
i /=3D BinaryPropertiesBlockSizeL3;
i =3D BinaryPropertiesStartL3[i];
i =3D BinaryPropertiesStartL2[i+d3];
i =3D BinaryPropertiesStartL1[i+d2];
return BinaryPropertiesValues[i+d1];
} // get_value1
bool is_uppercase(char32_t c) {
return bool((get_value1(c) >> 8) & 1);
} // is_uppercase
(IIRC, that's a compression scheme suggested somewhere in Unicode=20
documentation; my implementation of the compression part is a stupid gready=
=20
one).
Thus I don't see the interest of allowing O(log N) for the small size.
For speed, I'd just ensure that / and % can be done with masks and shifts=
=20
and then I'd not be that surprised that cache pressure is the most=20
important factor for anything but benchmarks.
Yours,
--=20
Jean-Marc
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/a529d826-ef6d-415b-b19e-809287da175b%40isocpp.or=
g.
------=_Part_750_1948382517.1467146882633
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlke a =
=C3=A9crit=C2=A0:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On 2016-06-28=
11:01, Thiago Macieira wrote:
<br>> On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT <a href=3D"j=
avascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"VDxNquRmBgAJ" rel=3D=
"nofollow" onmousedown=3D"this.href=3D'javascript:';return true;" o=
nclick=3D"this.href=3D'javascript:';return true;">asor...@gmail.com=
</a> wrote:
<br>>> =D0=B2=D1=82=D0=BE=D1=80=D0=BD=D0=B8=D0=BA, 28 =D0=B8=D1=8E=D0=
=BD=D1=8F 2016 =D0=B3., 10:21:26 UTC+3 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=
=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C Thiago Macieira
<br>>>
<br>>> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB:
<br>>>> We're not talking about "now". We're tal=
king about standardising something
<br>>>> for
<br>>>> 2020. We have the time to do it right.
<br>>>
<br>>> Okay, than we can create set of functions with standard
<br>>> isu32UNICODE_PROPERTY_NAME name. Without any complexity guaran=
tee, so
<br>>> implementation can use any methods to omit consume of memory, =
or any
<br>>> methods to get faster code.
<br>>=20
<br>> Why can't we require O(1) complexity?
<br>
<br>...because it limits the ways in which the algorithms might be
<br>implemented. See for example my (notional) implementation using only
<br>about 9 KiB; I achieve that using RLE which makes the functions O(logN)
<br>for N =3D the size of the data table. (Which, okay, since that's co=
nstant
<br>and not a function of the input, you could maybe argue *is* O(1), but..=
..)
<br></blockquote><div><br>It is possible in 10 KiB to store the 59 binary p=
roperties of Unicode for the basic<br>plane in such a way that they are acc=
essible in O(1).=C2=A0 <br><br>Access function is something like:<br><br>ui=
nt64_t get_value1(char32_t c) {<br>=C2=A0=C2=A0=C2=A0 size_t i =3D c;<br>=
=C2=A0=C2=A0=C2=A0 size_t d1 =3D i % BinaryPropertiesBlockSizeL1;<br>=C2=A0=
=C2=A0=C2=A0 i /=3D BinaryPropertiesBlockSizeL1;<br>=C2=A0=C2=A0=C2=A0 size=
_t d2 =3D i % BinaryPropertiesBlockSizeL2;<br>=C2=A0=C2=A0=C2=A0 i /=3D Bin=
aryPropertiesBlockSizeL2;<br>=C2=A0=C2=A0=C2=A0 size_t d3 =3D i % BinaryPro=
pertiesBlockSizeL3;<br>=C2=A0=C2=A0=C2=A0 i /=3D BinaryPropertiesBlockSizeL=
3;<br>=C2=A0=C2=A0=C2=A0 i =3D BinaryPropertiesStartL3[i];<br>=C2=A0=C2=A0=
=C2=A0 i =3D BinaryPropertiesStartL2[i+d3];<br>=C2=A0=C2=A0=C2=A0 i =3D Bin=
aryPropertiesStartL1[i+d2];<br>=C2=A0=C2=A0=C2=A0 return BinaryPropertiesVa=
lues[i+d1];<br>} // get_value1<br><br>bool is_uppercase(char32_t c) {<br>=
=C2=A0=C2=A0 return bool((get_value1(c) >> 8) & 1);<br>} // is_up=
percase<br><br>(IIRC, that's a compression scheme suggested somewhere i=
n Unicode documentation; my implementation of the compression part is a stu=
pid gready one).</div><br>Thus I don't see the interest of allowing O(l=
og N) for the small size.<br><br>For speed, I'd just ensure that / and =
% can be done with masks and shifts and then I'd not be that surprised =
that cache pressure is the most important factor for anything but benchmark=
s.<br><br>Yours,<br><br>-- <br>Jean-Marc<br></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/a529d826-ef6d-415b-b19e-809287da175b%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/a529d826-ef6d-415b-b19e-809287da175b=
%40isocpp.org</a>.<br />
------=_Part_750_1948382517.1467146882633--
------=_Part_749_2119401630.1467146882632--
.
Author: Arthur O'Dwyer <arthur.j.odwyer@gmail.com>
Date: Tue, 28 Jun 2016 16:44:15 -0700 (PDT)
Raw View
------=_Part_106_1871871346.1467157455440
Content-Type: multipart/alternative;
boundary="----=_Part_107_856895308.1467157455440"
------=_Part_107_856895308.1467157455440
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Tuesday, June 28, 2016 at 1:48:03 PM UTC-7, Jean-Marc Bourguet wrote:
> Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlke a =C3=A9crit :
>> On 2016-06-28 11:01, Thiago Macieira wrote:=20
>>> On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com=
=20
wrote:=20
>>>>=20
>>>> Without any complexity guarantee, so implementation can use any
>>>> methods to omit consume of memory, or any methods to get faster code.=
=20
>>>=20
>>> Why can't we require O(1) complexity?=20
>>
>> ...because it limits the ways in which the algorithms might be=20
>> implemented. See for example my (notional) implementation using only=20
>> about 9 KiB; I achieve that using RLE which makes the functions O(logN)=
=20
>> for N =3D the size of the data table. (Which, okay, since that's constan=
t=20
>> and not a function of the input, you could maybe argue *is* O(1),=20
but...)=20
>
> It is possible in 10 KiB to store the 59 binary properties of Unicode for=
=20
the basic
> plane in such a way that they are accessible in O(1). [...]
> I don't see the interest of allowing O(log N) for the small size.
You guys should stop talking about big-O notation; you're talking past each=
=20
other.
In this subthread, asorenji's original point was that the Standard=20
shouldn't specify implementation details of the semi-proposed isu32xxx=20
functions.
He mistakenly used the term "complexity guarantee" to describe this idea.
Thiago correctly <http://i.imgur.com/gRk1uZm.gif> pointed out that the=20
Standard could safely require O(1) complexity, because duh.
Matthew for some reason reverse-nitpicked the nitpick, despite agreeing=20
that Thiago was technically correct about the meaning of "O(1)".
And so on...
Just drop it, and let the other subthreads resume with discussion of the=20
actual feature space. Let's be reasonable and assume that implementors will=
=20
always do what's best for their particular platform.
And in future, when discussing computational complexity, remember to define=
=20
your terms! Any algorithm is O(N) for suitable definitions of N, and any=
=20
algorithm is O(1) for *other* definitions of N. (In this case, the only=20
halfway sane definition of "N" is "the size of Unicode"; unfortunately=20
that's not a fully sane definition, because the size of Unicode is a global=
=20
constant.)
=E2=80=93Arthur
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/cd92be65-d69c-475c-91d5-27e5939e296c%40isocpp.or=
g.
------=_Part_107_856895308.1467157455440
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">On Tuesday, June 28, 2016 at 1:48:03 PM UTC-7, Jean-Marc B=
ourguet wrote:<br>> Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlk=
e a =C3=A9crit :<br>>> On 2016-06-28 11:01, Thiago Macieira wrote: <b=
r>>>> On ter=C3=A7a-feira, 28 de junho de 2016 00:43:38 PDT asor..=
..@gmail.com wrote: <br>>>>>=C2=A0<br>>>>> Without a=
ny complexity guarantee, so=C2=A0implementation can use any<div>>>>=
;> methods to omit consume of memory, or any=C2=A0methods to get faster =
code. <br>>>>=C2=A0<br>>>> Why can't we require O(1) =
complexity? <br> >><br>>> ...because it limits the ways in whic=
h the algorithms might be <br>>> implemented. See for example my (not=
ional) implementation using only <br>>> about 9 KiB; I achieve that u=
sing RLE which makes the functions O(logN) <br>>> for N =3D the size =
of the data table. (Which, okay, since that's constant <br>>> and=
not a function of the input, you could maybe argue *is* O(1), but...) <br>=
><br>>=C2=A0It is possible in 10 KiB to store the 59 binary propertie=
s of Unicode for the basic<br>> plane in such a way that they are access=
ible in O(1). [...]<br>> I don't see the interest of allowing O(log =
N) for the small size.<br><br></div><div>You guys should stop talking about=
big-O notation; you're talking past each other.</div><div>In this subt=
hread, asorenji's original point was that the Standard shouldn't sp=
ecify implementation details of the semi-proposed isu32xxx functions.</div>=
<div>He mistakenly used the term "complexity guarantee" to descri=
be this idea.</div><div>Thiago <a href=3D"http://i.imgur.com/gRk1uZm.gif">c=
orrectly</a> pointed out that the Standard could safely require O(1) comple=
xity, because duh.</div><div>Matthew for some reason reverse-nitpicked the =
nitpick, despite agreeing that Thiago was technically correct about the mea=
ning of "O(1)".</div><div>And so on...</div><div><br></div><div>J=
ust drop it, and let the other subthreads resume with discussion of the act=
ual feature space. Let's be reasonable and assume that implementors wil=
l always do what's best for their particular platform.</div><div><br></=
div><div>And in future, when discussing computational complexity, remember =
to define your terms! =C2=A0Any algorithm is O(N) for suitable definitions =
of N, and any algorithm is O(1) for <i>other</i>=C2=A0definitions of N. =C2=
=A0(In this case, the only halfway sane definition of "N" is &quo=
t;the size of Unicode"; unfortunately that's not a fully sane defi=
nition, because the size of Unicode is a global constant.)</div><div><br></=
div><div>=E2=80=93Arthur<br></div></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/cd92be65-d69c-475c-91d5-27e5939e296c%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/cd92be65-d69c-475c-91d5-27e5939e296c=
%40isocpp.org</a>.<br />
------=_Part_107_856895308.1467157455440--
------=_Part_106_1871871346.1467157455440--
.
Author: FrankHB1989 <frankhb1989@gmail.com>
Date: Tue, 28 Jun 2016 21:15:25 -0700 (PDT)
Raw View
------=_Part_2_1624826700.1467173725921
Content-Type: multipart/alternative;
boundary="----=_Part_3_1623596635.1467173725921"
------=_Part_3_1623596635.1467173725921
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
=E5=9C=A8 2016=E5=B9=B46=E6=9C=8828=E6=97=A5=E6=98=9F=E6=9C=9F=E4=BA=8C UTC=
+8=E4=B8=8B=E5=8D=881:55:53=EF=BC=8CThiago Macieira=E5=86=99=E9=81=93=EF=BC=
=9A
>
> On segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:=20
> > Not all properties are needed at any time. This is true to everyone.=20
>
> We're talking about API. Are you able to say NO ONE needs that aPI?=20
>
> When someone needs some API, he/she is not meant to make everything=20
available in the same time. Although full Unicode support is good, lack of=
=20
the full set of Unicode API or the cost to support it should not be the=20
reason making specific API unusable or making it too difficult/subtle to=20
use, unless it is technically unavoidable. I don't see `iswalpha` is the=20
case here.
BTW, about the original question of OP - since ISO C states explicitly "the=
=20
sets of characters tested for by the `iswalpha` function" is=20
locale-specific and nothing Unicode stuff is specified, I don't see any=20
actual problem in the specification. I also think the wide-character API=20
should be never specific to Unicode. Namely, the related interface should=
=20
be Unicode-agnostic; only implementations may care how to make it work=20
better with current or future Unicode implementations, but the discussion=
=20
has gone too far in this topic. If Unicode API is needed, it is better to=
=20
create a new topic then.
> --=20
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org=20
> Software Architect - Intel Open Source Technology Center=20
>
>
--=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp=
..org/d/msgid/std-proposals/33eb3a74-dd91-416d-b923-1e985682532f%40isocpp.or=
g.
------=_Part_3_1623596635.1467173725921
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><br>=E5=9C=A8 2016=E5=B9=B46=E6=9C=8828=E6=97=A5=E6=98=
=9F=E6=9C=9F=E4=BA=8C UTC+8=E4=B8=8B=E5=8D=881:55:53=EF=BC=8CThiago Macieir=
a=E5=86=99=E9=81=93=EF=BC=9A<blockquote class=3D"gmail_quote" style=3D"marg=
in: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On=
segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:
<br>> Not all properties are needed at any time. This is true to everyon=
e.
<br>
<br>We're talking about API. Are you able to say NO ONE needs that aPI?
<br>
<br></blockquote><div>When someone needs some API, he/she is not meant to m=
ake everything available in the same time. Although full Unicode support is=
good, lack of the full set of Unicode API or the cost to support it should=
not be the reason making specific API unusable or making it too difficult/=
subtle to use, unless it is technically unavoidable. I don't see `iswal=
pha` is the case here.<br><br>BTW, about the original question of OP - sinc=
e ISO C states explicitly "the sets of characters tested for by the `i=
swalpha` function" is locale-specific and nothing Unicode stuff is spe=
cified, I don't see any actual problem in the specification. I also thi=
nk the wide-character API should be never specific to Unicode. Namely, the =
related interface should be <span class=3D"op_dict3_font24 op_dict3_marginR=
ight">Unicode-agnostic; only implementations may care how to make it work b=
etter with current or future Unicode implementations, but the discussion ha=
s gone too far in this topic. </span>If Unicode API is needed, it is better=
to create a new topic then.<br></div><blockquote class=3D"gmail_quote" sty=
le=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left=
: 1ex;">--=20
<br>Thiago Macieira - thiago (AT) <a href=3D"http://macieira.info" target=
=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'http://www.goo=
gle.com/url?q\x3dhttp%3A%2F%2Fmacieira.info\x26sa\x3dD\x26sntz\x3d1\x26usg\=
x3dAFQjCNEswDUBNCNanbu7euhqLn_62FW8ag';return true;" onclick=3D"this.hr=
ef=3D'http://www.google.com/url?q\x3dhttp%3A%2F%2Fmacieira.info\x26sa\x=
3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEswDUBNCNanbu7euhqLn_62FW8ag';return t=
rue;">macieira.info</a> - thiago (AT) <a href=3D"http://kde.org" target=3D"=
_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D'http://www.google.=
com/url?q\x3dhttp%3A%2F%2Fkde.org\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH=
GRJdo5_JYG1DowztwAHAKs80XSA';return true;" onclick=3D"this.href=3D'=
http://www.google.com/url?q\x3dhttp%3A%2F%2Fkde.org\x26sa\x3dD\x26sntz\x3d1=
\x26usg\x3dAFQjCNHGRJdo5_JYG1DowztwAHAKs80XSA';return true;">kde.org</a=
>
<br>=C2=A0 =C2=A0Software Architect - Intel Open Source Technology Center
<br>
<br></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/a/isocpp.org/d/msgid/std-proposals/33eb3a74-dd91-416d-b923-1e985682532f%=
40isocpp.org?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.=
com/a/isocpp.org/d/msgid/std-proposals/33eb3a74-dd91-416d-b923-1e985682532f=
%40isocpp.org</a>.<br />
------=_Part_3_1623596635.1467173725921--
------=_Part_2_1624826700.1467173725921--
.