Thread

Topic: Text_view: A C++ concepts and range based

Author: Zhihao Yuan <zy@miator.net>
Date: Mon, 8 Feb 2016 01:21:37 -0600 Raw View

On Mon, Feb 8, 2016 at 12:27 AM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
>
> Text_view avoids introducing another string type.  Instead, it provides
> facilities for constructing a view over any range, view, or container
> that holds a code unit sequence; the view associates an encoding with
> the code unit sequence and provides iterators that decode the sequence
> and produce code point values.  The value type of the iterator type is a
> character type that associates the code point value with a character set.
>

This approach could be useful, but for long term use, it would be
more beneficial to decode the external encoding only once and
use internal representation inside application.

> An example taken from the overview follows.  Note that \u00F8 (LATIN
> SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units
> (\xC3\xB8), but iterator based enumeration sees just the single code poin=
t.
>
> using CT =3D utf8_encoding::character_type;
> auto tv =3D make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
> auto it =3D tv.begin();
> assert(*it++ =3D=3D CT{0x004A}); // 'J'
> assert(*it++ =3D=3D CT{0x00F8}); // '=C3=B8'
> assert(*it++ =3D=3D CT{0x0065}); // 'e'

Although I know CT is char32_t, but IMHO you overly generalized
this part.  Making this not configurable works for me.

> I see this library as a very small, but fundamental step towards
> improving support for Unicode within the standard library.  Thank you
> for any feedback!
>

At least it's more simple and/or flexible then the existing codecvt,
wbuffer_convert, wstring_convert.

--=20
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Mon, 8 Feb 2016 14:53:40 +0000 Raw View

On 2/8/2016 2:21 AM, Zhihao Yuan wrote:
> On Mon, Feb 8, 2016 at 12:27 AM, Tom Honermann
> <Thomas.Honermann@synopsys.com> wrote:
>>
>> Text_view avoids introducing another string type.  Instead, it provides
>> facilities for constructing a view over any range, view, or container
>> that holds a code unit sequence; the view associates an encoding with
>> the code unit sequence and provides iterators that decode the sequence
>> and produce code point values.  The value type of the iterator type is a
>> character type that associates the code point value with a character set=
..
>>
>
> This approach could be useful, but for long term use, it would be
> more beneficial to decode the external encoding only once and
> use internal representation inside application.

I very much agree, though I think this is an orthogonal concern.=20
Applications still require the ability to work with an encoded string in=20
the internal encoding as either a sequence of code unit values or a=20
sequence of code point values depending on context.  Text_view provides=20
support for the latter while remaining agnostic to the choice of=20
internal encoding.  An application that uses UTF-8 as its internal=20
encoding might use text_view in this manner:

using internal_encoding =3D utf8_encoding;
using CT =3D internal_encoding::character_type;
....
std::string s =3D get_text_in_internal_encoding();
auto tv =3D make_text_view<internal_encoding>(s);
auto it =3D std::find(tv.begin(), tv.end(), CT{'\u00F8'});

>> An example taken from the overview follows.  Note that \u00F8 (LATIN
>> SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units
>> (\xC3\xB8), but iterator based enumeration sees just the single code poi=
nt.
>>
>> using CT =3D utf8_encoding::character_type;
>> auto tv =3D make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
>> auto it =3D tv.begin();
>> assert(*it++ =3D=3D CT{0x004A}); // 'J'
>> assert(*it++ =3D=3D CT{0x00F8}); // '=C3=B8'
>> assert(*it++ =3D=3D CT{0x0065}); // 'e'
>
> Although I know CT is char32_t, but IMHO you overly generalized
> this part.  Making this not configurable works for me.

CT isn't char32_t; it is character<unicode_character_set>.

C++ doesn't provide a type that meets my criteria for a character type.=20
  char, wchar_t, char16_t, and char32_t are code unit types that are=20
sometimes used as code point types.  They don't qualify as character=20
types because they do not have an explicit or implicit associated=20
character set to give meaning to their values.  text_view introduces a=20
character class template that makes the association with a character set=20
explicit.  This enables (or will enable with future work) transcoding=20
between character sets as would be needed to convert text in external=20
encodings to an internal encoding.

>> I see this library as a very small, but fundamental step towards
>> improving support for Unicode within the standard library.  Thank you
>> for any feedback!
>>
>
> At least it's more simple and/or flexible then the existing codecvt,
> wbuffer_convert, wstring_convert.

Thank you, I interpret that as meaning I haven't completely failed :)

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Zhihao Yuan <zy@miator.net>
Date: Mon, 8 Feb 2016 10:08:30 -0600 Raw View

On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
>
> CT isn't char32_t; it is character<unicode_character_set>.

I read your source code so I know it is :)

> C++ doesn't provide a type that meets my criteria for a character type.
>   char, wchar_t, char16_t, and char32_t are code unit types that are
> sometimes used as code point types.  They don't qualify as character
> types because they do not have an explicit or implicit associated
> character set to give meaning to their values.

What I'm looking for is something can simplify further development of
Unicode-based collate processing.  Considering char32_t is already the
default type, I think I can live with that.

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Mon, 8 Feb 2016 09:34:56 -0800 (PST) Raw View

------=_Part_515_1101852861.1454952896719
Content-Type: multipart/alternative;
 boundary="----=_Part_516_1333058929.1454952896719"

------=_Part_516_1333058929.1454952896719
Content-Type: text/plain; charset=UTF-8

I like the general idea of the feature: an iterator/range approach to
dealing with Unicode encodings. One which is relatively extensible for
specialized encoding schemes.

My biggest issue is this: why is it based on *concepts*?

There's nothing here that really requires or benefits from it. Sure, it's
nice in the sense that it's easier to keep people from doing some wrong
things. But concepts offer no protection from the most pernicious of
problems: the user believing a string has an encoding when it in fact has a
different one.

I don't know what concepts are doing here, besides restricting my compiler
options so that I can't actually use it.

Also, you have support for little endian and big endian UTF-16/32 ranges.
And you have support for native-endian ones. But what about conversions?
What if you have read a file that you know is little endian, and you want
to convert it to native-endian, but you don't really know what
native-endian is? Your way would seem to require going through a whole
UTF-16->codepoint->UTF-16 conversion step, when what you really want is
just some byte-swapping.

Another issue has to do with more Unicode operations: collation, etc.
Obviously, your current version isn't intended to handle these. But my
concern is that, by using these decoding iterators, you make it difficult
to write optimized code for doing such transformations.

For example, consider a Unicode normalizing algorithm. If the algorithm
knows that its dealing with UTF-8, then it can optimize how it looks at
codepoints. Nothing in ASCII requires changes during decomposition. So if a
UTF-8-based normalization algorithm sees a code unit with a 0 in the high
bit, it can just write that value out as is and move on. By decoding it to
UTF-32, then re-encoding it as UTF-8, you have to do a lot of conditional
tests which you know won't be met.

I imagine UTF-16-based normalizers can have similar optimizations.

Obviously, there should be an algorithm that handles any codepoint
sequence, so that it can handle other forms of encodings. But if you make
the algorithms use codepoint ranges, then this becomes difficult. Compilers
might be able to inline and optimize everything in specific cases. But they
might not.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_516_1333058929.1454952896719
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I like the general idea of the feature: an iterator/range =
approach to dealing with Unicode encodings. One which is relatively extensi=
ble for specialized encoding schemes.<br><br>My biggest issue is this: why =
is it based on <i>concepts</i>?<br><br>There&#39;s nothing here that really=
 requires or benefits from it. Sure, it&#39;s nice in the sense that it&#39=
;s easier to keep people from doing some wrong things. But concepts offer n=
o protection from the most pernicious of problems: the user believing a str=
ing has an encoding when it in fact has a different one.<br><br>I don&#39;t=
 know what concepts are doing here, besides restricting my compiler options=
 so that I can&#39;t actually use it.<br><br>Also, you have support for lit=
tle endian and big endian UTF-16/32 ranges. And you have support for native=
-endian ones. But what about conversions? What if you have read a file that=
 you know is little endian, and you want to convert it to native-endian, bu=
t you don&#39;t really know what native-endian is? Your way would seem to r=
equire going through a whole UTF-16-&gt;codepoint-&gt;UTF-16 conversion ste=
p, when what you really want is just some byte-swapping.<br><br>Another iss=
ue has to do with more Unicode operations: collation, etc. Obviously, your =
current version isn&#39;t intended to handle these. But my concern is that,=
 by using these decoding iterators, you make it difficult to write optimize=
d code for doing such transformations.<br><br>For example, consider a Unico=
de normalizing algorithm. If the algorithm knows that its dealing with UTF-=
8, then it can optimize how it looks at codepoints. Nothing in ASCII requir=
es changes during decomposition. So if a UTF-8-based normalization algorith=
m sees a code unit with a 0 in the high bit, it can just write that value o=
ut as is and move on. By decoding it to UTF-32, then re-encoding it as UTF-=
8, you have to do a lot of conditional tests which you know won&#39;t be me=
t.<i></i><br><br>I imagine UTF-16-based normalizers can have similar optimi=
zations.<br><br>Obviously, there should be an algorithm that handles any co=
depoint sequence, so that it can handle other forms of encodings. But if yo=
u make the algorithms use codepoint ranges, then this becomes difficult. Co=
mpilers might be able to inline and optimize everything in specific cases. =
But they might not.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

------=_Part_516_1333058929.1454952896719--
------=_Part_515_1101852861.1454952896719--

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Mon, 8 Feb 2016 09:46:37 -0800 Raw View

On Mon, Feb 8, 2016 at 6:53 AM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
> On 2/8/2016 2:21 AM, Zhihao Yuan wrote:
>> On Mon, Feb 8, 2016 at 12:27 AM, Tom Honermann
>> <Thomas.Honermann@synopsys.com> wrote:
>>> An example taken from the overview follows.  Note that \u00F8 (LATIN
>>> SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units
>>> (\xC3\xB8), but iterator based enumeration sees just the single code po=
int.
>>>
>>> using CT =3D utf8_encoding::character_type;
>>> auto tv =3D make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
>>> auto it =3D tv.begin();
>>> assert(*it++ =3D=3D CT{0x004A}); // 'J'
>>> assert(*it++ =3D=3D CT{0x00F8}); // '=C3=B8'
>>> assert(*it++ =3D=3D CT{0x0065}); // 'e'
>>
>> Although I know CT is char32_t, but IMHO you overly generalized
>> this part.  Making this not configurable works for me.
>
> CT isn't char32_t; it is character<unicode_character_set>.
>
> C++ doesn't provide a type that meets my criteria for a character type.
>   char, wchar_t, char16_t, and char32_t are code unit types that are
> sometimes used as code point types.  They don't qualify as character
> types because they do not have an explicit or implicit associated
> character set to give meaning to their values.

[lex.ccon]p2 says:

"A character literal that begins with the letter U, such as U=E2=80=99z=E2=
=80=99, is a
character literal of type char32_t. The value of a char32_t literal
containing a single c-char is equal to its ISO 10646 code point
value."

This doesn't specify that char32_t will always be used to represent
characters in the ISO 10646 character set (we can't, really, in the
language clauses), but it's close enough that I'm not interested in
the standard library supporting other uses. If you want to propose
wording in one of the library clauses to say that explicitly, I think
the only objection you might get is that it's so obvious it doesn't
need saying. :)

Jeffrey

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Mon, 8 Feb 2016 20:05:25 +0000 Raw View

On 2/8/2016 11:08 AM, Zhihao Yuan wrote:
> On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann
> <Thomas.Honermann@synopsys.com> wrote:
>>
>> CT isn't char32_t; it is character<unicode_character_set>.
>
> I read your source code so I know it is :)
>
>> C++ doesn't provide a type that meets my criteria for a character type.
>>    char, wchar_t, char16_t, and char32_t are code unit types that are
>> sometimes used as code point types.  They don't qualify as character
>> types because they do not have an explicit or implicit associated
>> character set to give meaning to their values.
>
> What I'm looking for is something can simplify further development of
> Unicode-based collate processing.  Considering char32_t is already the
> default type, I think I can live with that.

I'd like to get a better understanding of your thoughts here.  I think
you are suggesting that the value type of the text iterators should
always be char32_t.  If so, I'm guessing that your view reflects one (or
both) of these positions:

1) Values of type char32_t (are expected to) always semantically refer
to the character with the corresponding code point value as defined by
Unicode.  In other words, char32_t always caries an implied association
with the Unicode character set.

2) The char32_t type suffices as the code point and character type for
any encoding and there is no need to track an associated character set.
  Applications that do require tracking an associated character set
would be on their own to do so.

For case #1, this implies that dereferencing a text iterator necessarily
involves transcoding to Unicode for non-Unicode encodings (potentially
the encodings used for ordinary and wide string literals).

For case #2, this implies that all transcoding operations will require
the application to explicitly specify the associated character sets; the
example below will not be possible.

One of my goals is to enable transcoding with code like the following.
This lazily transcodes a string from some external encoding to the
internal encoding.  Note that character set transcoding occurs within
the call to std::copy based on the value types (character types) of the
iterators.  If the character types don't have reliably correct
associated character sets, then transcoding will fail to do the right thing.

std::string in = get_a_string_with_some_external_encoding();
std::string out;
std::back_insert_iterator<std::string> out_it{out};
auto tv_in = make_text_view<some_external_encoding>(in);
auto tv_out = make_otext_iterator<internal_encoding>(out_it);
std::copy(tv_in.begin(), tv_in.end(), tv_out);

Another goal is to support ISO-2022 encodings (support for these
encodings would be, at best, optional in the standard).  These encodings
support escape sequences that allow switching between character sets, so
a code point type doesn't suffice to identify a character set (the
any_character_set class and corresponding character specialization exist
for this purpose).  These can be supported by #1, but only at the cost
of aggressive transcoding.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Mon, 8 Feb 2016 12:26:59 -0800 Raw View

On Mon, Feb 8, 2016 at 12:05 PM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
> On 2/8/2016 11:08 AM, Zhihao Yuan wrote:
>> On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann
>> <Thomas.Honermann@synopsys.com> wrote:
>>>
>>> CT isn't char32_t; it is character<unicode_character_set>.
>>
>> I read your source code so I know it is :)
>>
>>> C++ doesn't provide a type that meets my criteria for a character type.
>>>    char, wchar_t, char16_t, and char32_t are code unit types that are
>>> sometimes used as code point types.  They don't qualify as character
>>> types because they do not have an explicit or implicit associated
>>> character set to give meaning to their values.
>>
>> What I'm looking for is something can simplify further development of
>> Unicode-based collate processing.  Considering char32_t is already the
>> default type, I think I can live with that.
>
> I'd like to get a better understanding of your thoughts here.  I think
> you are suggesting that the value type of the text iterators should
> always be char32_t.  If so, I'm guessing that your view reflects one (or
> both) of these positions:
>
> 1) Values of type char32_t (are expected to) always semantically refer
> to the character with the corresponding code point value as defined by
> Unicode.  In other words, char32_t always caries an implied association
> with the Unicode character set.
>
> 2) The char32_t type suffices as the code point and character type for
> any encoding and there is no need to track an associated character set.
>   Applications that do require tracking an associated character set
> would be on their own to do so.
>
> For case #1, this implies that dereferencing a text iterator necessarily
> involves transcoding to Unicode for non-Unicode encodings (potentially
> the encodings used for ordinary and wide string literals).
>
> For case #2, this implies that all transcoding operations will require
> the application to explicitly specify the associated character sets; the
> example below will not be possible.
>
> One of my goals is to enable transcoding with code like the following.
> This lazily transcodes a string from some external encoding to the
> internal encoding.  Note that character set transcoding occurs within
> the call to std::copy based on the value types (character types) of the
> iterators.  If the character types don't have reliably correct
> associated character sets, then transcoding will fail to do the right thing.
>
> std::string in = get_a_string_with_some_external_encoding();
> std::string out;
> std::back_insert_iterator<std::string> out_it{out};
> auto tv_in = make_text_view<some_external_encoding>(in);
> auto tv_out = make_otext_iterator<internal_encoding>(out_it);
> std::copy(tv_in.begin(), tv_in.end(), tv_out);
>
> Another goal is to support ISO-2022 encodings (support for these
> encodings would be, at best, optional in the standard).  These encodings
> support escape sequences that allow switching between character sets, so
> a code point type doesn't suffice to identify a character set (the
> any_character_set class and corresponding character specialization exist
> for this purpose).  These can be supported by #1, but only at the cost
> of aggressive transcoding.

I like option (1). If someone has a character set with 4-byte code
points that's not Unicode, they should build their own struct to
represent the code points, and not re-use char32_t. It does need to be
possible to efficiently transcode, e.g., Shift-JIS to UTF-8 without an
intermediate step through char32_t, but that's true whether or not
char32_t is assumed to represent Unicode code points.

Jeffrey

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Mon, 8 Feb 2016 20:24:12 +0000 Raw View

On 2/8/2016 12:47 PM, 'Jeffrey Yasskin' via ISO C++ Standard - Future=20
Proposals wrote:
> On Mon, Feb 8, 2016 at 6:53 AM, Tom Honermann
> <Thomas.Honermann@synopsys.com> wrote:
>> C++ doesn't provide a type that meets my criteria for a character type.
>>    char, wchar_t, char16_t, and char32_t are code unit types that are
>> sometimes used as code point types.  They don't qualify as character
>> types because they do not have an explicit or implicit associated
>> character set to give meaning to their values.
>
> [lex.ccon]p2 says:
>
> "A character literal that begins with the letter U, such as U=E2=80=99z=
=E2=80=99, is a
> character literal of type char32_t. The value of a char32_t literal
> containing a single c-char is equal to its ISO 10646 code point
> value."
>
> This doesn't specify that char32_t will always be used to represent
> characters in the ISO 10646 character set (we can't, really, in the
> language clauses), but it's close enough that I'm not interested in
> the standard library supporting other uses. If you want to propose
> wording in one of the library clauses to say that explicitly, I think
> the only objection you might get is that it's so obvious it doesn't
> need saying. :)

Likewise, [lex.ccon]p1 states:

"An ordinary character literal that contains a single c-char has type=20
char, with value equal to the numerical value of the encoding of the=20
c-char in the execution character set."

I don't read this as implying that all values of type 'char' represent a=20
character from the implementation defined execution character set (in=20
some cases it can't, hence multicharacter literals have type 'int').=20
The important context in both quotes is that the value is derived from a=20
character literal.  From my view point, there is a loss of information=20
here; the literal has an implied encoding that is not reflected by the=20
type of the literal.

I wasn't around for the discussions of the proposals that added char16_t=20
and char32_t.  I wouldn't be surprised if the intention was that=20
char32_t imply Unicode code points in general.

Regardless, I'm not sure it is relevant whether char32_t has an implied=20
associated character set.  char32_t use is enshrined as a code unit and=20
code point type, and Text_view only uses it as such.

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Mon, 8 Feb 2016 12:51:55 -0800 Raw View

On Mon, Feb 8, 2016 at 12:24 PM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
> On 2/8/2016 12:47 PM, 'Jeffrey Yasskin' via ISO C++ Standard - Future
> Proposals wrote:
>> On Mon, Feb 8, 2016 at 6:53 AM, Tom Honermann
>> <Thomas.Honermann@synopsys.com> wrote:
>>> C++ doesn't provide a type that meets my criteria for a character type.
>>>    char, wchar_t, char16_t, and char32_t are code unit types that are
>>> sometimes used as code point types.  They don't qualify as character
>>> types because they do not have an explicit or implicit associated
>>> character set to give meaning to their values.
>>
>> [lex.ccon]p2 says:
>>
>> "A character literal that begins with the letter U, such as U=E2=80=99z=
=E2=80=99, is a
>> character literal of type char32_t. The value of a char32_t literal
>> containing a single c-char is equal to its ISO 10646 code point
>> value."
>>
>> This doesn't specify that char32_t will always be used to represent
>> characters in the ISO 10646 character set (we can't, really, in the
>> language clauses), but it's close enough that I'm not interested in
>> the standard library supporting other uses. If you want to propose
>> wording in one of the library clauses to say that explicitly, I think
>> the only objection you might get is that it's so obvious it doesn't
>> need saying. :)
>
> Likewise, [lex.ccon]p1 states:
>
> "An ordinary character literal that contains a single c-char has type
> char, with value equal to the numerical value of the encoding of the
> c-char in the execution character set."
>
> I don't read this as implying that all values of type 'char' represent a
> character from the implementation defined execution character set (in
> some cases it can't, hence multicharacter literals have type 'int').
> The important context in both quotes is that the value is derived from a
> character literal.  From my view point, there is a loss of information
> here; the literal has an implied encoding that is not reflected by the
> type of the literal.
>
> I wasn't around for the discussions of the proposals that added char16_t
> and char32_t.  I wouldn't be surprised if the intention was that
> char32_t imply Unicode code points in general.

See http://open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018 for some
history, although history doesn't determine what we should do now.

> Regardless, I'm not sure it is relevant whether char32_t has an implied
> associated character set.  char32_t use is enshrined as a code unit and
> code point type, and Text_view only uses it as such.
>
> Tom.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups=
 "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an=
 email to std-proposals+unsubscribe@isocpp.org.
> To post to this group, send email to std-proposals@isocpp.org.
> Visit this group at https://groups.google.com/a/isocpp.org/group/std-prop=
osals/.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Mon, 8 Feb 2016 21:02:40 +0000 Raw View

On 2/8/2016 12:35 PM, Nicol Bolas wrote:
> I like the general idea of the feature: an iterator/range approach to
> dealing with Unicode encodings. One which is relatively extensible for
> specialized encoding schemes.

Good, thanks, that is exactly the intent.

> My biggest issue is this: why is it based on /concepts/?

Because I like concepts :)

 From a practical perspective, I found concepts to be *extremely*
helpful in getting this prototype together.  I got to skip all the
minutia around the use of std::enable_if.  The definition of
itext_iterator_category_selector in text_iterator.hpp reflects the
simplifications that concepts offered.  Expert std::enable_if users
could probably write the equivalent of those class specializations in no
time, but I'm not such an expert.

> There's nothing here that really requires or benefits from it.

That is probably true for users of the library.  If someone were to fork
text_view and port it to use the facilities available in Eric Niebler's
excellent range-v3 library [1], I think that would be very cool.

 > Sure,
> it's nice in the sense that it's easier to keep people from doing some
> wrong things. But concepts offer no protection from the most pernicious
> of problems: the user believing a string has an encoding when it in fact
> has a different one.

That indeed is a significant problem.  One of the things I'm
dissatisfied with is that an encoding cannot be inferred from string and
character literals.  I'd love to make that a possibility.  I think
adding a char8_t as has been previously proposed and the facilities
proposed in N4121 would suffice to solve this problem.

> I don't know what concepts are doing here, besides restricting my
> compiler options so that I can't actually use it.

Yeah, sorry about that.  But, this is std-proposals.  You weren't really
expecting code you could just drop into your favorite production code
line, were you?  :)

> Also, you have support for little endian and big endian UTF-16/32
> ranges. And you have support for native-endian ones. But what about
> conversions? What if you have read a file that you know is little
> endian, and you want to convert it to native-endian, but you don't
> really know what native-endian is? Your way would seem to require going
> through a whole UTF-16->codepoint->UTF-16 conversion step, when what you
> really want is just some byte-swapping.

I haven't taken on transcoding yet, though that is a logical next step
for the library (aside from adding support for additional encodings).

> Another issue has to do with more Unicode operations: collation, etc.
> Obviously, your current version isn't intended to handle these. But my
> concern is that, by using these decoding iterators, you make it
> difficult to write optimized code for doing such transformations.
>
> For example, consider a Unicode normalizing algorithm. If the algorithm
> knows that its dealing with UTF-8, then it can optimize how it looks at
> codepoints. Nothing in ASCII requires changes during decomposition. So
> if a UTF-8-based normalization algorithm sees a code unit with a 0 in
> the high bit, it can just write that value out as is and move on. By
> decoding it to UTF-32, then re-encoding it as UTF-8, you have to do a
> lot of conditional tests which you know won't be met.//

Proper Unicode collation and normalization does require examining more
than one code point at a time, so I agree that use of decoding iterators
is not likely to be part of a good solution for those. But, I don't
follow how use of decoding iterators leads us into a trap.

Since these features are specific to Unicode, I had envisioned that
separate facilities would be added to enable them.  For example, perhaps
a normalizing iterator interface could provide for a per-encoding
optimized experience.

> I imagine UTF-16-based normalizers can have similar optimizations.
>
> Obviously, there should be an algorithm that handles any codepoint
> sequence, so that it can handle other forms of encodings. But if you
> make the algorithms use codepoint ranges, then this becomes difficult.
> Compilers might be able to inline and optimize everything in specific
> cases. But they might not.

I agree.  If you think of any specific suggestions for how to enable
these optimizations, I'd very much like to hear them.  I had intended to
try and identify ways of achieving such optimizations when I get around
to working on these areas (patches welcome!)

Tom.

[1]: https://github.com/ericniebler/range-v3

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 08 Feb 2016 13:07:16 -0800 Raw View

Em segunda-feira, 8 de fevereiro de 2016, =C3=A0s 09:34:56 PST, Nicol Bolas=
=20
escreveu:
> For example, consider a Unicode normalizing algorithm. If the algorithm=
=20
> knows that its dealing with UTF-8, then it can optimize how it looks at=
=20
> codepoints. Nothing in ASCII requires changes during decomposition. So if=
 a=20
> UTF-8-based normalization algorithm sees a code unit with a 0 in the high
> bit, it can just write that value out as is and move on. By decoding it t=
o
> UTF-32, then re-encoding it as UTF-8, you have to do a lot of conditional
> tests which you know won't be met.
>=20
> I imagine UTF-16-based normalizers can have similar optimizations.

Indeed, that's exactly what we do in Qt (qstring.cpp, qutfcodec.cpp). A lot=
 of=20
the code has a fast-path for strings or ranges only containing US-ASCII=20
characters.

For example, the algorithm to convert from UTF-8 to UTF-16 is:

1) load 16 bytes into SIMD register
2) test the high bits of every byte
3.a) if they were all zero, simply do a zero-extension and save 32 bytes
3.b) if they weren't, find the first bit set and do the UTF-8 decoding from=
=20
    there

Another example is the IDNA encoding operation "ToASCII": if the entire str=
ing=20
is US-ASCII, no decoding to UCS-4 and subsequent encoding to Punycode is=20
required.=20

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Zhihao Yuan <zy@miator.net>
Date: Mon, 8 Feb 2016 16:00:09 -0600 Raw View

On Mon, Feb 8, 2016 at 2:05 PM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
>
>> What I'm looking for is something can simplify further development of
>> Unicode-based collate processing.  Considering char32_t is already the
>> default type, I think I can live with that.
>
> I'd like to get a better understanding of your thoughts here.  I think
> you are suggesting that the value type of the text iterators should
> always be char32_t.  If so, I'm guessing that your view reflects one (or
> both) of these positions:
>
> 1) Values of type char32_t (are expected to) always semantically refer
> to the character with the corresponding code point value as defined by
> Unicode.  In other words, char32_t always caries an implied association
> with the Unicode character set.
>

Yes.  What I'm expecting is portable code point values.
Unicode gives you portable code point values, and char32_t
(and char16_t in limited range) should be assumed to contain
Unicode code points.

> 2) The char32_t type suffices as the code point and character type for
> any encoding and there is no need to track an associated character set.
>   Applications that do require tracking an associated character set
> would be on their own to do so.
>

It can be made true for Unicode representations, e.g. UTF-8, UTF-16,
UTF-32 if you want, plus GB18030 if you care.  Others locale-depended
encoding can go locale machinery.  The ones which can't even be used
in locales, the C++ standard may be too thin for them.

> For case #1, this implies that dereferencing a text iterator necessarily
> involves transcoding to Unicode for non-Unicode encodings (potentially
> the encodings used for ordinary and wide string literals).
>

Your current implementation does not support non-Unicode
encodings.  You conditionally supported whar_t when it
contains Unicode code points.  C++ execution encoding
is not required to be ASCII, if you assumed it to be ASCII you
are doing it wrong.

> std::string in = get_a_string_with_some_external_encoding();
> std::string out;
> std::back_insert_iterator<std::string> out_it{out};
> auto tv_in = make_text_view<some_external_encoding>(in);
> auto tv_out = make_otext_iterator<internal_encoding>(out_it);
> std::copy(tv_in.begin(), tv_in.end(), tv_out);
>

If your internal encoding is 8-bit narrow encoding, there is
no conversion needed, since there is no conversion can be
done -- if it differs from external encoding, they are mostly
of different charsets.

Otherwise, as far as I can see, the best you can do is to
make use of locale information, but iostream/stdio do this
sufficiently well.

> Another goal is to support ISO-2022 encodings (support for these
> encodings would be, at best, optional in the standard).  These encodings
> support escape sequences that allow switching between character sets, so
> a code point type doesn't suffice to identify a character set (the
> any_character_set class and corresponding character specialization exist
> for this purpose).  These can be supported by #1, but only at the cost
> of aggressive transcoding.

Some of the iso-2022-compatible encodings can individually be
used as locale, thus you can individually decode them into
wchar_t, portably.  But iso-2022 itself is never actually supported,
for example, in libstdc++.  This encoding allows some uses
that Unicode doesn't care, like switching fonts, going trough
information channel restrictions, etc., and it loses it value if
you put it under a Unicode-centric (or any code-point centric)
design.

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 04:35:32 +0000 Raw View

On 2/8/2016 3:27 PM, 'Jeffrey Yasskin' via ISO C++ Standard - Future
Proposals wrote:
> On Mon, Feb 8, 2016 at 12:05 PM, Tom Honermann
> <Thomas.Honermann@synopsys.com> wrote:
>> On 2/8/2016 11:08 AM, Zhihao Yuan wrote:
>>> On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann
>>> <Thomas.Honermann@synopsys.com> wrote:
>>>>
>>>> CT isn't char32_t; it is character<unicode_character_set>.
>>>
>>> I read your source code so I know it is :)
>>>
>>>> C++ doesn't provide a type that meets my criteria for a character type.
>>>>     char, wchar_t, char16_t, and char32_t are code unit types that are
>>>> sometimes used as code point types.  They don't qualify as character
>>>> types because they do not have an explicit or implicit associated
>>>> character set to give meaning to their values.
>>>
>>> What I'm looking for is something can simplify further development of
>>> Unicode-based collate processing.  Considering char32_t is already the
>>> default type, I think I can live with that.
>>
>> I'd like to get a better understanding of your thoughts here.  I think
>> you are suggesting that the value type of the text iterators should
>> always be char32_t.  If so, I'm guessing that your view reflects one (or
>> both) of these positions:
>>
>> 1) Values of type char32_t (are expected to) always semantically refer
>> to the character with the corresponding code point value as defined by
>> Unicode.  In other words, char32_t always caries an implied association
>> with the Unicode character set.
....
> I like option (1). If someone has a character set with 4-byte code
> points that's not Unicode, they should build their own struct to
> represent the code points, and not re-use char32_t. It does need to be
> possible to efficiently transcode, e.g., Shift-JIS to UTF-8 without an
> intermediate step through char32_t, but that's true whether or not
> char32_t is assumed to represent Unicode code points.

It sounds like your preference is for a character to be represented by a
code point type that has an implied associated character set and that,
for Unicode encodings, the code point type be char32_t.

One of my goals for this library is to ensure that the facilities
provided are usable for all five of the (implementation defined)
encodings that the standard states must be provided (*).  If the table
below is filled in as follows, then types are needed for X and Y that
enable inferring their associated character sets to achieve these goals.

+----------+-----------+------------+---------------+
| Encoding | Code unit | Code point | Char set      |
+----------+-----------+------------+---------------+
| Ordinary | char      | X          | char_set_t<X> |
| Wide     | wchar_t   | Y          | char_set_t<Y> |
| UTF-8    | char      | char32_t   | Unicode       |
| UTF-16   | char16_t  | char32_t   | Unicode       |
| UTF-32   | char32_t  | char32_t   | Unicode       |
+----------+-----------+------------+---------------+

In other words, something like this would be needed:

class X {
   using character_set_type = ...;
   // member functions to mimic integral types
};
class Y {
  using character_set_type = ...;
   // member functions to mimic integral types
};

template<typename T>
struct __char_set_t_helper {
   using type = typename T::character_set_type;
};
template<>
struct __char_set_t_helper<char32_t> {
   using type = unicode_character_set;
};
template<typename T>
using char_set_t = typename __char_set_t_helper<T>::type;

My question is, what is gained by using char32_t as the character type
for the Unicode encodings over a class?  The class that text_view
currently provides is pretty much useless as is, but that is just
because it is a work in progress.  What benefit is obtained by using a
fundamental type here when fundamental types aren't available to be used
for other encodings?  It seems to me that use of distinct types for code
units vs characters would be beneficial for type safety reasons.

Tom.

(*): Yes, I know that the standard specifies that the encoding used for
ordinary and wide string functions is governed by run-time locale
settings.  At compile-time, there is an encoding that is used to encode
ordinary and wide string and character literals to code unit sequences;
this is the encoding I'm referring to here.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 05:00:31 +0000 Raw View

On 2/8/2016 5:00 PM, Zhihao Yuan wrote:
> On Mon, Feb 8, 2016 at 2:05 PM, Tom Honermann
> <Thomas.Honermann@synopsys.com> wrote:
>>
>>> What I'm looking for is something can simplify further development of
>>> Unicode-based collate processing.  Considering char32_t is already the
>>> default type, I think I can live with that.
>>
>> I'd like to get a better understanding of your thoughts here.  I think
>> you are suggesting that the value type of the text iterators should
>> always be char32_t.  If so, I'm guessing that your view reflects one (or
>> both) of these positions:
>>
>> 1) Values of type char32_t (are expected to) always semantically refer
>> to the character with the corresponding code point value as defined by
>> Unicode.  In other words, char32_t always caries an implied association
>> with the Unicode character set.
>>
>
> Yes.  What I'm expecting is portable code point values.
> Unicode gives you portable code point values, and char32_t
> (and char16_t in limited range) should be assumed to contain
> Unicode code points.

I don't disagree with this.  But I'll pose the same question I did in
responding to Jeffrey: what is gained by using char32_t as the character
type for the Unicode encodings over a class type?  Would a type that is
distinct from the type used for code units not be beneficial for type
safety reasons?

>> 2) The char32_t type suffices as the code point and character type for
>> any encoding and there is no need to track an associated character set.
>>    Applications that do require tracking an associated character set
>> would be on their own to do so.
>>
>
> It can be made true for Unicode representations, e.g. UTF-8, UTF-16,
> UTF-32 if you want, plus GB18030 if you care.  Others locale-depended
> encoding can go locale machinery.  The ones which can't even be used
> in locales, the C++ standard may be too thin for them.

My experience has been that locale machinery doesn't suffice due to
round tripping issues.

>> For case #1, this implies that dereferencing a text iterator necessarily
>> involves transcoding to Unicode for non-Unicode encodings (potentially
>> the encodings used for ordinary and wide string literals).
>>
>
> Your current implementation does not support non-Unicode
> encodings.  You conditionally supported whar_t when it
> contains Unicode code points.  C++ execution encoding
> is not required to be ASCII, if you assumed it to be ASCII you
> are doing it wrong.

The implementation is a work in progress.  Adding support for
non-Unicode encodings is on my todo list.

The support for wchar_t is not conditional.  What is conditional is the
definition of the 'iso_10646_wide_character_encoding' encoding.  The
(compile-time) implementation defined encoding used for wide string and
character literals is identified by the
'execution_wide_character_encoding' type alias.

And yes, assuming ASCII would be doing it wrong :)

>> std::string in = get_a_string_with_some_external_encoding();
>> std::string out;
>> std::back_insert_iterator<std::string> out_it{out};
>> auto tv_in = make_text_view<some_external_encoding>(in);
>> auto tv_out = make_otext_iterator<internal_encoding>(out_it);
>> std::copy(tv_in.begin(), tv_in.end(), tv_out);
>>
>
> If your internal encoding is 8-bit narrow encoding, there is
> no conversion needed, since there is no conversion can be
> done -- if it differs from external encoding, they are mostly
> of different charsets.
>
> Otherwise, as far as I can see, the best you can do is to
> make use of locale information, but iostream/stdio do this
> sufficiently well.

Thanks for these points.  In doc, I clearly need to differentiate
between the compile-time encodings used for ordinary and wide strings vs
the locale dependent encodings used at run-time.

>> Another goal is to support ISO-2022 encodings (support for these
>> encodings would be, at best, optional in the standard).  These encodings
>> support escape sequences that allow switching between character sets, so
>> a code point type doesn't suffice to identify a character set (the
>> any_character_set class and corresponding character specialization exist
>> for this purpose).  These can be supported by #1, but only at the cost
>> of aggressive transcoding.
>
> Some of the iso-2022-compatible encodings can individually be
> used as locale, thus you can individually decode them into
> wchar_t, portably.  But iso-2022 itself is never actually supported,
> for example, in libstdc++.  This encoding allows some uses
> that Unicode doesn't care, like switching fonts, going trough
> information channel restrictions, etc., and it loses it value if
> you put it under a Unicode-centric (or any code-point centric)
> design.

If you look closely at the encoding classes, you'll see that they have
support for encoding state transitions.  "state transitions" might not
be the best terminology to describe these sequences, but the same
technique can be used to encode non-code-point code unit sequences.  The
iterators provided by the view produced by make_text_view()
transparently skip non-code-point encoding code unit sequences.  This
does imply information loss the same as occurs when transcoding such an
ISO-2022 encoded text to Unicode.  otext_iterator supports encoding of
state transitions, though the applicable state transitions are obviously
encoding dependent.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 08 Feb 2016 21:32:04 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 05:00:31 PST Tom Honermann wrot=
e:
> > Yes.  What I'm expecting is portable code point values.
> > Unicode gives you portable code point values, and char32_t
> > (and char16_t in limited range) should be assumed to contain
> > Unicode code points.
>=20
> I don't disagree with this.  But I'll pose the same question I did in
> responding to Jeffrey: what is gained by using char32_t as the character
> type for the Unicode encodings over a class type?  Would a type that is
> distinct from the type used for code units not be beneficial for type
> safety reasons?

If char32_t didn't exist, there wouldn't be a difference. Take QChar and=20
char16_t: they're equivalent.

But if you ask "should we add a class knowing that char32_t exists", I'd=20
answer no.

> >> For case #1, this implies that dereferencing a text iterator necessari=
ly
> >> involves transcoding to Unicode for non-Unicode encodings (potentially
> >> the encodings used for ordinary and wide string literals).
> >=20
> > Your current implementation does not support non-Unicode
> > encodings.  You conditionally supported whar_t when it
> > contains Unicode code points.  C++ execution encoding
> > is not required to be ASCII, if you assumed it to be ASCII you
> > are doing it wrong.
>=20
> The implementation is a work in progress.  Adding support for
> non-Unicode encodings is on my todo list.

I'd require US-ASCII, Latin 1 (ISO-8859-1) and the "System" encoding as=20
mandatory. All others should be optional, probably by way of using ICU's=20
unicode/ucnv.h functionality.

"System" is whatever passes for the current locale in the system. On Window=
s=20
systems, that's the mis-named "ANSI" encoding, like CP 1252. That should ma=
tch=20
the narrow character execution charset.

Strictly speaking, you'll also need a wide-character "System", but I know o=
f=20
no system where wchar_t is not either UTF-16 or UCS-4, so it will be just a=
n=20
alias.

> If you look closely at the encoding classes, you'll see that they have
> support for encoding state transitions.  "state transitions" might not
> be the best terminology to describe these sequences, but the same
> technique can be used to encode non-code-point code unit sequences.  The
> iterators provided by the view produced by make_text_view()
> transparently skip non-code-point encoding code unit sequences.  This
> does imply information loss the same as occurs when transcoding such an
> ISO-2022 encoded text to Unicode.  otext_iterator supports encoding of
> state transitions, though the applicable state transitions are obviously
> encoding dependent.

Make sure they work for streaming data (i.e., incomplete). One buffer chunk=
=20
might end in the middle of a multi-byte sequence or in a shifted state. The=
=20
state needs to be transferred when moving on to the next chunk.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Zhihao Yuan <zy@miator.net>
Date: Mon, 8 Feb 2016 23:51:51 -0600 Raw View

On Mon, Feb 8, 2016 at 11:00 PM, Tom Honermann
<Thomas.Honermann@synopsys.com> wrote:
>
>> Yes.  What I'm expecting is portable code point values.
>> Unicode gives you portable code point values, and char32_t
>> (and char16_t in limited range) should be assumed to contain
>> Unicode code points.
>
> I don't disagree with this.  But I'll pose the same question I did in
> responding to Jeffrey: what is gained by using char32_t as the character
> type for the Unicode encodings over a class type?  Would a type that is
> distinct from the type used for code units not be beneficial for type
> safety reasons?
>

If you want to use some wrapper types, it might get or loss some
usabilities, I can't say much about it at this point; if you want to
use some other types, int32_t can allow you, in a standard-confirming
way, to alias ICU's UChar32.  Just saying, don't take this serious.
I care more about the numeric code point values.

>
> My experience has been that locale machinery doesn't suffice due to
> round tripping issues.
>
> [...]
>
> The support for wchar_t is not conditional.  What is conditional is the
> definition of the 'iso_10646_wide_character_encoding' encoding.  The
> (compile-time) implementation defined encoding used for wide string and
> character literals is identified by the
> 'execution_wide_character_encoding' type alias.
>
> And yes, assuming ASCII would be doing it wrong :)
>
> [...]
>
> Thanks for these points.  In doc, I clearly need to differentiate
> between the compile-time encodings used for ordinary and wide strings vs
> the locale dependent encodings used at run-time.
>

Discussion about these went here:

  https://github.com/tahonermann/text_view/issues/11

>> This encoding allows some uses
>> that Unicode doesn't care, like switching fonts, going trough
>> information channel restrictions, etc., and it loses it value if
>> you put it under a Unicode-centric (or any code-point centric)
>> design.
>
> If you look closely at the encoding classes, you'll see that they have
> support for encoding state transitions.  "state transitions" might not
> be the best terminology to describe these sequences, but the same
> technique can be used to encode non-code-point code unit sequences.  The
> iterators provided by the view produced by make_text_view()
> transparently skip non-code-point encoding code unit sequences.  This
> does imply information loss the same as occurs when transcoding such an
> ISO-2022 encoded text to Unicode.  otext_iterator supports encoding of
> state transitions, though the applicable state transitions are obviously
> encoding dependent.

You didn't get my point... Can your facility produce two kinds of
code point values (may differs in type, or tag, may even overlap in
values), from one external encoding?  Unicode is a universal code
point design, same character exists in two different languages
cannot be distinguished, but iso-2022 can.  For example, a system
can assign different glyphs to the same character in two languages
(if you translate them into Unicode).  Unicode tries to address this
via variation sequences, but this works on character level, while
iso-2022's solution works on code point level, since there was
just no glyph unification being made.

Anyway, this is not interesting.  People who wants to support it
(in the standard) needs sufficient motivation to work on it.

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Zhihao Yuan <zy@miator.net>
Date: Tue, 9 Feb 2016 00:25:28 -0600 Raw View

On Mon, Feb 8, 2016 at 11:32 PM, Thiago Macieira <thiago@macieira.org> wrote:
>
> I'd require US-ASCII, Latin 1 (ISO-8859-1) and the "System" encoding as
> mandatory. All others should be optional, probably by way of using ICU's
> unicode/ucnv.h functionality.
>

Neutral on ASCII, against iso-8859-1 -- implementation-defined
is where it should go.

> "System" is whatever passes for the current locale in the system. On Windows
> systems, that's the mis-named "ANSI" encoding, like CP 1252. That should match
> the narrow character execution charset.
>
> Strictly speaking, you'll also need a wide-character "System", but I know of
> no system where wchar_t is not either UTF-16 or UCS-4, so it will be just an
> alias.

As a FreeBSD user, I have to say... no.

We use each locale's encoding's documented, official code points in
wchar_t.

NetBSD uses Japanese standards' code points for Japanese,
UCS-4 for others.

And of course, IBM doesn't agree with you either, since it doesn't
agree with everything...

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 06:29:51 +0000 Raw View

On 2/9/2016 12:32 AM, Thiago Macieira wrote:
> On ter=C3=A7a-feira, 9 de fevereiro de 2016 05:00:31 PST Tom Honermann wr=
ote:
> I'd require US-ASCII, Latin 1 (ISO-8859-1) and the "System" encoding as
> mandatory. All others should be optional, probably by way of using ICU's
> unicode/ucnv.h functionality.
>
> "System" is whatever passes for the current locale in the system. On Wind=
ows
> systems, that's the mis-named "ANSI" encoding, like CP 1252. That should =
match
> the narrow character execution charset.

Sounds right.

> Strictly speaking, you'll also need a wide-character "System",

Also sounds right.

> but I know of
> no system where wchar_t is not either UTF-16 or UCS-4, so it will be just=
 an
> alias.

z/OS XL C++ uses EBCDIC by default, complete with support for multibyte=20
SI/SO sequences.  Compiler options and pragma directives can be used to=20
select alternative encodings.

>> If you look closely at the encoding classes, you'll see that they have
>> support for encoding state transitions.  "state transitions" might not
>> be the best terminology to describe these sequences, but the same
>> technique can be used to encode non-code-point code unit sequences.  The
>> iterators provided by the view produced by make_text_view()
>> transparently skip non-code-point encoding code unit sequences.  This
>> does imply information loss the same as occurs when transcoding such an
>> ISO-2022 encoded text to Unicode.  otext_iterator supports encoding of
>> state transitions, though the applicable state transitions are obviously
>> encoding dependent.
>
> Make sure they work for streaming data (i.e., incomplete). One buffer chu=
nk
> might end in the middle of a multi-byte sequence or in a shifted state. T=
he
> state needs to be transferred when moving on to the next chunk.

Support for this is not currently present, at least not in a way that=20
anyone would want to use.  In underflow scenarios, an exception is=20
thrown.  The decode can then be restarted with an adjusted end iterator=20
(likely implying invalidation of prior iterators), but for input=20
iterators, code units will have been lost.  I have given this a little=20
thought.  Since the interface is all based on iterators, one possibility=20
is that buffering can be provided by iterator adapters operating over=20
the streaming code unit sequence.  This wouldn't work in situations=20
where callers need control to return in order to fetch additional data=20
though.

There is a trade off between supporting suspending decode in the middle=20
of a code unit sequence and supporting input iterators; if iteration of=20
the code unit sequence isn't restartable, then additional state must be=20
maintained and this will bloat the size of code point iterators and=20
affect their performance.

Additionally, suspending decode in the middle of a code unit sequence=20
doesn't work for iterators - a dereferenced iterator needs to either=20
provide a code point or throw an exception.

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 06:48:08 +0000 Raw View

On 2/9/2016 12:51 AM, Zhihao Yuan wrote:
>>> This encoding allows some uses
>>> that Unicode doesn't care, like switching fonts, going trough
>>> information channel restrictions, etc., and it loses it value if
>>> you put it under a Unicode-centric (or any code-point centric)
>>> design.
>>
>> If you look closely at the encoding classes, you'll see that they have
>> support for encoding state transitions.  "state transitions" might not
>> be the best terminology to describe these sequences, but the same
>> technique can be used to encode non-code-point code unit sequences.  The
>> iterators provided by the view produced by make_text_view()
>> transparently skip non-code-point encoding code unit sequences.  This
>> does imply information loss the same as occurs when transcoding such an
>> ISO-2022 encoded text to Unicode.  otext_iterator supports encoding of
>> state transitions, though the applicable state transitions are obviously
>> encoding dependent.
>
> You didn't get my point... Can your facility produce two kinds of
> code point values (may differs in type, or tag, may even overlap in
> values), from one external encoding?

Yes.  This is the reason for the any_character_set and the
character<any_character_set> specialization.  The character class
primary template identifies an associated character set via an
associated type (the template argument).  The
character<any_character_set> specialization identifies an associated
character via a data member.  Encodings that support switching between
character sets declare their associated character set as the
any_character_set, and then set the character set ID (see
get_character_set_id()) on the returned character appropriately.

Though these facilities exist, I haven't yet written a codec that proves
they work adequately.

> Unicode is a universal code
> point design, same character exists in two different languages
> cannot be distinguished, but iso-2022 can.  For example, a system
> can assign different glyphs to the same character in two languages
> (if you translate them into Unicode).  Unicode tries to address this
> via variation sequences, but this works on character level, while
> iso-2022's solution works on code point level, since there was
> just no glyph unification being made.
>
> Anyway, this is not interesting.  People who wants to support it
> (in the standard) needs sufficient motivation to work on it.

I agree; I can't imagine anyone wants to see ISO-2022 support required.
  I only want to ensure that the interface can support it should someone
be so motivated as to provide an implementation.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 08 Feb 2016 23:02:25 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 00:25:28 PST Zhihao Yuan wrote:
> On Mon, Feb 8, 2016 at 11:32 PM, Thiago Macieira <thiago@macieira.org>=20
wrote:
> > I'd require US-ASCII, Latin 1 (ISO-8859-1) and the "System" encoding as
> > mandatory. All others should be optional, probably by way of using ICU'=
s
> > unicode/ucnv.h functionality.
>=20
> Neutral on ASCII, against iso-8859-1 -- implementation-defined
> is where it should go.

Latin1 is useful because it's a strict 1:1 mapping of UCS-4 to 8 bits, so=
=20
transforming from Latin 1 to UTF-16 or UCS-4 is extremely fast. Any "from U=
S-
ASCII" code could just as well be "from Latin 1".

The conversion "To Latin1" is also very easy if you define that the output =
is=20
undefined if the source contained non-Latin1 content. It's also the exact s=
ame=20
code for "To US-ASCII" if you add the same constraint to it.

Converting with error checking to either Latin1 or to US-ASCII is usually a=
s=20
difficult, though it varies depending on whether the platform has signed or=
=20
unsigned saturation.

> > "System" is whatever passes for the current locale in the system. On
> > Windows systems, that's the mis-named "ANSI" encoding, like CP 1252. Th=
at
> > should match the narrow character execution charset.
> >=20
> > Strictly speaking, you'll also need a wide-character "System", but I kn=
ow
> > of no system where wchar_t is not either UTF-16 or UCS-4, so it will be
> > just an alias.
>=20
> As a FreeBSD user, I have to say... no.
>=20
> We use each locale's encoding's documented, official code points in
> wchar_t.
>=20
> NetBSD uses Japanese standards' code points for Japanese,
> UCS-4 for others.

Interesting, I didn't know that. But note how the compiler doesn't agree wi=
th=20
that and always assumes UCS-4 for the wide character execution charset (and=
=20
for that matter, UTF-8 for the source character set):

$ uname -sr
FreeBSD 10.2-RELEASE-p9
$ export LC_ALL=3Dcs_CS
$ cat test.cpp
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main()
{
 setlocale(LC_ALL, "");
 wprintf(L"=C3=A0\n");
}
$ clang test.cpp
$ ./a.out | od -tx1
0000000 e0 0a
0000002
$ LC_ALL=3Del_GR ./a.out | od -tx1
0000000 e0 0a
0000002

The compiler converted "=C3=A0" to wchar_t(0x00E0), which the C library in =
the=20
Czech locale interprets as "=C5=95".

You've got Mojibake.

> And of course, IBM doesn't agree with you either, since it doesn't
> agree with everything...

Yeah, no comment.
--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 08 Feb 2016 23:08:52 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 06:29:51 PST Tom Honermann wrot=
e:
> > Make sure they work for streaming data (i.e., incomplete). One buffer
> > chunk
> > might end in the middle of a multi-byte sequence or in a shifted state.
> > The
> > state needs to be transferred when moving on to the next chunk.
>=20
> Support for this is not currently present, at least not in a way that=20
> anyone would want to use.  In underflow scenarios, an exception is=20
> thrown.

That doesn't sound like an exceptional scenario to me, so I would advise=20
against using exceptions for this.

> The decode can then be restarted with an adjusted end iterator=20
> (likely implying invalidation of prior iterators), but for input=20
> iterators, code units will have been lost.  I have given this a little=20
> thought.  Since the interface is all based on iterators, one possibility=
=20
> is that buffering can be provided by iterator adapters operating over=20
> the streaming code unit sequence.  This wouldn't work in situations=20
> where callers need control to return in order to fetch additional data=20
> though.

I would say that you should make the iterator stateful.

The next problem will be to make sure that the iterator can compare to the=
=20
"end" iterator.

> There is a trade off between supporting suspending decode in the middle=
=20
> of a code unit sequence and supporting input iterators; if iteration of=
=20
> the code unit sequence isn't restartable, then additional state must be=
=20
> maintained and this will bloat the size of code point iterators and=20
> affect their performance.

True. Streaming is probably more important.

Use-case: decoding a large file. You don't want to load the entire file to=
=20
memory, but instead read it in chunks.=20

For encodings with no shift state, you can get away with this by having a=
=20
minimum buffer size which guarantees that the current character being=20
processed is terminated. For example, for UTF-8 that would be 4 bytes. This=
=20
complicates the iteration, since you need to check not if=20
 it !=3D end
but instead
 it + (last_chunk ? 0 : minimum_chunk_size) >=3D end

But encodings with shift state, there is no minimum size, as the state can=
=20
remain indefinitely. On the other hand, the state that needs to be kept is=
=20
small.

> Additionally, suspending decode in the middle of a code unit sequence=20
> doesn't work for iterators - a dereferenced iterator needs to either=20
> provide a code point or throw an exception.

Agreed. You shouldn't try to decode that iterator. You need to go out of th=
e=20
loop, without an exception, to get more data.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Zhihao Yuan <zy@miator.net>
Date: Tue, 9 Feb 2016 01:30:26 -0600 Raw View

On Tue, Feb 9, 2016 at 1:02 AM, Thiago Macieira <thiago@macieira.org> wrote=
:
>
> Converting with error checking to either Latin1 or to US-ASCII is usually=
 as
> difficult, though it varies depending on whether the platform has signed =
or
> unsigned saturation.
>

But ASCII is 7-bit, when decoding goes wrong, a library can immediately
spot this issue; Latin1 is 8-bit, library doesn't know what is wrong, and
this cause a large amount of bad encodings in many systems, many
applications, many...

>>
>> As a FreeBSD user, I have to say... no.
>>
>> We use each locale's encoding's documented, official code points in
>> wchar_t.
>>
>> NetBSD uses Japanese standards' code points for Japanese,
>> UCS-4 for others.
>
> Interesting, I didn't know that. But note how the compiler doesn't agree =
with
> that and always assumes UCS-4 for the wide character execution charset (a=
nd
> for that matter, UTF-8 for the source character set):
>
> $ uname -sr
> FreeBSD 10.2-RELEASE-p9
> $ export LC_ALL=3Dcs_CS
>
> [...]
>
> The compiler converted "=C3=A0" to wchar_t(0x00E0), which the C library i=
n the
> Czech locale interprets as "=C5=95".
>
> You've got Mojibake.

Yes.  On these systems, multibyte encodings cannot go into
"" or L"".  Both FreeBSD and the C/C++ standards understand
the locale-dependent nature of wchar_t, but GCC needs to
fulfill its customer's demands.

It's not quite a limitation because practical applications
loads multilingual stuff from external sources.

C++11 Unicode literals can portability solve this.

--=20
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Mon, 08 Feb 2016 23:55:18 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 01:30:26 PST Zhihao Yuan wrote:
> On Tue, Feb 9, 2016 at 1:02 AM, Thiago Macieira <thiago@macieira.org> wro=
te:
> > Converting with error checking to either Latin1 or to US-ASCII is usual=
ly
> > as difficult, though it varies depending on whether the platform has
> > signed or unsigned saturation.
>=20
> But ASCII is 7-bit, when decoding goes wrong, a library can immediately
> spot this issue; Latin1 is 8-bit, library doesn't know what is wrong, and
> this cause a large amount of bad encodings in many systems, many
> applications, many...

I either don't understand what you meant or you didn't get what I meant.

If you simply discarded bits 8 through 31 of the char32_t, you might still =
end=20
up with valid US-ASCII data that wasn't US-ASCII in the first place. Exampl=
e:

 char32_t str[] =3D { 0x161 };

You can't simply discard the high bits, since the result would be 0x61 and=
=20
that's 'a' in US-ASCII.

To do the error-checking without special processor instructions, you'd writ=
e:

 for ( ; *in; ++in, ++out)
  if (*in > char32_t(0x7f))
   signal_error();
  else
   *out =3D *in;

The same code for Latin1 would replace 0x7f with 0xff.

The one that is faster will depend on which processor instructions are=20
available for SIMD processing.

> Yes.  On these systems, multibyte encodings cannot go into
> "" or L"".  Both FreeBSD and the C/C++ standards understand
> the locale-dependent nature of wchar_t, but GCC needs to
> fulfill its customer's demands.

Neither the C nor the C++ standard understand the nature of locales on sour=
ce=20
code. This was a particularly important complaint of mine about Unicode=20
strings in C++11: the committee stopped just short of making them completel=
y=20
useful.

You cannot write:
 u"=C3=A1"

because you don't know how the compiler will interpret the bytes in your=20
source file that make up that '=C3=A1' character.

I understand the difficulties in getting the required changes through, but =
it's=20
still a serious shortcoming.

> It's not quite a limitation because practical applications
> loads multilingual stuff from external sources.

It is because quite a lot of them assume Unicode in wchar_t.

Qt does.

> C++11 Unicode literals can portability solve this.

No, they can't. As discussed above, you cannot write:

 static const char16_t str[] =3D u"R=C3=A9sum=C3=A9";

You could write:

 static const char16_t str[] =3D u"R\u00e9sum\u00e9";

but that's ugly, unwieldy and very error-prone. (Did I get the codepoint=20
right? I didn't look it up) And it gets worse for non-Latin scripts:

 static const char16_t str[] =3D
  u"\u0393\u03b5\u03b9\u03ac \u03c3\u03bf\u03c5 "
  "\u039a\u03cc\u03c3\u03bc\u03b5";

And then try printing that char16_t literal... there's nothing in stdio.h a=
nd=20
if you use <iostream>

 std::cout << str << std::endl;

it prints for me:

0x400d78

Another missing feature that contributes to C++11 Unicode strings being jus=
t=20
short of useful.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Mathias Gaunard <mathias@gaunard.com>
Date: Tue, 9 Feb 2016 09:34:03 +0000 Raw View

--001a11411210a74239052b53047a
Content-Type: text/plain; charset=UTF-8

On Mon, Feb 8, 2016 at 6:27 AM, Tom Honermann <Thomas.Honermann@synopsys.com
> wrote:

>
> Text_view avoids introducing another string type.  Instead, it provides
> facilities for constructing a view over any range, view, or container
> that holds a code unit sequence; the view associates an encoding with
> the code unit sequence and provides iterators that decode the sequence
> and produce code point values.  The value type of the iterator type is a
> character type that associates the code point value with a character set.
>

I haven't had the time to go through your code yet, but I have two
questions:
 - how do you handle validation, do you validate on construction or trust
the programmer? Can you take shortcuts in your conversion if you statically
know the encoding?
 - have you considered expanding this to normalization forms?
 - how can you have a single "utf-8 text view" type that covers both text
that was originally utf-8 and text that was converted?

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

--001a11411210a74239052b53047a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Feb 8, 2016 at 6:27 AM, Tom Honermann <span dir=3D"ltr">&lt;<a href=3D"=
mailto:Thomas.Honermann@synopsys.com" target=3D"_blank">Thomas.Honermann@sy=
nopsys.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Text_view avoids introducing another string type.=C2=A0 Instead, it provide=
s<br>
facilities for constructing a view over any range, view, or container<br>
that holds a code unit sequence; the view associates an encoding with<br>
the code unit sequence and provides iterators that decode the sequence<br>
and produce code point values.=C2=A0 The value type of the iterator type is=
 a<br>
character type that associates the code point value with a character set.<b=
r></blockquote><div><br></div><div>I haven&#39;t had the time to go through=
 your code yet, but I have two questions:</div><div>=C2=A0- how do you hand=
le validation, do you validate on construction or trust the programmer? Can=
 you take shortcuts in your conversion if you statically know the encoding?=
</div><div>=C2=A0- have you considered expanding this to normalization form=
s?</div><div>=C2=A0- how can you have a single &quot;utf-8 text view&quot; =
type that covers both text that was originally utf-8 and text that was conv=
erted?</div></div></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

--001a11411210a74239052b53047a--

.

Author: "'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals" <std-proposals@isocpp.org>
Date: Tue, 9 Feb 2016 10:03:58 -0800 Raw View

--047d7b10c8bb6d9dd2052b5a25bc
Content-Type: text/plain; charset=UTF-8

On Feb 8, 2016 8:35 PM, "Tom Honermann" <Thomas.Honermann@synopsys.com>
wrote:
>
> On 2/8/2016 3:27 PM, 'Jeffrey Yasskin' via ISO C++ Standard - Future
> Proposals wrote:
> > On Mon, Feb 8, 2016 at 12:05 PM, Tom Honermann
> > <Thomas.Honermann@synopsys.com> wrote:
> >> On 2/8/2016 11:08 AM, Zhihao Yuan wrote:
> >>> On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann
> >>> <Thomas.Honermann@synopsys.com> wrote:
> >>>>
> >>>> CT isn't char32_t; it is character<unicode_character_set>.
> >>>
> >>> I read your source code so I know it is :)
> >>>
> >>>> C++ doesn't provide a type that meets my criteria for a character
type.
> >>>>     char, wchar_t, char16_t, and char32_t are code unit types that
are
> >>>> sometimes used as code point types.  They don't qualify as character
> >>>> types because they do not have an explicit or implicit associated
> >>>> character set to give meaning to their values.
> >>>
> >>> What I'm looking for is something can simplify further development of
> >>> Unicode-based collate processing.  Considering char32_t is already the
> >>> default type, I think I can live with that.
> >>
> >> I'd like to get a better understanding of your thoughts here.  I think
> >> you are suggesting that the value type of the text iterators should
> >> always be char32_t.  If so, I'm guessing that your view reflects one
(or
> >> both) of these positions:
> >>
> >> 1) Values of type char32_t (are expected to) always semantically refer
> >> to the character with the corresponding code point value as defined by
> >> Unicode.  In other words, char32_t always caries an implied association
> >> with the Unicode character set.
> ...
> > I like option (1). If someone has a character set with 4-byte code
> > points that's not Unicode, they should build their own struct to
> > represent the code points, and not re-use char32_t. It does need to be
> > possible to efficiently transcode, e.g., Shift-JIS to UTF-8 without an
> > intermediate step through char32_t, but that's true whether or not
> > char32_t is assumed to represent Unicode code points.
>
> It sounds like your preference is for a character to be represented by a
> code point type that has an implied associated character set and that,
> for Unicode encodings, the code point type be char32_t.
>
> One of my goals for this library is to ensure that the facilities
> provided are usable for all five of the (implementation defined)
> encodings that the standard states must be provided (*).  If the table
> below is filled in as follows, then types are needed for X and Y that
> enable inferring their associated character sets to achieve these goals.
>
> +----------+-----------+------------+---------------+
> | Encoding | Code unit | Code point | Char set      |
> +----------+-----------+------------+---------------+
> | Ordinary | char      | X          | char_set_t<X> |
> | Wide     | wchar_t   | Y          | char_set_t<Y> |
> | UTF-8    | char      | char32_t   | Unicode       |
> | UTF-16   | char16_t  | char32_t   | Unicode       |
> | UTF-32   | char32_t  | char32_t   | Unicode       |
> +----------+-----------+------------+---------------+
>
> In other words, something like this would be needed:
>
> class X {
>    using character_set_type = ...;
>    // member functions to mimic integral types
> };
> class Y {
>   using character_set_type = ...;
>    // member functions to mimic integral types
> };
>
> template<typename T>
> struct __char_set_t_helper {
>    using type = typename T::character_set_type;
> };
> template<>
> struct __char_set_t_helper<char32_t> {
>    using type = unicode_character_set;
> };
> template<typename T>
> using char_set_t = typename __char_set_t_helper<T>::type;
>
> My question is, what is gained by using char32_t as the character type
> for the Unicode encodings over a class?  The class that text_view
> currently provides is pretty much useless as is, but that is just
> because it is a work in progress.  What benefit is obtained by using a
> fundamental type here when fundamental types aren't available to be used
> for other encodings?  It seems to me that use of distinct types for code
> units vs characters would be beneficial for type safety reasons.
>

My current hypothesis is that we can use Unicode as the decoded form of
every other encoding, and so get away with char32_t as the only decoded
type. What's an example encoding that falsifies my hypothesis?

That said, I'm sympathetic to statically catching mix-ups between the
internal decoded form and the external form, even if the external form is
UTF-32. On the third hand, 1) all external forms are bytes, since they
could be in various endiannesses, and we wind up with UTF-8 being the only
ambiguous internal encoding, as usual; and 2) the guideline I've learned is
that you generally don't want to decode to code points when handling
international text, so even though we do need to provide that interface, we
shouldn't aim for it to be the one people generally use.

Jeffrey

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

--047d7b10c8bb6d9dd2052b5a25bc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p dir=3D"ltr">On Feb 8, 2016 8:35 PM, &quot;Tom Honermann=
&quot; &lt;<a href=3D"mailto:Thomas.Honermann@synopsys.com" target=3D"_blan=
k" class=3D"cremed">Thomas.Honermann@synopsys.com</a>&gt; wrote:<br>
&gt;<br>
&gt; On 2/8/2016 3:27 PM, &#39;Jeffrey Yasskin&#39; via ISO C++ Standard - =
Future<br>
&gt; Proposals wrote:<br>
&gt; &gt; On Mon, Feb 8, 2016 at 12:05 PM, Tom Honermann<br>
&gt; &gt; &lt;<a href=3D"mailto:Thomas.Honermann@synopsys.com" target=3D"_b=
lank" class=3D"cremed">Thomas.Honermann@synopsys.com</a>&gt; wrote:<br>
&gt; &gt;&gt; On 2/8/2016 11:08 AM, Zhihao Yuan wrote:<br>
&gt; &gt;&gt;&gt; On Mon, Feb 8, 2016 at 8:53 AM, Tom Honermann<br>
&gt; &gt;&gt;&gt; &lt;<a href=3D"mailto:Thomas.Honermann@synopsys.com" targ=
et=3D"_blank" class=3D"cremed">Thomas.Honermann@synopsys.com</a>&gt; wrote:=
<br>
&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt; CT isn&#39;t char32_t; it is character&lt;unicode_cha=
racter_set&gt;.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; I read your source code so I know it is :)<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt; C++ doesn&#39;t provide a type that meets my criteria=
 for a character type.<br>
&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0 =C2=A0char, wchar_t, char16_t, and char3=
2_t are code unit types that are<br>
&gt; &gt;&gt;&gt;&gt; sometimes used as code point types.=C2=A0 They don&#3=
9;t qualify as character<br>
&gt; &gt;&gt;&gt;&gt; types because they do not have an explicit or implici=
t associated<br>
&gt; &gt;&gt;&gt;&gt; character set to give meaning to their values.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; What I&#39;m looking for is something can simplify furthe=
r development of<br>
&gt; &gt;&gt;&gt; Unicode-based collate processing.=C2=A0 Considering char3=
2_t is already the<br>
&gt; &gt;&gt;&gt; default type, I think I can live with that.<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; I&#39;d like to get a better understanding of your thoughts h=
ere.=C2=A0 I think<br>
&gt; &gt;&gt; you are suggesting that the value type of the text iterators =
should<br>
&gt; &gt;&gt; always be char32_t.=C2=A0 If so, I&#39;m guessing that your v=
iew reflects one (or<br>
&gt; &gt;&gt; both) of these positions:<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; 1) Values of type char32_t (are expected to) always semantica=
lly refer<br>
&gt; &gt;&gt; to the character with the corresponding code point value as d=
efined by<br>
&gt; &gt;&gt; Unicode.=C2=A0 In other words, char32_t always caries an impl=
ied association<br>
&gt; &gt;&gt; with the Unicode character set.<br>
&gt; ...<br>
&gt; &gt; I like option (1). If someone has a character set with 4-byte cod=
e<br>
&gt; &gt; points that&#39;s not Unicode, they should build their own struct=
 to<br>
&gt; &gt; represent the code points, and not re-use char32_t. It does need =
to be<br>
&gt; &gt; possible to efficiently transcode, e.g., Shift-JIS to UTF-8 witho=
ut an<br>
&gt; &gt; intermediate step through char32_t, but that&#39;s true whether o=
r not<br>
&gt; &gt; char32_t is assumed to represent Unicode code points.<br>
&gt;<br>
&gt; It sounds like your preference is for a character to be represented by=
 a<br>
&gt; code point type that has an implied associated character set and that,=
<br>
&gt; for Unicode encodings, the code point type be char32_t.<br>
&gt;<br>
&gt; One of my goals for this library is to ensure that the facilities<br>
&gt; provided are usable for all five of the (implementation defined)<br>
&gt; encodings that the standard states must be provided (*).=C2=A0 If the =
table<br>
&gt; below is filled in as follows, then types are needed for X and Y that<=
br>
&gt; enable inferring their associated character sets to achieve these goal=
s.<br>
&gt;<br>
&gt; +----------+-----------+------------+---------------+<br>
&gt; | Encoding | Code unit | Code point | Char set=C2=A0 =C2=A0 =C2=A0 |<b=
r>
&gt; +----------+-----------+------------+---------------+<br>
&gt; | Ordinary | char=C2=A0 =C2=A0 =C2=A0 | X=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 | char_set_t&lt;X&gt; |<br>
&gt; | Wide=C2=A0 =C2=A0 =C2=A0| wchar_t=C2=A0 =C2=A0| Y=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 | char_set_t&lt;Y&gt; |<br>
&gt; | UTF-8=C2=A0 =C2=A0 | char=C2=A0 =C2=A0 =C2=A0 | char32_t=C2=A0 =C2=
=A0| Unicode=C2=A0 =C2=A0 =C2=A0 =C2=A0|<br>
&gt; | UTF-16=C2=A0 =C2=A0| char16_t=C2=A0 | char32_t=C2=A0 =C2=A0| Unicode=
=C2=A0 =C2=A0 =C2=A0 =C2=A0|<br>
&gt; | UTF-32=C2=A0 =C2=A0| char32_t=C2=A0 | char32_t=C2=A0 =C2=A0| Unicode=
=C2=A0 =C2=A0 =C2=A0 =C2=A0|<br>
&gt; +----------+-----------+------------+---------------+<br>
&gt;<br>
&gt; In other words, something like this would be needed:<br>
&gt;<br>
&gt; class X {<br>
&gt; =C2=A0 =C2=A0using character_set_type =3D ...;<br>
&gt; =C2=A0 =C2=A0// member functions to mimic integral types<br>
&gt; };<br>
&gt; class Y {<br>
&gt; =C2=A0 using character_set_type =3D ...;<br>
&gt; =C2=A0 =C2=A0// member functions to mimic integral types<br>
&gt; };<br>
&gt;<br>
&gt; template&lt;typename T&gt;<br>
&gt; struct __char_set_t_helper {<br>
&gt; =C2=A0 =C2=A0using type =3D typename T::character_set_type;<br>
&gt; };<br>
&gt; template&lt;&gt;<br>
&gt; struct __char_set_t_helper&lt;char32_t&gt; {<br>
&gt; =C2=A0 =C2=A0using type =3D unicode_character_set;<br>
&gt; };<br>
&gt; template&lt;typename T&gt;<br>
&gt; using char_set_t =3D typename __char_set_t_helper&lt;T&gt;::type;<br>
&gt;<br>
&gt; My question is, what is gained by using char32_t as the character type=
<br>
&gt; for the Unicode encodings over a class?=C2=A0 The class that text_view=
<br>
&gt; currently provides is pretty much useless as is, but that is just<br>
&gt; because it is a work in progress.=C2=A0 What benefit is obtained by us=
ing a<br>
&gt; fundamental type here when fundamental types aren&#39;t available to b=
e used<br>
&gt; for other encodings?=C2=A0 It seems to me that use of distinct types f=
or code<br>
&gt; units vs characters would be beneficial for type safety reasons.<br>
&gt;<br></p>
<p dir=3D"ltr">My current hypothesis is that we can use Unicode as the deco=
ded form of every other encoding, and so get away with char32_t as the only=
 decoded type. What&#39;s an example encoding that falsifies my hypothesis?=
</p>
<p dir=3D"ltr">That said, I&#39;m sympathetic to statically catching mix-up=
s between the internal decoded form and the external form, even if the exte=
rnal form is UTF-32. On the third hand, 1) all external forms are bytes, si=
nce they could be in various endiannesses, and we wind up with UTF-8 being =
the only ambiguous internal encoding, as usual; and 2) the guideline I&#39;=
ve learned is that you generally don&#39;t want to decode to code points wh=
en handling international text, so even though we do need to provide that i=
nterface, we shouldn&#39;t aim for it to be the one people generally use.</=
p><p>Jeffrey</p>
</div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

--047d7b10c8bb6d9dd2052b5a25bc--

.

Author: Zhihao Yuan <zy@miator.net>
Date: Tue, 9 Feb 2016 12:17:53 -0600 Raw View

On Tue, Feb 9, 2016 at 1:55 AM, Thiago Macieira <thiago@macieira.org> wrote=
:
>> But ASCII is 7-bit, when decoding goes wrong, a library can immediately
>> spot this issue; Latin1 is 8-bit, library doesn't know what is wrong, an=
d
>> this cause a large amount of bad encodings in many systems, many
>> applications, many...
>
> I either don't understand what you meant or you didn't get what I meant.
>
> If you simply discarded bits 8 through 31 of the char32_t, you might stil=
l end
> up with valid US-ASCII data that wasn't US-ASCII in the first place. Exam=
ple:
>
>         char32_t str[] =3D { 0x161 };
>

"Decode" means translate a representation into code points.
Try this in Python:

>>> u'=C3=A0'.encode('utf-8').decode('latin-1')
u'\xc3\xa0'

   ^^^ 2 characters, bad result

>>> u'=C3=A0'.encode('utf-8').decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
0: ordinal not in range(128)

   ^^^ Error spotted.

>
> Neither the C nor the C++ standard understand the nature of locales on so=
urce
> code. This was a particularly important complaint of mine about Unicode
> strings in C++11: the committee stopped just short of making them complet=
ely
> useful.
>
> You cannot write:
>         u"=C3=A1"
>
> because you don't know how the compiler will interpret the bytes in your
> source file that make up that '=C3=A1' character.
>

If you have a compiler which doesn't support UTF-8 source
files, theoretically you can write a wrapper to translate your
source files into Unicode escape sequences and feed to the
compiler on the fly.  Just saying.

>> It's not quite a limitation because practical applications
>> loads multilingual stuff from external sources.
>
> It is because quite a lot of them assume Unicode in wchar_t.
>
> Qt does.
>

Many projects are aware of __STDC_ISO_10646__.
For Qt, given the fact that it uses its own character type
UChar, I suspect that QString's
(to|from)(WCharArray|StdWString) methods are rarely
been requested under UNIXs.

--=20
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 19:34:43 +0000 Raw View

On 2/9/2016 1:04 PM, 'Jeffrey Yasskin' via ISO C++ Standard - Future=20
Proposals wrote:
> On Feb 8, 2016 8:35 PM, "Tom Honermann" <Thomas.Honermann@synopsys.com
> <mailto:Thomas.Honermann@synopsys.com>> wrote:
>  > My question is, what is gained by using char32_t as the character type
>  > for the Unicode encodings over a class?  The class that text_view
>  > currently provides is pretty much useless as is, but that is just
>  > because it is a work in progress.  What benefit is obtained by using a
>  > fundamental type here when fundamental types aren't available to be us=
ed
>  > for other encodings?  It seems to me that use of distinct types for co=
de
>  > units vs characters would be beneficial for type safety reasons.
>  >
>
> My current hypothesis is that we can use Unicode as the decoded form of
> every other encoding, and so get away with char32_t as the only decoded
> type. What's an example encoding that falsifies my hypothesis?

Shift-JIS defines code points that do not round-trip through Unicode.

http://support.microsoft.com/kb/170559

Additionally, decoding all code points to Unicode would necessarily=20
involve transcoding between character sets.  I believe that would=20
seriously impact performance for non-Unicode encodings (when using the=20
code point iterator interface)

> That said, I'm sympathetic to statically catching mix-ups between the
> internal decoded form and the external form, even if the external form
> is UTF-32. On the third hand, 1) all external forms are bytes, since
> they could be in various endiannesses, and we wind up with UTF-8 being
> the only ambiguous internal encoding, as usual;

Yes, all externally encoded text is consumed as a byte sequence at some=20
level, but that isn't how we generally work with these encodings.=20
std::u16string stores a byte sequence, but we generally work with it as=20
a sequence of char16_t code units.  Endianness issues fall somewhat=20
below what we usually think of as the encoding level (Unicode=20
differentiates these as encoding schemes (byte oriented) as opposed to=20
encoding forms (code unit oriented)).

I think you meant that UTF-8 is the only unambiguous internal encoding.=20
  While true, I'm not sure that is relevant.  Different encodings get=20
chosen as internal encodings for a host of reasons.

> and 2) the guideline
> I've learned is that you generally don't want to decode to code points
> when handling international text, so even though we do need to provide
> that interface, we shouldn't aim for it to be the one people generally us=
e.

It depends on what one is trying to do.  Consider the Shift-JIS encoding=20
of =E6=B5=AC (U+6D6C).  This is encoded as 0x8a 0x5C.  Shift-JIS is a multi=
byte=20
encoding that is almost ASCII compatible.  The code unit sequence 0x5C=20
encodes the ASCII '\' character.  But note that 0x5C appears as the=20
second byte of the encoding for U+6D6C.  If one were to naively split=20
such a string based on 0x5C code units, they would split a multibyte=20
code unit sequence.  Code point awareness is required to perform such a=20
split correctly.

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Nicol Bolas <jmckesson@gmail.com>
Date: Tue, 9 Feb 2016 12:49:33 -0800 (PST) Raw View

------=_Part_12802_1177560898.1455050973362
Content-Type: multipart/alternative;
 boundary="----=_Part_12803_897012093.1455050973368"

------=_Part_12803_897012093.1455050973368
Content-Type: text/plain; charset=UTF-8

On Tuesday, February 9, 2016 at 12:00:36 AM UTC-5, Tom Honermann wrote:
>
> On 2/8/2016 5:00 PM, Zhihao Yuan wrote:
> > On Mon, Feb 8, 2016 at 2:05 PM, Tom Honermann
> > <Thomas.H...@synopsys.com <javascript:>> wrote:
> >>
> >>> What I'm looking for is something can simplify further development of
> >>> Unicode-based collate processing.  Considering char32_t is already the
> >>> default type, I think I can live with that.
> >>
> >> I'd like to get a better understanding of your thoughts here.  I think
> >> you are suggesting that the value type of the text iterators should
> >> always be char32_t.  If so, I'm guessing that your view reflects one (or
> >> both) of these positions:
> >>
> >> 1) Values of type char32_t (are expected to) always semantically refer
> >> to the character with the corresponding code point value as defined by
> >> Unicode.  In other words, char32_t always caries an implied association
> >> with the Unicode character set.
> >>
> >
> > Yes.  What I'm expecting is portable code point values.
> > Unicode gives you portable code point values, and char32_t
> > (and char16_t in limited range) should be assumed to contain
> > Unicode code points.
>
> I don't disagree with this.  But I'll pose the same question I did in
> responding to Jeffrey: what is gained by using char32_t as the character
> type for the Unicode encodings over a class type?  Would a type that is
> distinct from the type used for code units not be beneficial for type
> safety reasons?
>

It allows `U"This is a Literal String"` to be a sequence of Unicode
codepoints via a literal string. You can't do that with a class type. Or at
least, not without also using a user-defined literal.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_12803_897012093.1455050973368
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Tuesday, February 9, 2016 at 12:00:36 AM UTC-5, Tom Hon=
ermann wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On 2/8/2016 5:00 =
PM, Zhihao Yuan wrote:<br>&gt; On Mon, Feb 8, 2016 at 2:05 PM, Tom Honerman=
n<br>&gt; &lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mail=
to=3D"FuwNLWX_HQAJ" rel=3D"nofollow" onmousedown=3D"this.href=3D&#39;javasc=
ript:&#39;;return true;" onclick=3D"this.href=3D&#39;javascript:&#39;;retur=
n true;">Thomas.H...@synopsys.com</a><wbr>&gt; wrote:<br>&gt;&gt;<br>&gt;&g=
t;&gt; What I&#39;m looking for is something can simplify further developme=
nt of<br>&gt;&gt;&gt; Unicode-based collate processing. =C2=A0Considering c=
har32_t is already the<br>&gt;&gt;&gt; default type, I think I can live wit=
h that.<br>&gt;&gt;<br>&gt;&gt; I&#39;d like to get a better understanding =
of your thoughts here. =C2=A0I think<br>&gt;&gt; you are suggesting that th=
e value type of the text iterators should<br>&gt;&gt; always be char32_t. =
=C2=A0If so, I&#39;m guessing that your view reflects one (or<br>&gt;&gt; b=
oth) of these positions:<br>&gt;&gt;<br>&gt;&gt; 1) Values of type char32_t=
 (are expected to) always semantically refer<br>&gt;&gt; to the character w=
ith the corresponding code point value as defined by<br>&gt;&gt; Unicode. =
=C2=A0In other words, char32_t always caries an implied association<br>&gt;=
&gt; with the Unicode character set.<br>&gt;&gt;<br>&gt;<br>&gt; Yes. =C2=
=A0What I&#39;m expecting is portable code point values.<br>&gt; Unicode gi=
ves you portable code point values, and char32_t<br>&gt; (and char16_t in l=
imited range) should be assumed to contain<br>&gt; Unicode code points.<p>I=
 don&#39;t disagree with this. =C2=A0But I&#39;ll pose the same question I =
did in <br>responding to Jeffrey: what is gained by using char32_t as the c=
haracter <br>type for the Unicode encodings over a class type? =C2=A0Would =
a type that is <br>distinct from the type used for code units not be benefi=
cial for type <br>safety reasons?</p></blockquote><div><br>It allows `U&quo=
t;This is a Literal String&quot;` to be a sequence of Unicode codepoints vi=
a a literal string. You can&#39;t do that with a class type. Or at least, n=
ot without also using a user-defined literal.</div><br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

------=_Part_12803_897012093.1455050973368--
------=_Part_12802_1177560898.1455050973362--

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Tue, 9 Feb 2016 22:05:02 +0000 Raw View

On 2/9/2016 2:09 AM, Thiago Macieira wrote:
> On ter=C3=A7a-feira, 9 de fevereiro de 2016 06:29:51 PST Tom Honermann wr=
ote:
>>> Make sure they work for streaming data (i.e., incomplete). One buffer
>>> chunk
>>> might end in the middle of a multi-byte sequence or in a shifted state.
>>> The
>>> state needs to be transferred when moving on to the next chunk.
>>
>> Support for this is not currently present, at least not in a way that
>> anyone would want to use.  In underflow scenarios, an exception is
>> thrown.
>
> That doesn't sound like an exceptional scenario to me, so I would advise
> against using exceptions for this.

Agreed.  I had already filed an issue to consider allowing the low level=20
encode() and decode() functions to return std::expected instead of=20
throwing, but I'm not sure I like that option either.  Perhaps=20
std::error_code would be more appropriate.  Suggestions and opinions=20
appreciated.

https://github.com/tahonermann/text_view/issues/7

>> The decode can then be restarted with an adjusted end iterator
>> (likely implying invalidation of prior iterators), but for input
>> iterators, code units will have been lost.  I have given this a little
>> thought.  Since the interface is all based on iterators, one possibility
>> is that buffering can be provided by iterator adapters operating over
>> the streaming code unit sequence.  This wouldn't work in situations
>> where callers need control to return in order to fetch additional data
>> though.
>
> I would say that you should make the iterator stateful.

Iterators are already stateful for encodings that require tracking state=20
to decode code unit sequences (shift state, etc...).  I presume you are=20
suggesting that additional state be added to store partially read code=20
unit sequences.  More on this below...

> The next problem will be to make sure that the iterator can compare to th=
e
> "end" iterator.

As long as it is ok for iterator advancement to block (and I'm not sure=20
how it could not be), then the current implementation addresses this.=20
Text_view's iterators are greedy, like istream_iterator.  Non-default=20
construction immediately decodes and advances the underlying code unit=20
iterator.  This is specifically done so that trailing non-code point=20
encoding code units are consumed so that matches against the end=20
iterator can be performed.

>> There is a trade off between supporting suspending decode in the middle
>> of a code unit sequence and supporting input iterators; if iteration of
>> the code unit sequence isn't restartable, then additional state must be
>> maintained and this will bloat the size of code point iterators and
>> affect their performance.
>
> True. Streaming is probably more important.
>
> Use-case: decoding a large file. You don't want to load the entire file t=
o
> memory, but instead read it in chunks.

Does ifstream not already accommodate that use case?

std::ifstream ifs =3D ...;
std::istreambuf_iterator<char> in{ifs};
std::istreambuf_iterator<char> end;
auto tv =3D make_text_view<utf8_encoding>(in, end);

>> Additionally, suspending decode in the middle of a code unit sequence
>> doesn't work for iterators - a dereferenced iterator needs to either
>> provide a code point or throw an exception.
>
> Agreed. You shouldn't try to decode that iterator. You need to go out of =
the
> loop, without an exception, to get more data.

Are you envisioning a scenario like this:

using encoding =3D utf8_encoding;
auto state =3D encoding::initial_state();
do {
   std::string b =3D block_and_get_more_data();
   auto tv =3D make_text_view<utf8_encoding>(state, begin(b), end(b));
   auto tv_it =3D begin(tv);
   while (tv_it !=3D end(tv))
     ...;
   state =3D tv_it;  // Trailing state is in end iterator, preserve it
                   // to seed state for the next iteration.
} while(!b.empty());

This works with the current implementation so long as the buffers fall=20
on a code unit sequence boundary.  Preserving byte state would allow=20
this to work in general.  I think this is achievable with a rather=20
modest increase to the size of the saved state (probably just 32-bits=20
for most encodings).

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 09 Feb 2016 15:37:40 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 12:17:53 PST Zhihao Yuan wrote:
> "Decode" means translate a representation into code points.
>=20
> Try this in Python:
> >>> u'=C3=A0'.encode('utf-8').decode('latin-1')
>=20
> u'\xc3\xa0'
>=20
>    ^^^ 2 characters, bad result
>=20
> >>> u'=C3=A0'.encode('utf-8').decode('ascii')
>=20
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 0: ordinal not in range(128)
>=20
>    ^^^ Error spotted.

I don't see how that is relevant. You started with a mistake, so you get=20
nonsense output (GIGO).

I'm not against US-ASCII. I'm saying that Latin1 is useful because it's=20
simple.

> > Neither the C nor the C++ standard understand the nature of locales on
> > source code. This was a particularly important complaint of mine about
> > Unicode strings in C++11: the committee stopped just short of making th=
em
> > completely useful.
> >=20
> > You cannot write:
> >         u"=C3=A1"
> >=20
> > because you don't know how the compiler will interpret the bytes in you=
r
> > source file that make up that '=C3=A1' character.
>=20
> If you have a compiler which doesn't support UTF-8 source
> files, theoretically you can write a wrapper to translate your
> source files into Unicode escape sequences and feed to the
> compiler on the fly.  Just saying.

Yeah, no thanks. It would have been better if the standard fixed the issue,=
 so=20
I could share files with my colleagues.

For Qt, we declared that source code is UTF-8. But since the literals don't=
=20
compile with Visual Studio, we cannot use non-escaped text.

> >> It's not quite a limitation because practical applications
> >> loads multilingual stuff from external sources.
> >=20
> > It is because quite a lot of them assume Unicode in wchar_t.
> >=20
> > Qt does.
>=20
> Many projects are aware of __STDC_ISO_10646__.
> For Qt, given the fact that it uses its own character type
> UChar, I suspect that QString's
> (to|from)(WCharArray|StdWString) methods are rarely
> been requested under UNIXs.

Because most people don't use wchar_t outside of Windows. On Windows, QStri=
ng=20
and wchar_t have the same raw data format, so you can just do a=20
reinterpret_cast.

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Tue, 09 Feb 2016 15:55:06 -0800 Raw View

On ter=C3=A7a-feira, 9 de fevereiro de 2016 22:05:02 PST Tom Honermann wrot=
e:
> > The next problem will be to make sure that the iterator can compare to =
the
> > "end" iterator.
>=20
> As long as it is ok for iterator advancement to block (and I'm not sure
> how it could not be), then the current implementation addresses this.
> Text_view's iterators are greedy, like istream_iterator.  Non-default
> construction immediately decodes and advances the underlying code unit
> iterator.  This is specifically done so that trailing non-code point
> encoding code units are consumed so that matches against the end
> iterator can be performed.

Blocking is never a good idea. And I don't see how you could do it. What is=
 it=20
going to block on.

When I implemented QStringIterator, which is similar to what you're trying =
to=20
do, I chose to use Java-style iterators instead of C++ Standard Library one=
s.=20
The simple difference? The iterator knows the end.

So you can write:

 while (it.hasNext()) {
  char32_t c =3D it.next();
  use(c);
 }

The error checking above is that next() silently replaces an invalid decodi=
ng=20
with the replacement character (U+FFFD). I suppose you could use std::expec=
ted=20
for your code.

Calling next() when hasNext() has returned false is undefined behaviour.

Note that QStringIterator, as the name says, operates on a QString, so ther=
e's=20
no "more data". The decoder is stateless.

> >> There is a trade off between supporting suspending decode in the middl=
e
> >> of a code unit sequence and supporting input iterators; if iteration o=
f
> >> the code unit sequence isn't restartable, then additional state must b=
e
> >> maintained and this will bloat the size of code point iterators and
> >> affect their performance.
> >=20
> > True. Streaming is probably more important.
> >=20
> > Use-case: decoding a large file. You don't want to load the entire file=
 to
> > memory, but instead read it in chunks.
>=20
> Does ifstream not already accommodate that use case?

ifstream is never a good reference for me. The only good thing about iostre=
ams=20
for me are cout and cerr. cin, fstream, stringstream, etc., are overkills a=
nd=20
complex, so they never enter my projects.

Anyway, I was thinking of a much lower-level operation, such as when you're=
=20
given a chunk of data from some third-party API and you need to decode.=20
Blocking isn't possible because you need to go back to the event loop to ge=
t=20
more data. And yet you need to retain the state.

> > Agreed. You shouldn't try to decode that iterator. You need to go out o=
f
> > the loop, without an exception, to get more data.
>=20
> Are you envisioning a scenario like this:
>=20
> using encoding =3D utf8_encoding;
> auto state =3D encoding::initial_state();
> do {
>    std::string b =3D block_and_get_more_data();
>    auto tv =3D make_text_view<utf8_encoding>(state, begin(b), end(b));
>    auto tv_it =3D begin(tv);
>    while (tv_it !=3D end(tv))
>      ...;
>    state =3D tv_it;  // Trailing state is in end iterator, preserve it
>                    // to seed state for the next iteration.
> } while(!b.empty());

Something like that, only more complex in a real application.

> This works with the current implementation so long as the buffers fall
> on a code unit sequence boundary.  Preserving byte state would allow
> this to work in general.  I think this is achievable with a rather
> modest increase to the size of the saved state (probably just 32-bits
> for most encodings).

For UTF-8, the minimum would be 19 bits (15 bits of decoded data, plus 2 bi=
ts=20
to indicate how many more bytes are expected and 2 more bits to indicate ho=
w=20
many in total).

--=20
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Wed, 10 Feb 2016 07:29:33 +0000 Raw View

On 2/9/2016 4:34 AM, Mathias Gaunard wrote:
> On Mon, Feb 8, 2016 at 6:27 AM, Tom Honermann
> <Thomas.Honermann@synopsys.com <mailto:Thomas.Honermann@synopsys.com>>
> wrote:
>
>
>     Text_view avoids introducing another string type.  Instead, it provides
>     facilities for constructing a view over any range, view, or container
>     that holds a code unit sequence; the view associates an encoding with
>     the code unit sequence and provides iterators that decode the sequence
>     and produce code point values.  The value type of the iterator type is a
>     character type that associates the code point value with a character
>     set.
>
>
> I haven't had the time to go through your code yet, but I have two
> questions:
>   - how do you handle validation, do you validate on construction or
> trust the programmer? Can you take shortcuts in your conversion if you
> statically know the encoding?

Valiwhatnow?  :)

Little validation is currently implemented.  Exceptions are thrown for
lone surrogates and other invalid UTF code unit sequences as they are
decoded, but that is about it.  Exceptions are probably not the best way
to handle this.  At one point, I had a design in which code point
iterators had a template argument that named a callable object type.  I
ended up punting on that because dealing with propagating the callable
object was slowing me down.  All the callable object could really do is
throw an exception or substitute a different code point.  It might be
worth revisiting this design; the issues are similar to issues that
arise for (stateful) allocators.

I haven't implemented meaningful transcoding support yet.  Conversions
between encodings that use the same character set work though:

auto tv = make_text_view<utf32_encoding>(U"\u00F8");
std::ostreambuf_iterator<char> utf8_cu_it{std::cout};
auto utf8_cp_it = make_otext_iterator<utf8_encoding>(utf8_cu_it);
std::copy(begin(tv), end(tv), utf8_cp_it);

The above manner of transcoding would be the slow way.  I do envision
adding a transcode() function that could dispatch to optimized functions
for transcoding between specific encodings.  I haven't started on such
work yet though.

>   - have you considered expanding this to normalization forms?

It is on my todo list, but I haven't done any design work yet.  I have
had some initial naive thoughts towards providing normalization
iterators that wrap code point iterators.  I'm not sure how viable this
approach really is though; especially if there would be a requirement to
support suspending decomposition in order to support streaming of code
point sequences similar to what Thiago has suggested for decoding.

>   - how can you have a single "utf-8 text view" type that covers both
> text that was originally utf-8 and text that was converted?

I'm not clear on what you are asking here.  The basic_text_view template
is parameterized on the encoding type and a view type (a non-owning
range type; note that some of the terms used in the code are outdated as
language has evolved with the ranges TS; I need to fix that).  A
basic_text_view created by make_text_view() will use a view type
currently named 'bounded_iterable' [1].  That is parameterized on the
iterator and sentinel pair that define the underlying code unit range.

So, a single basic_text_view specialization for UTF-8 will cover any
UTF-8 code unit sequence that is accessible with compatible iterators.

Tom.

[1]:
https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/bounded_iterable.hpp

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Wed, 10 Feb 2016 07:46:04 +0000 Raw View

On 2/9/2016 3:49 PM, Nicol Bolas wrote:
>     I don't disagree with this.  But I'll pose the same question I did in
>     responding to Jeffrey: what is gained by using char32_t as the
>     character
>     type for the Unicode encodings over a class type?  Would a type that is
>     distinct from the type used for code units not be beneficial for type
>     safety reasons?
>
>
> It allows `U"This is a Literal String"` to be a sequence of Unicode
> codepoints via a literal string. You can't do that with a class type. Or
> at least, not without also using a user-defined literal.

By definition, UTF-32 is a sequence of Unicode code points, so we have
that today.  Perhaps you are suggesting that a string literal should
suffice by itself to satisfy a TextView like concept?  I had originally
hoped to be able to support that, but found it not to be feasible to do
so, and support a generic interface for arbitrary encodings.  The
sticking points are mostly related to tracking encoding state (note that
state will be present for any encoding that has code unit sequences
extending across multiple bytes if Thiago's request to support stream
buffered data is implemented).

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Wed, 10 Feb 2016 08:11:56 +0000 Raw View

On 2/9/2016 6:55 PM, Thiago Macieira wrote:
> On ter=C3=A7a-feira, 9 de fevereiro de 2016 22:05:02 PST Tom Honermann wr=
ote:
>>> The next problem will be to make sure that the iterator can compare to =
the
>>> "end" iterator.
>>
>> As long as it is ok for iterator advancement to block (and I'm not sure
>> how it could not be), then the current implementation addresses this.
>> Text_view's iterators are greedy, like istream_iterator.  Non-default
>> construction immediately decodes and advances the underlying code unit
>> iterator.  This is specifically done so that trailing non-code point
>> encoding code units are consumed so that matches against the end
>> iterator can be performed.
>
> Blocking is never a good idea. And I don't see how you could do it. What =
is it
> going to block on.

I agree blocking is never a good idea.  I was assuming blocking by an=20
underlying code unit iterator that is awaiting data.  This was in the=20
context of underflow encountered in the midst of advancing the iterator=20
since, once at that point, blocking until a value is available to return=20
or throwing is the only option.  I understand now this can be avoided by=20
having the iterator compare equally with the end iterator if sufficient=20
data to decode the next code point is not available (implying buffering=20
in the state object the iterator holds of course).

> When I implemented QStringIterator, which is similar to what you're tryin=
g to
> do, I chose to use Java-style iterators instead of C++ Standard Library o=
nes.
> The simple difference? The iterator knows the end.
>
> So you can write:
>
>  while (it.hasNext()) {
>   char32_t c =3D it.next();
>   use(c);
>  }

itext_iterator holds a reference to the underlying code unit range, so=20
it also knows the end.

> The error checking above is that next() silently replaces an invalid deco=
ding
> with the replacement character (U+FFFD). I suppose you could use std::exp=
ected
> for your code.

As I mentioned in my reply to Mathias, at one point I was working on a=20
design that would allow this behavior to be configurable.  It still=20
isn't clear to me how beneficial that flexibility would be.

> Calling next() when hasNext() has returned false is undefined behaviour.

As it should be :)

> Note that QStringIterator, as the name says, operates on a QString, so th=
ere's
> no "more data". The decoder is stateless.
>
>>>> There is a trade off between supporting suspending decode in the middl=
e
>>>> of a code unit sequence and supporting input iterators; if iteration o=
f
>>>> the code unit sequence isn't restartable, then additional state must b=
e
>>>> maintained and this will bloat the size of code point iterators and
>>>> affect their performance.
>>>
>>> True. Streaming is probably more important.
>>>
>>> Use-case: decoding a large file. You don't want to load the entire file=
 to
>>> memory, but instead read it in chunks.
>>
>> Does ifstream not already accommodate that use case?
>
> ifstream is never a good reference for me. The only good thing about iost=
reams
> for me are cout and cerr. cin, fstream, stringstream, etc., are overkills=
 and
> complex, so they never enter my projects.

Fair enough, but I think the point stands that this kind of scenario can=20
be addressed by a lower level buffered iterator abstraction.

> Anyway, I was thinking of a much lower-level operation, such as when you'=
re
> given a chunk of data from some third-party API and you need to decode.
> Blocking isn't possible because you need to go back to the event loop to =
get
> more data. And yet you need to retain the state.

Understood.

>>> Agreed. You shouldn't try to decode that iterator. You need to go out o=
f
>>> the loop, without an exception, to get more data.
>>
>> Are you envisioning a scenario like this:
>>
>> using encoding =3D utf8_encoding;
>> auto state =3D encoding::initial_state();
>> do {
>>     std::string b =3D block_and_get_more_data();
>>     auto tv =3D make_text_view<utf8_encoding>(state, begin(b), end(b));
>>     auto tv_it =3D begin(tv);
>>     while (tv_it !=3D end(tv))
>>       ...;
>>     state =3D tv_it;  // Trailing state is in end iterator, preserve it
>>                     // to seed state for the next iteration.
>> } while(!b.empty());
>
> Something like that, only more complex in a real application.

Of course.

The only concern that I have about the above is that it leaves open the=20
possibility for trailing code units (e.g., garbage at the end of the=20
encoded text) to go unnoticed.  In a non-buffering scenario, an iterator=20
might silently compare to end even though there are code units=20
remaining.  The developer might care about these, or they might not.

Tom.

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

.

Author: mats.taraldsvik@gmail.com
Date: Wed, 10 Feb 2016 05:28:02 -0800 (PST) Raw View

------=_Part_256_370776456.1455110883081
Content-Type: multipart/alternative;
 boundary="----=_Part_257_457705040.1455110883081"

------=_Part_257_457705040.1455110883081
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks for working on this!

My use case is to transcode from a legacy encoding/character set (equal to=
=20
latin-6 except for two characters), and output to unicode (mostly utf-8 or=
=20
utf-16) as a std::string or std::wstring.

Is this a supported use case with the text_view library?

It could ease migration from old encodings/character sets for legacy=20
applications....

Mats

On Monday, 8 February 2016 07:27:12 UTC+1, Tom Honermann wrote:
>
> I am planning to submit a paper for the Jacksonville mailing (submission=
=20
> deadline this Friday!) discussing a library I've been developing that=20
> provides code point enumeration support for modern and legacy character=
=20
> encodings.  I will be attending the Jacksonville meeting and hope to=20
> present the paper there.  The intent of this email is to request some=20
> early feedback to help guide writing the paper and to prepare myself to=
=20
> address concerns raised.
>
> The library is named text_view and is avilable at
> https://github.com/tahonermann/text_view
>
> The readme file found there provides a short overview, feature=20
> description, terminology description, list of supported character=20
> encodings, and a specification of the interface.  The readme file is=20
> still rough and lacking in prose to describe many of the classes.  I=20
> plan to improve it soon, but am hopeful that it suffices to at least=20
> provide a sense of the library and how to use it.  Contributions welcome!
>
> Text_view avoids introducing another string type.  Instead, it provides=
=20
> facilities for constructing a view over any range, view, or container=20
> that holds a code unit sequence; the view associates an encoding with=20
> the code unit sequence and provides iterators that decode the sequence=20
> and produce code point values.  The value type of the iterator type is a=
=20
> character type that associates the code point value with a character set.
>
> An example taken from the overview follows.  Note that \u00F8 (LATIN=20
> SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units=20
> (\xC3\xB8), but iterator based enumeration sees just the single code poin=
t.
>
> using CT =3D utf8_encoding::character_type;
> auto tv =3D make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
> auto it =3D tv.begin();
> assert(*it++ =3D=3D CT{0x004A}); // 'J'
> assert(*it++ =3D=3D CT{0x00F8}); // '=C3=B8'
> assert(*it++ =3D=3D CT{0x0065}); // 'e'
>
> Please see the readme file at [1] for more examples and details.
>
> I see this library as a very small, but fundamental step towards=20
> improving support for Unicode within the standard library.  Thank you=20
> for any feedback!
>
> Tom.
>
> [1]: Text_view: A C++ Concepts based character encoding and code point
>       enumeration library
>       https://github.com/tahonermann/text_view
>
>

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-propos=
als/.

------=_Part_257_457705040.1455110883081
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for working on this!<br><br>My use case is to trans=
code from a legacy encoding/character set (equal to latin-6 except for two =
characters), and output to unicode (mostly utf-8 or utf-16) as a std::strin=
g or std::wstring.<br><br>Is this a supported use case with the text_view l=
ibrary?<br><br>It could ease migration from old encodings/character sets fo=
r legacy applications....<br><br>Mats<br><br>On Monday, 8 February 2016 07:=
27:12 UTC+1, Tom Honermann  wrote:<blockquote class=3D"gmail_quote" style=
=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: =
1ex;">I am planning to submit a paper for the Jacksonville mailing (submiss=
ion <br>deadline this Friday!) discussing a library I&#39;ve been developin=
g that <br>provides code point enumeration support for modern and legacy ch=
aracter <br>encodings. =C2=A0I will be attending the Jacksonville meeting a=
nd hope to <br>present the paper there. =C2=A0The intent of this email is t=
o request some <br>early feedback to help guide writing the paper and to pr=
epare myself to <br>address concerns raised.<p>The library is named text_vi=
ew and is avilable at<br><a href=3D"https://github.com/tahonermann/text_vie=
w" target=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=3D&#39;https=
://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftahonermann%2Ftext_vie=
w\46sa\75D\46sntz\0751\46usg\75AFQjCNFPyM2bvWGzfe1MTkDZDJqoUrY2hw&#39;;retu=
rn true;" onclick=3D"this.href=3D&#39;https://www.google.com/url?q\75https%=
3A%2F%2Fgithub.com%2Ftahonermann%2Ftext_view\46sa\75D\46sntz\0751\46usg\75A=
FQjCNFPyM2bvWGzfe1MTkDZDJqoUrY2hw&#39;;return true;">https://github.com/<wb=
r>tahonermann/text_view</a></p><p>The readme file found there provides a sh=
ort overview, feature <br>description, terminology description, list of sup=
ported character <br>encodings, and a specification of the interface. =C2=
=A0The readme file is <br>still rough and lacking in prose to describe many=
 of the classes. =C2=A0I <br>plan to improve it soon, but am hopeful that i=
t suffices to at least <br>provide a sense of the library and how to use it=
.. =C2=A0Contributions welcome!</p><p>Text_view avoids introducing another s=
tring type. =C2=A0Instead, it provides <br>facilities for constructing a vi=
ew over any range, view, or container <br>that holds a code unit sequence; =
the view associates an encoding with <br>the code unit sequence and provide=
s iterators that decode the sequence <br>and produce code point values. =C2=
=A0The value type of the iterator type is a <br>character type that associa=
tes the code point value with a character set.</p><p>An example taken from =
the overview follows. =C2=A0Note that \u00F8 (LATIN <br>SMALL LETTER O WITH=
 STROKE) is encoded as UTF-8 using two code units <br>(\xC3\xB8), but itera=
tor based enumeration sees just the single code point.</p><p>using CT =3D u=
tf8_encoding::character_type;<br>auto tv =3D make_text_view&lt;utf8_encodin=
g&gt;(<wbr>u8&quot;J\u00F8erg is my friend&quot;);<br>auto it =3D tv.begin(=
);<br>assert(*it++ =3D=3D CT{0x004A}); // &#39;J&#39;<br>assert(*it++ =3D=
=3D CT{0x00F8}); // &#39;=C3=B8&#39;<br>assert(*it++ =3D=3D CT{0x0065}); //=
 &#39;e&#39;</p><p>Please see the readme file at [1] for more examples and =
details.</p><p>I see this library as a very small, but fundamental step tow=
ards <br>improving support for Unicode within the standard library. =C2=A0T=
hank you <br>for any feedback!</p><p>Tom.</p><p>[1]: Text_view: A C++ Conce=
pts based character encoding and code point<br>=C2=A0 =C2=A0 =C2=A0 enumera=
tion library<br>=C2=A0 =C2=A0 =C2=A0 <a href=3D"https://github.com/tahonerm=
ann/text_view" target=3D"_blank" rel=3D"nofollow" onmousedown=3D"this.href=
=3D&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftahonerma=
nn%2Ftext_view\46sa\75D\46sntz\0751\46usg\75AFQjCNFPyM2bvWGzfe1MTkDZDJqoUrY=
2hw&#39;;return true;" onclick=3D"this.href=3D&#39;https://www.google.com/u=
rl?q\75https%3A%2F%2Fgithub.com%2Ftahonermann%2Ftext_view\46sa\75D\46sntz\0=
751\46usg\75AFQjCNFPyM2bvWGzfe1MTkDZDJqoUrY2hw&#39;;return true;">https://g=
ithub.com/<wbr>tahonermann/text_view</a><br></p><p></p><p></p><p></p><p></p=
><p></p><p></p><p></p><p></p></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

------=_Part_257_457705040.1455110883081--
------=_Part_256_370776456.1455110883081--

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Wed, 10 Feb 2016 10:35:31 -0800 Raw View

On quarta-feira, 10 de fevereiro de 2016 07:46:04 PST Tom Honermann wrote:
> > It allows `U"This is a Literal String"` to be a sequence of Unicode
> > codepoints via a literal string. You can't do that with a class type. Or
> > at least, not without also using a user-defined literal.
>
> By definition, UTF-32 is a sequence of Unicode code points, so we have
> that today.  Perhaps you are suggesting that a string literal should
> suffice by itself to satisfy a TextView like concept?

He's saying that char32_t has the advantage of compiler support, whereas a
different type wouldn't.

Take QChar, for a concrete example. It's UTF-16, so it's technically
equivalent to char16_t. But u"This is a string" produces a char16_t[] literal,
not a QChar[] one.

And I don't see how to create a compile-time static for a QChar literal
without those UDLs with template strings, which were removed from C++11, have
never come back, and are declared to be the most interesting feature of UDLs.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Thiago Macieira <thiago@macieira.org>
Date: Wed, 10 Feb 2016 10:46:47 -0800 Raw View

On quarta-feira, 10 de fevereiro de 2016 08:11:56 PST Tom Honermann wrote:
> > The error checking above is that next() silently replaces an invalid
> > decoding with the replacement character (U+FFFD). I suppose you could use
> > std::expected for your code.
>
> As I mentioned in my reply to Mathias, at one point I was working on a
> design that would allow this behavior to be configurable.  It still
> isn't clear to me how beneficial that flexibility would be.

Well, for those of us who will not touch exceptions with a 6-foot pole, your
code changes from "useless academic toy" to "useful framework".

Using std::expected allows to have the best of all worlds:

- want to have exceptions? Just use the value and it will throw if it's in the
wrong state
- want to have a silent replacement? .value_or(0xfffd)
- want to check errors without exceptions? there's a function

> > ifstream is never a good reference for me. The only good thing about
> > iostreams for me are cout and cerr. cin, fstream, stringstream, etc., are
> > overkills and complex, so they never enter my projects.
>
> Fair enough, but I think the point stands that this kind of scenario can
> be addressed by a lower level buffered iterator abstraction.

That would be buffering over my buffer. That's more memory allocated and
possibly introducing delays like networking bufferbloat (see
<https://en.wikipedia.org/wiki/Bufferbloat>)

> The only concern that I have about the above is that it leaves open the
> possibility for trailing code units (e.g., garbage at the end of the
> encoded text) to go unnoticed.  In a non-buffering scenario, an iterator
> might silently compare to end even though there are code units
> remaining.  The developer might care about these, or they might not.

Indeed. That reminds me of the qstring.cpp function convertCase.  Before
looping over the actual data, it does:

    // this avoids out of bounds check in the loop
    while (e != p && e[-1].isHighSurrogate())
        --e;

This probably means that iterators are the wrong tool for this job, at least
the way that the Standard Library understands iterators to be.

We've already talked about how they are stateful and know about the end
position. Now we need to be sure that they are properly disposed of, by
checking the saved state after the last character.

Please also add to your calculations the fact that you'll need to have ucnv or
iconv_t objects behind the scenes. There's a whole side of object management.

This is really a class with an invariant to be protected, not an iterator.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Sean Middleditch <sean.middleditch@gmail.com>
Date: Fri, 12 Feb 2016 14:05:36 -0800 (PST) Raw View

------=_Part_1176_1503759675.1455314736537
Content-Type: multipart/alternative;
 boundary="----=_Part_1177_1315666199.1455314736544"

------=_Part_1177_1315666199.1455314736544
Content-Type: text/plain; charset=UTF-8

On Tuesday, February 9, 2016 at 11:29:37 PM UTC-8, Tom Honermann wrote:
>
> On 2/9/2016 4:34 AM, Mathias Gaunard wrote:
> > On Mon, Feb 8, 2016 at 6:27 AM, Tom Honermann
> > <Thomas.H...@synopsys.com <javascript:> <mailto:Thomas.H...@synopsys.com
> <javascript:>>>
> > wrote:
> >
> >
> >     Text_view avoids introducing another string type.  Instead, it
> provides
> >     facilities for constructing a view over any range, view, or container
> >     that holds a code unit sequence; the view associates an encoding with
> >     the code unit sequence and provides iterators that decode the
> sequence
> >     and produce code point values.  The value type of the iterator type
> is a
> >     character type that associates the code point value with a character
> >     set.
> >
> >
> > I haven't had the time to go through your code yet, but I have two
> > questions:
> >   - how do you handle validation, do you validate on construction or
> > trust the programmer? Can you take shortcuts in your conversion if you
> > statically know the encoding?
>
> Valiwhatnow?  :)
>
> Little validation is currently implemented.  Exceptions are thrown for
> lone surrogates and other invalid UTF code unit sequences as they are
> decoded, but that is about it.  Exceptions are probably not the best way
>

Absolutely be sure to add error_code variants, at the very least.

Exceptions are disabled in a number of key C++-using industries as a matter
of course and should not be relied up on as the sole or only mechanism for
reporting error conditions, as you're pretty much guaranteeing that your
shiny new language facility will be useless text to some of the largest
industries using C++ (games, embedded, real-time, kernels, etc.).

You don't have to abandon exceptions, but just make sure they're not the
only way to report errors.

>

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_1177_1315666199.1455314736544
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Tuesday, February 9, 2016 at 11:29:37 PM UTC-8, Tom Hon=
ermann wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-le=
ft: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On 2/9/2016 4:34 =
AM, Mathias Gaunard wrote:<br>&gt; On Mon, Feb 8, 2016 at 6:27 AM, Tom Hone=
rmann<br>&gt; &lt;<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-=
mailto=3D"KdZ6dBtWHgAJ" rel=3D"nofollow" onmousedown=3D"this.href=3D&#39;ja=
vascript:&#39;;return true;" onclick=3D"this.href=3D&#39;javascript:&#39;;r=
eturn true;">Thomas.H...@synopsys.com</a> &lt;mailto:<a href=3D"javascript:=
" target=3D"_blank" gdf-obfuscated-mailto=3D"KdZ6dBtWHgAJ" rel=3D"nofollow"=
 onmousedown=3D"this.href=3D&#39;javascript:&#39;;return true;" onclick=3D"=
this.href=3D&#39;javascript:&#39;;return true;">Thomas.H...@<wbr>synopsys.c=
om</a>&gt;&gt;<br>&gt; wrote:<br>&gt;<br>&gt;<br>&gt; =C2=A0 =C2=A0 Text_vi=
ew avoids introducing another string type. =C2=A0Instead, it provides<br>&g=
t; =C2=A0 =C2=A0 facilities for constructing a view over any range, view, o=
r container<br>&gt; =C2=A0 =C2=A0 that holds a code unit sequence; the view=
 associates an encoding with<br>&gt; =C2=A0 =C2=A0 the code unit sequence a=
nd provides iterators that decode the sequence<br>&gt; =C2=A0 =C2=A0 and pr=
oduce code point values. =C2=A0The value type of the iterator type is a<br>=
&gt; =C2=A0 =C2=A0 character type that associates the code point value with=
 a character<br>&gt; =C2=A0 =C2=A0 set.<br>&gt;<br>&gt;<br>&gt; I haven&#39=
;t had the time to go through your code yet, but I have two<br>&gt; questio=
ns:<br>&gt; =C2=A0 - how do you handle validation, do you validate on const=
ruction or<br>&gt; trust the programmer? Can you take shortcuts in your con=
version if you<br>&gt; statically know the encoding?<p>Valiwhatnow? =C2=A0:=
)</p><p>Little validation is currently implemented. =C2=A0Exceptions are th=
rown for <br>lone surrogates and other invalid UTF code unit sequences as t=
hey are <br>decoded, but that is about it. =C2=A0Exceptions are probably no=
t the best way</p></blockquote><div><br></div><div>Absolutely be sure to ad=
d error_code variants, at the very least.</div><div><br></div><div>Exceptio=
ns are disabled in a number of key C++-using industries as a matter of cour=
se and should not be relied up on as the sole or only mechanism for reporti=
ng error conditions, as you&#39;re pretty much guaranteeing that your shiny=
 new language facility will be useless text to some of the largest industri=
es using C++ (games, embedded, real-time, kernels, etc.).</div><div><br></d=
iv><div>You don&#39;t have to abandon exceptions, but just make sure they&#=
39;re not the only way to report errors.</div><blockquote class=3D"gmail_qu=
ote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padd=
ing-left: 1ex;"><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>=
</p><p></p><p></p></blockquote></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

------=_Part_1177_1315666199.1455314736544--
------=_Part_1176_1503759675.1455314736537--

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Sat, 13 Feb 2016 04:40:14 +0000 Raw View

On 2/10/2016 8:28 AM, mats.taraldsvik@gmail.com wrote:
> Thanks for working on this!

You're welcome :)

> My use case is to transcode from a legacy encoding/character set (equal
> to latin-6 except for two characters), and output to unicode (mostly
> utf-8 or utf-16) as a std::string or std::wstring.
>
> Is this a supported use case with the text_view library?

It is intended to be, yes.  But not all the pieces are in place yet.  In
particular, interfaces for transcoding between character sets and
defining character maps have not been defined yet.

Once in place, transcoding from one encoding to another could be
accomplished with code like this:

std::string in = get_a_string_in_legacy_encoding();
std::string out;
std::back_insert_iterator<std::string> out_it{out};
auto tv_in = make_text_view<legacy_encoding>(in);
auto tv_out = make_otext_iterator<utf8_encoding>(out_it);
std::copy(tv_in.begin(), tv_in.end(), tv_out);

This works today if the two encodings use the same character set and
will work in general once the interfaces for transcoding between
character sets and defining character maps are in place.  However, use
of std::copy() doesn't provide for optimized transcoding, so I suspect
there will be (should be) additional interfaces to enable optimizations.

> It could ease migration from old encodings/character sets for legacy
> applications....

That is one of the goals.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Sat, 13 Feb 2016 05:23:27 +0000 Raw View

On 2/10/2016 1:47 PM, Thiago Macieira wrote:
> On quarta-feira, 10 de fevereiro de 2016 08:11:56 PST Tom Honermann wrote:
>>> The error checking above is that next() silently replaces an invalid
>>> decoding with the replacement character (U+FFFD). I suppose you could use
>>> std::expected for your code.
>>
>> As I mentioned in my reply to Mathias, at one point I was working on a
>> design that would allow this behavior to be configurable.  It still
>> isn't clear to me how beneficial that flexibility would be.
>
> Well, for those of us who will not touch exceptions with a 6-foot pole, your
> code changes from "useless academic toy" to "useful framework".

Understood.  I didn't intended for exceptions to be the only error
handling mechanism; I just haven't gotten around to doing something
about it yet.

> Using std::expected allows to have the best of all worlds:
>
> - want to have exceptions? Just use the value and it will throw if it's in the
> wrong state
> - want to have a silent replacement? .value_or(0xfffd)
> - want to check errors without exceptions? there's a function

Fitting std::expected into the low-level encoding and decoding
interfaces should be straight forward.  I don't see a way to utilize it
within the iterator interface though.  Using std::expected as the
value_type of an iterator would be ... interesting at best.

I've been thinking of supporting something like the following (perhaps
made more general, with member functions that receive the relevant
underlying code unit iterators and can return substitution characters or
throw custom exceptions).

struct my_text_view_policy {
   static const X on_underflow = save_state;      // or throw_exception
   static const X on_invalid = subst_replacement; // or throw_exception
};

std::string s = ...;
auto tv = make_text_view<some_encoding, my_text_view_policy>(s);

Thoughts on this approach appreciated.

>>> ifstream is never a good reference for me. The only good thing about
>>> iostreams for me are cout and cerr. cin, fstream, stringstream, etc., are
>>> overkills and complex, so they never enter my projects.
>>
>> Fair enough, but I think the point stands that this kind of scenario can
>> be addressed by a lower level buffered iterator abstraction.
>
> That would be buffering over my buffer. That's more memory allocated and
> possibly introducing delays like networking bufferbloat (see
> <https://en.wikipedia.org/wiki/Bufferbloat>)

The solution I'm envisioning wouldn't entail double buffering.  I'm just
suggesting that the buffer management be left to the underlying code
unit iterators.  I've implemented a solution in the past that used a
sliding window buffer.

>> The only concern that I have about the above is that it leaves open the
>> possibility for trailing code units (e.g., garbage at the end of the
>> encoded text) to go unnoticed.  In a non-buffering scenario, an iterator
>> might silently compare to end even though there are code units
>> remaining.  The developer might care about these, or they might not.
>
> Indeed. That reminds me of the qstring.cpp function convertCase.  Before
> looping over the actual data, it does:
>
>      // this avoids out of bounds check in the loop
>      while (e != p && e[-1].isHighSurrogate())
>          --e;
>
> This probably means that iterators are the wrong tool for this job, at least
> the way that the Standard Library understands iterators to be.

That may be, though I think the iterator approach will remain useful for
many use cases.  For other cases, coding to the lower level interfaces
is an option.

> We've already talked about how they are stateful and know about the end
> position. Now we need to be sure that they are properly disposed of, by
> checking the saved state after the last character.
>
> Please also add to your calculations the fact that you'll need to have ucnv or
> iconv_t objects behind the scenes. There's a whole side of object management.
>
> This is really a class with an invariant to be protected, not an iterator.

Assuming implementation is provided by either ICU or iconv, yes.
Iterators are allowed to reference the range object from which they were
created, so management can be consolidated there. For views, this would
imply additional shared resource management, perhaps managed via
std::shared_ptr.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.

Author: w.j.evers@versatel.nl
Date: Sat, 13 Feb 2016 11:40:16 -0800 (PST) Raw View

------=_Part_2071_683603939.1455392416815
Content-Type: multipart/alternative;
 boundary="----=_Part_2072_990804467.1455392416815"

------=_Part_2072_990804467.1455392416815
Content-Type: text/plain; charset=UTF-8



Op woensdag 10 februari 2016 19:46:56 UTC+1 schreef Thiago Macieira:

Well, for those of us who will not touch exceptions with a 6-foot pole,
> your
> code changes from "useless academic toy" to "useful framework".
>

That sounds traumatized.  What's the problem?

There's nothing academic about using exceptions; I've used them
successfully in real-world applications for years.   Exceptions have been
in the language since somewhere before 1998. If your framework still can't
handle them, then I suggest you consider that to be a deficiency of that
framework. Criticizing a proposal for using exceptions is just backwards.

Wil

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_2072_990804467.1455392416815
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br>Op woensdag 10 februari 2016 19:46:56 UTC+1 schree=
f Thiago Macieira:<br><br><blockquote class=3D"gmail_quote" style=3D"margin=
: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Well=
, for those of us who will not touch exceptions with a 6-foot pole, your=20
<br>code changes from &quot;useless academic toy&quot; to &quot;useful fram=
ework&quot;.<br></blockquote><div><br>That sounds traumatized.=C2=A0 What&#=
39;s the problem?<br><br>There&#39;s nothing academic about using exception=
s; I&#39;ve used them successfully in real-world applications for years. =
=C2=A0 Exceptions have been in the language since somewhere before 1998. If=
 your framework still can&#39;t handle them, then I suggest you consider th=
at to be a deficiency of that framework. Criticizing a proposal for using e=
xceptions is just backwards.<br><br>Wil<br><br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/a/isocpp.org/group=
/std-proposals/">https://groups.google.com/a/isocpp.org/group/std-proposals=
/</a>.<br />

------=_Part_2072_990804467.1455392416815--
------=_Part_2071_683603939.1455392416815--

.

Author: Tom Honermann <Thomas.Honermann@synopsys.com>
Date: Sat, 13 Feb 2016 21:01:16 +0000 Raw View

On 2/13/2016 2:40 PM, w.j.evers@versatel.nl wrote:
>
>
> Op woensdag 10 februari 2016 19:46:56 UTC+1 schreef Thiago Macieira:
>
>     Well, for those of us who will not touch exceptions with a 6-foot
>     pole, your
>     code changes from "useless academic toy" to "useful framework".
>
>
> That sounds traumatized.  What's the problem?
>
> There's nothing academic about using exceptions; I've used them
> successfully in real-world applications for years.   Exceptions have
> been in the language since somewhere before 1998. If your framework
> still can't handle them, then I suggest you consider that to be a
> deficiency of that framework. Criticizing a proposal for using
> exceptions is just backwards.

Let's please not have this discussion as part of this thread.  If you
search the std-proposals list, you'll find prior threads discussing this
topic.  There was one in June of last year titled "Thoughts on
Exceptions, Expected, and Error Handling" that should provide some insight.

The reality is that use of exceptions is not acceptable to a significant
portion of the C++ community at present.  I acknowledge that general
acceptance of this library will require error handling alternatives to
exceptions and had already been thinking of alternatives; I just haven't
had time to explore alternatives yet.

Tom.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-proposals/.

.