Thread

Topic: Perl-style hex escapes in strings

Author: Myriachan <myriachan@gmail.com>
Date: Thu, 17 Jul 2014 19:08:39 -0700 (PDT) Raw View

------=_Part_107_1666451723.1405649319543
Content-Type: text/plain; charset=UTF-8

I've always liked the way that Perl implements hex escapes in strings for
supporting Unicode.  They are a better solution to the problem that
following letters that happen to be hex digits have to be separated by
quotation marks.  Perl's hex escape format solves this.

Perl's hex escape format is as follows: \x{hhh...}   for some number of h's
>= 1.  The braces are literal: they are used to surround the sequence.  The
hexadecimal number is treated as a Unicode code point number; values higher
than 0xFFFF are considered to be outside the Basic Multilingual Plane.
Perl considers \x{} to be the same as \x{0}, but it seems more appropriate
for a C++ implementation to consider an empty hex string escape ill-formed.

An alternative method of solving the same problem is Python's escape
sequence format:

\uhhhh = 16-bit
\Uhhhhhhhh = 32-bit

Python requires exactly four and exactly eight hex digits in each case, or
the escape is ill-formed.  If \U is used and the underlying format is
16-bit, it is cast down to a single character.  If it is outside the BMP,
it becomes a surrogate pair.  A pair of two \u escapes can use the UTF-16
encoding of a surrogate to encode a surrogate pair; if the underlying
format is 32-bit, this will be converted to a single 32-bit code point.

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_107_1666451723.1405649319543
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I've always liked the way that Perl implements hex escapes=
 in strings for supporting Unicode.&nbsp; They are a better solution to the=
 problem that following letters that happen to be hex digits have to be sep=
arated by quotation marks.&nbsp; Perl's hex escape format solves this.<br><=
br>Perl's hex escape format is as follows: \x{hhh...}&nbsp;&nbsp; for some =
number of h's &gt;=3D 1.&nbsp; The braces are literal: they are used to sur=
round the sequence.&nbsp; The hexadecimal number is treated as a Unicode co=
de point number; values higher than 0xFFFF are considered to be outside the=
 Basic Multilingual Plane.&nbsp; Perl considers \x{} to be the same as \x{0=
}, but it seems more appropriate for a C++ implementation to consider an em=
pty hex string escape ill-formed.<br><br>An alternative method of solving t=
he same problem is Python's escape sequence format:<br><br>\uhhhh =3D 16-bi=
t<br>\Uhhhhhhhh =3D 32-bit<br><br>Python requires exactly four and exactly =
eight hex digits in each case, or the escape is ill-formed.&nbsp; If \U is =
used and the underlying format is 16-bit, it is cast down to a single chara=
cter.&nbsp; If it is outside the BMP, it becomes a surrogate pair.&nbsp; A =
pair of two \u escapes can use the UTF-16 encoding of a surrogate to encode=
 a surrogate pair; if the underlying format is 32-bit, this will be convert=
ed to a single 32-bit code point.<br></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_107_1666451723.1405649319543--

.

Author: David Krauss <potswa@gmail.com>
Date: Fri, 18 Jul 2014 10:29:15 +0800 Raw View

--Apple-Mail=_06C026EE-DA88-412F-AA8D-C7A71AD990A9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=ISO-8859-1


On 2014-07-18, at 10:08 AM, Myriachan <myriachan@gmail.com> wrote:

> I've always liked the way that Perl implements hex escapes in strings for=
 supporting Unicode.  They are a better solution to the problem that follow=
ing letters that happen to be hex digits have to be separated by quotation =
marks.  Perl's hex escape format solves this.
>=20
> Perl's hex escape format is as follows: \x{hhh...}   for some number of h=
's >=3D 1.  The braces are literal: they are used to surround the sequence.=
  The hexadecimal number is treated as a Unicode code point number; values =
higher than 0xFFFF are considered to be outside the Basic Multilingual Plan=
e.  Perl considers \x{} to be the same as \x{0}, but it seems more appropri=
ate for a C++ implementation to consider an empty hex string escape ill-for=
med.
>=20
> An alternative method of solving the same problem is Python's escape sequ=
ence format:
>=20
> \uhhhh =3D 16-bit
> \Uhhhhhhhh =3D 32-bit

This is already part of C++ since 1998. It probably predates adoption by Py=
thon.

However, \x and \u in C++ do completely different things. One is for charac=
ter-type raw values, the other is for Unicode codepoints which may be re-en=
coded to other values (such as the surrogate pair mapping you mention).

The \x{} syntax is safer, but is it worth the implementation effort? Any co=
debase using \x for unicode should migrate to \u posthaste.

IMHO, hex escape sequences are just magic numbers. The best solution isn't =
to add quote-breaks to terminate such sequences, but confine them to macros=
..

#define STRINGY_OPCODE "\xD5"
char const * data_sequence =3D "hello" STRINGY_OPCODE "world";

--=20

---=20
You received this message because you are subscribed to the Google Groups "=
ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposa=
ls/.

--Apple-Mail=_06C026EE-DA88-412F-AA8D-C7A71AD990A9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=ISO-8859-1

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html charset=
=3Dwindows-1252"></head><body style=3D"word-wrap: break-word; -webkit-nbsp-=
mode: space; -webkit-line-break: after-white-space;"><br><div><div>On 2014&=
ndash;07&ndash;18, at 10:08 AM, Myriachan &lt;<a href=3D"mailto:myriachan@g=
mail.com">myriachan@gmail.com</a>&gt; wrote:</div><br class=3D"Apple-interc=
hange-newline"><blockquote type=3D"cite"><div dir=3D"ltr">I've always liked=
 the way that Perl implements hex escapes in strings for supporting Unicode=
..&nbsp; They are a better solution to the problem that following letters th=
at happen to be hex digits have to be separated by quotation marks.&nbsp; P=
erl's hex escape format solves this.<br><br>Perl's hex escape format is as =
follows: \x{hhh...}&nbsp;&nbsp; for some number of h's &gt;=3D 1.&nbsp; The=
 braces are literal: they are used to surround the sequence.&nbsp; The hexa=
decimal number is treated as a Unicode code point number; values higher tha=
n 0xFFFF are considered to be outside the Basic Multilingual Plane.&nbsp; P=
erl considers \x{} to be the same as \x{0}, but it seems more appropriate f=
or a C++ implementation to consider an empty hex string escape ill-formed.<=
br><br>An alternative method of solving the same problem is Python's escape=
 sequence format:<br><br>\uhhhh =3D 16-bit<br>\Uhhhhhhhh =3D 32-bit<br></di=
v></blockquote><div><br></div><div>This is already part of C++ since 1998. =
It probably predates adoption by Python.</div><div><br></div><div>However, =
\x and \u in C++ do completely different things. One is for character-type =
raw values, the other is for Unicode codepoints which may be re-encoded to =
other values (such as the surrogate pair mapping you mention).</div></div><=
div><br></div><div>The <font face=3D"Courier">\x{}</font> syntax is safer, =
but is it worth the implementation effort? Any codebase using \x for unicod=
e should migrate to \u posthaste.</div><div><br></div><div>IMHO, hex escape=
 sequences are just magic numbers. The best solution isn&rsquo;t to add quo=
te-breaks to terminate such sequences, but confine them to macros.</div><di=
v><br></div><div><font face=3D"Courier">#define STRINGY_OPCODE "\xD5"</font=
></div><div><font face=3D"Courier">char const * data_sequence =3D "hello"&n=
bsp;</font><span style=3D"font-family: Courier;">STRINGY_OPCODE</span><font=
 face=3D"Courier">&nbsp;"world";</font></div><div><br></div></body></html>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

--Apple-Mail=_06C026EE-DA88-412F-AA8D-C7A71AD990A9--

.

Author: Myriachan <myriachan@gmail.com>
Date: Thu, 17 Jul 2014 20:01:02 -0700 (PDT) Raw View

------=_Part_588_517048062.1405652462979
Content-Type: text/plain; charset=UTF-8


>
>
> This is already part of C++ since 1998. It probably predates adoption by
> Python.
>
> However, \x and \u in C++ do completely different things. One is for
> character-type raw values, the other is for Unicode codepoints which may be
> re-encoded to other values (such as the surrogate pair mapping you mention).
>

How did I miss this?  I'm sorry...  Never mind.  >.<

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-proposals@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

------=_Part_588_517048062.1405652462979
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div style=3D=
"word-wrap:break-word"><div><br><div>This is already part of C++ since 1998=
.. It probably predates adoption by Python.</div><div><br></div><div>However=
, \x and \u in C++ do completely different things. One is for character-typ=
e raw values, the other is for Unicode codepoints which may be re-encoded t=
o other values (such as the surrogate pair mapping you mention).</div></div=
></div></blockquote><div><br>How did I miss this?&nbsp; I'm sorry...&nbsp; =
Never mind.&nbsp; &gt;.&lt;<br></div></div>

<p></p>

-- <br />
<br />
--- <br />
You received this message because you are subscribed to the Google Groups &=
quot;ISO C++ Standard - Future Proposals&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:std-proposals+unsubscribe@isocpp.org">std-proposa=
ls+unsubscribe@isocpp.org</a>.<br />
To post to this group, send email to <a href=3D"mailto:std-proposals@isocpp=
..org">std-proposals@isocpp.org</a>.<br />
Visit this group at <a href=3D"http://groups.google.com/a/isocpp.org/group/=
std-proposals/">http://groups.google.com/a/isocpp.org/group/std-proposals/<=
/a>.<br />

------=_Part_588_517048062.1405652462979--

.