Topic: Possible mistake in N2442 - string literals


Author: Greg Herlihy <greghe@mac.com>
Date: Sat, 14 Feb 2009 15:03:24 CST
Raw View
On Feb 12, 10:11 am, anon <a...@nowhere.it> wrote:
> Daphne Pfister wrote:
> > Universal-character-names are allowed. Which means that any raw data already
> > needs to be adjusted before being inserted verbatim already. Seems like
> > \u005C should work as an alternate to escaping backslashes.
>
> Oh, great! Thanks for clarifying this
>
> So, both for end-of-line slash and for the \+u sequence we will able to
> use \u005C ? So that would become \u005C<newline> for slash+newline and
> \u005Cu for \+u , correct?

Apparently not. A C++ program that contains a universal escape
sequence corresponding to a character in the basic source character
set (such as 0x5C, the backslash character) is ill-formed - according
to both the current C++ Standard (see    2.2/2) and the latest draft C++
Standard.

Greg



--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: anon <anon@nowhere.it>
Date: Sun, 15 Feb 2009 12:04:57 CST
Raw View
Greg Herlihy wrote:
> Apparently not. A C++ program that contains a universal escape
> sequence corresponding to a character in the basic source character
> set (such as 0x5C, the backslash character) is ill-formed - according
> to both the current C++ Standard (see    2.2/2) and the latest draft C++
> Standard.

Oh Sh#t
I hope it's not true
I read this in the paragraph you mentioned (draft SC22-N-4411.pdf):

"Additionally, if the hexadecimal value for a universal-character-name
outside a character or string literal corresponds to a control character
(...) or to a character in the basic source character set, the program
is ill-formed"

but it says "outside a character or string literal"...

What do you think?


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: usenet@stegropa.de (Stefan =?utf-8?Q?Gro=C3=9Fe?= Pawig)
Date: Sun, 15 Feb 2009 19:45:14 CST
Raw View
Greg Herlihy <greghe@mac.com> writes:
> On Feb 12, 10:11 am, anon <a...@nowhere.it> wrote:
>> So, both for end-of-line slash and for the \+u sequence we will able to
>> use \u005C ? So that would become \u005C<newline> for slash+newline and
>> \u005Cu for \+u , correct?
>
> Apparently not. A C++ program that contains a universal escape
> sequence corresponding to a character in the basic source character
> set (such as 0x5C, the backslash character) is ill-formed - according
> to both the current C++ Standard (see   2.2/2) and the latest draft C++
> Standard.

Well, kind of.  The latest draft (n2800) says at the end of 2.2/2:
---
[...] Additionally, if the hexadecimal value for a
universal-character-name outside a character or string literal
corresponds to a control character (in either of the ranges 0x00   ??0x1F
or 0x7F   ??0x9F, both inclusive) or to a character in the basic source
character set, the program is ill-formed.
---

Note the part that says "outside a character or string literal", which
is new compared to C++98.

Since the OP asked about (raw) string literals, he should be fine with
the \u005C construct.

   Kind regards,
     Stefan

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: anon <anon@nowhere.it>
Date: Sat, 7 Feb 2009 11:06:36 CST
Raw View
Hi there,
I was reading the n2442
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm
it seems to me there is a mistake

It says:
---
[Note: A source-file new-line in a raw string-literal results in a
new-line in the resulting execution string-literal, unless preceded by a
backslash.
---

Problem is, like this you cannot enter any raw literal having a
backslash followed by a new line, something you definitely might want to
do. Are we supposed to check every string literal we put in (which might
be binary and might be even very long) checking if the forbidden
sequence \+\n exists??
And what if our string literal does actually contain \+\n, what
alternatives do we have to put it into the source file? should we revert
to normal strings and escape all of it by hand? It might be
kilobyte-sized or megabyte-sized, also escaping it might be
inappropriate: it might need to stay human readable or might need to
stay raw binary...

Also consider that the \+\n sequence can easily exist, it actually might
be quite common: consider we could use the raw string literal to embed
code of another programming language into a C++ string literal (e.g. C++
inside C++ for code generators)

So this really seems a mistake to me.

Furthermore it does not seem to be possible to fix the standard later
without breaking old code, and the break would be silent.

Do you agree with me?
Should I forward these concerns to someone? To whom?

Thank you

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: James Kuyper <jameskuyper@verizon.net>
Date: Sun, 8 Feb 2009 09:27:49 CST
Raw View
anon wrote:
> Hi there,
> I was reading the n2442
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm
> it seems to me there is a mistake
>
> It says:
> ---
> [Note: A source-file new-line in a raw string-literal results in a
> new-line in the resulting execution string-literal, unless preceded by a
> backslash.
> ---
>
> Problem is, like this you cannot enter any raw literal having a
> backslash followed by a new line,  ...

Why not? What prevents you from using "\\\n"?

> ... something you definitely might want to
> do. Are we supposed to check every string literal we put in (which might
> be binary and might be even very long) checking if the forbidden
> sequence \+\n exists??

Yes. A very long binary string should not be placed in source code
except through use of a code generator which can easily be written to
check for such things. Source code is intended for storing program
logic, not storing program data; you can store program data there, but
to store large amounts of it there is using the wrong tool for the job;
it's like using a screwdriver as a shovel. If the blade of the
screwdriver is big enough, it can be made to serve as a very tiny
shovel, but it's still better to use an actual shovel when there's large
amounts of stuff to be moved.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: anon <anon@nowhere.it>
Date: Sun, 8 Feb 2009 19:27:30 CST
Raw View
James Kuyper wrote:
>> Problem is, like this you cannot enter any raw literal having a
>> backslash followed by a new line,  ...
>
> Why not? What prevents you from using "\\\n"?

The fact that this is a raw string and I cannot use a \ to escape another \

Or at least this is what I see from the definition of r-char at the link
above: there is no "escape sequence" entry. Or am I mistaken?

Anyway this sucks imho. If this is a raw string it should really be raw.
If the standard went such a long way to allow custom header/terminator
for the raw string, the reason should have been of allowing ANY
charachter sequence in it except for the terminator sequence (which is
customizable).
What is the reason of a raw if we still have to escape it?

Actually I now see another problem in the r-char definition: apparently
we cannot enter the forbidden sequence \u or \U in a raw string!? How
are we supposed to enter this sequence then, revert to standard string??

> Yes. A very long binary string should not be placed in source code
> except through use of a code generator which can easily be written to
> check for such things. Source code is intended for storing program
> logic, not storing program data; you can store program data there, but
> to store large amounts of it there is using the wrong tool for the job;

Do we have another way to assign unescaped binary data to some variable
in standard C++?

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: James Kuyper <jameskuyper@verizon.net>
Date: Mon, 9 Feb 2009 16:52:32 CST
Raw View
anon wrote:
> James Kuyper wrote:
>>> Problem is, like this you cannot enter any raw literal having a
>>> backslash followed by a new line,  ...
>>
>> Why not? What prevents you from using "\\\n"?
>
> The fact that this is a raw string and I cannot use a \ to escape another \
>
> Or at least this is what I see from the definition of r-char at the link
> above: there is no "escape sequence" entry. Or am I mistaken?

I'm sorry, I was unaware that there had been a proposal that creates a
new concept of a "raw" literal. I saw your link, but just assumed it was
a link to the latest draft version of the standard. I assumed that the
short sentence you cited from that document was all that I needed to
know about; and since it sounded very similar to existing wording in the
current text, I didn't think I needed to follow that link to answer your
question. My apologies for a series of bad assumptions.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Bo Persson" <bop@gmb.dk>
Date: Mon, 9 Feb 2009 16:54:44 CST
Raw View
anon wrote:
> Hi there,
> I was reading the n2442
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm
> it seems to me there is a mistake
>
> It says:
> ---
> [Note: A source-file new-line in a raw string-literal results in a
> new-line in the resulting execution string-literal, unless preceded
> by a backslash.
> ---
>
> Problem is, like this you cannot enter any raw literal having a
> backslash followed by a new line, something you definitely might
> want to do. Are we supposed to check every string literal we put in
> (which might be binary and might be even very long) checking if the
> forbidden sequence \+\n exists??

The problem, from a language point of view, is that the merging of
source lines (ending in a backslash) is performed by the preprocessor
*before* forming tokens. When it realizes that there are string
literals in the source, the newline has already been removed.

> And what if our string literal does actually contain \+\n, what
> alternatives do we have to put it into the source file? should we
> revert to normal strings and escape all of it by hand? It might be
> kilobyte-sized or megabyte-sized, also escaping it might be
> inappropriate: it might need to stay human readable or might need to
> stay raw binary...

Put it in a data file and read it in at runtime?


Bo Persson



--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: anon <anon@nowhere.it>
Date: Tue, 10 Feb 2009 16:23:41 CST
Raw View
Bo Persson wrote:
> The problem, from a language point of view, is that the merging of
> source lines (ending in a backslash) is performed by the preprocessor
> *before* forming tokens. When it realizes that there are string
> literals in the source, the newline has already been removed.

Ahh I understand now why it was made like this

It's a pity but... I just realized that we can use \+\+newline+newline
in order to put the sequence \+newline into the string, am I correct?

However as i wrote in my other post I have now seen a more serious
problem: it seems we cannot enter the sequences \+U or \+u in the raw
string (see definition of r-char)
Why on earth they forbid this? This is not likely to be workaroundable!?

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: "Daphne Pfister" <pfister@nortel.com>
Date: Tue, 10 Feb 2009 21:59:14 CST
Raw View
"anon" <anon@nowhere.it> wrote in message
news:498d44ca$0$35421$892e0abb@auth.newsreader.octanews.com...
> And what if our string literal does actually contain \+\n, what
> alternatives do we have to put it into the source file? should we revert
> to normal strings and escape all of it by hand? It might be
> kilobyte-sized or megabyte-sized, also escaping it might be
> inappropriate: it might need to stay human readable or might need to
> stay raw binary...

>From reading this bit in the grammar:
r-char:
   any member of the source character set, except, (1), a backslash \
followed by a u or U, or, (2), a right square bracket ]
                       followed by the initial d-char-sequence (which may
be empty) followed by a double quote ".
   universal-character-name

Universal-character-names are allowed. Which means that any raw data already
needs to be adjusted before being inserted verbatim already. Seems like
\u005C should work as an alternate to escaping backslashes.

Daphne




--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: =?ISO-8859-1?Q?Martin_Vejn=E1r?= <avakar@volny.cz>
Date: Wed, 11 Feb 2009 12:26:52 CST
Raw View
anon wrote:
> it seems we cannot enter the sequences \+U or \+u in the raw
> string (see definition of r-char)
> Why on earth they forbid this? This is not likely to be workaroundable!?

The inclusion of \u handling even in raw strings is unavoidable, at
least until the standard starts to treat international characters
differently. Right now, they are transformed into \u sequences, which in
turn must be interpreted in string literals.

In short, if you had r"\u00e1" equal to "\\u00e1", then they would both
be equal to r"   " as well. I suppose the current wording is the lesser of
two evils.
--
Martin


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: Beman Dawes <bgdawes@gmail.com>
Date: Wed, 11 Feb 2009 12:32:07 CST
Raw View
On Feb 10, 5:23 pm, anon <a...@nowhere.it> wrote:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2800.pdf is
the document to be looking at; it is the latest copy of the working
paper. There may have been changes since the document you referenced
in your original post.

> However as i wrote in my other post I have now seen a more serious
> problem: it seems we cannot enter the sequences \+U or \+u in the raw
> string (see definition of r-char)
> Why on earth they forbid this? This is not likely to be workaroundable!?

The raw string literal feature was carefully structured so that there
is no change in the meaning and interactions of the traditional phases
of translation. A change to universal character name processing would
break far too much existing code. See 2.1 Phases of translation.

--Beman Dawes

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]





Author: anon <anon@nowhere.it>
Date: Thu, 12 Feb 2009 12:11:27 CST
Raw View
Daphne Pfister wrote:
> Universal-character-names are allowed. Which means that any raw data already
> needs to be adjusted before being inserted verbatim already. Seems like
> \u005C should work as an alternate to escaping backslashes.

Oh, great! Thanks for clarifying this

So, both for end-of-line slash and for the \+u sequence we will able to
use \u005C ? So that would become \u005C<newline> for slash+newline and
\u005Cu for \+u , correct?

Thank you

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]