Thread

Topic: header-name tokens

Author: Christopher Eltschka <celtschk@web.de>
Date: Sat, 20 Apr 2002 10:10:52 GMT Raw View

Pete Becker <petebecker@acm.org> writes:

> Ron Natalie wrote:
> >
> > There is a tacit assumption that there is
> > always a multibyte (using char) conversion that you can move any
> > wchar_t based string to.
>
> I hope not. <g> wcstombs is allowed to fail, so there may not be such a
> conversion. And the result can depend on the current locale, so there
> may be more than one such conversion. The issue, both for C and for C++,
> is what the required behavior should be when someone attempts to open a
> file using a wide character name. Many OS's do not support anything
> larger than 8-bit characters in file names; even on those that do,
> wchar_t isn't required to be the same size as the larger character type.
> As an application writer, what heuristics do you use to produce file
> names that are reasonably likely to be portable? 6.1, no names that
> differ in case only, letters and numbers only, ...

Now C has no "file character" type, nor a "file name" type. On many
operating systems it's quite easy to give a string to the OS which
isn't a valid file name. Every Dos C/C++ compiler will compile

  fopen("invalid.name");

although it violates the 8.3 rule, and every DOS C/C++ compiler will
also compile

  fopen("a b+c.x*y");

although neither space nor + nor * are allowed in DOS filenames.

Therefore if you have a system which supports only 8-bit characters,
but want to support 16-bit names, the rules are easy:

The letter L'x' is translated into the letter 'x'. If a wide character
has no narrow version (i.e. the wide string has a letter which you
couldn't have in a narrow string), then the file name is just as wrong
as my two examples were on DOS. There's not really a difference here.

The same is true for implementations on platforms with wide character
filenames: Just as the normal 8 bit filenames are translated from the
narrow execution character set into the corresponding file name wide
character set, the characters from the wide execution character set
would be translated into characters of the wide file name character
set. In both cases, characters which are in the execution character
set, but not in the file name character set, are errors.

Note that locales don't come into play here, since the narrow and wide
string literals already define preferred encodings of the C or C++
implementation, and the encoding of the file names is not under
implementation control anyway.

BTW, even portable programs may use filenames which don't work on some
systems: Not every file name is hard coded into a string literal, and
I wouldn't like a portable program to accept on Unix only 6.1
filenames in configuration files, just because some strange system may
have such a limit.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Sun, 21 Apr 2002 02:07:42 GMT Raw View

Christopher Eltschka <celtschk@web.de> writes:

|>  Pete Becker <petebecker@acm.org> writes:

|>  Therefore if you have a system which supports only 8-bit characters,
|>  but want to support 16-bit names, the rules are easy:

|>  The letter L'x' is translated into the letter 'x'. If a wide
|>  character has no narrow version (i.e. the wide string has a letter
|>  which you couldn't have in a narrow string), then the file name is
|>  just as wrong as my two examples were on DOS. There's not really a
|>  difference here.

And what happens if the translation is locale dependant, as it usually
is?  Do you refuse it, and if not, what locale do you use?

    [...]
|>  Note that locales don't come into play here, since the narrow and
|>  wide string literals already define preferred encodings of the C or
|>  C++ implementation, and the encoding of the file names is not under
|>  implementation control anyway.

I don't quite understand this.  How do the narrow and the wide string
literals already define any preferred encoding?  To begin with, of
course, you cannot suppose that the encodings on the target machine and
on the machine doing the compilation are in any way related.  But even
without that -- I don't see where there is really any relationship
defined between wide character string literals and narrow character
ones.

One of the points I made in the standards committee was that file names
that *look* the same should refer to the same file.  This depends on the
fonts active at any given time (and not the locale).  I don't think it
is an absolute requirement, but I do think that it is something which
requires thought -- in France, two different code sets are widely used:
ISO 8859-1 and ISO 8859-15, and the characters I see with emacs depends
on the font I have configured for emacs.  My user may have configured
something else.

--=20
James Kanze                                mailto:kanze@gabi-soft.de
Conseils en informatique orient=E9e objet/
                    Beratung in objektorientierter Datenverarbeitung
Ziegelh=FCttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Christopher Eltschka <celtschk@web.de>
Date: Wed, 24 Apr 2002 00:45:11 GMT Raw View

James Kanze <kanze@gabi-soft.de> writes:

> Christopher Eltschka <celtschk@web.de> writes:
>
> |>  Pete Becker <petebecker@acm.org> writes:
>
> |>  Therefore if you have a system which supports only 8-bit characters,
> |>  but want to support 16-bit names, the rules are easy:
>
> |>  The letter L'x' is translated into the letter 'x'. If a wide
> |>  character has no narrow version (i.e. the wide string has a letter
> |>  which you couldn't have in a narrow string), then the file name is
> |>  just as wrong as my two examples were on DOS. There's not really a
> |>  difference here.
>
> And what happens if the translation is locale dependant, as it usually
> is?  Do you refuse it, and if not, what locale do you use?

If the translation is locale dependant, then of course the wide
characters should use the same locale as the narrow ones. But are
there really any platforms where the interpretation of file names is
locale dependant?

>
>     [...]
> |>  Note that locales don't come into play here, since the narrow and
> |>  wide string literals already define preferred encodings of the C or
> |>  C++ implementation, and the encoding of the file names is not under
> |>  implementation control anyway.
>
> I don't quite understand this.  How do the narrow and the wide string
> literals already define any preferred encoding?

It's quite simple:

If, in your program, you write 'a', then the compiler stores some bit
pattern into it's memory, which represents the letter a. The same is
true for any character which is part of both the source character set
(which _may_ be locale dependant, but then upon the locale while
running the compiler, not on the locale while running the program) and
the execution character set (which for exactly this reason has to be
determined at compile time), will result in the corresponding
character in the execution character set.

Or said differently: The implied char encoding is the encoding in
which the encoding of the letter 'a' is the number which is generated
by the compiler when translating the character expression 'a'. The
same is true for wide characters and L'a'. And the same ist also true
for more esoteric characters like \u0950: If you write L'\u0950' and
the compiler translates this into a number, then the number the
compiler translates this into is by definition the implied wide
character encoding of U+0950 DEVANAGARI OM (assuming the execution
wide character set contains such a character, of course).

>  To begin with, of
> course, you cannot suppose that the encodings on the target machine and
> on the machine doing the compilation are in any way related.

To translate character constants and string literals, the cross
compiler better knows about the character encoding of the destination
machine.

>  But even
> without that -- I don't see where there is really any relationship
> defined between wide character string literals and narrow character
> ones.

If '\uxxxx' translates to one number and L'\uxxxx' translates to
another number, then the first number is the narrow character
representation of the same character which is represented as wide
character by the second number.

>
> One of the points I made in the standards committee was that file names
> that *look* the same should refer to the same file.

So you'd say that the file "\u0391.txt" and the file "A.txt" should be
the same if the current font uses the same glyph for uppercase Alpha
and uppercase A, and different otherwise?

>  This depends on the
> fonts active at any given time (and not the locale).  I don't think it
> is an absolute requirement, but I do think that it is something which
> requires thought -- in France, two different code sets are widely used:
> ISO 8859-1 and ISO 8859-15, and the characters I see with emacs depends
> on the font I have configured for emacs.  My user may have configured
> something else.

And how should the compiler know which character set you displayed in
Emacs when you wrote the code?

Well, Emacs could convert all non-ASCII characters into the
correspondung \u form, or you could tell the compiler per switch or
pragma in which encoding your source file is.

But then, this is still completely independant from the encoding of
the characters *in* the program. It I'd cross-compile a program for an
EBCDIC system on an ASCII system, and I used the string literal "abc",
I'd certainly expect the compiler _not_ to copy the character codes
literally from the (ASCII) source file into the (to be executed in an
EBCDIC environment) object code, but to do a translation from ASCII to
EBCDIC. For sure, if you'd read the binary into Emacs on your ASCII
system, you'd not see your "abc" at that place, unless you happen to
have an EBCDIC encoding in Emacs.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Kanze <kanze@alex.gabi-soft.de>
Date: Wed, 24 Apr 2002 15:31:59 GMT Raw View

Christopher Eltschka <celtschk@web.de> writes:

|>  James Kanze <kanze@gabi-soft.de> writes:

|>  > Christopher Eltschka <celtschk@web.de> writes:

|>  > |>  Pete Becker <petebecker@acm.org> writes:

|>  > |>  Therefore if you have a system which supports only 8-bit
|>  > |>  characters, but want to support 16-bit names, the rules are
|>  > |>  easy:

|>  > |>  The letter L'x' is translated into the letter 'x'. If a wide
|>  > |>  character has no narrow version (i.e. the wide string has a
|>  > |>  letter which you couldn't have in a narrow string), then the
|>  > |>  file name is just as wrong as my two examples were on
|>  > |>  DOS. There's not really a difference here.

|>  > And what happens if the translation is locale dependant, as it
|>  > usually is?  Do you refuse it, and if not, what locale do you
|>  > use?

|>  If the translation is locale dependant, then of course the wide
|>  characters should use the same locale as the narrow ones. But are
|>  there really any platforms where the interpretation of file names
|>  is locale dependant?

Solaris.  HP/UX.  AIX.  Probably others.

I have file systems mounted on both a Linux machine and a Solaris
machine.  If I do an ls on the different machines, I see different
filenames for the same file.

Saying the interpretation is locale specific is really only a half
truth.  It is true that ls on Unix machines (but not Linux) uses the
locale when displaying filenames.  But only to a limited degree (to
replace characters where isprint is false with a ?), and only when
outputting to an interactive device.  (Thus, under Linux, "ls" and "ls
| cat" give different results.)  They also "interpret" the file name
according to the current font encoding, so that if I select a
different font, the name may contain different characters.

|>  >     [...]
|>  > |>  Note that locales don't come into play here, since the
|>  > |>  narrow and wide string literals already define preferred
|>  > |>  encodings of the C or C++ implementation, and the encoding
|>  > |>  of the file names is not under implementation control
|>  > |>  anyway.

|>  > I don't quite understand this.  How do the narrow and the wide
|>  > string literals already define any preferred encoding?

|>  It's quite simple:

|>  If, in your program, you write 'a', then the compiler stores some
|>  bit pattern into it's memory, which represents the letter a. The
|>  same is true for any character which is part of both the source
|>  character set (which _may_ be locale dependant, but then upon the
|>  locale while running the compiler, not on the locale while running
|>  the program) and the execution character set (which for exactly
|>  this reason has to be determined at compile time), will result in
|>  the corresponding character in the execution character set.

|>  Or said differently: The implied char encoding is the encoding in
|>  which the encoding of the letter 'a' is the number which is
|>  generated by the compiler when translating the character
|>  expression 'a'. The same is true for wide characters and L'a'. And
|>  the same ist also true for more esoteric characters like \u0950:
|>  If you write L'\u0950' and the compiler translates this into a
|>  number, then the number the compiler translates this into is by
|>  definition the implied wide character encoding of U+0950
|>  DEVANAGARI OM (assuming the execution wide character set contains
|>  such a character, of course).

That seems to be the most reasonable suggestion so far.  Still, I
think that the problem is less evident than you suggest.  The problem
is, as always, that a program should act intuitively -- having worked
in the field, I very well understand the results of ls, above.  But
try to explain to a normal user why to remove the file whose name is
displayed "?t?", he actually has to do some funny manipulations to
enter the name "=E9t=E9".

Quite frankly, I don't know what the correct solution is.  And I don't
think that I will know until someone actually implements something,
and we get some experience with it.  And I don't want us to
standardize the wrong thing.

|>  >  To begin with, of course, you cannot suppose that the encodings
|>  > on the target machine and on the machine doing the compilation
|>  > are in any way related.

|>  To translate character constants and string literals, the cross
|>  compiler better knows about the character encoding of the
|>  destination machine.

|>  >  But even without that -- I don't see where there is really any
|>  > relationship defined between wide character string literals and
|>  > narrow character ones.

|>  If '\uxxxx' translates to one number and L'\uxxxx' translates to
|>  another number, then the first number is the narrow character
|>  representation of the same character which is represented as wide
|>  character by the second number.

|>  > One of the points I made in the standards committee was that
|>  > file names that *look* the same should refer to the same file.

|>  So you'd say that the file "\u0391.txt" and the file "A.txt"
|>  should be the same if the current font uses the same glyph for
|>  uppercase Alpha and uppercase A, and different otherwise?

Good point.  To start with, it will be necessary to define what is
meant by "looks like".  A Greek capital letter alpha certainly looks
like a capital A to me.

Still, I'd think I could explain to a na=EFve user that the Greek
capital letter alpha and the Latin capital letter A are two different
letters, even if they look the same.  Where as I'm a lot less certain
about the problem with =E9 being displayed as a ?.

|>  >  This depends on the fonts active at any given time (and not the
|>  > locale).  I don't think it is an absolute requirement, but I do
|>  > think that it is something which requires thought -- in France,
|>  > two different code sets are widely used: ISO 8859-1 and ISO
|>  > 8859-15, and the characters I see with emacs depends on the font
|>  > I have configured for emacs.  My user may have configured
|>  > something else.

|>  And how should the compiler know which character set you displayed
|>  in Emacs when you wrote the code?

Exactly my point:-).

I know the problems.  I don't know any real solutions.  Yet, at least.

|>  Well, Emacs could convert all non-ASCII characters into the
|>  correspondung \u form, or you could tell the compiler per switch
|>  or pragma in which encoding your source file is.

Both:-).

A good program editor for C++ should generate portably readable source
files, which IMHO means that any character outside of the basic
character set should be represented in the \u form, and even the
characters such as [ and ] should be stored as trigraphs.  In so far
as the installed fonts allow it, of course, the characters should be
displayed as normal characters.  (IMHO, in a good program development
environment, all of the fonts will support full Unicode.  But I'm not
holding my breath.)

A good compiler should be able to accept input in whatever code it is
written.  Since it is typically impossible for the compiler to
determine the code set automatically (except maybe by means of a
pragma), a compile line option is necessary.  This still doesn't solve
all of the problems, however -- I normally work in an ISO 8859-1
environment, and would specify this as the option for input.  But what
happens if I happen to download an include file from the Czeck
republic?  In my code, a character encoded as 0xf8 should be
interpreted as \u00f8, but in the include file, as \u0159.  The only
real solution I see here is editors which save \u format, but display
correctly.  And translation programs which convert downloaded files to
\u format according to the given code set.  (UTF-8 might be an
alternative as well, provided you had a utility to convert it to \u
format for export.)

|>  But then, this is still completely independant from the encoding
|>  of the characters *in* the program. It I'd cross-compile a program
|>  for an EBCDIC system on an ASCII system, and I used the string
|>  literal "abc", I'd certainly expect the compiler _not_ to copy the
|>  character codes literally from the (ASCII) source file into the
|>  (to be executed in an EBCDIC environment) object code, but to do a
|>  translation from ASCII to EBCDIC. For sure, if you'd read the
|>  binary into Emacs on your ASCII system, you'd not see your "abc"
|>  at that place, unless you happen to have an EBCDIC encoding in
|>  Emacs.

Agreed.

--=20
James Kanze                                mailto:kanze@gabi-soft.de
Conseils en informatique orient=E9e objet/
                    Beratung in objektorientierter Datenverarbeitung
Ziegelh=FCttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Tue, 9 Apr 2002 16:03:22 GMT Raw View

whatiscpp@yahoo.com (John the newbie) writes:

|>  I have just learned that header-names are different tokens from
|>  string-literals.  What is the rationale for this decision?

Header names have to conform to implementation defined specifications.
String literals can be any string.

--=20
James Kanze                                mailto:kanze@gabi-soft.de
Conseils en informatique orient=E9e objet/
                    Beratung in objektorientierter Datenverarbeitung
Ziegelh=FCttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Pete Becker <petebecker@acm.org>
Date: Tue, 9 Apr 2002 19:24:48 GMT Raw View

Daniel Miller wrote:
>
>    (Of course this prompts the question "Why are wide-characters permitted in
> filenames for operating systems which can accept wide-character filesystem names
> such as Unicode support which is present in multiple modern operating systems?"
> which I cannot answer.

If the Japanese were to ask for such a feature it would have a much
higher likelihood of being added. They're the experts in this area, and
they're not asking.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Pete Becker <petebecker@acm.org>
Date: Tue, 9 Apr 2002 19:25:17 GMT Raw View

Ron Natalie wrote:
>
> There is a tacit assumption that there is
> always a multibyte (using char) conversion that you can move any
> wchar_t based string to.

I hope not. <g> wcstombs is allowed to fail, so there may not be such a
conversion. And the result can depend on the current locale, so there
may be more than one such conversion. The issue, both for C and for C++,
is what the required behavior should be when someone attempts to open a
file using a wide character name. Many OS's do not support anything
larger than 8-bit characters in file names; even on those that do,
wchar_t isn't required to be the same size as the larger character type.
As an application writer, what heuristics do you use to produce file
names that are reasonably likely to be portable? 6.1, no names that
differ in case only, letters and numbers only, ...

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Tue, 9 Apr 2002 22:42:58 GMT Raw View

With reference to this distinct tokenization, I think there's actually
a serious defect in the C++ standard; I pointed it out on this group
almost a year ago, but nobody seemed to take it seriously. At the
moment I really don't have the time to make a clear description but
the main issue is

where is it stated that header-name tokens are formed only in the
context of #include directives?

2.8p1 says header name preprocessing tokens shall only appear in
#include directives, but it seems to me this doesn't impose a lexing
constraint: it simply states that the program would be otherwise
ill-formed. Anyhow, even interpreting it as "prep-token are formed
only within #includes", how can the example of 16.2p8 be legal???


Genny.



---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Kanze <kanze@gabi-soft.de>
Date: Wed, 10 Apr 2002 15:00:41 GMT Raw View

Gennaro Prota <gennaro_prota@yahoo.com> writes:

|>  With reference to this distinct tokenization, I think there's
|>  actually a serious defect in the C++ standard; I pointed it out on
|>  this group almost a year ago, but nobody seemed to take it
|>  seriously. At the moment I really don't have the time to make a
|>  clear description but the main issue is

|>  where is it stated that header-name tokens are formed only in the
|>  context of #include directives?

It's not stated explicitly, I suspect.  However, how the input stream is
broken up into tokens is defined, and something like <stdio.h> is five
tokens anywhere except after a #include.

The real problem, if there is one, is when the form #include SYMBOL is
used, where SYMBOL is a preprocessing token corresponding to a macro.  I
generally use the # and the ## operators to end up with something like
"someDir/someFile.h".  This has worked on every compiler I've tried it
on, but I'm not sure it is supposed to.  In particular, the definition
of # says that the result is a string literal; despite appearances, the
sequence I need is NOT a string literal, but a q-character-sequence
between " delimiters.  Interpreted strictly, after an #include,
"stdio.h" is three tokens, a '"' delimiter, a q-char-sequence and a '"'
delimiter.  (I think that the intent is for what I am doing to be legal;
if not, I'm not sure how you are supposed to use the third form of the
#include directive.)

--=20
James Kanze                                mailto:kanze@gabi-soft.de
Conseils en informatique orient=E9e objet/
                    Beratung in objektorientierter Datenverarbeitung
Ziegelh=FCttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Dennett <jdennett@acm.org>
Date: Wed, 10 Apr 2002 17:30:10 GMT Raw View

James Kanze wrote:
> Gennaro Prota <gennaro_prota@yahoo.com> writes:
>
> |>  With reference to this distinct tokenization, I think there's
> |>  actually a serious defect in the C++ standard; I pointed it out on
> |>  this group almost a year ago, but nobody seemed to take it
> |>  seriously. At the moment I really don't have the time to make a
> |>  clear description but the main issue is
>
> |>  where is it stated that header-name tokens are formed only in the
> |>  context of #include directives?
>
> It's not stated explicitly, I suspect.  However, how the input stream is
> broken up into tokens is defined, and something like <stdio.h> is five
> tokens anywhere except after a #include.
>
> The real problem, if there is one, is when the form #include SYMBOL is
> used, where SYMBOL is a preprocessing token corresponding to a macro.  I
> generally use the # and the ## operators to end up with something like
> "someDir/someFile.h".  This has worked on every compiler I've tried it
> on, but I'm not sure it is supposed to.  In particular, the definition
> of # says that the result is a string literal; despite appearances, the
> sequence I need is NOT a string literal, but a q-character-sequence
> between " delimiters.  Interpreted strictly, after an #include,
> "stdio.h" is three tokens, a '"' delimiter, a q-char-sequence and a '"'
> delimiter.  (I think that the intent is for what I am doing to be legal;
> if not, I'm not sure how you are supposed to use the third form of the
> #include directive.)

It's well defined.  The standard says that when handling a #include
directive, you first check if it's in the <name> or "name" form.
If it is not, you do normal tokenization, then macro replacement,
then stringify the result (where handling of whitespace is implemenation
defined IIRC).  After that it must be in one of the other two forms,
or the program is ill-formed.

--
James Dennett <jdennett@acm.org>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Gennaro Prota <gennaro_prota@yahoo.com>
Date: Wed, 10 Apr 2002 22:53:13 GMT Raw View

On Wed, 10 Apr 2002 15:00:41 GMT, James Kanze <kanze@gabi-soft.de>
wrote:

>It's not stated explicitly, I suspect.  However, how the input stream is
>broken up into tokens is defined, and something like <stdio.h> is five
>tokens anywhere except after a #include.

This is of course the intent, but it's not spelled out in the obvious
place where it should be, i.e. 2.8p1 which simply states the so-called
maximal munch rule: "If the input stream has been parsed into
preprocessing tokens up to a given character, the next preprocessing
token is the longest sequence of characters that could constitute a
preprocessing token, even if that would cause further lexical analysis
to fail".
Verbatim this would mean that

if (a<3 and b>5)

yields a header-name token.

>The real problem, if there is one, is when the form #include SYMBOL is
>used, where SYMBOL is a preprocessing token corresponding to a macro.  I
>generally use the # and the ## operators to end up with something like
>"someDir/someFile.h".  This has worked on every compiler I've tried it
>on, but I'm not sure it is supposed to.  In particular, the definition
>of # says that the result is a string literal; despite appearances, the
>sequence I need is NOT a string literal, but a q-character-sequence
>between " delimiters.

This is another problem (*the* problem for the quoted form of
header-tokens): the only way I can make something like

#define NAME    "file.h"
#include NAME

is to "ignore" (for the purpose of, and at the moment of, #include
execution) the fact that what follows the #include itself is a
string-literal, and consider its pre-tokenized nature of
char-sequence. This, again, seems the intent to me (see also the note
to 16.2p4), but as far as I know its not written anywhere :(

Genny.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: whatiscpp@yahoo.com (John the newbie)
Date: Tue, 9 Apr 2002 00:15:57 GMT Raw View

Hi everybody,

I have just learned that header-names are different tokens from string-literals.
What is the rationale for this decision?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "James Kuyper Jr." <kuyper@wizard.net>
Date: Tue, 9 Apr 2002 14:39:38 GMT Raw View

John the newbie wrote:
>
> Hi everybody,
>
> I have just learned that header-names are different tokens from string-literals.
> What is the rationale for this decision?

Header names must generally meet an operating system's requirements for
file names, and for systems that support something comparable to Unix
directories, they generally support header names that contain whatever
syntax that is natural to that system for indicating a directory. A
given operating system may have syntax for directory/filenames that is
inconvenient to write as a C string literal. A well-known example that
was already well established by the time the first C standard was
approved, is PC/DOS, and it's plausible that it was one of the
motivating examples for this decision. \ is used as a directory
delimiter in DOS. If header names were required to be parsed as string
literals, then

#include "\tab\return\newline\000octal\formfeed.h"
/*         123 456789 0123456   789012 34567890123 */

would almost certainly not do what was intended. As indicated by the
number bar below it, that string literal is only 33 bytes long, contains
not a single backslash character, violates the 8-character limit on
filenames that were part of DOS at the time the first standard was
approved, and constitutes two seperate null-terminated strings. It would
have to be rewritten as:

#include "\\tab\\return\\newline\\000octal\\formfeed.h"

in order to work properly, which is not exactly convenient. Many (most?)
implementations of C on DOS machines recognise / as a synonym for \ in
header names, but that's hardly a natural feature to use for a
programmer who works exclusively on DOS machines, and I suspect that
most of them are unaware of it.

This isn't just a DOS issue; I believe that other operating systems have
similar problems; however, DOS is the only one where I know the details.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Francis Glassborow <francis.glassborow@ntlworld.com>
Date: Tue, 9 Apr 2002 15:04:28 GMT Raw View

In article <102a8848.0204061426.1fd20a11@posting.google.com>, John the
newbie <whatiscpp@yahoo.com> writes
>Hi everybody,
>
>I have just learned that header-names are different tokens from string-literals.
>What is the rationale for this decision?

Well string literals have to exist in some form within your program.
Header names are entirely for the benefit of the compiler and can be
mapped anyway the implementation chooses. Standard headers do not have
to have files corresponding to them and user written headers do not have
to represent identically named files. That is not just theory. I know of
a platform where

#include "myheader.h"

results in a file 'myheader' being read in from a subdirectory h.


--
Francis Glassborow      ACCU
64 Southfield Rd
Oxford OX4 1PA          +44(0)1865 246490
All opinions are mine and do not represent those of any organisation

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Daniel Miller <daniel.miller@tellabs.com>
Date: Tue, 9 Apr 2002 15:08:27 GMT Raw View

John the newbie wrote:

> Hi everybody,
>
> I have just learned that header-names are different tokens from string-literals.
> What is the rationale for this decision?

   <stdio.h> and "stdio.h" are permitted to be a header-name on #include

   "stdio.h" and L"stdio.h" are permitted to be string-literals.

   Although the "quoted" form appears in both, the <angle-bracket> form appears
only in header-name and the L"wide-character" form appears only in
string-literal.  Thus header-name & string-literal are quite different
syntactically.

   (Of course this prompts the question "Why are wide-characters permitted in
filenames for operating systems which can accept wide-character filesystem names
such as Unicode support which is present in multiple modern operating systems?"
which I cannot answer.  But even this hypothetical expansion of header-name to
also permit L<chineseWord.h> and L"japaneseWord.h" would still leave the
<angle-bracket> difference between header-name and string-literal.)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Daniel Miller <daniel.miller@tellabs.com>
Date: Tue, 9 Apr 2002 15:25:15 GMT Raw View

   correction of typographical error

Daniel Miller wrote:

> John the newbie wrote:
>
>> Hi everybody,
>>
>> I have just learned that header-names are different tokens from
>> string-literals.
>> What is the rationale for this decision?
>
>
>
>   <stdio.h> and "stdio.h" are permitted to be a header-name on #include
>
>   "stdio.h" and L"stdio.h" are permitted to be string-literals.
>
>   Although the "quoted" form appears in both, the <angle-bracket> form
> appears only in header-name and the L"wide-character" form appears only
> in string-literal.  Thus header-name & string-literal are quite
> different syntactically.
>
>   (Of course this prompts the question "Why are wide-characters

                                               ^^^
                                               aren't


> permitted in filenames for operating systems which can accept
> wide-character filesystem names such as Unicode support which is present
> in multiple modern operating systems?" which I cannot answer.  But even
> this hypothetical expansion of header-name to also permit
> L<chineseWord.h> and L"japaneseWord.h" would still leave the
> <angle-bracket> difference between header-name and string-literal.)

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Ron Natalie <ron@sensor.com>
Date: Tue, 9 Apr 2002 15:46:21 GMT Raw View

The problem is that C and C++ have a split personality when it comes to
string internationalization.  There is a tacit assumption that there is
always a multibyte (using char) conversion that you can move any
wchar_t based string to.  Filenames, command line arguments, and various
other "system" interfaces can only take char*'s while C++ iostreams and
strings don't really properly handle multibyte strings (at least stdio
streams try to some extent).

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]