Thread

Topic: troubled by std::wifstream::open(const char*)

Author: oldwolf@inspire.net.nz (Old Wolf)
Date: Mon, 15 Dec 2003 23:03:21 +0000 (UTC) Raw View

> My advise:  If you need to write internationalized applications, avoid
> iostreams.
> They just cause trouble.

There seem to be two stages of conversion problems in the case of
ifstream::open(), taking a given character width and encoding [XX]
  1) converting from where the filename was input/stored, to [XX]
  2) converting from [XX] to native API format
It would avoid a lot of conversion issues if ifstream::open() took
the filename in native API format.

Some iostreams vendors provide ifstream::open(int fd)
(and filebuf::open(), and fstream constructors etc) which
associates a filebuf/stream with an already-open C runtime handle.

So you can store your filenames in native OS format and open
them with the native OS open function, avoiding both conversion
issues totally. The ANSI C function _open_osfhandle() can be used
to convert a native OS handle to a C runtime handle.

Not only does this mean you can use iostreams with internationalized
filenames, you can also use them with sockets, pipes, comm ports, and
other devices, with OS-specific open() parameters.

At first this might seem like a loss of portability, since you are
including OS-specific code, but if you are going to do file I/O
without the Standard library then you are being non-portable anyway.

Is there a good reason why this form of ifstream::open() etc. could
not be added to the Standard for iostreams?


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net (James Kuyper)
Date: Tue, 16 Dec 2003 18:33:37 +0000 (UTC) Raw View

oldwolf@inspire.net.nz (Old Wolf) wrote in message news:<843a4f78.0312141932.5c4b6a4d@posting.google.com>...
..
> issues totally. The ANSI C function _open_osfhandle() can be used

I'm not sure what you're trying to say here. The function
_open_osfhandle() most definitely is not part of the ANSI C standard
library.


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Chuck McDevitt <Chuck_McDevitt@c-o-m-c-a-s-t.net>
Date: Sun, 7 Dec 2003 16:27:46 +0000 (UTC) Raw View


"Seagull Manager" <seagull.manager@nospamthanksbecauseisayso.demon.co.uk>
wrote in message news:bqfmlt$64r$1$830fa79d@news.demon.co.uk...
>
> Maybe this is someone everyone knows about already, and considers
completely
> boring, but it's been bugging me, so here goes:
>
> The standard has it that fstreams, even wide fstreams, can only be opened
> using 8-bit byte strings for filenames. If a filename contains, say,
> non-western letters in the name, it may be impossible to open it using an
> fstream without some kind of shim, e.g.:
>

I agree 100% that this is wrong.  The c++ language supports wchar_t.
In standard iostreams, if I write wchar_t strings to an iostream, there is a
codecvt
that can handle converting my wchar_t strings to char strings.

Given that, why can't the same apply to file names?

It would be easy enough to allow for wchar_t string name for files & paths,
with
the caveot that the conversion to the underlying file system's names and
paths is OS dependent.

This would be far more portable than how it is today.
Today, if you have a unicode wchar_t string containing a file name, you need
to code into
your application the knowledge of how to convert that to the OS's underlying
format.
For example, if I have a Japanese file name, I need to know, based on my os:

   1)  if japanese windows, convert to windows CP 932 (the OS will convert
back from CP 932 to Unicode).
   2)  if japanese Unix, convert to EUC.
   3)  if unicode unix/linux, convert to UTF8
   4)  if IBM mainframe, convert to something totally different.
   5)  if non-japanese windows, avoid iostreams, and open the file directly
from the unicode name,
        because there is no char * string that I can convert the japanese
name to that will work,
        even though the underlying filesystem can handle the name just fine.

Ugh.

My advise:  If you need to write internationalized applications, avoid
iostreams.
They just cause trouble.







      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: James Kanze <kanze@alex.gabi-soft.fr>
Date: Sun, 7 Dec 2003 16:27:46 +0000 (UTC) Raw View

Eugene Gershnik <gershnik@nospam.hotmail.com> writes:

|>  Jonathan Turkanis wrote:
|>  >> "Seagull Manager"
|>  >> <seagull.manager@nospamthanksbecauseisayso.demon.co.uk> wrote in
|>  >>  message
|>  >>  The standard has it that fstreams, even wide fstreams, can only
|>  >>  be opened using 8-bit byte strings for filenames. If a filename
|>  >>  contains, say, non-western letters in the name, it may be
|>  >>  impossible to open it using an fstream without some kind of
|>  >>  shim, e.g.:

|>  > For the main question, the Boost.Filesystem FAQ gives a good
|>  > answer. See: http://www.boost.org/libs/filesystem/doc/faq.htm (Why
|>  > aren't wide-character names supported?)

|>  Hmmm. On my OS the native character type is wchar_t. Conversions to
|>  char could be ambiguous or sometimes impossible (for example when
|>  the file comes from another system and its name contains characters
|>  from a language for which mine dosen't have a conversion table
|>  installed). So on my system the standard fstream simply isn't usable
|>  not to speak about portable :-(

This sounds like an impedence mismatch between your compiler and your
system.

C++ considers that the system has a single character type.  It makes no
assumptions about this type, except that it is a single type.  Any time
information is passed to or from the system, this is the type that is
used.  Thus, filebuf requires that the name of the file have this type,
and that all reads and writes to and from the file are done in this
type.  This type is called "char".  And I repeat, not only must all
filenames be char[], all transfers to and from the system take place as
transfers of char -- a wfstream translates multi-byte sequences into
wide characters.

By intention, a wide character stream will not use a multibyte encoding.
If there is no possiblity of multibyte encodings in a narrow character
stream, there is no difference between the two, and there is no reason
to use narrow character streams.  (This is not the case for any OS that
I know, however: most modern OS's use 8 bit characters, with multi-byte
encodings, usually UTF-8, being used for international characters.  The
one exception is Windows, which uses 16 bit characters with the
multibyte encoding UTF-16.)

So far, so good.  About the only problem I see is that C and C++ require
that the narrow character type be the smallest addressable type
supported by the implementation.  Which does cause a problem on Windows;
for many reasons, it is desirable to have 8 bit addressable types in C
and C++, even though the logical implementation would be to use 16 bit
narrow characters and 32 bit wide characters.

Regretfully, the only solution I can propose is to use the 8 bit
interface to Windows (which I think uses UTF-8, which is about as
portable as you are going to get for international characters).

--
James Kanze                             mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/
                 Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France  +33 1 41 89 80 93

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Seagull Manager <seagull.manager@nospamthanksbecauseisayso.demon.co.uk>
Date: Mon, 8 Dec 2003 22:54:09 +0000 (UTC) Raw View


"Eugene Gershnik" <gershnik@nospam.hotmail.com> wrote in message
news:xeidnSurMr4nulOiRTvUqQ@speakeasy.net...
>
>
> Hmmm. On my OS the native character type is wchar_t.

Absolutely (although to be pendantic, it is what your compiler vendor
conveniently defines as a wchar_t).

> Conversions to char
> could be ambiguous or sometimes impossible (for example when the file
comes
> from another system and its name contains characters from a language for
> which mine dosen't have a conversion table installed). So on my system the
> standard fstream simply isn't usable not to speak about portable :-(

Well, if the file has arrived from another file system, and its name has
somehow reached you without being correctly decoded into a wide character
representation that your code understands, then you've probably got a
problem regardless of what programming language or library you're using. But
that, of course, is a problem with all networked data, not just file names.

But you are quite right, by one simple omission in the standard, fstream has
been rendered next to useless to anyone who wants to write robust code on
systems that support wide character file names.

> Eugene

void Bruce_Attah(L" $B%U%!%$%%!&%M!<%` (B");



      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Eugene Gershnik <gershnik@nospam.hotmail.com>
Date: Mon, 8 Dec 2003 22:54:10 +0000 (UTC) Raw View

James Kanze wrote:
> Eugene Gershnik <gershnik@nospam.hotmail.com> writes:
>>>  Hmmm. On my OS the native character type is wchar_t. Conversions
>>>  to char could be ambiguous or sometimes impossible (for example
>>>  when the file comes from another system and its name contains
>>>  characters from a language for which mine dosen't have a
>>>  conversion table installed). So on my system the standard
>>>  fstream simply isn't usable not to speak about portable :-(
>
> This sounds like an impedence mismatch between your compiler and
> your system.
>
> C++ considers that the system has a single character type. It
> makes no assumptions about this type, except that it is a single
> type.  Any time information is passed to or from the system, this
> is the type that is used.  Thus, filebuf requires that the name of
> the file have this type, and that all reads and writes to and from
> the file are done in this type.  This type is called "char".  And I
> repeat, not only must all filenames be char[], all transfers to and
> from the system take place as transfers of char -- a wfstream
> translates multi-byte sequences into wide characters.

I see the logic of this but unfortunately this makes abstract C++
incompatible with the existing OS interface of my system. The system is
Windows as you probably guessed. The basic OS interface on this system has 2
distinct character types mapped to char and wchar_t by all C++
implementatons. For example the API used to open a file accepts whcar_t*
while an API to obtain the host IP from its name accepts char*. There is a
compatibility layer that allows me to open a file by passing it a char* thus
allowing an application to live in a single character type world.
Unfortunately usage of this layer makes I18N quite complicated.

>
> By intention, a wide character stream will not use a multibyte
> encoding. If there is no possiblity of multibyte encodings in a
> narrow character stream, there is no difference between the two,
> and there is no reason to use narrow character streams.  (This is
> not the case for any OS that I know, however: most modern OS's use
> 8 bit characters, with multi-byte encodings, usually UTF-8, being
> used for international characters.  The one exception is Windows,
> which uses 16 bit characters with the multibyte encoding UTF-16.)

It is fixed width UCS-16 I think.

>
> So far, so good.  About the only problem I see is that C and C++
> require that the narrow character type be the smallest addressable
> type supported by the implementation.  Which does cause a problem
> on Windows; for many reasons, it is desirable to have 8 bit
> addressable types in C and C++, even though the logical
> implementation would be to use 16 bit narrow characters and 32 bit
> wide characters.
>
> Regretfully, the only solution I can propose is to use the 8 bit
> interface to Windows (which I think uses UTF-8, which is about as
> portable as you are going to get for international characters).

The problem is it isn't UTF-8. When using 8-bit interface to Windows the
character strings are encoded in a language dependent manner. Each machine
contains a collection of conversion tables from each of supported languages
encodings to and from UCS-16. When you invoke an 8-bit wrapper over UCS-16
function one of these tables is used to convert the strings. Which table is
going to be used is determined by a mechanism similar to C locales. The net
result of this are huge problems with I18N portability. There are ways to
overcome the problems but they require heroic efforts from the application.
Using wide-character APIs solves most of these problems and it fits quite
easily into C++. I can write the whole application using wchar_t only if I
don't use any of the few APIs that require chars (only some socket function
do this). The perpetual problem are fstream::open, exception::what() and
ctors of <stdexcept> classes. I can see how the standard library on my
machine could provide extensions to solve most of the problems. After all we
have _wfopen(const wchar_t *filename, const wchar_t *mode) in the same
library. So from my POV this is entirely library QoI issue.
The thing I don't agree with is the argument that having only
fstream::open(char *) somehow makes the library and applications that are
using it more portable.

Eugene

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Seagull Manager <seagull.manager@nospamthanksbecauseisayso.demon.co.uk>
Date: Mon, 8 Dec 2003 22:54:10 +0000 (UTC) Raw View


"Hyman Rosen" <hyrosen@mail.com> wrote in message
news:1070390934.850586@master.nyc.kbcfp.com...
>
> There's nothing to stop a vendor from offering an extension
> which allows open on wide strings.

That just means that (a) few vendors will offer such an extension
(especially as OS vendors who also sell compilers may have an incentive to
encourage their customers to write non-portable code, which is most
effectively done by steering programmers to proprietary libraries and away
from standard ones - a weakness in the standard is a boon to vendors who
choose such a strategy), so that (b) if I can get hold of an implementation
that does offer such an extension, and I use it, my code will probably no
longer be portable between compilers, and (c) the implementation will be
underused, and therefore perhaps undertested.

Yet opening files is fundamental, and all common operating systems (even OSs
for mobile devices) support Unicode (or at least the 16-bit subset of
Unicode that is UCS-2) file names. Furthermore, many users around the world
routinely save their files using non-8-bit file names. Therefore,
std::fstream is not fully functional on any common operating system. It is
possible that on some implementations of std::fstream, there is no reliable
workaround available. In that case, if one ships a product that relies on
std::fstream to open files, it is pretty much guaranteed that one will at
some stage receive a support call saying "Someone sent me this important
file, and I can't open it with your app", and fixing the problem will
potentially be hugely expensive (perhaps involving the rewriting of all or
most code that uses iostreams).

It would have been a relatively minor matter for the standard to define the
spec of a function that opens files using wchar_t, but it failed to do so -
I would argue that the myopia or parochialism must have had something to do
with this. It seems to me that there should be an update of the C++ standard
to fix issues like this as soon as possible, then at a later date more
ambitious amendments can be made. Without such measures, I suspect that C++
will very likely be marginalized over the next few years.


wchar_t Bruce_Attah(char);



      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: alfps@start.no (Alf P. Steinbach)
Date: Tue, 9 Dec 2003 15:52:09 +0000 (UTC) Raw View

On Mon, 8 Dec 2003 22:54:10 +0000 (UTC), Seagull Manager <seagull.manager@nospamthanksbecauseisayso.demon.co.uk> wrote:

 >Therefore,
 >std::fstream is not fully functional on any common operating system.

<rant>
Well, that's not just a character representation issue, but also one of performance,
and of simplicity, and of general exception safety (most of all the safety of typical
_client code_), and much else, e.g. the related silliness where you can't write a
pure 'cat' program without using e.g. Posix functions.  In short, the i/o part of the
library, including the locale support, messages, etc., sucks.  It's useful for small
test programs and exercises.  It's a crying shame, because originally that part wasn't
at all bad.  But it's been abstracted to death, as if templatization could make it STL.
</rant>

Having concluded that, what is the remedy?  Simple, one should use something else.
And perhaps someday someone will come up with the STL of file and stream handling.

I absolutely don't think it's a good idea to make the standard streams _apparently_
useful by tacking on the most desperately needed real-world functionality.

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Tue, 9 Dec 2003 22:47:56 +0000 (UTC) Raw View

Eugene Gershnik <gershnik@nospam.hotmail.com> wrote in message
news:<X8SdnegW5ekpP06iRTvUrg@speakeasy.net>...
> James Kanze wrote:
> > Eugene Gershnik <gershnik@nospam.hotmail.com> writes:
> >>>  Hmmm. On my OS the native character type is wchar_t. Conversions
> >>>  to char could be ambiguous or sometimes impossible (for example
> >>>  when the file comes from another system and its name contains
> >>>  characters from a language for which mine dosen't have a
> >>>  conversion table installed). So on my system the standard fstream
> >>>  simply isn't usable not to speak about portable :-(

> > This sounds like an impedence mismatch between your compiler and
> > your system.

> > C++ considers that the system has a single character type. It makes
> > no assumptions about this type, except that it is a single type.
> > Any time information is passed to or from the system, this is the
> > type that is used.  Thus, filebuf requires that the name of the file
> > have this type, and that all reads and writes to and from the file
> > are done in this type.  This type is called "char".  And I repeat,
> > not only must all filenames be char[], all transfers to and from the
> > system take place as transfers of char -- a wfstream translates
> > multi-byte sequences into wide characters.

> I see the logic of this but unfortunately this makes abstract C++
> incompatible with the existing OS interface of my system.  The system
> is Windows as you probably guessed.  The basic OS interface on this
> system has 2 distinct character types mapped to char and wchar_t by
> all C++ implementatons.  For example the API used to open a file
> accepts whcar_t* while an API to obtain the host IP from its name
> accepts char*.  There is a compatibility layer that allows me to open
> a file by passing it a char* thus allowing an application to live in a
> single character type world.  Unfortunately usage of this layer makes
> I18N quite complicated.

I rather suspected it was Windows:-).  While the people at Microsoft
were certainly trying to do the right thing, I'm not sure that I'm happy
with the results.  As you say, it is a fact of life that you need 8 bit
characters, since that is, for better or worse, what is used everywhere,
and in these days of connectivity, you have to be compatible with
everywhere.  Thus, the interface to obtain the host IP uses char.  The
attempt to use 16 bit characters for all internal interfaces is
laudable.  But in practice, I have the feeling that it creates more
problems that it solves.  To begin with, it requires that you, the
programmer, deal with two different character types.  Forget the C++
problem for a moment, and consider that you have no problem using the
wide character interfaces in your application.  You read a filename
from a file (as wide characters), and pass it to the API to open the
file.  Excellent.  You read a URL from a file (as wide characters), and
pass it to the API.  Oops.

I'm not sure what the best solution should be, either from the system
point of view, or from the point of view of C++.  What is happening,
however, best or not (and it doesn't seem that bad a priori) is that
everything external to the program is moving (albeit very slowly) to
UTF-8.  So an 8 bit, converting interface, makes sense.

But none of this helps with your practical problem.

> > By intention, a wide character stream will not use a multibyte
> > encoding. If there is no possiblity of multibyte encodings in a
> > narrow character stream, there is no difference between the two, and
> > there is no reason to use narrow character streams.  (This is not
> > the case for any OS that I know, however: most modern OS's use 8 bit
> > characters, with multi-byte encodings, usually UTF-8, being used for
> > international characters.  The one exception is Windows, which uses
> > 16 bit characters with the multibyte encoding UTF-16.)

> It is fixed width UCS-16 I think.

You cannot encode all of the world's characters in UCS-16.  UCS-16
worked will through Unicode 3.0, but since then, you need UTF-something
is you want full Unicode support with less than 21 bits.  I thought that
I had heard that Windows does support the surrogate pairs in Unicode
(which means implicitly multi-byte characters).  Note that for Unicode
characters in the ranges U+0000..U+D7FF and U+E000..U+FFFF (which
includes all of the characters in Unicode 3.0), there is no difference
between the two.

In practice, Unicode was 16 bits through 3.0.  Institutions which fixed
wide character width too early got caught.  (Microsoft and IBM are the
big ones, and Java as a language.)

> > So far, so good.  About the only problem I see is that C and C++
> > require that the narrow character type be the smallest addressable
> > type supported by the implementation.  Which does cause a problem on
> > Windows; for many reasons, it is desirable to have 8 bit addressable
> > types in C and C++, even though the logical implementation would be
> > to use 16 bit narrow characters and 32 bit wide characters.

> > Regretfully, the only solution I can propose is to use the 8 bit
> > interface to Windows (which I think uses UTF-8, which is about as
> > portable as you are going to get for international characters).

> The problem is it isn't UTF-8. When using 8-bit interface to Windows
> the character strings are encoded in a language dependent manner.
> Each machine contains a collection of conversion tables from each of
> supported languages encodings to and from UCS-16.  When you invoke an
> 8-bit wrapper over UCS-16 function one of these tables is used to
> convert the strings.  Which table is going to be used is determined by
> a mechanism similar to C locales.  The net result of this are huge
> problems with I18N portability.

Where are these locales decided?  Can your program influence them?  And
are there locales which use UTF-8?

Note that as soon as you leave the world of straight US ASCII, you have
the portability problems anyway.  Locale names aren't standardized, the
presence of any useful locales isn't guaranteed, the interaction between
the C++ locales and the C locale isn't well specified and there are
still a lot of compilers out there which don't have a working version of
<locale> (nor wistream or wostream, for that matter).

> There are ways to overcome the problems but they require heroic
> efforts from the application.  Using wide-character APIs solves most
> of these problems and it fits quite easily into C++.

One of the reasons why it isn't in C++ is that most OS's don't have a
wide character interface.  And practically, most never will; as I said,
the current trend is toward using UTF-8 and sticking with 8 bits.

In the mean time, it would be perfectly legal for a library
implementation to treat the 8 bit char's as UTF-8, convert them
systematically into UTF-16 and use the wide character interface.  I
don't know if any library actually does this, however, and there may be
various historical problems which mean that it isn't possible.

> I can write the whole application using wchar_t only if I don't use
> any of the few APIs that require chars (only some socket function do
> this).  The perpetual problem are fstream::open, exception::what() and
> ctors of <stdexcept> classes.

In the case of the exceptions, there are very strong reasons why the
standard library cannot offer two versions.  And since char is still
more widespread than wchar_t...

> I can see how the standard library on my machine could provide
> extensions to solve most of the problems.  After all we have
> _wfopen(const wchar_t *filename, const wchar_t *mode) in the same
> library. So from my POV this is entirely library QoI issue.  The thing
> I don't agree with is the argument that having only
> fstream::open(char*) somehow makes the library and applications that
> are using it more portable.

Nothing using anything other than US ASCII is anywhere near portable.
Simply because different systems have different ways of handling the
differences.  Offering an fstream::open(wchar_t*) doesn't help
portability if the OS's on which one is building don't support it.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Eugene Gershnik <gershnik@nospam.hotmail.com>
Date: Wed, 10 Dec 2003 21:13:55 +0000 (UTC) Raw View

kanze@gabi-soft.fr wrote:
> Eugene Gershnik <gershnik@nospam.hotmail.com> wrote in message
> news:<X8SdnegW5ekpP06iRTvUrg@speakeasy.net>...
>> James Kanze wrote:
>>> C++ considers that the system has a single character type. It
>>> makes no assumptions about this type, except that it is a single
>>> type. Any time information is passed to or from the system, this
>>> is the type that is used.  Thus, filebuf requires that the name
>>> of the file have this type, and that all reads and writes to and
>>> from the file are done in this type.  This type is called "char".
>>> And I repeat, not only must all filenames be char[], all
>>> transfers to and from the system take place as transfers of char
>>> -- a wfstream translates multi-byte sequences into wide
>>> characters.
>
>> I see the logic of this but unfortunately this makes abstract C++
>> incompatible with the existing OS interface of my system.  The
>> system is Windows as you probably guessed.  The basic OS interface
>> on this system has 2 distinct character types mapped to char and
>> wchar_t by all C++ implementatons.  For example the API used to
>> open a file accepts whcar_t* while an API to obtain the host IP
>> from its name accepts char*.  There is a compatibility layer that
>> allows me to open a file by passing it a char* thus allowing an
>> application to live in a single character type world.
>> Unfortunately usage of this layer makes I18N quite complicated.
>
> I rather suspected it was Windows:-).  While the people at Microsoft
> were certainly trying to do the right thing, I'm not sure that I'm
> happy with the results.  As you say, it is a fact of life that you
> need 8 bit characters, since that is, for better or worse, what is
> used everywhere, and in these days of connectivity, you have to be
> compatible with everywhere.  Thus, the interface to obtain the host
> IP uses char.

Actually the Winsock char * functions are a relict of backward compatibility
with Berkeley sockets. The newer non-backward compatible interfaces are
using wchar_t * and I fully expect that eventually it will be possible to
have all the Windows application code dealing with Unicode only.

> The attempt to use 16 bit characters for all
> internal interfaces is laudable.  But in practice, I have the
> feeling that it creates more problems that it solves.  To begin
> with, it requires that you, the programmer, deal with two different
> character types.  Forget the C++ problem for a moment, and consider
> that you have no problem using the wide character interfaces in
> your application.  You read a filename from a file (as wide
> characters), and pass it to the API to open the file.  Excellent.
> You read a URL from a file (as wide characters), and pass it to the
> API.  Oops.

True. But again I beleive this is temporary. As the time goes there are less
and less places where one must use 8-bit chars.

>
> I'm not sure what the best solution should be, either from the
> system point of view, or from the point of view of C++.  What is
> happening, however, best or not (and it doesn't seem that bad a
> priori) is that everything external to the program is moving
> (albeit very slowly) to UTF-8.  So an 8 bit, converting interface,
> makes sense.

For the external data I see this trend too but not for internal APIs. Win32,
Java and .NET are pretty much settled on fixed width Unicode
representations. The only major system that still uses chars is Unix. It
makes perfect sense to me. Why would an OS open_file() use UTF-8? Assuming
the internal data is already in fixed-width encoding why incur another two
translations?

>>> The one exception is
>>> Windows, which uses 16 bit characters with the multibyte encoding
>>> UTF-16.)
>
>> It is fixed width UCS-16 I think.
>
> You cannot encode all of the world's characters in UCS-16.  UCS-16
> worked will through Unicode 3.0, but since then, you need
> UTF-something is you want full Unicode support with less than 21
> bits.  I thought that I had heard that Windows does support the
> surrogate pairs in Unicode (which means implicitly multi-byte
> characters).  Note that for Unicode characters in the ranges
> U+0000..U+D7FF and U+E000..U+FFFF (which includes all of the
> characters in Unicode 3.0), there is no difference between the two.

Win32 has some support for surrogates but it is quite limited and varies
with OS flavor. Though surrogates are a problem from a theoretical point of
view I beleive most programs that don't engage heavily in text manipulations
can get away with "each character is 16 bit" approach. This is of course
unfortunate but still better than the situation with JIS or EUC where you
cannot do anything without thinking about multi-byte issues.

> In practice, Unicode was 16 bits through 3.0.  Institutions which
> fixed wide character width too early got caught.  (Microsoft and
> IBM are the big ones, and Java as a language.)

I am not sure the change wasn't a big mistake on the part of Unicode group.
One of the big "selling points" of Unicode was that you don't have to deal
with lead byte/trail byte issues anymore. Do you know if anyone actually
uses the extended characters?

>>> Regretfully, the only solution I can propose is to use the 8 bit
>>> interface to Windows (which I think uses UTF-8, which is about as
>>> portable as you are going to get for international characters).
>
>> The problem is it isn't UTF-8. When using 8-bit interface to
>> Windows the character strings are encoded in a language dependent
>> manner. Each machine contains a collection of conversion tables
>> from each of supported languages encodings to and from UCS-16.
>> When you invoke an 8-bit wrapper over UCS-16 function one of these
>> tables is used to convert the strings.  Which table is going to be
>> used is determined by a mechanism similar to C locales.  The net
>> result of this are huge problems with I18N portability.
>
> Where are these locales decided?  Can your program influence them?
> And are there locales which use UTF-8?

Short answers: In the context of filenames it's a system-wide setting. Not
without affecting the whole system. No
Long answers:
There is a general, application controllable, per-thread locale mechanism in
Win32 but unfortunately only a tiny part of it is used for automaatic API
parameter conversion. The locale used for this purpose is always the "system
default locale" defined in system configuration data (you can change it in
Control Panel). If the program needs to do anything smarter it must convert
parameters manually using the full locale mechanism.
A general Win32 locale is set on a per-tread basis. Initially a default
process locale is determined by user's preferences set in the Control Panel.
An application is free to modify current thread's locale at any time. There
are various facilities for enumeration of available locales, their features
etc. Each locale is associated with things like date and time formats,
calendars and other localizable stuff. The Unicode/multibyte translation
tables (Codepages) are associated with locales but are kept somewhat
separate. In fact it is possible to do a string conversion with an arbitrary
installed code page regardless of current locale. The whole system is rather
powerfull and works quite well. Interestingly on newer versions of the OS
there are locales that do not have any multi-byte support. They are entirely
Unicode only meaning that char * applications will not be able to work with
non-US strings at all.
There aren't any UTF-8 locales but there are UTF-8 and UTF-7 codepages.
Thus, it is possible to manually convert strings from and to these formats
but not let the system automatically do it.
What always irritates me is that with VC implementation of standard library
the Win32 Locales and C and C++ locales are completely ignorant of each
other. I cannot even reliably map the Win32 Locale identifier to the string
that can be passed to std::locale constructor. Yet another good C++ facility
made useless by QoI.

> Note that as soon as you leave the world of straight US ASCII, you
> have the portability problems anyway.  Locale names aren't
> standardized, the presence of any useful locales isn't guaranteed,
> the interaction between the C++ locales and the C locale isn't well
> specified and there are still a lot of compilers out there which
> don't have a working version of <locale> (nor wistream or wostream,
> for that matter).

100% true. But then I'd expect the implementations to step in and provide
necessary bindings to exisiting facilities of their OS. For example I don't
care about standard locale names. Users of my software would expect to see
the names as they are in the rest of the system _not_ some C++ standard
names. So all I really care is to be able to construct C/C++ locale from my
system locale names.
There are also compilers out there that do not have _any_ iostreams. For
example eMbedded VC4 (roughly equivalent to desktop VC6 in the rest of its
features) doesn't have iostreams at all (which angers me a lot since I like
things like lexical_cast). I of course don't know the exact reasons for this
omission but my guess would be that implementing the full functionality on a
char * challnged system would be too expensive (both in time and library
size).

>
>> There are ways to overcome the problems but they require heroic
>> efforts from the application.  Using wide-character APIs solves
>> most of these problems and it fits quite easily into C++.
>
> One of the reasons why it isn't in C++ is that most OS's don't have
> a wide character interface.  And practically, most never will; as I
> said, the current trend is toward using UTF-8 and sticking with 8
> bits.

A counter-example would be Windows CE (Pocket PC, Smartphone and other its
variations). This system actually almost doesn't have any char * support :-)

>
> In the mean time, it would be perfectly legal for a library
> implementation to treat the 8 bit char's as UTF-8, convert them
> systematically into UTF-16 and use the wide character interface.  I
> don't know if any library actually does this, however, and there
> may be various historical problems which mean that it isn't
> possible.

It will break all applications localized the old way (using char * multibyte
strings).

>
>> I can write the whole application using wchar_t only if I don't use
>> any of the few APIs that require chars (only some socket function
>> do this).  The perpetual problem are fstream::open,
>> exception::what() and ctors of <stdexcept> classes.
>
> In the case of the exceptions, there are very strong reasons why the
> standard library cannot offer two versions.  And since char is still
> more widespread than wchar_t...

Out of curiosity what are these reasons?

>
>> I can see how the standard library on my machine could provide
>> extensions to solve most of the problems.  After all we have
>> _wfopen(const wchar_t *filename, const wchar_t *mode) in the same
>> library. So from my POV this is entirely library QoI issue.  The
>> thing I don't agree with is the argument that having only
>> fstream::open(char*) somehow makes the library and applications
>> that are using it more portable.
>
> Nothing using anything other than US ASCII is anywhere near
> portable. Simply because different systems have different ways of
> handling the differences.  Offering an fstream::open(wchar_t*)
> doesn't help portability if the OS's on which one is building don't
> support it.

But the same is true for fstream::open(char *) on Windows CE for example.
All that could be said is fstream::open(char *) could be more easily
supported on more systems _today_. I wonder why not have something like

1. An implementation MUST provide at least one of the following overloads:
open(char *), open(wchar_t*)
2. An implementation MUST provide both overloads if the underlying system
allows files to be opened using both wide and narrow names.
3. An impementation MAY provide both overloads if the underlying system
allows only one form of the name. In this case the algorithm used for filena
me conversion is implementation defined. One possibility would be using
current stream locale to perform the conversion.

Eugene

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Matthew Collett <m.collett@auckland.ac.nz>
Date: Thu, 11 Dec 2003 21:39:45 +0000 (UTC) Raw View

In article <AqOdnULSxsXVTEuiRTvUrg@speakeasy.net>,
 Eugene Gershnik <gershnik@nospam.hotmail.com> wrote:

> >> James Kanze wrote:
> >>> C++ considers that the system has a single character type. It
> >>> makes no assumptions about this type, except that it is a single
> >>> type. Any time information is passed to or from the system, this
> >>> is the type that is used.  Thus, filebuf requires that the name
> >>> of the file have this type, and that all reads and writes to and
> >>> from the file are done in this type.  This type is called "char".
> >>> And I repeat, not only must all filenames be char[], all
> >>> transfers to and from the system take place as transfers of char
> >>> -- a wfstream translates multi-byte sequences into wide
> >>> characters.

[snip]

> I wonder why not have something like
>
> 1. An implementation MUST provide at least one of the following overloads:
> open(char *), open(wchar_t*)
> 2. An implementation MUST provide both overloads if the underlying system
> allows files to be opened using both wide and narrow names.
> 3. An impementation MAY provide both overloads if the underlying system
> allows only one form of the name. In this case the algorithm used for filena
> me conversion is implementation defined. One possibility would be using
> current stream locale to perform the conversion.
>
> Eugene

A corollary of James Kanze's observation is that the wchar_t type is in
principle an internal choice of the C++ implementation, independent of
any properties of the OS.  Which in turn means that it's not clear what
it means to talk about "the underlying system allow[ing] files to be
opened using [...] wide [..] names".   There is though a real
distinction between a "wide character" (fixed width) encoding, and a
"multibyte" (variable width) one - and UTF-16 is a multibyte encoding
with 16 bit bytes, not a wide character one.

I'm presently using Mac OS X.  The implementation of C++ has 8 bit char,
and 32 bit wchar_t (allowing the use of a true wide character encoding
such as UCS-4), but the OS stores and manipulates filenames using
UTF-16.  It is the job of the C/C++ library implementation to cope with
the mismatch of character types, which it does just fine by treating the
char* parameter to open as UTF-8.  There might be a case for char being
changed to 16 bit to match the OS, but I would definitely consider it a
retrograde step if wchar_t was.  (Perhaps I've overstated the case
slightly, since the OS itself does also provide a UTF-8 aware API,
easing the library's task, but this does not alter the principle.)

Best wishes,
Matthew Collett

--
Those who assert that the mathematical sciences have nothing to say
about the good or the beautiful are mistaken.          -- Aristotle

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Thu, 11 Dec 2003 21:39:45 +0000 (UTC) Raw View

Eugene Gershnik <gershnik@nospam.hotmail.com> wrote in message
news:<AqOdnULSxsXVTEuiRTvUrg@speakeasy.net>...
> kanze@gabi-soft.fr wrote:

> > The attempt to use 16 bit characters for all internal interfaces is
> > laudable.  But in practice, I have the feeling that it creates more
> > problems that it solves.  To begin with, it requires that you, the
> > programmer, deal with two different character types.  Forget the C++
> > problem for a moment, and consider that you have no problem using
> > the wide character interfaces in your application.  You read a
> > filename from a file (as wide characters), and pass it to the API to
> > open the file.  Excellent.  You read a URL from a file (as wide
> > characters), and pass it to the API.  Oops.

> True. But again I beleive this is temporary. As the time goes there
> are less and less places where one must use 8-bit chars.

> > I'm not sure what the best solution should be, either from the
> > system point of view, or from the point of view of C++.  What is
> > happening, however, best or not (and it doesn't seem that bad a
> > priori) is that everything external to the program is moving (albeit
> > very slowly) to UTF-8.  So an 8 bit, converting interface, makes
> > sense.

> For the external data I see this trend too but not for internal
> APIs. Win32, Java and .NET are pretty much settled on fixed width
> Unicode representations. The only major system that still uses chars
> is Unix. It makes perfect sense to me. Why would an OS open_file() use
> UTF-8? Assuming the internal data is already in fixed-width encoding
> why incur another two translations?

The real question is rather, why would a system move to 16 bit char's.
Networking imposes 8 bit char's?  Networking is becoming more and more
prevelant.

> >>> The one exception is Windows, which uses 16 bit characters with
> >>> the multibyte encoding UTF-16.)

> >> It is fixed width UCS-16 I think.

> > You cannot encode all of the world's characters in UCS-16.  UCS-16
> > worked will through Unicode 3.0, but since then, you need
> > UTF-something is you want full Unicode support with less than 21
> > bits.  I thought that I had heard that Windows does support the
> > surrogate pairs in Unicode (which means implicitly multi-byte
> > characters).  Note that for Unicode characters in the ranges
> > U+0000..U+D7FF and U+E000..U+FFFF (which includes all of the
> > characters in Unicode 3.0), there is no difference between the two.

> Win32 has some support for surrogates but it is quite limited and
> varies with OS flavor. Though surrogates are a problem from a
> theoretical point of view I beleive most programs that don't engage
> heavily in text manipulations can get away with "each character is 16
> bit" approach. This is of course unfortunate but still better than the
> situation with JIS or EUC where you cannot do anything without
> thinking about multi-byte issues.

Agreed.  For that matter, most programs which don't engage heavily in
text manipulations can get away with UTF-8 -- with a little bit of help
from libraries, they should still be able to consider that each
character is a byte.

My impression of UTF-16 is that it combines the worst of both worlds.
It's only marginally better than UTF-8 with regards to multi-byte
issues, and it requires a lot more memory (although not as much as USC
32, I'll admit).  But that's a technical opinion, based on today's
facts, and it doesn't take into account various political aspects nor
backwards compatibility problems.

> > In practice, Unicode was 16 bits through 3.0.  Institutions which
> > fixed wide character width too early got caught.  (Microsoft and IBM
> > are the big ones, and Java as a language.)

> I am not sure the change wasn't a big mistake on the part of Unicode
> group.  One of the big "selling points" of Unicode was that you don't
> have to deal with lead byte/trail byte issues anymore. Do you know if
> anyone actually uses the extended characters?

Only one, but I then work western Europe, not in China.  From what he
told me, you cannot write Chinese correctly without them.  The
additional mathematical symbols seem useful too.

Someone requested them.  Presumably for a reason.  I don't think that
Unicode would have broken backwards compatibility in this way just to be
able to support the Shavian phonetic alphabet, but claiming to be an
international alphabet, but not supporting Chinese and Japanese, is a
bit too much.

> >>> Regretfully, the only solution I can propose is to use the 8 bit
> >>> interface to Windows (which I think uses UTF-8, which is about as
> >>> portable as you are going to get for international characters).

> >> The problem is it isn't UTF-8. When using 8-bit interface to
> >> Windows the character strings are encoded in a language dependent
> >> manner. Each machine contains a collection of conversion tables
> >> from each of supported languages encodings to and from UCS-16.
> >> When you invoke an 8-bit wrapper over UCS-16 function one of these
> >> tables is used to convert the strings.  Which table is going to be
> >> used is determined by a mechanism similar to C locales.  The net
> >> result of this are huge problems with I18N portability.

> > Where are these locales decided?  Can your program influence them?
> > And are there locales which use UTF-8?

> Short answers: In the context of filenames it's a system-wide
> setting. Not without affecting the whole system. No

I'm not sure I like it, but I can understand it.  Under Solaris or
Linux, the interpretation of characters, including characters in
filenames, depends on some environment variables set in the process
which reads them.  You do ls on the same directory in two different
windows, and you get totally different results.  Not an ideal situation
either.

It's the sort of situation where no matter what you do, it's wrong.

And of course, you don't want user defined settings affecting how you
interpret names on networked drives -- say when you open a remotely
mounted file hosted on a Unix (or other 8 bit) system.  (That's an
interesting problem in itself.  As far as a Unix kernel is concerned, a
filename is just a string of bytes, with no meaning whatsoever.  How it
is interpreted is entirely up to the application.  So you can easily
have UTF-8 filenames, ISO 8859-1 filenames and shift JIS filenames in
the same directory.  As I said, no matter what you do, it's wrong.)

    [...]
> > Note that as soon as you leave the world of straight US ASCII, you
> > have the portability problems anyway.  Locale names aren't
> > standardized, the presence of any useful locales isn't guaranteed,
> > the interaction between the C++ locales and the C locale isn't well
> > specified and there are still a lot of compilers out there which
> > don't have a working version of <locale> (nor wistream or wostream,
> > for that matter).

> 100% true. But then I'd expect the implementations to step in and
> provide necessary bindings to exisiting facilities of their OS. For
> example I don't care about standard locale names. Users of my software
> would expect to see the names as they are in the rest of the system
> _not_ some C++ standard names. So all I really care is to be able to
> construct C/C++ locale from my system locale names.

Users really shouldn't be aware of locale names at all.  They choose
there location and language with combo boxes, and the system does the
rest.

> There are also compilers out there that do not have _any_
> iostreams. For example eMbedded VC4 (roughly equivalent to desktop VC6
> in the rest of its features) doesn't have iostreams at all (which
> angers me a lot since I like things like lexical_cast). I of course
> don't know the exact reasons for this omission but my guess would be
> that implementing the full functionality on a char * challnged system
> would be too expensive (both in time and library size).

> >> There are ways to overcome the problems but they require heroic
> >> efforts from the application.  Using wide-character APIs solves
> >> most of these problems and it fits quite easily into C++.

> > One of the reasons why it isn't in C++ is that most OS's don't have
> > a wide character interface.  And practically, most never will; as I
> > said, the current trend is toward using UTF-8 and sticking with 8
> > bits.

> A counter-example would be Windows CE (Pocket PC, Smartphone and other
> its variations). This system actually almost doesn't have any char *
> support :-)

It's interesting to find all of these systems moving to 16 bit
characters, now that it has been established that 16 bits isn't
enough:-).

FWIW: I don't think that adding support for wchar_t filenames will be
sufficient.

> > In the mean time, it would be perfectly legal for a library
> > implementation to treat the 8 bit char's as UTF-8, convert them
> > systematically into UTF-16 and use the wide character interface.  I
> > don't know if any library actually does this, however, and there may
> > be various historical problems which mean that it isn't possible.

> It will break all applications localized the old way (using char *
> multibyte strings).

> >> I can write the whole application using wchar_t only if I don't use
> >> any of the few APIs that require chars (only some socket function
> >> do this).  The perpetual problem are fstream::open,
> >> exception::what() and ctors of <stdexcept> classes.

> > In the case of the exceptions, there are very strong reasons why the
> > standard library cannot offer two versions.  And since char is still
> > more widespread than wchar_t...

> Out of curiosity what are these reasons?

The character type is part of the type system.  And you can't have two
different types for bad_alloc -- one where what() returns a char*, and
another where it returns a wchar_t*.

> >> I can see how the standard library on my machine could provide
> >> extensions to solve most of the problems.  After all we have
> >> _wfopen(const wchar_t *filename, const wchar_t *mode) in the same
> >> library. So from my POV this is entirely library QoI issue.  The
> >> thing I don't agree with is the argument that having only
> >> fstream::open(char*) somehow makes the library and applications
> >> that are using it more portable.

> > Nothing using anything other than US ASCII is anywhere near
> > portable. Simply because different systems have different ways of
> > handling the differences.  Offering an fstream::open(wchar_t*)
> > doesn't help portability if the OS's on which one is building don't
> > support it.

> But the same is true for fstream::open(char *) on Windows CE for
> example.

Well, the char* people were there first:-).

I think that the real solution involves separating the system character
type from bytes.  And from wchar_t, for historical reasons.
(Historically, wchar_t has been used for larger than system character
types.)

At the IO level, I think we need at least three distinct types:

  - a wide character stream, guaranteed to be single byte encoding,
    maybe even guaranteed to be Unicode (UCS 4),

  - narrow character streams, both 8 and 16 bits, with application
    determined code translation on input and output, and

  - pure binary IO, either byte oriented or system character oriented
    (basically, the type you use for the buffer you pass to the system
    level read or write).

On most modern systems, the first two should be implemented in terms of
the third.  Or at least, that's what I would have thought.

> All that could be said is fstream::open(char *) could be more easily
> supported on more systems _today_. I wonder why not have something
> like

> 1. An implementation MUST provide at least one of the following overloads:
> open(char *), open(wchar_t*)
> 2. An implementation MUST provide both overloads if the underlying system
> allows files to be opened using both wide and narrow names.
> 3. An impementation MAY provide both overloads if the underlying system
> allows only one form of the name. In this case the algorithm used for filena
> me conversion is implementation defined. One possibility would be using
> current stream locale to perform the conversion.

Long term, I think that we'll have to provide both.  But defining the
semantics when one isn't present in the OS isn't trivial -- consider
what I said about having filenames with different encodings in the same
directory under Unix.

The answer to the problem (UTF-8) has been well known for years -- ever
since Pike published the first information on plan 9.  Within the last
couple of years, the trend in Unix systems, at least, has been to say
that that is the direction we should take, and that maybe one day (long
after I'm retired), it will actually be implemented.  In the meantime, a
number of systems seem to have gone other ways.

It's going to be a problem.  At present, most enterprise critical data
for large companies is on Unix machines or mainframes, and I don't see
that changing anytime soon -- PC's may have the computational power, but
they don't yet have the IO bandwidth or the reliability.  The current
situation is that anytime you leave 7 bit US ASCII on a Unix machine,
you're in the realm of experimentation.  And you don't even get that
with the mainframes, since you have to convert to EBCDIC.  In the
meantime, the network is shuffling everything around as UTF-8, and
Windows and the handhelds are UTF-16 (LE or BE? -- one of the nice
things about UTF-8 is that you don't have to worry about byte order).
And of course, with every trancoding, something gets lost.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Sat, 13 Dec 2003 02:12:36 +0000 (UTC) Raw View

Matthew Collett <m.collett@auckland.ac.nz> wrote in message
news:<m.collett-BDBDBF.14560911122003@lust.ihug.co.nz>...

> A corollary of James Kanze's observation is that the wchar_t type is
> in principle an internal choice of the C++ implementation, independent
> of any properties of the OS.  Which in turn means that it's not clear
> what it means to talk about "the underlying system allow[ing] files to
> be opened using [...] wide [..] names".  There is though a real
> distinction between a "wide character" (fixed width) encoding, and a
> "multibyte" (variable width) one - and UTF-16 is a multibyte encoding
> with 16 bit bytes, not a wide character one.

I think you've put your finger on the problem.  We need a type which
supports true, single byte character encoding.  Perhaps not everywhere,
but anyone doing intensive text manipulation needs it.  Practically
speaking, that type must have at least 21 bits.  We also need a type
which is 8 bits, or at least, that we can pretend is 8 bits.  Not
everyone needs it, but at the very least, you need it when you are
implementing networking software, or any low level protocols (not
necessarily text) -- for better or for worse, the C/C++ standard
requires that this type be called "char".  Finally, we need a type to
interface with the OS when the OS uses text.

Historically, the last two were identical.  If the OS start using other
than 8 bits for characters in their text interfaces, we have a problem.

Even if they don't, we sort of have a problem.  Ideally, I'd like to be
able to read files regardless of where they come from, and regardless of
how they are encoded.  This means that at least at the IO level, I have
to handle both 8 and 16 multibyte characters, translating them into my
internal code.

Finally, of course, some applications involve large quantities of text,
but no text manipulation other than storing, or possibly linear
searching.  For such applications, UTF-8 is indicated -- it can encode
anything you can encode with USC-4 (which requires 21 bits, which
pratical considerations rounds up to 32 bits), it typically takes up a
LOT less space (depending on what kind of text you are handling, of
course), and never takes up more space.  UTF-16 would also be a
candidate: it too would never take more space than USC-4, and would
typically take a LOT less.  I don't know how it globally compares with
UTF-8, but for languages using the Roman alphabet, even with accents, it
will take close to twice as much space as UTF-8.

Note that in some cases, more space also means slower execution, because
of more page faults, etc.

> I'm presently using Mac OS X.  The implementation of C++ has 8 bit
> char, and 32 bit wchar_t (allowing the use of a true wide character
> encoding such as UCS-4), but the OS stores and manipulates filenames
> using UTF-16.

It does!  I thought that MAX OS was Unix.  Unix requires 8 bits.  (OK,
formally, Unix doesn't say anything about what is used internally, but
what's the point of using 16 bits interally if you are required to use 8
at the ABI?)

> It is the job of the C/C++ library implementation to cope with the
> mismatch of character types, which it does just fine by treating the
> char* parameter to open as UTF-8.

This sounds like a very intelligent solution.

> There might be a case for char being changed to 16 bit to match the
> OS, but I would definitely consider it a retrograde step if wchar_t
> was.  (Perhaps I've overstated the case slightly, since the OS itself
> does also provide a UTF-8 aware API, easing the library's task, but
> this does not alter the principle.)

For practical reasons, you must support an eight bit type if at all
possible.  And C/C++ require that this type be called char/unsigned
char/signed char.  (No other type is allowed to be smaller than 16 bits,
and no other type can be smaller than a char.)

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Eugene Gershnik <gershnik@nospam.hotmail.com>
Date: Sat, 13 Dec 2003 02:12:36 +0000 (UTC) Raw View

Matthew Collett wrote:
> In article <AqOdnULSxsXVTEuiRTvUrg@speakeasy.net>,
>  Eugene Gershnik <gershnik@nospam.hotmail.com> wrote:
>
>>>> James Kanze wrote:
>>>>> C++ considers that the system has a single character type. It
>>>>> makes no assumptions about this type, except that it is a single
>>>>> type. Any time information is passed to or from the system, this
>>>>> is the type that is used.  Thus, filebuf requires that the name
>>>>> of the file have this type, and that all reads and writes to and
>>>>> from the file are done in this type.  This type is called
>>>>> "char".
>>>>> And I repeat, not only must all filenames be char[], all
>>>>> transfers to and from the system take place as transfers of char
>>>>> -- a wfstream translates multi-byte sequences into wide
>>>>> characters.
>
> [snip]
>
>> I wonder why not have something like
>>
>> 1. An implementation MUST provide at least one of the following
>> overloads: open(char *), open(wchar_t*)
>> 2. An implementation MUST provide both overloads if the underlying
>> system allows files to be opened using both wide and narrow names.
>> 3. An impementation MAY provide both overloads if the underlying
>> system allows only one form of the name. In this case the
>> algorithm used for filena me conversion is implementation defined.
>> One possibility would be using current stream locale to perform
>> the conversion.
>>
>> Eugene
>
> A corollary of James Kanze's observation is that the wchar_t type
> is in principle an internal choice of the C++ implementation,
> independent of
> any properties of the OS.  Which in turn means that it's not clear
> what
> it means to talk about "the underlying system allow[ing] files to be
> opened using [...] wide [..] names".

Let's have an ad-hoc definition. If an OS allows to open a file using a
system call that accepts the file name as (const) wchar_t * I'd call it
"wide enabled". The narrow case could be similarly defined. Note that I
avoid the issue of what and how is encoded in the memory pointed by char *
and wchar_t *. All that is important is the _form_ of the input.

> There is though a real
> distinction between a "wide character" (fixed width) encoding, and a
> "multibyte" (variable width) one - and UTF-16 is a multibyte
> encoding
> with 16 bit bytes, not a wide character one.
>
> I'm presently using Mac OS X.  The implementation of C++ has 8 bit
> char,
> and 32 bit wchar_t (allowing the use of a true wide character
> encoding
> such as UCS-4), but the OS stores and manipulates filenames using
> UTF-16.

What is the signature of the syscall/API you use to open a file? Does it
accept char* or wchar_t*? My guess it is char* since I beleive Mac OS X is
Unix-like but I may be wrong. If I am right using the ad-hoc definition
above your system is only "narrow enabled".

> It is the job of the C/C++ library implementation to cope
> with
> the mismatch of character types, which it does just fine by
> treating the char* parameter to open as UTF-8.  There might be a
> case for char being changed to 16 bit to match the OS, but I would
> definitely consider it a retrograde step if wchar_t was.  (Perhaps
> I've overstated the case
> slightly, since the OS itself does also provide a UTF-8 aware API,
> easing the library's task, but this does not alter the principle.)

Well under the rules I posted your library will _have to_ provide only char
* overload (since this is the only version exposed by OS). If your library
vendor so desired he could provide also a wchar_t * one and conver from
UCS-4 to UTF-8 internaly. This would _help portability_ since instead of

std:wstring wname = get_file_name_from_somewhere();
std::string name = convert_from_ucs4_to_utf8(wname);
my_stream.open(name.c_str(), ...);

which is portable only to other Unix, you would have

std:wstring wname = get_file_name_from_somewhere();
my_stream.open(wname.c_str(), ...);

which would be portable to any "wide enabled" system and those "narrow
enabled" ones where library provided an wchar_t overload.

Eugene





      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Eugene Gershnik <gershnik@nospam.hotmail.com>
Date: Sat, 13 Dec 2003 02:12:36 +0000 (UTC) Raw View

kanze@gabi-soft.fr wrote:
> Eugene Gershnik <gershnik@nospam.hotmail.com> wrote in message
> news:<AqOdnULSxsXVTEuiRTvUrg@speakeasy.net>...
>> kanze@gabi-soft.fr wrote:
>> For the external data I see this trend too but not for internal
>> APIs. Win32, Java and .NET are pretty much settled on fixed width
>> Unicode representations. The only major system that still uses
>> chars
>> is Unix. It makes perfect sense to me. Why would an OS open_file()
>> use UTF-8? Assuming the internal data is already in fixed-width
>> encoding
>> why incur another two translations?
>
> The real question is rather, why would a system move to 16 bit
> char's. Networking imposes 8 bit char's?  Networking is becoming
> more and more prevelant.

But otherwise you would have a nightmare of converting (say from UCS-4 to
UTF-8) every time you need to change a content of a listbox. Or are you
saying that the application should store strings in UTF-8 too? This is
possible even on Windows and I am actually currently working on a project
that went down this path (for Unix compatibility). The problem is that all
the string related parts of C and C++ standard libraries should be modified
in this case. We need two versions of strlen() (one for byte length and
another for character one) same with string::length() etc. Also various
smart string searching and sorting algorithms will probably have to be
rewritten.
Regarding networking it also uses more and more of XML which is rarely
manipulated in string form. For an XML speaking application it doesn't
really matter what encoding is being used to serialize the documents. If the
grand vision of applications communicating only by exchanging XML documents
will ever become true the network character encoding will become entirely
irrelevant for most programmers.

>
>>>>> The one exception is Windows, which uses 16 bit characters with
>>>>> the multibyte encoding UTF-16.)
>
>>>> It is fixed width UCS-16 I think.
>
>>> You cannot encode all of the world's characters in UCS-16.  UCS-16
>>> worked will through Unicode 3.0, but since then, you need
>>> UTF-something is you want full Unicode support with less than 21
>>> bits.  I thought that I had heard that Windows does support the
>>> surrogate pairs in Unicode (which means implicitly multi-byte
>>> characters).  Note that for Unicode characters in the ranges
>>> U+0000..U+D7FF and U+E000..U+FFFF (which includes all of the
>>> characters in Unicode 3.0), there is no difference between the
>>> two.
>
>> Win32 has some support for surrogates but it is quite limited and
>> varies with OS flavor. Though surrogates are a problem from a
>> theoretical point of view I beleive most programs that don't engage
>> heavily in text manipulations can get away with "each character is
>> 16
>> bit" approach. This is of course unfortunate but still better than
>> the situation with JIS or EUC where you cannot do anything without
>> thinking about multi-byte issues.
>
> Agreed.  For that matter, most programs which don't engage heavily
> in text manipulations can get away with UTF-8 -- with a little bit
> of help from libraries, they should still be able to consider
> that each character is a byte.

This is true. I am currently working on such a project. The problem is that
the "danger zone" is bigger.

>>> In practice, Unicode was 16 bits through 3.0.  Institutions which
>>> fixed wide character width too early got caught.  (Microsoft and
>>> IBM
>>> are the big ones, and Java as a language.)
>
>> I am not sure the change wasn't a big mistake on the part of
>> Unicode
>> group.  One of the big "selling points" of Unicode was that you
>> don't
>> have to deal with lead byte/trail byte issues anymore. Do you know
>> if anyone actually uses the extended characters?
>
> Only one, but I then work western Europe, not in China.  From what
> he told me, you cannot write Chinese correctly without them.
> The additional mathematical symbols seem useful too.
>
> Someone requested them.  Presumably for a reason.  I don't think
> that Unicode would have broken backwards compatibility in this
> way just to be able to support the Shavian phonetic alphabet,
> but claiming to be an international alphabet, but not
> supporting Chinese and Japanese, is a bit too much.

<useless_rant>
I don't know much about Unicode evolution but couldn't they count from the
beginning? I remember reading in the 90s statements like "65535 characters
is enough to encode all the world's writing systems and then some". At the
version 4(?) of the standard they suddenly discover it is not enough so they
opted to invalidate all the design decisions OS and language designers have
done in a decade.
</useless_rant>

>>> Where are these locales decided?  Can your program influence them?
>>> And are there locales which use UTF-8?
>
>> Short answers: In the context of filenames it's a system-wide
>> setting. Not without affecting the whole system. No
>
> I'm not sure I like it, but I can understand it.  Under Solaris or
> Linux, the interpretation of characters, including characters in
> filenames, depends on some environment variables set in the process
> which reads them.  You do ls on the same directory in two different
> windows, and you get totally different results.  Not an ideal
> situation either.

Well on Windows too what you will actually _see_ in Explorer or other file
management UI will depend on the current user's locale and font settings.

>
> It's the sort of situation where no matter what you do, it's wrong.
>
> And of course, you don't want user defined settings affecting how
> you interpret names on networked drives -- say when you open a
> remotely mounted file hosted on a Unix (or other 8 bit) system.

Remote drives are a nightmare anyway. Trying to see the content of a Unix
directory from Windows results in very interesting phenomena when the files
differ only by case for example ;-) And what happens to security information
is entirely up to wildest imagination of remote filesystem
protocol/application. Different encodings do not make the problem much
worse.

>
>     [...]
>>> Note that as soon as you leave the world of straight US ASCII, you
>>> have the portability problems anyway.  Locale names aren't
>>> standardized, the presence of any useful locales isn't guaranteed,
>>> the interaction between the C++ locales and the C locale isn't
>>> well specified and there are still a lot of compilers out there
>>> which
>>> don't have a working version of <locale> (nor wistream or
>>> wostream,
>>> for that matter).
>
>> 100% true. But then I'd expect the implementations to step in and
>> provide necessary bindings to exisiting facilities of their OS. For
>> example I don't care about standard locale names. Users of my
>> software would expect to see the names as they are in the rest of
>> the system _not_ some C++ standard names. So all I really care is
>> to be able to construct C/C++ locale from my system locale names.
>
> Users really shouldn't be aware of locale names at all.  They choose
> there location and language with combo boxes, and the system does
> the
> rest.

Sure. What I mean is that to show the name to the use I'd ask the system for
a UI name of locale 42. When I will call some OS localization routine I
again will use 42. So C++ locales should be constructible from the same 42
or provide a convenient mapping of the names they use to and from my OS ids.
Whether these names are standard or not doesn't bother me as long as I can
use all of my ids. The names and identification of locales on another OS
will surely be different so standartization will not lead to portability.

>>> One of the reasons why it isn't in C++ is that most OS's don't
>>> have a wide character interface.  And practically,
>>> most never will; as I said, the current trend is toward
>>> using UTF-8 and sticking with 8 bits.
>
>> A counter-example would be Windows CE (Pocket PC, Smartphone and
>> other its variations). This system actually almost doesn't
>> have any char * support :-)
>
> It's interesting to find all of these systems moving to 16 bit
> characters, now that it has been established that 16 bits isn't
> enough:-).
>
> FWIW: I don't think that adding support for wchar_t filenames will
> be sufficient.

For what?

>>> In the case of the exceptions, there are very strong reasons why
>>> the standard library cannot offer two versions.  And since char
>>> is still
>>> more widespread than wchar_t...
>
>> Out of curiosity what are these reasons?
>
> The character type is part of the type system.  And you can't have
> two different types for bad_alloc -- one where what() returns a
> char*, and another where it returns a wchar_t*.

Hmmm. How about adding

virtual const wchar_t * wwhat() const throw( ) = 0;

to std::exception? A suitable semantics that plays nice with all systems
could no doubt be invented. For example bad_cast will return valid strings
from both what() and wwhat(). The standard exceptions will be enhanced with
additional ctor and return "what they were given". The other overload will
return NULL. A custom class derived from std::exception could return valid
string from both or either one. But I am dreaming of course.

>>>> I can see how the standard library on my machine could provide
>>>> extensions to solve most of the problems.  After all we have
>>>> _wfopen(const wchar_t *filename, const wchar_t *mode) in the same
>>>> library. So from my POV this is entirely library QoI issue.  The
>>>> thing I don't agree with is the argument that having only
>>>> fstream::open(char*) somehow makes the library and applications
>>>> that are using it more portable.
>
>>> Nothing using anything other than US ASCII is anywhere near
>>> portable. Simply because different systems have different ways of
>>> handling the differences.  Offering an fstream::open(wchar_t*)
>>> doesn't help portability if the OS's on which one is building
>>> don't support it.
>
>> But the same is true for fstream::open(char *) on Windows CE for
>> example.
>
> Well, the char* people were there first:-).
>
> I think that the real solution involves separating the system
> character
> type from bytes.  And from wchar_t, for historical reasons.
> (Historically, wchar_t has been used for larger than system
> character
> types.)
>
> At the IO level, I think we need at least three distinct types:
>
>   - a wide character stream, guaranteed to be single byte encoding,
>     maybe even guaranteed to be Unicode (UCS 4),
>
>   - narrow character streams, both 8 and 16 bits, with application
>     determined code translation on input and output, and
>
>   - pure binary IO, either byte oriented or system character
>     oriented (basically, the type you use for the buffer you pass
>     to the system level read or write).
>
> On most modern systems, the first two should be implemented in
> terms of
> the third.  Or at least, that's what I would have thought.
>

If such library were available I would dump iostreams ASAP ;-)
But I think these issues are orthogonal to the filenames. The filenames are
just an example of a bigger problem of format of string identifiers passed
to underlying system. If any C++ library encapsulates OS facilities it has
to allow identifiers in the same form as required by OS.

>> All that could be said is fstream::open(char *) could be more
>> easily supported on more systems _today_. I wonder why not have
>> something
>> like
>
>> 1. An implementation MUST provide at least one of the following
>> overloads: open(char *), open(wchar_t*)
>> 2. An implementation MUST provide both overloads if the underlying
>> system allows files to be opened using both wide and narrow names.
>> 3. An impementation MAY provide both overloads if the underlying
>> system allows only one form of the name. In this case the
>> algorithm used for filena me conversion is implementation defined.
>> One possibility would be using current stream locale to perform
>> the conversion.
>
> Long term, I think that we'll have to provide both.  But defining
> the semantics when one isn't present in the OS isn't trivial --
> consider
> what I said about having filenames with different encodings in the
> same directory under Unix.

That's why always having both isn't a good idea. It doesn't make sense to
force wchar_t* only system to invent ways of converting char* and vice
versa. Only the systems that allow both ways on the OS level should expose
both ways from fstream.

>
> The answer to the problem (UTF-8) has been well known for years --
> ever since Pike published the first information on plan 9.
> Within the last couple of years, the trend in Unix systems,
> at least, has been to say that that is the direction we
> should take, and that maybe one day (long after I'm retired),
> it will actually be implemented.  In the
> meantime, a number of systems seem to have gone other ways.
>
> It's going to be a problem.  At present, most enterprise critical
> data for large companies is on Unix machines or mainframes,
> and I don't see that changing anytime soon -- PC's may have
> the computational power, but they don't yet have the IO
> bandwidth or the reliability.  The current situation is that
> anytime you leave 7 bit US ASCII on a Unix machine, you're in
> the realm of experimentation.  And you don't even get that
> with the mainframes, since you have to convert to EBCDIC.  In the
> meantime, the network is shuffling everything around as UTF-8, and
> Windows and the handhelds are UTF-16 (LE or BE? -- one of the nice
> things about UTF-8 is that you don't have to worry about byte
> order).
> And of course, with every trancoding, something gets lost.

This is an excellent argument why it isn't possible to provide portable
filename meanings or attempt to do any conversions on them in C++ standard
library. All that can be said about any OS that has files and filenames is
that its interface is either open([const] char *) or open([const] wchar_t *)
or both. The content and meaning of the pointed data is non-portable and
irrelevant. The C++ library shouldn't attempt to do anything with the
content but rather pass the pointer to OS as is. Which in turn calls to one
or both overloads for fstream::open.

Eugene

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Seagull Manager <seagull.manager@nospamthanksbecauseisayso.demon.co.uk>
Date: Tue, 2 Dec 2003 17:06:16 +0000 (UTC) Raw View

Maybe this is someone everyone knows about already, and considers completely
boring, but it's been bugging me, so here goes:

The standard has it that fstreams, even wide fstreams, can only be opened
using 8-bit byte strings for filenames. If a filename contains, say,
non-western letters in the name, it may be impossible to open it using an
fstream without some kind of shim, e.g.:

template<typename unicode_string>
const std::string
convert_wide_string_to_something_understood_by_fstreams(const
unicode_string&)
{
     // do something OS-dependent, such as encode in utf-8, or look up an
alternative file name
}

//...

std::wifstream my_file;
//...
my_file.open(convert_wide_string_to_something_understood_by_fstreams(gujerat
i_or_hangul_or_kanji_maybe_file_name).c_str());


If I want my application to use fstreams and at the same time be useful
internationally, I'll have to shim all my calls to "open" this way. This
strikes me as a little clunky (even if I don't use the ridiculously long
names I've used here for the discussion).

Someone suggested when the C++ standard was in the draft stages, that
fstream template classes be modified to have constructors and open() methods
that took template<CharType> parameters, which would have put the shim onus
on the library implementer, but the idea was kicked into the long grass
(a.k.a. "Future"). That was about five years ago.

Right now, most common operating systems have support for unicode filenames
built in, and computer use outside the ISO-8859-15* region continues to
boom. Could it be a sign that the standard is ripe for update now? I've
heard it suggested that the next revision of the standard will happen around
2008. Will that be too late for the language? I fear it might be. Niggles
like the one I've just mentioned could turn programmers away.


::Bruce_Attah;

(*aside: ISO-8859-15, a.k.a. "Latin-9" is probably the character set we
should all be using by default for Western languages; it is an update on
ISO-8859-1 that renders the latter essentially obsolete - I wonder why so
few do?)



      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Jonathan Turkanis <technews@kangaroologic.com>
Date: Wed, 3 Dec 2003 16:05:58 +0000 (UTC) Raw View

"Seagull Manager" <seagull.manager@nospamthanksbecauseisayso.demon.co.uk>
wrote in message
 >
 > The standard has it that fstreams, even wide fstreams, can only be opened
 > using 8-bit byte strings for filenames. If a filename contains, say,
 > non-western letters in the name, it may be impossible to open it using an
 > fstream without some kind of shim, e.g.:
 >

First, the type of characters which can appear in a file name is separate
from the type of data the file contains. E.g., a file whose name contains
unusual characters could contain pure binary data, and a text file using a
multibyte encoding could have a pure ASCII name. So the question applies as
much to narrow-character streams as it does to wide-character streams.

For the main question, the Boost.Filesystem FAQ gives a good answer. See:
http://www.boost.org/libs/filesystem/doc/faq.htm (Why aren't wide-character
names supported?)

Jonathan



      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Hyman Rosen <hyrosen@mail.com>
Date: Wed, 3 Dec 2003 16:05:58 +0000 (UTC) Raw View

There's nothing to stop a vendor from offering an extension
which allows open on wide strings.

Note that the wideness of the file contents is completely
orthogonal to the wideness of the file name.


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Eugene Gershnik <gershnik@nospam.hotmail.com>
Date: Thu, 4 Dec 2003 15:55:38 +0000 (UTC) Raw View

Jonathan Turkanis wrote:
>> "Seagull Manager"
>> <seagull.manager@nospamthanksbecauseisayso.demon.co.uk> wrote in
>>  message
>>  The standard has it that fstreams, even wide fstreams, can only
>>  be opened using 8-bit byte strings for filenames. If a filename
>>  contains, say, non-western letters in the name, it may be
>>  impossible to open it using an fstream without some kind of
>>  shim, e.g.:

> For the main question, the Boost.Filesystem FAQ gives a good
> answer. See: http://www.boost.org/libs/filesystem/doc/faq.htm (Why
> aren't wide-character names supported?)
>

Hmmm. On my OS the native character type is wchar_t. Conversions to char
could be ambiguous or sometimes impossible (for example when the file comes
from another system and its name contains characters from a language for
which mine dosen't have a conversion table installed). So on my system the
standard fstream simply isn't usable not to speak about portable :-(

Eugene

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]