Topic: Once again a Plea for proper International Character support


Author: hyrosen@mail.com (Hyman Rosen)
Date: Sun, 20 Oct 2002 20:38:17 +0000 (UTC)
Raw View
James Kanze wrote:
> Maybe.  To be truthful, I'd like to know what Java really does.

Sun will let you download the source code, so you could check.
If I recall correctly, you had to fax in some agreement first,
but it was open to anyone.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Thu, 17 Oct 2002 19:13:01 +0000 (UTC)
Raw View
James Kanze wrote:
 > I wouldn't be surprised if Java didn't systematically convert the
 > filename to narrow characters even under Windows.

I tried the sample Java notepad application in Windows.
I gave it a filename full of foreign characters, including
Chinese, and it faithfully created a file with that name,
at least as displayed by Explorer.

So it looks like Java has taken the same hacker's approach that
lots of us here are advocating - do it wide if you can, and do
something arbitrary and possibly appropriate if you can't.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Mon, 14 Oct 2002 20:28:51 +0000 (UTC)
Raw View
Peter Dimov wrote:
> Odd example. On non-VMS, if you make a file named a.txt and someone
> modifies it, when you open a.txt you get the modified file. The
> "condition of mine" doesn't guarantee that file contents are stable,
> just that you are able to open the same file.

In VMS, "a.txt;1" (I think that's the syntax) continues to exist
and is the same file as originally created. Subsequent versions
are new files. Anyway, if we're going to get into an argument over
the semantics of being "the same file", that's all the more reason
for not requiring that the meaning of a wide filename be "the same
file" each time the program is run.

Here's a simpler argument. If I specify a name like "joe.txt", its
meaning depends upon another external context, namely the current
directory when the program is executed (at least on UNIX and Windows).
Why should it be OK to depend on current directory for augmenting the
meaning of a filename, but not current language?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 15 Oct 2002 19:38:32 +0000 (UTC)
Raw View
James Kanze wrote:
> Let's put it differently: it is perfectly acceptable to document that a
> program will execute in a certain directory, or even that it must be
> started in that directory.

It is less than perfectly acceptable, but it is common.

> Would you consider it acceptable that the program must execute in a
 > certain language environment?

Yes.

> And if so, why bother with internationalization at all?

Because programs don't have to require a specific language environment,
but they can do so if they want to. Just as it's better for a program
to be runnable from any directory, it's better if it's runnable in any
language environment.

Also, because you can have a program which adapts itself to any
language the user would like to use, but then creates its data
files using that language environment, so that subsequently the
same language environment must be used in order to process those
files.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Mon, 14 Oct 2002 15:06:39 +0000 (UTC)
Raw View
Philip Guenther wrote:
> Which of those possibilities, if any, do you feel that the standard
> should disallow and consider non-compliant?

None. It's not up to the C or C++ standard to dictate to the
operating system what its filenames mean. If the OS has some
filenames which can only be opened as narrow chars, and some
that can only be opened as wide chars then your application
will have to deal with it. I don't know why you would expect
otherwise.

 > please suggest how I should present to the user, in
 > documentation and interface, the distinction between
 > the two filename APIs.

In the same way you currently document their choice of whether
the filename they pick should be opened in binary or text mode.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Thu, 10 Oct 2002 21:30:52 +0000 (UTC)
Raw View
P.J. Plauger wrote:

> Depends on how you do it. If you produce an International Standard that
> says, here's how you can *try* to open a file with a wide-character name,
> but no implementation is obliged to succeed for *any* request, and every
> implementation is at liberty to do something different -- that, my friend,
> is wishy washy.


This argument makes absolutely no sense. The only thing C++ says about
filenames is that they're opened "as if" by std::fopen, whose definition
comes from C. The only thing C says about the name string passed to fopen
is that fopen opens the file whose name is that string.

So we define a std::wfopen whose first argument is a const wchar_t *.
In exactly the same way, as fopen, this function is defined to open the
file whose name is that string. We declare open methods for file streams
and buffers which take wide string names, and define them to open the
files "as if" by std::wfopen.

The semantics are exactly the same. I fail to see why we need any more
comprehensive definition for wfopen than for fopen.

Apparently some people are in a total tizzy over the fact that there
may be certain pairs of string and wide string of which both members
name the same file, and feel that this must somehow be reflected in the
lnaguage specification. I have no idea why this should be the case.
It is equally true that there may be certain pairs of plain strings of
which both members name the same file (eg., "joe.txt" and "./joe.txt",
or any pair of linked names in UNIX), but neither C nor C++ talks about
this situation.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Fri, 11 Oct 2002 17:25:02 +0000 (UTC)
Raw View
Peter Dimov wrote:
> On the other hand, a context-independent mapping ensures that if you
> open the file today using wchar_t sequence X, you will be able to open
> the same file tomorrow using that same sequence X (assuming the file
> hasn't been deleted or moved of course.)

I don't understand why this is the programming language's problem to
solve, though. And in any case, this condition of yours doesn't hold
for char filenames, so why should it hold for wide names?

In VMS files have version numbers. When you modify a file, the old
version is kept, and a new copy is created with a higher version
number. When you open a file with no version specified, you get the
one with the highest version. So if I open "a.txt" today, and then
someone changes it, and I open "a.txt" tomorrow, I am not getting
the same file, even though that same file still exists (and is
accessible by explicitly giving the version).

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: kuyper@wizard.net ("James Kuyper Jr.")
Date: Fri, 11 Oct 2002 17:32:42 +0000 (UTC)
Raw View
Edward Diener wrote:
> "James Kanze" <kanze@gabi-soft.de> wrote in message
> news:d6651fb6.0210100630.7c9adf55@posting.google.com...
....
>>The C++ standards committee is made up of representatives of the
>>national bodies.  If you cannot persuade at least one member to support
>>your proposal, how do you expect to persuade the organization as a
>>whole.
>
>
> By creating a proposal which others would find acceptable because the idea
> as proposed ( and possibly the implementation of that idea ) behind it is a
> good one.

Well, if you can in fact write a proposal that would be accepted by the
committee as a whole, then you can show it to one member, and convince
him with it. So, what's the problem?

> I find it sad, not personally but intellectually, that the C++ language has
> devolved into a situation where politics takes precedence over creativity. I
> do not believe that that was what Bjarne Stroustrup had in mind when he
> created the language.

Of course not - he was designing a language; the committee is writing a
standard for a language. Those are quite different (but closely related)
activities. Creativity is very important in language design, and a
serious mistake in language standardization.


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Fri, 11 Oct 2002 18:53:11 +0000 (UTC)
Raw View
James Kanze wrote:
> Can you guarantee me that if wide character filenames are adopted, Sun,
> g++, Dinkumware and the STLport will implement them with the same
> semantics, so that they the different libraries will be interchangeable?
> If not, we have a problem which must be addressed.

Easy enough. Java runs on all of these platforms, and Java strings
and thus filenames are wide. So C++ implementors will adopt whatever
convention Java uses.

This illustrates more clearly than ever that the rules of association
between wide and narrow filenames belong to the platform ABI, not to
the programming language.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Fri, 11 Oct 2002 19:26:13 +0000 (UTC)
Raw View
James Kanze wrote:
> If wide character filenames are adopted, the newer versions of the
> libraries will implement them.  But what will they implement?  Will the
> g++ library still have the same semantics as the Sun libraray?  Could I
> replace one of the libraries with a library from Dinkumware or the
> STLport with no change in semantics?

You can get Java systems from more than one maker
on UNIX (eg., Sun and IBM). Java strings and thus
filenames are wide. We can just adopt whatever
solution Java used to avoid the worry you state.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 8 Oct 2002 21:41:25 +0000 (UTC)
Raw View
James Kanze wrote:
>> Specifically, how it is to be implemented for operating systems that
>> do not support wide character file names.
>
> That is the big question.  I don't think it is a killer problem, but to
> date, no one has really decided to address it.

Since wcsrtombs is part of standard C and C++,
I really don't see the problem here. We have a
standard way to convert a wide string to a
multi-byte string, so if the OS can't open a
wide-named file directly, convert the name
using this function, and use that name instead.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Wed, 9 Oct 2002 18:48:17 +0000 (UTC)
Raw View
Peter Dimov wrote:
> The perfect mapping has some pretty obvious properties:
> * Equal wchar_t sequences map to equal byte sequences, independent of
> OS/runtime/compiler state.

Why should this be the case? I would think the locale in effect
would dictate the conversion from wide-char to multi-byte sequence.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Wed, 9 Oct 2002 22:22:20 +0000 (UTC)
Raw View
James Kanze wrote:
> I'm sure that most people would like to see this work.  I'm far from
> sure that we have any consensus on the semantics, however, if the OS
> doesn't support wide character filenames directly.

I noted in passing in a different message that universal character names
are legal in #include directives, so implementations must already deal
with those in some fashion. What do they do? Can that be a guide to
how they should deal with wchar_t filenames?

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Thu, 10 Oct 2002 18:20:31 +0000 (UTC)
Raw View
Peter Dimov wrote:

> Because, if the forward mapping is state-dependent, the reverse
> mapping will be state-dependent, too. Therefore, in order to decode a
> filename, you will need not only the byte sequence, but the state used
> at the time the forward mapping was performed, which is one of the
> main disadvantages of char[] based names (to interpret them you need
> to know the code page).


Why is that a problem? Or rather, why is it more of a problem
than any other locale-dependent conversions? Whether or not a
char filename "decodes" into a wide-char filename, and how it
does this, obviously has absolutely nothing to do with C++,
but instead depends entirely on the operating system.

The best C++ has to offer in this regard is to run the filename
back through mbstowcs, to reverse the original wcsrtombs. Then
it's up to the locale implementation to deal with this. If the
locale converts unambiguously, you'll have what you want. If it
doesn't, then you'll have to carry around the extra information
somehow.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]





Author: hyrosen@mail.com (Hyman Rosen)
Date: Mon, 7 Oct 2002 23:12:30 +0000 (UTC)
Raw View
Beman Dawes wrote:
> "Implementation defined" won't fly; that's just an "illusion of
> portability" without any underlying reality. The implementors say they
> have no idea what semantics they would be defining.

Why not? All filenames offer an "illusion of portability".
What is the semantics of opening "c:\\tmp\\b.txt"? I can
do that successfully on both Windows and UNIX, but the
semantics are hardly the same!

For implementations where it's truly meaningless, the
implementors can provide a meaningless implementation.
A perfectly reasonable default for systems that can
open only char * file names is to convert the name using
wcsrtombs, and open a file using that string if the
conversion succeeded.

By the way, implementors must *already* deal with this
issue, becuase file names in #include directives can
contain universal character names!

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]