Thread

Topic: 27.8.1 underlaid character type is char

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Tue, 7 Jun 2005 16:14:21 GMT Raw View

Uzytkownik <kanze@gabi-soft.fr> napisal w wiadomosci
news:1118070685.189749.227540@o13g2000cwo.googlegroups.com...
> "Krzysztof    elechowski" wrote:
>> The default encoding should fulfill the following condition:
>> if I binary wfilebuf::sputn(wchar_t const str[], length) and
>> then fopen the file for binary input and fread(wchar_t *buff,
>> sizeof *buff, length, file) it back, then I should have
>> memcmp(buff, str, length * sizeof *buff) == 0.  It does not
>> seem very complicated, does it?
>
> Actually, it does.  I don't understand the why of this request,
> and of course, it is dangerously underspecified: what if I imbue
> the output stream with a JIS encoded locale, but set the global
> locale to, say UTF-16Le?

The default encoding is what you get when no locale is imbued explicitly,
i.e. it refers to the "C" locale.  My requirement applies to this default
locale only in order to mimic the behaviour of filebuf:

if I binary filebuf::sputn(char const str[], length) and then fopen(...) and
fread(...), I have memcmp(...) == 0.

Why not the same for wfilebuf?

> The only way you could possibly make
> this sort of guarantee is by removing all support for different
> encodings; i.e. by removing important functionality which is
> currently present in both C and C++.
>

Notwithstanding what I said, I personally think that binary files need not
support different encodings.  If you want it to be portable, do it in text
mode.

> If the C global locale and the imbued locale are the same, and
> all of the characters are representable in the character code
> implied by the locale, I would expect this to work today.

It does not.  What you get when you fread it back is char *, not wchar_t *.

Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Wed, 8 Jun 2005 10:04:17 CST Raw View

"Krzysztof Zelechowski" wrote:
> Uzytkownik <kanze@gabi-soft.fr> napisal w wiadomosci
> news:1118070685.189749.227540@o13g2000cwo.googlegroups.com...
> > "Krzysztof    elechowski" wrote:
> >> The default encoding should fulfill the following
> >> condition: if I binary wfilebuf::sputn(wchar_t const str[],
> >> length) and then fopen the file for binary input and
> >> fread(wchar_t *buff, sizeof *buff, length, file) it back,
> >> then I should have memcmp(buff, str, length * sizeof *buff)
> >> == 0.  It does not seem very complicated, does it?

> > Actually, it does.  I don't understand the why of this
> > request, and of course, it is dangerously underspecified:
> > what if I imbue the output stream with a JIS encoded locale,
> > but set the global locale to, say UTF-16Le?

> The default encoding is what you get when no locale is imbued
> explicitly, i.e. it refers to the "C" locale.

That's *not* what you get when no locale is imbued explicitly.
You get the global locale at the time the object was
constructed.  And where I live and work, this is rarely "C".
(Most programs start with a line:
    std::locale::global( std::locale( "" ) ) ;
which installs a user dependant locale.)

> My requirement applies to this default locale only in order to
> mimic the behaviour of filebuf:

> if I binary filebuf::sputn(char const str[], length) and then
> fopen(...) and fread(...), I have memcmp(...) == 0.

> Why not the same for wfilebuf?

Why?  Or perhaps more pertinantly, what would that mean?  What
would reading bytes into wchar_t mean?

Comparing the results of filebuf::sputn() and fread/fwrite makes
some sense.  Comparing fread/fwrite with wfilebuf is like
comparing apples to oranges.  It makes no sense to write
wchar_t, read char, and expect them to compare identical.  (It
also makes no sense to do binary IO on wfilebuf, but that is
another question.)

> > The only way you could possibly make this sort of guarantee
> > is by removing all support for different encodings; i.e. by
> > removing important functionality which is currently present
> > in both C and C++.

> Notwithstanding what I said, I personally think that binary
> files need not support different encodings.  If you want it to
> be portable, do it in text mode.

Well, we agree here, at least.  IMHO, binary means (or should
mean) binary, not binary translated as if it were some character
encoding.

> > If the C global locale and the imbued locale are the same,
> > and all of the characters are representable in the character
> > code implied by the locale, I would expect this to work
> > today.

> It does not.  What you get when you fread it back is char *,
> not wchar_t *.

Good point.  So the data isn't comparable, and using memcmp on
it is irrelevant.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net
Date: Sun, 29 May 2005 18:48:48 CST Raw View

I apologize if this come through as a duplicate response. I sent one
out a few days ago, but it hasn't shown up yet on any newsserver I have
access to (not even the one I used to post the message).

"Krzysztof Zelechowski" wrote:
> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
.
> > However, that doesn't mean you have to go back to fwrite() and fread()
> > to handle files containing actual wide characters. The
> > std::ostream::write() and std::istream::read() functions provide just
> > as much capability for reading and writing wide character files as the
> > stdio routines.
> >
>
> If I could wofstream::write wchar_t withoud codecvt::outing, I could also
> wfilebuf::sputn it because the former invokes the latter.

I was quite specific; I was referring to ostream::write(), not
wofstream::write().

>  I have checked
> that I cannot; my text gets codecvt::outed no matter what.

Yes, but ostream is a typedef for basic_ostream<char>, which means it
uses codecvt<char,char,mbstate_t>, which according to 22.2.1.5p3 "...
implements a degenerate conversion; it does not convert at all".

> ... And indeed,
> fwrite does not wctomb only if it is created as binary (whereas

Well yes, I was thinking in terms of binary files.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Mon, 30 May 2005 05:41:40 GMT Raw View

James Daughtry <mordock32@hotmail.com> wrote:
>> Class basic_filebuf does not allow storing wide characters to files.
> Where does the standard imply this? Last I checked, both the stream
> layer and the streambuf layer pay close attention to supporting wide
> characters all across the board.

You must have overseen the title.  It is

27.8.1 underlaid character type is char

And that means that files contain plain characters, not wide characters.
I agree that a streambuf supports wide characters but, being effectively
an abstract base class, it just fails on every operation.  Which is
quite correct for an abstract base object.  Now, a wfilebuf (thanks to
Mr Barbati for correcting me) is what I address here.  The situation is
that its operations depend on the character set selected in the locale
component.  In order for it to be able to store your output in an
external file, all your characters must be supported by the selected
locale.

This is not an issue in a localized application, but it is a problem in
an internationalized one.  And it is a problem for me because I want to
map the file I have produced to the memory and use it as a wchar_t *,
not as a char *.

Christopher

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Mon, 30 May 2005 05:45:55 GMT Raw View

"Bo Persson" <bop@gmb.dk> wrote:
>
> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelandet
> news:d6tf81$ma3$1@sklad.atcom.net.pl...
>> Class basic_filebuf does not allow storing wide characters to files.
>> This way the flexibility and power of Unicode is killed because we
>> cannot store arbitrary wide characters in a file.
>
> Of course you can. You just have to convert a wide character sequence
> into a byte sequence. The standard doesn't put restrictions on how many
> bytes are needed for each wide character.

Do I have to convert it myself?  I feel that such functionality should
be standard.

Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: bop@gmb.dk ("Bo Persson")
Date: Mon, 30 May 2005 16:39:24 GMT Raw View

"Christopher Conrade Zseleghovski" <kkz@duch.mimuw.edu.pl> skrev i
meddelandet news:d7de1d$rfi$1@achot.icm.edu.pl...
> "Bo Persson" <bop@gmb.dk> wrote:
>>
>> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelandet
>> news:d6tf81$ma3$1@sklad.atcom.net.pl...
>>> Class basic_filebuf does not allow storing wide characters to files.
>>> This way the flexibility and power of Unicode is killed because we
>>> cannot store arbitrary wide characters in a file.
>>
>> Of course you can. You just have to convert a wide character sequence
>> into a byte sequence. The standard doesn't put restrictions on how
>> many
>> bytes are needed for each wide character.
>
> Do I have to convert it myself?

The conversion is performed by the std::codecvt facet.

If you have specific requirements, you might have to choose a locale
containing the proper conversions, or add your own codecvt.

>I feel that such functionality should
> be standard.

The standard only contains the requirements that *all* implementations
must fulfill. Therefore, you cannot be sure what locales are available,
or what they are called. C++ can be implemented for an OS where the file
system doesn't support Unicode files (or any files, for that matter).

Bo Persson

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Mon, 30 May 2005 16:38:14 GMT Raw View

Christopher Conrade Zseleghovski wrote:
> Do I have to convert it myself?  I feel that such functionality should
> be standard.

Have you read the standard? The wide filebufs convert the characters
using the imbued codecvt facet.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Mon, 30 May 2005 16:39:03 GMT Raw View

"Christopher Conrade Zseleghovski" <kkz@duch.mimuw.edu.pl> wrote in message
news:d7de1d$rfi$1@achot.icm.edu.pl...

> "Bo Persson" <bop@gmb.dk> wrote:
>>
>> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelandet
>> news:d6tf81$ma3$1@sklad.atcom.net.pl...
>>> Class basic_filebuf does not allow storing wide characters to files.
>>> This way the flexibility and power of Unicode is killed because we
>>> cannot store arbitrary wide characters in a file.
>>
>> Of course you can. You just have to convert a wide character sequence
>> into a byte sequence. The standard doesn't put restrictions on how many
>> bytes are needed for each wide character.
>
> Do I have to convert it myself?  I feel that such functionality should
> be standard.

Using what standard? C and C++ have traditionally been character-set
neutral. But even if you commit to Unicode, you have the choice of
writing a file as UTF-8, UTF-16, UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE,
etc. etc. I named these six as just the ones that various people
have characterized as "obvious" in postings over the past couple of
years. Still widely used are JIS, Shift JIS, and EUC for the Japanese
market and numerous others for equally important constituencies.
We provide a collection of *dozens* of codecvt facets, including
ones for all of the above, in our CoreX library. Even so, we still
get occasional requests for exotic variants not in that package.

I will, however, provide one interesting data point. We originally
shipped our libraries with a degenerate conversion rule -- if the
wide-character fit in a single byte, we output it as-is; otherwise
we reported a conversion failure. (Microsoft still offers this rule
by default, IIRC.) Needless to say, we got regular complaints about
this simplistic approach. So a few years ago, we began shipping our
C and C++ libraries with a different default. The exterior encoding
is UTF-8, the interior one either UCS-2 or UCS-4, depending on the
size of wchar_t. Has anyone praised us for this decision? Nope.
But more important, we've received *zero complaints* since we
made the change. From a support standpoint, we consider this the
highest of praise.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof elechowski")
Date: Mon, 30 May 2005 21:06:22 GMT Raw View

U   ytkownik ""P.J. Plauger"" <pjp@dinkumware.com> napisa    w wiadomo   ci
news:0IHA00MBJZFF3D@firewall.mci.com...
> "Christopher Conrade Zseleghovski" <kkz@duch.mimuw.edu.pl> wrote in
> message news:d7de1d$rfi$1@achot.icm.edu.pl...
>
>> "Bo Persson" <bop@gmb.dk> wrote:
>>>
>>> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelandet
>>> news:d6tf81$ma3$1@sklad.atcom.net.pl...
>>>> Class basic_filebuf does not allow storing wide characters to files.
>>>> This way the flexibility and power of Unicode is killed because we
>>>> cannot store arbitrary wide characters in a file.
>>>
>>> Of course you can. You just have to convert a wide character sequence
>>> into a byte sequence. The standard doesn't put restrictions on how many
>>> bytes are needed for each wide character.
>>
>> Do I have to convert it myself?  I feel that such functionality should
>> be standard.
>
> Using what standard? C and C++ have traditionally been character-set
> neutral. But even if you commit to Unicode, you have the choice of
> writing a file as UTF-8, UTF-16, UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE,
> etc. etc.

The default encoding should fulfill the following condition:  if I binary
wfilebuf::sputn(wchar_t const str[], length) and then fopen the file for
binary input and fread(wchar_t *buff, sizeof *buff, length, file) it back,
then I should have memcmp(buff, str, length * sizeof *buff) == 0.  It does
not seem very complicated, does it?

Christopher

> I will, however, provide one interesting data point. We originally
> shipped our libraries with a degenerate conversion rule -- if the
> wide-character fit in a single byte, we output it as-is; otherwise
> we reported a conversion failure. (Microsoft still offers this rule
> by default, IIRC.)

Aren't your libraries bundled with Microsoft Visual C++?  They do not output
UTF-8, neither by default, nor by demand.

Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Tue, 31 May 2005 16:55:41 GMT Raw View

Uzytkownik "James Daughtry" <mordock32@hotmail.com> napisal w wiadomosci
news:1116943821.423256.16010@f14g2000cwb.googlegroups.com...
>> Class basic_filebuf does not allow storing wide characters to files.
> Where does the standard imply this?
27.8.1 underlaid character type is char

> Last I checked, both the stream
> layer and the streambuf layer pay close attention to supporting wide
> characters all across the board.
>

It is possible to store a wide char from the programmer's point of view, but
it is only the logical perspective.  Wide characters are converted (more
precisely, codecvt::outed) to multibyte characters behind the scenes; if you
mmap the file back, you get char *, not wchar_t *.  Besides, this process is
fragile and locale-dependent: you can easily end up with a string you cannot
store into the stream (it fails and sets errno to EILSEQ).  And what are you
going to do then? "Sorry, cannot write the file, please choose a different
locale?"  And what if the text your client wants to store is multiscript?
chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Tue, 31 May 2005 16:56:44 GMT Raw View

Sorry for top-posting, dunno how to fix it in my reader (M$OE_pl).

> (Note: it's very unfortunate that on several platforms wchar_t has 16
bits. You need 21 bits to store a Unicode code point, so if you want to
use 16-bit wchar_t either you restrict yourself to subset of Unicode
characters or you use UTF-16 internally).

I suggest using long wchar_t for this purpose.  I think that wchar_t is not
a good identifier for a reserved word.  I think we could do without it:
consider

typedef long char wchar_t;
typedef long long char lchar_t;

Ain't I smart?

Now to the issues.

1) You can do this in two steps: codecvt<lchar_t, char, mbstate_t>::in;
codecvt<lchar_t; wchar_t; mbstate_t>::out;
2) Pipes and sockets are not for the blind because they cannot be published.
It is natural that you should know the encoding in advance.
3)    27.8.1.4/19 seems really erratic.  Suppose you have already read some
text in English and you are instructed that the text that follows is going
to be in French.  This paragraph implies you could say "Wait a minute, I
have to translate the previous paragraphs to French".  Total nonsense.
4) This is a minor issue.  You really cannot optimize console input/output
because of hardware limitations.

Thanks for your extensive explanation.
Christopher

Uzytkownik "Alberto Barbati" <AlbertoBarbati@libero.it> napisal w wiadomosci
news:4PHke.24619$795.760921@twister1.libero.it...
When you read from the file, you just have to take one or more of this
8-bit quantities and assemble them to get the Unicode character. This
character (or more precisely, its code point) is kept *in memory* in
some type which is hopefully larger than a char, usually a wchar_t.
(Note: it's very unfortunate that on several platforms wchar_t has 16
bits. You need 21 bits to store a Unicode code point, so if you want to
use 16-bit wchar_t either you restrict yourself to subset of Unicode
characters or you use UTF-16 internally).

That's where file stream objects come in. They provide an abstraction
above external representation (a sequence of bytes) and an internal
representation (a sequence of wchar_t or bigger integral type). The
actual conversion is done by the codecvt<> facet.

Those were the good news. Now the bad ones.

The codecvt is not that good in providing the conversion. Some time ago
I attempted writing a library of codecvt facets to provide the all three
UTF conversions. All worked well enough on VC7.1 and I posted them onto
Boost. However, reviewers pointed out that the cooperation between
std::filebuf and codecvt facets is described in the C++ standard in such
a way that an implemention could have used my facets in different ways
from those I expected and that would have been a problem. I then decided
to think about it more deeply, but other tasks took precedence.

However, I see that the committee is now calling for TR2 proposals and
one of the subjects is... Unicode! Could this be the right time to
re-think about this issue?

IIRC, the main problems of the filebuf/codecvt contract were:

1) codecvt de facto implements an n-to-1 encoding (n external char, 1
internal element) while an n-to-m enconding would be necessary to
correctly implement the conversion UTF-8 (external) to UTF-16
(internal). LWG appears to be working hard on this (see, for example,
LWG issue #382).

2)    27.8.1.4/17 states that the facet cannot be changed unless the file
is positioned at the beginning. This requirement is a pain in the neck
for non-rewindable file-like objects (like pipe, sockets, etc.) but also
for regular (disk) files it's a problem because you have to decide the
enconding without reading any characters that may provide some kind of
signature. The alternative (using a runtime switch for each single
character) is going to be very inefficient.

3)    27.8.1.4/19 implies that the file might need to be rewound in some
way. It's not very clear how to implement this requirement for
non-rewindable file-like objects

4) some file-like objects must be read one character at a time, each
read operation potentially blocking, having the codecvt support this
case disallows some optimizations

Our options from here to TR2 include:

1) fix the filebuf/codecvt contract and stick to codecvt facets

2) create new buffer (and stream) objects as for example the
Boost.Iostreams's filtering/converting buffers

Are there other options? Which one has the most chances of providing the
best support for Unicode?

Alberto

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net
Date: Tue, 31 May 2005 11:57:45 CST Raw View

"Krzysztof Zelechowski" wrote:
> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
.
> > However, that doesn't mean you have to go back to fwrite() and fread()
> > to handle files containing actual wide characters. The
> > std::ostream::write() and std::istream::read() functions provide just
> > as much capability for reading and writing wide character files as the
> > stdio routines.
> >
>
> If I could wofstream::write wchar_t withoud codecvt::outing, I could also
> wfilebuf::sputn it because the former invokes the latter. ...

I thought I was sufficiently specific: I said ostream, not wofstream.
Any POD  type can be written as a string of bytes, and unless I've
missed something (a very distinct possibility), std::ostream::write()
performs such a write with precisely the same lack of conversion that
fwrite() does.

> I have checked that I cannot; my text gets codecvt::outed no matter what.

std::ostream is a typedef for basic_ostream<char>, and therefore has
charT==char and traits=char_traits<char>. Therefore it uses
codecvt<char,char,mbstate_t>, which "implements the degenerate
conversion: it does not convert at all." (22.2.1.5p3)

> ... And indeed, fwrite does not wctomb only if it is created as binary

Well, I was assuming the use of a file created as binary. I just didn't
think it was necessary to say so.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 31 May 2005 17:00:19 GMT Raw View

Christopher Conrade Zseleghovski wrote:
> And it is a problem for me because I want to map the file I have produced
 > to the memory and use it as a wchar_t *, not as a char *.

Too bad. Most operating systems do not support such a thing. If you need
portability between different systems your case is especially hopeless
because of byte-order differences. But you can always create your own
filebuf that writes wchar_t's to the underlying file if you want.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Tue, 31 May 2005 17:00:29 GMT Raw View

""Krzysztof =AFelechowski"" <yecril@bluebottle.com> wrote in message=20
news:d7fq7b$r7o$1@sklad.atcom.net.pl...

> U=BFytkownik ""P.J. Plauger"" <pjp@dinkumware.com> napisa=B3 w wiadomo=B6=
ci=20
> news:0IHA00MBJZFF3D@firewall.mci.com...
>> "Christopher Conrade Zseleghovski" <kkz@duch.mimuw.edu.pl> wrote in=20
>> message news:d7de1d$rfi$1@achot.icm.edu.pl...
>>
>>> "Bo Persson" <bop@gmb.dk> wrote:
>>>>
>>>> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelande=
t
>>>> news:d6tf81$ma3$1@sklad.atcom.net.pl...
>>>>> Class basic_filebuf does not allow storing wide characters to files.
>>>>> This way the flexibility and power of Unicode is killed because we
>>>>> cannot store arbitrary wide characters in a file.
>>>>
>>>> Of course you can. You just have to convert a wide character sequenc=
e
>>>> into a byte sequence. The standard doesn't put restrictions on how m=
any
>>>> bytes are needed for each wide character.
>>>
>>> Do I have to convert it myself?  I feel that such functionality shoul=
d
>>> be standard.
>>
>> Using what standard? C and C++ have traditionally been character-set
>> neutral. But even if you commit to Unicode, you have the choice of
>> writing a file as UTF-8, UTF-16, UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE,
>> etc. etc.
>
> The default encoding should fulfill the following condition:  if I bina=
ry=20
> wfilebuf::sputn(wchar_t const str[], length) and then fopen the file fo=
r=20
> binary input and fread(wchar_t *buff, sizeof *buff, length, file) it ba=
ck,=20
> then I should have memcmp(buff, str, length * sizeof *buff) =3D=3D 0.  =
It does=20
> not seem very complicated, does it?

It's not complicated. It might even be desirable under many
circumstances. But it's *not* required by the C++ Standard.

> Christopher
>
>> I will, however, provide one interesting data point. We originally
>> shipped our libraries with a degenerate conversion rule -- if the
>> wide-character fit in a single byte, we output it as-is; otherwise
>> we reported a conversion failure. (Microsoft still offers this rule
>> by default, IIRC.)
>
> Aren't your libraries bundled with Microsoft Visual C++?

Not all of them, no.

>                                                         They do not out=
put=20
> UTF-8, neither by default, nor by demand.

No.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Tue, 31 May 2005 17:00:11 GMT Raw View

Hyman Rosen <hyrosen@mail.com> wrote:
> Krzysztof Zelechowski wrote:
>> If you fopen a file for binary, it does not recode Unicode characters to
>> multibyte; it stores and reads them in native format.
>> It is not the case with a fstreambuf: an fstreambuf ALWAYS recodes.
>
> That's not right. It's not the fopen or binary that matters,
> it's what function you use to read from the stream. If you
> read from a FILE* with fgetc then you get a byte. If you use

I use fread and I get converted bytes in text mode.

Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 31 May 2005 17:00:40 GMT Raw View

Krzysztof =AFelechowski wrote:
> The default encoding should fulfill the following condition

Why? It seems completely useless. Very few people have files
encoded in this way, not least because of portability. A file
encoded as you propose does not encode byte ordering, for
example.


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Tue, 31 May 2005 21:00:14 GMT Raw View

"P.J. Plauger" <pjp@dinkumware.com> wrote:
> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> wrote in message
> news:d7fq7b$r7o$1@sklad.atcom.net.pl...
>
>> U?ytkownik ""P.J. Plauger"" <pjp@dinkumware.com> napisa? w wiadomo?ci
>> news:0IHA00MBJZFF3D@firewall.mci.com...
>>
>> The default encoding should fulfill the following condition:  if I binary
>> wfilebuf::sputn(wchar_t const str[], length) and then fopen the file for
>> binary input and fread(wchar_t *buff, sizeof *buff, length, file) it back,
>> then I should have memcmp(buff, str, length * sizeof *buff) == 0.  It does
>> not seem very complicated, does it?
>
> It's not complicated. It might even be desirable under many
> circumstances. But it's *not* required by the C++ Standard.
>

Let me suggest an amendment: it should be required by the standard.  It
is not an excessive requirement to be able to put any data stream to a
stream buffer.  All standard stream buffers *except* the
basic_streambuf, which is an incomplete object and hence consistently
fails, and the wfilebuf, which fails at random because of its bad
design, guarantee this provided system resources are not exhausted.

Christopher

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Tue, 31 May 2005 21:10:20 GMT Raw View

kuyper@wizard.net wrote:
> I apologize if this come through as a duplicate response. I sent one
> out a few days ago, but it hasn't shown up yet on any newsserver I have
> access to (not even the one I used to post the message).
>

Thanks for reposting.

> "Krzysztof Zelechowski" wrote:
>> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
>> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
> .
>> > However, that doesn't mean you have to go back to fwrite() and fread()
>> > to handle files containing actual wide characters. The
>> > std::ostream::write() and std::istream::read() functions provide just
>> > as much capability for reading and writing wide character files as the
>> > stdio routines.
>> >
>>
>> If I could wofstream::write wchar_t withoud codecvt::outing, I could also
>> wfilebuf::sputn it because the former invokes the latter.
>
> I was quite specific; I was referring to ostream::write(), not
> wofstream::write().
>

Oh I see.  But I cannot ostream::write(wchar_t const []); there is no
such method.

Christopher

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net
Date: Tue, 31 May 2005 22:39:32 CST Raw View

Christopher Conrade Zseleghovski wrote:
.
> > "Krzysztof Zelechowski" wrote:
> >> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
> >> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
> > .
> >> > However, that doesn't mean you have to go back to fwrite() and fread()
> >> > to handle files containing actual wide characters. The
> >> > std::ostream::write() and std::istream::read() functions provide just
> >> > as much capability for reading and writing wide character files as the
> >> > stdio routines.
> >> >
> >>
> >> If I could wofstream::write wchar_t withoud codecvt::outing, I could also
> >> wfilebuf::sputn it because the former invokes the latter.
> >
> > I was quite specific; I was referring to ostream::write(), not
> > wofstream::write().
> >
>
> Oh I see.  But I cannot ostream::write(wchar_t const []); there is no
> such method.

Agreed; you must cast to (char*) before writing. That produces
essentially the same result as using std::fwrite().

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kkz@duch.mimuw.edu.pl (Christopher Conrade Zseleghovski)
Date: Thu, 2 Jun 2005 21:52:11 GMT Raw View

kuyper@wizard.net wrote:
> Christopher Conrade Zseleghovski wrote:
> .
>> > "Krzysztof Zelechowski" wrote:
>> >> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
>> >> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
>> > .
>> >> > However, that doesn't mean you have to go back to fwrite() and fread()
>> >> > to handle files containing actual wide characters. The
>> >> > std::ostream::write() and std::istream::read() functions provide just
>> >> > as much capability for reading and writing wide character files as the
>> >> > stdio routines.
>> >> >
>> >>
>> >> If I could wofstream::write wchar_t withoud codecvt::outing, I could also
>> >> wfilebuf::sputn it because the former invokes the latter.
>> >
>> > I was quite specific; I was referring to ostream::write(), not
>> > wofstream::write().
>> >
>>
>> Oh I see.  But I cannot ostream::write(wchar_t const []); there is no
>> such method.
>
> Agreed; you must cast to (char*) before writing. That produces
> essentially the same result as using std::fwrite().
>
Isn't this cast dirrty?  It seems cleaner to fwrite(void const *).
Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Mon, 6 Jun 2005 10:14:28 CST Raw View

"Krzysztof    elechowski" wrote:
> U   ytkownik ""P.J. Plauger"" <pjp@dinkumware.com> napisa    w wiadomo   ci
> news:0IHA00MBJZFF3D@firewall.mci.com...
> > "Christopher Conrade Zseleghovski" <kkz@duch.mimuw.edu.pl> wrote in
> > message news:d7de1d$rfi$1@achot.icm.edu.pl...

> >> "Bo Persson" <bop@gmb.dk> wrote:

> >>> ""Krzysztof ?elechowski"" <yecril@bluebottle.com> skrev i meddelandet
> >>> news:d6tf81$ma3$1@sklad.atcom.net.pl...
> >>>> Class basic_filebuf does not allow storing wide
> >>>> characters to files.  This way the flexibility and power
> >>>> of Unicode is killed because we cannot store arbitrary
> >>>> wide characters in a file.

> >>> Of course you can. You just have to convert a wide
> >>> character sequence into a byte sequence. The standard
> >>> doesn't put restrictions on how many bytes are needed for
> >>> each wide character.

> >> Do I have to convert it myself?  I feel that such
> >> functionality should be standard.

> > Using what standard? C and C++ have traditionally been
> > character-set neutral. But even if you commit to Unicode,
> > you have the choice of writing a file as UTF-8, UTF-16,
> > UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE, etc. etc.

> The default encoding should fulfill the following condition:
> if I binary wfilebuf::sputn(wchar_t const str[], length) and
> then fopen the file for binary input and fread(wchar_t *buff,
> sizeof *buff, length, file) it back, then I should have
> memcmp(buff, str, length * sizeof *buff) == 0.  It does not
> seem very complicated, does it?

Actually, it does.  I don't understand the why of this request,
and of course, it is dangerously underspecified: what if I imbue
the output stream with a JIS encoded locale, but set the global
locale to, say UTF-16Le?  The only way you could possibly make
this sort of guarantee is by removing all support for different
encodings; i.e. by removing important functionality which is
currently present in both C and C++.

If the C global locale and the imbued locale are the same, and
all of the characters are representable in the character code
implied by the locale, I would expect this to work today.  But
that's a lot weaker requirement than what you seem to be saying.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: 6 Jun 2005 16:20:11 GMT Raw View

"Krzysztof Zelechowski" wrote:

    [...]
> 2) Pipes and sockets are not for the blind because they cannot
> be published.  It is natural that you should know the encoding
> in advance.

But it is precisely when reading from a socket that you
typically don't.  Using any of the Internet text protocols, the
header itself is in ASCII, but it can generally specify that the
body is in any encoding it wants.  You only know the encoding
once you've parsed the header.

> 3)    27.8.1.4/19 seems really erratic.  Suppose you have
> already read some text in English and you are instructed that
> the text that follows is going to be in French.  This
> paragraph implies you could say "Wait a minute, I have to
> translate the previous paragraphs to French".  Total nonsense.

That's exactly the point that was being made.  It's a frequent
case: a file in HTML format will generally start with something
like:

    <html>
        <head>
        <meta http-equiv="Content-Type" content="text/html;
charset=utf-8">

Only after reading the third line do you know the actual
encoding.

There are heuristics for determining whether you start reading
using ASCII (which for the characters in the first three lines,
should be the same as UTF-8, ISO 8859-1, or any one of a number
of other encodings), UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE.
To use those heuristics, however, you have to be able to read
the first four bytes as raw data.

Not being able to change the encoding in route is a pretty
drastic limitation.  (Another nice feature would be to adopt
automatically to the new line conventions used in the file.
Because, of course, a lot of people today use network mounted
files, and read and write the same file from several different
OS's.)

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof elechowski")
Date: Tue, 24 May 2005 04:10:11 GMT Raw View

Class basic_filebuf does not allow storing wide characters to files.  This
way the flexibility and power of Unicode is killed because we cannot store
arbitrary wide characters in a file.  What sense does it make?  It is a big
step backward.  Or is the whole story about wchar_t limited to character
sets by design and the wide character methods have nothing to do with
Unicode?  Do I have to fall back to stdio to support Unicode text?
Astonished,
Christopher


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "James Daughtry" <mordock32@hotmail.com>
Date: Tue, 24 May 2005 12:58:21 CST Raw View

> Class basic_filebuf does not allow storing wide characters to files.
Where does the standard imply this? Last I checked, both the stream
layer and the streambuf layer pay close attention to supporting wide
characters all across the board.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Tue, 24 May 2005 17:56:42 GMT Raw View

Krzysztof =AFelechowski wrote:
> Do I have to fall back to stdio to support Unicode text?

I'm curious. How does stdio support Unicode text?

In the vast majority of current systems, files are sequences
of chars, not sequences of "Unicode text". You can have a
basic_filebuf<wchar_t> and have it read and write encoded
Unicode in a variety of formats, I believe.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: bop@gmb.dk ("Bo Persson")
Date: Tue, 24 May 2005 20:31:12 GMT Raw View

""Krzysztof    elechowski"" <yecril@bluebottle.com> skrev i meddelandet
news:d6tf81$ma3$1@sklad.atcom.net.pl...
> Class basic_filebuf does not allow storing wide characters to files.
> This way the flexibility and power of Unicode is killed because we
> cannot store arbitrary wide characters in a file.

Of course you can. You just have to convert a wide character sequence
into a byte sequence. The standard doesn't put restrictions on how many
bytes are needed for each wide character.

The C++ standard defines 'char' as an equivalent of 'byte'.


Bo Persson


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: AlbertoBarbati@libero.it (Alberto Barbati)
Date: Tue, 24 May 2005 20:30:53 GMT Raw View

Krzysztof =AFelechowski wrote:
> Class basic_filebuf does not allow storing wide characters to files.  T=
his=20
> way the flexibility and power of Unicode is killed because we cannot st=
ore=20
> arbitrary wide characters in a file.  What sense does it make?  It is a=
 big=20
> step backward.  Or is the whole story about wchar_t limited to characte=
r=20
> sets by design and the wide character methods have nothing to do with=20
> Unicode?  Do I have to fall back to stdio to support Unicode text?
> Astonished,
> Christopher=20

You seems to be making a bit of confusion.

An external file, that is a something physically stored on a disk or
whatever other medium, is modeled by C++ as a sequence of elements of
type char (see =A727.8.1/2). I won't enter into the detail about whether
this abstraction is good or bad, if it's there just for legacy with C
code, if it's necessary etc. Let's assume that it's a kind of god-given
requirement. Let's also assume (this assumption is arbitrary, but
otherwise the discussion will get immediately too gory) that type char
has 8 bits.

Does that disallow me to read/write a Unicode file? The answer is no.
Fact is that to *store* a Unicode file on a disk you need to choose an
encoding. The most commonly used encodings are either 8-bit based
(UTF-8, SCSU) or 16-bit based (UTF-16, UCS-2) or 32-bit based (UTF-32,
UCS-4). However, when you choose a 16-bit or 32-bit encoding, you also
need to choose the endianness, so eventually a Unicode file really is
just *stored* a sequence of 8-bit quantities.

When you read from the file, you just have to take one or more of this
8-bit quantities and assemble them to get the Unicode character. This
character (or more precisely, its code point) is kept *in memory* in
some type which is hopefully larger than a char, usually a wchar_t.
(Note: it's very unfortunate that on several platforms wchar_t has 16
bits. You need 21 bits to store a Unicode code point, so if you want to
use 16-bit wchar_t either you restrict yourself to subset of Unicode
characters or you use UTF-16 internally).

That's where file stream objects come in. They provide an abstraction
above external representation (a sequence of bytes) and an internal
representation (a sequence of wchar_t or bigger integral type). The
actual conversion is done by the codecvt<> facet.

Those were the good news. Now the bad ones.

The codecvt is not that good in providing the conversion. Some time ago
I attempted writing a library of codecvt facets to provide the all three
UTF conversions. All worked well enough on VC7.1 and I posted them onto
Boost. However, reviewers pointed out that the cooperation between
std::filebuf and codecvt facets is described in the C++ standard in such
a way that an implemention could have used my facets in different ways
from those I expected and that would have been a problem. I then decided
to think about it more deeply, but other tasks took precedence.

However, I see that the committee is now calling for TR2 proposals and
one of the subjects is... Unicode! Could this be the right time to
re-think about this issue?

IIRC, the main problems of the filebuf/codecvt contract were:

1) codecvt de facto implements an n-to-1 encoding (n external char, 1
internal element) while an n-to-m enconding would be necessary to
correctly implement the conversion UTF-8 (external) to UTF-16
(internal). LWG appears to be working hard on this (see, for example,
LWG issue #382).

2) =A727.8.1.4/17 states that the facet cannot be changed unless the file
is positioned at the beginning. This requirement is a pain in the neck
for non-rewindable file-like objects (like pipe, sockets, etc.) but also
for regular (disk) files it's a problem because you have to decide the
enconding without reading any characters that may provide some kind of
signature. The alternative (using a runtime switch for each single
character) is going to be very inefficient.

3) =A727.8.1.4/19 implies that the file might need to be rewound in some
way. It's not very clear how to implement this requirement for
non-rewindable file-like objects

4) some file-like objects must be read one character at a time, each
read operation potentially blocking, having the codecvt support this
case disallows some optimizations

Our options from here to TR2 include:

1) fix the filebuf/codecvt contract and stick to codecvt facets

2) create new buffer (and stream) objects as for example the
Boost.Iostreams's filtering/converting buffers

Are there other options? Which one has the most chances of providing the
best support for Unicode?

Alberto

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: 25 May 2005 00:50:01 GMT Raw View

"Alberto Barbati" <AlbertoBarbati@libero.it> wrote in message
news:4PHke.24619$795.760921@twister1.libero.it...
> Our options from here to TR2 include:

> 1) fix the filebuf/codecvt contract and stick to codecvt facets

Indeed, that must be done, to improve codecvt portability
and to formally permit N-to-M conversions such as UTF-8/UTF-16.
See our CoreX library -- you can view the manual at our web
site. It provides dozens of codecvt facets that are portable
across all the popular Standard C++ libraries. But only the
Dinkumware library fully supports N-to-M.

> 2) create new buffer (and stream) objects as for example the
> Boost.Iostreams's filtering/converting buffers

We've proposed something pretty concrete, from our CoreX
library. The additions permit the use of *all* our codecvt
facets even on libraries whose native basic_filebuf is buggy
or incomplete.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Wed, 25 May 2005 15:40:49 GMT Raw View

Uzytkownik "Hyman Rosen" <hyrosen@mail.com> napisal w wiadomosci
news:G5Bke.18622$4d6.10646@trndny04...
Krzysztof    elechowski wrote:
> Do I have to fall back to stdio to support Unicode text?

I'm curious. How does stdio support Unicode text?

In the vast majority of current systems, files are sequences
of chars, not sequences of "Unicode text". You can have a
basic_filebuf<wchar_t> and have it read and write encoded
Unicode in a variety of formats, I believe.

---

If you fopen a file for binary, it does not recode Unicode characters to
multibyte; it stores and reads them in native format.
It is not the case with a fstreambuf: an fstreambuf ALWAYS recodes.
Chris


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net
Date: 25 May 2005 16:40:09 GMT Raw View

"Krzysztof    elechowski" wrote:
> Class basic_filebuf does not allow storing wide characters to files.  This
> way the flexibility and power of Unicode is killed because we cannot store
> arbitrary wide characters in a file.  What sense does it make?  It is a big
> step backward.  Or is the whole story about wchar_t limited to character
> sets by design and the wide character methods have nothing to do with
> Unicode?  Do I have to fall back to stdio to support Unicode text?

Using wchar_t to store Unicode characters is typical of the intended
use of wchar_t. However, the standard doesn't mandate any such use;
it's legal for wchar_t to be the same as char. For that matter, char
could be a 32 bit type used to store the execution character set with a
Unicode encoding. The standard doesn't say a lot, one way or another,
about the use of Unicode encodings.

I have no idea what you mean by "limited to character sets". The
definition of wchar_t is that it's big enough to hold any member of the
wide character set; what else did you expect of it that it doesn't
actually provide?

I believe you're referring to 27.8.1p2, which says "In a large
character set environment, multibyte character sequences are held in
files. In order to provide the contents of the file as wide character
sequences, wide oriented filebuf, namely wfilebuf, should convert wide
character sequences." It seems to me that this means that when reading
from the file, wfilebuf convert multibyte characters in the file into
wide characters in memory, and when writing to the file, it should
convert wide characters in memory to multibyte characters in the file.

However, that doesn't mean you have to go back to fwrite() and fread()
to handle files containing actual wide characters. The
std::ostream::write() and std::istream::read() functions provide just
as much capability for reading and writing wide character files as the
stdio routines.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: AlbertoBarbati@libero.it (Alberto Barbati)
Date: Thu, 26 May 2005 15:49:23 GMT Raw View

Krzysztof Zelechowski wrote:
> Uzytkownik "Hyman Rosen" <hyrosen@mail.com> napisal w wiadomosci=20
> news:G5Bke.18622$4d6.10646@trndny04...
> Krzysztof =AFelechowski wrote:
>=20
>>Do I have to fall back to stdio to support Unicode text?
>=20
>=20
> I'm curious. How does stdio support Unicode text?
>=20
> In the vast majority of current systems, files are sequences
> of chars, not sequences of "Unicode text". You can have a
> basic_filebuf<wchar_t> and have it read and write encoded
> Unicode in a variety of formats, I believe.
>=20
> ---
>=20
> If you fopen a file for binary, it does not recode Unicode characters t=
o=20
> multibyte; it stores and reads them in native format.

Sorry to be pedantic, but there is nothing like a "native format" for a
Unicode file. It's true that AFAIK Windows shows a preference for UCS-2
and Linux shows a preference for UTF-8, but other encodings are also
extensively used on both platformsn and in neither cases there is an
"official" native format.

> It is not the case with a fstreambuf: an fstreambuf ALWAYS recodes.

(I guess you meant filebuf up there)

Well... with a filebuf you won't get very far, because it gives you
chars and chars cannot represent a Unicode character. A wfilebuf is a
lot better ;-) However, evan a wfilebuf alone is not enough as what it
does is just a trivial decoding which, in most cases, simply maps 1-1
chars to wchar_t. If you want a little chance to actually have a
meaningful decoding you have to use both a wfilebuf *and* a suitable
codecvt facet.

Alberto

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: Hyman Rosen <hyrosen@mail.com>
Date: Thu, 26 May 2005 10:49:48 CST Raw View

Krzysztof Zelechowski wrote:
> If you fopen a file for binary, it does not recode Unicode characters to
> multibyte; it stores and reads them in native format.
> It is not the case with a fstreambuf: an fstreambuf ALWAYS recodes.

That's not right. It's not the fopen or binary that matters,
it's what function you use to read from the stream. If you
read from a FILE* with fgetc then you get a byte. If you use
the wide char readers, you get a converted wide char using
some encoding. Similarly, if you use a wchar_t filebuf then
you get multibyte encoding conversion and if you use a plain
char filebuf you don't.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Thu, 26 May 2005 15:54:18 GMT Raw View

Uzytkownik <kuyper@wizard.net> napisal w wiadomosci
news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
> "Krzysztof    elechowski" wrote:
>> Class basic_filebuf does not allow storing wide characters to files.
>> This
>> way the flexibility and power of Unicode is killed because we cannot
>> store
>> arbitrary wide characters in a file.  What sense does it make?  It is a
>> big
>> step backward.  Or is the whole story about wchar_t limited to character
>> sets by design and the wide character methods have nothing to do with
>> Unicode?  Do I have to fall back to stdio to support Unicode text?
>
> However, that doesn't mean you have to go back to fwrite() and fread()
> to handle files containing actual wide characters. The
> std::ostream::write() and std::istream::read() functions provide just
> as much capability for reading and writing wide character files as the
> stdio routines.
>

If I could wofstream::write wchar_t withoud codecvt::outing, I could also
wfilebuf::sputn it because the former invokes the latter.  I have checked
that I cannot; my text gets codecvt::outed no matter what.  And indeed,
fwrite does not wctomb only if it is created as binary (whereas
std::ios::binary has no effect on the encoding used).
But things are really worse than just that: there is no setting under which
wofstream could output the whole Unicode palette.  It is always restricted
to some character set and if it encounters an extraneous wide character, it
fails and refuses to output anything more until it is cleared.  Therefore,
there is no way for this process to be stable and reversible.

Chris

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Thu, 26 May 2005 22:52:40 GMT Raw View

""Krzysztof Zelechowski"" <yecril@bluebottle.com> wrote in message=20
news:d73g6o$mfn$1@sklad.atcom.net.pl...

> Uzytkownik <kuyper@wizard.net> napisal w wiadomosci=20
> news:1117036475.010036.5360@g47g2000cwa.googlegroups.com...
>> "Krzysztof =AFelechowski" wrote:
>>> Class basic_filebuf does not allow storing wide characters to files.=20
>>> This
>>> way the flexibility and power of Unicode is killed because we cannot=20
>>> store
>>> arbitrary wide characters in a file.  What sense does it make?  It is=
 a=20
>>> big
>>> step backward.  Or is the whole story about wchar_t limited to charac=
ter
>>> sets by design and the wide character methods have nothing to do with
>>> Unicode?  Do I have to fall back to stdio to support Unicode text?
>>
>> However, that doesn't mean you have to go back to fwrite() and fread()
>> to handle files containing actual wide characters. The
>> std::ostream::write() and std::istream::read() functions provide just
>> as much capability for reading and writing wide character files as the
>> stdio routines.
>>
>
> If I could wofstream::write wchar_t withoud codecvt::outing, I could al=
so=20
> wfilebuf::sputn it because the former invokes the latter.  I have check=
ed=20
> that I cannot; my text gets codecvt::outed no matter what.  And indeed,=
=20
> fwrite does not wctomb only if it is created as binary (whereas=20
> std::ios::binary has no effect on the encoding used).
> But things are really worse than just that: there is no setting under=20
> which wofstream could output the whole Unicode palette.  It is always=20
> restricted to some character set and if it encounters an extraneous wid=
e=20
> character, it fails and refuses to output anything more until it is=20
> cleared.  Therefore, there is no way for this process to be stable and=20
> reversible.

Not true. One of the codecvt facets in our CoreX library reads
and writes "extended UTF-8 that's 32-bit transparent. Stable and
reversible.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: yecril@bluebottle.com ("Krzysztof Zelechowski")
Date: Sat, 28 May 2005 18:02:33 GMT Raw View

Sorry for top posting, my newsreader must really be out of order.
The native format is, by definition, what can be memory mapped back to a
wchar_t *.  Thus, there is a difference between "native" and "official".
I really meant wfilebuf, I have been being sloppy.
How a wfilebuf decodes characters according to its locale.  It is not the
classic locale at my place in most cases, so Alberto's remark about 1-1 does
not apply.  What troubles me I do not have a supporting codecvt facet so
that I can wfilebuf::sputn without causing it to fail when it encounters
e.g. an ellipsis (which is not supported under OEM locale) or a hyphen
(which no locale supports AFAIK).  Such a codecvt facet should be in the
standard so that Mr Plauger cannot unbundle it and sell it separately for
$90 which my boss does not want to spend (skinflint!).
Thanks for your broad explanation in your other response.
Christopher

Uzytkownik "Alberto Barbati" <AlbertoBarbati@libero.it> napisal w wiadomosci
news:3Y1le.117536$IN.1994766@twister2.libero.it...
Krzysztof Zelechowski wrote:
> Uzytkownik "Hyman Rosen" <hyrosen@mail.com> napisal w wiadomosci
> news:G5Bke.18622$4d6.10646@trndny04...
> Krzysztof    elechowski wrote:
>
>>Do I have to fall back to stdio to support Unicode text?
>
>
> I'm curious. How does stdio support Unicode text?
>
> In the vast majority of current systems, files are sequences
> of chars, not sequences of "Unicode text". You can have a
> basic_filebuf<wchar_t> and have it read and write encoded
> Unicode in a variety of formats, I believe.
>
> ---
>
> If you fopen a file for binary, it does not recode Unicode characters to
> multibyte; it stores and reads them in native format.

Sorry to be pedantic, but there is nothing like a "native format" for a
Unicode file. It's true that AFAIK Windows shows a preference for UCS-2
and Linux shows a preference for UTF-8, but other encodings are also
extensively used on both platformsn and in neither cases there is an
"official" native format.

> It is not the case with a fstreambuf: an fstreambuf ALWAYS recodes.

(I guess you meant filebuf up there)

Well... with a filebuf you won't get very far, because it gives you
chars and chars cannot represent a Unicode character. A wfilebuf is a
lot better ;-) However, evan a wfilebuf alone is not enough as what it
does is just a trivial decoding which, in most cases, simply maps 1-1
chars to wchar_t. If you want a little chance to actually have a
meaningful decoding you have to use both a wfilebuf *and* a suitable
codecvt facet.

Alberto

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]