Thread

Topic: Reading and writing files using wifstream and wofstream

Author: wasti.redl@gmx.net
Date: Tue, 3 Mar 2009 10:14:57 CST Raw View

On Feb 26, 5:00 pm, Edward Diener <eldie...@tropicsoft.com> wrote:
> wasti.r...@gmx.net wrote:
> > This is not exactly correct. The relevant sections are:
>
> > 27.8.1 File Streams
> > "A file provides byte sequences. So the streambuf treats a file as the
> > external source/sink byte sequence. In a large character set
> > environment, multibyte character sequences are held in files. In order
> > to provide the contents of a file as wide character sequences, wide-
> > oriented filebuf, namely wfilebuf should convert wide character
> > sequences."
>
> My issue is with the term 'multibyte character sequences'.
>
> Does this mean that a wide character encoding must be converted to a
> multibyte encoding before writing to the file stream ? Or does it simply
> mean that the wide character encoding must be converted to a sequence of
> bytes before writing to the file stream ?
>

The latter. To the C++ standard, a multibyte encoding is any encoding
that uses a byte as the smallest unit and can have more than one unit
for a single character.
UTF-16 must therefore be considered as three possible encodings on 8-
bit-byte machines: UTF-16LE, splitting the 16-bit units into two bytes
using little-endian order, UTF-16BE with big-endian order, and
UTF-16VE, which emits a byte order mark.

> >> Why was this specified in the C++ standard ?
>
> > Files are byte sequences. The standard *has* to specify how to convert
> > between wide character sequences and byte sequences.
>
> >> Does the same output and input processing occur with all wide
> >> character streams ? I can hardly believe that. If the file streams are
> >> the only IO stream where this was specified, why were they made the
> >> exception ?
>
> > Other streams don't necessarily deal with byte sequences as the
> > external sequence.
>
> Then the question is: why are files deemed byte sequences as their
> external sequence but other streams are not deemed byte sequences ? I
> think this is a mistake. Files dealt with as binary should be sequences
> of any data.

What else would they be but byte sequences? You can treat byte
sequences as any data, but you must specify the interpretation (e.g.
endianness). However, as long as files are bytewise readable as the
smallest unit, they're byte sequences.

I'm working on an I/O library that makes this explicit, but it's far
from publishable.

Sebastian


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: James Kanze <james.kanze@gmail.com>
Date: Wed, 4 Mar 2009 14:28:53 CST Raw View

On Feb 26, 5:00 pm, Edward Diener <eldie...@tropicsoft.com> wrote:
> wasti.r...@gmx.net wrote:
> > On Feb 20, 7:02 am, Edward Diener <eldie...@tropicsoft.com> wrote:
> >> I was told that the C++ standard requires that wide
> >> characters written to an output stream via wofstream be
> >> converted to a narrow character before being written and
> >> that characters being read in using wifstream be read as
> >> single characters and converted to a wide character after
> >> being read.

> >> Where in the C++ standard is this specified ?

> > This is not exactly correct. The relevant sections are:

> > 27.8.1 File Streams
> > "A file provides byte sequences. So the streambuf treats a
> > file as the external source/sink byte sequence. In a large
> > character set environment, multibyte character sequences are
> > held in files. In order to provide the contents of a file as
> > wide character sequences, wide- oriented filebuf, namely
> > wfilebuf should convert wide character sequences."

> My issue is with the term 'multibyte character sequences'.

> Does this mean that a wide character encoding must be
> converted to a multibyte encoding before writing to the file
> stream ? Or does it simply mean that the wide character
> encoding must be converted to a sequence of bytes before
> writing to the file stream ?

What's the difference, really?

> > 27.8.1.4p3 describes the semantics of underflow() and gives
> > an as-if implementation that uses the currently imbued
> > locale's codecvt facet to perform the conversion between
> > external and internal sequence. uflow () (p4) behaves the
> > same, overflow() too in reverse direction.

> > In other words, the streams perform exactly the conversion
> > dictated by the codecvt facet of their locale. By imbuing
> > the streams with a modified locale, you can get any external
> > encoding you want.

> I would have imagined that whatever the default locale is when
> using wofstream and wifstream that the result should be
> writing and reading wchar_ts as is.

Impossible on most systems.

> I understand that this is a compiler decision rather than
> anything dictated by the C++ standard.

> In Visual C++ 9, using wofstream to write wchar_ts converts
> each wchar_t to a multibyte encoding equivalent before writing
> it out.

The default local could convert it to a multibyte encoding
where every character consists of two bytes, the first being the
low order byte of your wchar_t, and the second the high order
byte.

> >> Why was this specified in the C++ standard ?

> > Files are byte sequences. The standard *has* to specify how
> > to convert between wide character sequences and byte
> > sequences.

> >> Does the same output and input processing occur with all
> >> wide character streams ? I can hardly believe that. If the
> >> file streams are the only IO stream where this was
> >> specified, why were they made the exception ?

> > Other streams don't necessarily deal with byte sequences as
> > the external sequence.

> Then the question is: why are files deemed byte sequences as
> their external sequence but other streams are not deemed byte
> sequences?

Because that's the way most systems work.  Windows is an
exception, sort of, because it does support reading and writing
16 bit units to disks in some contexts.  It still requires a
byte sequence, however, if the disk is remote mounted, I think;
from what little I know, SMB and NFS only deal with byte
sequences.  And if the data is passing through just about any
other network protocol (HTML, etc.), you need a byte sequence.

> I think this is a mistake. Files dealt with as binary should
> be sequences of any data.

There are certainly some things that could be improved.  The
concept of a "binary" stream doing code translation on input and
output is a bit strange to begin with; logically, a binary
stream would read and write blocks of raw memory (unsigned
char?), with nothing else intervening, and you wouldn't have
char or wchar_t binary streams.

--
James Kanze (GABI Software)             email:james.kanze@gmail.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34


--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Edward Diener <eldiener@tropicsoft.com>
Date: Fri, 20 Feb 2009 00:02:13 CST Raw View

I was told that the C++ standard requires that wide characters written
to an output stream via wofstream be converted to a narrow character
before being written and that characters being read in using wifstream
be read as single characters and converted to a wide character after
being read.

Where in the C++ standard is this specified ?

Why was this specified in the C++ standard ?

This was all great news to me. I had always assumed that the wofstream
and wifstream output and input wide characters respectively.

Does the same output and input processing occur with all wide
character streams ? I can hardly believe that. If the file streams are
the only IO stream where this was specified, why were they made the
exception ?

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: wasti.redl@gmx.net
Date: Wed, 25 Feb 2009 10:01:06 CST Raw View

On Feb 20, 7:02 am, Edward Diener <eldie...@tropicsoft.com> wrote:
> I was told that the C++ standard requires that wide characters written
> to an output stream via wofstream be converted to a narrow character
> before being written and that characters being read in using wifstream
> be read as single characters and converted to a wide character after
> being read.
>
> Where in the C++ standard is this specified ?

This is not exactly correct. The relevant sections are:

27.8.1 File Streams
"A file provides byte sequences. So the streambuf treats a file as the
external source/sink byte sequence. In a large character set
environment, multibyte character sequences are held in files. In order
to provide the contents of a file as wide character sequences, wide-
oriented filebuf, namely wfilebuf should convert wide character
sequences."

27.8.1.4p3 describes the semantics of underflow() and gives an as-if
implementation that uses the currently imbued locale's codecvt facet
to perform the conversion between external and internal sequence. uflow
() (p4) behaves the same, overflow() too in reverse direction.

In other words, the streams perform exactly the conversion dictated by
the codecvt facet of their locale. By imbuing the streams with a
modified locale, you can get any external encoding you want.

>
> Why was this specified in the C++ standard ?

Files are byte sequences. The standard *has* to specify how to convert
between wide character sequences and byte sequences.

> Does the same output and input processing occur with all wide
> character streams ? I can hardly believe that. If the file streams are
> the only IO stream where this was specified, why were they made the
> exception ?

Other streams don't necessarily deal with byte sequences as the
external sequence.

Sebastian

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]

Author: Edward Diener <eldiener@tropicsoft.com>
Date: Thu, 26 Feb 2009 10:00:39 CST Raw View

wasti.redl@gmx.net wrote:
> On Feb 20, 7:02 am, Edward Diener <eldie...@tropicsoft.com> wrote:
>> I was told that the C++ standard requires that wide characters written
>> to an output stream via wofstream be converted to a narrow character
>> before being written and that characters being read in using wifstream
>> be read as single characters and converted to a wide character after
>> being read.
>>
>> Where in the C++ standard is this specified ?
>
> This is not exactly correct. The relevant sections are:
>
> 27.8.1 File Streams
> "A file provides byte sequences. So the streambuf treats a file as the
> external source/sink byte sequence. In a large character set
> environment, multibyte character sequences are held in files. In order
> to provide the contents of a file as wide character sequences, wide-
> oriented filebuf, namely wfilebuf should convert wide character
> sequences."

My issue is with the term 'multibyte character sequences'.

Does this mean that a wide character encoding must be converted to a
multibyte encoding before writing to the file stream ? Or does it simply
mean that the wide character encoding must be converted to a sequence of
bytes before writing to the file stream ?

>
> 27.8.1.4p3 describes the semantics of underflow() and gives an as-if
> implementation that uses the currently imbued locale's codecvt facet
> to perform the conversion between external and internal sequence. uflow
> () (p4) behaves the same, overflow() too in reverse direction.
>
> In other words, the streams perform exactly the conversion dictated by
> the codecvt facet of their locale. By imbuing the streams with a
> modified locale, you can get any external encoding you want.

I would have imagined that whatever the default locale is when using
wofstream and wifstream that the result should be writing and reading
wchar_ts as is. I understand that this is a compiler decision rather
than anything dictated by the C++ standard.

In Visual C++ 9, using wofstream to write wchar_ts converts each wchar_t
to a multibyte encoding equivalent before writing it out.

>
>> Why was this specified in the C++ standard ?
>
> Files are byte sequences. The standard *has* to specify how to convert
> between wide character sequences and byte sequences.
>
>> Does the same output and input processing occur with all wide
>> character streams ? I can hardly believe that. If the file streams are
>> the only IO stream where this was specified, why were they made the
>> exception ?
>
> Other streams don't necessarily deal with byte sequences as the
> external sequence.

Then the question is: why are files deemed byte sequences as their
external sequence but other streams are not deemed byte sequences ? I
think this is a mistake. Files dealt with as binary should be sequences
of any data.

--
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@netlab.cs.rpi.edu]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html                      ]