Topic: Unicode via wofstream: What am I doing wrong?


Author: "Ron Ruble" <raffles1@worldnet.att.net>
Date: 1999/03/18
Raw View
P. J. Plaugher just wrote about this subject in the "Standard
C++" column of the C/C++ User's Journal.

The long and short of it: many compiler vendors don't
write the high order byte to a file opened for wide-character
output, and reject characters that have anything other than
0 in the high-order byte. The standard appears to allow this,
and MS chose to do it this way.

He writes on how to correct this; if you are only going to
read the characters back on the same platform, do this
(note: this code is not guaranteed to write the same structure
as you saw in Word; it is only supposed to store the data in
an accurate form that can be read back the same way):

You need the full article if you have any problems. but this
compiles okay.

using namespace std;
typedef codecvt<wchar_t, char, mbstate_t> Mybase;

    // CLASS Simple_codecvt

class Simple_codecvt : public Mybase
{
public:
    typedef wchar_t  _E;
    typedef char  _To;
    typedef mbstate_t _St;

    explicit Simple_codecvt(size_t _R = 0)
        : Mybase(_R) {}

protected:
    virtual result do_in(_St& _State,
  const _To *_F1, const _To *_L1, const _To *&_Mid1,
  _E *F2, _E *_L2, _E *&_Mid2) const
 {return (noconv);}

    virtual result do_out(_St& _State,
  const _E *_F1, const _E *_L1, const _E *&_Mid1,
  _To *F2, _E *_L2, _To *&_Mid2) const
 {return (noconv);}

    virtual result do_unshift(_St& _State,
  _To *_F2, _To *_L2, _To *&_Mid2) const
 {return (noconv);}

    virtual int do_length(_St& _State, const _To *_F1,
  const _To *_L1, size_t _N2) const _THROW0()
 {return (_N2 < (size_t)(_L1 - _F1)
 ? _N2 : _L1 - _F1); }

    virtual bool do_always_noconv() const _THROW0()
 {return (true);}

    virtual int do_max_length() const _THROW0()
 {return (2);}

    virtual int do_encoding() const _THROW0()
 {return (2);}

};

#include <fstream>

int _tmain(int argc, TCHAR* argv[])
{
    const char *fname = "filename.txt"; // or whatever
    locale loc = _ADDFAC(locale::classic(), new Simple_codecvt);

     wofstream myostr;
     myostr.imbue(loc);
     myostr.open(fname, ios_base::binary);
     if (!myostr.is_open())
          cerr << "can't write to " << fname << endl;
     return 0;
}

This is for a private storage, where you know the file
contains UNICODE data from a processor with the
same byte ordering as the one it was created on. If you
want truly portable, you need a more complex code
conversion class and also modify one of the VC++ headers.

The article tells how to do both. You'll need a copy; the web
site doesn't include the article.


John Zoch wrote in message <7cp2ef$oi1$1@news.jump.net>...
>I'm going nuts trying to figure this one out.  the output of the the
>following code only puts 'A' and 'B' in the file unicode.txt and they
aren't
>wide.  Any ideas?
>
>Thanks for your help.
>
>-John
>
>---------------------------------------------------------------------------
-
>-----------------------------
>//Output from Hex editor of "unicode.txt"
>00000000 4142                                    AB
>
>Here's how I think it should look.
>//hex Output from a unicode text file generated with Microsoft Word.
>00000000 FFFE 5400 6800 6900 7300 2000 6900 7300 ..T.h.i.s. .i.s.
>00000010 2000 6100 2000 7400 6500 7300 7400 2E00  .a. .t.e.s.t...
>00000020 0D00 0A00                               ....
>
>
>//Microsoft Visual C++ 6.0 SP 2 code on a Windows NT 4.0 SP 4 box.
>#define _UNICODE
>#include <fstream>
>
>void main()
>{
> wchar_t unichar;
>
> std::wofstream fout;
> fout.open("unicode.txt");
>
> unichar = 65;  //ASCII 'A'
> fout << unichar;
>
> unichar = 66;  //ASCII 'B'
> fout << unichar;
>
> unichar = 267; //Some Unicode character.
> fout << unichar;
>
> unichar = 67;  //ASCII 'C'
> fout << unichar;
>
> unichar = 68;  //ASCII 'D'
> fout << unichar;
>
> fout.close();
>}


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]






Author: Mark Levis <mlevis@bigfoot.com>
Date: 1999/03/19
Raw View
John Zoch wrote:
>
> I'm going nuts trying to figure this one out.  the output of the the
> following code only puts 'A' and 'B' in the file unicode.txt and they aren't
> wide.  Any ideas?
>
> Thanks for your help.
>
> -John
> void main()

int main() The standard requires this!
> {
>  wchar_t unichar;
>
>  std::wofstream fout;
>  fout.open("unicode.txt");
>
[snip]
>  unichar = 267; //Some Unicode character.
>  fout << unichar;

Are you sure this is a valid character? character 0 - 255 wide char
may be more though.

Try asking unicode questions on microsoft.public.vc.*

[snip]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]






Author: "John Zoch" <jzoch@irsinc.com>
Date: 1999/03/18
Raw View
I'm going nuts trying to figure this one out.  the output of the the
following code only puts 'A' and 'B' in the file unicode.txt and they aren't
wide.  Any ideas?

Thanks for your help.

-John

----------------------------------------------------------------------------
-----------------------------
//Output from Hex editor of "unicode.txt"
00000000 4142                                    AB

Here's how I think it should look.
//hex Output from a unicode text file generated with Microsoft Word.
00000000 FFFE 5400 6800 6900 7300 2000 6900 7300 ..T.h.i.s. .i.s.
00000010 2000 6100 2000 7400 6500 7300 7400 2E00  .a. .t.e.s.t...
00000020 0D00 0A00                               ....


//Microsoft Visual C++ 6.0 SP 2 code on a Windows NT 4.0 SP 4 box.
#define _UNICODE
#include <fstream>

void main()
{
 wchar_t unichar;

 std::wofstream fout;
 fout.open("unicode.txt");

 unichar = 65;  //ASCII 'A'
 fout << unichar;

 unichar = 66;  //ASCII 'B'
 fout << unichar;

 unichar = 267; //Some Unicode character.
 fout << unichar;

 unichar = 67;  //ASCII 'C'
 fout << unichar;

 unichar = 68;  //ASCII 'D'
 fout << unichar;

 fout.close();
}





      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]