Thread

Topic: Using codecvt<>

Author: "James Keesey" <NEWSPOST@SANJOSE.VNET.IBM.COM>
Date: 1998/08/22 Raw View


I'm starting to look at the internationalization features of the standard.
I began with wstring for working with nationalized strings which led to
"how do I get wstrings into/out of the system."  With some help from
J.Kanze I found codecvt<>.  So far, so good.

When I examined codecvt<>'s interface, the first thing I noticed was that
it did not support basic_string<>.  So my first question is why not?
Aren't we supposed to be leaving char[] and wchar_t[] in the past?

Having realized that I would have to roll my own, I set out to do so.  At
the bottom of this post is my first attempt at an object that converts from
one basic_string<> type to another.  It seems to work under VC++ but would
appreciate any constructive comments/suggestions on improving it.

Assuming that I have it reasonably correct, this process has led me to
ask several questions:

1) Why is there no basic_string<> interface? (from above)

2) Why does it take so much code to do a simple conversion?  I appreciate
   having the power when I need it, but the codecvt<> interface seems too
   difficult for the common problems.

3) Why is there a codecvt<>.length(state, externT*, externT*, size_t) but
   not the inverse(converse?)?

4) Why isn't there a way to ask "given these characters, how characters in
   the target type will it take to hold them?"  The length() function
   doesn't seem to exactly fit this purpose.

5) I only have CD2 and I was wondering if there were changes made to this
   section [lib.locale.codecvt] by the time the standard was finalized.

------------------------------ Code begin ------------------------------
#include <locale>
#include <memory>

//****************************************************************************
//
//  NOTE: internT and externT must basic_string<> or types that implements
//        a similar interface.
//
//****************************************************************************
template <class internT, class externT, class stateT = mbstate_t>
class StringConverter
{
public:
    typedef
        std::codecvt<internT::value_type, externT::value_type, stateT>
        convert_type;
//
// NOTE--these are needed because VC++ 5.0 does not match CD2
    typedef convert_type::to_type extern_type;
    typedef convert_type::from_type intern_type;
//  typedef convert_type::extern_type extern_type;
//  typedef convert_type::intern_type intern_type;
    typedef convert_type::state_type  state_type;

    //*************************************************************************
    //
    //*************************************************************************
    StringConverter(int block_length = 100)
        : _blockLength(block_length), _conversion(0) { imbue(std::locale()); }

    //*************************************************************************
    //
    //*************************************************************************
    std::locale getloc() const { return _loc; }

    //*************************************************************************
    //
    //*************************************************************************
    std::locale imbue(const std::locale& loc)
    {
        std::locale ret = _loc;
        _loc = loc;
        // NOTE--non-standard call to use_facet<>() for VC++ 5.0
        _conversion =
            & std::use_facet<convert_type>(_loc, (convert_type*)0, true);
        return ret;
    }

    //*************************************************************************
    //
    // This function takes a string in the external representation and
    // converts it to its internal representation.
    //
    //*************************************************************************
    convert_type::result ex_to_in(const externT& fromStr, internT& toStr)
    {
        convert_type::result returnValue = convert_type::ok;

        toStr.erase();

        if (fromStr.size() == 0)
            return returnValue;

        state_type state = 0;

        const extern_type* from = fromStr.data();
        const extern_type* from_end = from + fromStr.size();
        extern_type* from_next = 0;

        std::auto_ptr<intern_type> to_buf(new intern_type[_blockLength]);
        intern_type* to = to_buf.get();
        intern_type* to_end = to + _blockLength;
        intern_type* to_next = 0;

        while (from != from_end) {
            convert_type::result res =
                _conversion->in(state,
                                    from, from_end, from_next,
                                    to,   to_end,   to_next);
            if (res == convert_type::noconv || res == convert_type::error) {
                returnValue = res;
                break;
            }
            else {
                if (res == convert_type::partial &&
                     from_next == from_end &&
                     to_next != to_end) {
                    returnValue = convert_type::partial;
                }
                toStr += internT(to, to_next - to);
                from = from_next;
            }
        }
        return returnValue;
    }

    //*************************************************************************
    //
    // This function takes a string in the internal representation and
    // converts it to its external representation.
    //
    //*************************************************************************
    convert_type::result in_to_ex(const internT& fromStr, externT& toStr)
    {
        convert_type::result returnValue = convert_type::ok;

        toStr.erase();

        if (fromStr.size() == 0)
            return returnValue;

        state_type state = 0;

        const intern_type* from = fromStr.data();
        const intern_type* from_end = from + fromStr.size();
        intern_type* from_next = 0;

        std::auto_ptr<extern_type> to_buf(new extern_type[_blockLength]);
        extern_type* to = to_buf.get();
        extern_type* to_end = to + _blockLength;
        extern_type* to_next = 0;

        while (from != from_end) {
            convert_type::result res =
                _conversion->out(state,
                                      from, from_end, from_next,
                                      to,   to_end,   to_next);
            if (res == convert_type::noconv || res == convert_type::error) {
                returnValue = res;
                break;
            }
            else {
                if (res == convert_type::partial &&
                     from_next == from_end &&
                     to_next != to_end) {
                    returnValue = convert_type::partial;
                }
                toStr += externT(to, to_next - to);
                from = from_next;
            }
        }
        return returnValue;
    }

private:
    int                 _blockLength;
    std::locale         _loc;
    const convert_type* _conversion;
};

------------------------------ Code ends  ------------------------------


--

James Keesey                              Internet: keesey@us.ibm.com
IBM Santa Teresa Labs
My opinions are not necessarily those of my employer.


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: ncm@nospam.cantrip.org (Nathan Myers)
Date: 1998/08/24 Raw View

James Keesey<NEWSPOST@SANJOSE.VNET.IBM.COM> wrote:
>
>When I examined codecvt<>'s interface, the first thing I noticed was that
>it did not support basic_string<>.  So my first question is why not?
>Aren't we supposed to be leaving char[] and wchar_t[] in the past?

codecvt<> is intended for use by basic_filebuf<>.  It's a low-level
facility, as are char* and wchar_t*, appropriately.  Of course it makes
sense to provide a higher-level interface, such as fstream.

The philosophy in the C++ library was that there would be no
"in-memory" representation of multibyte sequences; they would be
converted during i/o, with all in-memory representations of large
character set strings using wide characters.  It doesn't enforce
that, but doesn't try to make it easier to subvert it.

The CD2 version of codecvt differs substantially from the final
version.
--
Nathan Myers
ncm@nospam.cantrip.org  http://www.cantrip.org/

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: 1998/08/24 Raw View

James Keesey <NEWSPOST@SANJOSE.VNET.IBM.COM> wrote in article <1qkan28xddzs.fsf@keesey.us.ibm.com>...
> When I examined codecvt<>'s interface, the first thing I noticed was that
> it did not support basic_string<>.  So my first question is why not?
> Aren't we supposed to be leaving char[] and wchar_t[] in the past?

The sole defined use of template class codecvt in the Standard C++ library
is to convert between internal and external stream representations for
class basic_streambuf<E>. Internally, the stream looks like a sequence
of elements of type E. Externally, it looks like a sequence of elements of
type char. The mapping between the two is handled by a facet of class
codecvt<E, char, mbstate_t>. For this purpose, you primarily have to
work with an array of length 1 of type E and an array of some greater
length of type char. For the mapping to wchar_t, presumably the latter
length is never larger than MB_LEN_MAX.

I do, in fact, use a string with codecvt at a crucial place within basic_filebuf::
overflow. But you have to realize that all this machinery was accepted into
the draft C++ Standard as a paper tiger. We implementors got to tweak
the design only after it was basically frozen.

> Having realized that I would have to roll my own, I set out to do so.  At
> the bottom of this post is my first attempt at an object that converts from
> one basic_string<> type to another.  It seems to work under VC++ but would
> appreciate any constructive comments/suggestions on improving it.

Very few environments currently support template class codecvt. If you got
it to work with VC++, you're off to a good start.

> Assuming that I have it reasonably correct, this process has led me to
> ask several questions:
>
> 1) Why is there no basic_string<> interface? (from above)

Occam's Razor or an oversight. Take your pick.

> 2) Why does it take so much code to do a simple conversion?  I appreciate
>    having the power when I need it, but the codecvt<> interface seems too
>    difficult for the common problems.

Uh huh.

> 3) Why is there a codecvt<>.length(state, externT*, externT*, size_t) but
>    not the inverse(converse?)?

It is conceived of as a 1-to-N conversion.

> 4) Why isn't there a way to ask "given these characters, how characters in
>    the target type will it take to hold them?"  The length() function
>    doesn't seem to exactly fit this purpose.

I kinda thought that was its intent, but I confess that I don't use this function.

> 5) I only have CD2 and I was wondering if there were changes made to this
>    section [lib.locale.codecvt] by the time the standard was finalized.

My notes indicate that some new typedefs magically appeared, but I think
that happened by CD2. I also believe that do_length changed late in the
game, but I'm not sure whether that predates CD2. It always pays to read
the FDIS. I still get occasional small surprises.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com/hot_news.html



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "James Keesey" <NEWSPOST@SANJOSE.VNET.IBM.COM>
Date: 1998/08/24 Raw View

>>>>> "Nathan Myers" == Nathan Myers <ncm@nospam.cantrip.org> writes:

    > James Keesey<NEWSPOST@SANJOSE.VNET.IBM.COM> wrote:
    >>  When I examined codecvt<>'s interface, the first thing I noticed
    >> was that it did not support basic_string<>.  So my first question is
    >> why not?  Aren't we supposed to be leaving char[] and wchar_t[] in
    >> the past?

    > codecvt<> is intended for use by basic_filebuf<>.

Is this documented?  And did you mean basic_filebuf<> or basic_streambuf<>?
If the former, then how do the other do conversion?

    > It's a low-level facility, as are char* and wchar_t*, appropriately.
    > Of course it makes sense to provide a higher-level interface, such as
    > fstream.

Ok, can you give a brief example of how to read a file with interspersed
binary data (e.g. ints) and multibyte strings where the strings end up as
wstrings?  In my case I will have the length of the string (part of the
binary data).

    > The philosophy in the C++ library was that there would be no
    > "in-memory" representation of multibyte sequences;

Is this philosophy documented somewhere for those of us trying to figure
out how to use the library?  I don't see anywhere that one should not use
string at all in NLS enabled programs.

    > they would be converted during i/o, with all in-memory
    > representations of large character set strings using wide characters.
    > It doesn't enforce that, but doesn't try to make it easier to subvert
    > it.

You say "subvert" but I'm trying to get everything into wstrings.

    > The CD2 version of codecvt differs substantially from the final
    > version.

Which doesn't help when I can't get the real thing yet.  BTW, any news as
to when?

Thanks,
James

--

James Keesey                              Internet: keesey@us.ibm.com
IBM Santa Teresa Labs
The opinions expressed may not be those of my employer.

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: Matt Austern <austern@sgi.com>
Date: 1998/08/24 Raw View

"James Keesey" <NEWSPOST@SANJOSE.VNET.IBM.COM> writes:

>     > codecvt<> is intended for use by basic_filebuf<>.
>
> Is this documented?  And did you mean basic_filebuf<> or basic_streambuf<>?
> If the former, then how do the other do conversion?

The only streambufs in the C++ standard are basic_filebuf<>,
basic_stringbuf<>, and strstreambuf (strstreambuf isn't a template,
and it's deprecated).  There's no need for basic_stringbuf<> or
strstreambuf to do any conversions from one representation to another,
and the standard makes it clear that they aren't supposed to do any
such conversions.

Section 27.8.1, meanwhile, makes it clear that the external
representation of a file is a sequence of bytes, and that wide-
oriented filebufs are supposed to use codecvt to convert between
wide characters and multibyte characters.

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "P.J. Plauger" <pjp@dinkumware.com>
Date: 1998/08/25 Raw View

James Keesey <NEWSPOST@SANJOSE.VNET.IBM.COM> wrote in article <1qkak93y59xp.fsf@orac.stl.ibm.com>...
>     > codecvt<> is intended for use by basic_filebuf<>.
>
> Is this documented?  And did you mean basic_filebuf<> or basic_streambuf<>?
> If the former, then how do the other do conversion?

See my column, ``Standard C/C++: The Facet codecvt,'' C/C++ Users
Journal, November 1997. It discusses the intended use for codecvt and
shows the mechanics of using it.

> Ok, can you give a brief example of how to read a file with interspersed
> binary data (e.g. ints) and multibyte strings where the strings end up as
> wstrings?  In my case I will have the length of the string (part of the
> binary data).

The expectation is that the entire file has a uniform multibyte encoding.
For the kind of patchwork parsing you describe, see fwscanf in Standard
C (Amendment 1).

>     > The philosophy in the C++ library was that there would be no
>     > "in-memory" representation of multibyte sequences;
>
> Is this philosophy documented somewhere for those of us trying to figure
> out how to use the library?  I don't see anywhere that one should not use
> string at all in NLS enabled programs.

Standards tend to be weak on philosophy and long on nitty gritty details.

>     > they would be converted during i/o, with all in-memory
>     > representations of large character set strings using wide characters.
>     > It doesn't enforce that, but doesn't try to make it easier to subvert
>     > it.
>
> You say "subvert" but I'm trying to get everything into wstrings.

Yes, the idea is to read and write sequences of (typically) wide chars
internally, and leave it to the basic_ifstream/basic_ofstream objects
to do the conversions needed to match up to multibyte files in the
outside world. In fact, wcin/wcout/wcerr/wclog are predefined for your
wide-char comfort.

>     > The CD2 version of codecvt differs substantially from the final
>     > version.
>
> Which doesn't help when I can't get the real thing yet.  BTW, any news as
> to when?

You can get a precise and up-to-date reference to the Standard C++
Library at our web site. You can see a working version of codecvt in
the header <xlocale> of any VC++ V4.2 or V5.0. And if you look
around in house long enough, you'll find a version of that file that
matches the FDIS.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com/hot_news.html



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: ncm@nospam.cantrip.org (Nathan Myers)
Date: 1998/08/25 Raw View

(I'll address the questions not otherwise answered.)

 James Keesey<NEWSPOST@SANJOSE.VNET.IBM.COM> wrote:
>
>>>>>> "Nathan Myers" == Nathan Myers <ncm@nospam.cantrip.org> writes:
>    > It's a low-level facility, as are char* and wchar_t*, appropriately.
>    > Of course it makes sense to provide a higher-level interface, such as
>    > fstream.
>
>Ok, can you give a brief example of how to read a file with interspersed
>binary data (e.g. ints) and multibyte strings where the strings end up as
>wstrings?  In my case I will have the length of the string (part of the
>binary data).

You should read the file as a binary file, using filebuf directly,
into a memory buffer, and then apply codecvt<> operations to the
bytes you get.  Of course it's best to encapsulate these operations
in a higher-level interface.

>    > they would be converted during i/o, with all in-memory
>    > representations of large character set strings using wide characters.
>    > It doesn't enforce that, but doesn't try to make it easier to subvert
>    > it.
>
>You say "subvert" but I'm trying to get everything into wstrings.

What you are trying to do is reasonable, and that is the reason
that the low-level codecvt<> interface is exposed.    However, what you
are doing is complicated, so the standard doesn't package it all up
nicely for you.  However, the standard does contain a few examples
of use of the codecvt<> members, in chapter 27.
--
Nathan Myers
ncm@nospam.cantrip.org  http://www.cantrip.org/



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]