Thread

Topic: codecvt::do_in not consuming external characters

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Thu, 5 Dec 2002 23:50:30 +0000 (UTC) Raw View

Dietmar Kuehl wrote:
> My understanding is that essentially, the standard library assumes that
> internal characters are completely encoded in one character. In this thinking
> UTF-8, UTF-16, etc. are external encodings and to be mapped to an encoding
> where each individual character is encoded by just one, well, character. Thus,
> when reading characters from an external encoding the N-to-1 encoding is used,
> when writing to characters to an external encoding the 1-to-N encoding is
> used.

Put in that way, things are way more clear to me. The concept of
"completely encoded" is the key. Thanks for pointing it out.

> For reasons which are beyond me, some popular encodings use internally
> decided to do things differently and require multiple internal characters to
> represent one logical character (my understanding is that these are actually
> some form of composite glyphs). I think that it is possible to use the
> standard library on these beasts, too, but the user has to take care of not
> ripping the glyphs apart: There is no support whatsoever for this in the
> standard library.

In fact there are several useful encodings that require multiple
internal "units" to encode an abstract character, most notably UTF-16
and UTF-8. UTF-16 encodes a Unicode character with either one or two
16-bit units, UTF-8 encode a character with one up to four 8-bit units.

UTF-8 as an internal encoding is not very efficient and has limited
and/or very specific applicability, but it's difficult to live without
UTF-16, for at least these three reasons:

1) on several platforms, wchar_t is a 16 bit type, unable to hold a
"completely encoded" Unicode character.

2) UTF-16 is more memory-friendly than a "full" Unicode encoding that
usually requires up to double memory.

3) O/S specific API may require UTF-16 (for example, on Windows, the
Unicode subsystem require all strings to be passed as UTF-16).

> ??? Despite a different intention, I think the standard currently requires
> support for N-to-M mappings between external and internal characters (of
> course, assuming that the user provides a corresponding code conversion
> facet).

Because of a few peculiar properties of UTF-16 I was able to implement a
conversion facet where do_in() always consume at least one external
character and is able to produce exactly one internal character. This
should be enough to let it work however the standard is interpreted.

In order to achieve this, the implementation rely on the fact that a
proper part of the external sequence encoding a "multiply encoded"
abstract character contains enough information to produce one part of
the internal encoding.

The implementation could be cleaner and more efficient if do_in() were
allowed to produce internal characters without consuming any external
character (my initial question), but efficiency is not the issue here.

The problem is that I think it's impossible to implement a facet with
UTF-8 internal encoding without allowing do_in() to do that.

In conclusion: I agree that current wording of the standard is
consistent with the rationale you wrote at beginning of this message and
that it implements it correctly (i.e.: all 1-to-N encodings can be
read/written by a filebuf through a specific facet). However, I believe
that such wording is unclear about one detail (my original question) and
this leads to different interpretations: one possible interpretation de
facto disallows a class of M-to-N encodings.

I believe that a clarification, either in the positive or in the
negative, could be a good thing.

> ... and if you want to make sure that the next standard clears things up in
> the desired way, you should get involved in the standardization process in
> some form (well, posting here is a first step; providing words for resolving
> the pending code conversion facet defects is another one; filing new DRs or
> providing proposals for additions/cleanup are also reasonable approaches).

What should I do to be more involved? Would preparing a formal proposal
help?

Alberto Barbati

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: dietmar_kuehl@yahoo.com (Dietmar Kuehl)
Date: Wed, 4 Dec 2002 19:59:27 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") wrote:
> "Alberto Barbati" <abarbati@iaanus.com> wrote in message news:3ddd6632_4@corp.newsgroups.com...
> > I would like to understand the rationale for ruling out N-to-M mappings.
> > Do you know where I could find it?
>
> I know of no rationale for this part of the C++ Standard. FWIW, it
> appears to be modeled after the multibyte/wide-character mapping
> introduced into the C90 Standard and elaborated upon with Amendment
> 1 to the C Standard in 1995. Those mappings were 1-to-N and N-to-1.

My understanding is that essentially, the standard library assumes that
internal characters are completely encoded in one character. In this thinking
UTF-8, UTF-16, etc. are external encodings and to be mapped to an encoding
where each individual character is encoded by just one, well, character. Thus,
when reading characters from an external encoding the N-to-1 encoding is used,
when writing to characters to an external encoding the 1-to-N encoding is
used.

For reasons which are beyond me, some popular encodings use internally
decided to do things differently and require multiple internal characters to
represent one logical character (my understanding is that these are actually
some form of composite glyphs). I think that it is possible to use the
standard library on these beasts, too, but the user has to take care of not
ripping the glyphs apart: There is no support whatsoever for this in the
standard library.

> > > I agree. We do indeed supply a mapping from UTF-8 to UTF-16 as part of
> > > our CoreX Library. (The manual is available for perusal at our web site.)
> > > It is the one most likely to fail with a basic_filebuf other than our
> > > own.
> >
> > I'm puzzled. So, do you think this approach is (please select one):
> >
> > 1) valid under the current standard, even if it fails with most
> > implementations
>
> No.

??? Despite a different intention, I think the standard currently requires
support for N-to-M mappings between external and internal characters (of
course, assuming that the user provides a corresponding code conversion
facet).

> A real possibility is that for years to come you'll find very little
> uniformity among implementations of template class basic_filebuf.

... and if you want to make sure that the next standard clears things up in
the desired way, you should get involved in the standardization process in
some form (well, posting here is a first step; providing words for resolving
the pending code conversion facet defects is another one; filing new DRs or
providing proposals for additions/cleanup are also reasonable approaches).
--
<mailto:dietmar_kuehl@yahoo.com> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 28 Nov 2002 20:03:53 +0000 (UTC) Raw View

abarbati@iaanus.com (Alberto Barbati) wrote in message
news:<3de403f6_4@corp.newsgroups.com>...
> James Kanze wrote:
> >>I don't understand you objection. Code conversion *is not*
> >>intimately linked with reading from and writing to files.

> > So why does the standard link the two.  Why should all file IO use
> > code conversion, and no IS to/from a string, for example?

> I did not understand your objection and explained myself badly, I
> apologize. I thought that the fact of having two classes (the buffer
> and the codecvt facet) was enough to say that the two concepts are not
> strongly linked. In a sense it's true, but the interface between the
> two is so overspecified so that in another sense they *are* linked.

Code translation and reading/writing from a file are two distinctly
different concerns.  They should thus be in separate classes.  As
Dietmar Kuehl has pointed out, using codecvt is not the simplest thing
around.  It offers a set of implementations, but the interface needs
work.

> However, I see no point in doing string translation. Why would I?

Who knows.  I see no point in forbidding it.  I also see no point in
requiring code translations in every case where I read or write from a
file.  (There is a thread in the French speaking newsgroup right now
because someone simply wants to read wchar_t using wifstream without any
translation, and doesn't see why he should require the complexity of
locale to do so.  Fundamentally, he's right.)

> A codecvt translation is interesting (and hard) because it's done
> "streamingly", that is using only a small portion of the data at a
> time, because the rest of the data can't be fetched efficiently or
> can't be fetched at all. For an in-memory string, it's silly to do a
> streaming conversion. It's more efficient to do all
> insertions/extractions and then apply the conversion in one single
> step to the result.

Unless, for example, the data are not created at one time, and the
"string" is read asynchronously from another thread (or you use a
strstreambuf to put it in shared memory, or...)

Perhaps the single more important element in the success of C++ is that
no one has tried to second guess the programmer; the option with the
most freedom has always been the one chosen.

> > Why, for that matter, should there even be a wfilebuf, since I can
> > only write files in terms of bytes?  The logical solution would be a
> > translating wstreambuf, which is a wstreambuf which uses a narrow
> > character streambuf for its source and sink, and does the
> > translating.

> A wfilebuf is essential in my opinion, because it allows wide
> characters on the internal side.

But it contains significant redundant code with filebuf.  A better
solution would be to just have filebuf (with perhaps wfilebuf as an
extention for systems which support wide character files, i.e. where
there is a system read and write request which takes a wchar_t rather
than a char), which takes care of reading and writing the file, and a
translation filtering streambuf which converts between char's and
wchar_t's (using the imbued codecvt to do so).

Separation of concerns, and all that.

> Yet, filebufs have to have bytes on the external side because that's
> how the C I/O library (and probably the underlying OS, too) works.

That's how some underlying OS's work.  From what I understand, Windows
permits reads and writes using wchar_t buffers.

> For example, I have just finished implementing an UTF-8 facet to read
> Unicode files. You would imbue such facet on a wfstream.  You can't do
> that on narrow streams, because narrow characters can't hold a Unicode
> character. But it's good to have bytes on the external side, because
> UTF-8 is indeed defined in term of bytes. If wfilebuf had already been
> translating wide characters I would have been in big trouble: how big
> is a wchar_t? 2 bytes, 4 bytes or even more? With what endian-ness are
> they read from the stream? Some characters in UTF-8 are encoded with 1
> or 3 bytes, how am I supposed to read those?

But this works even better if I use a filtering streambuf for the
conversion.  That way, when I implement my socket stream:

  - I don't have to worry about making it a template, just handling
    bytes are enough (and lets face it: the less templates, the better),
    and

  - I don't have to worry about interfacing to codecvt, or anything
    else, I just have to provide the code for what is new.

> The "logical solution" you are referring to is not at all logical. A
> wstreambuf that does wchar_t translation have to address the wchar_t
> size and endianness issues.

I never proposed any such thing.  What I propose is a basic_filebuf
which doesn't translate.  It reads charT, and it presents what it reads
literally at the interface.  (There are problems on historical grounds
here, with regards to mapping of '\n'.  But nothing worse than what we
already have in narrow character streams.)

Translation between wide and narrow characters is a separate concern,
which belongs in a separate, filterning streambuf.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@galactica.it (Alberto Barbati)
Date: Wed, 27 Nov 2002 00:02:57 +0000 (UTC) Raw View

Dietmar Kuehl wrote:
 > BTW, concerning your initial question: I think a code conversion
facet is just
 > supposed to make "progress" with each call to 'do_in()' (or
'do_out()'). What
 > "progress" means exactly is pretty vague but effectively it should do
at least
 > one of the following (multiple things being done at once are
perfectly legal
 > and will probably improve the performance):
 >
 > - consume at least one character
 > - produce at least one character
 > - produce an error
 > - modify the state in some form such that eventually one of the other
happens
 >
 > That is, if there is at least one input character and one output
character the
 > function should produce at least one character or consume at least one
 > character - eventually. As long as there is one input character and
one output
 > character and the function does not return 'error' the user of
'do_in()' or
 > 'do_out()' can just call the function with the same arguments in a
loop and
 > the loop will eventually terminate.

Thanks, I think so too. I wish the LWG considered this wording as I
believe it is a giant leap towards a clarification about the
filebuf/codecvt interaction. A few implementations would have to be
modified to be made conformant, but I don't think the modification would
have such a big impact.

Alberto Barbati



-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Wed, 27 Nov 2002 17:48:09 +0000 (UTC) Raw View

James Kanze wrote:
>>I don't understand you objection. Code conversion *is not* intimately
>>linked with reading from and writing to files.
> So why does the standard link the two.  Why should all file IO use code
> conversion, and no IS to/from a string, for example?

I did not understand your objection and explained myself badly, I
apologize. I thought that the fact of having two classes (the buffer and
the codecvt facet) was enough to say that the two concepts are not
strongly linked. In a sense it's true, but the interface between the two
is so overspecified so that in another sense they *are* linked.

However, I see no point in doing string translation. Why would I? A
codecvt translation is interesting (and hard) because it's done
"streamingly", that is using only a small portion of the data at a time,
because the rest of the data can't be fetched efficiently or can't be
fetched at all. For an in-memory string, it's silly to do a streaming
conversion. It's more efficient to do all insertions/extractions and
then apply the conversion in one single step to the result.

 > Why, for that
> matter, should there even be a wfilebuf, since I can only write files in
> terms of bytes?  The logical solution would be a translating wstreambuf,
> which is a wstreambuf which uses a narrow character streambuf for its
> source and sink, and does the translating.

A wfilebuf is essential in my opinion, because it allows wide characters
on the internal side. Yet, filebufs have to have bytes on the external
side because that's how the C I/O library (and probably the underlying
OS, too) works. For example, I have just finished implementing an UTF-8
facet to read Unicode files. You would imbue such facet on a wfstream.
You can't do that on narrow streams, because narrow characters can't
hold a Unicode character. But it's good to have bytes on the external
side, because UTF-8 is indeed defined in term of bytes. If wfilebuf had
already been translating wide characters I would have been in big
trouble: how big is a wchar_t? 2 bytes, 4 bytes or even more? With what
endian-ness are they read from the stream? Some characters in UTF-8 are
encoded with 1 or 3 bytes, how am I supposed to read those?

The "logical solution" you are referring to is not at all logical. A
wstreambuf that does wchar_t translation have to address the wchar_t
size and endianness issues. And if you really want this translation,
it's so easy to provide a codecvt facet that implements it! You just
imbue it and that's it.

Anyway, we are slowly drifting off-topic. I'm not really interested in
this discussion. We can speak for months about what would be good to
have but we don't. Instead, I just would like to be able to
*effectively* use what we already have. It seems to me a much more
realistic target. That's why I posed a single, very precise, question
(which you didn't answer to, if I may add).

Alberto Barbati

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: dietmar_kuehl@yahoo.com (Dietmar Kuehl)
Date: Mon, 25 Nov 2002 19:19:42 +0000 (UTC) Raw View

abarbati@iaanus.com (Alberto Barbati) wrote:
> James Kanze wrote:
> > Logically, wouldn't something like this be a better solution overall
> > anyway.  Why is code conversion intimately linked with reading and
> > writing to files?  Why do I have to duplicate functionality already in
> > the standard library if I need code conversion in my socket stream?
> > Wouldn't a better design be to have a filtering streambuf for code
> > conversion?  (I know, since I'm the one suggesting it, some one is bound
> > to make a remark about "when all you have is a hammer...":-).)
>
> I don't understand you objection. Code conversion *is not* intimately
> linked with reading from and writing to files. That's exactly the reason
> why there is a codecvt facet class in the locale, a component physically
> separated from class filebuf that encapsulates file access.

Well, yes, the code conversion *implementation* is decoupled from file buffers
but if you have ever tried to implement 'std::basic_filebuf' you *know* that
in fact the *use* of these classes is pretty much limited to this class: It is
awefully hard to get it right, not to mention efficient. Of the sthandard
library implementers I asked about the hardest part to implement in the
standard library, all agreed that 'std::basic_filebuf' is a very good
candidate (with 'std::deque' and 'std::valarray' and family being other
candidates, the latter, however, due to its bad specification). I'm all in
favour of adding a class like the one mentioned by P.J.Plauger to the next
standard: It would give library implementers at least some guidance on how to
taggle this problem and would ease the implementation of wide character stream
buffers for user.

> If you write your own socket stream you are probably going to write a
> specific buffer class and not only you can, but you are encouraged to
> use the codecvt facet, thus avoiding code duplication and allowing user
> customization.

Well, for the socket you want to write a 'std::basic_streambuf<char>' which
assumes no need to doing a code conversion. The converting stream buffer would
just use the 'char' stream buffer to obtain or send the converted characters.
Just give it a try and determine what is harder to write: A class handling
'char's or a template class handling all kinds of character types and doing
its own code conversion? If you find that the latter is not sufficiently
complex to make a serious difference, add seeking to your implementation and
make it efficient... (if this is still an easy task for you, standard library
vendors have to seriously fear your competion).

Note that neither James nor P.J. were talking about getting rid of the code
conversion facet. They were just talking about a standard class which eases
the use of the code conversion facets and which is used eg. to implement file
buffers. Effectively, this is a reasonable way to go anyway just that the
class doing the code conversion is currently not exposed to the users and thus
named differently in the various standard library implementations.

BTW, concerning your initial question: I think a code conversion facet is just
supposed to make "progress" with each call to 'do_in()' (or 'do_out()'). What
"progress" means exactly is pretty vague but effectively it should do at least
one of the following (multiple things being done at once are perfectly legal
and will probably improve the performance):

- consume at least one character
- produce at least one character
- produce an error
- modify the state in some form such that eventually one of the other happens

That is, if there is at least one input character and one output character the
function should produce at least one character or consume at least one
character - eventually. As long as there is one input character and one output
character and the function does not return 'error' the user of 'do_in()' or
'do_out()' can just call the function with the same arguments in a loop and
the loop will eventually terminate.
--
<mailto:dietmar_kuehl@yahoo.com> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Tue, 26 Nov 2002 03:51:20 +0000 (UTC) Raw View

abarbati@iaanus.com (Alberto Barbati) wrote in message
news:<3ddfc975_4@corp.newsgroups.com>...
> James Kanze wrote:
 > Logically, wouldn't something like this be a better solution overall
> > anyway. Why is code conversion intimately linked with reading and
> > writing to files? Why do I have to duplicate functionality already
> > in the standard library if I need code conversion in my socket
> > stream? Wouldn't a better design be to have a filtering streambuf
> > for code conversion? (I know, since I'm the one suggesting it, some
> > one is bound to make a remark about "when all you have is a
> > hammer...":-).)

> I don't understand you objection. Code conversion *is not* intimately
> linked with reading from and writing to files.

So why does the standard link the two.  Why should all file IO use code
conversion, and no IS to/from a string, for example?  Why, for that
matter, should there even be a wfilebuf, since I can only write files in
terms of bytes?  The logical solution would be a translating wstreambuf,
which is a wstreambuf which uses a narrow character streambuf for its
source and sink, and does the translating.

> That's exactly the reason why there is a codecvt facet class in the
> locale, a component physically separated from class filebuf that
> encapsulates file access.

> If you write your own socket stream you are probably going to write a
> specific buffer class and not only you can, but you are encouraged to
> use the codecvt facet, thus avoiding code duplication and allowing
> user customization.

Except that I have to duplicate all of the work of interfacing to this
codecvt.  It's ridiculous.

> Oppositely, if there was a filtering streambuf you would have linked
> buffering with conversion, the thing you said you want to avoid.

In what way?  I don't understand the objection.

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Sun, 24 Nov 2002 00:13:04 +0000 (UTC) Raw View

James Kanze wrote:
> Logically, wouldn't something like this be a better solution overall
> anyway.  Why is code conversion intimately linked with reading and
> writing to files?  Why do I have to duplicate functionality already in
> the standard library if I need code conversion in my socket stream?
> Wouldn't a better design be to have a filtering streambuf for code
> conversion?  (I know, since I'm the one suggesting it, some one is bound
> to make a remark about "when all you have is a hammer...":-).)

I don't understand you objection. Code conversion *is not* intimately
linked with reading from and writing to files. That's exactly the reason
why there is a codecvt facet class in the locale, a component physically
separated from class filebuf that encapsulates file access.

If you write your own socket stream you are probably going to write a
specific buffer class and not only you can, but you are encouraged to
use the codecvt facet, thus avoiding code duplication and allowing user
customization.

Oppositely, if there was a filtering streambuf you would have linked
buffering with conversion, the thing you said you want to avoid.

I talked about filebuf in my message, just because filebuf is the only
class of the standard library that uses codecvt.

Alberto Barbati

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Mon, 25 Nov 2002 19:03:23 +0000 (UTC) Raw View

> When translating one external character into two or more internal
> characters, things change. If you don't "consume" that external byte,
> the state (at least the publicly-visible state) of the input stream
> hasn't changed.

As correctly remarked by P.J. Plauger, that's why you the pos_type
object returned by seek has contains a copy of the conversion state. I
would add, that's also why the standard states that the result of an
absolute seek*() is undefined unless the target position has been
previously obtained by a succesful tell*() on the same stream.

 > I can't answer your question as well as the likes of P.J. Plauger or
 > James Kanze already have. What I can do is throw in my $0.02 to agree
 > with what you seem to be getting at -- the next version of the standard
 > ought to be a lot more clear on this. It ought to specifically state
 > that reading a character might not have any correlation at all to the
 > external position. Furthermore, saving the external file position and
 > then seeking back to that position can give undefined results, because
 > there's no way to guarantee that you haven't seeked into the middle
 > of a multibyte sequence.

You're missing the point, the stardard is already clear enough about
seek and tell. My question is more about how a filebuf is supposed to
call the codecvt facet and what assumption can do. For example, STLport
assumes that if do_in() produces at least one internal character, it
must consumes at least one external character. .NET STL implementation
has an even stronger assumption: it assumes that if do_in() produces
exactly one internal character, it has to consume *all* external
characters that where given as input (this assumption is astonishingly
too strong, in my opinion). However, as the standard is unclear about
the subject, both approaches are now tolerated, thus making life harder
for library implementors.

If the specifications of the filebug/codecvt collaborarion remain so
unclear or, worse, are made too restrictive, library implementors will
decide, as P.J. Plauger already does, to provide alternatives to this
mechanism. In such a situation, what would be the use of having a standard?

Alberto Barbati



-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Wed, 20 Nov 2002 23:35:20 +0000 (UTC) Raw View

Hi Everybody,

about codecvt facets, I read LWG issues 76 and 382. I still have a doubt
that is probably implicit in issue 382, but I would like an expert's
opinion. The question is:

Short form:

can do_in() produce one internal character without consuming external
characters (i.e.: return partial with from_next == from) as a result of
the simple processing of the state parameter?

Long form:

I am writing a facet implementing UTF-16 conversion with surrogates.
Suppose that do_in() is called when the external sequence decodes to a
Unicode character outside the basic multilingual plane, that is, a
character with Unicode value above U+10000. Such a character must
produce two internal characters (the two surrogates). If there's enough
room in the internal buffer, there's no problem: I can just store the
two surrogates in the internal sequence. However, because of issue 76, I
know that I have to write the facet so that it can decode one character
at a time. In this case, the idea is to:

1) the first time do_in() is called (precondition: state == 0): consume
the required external sequence characters, store the first surrogate in
the internal sequence, store the second surrogate in the state argument.

2) the next time (precondition: state != 0): store state, that contains
the second surrogate, in the internal sequence and reset it to 0.

The second step does not consume external characters. (note: surrogates
values are guaranteed to be != 0).

Is this approach allowed by the (possibly amended) standard? If yes,
should this approach be explicitly described as valid?

In fact, this approach is quite general and, if it's correct, would
allow any N-to-M conversion to be used with basic_filebuf without
violating issue 76.

Please notice that if, in an N-to-M mapping, the number of external
characters that make a "decodable" block is always greater than one, we
may decide to consume only some characters in step 1 and the rest in
step 2. However, this approach is cumbersome and rules out an entire
class of mappings (for example all 1-to-M mappings).

Of course, a similar argument may be raised for do_out() as well.

Thanks for your attention,

Alberto Barbati



-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Thu, 21 Nov 2002 02:16:10 +0000 (UTC) Raw View

"Alberto Barbati" <abarbati@iaanus.com> wrote in message news:3ddc1448_4@corp.newsgroups.com...

> about codecvt facets, I read LWG issues 76 and 382. I still have a doubt
> that is probably implicit in issue 382, but I would like an expert's
> opinion. The question is:
>
> Short form:
>
> can do_in() produce one internal character without consuming external
> characters (i.e.: return partial with from_next == from) as a result of
> the simple processing of the state parameter?

The C++ Standard is profoundly unclear on this and related topics.
The most widely used implementations (Dinkumware, Rogue Wave, STLport,
LibStdC++, SGI) are all over the map in what they permit. So even if
you believe you can read the C++ Standard the way you'd like, you won't
get uniform support in the real world.

> Long form:
>
> I am writing a facet implementing UTF-16 conversion with surrogates.
> Suppose that do_in() is called when the external sequence decodes to a
> Unicode character outside the basic multilingual plane, that is, a
> character with Unicode value above U+10000. Such a character must
> produce two internal characters (the two surrogates). If there's enough
> room in the internal buffer, there's no problem: I can just store the
> two surrogates in the internal sequence. However, because of issue 76, I
> know that I have to write the facet so that it can decode one character
> at a time. In this case, the idea is to:
>
> 1) the first time do_in() is called (precondition: state == 0): consume
> the required external sequence characters, store the first surrogate in
> the internal sequence, store the second surrogate in the state argument.
>
> 2) the next time (precondition: state != 0): store state, that contains
> the second surrogate, in the internal sequence and reset it to 0.
>
> The second step does not consume external characters. (note: surrogates
> values are guaranteed to be != 0).
>
> Is this approach allowed by the (possibly amended) standard? If yes,
> should this approach be explicitly described as valid?

Not really. The *intent* of the C++ Standard was to support 1-to-N and
N-to-1 mappings.

> In fact, this approach is quite general and, if it's correct, would
> allow any N-to-M conversion to be used with basic_filebuf without
> violating issue 76.

I agree. We do indeed supply a mapping from UTF-8 to UTF-16 as part of
our CoreX Library. (The manual is available for perusal at our web site.)
It is the one most likely to fail with a basic_filebuf other than our
own.

> Please notice that if, in an N-to-M mapping, the number of external
> characters that make a "decodable" block is always greater than one, we
> may decide to consume only some characters in step 1 and the rest in
> step 2. However, this approach is cumbersome and rules out an entire
> class of mappings (for example all 1-to-M mappings).

Yes, we do that. And it's cumbersome. And we still can't avoid all the
shortcomings of other libraries. So what we do is supply a template class
Dinkum::wbuffer_convert that looks like a wide stream, does all its I/O
through a basic_streambuf<char>, and is tolerant enough to work properly
with an N-to-M codecvt facet. We also have a template class Dinkum::
wstring_convert that lets you use these ambitious facets for conversions
between string and wstring objects. With these helpers, we can offer all
our code conversions for use with all existing Standard C++ libraries.

But it ain't easy.

> Of course, a similar argument may be raised for do_out() as well.

Of course.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Thu, 21 Nov 2002 18:13:37 +0000 (UTC) Raw View

pjp@dinkumware.com ("P.J. Plauger") wrote in message
news:<3ddc3993$0$24776$724ebb72@reader2.ash.ops.us.uu.net>...
> "Alberto Barbati" <abarbati@iaanus.com> wrote in message
> news:3ddc1448_4@corp.newsgroups.com...

    [...]
> Yes, we do that. And it's cumbersome. And we still can't avoid all the
> shortcomings of other libraries. So what we do is supply a template
> class Dinkum::wbuffer_convert that looks like a wide stream, does all
> its I/O through a basic_streambuf<char>, and is tolerant enough to
> work properly with an N-to-M codecvt facet. We also have a template
> class Dinkum:: wstring_convert that lets you use these ambitious
> facets for conversions between string and wstring objects. With these
> helpers, we can offer all our code conversions for use with all
> existing Standard C++ libraries.

Logically, wouldn't something like this be a better solution overall
anyway.  Why is code conversion intimately linked with reading and
writing to files?  Why do I have to duplicate functionality already in
the standard library if I need code conversion in my socket stream?
Wouldn't a better design be to have a filtering streambuf for code
conversion?  (I know, since I'm the one suggesting it, some one is bound
to make a remark about "when all you have is a hammer...":-).)

--
James Kanze                           mailto:jkanze@caicheuvreux.com
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: abarbati@iaanus.com (Alberto Barbati)
Date: Thu, 21 Nov 2002 23:35:52 +0000 (UTC) Raw View

P.J. Plauger wrote:
> "Alberto Barbati" <abarbati@iaanus.com> wrote in message news:3ddc1448_4@corp.newsgroups.com...
>
>>about codecvt facets, I read LWG issues 76 and 382. I still have a doubt
>>that is probably implicit in issue 382, but I would like an expert's
>>opinion. The question is:
>>
>>Short form:
>>
>>can do_in() produce one internal character without consuming external
>>characters (i.e.: return partial with from_next == from) as a result of
>>the simple processing of the state parameter?
>
>
> The C++ Standard is profoundly unclear on this and related topics.
> The most widely used implementations (Dinkumware, Rogue Wave, STLport,
> LibStdC++, SGI) are all over the map in what they permit. So even if
> you believe you can read the C++ Standard the way you'd like, you won't
> get uniform support in the real world.

I'm aware of that, unfortunately. I see that LWG issues 76 and 382 are
going into the direction of clarifying the topic and that is very
important for the C++ community. If there is no definitive answer to my
question, at least this thread would be useful to discuss constructively
about possible solutions, I think.

>>The second step does not consume external characters. (note: surrogates
>>values are guaranteed to be != 0).
>>
>>Is this approach allowed by the (possibly amended) standard? If yes,
>>should this approach be explicitly described as valid?
>
> Not really. The *intent* of the C++ Standard was to support 1-to-N and
> N-to-1 mappings.

I would like to understand the rationale for ruling out N-to-M mappings.
Do you know where I could find it?

>>In fact, this approach is quite general and, if it's correct, would
>>allow any N-to-M conversion to be used with basic_filebuf without
>>violating issue 76.
>
> I agree. We do indeed supply a mapping from UTF-8 to UTF-16 as part of
> our CoreX Library. (The manual is available for perusal at our web site.)
> It is the one most likely to fail with a basic_filebuf other than our
> own.

I'm puzzled. So, do you think this approach is (please select one):

1) valid under the current standard, even if it fails with most
implementations

2) invalid under the current standard, but the LWG is going to amend the
standard to make it valid

3) invalid under the current standard and it may be submitted to the LWG
in order to consider it

4) invalid under both the current and any future revision of the
standard, though you disagree with the LWG

5) other... please specify.

>>Please notice that if, in an N-to-M mapping, the number of external
>>characters that make a "decodable" block is always greater than one, we
>>may decide to consume only some characters in step 1 and the rest in
>>step 2. However, this approach is cumbersome and rules out an entire
>>class of mappings (for example all 1-to-M mappings).
>
> Yes, we do that. And it's cumbersome. And we still can't avoid all the
> shortcomings of other libraries. So what we do is supply a template class
> Dinkum::wbuffer_convert that looks like a wide stream, does all its I/O
> through a basic_streambuf<char>, and is tolerant enough to work properly
> with an N-to-M codecvt facet. We also have a template class Dinkum::
> wstring_convert that lets you use these ambitious facets for conversions
> between string and wstring objects. With these helpers, we can offer all
> our code conversions for use with all existing Standard C++ libraries.

It's very fortunate that you raise this subject, because I am having
problems with the library shipped with .NET and you are the best source
about it. I would not want to go too platform-specific, we may continue
the discussion in private, if you are interested. Fact is that with the
.NET STL implementation, reading from a file with fstream/filebuf fails
at converting characters with both approaches (consuming N then 0 or
extracting N/2 then N/2). That's because in filebuf::uflow(), the value
of do_in()'s output parameter from_next is ignored.

Returning to the original subject, it's very interesting and valuable
that you, as a library implementor, provides alternative ways to
circumvent the underspecifications of the standard. However, I believe
it is in the interest of the C++ community to work together on removing
those underspecifications.

Alberto Barbati




-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: allan_w@my-dejanews.com (Allan W)
Date: Fri, 22 Nov 2002 02:04:56 +0000 (UTC) Raw View

abarbati@iaanus.com (Alberto Barbati) wrote
> about codecvt facets, I read LWG issues 76 and 382. I still have a doubt
> that is probably implicit in issue 382, but I would like an expert's
> opinion. The question is:
>
> Short form:
>
> can do_in() produce one internal character without consuming external
> characters (i.e.: return partial with from_next == from) as a result of
> the simple processing of the state parameter?

This issue looks familiar yet strange -- like looking directly at a
picture that you usually see only through a mirror.

We already know that I/O needs to do translations for text, and that
it isn't always a one-to-one correlation. Unix was (IIRC) the first
platform to roll "CR" and "LF" into one "Newline" character, and that
was a major influence on the C language. The first time someone ported
C to a non-Unix platform, they had to deal with
    printf("Hello!\n");
writing more than 7 bytes.

You've nailed the important difference, though. When the external size
is greater than the internal size, we still at least have distinct
positions. It may well be that your file position before the \n is 6
and your file position after then \n is 8 -- but at least they're
distinct.

When translating one external character into two or more internal
characters, things change. If you don't "consume" that external byte,
the state (at least the publicly-visible state) of the input stream
hasn't changed.

I can't answer your question as well as the likes of P.J. Plauger or
James Kanze already have. What I can do is throw in my $0.02 to agree
with what you seem to be getting at -- the next version of the standard
ought to be a lot more clear on this. It ought to specifically state
that reading a character might not have any correlation at all to the
external position. Furthermore, saving the external file position and
then seeking back to that position can give undefined results, because
there's no way to guarantee that you haven't seeked into the middle
of a multibyte sequence.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Fri, 22 Nov 2002 03:46:03 +0000 (UTC) Raw View

"Alberto Barbati" <abarbati@iaanus.com> wrote in message news:3ddd6632_4@corp.newsgroups.com...

> >>The second step does not consume external characters. (note: surrogates
> >>values are guaranteed to be != 0).
> >>
> >>Is this approach allowed by the (possibly amended) standard? If yes,
> >>should this approach be explicitly described as valid?
> >
> > Not really. The *intent* of the C++ Standard was to support 1-to-N and
> > N-to-1 mappings.
>
> I would like to understand the rationale for ruling out N-to-M mappings.
> Do you know where I could find it?

I know of no rationale for this part of the C++ Standard. FWIW, it
appears to be modeled after the multibyte/wide-character mapping
introduced into the C90 Standard and elaborated upon with Amendment
1 to the C Standard in 1995. Those mappings were 1-to-N and N-to-1.

> >>In fact, this approach is quite general and, if it's correct, would
> >>allow any N-to-M conversion to be used with basic_filebuf without
> >>violating issue 76.
> >
> > I agree. We do indeed supply a mapping from UTF-8 to UTF-16 as part of
> > our CoreX Library. (The manual is available for perusal at our web site.)
> > It is the one most likely to fail with a basic_filebuf other than our
> > own.
>
> I'm puzzled. So, do you think this approach is (please select one):
>
> 1) valid under the current standard, even if it fails with most
> implementations

No.

> 2) invalid under the current standard, but the LWG is going to amend the
> standard to make it valid

Dunno.

> 3) invalid under the current standard and it may be submitted to the LWG
> in order to consider it

I'm not sure of the state of the various DRs on codecvt, but I vaguely
recall volunteering to craft better wording. Still awaiting holy orders
from Matt Austern. I personally see no reason why codecvt shouldn't
support N-to-M, whatever the restrictions on wide-character encodings
may be.

> 4) invalid under both the current and any future revision of the
> standard, though you disagree with the LWG

Possibly.

> 5) other... please specify.

A real possibility is that for years to come you'll find very little
uniformity among implementations of template class basic_filebuf.
That's why we added our helper templates, to avoid the worst weaknesses
in this area.

> >>Please notice that if, in an N-to-M mapping, the number of external
> >>characters that make a "decodable" block is always greater than one, we
> >>may decide to consume only some characters in step 1 and the rest in
> >>step 2. However, this approach is cumbersome and rules out an entire
> >>class of mappings (for example all 1-to-M mappings).
> >
> > Yes, we do that. And it's cumbersome. And we still can't avoid all the
> > shortcomings of other libraries. So what we do is supply a template class
> > Dinkum::wbuffer_convert that looks like a wide stream, does all its I/O
> > through a basic_streambuf<char>, and is tolerant enough to work properly
> > with an N-to-M codecvt facet. We also have a template class Dinkum::
> > wstring_convert that lets you use these ambitious facets for conversions
> > between string and wstring objects. With these helpers, we can offer all
> > our code conversions for use with all existing Standard C++ libraries.
>
> It's very fortunate that you raise this subject, because I am having
> problems with the library shipped with .NET and you are the best source
> about it. I would not want to go too platform-specific, we may continue
> the discussion in private, if you are interested. Fact is that with the
> .NET STL implementation, reading from a file with fstream/filebuf fails
> at converting characters with both approaches (consuming N then 0 or
> extracting N/2 then N/2). That's because in filebuf::uflow(), the value
> of do_in()'s output parameter from_next is ignored.

It would take me a while to get up to speed on this. I can say that we've
tweaked basic_filebuf's interface to codecvt several times over the past
seven years. Even .NET is not our latest and best, but I think it's pretty
sturdy by industry standards. IIRC, our UTF-8 to UTF-16 converter works
fine with the basic_filebuf in .NET.

> Returning to the original subject, it's very interesting and valuable
> that you, as a library implementor, provides alternative ways to
> circumvent the underspecifications of the standard. However, I believe
> it is in the interest of the C++ community to work together on removing
> those underspecifications.

Agreed. But codecvt is but one of many spongy areas in C++ locales that
need cleaning up. It'll take a while.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: pjp@dinkumware.com ("P.J. Plauger")
Date: Fri, 22 Nov 2002 05:53:24 +0000 (UTC) Raw View

"Allan W" <allan_w@my-dejanews.com> wrote in message
news:7f2735a5.0211211756.17e8c185@posting.google.com...

> When translating one external character into two or more internal
> characters, things change. If you don't "consume" that external byte,
> the state (at least the publicly-visible state) of the input stream
> hasn't changed.

That's why you have to store the conversion state in the pos_type
object returned on a seek/tell.

> I can't answer your question as well as the likes of P.J. Plauger or
> James Kanze already have. What I can do is throw in my $0.02 to agree
> with what you seem to be getting at -- the next version of the standard
> ought to be a lot more clear on this. It ought to specifically state
> that reading a character might not have any correlation at all to the
> external position. Furthermore, saving the external file position and
> then seeking back to that position can give undefined results, because
> there's no way to guarantee that you haven't seeked into the middle
> of a multibyte sequence.

I think those general words are sufficient, and you can do a lot more
successful seeking than you might imagine, thanks to that stored
conversion state. (Still, I wouldn't stress most implementations in
this area.)

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]