Thread

Topic: UNICODE/ANSI neutral strings

Author: "Daniel Parker" <danielp@nospam.com>
Date: 1998/08/30 Raw View

B. K. Oxley (binkley) at Work wrote in message
<6s6ev2$70f$1@ndnws01.ne.highway1.com>...

>Christopher Eltschka wrote in message
><35E555EB.F7D668B8@physik.tu-muenchen.de>...

>|The simplest approach in C++ would be to use the freedom of defining
>|wchar_t. That is, the compiler could implement a compiler switch to
>|set wchar_t to be equivalent to char, or to 16bit.

>Please don't forget that wchar_t <> UNICODE character.  On several UNIX
>platforms, wchar_t is 32-bit, whereas UNICODE 2.0 characters are a16-bit
>(ignoring the issue of surrogates and the like).  This particular point
>is causing me and my company some agony, as we are trying to use UNICODE
>on all platforms, so we cannot rely on wstring, etc, but have to roll
>our own.  This is especially troublesome with the lower-level code in
>the standard library (see the thread on codecvt, for example).

Yikes.  I've never cared for Microsoft's approach of redefining the
character string when switching back and forth from UNICODE.  I'd rather
standardize on UNICODE strings, on16 bit characters, and coerce the output
to 8 bit characters when required.  I didn't think the standard library
provided great support for that to begin with, but I gather from this post
that it doesn't provide *any*, other than providing some code that we can
copy to base our own string class on.

--
Regards,
Daniel Parker danielp@no_spam.anabasis.com.





[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: Pete Becker <petebecker@acm.org>
Date: 1998/08/30 Raw View

Daniel Parker wrote:
>
>
> Yikes.  I've never cared for Microsoft's approach of redefining the
> character string when switching back and forth from UNICODE.  I'd rather
> standardize on UNICODE strings, on16 bit characters, and coerce the output
> to 8 bit characters when required.  I didn't think the standard library
> provided great support for that to begin with, but I gather from this post
> that it doesn't provide *any*, other than providing some code that we can
> copy to base our own string class on.

It's not that bad. basic_string is a template. If you need to insist on
16-bit character strings, figure out which arithmetic type is 16 bits
wide on your platform and use basic_string<int16>.

--
Pete Becker
Dinkumware, Ltd.
http://www.dinkumware.com

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "B. K. Oxley (binkley) at Work" <binkley@eps.inso.com>
Date: 1998/08/31 Raw View

Pete Becker wrote in message <35E9A94B.20285622@acm.org>...
|
|Daniel Parker wrote:
|>
|>
|> Yikes.  I've never cared for Microsoft's approach of redefining the
|> character string when switching back and forth from UNICODE.  I'd rather
|> standardize on UNICODE strings, on16 bit characters, and coerce the output
|> to 8 bit characters when required.  I didn't think the standard library
|> provided great support for that to begin with, but I gather from this post
|> that it doesn't provide *any*, other than providing some code that we can
|> copy to base our own string class on.
|
|It's not that bad. basic_string is a template. If you need to insist on
|16-bit character strings, figure out which arithmetic type is 16 bits
|wide on your platform and use basic_string<int16>.


It still isn't quite that simple.  For basic_string<>, Yes,  it is
fairly straight-forward (but consider that char_traits<> has a
char_traits<char> specialization to call into your platforms
[presumably] optimized str* functions).  But for the more interesting
classes, such as those for I/O conversion or for character
classification, it is quite daunting, if fun.

Proper UNICODE support also entails several nasty,
hard-to-sweep-under-the-rug issues, such as surrogates (32-bit
characters outside the BMP [Basic Multilingual Plane] -- regular UNICODE
characters are a 16-bit subset of the full, 32-bit ISO 10646 character
set) --; byte-ordering, which is platform-dependent ("well-behaved"
UNICODE strings need to start with a pair of "no-op" bytes which are
used to distinguish byte ordering [ugh!]); encoding (7-bit v. 8-bit
[believe it or not!], compression, UTF-8 v. UTF-16).  The list goes on.
None of these are addressed in the Standard C++ Library, nor should they
until C++ decides to take up UNICODE.  Contrast with Java, which
although UNICODE-based, also hasn't deal fully with all these issues,
just tackling the easy problems first (the 90% solution), and ignoring
the rest.

Here is a simple example:  what do you do with accented lowercase vowels
in French when you try to uppercase them?  It depends on whether you are
speaking about French or Canadian French!  One of them keeps the
diacritical markings, the other drops them.  A central problem to proper
support, then, are locale issues.  Any try reading the locale sections
of the standard.  It is really geared towards supporting the standard
"C" library locale specifications, not locale issues in the broader
sense (although there is certainly machinery to extend the Standard C++
Library in this respect [facets, et al], it is hardly implemented
universally).

--binkley



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: jkanze@otelo.ibmmail.com
Date: 1998/09/02 Raw View

In article <6se9bl$peg$1@ndnws01.ne.highway1.com>,
  "B. K. Oxley (binkley) at Work" <binkley@eps.inso.com> wrote:

> Here is a simple example:  what do you do with accented lowercase vowels
> in French when you try to uppercase them?  It depends on whether you are
> speaking about French or Canadian French!  One of them keeps the
> diacritical markings, the other drops them.

Are you sure of this?  The formal standard in French would say to keep
them, but often, in conjunction with a font in which there is no graphical
distinction.

Anyway, your point is correct; there is more to localization than would
be immediately apparent.  (Let's not forget that Unicode recognizes
three cases, and that languages like Arabic use four different representations
for most characters, in a context dependant manner.)

> A central problem to proper
> support, then, are locale issues.  Any try reading the locale sections
> of the standard.  It is really geared towards supporting the standard
> "C" library locale specifications, not locale issues in the broader
> sense (although there is certainly machinery to extend the Standard C++
> Library in this respect [facets, et al], it is hardly implemented
> universally).

In practice, the C++ library has defined a mechanism which allows an
unbelievable degree of flexibility, but is too complex for the average
programmer, and has ignored one of the most basic issues: that of
positionally dependant variable fields.  The result is that we are
forced to shift all of our national language dependant code into the
Java part of the application; C++ is just not usable.

--
James Kanze    +33 (0)1 39 23 84 71    mailto: kanze@gabi-soft.fr
        +49 (0)69 66 45 33 10    mailto: jkanze@otelo.ibmmail.com
GABI Software, 22 rue Jacques-Lemercier, 78000 Versailles, France
Conseils en informatique orient   e objet --
              -- Beratung in objektorientierter Datenverarbeitung

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: jcoffin@taeus.com (Jerry Coffin)
Date: 1998/08/26 Raw View

In article <6rus06$fc4$1@nnrp1.dejanews.com>, jkanze@otelo.ibmmail.com
says...

[ ... ]

> Have you ever actually tried just changing the define on Windows code,
> to convert an 8 bit character code program to a Unicode one?  I don't
> think it will work unless you also have a lot of conditional compilations
> in your code as well.

Yes and no -- MS provides a header that includes a (fairly large)
number of conditionally compiled names for functions, that resolve to
one function if UNICODE is defined and another if UNICODE is NOT
defined.  E.g. _tcscpy becomes strcpy if UNICODE isn't defined and
wcscpy (IIRC) if it IS defined.

Regardless of the exact names, if you use the names they define, you
can indeed produce a program that will work correctly for either
narrow or wide characters, with no #ifdef's, and such in your own
code.  OTOH, you ARE using their header which is basically just a huge
collection of #ifdef's already done for you...

Personally, I don't particularly like this approach, but I haven't
come up with a version that's dramatically better either, so I guess
I'll put up with theirs until I do.  One major problem (at least for
them) is coming up with a method that's reasonably portable not only
for C and C++, but completely different languages like Visual BASIC...

--
    Later,
    Jerry.

The Universe is a figment of its own imagination.


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: Christopher Eltschka <celtschk@physik.tu-muenchen.de>
Date: 1998/08/27 Raw View

Jerry Coffin wrote:

> In article <6rus06$fc4$1@nnrp1.dejanews.com>, jkanze@otelo.ibmmail.com
> says...

> > Have you ever actually tried just changing the define on Windows code,
> > to convert an 8 bit character code program to a Unicode one?  I don't
> > think it will work unless you also have a lot of conditional compilations
> > in your code as well.

> Yes and no -- MS provides a header that includes a (fairly large)
> number of conditionally compiled names for functions, that resolve to
> one function if UNICODE is defined and another if UNICODE is NOT
> defined.  E.g. _tcscpy becomes strcpy if UNICODE isn't defined and
> wcscpy (IIRC) if it IS defined.

> Regardless of the exact names, if you use the names they define, you
> can indeed produce a program that will work correctly for either
> narrow or wide characters, with no #ifdef's, and such in your own
> code.  OTOH, you ARE using their header which is basically just a huge
> collection of #ifdef's already done for you...

> Personally, I don't particularly like this approach, but I haven't
> come up with a version that's dramatically better either, so I guess
> I'll put up with theirs until I do.  One major problem (at least for
> them) is coming up with a method that's reasonably portable not only
> for C and C++, but completely different languages like Visual BASIC...

The simplest approach in C++ would be to use the freedom of defining
wchar_t. That is, the compiler could implement a compiler switch to
set wchar_t to be equivalent to char, or to 16bit. In addition, it
would link in different versions of the standard library where wchar_t
is treated as ASCII or UNICODE characters. This would be transparent
and would work with any standard C++ program using wchar_t (and not
only those written specifically for that particular implementation),
as long as the program itself doesn't make any assumptions about the
content/size of wchar_t.

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "B. K. Oxley (binkley) at Work" <binkley@eps.inso.com>
Date: 1998/08/28 Raw View

Christopher Eltschka wrote in message
<35E555EB.F7D668B8@physik.tu-muenchen.de>...
|
|The simplest approach in C++ would be to use the freedom of defining
|wchar_t. That is, the compiler could implement a compiler switch to
|set wchar_t to be equivalent to char, or to 16bit. In addition, it
|would link in different versions of the standard library where wchar_t
|is treated as ASCII or UNICODE characters. This would be transparent
|and would work with any standard C++ program using wchar_t (and not
|only those written specifically for that particular implementation),
|as long as the program itself doesn't make any assumptions about the
|content/size of wchar_t.


Please don't forget that wchar_t <> UNICODE character.  On several UNIX
platforms, wchar_t is 32-bit, whereas UNICODE 2.0 characters are a16-bit
(ignoring the issue of surrogates and the like).  This particular point
is causing me and my company some agony, as we are trying to use UNICODE
on all platforms, so we cannot rely on wstring, etc, but have to roll
our own.  This is especially troublesome with the lower-level code in
the standard library (see the thread on codecvt, for example).

--binkley




[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: AllanW@my-dejanews.com
Date: 1998/08/28 Raw View


> > In article <6rus06$fc4$1@nnrp1.dejanews.com>, jkanze@otelo.ibmmail.com
> > says...
> > > Have you ever actually tried just changing the define on Windows code,
> > > to convert an 8 bit character code program to a Unicode one?  I don't
> > > think it will work unless you also have a lot of conditional compilations
> > > in your code as well.

> Jerry Coffin wrote:
> > Yes and no -- MS provides a header that includes a (fairly large)
> > number of conditionally compiled names for functions, that resolve to
> > one function if UNICODE is defined and another if UNICODE is NOT
> > defined.  E.g. _tcscpy becomes strcpy if UNICODE isn't defined and
> > wcscpy (IIRC) if it IS defined.
>
> > Regardless of the exact names, if you use the names they define, you
> > can indeed produce a program that will work correctly for either
> > narrow or wide characters, with no #ifdef's, and such in your own
> > code.  OTOH, you ARE using their header which is basically just a huge
> > collection of #ifdef's already done for you...
[snip]

In article <35E555EB.F7D668B8@physik.tu-muenchen.de>,
  Christopher Eltschka <celtschk@physik.tu-muenchen.de> wrote:
> The simplest approach in C++ would be to use the freedom of defining
> wchar_t. That is, the compiler could implement a compiler switch to
> set wchar_t to be equivalent to char, or to 16bit. In addition, it
> would link in different versions of the standard library where wchar_t
> is treated as ASCII or UNICODE characters. This would be transparent
> and would work with any standard C++ program using wchar_t (and not
> only those written specifically for that particular implementation),
> as long as the program itself doesn't make any assumptions about the
> content/size of wchar_t.

But this is almost exactly what Microsoft does, except that it's
defined type is called _TCHAR.  When compiled with UNICODE, this
is typedef'd to be wchar_t; otherwise it's the same as char. To
make this portable to other compilers, add this to some common
header file:
    #ifdef _MSC_VER
    #include <tchar.h>      // Let Microsoft define it if possible
    #elif defined(UNICODE)
    typedef wchar_t _TCHAR; // Using Unicode
    #else
    typedef char _TCHAR;    // Not using unicode
    #endif
Obviously, this assumes you've defined UNICODE if you're using
Unicode; if not, make the appropriate change.

--
AllanW@my-dejanews.com is a "Spam Magnet" -- never read.
Please reply in USENET only, sorry.

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum


[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: "Venus Lee" <analinguica@hotmail.com>
Date: 1998/08/25 Raw View

Why doesn't C++ standard include UNICODE/ANSI neutral classes? In Windows,
nearly everything including character data type is typedefed such that
changing your app from ANSI to UNICODE or vicce versa requires only UNICODE
#definition at the top of the source code.

Actually, to make such a string would be very easy:

#ifdef UNICODE
#define _tstring wstring
#else
#define _tstring string
#endif

but it would be nice if Standard C++ provides for such case.

Thanks.

Venus Lee.



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]

Author: jkanze@otelo.ibmmail.com
Date: 1998/08/25 Raw View

In article <6rt5t4$e6t@news1.rad.net.id>,
  "Venus Lee" <analinguica@hotmail.com> wrote:
>
> Why doesn't C++ standard include UNICODE/ANSI neutral classes? In Windows,
> nearly everything including character data type is typedefed such that
> changing your app from ANSI to UNICODE or vicce versa requires only UNICODE
> #definition at the top of the source code.
>
> Actually, to make such a string would be very easy:
>
> #ifdef UNICODE
> #define _tstring wstring
> #else
> #define _tstring string
> #endif
>
> but it would be nice if Standard C++ provides for such case.

Have you ever actually tried just changing the define on Windows code,
to convert an 8 bit character code program to a Unicode one?  I don't
think it will work unless you also have a lot of conditional compilations
in your code as well.

--
James Kanze    +33 (0)1 39 23 84 71    mailto: kanze@gabi-soft.fr
        +49 (0)69 66 45 33 10    mailto: jkanze@otelo.ibmmail.com
GABI Software, 22 rue Jacques-Lemercier, 78000 Versailles, France
Conseils en informatique orient   e objet --
              -- Beratung in objektorientierter Datenverarbeitung

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum

[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]