Topic: ANSI C++ (ISO 8859-1 National Character Set FAQ)


Author: jim.fleming@bytes.com (Jim Fleming)
Date: 1995/04/19
Raw View
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Please save this article for future reference. It will be needed
as part of the public comment period for the DRAFT ANSI Standard
for the C++ Programming Language.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

In article
<internationalization/iso-8859-1-charset_798149278@rtfm.mit.edu>,
mike@vlsivie.tuwien.ac.at says...
>
>Archive-name: internationalization/iso-8859-1-charset
>Posting-Frequency: monthly
>Version: 2.7
>
>
>                  ISO 8859-1  National Character Set FAQ
>
>DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
>THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
>OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
>
>Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
>systems might differ slightly
>
>This FAQ discusses topics related to the use of ISO 8859-1 based 8 bit
>character sets. It discusses how to use European (Latin American)
>national character sets on UNIX-based systems and the Internet.
>
>If you need to use a character set other than ISO 8859-1, much of
>what is described here will be of interest to you.  However, you will
>need to find appropriate fonts for your character set (see section 17)
>and input mechanisms adapted to you language.
>
>
>
>1. Which coding should I use for accented characters?
>Use the internationally standardized ISO-8859-1 character set to type
>accented characters. This character set contains all characters
>necessary to type (West) European languages. This encoding is also the
>preferred encoding on the Internet.  ISO 8859-X character sets use the
>characters 0xa0 through 0xff to represent national characters, while
>the characters in the 0x20-0x7f range are those used in the US-ASCII
>(ISO 646) character set.  Thus, ASCII text is a proper subset of all
>ISO 8859-X character sets.
>
>The characters 0x80 through 0x9f are earmarked as extended control
>chracters, and are not used for encoding characters.  These characters
>are not currently used to specify anything.  A practical reason for
>this is interoperability with 7 bit devices (or when the 8th bit gets
>stripped by faulty software).  Devices would then interpret the
character
>as some control character and put the device in an undefined state.
>(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
>wrong character is represented, but this cannot change the state of a
>terminal or other device.)
>
>This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS
>is practically equivalent to ISO 8859-1) and (practically all) UNIX
>implementations.  MS-DOS normally uses a different character set and
>is not compatible with this character set. (It can, however, be
>translated to this format with various tools. See section 5.)
>
>Footnote: Supposedly, IBM code page 819 is fully ISO 8859-1 compliant.
>
>
>ISO 8859-1 supports the following languages:
>Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish,
>French, Galician, German, Icelandic, Irish, Italian, Norwegian,
>Portuguese, Spanish and Swedish.
>
>(It has been called to my attention that Albanian can be written with
>ISO 8859-1 also.  However, from a standards point of view, ISO 8859-2
>is the appropriate character set for Balkan countries.)
>
>ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
>several character sets:
>8859-1  Europe, Latin America
>8859-2  Eastern Europe
>8859-3  SE Europe/miscellaneous (Esperanto, Maltese, etc.)
>8859-4  Scandinavia/Baltic (mostly covered by 8859-1 also)
>8859-5  Cyrillic
>8859-6  Arabic
>8859-7  Greek
>8859-8  Hebrew
>8859-9  Latin5, same as 8859-1 except for Turkish instead of Icelandic
>8859-10 Latin6, for Lappish/Nordic/Eskimo languages
>
>Unicode is advantageous because one character set suffices to encode
>all the world's languages, however very few programs (and even fewer
>operating systems) support wide characters. Thus, only 8 bit wide
>character sets (such as the ISO 8859-X) can be used with these
>systems.  Unfortunately, some programmers still insist on using the
>`spare' eigth bit for clever tricks, crippling these programs such
>that they can process only US-ASCII characters.
>
>
>Footnote: Some people have complained about missing characters,
>          e.g. French users about a missing 'oe'.  Note that oe is
>          not a character, but a ligature (a combination of two
>          characters for typographical purposes).  Ligatures are not
>          part of the ISO 8859-X standard.  (Although 'oe' used to
>          be in the draft 8859-1 standard before it was unmasked as
>          `mere' ligature.)
>
>
>
>2. Getting your terminal to handle ISO characters.
>Terminal drivers normally do not pass 8 bit characters. To enable
>proper handling of ISO characters, add the following lines to your
>.cshrc:
>----------------------------------
>tty -s
>if ($status == 0) stty cs8 -istrip -parenb
>----------------------------------
>If you don't use csh, add equivalent code to your shell's start up
>file.
>
>Note that it is necessary to check whether your standard I/O streams
>are connected to a terminal. Only then should you reconfigure the
>terminal driver.  Note that tty checks stdin, but stty changes stdout.
>This is OK in normal code, but if the .cshrc is executed in a pipe,
>you may get spurious warnings :-(
>
>If you use the Bourne Shell or descendants (sh, ksh, bash,
>zsh), use this code in your startup (e.g. .profile) file:
>----------------------------------
>tty -s
>if [ $? = 0 ]; then
>        stty cs8 -istrip -parenb >&0
>fi
>----------------------------------
>
>Footnote: In the /bin/sh version, we redirect stdout to stdin, so both
>tty and stty operate on stdin.  This resolves the problem discussed in
>the /bin/csh script version.  A possible workaround is to use the
>following code in .cshrc, which spawns a Bourne shell (/bin/sh) to
>handle the redirection:
>----------------------------------
>tty -s
>if ($status == 0) sh -c "stty cs8 -istrip -parenb >&0"
>----------------------------------
>
>
>
>3. Getting the locale setting right.
>For the ctype macros (and by extension, applications you are running
>on your system) to correctly identify accented characters, you
>may have to set the ctype locale to an ISO 8859-1 conforming
>configuration. On SunOS, this may be done by placing
>------------------------------------
>setenv LANG C
>setenv LC_CTYPE iso_8859_1
>------------------------------------
>in your .login script (if you use the csh). An equivalent statement
>will adjust the ctype locale for non-csh users.
>
>The process is the same for other operating systems, e.g. on HP/UX use
>'setenv LANG german.iso88591'; on IRIX 5.2 use 'setenv LANG de'; on
Ultrix 4.3
>use 'setenv LANG GER_DE.8859' and on OSF/1 use 'setenv LANG
>de_DE.88591'.  The examples given here are for German.  Other
>languages work too, depending on your operating system.  Check out
>'man setlocale' on your system for more information.
>
>*****If you can confirm or deny this, please let me know.*****
>Currently, each system vendor has his own set of locale names, which
>makes portability a bit problematic.  Supposedly there is some X/Open
>document specifying a
>
>        <language>_<country>.<character_encoding>
>
>syntax for environment variables specifying a locale, but I'm unable
>to confirm this.
>
>While many vendors know use the <language>_<country> encoding, there
>are many different encodings for languages and countries.
>
>Many vendors seem to use some derivative of this encoding:
>It looks as if <language> is the two-letter code for the language from
>ISO 639, and <country> is the two-letter code for the country from ISO
>3166, but I don't know of any standard specifying <character_encoding>.
>*****If you can confirm or deny this, please let me know.*****
>
>
>Footnote on HP/UX systems:
>As of 10.0, you can use either german.iso88591 or de_DE.iso88591 (a
>name more in line with other vendors and developing standards for
>locale names).  For a complete listing of locale names, see the text
>file /usr/lib/nls/config.  Or, on HP-UX 10.0, execute locale -a . This
>command will list all locales currently installed on your system.
>
>
>
>4. Selecting the right font under X11 for xterm (and other applications)
>To actually display accented characters, you need to select a font
>which does contains bit maps for ISO 8859-1 characters in the
>correct character positions. The names of these fonts normally
>have the suffix "iso8859-1". Use the command
># xlsfonts
>to list the fonts available on your system. You can preview a
>particular font with the
># xfd -fn <fontname>
>command.
>
>Add the appropriate font selection to your ~/.Xdefaults file, e.g.:
>------------------------------------------------------------------------
----
>XTerm*Font: -adobe-courier-medium-r-normal--18-180-75-75-m-110-iso8859-1
>Mosaic*XmLabel*fontList:
-*-helvetica-bold-r-normal-*-14-*-*-*-*-*-iso8859-1
>------------------------------------------------------------------------
----
>
>While X11 is farther than most system software when it comes to
>internationalization, it still contains many bugs.  A number of bug
>fixes can be found at URL http://www.dtek.chalmers.se:80/~maf/i18n/.
>
>Footnote: The X11R5 distribution has some fonts which are labeled as
>ISO fonts, but which contain only the US-ASCII characters.
>
>
>
>5. Translating between different international character sets.
>While ISO 8859-1 is an international standard, not everybody uses this
>encoding. Many computers use their own, vendor-specific character sets
>(most notably Microsoft for MS-DOS).  If you want to edit or view files
>written in different encoding, you will have to translate them to an
>ISO 8859-1 based representation.
>
>There are several PD/free character set translators available on the
>Internet, the most notable being 'recode'.  recode is available from
>URL ftp://prep.ai.mit.edu/u2/emacs.  recode is covered by FSF
>copyright and is freely redistributable.
>
>The general format of the program call is one of:
>
>recode [OPTION]... [BEFORE]:[AFTER] [FILE]
>
>The second form is the common case.  Each FILE will be read assuming
>it is coded with charset BEFORE, it will be recoded over itself so to
>use the charset AFTER.  If there is no such FILE, the program rather
>acts as a filter and recode standard input to standard output.
>
>Some recodings are not reversible, so after you have converted the
>file (recode overwrites the original file with the new version!), you
>may never be able to recontruct the original file.  A safer way of
>changing the encoing of a file is to use the filter mechanism of
>recode and invoke it as follows:
>
>recode [OPTION]... [BEFORE]:[AFTER] <[OLDFILE] >[NEWFILE]
>
>Under SunOS, the dos2unix and unix2dos programs (distributed with
>SunOS) will translate between MS-DOS and ISO 8859-1 formats.
>
>It is somewhat more difficult to convert German, `Duden'-conformant
>Ersatzdarstellung (    = ae,     = sz (or not so conformant `ss') etc.)
>into the ISO 8859-1 character set.  The German dictionary available as
>URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/dicts/deutsch.tar.gz also
>contains a UNIX shell script which can handle all conversions except
>ones involving     (German scharfes-s), as for `ss' this change is more
>complicated.
>
>A more sophisticated program to translate Duden Ersatzdarstellung to
>ISO 8859-1 is Gustaf Neumann's diac program (version 1.3 or later)
>which can translate all ASCII sequences to their respective ISO 8859-1
>character set representation.  'diac' is available in URL
>ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/diac.
>
>Translating ISO 8859-1 to ASCII can be performed with a little sed
>script according to your needs.  But be aware that
>* No one-to-one mapping between Latin 1 and ASCII strings is possible.
>* Text layout may be destroyed by multi-character substitutions,
>  especially in tables.
>* Different replacements may be in use for different languages,
>  so no single standard replacement table will make everyone happy.
>* Truncation or line wrapping might be necessary to fit textual data
>  into fields of fixed width.
>* Reversing this translation may be difficult or impossible.
>* You may be introducing ambiguities into your data.
>
>
>
>6. Printing accented characters.
>
>6.1 PostScript printers
>If you want to print accented characters on a postscript printer, you
>may need a PS filter which can handle ISO characters.
>
>Our Postscript filter of choice is a2ps, the more recent version of
>which can handle ISO 8859-1 characters with the -8 option.  a2ps V4.3
>is available as URL
ftp://imag.imag.fr/archive/postscript/a2ps.V4.3.tar.gz.
>
>If you use the pps postscript filter, use the 'pps -ISO' option for
>pps to handle ISO 8859-1 characters properly.
>
>
>6.2 Other (non-PS) printers:
>If you want to print to non-PS printers, your success rate depends on
>the encoding the printer uses. Several alternatives are possible:
>
>* Your printer accepts ISO 8859-1:
>  You're lucky. No conversion is needed, just send your files to the
>  printer.
>
>
>* You printer supports a PC-compatible font:
>  You can use the recode tool to translate from ISO 8859-1 to this
>  encoding. (If you are using a SunOS based computer, you can also use
>  the unix2dos utility which is part of the standard distribution.)
>  Just add the appropriate invocation as a built-in filter to your
>  printer driver.
>
>
>* Your printer uses a national ISO 646 variant (7 bit ASCII
>  with some special characters replaced by national characters):
>  You will have to use a translation tool; this tool would
>  then be installed in the printer driver and translate character
>  conventions before sending a file to the printer.  The recode
>  program supports many national ISO 646 norms.  (If you add do
>  this, please submit it to the maintainers of recode, so that it can
>  benefit everybody.)
>
>  Unfortunately, you will not be able to display all characters with
>  the built-in characters set. Most printers have user-definable
>  bit-map characters, which you can use to print all ISO characters.
>  You just have to generate a pix-map for any particular character and
>  send this bitmap to the printer.  The syntax for these characters
>  varies, but a few conventions have gained universal acceptance
>  (e.g., many printers can process Epson-compatible escape sequences).
>
>
>* Your printer supports a strange format:
>  If your printer supports some other strange format (e.g. HP Roman8,
>  DEC MCS, Atari, NeXTStep, EBCDIC or what have you), you have to add a
>  filter which will translate ISO 8859-1 to this encoding before
>  sending your data to the printer.  'recode' supports many of these
>  character sets already.  If you have to write your own conversion
>  tool, consider this as a good starting base. (If you add support for
>  any new character sets, please submit your code changes to the
>  maintainers of recode).
>
>  If your printer supports DEC MCS, this is nearly equivalent to ISO
>  8859-1 (actually, it is a former ISO 8859-1 draft standard. The only
>  characters which are missing are the Icelandic characters (eth and
>  thorn) at locations 0xD0, 0xF0, 0xDE and 0xFE) - the difference is
>  only a few characters.  You could probably get by with just sending
>  ISO 8859-1 to the printer.
>
>
>* Your printer supports ASCII only:
>  You have several options:
>  + If your printer supports user-defined characters, you can print all
>    ISO characters not supported by ASCII by sending the appropriate
>    bitmaps.  You will need a filter to convert ISO 8859-1 characters
>    to the appropriate bitmaps.  (A good starting point would be
recode.)
>  + Add a filter to the printer driver which will strip the accent
>    characters and just print the unaccented characters. (This
>    character set is supported by recode under the name `flat' ASCII.)
>  + Add a filter which will generate escape sequences (such as
>    " <BACKSPACE> a for Umlaut-a (   ), etc.) to be printed.  Recode
>    supports this encoding under the name `ascii-bs'.
>
>Footnote: For more information on character translation and the
>'recode' tool, see section 5.
>
>
>
>7. TeX and ISO 8859-1
>If you want to write TeX without having to type {\"a}-style escape
>sequences, you can either get a TeX versions configured to read 8-bit
>ISO characters, or you can translate between ISO and TeX codings.
>
>The latter is arduous if done by hand, but can be automated if you use
>emacs. If you use Emacs 19.23 or higher, simply add the following line
>to your .emacs startup file. This mode will perform the necessary
>translations for you automatically:
>------------------
>(require 'iso-cvt)
>------------------
>
>If you are using pre-19.23 versions of emacs, get the "gm-lingo.el"
>lisp file via URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit.  Load
>gm-lingo from your .emacs startup file and this mode will perform the
>necessary translations for you automatically.
>
>If you want to configure TeX to read 8 bit characters, check out the
>configuration files available in URL
>ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit.
>
>In LaTeX 2.09 (or earlier), use the isolatin or isolatin1 styles to
>include support for ISO latin1 characters.  Use the following
>documentstyle definition:
>\documentstyle[isolatin]{article}
>
>isolatin.sty and isolatin1 are available from all CTAN servers and
>from URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit. (The isolatin1
>version on vlsivie is more complete than the one on CTAN servers.)
>
>There are several possibilities in LaTeX 2e to provide comprehensive
>support for 8 bit characters:
>
>The preferred method is to use the inputenc package with the latin1
>option.  Use the following package invocation to achieve this:
>\usepackage[latin1]{inputenc}
>
>The inputenc package should be the first package to be included in the
>document.  For a more detailed discussion, check out URL
>ftp://ftp.vlsivie/tuwien.ac.at/pub/8bit/latex2e.ps (in German).
>
>Alternatively, the styles used for earlier LaTeX versions (see above)
>can also be used with 2e.  To do this, use the commands:
>\documentclass{article}
>\usepackage{isolatin}
>
>
>You can also get the latex-mode to handle opening and closing quotes
>correctly for your language.  This can be achieved by defining the
>emacs variables 'tex-open-quote' and 'tex-closing-quote'.  You can
>either set these varaibles in your ~/.emacs startup file or as a
>buffer-local variable in your TeX file if you want to define quotes on
>a per-file basis.
>
>For German TeX quotes, use:
>-----------
>(setq tex-open-quote "\"`")
>(setq tex-closing-quote "'\"")
>-----------
>
>If you want to use French quotes (guillemets), use:
>-----------
>(setq tex-open-quote "   ")
>(setq tex-closing-quote "   ")
>-----------
>
>Bibtex has some problems with 8 bit characters, esp. when they are
>used as keys.  BibTeX 1.0, when it eventually comes out (most likely
>some time in 1996), will support 8-bit characters.
>
>
>
>8. ISO 8859-1 and emacs
>Emacs 19 (as opposed to Emacs 18) can automatically handle 8 bit
>characters. (If you have a choice, upgrade to Emacs version 19.23,
>which has the most complete ISO support.)  Emacs 19 has extensive
>support for ISO 8859-1. If your display supports ISO 8859-1 encoded
>characters, add the following line to your .emacs startup file:
>-----------------------------
>(standard-display-european t)
>-----------------------------
>
>If want to display ISO-8859-1 encoded files by using TeX-like escape
>sequences (e.g. if your terminal supports only ASCII characters), you
>should  add the following line to your .emacs file (DON'T DO THIS IF
>YOUR TERMINAL SUPPORTS ISO OR SOME OTHER ENCODING OF NATIONAL
>CHARACTERS):
>--------------------
>(require 'iso-ascii)
>--------------------
>
>If your terminal supports a non-ISO 8859-1 encoding of national
>characters (e.g. 7 bit national variant ISO 646 character sets,
>aka. `national ASCII' variants), you should configure your own display
>table.  The standard emacs distribution contains a configuration
>(iso-swed.el) for terminals which have ASCII in the G0 set and a
>Swedish/Finnish version of ISO 646 in the G1 set.  If you want to
>create your own display table configuration, take a look at this
>sample configuration and at disp-table.el for available support
>functions.
>
>
>Emacs can also accept 8 bit ISO 8859-1 characters as input. These
>character codes might either come from a national keyboard (and
>driver) which generates ISO-compliant codes, or may have been entered
>by use  of a COMPOSE-character mechanism.
>If you use such an input format, execute the following expression in
>your .emacs startup file to enable Emacs to understand them:
>-------------------------------------------------
>     (set-input-mode (car (current-input-mode))
>                     (nth 1 (current-input-mode))
>                     0)
>-------------------------------------------------
>
>In order to configure emacs to handle commands operating on words
>properly (such as 'Beginning of word, etc.), you should also add the
>following line to your .emacs startup file:
>-------------------------------
>(require 'iso-syntax)
>-------------------------------
>
>
>For further information on using ISO 8859-1 with emacs, also see the
>Emacs manual section on "European Display" (available as hypertext
>document by typing C-h i in emacs or as a printed version).
>
>
>If you need to edit text in a non-European language(Arabic, Chinese,
>Cyrillic-based languages, Ethiopic, Korean, Thai, Vietnamese, etc.),
>MULE (URL ftp://etlport.etl.go.jp/pub/mule) is a Multilingual
>Enhancement to GNU Emacs which supports these languages.
>
>
>
>9. Typing ISO with US-style keyboards.
>Many computer users use US-ASCII keyboards, which do not have keys for
>national characters.  You can use escape sequences to enter these
>characters.  For ASCII terminals (or PCs), check the documentation of
>your terminal for particulars.
>
>
>9.1 US-keyboards under X11
>Under X Windows, the COMPOSE multi-language support key can be used to
>enter accented characters.  Thus, when running X11 on a SunOS-based
>computer (or any other X11R4 or X11R5 server supporting COMPOSE
>characters), you can type three character sequences such as
>COMPOSE " a ->
>COMPOSE s s ->
>COMPOSE ` e ->
>to type accented characters.
>
>Note that this COMPOSE capability has been removed as of X11R6,
>because it does not adequately support all the languages in the world.
>Instead, compose processing is supposed to be performed in the client
>using an `input method', a mechanism which has been available since
>X11R5.  (In the short term, this is a step backward for European
>users, as few clients support this type of processing at the moment.
>It is unfortunate that the X Consortium did not implement a mechanism
>which allows for a smoother transition.  Even the xterm terminal
>emulator supplied by the X Consortium itself does not yet support this
>mechanism!)
>
>Input methods are controlled by the locale environment variables (LANG
>and LC_xxx).  The values for these variables are (or at least, should be
>made equivalent by any sane vendor) equivalent to those expected by
>the ANSI/POSIX locale library.  For a list of possible settings see
>section 3.
>
>
>
>9.2 US-keyboards and emacs
>There are several modes to enter Umlaut characters under emacs when
>using a US-style keyboard.  One such mode is iso-transl, which is
>distributed with the standard emacs distribution.  This mode uses the
>Alt-key for entering diacritical marks (accents et al.).  An extended
>iso-transl mode (iso-transl+) which allows the definition of language
>specific short cuts is available as URL
>ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/iso-transl+.shar.  This file
>also includes sample configurations for the German and Spanish
>languages.
>
>An alternative to using Alt-sequences for entering diacritical marks
>is the use of `electric accents', such as used on old type writers or
>under many MS Windows programs.  With this method, typing an accent
>character will place this accent on the next character entered.  One
>mode which supports this entry method is the iso-acc minor mode which
>comes with the standard e                                                                                    u