Thread

Topic: Universal character names -- I'm still confused

Author: usenet@nmhq.net (Niklas Matthies)
Date: Fri, 31 Oct 2003 17:29:15 +0000 (UTC) Raw View

On 2003-10-29 06:25, Jamie Allsop wrote:
> [snip]
>> I suppose you are not aware that Asian languages are actually most
>> commonly entered using roman letters on a QWERTY keyboard layout.
>> They don't seem to feel particularly uneasy about it. :)
>> With thousands (Japanese) to ten-thousands (Chinese) of glyphs, they
>> also don't really have much of a choice. The only practical method
>> is phonetic input, i.e. based on the pronunciation, for which roman
>> letters are quite adequate.
> [snip]
>
> No, roman letters are most definitely *not* quite adequate. Which is why
> there are numerous different phonetic input methods. And no, they are
> not mostly entered using qwerty and roman (in the sense you mean), and
> yes, they do feel uneasy about trying to use an inadequate system to
> input their native language. They do indeed generally use a keyboard
> with physical layout as a qwerty, probably with a US layout (which is
> different than the physical layout of my UK keyboard and sufficiently
> different to make using a US keyboard a real pain), but not for roman
> input; the same keys can be used to choose phonetc symbols (which not
> everyone will memorise, just like not everyone memorises the qwerty
> layout and all the physical layout options based on that for different
> locations - I suppose that is why we still by keyboards with letters on
> them). You are probably thinking of PinYin when you refer to roman input
> for (mainly) simplified Chinese as used in China. PinYin, while useable
> is not for real speed typing and is also not an accurate enough mapping
> for everyday use (though in the abscence of an alternative, better than
> nothing).

Actually, most of my knowledge comes from Japanese, which confessedly
biased my thinking here. My main point in the discussion, though, was
that chinese characters are, by necessity, not entered directly, but
through a phonetic (or sometimes structural) code, with the effect
that entering a single chinese character--as opposed to an english
word consisting of several characters--isn't generally easier just
because it's "just a single character".

Or, more generally speaking: The greater the character repertoire, the
more complicated it becomes on average to enter a character, since
keyboards and the number of fingers on one's hand don't grow with the
repertoire. Thus, while having a greater character repertoire can
augment expressivity, it also comes at an expense.

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: metadata@sbcglobal.net (Matt Seitz)
Date: Fri, 31 Oct 2003 21:45:16 +0000 (UTC) Raw View

Niklas Matthies wrote:
> My main point in the discussion, though, was
> that chinese characters are, by necessity, not entered directly, but
> through a phonetic (or sometimes structural) code, with the effect
> that entering a single chinese character--as opposed to an english
> word consisting of several characters--isn't generally easier just
> because it's "just a single character".

Here is another way of thinking of it.  One could just as easily say "English
words are, by necessity, not entered directly, but through a phonetic code (the
English alphabet)".  So whether you are discussing Chinese or English, entering
a word requires multiple keystrokes.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: usenet@nmhq.net (Niklas Matthies)
Date: Sat, 1 Nov 2003 08:26:15 +0000 (UTC) Raw View

On 2003-10-31 21:45, Matt Seitz wrote:
> Niklas Matthies wrote:
>> My main point in the discussion, though, was
>> that chinese characters are, by necessity, not entered directly, but
>> through a phonetic (or sometimes structural) code, with the effect
>> that entering a single chinese character--as opposed to an english
>> word consisting of several characters--isn't generally easier just
>> because it's "just a single character".
>
> Here is another way of thinking of it.  One could just as easily say
> "English words are, by necessity, not entered directly, but through
> a phonetic code (the English alphabet)".

Well, with the difference that with Roman characters, the "phonetic
code" actually becomes part of the source code (identifiers _are_ the
"phonetic code", and not the word represented by the code), whereas
with Chinese characters the phonetic code is merely an intermediate
representation during input and not relevant to the source code
produced (identifiers here are the characters expressed through the
-- usually ambiguous -- phonetic code, not the phonetic code itself
(and neither the word represented by the characters)), which means
there is an extra level of indirection during input.

Hence with Roman characters, identifiers are entered rather directly
(I'm saying "rather" because of shift keys and such), which is not the
case with Chinese characters.

> So whether you are discussing Chinese or English, entering a word
> requires multiple keystrokes.

Pretty close to my point. The fact that Chinese identifiers (or
mathematical-symbol identifiers) need fewer glyphs than English-word
identifiers generally doesn't mean that typing them is easier.

I'm not saying that this indicates that there's no benefit whatsoever
in using non-Roman-character identifiers. Not at all. I'm saying that
having a richer character repertoire comes at the expense of having a
more complicated input method, and that this is one fact that ought to
be considered in the cost-benefit analysis of actually using a richer
character repertoir.

If people are already switching keyboards because the position of
characters like '{' is too awkward on their original keyboard, then
there's IMHO the legitimate question whether there'll really be a wide
acceptance of entering characters not generally readily available on
keyboards at all (say, the gamma symbol), as opposed to using
character sequences like 'gamma'.

One could argue that those who want such symbols can use an editor-
level translation mechanism (similar to WYSIWYG TeX editing, for
example), and that source code as seen by the compiler need not be
bothered with it. While this is not exactly my position, I think that
those whose it is have a point here.

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Sat, 1 Nov 2003 23:28:59 +0000 (UTC) Raw View

Niklas Matthies wrote:
> On 2003-10-31 21:45, Matt Seitz wrote:
>
>>Niklas Matthies wrote:
>>
>>>My main point in the discussion, though, was
>>>that chinese characters are, by necessity, not entered directly, but
>>>through a phonetic (or sometimes structural) code, with the effect
>>>that entering a single chinese character--as opposed to an english
>>>word consisting of several characters--isn't generally easier just
>>>because it's "just a single character".
>>
>>Here is another way of thinking of it.  One could just as easily say
>>"English words are, by necessity, not entered directly, but through
>>a phonetic code (the English alphabet)".
>
>
> Well, with the difference that with Roman characters, the "phonetic
> code" actually becomes part of the source code (identifiers _are_ the
> "phonetic code", and not the word represented by the code), whereas
> with Chinese characters the phonetic code is merely an intermediate
> representation during input and not relevant to the source code
> produced (identifiers here are the characters expressed through the
> -- usually ambiguous -- phonetic code, not the phonetic code itself
> (and neither the word represented by the characters)), which means
> there is an extra level of indirection during input.
>
> Hence with Roman characters, identifiers are entered rather directly
> (I'm saying "rather" because of shift keys and such), which is not the
> case with Chinese characters.

Well, yes, but how relevant is that to the point in question? It is no
surprise that the keyboard as it exists today is best suited to enter
text based on the roman alphabet. That's what it was invented for. Hence
the "uneasiness" when entering non-roman text. The fact that the
keyboard dominates as a text-entering tool is another manifestation of
the current cultural predominance of "western" technology. It doesn't
have to be that way. There are other text-entering methods. Arabic and
Chinese are examples of "calligraphic" scripts that are better matched
to handwriting than to typing. Similarly, cuneiform script is better
matched to carving in clay. And isn't dictation the most natural
"phonetic" input method? I wouldn't be surprised if - with technological
advances in handwriting recognition technology - Chinese writers (and
programmers?) eventually found themselves more comfortable entering text
with a pen and a tablet instead of the keyboard.

Imagine entering source code with a pen and tablet instead of a
keyboard. What would your requirements for a programming language be
under such circumstances? For starters, entering a "gamma" would be a
piece of cake!

We're not there yet. C++ is keyboard oriented. But with with UCNs we
could wet our toes just a little bit.

>>So whether you are discussing Chinese or English, entering a word
>>requires multiple keystrokes.
>
>
> Pretty close to my point. The fact that Chinese identifiers (or
> mathematical-symbol identifiers) need fewer glyphs than English-word
> identifiers generally doesn't mean that typing them is easier.

I agree completely. Again, typing efficiency isn't high on my priority list.

> I'm not saying that this indicates that there's no benefit whatsoever
> in using non-Roman-character identifiers. Not at all. I'm saying that
> having a richer character repertoire comes at the expense of having a
> more complicated input method, and that this is one fact that ought to
> be considered in the cost-benefit analysis of actually using a richer
> character repertoir.

You're completely right. But this analysis should be done by the
programmer and not the language designer.

> If people are already switching keyboards because the position of
> characters like '{' is too awkward on their original keyboard, then
> there's IMHO the legitimate question whether there'll really be a wide
> acceptance of entering characters not generally readily available on
> keyboards at all (say, the gamma symbol), as opposed to using
> character sequences like 'gamma'.

The acceptance depends on the cost-benefit analysis you mentioned above.
If I use special mathematical symbols in identifiers of a library that I
write, it (presumably) will have an effect on the acceptance of the
library. Maybe it is an overall positive effect, maybe a negative one.
People's opinions will differ, but at least I had the choice as a
developer. In the case of the curly brackets, I haven't got a choice as
long as I use C++ (unless I use trigraphs, which is a lot worse). Curly
brackets are a part of C++, so they belong to a different category than
identifiers written using special symbols.

> One could argue that those who want such symbols can use an editor-
> level translation mechanism (similar to WYSIWYG TeX editing, for
> example), and that source code as seen by the compiler need not be
> bothered with it. While this is not exactly my position, I think that
> those whose it is have a point here.

Doesn't the UCN notation define just such a translation mechanism? The
point is that a reasonable source code editor can put the right glyph on
the screen automatically, so that the mechanics are hidden from the
user. If I understand it correctly, UCNs are meant as a means to map
those special symbols to the character set a C++ compiler understands
naturally. This is important for transporting source code transparently
between systems. The programmer shouldn't be bothered with those
technicalities.

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Sun, 19 Oct 2003 19:24:48 +0000 (UTC) Raw View

stefan heinzmann@yahoo.com (Stefan Heinzmann) wrote in message
news:<bme4m3$dvi$05$1@news.t-online.com>...

> I'm just reading through the C standard book from Wiley in
> anticipation of the C++ standard book. I understand that C and C++ are
> supposed to handle universal character names in the same way. And I
> would appreciate if someone could explain to me how that is intended
> to be used in practice.

How they were intended, or how things are actually working out.  For all
intents and purposes, I think that they are basically in the same
situation as trigraphs.

> For example, if I want to include German umlauts (such as or ) in
> identifiers, I actually want them to appear in print and on screen as
> the proper glyph and not as \u1234.

Of course you do.  And I believe that this was also the intent of the
original proposal.  That the development environment would understand
them, and treat them correctly.

That was also the original intent with regards to trigraphs.  For the
moment, the actual support for this seems to be at about the same level
as it is for trigraphs as well.

The standard also says that an implementation may accept additional,
implementation defined characters, which it maps to the correct
universal character name.  If you are working in an ISO 8859-1
environment, there is a good chance that if the compiler accepts
universal character names, it will actually handle ISO 8859-1
correctly.  This isn't an optimal situation, since if you transfer your
programs to a non ISO 8859-1 area (say the Czeck Republic or Poland),
then 1) they will look decidedly funny when displayed there, and 2), and
2) they may not even compile.  (Actually, if all you use are the
umlauts, I don't think that compilation will be a problem, regardless of
which 8859-n codeset is installed locally.)

> This uglification may be appropriate for the compiler for parsing, but
> certainly not for human reading.  Is my editor supposed to do the
> conversion?

I think that this was the intent.

> Or is the language implementation supposed to provide a preprocessor
> (pre-preprocessor) to convert non-ASCII characters to UCNs?

A compiler is supposed to do so, see    2.1/1.  However, the standard says
nothing about how the compiler decides what codeset the source code
actually uses -- and none of the compilers I know actually document
anything about this either.  So you're sort of stuck; g++ (3.3.1) seems
to refuse anything but straight ASCII, regardless of the externally set
locale.

> The standard may not mandate any particular way, but surely people
> must have an idea of what kind of support should be provided, or else
> what is the point of allowing UCNs in identifiers?

I'm not sure myself.  I would expect compilers to accept files in many
different codesets, but I'm not too sure as to how this should be
handled; the codeset must depend on the file, and may vary between
include files in a single translation unit, which means that most
classical means of specifying this sort of stuff need extending.  (A
global command line option is probably not too useful.)

For the moment, support seems to be about nil, and as you seem to have
noticed, universal character names are pretty much unusable today.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Sun, 19 Oct 2003 19:28:26 +0000 (UTC) Raw View

Hyman Rosen wrote:
> Stefan Heinzmann wrote:
>
>> The standard may not mandate any particular way, but surely people
>> must have an idea of what kind of support should be provided, or else
>> what is the point of allowing UCNs in identifiers?
>
>
> Read 2.1/1. The translation of physical source file characters
> to the source character set is implementation-defined. Source
> file characters outside of the basic set are translated to the
> universal character name equivalent. (Logically, of course. The
> implementation is free to represent things any way it wants.)

Let me quote the mentioned paragraph from my pdf copy of the C++ standard:
"Physical source file characters are mapped, in an
implementation-defined manner, to the basic source
character set (introducing newline characters for end-of-line
indicators) if necessary. Trigraph sequences (2.3) are replaced by
corresponding single-character internal representations. Any source file
character not in the basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a universal-character-name
(i.e. using the \uXXXX notation), are handled equivalently.)"

I'm not yet sure I understand this right. Does this mean that when I've
got an identifier with a german umlaut in it:

o  The compiler has to map the umlaut to an internal representation that
is the same as the representation it would use if I had written the
umlaut in \uXXXX notation.

o  The compiler is free how to do the conversion.

o  The compiler is free to choose an internal representation

o  The compiler is not allowed to ignore or otherwise choke on the
umlaut, as it has to do the mapping (or does the latitude go as far as
allowing the compiler to behave in any silly way it likes when
encountering a character that isn't in the base character set?)

> So it's up to your compiler vendor to decide what kind of source
> file encodings it understands, and up to your editor to decide
> how these encodings are displayed. The standard doesn't care, and
> chooses to explain how things work in terms of UCNs.

Well, MS Visual Studio 7.1 does display an umlaut in identifiers without
problems, but the compiler chokes on it. If I use the\uXXXX notation,
identifiers are accepted by the compiler, but when they're occurring in
a warning message, the \uXXXX is replaced by a question mark.

Is there some consensus what a "reasonable" level of support for UCNs
would be for a compiler? Is there an example of a compiler (and/or IDE)
that can serve as a role model?

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@alex.gabi-soft.fr (James Kanze)
Date: Mon, 20 Oct 2003 04:46:52 +0000 (UTC) Raw View

rridge@csclub.uwaterloo.ca (Ross Ridge) writes:

|>  <kanze@gabi-soft.fr> wrote:
|>  >In practice, UCN's are about as useful as trigraphs for writing
|>  >readable, portable programs.  Which is a shame, because they could
|>  >be really useful -- C++ has done its part, and both C and Java have
|>  >followed, which sounds pretty much like a de facto standard to me.

|>  That makes it anything but a "de facto" standard.  "In fact" there
|>  is no standard, because "in fact" on one uses it, and "in fact" no
|>  one who doesn't have to, like C/C++/Java third party tools, supports
|>  it.  It's a standard that only exists as words in a document.

I guess it depends on what you are talking about.  It is a de facto
standard (among those specifying languages) to standardize international
characters in character names by means of UCN's.  It is also a de facto
standard among those using the languages to ignore such, since there is
an apparent de facto standard among tool providers to do as little as
possible to support them.

This is a shame, because all things considered, UCN's aren't a bad
idea.  Or wouldn't be, with correct support.  (Of course, the same thing
could have been said about trigraphs in their time.  Good ideas without
any good support don't get very far.)

|>  And C++ took the idea from Java, ignoring the fact existing practice
|>  in Java had already shown UCNs in identifiers to be as useful as
|>  trigraphs for writing readable, portable programs.

I'm not sure.  Trigraphs were present, as far as I can remember, in the
very first C++ drafts that I saw (around 1993), before Java appeared.

--=20
James Kanze                             mailto:kanze@gabi-soft.fr
Conseils en informatique orient=E9e objet/
                 Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France  +33 1 41 89 80 93

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@alex.gabi-soft.fr (James Kanze)
Date: Mon, 20 Oct 2003 04:47:07 +0000 (UTC) Raw View

stefan_heinzmann@yahoo.com (Stefan Heinzmann) writes:

|>  Hyman Rosen wrote:
|>  > Stefan Heinzmann wrote:
|>  >> The standard may not mandate any particular way, but surely
|>  >> people must have an idea of what kind of support should be
|>  >> provided, or else what is the point of allowing UCNs in
|>  >> identifiers?

|>  > Read 2.1/1. The translation of physical source file characters to
|>  > the source character set is implementation-defined. Source file
|>  > characters outside of the basic set are translated to the
|>  > universal character name equivalent. (Logically, of course. The
|>  > implementation is free to represent things any way it wants.)

|>  Let me quote the mentioned paragraph from my pdf copy of the C++
|>  standard: "Physical source file characters are mapped, in an
|>  implementation-defined manner, to the basic source character set
|>  (introducing newline characters for end-of-line indicators) if
|>  necessary. Trigraph sequences (2.3) are replaced by corresponding
|>  single-character internal representations. Any source file character
|>  not in the basic source character set (2.2) is replaced by the
|>  universal-character-name that designates that character. (An
|>  implementation may use any internal encoding, so long as an actual
|>  extended character encountered in the source file, and the same
|>  extended character expressed in the source file as a
|>  universal-character-name (i.e. using the \uXXXX notation), are
|>  handled equivalently.)"

|>  I'm not yet sure I understand this right. Does this mean that when
|>  I've got an identifier with a german umlaut in it:

|>  o The compiler has to map the umlaut to an internal representation
|>  that is the same as the representation it would use if I had written
|>  the umlaut in \uXXXX notation.

Yes.  Except that it is implementation specific what the input encoding
is.  All of the compilers I know use US ASCII -- the US variant of ISO
646.  So there is no such thing as a German Umlaut in their input, no
matter what you see on the screen when viewing the file with other
tools.

|>  o  The compiler is free how to do the conversion.

|>  o  The compiler is free to choose an internal representation

|>  o The compiler is not allowed to ignore or otherwise choke on the
|>  umlaut,

If the compiler "sees" an Umlaut in the input, it must treat as
specified.  Neither Sun CC nor g++ are capable of seeing Umlauts in the
input.

Just out of curiousity, how do you expect the compiler to choose the
correct encoding for the file.  From the local environment, so that it
interprets them as you see them on the screen.  If so, how do you handle
the case where the headers for one library are in UTF-8, and those for
another library are in ISO 8859-1?  (This is not to support the laziness
on the part of the compiler writers.  It is just to point out that the
problem is perhaps not as simple as you think.)

|>  as it has to do the mapping (or does the latitude go as far as
|>  allowing the compiler to behave in any silly way it likes when
|>  encountering a character that isn't in the base character set?)

|>  > So it's up to your compiler vendor to decide what kind of source
|>  > file encodings it understands, and up to your editor to decide how
|>  > these encodings are displayed. The standard doesn't care, and
|>  > chooses to explain how things work in terms of UCNs.

|>  Well, MS Visual Studio 7.1 does display an umlaut in identifiers
|>  without problems, but the compiler chokes on it. If I use the\uXXXX
|>  notation, identifiers are accepted by the compiler, but when they're
|>  occurring in a warning message, the \uXXXX is replaced by a question
|>  mark.

Sounds like they're right up their with the Unix compilers:-).

|>  Is there some consensus what a "reasonable" level of support for
|>  UCNs would be for a compiler?

Not that I know of, and this might be the reason why compiler authors
are so hesitant to tackle the problem.

|>  Is there an example of a compiler (and/or IDE) that can serve as a
|>  role model?

--=20
James Kanze                             mailto:kanze@gabi-soft.fr
Conseils en informatique orient=E9e objet/
                 Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France  +33 1 41 89 80 93

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: usenet@nmhq.net (Niklas Matthies)
Date: Mon, 20 Oct 2003 04:47:38 +0000 (UTC) Raw View

On 2003-10-19 19:24, kanze@gabi-soft.fr wrote:
> stefan heinzmann@yahoo.com (Stefan Heinzmann) wrote in message
> news:<bme4m3$dvi$05$1@news.t-online.com>...
[...]
>> The standard may not mandate any particular way, but surely people
>> must have an idea of what kind of support should be provided, or else
>> what is the point of allowing UCNs in identifiers?
>
> I'm not sure myself.  I would expect compilers to accept files in many
> different codesets, but I'm not too sure as to how this should be
> handled; the codeset must depend on the file, and may vary between
> include files in a single translation unit, which means that most
> classical means of specifying this sort of stuff need extending.  (A
> global command line option is probably not too useful.)
>
> For the moment, support seems to be about nil, and as you seem to have
> noticed, universal character names are pretty much unusable today.

It should be rather trivial to have editors like Emacs or Vim apply
appropriate conversion filters upon reading and writing of source
files. I have seen this being done for HTML character entities,
which are not very different from UCNs, apart from the concrete
syntax.

Java IDEs also routinely support UTF-8 source files, so it shouldn't
be too difficult for C++ IDEs to do likewise.

And with regard to the the source file inclusion problem, one solution
would be to use Apache-like filename patterns (e.g. "MyHeader.utf8.h").

My impression is that the lack of support in C++ tools today simply
comes from lack of demand from developers.

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Mon, 20 Oct 2003 22:40:16 +0000 (UTC) Raw View

usenet@nmhq.net (Niklas Matthies) wrote in message
news:<slrnbp6jra.1kp3.usenet@nmhq.net>...
> On 2003-10-19 19:24, kanze@gabi-soft.fr wrote:
> > stefan heinzmann@yahoo.com (Stefan Heinzmann) wrote in message
> > news:<bme4m3$dvi$05$1@news.t-online.com>...
>  [...]
> >> The standard may not mandate any particular way, but surely people
> >> must have an idea of what kind of support should be provided, or
> >> else what is the point of allowing UCNs in identifiers?

> > I'm not sure myself.  I would expect compilers to accept files in
> > many different codesets, but I'm not too sure as to how this should
> > be handled; the codeset must depend on the file, and may vary
> > between include files in a single translation unit, which means that
> > most classical means of specifying this sort of stuff need
> > extending.  (A global command line option is probably not too
> > useful.)

> > For the moment, support seems to be about nil, and as you seem to
> > have noticed, universal character names are pretty much unusable
> > today.

> It should be rather trivial to have editors like Emacs or Vim apply
> appropriate conversion filters upon reading and writing of source
> files. I have seen this being done for HTML character entities, which
> are not very different from UCNs, apart from the concrete syntax.

I don't know about trivial, but it certainly should be possible.  At
least, supposing that these editors either work internally with 32 bit
(or at least 21 bit) characters, or support full UTF-8.  (From the
little I've scanned over in the emacs manual, it is limited to 19 bit
characters (and it seems to be using some special encoding of its own
internally).

> Java IDEs also routinely support UTF-8 source files, so it shouldn't
> be too difficult for C++ IDEs to do likewise.

Yes, but who has a UTF-8 editor?

> And with regard to the the source file inclusion problem, one solution
> would be to use Apache-like filename patterns
> (e.g. "MyHeader.utf8.h").

One solution, perhaps, but certainly not the only.

> My impression is that the lack of support in C++ tools today simply
> comes from lack of demand from developers.

Well, the start of this thread was precisely a poster who wanted it.

Demand is relative.  When most of us are still having problems getting
templates to work correctly, such issues may seem secondary.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Tue, 21 Oct 2003 18:16:49 +0000 (UTC) Raw View

kanze@gabi-soft.fr wrote:
[...]
>>My impression is that the lack of support in C++ tools today simply
>>comes from lack of demand from developers.
>
>
> Well, the start of this thread was precisely a poster who wanted it.
>
> Demand is relative.  When most of us are still having problems getting
> templates to work correctly, such issues may seem secondary.

I didn't say I wanted it. In fact, I'm still unsure whether I want it or
not; that certainly would depend on the amount of support I can expect
not just from a single compiler vendor, but across the industry. I
wanted to find out what the intentions of the authors of the holy
standard were and whether it was likely that this would come about. From
your answers I'm pessimistic.

Most projects I work with actually keep the source code in English, as a
lingua franca for computer science, because who knows who will read the
code...

But it always seemed like cultural imperialism to me to actually
*require* the use of English through choice (read: restriction) of
character set (and, by the way, also through reserved words). Why should
someone have to learn English before learning to program? (It sure is a
good idea to learn it, but should it be a requirement?)

And, programming in foreign languages aside, why should I not use the
uppercase greek Gamma symbol for the Gamma function? That's what all the
mathematicians do, anyway. We've got Unicode now, and virtually all
reasonable text processing systems support it. Yet in programming we're
still in ASCII times. Odd, isn't it?

There is a definite trend in C++ to use the language features to
implement domain specific languages. See the Spirit library in boost for
an example. Someone said once that library design is language design. Am
I the only one who thinks that restricting source code to the characters
available in plain ASCII is a hindrance for domain specific languages?

Mathematics seem to cope much better with this problem. Their notation
is far less language specific, because more operators and special
symbols are used. If only programming could be like that! Think about
all those extra operator symbols in Unicode...

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: usenet@nmhq.net (Niklas Matthies)
Date: Fri, 24 Oct 2003 05:14:24 +0000 (UTC) Raw View

On 2003-10-22 15:50, Stefan Heinzmann wrote:
> Niklas Matthies wrote:
[...]
> But, seriously, the keyboard question is as much an argument in
> favour of having special symbols as it is against. That's what I
> wanted to hint at with the "cultural imperialism" argument. If you
> are programming, you're virtually forced to use a "roman" keyboard.
> If you're programming C++ (or C or Java or...) you're actually best
> off to use a US keyboard layout. I do, although I'm a German. The
> curly braces are just too awkward to reach on the German keyboard.

I happen to be a German too using US keyboards for that same reason. :)

> So your argument is quite right if you are already using the "right"
> keyboard. If not, you might see it from a different angle.

I don't really agree. Or at least, this wasn't my point.

Whatever keyboard layout you use, the keyboard will only have so many
keys, and humans only ten fingers with limited reach, and the average
human brain can only remember so many positions and multiple key
assignments. Hence there's an inherent limit on the number of symbols
than can be efficiently typed.

When looking at a few Unicode symbol tables, it should become clear
that only a rather restricted subset can fit onto an efficient-to-use
keyboard layout. This means that even when one has the freedom to use
arbitrary Unicode symbols in source code, only a small set will be
practical to use on any given keyboard layout. So the benefit of that
freedom will be somewhat limited. This doesn't mean that the freedom
should not be given, only that it might buy less than might have been
expected.

The second reason is that, in todays globalized programming world,
it makes a lot of sense for source code to be written in one common
language, so that every C++ programmer can readily understand any C++
source code, regardless of the nationality of the C++ programmer who
wrote it.

(Sometime around 1995, Microsoft tried to provide localized versions
of VBA to (presumably) make non-english programmers feel more
comfortable. I remember writing an Excel application using German
keywords then. They dropped this with the next version, probably
realizing that it caused more problems than it solved.)

For this second reason, I don't perceive my US keyboards as US
keyboards, but rather as C++ keyboards (or, really, programming
keyboards), and it makes sense to me that, just as you have to get
acquainted with a particular keyboard layout or input method when you
want to type French or Russian (or whatever), you have to do the same
thing when wanting to type source code in some programming language.

> Of course, keywords in a language are the next interesting topic.
> Think about how a Chinese might see the situation. Imagine this:
> Say, you're English and work with a Chinese programming language.
> Each keyword is a Chinese glyph, which you have to enter using your
> English keyboard. Wouldn't you feel somewhat uneasy about this?

I suppose you are not aware that Asian languages are actually most
commonly entered using roman letters on a QWERTY keyboard layout.
They don't seem to feel particularly uneasy about it. :)
With thousands (Japanese) to ten-thousands (Chinese) of glyphs, they
also don't really have much of a choice. The only practical method
is phonetic input, i.e. based on the pronunciation, for which roman
letters are quite adequate.

In turn, this has the effect that entering a chinese glyph takes up
to four keypresses, and even some more when there is no context to
disambiguate the many homonyms. So you see that using chinese glyphs
in a programming language probably would not significantly increase
typing efficiency, if at all. And any such possible increase could
also be gained with english keywords by using corresponding keyboard
macros. So what remains is whether one prefers to have source code to
look more like english or more like chinese. (Most often, it looks
like neither. It simply looks like source code.)

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Fri, 24 Oct 2003 21:16:35 +0000 (UTC) Raw View

Niklas Matthies wrote:
> On 2003-10-22 15:50, Stefan Heinzmann wrote:
>=20
>>Niklas Matthies wrote:
>=20
> [...]
>=20
>>But, seriously, the keyboard question is as much an argument in
>>favour of having special symbols as it is against. That's what I
>>wanted to hint at with the "cultural imperialism" argument. If you
>>are programming, you're virtually forced to use a "roman" keyboard.
>>If you're programming C++ (or C or Java or...) you're actually best
>>off to use a US keyboard layout. I do, although I'm a German. The
>>curly braces are just too awkward to reach on the German keyboard.
>=20
>=20
> I happen to be a German too using US keyboards for that same reason. :)

Ah, I thought you might be. Gr=FC=DF Dich, Niklas!

>>So your argument is quite right if you are already using the "right"
>>keyboard. If not, you might see it from a different angle.
>=20
>=20
> I don't really agree. Or at least, this wasn't my point.
>=20
> Whatever keyboard layout you use, the keyboard will only have so many
> keys, and humans only ten fingers with limited reach, and the average
> human brain can only remember so many positions and multiple key
> assignments. Hence there's an inherent limit on the number of symbols
> than can be efficiently typed.
>=20
> When looking at a few Unicode symbol tables, it should become clear
> that only a rather restricted subset can fit onto an efficient-to-use
> keyboard layout. This means that even when one has the freedom to use
> arbitrary Unicode symbols in source code, only a small set will be
> practical to use on any given keyboard layout. So the benefit of that
> freedom will be somewhat limited. This doesn't mean that the freedom
> should not be given, only that it might buy less than might have been
> expected.

I'm quite aware of that. However, I don't value typing efficiency very=20
highly. In fact, I know a couple of programmers who type a lot faster=20
than they think. Copy and paste in particular is much too easy in my=20
opinion. Source code editors should require the words 'copy' and 'paste'=20
to be typed in and pop up a confirmation dialog box before doing it ;-)

In another reply, James Kanze seems to think along the same lines.=20
Source code should be composed for easy reading, not easy writing. This=20
is also the aim of domain specific languages. They can be seen as=20
syntactic sugar, but they can help tremendously in communicating the=20
intent and/or essence of the problem solution, and that's the ultimate=20
aim of programming, isn't it? Don't we all appreciate C++ because of its=20
flexibility in doing this?

Allowing special symbols adds to that expressive power IMHO, and thus is=20
desirable. As with any sharp tool, it can hurt you. Once proper tool=20
support for UCNs is available, people will use them, and overuse them,=20
until there's enough experience for common guidelines. Maybe someday=20
there will be an 'Effective UCN' book, who knows ;-)

> The second reason is that, in todays globalized programming world,
> it makes a lot of sense for source code to be written in one common
> language, so that every C++ programmer can readily understand any C++
> source code, regardless of the nationality of the C++ programmer who
> wrote it.

There are numerous cases where this argument is not the most important=20
one. So that is a decision that should not be pre-mandated by the=20
programming language. But don't get me wrong, I do in fact think that=20
software should be written in English unless there's a good reason=20
against. I just want to decide myself and not have it imposed on me by=20
the programming language.

> (Sometime around 1995, Microsoft tried to provide localized versions
> of VBA to (presumably) make non-english programmers feel more
> comfortable. I remember writing an Excel application using German
> keywords then. They dropped this with the next version, probably
> realizing that it caused more problems than it solved.)

That remains an issue even today. I use OpenOffice, and their=20
spreadsheet uses German function names (SUMME, RUNDEN, etc.). Yes, it=20
causes problems when you want to move spreadsheets between different=20
language environments. But it would cause other problems if you forced=20
people to use English. After all, spreadsheets are for ordinary computer=20
users, and if you require them to understand English you're erecting a=20
cultural barrier.

We've both learnt English well enough to get along nicely, so we tend to=20
underestimate the difficulties people have when they haven't got as far y=
et.

> For this second reason, I don't perceive my US keyboards as US
> keyboards, but rather as C++ keyboards (or, really, programming
> keyboards), and it makes sense to me that, just as you have to get
> acquainted with a particular keyboard layout or input method when you
> want to type French or Russian (or whatever), you have to do the same
> thing when wanting to type source code in some programming language.

So you're comfortable using one type of keyboard for programming and=20
another for typing prose? I'm not. I'm typing source code and prose with=20
the same keyboard, because I find it difficult to switch my mind between=20
different keyboards ("where's that @$%# ampersand this time?"). It took=20
me a while to find out that there's a US-International layout that=20
allows me to type German, French and Spanish prose without the need to=20
memorize a different keyboard layout for each.

>>Of course, keywords in a language are the next interesting topic.
>>Think about how a Chinese might see the situation. Imagine this:
>>Say, you're English and work with a Chinese programming language.
>>Each keyword is a Chinese glyph, which you have to enter using your
>>English keyboard. Wouldn't you feel somewhat uneasy about this?
>=20
>=20
> I suppose you are not aware that Asian languages are actually most
> commonly entered using roman letters on a QWERTY keyboard layout.
> They don't seem to feel particularly uneasy about it. :)
> With thousands (Japanese) to ten-thousands (Chinese) of glyphs, they
> also don't really have much of a choice. The only practical method
> is phonetic input, i.e. based on the pronunciation, for which roman
> letters are quite adequate.

I don't claim any knowledge here, but I was always under the impression=20
that there are several different entry methods, not all of them based on=20
phonetics. As far as I know the phonetics differ considerably between=20
different parts of China even, so any input method based on phonetics=20
would have to depend on the region.

> In turn, this has the effect that entering a chinese glyph takes up
> to four keypresses, and even some more when there is no context to
> disambiguate the many homonyms. So you see that using chinese glyphs
> in a programming language probably would not significantly increase
> typing efficiency, if at all. And any such possible increase could
> also be gained with english keywords by using corresponding keyboard
> macros. So what remains is whether one prefers to have source code to
> look more like english or more like chinese. (Most often, it looks
> like neither. It simply looks like source code.)

Fair enough. But as I said, typing efficiency is not high on my priority=20
list. Reading efficiency is. Chinese characters in source code would=20
probably decrease my reading efficiency, but that is because I don't=20
know any chinese. The gamma symbol would probably increase it, since I=20
had some education in math. It would be just like Chinese for someone=20
who doesn't know what a Gamma function is. But that's ok, since she/he=20
wouldn't understand it either if it were written with ASCII characters.=20
Moral: Use the symbols that are appropriate for the given problem=20
domain. And I expect from a programming language that it doesn't stand=20
in the way unnecessarily. C++ is pretty good in this respect already,=20
but it could be better still.

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: jamieallsop@NO_SPAMdtsonline.co.uk (Jamie Allsop)
Date: Wed, 29 Oct 2003 06:25:00 +0000 (UTC) Raw View

[snip]
> I suppose you are not aware that Asian languages are actually most
> commonly entered using roman letters on a QWERTY keyboard layout.
> They don't seem to feel particularly uneasy about it. :)
> With thousands (Japanese) to ten-thousands (Chinese) of glyphs, they
> also don't really have much of a choice. The only practical method
> is phonetic input, i.e. based on the pronunciation, for which roman
> letters are quite adequate.
[snip]

No, roman letters are most definitely *not* quite adequate. Which is why
there are numerous different phonetic input methods. And no, they are
not mostly entered using qwerty and roman (in the sense you mean), and
yes, they do feel uneasy about trying to use an inadequate system to
input their native language. They do indeed generally use a keyboard
with physical layout as a qwerty, probably with a US layout (which is
different than the physical layout of my UK keyboard and sufficiently
different to make using a US keyboard a real pain), but not for roman
input; the same keys can be used to choose phonetc symbols (which not
everyone will memorise, just like not everyone memorises the qwerty
layout and all the physical layout options based on that for different
locations - I suppose that is why we still by keyboards with letters on
them). You are probably thinking of PinYin when you refer to roman input
for (mainly) simplified Chinese as used in China. PinYin, while useable
is not for real speed typing and is also not an accurate enough mapping
for everyday use (though in the abscence of an alternative, better than
nothing).

Jamie

(Hi Stefan :) )

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Thu, 16 Oct 2003 02:21:54 +0000 (UTC) Raw View

Hi all,

I'm just reading through the C standard book from Wiley in anticipation=20
of the C++ standard book. I understand that C and C++ are supposed to=20
handle universal character names in the same way. And I would appreciate=20
if someone could explain to me how that is intended to be used in practic=
e.

For example, if I want to include German umlauts (such as =E4 or =FC) in=20
identifiers, I actually want them to appear in print and on screen as=20
the proper glyph and not as \u1234. This uglification may be appropriate=20
for the compiler for parsing, but certainly not for human reading. Is my=20
editor supposed to do the conversion? Or is the language implementation=20
supposed to provide a preprocessor (pre-preprocessor) to convert=20
non-ASCII characters to UCNs?

The standard may not mandate any particular way, but surely people must=20
have an idea of what kind of support should be provided, or else what is=20
the point of allowing UCNs in identifiers?

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: rridge@csclub.uwaterloo.ca (Ross Ridge)
Date: Thu, 16 Oct 2003 20:07:15 +0000 (UTC) Raw View

Stefan Heinzmann <stefan_heinzmann@yahoo.com> wrote:
>For example, if I want to include German umlauts (such as     or    ) in
>identifiers, I actually want them to appear in print and on screen as
>the proper glyph and not as \u1234. This uglification may be appropriate
>for the compiler for parsing, but certainly not for human reading. Is my
>editor supposed to do the conversion?

Yes.  Ask the people who made your editor to add this feature, they're
undoubtably under the impression that no one actually wants it.

>Or is the language implementation supposed to provide a preprocessor
>(pre-preprocessor) to convert non-ASCII characters to UCNs?

If this was required then there really wouldn't be any point to UCNs
in identifiers would there?  It would be simpler to just require the
implementation to accept German umlauts in identifiers as is.

>The standard may not mandate any particular way, but surely people must
>have an idea of what kind of support should be provided, or else what is
>the point of allowing UCNs in identifiers?

Java had it, so C++ had to have it too.

     Ross Ridge

--
 l/  //   Ross Ridge -- The Great HTMU
[oo][oo]  rridge@csclub.uwaterloo.ca
-()-/()/  http://www.csclub.uwaterloo.ca/u/rridge/
 db  //

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: hyrosen@mail.com (Hyman Rosen)
Date: Thu, 16 Oct 2003 20:07:37 +0000 (UTC) Raw View

Stefan Heinzmann wrote:
> The standard may not mandate any particular way, but surely people must
> have an idea of what kind of support should be provided, or else what is
> the point of allowing UCNs in identifiers?

Read 2.1/1. The translation of physical source file characters
to the source character set is implementation-defined. Source
file characters outside of the basic set are translated to the
universal character name equivalent. (Logically, of course. The
implementation is free to represent things any way it wants.)

So it's up to your compiler vendor to decide what kind of source
file encodings it understands, and up to your editor to decide
how these encodings are displayed. The standard doesn't care, and
chooses to explain how things work in terms of UCNs.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: do-not-spam-benh@bwsint.com (Ben Hutchings)
Date: Thu, 16 Oct 2003 20:07:53 +0000 (UTC) Raw View

In article <bme4m3$dvi$05$1@news.t-online.com>, Stefan Heinzmann wrote:
<snip>=20
> For example, if I want to include German umlauts (such as =E4 or =FC) i=
n=20
> identifiers, I actually want them to appear in print and on screen as=20
> the proper glyph and not as \u1234.

When are you expecting identifiers to appear on print or screen?  Are
you using an extension like __FUNCTION__ or an assert() implementation
that shows function names?

> This uglification may be appropriate=20
> for the compiler for parsing, but certainly not for human reading. Is m=
y=20
> editor supposed to do the conversion? Or is the language implementation=
=20
> supposed to provide a preprocessor (pre-preprocessor) to convert=20
> non-ASCII characters to UCNs?
<snip>

The language implementation should convert UCNs in *literals* into the
corresponding characters in the execution character set at translation
time.  If there is no corresponding character to a UCN, the conversion
is implementation defined.  (References: 2.13.2/5, 2.13.4/5.)  It may
be necessary to tell your implementation that the execution character
set is what you intend it to be and not, say, ASCII, which has no
accented letters.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kuyper@wizard.net (James Kuyper)
Date: Fri, 17 Oct 2003 04:12:23 +0000 (UTC) Raw View

stefan heinzmann@yahoo.com (Stefan Heinzmann) wrote in message news:<bme4m3$dvi$05$1@news.t-online.com>...
...
> The standard may not mandate any particular way, but surely people must
> have an idea of what kind of support should be provided, or else what is
> the point of allowing UCNs in identifiers?

It was anticipated that some editors would be written to automatically
handle UCN's internally, so the user wouldn't have to think about
them. I've no idea whether any such editors have actually been
written.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Sat, 18 Oct 2003 05:29:10 +0000 (UTC) Raw View

hyrosen@mail.com (Hyman Rosen) wrote in message
news:<1066312453.893441@master.nyc.kbcfp.com>...
> Stefan Heinzmann wrote:
> > The standard may not mandate any particular way, but surely people
> > must have an idea of what kind of support should be provided, or
> > else what is the point of allowing UCNs in identifiers?

> Read 2.1/1. The translation of physical source file characters to the
> source character set is implementation-defined. Source file characters
> outside of the basic set are translated to the universal character
> name equivalent. (Logically, of course. The implementation is free to
> represent things any way it wants.)

> So it's up to your compiler vendor to decide what kind of source file
> encodings it understands, and up to your editor to decide how these
> encodings are displayed. The standard doesn't care, and chooses to
> explain how things work in terms of UCNs.

Except that if you want to be sure someone else can compile your files,
they will physically contain \u00E4, rather than a    .  Which isn't
really very practical unless your other tools display this as an    .

In practice, UCN's are about as useful as trigraphs for writing
readable, portable programs.  Which is a shame, because they could be
really useful -- C++ has done its part, and both C and Java have
followed, which sounds pretty much like a de facto standard to me.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: rridge@csclub.uwaterloo.ca (Ross Ridge)
Date: Sat, 18 Oct 2003 16:51:02 +0000 (UTC) Raw View

<kanze@gabi-soft.fr> wrote:
>In practice, UCN's are about as useful as trigraphs for writing
>readable, portable programs.  Which is a shame, because they could be
>really useful -- C++ has done its part, and both C and Java have
>followed, which sounds pretty much like a de facto standard to me.

That makes it anything but a "de facto" standard.  "In fact" there is
no standard, because "in fact" on one uses it, and "in fact" no one
who doesn't have to, like C/C++/Java third party tools, supports it.
It's a standard that only exists as words in a document.

And C++ took the idea from Java, ignoring the fact existing practice in
Java had already shown UCNs in identifiers to be as useful as trigraphs
for writing readable, portable programs.

      Ross Ridge

--
 l/  //   Ross Ridge -- The Great HTMU
[oo][oo]  rridge@csclub.uwaterloo.ca
-()-/()/  http://www.csclub.uwaterloo.ca/u/rridge/
 db  //

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Sat, 18 Oct 2003 16:51:27 +0000 (UTC) Raw View

rridge@csclub.uwaterloo.ca (Ross Ridge) wrote in message
news:<bmlrk4$2b6$1@rumours.uwaterloo.ca>...

> >The standard may not mandate any particular way, but surely people
> >must have an idea of what kind of support should be provided, or else
> >what is the point of allowing UCNs in identifiers?

> Java had it, so C++ had to have it too.

Except that C++ had it (at least in a draft standard) before Java came
along.  In this case, I think that it is more C++ had it, so Java had to
have it as well.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: usenet@nmhq.net (Niklas Matthies)
Date: Tue, 21 Oct 2003 18:46:23 +0000 (UTC) Raw View

On 2003-10-20 22:40, kanze@gabi-soft.fr wrote:
> usenet@nmhq.net (Niklas Matthies) wrote:
[...]
>> It should be rather trivial to have editors like Emacs or Vim apply
>> appropriate conversion filters upon reading and writing of source
>> files. I have seen this being done for HTML character entities, which
>> are not very different from UCNs, apart from the concrete syntax.
>
> I don't know about trivial, but it certainly should be possible.  At
> least, supposing that these editors either work internally with 32 bit
> (or at least 21 bit) characters, or support full UTF-8.  (From the
> little I've scanned over in the emacs manual, it is limited to 19 bit
> characters (and it seems to be using some special encoding of its own
> internally).

I don't know emacs too well, so you may be right that it has some
limitations. Vim supports full utf-8/ucs-4 (and many other encodings,
see :help encoding-names), where you can set internal encoding, file
encoding and terminal encoding seperately, and for conversions
anything that iconv() supports.

>> Java IDEs also routinely support UTF-8 source files, so it shouldn't
>> be too difficult for C++ IDEs to do likewise.
>
> Yes, but who has a UTF-8 editor?

Within Sun ONE Studio 5 (and probably also 4) I can enter Japanese in
a source file and successfully load, store and compile it (for example
using identifiers consisting of Japanese characters), given that I
choose an appropriate encoding (like utf-8) in the options for source
file encoding (editor and compiler) and select a font that can display
those characters. I suppose that other IDEs like e.g. Eclipse can do
likewise.

And using a stand-alone editor like Vim, everyone has a utf-8 editor.

>> And with regard to the the source file inclusion problem, one solution
>> would be to use Apache-like filename patterns
>> (e.g. "MyHeader.utf8.h").
>
> One solution, perhaps, but certainly not the only.

My intent was to mention one solution that is already in wide use for
a very similar (if not virtually the same) problem. I didn't mean to
imply that it is the only one.

[...]
> Demand is relative.  When most of us are still having problems getting
> templates to work correctly, such issues may seem secondary.

I suppose so.

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: usenet@nmhq.net (Niklas Matthies)
Date: Wed, 22 Oct 2003 01:49:02 +0000 (UTC) Raw View

On 2003-10-21 18:16, Stefan Heinzmann wrote:
> kanze@gabi-soft.fr wrote:
[...]
> And, programming in foreign languages aside, why should I not use
> the uppercase greek Gamma symbol for the Gamma function?

One reason is keyboard input. In general, pressing some funky key
combination for the Gamma symbol is not faster than typing "gamma".
Moreover, to be efficient at all, it requires memorizing how to
enter these special symbols (and once you start, there will probably
be dozens).

I'm actually in favor of having a richer symbol set for identifiers,
but you have to consider the limitations. While I _can_ type (I'm
picking a Latin1-compatible example here and hope the charset survives
moderation)

   cout =AB (=BE=B7x=B2);

without problems (thanks to a <Compose> key), and it would certainly
read nicer than

   cout << (0.75*(x*x));

the latter is decidedly more convenient to enter than the former
(unless you start heavily custom-mapping your keyboard).

-- Niklas Matthies

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: stefan_heinzmann@yahoo.com (Stefan Heinzmann)
Date: Wed, 22 Oct 2003 15:50:54 +0000 (UTC) Raw View

Niklas Matthies wrote:
> On 2003-10-21 18:16, Stefan Heinzmann wrote:
[...]
>>And, programming in foreign languages aside, why should I not use
>>the uppercase greek Gamma symbol for the Gamma function?
>=20
>=20
> One reason is keyboard input. In general, pressing some funky key
> combination for the Gamma symbol is not faster than typing "gamma".
> Moreover, to be efficient at all, it requires memorizing how to
> enter these special symbols (and once you start, there will probably
> be dozens).
>=20
> I'm actually in favor of having a richer symbol set for identifiers,
> but you have to consider the limitations. While I _can_ type (I'm
> picking a Latin1-compatible example here and hope the charset survives
> moderation)
>=20
>    cout =AB (=BE=B7x=B2);
>=20
> without problems (thanks to a <Compose> key), and it would certainly
> read nicer than
>=20
>    cout << (0.75*(x*x));
>=20
> the latter is decidedly more convenient to enter than the former
> (unless you start heavily custom-mapping your keyboard).

I'm assuming then that you don't use a greek or cyrillic keyboard :-)

But, seriously, the keyboard question is as much an argument in favour=20
of having special symbols as it is against. That's what I wanted to hint=20
at with the "cultural imperialism" argument. If you are programming,=20
you're virtually forced to use a "roman" keyboard. If you're programming=20
C++ (or C or Java or...) you're actually best off to use a US keyboard=20
layout. I do, although I'm a German. The curly braces are just too=20
awkward to reach on the German keyboard.

So your argument is quite right if you are already using the "right"=20
keyboard. If not, you might see it from a different angle.

Of course, keywords in a language are the next interesting topic. Think=20
about how a Chinese might see the situation. Imagine this: Say, you're=20
English and work with a Chinese programming language. Each keyword is a=20
Chinese glyph, which you have to enter using your English keyboard.=20
Wouldn't you feel somewhat uneasy about this?

I'm aware that this would lead much farther than just supporting UCNs in=20
C++ properly. I'm not advocating using chinese glyphs for C++ keywords.=20
But I do want to advocate some non-biased thinking about programming for=20
non-native English speakers. I also like the notion of domain specific=20
languages being implemented through C++ libraries, and I would like to=20
have the support for this improved in C++.

Cheers
Stefan

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Wed, 22 Oct 2003 17:44:33 +0000 (UTC) Raw View

usenet@nmhq.net (Niklas Matthies) wrote in message
news:<slrnbpaqme.1gv2.usenet@nmhq.net>...
> On 2003-10-20 22:40, kanze@gabi-soft.fr wrote:
> > usenet@nmhq.net (Niklas Matthies) wrote:
>  [...]
> >> It should be rather trivial to have editors like Emacs or Vim apply
> >> appropriate conversion filters upon reading and writing of source
> >> files. I have seen this being done for HTML character entities,
> >> which are not very different from UCNs, apart from the concrete
> >> syntax.

> > I don't know about trivial, but it certainly should be possible.  At
> > least, supposing that these editors either work internally with 32
> > bit (or at least 21 bit) characters, or support full UTF-8.  (From
> > the little I've scanned over in the emacs manual, it is limited to
> > 19 bit characters (and it seems to be using some special encoding of
> > its own internally).

> I don't know emacs too well, so you may be right that it has some
> limitations. Vim supports full utf-8/ucs-4 (and many other encodings,
> see :help encoding-names), where you can set internal encoding, file
> encoding and terminal encoding seperately, and for conversions
> anything that iconv() supports.

My impressions are just the result of a quick scan, and could be wrong
with regards to Emacs.

> >> Java IDEs also routinely support UTF-8 source files, so it
> >> shouldn't be too difficult for C++ IDEs to do likewise.

> > Yes, but who has a UTF-8 editor?

> Within Sun ONE Studio 5 (and probably also 4) I can enter Japanese in
> a source file and successfully load, store and compile it (for example
> using identifiers consisting of Japanese characters), given that I
> choose an appropriate encoding (like utf-8) in the options for source
> file encoding (editor and compiler) and select a font that can display
> those characters. I suppose that other IDEs like e.g. Eclipse can do
> likewise.

That's interesting, because one of the compilers I use to test things is
Sun CC 5.1 -- and it gave compiler errors along the lines of illegal
character when I fed it a source file with    t    in it, even though
everything in my environment says I'm working with ISO 8859-1.  Maybe it
always supposed UTF-8.

I'd like to experiment more with it, but for some reason, they've
deinstalled most of the locales (and the non-8859-1 fonts) on our
machines.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: Thu, 23 Oct 2003 06:18:06 +0000 (UTC) Raw View

usenet@nmhq.net (Niklas Matthies) wrote in message
news:<slrnbpb3fe.1pkh.usenet@nmhq.net>...
> On 2003-10-21 18:16, Stefan Heinzmann wrote:
> > kanze@gabi-soft.fr wrote:
>  [...]
> > And, programming in foreign languages aside, why should I not use
> > the uppercase greek Gamma symbol for the Gamma function?

> One reason is keyboard input. In general, pressing some funky key
> combination for the Gamma symbol is not faster than typing "gamma".

I hate to disillusion you, but a lot of us have to press some funky key
combination for things like { and } -- that hasn't stopped C++.  One
more or less won't hurt us.  And we don't need some funky key
combination for things like     or     -- those are everyday keys on our
machines.  On most of the keyboards I use, it's easier to get one of
these characters than a { or a }.

There are strong arguments for just using (American) English for
everything, but they don't apply everywhere.  Bookkeeping laws are
different enough from country to country that a program for calculating
French taxes will NOT be used or maintained outside of France.  And
while you can write French without the accents, the results are pretty
unreadable, not to mention the ambigu   ty of such words as "co   t"
(cost -- not an unreasonably symbol name in a tax program) if you drop
the accents and do a "using namespace std" (or use a compiler like the
older g++, which aliases std:: to ::).

> Moreover, to be efficient at all, it requires memorizing how to enter
> these special symbols (and once you start, there will probably be
> dozens).

Well, I've know people to master Emacs, and there are a lot more than
just dozens of funky key combinations there:-).

> I'm actually in favor of having a richer symbol set for identifiers,
> but you have to consider the limitations. While I can type (I'm
> picking a Latin1-compatible example here and hope the charset survives
> moderation)

>    cout   (  x );

> without problems (thanks to a <Compose> key),

It didn't.

> and it would certainly read nicer than

>    cout << (0.75*(x*x));

> the latter is decidedly more convenient to enter than the former
> (unless you start heavily custom-mapping your keyboard).

Well, you only enter the code once, where as you read it thousands of
times.  So you really should privilege the reader.

--
James Kanze           GABI Software        mailto:kanze@gabi-soft.fr
Conseils en informatique orient   e objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]