Thread

Topic: Defect Report: TR1 regex case-insensitive character ranges are unimplementable as specified

Author: kanze@gabi-soft.fr
Date: 6 Jul 2005 16:10:05 GMT Raw View

Martin Bonner wrote:
> kanze@gabi-soft.fr wrote:
> > "John Maddock" wrote:

> > > Quick and dirty performance comparisons show that
> > > expressions such as "[X-\\x{fff0}]+" are indeed very slow
> > > to compile with ICU (about 200 times slower than a
> > > "normal" expression).
> > Just a small question: is there any real use for this sort
> > of expression?
> [snip]

> > And of course, if the functionality is of no possible
> > utility, why bother supporting it?

> The implication of the OP was that TR1 references the
> ECMAScript standard which says that it should be supported.

I'm not sure that we can simply follow ECMAScript blindly.
Traditionally, at least, the character encoding is not specified
by the C++ standard.  Given that, expressions like the one given
have no real meaning.  If ECMAScript specifies a specific
character encoding, then the expression has defined semantics
(even if they are still totally useless).

> In general, I don't like the idea of implementors saying "that
> bit of a standard is useless, I won't bother supporting it".

I totally agree there.  But as I understand it, this isn't yet
part of the standard.  So the question is, why should the
standard require it?

> On the other hand, this does seem a /particularly/ useless bit
> of standard, and supporting it has major implications for
> people doing useful things.

> Of course, if the C++ Standards Committee says "that bit of
> the standard is useless, don't bother supporting it", the
> situation is rather different ... and getting that response is
> pretty much the point of the DR.

For the moment, I think we are still talking about what should
be in the standard; I don't think that the committee can say
that something that is actually part of the standard is useless,
and shouldn't be supported.  A TR isn't (yet) the standard,
however, and a statement to the effect that such and such
functionality will not be required when the clauses become part
of the standard is perfectly valid.

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: boost.regex@virgin.net ("John Maddock")
Date: Fri, 1 Jul 2005 16:12:25 GMT Raw View

> A problem with TR1 regex is currently being discussed on the Boost
> developers list. It involves the handling of case-insensitive matching of
> character ranges such as [Z-a]. The proper behavior (according to the
> ECMAScript standard) is unimplementable given the current specification of
> the TR1 regex_traits<> class template. John Maddock, the author of the TR1
> regex proposal, agrees there is a problem. The full discussion can be
> found at http://lists.boost.org/boost/2005/06/28850.php (first message
> copied below). We don't have any recommendations as yet.

One small correction, I have since found that ICU's regex package does
implement this correctly, using a similar mechanism to the current
TR1.Regex.

Given an expression [c1-c2] that is compiled as case insensitive it:

Enumerates every character in the range c1 to c2 and converts it to it's
case folded equivalent.  That case folded character is then used a key to a
table of equivalence classes, and each member of the class is added to the
list of possible matches supported by the character-class.  This second step
isn't possible with our current traits class design, but isn't necessary if
the input text is also converted to a case-folded equivalent on the fly.

ICU applies similar brute force mechanisms to character classes such as
[[:lower:]] and [[:word:]], however these are at least cached, so the impact
is less noticeable in this case.

Quick and dirty performance comparisons show that expressions such as
"[X-\\x{fff0}]+" are indeed very slow to compile with ICU (about 200 times
slower than a "normal" expression).  For an application that uses a lot of
regexes this could have a noticeable performance impact.  ICU also has an
advantage in that it knows the range of valid characters codes: code points
outside that range are assumed not to require enumeration, as they can not
be part of any equivalence class.  I presume that if we want the TR1.Regex
to work with arbitrarily large character sets enumeration really does become
impractical.

Finally note that Unicode has:

Three cases (upper, lower and title).
One to many, and many to one case transformations.
Character that have context sensitive case translations - for example an
uppercase sigma has two different lowercase forms  - the form chosen depends
on context(is it end of a word or not), a caseless match for an upper case
sigma should match either of the lower case forms, which is why case folding
is often approximated by tolower(toupper(c)).

Probably we need some way to enumerate character equivalence classes,
including digraphs (either as a result or an input), and some way to tell
whether the next character pair is a valid digraph in the current locale.

Hoping this doesn't make this even more complex that it was already,

John.


---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.fr
Date: 4 Jul 2005 17:50:35 GMT Raw View

"John Maddock" wrote:

> Quick and dirty performance comparisons show that expressions
> such as "[X-\\x{fff0}]+" are indeed very slow to compile with
> ICU (about 200 times slower than a "normal" expression).

Just a small question: is there any real use for this sort of
expression?  I know that when I implementation regular
expressions (about 20 years ago, in C), I simply imposed the
requirement that in a range, both of the characters must be pure
ASCII, and both must be lower, or both upper, or both digit, and
let it go at that.  Of course, back then, internationalization
was less an issue, and the possibility of EBCDIC more of one
(and we wanted [a-z] to do what was expected even with EBCDIC).
But still, is there ever a time when you can specify a range
which makes sense, and still covers different categories of
characters?

And of course, if the functionality is of no possible utility,
why bother supporting it?

--
James Kanze                                           GABI Software
Conseils en informatique orient   e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S   mard, 78210 St.-Cyr-l'   cole, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: "Martin Bonner" <martinfrompi@yahoo.co.uk>
Date: Tue, 5 Jul 2005 12:14:19 CST Raw View

kanze@gabi-soft.fr wrote:
> "John Maddock" wrote:
>
> > Quick and dirty performance comparisons show that expressions
> > such as "[X-\\x{fff0}]+" are indeed very slow to compile with
> > ICU (about 200 times slower than a "normal" expression).
>
> Just a small question: is there any real use for this sort of
> expression?
[snip]
> And of course, if the functionality is of no possible utility,
> why bother supporting it?

The implication of the OP was that TR1 references the ECMAScript
standard which says that it should be supported.

In general, I don't like the idea of implementors saying "that bit of a
standard is useless, I won't bother supporting it".  On the other hand,
this does seem a /particularly/ useless bit of standard, and supporting
it has major implications for people doing useful things.

Of course, if the C++ Standards Committee says "that bit of the
standard is useless, don't bother supporting it", the situation is
rather different ... and getting that response is pretty much the point
of the DR.

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]