Topic: Defect Report: TR1 regex case-insensitive character ranges are


Author: Eric Niebler <eric@boost-consulting.com>
Date: Fri, 1 Jul 2005 06:34:26 +0000 (UTC)
Raw View

[ Note: Delivered to C++ Committee. -sdc]


A problem with TR1 regex is currently being discussed on the Boost
developers list. It involves the handling of case-insensitive matching
of character ranges such as [Z-a]. The proper behavior (according to the
ECMAScript standard) is unimplementable given the current specification
of the TR1 regex_traits<> class template. John Maddock, the author of
the TR1 regex proposal, agrees there is a problem. The full discussion
can be found at http://lists.boost.org/boost/2005/06/28850.php (first
message copied below). We don't have any recommendations as yet.

-- Begin original message --

The situation of interest is described in the ECMAScript specification
(ECMA-262), section 15.10.2.15:

"Even if the pattern ignores case, the case of the two ends of a range
is significant in determining which characters belong to the range.
Thus, for example, the pattern /[E-F]/i matches only the letters E, F,
e, and f, while the pattern /[E-f]/i matches all upper and lower-case
ASCII letters as well as the symbols [, \, ], ^, _, and `."

A more interesting case is what should happen when doing a
case-insentitive match on a range such as [Z-a]. It should match z, Z,
a, A and the symbols [, \, ], ^, _, and `. This is not what happens with
Boost.Regex (it throws an exception from the regex constructor).

The tough pill to swallow is that, given the specification in TR1, I
don't think there is any effective way to handle this situation.
According to the spec, case-insensitivity is handled with
regex_traits<>::translate_nocase(CharT) -- two characters are equivalent
if they compare equal after both are sent through the translate_nocase
function. But I don't see any way of using this translation function to
make character ranges case-insensitive. Consider the difficulty of
detecting whether "z" is in the range [Z-a]. Applying the transformation
to "z" has no effect (it is essentially std::tolower). And we're not
allowed to apply the transformation to the ends of the range, because as
ECMA-262 says, "the case of the two ends of a range is significant."

So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix
is to redefine translate_nocase to return a string_type containing all
the characters that should compare equal to the specified character. But
this function is hard to implement for Unicode, and it doesn't play nice
with the existing ctype facet. What a mess!

-- End original message --


--
Eric Niebler
Boost Consulting
www.boost-consulting.com



[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]