Thread

Topic: C++0x Wish list (Wide Char bit maps)

Author: Charles Sanders <C.Sanders@BoM.GOV.AU>
Date: Tue, 11 Jun 2002 15:03:53 GMT Raw View

James Kanze wrote:
<snip>
> I have a number of cases where I represent sets of characters
> using a bit map.  It is obvious that a bit map is NOT an
> appropriate representation when the domain contains over a
> million elements -- each bit map would require more that 128 K
> bytes.  But what is the appropriate representation?  For the
> moment, I'm using a list of ranges.
<snip>
>
> Anyhow, if anyone has any ideas about other representations, I'm all
> ears.

 One representation that I saw used for land/sea masks,
which, like the character class bitmaps, tend to have groups
of similar bits near each other (ie the middle of the Pacific Ocean
is (nearly) all sea, was an array of bit maps. This was about 20-30
years ago (when memory was really scarce and 256k was a large
mainframe), and was done in Fortran.

 ie you  logically break up the large bit map in to equal
sized pieces, and have an array of bit maps of this size, and
a "pointer" array containing either the index into the bit map
array for the map pertaining to each piece, or special values
(not valid indices) indicating "all 1's" or "all 0's", then the
check can be made something like

 int index = map_num[ inputchar/BitsPerMap ];
 if( 0 <= index < NumberOfMaps )
 {
  return GetBit( map[index], inputchar%BitsPerMap );
 }
 else
 {
  return (index == ALL_BITS_TRUE);
 }

where map_num[] and map[] are suitably defined arrays and
GetBit( m, i ) extracts the ith bit of bitmap m, and can be
as simple as m&(1<<i) if the bitmap pieces will fit in an integral
type such as long or long long. ALL_BITS_TRUE and ALL_BITS_FALSE will
be suitably defined constants (not overlapping with 0...NumberOfMaps-1)
and NumberOfMaps will be the number of pieces of the entire, large,
bit map that have a mixture of 1 and 0 bits (ie are not all 1 and are
not all 0). This can be a big win in the amount of memory used
if there are long stretches of all 1's and/or all 0s, and does
not slow things down much compared with using a single, large
bit map.

 The optimum value for BitsPerMap will obviously depend on
the actual distribution of 1 and 0 bits, and will be different
for different character classes. Also, if it is possible to keep
the number of actual maps needed to 254 or less, the map indices
can be stored in an unsigned char, reducing memory usage even more.
Similarly, in some cases, some of the maps may happen to be equal
because of repeated patterns of bits, and it may be possible to share
the bit maps. It should also be possible to try various values of
BitsPerMap and find the optimum.

 This is similar to Ross Smith's suggestion.

Charles Sanders
C.Sanders@BoM.GOV.AU

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]

Author: kanze@gabi-soft.de (James Kanze)
Date: Wed, 12 Jun 2002 16:09:32 GMT Raw View

Charles Sanders <C.Sanders@BoM.GOV.AU> wrote in message
news:<3D05CCA4.D7E9CE59@BoM.GOV.AU>...
> James Kanze wrote:
> <snip>
> > I have a number of cases where I represent sets of characters
> > using a bit map.  It is obvious that a bit map is NOT an
> > appropriate representation when the domain contains over a
> > million elements -- each bit map would require more that 128 K
> > bytes.  But what is the appropriate representation?  For the
> > moment, I'm using a list of ranges.
>  <snip>

> > Anyhow, if anyone has any ideas about other representations, I'm all
> > ears.

>  One representation that I saw used for land/sea masks, which,
> like the character class bitmaps, tend to have groups of similar
> bits near each other (ie the middle of the Pacific Ocean is (nearly)
> all sea, was an array of bit maps.

One of the problems with deciding on a representation is that I don't
know, a priori, how the bits will be distributed.  Still, two frequent
cases probably predominate: a single bit (to match a single
character), or a predefined set of bits (e.g. [:alpha:]).  With
regards to the latter, I'll have to check to see whether there really
are large blocks.  I know that in some cases, capitals and small
letters alternate, which means that something like [:lower:] will have
every other bit set.  On the other hand, this should only affect
alphabets which have upper and lower case, which corresponds to a
small minority of all characters.

Presumably, some clever processing on the Unicode data file should be
able to calculate an optimal size for the predefined sets, and since
most of the others will be just a single characters, this might be the
best solution.

I had considered using something like this, but with dynamically
determined ranges.  The problem was always how to optimally determine
the ranges, especially after logical operations like union or
intersection.  (One potentially common case might be [_:alpha:], for
example -- all of the alpha characters, plus '_'.  Internally, when
parsing this, :alpha: leads to a static const, which gets or'ed into
the previous set consisting of the '_'.)  Using fixed size ranges
should make this somewhat easier.

    [...]
>  The optimum value for BitsPerMap will obviously depend on the
> actual distribution of 1 and 0 bits, and will be different for
> different character classes. Also, if it is possible to keep the
> number of actual maps needed to 254 or less, the map indices can be
> stored in an unsigned char, reducing memory usage even more.
> Similarly, in some cases, some of the maps may happen to be equal
> because of repeated patterns of bits, and it may be possible to
> share the bit maps. It should also be possible to try various values
> of BitsPerMap and find the optimum.

For starters, one might just try using 128 or 256.  I think that many
of the Unicode blocks are aligned at these values.

At any rate, some experimentation is necessary (and I don't currently
have time to do it).

--
James Kanze                                mailto:kanze@gabi-soft.de
Conseils en informatique orient   e objet/
                    Beratung in objektorientierter Datenverarbeitung
Ziegelh   ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627

---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html                       ]