Topic: Hash table in STL


Author: kanze@gabi-soft.fr (J. Kanze)
Date: 1996/05/28
Raw View
In article <31A760DF.7F3A@cyberspy.com> Rich Paul
<linguist@cyberspy.com> writes:

|> James Kanze US/ESC 60/3/141 #40763 wrote:

(Just for the record, all of what I actually wrote got cut.  Which is
OK, but the above line might lead to confusion.)

|> I suppose that what I'd like to see in the ( next iteration ) of the
|> standard is  something like this:

|> template<class key>
|> unsigned hash ( const key &k )
|> {
|>  return k.hash();
|> };
|> unsigned hash ( unsigned u )
|> {
|>  return u;
|> };
|> unsigned hash ( int i )
|> {
|>  return (unsigned)i;
|> };
|> //etc, etc, etc

I actually do something like this.  All of the classes I write provide a
hash and a compare function; a template like the above causes the member
hash function to be invoked, and specializations on the builtin types
allow them to work.

Specialization is still needed for numerous cases, the most frequent
being RefCntPtr (where the hash is actually of the object pointed to).

|> template<class key, class value, unsigned size>
|> class hash_table
|> {
|>  static unsigned hash ( const key &k )
|>  {
|>   return hash(k)%size;
|>  };
|> };

|> This seems to be a bit of the best of worlds.  A class you don't define
|> can be hashed  any way you like, though of course you can only use it's
|> publics.  A class you define can have a public hash function.

More correctly, any type for which the hash template function has not
been specialized requires a member function "hash".  There is *NO*
default for class types.

Since my discussion is moving away from standardization issues, I've set
followups to comp.lang.c++.moderated.  If your response returns to
standardization issues, feel free to set it back.
--
James Kanze           (+33) 88 14 49 00          email: kanze@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs Bourgeois, 67000 Strasbourg, France
Conseils en informatique industrielle --
                            -- Beratung in industrieller Datenverarbeitung
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: jcoffin@taeus.com (Jerry Coffin)
Date: 1996/05/21
Raw View
In article <4nn9d9$llt@msil.sps.mot.com>, yairs@msil.sps.mot.com
says...
> Why are the STL associative containers (map, set, multimap,
> multiset) implemented as trees & not hash tables ?
> Hash table implementations can be much more efficient, and do not
> require element ordering.

The publicly available implementation of STL uses red-black trees, but
what's in the standard doesn't require a tree based implementation.

If you want to use a hash table for the implementation you simply have
to meet the requirements in the standard.  The two that initially seem
difficult are those for logarithmic computational complexity, and for
the ability to traverse a collection in sorted order.

The first can be met by using some sort of balanced tree to handle
collisions.  With typical keys that can be compared (which I believe
is a requirement for current [multi]sets and [multi]maps) imposing a
total ordering is generally fairly easy as well: simply compare first
the hash values, then if they're equal, return a comparison of the
keys themselves.

This leaves only one major hurdle: actually creating hash codes from
keys of arbitrary types.  I'm not sure there's a portable method of
doing this, but on most machines you can simply get the address and
size of the key and treat the key as an array of char's.

Ultimately, I'm not convinced that the current requirements prohibit
implementation as hash tables, though a hash table that meets the
requirements is likely to be quite different from one you'd write for
other purposes and situations.


[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze US/ESC 60/3/141 #40763 <kanze@lts.sel.alcatel.de>
Date: 1996/05/22
Raw View
In article <MPLANET.31a152e8jcoffin9896a7@news.rmi.net>
jcoffin@taeus.com (Jerry Coffin) writes:

|> In article <4nn9d9$llt@msil.sps.mot.com>, yairs@msil.sps.mot.com
|> says...
|> > Why are the STL associative containers (map, set, multimap,
|> > multiset) implemented as trees & not hash tables ?
|> > Hash table implementations can be much more efficient, and do not
|> > require element ordering.

|> The publicly available implementation of STL uses red-black trees, but
|> what's in the standard doesn't require a tree based implementation.

|> If you want to use a hash table for the implementation you simply have
|> to meet the requirements in the standard.  The two that initially seem
|> difficult are those for logarithmic computational complexity, and for
|> the ability to traverse a collection in sorted order.

If you are trying to meet the exact requirements of map, what's the
point.  The motivation for wanting a hash table in the first place is
to get a different trade off: better average performance (constant
instead of logrithmic) but poorer worst case (linear instead of
logrithmic).

 [Moderator's note: hash tables with trees in the hash buckets
 have constant-time average performance and logarithmic
 worst case - i.e. the best of both worlds, asymptotically
 speaking.  Of course the constant factors are important too.  -fjh.]

|> The first can be met by using some sort of balanced tree to handle
|> collisions.  With typical keys that can be compared (which I believe
|> is a requirement for current [multi]sets and [multi]maps) imposing a
|> total ordering is generally fairly easy as well: simply compare first
|> the hash values, then if they're equal, return a comparison of the
|> keys themselves.

With the standard map class, the *user* defines the ordering; it is not
arbitrary (as is the case when comparing hash values).

|> This leaves only one major hurdle: actually creating hash codes from
|> keys of arbitrary types.  I'm not sure there's a portable method of
|> doing this, but on most machines you can simply get the address and
|> size of the key and treat the key as an array of char's.

Doesn't work on any machine I know of.  In almost all cases, certain
structures will contain padding with random values.  And of course,
-0.0 and +0.0 must hash to the same value, although they generally have
different bit patterns in memory.  Finally, you generally will want a
deep comparison, and so a deep hash.  Using a naive string class, for
example, two strings will always generate distinct hash codes (since
each string will consist of a simple pointer to its own memory), even
if the contents of the string are the same.

|> Ultimately, I'm not convinced that the current requirements prohibit
|> implementation as hash tables, though a hash table that meets the
|> requirements is likely to be quite different from one you'd write for
|> other purposes and situations.

There is nothing which prevents an STL compatible hash table, and in
fact, one is available at the HP site.  I do not believe that the
standard map class can be implemented using hash coding techniques,
however.

--
James Kanze         Tel.: (+33) 88 14 49 00        email: kanze@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils, itudes et rialisations en logiciel orienti objet --
                -- A la recherche d'une activiti dans une region francophone
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: jcoffin@taeus.com (Jerry Coffin)
Date: 1996/05/23
Raw View
In article <9605220924.AA21447@lts.sel.alcatel.de>,
<kanze@lts.sel.alcatel.de> says...
[ ... ]

> If you are trying to meet the exact requirements of map, what's the
> point.  The motivation for wanting a hash table in the first place
> is to get a different trade off: better average performance
> (constant instead of logrithmic) but poorer worst case (linear
> instead of logrithmic).
>
>  [Moderator's note: hash tables with trees in the hash buckets
>  have constant-time average performance and logarithmic
>  worst case - i.e. the best of both worlds, asymptotically
>  speaking.  Of course the constant factors are important too.  -
>      fjh.]

Exactly.  The average performance won't (normally) be as good as other
hash tables, but will still be better than a single tree.

> With the standard map class, the *user* defines the ordering; it is
> not arbitrary (as is the case when comparing hash values).

In that case I'm pretty sure you can't meet the requirements.  (I
should have looked them over more carefully before my previous reply.)

> Doesn't work on any machine I know of.  In almost all cases, certain
> structures will contain padding with random values.  And of course,
> -0.0 and +0.0 must hash to the same value, although they generally
> have different bit patterns in memory.

Though you don't state it directly, some might infer from the above
that the possibility of there being both + and - 0.0 is common.  This
is definitely not the case - in fact 1's complement machines are
comparatively unusual.

> Finally, you generally will want a deep comparison, and so a deep
> hash.  Using a naive string class, for example, two strings will
> always generate distinct hash codes (since each string will consist
> of a simple pointer to its own memory), even if the contents of the
> string are the same.

Yes, you're right: the hash function really does need knowledge of the
structure of the key.  I'm not sure what I was thinking when I said
otherwise, as it's clearly wrong and I know better.  (Or at least by
now I certainy _ought_ to know better.)

> There is nothing which prevents an STL compatible hash table, and in
> fact, one is available at the HP site.  I do not believe that the
> standard map class can be implemented using hash coding techniques,
> however.

I'm not sure exactly what "STL compatible" means, but it seems to be
somewhat like "conforming" WRT to a C program: such a broad term as to
be nearly meaningless.

There are (to my knowledge) at least two different freely available
hash table implementations meant for use with STL: one by Bob Fraley,
and another by Barriero (sp?) and Musser.  Bob Fraley works at HP, and
I believe his is the one available on HP's site.  I won't go into
detail here (since it has more to do with algorithms than the
standard) but I'd consider Musser and Barriero's implementation more
interesting and rigorous than Fraley's.  (This isn't really surprising
- if I'm not mistaken, Musser submitted the formal proposal for adding
unordered maps to facilitate implementation as hash tables, and the
implementation is somewhat later than Fraley's.)
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: James Kanze US/ESC 60/3/141 #40763 <kanze@lts.sel.alcatel.de>
Date: 1996/05/24
Raw View
In article <MPLANET.31a374a7jcoffin9896ba@news.rmi.net>
jcoffin@taeus.com (Jerry Coffin) writes:

|> In article <9605220924.AA21447@lts.sel.alcatel.de>,
|> <kanze@lts.sel.alcatel.de> says...

    [Concerning just hashing the raw bytes of an object...]
|> > Doesn't work on any machine I know of.  In almost all cases, certain
|> > structures will contain padding with random values.  And of course,
|> > -0.0 and +0.0 must hash to the same value, although they generally
|> > have different bit patterns in memory.

|> Though you don't state it directly, some might infer from the above
|> that the possibility of there being both + and - 0.0 is common.  This
|> is definitely not the case - in fact 1's complement machines are
|> comparatively unusual.

0.0 is a floating point value.  Most machines I know of use a signed
magnitude representation for floating point (IEEE, or IBM 360, for
example).  Anyway, it is just an example.  From the rest of your post, I
think that you accept my contention that there are cases where objects
with different bytes must compare equal, and thus hash to the same
value.

    [...]
|> > There is nothing which prevents an STL compatible hash table, and in
|> > fact, one is available at the HP site.  I do not believe that the
|> > standard map class can be implemented using hash coding techniques,
|> > however.

|> I'm not sure exactly what "STL compatible" means, but it seems to be
|> somewhat like "conforming" WRT to a C program: such a broad term as to
|> be nearly meaningless.

Not really.  More that a library, STL defines an interface
specification.  First and foremost, STL compatible means that the
container can be accessed through an interator type conforming to the
STL conventions.  (Note that STL compatible does *NOT* require a
container; it only requires the iterator.  Containers meeting certain
requirements, however, will want to offer compatible functions: an STL
compatible hash table would use "erase", for example, and not "remove".)

--
James Kanze         Tel.: (+33) 88 14 49 00        email: kanze@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils, itudes et rialisations en logiciel orienti objet --
                -- A la recherche d'une activiti dans une region francophone
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: wkdugan@ix.netcom.com (Bill Dugan)
Date: 1996/05/25
Raw View
James Kanze US/ESC 60/3/141 #40763 <kanze@lts.sel.alcatel.de> wrote:

snip

>If you are trying to meet the exact requirements of map, what's the
>point.  The motivation for wanting a hash table in the first place is
>to get a different trade off: better average performance (constant
>instead of logrithmic) but poorer worst case (linear instead of
>logrithmic).

> [Moderator's note: hash tables with trees in the hash buckets
> have constant-time average performance and logarithmic
> worst case - i.e. the best of both worlds, asymptotically
> speaking.  Of course the constant factors are important too.  -fjh.]

Or if you're willing to accept slow inserts, you can maintain
constant-time lookups by expanding and reorganizing the hash table
when needed.

I think the key point is that there are lots of performance trade-offs
here, and different applications may want different points in this
trade space. In many cases, until they've encountered real-world
usage, it won't be obvious exactly what performace trade is really
optimal. For this reason, it may be desirable to preserve the same
interface for containers with differing performance trades. This would
allow substitution of a container with different performance
characteristics without breaking anything.

snip



[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: Rich Paul <linguist@cyberspy.com>
Date: 1996/05/26
Raw View
James Kanze US/ESC 60/3/141 #40763 wrote:

I suppose that what I'd like to see in the ( next iteration ) of the standard is
something like this:

template<class key>
unsigned hash ( const key &k )
{
 return k.hash();
};
unsigned hash ( unsigned u )
{
 return u;
};
unsigned hash ( int i )
{
 return (unsigned)i;
};
//etc, etc, etc


template<class key, class value, unsigned size>
class hash_table
{
 static unsigned hash ( const key &k )
 {
  return hash(k)%size;
 };
};

This seems to be a bit of the best of worlds.  A class you don't define can be hashed
any way you like, though of course you can only use it's publics.  A class you define
can have a public hash function.

Now of course, it should be extended for functors, so that differant tables can hash
differant ways.  But the key is that the public hash should use all the bits of it's
return type ... let the mod in the hash table cut it down ... so you don't write
objects that anticipate the sizes of the hash tables they will be stored in.

what to do about floating point is difficult ... I'm sure this has been explored by
better minds than mine, so I'll leave that to them.

One more possibility ... what if a hash value was a (templated) class?  Just thought of
that ...
--
#include <legalbs/standarddisclaimer>
Rich Paul                |  If you like what I say, tell my
C++, OOD, OOA, OOP,      |  employer, but if you don't,
OOPs, I forgot one ...   |  don't blame them.  ;->


[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: yairs@msil.sps.mot.com (Yair Shmuely)
Date: 1996/05/19
Raw View
Why are the STL associative containers (map, set, multimap, multiset)
implemented as trees & not hash tables ?
Hash table implementations can be much more efficient, and do not require
element ordering.

Yair

 --------------------------------------------------------------
 Yair Shmuely                     phone: (972) 9-590205
 CAD Dept.                        fax  : (972) 9-562990
 Motorola Semiconductor Israel    email: yairs@msil.sps.mot.com
 1 Shenkar St, Herzelia, 46120, Israel
 --------------------------------------------------------------




[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]





Author: Mark Nelson <markn@airmail.net>
Date: 1996/05/20
Raw View
Yair Shmuely wrote:
>
> Why are the STL associative containers (map, set, multimap, multiset)
> implemented as trees & not hash tables ?
> Hash table implementations can be much more efficient, and do not require
> element ordering.

You can perhaps find some discussion of this from deja news.  Hash tables
have their good points and bad points, as do trees.  An attempt was made
to include a set of hashed containers in the STL, but I think it failed
because there wasn't sufficient time to evaluate it properly.

The HP STL ftp site has a copy of the hashed container proposal, along with
the source code that provided a preliminary implementation.

Mark Nelson
http://web2.airmail.net/markn
---
[ comp.std.c++ is moderated.  To submit articles: try just posting with      ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu         ]
[ FAQ:      http://reality.sgi.com/employees/austern_mti/std-c++/faq.html    ]
[ Policy:   http://reality.sgi.com/employees/austern_mti/std-c++/policy.html ]
[ Comments? mailto:std-c++-request@ncar.ucar.edu                             ]