Many applications require a dynamic set
that supports only the dictionary operations INSERT, SEARCH, and DELETE. For example, a compiler for a computer
language maintains a symbol table, in which the keys of elements are arbitrary
character strings that correspond to identifiers in the language. A hash table
is an effective data structure for implementing dictionaries. Although searching
for an element in a hash table can take as long as searching for an element in a
linked list--(*n*) time in
the worst case--in practice, hashing performs extremely well. Under reasonable
assumptions, the expected time to search for an element in a hash table is
*O*(1).

Direct addressing is a simple technique that works well when the
universe *U* of keys is reasonably small. Suppose that an application needs
a dynamic set in which each element has a key drawn from the universe *U* =
{0,1, . . . , m - 1}, where *m* is not too large. We shall assume that no
two elements have the same key.

To
represent the dynamic set, we use an array, or * direct-address
table*,

The dictionary operations are trivial to implement.

DIRECT-ADDRESS-SEARCH(T,k)

returnT[k]

DIRECT-ADDRESS-INSERT(T,x)

T[key[x]]x

DIRECT-ADDRESS-DELETE(T,x)

T[key[x]] NIL

Each of these operations is fast: only *O*(1) time is required.

For some applications, the elements in the dynamic set can be stored in the direct-address table itself. That is, rather than storing an element's key and satellite data in an object external to the direct-address table, with a pointer from a slot in the table to the object, we can store the object in the slot itself, thus saving space. Moreover, it is often unnecessary to store the key field of the object, since if we have the index of an object in the table, we have its key. If keys are not stored, however, we must have some way to tell if the slot is empty.

A * bit vector* is simply an
array of bits (0's and 1's). A bit vector of length

The difficulty with
direct addressing is obvious: if the universe *U* is large, storing a table
*T* of size |*U*| may be impractical, or even impossible, given the
memory available on a typical computer. Furthermore, the set *K* of keys
*actually stored* may be so small relative to *U* that most of the
space allocated for *T* would be wasted.

With direct addressing, an element with key *k* is
stored in slot *k*. With hashing, this element is stored in slot
*h*(*k*); that is, a **hash function***h* is used to
compute the slot from the key *k*. Here *h* maps the universe *U*
of keys into the slots of a **hash table***T*[0 . . *m* -
1]:

h: U{0,1, . . . ,m- 1} .

We say that an element with key *k*
* hashes* to slot

In
* chaining*, we put all the elements that hash to the same slot in a
linked list, as shown in Figure 12.3. Slot

CHAINED-HASH-INSERT(T,x)

insertxat the head of listT[h(key[x])]

CHAINED-HASH-SEARCH(T,k)

search for an element with keykin listT[h(k)]

CHAINED-HASH-DELETE(T,x)

deletexfrom the listT[h(key[x])]

The worst-case running time for insertion is *O*(1). For searching, the
worst-case running time is proportional to the length of the list; we shall
analyze this more closely below. Deletion of an element *x* can be
accomplished in *O*(1) time if the lists are doubly linked. (If the lists
are singly linked, we must first find *x* in the list
*T*[*h*(*key*[*x*])], so that the *next* link of
*x*'s predecessor can be properly set to splice *x* out; in this case,
deletion and searching have essentially the same running time.)

How well does hashing with chaining perform? In particular, how long does it take to search for an element with a given key?

In this section, we discuss some issues regarding the design of good hash functions and then present three schemes for their creation: hashing by division, hashing by multiplication, and universal hashing.

A good hash function satisfies (approximately) the assumption of simple
uniform hashing: each key is equally likely to hash to any of the *m*
slots. More formally, let us assume that each key is drawn independently from
*U *according to a probability distribution *P*; that is, *P(k)*
is the probability that *k* is drawn. Then the assumption of simple uniform
hashing is that

Unfortunately, it is generally not possible to check this condition, since
*P *is usually unknown.

Sometimes (rarely) we do know the distribution *P*. For example, suppose
the keys are known to be random real numbers *k* independently and
uniformly distributed in the range 0 *k* < 1. In this case, the
hash function

h(k) =km

can be shown to satisfy equation (12.1).

In practice, heuristic techniques can be used to create a
hash function that is likely to perform well. Qualitative information about
*P* is sometimes useful in this design process. For example, consider a
compiler's symbol table, in which the keys are arbitrary character strings
representing identifiers in a program. It is common for closely related symbols,
such as pt and pts, to occur in the same program. A good hash function would
minimize the chance that such variants hash to the same slot.

Most hash functions assume that the universe of keys is the set **N** =
{0,1,2, . . .} of natural numbers. Thus, if the keys are not natural numbers, a
way must be found to interpret them as natural numbers. For example, a key that
is a character string can be interpreted as an integer expressed in suitable
radix notation. Thus, the identifier pt
might be interpreted as the pair of decimal integers (112,116), since p = 112 and t = 116 in the ASCII character set; then, expressed as a radix-128
integer, pt becomes (112 128) + 116 = 14452. It is usually
straightforward in any given application to devise some such simple method for
interpreting each key as a (possibly large) natural number. In what follows, we
shall assume that the keys are natural numbers.

In the * division method* for
creating hash functions, we map a key

h(k) =kmodm.

For example, if the hash table has size *m* = 12 and the key is *k*
= 100, then *h(k)* = 4. Since it requires only a single division operation,
hashing by division is quite fast.

When using the division method, we usually avoid certain values of *m.
*For example, *m* should not be a power of 2, since if *m* =
2* ^{p}*, then

Good values for *m* are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly *n* = 2000 character strings, where a character
has 8 bits. We don't mind examining an average of 3 elements in an unsuccessful
search, so we allocate a hash table of size *m* = 701. The number 701 is
chosen because it is a prime near = 2000/3 but not near any power of
2. Treating each key *k* as an integer, our hash function would be

h(k) =kmod 701 .

As a precautionary measure, we could check how evenly this hash function distributes sets of keys among the slots, where the keys are chosen from "real" data.

The * multiplication method*
for creating hash functions operates in two steps. First, we multiply the key

h(k) =m(kAmod 1) ,

where "*k A* mod 1" means the fractional part of *kA*, that is,
*kA - **kA*.

An advantage of the multiplication method is that the value of *m* is
not critical. We typically choose it to be a power of 2--*m* =
2* ^{p}* for someinteger

Although this method works with any value of the constant *A*, it works
better with some values than with others. The optimal choice depends on the
characteristics of the data being hashed. Knuth [123] discusses the choice of
*A* in some detail and suggests that

is likely to work reasonably well.

As an example, if we have *k* = 123456, *m* = 10000, and *A*
as in equation (12.2), then

h(k) = 10000 (123456 0.61803 . . . mod 1)

= 10000 (76300.0041151. . . mod 1)

= 10000 0.0041151 . . .

= 41.151 . . .

= 41 .

If a
malicious adversary chooses the keys to be hashed, then he can choose *n*
keys that all hash to the same slot, yielding an average retrieval time of (*n*). Any fixed hash function
is vulnerable to this sort of worst-case behavior; the only effective way to
improve the situation is to choose the hash function *randomly* in a way
that is *independent* of the keys that are actually going to be stored.
This approach, called * universal hashing*, yields good performance
on the average, no matter what keys are chosen by the adversary.

E[c] = 1/_{yz}m.

Let *C _{x}* be the total number of collisions involving key

Since *n* *m*, we
have E [*C _{x}*] < 1.

But how easy is it to design a universal class of hash functions? It is quite
easy, as a little number theory will help us prove. Let us choose our table size
*m* to be prime (as in the division method). We decompose a key *x*
into *r*+ 1 bytes (i.e., characters, or fixed-width binary substrings), so
that *x* = *x*_{0},*
x*_{1},*. . . *,* x _{r}*; the only requirement is that the
maximum value of a byte should be less than

With this definition,

has *m ^{r}*+1 members.

The class defined by equations (12.3) and (12.4) is a universal class of hash functions.

In * open
addressing*, all elements are stored in the hash table itself. That is,
each table entry contains either an element of the dynamic set or NIL. When searching for an element, we
systematically examine table slots until the desired element is found or it is
clear that the element is not in the table. There are no lists and no elements
stored outside the table, as there are in chaining. Thus, in open addressing,
the hash table can "fill up" so that no further insertions can be made; the load
factor can never exceed 1.

To perform insertion
using open addressing, we successively examine, or * probe*, the hash
table until we find an empty slot in which to put the key. Instead of being
fixed in the order 0, 1, . . . ,

h:UX {0, 1, . . . ,m-1} {0, 1, . . . ,m-1} .

With open addressing, we require that for every key *k*, the **probe
sequence**

h(k, 0),h(k, 1), . . . ,h(k,m- 1)

be a permutation of 0, 1, .
. . , *m* - 1, so that
every hash-table position is eventually considered as a slot for a new key as
the table fills up. In the following pseudocode, we assume that the elements in
the hash table *T* are keys with no satellite information; the key *k*
is identical to the element containing key *k*. Each slot contains either a
key or NIL (if the slot is empty).

HASH-INSERT(T,k)

1i0

2repeatjh(k,i)

3ifT[j] = NIL

4thenT[j]k

5returnj

6elseii+ 1

7untili=m

8error"hash table overflow"

The algorithm for searching for key *k* probes the same sequence of
slots that the insertion algorithm examined when key *k* was inserted.
Therefore, the search can terminate (unsuccessfully) when it finds an empty
slot, since *k* would have been inserted there and not later in its probe
sequence. (Note that this argument assumes that keys are not deleted from the
hash table.) The procedure HASH-SEARCH takes as input a hash table *T* and
a key *k*, returning *j* if slot *j* is found to contain key
*k*, or NIL if key *k* is
not present in table *T*.

HASH-SEARCH(T,k)

1i0

2repeatjh(k,i)

3ifT[j]=j

4thenreturnj

5ii+ 1

6untilT[j] = NIL ori=m

7returnNIL

Deletion from an open-address hash table is difficult. When
we delete a key from slot *i*, we cannot simply mark that slot as empty by
storing NIL in it. Doing so might make it
impossible to retrieve any key *k* during whose insertion we had probed
slot *i* and found it occupied. One solution is to mark the slot by storing
in it the special value DELETED instead
of NIL. We would then modify the
procedure HASH-SEARCH so that it keeps on looking when it sees the value DELETED, while HASH-INSERT would treat
such a slot as if it were empty so that a new key can be inserted. When we do
this, though, the search times are no longer dependent on the load factor , and for this reason chaining is
more commonly selected as a collision resolution technique when keys must be
deleted.

Given an ordinary hash function
*h*': *U* {0, 1, . .
. , *m* - 1}, the method of * linear probing* uses the hash
function

h(k,i) = (h'(k) +i) modm

for *i* = 0,1,...,*m* - 1. Given key *k*, the first slot
probed is *T*[*h*'(*k*)]. We next probe slot
*T*[*h*'(*k*) + 1], and so on up to slot *T*[*m* - 1].
Then we wrap around to slots *T*[0], *T*[1], . . . , until we finally
probe slot *T*[*h*'(*k*) - 1]. Since the initial probe position
determines the entire probe sequence, only *m* distinct probe sequences are
used with linear probing.

Linear probing is easy to implement, but
it suffers from a problem known as * primary clustering*. Long runs
of occupied slots build up, increasing the average search time. For example, if
we have

* Quadratic
probing* uses a hash function of the form

h(k,i) = (h'(k) +c_{1}i+c_{2}i^{2}) modm,

where (as in linear probing) *h*' is an auxiliary hash function,
*c*_{1} and *c*_{2} 0 are auxiliary constants, and
*i* = 0, 1, . . . , *m* - 1. The initial position probed is
*T*[*h*'(*k*)]; later positions probed are offset by amounts that
depend in a quadratic manner on the probe number *i*. This method works
much better than linear probing, but to make full use of the hash table, the
values of *c*_{1}, *c*_{2}, and *m* are
constrained. Problem 12-4 shows one way to select these parameters. Also, if two
keys have the same initial probe position, then their probe sequences are the
same, since *h*(*k*_{1}, 0) = *h*(*k*_{2},
0) implies *h*(*k*_{1}, *i*) =
*h*(*k*_{2}, *i*). This leads to a milder form of
clustering, called **secondary clustering***.* As in linear
probing, the initial probe determines the entire sequence, so only *m*
distinct probe sequences are used.

Double hashing is one of the best methods
available for open addressing because the permutations produced have many of the
characteristics of randomly chosen permutations. * Double hashing*
uses a hash function of the form

h(k,i) = (h_{1}(k) +ih_{2}(k)) modm,

where *h*_{1} and *h*_{2} are auxiliary hash
functions. The initial position probed is *T*[*h*_{1}
(*k*)]; successive probe positions are offset from previous positions by
the amount *h*_{2}(*k*), modulo *m*. Thus, unlike the
case of linear or quadratic probing, the probe sequence here depends in two ways
upon the key *k*, since the initial probe position, the offset, or both,
may vary. Figure 12.5 gives an example of insertion by double hashing.

The value *h*_{2}(*k*) must be relatively prime to the
hash-table size *m* for the entire hash table to be searched. Otherwise, if
*m* and *h*_{2}(*k*) have greatest common divisor
*d* > 1 for some key *k*, then a search for key *k* would
examine only (1/*d*)th of the hash table. (See Chapter 33.) A convenient
way to ensure this condition is to let *m* be a power of 2 and to design
*h*_{2} so that it always produces an odd number. Another way is to
let *m* be prime and to design *h*_{2} so that it always
returns a positive integer less than *m*. For example, we could choose
*m* prime and let

h_{1}(k) =kmodm,

h_{2}(k) = 1 + (kmodm'),

where *m*' is chosen to be slightly less than *m* (say, *m* -
1 or *m* - 2). For example, if *k* = 123456 and *m* = 701, we
have *h*_{1}(*k*) = 80 and *h*_{2}(*k*) =
257, so the first probe is to position 80, and then every 257th slot (modulo
*m*) is examined until the key is found or every slot is examined.

Double hashing represents an improvement over linear or quadratic probing in
that (*m*^{2}) probe
sequences are used, rather than (*m*), since each possible
(*h*_{1} (*k*), *h*_{2}(*k*)) pair yields a
distinct probe sequence, and as we vary the key, the initial probe position
*h*_{1}(*k*) and the offset *h*_{2}(*k*) may
vary independently. As a result, the performance of double hashing appears to be
very close to the performance of the "ideal" scheme of uniform hashing.

Our analysis of open addressing, like our analysis of
chaining, is expressed in terms of the load factor of the hash table, as *n* and
*m* go to infinity. Recall that if *n* elements are stored in a table
with *m* slots, the average number of elements per slot is * = *n/m*. Of course, with
open addressing, we have at most one element per slot, and thus *n* *m*, which implies * 1.

*p _{i}* = Pr {exactly

To evaluate equation (12.6), we define

*q _{i}* = Pr {at least

for *i* = 0, 1, 2, . . . . We can then use identity (6.28):

What is the value of *q _{i}* for

With uniform hashing, a second probe, if necessary, is to one of the
remaining *m* - 1 unprobed slots, *n* - 1 of which are occupied. We
make a second probe only if the first probe accesses an occupied slot; thus,

In general, the *i*th probe is made only if the first *i *- 1
probes access occupied slots, and the slot probed is equally likely to be any of
the remaining *m* - *i* + 1 slots, *n* - *i* + 1 of which
are occupied. Thus,

for *i* = 1, 2, . . . , *n*, since (*n* - *j*) /
(*m* - *j*) *n
*/ *m* if *n*
*m* and *j* 0. After
*n* probes, all *n* occupied slots have been seen and will not be
probed again, and thus *q _{i}* = 0 for

We are now ready to evaluate equation (12.6). Given the assumption that < 1, the average number of probes in an unsuccessful search is

Equation (12.7) has an intuitive interpretation: one probe is always made,
with probability approximately
a second probe is needed, with probability approximately * ^{}*2 a third probe is
needed, and so on.

If is a constant,
Theorem 12.5 predicts that an unsuccessful search runs in *O*(1) time. For
example, if the hash table is half full, the average number of probes in an
unsuccessful search is 1/(1 - .5) = 2. If it is 90 percent full, the average
number of probes is 1/(1 - .9) = 10.

Theorem 12.5 gives us the performance of the HASH-INSERT procedure almost immediately.

Computing the expected number of probes for a successful search requires a little more work.

for a bound on the expected number of probes in a successful search.

Write pseudocode for HASH-DELETE as outlined in the text, and modify HASH-INSERT and HASH-SEARCH to incorporate the special value DELETED.

The bound on the harmonic series can be improved to

where = 0.5772156649
. . . is known as * Euler's constant* and satisfies 0 < < 1. (See Knuth [121] for a
derivation.) How does this improved approximation for the harmonic series affect
the statement and proof of Theorem 12.7?

12-1 Longest-probe bound for hashing

A hash table of size *m* is used to
store *n* items, with *n* *m*/2. Open addressing is used
for collision resolution.

* c.* Show that Pr{

* d.* Show that the expected length of the longest probe sequence
is

You
are asked to implement a dynamic set of *n* elements in which the keys are
numbers. The set is static (no INSERT or
DELETE operations), and the only
operation required is SEARCH. You are
given an arbitrary amount of time to preprocess the *n* elements so that
SEARCH operations run quickly.

12-3 Slot-size bound for chaining

Suppose that we have a
hash table with *n* slots, with collisions resolved by chaining, and
suppose that *n* keys are inserted into the table. Each key is equally
likely to be hashed to each slot. Let *M* be the maximum number of keys in
any slot after all the keys have been inserted. Your mission is to prove an
*O*(1g *n*/1g 1g *n*) upper bound on *E*[*M*], the
expected value of *M*.

**a****.** Argue that the probability *Q _{k}* that

* c. *Use Stirling's approximation, equation (2.1l), to show that

Conclude that E [*M*] = *O*(lg *n*/1g 1g *n*).

Suppose that we are given a key *k*
to search for in a hash table with positions 0, 1, . . . , *m* - 1, and
suppose that we have a hash function *h* mapping the key space into the set
{0, 1, . . . , *m* - 1}. The search scheme is as follows.

1. Compute the value *i* *h*(*k*), and set
*j* 0.

3. Set *j* (*j* +
l) mod *m* and *i*
(*i* + *j*) mod *m*, and return to step 2.

Assume that *m* is a power of 2.

* b.* Prove that this algorithm examines every table position in
the worst case.

Let be a class of hash functions in which
each *h* maps the universe *U* of keys to {0, 1, . . . , *m* -
1}. We say that is
* k-universal* if, for every fixed sequence of

* a.* Show that if
is 2-universal, then it is universal.

* b.* Show that the class defined in Section 12.3.3 is not
2-universal.

h,_{a}b(x)=ax+b,

then is 2-universal.

Knuth [123] and Gonnet [90] are excellent references for the analysis of hashing algorithms. Knuth credits H. P. Luhn (1953) for inventing hash tables, along with the chaining method for resolving collisions. At about the same time, G. M. Amdahl originated the idea of open addressing.