Design and Analysis of Algorithms: Hash Tables *

Dictionaries

Dictionary ADT.

Operations associated with this data type allow:

(Source)

Typical uses:

Direct addressing and Hashing are two ways of implementing a dictionary. Are there others?

11.1 Direct-address tables
Direct-address operations

                    Direct-Address-Search(T, k)
                        return T[k]

                    Direct-Address-Insert(T, x)
                        T[x.key] = x

                    Direct-Address-Delete(T, x)
                        T[x.key] = NIL
                    

Quiz
  1. A good use for a direct-address table might be:
    1. All answers are fine
    2. Memoization
    3. Bingo
    4. Marking members of a set as present
  2. We can't use direct-address tables when
    1. there are a large number of (potential) entries
    2. all answers are fine
    3. we are programming the Sieve of Eratosthenes
    4. we are dealing with zipcodes
  3. What is direct addressing?
    1. Fewer keys than array positions
    2. Every key specifies a distinct array position
    3. Fewer array positions than keys
    4. None of the mentioned
Answers

1. a; 2. a; 3. b;

11.2 Hash tables
Basic Hashing
Introducing probability into an algorithm.

What happens to the usual assumptions?
Correctness: always, most of the time?
Termination: always, or almost always? What does "performance" mean if the running time/answer/even termination change from one run to the next?

Probability Basics

Reviewed in this document.

Simple uniform hashing

This employs chaining. Furthermore, we assume that the distribution of elements is uniform across hash table slots.

11.3 Hash functions
Choosing the right m for the division method

Consider the following hashing scheme:
h(k) = k mod m
m = 7
We convert a string into a hashable key by treating it as a base-8 number.
So 'abc', where a = 1, b = 2, and c = 3, is converted to a key as follows: 1 * 82 + 2 * 8 + 3 = 83.
In this hashing scheme, what do the strings 'cba' and 'bac' hash to?
Can you write a more general statement about a pattern we can detect here? Something along the lines of, "If the table size is 2P - 1, and strings are interpreted in radix 2P..."

Answer:
If h(k) = k mod m, where m = 2P − 1, and k is a character string interpreted in radix 2P, then all permutations of a given string will hash to the same value. So in the example above, 'abc', 'cba', and 'bac' all hash to the same value.

Proof:
Assumed (could be proven, but we won't do it here):

  1. (x + y) mod z == (x mod z + y mod z) mod z
    Example: (10 + 12) mod 7 == (10 mod 7 + 12 mod 7) mod 7
  2. (x * y) mod z == (x mod z) * (y mod z) mod z
    Example: (10 * 12) mod 7 == (10 mod 7) * (12 mod 7) mod 7
    (7 * 17 = 119)
  3. if x mod z == 1, then xn mod z == 1
    Example: 8 mod 7 == 1, and 82 mod 7 == 1
    This is a special case of 2!

So, we have:






  (By 1)

  (By 2)

  (By 3)

Universal hashing
11.4 Open addressing
11.5 Perfect hashing

We can get even better perfromance with a fixed hash table -- think of reserved words in a programming language, or the index of a CD -- by perfect hashing.
We proceed as in hashing with chaining, but then, instead of a linked list, each hash slot gets a hash table mj of size n2, where n is the number of elements expected to hash to slot j.
The probability of geetting a collision is much like the birthday problem: when the table size is the square of the expected number of entries, the probability of collisions is < 1/2. So we can just try hash functions until we find one that produces no collisions.

Source Code

Java
Ruby
C++
Python