Design and Analysis of Algorithms: Hash Tables *

Dictionaries

Dictionary ADT.

Operations associated with this data type allow:

the addition of a pair to the collection
the removal of a pair from the collection
the modification of an existing pair
the lookup of a value associated with a particular key

(Source)

Typical uses:

Symbol lookup in a programming language
Counting words in a book
Store colors by name as key and their numeric equivalent as the value. Then we can write set_text(colors["red"]).

Direct addressing and Hashing are two ways of implementing a dictionary. Are there others?

11.1 Direct-address tables

O(1) worst case time for lookup.
Uses:
- Memoization
- Bingo
- Sieve of Eratosthenes
- Mark zipcodes seen
Downside: wastes space. If you have no idea how many possible keys you need, direct addressing is not a good choice.
For instance, if your key is an arbitrary string!
Example code here.

Direct-address operations

                    Direct-Address-Search(T, k)
                        return T[k]

                    Direct-Address-Insert(T, x)
                        T[x.key] = x

                    Direct-Address-Delete(T, x)
                        T[x.key] = NIL

Quiz

A good use for a direct-address table might be:

All answers are fine
Memoization
Bingo
Marking members of a set as present

We can't use direct-address tables when

there are a large number of (potential) entries
all answers are fine
we are programming the Sieve of Eratosthenes
we are dealing with zipcodes

What is direct addressing?

Fewer keys than array positions
Every key specifies a distinct array position
Fewer array positions than keys
None of the mentioned

Answers

1. a; 2. a; 3. b;

11.2 Hash tables

Basic Hashing

O(1) average case time for lookup.
Universe of keys U mapped into slots of a hash table of size m by hash function h.
Because size(U) > m, collisions are always possible.
Imagine we hash by word length: 'mark' and 'beam' both hash to 4. (Stupid hash function, but it illustrates the idea.) We must resolve this collision somehow.
Resolve collisions by chaining:
Each slot holds a linked list of values.
Cryptographic hashing
Use large hash keys: SHA-1 uses 160 bit keys. SHA-2 uses keys of up to 512 bits.
Perceptual hashing

Introducing probability into an algorithm.

What happens to the usual assumptions?
Correctness: always, most of the time?
Termination: always, or almost always? What does "performance" mean if the running time/answer/even termination change from one run to the next?

Probability Basics

Reviewed in this document.

Simple uniform hashing

This employs chaining. Furthermore, we assume that the distribution of elements is uniform across hash table slots.

Hash table T with m slots storing n elements.
Load factor: α = n / m
α is the average number of elements stored in a chain.
Our analysis is in terms of α, which can be less than, equal to, or greater than one.
Worst case is very bad:
All n keys hash to the same slot.
Worst case for searching becomes Θ(n) plus time to compute hash function.
We could have just used a linked list directly!
Average case:
Assuming any given element is equally likely to hash into any slot...
We get average case Θ(1 + α) time.
Unsuccessful search: the average chain length will be α. Thus, after finding the right slot with a hash function that runs in O(1) time, we will search α expected elements before giving up, giving us he above run time.
Successful search: The probability that a list will be searched is proportional to the number of items it contains. Nevertheless, we still expect α items to be searched.
This means that if our table size is roughly proportional to n, then we have n = O(m), and α = n / m, and so α = O(m) / m, and so α = O(1). And thus the whole search is O(1).

11.3 Hash functions

First, convert key to an integer.
E.g., we can interpret characters in a string by their ASCII values.
Then treat each value as a digit in a radix-128 integer.
Keys could be many other things besides ordinary strings.
E.g., genomes:
Multiplication method:
h(k) = [m (k A mod 1)], where 0 < A < 1.
Lots of special considerations on the best values for A: we have a suggestion that it should be about (5^1/2 − 1) / 2, or 0.6180339887...
Division method:
h(k) = k mod P, where P is a suitably-chosen prime number.

Choosing the right m for the division method

Consider the following hashing scheme:
h(k) = k mod m
m = 7
We convert a string into a hashable key by treating it as a base-8 number.
So 'abc', where a = 1, b = 2, and c = 3, is converted to a key as follows: 1 * 8² + 2 * 8 + 3 = 83.
In this hashing scheme, what do the strings 'cba' and 'bac' hash to?
Can you write a more general statement about a pattern we can detect here? Something along the lines of, "If the table size is 2^P - 1, and strings are interpreted in radix 2^P..."

Answer:
If h(k) = k mod m, where m = 2^P − 1, and k is a character string interpreted in radix 2^P, then all permutations of a given string will hash to the same value. So in the example above, 'abc', 'cba', and 'bac' all hash to the same value.

Proof:
Assumed (could be proven, but we won't do it here):

(x + y) mod z == (x mod z + y mod z) mod z
Example: (10 + 12) mod 7 == (10 mod 7 + 12 mod 7) mod 7
(x * y) mod z == (x mod z) * (y mod z) mod z
Example: (10 * 12) mod 7 == (10 mod 7) * (12 mod 7) mod 7
(7 * 17 = 119)
if x mod z == 1, then xⁿ mod z == 1
Example: 8 mod 7 == 1, and 8² mod 7 == 1
This is a special case of 2!

So, we have:

  (By 1)

  (By 2)

  (By 3)

Universal hashing

Establish a family of hash functions.
Choose so that Prob[h(x) = h(y)] ≤ 1/m, where m is the size of our hash table.
In other words, the hash functions have no more chance of collision than simply randomly choosing to slots between 1 and m.
Choose one at random each execution.
Tricky: what if we store hash values?
Good average case behavior
If a "bad" function handles some data once, a "good" one will handle it another time.
So a "bad" set of programming variable names one run will turn into a good set the next run.

11.4 Open addressing

All elements are stored directly in the table; no chaining.
Linear probing
Easy: just move along array indices!
Prone to clustering.
Why: once an area of the table begins to fill up, we are more likely to get collisions there.
Quadratic probing
Uses a hash function of the form:
h(k, i) = (h'(k) + c₁i + c₂i² mod m
Prone to milder form of clustering.
Double hashing
Uses two hash functions to search array for key.
Unsuccessful search: 1 / (1 - α) expected probes.
(Since at most one element can be in a slot, α ≤ 1.)
Our expected number of searches is 1 + α + α² + α³ + α⁴...
Successful search: (1 / α) ln (1 / (1 - α))
Source code here.

11.5 Perfect hashing

We can get even better perfromance with a fixed hash table -- think of reserved words in a programming language, or the index of a CD -- by perfect hashing.
We proceed as in hashing with chaining, but then, instead of a linked list, each hash slot gets a hash table m_j of size n², where n is the number of elements expected to hash to slot j.
The probability of geetting a collision is much like the birthday problem: when the table size is the square of the expected number of entries, the probability of collisions is < 1/2. So we can just try hash functions until we find one that produces no collisions.

Source Code

Java
Ruby
C++
Python