Discrete Probability

An Introduction to Discrete Probability
Finite Probability

An experiment is a procedure that yields one of a given set of possible outcomes. The sample space of the experiment is the set of possible outcomes. An event is a subset of the sample space.


If S is a finite nonempty sample space of equally likely outcomes, and E is an event, that is, a subset of S, then the probability of E is p(E) = |E| / |S|

Probabilities of Complements and Unions of Events

Let E be an event in a sample space S. The probability of the event E = SE, the complementary event to E, is given by p(E) = 1 − p(E).


What is the probability that when two dice are rolled, the sum of the numbers on the two dice is 7?


There are a total of 36 equally likely possible outcomes when two dice are rolled.There are six successful outcomes, namely, (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), and (6, 1), where the values of the first and second dice are represented by an ordered pair. Hence, the probability that a seven comes
up when two fair dice are rolled is 6/36 = 1/6.


Find the probability that a hand of five cards in poker contains four cards of one kind


By the product rule, the number of hands of five cards with four cards of one kind is the product of the number of ways to pick one kind, the number of ways to pick the four of this kind out of the four in the deck of this kind,and the number of ways to pick the fifth card.This is C(13, 1) C(4, 4) C(48, 1).By Example 11 in Section 6.3 there are C(52, 5) different hands of five cards. Hence, the probability that a hand contains four cards of one kind is (C(13, 1) C(4, 4) C(48, 1))/ C(52, 5) = (13 · 1 · 48) / 2,598,960 ≈ 0.00024.


Let E1 and E2 be events in the sample space S. Then p(E1E2) = p(E1) + p(E2) − p(E1E2).


What is the probability that a positive integer selected at random from the set of positive integers not exceeding 100 is divisible by either 2 or 5?


Let E1 be the event that the integer selected at random is divisible by 2, and let E2 be the event that
it is divisible by 5.

Then E1E2 is the event that it is divisible by either 2 or 5.

Also, E1 ∩ E2 is the event that it is divisible by both 2 and 5,
or equivalently, that it is divisible by 10.

Because |E1| = 50, |E2| = 20, and |E1 ∩ E2| = 10, it follows that p(E1 ∪ E2) = p(E1) + p(E2) − p(E1 ∩ E2)= 50/100 + 20/100 - 10/100 = 3/5.

Probabilistic Reasoning

Probabilistic reasoning is using logic and probability to handle uncertain situations.

An example of probabilistic reasoning is using past situations and statistics to predict an outcome.

It is used to define which of two events is more likely.Analyzing the probabilities of such events can be tricky. The below example describes a problem of this type.

The Monty Hall Problem Video
Probability Theory Overview

We defined probability of an Event E as the number of outcomes in E divided by total number of outcomes.This definition assumes that outcomes are equally likely.

However, many experiments have outcomes that are not equally likely. For instance, a coin may be biased so that it comes up heads twice as often as tails.

In this section we will study how to define probabilities where outcomes are not equally likely.

Assigning Probabilities

Let S be the sample space of an experiment with a finite or countable number of outcomes. We assign a probability p(s) to each outcome s.

We require that two conditions be met:

  1. 0 ≤ p(s) ≤ 1 for each sS
  2. sS p(s) = 1

The function p from the set of all outcomes of the sample space S is called a probability distribution.

The probability p(s) assigned to an outcome s should equal the limit of the number of times s occurs divided by the number of times the experiment is performed, as this number grows without bound.


Consider a set S with n elements, then, the probability of each element of S considering uniform distribution is 1/n.

Probabilities of Complements and Unions of Events

p(E) = 1 - p(E)


The probability of any event composed of a sequence of disjoint events is just the sum of the individual probabilities.


If E and F are events in the sample space S, then:
p(EF) = p(E) + p(F) − p(EF)

Conditional Probability


The conditional probability of event E given event F has occurred is defined as:

When the shark bites, with his teeth big...

Let's analyze with our shark example:
E = "bitten by shark" (imagine it's .0005)
F = "shark present" (imagine it's .001)
And EF = .0005
Therefore, p(E | F) = .5
Or, "Half the time, if there is a shark around, it bites someone."



Events E and F are independent if and only if p(EF) = p(E)p(F)

Notice how this was not the case with the shark attacks! Given a shark is around, the probability of a shark bite is much higher than p(E)p(F) (which was .0000005). This would not be true if F were, say, "UFO present."

Pairwise independence versus mutual independence:
The best way to understand this difference is to consider a situation like this, with two coin tosses:
Event A is both tosses are the same.
Event B is the first toss is heads.
Event C is the second toss is heads.
Notice that if B and C both occur, or both don't occur, A occured for sure. And if only one of B and C occured, A didn't occur for sure.
Nevertheless, any two pairs of these events, taken on their own, are independent.

Bernoulli Trials and the Binomial Distribution

Bernoulli trial: an experiment with only two possible outcomes. (The patient lives, or the patient dies.)
p + q = 1

Many problems can be solved by determining the probability of k successes when an experiment consists of n mutually independent Bernoulli trials. (Bernoulli trials are mutually independent if the conditional probability of success on any given trial is p, given any information whatsoever about the outcomes of the other trials.)


The probability of exactly k successes in n independent Bernoulli trials, with probability of success p and probability of failure q = 1 − p, is C(n, k)pkqnk

Example: Suppose that the probability that a 0 bit is generated is 0.9, that the probability that a 1 bit is generated is 0.1, and that bits are generated independently. What is the probability that exactly eight 0 bits are generated when 10 bits are generated?
Answer: C(10, 8)(0.9)8(0.1)2 = 0.1937102445

Random Variables


A random variable is a function from the sample space of an experiment to the set of real numbers. A random variable assigns a real number to each possible outcome.
Example: Let X(t) be the number of heads that appear in three coin tosses. Then, X(HHH) = 3, X(HHT) = 2, and so on.


The distribution of a random variable is the complete set of probabilities for each possible value of the variable.
Example: In the above case of three coin tosses, P(X = 3) = 1/8, P(X = 2) = 3/8, and so on.
We will study how to use random variables in section 7.4.

The Birthday Problem

How many people have to enter a party before it is likely (p > .5) that two of them have the same birthday?
Note that this resembles a pigeonhole problem, but with an important difference: we are not asking how many do we need to be certain two have the same birthday (answer: 367), but how many do we need for it to be likely?

We are going to solve this by solving the complement of the problem: we will solve for pn, the probability that they all have different birthdays, and then the probability two will have the same birthday will be 1 − pn. The first person obviously will not have "the same" birthday as anyone else. The probability the second person will have a different birthday is 365 / 366. For the third (assuming no match yet!) it is 364 / 366. And so on.

So our question becomes, at what point does 1 − pn become greater than .5? And the answer is the surprisingly small number 23!
All of this would be just a curiosity, except it is the same question as "How likely is it our hashing function will produce a collision after hashing n items?"
In hashing, we are mapping n keys to m slots, where n is usually much greater than m. (For instance, possible student names to 100 slots in a class roster.) We want to know if students are randomly mapped to a slot with probability 1/m, how likely is it that two students will map to the same slot? (Too many collisions slow down our hashing algorithm.)

Monte Carlo Algorithms

These are examples of probabilistic algorithms. Most of the time, we study algorithms that produce a definite answer. But probabilistic algorithms produce correct answers only with some likelihood. A Monte Carlo Algorithm runs repeated trials of some process until the answer it gives is "close enough" to certainty for the purposes of the person or persons employing the algorithm.

Example: For an integer that is not prime, it will pass Miller's test for fewer than n/4 bases b with 1 < b < n. We can generate a large number and try, say, k iterations of the algorithm, and if it passes each iteration, the that it is not prime are (1/4)k. For 30 iterations, this will give us a probability that n is composite of less than 1/1018.

The Probabilistic Method



Bayes' Theorem
Bayesian Spam Filter
Expected Value and Variance
Expected Values


The expected value of the random variable X on the sample space S, is E(X) = Σp(s) X(s)
This is also called the mean of the random variable.
Example: The expected value of 1 toss of a fair die is 7/2.


We can also get the above by summing up the probability-weighted values for X being equal to each outcome.
Example: Use two coin tosses to illustrate this.


The expected number of successes in n Bernoulli trials where the probability of success in each trial is p = np.
Example: The expected number of sixes in 36 dice rolls (where six is success and all other results failure) is six. (1/6 * 36)

Linearity of Expectations


EΣ(X1 + X2 + ... + Xn = Σ( E(X1) + E(X2) + ... + E(Xn))

Expected value in the hatcheck problem: The hatcheck employee loses all tickets and gives the hats back randomly. What is the expected number of people who get back the correct hat?

Average-Case Computational Complexity Overview

Average case computational complexity of an algorithm is amount of computational resourse used by algorithm,averaged over all possible inputs.It can also be interpreted as computing the expected value of a random variable.

Let us consider aj, j=1,2,....k to be all the outcomes of an experiment and X be a random variable,that defines number of operations used by the algorithm,when aj is an input

Let x1,x2...xk be the respective random variables pj is the probability to each possible input .Then the average case complexity of the algorithm is

The Geometric Distribution

Geometric distribution defines random variables with infinitely many outcomes


A random variable X has a geometric distribution with parameter p if p(X = k) = (1 − p)k−1 p for k = 1, 2, 3, . . ., where p is a real number with 0 ≤p ≤ 1.


If the random variable X has the geometric distribution with parameter p, then E(X) = 1/p.

Independent Random Variables

Two random variables X and Y are independent if: p(X = r1 and Y = r2) = p(X = r1) * p(Y = r2)


The variance of a random variable defines how widely a random variable is distributed


Let X be a random variable on a sample space S. The variance of X, denoted by V(X), is V(X) =∑ s∈S (X(s) − E(X))2 p(s).

That is, V (X) is the weighted average of the square of the deviation of X. The standard deviation of X, denoted σ(X), is defined to be



If X is a random variable on a sample space S, then V(X)= E(X2) − E(X)2.


If X is a random variable, V(X) = E((Xμ)2)


BIENAYMÉ’S FORMULA If X and Y are two independent random variables on a sample space S, then V (X + Y) = V (X) + V (Y). Furthermore, if Xi, i = 1, 2, . . . , n, with n a positive integer, are pairwise independent random variables on S, then V (X1 + X2 +· · ·+Xn) = V (X1) + V (X2)+· · ·+V (Xn).

Chebyshev's Inequality

It defines the likeliness of a random variable to take a value far from it's expected value


Let X be a random variable on a sample space S with probability function p. If r is a positive real number, then p(|X(s) − E(X)| ≥ r) ≤ V (X)/r2.