Sampling

A Population

When we sample, we are trying to determine something about a population from a subset of that population. But before we can determine that, we need to know what population we are dealing with. In terms of set theory, we need to know what universe we are dealing with.

Examples of populations:

All male smokers over 50.
The racoons of North America.
Apple iPhones being produced in China.
Likely voters in the next U.S. presidential election.
Division I football players.
Stocks the price of which has dropped over 50% in the last month.
The hemlock trees of Pennsylvania.
Students at St. Joseph's College.

It may not be easy to determine if a particular entity is in the population or not. Is someone who smokes one cigarette a month when he is out drinking a "smoker"? Is a raccon who lives at the border of Panama and Columbia a "North American racoon"? Is a phone with 30% of its components from China "produced in China"? Is someone who claims they are going to vote, but hasn't voted in 20 years, really a "likely" voter? Is someone taking only one class every year or two a "student at St. Joseph's College"?

Note that these decisions can be made in a biased way: if we want to exaggerate the dangers of smoking, we could count as "smokers" only people who smoke over two packs a day. On the other hand, if we want to minimize the dangers, we could include anyone who has smoked even a single cigarette in the last several decades.

A Sample

At first, it might seem plausible that if we want to learn something about a population from a subset of that population, we should carefully construct that subset to closely mirror the actual population. So if, for instance we want to sample the American electorate about an upcoming election, we might decide, "Well, we should construct our sample to include 45% Democrat voters, 40% Republican voters, 10% Libertarian voters, 5% Green Party voters."

This approach it is seriously wrong, as it begs the question of what the population is actually like. If we already know the composition of the population, then we do not need to sample. We could simply declare that "The vote will be 45% Democrat, 40% Republican, 10% Libertarian, and 5% Green." The only reason that we are sampling is that we do not know how the population as a whole will vote, and we are hoping that our sample will help us to understand how it will.

Perhaps surprisingly, the best way to sample a population to determine its characteristics from the sample is to make the sample as random as we can. But even that is fraught with difficulties: we need to sample by some means, and that means itself may bias our sample. Alf Landon.

External links