Putting a proposition before a large population of voters can be expensive, so an organization wishing to do so would like to have a reasonable assurance that a given proposition will pass. One approach is to take a survey of a randomly chosen subset of voters and use the results to estimate the proposition's chances amongst the general population. The larger the survey size and the larger the margin that the proposition passes in the survey, the larger the chances are that the proposition would pass for a general vote.
What the polling organization needs to know is how large a survey is required. The computation, as shown below, requires two inputs: the acceptable risk that the general vote fails and an estimate of the voter sentiment. The only assumption made is that the size of the general population is infinite.
The symbols will be defined as follows:
- s – number of people in the survey
- y – the number of people voting yes in the survey
- p – the intrinsic probability that any given person votes yes
- δ – the risk; i.e., the probability that, given the poll, the general vote still fails
Intuitively we expect y to be approximately p·s. The fundamental issue is that, for any given survey, y is likely to only near this value. The range of possible y's needs to be taken into account in computing the risk, δ. The probability of the survey resulting in a particular y given s and p is written as P(y|s,p). The process of taking a random sampling of yes/no votes from a population results in the binomial distribution,
(1)
After the survey is taken, the organization will know y, but p is still unknown. We must use Bayes' Rule,
(2)
to compute P(p|s,y). If order to do this, we must assume a Bayesian prior, either an a priori P(p) or P(y). In either case the only reasonable assumption is a uniform (constant) distribution, either P(p) = 1, implying that all opinions on the vote are equally likely, or P(y) = 1/(s+1), implying that all possible voting results are equally probable. Actually either of these assumptions implies the other. We then have
(3)
This distribution is known as the Beta distribution, Beta(α,β), where α=y+1 and β=s–y+1.
Now compute the risk of a failed vote. Since the population is infinite, a vote fails whenever p is less than ½. This integral over a range of probabilities is the cumulative distribution function,
(4)
In this case, this integral is the incomplete Beta function, or
(5)
The desire is that the CDF be less than some small number δ for p = ½, or
(6)
Since s and y are integers, the Beta function can be used to replace the first two terms, resulting in
(7)
This expression is the regularized incomplete beta function, and happens to have an infinite series expansion
(8)
Applying this to δ gives
(9)
This expression could potentially have a very large number of terms for large s. Using the identity
(10)
the above expression can be rearranged to arrive at
(11)
If the sample has more yes than no votes, then y > s-y, so
(12)
Although this expression has fewer terms for close votes, it does involve a subtraction of two close terms for small δ.
So how does a polling organization use this to find the survey size? First, guess at p and call this guess g. Then choose a random s. Set y to g·s, rounded up. Then use equation (8) or (12) to find δ. If δ is smaller than the desired risk, reduce s, and similarly increase s if δ is too large.
One important caveat is to consider what happens if the survey takes place and the yes votes are less than half of the total. In this case the implication is that the proposition would fail in the general vote. The polling organization might decide that g, the estimated p, was too small, and the survey failed though bad luck. What the organization cannot do is repeat the survey without using different formulas.
Repeating the same survey with the same or larger s, or adding more voters to the existing survey, would need to take into account that the first survey failed. This could be done, but the mathematical formulas would be different than the computations above.
This computation was suggested to me by David Chaum, who wanted to know if a closed-form solution existed. See rsvoting.org for more information.