Computing Survey Size

Comments None

Putting a proposition before a large population of voters can be expensive, so an organization wishing to do so would like to have a reasonable assurance that a given proposition will pass. One approach is to take a survey of a randomly chosen subset of voters and use the results to estimate the proposition’s chances amongst the general population. The larger the survey size and the larger the margin that the proposition passes in the survey, the larger the chances are that the proposition would pass for a general vote.

In a previous article, the required formulas were derived for an infinite voter population.

The symbols are defined as follows:

  • s – number of people in the survey
  • y – the number of people voting yes in the survey
  • p – the intrinsic probability that any given person votes yes
  • δ – the risk; i.e., the probability that, given the poll, the general vote still fails
  • g – the polling organization’s estimate for p

The result is



or equivalently



Note that the terms in the sum of (2) are symmetric about (s+1)/2. We then get






This gives a simple way to compute y given δ and s. Successive terms are added to the sum until the desired risk δ is met.

Equation (2) can be quickly computed with some rearrangement. By pulling out the common terms of the factorials, equation (2) becomes






The same type rearrangement can be applied to equations (3) and (4). The collection of terms in front of the summation are computed using lgamma(), a commonly available library function for computing the gamma function.

At this point we can easily compute δ(s,y) and y(s,δ). The polling organization, on the other hand, is interested in g, essentially an a priori estimate of y/s. If the pollster has a do-not-exceed target risk, then the conservative approach is to assume the requirement of more votes rather than less. Thus the risk is computed as δ(s,g) = δ(s,y=]g·s[), where ][ is the ceiling function and rounds upwards to the next integer. Similarly, computing the worst-case estimated g given a survey size and a do-not-exceed risk is g(s,δ) = ]y(s,δ)[/s.

The final computation is s(g,δ). This can be done two ways. The first is to find the smallest s such that δ(s,g) < δ. Note that, because s enters twice in this formula, there is no guarantee that δ(s,g) is monotonic in s. An alternative is to find the largest t such that δ(t,g) > δ, and then use s=t-1.

The second method is to find the smallest s such that g(s,δ) < g. As before, since s enters twice, g(s,δ) may not be monotonic. Similarly, an alternative is to find the largest t such that g(t,g) > g, and then use s=t-1.

Since finding s is an iterative process, having an initial approximation for s(g,δ) would be helpful. Unfortunately most of the approximations for the summation in equation (1) do not result in an equation which is readily invertible.

This computation was suggested to me by David Chaum, who wanted to know if a closed-form solution existed.

Click here for a pdf of this article.

Categories ,


There are currently no comments on this article.


Enter your comment below. Fields marked * are required. You must preview your comment first before finally posting.

← Older Newer →