Chapter 8 Sampling Distributions

Sampling Distributions

Suppose that without doing any research or having any knowledge of the business, you naively decide to open a pizzeria. On the first day you sell 9 pizzas, and on the next three days you sell 4,4, and 3 pizzas respectively. After the third day, you are broke and forced to close the business.

Leaving the sad story aside, consider our population data. Namely, N=4 for 4 says, and x1=9, x2=4a, x3=4b and x4=3, where we just denote the two fours as 4a and 4b to indicate that they are from different days.
Let us now list all possible samples of size 2 (with replacement), namely:
3-3, 3-4a, 3-4b, 3-9, 4a-3,4a-4a, 4a-4b, 4a-9, 4b-3,4b-4a, 4b-4b, 4b-9, 9-3,9-4a,9-4b, 9-9. For these 16 samples, let us consider, the mean, median, standard deviation (sd), variance (var) and proportion of odd numbers in the sample. Calculating these values for the 16 samples we obtain:

pizza_sample.tiff
Excel file available here.

Central Limit Theorem

The central limit theorem (CLT) is a fundamental theorem in statistics that allows us to have inferential statistics. Namely, suppose that we have a random procedure that follows some random variable with a given distribution, a mean $\mu$ and standard deviation $\sigma$. Assume also that you can repeat the experiment as many times as desired, and then you have random variables

(1)
\begin{align} X_1,X_2, X_3,\dots, X_n,\dots \end{align}

that we call i.i.d.r.v's, that is independent, identically distributed random variables.

Then, the sample average is given by the sum of the first n such random variables divided by n:

(2)
\begin{align} \overline{X}_n = \frac{\sum_{i=1}^n X_i}{n} \end{align}

The CLT states that as $n\longrightarrow\infty$ (becomes large), the distribution of the sample average $\overline{X}_n$ becomes closer and closer to a normal distribution with mean $\mu$ and diminishing standard deviation $\sigma/\sqrt{n}$.

This can be denoted as follows:

(3)
\begin{align} \frac{\overline{X}_n -\mu}{\sigma/\sqrt{n}}\longrightarrow Z \sim N(0,1) \end{align}

The importance of this theorem cannot be stressed enough, suffice to say that the result is independent of the underlying distribution of the random variable X.

Notice that the mean remains unchanged and as n increases, the standard deviation of $\overline{X}_n$ diminishes as 1/$\sqrt{n}$.

As the graph below shows, this means that the weight gets concentrated on the center around the mean.

sampling_distros.png

In the graph above, if the blue line represents the case of the original sample size n, the red is then 2n and the yellow is when the sample is 4 times its original size.

An excellent way to see this theorem in action can be seen by using the applet that can be found at http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/

Click on begin in the top right of the site to get the Java applet running. Set the distribution on the top to be uniform U(0,32), and in the 3rd and 4th panel the mean and N=5 and N=20 respectively. Click in both panels on the normal fit button. Click on animate in the second panel once or twice, then on the number of repetitions 5, 1000, and 10000, and observe how the mean $\overline{X}_n$ becomes closer and closer to a normal distribution. Also notice that for n=5, the standard deviation is about twice as large than for n=20, as the number of repetitions increases.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License