March 2008 – James Harper’s Thoughts

Statistical Significance of a Voter Survey

In a survey of 400 likely voters, 215 responded that they would vote for the incumbent and 185 responded that they would vote for the challenger. Let p denote the fraction of all likely voters who preferred the incumbent at the time of the survey, and let p̂ be the fraction of survey respondents who preferred the incumbent.

Give an estimate of p.

p̂ = the sample mean = 215 / 400 = 0.5375

This is also the estimate of the population mean, p.

Calculate the Standard Error (SE) of the estimate, p̂.

This is a Bernoulli sample. So, the sample variance for a Bernoulli sample is:

s_x²= p̂ * (1- ?p̂ )

s_x²= 0.5375*(1-0.5375)

s_x²= 0.2486

So, the SE of the estimated mean is:

SE = sqrt(s_x² / n)

SE = sqrt(0.2486 / 400)

SE = 0.0249

What is the p-value for the test H₀ : p = 0.5 vs. H₁ : p != 0.5?

The first thing to note is that, because the alternate hypothesis is !=, this is a two-sided hypothesis.

Next, compute the t-statistic assuming the null hypothesis:

t = ( (Sample Mean) – (Null Hypothesis) / SE )

t = (0.5375 – 0.5) / 0.0249

t = 1.506

The p-value for the two sided hypothesis is, therefore,

p = 2 * NORMDIST(-1.506, 0, 1, 1)

p = .1321

Note that we use the negative t-statistic in this calculation of area under the curve because we are interested in the area that is inconsistent with the null hypothesis.

What is the p-value for the test H₀ : p = 0.5 vs. H₁ : p > 0.5?

This time, we have a one-sided hypothesis test. So, the answer is half of what it was for the two sided test:

p = NORMDIST(-1.506, 0, 1, 1)

p = 0.066

The results of the one-sided hypothesis test differs from the two sided test because in the one-sided test, we are only interested in the case where the average is > the null hypothesis. In this case, we’re interested in the case where the incumbent may actually have less than 50% of the population support. In the two-sided case, we were interested in the case where the incumbent had support from less than 50% of the population as well as the case where the incumbent had more than 57.5% of the population’s support.

Do the survey results contain statistically significant evidence that the incumbent was ahead of the challenger at the time of the survey?

It depends on the chosen significance level. For significance levels > 0.066, there is statistical evidence that the incumbent is ahead.

Finding Probabilities Using the Central Limit Theorem – #2

In a population μ_Y = 100 and σ_Y² = 43. In a random sample of size n = 64, what is Pr (101 < Ȳ < 103)?

The sample variance = (σ_Y² / n) = 43/64 = 0.671875

Therefore, the Standard Error (SE) = sqrt(0.671875) = 0.81968.

Normalizing this to a Standard Normal Distribution,

Z = ((103 – μ_Y) / SE)

Z = ((103 – 100) / 0.81968) = 3.66

The Z value is the number of Standard Errors away from the mean that will yield the desired Ȳ value of 103.

This is a one sided hypothesis since we are interested in the probability of Ȳ being < 103.

In EXCEL, the probability that Ȳ is < 103 is:

=NORMDIST(Z-Value, Mean of 0, Standard Deviation of 1, 1 for Cumulative)

=NORMDIST(3.66, 0, 1, 1)

=0.999874

Using the same logic for the probability of Ȳ being < 101,

Z = ((101 – μ_Y) / SE)

Z = 1 / 0.81968 = 1.22

NORMDIST(1.22, 0, 1, 1) = 0.8888

The difference between the two is the probability of Ȳ being between 101 and 103.

Answer: 0.1111

Finding Probabilities Using the Central Limit Theorem

In a population μ_Y = 100 and σ_Y² = 43. In a random sample of size n = 100, what is Pr (Ȳ < 101)?

The sample variance = (σ_Y² / n) = 43/100 = 0.43

Therefore, the Standard Error (SE) = sqrt(0.43) = 0.6557.

Normalizing this to a Standard Normal Distribution,

Z = ((101 – μ_Y) / SE)

Z = ((101 – 100) / 0.6557) = 1.525

The Z value is the number of Standard Errors away from the mean that will yield the desired Y value of 101.

This is a one sided hypothesis since we are interested in the probability of Y being < 101.

In EXCEL, the probability that Y is < 101 is:

=NORMDIST(Z-Value, Mean of 0, Standard Deviation of 1, 1 for Cumulative)

=NORMDIST(1.525, 0, 1, 1)

=0.9364

What is the difference between the sample average, Ȳ, and the population mean?

The sample average is the average of the samples taken from a population. The population mean is the average of the entire population. They are guaranteed to be the same if and only if the sample includes the entire population.

What is the difference between the estimator and the estimate?

The estimator is a function of the sample data. The sample data is drawn randomly from the population. The estimator is a function that is used to make an educated guess of one of the parameters in the population such as the population mean. And example of the estimator is a function that produces Ȳ where Ȳ is the sample mean. For example, if n samples are taken from a population of 10 things, Ȳ = (1/n) * sum(X1 + X2 + X3 + X4) where X1, X2, X3, and X4, are the sample values. The population mean itself = (1/10) * sum(X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10).

In this case, the estimator is the function (1/n) * sum(X1 + X2 + X3 + x4).

Assume that a population distribution has a mean of 10 and a variance of 16. Determine the mean and variance of Ȳ from an i.i.d. sample from this population for n = 10.

The mean of the population sample is expected to be 10. The variance of Ȳ is expected to be = (Population Variance / n) = 16/10 = 1.6. Obviously, as we take more and more samples from the population, the variance of the sample mean will converge toward zero. For example, if we take 1000 samples, the variance of Ȳ , the sample mean, will be 0.016. By this point, we can be reasonably sure that the sample mean is very close to 10.

The fact that the sample variance approaches 0 with large numbers of samples (n) and that the sample mean approaches the population mean is a result of the Law of Large Numbers.

What role does the central limit theorem play in hypothesis testing in statistics? What role does it play in the construction of confidence intervals?

Due to the Central Limit Theorem, the distributions of sample means for multiple samples of a population is, itself, a distribution. The distribution of the sample means is approximates a Normal Distribution if there are enough samples. Because of this, confidence intervals can be constructed using standard deviations around the sample mean. For example, (±1.96 * (Sample Mean Standard Deviation)) gives us a 95% certainty that the population mean falls within a set of values.

How many samples are “enough?” Well, it depends on how the population values are distributed. But, usually 30 samples are enough to construct a reasonable approximation to the Normal Distribution.

What is the difference between a null hypothesis and an alternative hypothesis?

A null hypothesis is a value that is chosen and assumed to be true. The alternate hypothesis is a value that is chosen and assumed to be true if the null hypothesis is false.

What is the “size of a test?”

The size of a test is the probability that a test rejects the null hypothesis even though the null hypothesis is really true.

What is the significance level?

A person chooses a significance level for a test (e.g., 50%, 75%, 90%, 95%, or 99%) and uses it as a criteria to reject a hypothesis test if the null hypothesis is true.

What is the definition of “power?”

The power is the probability that a test rejects the null hypothesis when the alternative hypothesis is true.

What is the difference between a one sided and a two sided alternative hypothesis?

With a one sided hypothesis, the value of interest is only on one side (> or <) the null hypothesis. Under the two sided alternate hypothesis, the value of interest is not equal to the null hypothesis value.

Why does a confidence interval contain more information than the result of a single hypothesis test?

The confidence interval contains all values that reject the null hypothesis. Since the sample mean is normally distributed, then the confidence interval contains values that are at least a given amount larger than the sample mean as well as values that are at least a given amount smaller than the sample mean.

Why is the differences-of-means estimator, applied to data from a randomized controlled experiment, an estimator of the treatment effect?

The “treatment effect” is the causal effect in an experiment or quasi-experiment. The causal effect is the expected effect of a given treatment or intervention in an ideal randomized controlled experiment. An example of a “treatment effect” is the expected result of giving a drug to a population versus not giving it to an identical population. Another example is the expected result of giving fertilizer to plants versus not giving it to other plants growing in otherwise identical conditions.

The differences-in-mean estimator is the differences in mean between the control groups (those that have not received the treatment) and the treatment groups. Remember, for such experiments to have any meaning, the control group and treatment group must be randomly selected from the sample (identical) population.

Standard Error Definition

The standard error of an estimation is the estimated standard deviation of the error in the estimation. Specifically, it estimates the standard deviation of the difference between estimated values and the true values.

Notice that the true value of the standard deviation in a population is usually unknown and the use of the term standard error carries with it the idea that an estimate of this unknown quantity is being used. It also carries with it the idea that it measures, not the standard deviation of the estimate itself, but the standard deviation of the error in the estimate, and these can be very different.

$SE_bar{x} = frac{s}{sqrt{n}}$

where

s is the sample standard deviation (i.e., the sample based estimate of the standard deviation of the population), and

n is the size (number of items) of the sample.

Spring 2008 Books

MKT 402 – Albuquerque

There is one required book and two packets. The prices of the packets are not available yet. The book is:

Marketing Management (13th Edition) by Phillip Kotler and Kevin Keller. The book is a recent update; the 12th Edition was published two years ago. From all indications, there are no significant differences between the two books. Both are 816 pages. So, the new edition likely corrects typos.

With that in mind, I’m going to get 12th Edition, international version. The cost on EBay is approximately $55. Click here to see what current EBay auctions are.

The new version from Amazon is $166.67. Here is the link.

GBA 412 – Groenevelt

Introduction to Econometrics (2nd Edition) by James Stock appears to be the only requirement. The cost on Amazon is $150.67 new or used starting at $79 plus shipping. Here is the link.

The cost for the new international edition on Ebay is in the $60-$65 range, including shipping. Here is the link for that.