What conditions must be met for a confidence interval?

Confidence Intervals

Andrew F. Siegel, Michael R. Wagner, in Practical Business Statistics [Eighth Edition], 2022

Normal Distribution

Assumption 2 Required for the Confidence Interval

The quantity being measured is normally distributed.

The detailed theory behind the confidence interval is based on the assumption that the quantity being measured is normally distributed in the population. Such a simplifying assumption makes it possible to work out all of the equations to compute the critical t value [which has already been done for you]. Fortunately, in practice this requirement is much less rigid for two reasons.

First of all, you could never really tell whether or not the population is perfectly normal, because all you have is the sample with its randomness. In practice, therefore, you would look at a histogram of the data to see if the distribution is approximately normal, that is, not too skewed and with no extreme outliers.

Second, the central limit theorem often comes to the rescue. Because statistical inference is based primarily on the sample average, , what you need primarily is that the sampling distribution of be approximately normal. The central limit theorem tells you that if n is large, will be approximately normally distributed even if the individuals in the population [and the sample] are not.

Thus, the practical rule here may be summarized as follows:

Assumption 2 [in Practice]

Look at a histogram of the data. If it looks approximately normal, then you are OK [ie, the confidence interval statement is approximately valid]. If the histogram is slightly skewed, then you are OK, provided the sample size is not too small. If the histogram is moderately skewed or has very few moderate outliers, then you are OK, provided the sample size is large. If the histogram is extremely skewed or has extreme outliers, then you may be in trouble.

For a binomial situation, the central limit theorem implies that the sample percentage p is approximately normally distributed when n is large [provided the population percentage is not too close to 0% or 100%, as was covered in Chapter 8]. This shows how the assumption of a normal distribution can be [approximately] satisfied for a binomial situation.

What can you do if the normal distribution assumption is not satisfied at all, because of, say, extreme skewness? One approach is to transform the data [perhaps with logarithms] to bring about a normal distribution; keep in mind that the resulting confidence interval would then be for the mean of the population logarithm values, which are more complicated to communicate. Another possibility is to use nonparametric methods, to be described in Chapter 16.

Example

Data Mining to Understand the Average Donation Amount

Consider the donations database with 20,000 entries on the companion site. The total amount given by these 20,000 people in response to the current mailing was $15,592.07, with 989 making a current donation and 19,011 not donating at this time. Thus, the average donation is $0.7796035, or about 78 cents per person. Certainly, the amount donated will vary according to the circumstances of a particular mailing. One source of variation is pure statistical variation, leading to the following question: If we were to send a mailing to a similar [but much larger] group of people that these 20,000 people represent [viewing these 20,000 as a random sample from the larger group], how much, on average, should we expect to receive from each person in the new mailing? An answer may be found using the confidence interval.

The standard deviation of the 20,000 donations is $4.2916438 and the standard error is $0.0303465, leading to a 95% confidence interval extending from $0.720122 to $0.839085. If we plan a new mailing to 500,000 people, then we would expect to receive donations totaling between $360,061 and $419,543 [obtained by multiplying the ends of the confidence interval by 500,000 people].

What about the assumptions for validity of this confidence interval from about 72 to 84 cents for the population mean donation amount? The first assumption requires that the data be a random sample from the population of interest, and this would be true, for example, if the 20,000 were initially selected randomly from the 500,000 as part of a pilot study to see if it would be worthwhile mailing to all 500,000 at this time.12 The second assumption requires that the quantity being measured be normally distributed; this assumption does not appear to be satisfied, as is seen from the very nonnormal histogram for the 20,000 donation amounts in Fig. 9.2.3. However, the confidence interval is OK in this case, even though the distribution of individual donations is very skewed, because the sample size is large enough to make the distribution of “averages of 20,000 donations” approximately normal. To show that the distribution of “averages of 20,000 donations” is normally distributed, Fig. 9.2.4 shows a histogram of 500 “bootstrap samples,” with each bootstrap sample of size 20,000 chosen by sampling with replacement from the database of 20,000 donation amounts. Even though the individual donation amounts are highly skewed, the averages of 20,000 donations are actually very close to a normal distribution because of the central limit theorem.

Fig. 9.2.3. A histogram of the 20,000 individual donation amounts shows a highly skewed and very nonnormal distribution. However, assumption 2 for validity of the confidence interval may still be satisfied because the sample average might be approximately normal.

Fig. 9.2.4. A histogram of averages of 20,000 donations shows that the average of 20,000 donations is very nearly normally distributed [because of the central limit theorem] even though individual donation amounts are highly skewed. In this case, 500 averages are shown, with each average chosen by random sampling [with replacement, according to the bootstrap technique] from the database of 20,000 donation amounts.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128200254000099

Correlation and Regression

Andrew F. Siegel, Michael R. Wagner, in Practical Business Statistics [Eighth Edition], 2022

Confidence Intervals for Regression Coefficients

This material on confidence intervals should now be familiar. You take an estimator [such as b], its own personal standard error [such as Sb], and the critical t value [using n − 2 degrees of freedom for regression]. The two-sided confidence interval extends from b − tSb to b + tSb. The one-sided confidence interval claims either that the population slope, β, is at least b − tSb or that the population slope, β, is no more than b + tSb [using the one-sided t values, of course]. You may wish to reread the summary of Chapter 9 for a review of the basics of confidence intervals; the only difference here is that you are estimating a population relationship rather than just a population mean.

Similarly, inference for the population intercept term, α, is based on the estimator a and its standard error, Sa.

Confidence Intervals

For the population slope, β:

Fromb−tSbtob+tSb

For the population intercept, α:

Froma−tSatoa+tSa

Example

Variable Costs of Production

For the production cost data, the estimated slope is b = 51.66, the standard error is Sb = 7.35, and the two-sided critical t value for n − 2 = 16 degrees of freedom is 2.119905 for 95% confidence interval. The 95% confidence interval for β extends from 51.66 − [7.35][2.119905] = 36.08 to 51.66 + [7.35][2.119905] = 67.24. Your confidence interval statement is as follows:

We are 95% confident that the long-run [population] variable costs are somewhere between $36.08 and $67.24 per item produced.

As often happens, the confidence interval reminds you that the estimate [$51.66] is not nearly as exact as it appears. Viewing your data as a random sample from the population of production and cost experiences that might have happened under similar circumstances, you find that with just 18 weeks’ worth of data, there is substantial uncertainty in the variable cost.

A one-sided confidence interval provides a reasonable upper bound that you might use for budgeting purposes. This reflects the fact that you do not know what the variable costs really are; you just have an estimate of them. In this example, the one-sided critical t value is 1.745884, so your upper bound is 51.66 + [7.35][1.746] = 64.49. The one-sided confidence interval statement is as follows:

We are 95% confident that the long-run [population] variable costs are no greater than $64.49 per item produced.

Note that this bound [$64.49] is smaller than the upper bound of the two-sided interval [$67.24] because you are interested only in the upper side. That is, because you are not interested at all in the lower side [and will not take any error on that side into account], you can obtain an upper bound that is closer to the estimated value of $51.66.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128200254000117

Hypothesis testing

Sheldon M. Ross, in Introduction to Probability and Statistics for Engineers and Scientists [Sixth Edition], 2021

Remarks

[a] There is a direct analogy between confidence interval estimation and hypothesis testing. For instance, for a normal population having mean μ and known variance σ2, we have shown in Section 7.3 that a 100[1−α] percent confidence interval for μ is given by

μ∈[x‾−zα/2σn,x‾+zα/2σn]

where x‾ is the observed sample mean. More formally, the preceding confidence interval statement is equivalent to

P{μ∈[X‾−zα/2σn,X‾+zα/2σn]}=1−α

Hence, if μ=μ0, then the probability that μ0 will fall in the interval

[X‾−zα/2σn,X‾+zα/2σn]

is 1−α, implying that a significance level α test of H0:μ=μ0 versus H1:μ≠μ0 is to reject H0 when

μ0∉[X‾−zα/2σn,X‾+zα/2σn ]

Similarly, since a 100[1−α] percent one-sided confidence interval for μ is given by

μ∈[X‾−zασn,∞]

it follows that an α-level significance test of H0:μ≤μ0 versus H1 :μ>μ0 is to reject H0 when μ0∉[X‾−zασ/n,∞] — that is, when μ0

Chủ Đề