What is the relationship between sample size and margin of error?

The margin of error refers to the 95% confidence interval of a poll, and provides a false sense of reliability of a poll since one out of twenty times the true value will lie outside the confidence interval.

From: Encyclopedia of Social Measurement, 2005

The idea of interval estimates

Stephen C. Loftus, in Basic Statistics with R, 2022

12.5 Creating confidence intervals

Generally speaking, our confidence intervals for our parameters will be of the form

Point Estimate ± Margin of Error similar to the interval estimates we talked about above. While the exact form of our confidence intervals will vary depending on what parameter we are doing confidence intervals for, we can see the process of making confidence intervals and how they are connected to the probability of being “right.” Specifically, we will look at this through calculating our confidence interval for the population proportion p.

For our confidence interval to be “right” (1−α)100% of the time, we need the interval to have a lower bound and upper bound calculated from our sample such that

P(Lower

So, let us start with a distribution that we have worked with before: the standard Normal distribution. We could find a value z⁎ so that the probability that a standard Normal is between −z⁎ and z⁎ is 1−α, or

P(−z⁎

We will call this z⁎ the critical value. Now, if we can find something that connects the population proportion p to the standard Normal distribution, we will be able to create an interval. If we recall, there is an important theorem that connects our sample proportion, population proportion, and the standard Normal: the central limit theorem. This states that the sample proportion will follow a Normal distribution assuming the sample size n is large enough, specifically

pˆ∼N(p,s.e.(pˆ2)).

As we saw in hypothesis testing, we can translate this to a standard Normal by subtracting off the mean p and dividing by the standard deviation s.e.(pˆ)

pˆ−ps.e.(pˆ) ∼N(0,1).

So let us go back to our standard Normal probability, P(−z⁎. We can sub in pˆ−ps.e.(pˆ) for the N(0,1 ), since pˆ−ps.e.(pˆ)∼N(0,1). Our standard Normal probability then becomes

P(−z⁎

Now, we have our population proportion p inside a probability statement directly connected to our 1−α probability. To find our lower and upper bounds—and, therefore, our interval—we need to solve for p inside the probability statement:

P(−z⁎

So, we now have an interval with a lower bound of pˆ−z⁎ s.e.(pˆ) and an upper bound of pˆ +z⁎s.e.(pˆ). This interval is random, having been calculated from our data. Additionally, it will cover the true value of p—and thus be “right”—(1−α)100% of the time, which is the probability that we want. With this in mind, our (1−α)100% confidence interval for p will be

(pˆ−z⁎s.e.(pˆ),pˆ+z⁎s.e.(pˆ)).

This leaves three components to calculating our confidence interval: our sample proportion , our standard error s .e.(pˆ), and our critical value z⁎. We calculate our sample proportion from our sample data as previously described. For the moment, the standard error will be given. In later chapters, we will see the formula for the standard errors of our various sample statistics. This just leaves our critical value z⁎. In deriving our confidence interval, we said that we selected z⁎ based on how confident we wanted to be in our interval by choosing z⁎ such that

P(−z ⁎

Finding z⁎ based on this definition can be difficult, as it requires us to find two probabilities to get our answer. However, there is an easier way to find z⁎ based on the fact that Normal distributions are symmetric. It turns out that if P(−z⁎, then

P(N(0,1)

This will allow us to look up only one probability to find our z⁎. Let us see this in practice. Say we wanted to calculate a 95% confidence interval for p. To find our value of α, we solve (1−α)100%=95% so that our α=0.05. In finding z⁎, this implies that

P(N(0,1)

In order to get this critical value we turn to technology, specifically R. To find the quantile of a Normal distribution associated with a specific probability, we turn to the qnorm function. The qnorm function takes in three arguments: the mean of the Normal distribution mean, the standard deviation of the Normal distribution sd, and the probability associated with that quantile p. The code to use this function is

qnorm(p, mean, sd)

And R will return the quantile z⁎ such that P(N(mean,sd ). For our example—finding the z⁎ associated with a 95% confidence interval—we would use the code

qnorm(p=0.975, mean=0, sd=1)

And R will return a z⁎ value of 1.959964, which we can use to get our 95% confidence interval. Let us take a look at this in practice. In 1996, the United States General Social Survey took a survey of 96 people asking if they were satisfied with their job [36,37]. They found that 79 of them were satisfied, so the sample proportion would be pˆ=79/96 =0.8229. Say we wanted to create a 90% confidence interval (α=0.1) for the true proportion of people satisfied with their job. The first thing is to find z⁎ , so that

P(N(0,1)

Using R, we input the following code to get our value of z ⁎=1.644854:

qnorm(p=0.95, mean=0, sd=1)

Our final part of the confidence interval is the standard error of , which in this case is s.e.(pˆ)=0.039. Taking all this information, our 90% confidence interval for p will be

(0.8229−1.645×0.039,0.8229+1.645×0.039)=(0.7585,0.8871).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128207888000249

Principles of Inference

Donna L. Mohr, ... Rudolf J. Freund, in Statistical Methods (Fourth Edition), 2022

3.3.1 Interpreting the Confidence Coefficient

We must emphasize that the confidence interval statement is not a standard probability statement. That is, we cannot say that with 0.95 probability μ lies between 7.792 and 7.988. Remember that μ is a fixed number, which by definition has no distribution. This true value of the parameter either is or is not in a particular interval, and we will likely never know which event has occurred for a particular sample. We can, however, state that 95% of the intervals constructed in this manner will contain the true value of μ.

Definition 3.13

The maximum error of estimation, also called the margin of error, is an indicator of the precision of an estimate and is defined as one-half the width of a confidence interval.

We can write the formula for the confidence limits on μ as y¯±E, where

E=zα∕2σ∕n

is one-half of the width of the (1−α) confidence interval. The quantity E can also be described as the farthest that μ may be from and still be in the confidence interval. This value is a measure of how “close” our estimate may be to the true value of the parameter. This bound on the error of estimation, E, is most often associated with a 95% confidence interval, but other confidence coefficients may be used. Incidentally, the “margin of error” often quoted in association with opinion polls is indeed E with an unstated 0.95 confidence level.

The formula for E illustrates for us the following relationships among E,α,n, and σ:

1.

If the confidence coefficient is increased (α decreased) and the sample size remains constant, the maximum error of estimation will increase (the confidence interval will be wider). In other words, the more confidence we require, the less precise a statement we can make, and vice versa.

2.

If the sample size is increased and the confidence coefficient remains constant, the maximum error of estimation will be decreased (the confidence interval will be narrower). In other words, by increasing the sample size we can increase precision without loss of confidence, or vice versa.

3.

Decreasing σ has the same effect as increasing the sample size. This may seem a useless statement, but it turns out that proper experimental design (Chapter 10) can often reduce the standard deviation.

Thus there are trade-offs in interval estimation just as there are in hypothesis testing. In this case we trade precision (narrower interval) for higher confidence. The only way to have more confidence without increasing the width (or vice versa) is to have a larger sample size.

Example 3.5

Factors Affecting Margin of Error

Suppose that a population mean is to be estimated from a sample of size 25 from a normal population with σ=5.0. Find the maximum error of estimation with confidence coefficients 0.95 and 0.99. What changes if n is increased to 100 while the confidence coefficient remains at 0.95?

Solution

1.

The maximum error of estimation of μ with confidence coefficient 0.95 is

E=1.96(5∕25) =1.96.

2.

The maximum error of estimation of μ with confidence coefficient 0.99 is

E=2.576(5∕25)=2.576.

3.

If n=100 then the maximum error of estimation of μ with confidence coefficient 0.95 is

E=1.96(5∕100)=0.98.

Note that increasing n fourfold only halved E. The relationship of sample size to confidence intervals is discussed further in Section 3.4.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128230435000035

Confidence Intervals

George W. Burruss, Timothy M. Bray, in Encyclopedia of Social Measurement, 2005

Introduction

Reports of opinion polls, election returns, and survey results typically include mention of a margin of error of plus or minus some percentage. This margin of error is actually a confidence interval. A report may state the percentage of people favoring a bond package, for instance. So why is a margin of error also reported? Let us take a moment and understand why confidence intervals (margins of error) are necessary. Then we will address how to compute one.

Suppose a researcher reported that 35% of U.S. citizens favor capital punishment, with a 5% margin of error. More than likely, that researcher did not speak with every one of the more than 250 million U.S. citizens. Instead, he or she spoke with a random sample of those citizens, say, 200 of them. Of those 200, 70 (35%) favor capital punishment. Unfortunately, the researcher is not interested in the opinions of just 200 citizens; rather, the opinions of the entire U.S. population are of interest. Therein lies the problem: from the opinions of only 200 people, how to describe the opinions of the 250 million U.S. citizens? Based on responses to the researcher's questions, it is known that 35% of the sample supports the death penalty, but what are the chances that the sample was perfectly representative of the population and that exactly 35% of the sample supports the death penalty as well? The chances are not good. What if, even though the sample was completely random, it happened by the luck of the draw that the sample was composed of a lot of conservatives? That would be an example of sampling error. In that case, even though the true percentage of the U.S. population supporting capital punishment might be only 20%, the sample is biased toward conservatives, showing 35%. The margin of error, or confidence interval, helps control for sampling error. The 5% margin of error indicates that, though 35% of the sample supports capital punishment, the true percentage for support in the U.S. population is around 35% and somewhere between 30 and 40%. This confidence interval is associated with a confidence level, which is usually expressed as a percentage, such as 95 or 99%. If, in this example, the researcher's 5% confidence interval corresponds to the 95% confidence level, that means that if an infinite number of random samples was taken, only 5% of the time would the true population parameter fail to be captured within the confidence intervals. Though the population parameter is fixed, the confidence intervals fluctuate randomly around the population parameter across repeated samples.

Reporting of a confidence interval typically includes the point estimate, the interval, and the chosen confidence level. The confidence level is the percentage of samples that, taken repeatedly, would contain the point estimate—typically either 95 or 99%. For example, a 95% confidence level would have, in the long run, 95 out of 100 samples with a point estimate that falls within the confidence interval. The most common point estimates are means, proportions, and sample differences; but other statistics, such as medians and regression coefficients, can have confidence intervals too.

In many disciplines, researchers express the uncertainty surrounding the point estimate as the result of an hypothesis test, rather than report the confidence interval. In such cases, researchers typically test what is called the null hypothesis. If, for instance, in considering differences between sentence lengths imposed by two judges, the null hypothesis (also known as the hypothesis of no difference) would state that there is no difference between the judges. The confidence level, in that case, represents the probability (or likelihood) of rejecting the null hypothesis (saying there is really a difference), when in fact it is false (there really is no difference). The similarities between the hypothesis testing approach and the confidence interval approach are apparent. For instance, if the difference in average sentences imposed by two judges is 3 years, but the margin of error is plus or minus 3 years, the judges may, on average, sentence defendants to the same number of years. The null hypothesis—both judges give the same average sentence—cannot be rejected. Hypothesis testing and confidence intervals therefore use the same logic, but hypothesis testing reports a dichotomous result based on the chosen confidence level. Some medical journals require the reporting of confidence intervals because estimated differences between control and experimental groups that lie in the null region may still indicate important effects.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000608

Surveys

Kenneth A. Rasinski, in Encyclopedia of Social Measurement, 2005

Error and Bias in Surveys

Probability samples are the mainstay of modern survey research because they allow the researcher to control the margin of error. Probability samples, perfectly constructed and implemented, all other things being equal, produce unbiased estimates. However, rarely is it the case that construction and implementation are perfect, and when they are not the potential for bias arises. Bias is the difference between the population value that is supposed to be estimated and the population value that is actually estimated. There are two common sources of bias that are usually discussed with regard to sampling. One common source of bias emerges from the use of a sample frame that does not fully cover the units in the population (the undercoverage problem). The second common source is produced from the failure to obtain data from each of the selected elements in the sample (the nonresponse problem).

Samples are typically selected from lists of units that encompass the entire population (or by using an approximation technique if no such list is available). This list is called a sample frame, and it often only imperfectly represents the population. In part, this is because populations are dynamic, whereas sample frames are static. For example, a list of all the small businesses in the country that one might construct today could be out of date tomorrow if a new business started or an existing one ended. There are methods to update sample frames, but even in the best circumstances a sample frame is unlikely to be a complete listing of all the elements in a population. To the extent that a sample frame covers the population completely or nearly completely, there is hope for obtaining unbiased estimates of population values from a properly designed sample. To the extent that there is substantial undercoverage—that is, a substantial number of elements of the population are missing in the sample frame—there may be bias in the survey estimates.

One example of sample frame undercoverage is seen in telephone survey sampling. Techniques are available to obtain a good approximation of all the possible telephone numbers in a geographical area or even in the entire country. However, if telephone numbers are to be used as a sample frame, households without telephones will have no chance of being represented in a sample selected from that sample frame. The survey estimates will be of the population of households that have telephones. To the extent that this population differs from the population of households without telephones, the estimate will be biased. Undercoverage is a potential source of bias in survey estimates but it need not be a serious source. If the undercoverage is small, for example, in a medium to large community in which 98% of households have telephones, the amount of bias in information is likely to be small. If only 50% of households in an area have telephones, the bias in information collected from a telephone survey has the potential to be large, and telephone survey methodology would not be recommended. If it is the case that elements are not represented in the sample frame for random reasons, then undercoverage is unlikely to produce bias; however, this is usually not the case.

Another common failing leading to bias is the inability to collect information from some of the selected units. For example, in a confidential survey conducted in a corporation some employees selected into the sample may choose not to participate. The failure to collect information from units selected into the survey is expressed as the rate of nonresponse. The nonresponse rate is the number of individuals (or schools or other organizations) who refused to participate in the survey (or could not be located) divided by the total number of individuals (or schools/organizations) selected into the sample. The response rate is the number of completed surveys divided by the total number in the sample. High nonresponse rates can lead to significant bias in survey estimates. Therefore, it is important to know the response rate of the survey when evaluating the quality of its results.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000323

Confidence intervals for a single parameter

Stephen C. Loftus, in Basic Statistics with R, 2022

14.4.2 Practice problems

7.

Say that you wanted to estimate the true proportion of American adults who have experienced harassment online within a margin of error of m=0.1 with 90% confidence. A previous study [52] found that pˆ=0.41. What sample size will you need for your study?

8.

Say that you wanted to estimate the true proportion of people in Greece who completed their college education at a foreign college within a margin of error of m=0.05 and 99% confidence. What sample size will you need for your study?

9.

Suppose that you wanted to estimate the true proportion of high school students who took honors classes in high school within a margin of error of m =0.01 and 95% confidence. What sample size will you need for your study?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128207888000274

Statistical estimation

Kandethody M. Ramachandran, Chris P. Tsokos, in Mathematical Statistics with Applications in R (Third Edition), 2021

5.5.2.1 Margin of error and sample size

In real-world problems, the estimates of the proportion p are usually accompanied by a margin of error, rather than a CI. For example, in the news media, especially leading up to election time, we hear statements such as “The CNN/USA Today/Gallup poll of 818 registered voters taken on June 27–30 showed that if the election were held now, the president would beat his challenger 52% to 40%, with 8% undecided. The poll had a margin of error of plus or minus 4 percentage points.” What is this “margin of error”? According to the American Statistical Association, the margin of error is a common summary of sampling error that quantifies uncertainty about a survey result. Thus, the margin of error is nothing but a CI. The number quoted in the foregoing statement is half the maximum width of a 95% CI, expressed as a percentage.

Let b be the width of a 95% CI for the true proportion, p. Let pˆ=x/n be an estimate for p where x is the number of successes in n trials. Then,

b=xn+1.96(x/n)(1−(x/n))n−(xn−1.96( x/n)(1−(x/n))n)=3.92(x/n)(1−(x/n) )n≤3.9214n,

because (x/ n)(1−(x/n))=pˆ (1−pˆ)≤14.

Thus, the margin of error associated with pˆ=(x/n) is 100d%, where:

d=maxb 2=3.9214n2=1.962n .

From the foregoing derivation, it is clear that we can compute the margin of error for other values of α by replacing 1.96 with the corresponding value of zα/2.

A quick look at the formula for the CI for proportions reveals that a larger sample would yield a shorter interval (assuming other things being equal) and hence, a more precise estimate of p. The larger sample is costlier in terms of time, resources, and money, whereas samples that are too small may result in inaccurate inferences. Then, it becomes beneficial for finding out the minimum sample size required (thus less costly) to achieve a prescribed degree of precision (usually, the minimum degree of precision acceptable). We have seen that the large-sample (1 − α)100% CI for p is:

pˆ−zα/2pˆ(1 −pˆ)n

Rewriting it, we have:

|pˆ−p|≤zα/2pˆ(1−pˆ)n=zα/2npˆ (1−pˆ),

which shows that, with probability (1 − α), the estimate is within zα/2pˆ(1−p ˆ)/n units of p. Because pˆ(1−pˆ)≤1/4, for all values of pˆ, we can write the foregoing inequality as:

|pˆ−p|≤zα/2n14=zα/22n .

If we wish to estimate p at level (1 − α) to within d units of its true value, that is |pˆ−p|≤d, the sample size must satisfy the condition (zα/2/(2n))≤d, or

n≥zα/224 d2.

Thus, to estimate p at level (1 − α) to within d units of its true value, take the minimal sample size as n=zα/22/(4d2), and if this is not an integer, round up to the next integer.

Sometimes, we may have an initial estimate of the parameter p from a similar process or from a pilot study or simulation. In this case, we can use the following formula to compute the minimum required size of the sample to estimate p, at level (1 − α), to within d units:

n= zα/22p˜(1−p˜)d2

and, if this is not an integer, we round up to the next integer.

A similar derivation for calculation of sample size for estimation of the population mean μ at level (1 − α) with margin of error E is given by:

n=zα/22σ2E2

and, if this is not an integer, rounding up to the next integer. This formula can be used only if we know the population standard deviation, σ. Although it is unlikely we will know σ when the population mean itself is not known, we may be able to determine σ from an earlier similar study or from a pilot study/simulation.

Example 5.5.5

A dendritic tree is a branched formation that originates from a nerve cell. To study brain development, researchers want to examine the brain tissues from adult guinea pigs. How many cells must the researchers select (randomly) so as to be 95% sure that the sample mean is within 3.4 cells of the population mean? Assume that a previous study has shown σ = 10 cells.

Solution

A 95% confidence corresponds to α = 0.05. Thus, from the normal table, zα/2 = z0.025 = 1.96. Given that E = 3.4 and σ = 10, and using the sample size formula, the required sample size n is:

n=zα/22σ 2E2=(1.96)2(10)2 (3.4)2=33.232.

Thus, take n = 34.

Example 5.5.6

Suppose that a local TV station in a city wants to conduct a survey to estimate support for the president's policies on the economy within 3% error with 95% confidence.

(a)

How many people should the station survey if they have no information on the support level?

(b)

Suppose they have an initial estimate that 70% of the people in the city support the economic policies of the president. How many people should the station survey?

Solution

Here α = 0.05, and thus zα/2 = 1.96. Also, d = 0.03.

(a)

With no information on p, we use the sample size formula:

n=zα/224d2=(1.96)2 4(0.03)2=1067.1.

Hence, the TV station must survey 1068 people.

(b)

Because p˜=0.7, the required sample size is calculated from:

n=z α/22p˜(1−p˜)d2=(1.96)2(0.70)(0.30)(0.03)2=896.37.

Thus, the TV station must survey at least 897 people.

In practice, we should realize that one of the key factors of a good design is not sample size by itself; it is getting representative samples. Even if we have a very large sample size, if the sample is not representative of our target population, then sample size means nothing. Therefore, whenever possible, we should use random sampling procedures (or other appropriate sampling procedures) to ensure that our target population is properly represented.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128178157000051

Numerical Methods

Stephen Andrilli, David Hecker, in Elementary Linear Algebra (Fourth Edition), 2010

A Calculating Mindset

Although we have focused on many theoretical results in this book, computation is also an extremely important part of mathematics. Some mathematical problems that can not be solved with perfect precision can be solved numerically to within a specified margin of error. For example, with large-degree polynomials, we may not always know the exact value of their roots, but there are many computational techniques that can be used to approximate these roots to any desired degree of accuracy.

In this chapter, we present several additional computational techniques that are useful in linear algebra. For example, in certain types of linear systems, two or more of the equations in the system are so close that it becomes more difficult to find the numerical solution because calculations are rounded at each step and roundoff errors can accumulate. To offset these problems, such techniques as partial pivoting and iterative methods are used that help to minimize such roundoff errors. An important iterative method for finding eigenvalues, known as the Power Method, is also explored.

Methods for decomposing (or factoring) a matrix into a product of two or more special types of matrices are very useful in numerical linear algebra for solving linear systems. In this chapter, three such methods are introduced: LDU Decomposition, QR Factorization, and Singular Value Decomposition. In particular, we will see that Singular Value Decomposition is especially helpful in reducing the amount of information that needs to be kept in storage in order to reproduce a given image to a desired degree of accuracy.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123747518000147

Multiple Regression

Donna L. Mohr, ... Rudolf J. Freund, in Statistical Methods (Fourth Edition), 2022

8.3.6 Inferences on the Response Variable

As in the case of simple linear regression, we may be interested in the precision of the estimated conditional mean as well as predicted values of the dependent variable (see Section 7.5). The formulas for obtaining the variances needed for these inferences are obtained from matrix expressions, and are discussed in Section 11.7. Most computer programs have provisions for computing confidence and prediction intervals and also for providing the associated standard errors. A computer output showing 95% confidence intervals is presented in Section 8.5. A word of caution: Some computer program documentation may not be clear on which interval (confidence on the conditional mean or prediction) is being produced, so read instructions carefully!

The point estimates for the mean at given values of the independent variables, or a new individual observation at those values, are both calculated in the same way. However, the margins of error for the second are much wider, as we will see in the following example.

Example 8.3

Snow Geese Departure Times Continued

First, we will select the values for the independent variables for which we wish to make predictions. For the sake of this example, suppose we select TEMP = 10, HUM = 100, LIGHT = 10, and CLOUD = 100. Then the point estimate for TIME is given by inserting these values into the estimated regression equation:

TIME︷=−52.994+0.9130(10)+0.1425(100)+2.5160(10)+0.0922(100)=4.77.

The margins of error are greatly different for the estimated mean (of many observations with these exact independent values) or for the prediction of a single observation. For example, the SAS System reports a 95% confidence interval for the mean as (−0.30, 9.84), corresponding to a margin of error of about 5.07. The prediction interval for an individual value is given as (−12.50, 22.03), a margin of error of about 17.27.

While the actual method for calculating these margins of error is deferred until Chapter 11, it is reasonably simple to understand why prediction intervals are so much wider. Since they are for a single individual, they are roughly about 2σˆ=2MSE. In this example, 2σˆ=265.474=16.2, slightly smaller than the more accurate value of 17.27. For the mean of many observations, we expect that the margin of error will be more like 2σˆ/n. In this example, that gives 265.474/36=2.7, again smaller than the more accurate value but of the right magnitude. These rough calculations are meant to illustrate the reason we expect the prediction intervals to be much wider. When sample sizes are very large and the independent values are very near their sample means, these calculations are quite accurate. Generally, though, statistical software should be used to calculate the proper intervals using the methods discussed in Chapter 11.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128230435000084

Validity, Data Sources

Michael P. McDonald, in Encyclopedia of Social Measurement, 2005

Reliability of Scoring

Validity is often distinguished from reliability, the degree by which repeated scoring of a measure provides consistent values. Alwin provides a detailed treatment of reliability in the Encyclopedia for Social Measurement but a brief discussion is warranted here because reliability is a necessary condition for validity. If repeated attempts to score the cases of a measure yield dramatically different results, then validity may be impossible to ascertain.

Virtually all social science data have measurement error. The naüve researcher accepts all data at face value while the cynical researcher never fully trusts that any data source reports the exactly correct values for all cases. A classic example of measurement error is the statistical “margin of error” of polling percentages associated with random sampling of a population. Because of random selection of respondents, no two surveys will produce the same result, and no survey is guaranteed to represent the true value. Instead, margins of error are reported to indicate where we have confidence in where the true value may lie. Measurement error extends well beyond surveys to any setting where human error in coding and classifying cases may occur. Some cases of a measure may even be missing entirely.

The reliability of a measure is determined by the size and bias of the measurement error. If measurement error is large, then the noise in the scoring of the measure may make any meaningful interpretation of the measure in relationship to the concept, or other concepts, impossible to determine. In the context of polling, the margin of error of a poll is inversely related to the sample size of the survey. If the sample size is small, then the margin of error will be large, so large that the location of the true value will fall within such as large range as to be virtually unknown. In a similar vein, if the poll result is within the margin of error of the thing that one wishes to determine, such as the winner of an election, then the outcome is said to be within the margin of error of the poll, and impossible to determine with confidence.

Measurement error may be small but still have bias, systematic measurement error that may also invalidate a measure's usefulness. For example, in 1932, Literary Digest magazine conducted a poll of over two million respondents selected from telephone directories and automobile registrations. The poll predicted a landslide victory for Alf Landon, but the landslide that year went to Franklin D. Roosevelt. Even with two million respondents, the poll was unrepresentative of the population, since only the most affluent owned phones and cars at that time. Thus, even though the margin of error of this poll would have been relatively smaller than usual surveys, because of the extremely large sample size, the survey was still unreliable because it was biased.

Biases may appear in many guises. In polling, it is well known that respondents may lie to the interviewer to provide the socially correct response. For example, more respondents consistently report voting than official government figures indicate. Sociologists term the dual nature of bias that enters when humans observe human behavior as the insider versus the outsider biases. Sociologists who seek an insider perspective attempt to gain the trust of those that one studies by becoming active participants in the subject they are studying. As an insider, the sociologist hopes that the subjects are more willing to provide their true feelings on a subject to the researchers. However, the gap between researcher and subject may never be fully gulfed, as subjects will always know that they are being studied. Furthermore, a researcher may never be able to achieve insider status, such as a person studying different races. In becoming an insider, the sociologist gains the trust of the study subjects, but at the same time loses the perspective of detachment. As an outsider, the sociologist is able to see the large picture and place the observations into a larger meaningful context. A careful balance of insider insight and outsider perspective is needed to produce meaningful sociological research.

Since polls rely on random sampling, a poll may, by just sheer bad luck of random chance, draw an aberrant sample. The margin of error refers to the 95% confidence interval of a poll, and provides a false sense of reliability of a poll since one out of twenty times the true value will lie outside the confidence interval. During the course of a presidential election, many polls, much more than 20, are conducted. Unfortunately, the surprise poll, the one with a surprising result that may be the result of a bad sample, is the one that receives the most attention. To turn statistics on its head, 1 out of every 20 statistical results may incorrectly reject the null hypothesis merely as a consequence of random chance.

Although polls provide classic examples of reliability issues, by no means is reliability an issue restricted to polling. For example, Ward chronicles how research into the retirement of U.S. Supreme Court justices was fatally flawed because the first study of retirement incorrectly reported the text of a 1937 law that set the age at which judges may retire with benefits. Subsequent researchers cited the original study, without checking the statute itself, and propagated the error. The lesson is that it is important to check the reliability of secondary sources.

In the course of data entry, inevitably cases will be incorrectly scored through human error. These sorts of data entry errors are may appear as outliers, atypical cases with values far from other cases. Outliers are not necessarily the product of an error, but they often are, and they can severely affect observed relationships between measures. Researchers should carefully check the extreme and highly implausible values of a measure, as these cases are typically the result of data entry errors. For example, in analyzing election data for a project, one election outcome had an exact 50–50 split of the vote between two candidates. Since an exact tie is unlikely, checking the observation revealed the vote total for one candidate had been incorrectly entered twice, for both candidates. Some of the same techniques to test the validity of a measure may also reveal outliers. For example, plotting two related measures against one another will reveal deviant cases. To improve the reliability of a measure, when a new data entry project has been completed a careful researcher will perform checks for outliers and check their validity. Just as importantly, data obtained from even the most respected outside sources should not be blindly accepted as error free, and should be similarly scrutinized.

Data may also be missing: randomly or with bias. Statisticians have developed methods to impute, or fill in, missing data and incorporate the statistical error of imputation into results. Imputation of missing data is a statistical guess for the true values, and thus it inherently contains measurement error in the resulting scores for missing cases. The loss of reliability associated with imputation will be proportional to the number of missing cases and the reliability of the imputation procedure, which itself may also contain bias.

Reliability is also an issue for statistical software. The world of pencil and paper does not directly translate into the binary arithmetic of computers. Silent, small measurement error is introduced when numbers, particularly fractions, are entered into statistical software programs. Even for perfectly valid measure, if such a thing existed, these errors may propagate through statistical algorithms to produce wildly inaccurate statistical results. An understanding of the limits of the computers can help researchers from avoiding these undesirable results.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000463

Numerical Techniques

Stephen Andrilli, David Hecker, in Elementary Linear Algebra (Fifth Edition), 2016

A Calculating Mindset

In this chapter, we present several additional computational techniques that are widely used in numerical linear algebra. When performing calculations, exact solutions are not always possible because we often round our results at each step, enabling roundoff errors to accumulate readily. In such cases, we may only be able to obtain the desired numerical answer within a certain margin of error. In what follows, we will examine some iterative processes that can minimize such roundoff errors — in particular, for solving systems of linear equations or for finding eigenvalues. We will also examine three methods (LDU Decomposition, QR Factorization, and Singular Value Decomposition) for factoring a matrix into a product of simpler matrices — techniques that are particularly useful for solving certain linear systems.

Throughout this text, we have urged the use of a calculator or computer with appropriate software to perform tedious calculations once you have mastered a computational technique. The algorithms discussed in this chapter are especially facilitated by employing a calculator or computer to save time and decrease drudgery.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128008539000098

Which relationship between sample size and sampling error is correct?

In general, larger sample sizes decrease the sampling error, however this decrease is not directly proportional. As a rough rule of thumb, you need to increase the sample size fourfold to halve the sampling error.
In the context of polling, the margin of error of a poll is inversely related to the sample size of the survey. If the sample size is small, then the margin of error will be large, so large that the location of the true value will fall within such as large range as to be virtually unknown.

Does sample size increase or decrease margin of error?

The margin of error decreases as the sample size increases because the Law of Large Numbers states that as the sample size increases the sample mean approaches the value of the population mean.

What is the relationship between margin of error and confidence level?

The lower bound of the confidence interval is the observed score minus the margin of error; the upper bound is the observed score plus the margin of error. The width of the confidence interval is twice the margin of error. Statistics is the discipline that helps us calculate confidence intervals.