The probability that the results of a study will be statistically significant is

Recommended textbook solutions

The probability that the results of a study will be statistically significant is

Probability and Statistics for Engineers and Scientists

9th EditionKeying E. Ye, Raymond H. Myers, Ronald E. Walpole, Sharon L. Myers

1,204 solutions

The probability that the results of a study will be statistically significant is

Introduction to Probability and Statistics

15th EditionBarbara M. Beaver, Robert J. Beaver, William Mendenhall

1,585 solutions

The probability that the results of a study will be statistically significant is

Modern Mathematical Statistics with Applications

1st EditionJay L. Devore, Kenneth N. Berk

682 solutions

The probability that the results of a study will be statistically significant is

Probability and Statistics for Engineers and Scientists

9th EditionKeying E. Ye, Raymond H. Myers, Ronald E. Walpole, Sharon L. Myers

1,204 solutions

Error Statistics

Deborah G. Mayo, Aris Spanos, in Philosophy of Statistics, 2011

2.4 Fallacies arising from overly sensitive tests

A common complaint concerning a statistically significant result is that for any discrepancy from the null, say γ ≥ 0, however small, one can find a large enough sample size n such that a test, with high probability, will yield a statistically significant result (for any p-value one wishes).

(#4) With large enough sample size even a trivially small discrepancy from the null can be detected.

A test can be so sensitive that a statistically significant difference from H0 only warrants inferring the presence of a relatively small discrepancy γ; a large enough sample size n will render the power POW(Tα; μ1=μ0 + γ) very high. To make things worse, many assume, fallaciously, that reaching statistical significance at a given level α is more evidence against the null the larger the sample size (n). (Early reports of this fallacy among psychology researchers are in Rosenthal and Gaito, 1963). Few fallacies more vividly show confusion about significance test reasoning. A correct understanding of testing logic would have nipped this fallacy in the bud 60 years ago. Utilizing the severity assessment one sees that an α-significant difference with n1 passes μ > μ1 less severely than with n2 where n1 > n2.

For a fixed type I error probability α, increasing the sample size decreases the type II error probability (power increases). Some argue that to balance the two error probabilities, the required α level for rejection should be decreased as n increases. Such rules of thumb are too tied to the idea that tests are to be specified and then put on automatic pilot without a reflective interpretation. The error statistical philosophy recommends moving away from all such recipes. The reflective interpretation that is needed drops out from the severity requirement: increasing the sample size does increase the test's sensitivity and this shows up in the “effect size” γ that one is entitled to infer at an adequate severity level. To quickly see this, consider figure 5.

The probability that the results of a study will be statistically significant is

Figure 5. Severity associated with inference μ> 0.2, d(x0)=1.96, and different sample sizes n.

It portrays the severity curves for test Tα, σ= 2, n= 100, with the same outcome d(x0)=1.96, but based on different sample sizes (n=50, n=100, n=1000), indicating that: the severity for inferring μ > .2 decreases as n increases:

forn=50:SEV(μ>0.2)=.895,forn=100:SEV(μ>0.2)=.831,forn=1000:SEV(μ>0.2)=.115.

The facts underlying criticism #4 are also erroneously taken as grounding the claim:

“All nulls are false.”

This confuses the true claim that with large enough sample size, a test has power to detect any discrepancy from the null however small, with the false claim that all nulls are false.

The tendency to view tests as automatic recipes for rejection gives rise to another well-known canard:

(#5) Whether there is a statistically significant difference from the null depends on which is the null and which is the alternative.

The charge is made by considering the highly artificial case of two point hypotheses such as: μ= 0 vs. μ=.8. If the null is μ= 0 and the alternative is μ=.8 then x¯=0.4 (being 2σx from 0) “rejects” the null and declares there is evidence for .8. On the other hand if the null is μ=.8 and the alternative is μ= 0, then observing x¯=0.4 now rejects .8 and finds evidence for 0. It appears that we get a different inference depending on how we label our hypothesis! Now the hypotheses in a N-P test must exhaust the space of parameter values, but even entertaining the two point hypotheses, the fallacy is easily exposed. Let us label the two cases:

Case1:H0:μ=0vs.H1:μ=8,Case2:H0:μ= .8vs.H1:μ=0.

In case 1, x¯ =0.4 is indeed evidence of some discrepancy from 0 in the positive direction, but it is exceedingly poor evidence for a discrepancy as large as .8 (see figure 2). Even without the calculation that shows SEV (μ > .8) =.023, we know that SEV (μ > .4) is only .5, and so there are far less grounds for inferring an even larger discrepancy5.

In case 2, the test is looking for discrepancies from the null (which is .8) in the negative direction. The outcome x¯=0.4 (d(x0)=−2.0) is evidence that μ ≤ .8 (since SEV (μ ≤ .8)=.977), but there are terrible grounds for inferring the alternative μ= 0!

In short, case 1 asks if the true μ exceeds 0, and x¯=.4 is good evidence of some such positive discrepancy (though poor evidence it is as large as .8); while case 2 asks if the true μ is less than .8, and again x¯=.4 is good evidence that it is. Both these claims are true. In neither case does the outcome provide evidence for the point alternative, .8 and 0 respectively. So it does not matter which is the null and which is the alternative, and criticism #5 is completely scotched.

Note further that in a proper test, the null and alternative hypotheses must exhaust the parameter space, and thus, “point-against-point” hypotheses are at best highly artificial, at worst, illegitimate. What matters for the current issue is that the error statistical tester never falls into the alleged inconsistency of inferences depending on which is the null and which is the alternative.

We now turn our attention to cases of statistically insignificant results. Overly high power is problematic in dealing with significant results, but with insignificant results, the concern is the test is not powerful enough.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444518620500058

Principles of Inference

Donna L. Mohr, ... Rudolf J. Freund, in Statistical Methods (Fourth Edition), 2022

3.5.1 Statistical Significance versus Practical Significance

The use of statistical hypothesis testing provides a powerful tool for decision making. In fact, there really is no other way to determine whether two or more population means differ based solely on the results of one sample or one experiment. However, a statistically significant result cannot be interpreted simply by itself. In fact, we can have a statistically significant result that has no practical implications, or we may not have a statistically significant result, yet useful information may be obtained from the data. For example, a market research survey of potential customers might find that a potential market exists for a particular product. The next question to be answered is whether this market is such that a reasonable expectation exists for making profit if the product is marketed in the area. That is, does the mere existence of a potential market guarantee a profit? Probably not. Further investigation must be done before recommending marketing of the product, especially if the marketing is expensive. The following examples are illustrations of the difference between statistical significance and practical significance.

Example 3.7

Defective Contact Lens

This is an example of a statistically significant result that is not practically significant.

In the January/February 1992 International Contact Lens Clinic publication, there is an article that presented the results of a clinical trial designed to determine the effect of defective disposable contact lenses on ocular integrity (Efron and Veys, 1992). The study involved 29 subjects, each of whom wore a defective lens in one eye and a nondefective one in the other eye. The design of the study was such that neither the research officer nor the subject was informed of which eye wore the defective lens. In particular, the study indicated that a significantly greater ocular response was observed in eyes wearing defective lenses in the form of corneal epithelial microcysts (among other results). The test had a p value of 0.04. Using a level of significance of 0.05, the conclusion would be that the defective lenses resulted in more microcysts being measured. The study reported a mean number of microcysts for the eyes wearing defective lenses as 3.3 and the mean for eyes wearing the nondefective lenses as 1.6. In an invited commentary following the article, Dr. Michel Guillon makes an interesting observation concerning the presence of microcysts. The commentary points out that the observation of fewer than 50 microcysts per eye requires no clinical action other than regular patient follow-up. The commentary further states that it is logical to conclude that an incidence of microcysts so much lower than the established guideline for action is not clinically significant. Thus, we have an example of the case where statistical significance exists but where there is no practical significance.

Example 3.8

Weight Loss

A major impetus for developing the statistical hypothesis test was to avoid jumping to conclusions simply on the basis of apparent results. Consequently, if some result is not statistically significant the story usually ends. However it is possible to have practical significance but not statistical significance. In a recent study of the effect of a certain diet on weight reduction, a random sample of 10 subjects was weighed, put on a diet for 2 weeks, and weighed again. The results are given in Table 3.2.

Table 3.2. Weight difference (in pounds).

SubjectWeight BeforeWeight AfterDifference (Before – After)
1 120 119 +1
2 131 130 +1
3 190 188 +2
4 185 183 +2
5 201 188 +13
6 121 119 +2
7 115 114 +1
8 145 144 +1
9 220 243 −23
10 190 188 +2

Solution

A hypothesis test comparing the mean weight before with the mean weight after (see Section 5.4 for the exact procedure for this test) would result in a p value of 0.21. Using a level of significance of 0.05 there would not be sufficient evidence to reject the null hypothesis and the conclusion would be that there is no significant loss in weight due to the diet. However, note that 9 of the 10 subjects lost weight! This means that the diet is probably effective in reducing weight, but perhaps does not take a lot of it off. Obviously, the observation that almost all the subjects did in fact lose weight does not take into account the amount of weight lost, which is what the hypothesis test did. So in effect, the fact that 9 of the 10 subjects lost weight (90%) really means that the proportion of subjects losing weight is high rather than that the mean weight loss differs from 0.

We can evaluate this phenomenon by calculating the probability that the results we observed occurred strictly due to chance using the basic principles of probability of Chapter 2. That is, we can calculate the probability that 9 of the 10 differences in before and after weight are in fact positive if the diet does not affect the subjects’ weight. If the sign of the difference is really due to chance, then the probability of an individual difference being positive would be 0.5 or 1∕2. The probability of 9 of the 10 differences being positive would then be 10(0.5)(0.5)9 or 0.009765–a very small value. Thus, it is highly unlikely that we could get 9 of the 10 differences positive due to chance so there is something else causing the differences. That something must be the diet.

Note that although the results appear to be contradictory, we actually tested two different hypotheses. The first one was a test to compare the mean weight before and after. Thus, if there was a significant increase or decrease in the average weight we would have rejected this hypothesis. On the other hand, the second analysis was really a hypothesis test to determine whether the probability of losing weight is really 0.5 or 1∕2. We discuss this type of a hypothesis test in the next chapter.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128230435000035

Empirical UX Evaluation: Preparation

Rex Hartson, Pardha Pyla, in The UX Book (Second Edition), 2019

23.6.2 Identify the Right Kinds of Participants

Formal Summative Evaluation

A formal, statistically rigorous summative (quantitative) empirical UX evaluation that produces statistically significant results (Section 21.1.5.1).

In formal summative evaluation, the process of selecting participants is referred to as “sampling,” but that term is not appropriate here because what we are doing has nothing to do with the implied statistical relationships and constraints. In fact, it’s quite the opposite. You’re trying to learn the most about your design with the smallest number of participants and with exactly the right selected (not random) participants. Look for participants who are “representative users,” that is, participants who match your target work role's user class descriptions and who are knowledgeable of the general target system domain. If you have multiple work roles and user classes, you should try to recruit participants representing each category. If you want to be certain your participants are representative, you can prepare a short written demographic survey to administer to participants to confirm that each one meets the requirements of your intended work activity role's user class characteristics.

In fact, participants must match the user class attributes in any UX targets they will help evaluate. So, for example, if initial usage is specified, you need participants unfamiliar with your design.

23.6.2.1 “Expert” participants

If you have a session calling for experienced usage, it’s obvious that you should recruit an expert user, someone who knows the system domain and knows your particular system. Expert users are good at thinking aloud to generate qualitative data. These expert users will understand the tasks and can tell you what they don’t like about the design. But you cannot necessarily depend on them to tell you how to make the design better.

Recruit a UX expert if you need a participant with broad UX knowledge and who can speak to design flaws in terms of design guidelines. As participants, these experts may not know the system domain as well and the tasks might not make as much sense to them, but they can analyze user experience, find subtle problems (e.g., small inconsistencies, poor use of color, confusing navigation), and offer suggestions for solutions.

Or you can consider recruiting a so-called “double expert,” a UX expert who also knows your system very well, perhaps the most valuable kind of participant.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053423000230

UX Evaluation Methods and Techniques

Rex Hartson, Pardha Pyla, in The UX Book (Second Edition), 2019

21.1.5.2 Informal summative evaluation

An informal summative UX evaluation method is a quantitative summative UX evaluation method that is not statistically rigorous and does not produce statistically significant results. Informal summative evaluation is used in support of formative evaluation, as an engineering technique to help assess how well you are achieving good usability and UX.

Participant

A participant, or user participant, is a user, potential, or user surrogate who helps evaluate UX designs for usability and user experience. These are the people who perform tasks and give feedback while we observe and measure. Because we wish to invite these volunteers to join our team and help us evaluate designs (i.e., we want them to participate), we use the term “participant” instead of “subject” (Section 21.1.3).

Informal summative evaluation is done without experimental controls, with smaller numbers of user participants, and with only summary descriptive statistics (such as average values). At the end of each iteration for a product version, the informal summative evaluation can be used as a kind of acceptance test to compare with our UX targets (Chapter 22) and help ensure that we meet our UX and business goals with the product design.

Table 21-1 highlights the differences between formal and informal summative UX evaluation methods.

Table 21-1. Some differences between formal and informal summative UX evaluation methods

Formal Summative UX EvaluationInformal Summative UX Evaluation
ScienceEngineering
Randomly chosen subjects/participantsDeliberately nonrandom participant selection to get most formative information
Concerned with having large enough sample size (number of subjects)Deliberately uses relatively small number of participants
Uses rigorous and powerful statistical techniquesDeliberately simple, low-power statistical techniques (e.g., simple mean and, sometimes, standard deviation)
Results can be used to make claims about “truth” in a scientific senseResults cannot be used to make claims, but are used to make engineering judgments
Relatively expensive and time consuming to performRelatively inexpensive and rapid to perform
Rigorous constraints on methods and proceduresMethods and procedures open to innovation and adaptation
Tends to yield “truth” about very specific scientific questions (A vs. B)Can yield insight about broader range of questions regarding levels of UX achieved and the need for further improvement
Not used within a UX design processIntended to be used within a UX design process in support of formative methods

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053423000217

Naturalism and the Nature of Economic Evidence

Harold Kincaid, in Philosophy of Economics, 2012

2 Nonexperimental Evidence

The debates sketched above certainly show up in discussions of the force and proper role of nonexperimental evidence in economics. In this section I review some of the issues and argue for a naturalist inspired approach. I look at some standard econometric practices, particularly significance testing and data mining.

The logic of science ideal is heavily embodied in the widespread use and the common interpretation of significance tests in econometrics. There are both deep issues about the probability foundations of econometrics that are relevant here and more straight forward — if commonly missed — misrepresentations of what can be shown and how by significance testing. The naturalist stance throughout is that purely logical facts about probability have only a partial role and must be embedded in complex empirical arguments.

The most obvious overinterpretation of significance testing is that emphasized by McCloskey and others [McCloskey and Ziliak, 2004]. A statistically significant result may be an economically unimportant result; tiny correlations can be significant and will always be in large samples. McCloskey argues that the common practice is to focus on the size of p-values to the exclusion of the size of regression coefficients.9

Another use of statistical significance that has more serious consequences and is overwhelmingly common in economics is using statistical significance to separate hypotheses into those that should be believed and those that should be rejected and to rank believable hypothesis according to relative credibility (indicated by the phrase “highly significant” which comes to “my p value is smaller than yours”). This is a different issue from McCloskey's main complaint about ignoring effect size. This interpretation is firmly embedded in the practice of journals and econometric textbooks.

Why is this practice mistaken? For a conscientious frequentist like Mayo, it is a mistake because it does not report the result of a stringent test. Statistical significance tests tell us about the probability of rejecting the null when the null is in fact true. They tell us the false positive rate. But a stringent test not only rules out false positives but false negatives as well. The probability of this is measured by 1-minus the power of the test. Reporting a low false positive rate is entirely compatible with a test that has a very high false negative rate. However, the power of the statistical tests for economic hypotheses can be difficult to determine because one needs credible information on possible effect sizes before hand (another place frequentists seem to need priors). Most econometric studies, however, do not report power calculations. Introductory text books in econometrics [Barreto and Howland, 2006] can go without mentioning the concept; a standard advanced econometrics text provides one brief mention of power which is relegated to an appendix [Greene, 2003]. Ziliak and McCloskey find that about 60% of articles in their sample from American Economic Review do not mention power. So one is left with no measure of the false negative rate and thus still rather in the dark about what to believe when a hypothesis is rejected.

Problems resulting from the lack of power analyses are compounded by the fact that significance tests also ignore the base rate or prior plausibility. Sometimes background knowledge can be so at odds with a result that is statistically significant that it is rational to remain dubious. This goes some way in explaining economists conflicted attitude towards econometrics. They are officially committed to the logic of science ideal in the form of decision by statistical significance. Yet they inevitably use their background beliefs to evaluate econometric results, perhaps sometimes dogmatically and no doubt sometimes legitimately, though the rhetoric of significance testing gives them no explicit way to do so.

A much deeper question about the statistical significance criterion concerns the probability foundations of econometric evidence. This is a topic that has gotten surprisingly little discussion. Statistical inferences are easiest to understand when they involve a chance set up [Hacking, 1965]. The two standard types of chance set ups invoked by statisticians are random samples from a population and random assignment of treatments. It is these chance set ups that allow us to draw inferences about the probability of seeing particular outcomes, given a maintained hypothesis. Current microeconometric studies that depend on random samples to collect survey data do have a foundation in a chance set up and thus the probability foundations of their significance claims are clear. The problem, however, is that there is much econometric work that involves no random sample nor randomization.

This lack of either tool in much economic statistics discouraged the use of inferential statistics until the “Probability Revolution” of Haavelmo [1944]. Haavelmo suggested that we treat a set of economic data as a random draw from a hypothetical population consisting of other realizations of the main economic variables along with their respective measurement errors and minor unknown causes. However, he gives no detailed account of what this entails nor of what evidence would show it valid. The profession adopted the metaphor and began using the full apparatus of modern statistics without much concern for the question whether there is a real chance set up to ground inferences. The practice continues unabetted today.

One fairly drastic move made by some notable econometricians such as Leamer and commentators such as Kuezenkamp is to take a staunch antirealist position. Thus Leamer [Hendry et al., 1990] doubts that there is a true data generating process. Kuezenkamp [2000], after surveying many of the issues mentioned here, concludes that econometric methods are tools to be used, not truths to be believed. If the goal of econometrics is not to infer the true nature of the economic realm but only to give a perspicuous rendering of the data according to various formal criteria, then worries about the chance set up are irrelevant. Obviously this is surrendering the idea of an economic science that tells us about causes and possible policy options. It is seems that when the logic of science ideal confronts difficulties in inferring successfully about the real world, the latter is being jettisoned in favor of the former.

The best defense given by those still interested in the real world probably comes from the practices of diagnostic testing in econometrics. The thought is that we can test to see if the data seem to be generated by a data generating process with a random component. So we look at the properties of the errors or residuals in the equations we estimate. If the errors are orthogonal to the variables and approximate a normal distribution, then we have evidence for a randomizing process. The work of Spanos [2000] and Hoover and Perez [1999], for example, can be seen as advocating a defense along these lines.

These issues are complicated and a real assessment would be a chapter in itself. But I can sketch some issues and naturalist themes. First, if tests of statistical significance on residuals are seen as decisive evidence that we have good reason to believe that we have a random draw from many different hypothetical realizations, then we confront all the problems about over interpreting significance tests. These tests have the same problem pointed out about using significance tests as an epistemic criterion. We do not have a grip on the prospects for error unless we have at least a power calculation and background knowledge about prior plausibility. So we do not know what to infer from a diagnostic test of this sort. Moreover, there is also the problem concerning what is the chance set up justifying this diagnostic test in the first place. Taken in frequentist terms, the test statistic must be some kind of random draw itself. So the problem seems to be pushed back one more step.

However, despite these problems, there is perhaps a way to take diagnostic testing as a valuable aid in justifying probability foundations if we are willing to make it one component in an overall empirical argument of the sort that naturalists think is essential. A significance test on residuals for normality or independence, for example, can be seen as telling us the probability of seeing the evidence in hand if it had been generated from a process with a random component. That does not ensure us that the hypothesis was plausible to begin with nor tell us what the prospects of false positives are, but it does give us evidence about p(E/H =randomly generated residuals). If that information is incorporated into an argument that provides these other components, then it can play an important role. In short, significance test is not telling us that we have a random element in the data generating process — it is telling us what the data would like if we did.

These issues have natural connections to debates over “data mining” and I want to turn to them next. A first point to note is that “data mining” is often left undefined. Let's thus begin by distinguishing the different activities that fall under this rubric:

Finding patterns in a given data set

This is the sense of the term used by the various journals and societies that actively and positively describe their aim as data mining. “Finding patterns” has to be carefully distinguished from the commonly used phrase “getting all the information from the data” where the latter is sufficiently broad to include inferences about causation and about larger populations. Finding patterns in a data set can be done without using hypothesis testing. It thus does not raise issues of accommodation and prediction nor the debates over probability between the Bayesians and frequentists.

Specification searches

Standard econometric practice involves running multiple regressions that drop or add variables based on statistical significance and other criteria. A final equation is thus produced that is claimed to be better on statistical grounds.

Diagnostic testing of statistical assumptions

Testing models against data often requires making probability assumptions, e.g. that the residuals are independently distributed. As Spanos [2000] argues, this should not be lumped with the specification searches described above — there is no variable dropping and adding based on tests of significance.

Senses 1 and 3 I would argue are clearly unproblematic in principle (execution is always another issue). The first form is noninferential and thus uncontroversial. The third sense can be seen as instance of the type of argument for ruling out chance that I defended above for the use of significance tests. Given this interpretation — rather than one where the results all by themselves are thought to confirm a hypothesis — this form of data mining is not only defensible but essential.

The chief compelling complaint about data mining concerns the difficulties of interpreting the frequentist statistics of a final model of a specification search. Such searches involve multiple significance tests. Because rejecting a null at a p value of .05 means that one in twenty times the null will be wrongly rejected, the multiple tests must be taken into account. For simple cases there are various ways to correct for such multiple tests whose reliability can be analytically verified; standard practice in biostatistics, for example, is to use Bonferroni correction [1935] which in effect imposes a penalty for multiple testing in terms of p values required. As Leamer points out, it is generally the case that there are no such analytic results to make sense of the very complex multiple hypothesis testing that goes on in adding and dropping variables based on statistical significance — the probability of a type I error is on repeated uses of the data mining procedure is unknown despite the fact that significance levels are reported.10

Mayer [2000] has argued that the problems with data mining can best be solved by simply reporting all the specifications tried. However, fully describing the procedure used and the models tested does not solve the problem. We simply do not know what to make of the final significance numbers (nor the power values either on the rare occasions when they are given) even if we are given them all.

Hoover and Perez [1999] provides an empirical defense that might seem at first glance a way around this problem. Perhaps we do not need a frequentist interpretation of the test statistics if we can show on empirical grounds that specific specification search methods,. e.g. Hendry's general to specific modelling, produce reliable results. Hoover, using Monte Carlo simulations to produce data where the true relationship is known, shows that various specification search strategies, particularly general to specific modeling, can do well in finding the right variables to include.

However, there is still reason to be skeptical. First, Hoover's simulations assume that the true model is in the set being tested (cf. [Ganger and Timmermann, 2000]). That would seem not to be the case for many econometric analyses where there are an enormous number of possible models because of the large number of possible variables and functional forms. There is no a priori reason this must always be the case, but once again, our inferences depend crucially on our background knowledge that allows us to make such judgments. These assumptions are particularly crucial when we want to get to the correct causal model, yet there is frequently no explicit causal model offered. Here is thus another case where the frequentist hope to eschew the use of priors will not work.

Moreover, Hoover's simulations beg important questions about the probabilistic foundations of the inferences. His simulations involve random sampling from a known distribution. Yet in practice distributions are not known and we need to provide evidence that we have random sample. These are apparently provided in Hoover's exercise by fiat since the simulations assume random samples [Spanos, 2000].

However, the problems identified here with specification searches have their roots in frequentist assumptions, above all the assumption that we ought to base our beliefs solely on the long run error characteristics of a test procedure. The Bayesians argue, rightly on my view, that one does not have to evaluate evidence in this fashion. They can grant that deciding what to believe on the basis of, say, repeated significance tests can lead to error. Yet they deny that one has to (and, more strongly and unnecessary to the point I am making here, can coherently) make inferences in such a way. Likelihoods can be inferred using multiple different assumptions about the distribution of the errors and a pdf calculated. What mistakes you would make if you based your beliefs solely on the long term error rates of a repeated significance testing procedure is irrelevant for such calculations. Of course, Bayes' theorem still is doing little work here; all the force comes from providing an argument establishing which hypotheses should be considered and what they entail about the evidence.

So data mining can be defended. By frequentists standards data mining in the form of specification searches cannot be defended. However, those standards ought to be rejected as decisive criterion in favor of giving a complex argument. When Hoover and others defend specification searches on the empirical grounds that they can work rather than on the grounds of their analytic asymptotic characteristics, they are implicitly providing one such argument.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444516763500051

Analysis of Variance

B.M. King, in International Encyclopedia of Education (Third Edition), 2010

Interpreting a Significant F Value

Independent-groups ANOVA can be used with two samples, in which case F is the square of the t-statistic that compares the two sample means. A statistically significant result indicates that one population mean is either less than or greater than the other. What does it mean when we obtain a statistically significant value of F for three or more samples? In this case it tells us only that there is a difference among the populations. It does not tell us the manner in which they differ. For three groups, all three population means could be different from one another, or one could be greater than the other two, etc. To determine which means are significantly different from others, we normally use post hoc (a posteriori) comparisons. Some of the most commonly used tests are Duncan's multiple-range test, the Newman–Keuls test, Tukey's HSD test, and the Scheffé test. Duncan's test is the least conservative with regard to type I error and the Scheffé test is the most conservative. An explanation of these tests is beyond the scope of this article, but most textbooks will provide a full explanation of one or more of them. However, before you can use any of them you must first have obtained a significant value of F. In our example, all four post hoc tests would reveal that teaching method 2 is superior to the other two methods, which did not significantly differ from one another.

There are some underlying assumptions associated with the use of ANOVA. The first is that the populations from which the samples are drawn are normally distributed. Moderate departure from the normal bell-shaped curve does not greatly affect the outcome, especially with large-sized samples (Glass et al., 1972). However, results are much less accurate when populations of scores are very skewed or multimodal (Tomarken and Serlin, 1986), which is frequently the case in the behavioral sciences (Micceri, 1989). In this case, you should consider using the Kruskal–Wallis test, an assumption-freer (nonparametric) test for the independent-groups design (see King and Minium, 2008). This is especially true when using small samples. A second assumption is that of homogeneity of variance, that is, the variances in the populations from which samples are drawn are the same. However, this is a major problem only when variances differ considerably, and is less of a problem if you use samples that are of the same size (Milligan et al., 1987; Tomarken and Serlin, 1986).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947013063

UX Evaluation: Reporting Results

Rex Hartson, Pardha Pyla, in The UX Book (Second Edition), 2019

27.2.2 Reporting Qualitative Results—The UX Problems

All UX practitioners should be able to write clear and effective reports about problems found but, in their “CUE-4” studies, Dumas, Molich, and Jeffries (2004) found that many cannot. They observed a large variation in reporting quality over teams of usability specialists, and that most reports were inadequate by their standards.

If you use rapid evaluation methods for data collection, it is especially important to communicate effectively about the analysis and results because this kind of data can otherwise be dismissed easily “as unreliable or inadequate to inform design decisions” (Nayak, Mrazek, and Smith, 1995). Even in empirical evaluation, though, the primary type of data from formative evaluation is qualitative, and raw qualitative data must be skillfully distilled and interpreted to avoid the impression of being too “soft” and subjective.

27.2.2.1 Common Industry Format for reporting

We don’t include formal summative evaluation in typical UX practice, but the US National Institute of Standards & Technology (NIST) did initially produce a Common Industry Format (CIF) for reporting formal summative UX evaluation results.

Formal Summative Evaluation

A formal, statistically rigorous summative (quantitative) empirical UX evaluation that produces statistically significant results (Section 21.1.5.1).

Following this initial effort, the group—under the direction of Mary Theofanos, Whitney Quesenbery, and others—organized two workshops in 2005 (Theofanos, Quesenbery, Snyder, Dayton, and Lewis, 2005), these aimed at a CIF for formative evaluation reports (Quesenbery, 2005; Theofanos and Quesenbery, 2005).

In this work, they recognized that, because most evaluations conducted by usability practitioners are formative, there was a need for an extension of the original CIF project to identify best practices for reporting formative results. They concluded that requirements for content, format, presentation style, and level of detail depended heavily on the audience, the business context, and the evaluation techniques used.

While their working definition of “formative testing” was based on having representative users, here we use the slightly broader term “formative evaluation” to include usability inspections and other methods for collecting formative usability and user experience data.

Inspection (UX)

An analytical evaluation method in which a UX expert evaluates an interaction design by looking at it or trying it out, sometimes in the context of a set of abstracted design guidelines. Expert evaluators are both participant surrogates and observers, asking themselves questions about what would cause users problems and giving an expert opinion predicting UX problems (Section 25.4).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053423000278

Hypothesis Testing

Andrew F. Siegel, Michael R. Wagner, in Practical Business Statistics (Eighth Edition), 2022

Abstract

In this chapter, you will learn how hypothesis testing uses data to decide between two possibilities, often to distinguish structure from mere randomness as a helpful input to executive decisionmaking. We will define a hypothesis as any statement about the population; the data will help you decide which hypothesis to accept as true. There will be two hypotheses that play different roles: The null hypothesis represents the default, to be accepted in the absence of evidence against it; the research hypothesis has the burden of proof, requiring convincing evidence for its acceptance. Accepting the null hypothesis is a weak conclusion, whereas rejecting the null and accepting the research hypothesis is a strong conclusion and leads to a statistically significant result. Every hypothesis test can produce a p-value (using statistical software) that tells you how surprised you would be to learn that the null hypothesis had produced the data, with smaller p-values indicating more surprise and leading to significance. By convention, a result is statistically significant if p < 0.05, is highly significant if p < 0.01, is very highly significant if p < 0.001, and is not significant if p > 0.05. There are two types of errors that you might make when testing a hypothesis. The type I error is committed when the null hypothesis is true but you reject it and (wrongly) declare that your result is statistically significant; the probability of this error is controlled conventionally at the 5% level (but you may set this test level or significance level to be other values, such as 1%, 0.1%, or perhaps even 10%). The type II error is committed when the research hypothesis is true but you (wrongly) accept the null hypothesis instead and declare the result not to be significant; the probability of this error is not easily controlled.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200254000105

Clinical Development and Statistics

Joseph Tal, in Strategy and Statistics in Clinical Trials, 2011

Statistical Input

Many compounds do not go directly from Phase I into full-fledged trials like the one proposed. A smaller pilot study is probably more common and, under the circumstances, perhaps more advisable. In fact, you have no idea where the number 150 came from and suspect it had more to do with budgets and stock prices than with the development program's needs. Regardless, this is what you have been given and it is substantial. But “substantial” does not necessarily mean “sufficient,” the relationship between the two depending on the case at hand.

Numbers—as large or small as they may seem at first—cannot be evaluated without a context. A 9-year-old child selling lemonade in front of her family's garage might feel that taking in $30 on a single Sunday makes her the class tycoon. But offer her the same in a toy store and she might complain of underfunding (and, frighteningly, might use these very words).

Be that as it may, this is what you have and you must make the best of it. Still, you are not going to take the numbers proposed as set in stone, and one of your first questions is whether they will provide your development program with the information needed. Specifically, will this study produce enough data for making an informed decision on taking CTC-11 into the next level of testing?

The statistician's role here is central. He will likely begin with straightforward power analyses, which here relate to calculations determining the number of subjects needed for demonstrating the drug's efficacy.1

In a future chapter we will deal with power analysis in greater detail. For the moment let us point out that to do these analyses a statistician needs several pieces of information. The most important of these is an estimate of the drug's effect size relative to Control. For example, stating that CTC-11 is superior to Control by about 10% is saying that the drug's effect size is about 10%.2

The statistician will get these estimates from clinicians and others in the organization. But he should also review results obtained to date within the Company and read some scientific publications on the subject. To do this he will need assistance from life-scientists, without whose help he will have difficulty extracting the required information from medical publications.

This is but one example of professionals from different fields needing to interact in trial planning. In this book I will note many more. So while statisticians need not have deep knowledge of biology or chemistry or medicine, they should be sufficiently conversant in these disciplines to conduct intelligent discussions with those who are. And the same goes for life-scientists, who would do well to be conversant in statistics.

Once acquired, the statistician will incorporate this information into his power analyses. These will yield sample sizes that will be more useful than those proposed primarily on financial considerations. If management's proposal and the power analyses produce very different sample sizes, you will (alas) have another opportunity for multidisciplinary interaction.

A Difficulty Within a Problem

You have asked the statistician to compute the required sample size that will ensure your trial is a success: the number of subjects that will provide sufficient information for making future decisions on CTC-11. The statistician, in turn, has asked you for information; he has requested that you estimate the effect of the drug relative to Control. On the face of it, this is a silly request. After all, you are planning to conduct a trial precisely to discover this effect, so how can you be expected to know it before conducting your trial? To tell the truth, you cannot know it. But you can come up with an intelligent guess and have no choice but to do so. Indeed, estimating an effect size for the purpose of planning a trial of which the purpose is to estimate effect size arises often. We shall deal with it later, but for the moment let me assure you it is not as problematic as it sounds.

When determining sample size, the statistician will do well to talk with physicians and marketing personnel regarding the kind of CTC-11 efficacy needed for the drug to sell. Incorporating this information into power analyses will provide the Company with data on how valuable (or not) trials of varying sizes are likely to be from the standpoint of assessing market need.

The statistician should also expand his exploration to alternative study designs—not just the initially proposed six-month study of 150 subjects in two arms. Some of these designs will require fewer resources, while others will require more. He might, for example, examine a scenario where the larger trial is replaced with a smaller pilot study of 10 to 30 subjects. This sort of study could provide a more realistic estimate of the drug's effect in humans—an estimate that is now lacking. Once the pilot study is done, there will be more reliable information for planning the larger trial.

The larger the trial, the more informative the data obtained from it. But, as Goldilocks demonstrated years ago, strength does not necessarily reside in numbers; if a smaller trial can provide us with the required information, we should prefer it to a larger one. Conversely, if the larger study has little potential to provide the required data and an even larger trial is needed, you would do well to forgo the former and request more resources.

So a small pilot may be just what the statistician ordered. But this pilot will come at a price: A two-stage approach—a pilot and subsequent, larger trial—will slow down the development process. Moreover, given the fixed budget, any pilot will come at the expense of resources earmarked for the second stage. Here too there is more than one option. For example, you can design a standalone pilot and reassess development strategy after its completion. Alternatively, you can design the larger study with an early stopping point for interim analysis—an early check of the results. Once interim results are in, the information can be used to modify the remainder of the trial if needed.

These two approaches—one that specifies two studies and another that implies a single, two-stage study—can have very different implications for the Company. They differ in costs, logistics, time, flexibility, and numerous other parameters. The choice between them should be considered carefully.

For the moment let us simply state that the statistician's role is central when discussing trial sample size—the number of subjects that should be recruited for it. At the same time, it is very important for those requesting sample size estimates to actively involve statisticians in discussions dealing with a wider range of topics as well—for example, the drug's potential clinical effects and alternatives to the initially proposed design. And given that it takes at least two to trial, it is critical that the statistician be open-minded enough to step out of his equation-laden armor and become cognizant of these issues.

In sum, the fact that a relatively large sample size has been proposed for this early trial does not necessarily imply that it will provide the information needed. Together with your colleagues in R&D, logistics, statistics, and elsewhere, you should discuss all realistic alternatives: There can be two trials instead of one, one two-stage trial, as well as trials with more than two arms or less, a longer trial or a shorter one, and so on.

Now all this may seem a bit complicated, and it can be. At the same time you should keep in mind that because your budget is limited, the universe of possibilities is restricted as well; covering all, or nearly all, study design possibilities given fixed resources is definitely doable.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123869098000015

Hypothesis Testing: Concept and Practice

ROBERT H. RIFFENBURGH, in Statistics in Medicine (Second Edition), 2006

5.3. TWO POLICIES OF TESTING

A BIT OF HISTORY

During the early development of statistics, calculation was a major problem. It was done by pen in the Western cultures and by abacus (7-bead rows) or soroban (5-bead rows) in Eastern cultures. Later, hand-crank calculators and then electric ones were used. Probability tables were produced for selected values with great effort, the bulk being done in the 1930s by hundreds of women given work by the U.S. Works Project Administration during the Great Depression. It was not practical for a statistician to calculate a p-value for each test, so the philosophy became to make the decision of acceptance or rejection of the null hypothesis on the basis of whether the p-value was bigger or smaller than the chosen α (e.g., p ≥ 0.05 vs. P < 0.05) without evaluating the p-value itself. The investigator (and reader of published studies) then had to not reject H0 if p were not less than α (result not statistically significant) and reject it if p were less (result statistically significant).

CALCULATION BY COMPUTER PROVIDES A NEW OPTION

With the advent of computers, calculation of even very involved probabilities became fast and accurate. It is now possible to calculate the exact p-value, for example, p = 0.12 or p = 0.02. The user now has the option to make a decision and interpretation on the exact error risk arising from a test.

CONTRASTING TWO APPROACHES

The later philosophy has not necessarily become dominant, especially in medicine. The two philosophies have generated some dissension among statisticians. Advocates of the older approach hold that sample distributions only approximate the probability distributions and that exactly calculated p-values are not accurate anyway; the best we can do is select a “significant” or “not significant” choice. Advocates of the newer approach—and these must include the renowned Sir Ronald Fisher in the 1930s—hold that the accuracy limitation is outweighed by the advantages of knowing the p-value. The action we take about the test result may be based on whether a not-significant result suggests a most unlikely difference (perhaps p = 0.80) or is borderline and suggests further investigation (perhaps p = 0.08) and, similarly, whether a significant result is close to the decision of having happened by chance (perhaps p = 0.04) or leaves little doubt in the reader's mind (perhaps p = 0.004).

OTHER FACTORS MUST BE CONSIDERED

The preceding comments are not meant to imply that a decision based on a test result depends solely on a p-value, the post hoc estimate of a, The post hoc estimate of β, the risk of concluding a difference when there is none, is germane. And certainly the sample size and the clinical difference being tested must enter into the interpretation. Indeed, the clinical difference is often the most influential of values used in a test equation. The comments on the interpretation of p-values relative to one another do hold for adequate sample sizes and realistic clinical differences.

SELECT THE APPROACH THAT SEEMS MOST SENSIBLE TO YOU

Inasmuch as the controversy is not yet settled, users may select the philosophy they prefer. I tend toward the newer approach.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120887705500447

What probability is statistically significant?

This test provides a p-value, which is the probability of observing results as extreme as those in the data, assuming the results are truly due to chance alone. A p-value of 5% or lower is often considered to be statistically significant.

When results of a study are statistically significant they?

A study result is statistically significant if the p-value of the data analysis is less than the prespecified alpha (significance level). In our example, the p-value is 0.02, which is less than the pre-specified alpha of 0.05, so the researcher concludes there is statistical significance for the study.

When the results of a study are statistically significant it means that quizlet?

Statistical significance means that the result observed in a sample is unusual when the null hypothesis is assumed to be true.