If scores on a test are consistent over time, the test has

After getting a general sense of the construct validity of the instrument, we sometimes wish to know whether this outcome is able to measure change during a period of time. This is only important if you are going to measure and make claims about a change in score (i.e., if a change of 50 points was observed). If your goal is to describe an outcome, such as the level of pain after a treatment, you may have enough information at step 3. In the original OMERACT Filter, this was the “discrimination” component of the information need in Filter 2. This is expanded and contextualized to answer the question “Can the instrument discriminate well enough to detect the change you need to be able to detect?” The hallmark of the ability to measure change in a group is twofold: First, do the scores remain the same when the target concept has not changed during a period of time (test-retest reliability), and second, when the concept changes, does the score on the instrument change as well (responsiveness or sensitivity to the target construct of change)?

Test-Retest Reliability

Test-retest reliability requires two administrations of the instrument during a period of time when no change in the target concept has occurred. As a reader of reliability studies, you should feel convinced that no change in the target (e.g., pain, function, or disease activity) would have occurred in these patients in this situation.14,96,97 Often, people conducting studies of test-retest reliability will establish a clinical situation in which no change should have occurred, or they use an external anchor (e.g., a question about whether the patient's pain is the same as last time) to find patients who have not changed. Similar to interobserver reliability, the ICC is the preferred statistic for continuous scores and weighted kappa, its equivalent, for categorical scores.95 The cutoffs are the same, and a coefficient can be converted into a “minimal detectable change”98 = 1.96 × s(2[1-r])1/2, where s = standard deviation and r = test-retest reliability (ICC).85,98 Ninety-five percent of people who are stable will have change scores less than this value, hence a change greater than this is not likely to occur in a stable patient. This becomes a lower boundary of meaningful change. Anything below that boundary could be simply day-to-day fluctuations in scores.

Responsiveness is perhaps best thought of as longitudinal construct validity. Similar to construct validity, it depends on an a priori theoretical relationship in which the attribute is changing with time (e.g., change in pain with time). We all too often focus on the amount of change recorded, rather than the match of the change in the instrument's scores with the type or amount of change that has actually occurred and was expected in that testing situation. A large change is not useful if we were expecting a small one; it only suggests error and noise. The amount of change expected in a study of responsiveness should be carefully described and should be directly associated with the intended application (i.e., measurement need).99 If the goal is to detect change in a clinical trial, then it is important to assess the instrument's ability to detect the difference in change between treatment and control groups. If the goal is to detect change in a cohort, it might be more useful to examine change in a single group, perhaps in a treatment of known efficacy that is close to your intended application (e.g., hip replacement) or in people who rated themselves as improved on an external anchor (e.g., global index of change). Responsiveness is summarized with statistics of signal (change) overnoise (error) such as the standardized response mean (mean change/standard deviation of change), t statistic (mean change/standard error), or effect size (mean change over standard deviation of baseline).95 These statistics on their own have little meaning without an a priori construct of change (magnitude and direction expected).

When the target application is the quantification of the relative change between treatment and control groups, such as in a clinical trial, the ability of a scale to discriminate at that level, and against similar groups as in the target application, is needed. The effect size of statistics described above can be adapted for relative change between two groups72,100 or a standardized mean difference can be calculated.

Deyo also described the correlational approach (correlate change and another indicator of change) as a direct parallel to cross-sectional construct validity.101 He also suggested a receiver-operator characteristic curve approach (various change scores against external gold standard that the person change). This offers information on the sensitivity and specificity of different change scores for application to individuals as well as the area under the curve as a summary of discrimination between changed and unchanged groups.101 All of these approaches are dependent on the external anchor of change having occurred, and this is an important part of the study. Regardless of the approach, the numeric summaries of responsiveness, such as effect sizes or areas under the curve, should correspond to the type of change (magnitude and direction) expected in the a priori theory. A large effect size or area under the curve does not mean an instrument is to be stamped as “responsive.” It should correspond with the change anticipated in the study, small or large. Comparisons of the effect sizes are helpful if different instruments are compared in the same study, as was done by Buchbinder72 or Verhoeven100 in early IA patients. But caution should be used when comparing instruments across studies. Responsiveness is a highly contextualized property, and the same instrument may not be responsive in another situation (early vs. late disease, OA vs. IA, 2-week vs. 6-month follow-up).87

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323316965000334

Projective Assessment of Children and Adolescents☆

M. Jain, ... K. Kuehnle, in Reference Module in Neuroscience and Biobehavioral Psychology, 2017

Psychometric Foundations

The test-retest reliability study performed on 175 college students at an interval of three to 9 months revealed correlations ranging from 0.30 to 0.77 for the six scoring categories (Bernard, 1949). Inter scorer reliability has been found to be high. Clarke et al. (1947) reported approximately 85% agreement between two independent scorers on the P–F test. Lindzey (1954) conducted validity study using Thematic Apperception Test (TAT) as the criterion measure and compared the scores on the two tests. He also compared the scores of P–F test between groups who were very high and very low in prejudice to minorities. He found that scores on TAT and P–F test did not correlate on extrapunitiveness and intropunitiveness and no difference was found between high and low groups.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128093245050586

Testing the Scapulothoracic Joint

Todd S. Ellenbecker MS, PT, SCS, OCS, CSCS, in Clinical Examination of the Shoulder, 2004

Test-Retest Reliability

Kibler (1998) performed a test-retest reliability investigation to assess both intratester and intertester reliability. Intraclass correlation coefficients (ICC) were between 0.84 and 0.88 for intratester reliability, with similar coefficients reported in all three positions of testing. Intertester reliability coefficients ranged from 0.77 to 0.85. These reliability coefficients indicate acceptable levels of reproducibility for the use of this clinical test (Portney & Watkins, 1993).

Additional studies have independently evaluated the Kibler LSST. Gibson et al (1995) reported intratester reliability of 0.81 to 0.94 and intertester reliability of 0.18 to 0.92. T'Jonck et al (1996) reported similar ICCs for intratester reliability (0.69 to 0.96) and ICCs for intertester reliability ranging between 0.72 and 0.90. In addition to the reliability coefficients reported, Gibson et al (1995) and T'Jonck et al (1996) identified lower intratester and intertester correlation coefficients with Kibler position 3. All of the researchers acknowledge the increased difficulty in palpating the inferior angle of the scapula in position 3 because of the greater contraction of the muscles surrounding the scapula itself (Kibler, 1998a; Gibson et al, 1995; T'Jonck et al, 1996).

Odem et al (2001) published a test-retest reliability study that conflicted with earlier studies of the Kibler LSST. The reliability research by Kibler (1998a), Gibson et al (1995), and T'Jonck et al (1996) all tested the distances between the inferior angle of the scapula and the vertebral spinous process. Odem et al (2001) tested the actual bilateral difference in subjects and found lower test-retest reliability coefficients ranging from 0.52 to 0.80 for intratester conditions and 0.43 to 0.79 for intertester conditions. They concluded that the Kibler test had compromised reliability and that caution should be used in interpretation of test results. This information is in contrast to the other reliability studies on the LSST.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780721698076500060

Test–Retest Reliability

Chong Ho Yu, in Encyclopedia of Social Measurement, 2005

Sample Issues

It is highly recommended that for estimating test–retest reliability, researchers should recruit more subjects than they need because when multiple tests are given to the same group of subjects, it is likely that in a later session some subjects may drop out from the study. Besides sample size, the quality of the sample is also important to test–retest reliability studies. No doubt the quality of the sample is tied to the sample representativeness. If a researcher designs a test for clinical groups, it is essential that the test–retest reliability information is obtained from those particular groups. For example, schizophrenics are said to be difficult to test. It is expected that the mental state of patients who suffer from schizophrenics is unstable, and thus sometimes the use of normal subjects is necessary. However, in a clinical setting a reliability estimate obtained from a normal sample for a test designed for use with abnormal samples may be problematic. Take the Drug Use History Form (DUHF) as another example. DUHF is designed to track usage of drugs among drug users, and its test–retest reliability information is crucial for clinicians to carry out treatment-effectiveness studies. However, due to the physical and emotional weaknesses of drug users, availability of self-report data from drug users may be scarce, and hence sometimes nondrug users participate in test–retest reliability studies of DUHF. As a result, contamination by nondrug users artificially inflates test–retest reliability coefficients of DUHF. The problem of sample representativeness could also be found in widely used diagnosis tests in education. Reports of test–retest reliability of Reading Disabilities (RD) tests have been questioned by certain researchers, because individuals who experience difficulty with reading may exhibit a limited range of reading performance; multiple measures of people who could not read at all would not yield a meaningful test–retest study result. To counteract this shortcoming, it is suggested that measures of RD should be based upon samples who have acquired basic reading skills by receiving interventions.

Another controversial aspect of sample representativeness is the use of convenience sampling rather than true random sampling. Traub criticized that the “canon” of sample representativeness has been overlooked by many social scientists because very often reliability studies are conducted on convenience samples, consisting of subjects who are readily accessible by the experimenter. Traub was doubtful of whether reliability estimates based upon “grab-bag” samples could be generalized beyond the sample itself. While Traub's criticism is true to some certain extent, other researchers have argued that “true random sampling” is an idealization. It is extremely difficult, if not impossible, to randomly draw samples from the target population to which the inference is made. For example, if the target population is drug users in the United States, ideally speaking subjects should be randomly selected from drug users in all 50 states, but in practice this goal may be very difficult to accomplish. Some have suggested that broader generalizations with convenience samples is still possible when different reliability studies using the same instrument are carried out in different places, and then meta-analytical techniques are employed to synthesize these results.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000943

Evaluation of Nutrition Interventions

ALAN R. KRISTAL, JESSIE A. SATIA, in Nutrition in the Prevention and Treatment of Disease, 2001

B. Reliability

Two types of reliability are important for evaluating intervention trials. Test–retest reliability measures agreement between multiple assessments. In practice, this means that a measure taken on one day would be strongly correlated with a measure taken on another day. Although no measures have perfect reliability, measures of daily nutrient intake or specific dietary behavior have particularly low reliability due to the variability in the amounts and types of foods people eat from day to day. This type of variability, termed intra-individual or within-persons variability, makes even a perfect assessment of a single day's diet not very informative for evaluating whether or not a person has changed their usual diet as a response to an intervention.

A second and entirely different type of reliability, which is relevant primarily to measures of psychosocial factors, is internal consistency reliability. Most psychosocial factors cannot be assessed directly (e.g., social support for eating low-fat foods), and they are generally measured using a set of items that taken together characterize the construct (e.g., “How much support do you get from your co-workers to select healthy foods from the cafeteria at lunch?”). The statistic called Cronbach's alpha, which ranges from 0 to 1, is a measure of how well the mean of scale items measures an underlying construct. High internal consistency reliability is a function of two factors, the average correlation among items in the scale and the number of items in the scale. Most scientists suggest a minimum of 0.7 for internal consistency; however, this ignores the practical problem that in applied evaluations it is not feasible to use lengthy scales. When scales are restricted to three or four items, a Cronbach's alpha of 0.50 is quite satisfactory.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780121931551500118

Assessment

Tayla T.C. Lee, ... Cassidy L. Tennity, in Comprehensive Clinical Psychology (Second Edition), 2022

4.11.3.4.1 Reliability

Data reported in Ben-Porath and Tellegen (2020b) indicate test-retest reliabilities of the substantive scales from a subset of 275 individuals were generally greater than 0.80 (Mdn interval = 8 days). Median internal consistencies for men in the normative sample were 0.83, 0.78, 0.72, and 0.79 for the H

If scores on a test are consistent over time, the test has
O, RC, SP, and PSY-5 scales, respectively. These values for normative sample women were similar (αMdn = 0.82, 0.77, 0.72, and 0.78 for H
If scores on a test are consistent over time, the test has
O, RC, SP, and PSY-5 scales, respectively). Internal consistency values were generally higher in a mental health outpatient sample, with median internal consistencies for all MMPI-3 scale families approximating 0.80 for both men and women. Across the normative and outpatient samples, scales assessing constructs that likely had restricted variability (e.g., psychotic phenomena) or fewer items (e.g., SP scales) tended to have lower reliability. However, Standard Errors of Measurement were generally small and consistent. Overall, these data support that MMPI-3 scores are likely to be sufficiently precise to provide meaningful assessment of substantive constructs.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186978001813

Interest Inventories

S.H. Carson, in Encyclopedia of Creativity (Second Edition), 2011

Validity and Reliability of Vocation-Based Interest Inventories

Most popular vocational interest inventories available today appear to have acceptable validity and test–retest reliability according to published studies.

Because one of the main purposes of interest inventories is to assist in career counseling, the predictive validity of the tests is important. While there is variability across instruments, the most popular inventories have shown good predictive validity. Studies of popular inventories indicate a concordance of 50–75% between current occupations and occupations predicted by interests inventories 12–19 years earlier.

A second purpose of interest inventories is to predict future job satisfaction. One premise of the inventories is that when individual interests are matched with occupational demands, career satisfaction should improve. Several studies have investigated job satisfaction in terms of interest-occupation compatibility as measured by interest inventories. The results have been mixed. While some studies show only small associations between job satisfaction and interest-occupation congruence, other studies show that congruence is an efficient predictor of job satisfaction when between-occupation sources of variance are controlled.

A number of studies have also examined validity and reliability for interest inventories across racial and ethnic groups. The SII has shown validity in Chinese groups and Icelandic college-age groups. There is evidence for validity in college-educated groups of African Americans, Latinos, Native American and Asian Americans as well. However, it is unclear whether the validity of the inventories would extend to minority groups with lower education levels.

Most research indicates that it is preferable to use interest inventories that have been developed specifically with the group to be tested in mind. To that end, several interest inventories have also been developed for specialized groups, including picture-based inventories for groups who are illiterate, developmentally disabled or who have limited familiarity with English.

A quantitative analysis of 66 longitudinal studies conducted by K. S. Douglas Low and colleagues in 2005 examined the stability of interests across time as measured by popular vocational interest inventories. This review found that vocational interests remain relatively stable over time, and in fact, show even greater stability than measures of personality traits between early adolescence and early adulthood. According to this review, stability of interests rises steadily from early adolescence, and peaks in early adulthood, declining somewhat between 29 and 40. Of the six categories of interests in the Holland classification system, Realistic and Artistic interests showed the greatest stability over time. When the interest classification system developed by Kuder was examined, Artistic interests again showed the greatest stability, followed by Mechanical, Musical, and Scientific interests. The bulk of the research indicates that areas of interest, whether described by the Holland and Kuder classification systems, are stable, trait-like features within individuals.

When a test is consistent over time what is that called?

Reliability and validity Reliability is the degree to which an assessment tool produces stable and consistent results. Validity refers to how well a test measures what it is purported to measure. Score reliability The consistency with which two or more individuals would score the same response to a test item.

When a test is consistent over time what is that called quizlet?

also called test-retest reliability, is the degree to which scores on the same test are consistent over time. this type of reliability provides evidence that scores obtained on a test at one time (test) are the same or close to the same when the test is readministered some other time (retest).

Which type of reliability shows how consistent scores are over time?

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

Is how consistent results are when the same person takes a test more than once?

Reliability refers to how dependably or consistently a test measures a characteristic. If a person takes the test again, will he or she get a similar test score, or a much different score? A test that yields similar scores for a person who repeats the test is said to measure a characteristic reliably.