Reliability and Validity, Part I

Reliability and Validity, Part I

I. Overview
II. Scale Development Issues
     A. The Domain Sampling Model
     B. Content Validity
     C. The domain sampling model and the interpretation of test scores
     D. Face Validity
III. How Reliable is the Scale?
     A. Theory of Measurement Error
     B. Reliability Estimates
     C. Reliability Standards
     D. Standard error of measurement
     E. Regression Towards the Mean
     F. Interrater reliability
IV. Diagnostic Utility

Reliability and Validity, Part II

References
Footnotes  

 

I.  Overview

The goal of this set of notes is explore issues of reliability and validity as they apply to psychological measurement.  The approach will be to look these issues by examining a particular scale, the PTSD-Interview (PTSD-I: Watson, Juba, Manifold, Kucala, & Anderson, 1991).  The issues to be discussed include:
(a) How would you go about developing a scale to measure posttraumatic stress disorder?
(b) What items would you include in your scale and how would you determine the content validity of the scale?
(c) How would you determine the reliability of the scale?
(d) How would you determine the validity of the scale?

This web page will focus on the the first three issues.  A companion web page will look at the validity question.


II. Scale Development Issues

A. The Domain Sampling Model

The first two questions posed in the overview, "How would you go about developing a scale to measure posttraumatic stress disorder?" and "What items would you include in your scale and how would you determine the content validity of the scale?" can be answered by first defining the domain of possible items that are relevant for the scale.  In the case of PTSD, the relevant domain of items are specified by the DSM-IV diagnostic criteria for PTSD. 

See handout: DSM-IV Diagnostic Criteria for PTSD

How would you specify the domain for content quizzes in general psychology,  for a personality test of extraversion?


B. Content Validity

Content validity asks the question, "Do the items on the scale adequately sample the domain of interest?"

If you are developing a test to diagnose PTSD then the test must adequately reflect all of the DSM diagnostic criteria.

See handout: The PTSD-I Scale

Do the items on the PTSD-I scale adequately reflect the DSM-IV criteria?  How might you come to quantitative decision about the content validity of the scale?

The DSM criteria are all or nothing. Participants respond to the PTSD-I items on 7-point scales:

No Very Little A little Somewhat Quite a bit Very much ExtremelyNever Very Rarely Sometimes Commonly Often Very Often Always 1 2 3 4 5 6 7

How do you determine the minimum score you need to fit the all-or-nothing criteria of the DSM?   Is 3 "A little sometimes" enough to meet the criterion?   Is 7 "extremely always" necessary to meet the criterion? If you make the criteria too strict then you will underdiagnose PTSD.  On the other hand if you make the criteria too lenient you will over diagnose PTSD.

One approach would be to go back to the DSM criteria to see if they give any guidance.   In this instance are the DSM-IV guidelines helpful in determining minimum score that should be used to indicate that a specific criterion was met?

An empirical approach is to examine diagnostic utility indices at each possible cutting score. You can choose a cut score for the items that will maximize sensitivity, maximize specificity, or maximize efficiency. Watson et al. (1991) determined a score of 4 or greater on an item indicated that the DSM-III-R criterion was met.  They choose that score because it produced an optimal sensitivity/specificity balance.

What are the advantages and disadvantages of using this type of continuous scale rather than an all-or-nothing (yes or no) response scale?

If you are developing a content test for general psychology, how would you determine if the test had adequate content validity?

C. The domain sampling model and the interpretation of test scores.

For content tests the proportion correct is assumed to be the proportion correct that would have been obtained if every item in the domain had been included on the test.


D. Face Validity

Face validity refers to the issue of whether or not the items are measuring what they appear, on the face of it, to measure.

Does the PTSD-I have face validity?

Does the MMPI have face validity?


III. How Reliable is the Scale?

Reliability is the degree of consistency of the the measure.  Does the measure give you the same results every time it is used?

A. Theory of Measurement Error

An observed score, x, has two components: the true score, t, and measurement error, e.

x = t + e

Measurement error can be either random or systematic.   Systematic error is present each time the measure is given (e.g., questions that consistent measure some other domain, or possible response biases such as the tendency to agree with all items on the test).  Systematic error poses a validity problem, you are not measuring what you intend to measure.

Reliability is defined as the proportion of true variance over the obtained variance.

reliability = true variance / obtained variance
              = true variance / (true variance + error variance)

A reliability coefficient of .85 indicates that 85% of the variance in the test scores depends upon the true variance of the trait being measured, and 15% depends on error variance.  Note that you do not square the value of the reliability coefficient in order to find the amount of true score variance.

B. Reliability Estimates

 

Measured at one testing session

Measured at two testing sessions

Single Measure

HOMOGENEITY

  • Cronbach’s coefficient alpha, r11
wpe1.gif (1105 bytes) where n is the number of items in the test; and
rii is the average correlation between all of the test items
See also the computational formula for Cronbach's alpha.

Graphic representation of reliability as a function of n and rii

  • split half reliability, rtt
wpe2.gif (1028 bytes) where rhh is the correlation between the half-tests.
  • error variance is due to content sampling and content heterogeneity.  Low homogeneity indices may indicate that the test measures more than one domain.

TEST-RETEST RELIABILITY
or
STABILITY

  • measured as the correlation between the same test given at different times
  • error variance is due to time sampling and content sampling

Different forms of the Measure

EQUIVALENCE

  • measured as the correlation between different forms of the test given at the same time

  • error variance is due to content sampling

STABILITY and
EQUIVALENCE

  • measured as the correlation between different forms of the test given at different times

  • error variance is due to content sampling and time sampling

C. Reliability Standards

A good rule of thumb for reliability is that if the test is going to be used to make decisions about peoples lives (e.g., the test is used as a diagnostic tool that will determine treatment, hospitalization, or promotion) then the minimum acceptable coefficient alpha is .90.

This rule of thumb can be substantially relaxed is the test is going to be used for research purposes only.

Here are the reliability estimates for the PTSD-I and the Posttraumatic Stress Diagnostic Scale (PDS; Foa, 1995).  The PTSD-I is based in the DSM-II-R.  The PDS is based on the DSM-IV.

 

Measured at one testing session

Measured at two testing sessions

Single Measure

HOMOGENEITY (alpha)

     TEST-RETEST RELIABILITY

PTSD-I

PDS

PTSD-I

PDS

 .921
 .87 to .932,3

 .926

 1 week = .951
90 days = .762,4, .912,5

10-22 days = .837

Different forms of the Measure

EQUIVALENCE

Different forms are not available for the PTSD-I 
or the PDS

 

STABILITY and
EQUIVALENCE

Different forms are not available for the PTSD-I 
or the PDS

 

1 Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson et al,., 1991,     study 1).
2 Participants: 77 noncombat, nonhospitalized  individuals suffering from traumatic memories;     37 diagnosed as PTSD (Wilson, Tinker, Becker, & Gillette, 1995).
3 Measured at 4 different times during the study.
4 Delayed treatment condition participants, n = 39.
5 Immediate treatment condition participants, n = 37.
6 n = 248
7 n = 110

 

D.  Standard error of measurement

The standard error of measurement is the standard deviation of the error scores. If you know the standard error of measurement you can determine the confidence interval around any true score or the confidence interval of a predicted true score given an obtained score.  The formula for the standard error of measurement is

wpe6.gif (1152 bytes) where SD = the standard deviation of the measure, and
r11= the reliability (typically coefficient alpha) of the measure.

For example, given that the SD of a test is 15.00 and the reliability is .90, then the standard error of measurement would be

SEmeas = 15.00 * (1- .90) = 15.00 * ( .1) = 15.00 * .316 = 4.74

If the same test had a reliability of .60, then the standard error of measurement would be

SEmeas = 15.00 * (1- .60) = 15.00 * ( .4) = 15.00 * .632 = 9.48.

95% confidence interval around a true score. You can use the SEmeas to build a 95% confidence interval around the true score.  For a given true score, t, 95% of the obtained scores will fall between t � 1.96*SEmeas.   In the example of the test with a standard deviation of 15.00 and a reliability of .90, for a given true score of 100, the 95% confidence interval of the obtained scores would be 100 � 1.96*4.74. That is, the 95% confidence interval would range between 90.91 and 109.29.   

Predicting the true score from an obtained score.  You can use information about the reliability of a measure to predict the true score from an obtained score. The predicted true score, t',  is found as

t' = r11x1 where t' is the estimated true deviation score,
x1 is the deviation score ( x1 = X - M) obtained on test 1, and
r11 is the reliability coefficient.

 

A set of obtained scores, X, for a hypothetical test that has a reliability of .90 are shown in Table 1. The mean of the scores is 20.0, the standard deviation is 6.06.  

The deviation scores, X1, are computed by subtracting the mean (20.0) from each obtained score, X1 = X - M.  An obtained score (or raw score) of  12 on this test is equivalent to a deviation score of -8.00. An obtained score of 25 is equivalent to a deviation score of 5.00.

The estimated true deviation scores, t' are compted by multiplying the deviation score by the reliability, t' = r11*X1.   If a person obtained a score of 12 on this test, then the estimated true deviation score would be -7.20. In terms of the original scale the estimated true score would be M + t', or 12.80 (20.00 + (-7.20) = 12.80) . If a person obtained a score of 25 on the test the estimated true deviation score would be score would be 4.5.  In terms of the original scale, the estimated true score would be 20.00 + 4.5 or 24.5.  Note that whenever the reliability of the test is less than 1.00, then the estimated true score is always closer to the mean.

The standard error of measurement, 1.91 (shown at the bottom of the true scores column),  was found by multiplying the standard deviation, 6.06, by the square root of the 1 - the reliability coefficient, SEmeas = 6.06 * sqrt(1 - .90).

Confidence intervals are constructed around each estimated true score.  The 95% confidence interval around the estimated true deviation score of  -7.20 ranges from 3.46 to 10.94.

Recall that the reliability coefficient can be interpreted in terms of the percent of of obtained score variance that is true score variance.  In this example the reliability if .90 so the true score variance should be 90% of the obtained score variance.           

The deviation true scores and deviation confidence interval scores can be converted back to the original scale by adding the deviation score to the mean of the scale.

 
Table 1. Estimated true deviation scores, t', and the upper and lower bounds of the 95% confidence interval (95% CI) when the reliability (r11) of the test = .90
   

 Obtained Score
X

Deviation
Score1
X1

 

 true deviation score2
t'

 

 95% CI
(deviation scores)

lower bound3upper bound4

1

11

-9.00

-8.10 -11.86 -4.34

2

12

-8.00

-7.20 -10.96 -3.44

3

17

-3.00

-2.70 -6.46 1.06

4

18

-2.00

-1.80 -5.56 1.96

5

19

-1.00

-0.90 -4.66 2.86

6

20

.00

0.00 -3.76 3.76

7

21

1.00

0.90 -2.86 4.66

8

25

5.00

4.50 0.74 8.26

9

28

8.00

7.20 3.44 10.96

10

29

9.00

8.10 4.34 11.86
Mean
SD
20.00
6.06
0.00
6.06
0.00
5.45
-3.76
5.45
3.76
5.45
SEmeas     1.9163    
1 X1 = X - M
      = X - 20.00

2   t' = r11*X1
       = .90*X1

3 95% CI lower bound = t' - 1.96*SEmeas
                                     = t' - 1.96*1.9163
                                     = t' - 3.76

4 95% CI lower bound= t' + 1.96*SEmeas
                                    = t' + 1.96*1.9163
                                    = t' + 3.76

 

The numeric data presented in the previous table (r11 = .90) are shown in graphic form in the figure at the right. Obtained scores are shown on the x-axis and true scores are shown on the y-axis. The red diagonal line represents a set of scores that are perfectly reliable.  If the scores are perfectly reliable then the true score is equal to the obtained score.

The green lines represent the estimated  true scores when the reliability of the scale is .90 (r11 = .90).  The center green line is the predicted true score, the outer green lines represent the upper and lower bounds of the 95% confidence interval for the predicted true scores.  

Note that the 95% confidence interval is built around the estimated true scores rather than around the obtained scores. 

Figure 1.  The relationship between obtained scores (x-axis) and true scores (y-axis) for r11 = 1.00 (red line) and for r11 = .90 (green lines).

 

The estimated true scores and 95% confidence intervals are presented in the animated graphic (Figure 2) for the following reliabilities: 1.00, .95, .90, .80, .70., .60, .50, .40 .03, .20, and .10.

The red diagonal line indicates a test with perfect reliability. The true score is  the obtained score. You can use that diagonal red line as a comparison when viewing the true scores and confidence intervals at other levels of reliability.

As you watch the graphic notice that as the reliability of the test decreases the estimated true scores become closer and closer to the mean of the set of scores. What would be the estimated true score, t', if the reliability of the test were 0?

Also notice that the 95% confidence interval gets wider and wider as the reliability of the test decreases. 

Also notice that because the 95% confidence interval is built around the estimated true score, the confidence interval is not symmetric around the obtained score. The degree of asymmetry becomes greater as the reliability decreases and as you go from a score near the mean to score that is distant from the mean.

Figure 2.  The relationship between obtained scores (x-axis) and true scores (y-axis) at various scale reliabilities.
semean.gif (22997 bytes)

Introductory level measurement books typically say that the confidence interval for an obtained score can be constructed around that obtained score rather than around the true score. They are technically incorrect, but the confidence interval so constructed will not be too far off as long as the reliability of the test is high.

E. Regression Towards the Mean

The above discussion of true scores and their confidence intervals provides the statistical basis for what is known as the regression toward the mean effect.   The regression towards the mean effect occurs when people are selected for a study because they have an extreme score on some test.  For example, children are selected for a special reading class because they score low on a reading test, or adults are selected for a treatment outcome study because they score high on a PTSD diagnostic test.   If those same people are retested using the same instrument, then we would expect that the new set of scores would be closer to the true score which is closer to the mean.   That is, children selected because of low reading scores should get higher reading scores.  Adults selected because of high PTSD scores should get lower PTSD scores.   This is expected to happen in the absence of any treatment.

One common sense explanation of this effect begins with the expectation that measurement errors will be random.  For someone who has an extreme score, it is assumed that the errors for that testing were not random.  It is likely that the errors all happened to converge in a manner that they artificially inflated the score on that particular test given at that particular time.  It is unlikely that the random errors will happen in the same manner on another testing so that person's score should be closer to the true score on subsequent testings.

The magnitude of the regression towards the mean effect depends upon two things: (a) the reliability of the test and (b) the extremity of the selection scores.  The   magnitude of the regression towards the mean effect will increase as the reliability of the test decreases.  The regression towards the mean effect will increase as the selection scores become more extreme.  Check out your understanding of this by going back and looking at the graphs showing the relationship between obtained and true scores at different levels of reliability and different levels of extremity.

F. Interrater reliability

Interrater reliability is concerned with the consistency between the judgments of 2 or more raters. 

In the past, interrater reliability has been measured by having 2 (or more) people make decisions (e.g. meets PTSD diagnosis criteria) or ratings across a number of cases and then find the correlation between those two sets of decisions or ratings. That method gives an overestimate of the interrater reliability so it is rarely used. 

One of the more commonly used measures of interrater reliability is kappa.  Kappa takes into account the expected level of agreements between judges or raters.  For that reason it is considered to be a more appropriate measure on interrater reliability   For a discussion of kappa and how to compute it using SPSS see Crosstabs: Kappa.

The interrater reliability indices for the PTSD-I are shown below.

Interrater Reliability for the PTSD-I and PDS
PTSD-I: kappa = .611

PTSD-I: kappa = >.842

PDS: kappa = .773

1 Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson et al,., 1991, study 1; 2 raters).
2 Participants: 80 noncombat, nonhospitalized  individuals suffering from traumatic memories; 37 diagnosed as PTSD; 3 raters judged all participants (Wilson, Tinker, Becker, & Gillette, 1995).
3 Participants: 110 individuals recruited from treatment and research centers

In both of those studies, disagreements in diagnoses were resolved by a discussion among the raters.


IV.  Diagnostic Utility

Diagnostic utility refers to the extent to which the measure correctly identifies individuals who meet or do not meet diagnostic criteria.  There are three common measures of diagnostic utility.

  1. Sensitivity - the probability that those with the diagnosis will be correctly identified by the test as meeting the diagnostic criteria.
  2. Specificity - the probability that those without the diagnosis will be correctly identified by the test as not meeting the diagnostic criteria.
  3. Efficiency - the overall probability that correctly classifying both those with the diagnosis and those without the diagnosis.

Sensitivity, specificity, and efficiency are reported as percents or as a decimal numbers ranging from 0 to 1.0.

Consider the following crosstabulation table where the cells could be filled in with the number of people in that cell.

    "True" Diagnosis
Test Classification     Has the DisorderDoes not have
the Disorder
Meets Criteria Sensitivity = column percent (Type ? error)2
Does not Meet Criteria (Type ? error)2 Specificity = column percent
Efficiency = (nsensitivity cell + nspecifity cell ) / ntotal

And the following distribution of hypothetical scores -

    "True" Diagnosis
Test Classification     Has the DisorderDoes not have
the Disorder
Total
Meets Criteria n = 20

sensitivity = 20/25
            = .80

n = 10

 

n = 30
Does not Meet Criteria n = 5

 

n = 65

Specificity = 65/75
            = .87

n = 70
Total n = 25 n = 75 n = 100
Efficiency = (20 + 65) / 100 = .85

The utility indices are dependent upon the leniency or strictness of the test in diagnosing individuals. For example, if the test were very strict and classified only 10 of the individuals as meeting the criteria, rather than the 30 shown above, then the sensitivity, specificity and efficiency indices would be different.

The utility indices are also dependent upon how you determined the "true" diagnosis. How would you determine the "True" diagnosis for an individual?

Watson et al. (1991) used the DIS PTSD scale (Robins & Helzer, 1985) as the "gold standard" for diagnosing PTSD. The DIS PTSD scale is widely used scale that has good utility scores for clinical populations (but not for nonclinical populations). The participants (study 2) were 61 Vietnam veterans, 27 of whom were diagnosed as PTSD using the DIS PTSD scale.  Foa (1995) used the Structured Clinical Interview for the DSM-II-R (Williams, et al., 1992) as the "gold standard." The participants were 248 individuals recruited from treatment and research centers. The utility scores are shown below

Utility Indices for the PTSD-I
(from Watson, et al., 1991, study 2) and the PDS (from Foa, 1995)
  PTSD-I PDS
Sensitivity .89 .82
Specificity .94 .77
Efficiency .92 .79

References

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

Foa. E. B. (1995). Posttraumatic Stress Diagnostic Scale. Minneapolis: National Computer Systems

Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version III-A).  St. Louis, MO: Washington University, Department of Psychiatry.

Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991). The PTSD interview: Rationale, description, reliability, and concurrent validity of a DSM-III based technique. Journal of Clinical Psychology, 47, 179-188.\

Williams, J. B., W., Gibbon, M., First, M. B., Spitzer, R. L., Davies, M., Borus, J., Howes, M. J., Kane, J., Pope, H., G., Rounsaville, B., & Wittchen, H.U. (1992). The Structured Clinical Interview for DSM-III-R (SCID). Archives of General Psychiatry, 49, 630-636.

Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSD-I as an outcome measure. Poster presented at the annual meeting of the International Society for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.


Footnotes

1. The computational formula for Cronbach's alpha uses the standard deviations of each of the items and the standard deviation of the test as a whole rather than the average intercorrelation between all of the items:

wpe3.gif (1289 bytes) where n = the number of items in the test,
wpe4.gif (928 bytes)is the sum of the squared standard deviations of each of the items( i.e., the sum of the item variances), and
wpe5.gif (904 bytes)is the squared standard deviation of the total test scores (i.e., the variance of the test).

Or, you could use a computer program to compute alpha. In SPSS for windows, Cronbach's alpha can be found under:

Statistics
      Scale
           Reliability analysis ...

 

2. Type I error = rejecting the null hypothesis when it is true.
     Type II error = failing to reject the null hypothesis when it is false.

     In this instance the null hypothesis is that the person does not meet the diagnostic criteria.

 

 


top

-revised 02/24/00 � Lee A. Becker, 1999