I. Overview II. Scale Development Issues A. The Domain Sampling Model B. Content Validity C. The domain sampling model and the interpretation of test scores D. Face Validity III. How Reliable is the Scale? A. Theory of Measurement Error B. Reliability Estimates C. Reliability Standards D. Standard error of measurement E. Regression Towards the Mean F. Interrater reliability IV. Diagnostic Utility 
The goal of this set of notes is explore issues of reliability and validity as they apply to psychological measurement. The approach will be to look these issues by examining a particular scale, the PTSDInterview (PTSDI: Watson, Juba, Manifold, Kucala, & Anderson, 1991). The issues to be discussed include:
(a) How would you go about developing a scale to measure posttraumatic stress disorder?
(b) What items would you include in your scale and how would you determine the content validity of the scale?
(c) How would you determine the reliability of the scale?
(d) How would you determine the validity of the scale?
This web page will focus on the the first three issues. A companion web page will look at the validity question.
The first two questions posed in the overview, "How would you go about developing a scale to measure posttraumatic stress disorder?" and "What items would you include in your scale and how would you determine the content validity of the scale?" can be answered by first defining the domain of possible items that are relevant for the scale. In the case of PTSD, the relevant domain of items are specified by the DSMIV diagnostic criteria for PTSD.
See handout: DSMIV Diagnostic Criteria for PTSD
How would you specify the domain for content quizzes in general psychology, for a personality test of extraversion?
Content validity asks the question, "Do the items on the scale adequately sample the domain of interest?"
If you are developing a test to diagnose PTSD then the test must adequately reflect all of the DSM diagnostic criteria.
See handout: The PTSDI Scale
Do the items on the PTSDI scale adequately reflect the DSMIV criteria? How might you come to quantitative decision about the content validity of the scale?
The DSM criteria are all or nothing. Participants respond to the PTSDI items on 7point scales:
No Very Little A little Somewhat Quite a bit Very much ExtremelyNever Very Rarely Sometimes Commonly Often Very Often Always 1 2 3 4 5 6 7 
How do you determine the minimum score you need to fit the allornothing criteria of the DSM? Is 3 "A little sometimes" enough to meet the criterion? Is 7 "extremely always" necessary to meet the criterion? If you make the criteria too strict then you will underdiagnose PTSD. On the other hand if you make the criteria too lenient you will over diagnose PTSD.
One approach would be to go back to the DSM criteria to see if they give any guidance. In this instance are the DSMIV guidelines helpful in determining minimum score that should be used to indicate that a specific criterion was met?
An empirical approach is to examine diagnostic utility indices at each possible cutting score. You can choose a cut score for the items that will maximize sensitivity, maximize specificity, or maximize efficiency. Watson et al. (1991) determined a score of 4 or greater on an item indicated that the DSMIIIR criterion was met. They choose that score because it produced an optimal sensitivity/specificity balance.
What are the advantages and disadvantages of using this type of continuous scale rather than an allornothing (yes or no) response scale?
If you are developing a content test for general psychology, how would you determine if the test had adequate content validity?
For content tests the proportion correct is assumed to be the proportion correct that would have been obtained if every item in the domain had been included on the test.
Face validity refers to the issue of whether or not the items are measuring what they appear, on the face of it, to measure.
Does the PTSDI have face validity?
Does the MMPI have face validity?
Reliability is the degree of consistency of the the measure. Does the measure give you the same results every time it is used?
An observed score, x, has two components: the true score, t, and measurement error, e.
x = t + e 
Measurement error can be either random or systematic. Systematic error is present each time the measure is given (e.g., questions that consistent measure some other domain, or possible response biases such as the tendency to agree with all items on the test). Systematic error poses a validity problem, you are not measuring what you intend to measure.
Reliability is defined as the proportion of true variance over the obtained variance.
reliability = true variance / obtained variance = true variance / (true variance + error variance) 
A reliability coefficient of .85 indicates that 85% of the variance in the test scores depends upon the true variance of the trait being measured, and 15% depends on error variance. Note that you do not square the value of the reliability coefficient in order to find the amount of true score variance.
Measured at one testing session 
Measured at two testing sessions 

Single Measure 
HOMOGENEITY

TESTRETEST RELIABILITY


Different forms of the Measure 
EQUIVALENCE

STABILITY and

A good rule of thumb for reliability is that if the test is going to be used to make decisions about peoples lives (e.g., the test is used as a diagnostic tool that will determine treatment, hospitalization, or promotion) then the minimum acceptable coefficient alpha is .90.
This rule of thumb can be substantially relaxed is the test is going to be used for research purposes only.
Here are the reliability estimates for the PTSDI and the Posttraumatic Stress Diagnostic Scale (PDS; Foa, 1995). The PTSDI is based in the DSMIIR. The PDS is based on the DSMIV.
Measured at one testing session 
Measured at two testing sessions 

Single Measure 
HOMOGENEITY (alpha) 
TESTRETEST RELIABILITY 

PTSDI 
PDS 
PTSDI 
^{PDS}  
.92^{1 } .87 to .93^{2,3} 
.92^{6} 
1 week = .95^{1 }90 days = .76^{2,4}, .91^{2,5} 
1022 days = .83^{7} 

Different forms of the Measure 
EQUIVALENCE Different forms are not available for the PTSDI

STABILITY and Different forms are not available for the PTSDI


^{1} Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson et al,., 1991, study 1). ^{2} Participants: 77 noncombat, nonhospitalized individuals suffering from traumatic memories; 37 diagnosed as PTSD (Wilson, Tinker, Becker, & Gillette, 1995). ^{3} Measured at 4 different times during the study. ^{4} Delayed treatment condition participants, n = 39. ^{5} Immediate treatment condition participants, n = 37. ^{6} n = 248 ^{7 }n = 110 
The standard error of measurement is the standard deviation of the error scores. If you know the standard error of measurement you can determine the confidence interval around any true score or the confidence interval of a predicted true score given an obtained score. The formula for the standard error of measurement is
where SD = the standard deviation of the measure, and r_{11}= the reliability (typically coefficient alpha) of the measure. 
For example, given that the SD of a test is 15.00 and the reliability is .90, then the standard error of measurement would be
SE_{meas} = 15.00 * � (1 .90) = 15.00 * � ( .1) = 15.00 * .316 = 4.74
If the same test had a reliability of .60, then the standard error of measurement would be
SE_{meas} = 15.00 * � (1 .60) = 15.00 * � ( .4) = 15.00 * .632 = 9.48.
95% confidence interval around a true score. You can use the SE_{meas} to build a 95% confidence interval around the true score. For a given true score, t, 95% of the obtained scores will fall between t � 1.96*SE_{meas}. In the example of the test with a standard deviation of 15.00 and a reliability of .90, for a given true score of 100, the 95% confidence interval of the obtained scores would be 100 � 1.96*4.74. That is, the 95% confidence interval would range between 90.91 and 109.29.
Predicting the true score from an obtained score. You can use information about the reliability of a measure to predict the true score from an obtained score. The predicted true score, t', is found as
t' = r_{11}x_{1}  where t' is the estimated true deviation score, x_{1} is the deviation score ( x_{1} = X  M) obtained on test 1, and r_{11} is the reliability coefficient. 
A set of obtained scores, X, for a hypothetical test that has a reliability of .90 are shown in Table 1. The mean of the scores is 20.0, the standard deviation is 6.06.
The deviation scores, X_{1}, are computed by subtracting the mean (20.0) from each obtained score, X_{1} = X  M. An obtained score (or raw score) of 12 on this test is equivalent to a deviation score of 8.00. An obtained score of 25 is equivalent to a deviation score of 5.00. The estimated true deviation scores, t' are compted by multiplying the deviation score by the reliability, t' = r_{11}*X_{1}. If a person obtained a score of 12 on this test, then the estimated true deviation score would be 7.20. In terms of the original scale the estimated true score would be M + t', or 12.80 (20.00 + (7.20) = 12.80) . If a person obtained a score of 25 on the test the estimated true deviation score would be score would be 4.5. In terms of the original scale, the estimated true score would be 20.00 + 4.5 or 24.5. Note that whenever the reliability of the test is less than 1.00, then the estimated true score is always closer to the mean. The standard error of measurement, 1.91 (shown at the bottom of the true scores column), was found by multiplying the standard deviation, 6.06, by the square root of the 1  the reliability coefficient, SEmeas = 6.06 * sqrt(1  .90). Confidence intervals are constructed around each estimated true score. The 95% confidence interval around the estimated true deviation score of 7.20 ranges from 3.46 to 10.94. Recall that the reliability coefficient can be interpreted in terms of the percent of of obtained score variance that is true score variance. In this example the reliability if .90 so the true score variance should be 90% of the obtained score variance. The deviation true scores and deviation confidence interval scores can be converted back to the original scale by adding the deviation score to the mean of the scale. 

The numeric data presented in the previous table (r_{11} = .90) are shown in graphic form in the figure at the right. Obtained scores are shown on the xaxis and true scores are shown on the yaxis. The red diagonal line represents a set of scores that are perfectly reliable. If the scores are perfectly reliable then the true score is equal to the obtained score.
The green lines represent the estimated true scores when the reliability of the scale is .90 (r_{11} = .90). The center green line is the predicted true score, the outer green lines represent the upper and lower bounds of the 95% confidence interval for the predicted true scores. Note that the 95% confidence interval is built around the estimated true scores rather than around the obtained scores. 
Figure 1. The relationship between obtained scores (xaxis) and true scores (yaxis) for r_{11} = 1.00 (red line) and for r_{11} = .90 (green lines). 
The estimated true scores and 95% confidence intervals are presented in the animated graphic (Figure 2) for the following reliabilities: 1.00, .95, .90, .80, .70., .60, .50, .40 .03, .20, and .10.
The red diagonal line indicates a test with perfect reliability. The true score is the obtained score. You can use that diagonal red line as a comparison when viewing the true scores and confidence intervals at other levels of reliability. As you watch the graphic notice that as the reliability of the test decreases the estimated true scores become closer and closer to the mean of the set of scores. What would be the estimated true score, t', if the reliability of the test were 0? Also notice that the 95% confidence interval gets wider and wider as the reliability of the test decreases. Also notice that because the 95% confidence interval is built around the estimated true score, the confidence interval is not symmetric around the obtained score. The degree of asymmetry becomes greater as the reliability decreases and as you go from a score near the mean to score that is distant from the mean. 
Figure 2. The relationship between obtained scores (xaxis) and true scores (yaxis) at various scale reliabilities. 
Introductory level measurement books typically say that the confidence interval for an obtained score can be constructed around that obtained score rather than around the true score. They are technically incorrect, but the confidence interval so constructed will not be too far off as long as the reliability of the test is high.
The above discussion of true scores and their confidence intervals provides the statistical basis for what is known as the regression toward the mean effect. The regression towards the mean effect occurs when people are selected for a study because they have an extreme score on some test. For example, children are selected for a special reading class because they score low on a reading test, or adults are selected for a treatment outcome study because they score high on a PTSD diagnostic test. If those same people are retested using the same instrument, then we would expect that the new set of scores would be closer to the true score which is closer to the mean. That is, children selected because of low reading scores should get higher reading scores. Adults selected because of high PTSD scores should get lower PTSD scores. This is expected to happen in the absence of any treatment.
One common sense explanation of this effect begins with the expectation that measurement errors will be random. For someone who has an extreme score, it is assumed that the errors for that testing were not random. It is likely that the errors all happened to converge in a manner that they artificially inflated the score on that particular test given at that particular time. It is unlikely that the random errors will happen in the same manner on another testing so that person's score should be closer to the true score on subsequent testings.
The magnitude of the regression towards the mean effect depends upon two things: (a) the reliability of the test and (b) the extremity of the selection scores. The magnitude of the regression towards the mean effect will increase as the reliability of the test decreases. The regression towards the mean effect will increase as the selection scores become more extreme. Check out your understanding of this by going back and looking at the graphs showing the relationship between obtained and true scores at different levels of reliability and different levels of extremity.
Interrater reliability is concerned with the consistency between the judgments of 2 or more raters.
In the past, interrater reliability has been measured by having 2 (or more) people make decisions (e.g. meets PTSD diagnosis criteria) or ratings across a number of cases and then find the correlation between those two sets of decisions or ratings. That method gives an overestimate of the interrater reliability so it is rarely used.
One of the more commonly used measures of interrater reliability is kappa. Kappa takes into account the expected level of agreements between judges or raters. For that reason it is considered to be a more appropriate measure on interrater reliability For a discussion of kappa and how to compute it using SPSS see Crosstabs: Kappa.
The interrater reliability indices for the PTSDI are shown below.
Interrater Reliability for the PTSDI and PDS 

PTSDI: kappa = .61^{1}
PTSDI: kappa = >.84^{2} PDS: kappa = .77^{3} 
^{1} Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson et al,., 1991, study 1; 2 raters). ^{2} Participants: 80 noncombat, nonhospitalized individuals suffering from traumatic memories; 37 diagnosed as PTSD; 3 raters judged all participants (Wilson, Tinker, Becker, & Gillette, 1995). ^{3} Participants: 110 individuals recruited from treatment and research centers 
In both of those studies, disagreements in diagnoses were resolved by a discussion among the raters.
Diagnostic utility refers to the extent to which the measure correctly identifies individuals who meet or do not meet diagnostic criteria. There are three common measures of diagnostic utility.
Sensitivity, specificity, and efficiency are reported as percents or as a decimal numbers ranging from 0 to 1.0.
Consider the following crosstabulation table where the cells could be filled in with the number of people in that cell.
"True" Diagnosis  

Test Classification  Has the Disorder  Does not have the Disorder  
Meets Criteria  Sensitivity = column percent  (Type ? error)^{2}  
Does not Meet Criteria  (Type ? error)^{2}  Specificity = column percent  
Efficiency = (n_{sensitivity cell }+ n_{specifity cell} ) / n_{total} 
And the following distribution of hypothetical scores 
"True" Diagnosis  

Test Classification  Has the Disorder  Does not have the Disorder  Total  
Meets Criteria  n = 20
sensitivity = 20/25 
n = 10

n = 30  
Does not Meet Criteria  n = 5

n = 65
Specificity = 65/75 
n = 70  
Total  n = 25  n = 75  n = 100  
Efficiency = (20 + 65) / 100 = .85 
The utility indices are dependent upon the leniency or strictness of the test in diagnosing individuals. For example, if the test were very strict and classified only 10 of the individuals as meeting the criteria, rather than the 30 shown above, then the sensitivity, specificity and efficiency indices would be different.
The utility indices are also dependent upon how you determined the "true" diagnosis. How would you determine the "True" diagnosis for an individual?
Watson et al. (1991) used the DIS PTSD scale (Robins & Helzer, 1985) as the "gold standard" for diagnosing PTSD. The DIS PTSD scale is widely used scale that has good utility scores for clinical populations (but not for nonclinical populations). The participants (study 2) were 61 Vietnam veterans, 27 of whom were diagnosed as PTSD using the DIS PTSD scale. Foa (1995) used the Structured Clinical Interview for the DSMIIR (Williams, et al., 1992) as the "gold standard." The participants were 248 individuals recruited from treatment and research centers. The utility scores are shown below
Utility Indices for the PTSDI (from Watson, et al., 1991, study 2) and the PDS (from Foa, 1995)  

PTSDI  PDS  
Sensitivity  .89  .82 
Specificity  .94  .77 
Efficiency  .92  .79 
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297334.
Foa. E. B. (1995). Posttraumatic Stress Diagnostic Scale. Minneapolis: National Computer Systems
Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version IIIA). St. Louis, MO: Washington University, Department of Psychiatry.
Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991). The PTSD interview: Rationale, description, reliability, and concurrent validity of a DSMIII based technique. Journal of Clinical Psychology, 47, 179188.\
Williams, J. B., W., Gibbon, M., First, M. B., Spitzer, R. L., Davies, M., Borus, J., Howes, M. J., Kane, J., Pope, H., G., Rounsaville, B., & Wittchen, H.U. (1992). The Structured Clinical Interview for DSMIIIR (SCID). Archives of General Psychiatry, 49, 630636.
Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSDI as an outcome measure. Poster presented at the annual meeting of the International Society for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.
1. The computational formula for Cronbach's alpha uses the standard deviations of each of the items and the standard deviation of the test as a whole rather than the average intercorrelation between all of the items:
where n = the number of items in the test, is the sum of the squared standard deviations of each of the items( i.e., the sum of the item variances), and is the squared standard deviation of the total test scores (i.e., the variance of the test). 
Or, you could use a computer program to compute alpha. In SPSS for windows, Cronbach's alpha can be found under:
Statistics
Scale
Reliability analysis ...
2. Type I error = rejecting the null hypothesis when it is true.
Type II error = failing to reject the null hypothesis when it is false.
In this instance the null hypothesis is that the person does not meet the diagnostic criteria.
revised 02/24/00 � Lee A. Becker, 1999