Statistical significance relates to the question of whether or not the results of a statistical test meets an accepted criterion level. In psychology this level is typically the value of p < .05. The criteria of p < .05 was chosen to minimize the possibility of a Type I error, finding a significant difference when one does not exist. It does not protect us from Type II error, failure to find a difference when the difference does exist. As you know, Type II error is related to the the issue of the power of the statistical test.
Statistical significance is a function of many factors including:
The magnitude of the effect
The sample size
The reliability of the effect (i.e., is the treatment equally effective for all participants?)
The reliability of the measurement instrument
It is always tempting to think of the magnitude of the significance test (wow, the significance level is at p < .0005!) as being only a function of the first factor, the magnitude of the effect, while ignoring the possibility that the small p value may be due to a large sample size, very little variability in the response to the treatment, a measurement instrument that is very reliable, and various combinations of those other factors.
Isn't it interesting that our intuition tells us that we should be interested in the magnitude of the effect rather than (or at least in addition to) the significance level? I believe we come to a natural, but erroneous, interpretation of statistical significance as a measure of the effect magnitude we intuitively know that somehow it is the magnitude of the effect that is fundamentally important.
Proponents of metaanalysis have been interested in the measurement of the magnitude of a treatment effect for many years. They have wanted to compare the size of the treatment effect across many different studies. The problem was how to compare treatment effects across different studies when the traditionally reported statistical significance tests were contaminated by all those other factors. They have settled on a measure of the magnitude of a treatment effect that controls for the sample size, one of the major contributors to the significance level statistic. That measure is called the effect size. Effect sizes were discussed in a previous set of lecture notes (see effect size ).
More recently clinical psychologists have taken the next step and asked the question, is the magnitude of change produced by the treatment clinically significant? In the next two sections I will describe the computation and interpretation of the effect sizes and then discuss how clinical significance is defined and some of the ways that are being proposed to measure clinical significance.
A very strict definition of clinical significance is when "once troubled and disordered clients are now, after treatment, not distinguishable from a meaningful and representative nondisturbed reference group" (Kendall & Grove, 1988, p. 148). In addition it has been argued that the change due to treatment must be reliable (Jacobson & Truax, 1991: Ogles, Lambert, & Masters, 1996). Wilson, Becker, and Tinker (1997) report clinical significance both at the level of the group by comparing the treatment group as a whole with the normal comparison group, and at the level of the individual participants.
In order to determine clinical significance you must operationally define each of the terms in the definition. How can you define a client as "troubled and disordered?" When is a client "not distinguishable from a ...nondisturbed reference group?" How can you define "reliable change?"
Finding Normative Data. The definition by Kendall and Grove (1988) implies that you need to find relevant data on a normative group of undisturbed persons. Normative data for disturbed and nondisturbed groups are typically reported in the test manuals for clinical scales. For example Wilson, Becker, and Tinker (1995; 1997) used normative data reported by the test manuals for the StateTrait Anxiety Inventory (STAT; Spielberger, Gorsuch, Luschene, Vagg, & Jacobs, 1983) and for the Symptom Check List (SCL90R; Derogatis, 1992). There is no test manual for the Impact of Events Scale (IES: Horowitz, Wilner, & Alvarez, 1979). We called Dr. Horowitz to see if he had any suggestions for how to obtain normative data and he referred us to a review paper (Horowitz, Field, & Classen, 1993). That review reported means and standard deviations for several groups of traumatized individuals before and after treatment, and for some nontraumatized control groups. Out of that data we created a composite score for nontraumatized individuals.^{1} For some scales you will not be able to find any normative data. For example, we were not able to come up with satisfactory normative data for the Subjective Units of Disturbance Scale (SUDS, Wolpe, 1990), so we did not report clinical significance for that scale.
What is "normal?" There is beginning to be a consensus that "normal" can be defined as � 1 SD from the mean of the nondisturbed reference group (also called the normative group). That is, if the mean of the treated group falls within �1 SD of the mean of the normative group then the treated group is undistinguishable from the normative group. At the level of the individual the consensus is that the score that is 1 SD above the mean of the normative group is a reasonable cutoff score. An individual who falls at or below this cutoff score is viewed as having a successful outcome (they are "cured").
What is "troubled and disordered?" Once you take the position that � 1 SD from the mean of the nondisturbed reference group is normal, then any scores beyond +1 SD are disturbed. Scores beyond +2 standard deviations are even more disturbed.
So, a clear demonstration of clinical significance would be to take a group of clients who score, say, beyond +2 SDs of the normative group prior to treatment and move them to within � 1 SD from the mean of that group. The research implication of this definition is that you want to select people who are clearly disturbed to be in the clinical outcome study. If the mean of your untreated group is at, say, +1.2 SDs above the mean the change due to treatment probably is not going to be viewed as clinically significant.
Some people have argued that Kendall and Grove's (1988) definition is too stringent. Is your clinical work a failure if there is substantial improvement in your client but he or she has not yet passed the +1 SD cutoff score? One approach to this issue to made finer gradations along the outcome scale. For example the scores on the Beck Depression Inventory (BDI) are categorized as falling into one of four groups: normal, mildly depressed, moderately depressed, and severely depressed. Clinically significant improvement could be defined as movement from one area (severe) to another (moderate) without having to move all the way to the normal range.
Graphic Representation of Clinical Significance
Clinical significance can be graphically represented by superimposing normative group information on a graph showing pretreatment and posttreatment means. The data in this example are from a treatment outcome study that used immediate and delayed EMDR treatment for psychological trauma (Wilson, Becker, & Tinker, 1995). The effects for two dependent variables are shown in Figure 1: IES intrusion (top) and Symptom Check List 90R Depression subscale(bottom).

The left vertical axes show the raw scores for dependent variable. The right vertical axes show the conversion of the raw score to the normative group zscores. The area within � 1SD of the mean of the normative group is filled with a diagonal crosshatch. The area from +1 SD to +2 SD from the mean of the normative group is filled with a vertical crosshatch. Measurement times are shown on the horizontal axis, T1 through T5. The IES Intrusion scores at pretreatment (T1) were beyond +1 SD of the mean of the normative group (they both fell within the vertical crosshatch area). At T2 the mean of the immediate treatment group had moved to just below the mean (Z = 0) of the normative group. This would be defined as a clinically significant change because those clients moved from beyond +1 SD of the normative group's mean to within � 1SD. The mean of the delayed treatment group remained beyond +1SD of the normative group's mean at T2. After treatment (T3) the mean of the delayed treatment group also moved close to the mean of the normative group. There was no deterioration of the treatment effect at the 3month followup (T4 and T5). The error bars for each mean are the 95% confidence intervals for that mean based on each mean's standard error. The are appropriate for comparing differences between groups, they are too conservative for within group comparisons. The SCL90R Depression scores for both treatment groups were were beyond +2 SD of the mean of the normative group at pretreatment (T1). At T2 the mean of the immediate treatment group was near the the +1 SD cutoff score while there was no change for the delayed treatment group. After treatment the mean of the delayed treatment group was also near the +1 SD cutoff score. This data suggests that the effects of the treatment were incomplete with respect to depression. One would like the scores to be closer to the mean of the normative group. There was no deterioration of the treatment effect as the 3month followup (T4 and T5). There is a problem with this categorical approach to defining significant change. What if an individual were just above the cutoff for the "severe" category prior to treatment and moved to just below that cutoff after treatment? Would your intuition be that there was a clinically significant change? 
A reliable change can be defined in terms of the reliability of the measurement instrument (Jacobson & Truax, 1991; Ogles, Lambert, & Masters, 1996). Reliability has to do with the consistency of the measurement. To what extent are the scores the same from one administration to the next. A very reliable instrument produces nearly identical scores each time the instrument is used. If you measure a table with a ruler, your nearly always get exactly the same distance. Psychological instruments are not as reliable as the physical measurement of distance with a ruler. When we measure a person again using the same psychological scale we typically do not get exactly the same score. The error variance in a set of scores that is due to the unreliability of the scale is called the standard error of measurement (see also the reliability notes). Scales that are highly reliable will have a small standard error of measurement. If you know the reliability of the scale (typically measured as Cronbach's alpha) and the standard deviation of the raw scores on that scale you can find the expected standard deviation of the variability of the error scores. The formula for the standard error measurment is:
where SD = the standard deviation of the measure, and r_{11}= the reliability (typically coefficient alpha) of the measure. 
A Reliable Change Index (RCI) is computed by dividing the difference between the pretreatment and posttreatment scores by the standard error of the difference between the two scores. If the RCI is greater than 1.96, then the difference is reliable, a change of that magnitude would not be expected due to the unreliability of the measure. Conversely, if the RCI score is 1.96 or less then the change is not considered to be reliable, it could have occurred just due to the unreliability of the measure.
RCI = (posttest  pretest) / SE_{meas} 
Graphic Representation of Reliable Change Index Data
Another way to look at RCI is to set up 95% confidence bounds around a change score of zero and display the results graphically. The clinical significance data shown in Figure 1 represents group data. It displays information about whether there were clinically significant changes for the treatment groups as a whole. The reliable change index data shown in Figure 2 represents individual data. It shows whether there were significant changes at the level of the individual.
Reliable change index data is shown in Table 2 for the IES Intrusion scale (top) and the SCL90R Global Severity Index (bottom). The graphs show individual data points for participants who were diagnosed with PTSD (closed circles) or partial PTSD (open circles). Partial PTSD participants suffered from some PTSD symptology, but they did not meet the full DSMIV criteria.

The horizontal axes show pretreatment scores, the vertical axes show the 15month followup scores. The horizontal line represents the +1 SD normativegroup cutoff score. Scores below the cutoff score are considered to be within the normal range of scores.
The diagonal line from the lower left to the upper right is the line of no change. Data points that fall on the diagonal line are the same at both pretest and at the 15month followup. Data points in the upper left triangle are higher at followup than at pretest, that is, have deteriorated from pretest to followup. Data points in the lower right triangle are lower at followup than at pretest, that is, they have improved from pretest to followup. The dotted lines to the left and right of the diagonal line represent the reliable change index band, set at an RCI score of � 1.96 standard errors of measurement around the line of no change. Individual scores within the RCI band have not shown reliable change while scores outside of the RCI band have shown reliable change. Inspection of the IES Intrusion data (top graph) shows that the data cluster on the improvement side of the line of no change and that most of the participants fall below the 1 SD cutoff score. There were no participants who were harmed by the treatment, that is, no one showed a significant increase in intrusion scores from pretest to followup. The pattern of GSI scores (bottom graph) show are also clustered to the improvement side of the line of no change. There are a few more individuals who fall above the 1 SD cutoff score than there are in the intrusion data. There was one individual who appeared to show reliable deterioration from pretest to followup. There appears to be no difference between the PTSD and partial PTSD participants. The open and closed circles appear to be randomly dispersed. Note that the width of the no change zone is wider for the IES intrusion score than for the GSI score. 
Statistical summaries of the data shown in Figure 2 displayed in Table 1.
Reliable deterioration  Uncertain change  Reliable improvement  not recovered  Reliable improvement recovered  % Moved from above cutoff score at pretest to below cutoff score at followup  

Measure  n  %  n  %  n  %  n  %  n  % 
IES Intrusion (n = 63) 
0  0  24  38  0  0  39  62  31 of 33  94 
SCL90R GSI (n = 66) 
1  2  19  29  9  14  37  56  34 of 51  67 
Reliable deterioration are those cases in the upper right triangle, outside of the band of no reliable change. Uncertain change are those participants within the band of no reliable change. Reliable improvement  not recovered are individuals to the right of the band of no reliable change and above the 1 SD cutoff score. Reliable improvement  recovered are individuals to the right of the band of no reliable change and below the 1 SD cutoff score. The percent who moved from above the cutoff score at pretest to below the cutoff score at posttest is selfexplanatory.
Follette, W. C., Callaghan, G. M. (1996). The importance of the principle of clinical significancedefining significant to whom and for what purpose: A response to Tingey, Lambert, Burlingame, and Hansen. Psychotherapy Research, 6, 133143.
Kendall, P. C., & Grove, W. M. (1988). Normative comparisons in therapy outcome. Behavioral Assessment, 10, 147158.
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 1219. (note: this is the original reference for the RCI index)
Martinovich, Z., Saunders, S., & Howard, K. I. (1996). Some comments on "assessing clinical significance." Psychotherapy Research, 6, 124132.
Ogles, B. M., Lambert, M. J., & Masters, K. S. (1996). Assessing outcome in clinical practice. Boston: Allyn and Bacon.(note: see pages 8290)
Tingey, R. C., Lambert, M. J., Burlingame, G. M., & Hansen, N. B. (1996a). Assessing clinical significance: Proposed extensions to method. Psychotherapy Research, 6, 109123.
Tingey, R. C., Lambert, M. J., Burlingame, G. M., & Hansen, N. B. (1996b). Clinically significant change: Practical indicators for evaluating psychotherapy outcome. Psychotherapy Research, 6, 144153.
Wilson, S. A., Becker, L. A., & Tinker, R. H. (1995). Eye movement desensitization and reprocessing (EMDR) treatment for psychologically traumatized individuals. Journal of Consulting and Clinical Psychology, 63, 928937.
Wilson, S. A., Becker, L. A., & Tinker, R. H. (1997). Fifteenmonth followup of eye movement desensitization and reprocessing (EMDR) treatment for PTSD and psychological trauma. Journal of Consulting and Clinical Psychology, 65, 10471056.
^{1} "The specific studies were selected as normative because Horowitz et al. (1993) considered them to be representative of recovery after an acute response (20 female survivors of a tornado 68 weeks after the event) or after a successful clinical intervention (35 stress clinic patients 66 weeks after the event) or representative of people who had little stress response at the time of the event or several months later (19 male survivors of a tornado 68 weeks after the event, 37 nonpatient controls for the stress clinic patients 66 weeks after the event, and 15 plane crash rescue workers 82 weeks after the event). The unweighted average of the means and standard deviations for these groups were used as estimates of the normative population means and standard deviations." (Wilson, Becker, & Tinker, 1995, p. 934)
� Lee A. Becker, 1997 Revised 04/06/99