Difference between revisions of "Interrater reliability"
(in from wp) 
(No difference)

Revision as of 17:42, 16 November 2006
Interrater reliability or Interrater agreement is the measurement of agreement between raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the metrics given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable.
There are a number of statistics which can be used in order to determine the interrater reliability. Different statistics are appropriate for different types of measurement. Some of the various statistics are; jointprobability of agreement, Cohen's kappa and the related Fleiss' kappa, interrater correlation, and intraclass correlation.
The jointprobability of agreement is probably the most simple and least robust measure. This simply takes the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater and then divides this number by the total number of ratings. This however assumes that the data is entirely nominal. Another problem with this statistic is that it does not take into account that agreement may happen solely based on chance.
Contents
Kappa statistics
 Main articles: Cohen's kappa, Fleiss' kappa
Cohen's kappaTemplate:Ref which works on two raters, and Fleiss' kappaTemplate:Ref, an adaptation that works for any fixed number of raters, are statistics which also take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problems as the jointprobability in that they treat the data as nominal and assume no underlying connection between the scores.
Correlation coefficients
 Main articles: Pearson productmoment correlation coefficient, Spearman's rank correlation coefficient
In respect of interrater correlation, either Pearson's correlation coefficient, or Spearman's correlation coefficient can be used to measure pairwise correlation between raters and then the mean can be taken to give an average level of agreement for the group. The mean of Spearman's has been used to measure interjudge correlation. However neither Spearman's or Pearson's take into account the magnitude of the differences between scores. For example, in rating on a scale of , Judge A might assign the following scores to four segments; and Judge B might assign; . The correlation coefficient would be 1, indicating perfect correlation, however the judges do not completely agree.
Intraclass correlation coefficient
Another way of performing reliability testing is to use the intraclass correlation coefficient (ICC) Template:Ref. This is defined as, "the proportion of variance of an observation due to betweensubject variability in the true scores".Template:Ref The range of the ICC is, as with the other correlation coefficients, between 1.0 and 1.0. The ICC will be high when there is little variation between the scores given each to a segment by raters, e.g. if all raters give the same, or similar scores to each of the segments. The ICC is an improvement over Pearson's and Spearman's , as it takes into account the difference, or variance in between ratings for individual segments, along with the correlation between raters.
Notes
 Template:Note Cohen, J. (1960) "A coefficient for agreement for nominal scales" in Education and Psychological Measurement. Vol. 20, pp. 3746
 Template:Note Fleiss, J. L. (1971) "Measuring nominal scale agreement among many raters" in Psychological Bulletin. Vol. 76, No. 5, pp. 378382
 Template:Note Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in Psychological Bulletin. Vol. 86, pp. 420428
 Template:Note Everitt, B. (1996) Making Sense of Statistics in Psychology (Oxford : Oxford University Press) ISBN 0198523661
Further reading
 Gwet, K. (2001) Handbook of InterRater Reliability, (Gaithersburg : StatAxis Publishing) ISBN 0970806205