OUP user menu

Agreement, the F-Measure, and Reliability in Information Retrieval

George Hripcsak , Adam S. Rothschild
DOI: http://dx.doi.org/10.1197/jamia.M1733 296-298 First published online: 1 May 2005


Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the κ statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that κ approaches these measures as the number of negative cases grows large. Positive specific agreement—or the equivalent F-measure—may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.

View Full Text