Skip to main content

NAEP Technical DocumentationMathematics Interrater Agreement

A random subsample of the mathematics responses for each constructed-response item is scored by a second individual to obtain statistics on interrater agreement. For assessments with a large sample size (national and state samples), 5 percent of responses are scored a second time. For assessments with smaller sample sizes (national samples), 25 percent of responses are scored a second time. This interrater agreement information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters.

Agreement reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject-area coordinator. Printed copies are reviewed daily by the lead scoring staff. In addition to the immediate feedback provided by online agreement reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency.

During the scoring of an item, scoring supervisors monitor progress using an interrater agreement tool. This display tool functions in either of two modes:

  • displaying information about all first scores versus all second scores (mode one); or 
  • displaying all scores given by an individual rater versus the scores assigned by the other raters (mode two).

The information is displayed as a matrix with first scores displayed in rows and second scores displayed in columns (for mode one), or with an individual rater's scores in rows and all other raters' scores in columns (for mode two). Results may be reviewed for either individual raters or the team as a whole. In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second scores and the overall percentage of agreement on the item. Since the interrater agreement reports are cumulative, a printed copy of the agreement of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final agreement reports and archive them with the training sets. 

Links to scoring statistics for constructed-response items, mathematics assessments: Various years, 2000–2017
YearItem-by-item interrater agreementInterrater agreement rangesNumber of constructed-response items
2017               X               X               X
2015 X X X
2013 X X X
2012 (long-term trend) X X X
2011 X X X
2009 X X X
2008 (long-term trend) X X X
2007 X X X
2005 X X X
2004 (long-term trend) X X X
2003 X X X
2000 X X   X
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000–2017 Mathematics Assessments.

Last updated 04 May 2022 (SK)