NAEP Technical DocumentationWithin-Year Interrater Agreement

Arts Interrater Agreement

Civics Interrater Agreement

Economics Interrater Agreement

Geography Interrater Agreement

Mathematics Interrater Agreement

Reading Interrater Agreement

Science Interrater Agreement

Technology and Engineering Literacy (TEL) Interrater Agreement

U.S. History Interrater Agreement

Writing Interrater Agreement

t statistics

Monitoring within-year interrater agreement is accomplished by re-routing some responses to be scored a second time. For all items, the scoring system selects a subset of the current-year student responses for the second scoring. Responses being second-scored are not distinguishable by the scorers from student responses being first-scored. The first and second scores for the subset of responses are analyzed to determine the within-year agreement. The agreement statistics can be obtained by the scoring supervisor at any point during scoring. Within-year interrater agreement is closely monitored to ensure the quality of the scoring.

Through 2009, NAEP used the following target standards for within-year agreement:

items scored on 2-point scales: 85 percent exact agreement;
items scored on 3-point scales: 80 percent exact agreement;
items scored on 4-point and 5-point scales: 75 percent exact agreement; and
items scored on 6-point scales: 60 percent exact agreement.

Starting in 2010 and continuing forward, NAEP uses a two-tier flagging system, where flags are determined separately for each subject. Items with slightly low interrater reliability (IRR) are yellow flagged to indicate mild concern. Items of greater concern are red flagged. Such flags were determined based on historical scoring data for each subject. A red flag indicates an uncharacteristically low IRR given historical data. The red flag is intended to be set at the 5th percentile of historical IRRs for a particular subject, grade, and score category, which means that the IRR falls within the bottom 5 percent of a representative distribution of IRRs. The yellow flag is intended to be set at the 20th percentile of historical IRRs for a subject, grade, and score category. The word ‘intended’ is used because sufficient historical data to set robust flags were not available for all subjects and score categories. In those cases, an additional consultation with subject-area specialists was used to set the flags. NAEP continues to evaluate flagging targets, and to make updates when appropritate.

Target standards are as follows:

Current percent exact agreement target standards for yellow flag, by item point scale and subject area
Subject	2-point scale	3-point scale	4-point scale	5-point or more scale
Arts	91%	81%	76%	79%
Civics	†	80%	80%	†
Economics	90%	85%	80%	75%
Geography	95%	93%	85%	†
Mathematics	97%	94%	91%	91%
Reading	87%	82%	77%	77%
Science	92%	87%	86%	81%
Technology and engineering literacy (TEL)	90%	85%	80%	†
U.S. history	87%	82%	80%	†
Writing	†	†	70%	61%
† Not applicable.

Current percent exact agreement target standards for red flag, by item point scale and subject area
Subject	2-point scale	3-point scale	4-point scale	5-point or more scale
Arts	85%	75%	74%	77%
Civics	†	80%	75%	†
Economics	85%	80%	75%	70%
Geography	92%	85%	75%	†
Mathematics	94%	92%	90%	90%
Reading	85%	80%	75%	75%
Science	89%	84%	83%	78%
Technology and engineering literacy (TEL)	85%	80%	75%	†
U.S. history	85%	80%	77%	†
Writing	†	†	70%	57%
† Not applicable.

Scoring staff also need to be alert for downward changes in the within-year agreement for an item. For example, if first and second scores were in exact agreement 90 percent of the time in the morning (or on day 1 of scoring) and the rate of exact agreement declined to 82 percent in the afternoon (or day 2 of scoring), a problem may exist, even if the overall within-year agreement remains over the minimum standard. Backreading and calibration are tools used to monitor and correct declines in within-year agreement. If within-year agreement rates fall below the indicated standards for an item and it is believed this was primarily a result of inconsistent scoring, it is possible that the item will be rescored. Decisions about the rescoring of items are made by test development staff and psychometricians in consultation with scoring staff and content coordinators.

For more information on the estimation of reliability based on interrater agreement, see Constructed-Response Interrater Reliability.

Last updated 02 November 2022 (SK)

Printer-friendly Version