Table of Contents | Search Technical Documentation | References

NAEP Analysis and Scaling → Initial Activities → Constructed-Response Interrater Reliability

NAEP Technical DocumentationConstructed-Response Interrater Reliability

Reliability Statistics for Items Rescored for Current-Year Rater Consistency

Items from the Previous Assessment That Are Rescored During the Current Assessment

During the scoring of student responses, some responses to constructed-response items are rescored by a second rater for one of two reasons:

to determine how reliably the current raters are scoring responses to specific items; or
to determine whether the current raters differ in their rating from the raters who had scored the same responses in previous years.

The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, and Cohen's Kappa (Cohen 1968). These measures are summarized in Kaplan and Johnson (1992) and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed-response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from the prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items. For more information about subject-specific targets for the interrater agreement, please refer to the TDW Scoring section.

Cohen’s Kappa quantifies the reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEP's unique assessment design in mind. The percentage of exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, national assessments, by subject, year, and grade: Various years, 2000–2018
Subject	Year	Within-year reliability			Cross-year reliability
Subject	Year	Grade 4	Grade 8	Grade 12	Grade 4	Grade 8	Grade 12
Arts - Music	2016	†	R3	†	†	R3	†
Arts - Music	2008	†	R3	†	†	—	†
Arts - Visual arts	2016	†	R3	†	†	R3	†
Arts - Visual arts	2008	†	R3	†	†	—	†
Civics	2018	†	R3	†	†	R3	†
	2014	†	R3	†	†	R3	†
	2010	R3	R3	R3	R3	R3	R3
	2006	R3	R3	R3	R3	R3	R3
Economics	2012	†	†	R3	†	†	R3
Economics	2006	†	†	R3	†	†	†
Geography	2018	†	R3	†	†	R3	†
	2014	†	R3	†	†	R3	†
	2010	R3	R3	R3	R3	R3	R3
	2001	R3	R3	R3	R3	R3	R3
Mathematics	2017	R3	R3	†	†	†	†
	2015	R3	R3	R3	R3	R3	R3
	2013	R3	R3	R3	R3	R3	R3
	2011	R3	R3	†	R3	R3	†
	2009	R3	R3	R3	R3	R3	R3
	2007	R3	R3	†	R3	R3	†
	2005	R3	R3	R3	R3	R3	R3
	2003	R3	R3	†	R3	R3	†
	2000	R3	R3	R3	R3	R3	R3
Reading	2017	R3	R3	†	†	†	†
	2015	R3	R3	R3	R3	R3	R3
	2013	R3	R3	R3	R3	R3	R3
	2011	R3	R3	†	R3	R3	†
	2009	R3	R3	R3	R3	R3	R3
	2007	R3	R3	†	R3	R3	†
	2005	R3	R3	R3	R3	R3	R3
	2003	R3	R3	†	R3	R3	†
	2002	R3	R3	R3	R3	R3	R3
	2000	R3	†	†	R3	†	†
Science	2015	R3	R3	R3	R3	R3	R3
	2011	†	R3	†	†	R3	†
	2009	R3	R3	R3	—	—	—
	2005	R3	R3	R3	R3	R3	R3
	2000	R3	R3	R3	R3	R3	R3
Technology and engineering literacy (TEL)	2018	†	R3	†	†	R3	†
Technology and engineering literacy (TEL)	2014	†	R3	†	†	—	†
U.S. history	2018	†	R3	†	†	R3	†
	2014	†	R3	†	†	R3	†
	2010	R3	R3	R3	R3	R3	R3
	2006	R3	R3	R3	R3	R3	R3
	2001	R3	R3	R3	R3	R3	R3
Writing	2011	—	R3	R3	—	—	—
	2007	†	R3	R3	†	R3	R3
	2002	R3	R3	R3	R3	R3	R3

— Not available. There are no cross-year reliability results for arts in 2008 due to changes in scoring procedures that differ from previous assessment years. There are no cross-year reliability results for science in 2009 because it was the first year with new trend. There are no cross-year reliability results for writing in 2011 because it was administered on computer for the first time, breaking scale with past writing assessments. There are no cross-year reliability results for TEL in 2014 because this was the first year in which this assessment was administered.
† Not applicable. Assessment not given at all grades.
NOTE: Because preliminary analyses of students' writing performance in the 2017 NAEP writing assessments at grades 4 and 8 revealed potentially confounding factors in measuring performance, results will not be publicly reported. Some of the NAEP assessments included in this table reference previous assessments (prior to 2000) that are not included in the technical documentation on the web. R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample type and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The block naming conventions used in the 2018 civics, geography, and U.S. history assessments are described in the document 2018 Block Naming Conventions in Data Products and TDW.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), various years, 2000–2018 Assessments.

Links to score range, percentage of exact agreement, Cohen's Kappa, and intraclass correlation for constructed-response items, long-term trend assessments, by subject, year, and age: 2004, 2008, and 2012
Subject	Year	Within-year reliability			Cross-year reliability
Subject	Year	Age 9	Age 13	Age 17	Age 9	Age 13	Age 17
Mathematics long-term trend	2012	R3	R3	R3	R3	R3	R3
	2008	R3	R3	R3	R3	R3	R3
	2004	R3	R3	R3	R3	R3	R3
Mathematics long-term trend bridge	2004	R2	R2	R2	†	†	†
Reading long-term trend	2012	R3	R3	R3	R3	R3	R3
	2008	R3	R3	R3	R3	R3	R3
	2004	R3	R3	R3	R3	R3	R3
Reading long-term trend bridge	2004	R2	R2	R2	†	†	†
† Not applicable. NOTE: R2 is the non-accommodated reporting sample; R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample type and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The R2 sample was used as the bridge sample type in 2004 bridge studies to examine comparability of scoring based on an assessment sample similar to those used for LTT in 2001 and years prior. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Mathematics and Reading Long-Term Trend Assessments.

Last updated 02 November 2022 (SK)

Printer-friendly Version