![]() | ||
Reliability Statistics for Items Rescored for Current-Year Rater Consistency Items from the Previous Assessment That Are Rescored During the Current Assessment |
During the scoring of student responses, some responses to
constructed-response items are rescored by a second
rater for one of two reasons:
The statistics calculated for both of these purposes are the percentage of exact agreement, the intraclass correlation, and Cohen's Kappa (Cohen 1968). These measures are summarized in Kaplan and Johnson (1992) and Abedi (1996). Each measure has advantages and disadvantages for use in different situations. Agreement percentages vary significantly across items. On a simple two-point mathematics item, agreement should approach 100 percent. On the other hand, when scoring a complex six-point writing constructed-response item, an agreement of 60 percent would be considered an acceptable result. The trend year agreement percentage should approximate the interrater agreement from the prior NAEP administration. The trend year agreement should be within 8 percent of the prior year interrater agreement for two- and three-point items and within 10 percent of the prior year interrater agreement for four- to six-point items. For more information about subject-specific targets for the interrater agreement, please refer to the TDW Scoring section.
Cohen’s Kappa quantifies the reliability between groups of scorers and accounts for agreement due to chance. Kappa statistics should be higher than 0.7 for two- and three-point items and higher than 0.6 for four- to six-point items. Items with reliability statistics considered to be too low prompt investigation into the low rater agreement. The aforementioned criteria were created with NAEP's unique assessment design in mind. The percentage of exact agreement for all constructed-response items, Cohen's Kappa for dichotomously scored constructed-response items, and the intraclass correlation for polytomously scored constructed-response items are provided for every NAEP assessment.
Subject | Year | Within-year reliability | Cross-year reliability | ||||
---|---|---|---|---|---|---|---|
Grade 4 | Grade 8 | Grade 12 | Grade 4 | Grade 8 | Grade 12 | ||
Arts - Music | 2016 | † | R3 | † | † | R3 | † |
2008 | † | R3 | † | † | — | † | |
Arts - Visual arts | 2016 | † | R3 | † | † | R3 | † |
2008 | † | R3 | † | † | — | † | |
Civics | 2018 | † | R3 | † | † | R3 | † |
2014 | † | R3 | † | † | R3 | † | |
2010 | R3 | R3 | R3 | R3 | R3 | R3 | |
2006 | R3 | R3 | R3 | R3 | R3 | R3 | |
Economics | 2012 | † | † | R3 | † | † | R3 |
2006 | † | † | R3 | † | † | † | |
Geography | 2018 | † | R3 | † | † | R3 | † |
2014 | † | R3 | † | † | R3 | † | |
2010 | R3 | R3 | R3 | R3 | R3 | R3 | |
2001 | R3 | R3 | R3 | R3 | R3 | R3 | |
Mathematics | 2017 | R3 | R3 | † | † | † | † |
2015 | R3 | R3 | R3 | R3 | R3 | R3 | |
2013 | R3 | R3 | R3 | R3 | R3 | R3 | |
2011 | R3 | R3 | † | R3 | R3 | † | |
2009 | R3 | R3 | R3 | R3 | R3 | R3 | |
2007 | R3 | R3 | † | R3 | R3 | † | |
2005 | R3 | R3 | R3 | R3 | R3 | R3 | |
2003 | R3 | R3 | † | R3 | R3 | † | |
2000 | R3 | R3 | R3 | R3 | R3 | R3 | |
Reading | 2017 | R3 | R3 | † | † | † | † |
2015 | R3 | R3 | R3 | R3 | R3 | R3 | |
2013 | R3 | R3 | R3 | R3 | R3 | R3 | |
2011 | R3 | R3 | † | R3 | R3 | † | |
2009 | R3 | R3 | R3 | R3 | R3 | R3 | |
2007 | R3 | R3 | † | R3 | R3 | † | |
2005 | R3 | R3 | R3 | R3 | R3 | R3 | |
2003 | R3 | R3 | † | R3 | R3 | † | |
2002 | R3 | R3 | R3 | R3 | R3 | R3 | |
2000 | R3 | † | † | R3 | † | † | |
Science | 2015 | R3 | R3 | R3 | R3 | R3 | R3 |
2011 | † | R3 | † | † | R3 | † | |
2009 | R3 | R3 | R3 | — | — | — | |
2005 | R3 | R3 | R3 | R3 | R3 | R3 | |
2000 | R3 | R3 | R3 | R3 | R3 | R3 | |
Technology and engineering literacy (TEL) | 2018 | † | R3 | † | † | R3 | † |
2014 | † | R3 | † | † |
— | † | |
U.S. history | 2018 | † | R3 | † | † | R3 | † |
2014 | † | R3 | † | † | R3 | † | |
2010 | R3 | R3 | R3 | R3 | R3 | R3 | |
2006 | R3 | R3 | R3 | R3 | R3 | R3 | |
2001 | R3 | R3 | R3 | R3 | R3 | R3 | |
Writing | 2011 | — | R3 | R3 | — | — | — |
2007 | † | R3 | R3 | † | R3 | R3 | |
2002 | R3 | R3 |
R3 | R3 | R3 | R3 |
Subject | Year | Within-year reliability | Cross-year reliability | ||||
---|---|---|---|---|---|---|---|
Age 9 | Age 13 | Age 17 | Age 9 | Age 13 | Age 17 | ||
Mathematics long-term trend | 2012 | R3 | R3 | R3 | R3 | R3 | R3 |
2008 | R3 | R3 | R3 | R3 | R3 | R3 | |
2004 | R3 | R3 | R3 | R3 | R3 | R3 | |
Mathematics long-term trend bridge | 2004 | R2 | R2 | R2 | † | † | † |
Reading long-term trend | 2012 | R3 | R3 | R3 | R3 | R3 | R3 |
2008 | R3 | R3 | R3 | R3 | R3 | R3 | |
2004 | R3 | R3 | R3 | R3 | R3 | R3 | |
Reading long-term trend bridge | 2004 | R2 | R2 | R2 | † | † | † |
† Not applicable. NOTE: R2 is the non-accommodated reporting sample; R3 is the accommodated reporting sample. If sampled students are classified as students with disabilities (SD) or English learners (EL), and school officials, using NAEP guidelines, determine that they can meaningfully participate in the NAEP assessment with accommodation, those students are included in the NAEP assessment with accommodation along with other sampled students including SD/EL students who do not need accommodations. The R3 sample is more inclusive than the R2 sample type and excludes a smaller proportion of sampled students. The R3 sample is the only reporting sample used in NAEP after 2001. The R2 sample was used as the bridge sample type in 2004 bridge studies to examine comparability of scoring based on an assessment sample similar to those used for LTT in 2001 and years prior. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2004, 2008, and 2012 Mathematics and Reading Long-Term Trend Assessments. |