NAEP Technical Documentationt statistics

A goal in scoring is consistency in scores that are assigned to the same responses by different raters within the same year or by different raters across different assessment years. Statistical flags are used to identify items for which scoring is not consistent. A system allowing for t statistics to be calculated to compare the scores for each of the item responses that has been rescored at different points in the scoring process has been implemented.

To calculate a t statistic, the scoring supervisor executes a command in the report window. The scoring supervisor is then prompted for the item, the application (or purpose for which the t statistic is being performed), and the scoring group to which the item is assigned. The system then displays the results, which are printable. The test results are based only on responses for which there are two scores and for which both scores are on task. The display shows number of scores compared, number of scores with exact agreement, percent of scores with exact agreement, mean of the scores assigned during the scoring process for previous assessment years, mean of the currently assigned scores, the mean difference, variance of the mean difference, standard error of the mean difference, and the estimate of the t-statistic. The formulas used are as follows:

Dbar = Mean Score 2 - Mean Score 1, where

Dbar is the mean difference,
Mean Score 1 is the mean of all scores assigned by the first rater, and
Mean Score 2 is the mean of all scores assigned by the second rater.
DiffDbarsq = ((Score 2 - Score 1) - Dbar)², where

DiffDbarsq is calculated for each score comparison.
VarDbar = (sum(DiffDbarsq))/(N-1), where

VarDbar is the variance of the mean difference.
SEDbar = SQRT (VarDbar/N), where

SEDbar is the standard error of the mean difference, and
N is the number of responses with two scores assigned by two different raters.
Percent Exact Agreement = number of responses with identical scores/total number of double scored responses being compared, where

Exact Agreement is a response with identical scores assigned by two different raters.
T = Dbar/SEDbar

For purposes of calculations, the possible scores for a response to an item are ordered categories beginning with 0 and ending with the number of categories for the item responses minus one.

The estimate of a t statistic is acceptable if it is within the range from -1.5 to 1.5. The range of + or - 1.5 was selected because only one criterion was required for all items, regardless of the number of responses with scores being compared. As the number of responses with scores being compared gets large, 1.5 as the criterion means that about 15 percent of the differences were judged not acceptable according to the test when they should have been acceptable. If the estimate of the t statistic was outside that range, raters were asked to stop scoring so the situation could be assessed by the trainer and scoring supervisor. Scoring resumed only after trainer and scoring supervisor had determined a plan of action that would rectify the differences in scores. Usually, different responses to the item were discussed with the raters or raters were retrained prior to the continuation of scoring.

Last updated 21 February 2009 (RF)

Printer-friendly Version