NAEP Technical DocumentationResults of NAEP Differential Item Functioning (DIF) Analysis for the Writing Main Assessment in 2007

In standard differential item functioning (DIF) analyses such as Mantel-Haenszel (Mantel 1963) and SIBTEST (Shealy and Stout 1993), it is well established that a moderately long matching test is required for the procedures to be valid (i.e., identify DIF in items unconfounded by other irrelevant factors [Donoghue, Holland, and Thayer 1993]). In the 2002 and 2007 writing assessment, the booklets contained two 25-minute blocks with one writing item per block. Therefore, each student had at most two responses on six-category prompts. This was too little information for the test statistics associated with Mantel or SIBTEST procedures to function effectively.

In the writing assessment, the standardization method of Dorans and Kulick (1986) was used to produce descriptive statistics. The matching variable was the total score on the booklet. As in other NAEP DIF analyses, the statistics were computed based on pooled booklet matching; the results were accumulated over the booklets in which a given item appears (e.g., Allen and Donoghue 1996). This analysis was accomplished using the standard NAEP DIF program NDIF. The statistic of interest appears under the label SMD for "standardized mean DIF." (First, differences in the mean item score between the two comparison groups were calculated for each level of the booklet score. Then, the standardized mean DIF for the item was the average of these differences divided by their standard deviation.)

Significance testing was not performed, due to the low reliability of the matching variable. Instead, the standardized mean difference values were used descriptively to identify those items that demonstrated the most evidence of DIF. A rough criterion used in the past to describe DIF for polytomous items has been used to create the ratio of the SMD to the item's standard deviation and to flag any item with a ratio of at least 0.25. In the writing data, no items approached that level.

Using this criterion, data analysts found no 2002 or 2007 writing items indicating DIF.

Last updated 23 June 2010 (GF)

Printer-friendly Version