Skip to main content

Table of Contents  |  Search Technical Documentation  |  References

NAEP Technical DocumentationThe Response Data File

The response data files contain all of the collected and derived school- and student-based data variables in a form appropriate for analysis. Following are descriptions of the file organization and the definition of special response codes.

File Organization

     Student Data

Each student data file in the 1990 and later NAEP data products contain the response data and sample weights for the students within each sampling frame of an assessment component. In general, a sampling frame is defined by subject area, age/grade cohort, and type of administration. For example, the 1990 assessments had three components: national, state, and long-term trend. The national component assessed mathematics, science, and writing for three age/grade cohorts. In addition, one part of the mathematics assessment was conducted under different administration conditions than the others, requiring a separate sampling frame and a separate data file at each cohort. The 1990 national data product therefore contained 12 student data files. The state component assessed mathematics in only one grade/age cohort, but was administered in forty participating jurisdictions, each with its own sampling frame. Therefore, the 1990 state data product contained 40 student data files, all with identical structures or formats. This encouraged proper analysis of these data, one jurisdiction at a time, yet permitted combination of two or more files for the purposes of comparison among jurisdictions. The long-term trend component assessed reading and writing in one sampling frame and mathematics and science in another. The administration of the mathematics and science assessments required separate sessions for each instrument, further subdividing the sampling frame. The 1990 long-term trend data product contained four student data files for two age/grade cohorts and three for the other.

Beginning with the 2002 assessment, a combined sample of public schools was selected for both the state and national assessments. Therefore, the national sample was a subset of this combined sample of students assessed in each participating state, plus an additional sample from the states that did not participate in the state assessment. At grade 12, the sample was chosen using a stratified two-stage design that involved samples of students from selected public and nonpublic schools across the country.

     Excluded Student Data

From 1990 through 1994, a special questionnaire was filled out by a school administrator or teacher for each student who was considered unable to participate in the assessment due to a disability or limited English proficiency. In the national and long-term trend components, the data for these students were pooled across subject areas and their weights adjusted to represent that subpopulation when combined with any of the assessed student files within the cohort. In 1996, a new questionnaire was administered for all students with disabilities (SD) or limited-English-proficient (LEP) students whether they participated in the assessment or not. In the 1996 and later data products, the excluded students are combined with the assessed students and the data from the SD/LEP questionnaire are appended for the excluded students and the appropriate assessed students.

     Teacher Data

Teachers of students participating in NAEP complete a questionnaire concerning themselves and their teaching practices. The purpose of these data is to report teacher characteristics as related to the outcomes (item responses) of the students they teach. The teacher data are appended to the data collected for their students participating in NAEP assessments.

     School Data

The schools in the NAEP assessments are part of the stratified sampling frame. Any data collected at the school level can be reported, with appropriate weighting, at the school level as well as at the student level. Within a given cohort, the sampling frames for subject areas can overlap, and there are many schools that participate in two or more subject area assessments. There are different sets of sample weights for each sampling frame. For purposes of economy in the analysis of NAEP data, there is only one school data file for each age/grade cohort in the national and long-term trend component. Each school data file can properly link with all of the student data files in that component and cohort for the purpose of relating student outcomes to school characteristics.

Data Variables

The variables on all data files are grouped and arranged in the following order:

  1. identification information,
  2. weights and sampling information,
  3. derived variables,
  4. scale scores (where applicable), and
  5. response data.

The identification information is obtained from the front covers of the instruments or from the administration rosters. School or student names are not part of the data that can be matched. The weight data include sample descriptors, selection probabilities, nonresponse adjustments, and replicate weights for the estimation of sampling error. The scale scores are derived from student responses to the NAEP cognitive assessment items and typically expressed on scales of 0–500 (geography, mathematics, reading, and U.S. history) or 0–300 (civics, science, and writing). The derived data include sample descriptions from other sources and variables that are derived from the response data for use in analysis or reporting.

On the student data files, the response data are arranged in the following order:

  1. background questions,
  2. cognitive item responses,
  3. teacher questions (where collected), and
  4. SD/LEP questions (1996 and later).

The background data include responses to general and subject-related questions. The item-response data within each block of questions are left in their order of presentation. The responses to cognitive blocks that are not present in a given booklet are left blank, signifying a condition of "missing by design."

Data Definition

Nearly all of the data variables in the NAEP data products are coded in numeric format to facilitate analysis and reporting in statistical packages and procedural languages. Each numeric variable on the data files is classified as either continuous or discrete. The continuous variables include the weights, scale scores, identification codes, and questionnaire responses where counts or percentages were requested. The discrete variables include those items for which each numeric value corresponds to a response or score category. The designation of "discrete" also includes those derived variables to which numeric classification categories are assigned. The NAEP database contains special codes to indicate certain response conditions: "I don't know" responses, multiple responses, omitted responses, not-reached responses, and unresolvable responses, which include out-of-range responses and responses that were missing due to errors in printing or processing. The scoring guides for the constructed-response items include additional special codes for ratings of "illegible," "off task," or nonrateable by the scorers. All of these codes were converted to a consistent numeric format. The following convention is used in the designation of these codes:

Special response codes for assessments: 2000, 2001, 2002, and 2003
Code (width=1) Code (width=2) Definition
5 55 Illegible (constructed-response items)
6 66 Off-task (constructed-response items)
7 77 I don't know (multiple-choice items)
8 88 Omitted
9 99 Not reached
0 00 Multiple response (multiple-choice items)
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP).

For multiple-choice items that have seven or more valid response options, or the "I don't know" response, and for those constructed-response items whose scoring guides have five or more categories, these data fields are expanded to accommodate the valid response values and the special codes. In these cases, the special codes are "extended" to fill the output data field: The "I don't know" and nonrateable codes are extended from 7 to 77, the omitted response codes are extended from 8 to 88, and so on.


Last updated 06 March 2009 (RF)

Printer-friendly Version