Design / Feasibility Team - Report to the National Assessment Governing Board
Topic
Focus Areas

Reports and Papers

Design/Feasibility Team

Report to the National Assessment Governing Board

July 1, 1996

Robert Forsyth, University of Iowa
Ronald Hambleton, University of Massachusetts
Robert Linn, University of Colorado
Robert Mislevy, Educational Testing Service
Wendy Yen, CTB/McGraw-Hill

For their helpful discussions and useful information, we wish to express our gratitude to the National Assessment Governing Board and staff, the National Center for Education Statistics, Educational Testing Service, and Mathtech. Keith Rust and Ben King provided invaluable input on sampling considerations.

 

Contents

 

Executive Summary

The redesign of the National Assessment of Educational Progress (NAEP) has been the focus of extensive deliberations by the National Assessment Governing Board (NAGB) during the past year. As part of those deliberations, NAGB has developed a paper called "Themes and Issues" in which the Board has identified some critical objectives of NAEP and recommended a number of characteristics to be achieved in the redesign. The Design Feasibility Team (DFT) was formed by NAGB to lay out technical implications for the design, analysis, and reporting of NAEP that are implicit in the Board’s Themes and Issues.

The report of the DFT is intended to provide a bridge between the general desired characteristics and priorities for NAEP expressed in the Themes and Issues, on the one hand, and the specifics of a Request for a Proposal (RFP) and the actual detailed designs developed by respondents to the RFP, on the other. The report does not detail a specific design. Indeed, the DFT believes it would be presumptuous to do so at this stage and potentially counterproductive because there are many possible approaches to achieving the various objectives of the redesign. Rather than rushing to premature closure on the details of design, the report lays out the trade-offs that need to be evaluated in considering alternative designs and identifies criteria that may be used in judging the quality of alternatives in terms of the objectives of NAGB’s Themes and Issues.

Before turning to a discussion of the specifics of the Themes and Issues and their implications for design, analysis, and reporting, it is important to understand the context in which NAEP operates. Thus, after a brief introduction, Section 2 of the DFT report begins with a discussion of the key features of NAEP (e.g., a large-scale, nationally-representative, cross-sectional survey using standardized tasks not directly tied to specific instruction or curriculum, conducted in a brief period of time). Those characteristics make NAEP good for some purposes (e.g., monitoring trends) but not others (e.g., making causal inferences). It is argued that a key objective in rethinking NAEP is to focus resources within the range of missions that a survey with its evidentiary characteristics is good at, and minimizing what it is not good at.

Section 3 of the report addresses management and administration issues that impact the cost, dependability, and timeliness of NAEP. Several key ideas are identified that may be used as criteria in judging alternative designs. Notable among these are the need to think in terms of global optimization of the whole NAEP system rather than only local optimization of each component, the concept of the critical path, and the need to monitor variation in the results due to factors such as sampling error, changes in assessment tasks, and changes in procedures. Keeping the activities that are on the critical path to a minimum is seen as key to achieving the goals of simplification and faster reporting. Four implications of the discussion of management and administration issues are identified: (1) overarching priorities need to be specified to keep local optimization from subverting the larger goal of global optimization; (2) a modular design that identifies a "core NAEP" for tracking trends and rapid reporting as well as modules for special purposes is needed for an acceptably simple and efficient critical path while maintaining the richness of assessment expected of NAEP; (3) phased analysis and reporting is needed; and (4) changes need to be phased in.

Although a desirable narrowing of demands on NAEP can be achieved by a restricting attention to the kinds of information that can be provided effectively by a large-scale survey such as NAEP, there is still considerable leeway in setting priorities. It is not possible to "have it all;" trade-offs must be made. Some of the necessary tradeoffs that need to be weighed in the redesign are discussed in section 4 of the DFT report. The Themes and Issues provide general guidance that will be helpful in setting priorities and evaluating the necessary trade-offs. The DFT discussion of the Themes and Issues should provides some additional basis for evaluating the trade-offs.

In planning the redesign it is useful to have an understanding of how we got to where we are in NAEP. Toward this end, section 5 of the DFT report provides a brief selective history of NAEP thus setting the stage for the main section of the report—the detailed discussion of the Themes and Issues in section 6.

Trade-offs are elaborated in section 6 and approaches to meeting the objectives of the Board’s Themes and Issues are discussed. For example, trade-offs among three approaches to combining state and national samples are evaluated. Three variations of an approach to reporting, called marketbasket reporting, that the DFT believes will help in meeting several of NAGB’s key objectives are elaborated. Some potential simplifications in analyses are identified especially with regard to the core NAEP to be used for rapid reporting and tracking trends. It is noted, however, that complex analyses (done once) do not, in and of themselves, preclude rapid turnaround. Indeed, it is not so much complex analyses as time spent in (1) report review and revision and (2) rework that can be avoided through system redesign that appear to be bottlenecks.

The final section of the report sketches a feasible configuration for NAEP that incorporates the objectives specified in the Themes and Issues. This configuration includes a modular design, the use of a marketbasket for reporting, phased analysis and release of reports, and previously-proven analyses on the critical path for the core NAEP results. It is not the only onfiguration possible, but it is presented as an example of a design that appears to address NAGB’s Themes and Ideas. The team hopes that NAGB will find the discussions of technical issues and design tradeoffs of some assistance in their consideration of alternative configurations.

Design/Feasibility Team
Mission Statement

 

The Design/Feasibility Team shall provide advice on the technical feasibility, the necessary components, and costs of implementing a National Assessment of Educational Progress based on the policy themes and ideas being drafted by the National Assessment Governing Board. Specifically, the charge to the design/feasibility team is a threefold one.

First, using the Board’s themes and ideas as articulated in the preliminary policy paper and Board policies now in place, the design/feasibility team should identify the necessary components of a design that would fully embody such themes and ideas. The question to be addressed is. "How can these policy directions best be operationalized in a large scale assessment?" The necessary components so identified may form the bases for the specifications of the Request For Proposals to potential contractors. In developing the necessary components the Design/Feasibility Team may want to propose various options. If this is the case, priorities and trade-offs among options would also be identified.

Second, the Design/Feasibility Team should examine the necessary components of the resulting design for both intended and unintended consequences. The focus here is to ask the question, "If all these moving parts are put in motion, what will be the effect?" The design/feasibility team should plan to provide empirical evidence where possible to support their conclusions. The design/feasibility team will be advised by one or more financial consultants.

Third, the charge to the Design/Feasibility Team is to identify those areas in the design which appear not to be feasible for the National Assessment operation in the next 5 year and 10 year period, or those which might result in ultimate deleterious effects on the NAEP program.

The Design/Feasibility Team shall complete its report no later than June 30, 1996. A status report shall be made to the Board at its May meeting, and the final recommendations will be presented to the Board in the form of a Design/Feasibility Team Report.

1.0 Introduction And Overview

 

The role of the Design Feasibility Team (DFT) is to lay out technical implications for NAEP design, analysis, and reporting that are implicit in National Assessment Governing Board "Themes & Issues" (Table 1-1 gives its key points). We will sketch a configuration that moves NAEP in the directions outlined therein. It would be presumptuous for us to detail a specific design, since other effective ideas may have not yet surfaced, or even been conceived. Such ideas, and the wherewithal to craft them into a detailed plan, will emerge through competition for the contracts or supporting grants, as multiple organizations devote substantial time and talent to win the project. But we, the DFT members, have had the opportunity to experience firsthand what has worked well and what has not in several large-scale assessments, including NAEP itself; and we have gained some insights into why this may be so. We will point out trade-offs and implications that are not always apparent on the surface, and highlight issues that will have to be addressed in any specific design proposal. We will sketch design components which, in concert, can move NAEP in the directions that the NAGB Themes and Issues propose.

We envisage a core national assessment, administered on a predictable schedule, which focuses on those things that a large-scale, cross-sectional, nationally-representative survey can, by its nature, do well. For this core, analysis and reporting can be accomplished more quickly, efficiently, and reliably than under the current NAEP configuration. Modular design would facilitate integrating this core with other NAEP components, such as state assessments, new and more varied tasks, and auxiliary information such as teacher surveys—but none of these would appear on the critical path to initial time-series reports. This modularity would facilitate the use of NAEP linkages with extra-NAEP studies that provide kinds information that large-scale cross-sectional surveys cannot. These would include longitudinal surveys, program evaluations, state and local testing programs, and research studies of classroom practices and student learning. Changes in the core would be phased in over multiple time points, as their worth and feasibility are demonstrated and interest proves enduring.

The next two sections of this report cut across the particulars of NAEP designs, no matter how purposes and tradeoffs are resolved.

Section 2 addresses the missions for which large-scale assessments like NAEP, by their very nature, are and are not well-suited. The key idea is to focus NAEP’s efforts on what it can do well. We will note in passing, though, that leeway remains within these possibilities for where to focus attention—that is, for specifying the purposes that have the highest priorities. Tradeoffs arise because different purposes are better served by different assessment configurations.

Section 3 addresses management and administration issues that impact the cost, reliability, and timeliness of NAEP. The key idea is organizing NAEP activities to eliminate bottlenecks and inefficiencies.

Sections 4 and 5 provide additional background specific to NAEP. Section 4 discusses the issue of design tradeoffs, and Section 5 reviews how some of these tradeoffs have been decided over the years in NAEP, as reflected in design elements and their expected and unexpected consequences.

Sections 6 and 7 address redesign issues directly. Building on the preceding sections and on experience with NAEP and other large-scale assessments, Section 6 comments individually on the NAGB Themes and Issues in greater detail. Section 7 sketches a feasible configuration for NAEP that incorporates the Themes and Issues. Alternatives that reflect tradeoffs among competing purposes and values are noted.

Table 1-1
NAGB Themes and Issues’ Objectives, Sub-Objectives, and Recommendations

OBJECTIVE 1: To measure national and state progress toward the third National Education Goal and provide timely, fair, and accurate data about student achievement at the national level, among the states, and in comparison with other nations.

A. Test all subjects specified by Congress: reading, writing, mathematics, science, history, geography, civics, the arts, foreign language, and economics.

o The National Assessment should be conducted annually;

o Reading, writing, mathematics, and science should be given priority, with testing in these subjects conducted according to a publicly released 10-year schedule adopted by the NAGB;

o History, geography, the arts, civics, foreign language, and economics also should be tested on a reliable basis according to a publicly released schedule adopted by NAGB.

B. Vary the amount of detail in testing and in reporting results.

o National Assessment testing and reporting should vary, using standard report cards most frequently and comprehensive reporting in selected subjects about every ten years;

o National Assessment results should be timely, with the goal being to release results within 6 months of the completion of testing.

C. Simplify the National Assessment design.

o Options should be identified to simplify the design of the Bational Assessment and reduce reliance on conditioning, plausible values, and imputation to estimate group scores.

D. Simplify the way the National Assessment reports trends in student achievement.

o A carefully planned transition should be developed to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program;

o As a part of the transition, NAGB will review the tests now used to monitor long-term trends in reading, writing, mathematics, and science to determine whether and how they might be used now that new tests and performance standards have been developed during the 1990’s for the main National Assessment. NAGB will decide how to continue the present long-term trend assessments, how often they would be used, and how the results would be reported.

E. Use performance standards to report whether student achievement is "good enough."

o The National Assessment should continue to report student achievement results based on performance standards.

F. Use international comparisons.

o NAEP test frameworks, test specifications, achievement levels, and data interpretations should take into account, where feasible, curricula, stadards, and student performance in other nations;

o The National Assessment should promote "linking" studies with international assessments.

G. Emphasize reporting for grades 4, 8, and 12.

o The National Assessment should continue to test in and report results for grades 4, 8, and 12; however, in selected subjects, one or more of these grades may not be tested;

o Age-based testing and reporting should continue only to the extent necessary for international comparisons and for long-term trends, should NAGB decide to continue long-term trends in their current form;

o Grade 12 results should be accompanied by clear, highlighted statements about school and student participation, student motivation, and cautions, where appropriate, about interpreting 12th grade achievement results;

o The National Assessment should work to improve school and student participation rates and student motivations at grade 12.

H. National Assessment results for states.

o National Assessment state-level assessments should be conducted on a reliable, predictable schedule according to a 10-year plan adopted by NAGB;

o Reading, writing, mathematics, and science at grades 4 and 8 should be given priority for state-level testing;

o Testing in other subjects and at grade 12 should be permitted at state option and cost;

o Where possible, national results should be estimated from state samples in order to reduce burden on states, increase efficiency, and save costs.

I. Use innovations in measurement and reporting.

o The National Assessment should assess the merits of advances related to technology and the measurement and reporting of student achievement;

o Where warranted, the National Assessment should implement such advances in order to reduce costs and/or improve test administration, measurement, and reporting.

OBJECTIVE 2: To develop, through a national consensus, sound assessments to measure what students know and can do as well as what students should know and be able to do.

A. Keep test frameworks and specifications stable.

o Test frameworks and test specifications developed for the National Assessment generally should remain stable for at least ten years;

o To ensure that trend results can be reported, the pool of test questions developed in each subject for the NAEP should provide a stable measure of student performance for at least ten years;

o In rare circumstances, such as where significant changes in curricula have occurred, the Governing Board may consider making changes to test frameworks and specifications before ten years have elapsed;

o In developing new test frameworks and specifications, or in making major alterations to approved frameworks and specifications, the cost of the resulting assessment should be estimated. The Governing Board will consider the effect of that cost on the ability to test other subjects before approving a proposed test framework and/or specifications.

B. Use an appropriate mix of multiple-choice and ‘performance’ questions.

o Both multiple-choice and performance items should continue to be used in the NAEP;

o In developing new test frameworks, specifications, and questions, decisions about the appropriate mix of multiple-choice and performance items should take into account the nature of the subject, the range of skills to be assessed, and cost.

OBJECTIVE 3: To help states and others link their assessments with the National Assessment and use National Assessment data to improve educational performance.

o The National Assessment should develop policies, practices, and procedures that enable states, school districts, and other who want to do so at their own cost, to conduct studies to link their test results to the National Assessment;

o The National Assessment should be designed so that others may access and use National Assessment test data and background information;

o The National Assessment should employ safeguards to protect the integrity of the National Assessment program, prevent misuse of data, and ensure the privacy of individual test takers.

2.0 What NAEP Can and Cannot Do

 

Historically, NAEP has exhibited the following characteristics: It is a large-scale, nationally-representative, cross-sectional survey. It is timed, standardized, and, from the students’, teachers’, and school administrators’ points of view, low-stakes. It is not directly connected with students’ instruction, in that the tasks they are administered have not been selected in light of what they have been working on in their classes, and students receive no feedback on how they have done. Developing the NAEP content framework is a national consensus process. Since the early 1990’s, it is also the case that developing achievement level standards is a national consensus process. This section discusses the kinds of missions that an assessment possessing these properties is well-suited to support, describes the kinds it is not well-suited to support, no matter how elegantly designed and skillfully executed, and notes some broad implications for NAEP design. Additional discussion related to specific Themes and Issues will appear in Section 6.

2.1 Missions for Which NAEP Is Well-Suited

NAEP is virtually unique as a time series of large, nationally-representative samples of students. As such, it provides information about both the status of achivement in the nation as a whole at each time point, and about changes over time. It has thus played a central role in debates about the trends in American achievement over time (e.g., Koretz, 1992a). Other widely-used tests cannot serve this function. The Scholastic Assessment Test (SAT) or American College Test (ACT) cannot play this role, despite the large number of students involved, because their student samples are self-selected. Commercial achievement tests used by states and districts are not directly comparable. Longitudinal studies such as the National Educational Longitudinal Studies (NELS) and High School and Beyond (HSB) do not collect comparable information about successive cohorts of students at regular intervals. Thus, as the National Academy of Education (1993) has argued, NAEP’s capability to track student achievement over time is one of its unique and most precious features.

Of course, tracking trends in achievement over time requires gathering comparable data about students’ achievement. A tension is thus introduced between, on the one hand, maximizing measurement of change by comparing performance on a given collection of tasks in one assessment with performance on the same collection in succeeding years; and, on the other hand, revising the collection of tasks in succeeding years in order to reflect changing belief about what knowledge and skills are important to assess.

Also notable is the national notice NAEP can command for its methodologies and the focus of its attention (e.g., through achievement level reporting). NAEP has historically been a source of innovation for assessment methodology, and a forum for discussion about the topics and skills schooling should address. Moreover, the form and content of a national assessment communicate volumes of information in and of themselves to many audiences, before any data about students are even collected. Indeed, a core NAEP mission is "to develop, through a national consensus, sound assessments to measure what students know and can do as well [as] what students should know and be able to do." Thus, it is critically important for the National Assessment to reflect the breadth and richness of valued content and processes. NAEP is thus in a position to contribute strongly to national discussion about what is important for students to learn, and to establish frameworks for these discussions that could extend to other studies and other purposes beyond NAEP itself. One of the missions of NAEP is "to help states and others link their assessments with the National Assessment," and there are useful ways NAEP can foster these connections. The designs described in this document move NAEP further in that direction (subject to the caveats pointed out in the following section).

NAEP also collects information about students in the form of demographic, personal, and instructional background variables. Student-level information comes from self reports and from associated NAEP teacher and principal surveys, and some school-level information comes from census data. Associations between these variables and levels of performance can thus be estimated, and are routinely calculated and reported. To the extent that these variables are defined and measured in the same way in successive assessments, trends in these associations can be monitored.

2.2 Missions for Which NAEP Is Not As Well-Suited

Perhaps the most notable limitations of NAEP in terms of the kinds of inferences it can support stem from its being cross-sectional (as opposed to longitudinal) and observational (as opposed to experimental, or random-assignment). Being cross-sectional means that a given student is observed at only a single point in time. It is therefore impossible to estimate growth curves or patterns over time at the level of individual students, or to estimate associations between growth curves and student background variables. Such inferences are possible only in longitudinal studies such as NELS and HSB.

Even these longitudinal studies, if they merely track students in existing schools with existing instruction, fail to provide definitive evidence about causes or determinants of achievement. Inferences of this type require comparisons of comparable students under different conditions—a requirement that can rarely be strongly supported in cross-sectional observational surveys such as NAEP. This means that the associations between performance and student background variables, though they might be suggestive and useful for follow-up, are insufficient for concluding that those background variables caused higher or lower performance.

As an example of the kind of inferential error that can result from observational studies, we point to the following counterintuitive association in the 1992 NAEP reading assessment (Mullis, Campbell, & Farstrup, 1993). The question is whether extra instruction in reading helps children read better. Of course, we respond. Yet in that study, the amount of reading instruction fourth-graders receive was correlated negatively with their performance on the reading tasks:

 

 

Time Spent in Reading Instruction

 

30-45 Minutes

60 Minutes

90 Minutes or More

Average Proficiency

220

219

216

 

With a negative correlation (r= -.1) between reading performance and time spent in reading instruction, it appears that increasing reading instruction decreases reading performance! But the average difference among students in the population who received various amounts of reading instruction—the ‘prima facie’ effect—doesn’t necessarily estimate the average causal effect of reading instruction on performance, because factors that may influence instructional time or reading performance are not taken into account in the comparison (Holland & Rubin, 1987). The NAEP report explains that the negative relationship in this example makes sense when we remember that (a) students who get extra help are usually students who seem to need extra help, and (b) students who seem to need extra help usually have low test scores. Other prima facie effects that we interpret as causal effects if they conform to our expectations can be just as wrong for similar reasons.

Other offsetting features of NAEP include limitations with respect to (a) motivation, (b) reliance upon survey data, and (c) constraints on students’ time and the connection of assessment tasks with their instructional experiences. We will discuss motivation in a Section 6.1G. As for reliance on survey data, we point out that teacher and pupil reports of instructional practice are notoriously dubious. Soliciting background data from the students themselves is quite economical, compared to ascertaining information such as home characteristics from actual observation or record searches. But especially with younger students, the trade-off is accuracy:

Some indicator systems have relied on student reports for information on background factors. … A[n] … analysis of the quality of responses in the High School and Beyond study provided … sobering results. Correlation coefficients between sophomores’ and parents’ reports of background variables ranged from very low to quite high—for example, .21 for the presence of a specific place to study in the home; .35 for the presence of an encyclopedia in the home (an item used in the NAEP as well); .44 for mother’s occupation; .50 for family income; .56 for whether the family owns or rents its residence; .81 for mother’s education; and .87 for father’s education (Fetters, Stowe, & Owings, 1984). (Koretz, 1992b, pp. 17-18)

As for constraints on students’ time in assessment and lack of connection with their instructional programs, we must recognize that what we can learn about students from the NAEP cognitive tasks is limited in its scope. That is, there are kinds of learning we want our students to accomplish, but about which NAEP cannot provide direct evidence. For example, NAEP is not well suited to support inferences about how well students perform in tasks that extend over time, that involve the use of resources beyond the NAEP setting, or directly address skills and concepts on which the student has been specifically working. In these senses, NAEP tends to underestimate what students can do (Kane, 1996). Conversely, NAEP can overestimate the capabilities of students who do well on its limited palette of tasks but fare poorly in the context of the classroom. These facts hold implications for both achievement-level reporting and for the view of domains of NAEP tasks as representations of domains of learning (see Section 6.1E on standard-level setting).

A related mission for which NAEP is not well-suited is as a measurement tool for high stakes state or local accountability. While there is much consensus around the country in terms of what should be taught, there are also serious differences, with perspectives ranging from the most conservative to the most avant-garde. These differences produce intense scrutiny of any assessments used for high stakes evaluations. The National Assessment is vulnerable to attack if it is seen as a federal test implemented to support a federal curriculum. While the low-stakes nature of NAEP has contributed to participation and motivation problems, the same low stakes have also been a key contributor to its longevity, support, and usefulness.

This leads naturally to the mission of linking NAEP with the assessments of states and others. It is critical to NAEP’s credibility that the limitations of what can and cannot be accomplished with such links be acknowledged. NAEP frameworks will rarely match any given state’s frameworks, and NAEP assessment forms will rarely be parallel with state assessment forms. Student and administrator motivations are very different on the NAEP and local assessments. All of these differences produce uncertainty (‘error’) in linking state assessments to the National Assessment (Linn & Kiplinger, 1994; Ercikan, in press). Some states may wish to use the link to assess how their students would do on NAEP in years or grades where NAEP is not administered. Others may wish to use the link to estimate how the nation would do on a state assessment, estimating national norms for it. But the state assessment cannot be a "stand in" for NAEP, or vice versa. The changes over grades and years that states are concerned about assessing will often be smaller than the linking errors.

The bottom line for assessments like the current NAEP is that they can provide excellent information about the status of a limited number and nature of indicators of WHAT students do, and establish frameworks for public discussion of educational progress and policy—but limited information on WHY (i.e., the determinants of their performance, which is what policy-makers are really interested in), HOW (i.e., what educators in content areas and educational and cognitive psychologists are really interested in), or UNDER WHAT CONDITIONS (another thing educational and cognitive psychologists see as important). Different ways of gathering information are much better suited to providing information about these aspects of student learning, including longitudinal studies, laboratory research, in-depth cognitive studies of smaller numbers of individual students, controlled field trials, and careful observational studies of classroom processes.

2.3 Implications

A key objective in rethinking NAEP is to focus resources within the range of missions that a survey with its evidentiary characteristics is good at, and minimizing what it is not good at. If it is deemed important at a national level to obtain information that NAEP is ill suited to provide, we should not attempt to stretch NAEP to do so (necessarily poorly). Rather, we should conceive of an informational system in which NAEP is but one component; a system in which complementary and interconnected research of various kinds is each designed to do well the kinds of things it can do, and does not waste time and money doing things it cannot do well. This would argue for a simpler and more compact National Assessment which effectively indicates status and trend of key indicators, routinely gathering information about selected background variables as well, but not professing to answer causal questions about trends or to explain the cognition underlying performances. Instead, NAEP should be designed to be easy to ‘plug into’ alternative projects and ways of gathering data that are well designed for other purposes. Examples of complementary studies that could include National Assessment indicators among their own data-gathering are program evaluations, classroom observations, cognitive research studies, protocol analyses of large-scale assessment tasks, longitudinal surveys such as NELS, and studies including in-depth background and instructional practices of students.

 

3.0 Management Principles

Many of the problems that have plagued NAEP over the years, including anomalies, errors, high costs, and extended time lines, can be diminished by applying familiar management principles from business and industry (e.g., Deming, 1982). They apply no matter what configuration of design, analysis, and reporting is ultimately decided upon for NAEP. They concern how complex systems, with multiple steps and many actors, are structured. The following sections present the relevant concepts, and illustrate how they apply in NAEP.

3.1 Local vs. Global Optimization

How do we improve quality and productivity? "‘By everyone doing his best’?," asked W. Edwards Deming; "Five words—and it is wrong. … You have to know what to do. You have to know what to do, then do your best. Sure we need everybody’s best—everybody working together with a common aim. And knowing something about how to achieve it" (Walton, 1986, p. 32). The concept of ‘local optimization’ is ‘everyone doing his best’ but with a limited understanding of how their work fits into the system as a whole. The criteria that seem important to each contributor may do a good job of balancing tradeoffs that are visible to each of them, in accordance with priorities as they see them—yet when brought together, the contribution of one group can block or delay contributions of others. The resulting system, even if locally optimal everywhere, can be globally suboptimal. Some examples from NAEP:

  • For the 1986 Reading Assessment, test developers made slight revisions to NAEP tasks from previous years in order to improve their comprehensibility or grammar. They were better items. They became worthless for gauging change over time, however, since the small changes in performance these minor revisions caused (e.g., percents-correct from, say, 65% to 68%) often exceeded the amount of change in population performance over a two-year period (Beaton & Zwick, 1990). The course now followed is less optimal locally, but more desirable globally: Use unchanged items in unchanged blocks for trend analyses; administer and score these blocks of items in just the same way as in the previous cycle; and treat any revised items, even ‘slightly revised’ items, as new items.

  • A major revision of a content-area framework produces new task specifications and many new tasks—tasks that reflect the latest thinking in the field and the most up-to-date research. And what could be better than putting these new tasks, interpreted through the new framework, into place immediately? But without having administered these tasks in their final configuration before the operational test administration, we cannot know which ones will provide useful data from students, or whether we will be able to link results from the new framework to the previous framework. We can only find out after we actually have data—when we must carry out analyses with uncertain and unpredictable results, when errors or unexpected complications may call for the invention of new analytic techniques, when unforeseen glitches may require expensive and time-consuming rework, and when interagency decisions among alternative analysis and reporting options must be wrangled out. A more globally optimal course would be to include the new tasks under the new framework in a first assessment cycle jointly with established items in the previously established framework, but the initial results would be reported only relative to the established framework only. The unpredictable and exploratory analyses required for the new components would be carried out more deliberately. Alternative procedures would be compared more thoughtfully, or invented if necessary—without the need for untested patches rushed into place to meet reporting deadlines. What works and what doesn’t could be determined. Results of these analyses could be released in more detailed reports that introduce reporting under the new framework and show its relationship to the previous framework. Preparations could be made for faster and more stable initial reporting under the new framework in the next assessment cycle.

3.2 The Critical Path

The PERT chart is a popular management tool for understanding the interrelationships among tasks in a large project. It shows which tasks depend on others, and which can be carried out in parallel. Importantly, it describes the chain of tasks, each depending on the previous, which absolutely must be carried out for the project to be completed; this is the critical path. Carrying out the tasks in the critical path determines the minimal amount of time required to complete the project. This concept is a key to cutting the time to report NAEP results: No task should appear on this critical path between data collection and reporting if it can be done before or if it is not essential for the report.

There appear to be few tasks currently on the NAEP critical path that can be moved ahead without incurring any tradeoffs whatsoever. Others involve tradeoffs, but ones for which disadvantages appear to be overwhelmed by the advantages of speed and efficiency. For example:

  • The current NAEP analysis configuration requires, for a given subject area, analyses that involve all tasks in all scales, final sampling weights, and all background variables, including the matching and merging of teacher survey data. It is nearly true that everything must be done before anything can be reported; almost every element in the data lies on the critical path to the first report.

  • Major analytic decisions such as whether double-length writing blocks can be incorporated into the Writing scale, or whether to carry out cross-grade or within-grade scaling, are often slated to be made only after the data are in, and have been analyzed. This requires parallel development of alternative analytic procedures, honed to the point that they are sufficiently reliable to be employed in production. It requires time- and staff-consuming analyses and interagency decisions to make final determinations.

3.3 Decreasing Returns/Negative Returns

Most people are familiar with the principle of decreasing returns. In test theory, for example, three item responses provide more information about a student than two responses, but the increase is not as much as the gain from two responses over one. The Spearman-Brown formula allows us to approximate these decreasing gains. However, the increment in testing time from two to three is just much as the increase from one to two. At some point, the added items do not provide enough additional information to justify their cost. We will see several examples of this principle at work in NAEP, and it enters into deciding among design tradeoffs.

The lesser-known phenomenon of negative returns also arises frequently in NAEP. To continue the test theory example, when increasing test length begins to influence students’ performance because of fatigue, frustration, or lack of cooperation, the Spearman-Brown predictions of decreasing returns are no longer correct. Costs are linearly higher, but the information gained can actually be less than it would have been with fewer items. This situation arises in NAEP as a consequence of motivation, logistical limitations, and attempts to address inferences that large-scale surveys are not, by their nature, suited to support. Some examples:

 

  • A short constructed response task has the potential to tell us more about a student’s thinking than a multiple-choice item. A longer constructed response task might tell even more about some students, but nothing at all about those who decide not to bother responding to it. Omit rates for grades 8 and 12 in 1994 Geography, for example, averaged less than 1% for multiple-choice tasks and about 5% for short constructed response tasks—but up to 40% for tasks in which students were asked to provide extensive responses. Since these omissions are self-selected, even if 10,000 students do respond assiduously, there is less information about average performance in the population than from a random sample of just 100 students who were all engaged in the task.

  • It has been well-known, since at least the days of the Coleman Report (Coleman et al., 1966), that students’ home experiences have substantial impact on their school achievement. The self-reported information about students that NAEP routinely gathers is affordable, but of varying quality. Recent explorations into whether to attempt to obtain more accurate information about students’ home experiences and socio-economic status (SES) have (probably wisely) recommended against using either more detailed census data or parent surveys in the main assessment. Costs would rise substantially, and public resistance could increase to such a degree as to erode cooperation. And, because NAEP is a survey rather than an experiment, it would still not be possible to unravel the comparative effects of schooling and background experiences.

3.4 Operational Definitions

Educators can agree unanimously that we need to help students "improve their math skills," but disagree vehemently about just how to appraise students’ skills. Their conceptions of mathematical skills diverge as they move from generalities to the classroom. They employ the language and concepts of alternative perspectives on how mathematics is taught, how it is learned, and about which topics and skills are important. The disparate assessments they have in mind all provide evidence about students’ competence—but each from a particular point of view of that competence, how it is evidenced, and how much to value different aspects of it.

Several levels of abstraction might be conceived for thinking or talking about student achievement, but it is an actual specific assessment that a student ultimately encounters. "Test specifications" identify what a particular assessment should comprise: The kinds and numbers of tasks, the way it will be carried out, and the processes by which observations will be summarized and reported. This level of specification determines an operational definition of competence. Deming (1982) describes how similar processes are routinely required in industry, law, and medicine:

 

Does pollution mean, for example, carbon monoxide in sufficient concentration to cause sickness in 3 breaths, or does one mean carbon monoxide in sufficient concentration to cause sickness when breathed continuously over a period of 5 days? In either case, how is the effect going to be recognized? By what procedure is the presence of carbon monoxide to be detected? What is the diagnosis or criterion for poisoning? Men? Animals? If men, how will they be selected? How many? How many in the sample must satisfy the criteria for poisoning from carbon monoxide in order that we may declare the air to be unsafe for a few breaths, or for a steady diet?

Operational definitions are necessary for economy and reliability. Without an operational definition, unemployment, pollution, safety of goods and of apparatus, effectiveness (as of a drug), side-effects, duration of dosage before side-effects become apparent (as examples), have no meaning unless defined in statistical terms. Without an operational definition, investigations on a problem will be costly and ineffective, almost certain to lead to endless bickering and controversy. (pp. 286-287)

For practical work, stakeholders agree on one or more operational definitions to track the more abstractly defined matters in which they are interested. The U.S. Food and Drug Administration, for example, works with an operational definition for "acceptable frozen broccoli" that includes ‘less than 272 aphids per pound’—obviously a consensually defined quantity. Different operational definitions, equally defensible, can lead to somewhat different results—but only after they have been specified can accurate estimation, or discourse based on the matter, proceed.

In NAEP, an operational definition of proficiency in a subject area is determined jointly by the subject-area framework, test specifications, administration procedures, and scaling/reporting procedures. Even a seemingly minor decision about whether to ignore omitted responses or to count them wrong is part of the definition. Any change in any of the components changes the operational definition of the proficiency, and has the potential to affect results by more than changes in what students actually know and can do affects them.

Operational definitions come into play in NAEP in several other places, such as sampling frames, background variables, exclusion rules for testing students, and, importantly, achievement levels. This last instance is discussed in Section 6.1E.

3.5 Variation in Systems

At the heart of Deming’s revolutionary approach to quality control was an understanding of variation in a system. Any system exhibits variation. Even an established system under what Deming called ‘statistical control’ exhibits a certain amount of variation. Resources are squandered when attention is focused on variation within these limits. One way that resources are effectively used is identifying and resolving ‘special’ causes of variation that lie outside the natural variation of a system—"putting out fires", or, in NAEP, "resolving anomalies." Statistical ideas help distinguish special causes from the natural variation of a system. In industry, typical ‘control limits’ for zeroing in on outliers are three standard deviations beyond average results. While putting out fires is an effective use of resources, it does not improve a system. Only changing the system can do that. The second way to use resources effectively is to change the system so as to improve its product—and, almost always, to reduce the amount of variation in the system. These principles are relevant to the NAEP redesign, for decreasing reporting time and improving the accuracy of trend results.

3.5.1 Reporting time

Figures 3-1 a) and b) present HYPOTHETICAL illustrations of two reporting systems. The top panel suggests time-to-release of main reports under the current main-NAEP configuration, which includes revisions, changes, new procedures, and reporting decisions (such as standard setting, how to handle scaling, what results to report and how to report them). This figure is fictitious, partly because calendar time to reports depends on which reports are given priority. For example, the average time is higher than desired, although some reports are ready fairly quickly. But the variation is very wide, due in large part to unforeseeable needs for rework due to unstable or new portions of the assessment or attendant processes, under a configuration in which almost everything must be analyzed and resolved before anything is reported. Simply exhorting everyone to do better does little to bring average reporting time down, since the wide variation in the system, as configured, leads predictably to some reporting times above the desired target. Focusing resources on the specific incidents that led an assessment to come in after schedule is wasted, if the underlying cause is an untested change, an inherently unstable variable, or a survey that requires complex file-matching—if the next assessment cycle will include new untested changes, inherently unstable variables, and surveys that require complex file-matching.

 

Figure 3-1
Hypothetical Distributions of Time-to-Report in Two Assessment Systems

The bottom panel illustrates some important observations we have made about the process of reporting long-term trends. Long-term trend reporting, neglected in the main NAEP activities, has coincidentally become a stable process. Very few changes at all enter into test designs, administration, or analysis (although reporting has sometimes been extensive, as trend reports sometimes have much interpretation and contextualization). The time necessary to prepare the basic data for reporting is not only shorter, but exhibits far less variation. This first feature is the bottom line, of course, and we will be exploring ways to achieve it in a redesigned standard NAEP. The second feature, reduced variation, is important not just for the predictability and the reliability of the system, but because it permits quicker and more accurate detection of true ‘special causes’. That is, if variation due to controllable nuisance effects is decreased, true anomalies are faster and easier to detect and resolve.

3.5.2 Accuracy of trend results

Deming, as a statistician, appreciated both the value of statistical models for gauging uncertainty and the limitations. A limitation of model-based estimates of uncertainty (i.e., standard errors) is that they depend on the model. To the degree that the model is wrong or incomplete, the reported standard errors are wrong—usually too small, because they do not include important sources of variation in the results. This is important in NAEP in the following way.

NAEP results may be called ‘reading proficiency’ or ‘math performance’ in a rather generic or global use of the term, but what they really are, are summaries of observations (which we believe have something to do with students’ knowledge and skills) collected in specific ways under specific conditions. Literally hundreds of specifics are involved, everything from definitions of the population frames, sampling procedures, color of ink, and weights, to item specifications, timing, administration, analytic procedures, and training procedures for scorers. Design changes have there three important implications:

 

  • Every single one of these specifics affects the level of the outcome to some degree.

  • Some of these features, when changed, can have greater impact than the true target of inference, namely, change of student proficiency over time.

  • Changing several features, even if each is seemingly minor, can also have greater impact than the change of student proficiency over time.

Error variance from some of these effects can be handled with statistical models—student sampling and item sampling, in particular. These sources of variance are all that show up in reported standard errors. Score variations due to other feature changes usually are not estimated, except when a change is deemed sufficiently suspect to merit dual administration under old and new conditions, and an attempt made to adjust for the average effect of the change. This means that the variation associated with seemingly small changes is present in results, but not in the standard errors for them.

Two untoward consequences result from this underestimation of the variability in results. First, distortions result in planning the sampling design. For example, there was uncertainty present in the 1986 design due to changing item context that was as large as uncertainty due to student sampling(1) (Mislevy, 1990). The huge expense of securing large random samples of students is wasted if locally desirable changes in design and procedure add variance back into the results.

Second, while the current student- and item-based standard errors are not too bad for within-assessment comparisons (because conditions are constant within that assessment), they underestimate more seriously standard errors for trends because of changes across assessment cycles. Setting control limits in relation to the underestimated standard errors guarantees that false alarms will be set off on a regular basis. Too many observations will be identified as suspicious. This triggers a search for ‘the mistake’, a special cause of variation, when there is no special cause; just another draw from the natural variance of a noisy system. If one wants accurate reports and the current system is not accurate enough, continually chasing a few real and many false signals of anomalies cannot solve the problem. The real solution requires honest estimates of the actual uncertainty in the existing system, then changing the system so it is less noisy.

The impact of variations in design options, and the consequent generalizability of inferences drawn from NAEP data, can and should be examined empirically by the use of generalizability studies. These studies should be done as part of the planning process, and should not be on the critical path to main reports. In such studies, different versions of an assessment are developed that vary in controlled ways. For example, test forms may be developed that contain different items but that are designed to be parallel in terms of the number and type of items and their measurement properties. Or forms may be created that vary more systematically, such as in their proportions of constructed-response, performance, and multiple-choice items. The variation in results across forms provides important information about how much error can be expected from changes in the assessment. This information makes it clearer how generalizable are conclusions drawn from a particular assessment design.

3.6 Implications and Approaches

How can we apply these management concepts to achieve the desiderata of the Themes and Issues? Our discussion of the Themes and Issues and our design sketch make use of four ideas:

  • Setting Priorities
  • Modular Design
  • Phased Analysis & Reporting
  • Phased-In Change

Setting Priorities. The way to circumvent local optimization is to specify overarching priorities. This makes it possible to create an assessment design and an analysis plan that will override some locally desirable alternatives. If everything is especially important, then nothing is especially important. For example, striking gains in speed and stability ofresults can be produced by initially focusing energies on the aspects of the data that are most important and least problematic. However, it must be recognized that this prioritization inevitably means that certain analyses or variables that are most important to some NAEP constituents are not on the critical path to initial results.

Modular Design. The idea here is to design NAEP in terms of distinguishable modules, perhaps the most important of which ("a core NAEP") supports trend comparisons over time and consists of elements which are important, stable, and (comparatively) easy to analyze and report. These core modules could be embedded in other NAEP activities (in particular, state NAEP), and in non-NAEP studies. Other elements of NAEP could be spiraled into the main NAEP administration, but would not appear on the critical path to initial reports. These could include, for example, teacher surveys, experimental and more extensive tasks, long-term trend blocks, and blocks of tasks being readied to appear on the critical path in the next assessment cycle.

Phased Analysis & Reporting. As large as NAEP is, it is dwarfed by the census that the Census Bureau carries out every ten years. Yet the Census Bureau reports its first results six months after the data are in—as required by law. How can they do this? They do not report every possible result in every conceivable form. They report the most important results in the most straightforward way, then continue, over the next ten years, to analyze, to refine, to report, and to release further analyses in priority order. The analyses required for these results are not on the critical path to the initial report. NAEP has moved in this direction recently with its First Look reports. In Section 6.1D we discuss how even quicker initial reports could be accomplished.

Phased-In Change. In every administration of NAEP assessments, some aspects of the data collection have been essentially unchanged from the previous administration, others are changed only modestly, and others are quite different. We see time and again that the chances of problems (some remediable, others not) increase accordingly. For example, long-term trend assessments are essentially unchanged from one administration to the next, and, not surprisingly, they exhibit far fewer problems than main NAEP. Many things that could go wrong in an assessment have been discovered (often the hard way), worked through, and are avoided in successive administrations. It is largely known what the data will look like when they arrive, and what to do, and how to do it. Many of these advantages could be built into a core NAEP, while relaxing some of the incidental constraints that also characterize the long-term trend assessments. Open-ended tasks, which are not part of the current long-term assessments, could be included in the mix of tasks. New blocks could be introduced, as long as (a) they were not included in the standard results the first time they are used, and (b) they were very similar to blocks already in the mix in terms of structure, difficulty, content, and format balance. More consequential shifts of these factors would be introduced only periodically (say, eight to ten years), and after at least one joint administration in which they are not included in the initial results.

4.0 The Necessity and Effects of Design Trade-offs

Even when attention focuses on the kinds of information that large-scale surveys such as NAEP can do well, there remains much leeway for setting priorities. Broad and current content coverage, for example, has always been important for NAEP; so has the capability to compare performance across time points. The NAGB Themes and Issues propose a higher priority for expeditious turnaround of results than has been the case historically. And while associations between performance and student background variables have been desired, the high cost of reliable measures of student background has led NAEP to rely on less trustworthy self-reports. Three key points must be kept in mind:

 

  • Different purposes are best achieved by different design configurations. (For example, assessment frameworks that were created anew with each assessment cycle would guarantee the most current perspectives on what is deemed important in a subject area, but devastate comparisons across assessment cycles.)

  • Any single design involves tradeoffs among features that strengthen, weaken, or sometimes even preclude inferences associated with different purposes. (For example, an assessment such as NAEP that uses relatively short, matrix-sampled test forms provide efficient estimates of population characteristics, but poor estimtes for individual students.)

  • Establishing priorities among purposes enables assessment designers to plan a configuration that maximizes the attainment of high-priority purposes, while satisfying lesser priorities to lesser degrees.

4.1 A Fundamental Tradeoff

Perhaps the crucial tradeoffs to be addressed in a NAEP redesign emerge from the interplay of the following points made in the Themes and Issues:

 

  • Group focus. "The National Assessment only provides group results; it is not an individual student test."

  • Validity. "[V]alidity … of the data will remain a hallmark of the National Assessment." [In particular, this has included content coverage—consensually-determined frameworks and item pools that represent the breadth and depth of knowledge and skills in a given subject area, insofar as it is possible to assess them by NAEP.]

  • Achievement level reporting. "The National Assessment should continue to report student achievement results based on performance standards."

  • Simpler design. "Options should be identified to simplify the design of the National Assessment and reduce reliance on conditioning, plausible values, and imputation to estimate group scores."

Content-coverage has been important to NAEP since its inception. Such comprehensiveness cannot be attained if all students are administered the same, or virtually parallel, test forms. In and of itself, variation in test forms is not a barrier to rapidity and simplicity. The NAEP design of the early 1970s had few restrictions on booklet construction yet supported simple analyses—but largely because results were reported in terms of performance on items, not in terms of performance by students.

This may seem like a trivial distinction, since all the data are is performances on items by students. The key difference, though, is that under item-level reporting, the issue addressed is how students would do on this item, regardless of performance on other items. In a student-level framework of reporting (even if scores are never even calculated for individual students), the focus is on how a given student would do across items. This means projecting from how she does on the one particular set of items in her tet form, to how she might have done on some larger set (e.g., an actual set of reference items, or a performance scale that implies levels of performance in a domain of items). This means that interrelationships among performances across items are important, and the complexities of some kind of linking and scaling procedures appear. Methodologies available for linking results on different test forms vary in their complexity. The simplest can be employed when (1) forms are parallel, which demands tight constraints on form design and works against breadth of content-coverage, and (2) target inferences are about individuals measured equally well, rather than about properties of distributions of groups.

The current NAEP configuration has neither of these characteristics. Data come from booklets that vary within assessments and over time. Students are administered too few items to obtain accurate measures of their performance, since experience has shown that administering large numbers of tasks under unmotivated ‘drop in from the sky’ testing abrades the engagement of students and schools alike. And the target of inference is proportions of students at or above designated achievement levels—one of the hardest to estimate from sparse matrix-sampling designs. This is the state of affairs that led to the complex statistical methodologies noted in the ‘simpler design’ desideratum.

Is it possible to have a cleaner design, simpler analyses, and faster reporting—yet maintain broad content coverage and valid achievement-level reporting? Our perspective emphasizes (1) use of management principles in design, so that procedures can be faster, simpler, and more stable no matter how tradeoffs are balanced, and (2) arrangement of design and reporting priorities so as to be, at once, consonant with the desiderata in the Themes and Issues, but ordered so as to reduce costs and complexities in achieving them.

4.2 Tradeoffs and Test Specifications

Among the major features of an assessment that affect the structure of test forms are the following: 1) content specifications (including definition of objectives or outcomes and the number of items measuring each outcome); 2) item types and formats (including but not limited to multiple-choice or performance items); 3) desired standard error functions, especially as they relate to achievement levels; 4) testing time per student; and 5) linking requirements (between forms or grades). Decisions about these design features—which, as we shall discuss, should flow from decisions about priorities on assessment purposes—will create the look of the assessment and have a great influence on the complexity of the analysis techniques needed.

Test frameworks determine the breadth of content coverage needed, but test specifications are more specific than frameworks. If it is desired to make broad and robust generalizations about student achievement, then broad content coverage is needed. The level of detail at which distinctions will be made is also important. For example, if it is desired to draw generalizable conclusions about students’ achievement in problem solving versus algebra, then a sufficient number of items needs to be included to measure those separate objectives. In the past, NAEP has been notable for the breadth of its content coverage, which has positively affected its reputation as a valid and useful benchmark of American student achievement. However, this breadth has contributed to the need for a large, complex, expensive number of test forms.

It is an explicit desideratum of the Themes and Issues that constructed response or performance items be included in a redesigned NAEP. The number and type of performance items have tremendous impact on testing time and scoring costs. Also, while increasing the depth of assessment, the task effects inherent in performance items decrease the generalizability of results relative to devoting the same amount of testing time on multiple-choice items. That is, the use of just one performance task creates the need to use additional performance tasks in order to maintain stable results. For example, if only one math task is used in one year and it focuses heavily on geometry, and the next year an algebra-laden task is used, it will not be possible to understand the meaning of score changes: Are the changes due to changes in levels of student achievement in math skills common to both tasks or are they due to the fact that students can do one type of task better than the other? Using several carefully chosen tasks in each assessment improves the interpretability of the results since it affords the possibility of sorting out some of these competing explanations.

The desideratum to report scores in terms of achievement levels places particular emphasis on the standard error functions, or degree of accuracy of information about individual students. Items need to be placed in the assessment to match the target achievement levels. For example, to accurately measure Advanced performance, difficult items must be in the assessment. If NAGB decides that it is important to place more emphasis on measuring students’ progress as they move toward achieving the Basic level, more items at the low end of the scale need to be added to the assessment. There is a fundamental dilemma in designing an assessment before standards are set: it may be that reasonable standards are set but that a given assessment design cannot measure with necessary accuracy, the proportion of students reaching those standards.

Past NAEP results have found that when students are tested for longer than one testing session (about an hour), there is a substantial loss of student participation.(2) Such loss biases assessment results. As long as it is desired to measure more content than one student will take, more than one test form must be used, and complexities arise in design and analyses. For example, imagine that it takes two forms to cover the NAEP content framework and item format specifications to an acceptable degree. Since two forms are needed, in one or more ways they cannot be parallel; they may measure different content, perhaps with different formats, or have different standard error functions. In an extreme case, one form might contain only multiple-choice items (Form A) and the other contain one or more performance items (Form B). To obtain an overall picture of performance of a group of students, it will be necessary to pool results from the two forms. It will not be possible for only one form to be used by, say, states, to link their assessments to NAEP; both forms will be needed.

Furthermore, if more than one form is needed to cover the desired content and item formats, comparability of results over years requires the use of either a) tight restrictions on test form characteristics or b) complex analysis procedures. Continuing the example above, call the first year’s test forms A1 and B1. To maintain overall consistency of results in the second year of testing, it is necessary to design A2 to be parallel to A1 and B2 to be parallel to B1. (Looser restrictions could be used, but they are more complicated to explain and implement.) If form consistency is not maintained, then the distributions of observed scores (and percents of students in each achievement level) will be affected by differences in standard error functions. Sophisticated statistical techniques exist for dealing with these differences (e.g., the "plausible values" methodology), but one of the Themes and Issues desiderata is to reduce use of such techniques. We will discuss these issues further in Section 6.1C.

4.3 Remarks

It is not possible to "have it all;" trade-offs must be made. The present NAEP design emphasizes breadth of content coverage, use of performance items, minimum testing time per student, and achievement level reporting. These features have been obtained by increasing the cost and complexity of the form design and analysis. The cost and complexity can be reduced, but then something else must be given up. The configuration we sketch in Section 7, for example, maintains broad content coverage and allows for controlled evolution of the task pools, and permits more rapid reporting—but it does so by constraining the specifications of booklets upon which standard, initial reports are based. Subsequent reports incorporating broader content, newer and more complex tasks, and more additional student background variables can come later, necessarily carried out with more complex analyses.

5.0 A Selective History of Elements of the NAEP Design

This section briefly reviews selected elements of design configurations NAEP has exhibited over the years, in terms of purposes, priorities, and trade-offs—some explicit, others implicit; some intentional, others adventitious; and some with unforeseen consequences. This discussion further illustrates the principles introduced above, and sets the stage for deliberation of options for the future.

5.1 1970-1983

Certain features of NAEP were instituted at its onset, conceived to produce results sufficiently useful, cost-effective, and politically benign to come into being.

5.1.1 Student Sampling

NAEP was designed to gather information from samples of students rather than from every student. This approach, motivated more by practice in public-opinion polling than educational testing, allowed extraordinary efficiencies when the target of inference was performance of groups of students rather than of individual students. Accurate estimates of national performance, for example, could be grounded on a random sample of a few thousand students. A multi-stage sample was employed (a simple random sample of students from the nation is impractical), necessitating that clustering effects and stratification be accounted for in estimating item averages and precision of estimation. Since results were not obtained for all students, nor used for purposes specific to sampled individuals, motivation was more of a concern than in typical tests in which something good happens to a student if he does well, or something bad happens if he does poorly.

A tradeoff appeared in the sampling of students at random from their schools, rather than from intact classrooms. The advantage: A lower clustering effect, implying more efficient estimates of group performance for a given sample size. The disadvantage: Hierarchical linear modeling (HLM), which would examine impact of class and teacher effects, was precluded.

5.1.2 Item Sampling

Item-sampling is the dual of student-sampling. Since performance in any subject area is only poorly reflected by any single item, or even several of them, we learn more comprehensively about all the many facets of skill and knowledge in a subject from a large number of diverse tasks—far too many for any single student to be administered, especially under unmotivated conditions. NAEP pioneered the radical solution of item-sampling: each sampled student was administered a sample of items from the pool. Technical innovations made it possible to obtain, from these ‘matrix samples’ of responses, estimates of average performance (e.g., Lord, 1962). Matrix sampling was ideal for broad content coverage and efficient estimates of performance in large domains of items. An important feature of matrix sampling is that it supports estimates of average performance in the domain or on individual items even if students respond to very few items. This was a partial solution to the motivation problem, since under low-stakes conditions, motivation declines as amount-of-effort-required increases. (Indeed, motivation can decline to the point of negative returns as testing sessions become longer; two hours of testing time per student can provide less information about a group than one hour of testing time, if rates of school and student refusal, and item omit rates, increase.)

Items in the original NAEP design were administered by paced audio tapes. That is, all students in a testing session were administered the same booklet of items, and an audiotape moved students through the booklet item by item. A number of trade-offs were involved here: Administration was logistically cumbersome, and data were less than optimally efficient because of the clustering of students. On the other hand, information was better item-by-item than when students are simply given a number of items to work, and a block of time to work on them. In this latter situation, it is up to students to decide when how long and in what order they will work on each item—a factor of some importance in their performance, but uncontrolled by administration conditions.

5.1.3 Reporting

From the inception of NAEP through 1988, NAEP reported results in terms of regions of the nations rather than states. Obviously, since regions of the country are not responsible agencies for education, reporting in these terms had less policy relevance than reporting in terms of states or school districts. Why make such a tradeoff? In order to make NAEP acceptable, so it could come into being in the first place. It was not politic to create a national assessment that was too useful.

Similarly, sampling and reporting was organized in terms of ages of students rather than grades, even though schooling in this nation is mainly organized in terms of grades. Ages 9, 13, and 17 were targeted, in tune with international assessments which were starting to be carried out by the International Assessments of Education (IEA).

Reporting originally focused on single items: percents of correct response, or distributions of kinds of response, in the nation as a whole and in subgroups of students defined by the background variables NAEP also included in its surveys (e.g., race/ethnicity, parents’ education, and region of the country). This item-by-item reporting contrasts with student-based reporting, in which individual students’ performance is summarized over items, and the distributions of these summaries are analyzed. A great advantage of item-level reporting was that no equating or scaling procedures were required; average performance on an item was simply what it was. Analysis was as simple as possible as far as items were concerned, given that the complex student-sampling required a certain level of complexity in analysis (jackknife estimation of variance due to student sampling, which is still used). Another advantage was that there were relatively few constraints on the composition of assessment booklets. They need not have had similar content, formats, difficulties, or lengths. A disadvantage was that item-by-item reporting provided estimates of average performance by subgroup, but it precluded the conception of distributions of students’ performance, or of reporting in terms of achievement levels.

Subject area experts liked the original item-by-item reports, but such detailed reports were quickly found to be unsatisfactory for communicating with policy-makers and the public. When someone asked ‘how are kids doing’, she did not want two hundred answers, one for each item—especially if the same general message was being repeated for most of the items. Beginning around 1974, reports began to provide results in terms of average performance over clusters of related items.

5.1.4 Measuring Trends

Once reporting was organized around average percents of correct response across clusters of tasks, these clusters naturally became the basis of comparisons across time and across age groups. The disadvantage was that the groups of items in common across years were not selected purposefully to this end; they varied in number and content coverage, and constituted only haphazard and unrelated reporting scales. For example, percents-correct for 13-year olds might be higher than those for 17-year olds, simply because the 13-year olds’ common items happened to be among the easier ones they were administered. Moreover, the release of 1/4 of the items with each assessment cycle meant that fewer and fewer items were available for comparing performance over time. In short, this method of comparing achievement over time had not been planned. It arose as an ad hoc response to a mission whose importance grew over time, but for which the design was not well suited.

5.2 The 1984 Redesign

After having started out with a clean design in the early 1970’s, satisfactorily addressing the perceived needs of the time with the technologies of the time, NAEP became increasingly unwieldy over the years as expectations changed. Ad hoc procedures (such as trend reports on clusters of items) had been introduced to meet new expectations as well as possible, but even so dissatisfaction was increasing. The competition for a redesigned NAEP in 1984 led to a contract in which many of the features of the current configuration originated (see Messick, Beaton, & Lord, 1983). New priorities were recognized, and the new design was introduced to reflect different balancing of recognized tradeoffs.

5.2.1 Student-Sampling

Recognizing the fact that schooling in the US was organized mainly by grades, the 1984 redesign introduced concurrent age and grade sampling. The sampled grades were the ‘modal’ grades associated with ages: Grade 4 with Age 9, Grade 8 with Age 13, and Grade 11 with Age 17. A given administration scheme and set of test booklets was used in each grade/age combination. Advantages included in this extension of NAEP include the maintenance of age-based surveys for trend and international comparisons, and the availability of grade-based surveys for increased policy relevance. Disadvantages included increased fieldwork; additional complexities in sampling, weighting, administration, and data structures; and dual analyses and possibilities of analytic errors.

5.2.2 Item-Sampling

A more complex version of multiple-matrix item-sampling was introduced in 1984: BIB-spiraling. BIB spiraling presented booklets which were organized around three blocks of items, and these blocks were combined so that each block would appear at least once with every other block. This Balanced Incomplete Block design is the ‘BIB’ in BIB spiraling. The motivation for this innovation was to support the construction of response scales for performance; that is, to support reporting in terms of what individual students do on collections of items, rather than simply average performance per item. Specifically, scaling by means of item response theory (IRT) models was introduced. This scaling allowed consideration of distributions of performance, setting the stage for achievement level reporting. The trade off was increased complexity of analysis.

The ‘spiral’ in BIB-spiraling involved administering different booklets to students in a given testing session—spiraling through the entire set of booklets rather than having every student in a session paced through the same booklet. These spiraled booklets were constructed so that each block was allotted the same amount of time, and students allocated their time as they chose within each block. Advantages of this procedure included easier logistics, since audiotape equipment and its vicissitudes were no longer a factor, and more efficient sampling in one sense, since the testing-session-by-booklet clustering effect was eliminated. However, additional uncertainties (sources of variance) were introduced into item-level results: larger position effects, more omits, and more dependence on students’ varying time-management skills. National estimates of item-level performance were less trustworthy, since the percent-correct now depended materially on whether an item happened to appear near the beginning or the end of a block.(3) Block of items, rather than individual items, became the fungible unit of interpretation.

5.2.3 Plausible values

A major feature of the current NAEP configuration is the use of ‘plausible values’ methodology to estimate distributions of students’ proficiencies. It is noteworthy that this methodology was an unintended consequence of the redesign. Specifically, it became necessary as a consequence of: (1) placing a high priority on student-based, rather than item-based, reporting; (2) collecting data in booklets constructed to far looser constraints than are employed in typical student testing programs; and (3) finding that complex analyses had to be invented to meet these missions with the data that had been collected.

IRT scaling was introduced to enable comparisons of group-level performance across distinct years, ages, and booklets. By the early 1980’s, established scaling methods were available to do this with reasonable speed, stability, and expenditure. Item parameters were estimated to establish a common scale; IRT ability estimates were produced for each examinee; and subsequent analyses could be carried out across non-identical item sets. In particular, the IRT calibration and scoring were completely separate from the preparation of, and analyses concerning, any other background or instructional variables whose relationships with performance were of interest. Subgroup distributions and secondary analyses could then be carried out with these estimates for individual students. It was therefore planned to carry out the analysis of NAEP data using this approach—a relatively simple analysis, under which the following elements all appeared to be in concert: the data, the analysis, the commitments, the PERT chart, the requisite resources, and the long-term stability of the approach.

But this approach failed in the 1984 assessment. Although NAEP’s sparse matrix-sampling design was extremely efficient for obtaining information about population characteristics, it didn’t support response-data-only IRT ability estimates for individual students upon which suitable analyses could then be based. Vastly differing mixes across booklets of content, difficulty, test-length, and timing, further impaired the approach, since varying measurement error distributions caused distortions in individual-student score estimates that were larger than the true differences of interest, such as between sexes or regions. These form-to-form factors, it was realized, were explicitly managed in programs like the SAT and ACT. In such applications, IRT could indeed characterize and take advantage of patterns among students’ performances on different sets of items—but only because the test forms were sufficiently long and parallel. This was the crisis faced with the 1984 assessment:

 

  • Data had already been collected, in anticipation of an analysis of performance with a methodology that was familiar, expeditious, and totally separate from background variables and sampling design considerations;

  • Commitments as to the content and timeliness of results were based on this assumption; and

  • The anticipated analysis failed.

Two steps were taken to meet the crisis:

1. A conceptual framework was established for using IRT-based models to establish a common reporting metric and characterize population relationships from sparse matrix-sampling data (i.e., marginal estimation(4) ). The reporting metric was a 0-500 scale based on the IRT ability. (See Section 6.1C for further discussion).

2. A marginal-estimation analysis system was devised to deliver on a majority of the established commitments, using the data in hand. Specifically, the ‘plausible values’ approach, based on Rubin’s (1987) multiple imputation methods for handling missing data, was introduced to implement the concept of marginal estimation. The main idea was to estimate the joint distribution among IRT ‘true scores’ and student background variables, then produce pseudo-data sets from which these results could be reproduced.

The commitment of providing a secondary user data tape merits special note. It had originally been promised to provide users with a NAEP data file including IRT ability estimates as well as background information for each student, from which any analyses could be carried out. This would have been both easy to do and satisfactory for secondary analyses had the anticipated analyses worked as planned. They did not. User tapes containing plausible values were instead produced. If secondary analyses were to include a given variable, that variable had to be included in the construction of the plausible values (i.e., including them in the ‘conditioning model’) in order to obtain satisfactory results. Were it not for the mission of producing this specific type of data file for secondary users, the imputation procedures could have been avoided by using alternative, somewhat simpler, marginal analyses. a tape which could be used more or less as if the original plans had worked, appropriate marginal analysis could be carried out for given inferences without having to first . This then necessitated cleaning, processing, and, when required, collating, all student background variables before analysis for initial reports were carried out. It is important to note that this requirement placed on the critical path to initial reports the most problematic background variables (especially untested new or revised self-report items, and those from teacher surveys, which involved complex file matching).

The commitments were largely met, but at the cost of severe and unanticipated mismatches among the analysis, the commitments, the PERT chart, the requisite resources, and the stability of the configuration. The new analysis procedures made it possible to bring together results from booklets of different lengths, difficulties, and compositions, although there were limits to how far even approach this could be pushed. The analysis, however, was more complex, more dependent on models, and unfamiliar (and therefore suspect) even among the educational measurement community.

5.2.4 Trend Analysis

Once the marginal analyses described above had been devised, the IRT model and distributional estimation methods were applied to map the historical stream of pre-1984 reading data into the 0-500 scale. Stable and credible results were attained, which echoed, amplified, and made more comprehensible the cross-year and cross-age results that had been reported in the past in terms of average percents-correct. The procedure worked well despite the widely varying design configurations in past assessments. In retrospect, it was realized that paced administration helped satisfy IRT assumptions by reducing context effects.

This historical trend was necessarily based on historical data, under the content frameworks as they had been developed and operationalized (e.g., audiotape paced administration, definition of age cohorts, booklet designs). These same procedures were used to collect data in 1984 concurrently with the new BIB-spiral administration, and this bridge was used to set the baseline for what was anticipated to be the start of a new trend line based on new procedures. No new trend line ever materialized. Changes in frameworks, item specifications, definitions, and administration conditions were introduced every one or two assessment cycles so that a consistently defined metric could never be established for solid comparisons over time for more than two assessments.

5.3 The 1986 Reading Anomaly

The ‘Reading Anomaly’ refers to results from the 1986 assessment that showed declines from 1984 levels that were much greater than changes in any four-, five-, or six-year period in the previous history of NAEP. It is not that the changes were large in absolute terms; at Age 17, the most startling, they were only about 3 points in terms of average percents correct. But this was large in relation to the target of inference, namely population changes, because population changes are very small over short time periods. Subsequent investigations (Beaton & Zwick, 1990) showed that the anomaly could be traced to seemingly inconsequential changes in booklet configuration, administration procedures, item context, and post-stratification procedures—each one designed to provide better information—yet which made it impossible to compare results across assessments. Beaton and Zwick offered the moral, "When measuring change, do not change the measure’.

This experience provides the sobering realization of the severe constraints on population definition, booklet design, administration protocol, and analytic procedures that ensue when the mission of NAEP includes tracking change over time. Hundreds of seemingly small facets of the configuration could be tweaked to produce modest improvements in estimation within one time point, but these changes could have a greater impact on the overall results than actual differences over a two-year period in what students know and can do.

5.4 Some Changes Since the Anomaly

5.4.1 ‘The only constant is change’

Changes in frameworks, item specifications, the time of year of testing, age definitions, exclusion rules, and so on, have been the rule in NAEP. It has sometimes been possible to jointly administer the assessment under both the previous and new versions of each change and to estimate an adjustment factor that takes into account the average effect of the change. Even when such adjustments appear to have been successful, the average impact need not be representative of the impact on different demographic or curricular groups. Thus, results concerning such variables have an added source of variance that is not accounted for in standard errors of estimation. Higher rates of ‘false positives’ of significant differences result. And, even when adjustments appear to have been successful, additional time has been added to the critical path for attempting to estimate the effect of the change, to determine whether it can be adjusted for, and possibly to invent a way to make the adjustment. Sometimes the impact of change is large enough to make it impossible to directly compare results from one assessment to the next. In all the years of NAEP assessments since 1984, it has only happened once that results from three successive assessments have been comparable (1994 Mathematics). Typically only two in a row are comparable before revisions sufficiently large to obviate the continuation of a ‘clean’ trend line are introduced. ‘One in a row’ is not uncommon.

The trade-off involved with continuous change is a victory of local optimization over global optimization. The intended advantage is improvement each time, insofar as honing data to mission as perceived by content area committees or other stakeholders. The negative effects are that (1) there are almost always some new wrinkles in each assessment, so there are almost always some first-time glitches and slow reporting; and (2) no new trend line has ever gotten started and remained in place long enough to become fast and reliable. An interesting unintended positive effect has been the continuation of so-called ‘long term trend’ assessments in Reading, Mathematics, and Science, which still use definitions, booklets, and administration procedures from the 1970’s and early 1980’s. Procedures for long-term trend have been refined and honed, so analyses needed for what would correspond to the ‘standard report card’ in NAGB’s Themes and Issues can be carried out reliably and quickly—within 3 months after receipt of data. This observation supports our recommendations in Section 7 for ways to design a standard NAEP assessment configuration that can yield initial reports within six months after the completion of testing.

5.4.2 Grade 12 Reporting

The 1984 grade/age combinations included grades 4, 8, and 11. It was noticed that if a given subject area were assessed every four years, NAEP could track and compare cohort effects if grades four years apart were surveyed. The 1986 assessment therefore surveyed grades 3, 7, and 11. Grade 3 proved problematic, as students at this age exhibited considerable difficulties dealing with the assessment. Moreover, interest was expressed by various stakeholders in surveying progress at key transition points in the educational system: Grades 4, 8, and 12. These became the surveyed grades beginning in 1988. Probably the most serious disadvantage in this shift is the lack of motivation among Grade 12 students, as evidenced in field observers’ notes, students’ self-reports, and omit patterns in data (see Section 6.1G).

5.4.3 State-Level Reporting

The Alexander-James Report opened the door to reporting at the state level in the 1988 assessment. An obvious advantage is increased policy relevance: states are agencies with direct responsibility for education. A disadvantage was the increase in the size of NAEP by almost two orders of magnitude, adding considerable logistic and analytic challenges. Creative ways of addressing these challenges have been advanced, such as administration by local rather than contractor personnel, under contractor-directed training and sampled observation. Differences between state assessment samples and a specially-selected ‘state assessment comparison’ subsample of the contractor-administered national NAEP have exhibited small but statistically significant differences. Another disadvantage in state-level reporting is the increased burden on states—especially small states, for whom practically all schools are involved in NAEP every assessment cycle, and for sparsely-populated states, for whom logistic difficulties arise with small schools spread across wide geographic areas (see discussions in Section 1H and in the Appendix). Whether states and schools will be sufficiently motivated to maintain this effort over the long run is an open question.

An interesting example of a costly unintended consequence is the breakout of private school results in state NAEP assessments. The original plan was to report only public schools, since they have public responsibility, and it is generally easier to locate and gain cooperation from them than private schools. But states have different proportions and compositions of public/private schooling, so omitting private schools from the sample may bias inferences about how students in a state are faring compared to students in other states. Now a reliable unbiased estimate of students in all public schools in a state can be obtained with a sample of, say, 50 to 100 public schools. A reliable unbiased estimate of all schools in the state, public AND private, can be obtained with perhaps an additional subsample of, say, 10 private schools. However, 10 schools is NOT sufficient for a reliable estimate of all private school students, and obtaining unbiased estimates is difficult because non-Catholic private schools have a high rate of refusing to participate. The advantage of having state-level private-school results is mainly to avoid users simply subtracting the ‘public schools’ estimate from the ‘all schools’ estimate to get their own (highly unreliable) private-school estimate. Yet securing enough private schools to ground a reliable private school estimate could, in some states, require as much effort as the public school effort! (Keith Rust discusses this issue in his 5/8/96 memo to Mary Lyn Bourque; see Appendix to this report.)

5.5 Where Are We Now?

NAEP finds itself in much the same situation as it did in the days of the 1984 NAEP redesign competition: A clean design had been introduced a bit more than a decade previously, which responded to the needs of its time, but for which changing desires, technologies, and political milieu appeared to push beyond its limits. Now as then, much has been learned from preceding work upon which to craft configurations better suited to the current situation.

6. Themes and Issues

This section discusses the objectives and recommendations from the National Assessment Governing Board’s "Themes and Issues" document. The discussion is organized point-by-point, with cross references as appropriate. Further amplification of specific tradeoffs and issues appear here. The results are synthesized into a sketch of a feasible configuration in Section 7.

OBJECTIVE 1: To measure national and state progress toward the third National Education Goal and provide timely, fair, and accurate data about student achievement at the national level, among the states, and in comparison with other nations.

1A. Test all subjects specified by Congress: reading, writing, mathematics, science, history, geography, civics, the arts, foreign language, and economics.

 

  • The National Assessment should be conducted annually;

  • Reading, writing, mathematics, and science should be given priority, with testing in these subjects conducted according to a publicly released 10-year schedule adopted by the NAGB;

  • History, geography, the arts, civics, foreign language, and economics also should be tested on a reliable basis according to a publicly released schedule adopted by NAGB.

There are many ways to devise schedules that meet the stated goals. Although it is not up to the Design/Feasibility Team to say what subjects or how often, we can say that it if the tracking of trends is a primary goal, then there is a need to maintain considerable stability in the framework and assessment design for at least three administrations. This section offers three examples, not as recommendations, but as vehicles to illustrate issues and tradeoffs. It will be noted that all of these examples show a lot more assessment going on than under the current design, with more subjects assessed and more frequent assessment. This will be feasible and affordable only if subjects which are not being assessed comprehensively are kept relatively simple and exhibit minimal changes (the nature of which is discussed further in Section 6.1B). We assume that "comprehensive assessments" coincide with revised subject-area frameworks (Section 6.2A).

Example 1

Table 6.1A-1 is an example with two core subjects, Math and Reading, which are assessed biennially, and eight other subjects, which are assessed two or three times during each ten year period. Three subjects are assessed each year, although field testing of new items from other subjects can take place in any year as required. Math and Reading are given highest priority in this example since there seems to be no argument from any quarter (educators, policy makers, and parents) that these two subjects are critical for students’ success. It is supposing that after Math and Reading, the four subjects with next priority are Science, U.S. History, Writing, and Geography. These subjects are grouped into two pairs, one pair of which is assessed between the Reading/Math years, so that each is assessed every four years. Three other subjects appear once every four years, and one subject is tested every five years.

Table 6.1A-1
Example 1 of Subject Area Assessment Cycles

 

Subjects

Year

1

2

3

1

Math

Reading

Civics

2

Science

History

F. Language

3

Math

Reading

Arts

4

Writing

Geography

Economics

5

Math

Reading

Civics

6

Science

History

F. Language

7

Math

Reading

Arts

8

Writing

Geography

Economics

9

Math

Reading

Civics

10

Science

History

F. Language

11

Math

Reading

Arts

etc.

     

Note: Core subjects in bold

Example 2

Table 6.1A-2 is an example with four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. A ten-year period is illustrated, in which core subjects are assessed every three years and non-priority subjects every five years. This example assesses two or three of the main subjects each year, with the possibility of augmenting a two-subject year with a special assessment.

 

Table 6.1A-2
Example 2 of Subject Area Assessment Cycles

 

Type of Assessment

Year

Comprehensive

Standard

Special/Probe

1

Math

Writing, History

-

2

Reading

Arts, Economics

-

3

Science

F. Language

Possible

4

Writing

Civics, Math

-

5

History

Reading

Possible

6

Geography

Science

Possible

7

Arts

MathWriting

-

8

Economics

Reading

Possible

9

F. Language

Science, History

-

10

Math, Civics

Writing

-

etc.

     

Note: Core subjects in bold

Compared with Example 1, twice as many core subjects are assessed. The tradeoff for more core subjects is assessing them only every three years. Three-year cycles for the core subjects may suffice for timely monitoring of slowly changing trends. Also, if a National Assessment is the basic NAEP to which states can attach themselves, a state interested in only, say, Reading and Math can hold its participation to only one subject every other year. Moreover, a more integrated process (framework development/items development/tryout/final "test"/ reporting of results) could be achieved, with additional field-testing and such activities as DIF analyses and achievement level setting occurring in off years. A disadvantage of a three-year pattern for main assessments is that cohorts cannot be tracked.

Example 3

Table 6.1A-3 is a variation of Example 2, similar in that there are again four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. Also, special assessments can appear periodically. It differs in that (1) there is an eight-year rather than a ten-year pattern, (2) core subjects are assessed biennially, and (3) in order to achieve the foregoing increases in intensity of assessment, four subjects are assessed every year. States could elect to participate in four-year cycles for core subjects in order to reduce their costs and burdens.

 

Table 6.1A-3
Example 3 of Subject Area Assessment Cycles

 

Type of Assessment

Year

Comprehensive

Standard

Special/Probe

1

Math

Science, History, Civics

-

2

Reading

Writing, Arts

Possible

3

Science

Math, Geography, Economics

-

4

Writing

Reading, F. Language

Possible

5

History, Civics

MathScience

-

6

Arts

ReadingWriting

Possible

7

Geography, Economics

MathScience

-

8

F. Language

ReadingWriting

Possible

etc.

     

Note: Core subjects in bold

1B. Vary the amount of detail in testing and in reporting results.

 

  • National Assessment testing and reporting should vary, using standard report cards most frequently and comprehensive reporting in selected subjects about every ten years;

  • National Assessment results should be timely, with the goal being to release results within 6 months of the completion of testing.

The notion of decreasing returns plays a role in deciding how to vary comprehensive data-gathering and reporting should be. For the sake of argument, suppose that a typical main assessment under the current configuration supports a thousand inferences, in the way of distributions at achievement levels, comparisons among subgroups, levels of background variables, and associations between background variables and performance. This is a lot to learn the first time it is done, and there will be several "surprises"—leads for following up with additional or different kinds of research. The second time the same survey is conducted, however, most of these results will be essentially the same as they were two years before. Changes across time will not be measured precisely enough to detect change in most variables, except for large, well-measured ones. Collecting the same kind of data provides little additional information about the stories behind the surprises. All in all, the cost is about the same as the first time, but the informational value is far less. "Information per dollar" from the same survey continues to decrease over time (Boruch & Terhanian, 1996).

This phenomenon supports the notion of having only occasional comprehensive assessments, with performance and background variables rethought so we can surprise ourselves again. Between these periodic larger efforts, two complementary kinds of assessment can take place: (1) more modest and largely constant ‘standard’ assessments that report basic results and track major changes reliably and quickly; and (2) targeted assessments that dig more deeply into focused aspects of performance or correlates thereof, but off the critical path to standard reports. Targeted assessments can be costed out and designed separately, but administered jointly with the core assessment.

The decision to vary the intensity of assessments is not really a technical one. A key technical issue, however, comes from the assumption that comparisons across assessments varying in intensity are desired. This suggests the need to have the design provide the means for making such comparisons dependable. What size standard errors are acceptable? This question might be addressed in terms of the magnitude of changes that have been used to make the case of that achievement is improving or declining—for example, the size of the long-term trend analyses. There is also the issue of subgroup comparisons. The law says that NAEP should "include information on special groups, including, wherever feasible, information collected, cross-tabulated, analyzed, and reported by sex, race or ethnicity, and socioeconomic status." Thus, group comparisons and changes deemed important in past policy discussions of NAEP results might be used to set targets for standard errors, which in turn determine the sample size (more precisely, the outcome of sample sizes in a multi-stage sampling design, in which number of PSUs is the dominant factor). We note that cutting back on non-cognitive background variables does not have to be accompanied by cutting back on the achievement items.

Completion Date/Elapsed Time

One desideratum for the new National Assessment is that "results should be timely, with the goal being to release results within 6 months of the completion of testing." Following is a discussion of a sample NAEP schedule and some ballpark comparisons with other large scale testing programs conducted by states and commercial publishers.

The KPMG Peat Marwick-Mathtech review of NAEP (1996) describes the timeline for the 1994 NAEP Reading Report. The completion dates and elapsed time for the main activities can be summarized as follows:

Step

Task(s)

Completion

Months

0

Testing

4/01/94

 

1

Scoring & preliminary weights

7/30/94

4

2

DIF item review

8/31/94

1

3

Scaling, conditioning, & weighting

12/8/94

3

4

Draft report

3/01/95

3

5

NCES-NAGB review/revision; final report

3/07/96

12

 

Total

 

23

 

These elapsed times can be compared to those experienced by operational state testing programs. For steps 1 to 3, the NAEP times, while not short, do not appear wildly out of line with other programs that involve similar types of work. This is not to say that time savings could not be attained in those steps, but that it would take detailed analysis to determine the amount of savings that could be expected by specific changes or improvements in the process. But those reductions, even if successful, cannot have too great an impact on the total elapsed time-as it stands now. For example, if steps 1 to 3 were reduced by 25%, the total timeline would be reduced by 2 months. That amount of reduction is important if we are trying to get an 8 month process down to 6 months, but such a reduction is little help in getting a 23 month process down to 6 months. For a redesigned NAEP, an appropriate strategy for fine tuning of timelines for the basic scoring and scaling process—beyond changing the process to minimize tasks on the critical path— is to make timely delivery of results an important factor in evaluating responses to the RFP and to negotiate with potential contractors.

The time spent on drafting and reviewing the NAEP report—15 months—was far and away the greatest contributor to the overall elapsed time. We grant that this amount of time for the review/revision step is longer than most—but a configuration that accomodates the possibility of this amount of time along the critical path clearly works against the objective of predictably fast reports. This time is far greater than that spent by most state testing programs to review and interpret results. Ballpark figures for states to conduct such reviews are 1 to 3 months. Many state programs do release detailed technical reports that can take many months to develop, but these reports are written after scores have been released to schools and the public. The obvious step to take to produce more timely results is to drastically change the nature and review cycle of the initial reports that are released.

For published multiple-choice achievement tests (e.g., CAT, CTBS, ITBS, MAT, SAT), publishers generally provide turn-around times of 2 to 4 weeks from the receipt of answer documents to the shipping of reports. This turnaround time is possible because all data collection, analysis, and scoring program quality assurance steps take place before the test is published. That is, when answer documents are received from customers, they are scanned and run through operational scoring programs that are known to be accurate. Secondary data analyses in score reports are limited to basic summary statistics. Also, except for algorithmically determined analyses (e.g., flagging of "significant" results), interpretations of the results are not provided.

Use of constructed-response or performance items adds to the turn-around time because of the hand scoring that must occur. The amount of additional time needed depends on the complexity of the scoring rubrics, the volume of scoring, and the scoring capacity. However, once the hand scoring is completed and merged, if necessary, with other machine-scanned (multiple-choice) data, processing of score reports proceeds with the same speed as for tests composed solely of multiple-choice items.

When a contractor develops a custom test, it is possible for the operational scoring of that test to proceed with the same speed as a published test, if all data collection, analysis, and quality assurance needed for the production of scoring tables and scoring systems has taken place before the operational testing occurs. If data from the operational testing are used to generate scoring tables (e.g., to create scales or local norms, or to link a new form to an existing scale), then additional time is needed for the analyses that generate the scoring tables and for final quality assurance. In order to save time, all specifications are in place and computer programs for scaling and scoring are checked out before operational data are available. To avoid reruns, it is critical that all rules for valid student responses and exclusions are unambiguously specified early in the process. If, in addition to scaling or equating analyses, item review and selection occurs, then more time is added to the process. All these principles can be considered in designing the new National Assessment to produce more timely results.

Contractors conduct many checks on results to make sure that they are accurate, however for operational programs (as contrasted with pilot or tryout studies) contractors usually do not provide in-depth interpretation of results at the time of their release to the client. Most state departments or districts conduct their own checks on the reasonableness of the results and develop interpretations for release to school boards and the media. The time needed for these checks and interpretations obviously varies by program, but in general states do not provide in-depth analyses of their results anywhere near the magnitude provided by the First Look NAEP reports. The more thorough the analyses and the more extensive the review and approval process, the longer the delay in releasing results. As stated earlier, for operational testing programs, states normally take no more than 1 to 3 months for conducting secondary analyses, reviewing, and interpreting results.

These principles are incorporated in the ‘feasible design’ configuration sketched in Section 7. We focus there on putting out a ‘standard report’ which provides essentially numerical, graphical, and tabular results, and information necessary to parse those results (which can be prepared before the data are in hand)—all of which can be accomplished within six months of the end of data collection once a system has been in place for two or three assessment cycles. If more extensive interpretative and contextual analyses are further desired, it should be routinely possible, based on the experience of state testing programs, to release such a report within twelve months of the end of data collection.

As a final comment, we highlight the fact that complex analyses (done once) do not, in and of themselves, preclude rapid turnaround. ‘Review & revision’ was the bottleneck in the example Peat-Marwick studied. Unless this step can be controlled, a six-month turnaround cannot be realized.

1C. Simplify the National Assessment design.

 

  • Options should be identified to simplify the design of the National Assessment and reduce reliance on conditioning, plausible values, and imputation to estimate group scores.

Analytic procedures under the current NAEP design are far more complicated than a classroom test, but laughably simple compared to the space shuttle. What is the right level of complexity? Einstein said, "Models should be as simple as possible, but not more." In NAEP, this means that procedures should be only as complex as necessary to meet its missions, within constraints and resources consonant with those missions.

Section 5 traced the NAEP design as it evolved in response to expectations and constraints, with Section 5.2.3 focusing on conditioning, plausible values, and imputation procedures in the current NAEP design. After a brief recap, we consider the combination of missions, constraints, and resources these procedures arose to serve, and how changing them would allow for simpler procedures. We discuss configurations that eliminate the need for all forms of marginal estimation, and note how they strain other desiderata in the Themes and Issues. We then consider how one might simultaneously (1) refocus the NAEP mission in line with the Themes and Issues, and (2) accordingly restructure its design to exploit the advantages of marginal estimation(5) while minimizing the complexities and instabilities associated with the current configuration.

Where are we now, and how did we get here?

The design for the 1984 assessment proposed an appealing tradeoff: A modest amount of additional complexity (in the form of IRT modeling and estimation of IRT ability scores for individual students) would provide substantial benefits such as (1) a common reporting scale over time despite design changes, (2) a scale that could be interpreted in terms of what students were generally able to do, (3) the capability to estimate distributions of student proficiency (not possible under item-level reporting), and (4) secondary user data tapes which would support any and all subsequent analyses in the IRT scale. Although it proved possible to fit IRT scales to NAEP data, it also turned out that the expectations regarding estimates of the effects NAEP was meant to report could not be met with the anticipated methodologies (see Section 5 for further historical perspective). A markedly more complex methodology that could meet most of the expectations was invented (on the critical path to the report of the 1984 survey, no less; see Chapters 9-11 in Beaton, 1987). This methodology was based on marginal estimation of student distributions (Lord, 1969; Mislevy, 1984, 1985), and the particular implementation used for NAEP employed conditioning, plausible values, and imputation procedures based on work by statistician Don Rubin (1987).

The original expectations concerning the proposed IRT procedures were based on experiences in the context of measuring and comparing individual students. These procedures failed to be sufficiently accurate for measuring the more subtle effects NAEP must address. In particular, point estimates of student IRT ability parameters, from domains and test forms as sparse and as varied as those of NAEP, simply failed to provide adequate estimates of trends, group results, and associations between proficiency and background variables. Each ability estimate has a certain amount of error or noise, which may be negligible for measuring or comparing individual students, but which accumulates to cause non-negligible biases in estimates of group distributions (e.g., proportions of students above the ‘Proficient’ cut point). Marginal analyses estimate group distributions directly from item responses, bypassing the problematic step of estimating abilities for each individual student. The approach appropriately handles the different test form lengths, numbers of items, and booklet difficulties that different students are administered within and across assessment years, as well as the varying amounts of information for students at different areas on the scale.

If only a few distributions are to be estimated, such as proficiency distributions with respect to values of, say, 20 background variables, the marginal analyses can be carried out directly and the results reported without student scores or anything that looks like them. If these targeted background variables were stable, this process need not be unduly burdensome, fragile, or time-consuming. But the expectations of reporting results on essentially all background variables and providing a public data tape with IRT ability estimates that a secondary user could analyze on the IRT scale in any way, led to a further complexity that did increase the burden, lower the stability, and add time to the critical path. This step was the creation of files of multiple imputations, created from a very large marginal analysis which included hundreds of effects of background variables, which would essentially reproduce the results of this analysis. The critical path to initial reports thus included the creation of ‘master’ files of multiple imputations that supported IRT analyses of not just basic well-behaved background variables, but ones that were new, poorly-behaved, or require always-unpredictable file matching.

The current design and analysis configuration is thus an accident of history. Vastly increased expectations that were formed under the anticipation of simpler IRT analyses are being largely satisfied, but only with more complex analyses than were originally intended. Locally optimal changes such as introducing untested or substantial changes in test specifications into the critical path, or postponing modeling or analysis decisions, must be compensated for in analysis—more complex and less stable than anticipated—with no prospect of becoming less complex or more stable under the current configuration and organization of activities.

Can marginal analysis be totally eliminated?

Testing programs such as the SAT, ACT, the Armed Service Vocational Aptitude Battery (ASVAB), and the achievement tests operated by private testing agencies, do not use marginal analysis, or the attendant conditioning, plausible values, or multiple imputations procedures. How do they avoid it? Mainly because each form of such a test is strongly parallel to every other form. They are clones with respect to content, format, test length, difficulty, and accuracy of measurement at different areas along the scale. NAEP could achieve this situation only by severely constraining the scope of content coverage and variety from form to form, and all but precluding changes in forms except at periodic intervals when the framework is changed. Distributions of observed scores could be compared and tracked, since with parallel forms there is no problem of integrating information from forms with different measurement properties—the problem marginal analyses were introduced to address. But this highly constrained test specification would subvert the goal of broad content coverage, costing NAEP credibility as a nationally representative survey of achievement in a subject area domain.

While in theory, tests must be closely parallel to produce equivalent score distributions, in practice it is an empirical question how parallel tests need to be to produce a desired level of consistency. A generalizability study, in which the consistency of results is examined for given changes in test forms, can be conducted to determine the practical effects of particular test designs (for example, Yen ,1995). Thus, while theory supports the use of either closely parallel test forms or marginal analyses to produce consistent results, by learning from appropriate generalizability studies, greater degrees of freedom could be available in the design of the new National Assessment.

An alternative route to eliminating marginal analysis is to employ a weaker form of achievement level reporting. The current configuration requires a student-level focus; i.e., how well a student would fare on a number of items in a subject area. Absent the parallel forms described above, this requires some kind of scaling and some kind of marginal analysis to estimate distributions of student-level scores, and proportions of students above various proficiency cut points. But an item-level focus, hearkening back to pre-1984 NAEP, would simply examine performance item by item: Performance of 80% correct is desired for this item, say, but observed performance is 45%. No complex scaling is needed, and test design is unfettered. This is a meager kind of achievement level reporting though, and probably not satisfactory from NAGB’s point of view.

Still another route would be to attempt to obtain precise scores for every sampled student, to see if sufficient accuracy could be gained to realize the initial hope of basing all analyses and user tapes on simple ability estimates for every student. Longer tests or computerized adaptive testing (CAT) would be vehicles for accumulating the additional information for each student:

 

  • The former route (long tests) can introduce non-negligible fatigue and/or motivation effects that depress performance as the administration continues. Johnson et al. (1996), for example, found that performance in the latter portion of double-length NAEP booklets was lower than that observed in the first portion by amounts in the magnitude of typical changes across assessment cycles—once again, not a large effect in absolute terms, but large in terms of the effects NAEP is meant to assess.

  • CAT (or something achieving the same ends, such as adaptive data-gathering over the Internet) is likely to appear eventually in NAEP. Such methods are already in place in related contexts such as household survey research, inventory control, and many licensure and admissions tests. Use of CAT virtually compels NAEP to use some kind of scaling method, such as IRT. The ability to present students with items that are neither overly difficult or easy for them would enhance the stability of NAEP results considerably, although there will still be varying amounts of information associated with different examinees. Even CAT algorithms designed to obtain the same amount of ‘apparent’ precision from all examinees only approximate this result. The resulting point estimates might suffice for a wide variety of secondary analyses, and thus support a secondary user tape much like the one envisioned for 1984—but marginal analyses would still be preferred for ‘official’ NAEP results. They would not be exactly replicable from the ability estimates, and the results obtained from the ability estimates would be inferior.

The prospect of longer tests or CAT being used to totally eliminate marginal analyses is thus bleak, but the prospect of CAT to substantially "reduce reliance on conditioning, plausible values, and imputation to estimate group scores" is quite encouraging: Better data for each student would make marginal estimation procedures, when required, more stable, and less dependent on modeling assumptions.

The final route noted here eliminates marginal analysis from only one portion of analysis and reporting, but in a way that has advantages for rapid reporting and linkage to other assessment efforts: It requires ‘marketbasket reporting’ with parallel collections of tasks, each of which can be administered to a student in a typical testing session. (See Section 6.I1 for additional discussion of marketbasket reporting.) If the observed-score on one such marketbasket is designated as the ‘official’ reporting metric, and parallel forms are linked to it by standard equating methods, then distributions of scores in this metric are immediately on the correct reporting metric. Disadvantages associated with this approach include limitations on the comprehensiveness of coverage of the score, foregoing of information from all other test forms, and the increased complexity of mapping information from other forms to this scale. In particular, marginal analysis is required for the other forms, and an additional step is required to project information from them to the official reporting scale.

An Alternative

We have noted that marginal analysis can be fast, stable, and invisible to the user of results, if (1) the results it is used to support are basic reports on a limited (say 20-40), predefined, previously used and tested, set of background variables that require no complex file matching, and (2) no file of multiple imputations is simultaneously required to support any other analyses with any other background variables. This is essentially the current situation with long-term trend, for which analyses necessary to support a ‘standard report’ could generally be completed within three months after the data are received if attention were turned to them first. Subsequent reports, incorporating wider varieties of analyses, including more background variables, and offering additional interpretive discussion, could follow off the critical path to the standard report.

This proposal bears similarities to the idea of ‘two-tiered reporting’ that the Advisory Council on Education Statistics advised against (ACES, April 1996). Two-tiered reporting, as considered by NAGB and the NAEP Design and Analysis Committee, would consist of an initial report based on a marginal analysis including only a predefined core set of well-behaved background variables—producing its own set of multiple imputations—while following reports would be based on a subsequent set of multiple imputations like the only one now created, which would include essentially all variables in a fairly large model. It was noted that estimates for quantities reported in the initial report would differ when estimated from this second set of plausible values. ACES judged that having ‘initial results’ and ‘corrected results’ would be confusing and undesirable.

The ACES position is quite reasonable if one looks at the two set of results for core variables as ‘a quick bad estimate’ and a subsequent ‘careful good estimate’—a perspective encouraged by the release of the second multiple imputations tape under the presumption that it supports the ‘best’ answers to any analysis a secondary user would want to run. But this misses the point that the second set is still not ‘the best’ for any given inference. A user with a particular model or inference in mind could custom-build her own marginal analysis, and get a better result than the second tape would provide (usually not much better, and quite possibly worse unless they had a strong statistical background). The first marginal analysis we propose would provide quick good estimates of key results, using the relevant data and a robust and justifiable model. Subsequent analyses would focus on finer-grained analyses, include additional data from other background variables, addressing different questions, and often employing a different form of model. They could also, if desired, provide estimates for the same key results as initially reported—also good, if done right; not identical, better in some ways but not as good in others, and generally negligibly different from the initial ones.

Now the first round of analysis would generally not provide as good results for background variables not included in the initial run as the current single plausible values file, and the tradeoff we suggest does not promise them. Rather than providing a ‘once and best’ file of plausible values from which one could carry out any analysis—including exactly reproducing the ‘official’ estimates of key initial results—only the raw data and procedures by which a secondary analyst could fit marginal models and estimate effects of interest would be provided. Secondary users could carry out analyses that included variables not in the standard report, and forms of models not generally fit in the course of official standard NAEP reports—and NAEP would no longer take the responsibility of creating imputation sets to support such analyses. The secondary analyst could, if desired, fit the official standard model to the official standard reporting variables in order to replicate official standard results.

Remarks

There are clear advantages in simplifying the procedures used in the National Assessment in terms of ease of explanation and credibility. However, in a program like the National Assessment that "serves so many masters," it is unlikely that the procedures used can be so simplified that the general public can fully understand all of them. Furthermore, this is to be expected in any long-term testing program that changes and problems inevitably occur for which effective technical solutions can be devised. Thus, it is disadvantageous to preclude the use of any particular technical procedure in the design of the new National Assessment. On the other hand, in order to obtain creative and competitive bids from potential contractors, the RFP should avoid mandating any particular technical procedures. The RFP should use simplicity of explanation, cost, and efficient timelines as a partial list of the evaluation criteria for the proposed technical procedures, and different bidders can propose different solutions. The approach we have sketched above is a kind of ‘existence proof’, indicating how innovative analytic procedures that appear in the current configuration of NAEP could be used to better advantage in a simpler configuration. It is likely that others can propose alternative procedures that are also technically appropriate. 1D. Simplify the way the National Assessment reports trends in student achievement.

 

  • A carefully planned transition should be developed to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program;

  • As a part of the transition, NAGB will review the tests now used to monitor long-term trends in reading, writing, mathematics, and science to determine whether and how they might be used now that new tests and performance standards have been developed during the 1990’s for the main National Assessment. NAGB will decide how to continue the present long-term trend assessments, how often they would be used, and how the results would be reported.

The ‘P’ in NAEP stands for ‘Progress’—which, by definition, implies change. It has always been a high priority mission of NAEP to track trends in "student achievement." But "student achievement" is not a well-defined, existing, entity in and of itself. We can observe many things that students say or do that we think are related to what they know and can do, but only by specifying particular ways that we will organize and synthesize these observations can we start to "measure" student achievement. That is, we require an operational definition to "report student achievement." We require an operational definition that is the same at two points in time to "report change in student achievement," and an operational definition that is the same for three of more points in time to "report trends in student achievement."

Specifying an operational definition of student achievement means fixing certain facets of design, tasks, etc. Fixing these conditions makes accurate comparisons over time possible, but constrains the meaning of scores and makes them seem more precise than they really are. For example, one operational definition may show 60% Proficient in Year 1 and 63% in Year 2; a second, equally acceptable, operational definition may show 68% Proficient in Year 1 and 70% in Year 2. The methods agree in essence as to change, but differ more substantially about absolute proportions in either year. Choosing a configuration (‘fixing these facets’) and sticking with it enables us to measure change. Allowing them to vary subverts this effort, since seemingly minor variations in facets of test design and administration have more impact on absolute levels of performance than changes in educational results over the same period of time (see Section 5.3 on the 1986 "Reading Anomaly").

The NAEP "long-term trend" assessments have been run concurrently with main NAEP assessments for more than a decade now. They use definitions, booklets, and administration procedures, and reporting scales that date back as far as to the 1970’s. They are employed only because since that time, practically no operational definitions of proficiency and populations have remained the same long enough to institute a new trend line for more than one or two assessment cycles.

The "feasible configuration" sketched in Section 7 includes a kind of "punctuated equilibrium" approach for tracking trends. A major framework redefinition initiates a new metric for tracking trends, perhaps in terms of one or more marketbaskets of items (see Section 6.1I for more on marketbasket and other reporting metrics). This redefinition coincides with a "comprehensive" assessment in the subject area, and allows for stable trend comparisons in succeeding assessments of the subject which use the same framework and test specifications. When the next framework definition is carried out 8 to 12 years hence, the previous reporting metric will continue to be used for at least the next two assessment cycles in that subject (Figure 6.1D-1). The overlapping trend lines ensure the ability to identify and interpret changes in achievement across changes in the framework.

 

Figure 6.1D-1
Phasing-in Revised Reporting Metric

Depending on the extent to which what is important in a subject area changes, simple adjustments of the new metric may prove incommensurate with the previous one, showing reversals among subgroups, different relationships between proficiency and background factors, even different directions of trends. It may be desired, therefore, to include blocks of items from previous frameworks in a given subject-area assessment from time-to-time (e.g., the old "long-term trend" books), just to see how today’s students are faring with tasks that were deemed important to students of the past.

1E. Use performance standards to report whether student achievement is "good enough."

 

  • The National Assessment should continue to report student achievement results based on performance standards.

Our comments on achievement-level reporting fall into two broad categories. The first concerns the technical implications of this desideratum for design features, in light of other desiderata. The second discusses achievement levels as operational definitions, which can help when evaluating tradeoffs in which they are involved.

Design Constraints Induced by Achievement-Level Reporting

Reporting results in terms of achievement relative to performance standards means providing statements such as "35% of the fourth grade students are at or above the Basic level." This implies the conceptual framework of a scale for integrating performances on individual items to projected student performance in some broader context—e.g., in terms of an IRT scale, or hypothesized performance in a marketbasket of particular tasks or a large domain of tasks. Such reporting is distinguished from, and more difficult to estimate than, the item-by-item results that characterized the first NAEP assessments. It is no longer possible to ignore associations among performance on different items, or to construct test forms without regard to content and format balance and measurement properties.

If all students, across all groups and occasions that are to be compared, are administered either the same form or strictly parallel forms, achievement-level reporting is not particularly difficult. One can track the proportion of students who get 600 or above on the SAT Verbal, for example, because all SAT Verbal tests are essentially the same. This is an easy tabulation of ‘observed score’ performance relative to an ‘observed score’ standard. There are, of course, technical considerations about form construction that impact the accuracy of estimating proportions of students at various levels. For example, if there is a very high cutpoint for ‘Advanced,’ the accuracy of the estimate of the proportion of students at or above this level depends critically on the number of difficult items. If there are no hard items, then the proportion of Advanced students is poorly estimated. If there are sufficient hard items but few Advanced students, then (1) an accurate estimate is obtained of just how few students are there, but (2) the hard items are administered to many students who have little chance of answering them correctly—and these students are presented fewer items they have a chance to respond to meaningfully. (CAT, or computerized adaptive testing, would mitigate this latter problem, but introduces added logistical complications and require a scaling model.)

If students within or across assessments are administered different forms, however, the same underlying level of performance gives different results on the observed score metric. For example, if a measure of free-throw accuracy has only four attempts, maybe 15% of the students will make 100% of their attempts. Are they truly 100% accurate shooters? No. A ten-attempt contest might show only 4% make all their shots. A hundred-attempt contest would probably show none making all shots. The situation becomes more complex if some shots are from the free throw line, some from the corner, and others are lay-ups—and different students have different mixes of shots! A more complex method of summarizing performance, akin to IRT, would be required. In essence, it becomes necessary to define a hypothetical scale of performance that would be used to characterize shooting accuracy across these many possible situations, then figure out what evidence different numbers and locations of attempts convey about it. This is the ‘true-score’ scale—like the NAEP IRT scale. One could project from it to a hypothetical standard collection of shots—like a marketbasket scale. The analyses required to carry out these steps, to estimate the distribution of true math (or shooting) abilities and relate all the items (or shots) to it, are ‘marginal analyses.’

The joint intention to report achievement level results and to administer heterogeneous booklets of tasks that cover a broad range of a subject area, therefore, probably requires some level of complexity in the form of marginal estimation. Not all of the marginal estimation procedures in the current configuration—conditioning and production of comprehensive plausible values files in particular—are necessary if only achievement level results for selected reporting categories are to be produced. Thus, achievement level reporting, even with heterogeneous forms, does not necessarily bring about the full complement of potential complexities, instabilities, and long time-lines that appear in portions of the current procedures. Sections 6.1B and 6.1C discuss further how these problems can be largely obviated.

Achievement Standards as Operational Definitions

Given a determination of a scale upon which to summarize performance, setting achievement standards is at heart a judgmental process. What is important to keep in mind is that the nature of the data that NAEP can gather about students. For example, NAEP cannot determine how well a student has learned those particular things he or she has been working on, or about performance on extended projects. Some of what we’d really like to know about students, if we could learn everything we wanted to determine whether a student were ‘really’ Advanced, is necessarily missing. (In Messick’s (1989) terminology, this is ‘construct under-representation’.) Moreover, NAEP data is collected under specific circumstances and formats, which influence motivation and opportunity to perform. These factors influence performance, even though we we would prefer they did not. (In Messick’s terms, this is ‘construct irrelevant variance’.)

How can we know the degree to which what we call ‘Advanced’ based on NAEP performance reflects what we would conclude if we were to observe students in their classrooms and in their lives for extended periods of time? The answer is validity studies—examinations of the relationship between NAEP performance and achievement levels as to what is captured, what is missed, how students who seem to be doing well in school perform on NAEP, how students who seem Advanced on NAEP perform in school, and so on. Such studies have begun to be carried out, and they are a necessary complement to the standards-setting process. Figure 6.1E-1 illustrates some of these relationships graphically.

Another aspect of achievement levels as operational definitions is that they may change over time. Not only does what students study change over time, but how well we would like them to do can change too. If standards-based reporting is to include ‘where we want to be’ as well as ‘where we have been,’ we must recognize that ‘where we want to be’ might be different ten years from now! It would appear sensible to keep the operational definitions of both proficiency in a subject area and achievement levels fixed for successive time points, in order to track change over time. Periodic re-structuring of both would then take place concurrently, perhaps timed with comprehensive assessments. We note in passing that market-basket reporting (Section 6.1I), with published representative market-baskets, would facilitate public discussion and debate of achievement levels.

 

Figure 6.1E-1
Achievement Level Setting Data, Operations, and Validity Connections

In sum, there is no end to the refinements that can be done in standard setting procedures. Technical input of the kind we can provide can add only so much "science" to this judgmental, value-driven, process. What technical procedures can do is to …

 

  • help provide judges with both intra-judge and inter-judge feedback about implications and interconnections among their decisions,

  • structure the process so as to be comprehensible to both the judges themselves and others who will use the cutpoints in future,

  • characterize variance of opinion among judges, which is a component of uncertainty about statements such as "30% of the students perform at or above the ‘Proficient’ level in reading," and

  • design and carry out ‘validity’ studies to further illuminate the scope and range of the standards.

All of these kinds of activities have been pursued since NAGB instituted achievement level reporting, and, with experience and input from other parties (e.g., the National Academy of Education and the Government Accounting Office), a defensible process is beginning to emerge. More aggressive validity studies are in order, coming from different perspectives, with different data sources, and from different agencies and stakeholders. Cronbach’s view of validation is a range of studies that each challenge standards in different ways, to provide a more solid justification in the long run (for example, comparing the number of students obtaining scores of 3 or above on various AP tests with the number of students implied to be ‘Advanced’ in terms of the achievement level reports). After a point which may well have already been reached in the current standard-setting process, more evidence of the same kind—i.e., more elaborate standard settings per se—is less valuable than evidence of these different kinds.

1F. Use international comparisons.

 

  • National Assessment test frameworks, test specifications, achievement levels, and data interpretations should take into account, where feasible, curricula, standards, and student performance in other nations;

  • The National Assessment should promote "linking" studies with international assessments.

International comparative studies are invaluable for providing contextual interpretation for achievement in any participating country (Mislevy, 1995), but they are fraught with difficult technical, sampling, translation/adaptation, and interpretation problems. We struggle with the technical problems of equating even for nearly parallel forms, but the problem of linking assessment data from an international context are considerably more difficult. For these reasons, perhaps the greatest benefits to NAEP from international studies will be qualitative rather than quantitative:

 

  • Subject area committees can examine frameworks, specifications, and curriculum analyses of international studies to glean ideas about what to assess and how to assess it.

  • Panels setting achievement levels can examine the comparative performance of students from other countries as one of many sources of information to ground their perceptions of what U.S. students "should" be able to do. (Judges must be clearly informed about the limitations of international comparative studies, including potential problems with differential motivation, curriculum mismatch, non-comparable samples, test administration differences, and translation/adaptation.)

  • Authors of NAEP interpretive reports and secondary researchers will be able to use international results to provide a broader context for analyzing NAEP data, as it will be possible to (qualitatively) compare the U.S. with other countries’ patterns of association between proficiency and background and schooling variables (e.g., Raudenbush & Willms, 1991).

We envisage two useful kinds of linking studies. First, samples of U.S. students participating in an international assessment can be administered intact blocks of NAEP items as well, so that the joint distributions of NAEP and the international performances can be estimated. It should not be expected that one-to-one correspondence can be established between the two, but the understanding of the relationships will enrichen interpretations and understandings of both. Second, international assessments may wish to embed blocks of NAEP items in their own surveys. Similar caveats apply. Both approaches, it will be noted, are facilitated by a ‘modular’ NAEP design, student-level marketbaskets of items (see Section 6.1I), simple ways of mapping performance onto a NAEP reporting metric, and the availability of NAEP booklets with fairly simple scoring demands (e.g., the ‘standard’ booklets described in Section 6.2B), comprised of multiple-choice and short constructed-response tasks.

1G. Emphasize reporting for Grades 4, 8, and 12.

 

  • The National Assessment should continue to test in and report results for grades 4, 8, and 12; however, in selected subjects, one or more of these grades may not be tested;

  • Age-based testing and reporting should continue only to the extent necessary for international comparisons and for long-term trends, should NAGB decide to continue long-term trends in their current form;

  • Grade 12 results should be accompanied by clear, highlighted statements about school and student participation, student motivation, and cautions, where appropriate, about interpreting 12th grade achievement results;

  • The National Assessment should work to improve school and student participation rates and student motivations at grade 12.

Because US schools are organized by grades for the most part, grade-based assessment is considerably easier than age-based assessment with respect to specifying a sampling frame, identifying schools, and administering the assessment within schools. Grade-based reported is also more policy relevant. It seems quite reasonable to emphasize grade-based assessment in NAEP.

What are the tradeoffs? The most obvious is reduced comparability to international assessments. But since international assessments are non-periodic and idiosyncratic, it would be desirable for NAEP to carry out (and cost-out separately) age-based assessment only in conjunction with specific efforts. A second potential attraction of age-based assessment is that it can address a population not limited to schools; that is, it can include out-of-school as well as in-school 17-year olds. But this option has not been pursued by NAEP for some time anyway, because of the higher costs of household surveys compared to school surveys. Out-of-school testing is better suited to projects like the Young Adult Literacy Survey (YALS). Rather than having NAEP carry out such activities, it would be preferable to have NAEP item blocks embedded in these studies, so that selected results could be analyzed within the NAEP interpretive framework.

Another tradeoff does not appear to be a concern at present, but could become one if NAEP accrues higher stakes for states, schools, or districts: Retention and testing-exclusion policies can be manipulated to produce favorable results in grade-based reports. For example, if we want our school’s Grade 8 results to look good, we hold back low-performing 13-year olds. One can change the timing and the categories in which students appear under grade-based reporting more easily than under age-based reporting.

Focus on Grade 12

The following observations about Grade 12 assessment have considerable support:

 

  • Grade 12 testing is not a high priority to states (which is another reason to envision NAEP as a modular system, with a central national core that could also be the core of state assessments).

  • Participation rates are lower in grade 12 than in either grades 4 or 8, both at the school and student level.

  • Motivation is a problem. This is supported by field administrators’ reports, students’ survey responses, and empirical data from recent assessments (e.g., omits and not-reached rates).

  • International comparisons at grade 12 are more problematic than at the lower grades, due to substantial variation among nations with respect to which students at this age attends school and what they study. For example, in some countries to which comparisons with the US might be made, fewer than 20% of the students reach the 12th grade.

  • Special assessments (e.g., advanced math) are useful at grade 12.

Motivation

The two factors usually noted as ‘causing’ the motivation problem are time of year and lack of individual scores. NAEP is administered in February and March. Graduation for these students is only a few months away, so they will not view external tests without consequences as important. Individual scores are not provided with NAEP. Even if they were, it is doubtful that motivation would increase substantially, given the low stakes.

Two additional factors that may also affect the motivation of grade 12 students are test format and test content. Regarding format: it appears that relatively long constructed response questions or extended-context theme blocks can affect the willingness of grade 12 student to provide their best, or even their typical, work. Omit rates of over 40% were observed in such items on the 1994 Geography assessment, for example. Regarding content: The target population for the grade 12 assessment is all students in grade 12. Given that considerable specialization occurs in the high school curriculum (e.g., in science, math, and the arts), what should be assessed as part of the grade 12 assessment in such curriculum areas? How this question is answered may affect motivation. For example, if 25% of the grade 12 students take a physics course, is it reasonable to ask questions on the science assessment that require background information typically acquired only in a physics course? If a grade 12 assessment contains "too much" material of this type, the motivation of most grade 12 students to perform at their typical level will surely be adversely affected.

It is obviously not very satisfactory to continue the trends of decreasing motivation and participation, and merely attach strong warning labels. There is little rationale for spending large amounts of money and energy to collect data that we acknowledge is not only substandard, but substandard in ways for which our analyses cannot compensate.

Switch to Grade 11?

One possible alternative is to switch the upper-level assessment to Grade 11, as was briefly done in the mid-1980s. This grade is equally amenable to international comparisons based on age cohorts, since there are about as many 17-year olds in grade 11 as there are in grade 12. (Which grade has more 17-year olds depends on how one defines ‘age 17’—calendar year, school year, age at time of testing). Field reports from those assessments suggest better motivation at Grade 11.

There are two main disadvantages to making this change. First is losing the ability to track cohorts in a given subject area in two- or four-year cycles, as they are assessed in grades 4, 8, and 12. This feature of NAEP has been used in some secondary analyses, but its usefulness is questionable if Grade 12 data are wanting. Second is losing an ostensible ‘exit measure.’ However, the exit examinations so important in many countries are high stakes to those students, and, importantly, are matched with the students’ curricula, more like Advanced Placement tests or New York’s Regency Exams in this country. Such exit exams are more specifically useful, but less generally comparable, than broad range surveys like NAEP. It is a reasonable tradeoff for the Board to consider: Grade 11 NAEP as a fairly good index of performance near the end of secondary school, as opposed to Grade 12 NAEP as a fairly poor exit examination a year later.

Modifications to the Assessment

The team does not propose that NAEP provide individual student scores in the near future. If Grade 12 testing is continued, and continues to take place in February and March, what can be done to improve both the participation of schools and students, and the motivation of students? This question has no simple answer, but we note below some directions in which to work. The forthcoming RFP should request suggestions for improving participation and motivation, to stimulate a broad range of thinking.

Some improvement in motivation may be accomplished by modifying test content and test administration procedures. These ideas to increase school and student engagement also apply if NAEP switches to Grade 11.

1. Test Content. For those curriculum areas that have specialized courses it may be feasible for the framework committees to define a set of content/process specifications that is appropriate for all students, and a supplementary set of specifications at grade 12 that is appropriate for students taking the specialized courses. The latter might form distinct subscales or marketbaskets, although some smaller number of these tasks could appear in the ‘overall’ marketbasket if marketbasket-based reporting is employed. Such a system would fit well with the notion of special targeted testing at grade 12.

2. Test Administration Procedures. Some variations of the current administration procedures that might be considered are:

 

a. Administer the tests in very small groups (e.g., 5-10 students).

b. Have high school teachers administer the tests.

c. Use computer-assisted administration. (Experimentation with adaptive testing could begin with grade 12 students.)

3. Complementary Data-Gathering. Marketing researchers have bumped into the limits of what they can learn from standard surveys. As they ask large, representative samples of respondents more questions or more complex questions in ‘drop in from the sky’ format, they find that cooperation wanes and they learn less. These researchers have complemented survey research with focus group research: Much smaller groups of consumers are involved, but they have in-depth, less-structured discussions, extending sometimes a full day or more. What is learned is deeper, but not necessarily in forms amenable to simple statistical analyses. Motivation is high, because of the personal attention, the engaging nature of the interactions, and the feeling that one’s opinions really matter (and they usually get a good meal and a check as well).

The NAEP analog would be small samples of 12th graders who responded to standard NAEP blocks, but then were also presented more in-depth problems, engaged in think-aloud solutions, and encouraged to discuss, with administrators and peers, their views about the tasks and tasks’ relationships to their studies. The idea is to keep the core of the upper-level assessment simple and straightforward so as to avoid depressing motivation—focusing on what broad-based, drop-in-from-the-sky, surveys do best—while complementing this information with richer information from smaller samples of students under conditions that promote the higher levels of motivation needed for more ambitious work.

1H. National Assessment results for states.

 

  • National Assessment state-level assessments should be conducted on a reliable, predictable schedule according to a 10-year plan adopted by NAGB;

  • Reading, writing, mathematics, and science at grades 4 and 8 should be given priority for state-level testing;

  • Testing in other subjects and at grade 12 should be permitted at state option and cost;

  • Where possible, national results should be estimated from state samples in order to reduce burden on states, increase efficiency, and save costs.

National and State Samples and Data

The NAGB "Themes and Issues" draft policy statement recommended that "where possible, national results should be estimated from state samples in order to reduce the burden on states, increase efficiency and save costs". As is pointed out in a May 9, 1996 memorandum on sampling issues for redesign from Keith Rust to Mary Lyn Bourque (see Appendix), three approaches can be distinguished for obtaining a single national data set from which both state level and national estimates could be estimated. (1) Samples could be drawn for each state with those state samples supplemented as necessary to obtain an adequate national sample. (2) A national sample could be drawn and supplemented as necessary to obtain adequate state samples for states choosing to participate. (3) Two distinct samples (one national and one for participating states) could be drawn with minimal overlap of schools, but then the samples could be combined into a single data set for selected analysis and reporting purposes.

Any of these three approaches is technically feasible. However, the degree to which each approach would, in fact, accomplish NAGB’s goals of reducing the burden to states, increasing efficiency, or reducing costs, depends on a combination of policy and assessment design decisions and on the level, certainty, and quality of state participation in each assessment. Approaches 1 and 2 face three major obstacles: (a) equating required for differences in administration procedures, (b) differences in subjects or blocks of items for national and state assessments, and (c) state-participation related issues.

Differences in the administration procedures of NAEP for the national and state samples in each of the assessments since 1990 have required an equating of the two sets of assessment results. As is discussed in Keith Rust’s May 9 memorandum, equating requirements present a major obstacle that prevented the implementation of either approaches 1 or 2 in the 1996 assessment. One possible way of overcoming the obstacles presented by the equating requirement would be to use the same administration procedures for both state and national samples. Since administration by the contractor to both state and national samples, as has been the practice for the national samples, would mean a substantial increase in costs, the more feasible alternative would seem to be to use local administrators with monitoring of a subsample by the contractor as is now done in the state assessments. The degree of monitoring and degree of training could be increased, thereby increasing the consistency of administration activities beyond current practice in state NAEP. In considering this option it would be prudent to investigate the possible effects such a change might have on participation rates among schools selected for the national sample in states not participating in the state assessment.

The second obstacle in the past to approaches 1 and 2 is that the state assessment at a given grade is only a part of the national assessment. For example, the state assessment at grade 4 only involved mathematics whereas the national assessment at grade 4 involved both mathematics and science. Even in mathematics, the national sample at grade 4 involved assessment blocks (e.g., estimation or mathematics theme assessments) that were not part of the state assessment. Although components that are only applicable at the national level could be handled by a special national sample, such an approach would wipe out any gains in efficiency or cost that might otherwise have been achieved. Thus, for approaches 1 or 2 to be effective in achieving the goals of increased efficiency and reduced cost, the state and national assessments would need to be identical in terms of subjects assessed and the blocks of items included in the assessments to be used for any national reporting (although the national assessment could use additional blocks that were not included in the state assessments, such as extended writing blocks or ‘theme blocks’ in science).

The third major obstacle to approaches 1 and 2 is the dependence of the quality of the national results on last-minute withdrawals of states from the assessment and the level of school participation within selected states. Although the first two obstacles can be overcome by policies under the control of NAGB and NCES, albeit at the expense of additional constraints on what can be included in the National Assessment and how the assessments will be administered, the third obstacle falls outside the control of NAGB or NCES. Consequently, we believe that it would be better to consider the third option: that is, maintaining separate state and national samples, but combining them for purposes of obtaining better estimates at both the state and national levels, but mainly at the national level.

Rust notes in his May 9 memorandum that combining the samples would require the development and use of an additional set of weights. The additional work on weighting would need to be done concurrently with other operational activities such as scoring of assessment exercises so that this work would not slow down the reporting of results. Respondents to the NAEP RFP might be asked to propose and defend alternative sampling designs such as the use of states as strata that would take advantage of the combined state and national samples to obtain improved estimates for both states and the nation.

For the reasons discussed in the Rust memorandum, it is unlikely that the burden on states will be greatly reduced by any of the three approaches. The key to burden is the number of schools per state. Although some reduction from the current sample of 100 schools would be possible without great loss, it would undermine the quality of the results to reduce the number too much — beyond a lower bound of, say, 65-75 schools. An analysis of data from selected previous years’ state NAEP assessments would further quantify the tradeoff; i.e., in return for reducing burden by this magnitude, what is the decrease in precision of results—especially with respect to breakdowns of the data (e.g., racial/ethnic group). More significant savings would accrue from conducting state assessments on a less frequent schedule than from carrying out frequent assessments with smaller samples of schools.

1I. Use innovations in measurement and reporting.

 

  • The National Assessment should assess the merits of advances related to technology and the measurement and reporting of student achievement;

  • Where warranted, the National Assessment should implement such advances in order to reduce costs and/or improve test administration, measurement, and reporting.

The following comments fall into two main subsections. The first concerns an issue of measurement and reporting, namely, market-basket and domain-referenced reporting. (Rregardless of any approaches adopted for reporting, of course, field-testing of report forms for various audiences needs to be carried out and the findings used in preparing improved reports, scales, and interpretations.) The second concerns innovations more generally, offering comments on why and how they might be introduced into NAEP, and lists a number that occur to us as having some merit.

 

Market-Basket Reporting

This section describes methods for reporting assessment results that is comfortable to users who are familiar with only traditional test scores, yet allows for a bit more flexibility in assessment design than traditional test score methods. It allows for the possibility of embedding parallel ‘market-baskets’ of items within more complex assessment designs. (The term "items" is meant here to include performance and constructed-response tasks as well as multiple-choice questions.) Results from market-basket forms would support faster and simpler, though less efficient, reporting, while information from broader ranges of items and data could be mapped into its scale using more complex statistical methods. Under some variations of the idea, released market-basket forms could be made available to embed in other projects with strengths and designs that complement NAEP’s.

Background: Linking the Results of Different Tests

Educational assessments, like educational tests, can present to each student only a handful of the items in a subject domain (among those which can be administered in the assessment setting). There are several reasons why different students are usually administered different sets of items. If the same items are always used in high-stakes tests, they become familiar and students’ responses no longer provide information about their capabilities as more broadly construed. If only a narrow set of items is used in large-scale surveys, no information is obtained about many other skills of interest. This is especially so with more complex performance tasks. Figuring out how to integrate information from different test forms is a perennial problem in assessment. A brief discussion of available approaches follows, focusing on issues that arise in NAEP (see Mislevy, 1992, & Linn, 1993, for more on linking).

Linking results from different test forms is not an issue when the same test form is always used for everyone, so that students’ observed scores all provide information in the same frame of reference. The most familiar way of creating a common frame of reference for different tests is EQUATING: Tests are carefully structured to have the same numbers of items and the same formats, to cover the same basic skills in about the same mix, to have about the same average difficulty and mix of hard and easy items, and the same timing and administration conditions. Total scores are nearly interchangeable in these ‘parallel tests’, and scores from any of them have about the same meaning and the same precision. Fairly straightforward statistical equating procedures make minor adjustments to match up score distributions more precisely, and a one-to-one correspondence table can be constructed for mapping scores from one form to another and, in doing so, adequately capturing all the information about students’ proficiencies their performances conveyed. Note that it is not the equating formulas that make the scores comparable; it is the highly constrained way in which they are constructed.

Can this approach be used in a large-scale assessment such as NAEP? Yes, but it imposes heavier constraints on form construction than have ever been seen in NAEP. In order to use observed scores as the main reporting framework, a subset of essentially parallel forms would need to be used within assessments and across assessment years.

CALIBRATION is a way of mapping information from somewhat more diverse test forms, although it too depends on disciplined test construction in order to work and requires more complex statistical methods. The idea is that the same skills and mix of items are employed, but possibly with different test lengths and difficulties. Information is therefore ‘about the same thing’, but with different accuracy for students with different test forms or at different levels of proficiency even within the same form. Under these constraints, it is often possible to use a model (an IRT model, in particular) that exploits common relationships among items even though they appear in dissimilar test forms. Unless test forms are nearly parallel, however, individual examinees’ IRT ability estimates yield incomparable estimates of distributions in groups of examinees. That is, no one-to-one correspondence table can give optimal answers for the full range of inferences one would want to make from the data—importantly, in NAEP, not for population distributions, proportions of students above proficiency-level cut points, or relationships among proficiency and background variables. More complex statistical methods (i.e., marginal estimation methods) should be used to deal with the differing amounts of uncertainty associated with individual students’ responses, given their performance and the particular configuration of items they were administered. This is currently accomplished in NAEP with Rubin’s multiple imputations procedures—i.e., NAEP ‘plausible values.’

PROJECTION is extrapolating results from one form of gathering evidence, to what might have been observed using a different form of gathering evidence—for example, projecting results from the Armed Services Vocational Aptitude Battery (ASVAB) to NAEP (Bloxom, Pashley, Nicewander, & Yan, 1995). In statistical terminology, projection is calculating the predictive distribution of scores on one test given observations from another, taking into account relationships with background variables if one contemplates making inferences about the associations between these variables and proficiency. As with calibration, more complex statistical methods are required. Moreover, when the skill requirements and mix of item types are not tightly constrained, the empirical relationships upon which projections are based can change over time, across ages, and from one demographic or curricular subgroup to another.

Market-Basket Reporting: Definitions and Variations

A marketbasket of items is a collection of items one might administer, performance on which constitutes a reporting scale. As discussed below, this may be performance in terms of ‘true-scores’ or ‘observed scores.’ Both right/wrong items and open-ended items could be included, as long as a well-defined score were agreed-upon. The items in a marketbasket would be made public so that users would have a concrete reference for the meaning of score levels. The number and mix of items to be included in a marketbasket can be determined by consensus, but the following are some considerations:

 

  • Representativeness. A range of examples of item types and content are desirable, as are illustrations of key skills and concepts. Innovative, experimental, and specialized tasks would appear in other non-marketbasket assessment forms, and could be introduced and varied more flexibly.

  • Size. Three possible sizes of marketbaskets are discussed below, in terms of their comparative advantages and disadvantages. The variations are (1) a marketbasket being a typical booklet that an actual student takes; (2) a somewhat longer, but not exhaustive, collection of actual items; and (3) a large collection which effectively defines the subject domain of interest.

  • Replicability. Unlike the Consumer Price Index (CPI), which holds a collection of consumer products fixed for a period of years, it is desirable to be able to construct parallel collections of marketbasket items.

Variation #1: Marketbasket Size That of a Typical Assessment Form

The basic idea would be that a first marketbasket collection would be used to establish a reporting metric—observed scores on this set of items, and other sets that are very much like it—and released to the public. Replicate marketbasket collections would be administered in the same assessment, and, having been built to be essentially parallel to the original collection, could be linked to it by means of equating functions. (Algorithms are available for constructing, say, six replicate marketbasket collections from a suitable startup pool of items [Stocking et al., 1991; Theunissen, 1985.) These equating functions, once defined, would be used in succeeding assessments so that linking between assessments would be taken off the critical path to reporting for information from marketbasket booklets; thus, reports limited to marketbasket results, from marketbasket booklets, could be fast and reliable. (An advantageous side benefit is that responses from IEP/LEP students would not need to be analyzed through IRT models in order to be included in national results, thus removing a tenuous step from the analysis.)

Other booklets could be included in an assessment without the tight constraints required for marketbasket forms. Information from these forms could be reported in its own right, and/or projected to the marketbasket scale using the more complex statistical procedures mentioned above, but off the critical path to initial, fast-track, reports. (Specifically, one could fit a possibly-multidimensional IRT model to the full data set including one or more marketbasket collections, and, from all responses, build a predictive distribution for observed scores on the marketbasket reporting scale and draw plausible values from it to convey the results and their precision.) The more such forms differed from the marketbasket collections, the more interesting they are in their own right, the less information they bring to bear on the marketbasket-metric results, and the more difficult and unstable will be their link to the marketbasket metric.

The advantage to having marketbasket collections that could be released (and replaced) over time is that these collections and their accompanying relationship to the reporting metric could be made available to embed in other projects—state or local assessments, national or international surveys (e.g., NELS or TIMSS), program evaluations, and public and private research projects. Using the observed-score metric of the marketbasket as the reporting scale means that such a project could get observed score distributions that were on the NAEP scale without complex statistical methods.

Variation #2: Marketbasket Larger than a Typical Assessment Form

A disadvantage of using as a marketbasket scale a set of items a student would be administered in a testing session is that the breadth of the subject domain could probably not be fully represented. It may be that several such sets are needed to adequately convey the mix of formats, skills, and topics specified in the framework. It is possible to use such a larger collection as a framework for reporting results, too. The important advantage of more than one set is better representativeness and better communication of content coverage. The disadvantage is that observed scores on a typical administered booklet no longer provide unbiased estimates of population distributions other than central tendency (in particular, not proportions of students at or above proficiency level cut points). Marginal estimation would usually be required for such inferences in terms of these larger marketbaskets. Either the true-score or projected observed-score metric could be established as the reporting scale, although either would involve more complex statistical analyses than simply reporting observed-score results.

In this variation, as in the others, we assume that items will continue to be released to the public. One would have created the marketbasket (so it is representative of the subject area framework), plus perhaps an equivalent or larger set of representative items that would be administered in the first comprehensive assessment under the new framework. All would be scaled together, but under analyses off the critical path to standard reports in that assessment cycle. The standard report for that cycle would have been based on the previous framework and turned around rapidly. The original marketbasket collection would be released at this time, results would be reported in its metric, and the next assessment cycle would use the unreleased scaled marketbasket forms. In that cycle new forms could be calibrated onto the scale (again off the critical path to reports) to continue the practice of releasing items and refreshing the pool. In this way, item development would be concentrated at the initiation of the new framework, with continued development concurrent with standard reporting in successive years. Depending on the number of cycles between framework redefinitions, perhaps a third to a half of the forms would be developed and scaled ‘up front’, while the rest were developed over time.

Variation #3: Marketbasket Constitutes Subject Domain (Bock, 1996)

This variation builds on Darrell Bock’s (1996) idea of ‘domain referenced scoring’. Bock described advantages of bringing together, at the initiation of a new framework in a subject area, a sufficient number of items to constitute an operational definition of skill in that domain—depending on the subject, perhaps 500 to 5000 items. An appealing advantage of this approach is that the entire domain of items would be released immediately to the public. Having specified how one would define a score if a student were able to respond to all of these items, it is possible to calculate a predictive distribution for this domain score from a student’s response to some subset of the items.

Domain-referenced scoring with marginal estimation means establishing IRT or similar scales among the items in the domain. This might require either multiple subscales or multivariate IRT models. These scaling models, in conjunction with marginal estimation methods, would be the vehicle through which predictive distributions to the domain as a whole were calculated. Varying compositions of booklets could be used; moreover, computerized adaptive testing could be used as well.

This variation requires large item development and item calibration efforts at the beginning of a (say) ten-year period. Given that all items in a domain would be released, it would be necessary to provide the public with additional help in interpreting assessment results. The public visibity of the item pool would have the advantage of stimulating discussion and learning in the subject area, though possibly its use by various individuals and groups would raise performance on these kinds of tasks at the expense of other skills which cannot be addressed in the NAEP setting.

Changing the Marketbasket

The advantages of the marketbasket approach (speed, convenience, and stability) follow largely from the additional constraints imposed on the portion of the assessment built around marketbaskets. As with the Consumer Price Index (CPI), such a marketbasket would tend to become less pertinent over time. As curricular emphases and public interests change over time, new portions of the assessment could be added and reported in their own right to provide more timely information, while trends would still be reported on the marketbasket metric. After some point (ten years?) however, it would seem appropriate to provide for reconstituting the marketbasket. These shifts would correspond with revising subject area frameworks. As with the CPI, new collections of items could well exhibit different trends from the previous mix, and different relationships with background variables. After including both new and old marketbasket booklets in at least two successive assessments to determine the differences and offer an approximate overall-linkage (e.g., matching means and standard deviations), trends would be reported in terms of the new collections. Periodically, however, booklets corresponding to superseded marketbasket metrics could be embedded in current assessments in order to provide relevant comparisons. Conversely, near the end of a marketbasket cycle, new trial marketbasket forms would be tested and administered in parallel with the established marketbasket forms.

Some Parallels Between Marketbasket Reporting and the CPI

Figure 6.1I-1 draws parallels between ‘marketbasket reporting’ in the Consumer Price Index and how it might work in NAEP. The excerpt on the CPI discusses the marketbasket update that took place early in 1987. Updates of the CPI had taken place in ten-year intervals, and reflected changes in Americans’ buying habits.

 

Innovations in NAEP

At a time when the National Assessment is being refocused in order to improve its efficiency and responsiveness, it is especially important that NAGB maintain clarity in its motivation for innovations in measurement and reporting. It must be assumed that the National Assessment will not have funds for "innovation for innovation’s sake." For example, computerized testing deserves attention if it can reduce costs, increase motivation (e.g., for 12th grade students), or permit measurement of important achievement domains that are presently unmeasured (e.g., interactive problem solving). However, just because computerized testing permits measurement of a particular achievement domain, does not mean that the domain is important enough to include in the National Assessment. Similarly, changes in reporting technology make sense only if the changes produce real improvements in the usefulness of NAEP data from the perspective of the NAEP "customer."

In contrast, many NAEP innovations over the years have met the test of furthering the perceived missions of NAEP. As examples, matrix-sampling allowed wide content coverage without taxing respondents unduly; open-ended tasks broadened the kinds of knowledge and skills that could be assessed; and plausible values permitted estimation of group distributions and achievement levels from sparse, matrix-sampled data. Each of these innovations, as we have seen, reflects tradeoffs; each introduces its own countervailing complexities.

 

Figure 6.1I-1
Parallels Between the CPI and NAEP

Periodic RFPs should invite vendors to propose innovative uses of technology that promote the new NAEP directions. Such vendors are likely to come up with a wide range of suggestions, more thoughtful and creative than those we might mention here. Tryouts should begin on a small scale, generally without the need for nationally representative data. No innovation would appear on the critical path to standard reports until it was thoroughly understood. Results from these trials would be fed back to the subject-area committees to keep them informed of actualities and feasibilities, as well as possibilities, for the NAEP context. Some possible avenues for explorations that occur to us are listed below.

 

  • Use of computers for constructed response tasks which involve drawing, moving, arranging, etc., which can be scored algorithmically, so that NAEP could have more open-ended tasks without more human scoring time, cost, and sources of uncertainty.

  • Adaptive testing in the usual ‘difficulty’ sense has been advanced in many quarters. The usual argument of efficient measurement is not as compelling in the context of estimating group characteristics as it is in the more familiar context of measuring individual students, since the dominant factor in the uncertainty about groups is variance across students rather than within students. A more accurate estimate of a group mean is obtained, for example, with 10 responses each from 100 students than with 1000 responses each from 50 students. It should also be noted that using adaptive testing means unavoidable reliance on scaling models and almost certain reliance on marginal estimation methods.

  • Adaptive testing in the sense of subject subareas is attractive. All students could get broad content samples, which would include, say, a couple easier physics or chemistry tasks. Students who do well on them would get more. This achieves the objective of the targeted sampling for special assessment, but in a way that doesn’t involve additional logistics or sampling procedures for identifying special subpopulations of students. The sample is still representative for a subarea, although students who have demonstrated they are proficient will provide more data than those who are not proficient. In contrast to data obtained only from nonrepresentative special subpopulation samples (e.g., AP students or magnet school students), it is known how to project the results from the above-described approach to national estimates. (The benefits discussed here apply also for two-stage or multi-stage testing, a simpler variation of adaptive testing in which item-selection decisions are not carried out every item but only occasionally to select blocks of items.)

  • Computerized scoring of writing samples, especially if they have been entered directly into the computer, permits automatic and economical analyses of lexicographic and syntactic aspects of writing, to complement raters’ judgments.

  • The use of imaging in scoring, already employed by National Computer Systems in a variety of NAEP assessments (e.g., writing), permits more economical logistics for scoring of open-ended portions of paper-and-pencil booklets.

     

  • Data analytic tools that facilitate secondary analyses should continue to be developed. NAEP employs complex student- and item-sampling designs, and scaling models in order to achieve its primary missions efficiently, but at the cost of increased challenges for secondary analysts. The ‘plausible values’ methodology is a technical/measurement innovation that simplifies their task, but with its own tradeoffs. A single large imputation run provides reasonably good estimates for a wide variety of secondary analyses, but for any one specific analysis, a better model could be built. Tools that would allow secondary users to build and fit such models without having to deal with all the complexities of marginal estimation are now within reach.

  • On-line access to NAEP data and analytic tools, as described above, would facilitate and foster secondary uses of NAEP as a national resource. NAEP publications have recently become available on the World-Wide Web, which is a laudable achievement to be followed up aggressively.

  • The area of NAEP score reporting, aside from whether marketbasket, scale-scores, or achievement levels are used, offers potential for improved communication. Experimentation and focus-group studies should be carried out, in light of qeustions such as the following: Who are the audiences for NAEP results? What are their levels of sophistication with respect to data/graphic displays/statistics, etc.? How can the NAEP results be displayed to maximize understandability and usefulness among the various intended audiences?

OBJECTIVE 2: To develop, through a national consensus, sound assessments to measure what students know and can do as well as what students should know and be able to do.

2A. Keep test frameworks and specifications stable.

 

  • Test frameworks and test specifications developed for the National Assessment generally should remain stable for at least ten years;

  • To ensure that trend results can be reported, the pool of test questions developed in each subject for the National Assessment should provide a stable measure of student performance for at least ten years;

  • In rare circumstances, such as where significant changes in curricula have occurred, the Governing Board may consider making changes to test frameworks and specifications before ten years have elapsed;

  • In developing new test frameworks and specifications, or in making major alterations to approved frameworks and specifications, the cost of the resulting assessment should be estimated. The Governing Board will consider the effect of that cost on the ability to test other subjects before approving a proposed test framework and/or specifications.

"Proficiency" in any NAEP subject area is an operational definition, and one which depends to a large extent on subject-area frameworks. Changing the operational definition at this level is costly and always opens the door to lack of comparability between the reporting metrics of the new and previous assessments. For these reasons, holding the frameworks and specifications constant for intervals of, say, 8 to 12 years increases the stability of NAEP results, leads to quicker standard reports, and reduces some major costs in several ways (e.g., framework design committee work, achievement level setting, and item development).

Keeping the frameworks constant does not necessitate freezing assessments as tightly as the current NAEP long-term trend assessments. Given a framework and test specifications, new tasks can be written and introduced into the assessment (on a track that is off the critical path to standard reports) while maintaining the integrity of the reporting metric. This phased-in introduction of items similar to ones already used allows a modest degree of evolution of the item domain, while retaining the reporting metric for trends.

More dramatic changes in what is important in a domain can be introduced more rapidly if desired, still without incurring the full costs of framework redefinition or comprehensive assessments, by carrying out a "special assessment" in that subject area—perhaps spiraling such booklets in with the standard administration. Results could be published on the topic in question without being included in the standard reporting scale or publication. Such focused or special assessments can be carried out as often as one likes and can afford—separate costing and reporting protects speed and stability of longer-term measures, while allowing for timely information. The core assessment can be supplemented at any time with ideas that appear in neither the frameworks nor the test specifications; however, the supplements must be kept off the critical path when introduced.

B. Use an appropriate mix of multiple-choice and ‘performance’ questions.

 

  • Both multiple-choice and performance items should continue to be used in the National Assessment;

  • In developing new test frameworks, specifications, and questions, decisions about the appropriate mix of multiple-choice and performance items should take into account the nature of the subject, the range of skills to be assessed, and cost.

A core NAEP mission is "to develop, through a national consensus, sound assessments to measure what students know and can do as well [as] what students should know and be able to do." Thus, it is critically important for the National Assessment to reflect the breadth and richness of valued content and processes. Undue narrowing of the National Assessment will damage its credibility, so these frameworks must set the stage for the mix of tasks that appear in the assessment. On the other hand, the core assessment should not be jeopardized by use of tasks or procedures which are not well-suited to NAEP’s large-scale, drop-in-from-the-sky, unmotivated character. It is consistent with NAEP’s mission to use mixtures of multiple-choice (MC) and constructed response items (some of which, such as writing exercises and hands-on science tasks, are more extended). A compromise is to build the core assessment around tasks that are well-suited to NAEP’s standard conditions, and assess large extended performance tasks off the main reporting scale. More in-depth assessment conditions, with smaller samples of students who are also administered a portion of the standard NAEP tasks, would provide better information about such skills and communicate their importance to the public.

Characteristics of performance tasks

Performance tasks have been used increasingly in NAEP because they provide a different kind of evidence than MC items about what students know and can do. Accordingly, they have very different evidentiary characteristics. A major plus is that they can provide direct evidence about productive aspects of students’ competence. Balancing considerations are logistic complexities, time requirements, content specificity, and motivation problems. We have seen, as an example of logistic complexities, differences attributable to modified training materials that swamp differences attributable to changes in what students know and can do. We have seen, as an example of motivation problems, large discrepancies in performance tasks among various students who do well on MC—in 1994 geography, as many as 40 percent of the twelfth grade students simply did not bother with tasks that required entry and effort. It is a further concern that rates of omission appear to be associated with student ethnicity (Swinton, 1993).

There is a fairly clear distinction between what might be called "constructed response" (CR) tasks and "extended constructed response" (ECR) tasks. Both require open-ended responses, as opposed to MC items, but they differ notably in the amount of time and entry required of students. ECR tasks generally require more time, exhibit interconnections among aspects of the performance, and can effect poor performance for a variety of reasons—only one of which is lack of the requisite skills or knowledge. For example, Yepes-Bayara (1996) describes a talk-aloud study of sixteen 8th grade students as they worked their way through both a "regular" block of science items from the 1996 assessment (i.e., MC and CR tasks) and a "special" block (i.e., either a ECR hands-on experiment or a "theme block" comprising several MC and CR tasks all dealing with the ecology of a pond). He found that the major difference between students who did well and who did poorly on the hands-on block was not a lack of science understanding, but trouble with planning and management skills.

Implications for task design

A recurring but largely remediable problem involving ECR tasks has been the finding that certain tasks or booklet configurations do not work well in the standard NAEP context, yet they are wastefully administered to national samples of thousands of students. ECR tasks are used to good effect when integrated with instructional programs, so that the prerequisite background knowledge and familiarity with task demands is assured. Creating ECR tasks that can be administered to a random sample of students across the nation, regardless of curricular experiences or background knowledge, has proved difficult.

One avenue toward a solution is better feedback from field testing. Data from smaller scale field tests, which are not necessarily nationally representative, can identify what is working well and what is not. Simple data analysis will generally not be sufficient; data not otherwise available, such as collected by Yepes-Bayara (op cit.), will be necessary to understand what is really happening when students interact with ECR tasks. Without such studies, the danger is that interpretations of poor performance will presume problems with the targeted knowledge and skills, when factors such as lack of familiarity with format, specifics of the given task, and time management skills are the trouble.

A second avenue is an oft-suggested recommendation that NAEP establish standing subject-matter panels. A 1993 NAE report describes the proposed functions of these panels in some detail, including the recommendation that....

the panels should provide continuity to the assessment by being involved in all aspects of the process, including formulating the framework and objectives; reviewing items, item-scoring rubrics, and reporting formats, and helping to achieve agreement on narrative descriptions of performance standards and representative illustrative tasks. This would help to ensure that there is both logical and statistical correspondence between content standards, performance standards, and illustrative tasks. (p. 123)

In the current NAEP scheme, subject area committees develop frameworks. Task developers attempt to create tasks that reflect the frameworks. Statistical analysts gather and analyze data that have the potential to determine the extent to which the tasks truly provide useful information about students’ skills, as envisioned by the framework committees. But the feedback loop is never closed, forfeiting the opportunity to continually refine the subject-area specialists’ vision with the actualities of the assessment. A standing panel could serve this role.

A third avenue, further down the road, capitalizes on adaptive testing. Since an ECR task can be inappropriate for a given student because of her poor matchup with its difficulty, content, context, or format, other information about the student could be used to select ECR tasks for which the chance of a good matchup is better.

Implications for scaling and achievement-level setting

ECR tasks can be a source of "negative returns" for NAEP. In "drop-in-from-the-sky" assessments, such tasks provoke considerable problems with nonresponse, low motivation, and poor performance for reasons other than the targeted skill. They take much more time than MC and CR tasks. The variety of reasons for poor performance often renders them uninformative from the technical perspective of their contributions to measurement accuracy, in terms of the domain of tasks in the subject area as a whole. They are most difficult for human scorers, and as such are fertile sources of scoring anomalies that must be identified and, if possible, rectified. Despite their fertile possibilities of providing more information about what students know and can do, they can result in a NAEP that actually tells less! Yet, to be as faithful as possible to the subject area frameworks, they are important to include in NAEP.

A compromise solution is to take ECR tasks out of the core of standard results—to include only MC and CR tasks in the core assessment data-gathering, scale construction, achievement level setting, and the standard reporting timeline. Stability, speed, and accuracy would all be enhanced, although trading away (1) the breadth of subject area coverage and (2) communication to the public about the importance of learning ostensibly modeled by ECR tasks. An alternative route to recovering these functions that appear to be lost (but are not really, since including them in the main stream does not really achieve them anyway) is discussed below.

Implications for reporting

If ECR tasks are important to assess and report, but the standard system of procedures cannot do it well; beside problems of interpretability and motivation, they are oftem problematic in terms of equating, scaling, standard-setting, and score reporting. The solution is to create and publicize an alternate system that does suit them. The properties of this parallel system might include ...

 

  • some ECR assessment modules that could be presented to national samples of students along with standard modules (much as they are now), but limited to tasks that have been found to provide useful data in this format;

  • smaller samples of students to be assessed with ECR tasks in contexts that evoke richer information about what is actually happening when students work through the tasks (with the side benefit that this special attention increases students’ motivation); and

  • well-publicized special reports that highlight the results of these tasks.

The last of these would mean, of course, making the limitations of NAEP results clearer than they have been in the past—that is, what one can and cannot learn about what students know and can do, as conveyed by NAEP scales, item performance, distributions, and achievement-level results alike. But these limitations are not new; they have always been present in the data, if lost in the interpretations. The proposed approach has the benefit of focusing one kind of report (the standard report) on what NAEP has always been able to do relatively well, and focusing separate attention on what it has not done well, through the use of data-gathering and reporting efforts with different configurations to overcome many of its (formerly hidden) deficiencies.

Implications for Assessing Trends

To have reliable estimates of trends, it is necessary to have stability, not only in test frameworks, but in the test specifications that flow from the frameworks. Test specifications include definitions of objectives (or outcomes) and their relative emphasis (in percent of items or score points), item types and formats measuring those objectives (including but not limited to MC or CR items), item bias evaluation criteria (for detecting DIF, or differential item functioning), and standard error functions (accuracy at each point along the scale).

It is particularly important to attend to standard error functions when estimating proportions of students reaching performance standards. In terms of variances, the traditional formula reflects the fact that observed score variance is the sum of true score variance and error variance. The distribution of observed scores is affected by the distribution of true scores and the distribution of measurement error (which can be summarized by the standard error function). The percents of students apparently reaching performance standards, particularly those related to extreme performance, can be greatly affected by changes in the standard error function, even if the distribution of true scores remains unchanged. For example, more high-school basketball players will be able to sink at least 75% of their free throws if we only observe four attempts than if we observe a hundred attempts.

In order to draw conclusions about trends in student performance that are unaffected by nuisance factors, either (a) the tests must be constructed to have stable standard error functions (and if multiple forms are used simultaneously, the mix of the forms must be maintained), or (b) more sophisticated statistical procedures for estimating true score distributions separate from observed score distributions must be used (i.e., marginal estimation procedures; see Section 6.1C). Procedure (a) is easier to carry out and easier to communicate, since it doesn’t depend on satisfying modeling assumptions or on estimation procedures that are accurate under trying conditions. However, it requires a discipline and control in test frameworks and specifications, which has not characterized the National Assessment to date.

OBJECTIVE 3: To help states and others link their assessments with the National Assessment and use National Assessment data to improve educational performance.

 

  • The National Assessment should develop policies, practices, and procedures that enable states, school districts, and other who want to do so at their own cost, to conduct studies to link their test results to the National Assessment;

  • The National Assessment should be designed so that others may access and use National Assessment test data and background information;

  • The National Assessment should employ safeguards to protect the integrity of the National Assessment program, prevent misuse of data, and ensure the privacy of individual test takers.

The two following sections concern (1) linking NAEP to other assessments, so that others may employ NAEP methods and interpretive frameworks, and (2) designing NAEP so that others may access and use NAEP test data and background information.

 

Linking NAEP and Other Assessments

States and test publishers will probably want to link their assessments to the National Assessment. It is clearly advantageous to NAEP’s credibility and visibility that such links be fostered. It is also critical to NAEP’s credibility that the types of inferences that various kinds of linkages support be clearly understood (Mislevy, 1992; Linn, 1993). It is worth mentioning one possible consequence of linking that may be unanticipated by states. Once the National Assessment is administered on a regular basis, it will be possible to compare a state’s trend lines for NAEP and the predicted trends based on the state assessment. Differences are bound to occur (as indeed they did last year in Kentucky). These differences will keep newspapers, state departments, and NAEP’s Technical Advisory Committees busy.

NAEP should not be in the business of policing and certifying linkages between NAEP and other assessments. The best way to support these efforts would be to provide clear discussions and outlines of procedures for valid linking approaches, and examples to use as models (e.g., Bloxom et al., 1995; Williams et al., in press).

NAEP can also facilitate linkages to other assessments by using modular design elements in its own standard assessments. Publicly-available intact NAEP booklets, accompanied by descriptions of administration conditions and usage, can be administered to samples of students in state or private assessments. For example, ‘Variation #1’ of the section on marketbasket reporting (Section 6.1I) describes booklets which, possibly through an equating lookup table, map observed scores to the NAEP reporting scale so that other projects could approximate the distributions of their students on a NAEP marketbasket scale without complex statistical procedures.

From the perspective of national educational research and improvement, it is particularly important for NAEP to be linked to studies whose methods and approaches complement those of NAEP, such as longitudinal studies and field comparisons of alternative educational approaches. NAEP can provide national snapshots of educational performance; longitudinal studies tell more about how these pictures evolve over time, and experiments and field trials give clues as to why. For these latter kinds of studies to be able to include results and analyses within the NAEP interpretive framework enriches both NAEP and the linked studies.

 

Access and Use of NAEP Data

Beside public reports, two historical methods for making NAEP data available to others have been the provision of exhaustive tables of marginal results (‘NAEP almanacs’) and of secondary user data files. Both must be accompanied by technical documentation for wise use—the latter especially so. Usage has been limited in large part by the challenge of carrying out analyses. Two approaches to extending use, already initiated but meriting further aggressive development, are tools that facilitate analyses and on-line access to materials:

 

  • At present, data-analytic tools can be used to carry out analyses of multiple imputations, accounting for both sampling and measurement uncertainty. The next steps would be (1) graphically-oriented interfaces, to further reduce the level of statistical sophistication required to perform familiar analyses, and (2) tools which carry out appropriate marginal analyses directly from response data and user-posited models, to eliminate the step of ‘one size fits all’ multiple-imputation files.

  • At present, on-line access to NAEP publications, including results which can be further analyzed, is underway on the World-Wide Web. On-line access to basic data and tools such as described above would further facilitate wider use, and open the door to user networks and sharing of information and results.

7.0 A Feasible Configuration

7.1 Overview

NAEP serves many missions and many interested parties. Practically all of these missions can still be served, yet the project moved considerably in the directions outlined in NAGB’s Themes and Issues, through a reconfiguration along the lines sketched in this section. Evidence from NAEP itself (in particular, experience with the long-term trend) indicates this objective is eminently achievable. Accomplishing it requires (1) establishing priorities on its various missions, (2) creating a design/analysis/reporting plan that optimizes attaining these missions at the cost of putting other missions on slower tracks, and (3) phasing in revisions so they never appear on the critical path to initial reports until they have been administered in their final form off the critical path. More comprehensive reporting, in terms of additional background variables, broader arrays of tasks, and interpretative discussion of results, can appear further downstream. Modular construction of core NAEP blocks would facilitate their inclusion in other projects, governmental and private, providing ways of using the NAEP interpretative framework to address missions that are poorly served by large-sample surveys like NAEP.

7.2 Modularity of Design

The core elements of NAEP to support standard reports would be booklets containing basic student background self-report questions, and, within a given subject area, similar mixes of multiple-choice and constructed response tasks. The constructed response tasks would tend to be shorter ones, as those which are problematic from either scoring considerations or student motivation would be better located off the critical path to standard reports. The number of booklets needed to adequately represent a subject area, given these constraints, is an important policy question that lies beyond our franchise to resolve (see Section 6.1I for discussion of tradeoffs).

Auxiliary information, which could be spiraled into the main administration but not be placed on the critical path to standard reports, would include teacher surveys, parent surveys (should any be carried out in future), and broader varieties of tasks (e.g., estimation blocks in math, theme blocks in science, student-choice tasks in reading and writing, occasional administration of long-term trend blocks; information from such blocks could be included in the reporting metric through the use of projection, but their main value would lie in their own unique revelations). In an earlier ‘comprehensive’ administration, the core booklets would have been linked into the reporting scale, using a fully- or partially-BIB design to support scaling analyses. This linking might be either equating or calibration, depending on the reporting metric that has been designated (i.e., equating for single-booklet size marketbaskets, calibration for larger marketbaskets or abstract scales). This having been accomplished, use of these booklets in subsequent assessments need not be BIBbed, thereby simplifying printing and administration logistics somewhat. New booklets approximately parallel to these could be introduced if desired, but this would be a smaller-scale item-development effort, relying on established frameworks and item specifications. Such new booklets would be spiraled in with the established core booklets and calibrated into the scale, but again not on the critical path to reporting in the cycle in which they are introduced. Modest changes can be introduced in this way, much as the Consumer Price Index will, from time to time, substitute one item in its marketbasket for another if that item is no longer available.

Periodically, new frameworks will be introduced in a subject area—say, at eight to ten year intervals. What is referred to in Themes and Issues as ‘comprehensive reports’ would logically take place when booklets under the new framework are put in place. ‘Put in place’ means three administrations (see Figure 6.1D-1):

(1) field trials of items in an assessment cycle which includes that subject area, for learning about the feasibility and operation of the tasks (Note: new tasks, designs, etc., should be pretested in small samples that need not be representative, and analyzed for feasibility, gross errors, etc.); followed by

(2) construction and administration of forms as they would be anticipated to appear in a full administration in the next assessment cycle; further refinement of tasks and analytic procedures would take place here, as would DIF analyses and possibly standard setting activities; then

(3) full administration in a comprehensive assessment. Core booklets from the previous framework would be administered concurrently in this last year, however, to serve as the basis of a standard report. The comprehensive report would be on a track more like reporting under the current configuration, with more comprehensive analyses and interpretive discussion, but expected within, say, 12-15 months of data collection. The comprehensive assessment activities and reports would include analyses that examined the link between the new and the previous reporting scales. Subsequent standard assessments would then be built around a stable core from the new framework.

For state NAEP assessments in a given subject and grade, the basic item administration design would be the same core booklets used in the national sample. The student sampling design (discussed more fully in Section 6.1H) would consist of a self-sufficient national sample. Each participating state would have its own sample, although the schools in that state which appear in the national sample would consitute a portion of the state sample for state-level reports. Administration procedures for state and national NAEP would be made comparable, through local administration in all cases but with additional training and closer monitoring by the contractor than currently done in state NAEP. The samples of schools in a given subject area could be reduced from the current 100 to 75 with tolerable loss of precision. We advise against designing NAEP as mainly a state-level assessment, from which national results will be summarized. (See Section 6.1H and Appendix for elaboration of our rationale.) Present enthusiasm notwithstanding, that alternative is risky because it relies heavily on a long-term commitment of states to a burdensome activity, the benefits of which may not seem to justify the effort over time.

7.3 Reporting Approach

The high priority NAGB places on achievement-level reporting necessitates a student-based definition of proficiency (regardless of whether individual student scores are ever calculated), and the heavy design constraints necessary to rely on only simple analyses to link results across booklets and over time seem unacceptable to many user groups. Together, these considerations necessitate some kind of response scaling. This opens the door to a wide variety of reporting mechanisms, including arbitrary scales with specified desirable properties (e.g., the 0-500 scale now used, or the within-grade scales that were used by the California Assessment Program in the 1980’s, or projections to actual groups of items). Section 6.1I above discusses this last possibility more fully, outlining pluses and minuses of three variations on the theme: (1) replicate market-baskets of a size that can be administered to a given student, (2) a somewhat larger market-basket constructed to constitute a representation of a larger conceptual domain, and (3) literal construction of an item domain as described by Bock (1996).

Replicate market-baskets, each of which consist of a representative mix of items which constitutes a booklet in the core data-collection for standard reports, have the advantage of providing observed scores (possibly ‘equated’ observed scores) that are unbiased estimates of NAEP results, assuming the NAEP scale is defined in terms of observed scores on a baseline market-basket booklet. One important benefit of this feature is that such booklets are easy-to-communicate anchors for public understanding of NAEP results. A second is that these booklets would be very easy to embed in other studies, and get results that are (relatively) easy to link to NAEP results (in particular, plausible values methods could be avoided). A disadvantage is that any one marketbasket booklet would be an incomplete representation of the content domain. One would need to examine several of them to appreciate the breadth and variety of tasks across which generalization is intended.

This deficiency would be overcome by a marketbasket larger than could be routinely administered to a sampled student—either for a suite of core booklets, or a large established pool of items deemed to constitute the domain (Bock, 1996). The tradeoff is that mapping to the larger set from smaller sets of items (in particular, typical core booklets) is less straightforward. Although individual students’ ability estimates can still be calibrated so that they are unbiased estimates of their proficiencies, distributions of these estimates suffer one of the basic problems that led to marginal analyses in the first place: Distributions of these estimates are biased estimates of population distributions. In particular, estimates of proportions of students above, say, the cutoff for ‘Advanced’ based on distributions of estimated individual abilities tend to overestimate true population proportions by amounts greater than sampling error, and greater than changes in proficiency across time points.

Any of the above approaches supports achievement level setting (ALS) much as it is now, insofar as panel members’ judgments are gathered (see Section 6.1E for discussion of ALS). Analytic procedures can be simplified, however, when the reporting scales are based on projected scores on specified sets of tasks. Panels’ ratings of performance on these tasks are on the reporting scale perforce, thereby eliminating the step of translation through a scaling model. It remains important nevertheless that panel members be provided actual data on those tasks as one source of information to ground their judgments.

The replicate market-baskets approach with scores projected on the tasks that constitute a booklet would also allow analysis of achievement level results using "whole-booklet" methods for purposes of enhancing understanding of achievement levels, validation of the levels, or, possibly as an alternate way of setting achievement levels that could remove IRT analyses from the standard-setting process.

7.4 Phased Reporting

Standard reports of key results, based on core booklets of items and background variables that have been administered in identical form in previous assessments, can be produced quickly and reliably. By identifying, from among the myriad of potential background variables, perhaps 20 that are central, established, and stable, we can, like the Census Bureau, strip the others from the critical path to the initial ‘standard report’ results. We can arrange the analyses required for these results to be more stable and easier to troubleshoot than the current all-inclusive analysis. Under these constraints it is reasonable to expect that the information necessary for standard reports could be provided to NCES within six months of the completion of data collection. If standard reports are limited to numerical results with minimal interpretation, pre-approved shells could be released without excessive departmental review. This would be akin to proofreading galleys, rather than opening another round of input for interpretation of those results—a process of verification rather than creation, since the more time-consuming and thoughtful planning for the shell would have taken place before the receipt of the data. Interpretations, from the Department and/or the Board, could accompany the standard report in the form of press releases or more discursive companion reports, rather than be part of the data release, if those agencies desired. This keeps the vicissitudes of reaching consensus on interpretations off the contractors’ critical path to their contribution to the standard report.

Subsequent waves of analysis can incorporate additional variables (e.g., "focus reports"). These latter reports can address background variables that are known a priori to be experimental, unstable, or problematic, but remain off the critical path to the initial results. It has been suggested that more detailed results should not be delayed, because no one will be interested in them then. If this is true, should they be obstacles on the critical path to the main results?

7.5 Analytic Methodology

Some kind of scaling model and ‘true-score’ or marginal analyses will likely be required if NAEP is to allow flexibility in booklet construction within and across assessments (i.e., not all booklets would have to be ‘clones,’ like ASVAB or SAT forms), but still grant priorities to (1) broad content coverage, (2) trend monitoring, and (3) student-based interpretation so as to support achievement-level reporting. Plausible value methodology is one route to this end, but not the only one—others, though all sophisticated, can be accomplished, but no simple analysis will meet requirements (1)-(3) unless substantial constraints are imposed on booklet construction within and across assessments.

Marginal analyses required to produce standard results can appear on the critical path to the six-month standard report, but only by limiting analyses to a specified subset of well-behaved and well-understood variables, which have been used in identical form in a previous assessment and require no file matching from ancillary surveys carried out concurrently with the student assessment. (This does allow the use of data gathered by the sampling contractor in previous steps, such as that obtained from census information for drawing the sample of schools, or from earlier contacts with sampled schools.) Changed items and revised decisions jeopardize the timeline, and if even seemingly minor changes are routinely made, the timelines will be routinely missed despite everyone’s best efforts.

One of the tradeoffs necessitated by this fast reporting track is that it will not be possible to produce a public user file of the current form, with a single file of plausible values that support both (1) the exact replication of ‘official’ results and (2) secondary analyses of NAEP responses in relation to any and all background data—in particular, problematic, new, or concurrent-survey background variables. Satisfying this requirement in the current configuration leads to the ‘everything must be analyzed before anything can be reported’ phenomenon.

The hesitency over releasing results based on only stable and familiar parts of the data may be the belief that there will later be a final, single, comprehensive analysis, which supports a best and correct answer for any inference one might be interested in. There cannot be. Since some complexities in student- and item-sampling are required to meet NAEP missions within affordable costs, model-based inferences cannot be avoided. No model can be all encompassing; given any particular inference, it is possible to custom-design a model and analytic procedure that provide optimal results for that inference, but necessarily does less well for others. The plausible values files under the current configuration are themselves the result of a tradeoff, attempting to support a wide range of analyses fairly well. More time and resources for a more comprehensive single model provide decreasing returns, and eventually negative returns as models become unstable when they exceed the informational capacity of the data. The long-range solution is to provide user files with response data and tools for users to carry out their own marginal analyses—to obtain sharper results for any specific analyses they elect to pursue, or, if they wish, to replicate ‘official results’ by using the ‘official results’ model.

7.6 Costs

One charge to this team was to "provide advice on the ... necessary components and costs of implementing a National Assessment of Educational Progress based on the policy themes and ideas...." The team believed that in order to fulfill this charge with respect to costs some information about present costs was needed. However, that information could not be made available to us in sufficient detail to provide accurate cost estimates of specific design configurations. In the absence of that information, we can provide the following general advice.

It is advantageous to NAEP to maximize competition among bidders for the new National Assessment, for bidders to believe that there is real competition, and for bidders to understand the RFP and the bid evaluation criteria. In the RFP, any processes that are really required should be specified, but in general it is better to describe an outcome and leave it to each bidder to propose a process that produces that outcome. If unusual or specialized procedures are required, the number of qualified bidders will be reduced. To the extent feasible, the RFP should be broken into parts, so that bidders’ expertise can be optimally matched with program needs. The RFP should specify weights to be given to various factors, such as satisfying RFP requirements, minimizing elapsed time between testing and reports, and minimizing costs.

The program design factors listed below play important roles in determining the cost of assessment, and should be the focus of attention when costs are a consideration. An important part of setting priorities for the National Assessment is implicit in establishing budgets. While it is useful to rank priorities and discuss them in qualititative terms, the exercize of costing out configurations will make priorities clearer to NAEP consituencies.

 

  • Sampling and administration. Carrying out the sampling procedures and field work of a NAEP assessment is a large and unavoidable expense. This means that all other things being equal, annual assessment is more expensive than biannual, and state assessments are more expensive than national-only assessments. In the 1994 Reading assessment, for example, the state assessment component accounted for about 60% of the cost and the national about 40%. A large factor in costs, therefore, is determined simply by the decision of how often to go out into the field, and at what level (state & national, versus national only). A design in which state-level assessment took place only every 2 or 3 years would be not only less burdensome for states, but less costly to NAEP itself.

  • Number of subject areas assessed. There are considerable economies in assessing more than one subject area once it has been determined to go into the field, due to common sampling and administration costs. For example, if state NAEP were to assess Reading and Math in two year cycles, assessing both in the same year and having no state assessment in intervening years is less costly than continually assessing Reading one year and Math the next.

  • New frameworks, specifications, and achievement level setting. The long-term NAEP assessments cost roughly two million dollars. They have no costs for frameworks, item writing, or achievement level setting. If these activities were conducted only at intervals of eight to ten years, in conjunction with ‘comprehesive assessment’ in subject areas, the standard assessments in intervening years would have costs more similar to those of the current long-term assessments.

  • Amount of hand scoring. Hand scoring is more expensive than mechanical scoring. We have recommended that standard assessments contain a degree of open-ended items in accordance with subject area frameworks, but avoiding whenever possible those items which are difficult and unreliable to score (which also tend to be those that exhibit problems with student omission and motivation).

     

  • Sampling designs that require substantial effort to access and test small numbers of students. A primary example here is private school subsampling if private school estimates are desired in state reports.

  • Number of test forms and items. The number of items is determined partly by the framework, as sufficient items must be written to provide adequate content coverage, and partly by the practice of releasing a portion of the pool with each assessment. If assessment occurs more frequently, fewer items can released at each cycle while still providing similar numbers of released items. The number of test forms used in a given assessment depends partly on the number of subjects assessed (see above; more subjects and therefore more forms, though, is still less expensive than carrying out those subjects in different years), and partly on whether the administration is for a comprehensive or a standard assessment. For a comprehensive assessment, more interconnections among blocks of items is desirable to support scaling and linking. In standard assessments in succeeding years, only a subset of the booklets need to be administered.

  • Matching of data from different sources. Teacher surveys, transcript studies, and the portfolio assessment are all examples of NAEP activities that require file matching if their information is to be related to the main proficiency scales. These activities should be designed so as to be costed out separately, to be carried out only on an ‘as needed’ basis, and should not appear on the critical path to standard assessment reports.

  • Changes in specifications. The example of the current long-term assessment demonstrates that complex analyses, in and of themselves, need not be costly or time-consuming—if stable and well-understood procedures are applied to stable and well-understood data. Changes along the critical path, with respect to items, booklet configurations, analytic procedures, administration protocols, scoring rules, training materials, and countless other details, all open the door to unanticipated effects on the data. Some of them will be detected, and engender the costs and delays of rework. Others will not be detected, and add to the variance and instability of the results.

  • New complex analysis procedures. The unanticipated development of the currently-used marginal analysis procedures has occupied several person-years of coding, testing, revising, and extending programs—often on the critical path to results, no less. Developing new procedures should not be done on an ad hoc basis, in an attempt to accomodate changes in item specifications or administration procedures, the technical implications of which had not been considered.

References

Advisory Council on Education Statistics (ACES). (April, 1996). Letter to Mr. William Randall, Commissoner of Education.

Albert, J.H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251-269.

Beaton, A.E. (1987). The NAEP 1983/84 Technical Report (NAEP Report 15-TR-20). Princeton: Educational Testing Service.

Beaton, A.E., & Zwick, R. (1990). The effect of changes in the National Assessment: Disentangling the NAEP 1985-86 reading anomaly. (No. 17-TR-21) Princeton, NJ: National Assessment of Educational Progress/Educational Testing Service.

Bloxom, B., Pashley, P.J., Nicewander, W.A., & Yan, D. (1995). Linking to a large-scale assessment: An empirical evaluation. Journal of Educational and Behavioral Statistics20, 1-26.

Bock, R.D. (1996). Domain-referenced reporting in large-scale educational assessments. Commissioned paper to the National Academy of Education, for the Capstone Report of the NAE Technical Review Panel on State/NAEP Assessment.

Boruch, R.F., & Terhanian, G. (1996). "So What?" The implications of new analytic methods for designing NCES surveys. Paper prepared for the Office of Educational Research and Improvement, National Center for Educational Statistics, U.S. Department of Education. Philadelphia: Graduate School of Education, University of Pennsylvania.

Coleman, J.S., et al. (1966). Equality of educational opportunity. Washington, D.C.: U.S. Department of Health, Education, and Welfare.

Deming, W.E. (1982). Out of the crisis. Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of Technology.

Ercikan, K. (in press). Linking statewide tests to the National Assessment of Educational Progress: Accuracy of combining test results across states. Applied Measurement in Education.

Fetters, W.B., Stowe, P.S., & Owings, J.A. (1984). High School and Beyond: Quality of responses of high school students to questionairre items. Washington, D.C.: National Center for Education Statistics.

Hershey, Robert D. "Updating the Market Basket." New York Times, February 26, 1987, pp. D1, D6.

Holland, P.W., & Rubin, D.B. (1987). Causal inference in retrospective studies. Research Report RR-87-7. Princeton: Educational Testing Service.

Johnson, E.G., Liang, J-L., Norris, N., Rogers, A., & Nicewander, A. (1996). "Directly estimated NAEP scale values from double-length assessment booklets—A replacement for plausible values?" Paper presented at the annual meeting of the National Council on Measurement in Education, New York, April, 1996.

KPMG Peat-Marwick LLP and Mathtech, Inc. (1996). A review of the National Assessment of Educational Progress: Management and methodological procedures. Study conducted for the U.S. Department of Education, National Center for Education Statistics. Washington, DC: Author.

Kane, M. (1996). Paper presented in the session Psychometric Issues in Setting Standards in NAEP. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, April, 1996.

Koretz, D. (1992a). What happened to test scores, and why? Educational Measurement: Issues and Practice, 11, 7-11.

Koretz, D. (1992b). Evaluating and validating indicators of mathematics and science education. RAND Note No. N-2900-NSF. Santa Monica, CA: RAND.

Linn, R.L. (1990). Historical origins and issues in the National Assessment of Educational Progress. In Assessment at the National Level, a symposium presented at the Institute for Practice and Research in Education Forum, University of Pittsburgh, October 26.

Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6, 83-102.

Linn, R. L., & Kiplinger, V. L. (1994). Linking statewide tests to the National Assessment of Educational Progress: Stability of results. Applied Measurement in Education, 8, 135-156.

Lord, F.M. (1962). Estimating norms by item sampling. Educational and Psychological Measurement22, 259-267.

Lord, F.M. (1969). Estimating true score distributions in psychological testing (An empirical Bayes problem). Psychometrika34, 259-299.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd Ed.) (pp. 13-103). New York: American Council on Education/Macmillan.

Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress.

Mislevy, R.J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381.

Mislevy, R.J. (1985). Estimation of latent group effects. Journal of the American Statistical Association, 80, 993-997.

Mislevy, R.J. (1990). Item-by-form variation in the 1984 and 1986 NAEP reading surveys. In A.E. Beaton & R. Zwick, The effect of changes in the National Assessment: Disentangling the NAEP 1985-86 reading anomaly (pp. 145-163)Report No. 17-TR-21. Princeton, NJ: Educational Testing Service.

Mislevy, R.J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. (foreword by R.L. Linn) Princeton, NJ: Policy Information Center, Educational Testing Service. (ERIC #: ED-353-302)

Mislevy, R.J. (1995). What can we learn from international assessments? Educational Evaluation and Policy Analysis, 17, 419-437.

Mullis, I.V.S., Campbell, J.R., & Farstrup, A.E. (1993). NAEP 1992 reading report card for the nation and the states. Princeton, NJ: Educational Testing Service

Muthén, B. (1988). LISCOMP [computer program]. Mooresville, IN: Scientific Software, Inc.

National Academy of Education (1993). The Trial State Assessment: Prospect and realities. Stanford: Author.

Raudenbush, S.W., & Willms, J.D. (Eds.) (1991). Schools, classrooms, and pupils: International studies of schooling from a multilevel perspective. San Diego: Academic Press.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 18, 161-169.

Stocking, M.L., Swanson, L., & Pearlman, M. (1991). Automated item selection using item response theory. Research Report 91-9. Princeton, NJ: Educational Testing Service.

Swinton, S. (1993). Differential response rates to open-ended and multiple-choice NAEP items by ethnic groups. Paper presented at the annual meeting of the American Educational Research Association, Atlanta GA, April 1993.

Theunissen, T.J.J.M. (1985). Binary programming and test design. Psychometrika50, 411-420.

Walton, M. (1986). The Deming management method. New York: Pedigree.

Williams, V.S.L., Billeaud, K., Davis, L.A., Thissen, D., & Sanford, E. (in press). Projecting to the NAEP scale: Results from the North Carolina End-of-Grade testing program. Journal of Educational Measurement.

Yen, W. M. (1995) The technical quality of performance assessments: Standard errors of percents of students reaching standards. Presidential address for the National Council on Measurement in Education, San Francisco, April 1995.

Yepes-Baraya, M. (1996). A cognitive study based on the National Assessment of Educational Progress (NAEP) Science Assessment. Paper presented at the annual meeting of the National Council on Measurement in Education, New York City, April, 1996.

Appendix

1650 Research Blvd. * Rockville, MD 20850-3129 * 301 251-1500 * Fax 301 294-2034

DATE: May 9, 1996

TO: Mary Lyn Bourque, NAGB
cc. Benjamin King

FROM: Keith Rust

SUBJECT: Sampling Issues for Redesign

This note is in response to your memo to Benjamin King and me of April 5, 1996, concerning the sampling issues that arise in a consideration of the redesign of NAEP for beyond 1998. Although "everything is related to everything else", I will comment in the issues as you raised them in your memo. Then I will add a few remarks on the topic of targeted assessments.

1. What is the feasibility of combining now-separate state and national samples for the cross-sectional components into a combined sample from which both state and national estimates could be made?

I think it is useful to think of this issue in three parts: 1) the "representativeness" of the student samples; 2) the operational feasibility of conducting the assessments; and 3) the impact of different testing conditions for state and national administrations.

One can think of three models for obtaining a single national data set from which both state level and national estimates could be obtained. The first is to draw the samples for each state, and then supplement as necessary to obtain an adequate national sample. The second is to reverse this process, and draw a national sample, and then supplement it as necessary in each state that wishes to participate. The third is to draw two distinct samples (with minimum overlap of schools), one designed for the national assessment, and one for the state assessments. Then for analysis and reporting purposes they could be combined into a single data set.

A key aspect to a consideration of this problem is that it is the issues of the effect of testing conditions, and operational considerations, that present most of the obstacles to either of the first two approaches. Obtaining representative samples of students for both national and state assessments, which are largely overlapping, is technically not very difficult, although it must be carried out carefully to avoid bias in the results.

Serious consideration was given to the first two of these schemes for the 1996 assessment. We considered a plan whereby the national assessment would be stratified by state. Within each state, a proportional, unclustered sample of schools would be chosen. The state sample within each participating state would then have been a supplement to the national sample that fell in the state. This is quite feasible and relatively straightforward from a sampling viewpoint, and in fact this approach was used to obtain district NAEP samples that were supplements of state NAEP samples within participating districts (samples were drawn for four districts, but only Milwaukee participated).

The drawbacks to the approach, and the reason it was not implemented, were operational and related to equating. The equating problem is fundamental. Current NAEP scaling procedures require that equivalent samples of students be obtained under national and state conditions. Thus the national sample that falls within the aggregate of participating states is equated to the aggregate state samples form these states. With the above sampling plan, it would have been necessary to have, for schools from the national sample that fell within the participating states, a random half assigned to state administration and half to national. This would have required special weights for the purpose of equating, in addition to the weights for reporting that would blend the two samples together. Keeping in mind that some states withdrew from the assessment just weeks before testing, it would have been very operationally difficult to then recruit those national sample schools, slated for state administration, into the national sample.

The other major drawback is the operational one that results from the fact that, at a given grade, the state assessment is only a part of the national assessment. The state assessments for 1996 were in math at grades 4 and 8, and science at grade 8. For the national assessment, grade 4 also had a science assessment, a math estimation assessment, a math theme assessment, and a random subsample where special accommodations were offered to IEP and LEP students in math and science. At grade 8, the national assessment also had math estimation, math theme, a targeted math assessment, and accommodations in math and science. In addition, large reading assessments were also originally scheduled for both grades, but were canceled, at a point which would have been disastrous had an integrated sample been attempted. Although these extra components could be handled through a special national sample, doing this fairly effectively destroys any cost savings. If national is to have more components than state, it is difficult to save resources by combining samples for the common parts.

When one considers the idea of supplementing the state samples with national samples in those states which do not participate, most of the above objections arise, plus some additional ones. The problems of equating state and national assessments are still present, as is the issue of national samples for the components not in the state program. Furthermore, the national estimates are held hostage to state participation, and the quality of that participation. In 1996, several states dropped out at the last minute. If California were to drop out at a late stage, this would present a great problem. Perhaps worse would be the case of California’s participating, but with a 50% response rate, and numerous failures of the quality control checks.

Thus it appears to me that if the two samples are to be combined, at the point of sample selection, to provide significant savings in resources and burden, two conditions are needed. The first is that the differences in results arising from state and national administration must either be removed or ignored. The second is that, at a given grade, state and national assessments must be the same, and not with state as a subset of national.

That being said, I suspect that the board’s concern over the issue of combining samples, although legitimate, is somewhat misplaced. I believe that the real concern is not that the samples are separate, but that the resulting data sets are kept distinct, even after equating. It does not seem logical that national results should be reported using samples many times smaller than the aggregate state samples, when most states participate. In the southeast region, in some years all states have participated. Thus I think the question should not be : "How do we get these samples together?", but rather: "How do we get these data together?" Combining the samples is one way to do this, but given the above discussion, in most circumstances it is unlikely to be the best.

I believe it would be possible to combine the state and national data sets for a given grade and subject into a single data set that could be used for reporting national and state level results. This would require some extra time, since an additional set of weights would be required to combine these. This weighting process could be simplified if states were used as strata for the national design, so that the piece to be combined with the state data would be separated from the rest of the national sample along stratum lines. The national sample does not currently use states as strata, as this is not fully efficient for producing national estimates. If it were used in a case where national and state data were to be combined, this change might well be worthwhile. However, it is not necessary to stratify the national sample by state in order to combine the data.

Although combining national and state data would require extra resources and time for weighting, it is not clear to me that this would have a substantial impact on the overall timeline for reporting. To a fair extent, this extra time could overlap with other activities that are going on to develop reports.

Finally, it is important to note that almost all of the benefits in combining data will accrue to national estimates. These will be more reliable, and amenable to much finer breakdown, than at present. There would be little impact on the results for an individual state.

2. If state and national samples must, of necessity, remain separate, what cost-efficiencies can be incorporated? For example, the current national design calls for about 100 PSUs, while the long-term trend design calls for only 52 PSUs. What would be the effect of reducing the number of PSUs in the cross-sectional samples.

The effects of changing the numbers of PSUs, for a given sample size of schools and students to be assessed, are on the reliability of the estimates (sampling error, and the estimates of sampling error), and the costs of centralized administration.

There are two aspects to the issue of reliability. Below a certain threshold number of PSUs, it is not possible to obtain reliable estimates of sampling error. This means that one can have no confidence in confidence intervals (no matter how wide), and significance tests become meaningless. My personal view is that this minimum is about 50, although I have been forced to work on studies with as few as 20. The Hispanic Health and Nutrition Examination Survey, conducted a few years ago, used 8, and this has been causing analysts considerable problems ever since.

Above this minimum threshold, increasing the number of PSUs decreases sampling error, but with diminishing returns. Once the point is reached where a given assessment component averages one school per PSU, there are no more gains from increasing the number of PSUs (for current NAEP national samples, this number is about 500).

From the operational standpoint, a smaller number of PSUs reduces costs because the supervisors and administrators do not need to travel as much. The benefits of this run out when the point is reached that one person cannot handle all of the work in a PSU in the time frame necessary. If one is to hire an extra person, she/he might as well be somewhere else.

Because the long-term trend samples are smaller than the national samples, and because these take place largely at different times of the year, it is more efficient to use fewer PSUs. If more PSUs were used, this would increase administration costs considerably, with only very moderate gains in sampling error. Conversely, if fewer PSUs were used for the main samples, there would be little or no cost savings, and the sampling errors would increase somewhat, as well as the reliability of the standard errors themselves.

In considering issues of sampling reliability, one needs to consider the effect of the number of PSUs on subdomain estimates. For demographic groups well spread across the country (gender, parents education), the benefits of increasing the number of PSUs are somewhat less than for the whole sample, as they are relatively less-clustered to begin with. But for groups defined by geography (region, urbanization, and to some extent race/ethnicity), increasing the number of PSUs tends to have more benefit. This is because these groups are confined to (or dominated by) a subset of PSUs. Thus for example, attempting to produce estimates for large-central-city schools in the southeast region is problematic, especially from the trend samples, since those schools that fall into the sample all come from one or two places.

All this is a round-about way of saying that I do not believe that there are savings to be had from deliberately reducing the number of PSUs in the national sample. The number of PSUs is a design parameter, to be optimized for a given design, not a matter of policy. Perhaps this is best illustrated by the fact that for the 1997 assessment in the arts at grade 8 (which is not a trend sample) we plan to use 52 PSUs.

One point to note is that, if trend samples (or samples for "standard" reports) were to be assessed at the same time as main samples (or samples for "comprehensive" reports), then the administrative workloads could be combined, and all assessments could be carried in, say, 100, PSUs. this would reduce the sampling errors for the smaller samples somewhat.

One of the best ways to get sampling efficiency from NAEP samples is to have as many assessments as possible spiraled together at a grade (regardless of whether BIB spiraling is used within subject). This is because this approach ensures that all assessments are conducted in even the smallest schools, that can only reasonably be asked to conduct a single session (so for this reason, these benefits are greatest at grade 4, and for private schools). This is very much counter to the situation in 1996, although in 1994 history and geography were spiraled together, to their mutual benefit in terms of sampling error.

Finally, I think those involved in NAEP policy should keep in mind the relatively great expense that is involved in sampling private schools for national samples at present. Private school students make up less than 10 percent of the student population (with the majority being in Catholic schools). As NAEP oversamples these students by a factor of three, they make up about 25 percent of the student samples (again, with Catholic students in the majority - about 60 percent). Yet because private schools are so small, they constitute about 40 percent of the schools in the sample, with the majority of these being non-Catholic. The participation rate for Catholic schools is higher than for public schools (especially at grade 12), whereas the participation rate for non-Catholic private schools is much lower; generally below the NCES minimum standard of 70 percent. It seems to me that to justify having one quarter of the schools in national samples be non-Catholic private schools (with about 3 to 4 percent of the student population), NAEP needs to find a way to obtain much more satisfactory participation rates from these schools.

3. What would the sampling design look like for STANDARD NAEPs? For COMPREHENSIVE NAEPs? Need they be the same?

From the information under tab 9 in the briefing book that you provided, I see little scope for differences in sampling between these two types of reports. It seems to me that the big differences might be in the need for conditioning. Since the same kinds of breakdowns are required by race/ethnicity, sex, SES, public/private, achievement levels, etc., I think that the same sample sizes would be needed for both types of reports. Furthermore, these would probably not be much different from the sample sizes that we see now. These are the tabulations that are driving sample size requirements at present, more than issues of subscales, or requirements for minimum sizes needed to establish scales. National NAEP already has to oversample to get adequate numbers of student sot report by race/ethnicity and public/private.

The dropping of the requirement to collect school, teacher, and principal data for the standard reports might have a favorable effect on school response rates, although I doubt it would be major. Certainly my perception is that these components are viewed as a burden, out of proportion to the importance of the data they provide, relative to other NAEP data, such as student proficiency data.

To effect significant reduction in student sample size for a given cognitive assessment, one needs to reduce reporting requirements for small students subgroups, and accept increased levels of uncertainty for larger groups, both in comparison to each other and over time. The other way to keep sample sizes relatively small is to have an assessment that extracts a relatively large amount of information about what a student can do in a given time period. The trend towards more performance oriented testing, and more comprehensive frameworks, tends to work against this. These types of assessment aim to increase validity, and in doing so tend to decrease efficiency. The current plans for the 1998 writing assessment appear likely to generate pressure to increase sample sizes significantly, compared to past assessments in other subjects.

4. Is BIB-spiraling and matrix sampling a necessary assumption?

I have already mentioned that spiraling across subjects is quite beneficial in terms of sampling error. That is separate from the question of BIB-spiraling and matrix sampling within a subject. This is somewhat outside my area of expertise, but actually I think the answer is obviously "yes", using the following logic.

A. It would be suicidal to attempt to increase substantially the testing time for any one student. There is upward pressure here, but no-one is suggesting that NAEP should take a week for each student, as some state or district testing programs appear to require.

B. I do not believe that it would be good or popular to give up NAEP scales (and consequently achievement levels). Mean p-values for items, some of which at least would not be released, would seem to be of limited value in the long-term.

C. Given that, the only option would be to drastically reduce the breadth of the framework for a given assessment. One would have to report essentially by subscales, and combine these together. One would have to produce a math scale with no students taking both algebra and geometry, for example. This is actually "TIB-Spiralling" - "Totally Incomplete Blocks"! Although I believe this would be possible (since math scores for individual students are not needed), I think that the loss of ability to make use of the high intercorrelation among subscales would mean that larger student samples would be needed (I am not certain about this, but I bet Al Beaton, Bob Mislevy, Gene Johnson each know the answer without thinking).

Even though it might be possible to do away with BIB-spiraling, what is the benefit? If people find it difficult to analyze NAEP data now, how will it be when there are no math scores at all for any student in the math data, only algebra (or a part of algebra) for some students, and geometry for others?

BIB-spiraling was introduced to NAEP to make scaling possible. Nothing has happened in the past 12 years to decrease the demand for scaling, or to reduce the need for BIB-spiraling in order to achieve scaling. The somewhat separate question is whether conditioning and plausible values are essential/desirable. I am even less qualified to comment on this issue.

5. What about the sampling design in small states (e.g. DE, RI), and low density states (e.g., MN, AK)? (Actually, I think you mean MT rather than MN).

I think that to achieve a rational approach to varying the sampling scheme for these states, to reduce the burden on the state and the individual schools, it is important to step back and revisit the overall policy about state NAEP reporting reliability requirements. This can be posed as three questions (although the last two might be the same, depending upon the answer to the first one):

A. Should all states be subject to the same levels of reliability, or should results for larger states be more reliable than those for smaller states?

B. What should be the minimum level of reliability that is accepted?

C. What should be the maximum level of reliability that is required of any one state?

When the design for 1990 was formulated, the answers to these questions were developed as follows: The federal government should require a certain level of reliability for each state, but states were free to ask for additional sample if they wished. This additional sample could be targeted at certain subgroups, but no reporting by district or school was allowed. No state has ever asked for additional sample. (In 1996 the Department of Defense asked for a grade 4 science sample, but their case is rather different from the states and other jurisdictions.)

The minimum size was set at 2,000 assessed students. The reasoning was that this would provide the level of precision of each state that was achieved for each region in national NAEP samples. It was the judgment of NCES, ETS, and their advisors that this was sensible level to aim at. This number was not arrived at by considering prespecified precision requirements, or using power calculations (although such calculations may well be behind the determination of national sample size numbers, from which this was derived.

The sample size of 100 schools was motivated by two things. First, a sample of 20-30 students per school seemed reasonable and practical, at least when only one subject was tested. Reducing the sample size much below that would not materially reduce the burden on a school. The second consideration involved a similar set of issues to those that are involved in determining the number of PSUs in national NAEP. In state NAEP, schools play the role of primary sampling unit (PSU), as there is no clustering of the school sample within a state. It is important to have a sufficient number of schools in the sample to ensure that the design effect, due to the clustering of the student sample within schools, is not unduly large, and to ensure that estimates of sampling error are themselves reliable. Since school mean proficiencies are subject to quirks in distribution to a much greater extent than geographic PSU means, this indicates that the desirable number of schools for a state assessment should be rather higher than the number of PSUs needed for a national assessment. The consequences of having a sample of schools that is too small have been seen in 1994, in the attempt to report results for private schools by state.

Since 1990, four variations to this basic plan have been implemented. All are directed at either improving precision, reducing state and school burden, or both. These are as follows:

A. The student sample size requirement for public schools was increased from 2,000 to 2,500. In 1990 we at Westat wanted to be very sure that states would achieve the required sample size of 2,000 assessed students. Yet we did not know how the school and student response rates, and student exclusion rates, would vary across state. So we were quite conservative in the sample design, even to the extent of taking more than 100 schools in some states. In the event, most states ended up with over 2,500 students assessed. Everyone seemed to like this result, so a target of 2,500 was institutionalized in 1992.

B. In states with a few large schools, and many small schools, the sample in large schools was increased, and the number of small schools was decreased. This was an improvement in the design, since the sample was made closer to self-weighting. In 1990, students in large schools had been underrepresented in the sample (with appropriate weighting procedures applied to remove any bias). This also reduced the burden on the half-dozen or so plains and mountain states involved. I wrote to the state coordinators of the states involved prior to making this change, and no-one objected. Montana was one of the states most affected by this change.

C. Small schools are undersampled, by a factor down to one-half. I recall that in 1992, Nebraska required a sample of 200 grade 4 public schools in order to get a sample for two subjects, and other states required many more than 100. Thus in 1994, we introduced a modest undersampling of small schools. Schools with 10 or fewer students have a "half chance" of selection; those with between 10 and 20 students have their chances prorated accordingly. The total target sample size of students remains unchanged; some additional large schools get into the sample to make up for the small schools that are reduced. Again, weighting takes care of bias issues. This change makes a fairly sizable reduction in the number of schools needed in states like Nebraska and Maine. There is some reduction in precision in states heavily affected by this, but it is very minor.

D. Schools with very large samples of students assigned are permitted a reduced sample size. This policy helps states such as Delaware, Alaska, and Guam, where a very high proportion of the states students are concentrated in just a few schools. Implementing this policy does not affect the number of schools in the sample, one way or the other. It reduces the total sample size of students (by definition, never by more than 50%, and in practice, not by more than about 30%), and results in an undersampling of students from large schools. Thus the precision of the results is decreased, noticeably in the most extreme cases.

It can be seen that these changes are relatively minor. It is still generally true that each states assesses 2,500+ students per grade per subject, in about 100 schools. The number of schools increases when more than one subject is assessed, by a lot if there are many small and moderate size schools (fewer than 60 students per grade) in the state.

So what can be done to reduce the burden on small states? The first thing that needs to be answered is whether 2,500 students are really needed in every state. The implementation of the policy in D above suggests not; samples as small as 1,250 would be acceptable. The second question that follows is whether there is a rationale for having a smaller sample in some states than others. So far I have not seen one. NAEP reports attempt to give the same sorts of information for all states, and I have not heard suggestions that some states should have more breakdowns than others. Does this mean that the sample size of students should be reduced to 1,500 or 1,250 for all states? This could be addressed by considering how much more white space would show up in the center of the mileage chart, and how many subgroup and across-time comparisons would be lost within each state. The trade-offs are fairly clear, and it should be possible to work towards consensus.

Perhaps this is leading towards a plan where there is a minimum requirement for a sample size of, say, 1,500, with states able to volunteer for larger samples, up to 3,000, if they wish. My impression is that this will result in samples of 1,500 across the board, in states large and small. Alternatively, the Board could require different sample sizes for different states, depending upon size and resources. If this is done, it must be realized that smaller states will get less reliable results.

This discussion of student sample sizes dodges the big issue for most states, and that is the number of schools in the assessment. In moderate and large size states, reducing the number of students without changing the number of schools would be seen as no benefit at all. In very small states, as we have already seen, it is viewed as only a limited benefit.

Reducing the number of schools should be considered very carefully. As I indicated, the difficulties in reporting of private school results by state shows the problems that can arise when the sample of schools is too small. Having a sample of 100 schools has been very beneficial up to this point, in that we have not seen instances where the results for a state appear to have been skewed by one or two outlying schools in the sample. As I mentioned above, on technical grounds I am always nervous when a sample has fewer than 50 PSUs, and would be especially so when the PSUs are schools, and the purpose of the survey is NAEP. A reduction to, say, 75 schools per state would probably not cause undue problems, but I question whether this would be viewed as a real benefit by the smaller states.

The most dramatic way to reduce the burden for states is to keep the number of subjects and grades for a given year low. Assessing two subjects per grade, especially at grade 4, adds a lot of burden. Perhaps allowing states to choose to conduct only a subset of the subjects and grades offered would be helpful, although this leads to lots of problems in reporting, especially for trend. However, by default this has occurred because states have pulled out late from the assessment for one of the two grades (two states did this in 1996) or states decline to release results for part (but not all) of the assessment (two jurisdictions have done this in the past).

The main message needs to be that states and others should not expect to get something for nothing, and that NAEP needs to ensure that its credibility is not threatened by disseminating unreliable results because of pressures to reduce sample sizes.

6. What about the sampling design for private schools in small states?

Again, as long as we avoid trying to get something for nothing, I believe that it is possible to come up with some sensible directions here. We have learned that it is not realistic to sample just a sufficient number of private schools to supplement the public school sample, expect these private schools to participate, publish results that do not show a private school breakout, and expect that people will not try to figure out how private schools stand relative to public schools anyway. Equally, it is unrealistic for the federal government to pay for a sufficiently large sample of private schools in each state that reliable estimates could be published at the state level. The burden on private schools would be very heavy (in many states all schools would be included every time), and it is not clear that private schools would find these results particularly interesting.

There are only two ways out of this that I can see. The first is to drop private schools from the state assessments, at least as a requirement. If certain states wished to have a supplemental sample of private schools, perhaps that could be provided. I have only become aware of one state where the chief state school officer is perhaps particularly interested in private schools (Louisiana). My perception is that the push to include private schools in state NAEP comes very much from "inside the beltway".

The second approach would be to use the aggregate data from private schools to provide breakdowns of the national private school population that are of interest to private schools. For example, breakdowns by affiliation, size of school, urban/rural, males and females, parents education, race/ethnicity, all within the private school population, might provide the incentive for private schools to participate. Then I would suggest that the sample in each state be just large enough to supplement the public school sample appropriately (as has been the case in 1994 and 1996), with no reporting of private school results by state. One would have to rely on education and refutation to overcome the temptation of some to deduce state level private school results by subtraction. I believe this would not be as difficult as some have thought.

7. Is the sample size the same when one is establishing the scale in a particular content area as when one is maintaining the same scale in another NAEP cycle?

Again, this is a little outside my area of expertise. It seems to me, however, that in the past the sample size requirements have been overwhelmingly dominated by reporting requirements, not scaling requirements (with the reverse being true for field test sample sizes of course). From what I see of what is proposed for Standard reports, this seems likely to remain true in the future. I would guess that sample size requirements for scaling would only become an issue if the only results to be reported with reasonable reliability would be for the overall population, male/female, white/other race-ethnicity, and urban/rural, for example.

8. The Use of Targeted Assessments in NAEP

It appears that overtime gaps have developed between what American students can do, and what things NAEP would like to report on their ability to do. That is, the achievement levels are set much higher than the distribution of actual achievement. My sense is that NAEP assessment is now best suited to measure something in between these two extremes. As such, it may not be doing an especially good job of either. In order to do the job of assessing accurately what things students do and do not know across the spectrum, and at the same time obtain accurate data as to how students are performing relative to the Proficient and Advanced achievement levels, it might be good to consider the use of targeted assessments. Thus in a given content area there could be material given to a sample from the whole population, and then other, more challenging, material given to students whose course-taking patterns suggest a greater level of proficiency. With this approach, NAEP might achieve both of the above objectives more effectively.

Targeted assessments have been used effectively in TIMSS at grade 12 in 1995, and in NAEP at grades 8 and 12 in 1996. In both cases this was for math and science. The possibility has been raised of using this approach for the 1998 NAEP writing assessment, at least at grade 12.

The NAEP and TIMSS experiences have shown that the sampling and administration of these targeted assessments can work well. Furthermore, they appear to be well-received by both schools and students. In NAEP it appears that schools were anxious to identify a group of students to take the targeted assessment, even if none of their students had taken the necessary courses to qualify. This approach might go some way to addressing the concern about motivation among grade 12 students.

Although this approach appears to offer a number of advantages, I do not think a decision to adopt this approach on a large scale should be taken lightly. In particular, I think that the Board needs to address itself first to the possible charge that NAEP is becoming elitist, interested only in what the best and most advantaged students can do.

I hope that you find at least some of these comments useful. Please let me know if you would like more detail on any aspects.

 

Endnotes

1. The uncertainty was small in absolute terms, but large compared to both sampling error and the size of changes in average proficiency over years.

2. An experiment with double-length forms reported by Johnson et al. (1996) showed some drop-off effects on performance in final blocks, compared to performance on earlier blocks in the same session, of about 4 points on a percent-correct scale. This would not be a large effect for the purpose of comparing individuals, but it is a substantial effect for the purpose of comparing subgroups or changes over time; these targets of inference are themselves only of this size.

3. Position effects also add uncertainty to the process of achievement level setting. Panel members judge expected percents of correct response to items for students at various levels of proficiency, but they ignore where an item might appear in a block.

4. "Marginal estimation" means estimating characteristics of population and subpopulation proficiency distributions, and of associations among proficiencies and background variables, directly from noisy observed responses. It contrasts with estimating point estimates examinee by examinee from his or her observed data, and using these values to estimate population characteristics. Marginal estimation provides superior estimates, dramatically so in cases where different examinees provide responses to forms that differ in difficulty, number of items, accuracy, and so on.

An example, extended in Section 6.1E, concerns basketball shooting proficiency. If all students' attempt four free-throws, comparing their shooting abilities is easy, if not especially accurate. Maybe 15% students will make 100% of their attempts. Are they truly 100% accurate shooters? No. A ten-attempt contest might show only 4% make all their shots. A hundred-attempt contest would probably show none making all shots. The situation becomes all the more complex if some shots are from the free throw line, and some are from the corner, while still others are lay-ups -- and different students have different mixes of shots! A more complex method of summarizing performance, akin to IRT, would be required. In essence, it becomes necessary to define a hypothetical scale of performance that would be used to characterize shooting accuracy across these many possible situations, then figure out what evidence different numbers and locations of attempts convey about it. This is the 'true-score' scale -- like the NAEP IRT scale. One could project from it to a hypothetical standard collection of shots. The analyses required to carry out these steps, to estimate the distribution of "true" shooting abilities and relate all the shots to it, are 'marginal analyses.'

The multiple imputations approach used in current NAEP operations is one example of a marginal analysis in testing. Others include Monte Carlo Markov Chain estimation (e.g., Albert, 1992), structural equations modeling with categorical test items and latent variables (e.g., Muthén, 1988), and classical test theory methods for correcting variances and correlations for measurement error (Spearman, 1907).

"Marginal estimation" means estimating characteristics of population and subpopulation proficiency distributions, and of associations among proficiencies and background variables, directly from noisy observed responses. It contrasts with estimating point estimates examinee by examinee from his or her observed data, and using these values to estimate population characteristics. Marginal estimation provides superior estimates, dramatically so in cases where different examinees provide responses to forms that differ in difficulty, number of items, accuracy, and so on.