Design/Feasibility Team

Report to the National Assessment Governing Board

July 1, 1996

Robert Forsyth, University of Iowa
Ronald Hambleton, University of Massachusetts
Robert Linn, University of Colorado
Robert Mislevy, Educational Testing Service
Wendy Yen, CTB/McGraw-Hill

For their helpful discussions and useful information, we wish to express our gratitude to the National Assessment Governing Board and staff, the National Center for Education Statistics, Educational Testing Service, and Mathtech. Keith Rust and Ben King provided invaluable input on sampling considerations.


Contents


Executive Summary

The redesign of the National Assessment of Educational Progress (NAEP) has been the focus of extensive deliberations by the National Assessment Governing Board (NAGB) during the past year. As part of those deliberations, NAGB has developed a paper called "Themes and Issues" in which the Board has identified some critical objectives of NAEP and recommended a number of characteristics to be achieved in the redesign. The Design Feasibility Team (DFT) was formed by NAGB to lay out technical implications for the design, analysis, and reporting of NAEP that are implicit in the Board’s Themes and Issues.

The report of the DFT is intended to provide a bridge between the general desired characteristics and priorities for NAEP expressed in the Themes and Issues, on the one hand, and the specifics of a Request for a Proposal (RFP) and the actual detailed designs developed by respondents to the RFP, on the other. The report does not detail a specific design. Indeed, the DFT believes it would be presumptuous to do so at this stage and potentially counterproductive because there are many possible approaches to achieving the various objectives of the redesign. Rather than rushing to premature closure on the details of design, the report lays out the trade-offs that need to be evaluated in considering alternative designs and identifies criteria that may be used in judging the quality of alternatives in terms of the objectives of NAGB’s Themes and Issues.

Before turning to a discussion of the specifics of the Themes and Issues and their implications for design, analysis, and reporting, it is important to understand the context in which NAEP operates. Thus, after a brief introduction, Section 2 of the DFT report begins with a discussion of the key features of NAEP (e.g., a large-scale, nationally-representative, cross-sectional survey using standardized tasks not directly tied to specific instruction or curriculum, conducted in a brief period of time). Those characteristics make NAEP good for some purposes (e.g., monitoring trends) but not others (e.g., making causal inferences). It is argued that a key objective in rethinking NAEP is to focus resources within the range of missions that a survey with its evidentiary characteristics is good at, and minimizing what it is not good at.

Section 3 of the report addresses management and administration issues that impact the cost, dependability, and timeliness of NAEP. Several key ideas are identified that may be used as criteria in judging alternative designs. Notable among these are the need to think in terms of global optimization of the whole NAEP system rather than only local optimization of each component, the concept of the critical path, and the need to monitor variation in the results due to factors such as sampling error, changes in assessment tasks, and changes in procedures. Keeping the activities that are on the critical path to a minimum is seen as key to achieving the goals of simplification and faster reporting. Four implications of the discussion of management and administration issues are identified: (1) overarching priorities need to be specified to keep local optimization from subverting the larger goal of global optimization; (2) a modular design that identifies a "core NAEP" for tracking trends and rapid reporting as well as modules for special purposes is needed for an acceptably simple and efficient critical path while maintaining the richness of assessment expected of NAEP; (3) phased analysis and reporting is needed; and (4) changes need to be phased in.

Although a desirable narrowing of demands on NAEP can be achieved by a restricting attention to the kinds of information that can be provided effectively by a large-scale survey such as NAEP, there is still considerable leeway in setting priorities. It is not possible to "have it all;" trade-offs must be made. Some of the necessary tradeoffs that need to be weighed in the redesign are discussed in section 4 of the DFT report. The Themes and Issues provide general guidance that will be helpful in setting priorities and evaluating the necessary trade-offs. The DFT discussion of the Themes and Issues should provides some additional basis for evaluating the trade-offs.

In planning the redesign it is useful to have an understanding of how we got to where we are in NAEP. Toward this end, section 5 of the DFT report provides a brief selective history of NAEP thus setting the stage for the main section of the report—the detailed discussion of the Themes and Issues in section 6.

Trade-offs are elaborated in section 6 and approaches to meeting the objectives of the Board’s Themes and Issues are discussed. For example, trade-offs among three approaches to combining state and national samples are evaluated. Three variations of an approach to reporting, called marketbasket reporting, that the DFT believes will help in meeting several of NAGB’s key objectives are elaborated. Some potential simplifications in analyses are identified especially with regard to the core NAEP to be used for rapid reporting and tracking trends. It is noted, however, that complex analyses (done once) do not, in and of themselves, preclude rapid turnaround. Indeed, it is not so much complex analyses as time spent in (1) report review and revision and (2) rework that can be avoided through system redesign that appear to be bottlenecks.

The final section of the report sketches a feasible configuration for NAEP that incorporates the objectives specified in the Themes and Issues. This configuration includes a modular design, the use of a marketbasket for reporting, phased analysis and release of reports, and previously-proven analyses on the critical path for the core NAEP results. It is not the only onfiguration possible, but it is presented as an example of a design that appears to address NAGB’s Themes and Ideas. The team hopes that NAGB will find the discussions of technical issues and design tradeoffs of some assistance in their consideration of alternative configurations.


Design/Feasibility Team
Mission Statement

The Design/Feasibility Team shall provide advice on the technical feasibility, the necessary components, and costs of implementing a National Assessment of Educational Progress based on the policy themes and ideas being drafted by the National Assessment Governing Board. Specifically, the charge to the design/feasibility team is a threefold one.

First, using the Board’s themes and ideas as articulated in the preliminary policy paper and Board policies now in place, the design/feasibility team should identify the necessary components of a design that would fully embody such themes and ideas. The question to be addressed is. "How can these policy directions best be operationalized in a large scale assessment?" The necessary components so identified may form the bases for the specifications of the Request For Proposals to potential contractors. In developing the necessary components the Design/Feasibility Team may want to propose various options. If this is the case, priorities and trade-offs among options would also be identified.

Second, the Design/Feasibility Team should examine the necessary components of the resulting design for both intended and unintended consequences. The focus here is to ask the question, "If all these moving parts are put in motion, what will be the effect?" The design/feasibility team should plan to provide empirical evidence where possible to support their conclusions. The design/feasibility team will be advised by one or more financial consultants.

Third, the charge to the Design/Feasibility Team is to identify those areas in the design which appear not to be feasible for the National Assessment operation in the next 5 year and 10 year period, or those which might result in ultimate deleterious effects on the NAEP program.

The Design/Feasibility Team shall complete its report no later than June 30, 1996. A status report shall be made to the Board at its May meeting, and the final recommendations will be presented to the Board in the form of a Design/Feasibility Team Report.


1.0 Introduction And Overview

The role of the Design Feasibility Team (DFT) is to lay out technical implications for NAEP design, analysis, and reporting that are implicit in National Assessment Governing Board "Themes & Issues" (Table 1-1 gives its key points). We will sketch a configuration that moves NAEP in the directions outlined therein. It would be presumptuous for us to detail a specific design, since other effective ideas may have not yet surfaced, or even been conceived. Such ideas, and the wherewithal to craft them into a detailed plan, will emerge through competition for the contracts or supporting grants, as multiple organizations devote substantial time and talent to win the project. But we, the DFT members, have had the opportunity to experience firsthand what has worked well and what has not in several large-scale assessments, including NAEP itself; and we have gained some insights into why this may be so. We will point out trade-offs and implications that are not always apparent on the surface, and highlight issues that will have to be addressed in any specific design proposal. We will sketch design components which, in concert, can move NAEP in the directions that the NAGB Themes and Issues propose.

We envisage a core national assessment, administered on a predictable schedule, which focuses on those things that a large-scale, cross-sectional, nationally-representative survey can, by its nature, do well. For this core, analysis and reporting can be accomplished more quickly, efficiently, and reliably than under the current NAEP configuration. Modular design would facilitate integrating this core with other NAEP components, such as state assessments, new and more varied tasks, and auxiliary information such as teacher surveys—but none of these would appear on the critical path to initial time-series reports. This modularity would facilitate the use of NAEP linkages with extra-NAEP studies that provide kinds information that large-scale cross-sectional surveys cannot. These would include longitudinal surveys, program evaluations, state and local testing programs, and research studies of classroom practices and student learning. Changes in the core would be phased in over multiple time points, as their worth and feasibility are demonstrated and interest proves enduring.

The next two sections of this report cut across the particulars of NAEP designs, no matter how purposes and tradeoffs are resolved.

Section 2 addresses the missions for which large-scale assessments like NAEP, by their very nature, are and are not well-suited. The key idea is to focus NAEP’s efforts on what it can do well. We will note in passing, though, that leeway remains within these possibilities for where to focus attention—that is, for specifying the purposes that have the highest priorities. Tradeoffs arise because different purposes are better served by different assessment configurations.

Section 3 addresses management and administration issues that impact the cost, reliability, and timeliness of NAEP. The key idea is organizing NAEP activities to eliminate bottlenecks and inefficiencies.

Sections 4 and 5 provide additional background specific to NAEP. Section 4 discusses the issue of design tradeoffs, and Section 5 reviews how some of these tradeoffs have been decided over the years in NAEP, as reflected in design elements and their expected and unexpected consequences.

Sections 6 and 7 address redesign issues directly. Building on the preceding sections and on experience with NAEP and other large-scale assessments, Section 6 comments individually on the NAGB Themes and Issues in greater detail. Section 7 sketches a feasible configuration for NAEP that incorporates the Themes and Issues. Alternatives that reflect tradeoffs among competing purposes and values are noted.

Table 1-1

NAGB Themes and Issues’ Objectives, Sub-Objectives, and Recommendations

OBJECTIVE 1: To measure national and state progress toward the third National Education Goal and provide timely, fair, and accurate data about student achievement at the national level, among the states, and in comparison with other nations.

A. Test all subjects specified by Congress: reading, writing, mathematics, science, history, geography, civics, the arts, foreign language, and economics.

o The National Assessment should be conducted annually;

o Reading, writing, mathematics, and science should be given priority, with testing in these subjects conducted according to a publicly released 10-year schedule adopted by the NAGB;

o History, geography, the arts, civics, foreign language, and economics also should be tested on a reliable basis according to a publicly released schedule adopted by NAGB.

B. Vary the amount of detail in testing and in reporting results.

o National Assessment testing and reporting should vary, using standard report cards most frequently and comprehensive reporting in selected subjects about every ten years;

o National Assessment results should be timely, with the goal being to release results within 6 months of the completion of testing.

C. Simplify the National Assessment design.

o Options should be identified to simplify the design of the Bational Assessment and reduce reliance on conditioning, plausible values, and imputation to estimate group scores.

D. Simplify the way the National Assessment reports trends in student achievement.

o A carefully planned transition should be developed to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program;

o As a part of the transition, NAGB will review the tests now used to monitor long-term trends in reading, writing, mathematics, and science to determine whether and how they might be used now that new tests and performance standards have been developed during the 1990’s for the main National Assessment. NAGB will decide how to continue the present long-term trend assessments, how often they would be used, and how the results would be reported.

E. Use performance standards to report whether student achievement is "good enough."

o The National Assessment should continue to report student achievement results based on performance standards.

F. Use international comparisons.

o NAEP test frameworks, test specifications, achievement levels, and data interpretations should take into account, where feasible, curricula, stadards, and student performance in other nations;

o The National Assessment should promote "linking" studies with international assessments.

G. Emphasize reporting for grades 4, 8, and 12.

o The National Assessment should continue to test in and report results for grades 4, 8, and 12; however, in selected subjects, one or more of these grades may not be tested;

o Age-based testing and reporting should continue only to the extent necessary for international comparisons and for long-term trends, should NAGB decide to continue long-term trends in their current form;

o Grade 12 results should be accompanied by clear, highlighted statements about school and student participation, student motivation, and cautions, where appropriate, about interpreting 12th grade achievement results;

o The National Assessment should work to improve school and student participation rates and student motivations at grade 12.

H. National Assessment results for states.

o National Assessment state-level assessments should be conducted on a reliable, predictable schedule according to a 10-year plan adopted by NAGB;

o Reading, writing, mathematics, and science at grades 4 and 8 should be given priority for state-level testing;

o Testing in other subjects and at grade 12 should be permitted at state option and cost;

o Where possible, national results should be estimated from state samples in order to reduce burden on states, increase efficiency, and save costs.

I. Use innovations in measurement and reporting.

o The National Assessment should assess the merits of advances related to technology and the measurement and reporting of student achievement;

o Where warranted, the National Assessment should implement such advances in order to reduce costs and/or improve test administration, measurement, and reporting.

OBJECTIVE 2: To develop, through a national consensus, sound assessments to measure what students know and can do as well as what students should know and be able to do.

A. Keep test frameworks and specifications stable.

o Test frameworks and test specifications developed for the National Assessment generally should remain stable for at least ten years;

o To ensure that trend results can be reported, the pool of test questions developed in each subject for the NAEP should provide a stable measure of student performance for at least ten years;

o In rare circumstances, such as where significant changes in curricula have occurred, the Governing Board may consider making changes to test frameworks and specifications before ten years have elapsed;

o In developing new test frameworks and specifications, or in making major alterations to approved frameworks and specifications, the cost of the resulting assessment should be estimated. The Governing Board will consider the effect of that cost on the ability to test other subjects before approving a proposed test framework and/or specifications.

B. Use an appropriate mix of multiple-choice and ‘performance’ questions.

o Both multiple-choice and performance items should continue to be used in the NAEP;

o In developing new test frameworks, specifications, and questions, decisions about the appropriate mix of multiple-choice and performance items should take into account the nature of the subject, the range of skills to be assessed, and cost.

OBJECTIVE 3: To help states and others link their assessments with the National Assessment and use National Assessment data to improve educational performance.

o The National Assessment should develop policies, practices, and procedures that enable states, school districts, and other who want to do so at their own cost, to conduct studies to link their test results to the National Assessment;

o The National Assessment should be designed so that others may access and use National Assessment test data and background information;

o The National Assessment should employ safeguards to protect the integrity of the National Assessment program, prevent misuse of data, and ensure the privacy of individual test takers.


2.0 What NAEP Can and Cannot Do

Historically, NAEP has exhibited the following characteristics: It is a large-scale, nationally-representative, cross-sectional survey. It is timed, standardized, and, from the students’, teachers’, and school administrators’ points of view, low-stakes. It is not directly connected with students’ instruction, in that the tasks they are administered have not been selected in light of what they have been working on in their classes, and students receive no feedback on how they have done. Developing the NAEP content framework is a national consensus process. Since the early 1990’s, it is also the case that developing achievement level standards is a national consensus process. This section discusses the kinds of missions that an assessment possessing these properties is well-suited to support, describes the kinds it is not well-suited to support, no matter how elegantly designed and skillfully executed, and notes some broad implications for NAEP design. Additional discussion related to specific Themes and Issues will appear in Section 6.

2.1 Missions for Which NAEP Is Well-Suited

NAEP is virtually unique as a time series of large, nationally-representative samples of students. As such, it provides information about both the status of achivement in the nation as a whole at each time point, and about changes over time. It has thus played a central role in debates about the trends in American achievement over time (e.g., Koretz, 1992a). Other widely-used tests cannot serve this function. The Scholastic Assessment Test (SAT) or American College Test (ACT) cannot play this role, despite the large number of students involved, because their student samples are self-selected. Commercial achievement tests used by states and districts are not directly comparable. Longitudinal studies such as the National Educational Longitudinal Studies (NELS) and High School and Beyond (HSB) do not collect comparable information about successive cohorts of students at regular intervals. Thus, as the National Academy of Education (1993) has argued, NAEP’s capability to track student achievement over time is one of its unique and most precious features.

Of course, tracking trends in achievement over time requires gathering comparable data about students’ achievement. A tension is thus introduced between, on the one hand, maximizing measurement of change by comparing performance on a given collection of tasks in one assessment with performance on the same collection in succeeding years; and, on the other hand, revising the collection of tasks in succeeding years in order to reflect changing belief about what knowledge and skills are important to assess.

Also notable is the national notice NAEP can command for its methodologies and the focus of its attention (e.g., through achievement level reporting). NAEP has historically been a source of innovation for assessment methodology, and a forum for discussion about the topics and skills schooling should address. Moreover, the form and content of a national assessment communicate volumes of information in and of themselves to many audiences, before any data about students are even collected. Indeed, a core NAEP mission is "to develop, through a national consensus, sound assessments to measure what students know and can do as well [as] what students should know and be able to do." Thus, it is critically important for the National Assessment to reflect the breadth and richness of valued content and processes. NAEP is thus in a position to contribute strongly to national discussion about what is important for students to learn, and to establish frameworks for these discussions that could extend to other studies and other purposes beyond NAEP itself. One of the missions of NAEP is "to help states and others link their assessments with the National Assessment," and there are useful ways NAEP can foster these connections. The designs described in this document move NAEP further in that direction (subject to the caveats pointed out in the following section).

NAEP also collects information about students in the form of demographic, personal, and instructional background variables. Student-level information comes from self reports and from associated NAEP teacher and principal surveys, and some school-level information comes from census data. Associations between these variables and levels of performance can thus be estimated, and are routinely calculated and reported. To the extent that these variables are defined and measured in the same way in successive assessments, trends in these associations can be monitored.

2.2 Missions for Which NAEP Is Not As Well-Suited

Perhaps the most notable limitations of NAEP in terms of the kinds of inferences it can support stem from its being cross-sectional (as opposed to longitudinal) and observational (as opposed to experimental, or random-assignment). Being cross-sectional means that a given student is observed at only a single point in time. It is therefore impossible to estimate growth curves or patterns over time at the level of individual students, or to estimate associations between growth curves and student background variables. Such inferences are possible only in longitudinal studies such as NELS and HSB.

Even these longitudinal studies, if they merely track students in existing schools with existing instruction, fail to provide definitive evidence about causes or determinants of achievement. Inferences of this type require comparisons of comparable students under different conditions—a requirement that can rarely be strongly supported in cross-sectional observational surveys such as NAEP. This means that the associations between performance and student background variables, though they might be suggestive and useful for follow-up, are insufficient for concluding that those background variables caused higher or lower performance.

As an example of the kind of inferential error that can result from observational studies, we point to the following counterintuitive association in the 1992 NAEP reading assessment (Mullis, Campbell, & Farstrup, 1993). The question is whether extra instruction in reading helps children read better. Of course, we respond. Yet in that study, the amount of reading instruction fourth-graders receive was correlated negatively with their performance on the reading tasks:

Time Spent in Reading Instruction


30-45 Minutes

60 Minutes

90 Minutes or More

Average Proficiency

220

219

216

With a negative correlation (r= -.1) between reading performance and time spent in reading instruction, it appears that increasing reading instruction decreases reading performance! But the average difference among students in the population who received various amounts of reading instruction—the ‘prima facie’ effect—doesn’t necessarily estimate the average causal effect of reading instruction on performance, because factors that may influence instructional time or reading performance are not taken into account in the comparison (Holland & Rubin, 1987). The NAEP report explains that the negative relationship in this example makes sense when we remember that (a) students who get extra help are usually students who seem to need extra help, and (b) students who seem to need extra help usually have low test scores. Other prima facie effects that we interpret as causal effects if they conform to our expectations can be just as wrong for similar reasons.

Other offsetting features of NAEP include limitations with respect to (a) motivation, (b) reliance upon survey data, and (c) constraints on students’ time and the connection of assessment tasks with their instructional experiences. We will discuss motivation in a Section 6.1G. As for reliance on survey data, we point out that teacher and pupil reports of instructional practice are notoriously dubious. Soliciting background data from the students themselves is quite economical, compared to ascertaining information such as home characteristics from actual observation or record searches. But especially with younger students, the trade-off is accuracy:

Some indicator systems have relied on student reports for information on background factors. … A[n] … analysis of the quality of responses in the High School and Beyond study provided … sobering results. Correlation coefficients between sophomores’ and parents’ reports of background variables ranged from very low to quite high—for example, .21 for the presence of a specific place to study in the home; .35 for the presence of an encyclopedia in the home (an item used in the NAEP as well); .44 for mother’s occupation; .50 for family income; .56 for whether the family owns or rents its residence; .81 for mother’s education; and .87 for father’s education (Fetters, Stowe, & Owings, 1984). (Koretz, 1992b, pp. 17-18)

As for constraints on students’ time in assessment and lack of connection with their instructional programs, we must recognize that what we can learn about students from the NAEP cognitive tasks is limited in its scope. That is, there are kinds of learning we want our students to accomplish, but about which NAEP cannot provide direct evidence. For example, NAEP is not well suited to support inferences about how well students perform in tasks that extend over time, that involve the use of resources beyond the NAEP setting, or directly address skills and concepts on which the student has been specifically working. In these senses, NAEP tends to underestimate what students can do (Kane, 1996). Conversely, NAEP can overestimate the capabilities of students who do well on its limited palette of tasks but fare poorly in the context of the classroom. These facts hold implications for both achievement-level reporting and for the view of domains of NAEP tasks as representations of domains of learning (see Section 6.1E on standard-level setting).

A related mission for which NAEP is not well-suited is as a measurement tool for high stakes state or local accountability. While there is much consensus around the country in terms of what should be taught, there are also serious differences, with perspectives ranging from the most conservative to the most avant-garde. These differences produce intense scrutiny of any assessments used for high stakes evaluations. The National Assessment is vulnerable to attack if it is seen as a federal test implemented to support a federal curriculum. While the low-stakes nature of NAEP has contributed to participation and motivation problems, the same low stakes have also been a key contributor to its longevity, support, and usefulness.

This leads naturally to the mission of linking NAEP with the assessments of states and others. It is critical to NAEP’s credibility that the limitations of what can and cannot be accomplished with such links be acknowledged. NAEP frameworks will rarely match any given state’s frameworks, and NAEP assessment forms will rarely be parallel with state assessment forms. Student and administrator motivations are very different on the NAEP and local assessments. All of these differences produce uncertainty (‘error’) in linking state assessments to the National Assessment (Linn & Kiplinger, 1994; Ercikan, in press). Some states may wish to use the link to assess how their students would do on NAEP in years or grades where NAEP is not administered. Others may wish to use the link to estimate how the nation would do on a state assessment, estimating national norms for it. But the state assessment cannot be a "stand in" for NAEP, or vice versa. The changes over grades and years that states are concerned about assessing will often be smaller than the linking errors.

The bottom line for assessments like the current NAEP is that they can provide excellent information about the status of a limited number and nature of indicators of WHAT students do, and establish frameworks for public discussion of educational progress and policy—but limited information on WHY (i.e., the determinants of their performance, which is what policy-makers are really interested in), HOW (i.e., what educators in content areas and educational and cognitive psychologists are really interested in), or UNDER WHAT CONDITIONS (another thing educational and cognitive psychologists see as important). Different ways of gathering information are much better suited to providing information about these aspects of student learning, including longitudinal studies, laboratory research, in-depth cognitive studies of smaller numbers of individual students, controlled field trials, and careful observational studies of classroom processes.

2.3 Implications

A key objective in rethinking NAEP is to focus resources within the range of missions that a survey with its evidentiary characteristics is good at, and minimizing what it is not good at. If it is deemed important at a national level to obtain information that NAEP is ill suited to provide, we should not attempt to stretch NAEP to do so (necessarily poorly). Rather, we should conceive of an informational system in which NAEP is but one component; a system in which complementary and interconnected research of various kinds is each designed to do well the kinds of things it can do, and does not waste time and money doing things it cannot do well. This would argue for a simpler and more compact National Assessment which effectively indicates status and trend of key indicators, routinely gathering information about selected background variables as well, but not professing to answer causal questions about trends or to explain the cognition underlying performances. Instead, NAEP should be designed to be easy to ‘plug into’ alternative projects and ways of gathering data that are well designed for other purposes. Examples of complementary studies that could include National Assessment indicators among their own data-gathering are program evaluations, classroom observations, cognitive research studies, protocol analyses of large-scale assessment tasks, longitudinal surveys such as NELS, and studies including in-depth background and instructional practices of students.


3.0 Management Principles

Many of the problems that have plagued NAEP over the years, including anomalies, errors, high costs, and extended time lines, can be diminished by applying familiar management principles from business and industry (e.g., Deming, 1982). They apply no matter what configuration of design, analysis, and reporting is ultimately decided upon for NAEP. They concern how complex systems, with multiple steps and many actors, are structured. The following sections present the relevant concepts, and illustrate how they apply in NAEP.

3.1 Local vs. Global Optimization

How do we improve quality and productivity? "‘By everyone doing his best’?," asked W. Edwards Deming; "Five words—and it is wrong. … You have to know what to do. You have to know what to do, then do your best. Sure we need everybody’s best—everybody working together with a common aim. And knowing something about how to achieve it" (Walton, 1986, p. 32). The concept of ‘local optimization’ is ‘everyone doing his best’ but with a limited understanding of how their work fits into the system as a whole. The criteria that seem important to each contributor may do a good job of balancing tradeoffs that are visible to each of them, in accordance with priorities as they see them—yet when brought together, the contribution of one group can block or delay contributions of others. The resulting system, even if locally optimal everywhere, can be globally suboptimal. Some examples from NAEP:

  • For the 1986 Reading Assessment, test developers made slight revisions to NAEP tasks from previous years in order to improve their comprehensibility or grammar. They were better items. They became worthless for gauging change over time, however, since the small changes in performance these minor revisions caused (e.g., percents-correct from, say, 65% to 68%) often exceeded the amount of change in population performance over a two-year period (Beaton & Zwick, 1990). The course now followed is less optimal locally, but more desirable globally: Use unchanged items in unchanged blocks for trend analyses; administer and score these blocks of items in just the same way as in the previous cycle; and treat any revised items, even ‘slightly revised’ items, as new items.

  • A major revision of a content-area framework produces new task specifications and many new tasks—tasks that reflect the latest thinking in the field and the most up-to-date research. And what could be better than putting these new tasks, interpreted through the new framework, into place immediately? But without having administered these tasks in their final configuration before the operational test administration, we cannot know which ones will provide useful data from students, or whether we will be able to link results from the new framework to the previous framework. We can only find out after we actually have data—when we must carry out analyses with uncertain and unpredictable results, when errors or unexpected complications may call for the invention of new analytic techniques, when unforeseen glitches may require expensive and time-consuming rework, and when interagency decisions among alternative analysis and reporting options must be wrangled out. A more globally optimal course would be to include the new tasks under the new framework in a first assessment cycle jointly with established items in the previously established framework, but the initial results would be reported only relative to the established framework only. The unpredictable and exploratory analyses required for the new components would be carried out more deliberately. Alternative procedures would be compared more thoughtfully, or invented if necessary—without the need for untested patches rushed into place to meet reporting deadlines. What works and what doesn’t could be determined. Results of these analyses could be released in more detailed reports that introduce reporting under the new framework and show its relationship to the previous framework. Preparations could be made for faster and more stable initial reporting under the new framework in the next assessment cycle.

3.2 The Critical Path

The PERT chart is a popular management tool for understanding the interrelationships among tasks in a large project. It shows which tasks depend on others, and which can be carried out in parallel. Importantly, it describes the chain of tasks, each depending on the previous, which absolutely must be carried out for the project to be completed; this is the critical path. Carrying out the tasks in the critical path determines the minimal amount of time required to complete the project. This concept is a key to cutting the time to report NAEP results: No task should appear on this critical path between data collection and reporting if it can be done before or if it is not essential for the report.

There appear to be few tasks currently on the NAEP critical path that can be moved ahead without incurring any tradeoffs whatsoever. Others involve tradeoffs, but ones for which disadvantages appear to be overwhelmed by the advantages of speed and efficiency. For example:

  • The current NAEP analysis configuration requires, for a given subject area, analyses that involve all tasks in all scales, final sampling weights, and all background variables, including the matching and merging of teacher survey data. It is nearly true that everything must be done before anything can be reported; almost every element in the data lies on the critical path to the first report.

  • Major analytic decisions such as whether double-length writing blocks can be incorporated into the Writing scale, or whether to carry out cross-grade or within-grade scaling, are often slated to be made only after the data are in, and have been analyzed. This requires parallel development of alternative analytic procedures, honed to the point that they are sufficiently reliable to be employed in production. It requires time- and staff-consuming analyses and interagency decisions to make final determinations.

3.3 Decreasing Returns/Negative Returns

Most people are familiar with the principle of decreasing returns. In test theory, for example, three item responses provide more information about a student than two responses, but the increase is not as much as the gain from two responses over one. The Spearman-Brown formula allows us to approximate these decreasing gains. However, the increment in testing time from two to three is just much as the increase from one to two. At some point, the added items do not provide enough additional information to justify their cost. We will see several examples of this principle at work in NAEP, and it enters into deciding among design tradeoffs.

The lesser-known phenomenon of negative returns also arises frequently in NAEP. To continue the test theory example, when increasing test length begins to influence students’ performance because of fatigue, frustration, or lack of cooperation, the Spearman-Brown predictions of decreasing returns are no longer correct. Costs are linearly higher, but the information gained can actually be less than it would have been with fewer items. This situation arises in NAEP as a consequence of motivation, logistical limitations, and attempts to address inferences that large-scale surveys are not, by their nature, suited to support. Some examples:

  • A short constructed response task has the potential to tell us more about a student’s thinking than a multiple-choice item. A longer constructed response task might tell even more about some students, but nothing at all about those who decide not to bother responding to it. Omit rates for grades 8 and 12 in 1994 Geography, for example, averaged less than 1% for multiple-choice tasks and about 5% for short constructed response tasks—but up to 40% for tasks in which students were asked to provide extensive responses. Since these omissions are self-selected, even if 10,000 students do respond assiduously, there is less information about average performance in the population than from a random sample of just 100 students who were all engaged in the task.

  • It has been well-known, since at least the days of the Coleman Report (Coleman et al., 1966), that students’ home experiences have substantial impact on their school achievement. The self-reported information about students that NAEP routinely gathers is affordable, but of varying quality. Recent explorations into whether to attempt to obtain more accurate information about students’ home experiences and socio-economic status (SES) have (probably wisely) recommended against using either more detailed census data or parent surveys in the main assessment. Costs would rise substantially, and public resistance could increase to such a degree as to erode cooperation. And, because NAEP is a survey rather than an experiment, it would still not be possible to unravel the comparative effects of schooling and background experiences.

3.4 Operational Definitions

Educators can agree unanimously that we need to help students "improve their math skills," but disagree vehemently about just how to appraise students’ skills. Their conceptions of mathematical skills diverge as they move from generalities to the classroom. They employ the language and concepts of alternative perspectives on how mathematics is taught, how it is learned, and about which topics and skills are important. The disparate assessments they have in mind all provide evidence about students’ competence—but each from a particular point of view of that competence, how it is evidenced, and how much to value different aspects of it.

Several levels of abstraction might be conceived for thinking or talking about student achievement, but it is an actual specific assessment that a student ultimately encounters. "Test specifications" identify what a particular assessment should comprise: The kinds and numbers of tasks, the way it will be carried out, and the processes by which observations will be summarized and reported. This level of specification determines an operational definition of competence. Deming (1982) describes how similar processes are routinely required in industry, law, and medicine:

Does pollution mean, for example, carbon monoxide in sufficient concentration to cause sickness in 3 breaths, or does one mean carbon monoxide in sufficient concentration to cause sickness when breathed continuously over a period of 5 days? In either case, how is the effect going to be recognized? By what procedure is the presence of carbon monoxide to be detected? What is the diagnosis or criterion for poisoning? Men? Animals? If men, how will they be selected? How many? How many in the sample must satisfy the criteria for poisoning from carbon monoxide in order that we may declare the air to be unsafe for a few breaths, or for a steady diet?

Operational definitions are necessary for economy and reliability. Without an operational definition, unemployment, pollution, safety of goods and of apparatus, effectiveness (as of a drug), side-effects, duration of dosage before side-effects become apparent (as examples), have no meaning unless defined in statistical terms. Without an operational definition, investigations on a problem will be costly and ineffective, almost certain to lead to endless bickering and controversy. (pp. 286-287)

For practical work, stakeholders agree on one or more operational definitions to track the more abstractly defined matters in which they are interested. The U.S. Food and Drug Administration, for example, works with an operational definition for "acceptable frozen broccoli" that includes ‘less than 272 aphids per pound’—obviously a consensually defined quantity. Different operational definitions, equally defensible, can lead to somewhat different results—but only after they have been specified can accurate estimation, or discourse based on the matter, proceed.

In NAEP, an operational definition of proficiency in a subject area is determined jointly by the subject-area framework, test specifications, administration procedures, and scaling/reporting procedures. Even a seemingly minor decision about whether to ignore omitted responses or to count them wrong is part of the definition. Any change in any of the components changes the operational definition of the proficiency, and has the potential to affect results by more than changes in what students actually know and can do affects them.

Operational definitions come into play in NAEP in several other places, such as sampling frames, background variables, exclusion rules for testing students, and, importantly, achievement levels. This last instance is discussed in Section 6.1E.

3.5 Variation in Systems

At the heart of Deming’s revolutionary approach to quality control was an understanding of variation in a system. Any system exhibits variation. Even an established system under what Deming called ‘statistical control’ exhibits a certain amount of variation. Resources are squandered when attention is focused on variation within these limits. One way that resources are effectively used is identifying and resolving ‘special’ causes of variation that lie outside the natural variation of a system—"putting out fires", or, in NAEP, "resolving anomalies." Statistical ideas help distinguish special causes from the natural variation of a system. In industry, typical ‘control limits’ for zeroing in on outliers are three standard deviations beyond average results. While putting out fires is an effective use of resources, it does not improve a system. Only changing the system can do that. The second way to use resources effectively is to change the system so as to improve its product—and, almost always, to reduce the amount of variation in the system. These principles are relevant to the NAEP redesign, for decreasing reporting time and improving the accuracy of trend results.

3.5.1 Reporting time

Figures 3-1 a) and b) present HYPOTHETICAL illustrations of two reporting systems. The top panel suggests time-to-release of main reports under the current main-NAEP configuration, which includes revisions, changes, new procedures, and reporting decisions (such as standard setting, how to handle scaling, what results to report and how to report them). This figure is fictitious, partly because calendar time to reports depends on which reports are given priority. For example, the average time is higher than desired, although some reports are ready fairly quickly. But the variation is very wide, due in large part to unforeseeable needs for rework due to unstable or new portions of the assessment or attendant processes, under a configuration in which almost everything must be analyzed and resolved before anything is reported. Simply exhorting everyone to do better does little to bring average reporting time down, since the wide variation in the system, as configured, leads predictably to some reporting times above the desired target. Focusing resources on the specific incidents that led an assessment to come in after schedule is wasted, if the underlying cause is an untested change, an inherently unstable variable, or a survey that requires complex file-matching—if the next assessment cycle will include new untested changes, inherently unstable variables, and surveys that require complex file-matching.

Figure 3-1
Hypothetical Distributions of Time-to-Report in Two Assessment Systems

The bottom panel illustrates some important observations we have made about the process of reporting long-term trends. Long-term trend reporting, neglected in the main NAEP activities, has coincidentally become a stable process. Very few changes at all enter into test designs, administration, or analysis (although reporting has sometimes been extensive, as trend reports sometimes have much interpretation and contextualization). The time necessary to prepare the basic data for reporting is not only shorter, but exhibits far less variation. This first feature is the bottom line, of course, and we will be exploring ways to achieve it in a redesigned standard NAEP. The second feature, reduced variation, is important not just for the predictability and the reliability of the system, but because it permits quicker and more accurate detection of true ‘special causes’. That is, if variation due to controllable nuisance effects is decreased, true anomalies are faster and easier to detect and resolve.

3.5.2 Accuracy of trend results

Deming, as a statistician, appreciated both the value of statistical models for gauging uncertainty and the limitations. A limitation of model-based estimates of uncertainty (i.e., standard errors) is that they depend on the model. To the degree that the model is wrong or incomplete, the reported standard errors are wrong—usually too small, because they do not include important sources of variation in the results. This is important in NAEP in the following way.

NAEP results may be called ‘reading proficiency’ or ‘math performance’ in a rather generic or global use of the term, but what they really are, are summaries of observations (which we believe have something to do with students’ knowledge and skills) collected in specific ways under specific conditions. Literally hundreds of specifics are involved, everything from definitions of the population frames, sampling procedures, color of ink, and weights, to item specifications, timing, administration, analytic procedures, and training procedures for scorers. Design changes have there three important implications:

  • Every single one of these specifics affects the level of the outcome to some degree.

  • Some of these features, when changed, can have greater impact than the true target of inference, namely, change of student proficiency over time.

  • Changing several features, even if each is seemingly minor, can also have greater impact than the change of student proficiency over time.

Error variance from some of these effects can be handled with statistical models—student sampling and item sampling, in particular. These sources of variance are all that show up in reported standard errors. Score variations due to other feature changes usually are not estimated, except when a change is deemed sufficiently suspect to merit dual administration under old and new conditions, and an attempt made to adjust for the average effect of the change. This means that the variation associated with seemingly small changes is present in results, but not in the standard errors for them.

Two untoward consequences result from this underestimation of the variability in results. First, distortions result in planning the sampling design. For example, there was uncertainty present in the 1986 design due to changing item context that was as large as uncertainty due to student sampling(1) (Mislevy, 1990). The huge expense of securing large random samples of students is wasted if locally desirable changes in design and procedure add variance back into the results.

Second, while the current student- and item-based standard errors are not too bad for within-assessment comparisons (because conditions are constant within that assessment), they underestimate more seriously standard errors for trends because of changes across assessment cycles. Setting control limits in relation to the underestimated standard errors guarantees that false alarms will be set off on a regular basis. Too many observations will be identified as suspicious. This triggers a search for ‘the mistake’, a special cause of variation, when there is no special cause; just another draw from the natural variance of a noisy system. If one wants accurate reports and the current system is not accurate enough, continually chasing a few real and many false signals of anomalies cannot solve the problem. The real solution requires honest estimates of the actual uncertainty in the existing system, then changing the system so it is less noisy.

The impact of variations in design options, and the consequent generalizability of inferences drawn from NAEP data, can and should be examined empirically by the use of generalizability studies. These studies should be done as part of the planning process, and should not be on the critical path to main reports. In such studies, different versions of an assessment are developed that vary in controlled ways. For example, test forms may be developed that contain different items but that are designed to be parallel in terms of the number and type of items and their measurement properties. Or forms may be created that vary more systematically, such as in their proportions of constructed-response, performance, and multiple-choice items. The variation in results across forms provides important information about how much error can be expected from changes in the assessment. This information makes it clearer how generalizable are conclusions drawn from a particular assessment design.

3.6 Implications and Approaches

How can we apply these management concepts to achieve the desiderata of the Themes and Issues? Our discussion of the Themes and Issues and our design sketch make use of four ideas:

  • Setting Priorities
  • Modular Design
  • Phased Analysis & Reporting
  • Phased-In Change
Setting Priorities. The way to circumvent local optimization is to specify overarching priorities. This makes it possible to create an assessment design and an analysis plan that will override some locally desirable alternatives. If everything is especially important, then nothing is especially important. For example, striking gains in speed and stability ofresults can be produced by initially focusing energies on the aspects of the data that are most important and least problematic. However, it must be recognized that this prioritization inevitably means that certain analyses or variables that are most important to some NAEP constituents are not on the critical path to initial results.

Modular Design. The idea here is to design NAEP in terms of distinguishable modules, perhaps the most important of which ("a core NAEP") supports trend comparisons over time and consists of elements which are important, stable, and (comparatively) easy to analyze and report. These core modules could be embedded in other NAEP activities (in particular, state NAEP), and in non-NAEP studies. Other elements of NAEP could be spiraled into the main NAEP administration, but would not appear on the critical path to initial reports. These could include, for example, teacher surveys, experimental and more extensive tasks, long-term trend blocks, and blocks of tasks being readied to appear on the critical path in the next assessment cycle.

Phased Analysis & Reporting. As large as NAEP is, it is dwarfed by the census that the Census Bureau carries out every ten years. Yet the Census Bureau reports its first results six months after the data are in—as required by law. How can they do this? They do not report every possible result in every conceivable form. They report the most important results in the most straightforward way, then continue, over the next ten years, to analyze, to refine, to report, and to release further analyses in priority order. The analyses required for these results are not on the critical path to the initial report. NAEP has moved in this direction recently with its First Look reports. In Section 6.1D we discuss how even quicker initial reports could be accomplished.

Phased-In Change. In every administration of NAEP assessments, some aspects of the data collection have been essentially unchanged from the previous administration, others are changed only modestly, and others are quite different. We see time and again that the chances of problems (some remediable, others not) increase accordingly. For example, long-term trend assessments are essentially unchanged from one administration to the next, and, not surprisingly, they exhibit far fewer problems than main NAEP. Many things that could go wrong in an assessment have been discovered (often the hard way), worked through, and are avoided in successive administrations. It is largely known what the data will look like when they arrive, and what to do, and how to do it. Many of these advantages could be built into a core NAEP, while relaxing some of the incidental constraints that also characterize the long-term trend assessments. Open-ended tasks, which are not part of the current long-term assessments, could be included in the mix of tasks. New blocks could be introduced, as long as (a) they were not included in the standard results the first time they are used, and (b) they were very similar to blocks already in the mix in terms of structure, difficulty, content, and format balance. More consequential shifts of these factors would be introduced only periodically (say, eight to ten years), and after at least one joint administration in which they are not included in the initial results.


4.0 The Necessity and Effects of Design Trade-offs

Even when attention focuses on the kinds of information that large-scale surveys such as NAEP can do well, there remains much leeway for setting priorities. Broad and current content coverage, for example, has always been important for NAEP; so has the capability to compare performance across time points. The NAGB Themes and Issues propose a higher priority for expeditious turnaround of results than has been the case historically. And while associations between performance and student background variables have been desired, the high cost of reliable measures of student background has led NAEP to rely on less trustworthy self-reports. Three key points must be kept in mind:

  • Different purposes are best achieved by different design configurations. (For example, assessment frameworks that were created anew with each assessment cycle would guarantee the most current perspectives on what is deemed important in a subject area, but devastate comparisons across assessment cycles.)

  • Any single design involves tradeoffs among features that strengthen, weaken, or sometimes even preclude inferences associated with different purposes. (For example, an assessment such as NAEP that uses relatively short, matrix-sampled test forms provide efficient estimates of population characteristics, but poor estimtes for individual students.)

  • Establishing priorities among purposes enables assessment designers to plan a configuration that maximizes the attainment of high-priority purposes, while satisfying lesser priorities to lesser degrees.

4.1 A Fundamental Tradeoff

Perhaps the crucial tradeoffs to be addressed in a NAEP redesign emerge from the interplay of the following points made in the Themes and Issues:

  • Group focus. "The National Assessment only provides group results; it is not an individual student test."

  • Validity. "[V]alidity … of the data will remain a hallmark of the National Assessment." [In particular, this has included content coverage—consensually-determined frameworks and item pools that represent the breadth and depth of knowledge and skills in a given subject area, insofar as it is possible to assess them by NAEP.]

  • Achievement level reporting. "The National Assessment should continue to report student achievement results based on performance standards."

  • Simpler design. "Options should be identified to simplify the design of the National Assessment and reduce reliance on conditioning, plausible values, and imputation to estimate group scores."

Content-coverage has been important to NAEP since its inception. Such comprehensiveness cannot be attained if all students are administered the same, or virtually parallel, test forms. In and of itself, variation in test forms is not a barrier to rapidity and simplicity. The NAEP design of the early 1970s had few restrictions on booklet construction yet supported simple analyses—but largely because results were reported in terms of performance on items, not in terms of performance by students.

This may seem like a trivial distinction, since all the data are is performances on items by students. The key difference, though, is that under item-level reporting, the issue addressed is how students would do on this item, regardless of performance on other items. In a student-level framework of reporting (even if scores are never even calculated for individual students), the focus is on how a given student would do across items. This means projecting from how she does on the one particular set of items in her tet form, to how she might have done on some larger set (e.g., an actual set of reference items, or a performance scale that implies levels of performance in a domain of items). This means that interrelationships among performances across items are important, and the complexities of some kind of linking and scaling procedures appear. Methodologies available for linking results on different test forms vary in their complexity. The simplest can be employed when (1) forms are parallel, which demands tight constraints on form design and works against breadth of content-coverage, and (2) target inferences are about individuals measured equally well, rather than about properties of distributions of groups.

The current NAEP configuration has neither of these characteristics. Data come from booklets that vary within assessments and over time. Students are administered too few items to obtain accurate measures of their performance, since experience has shown that administering large numbers of tasks under unmotivated ‘drop in from the sky’ testing abrades the engagement of students and schools alike. And the target of inference is proportions of students at or above designated achievement levels—one of the hardest to estimate from sparse matrix-sampling designs. This is the state of affairs that led to the complex statistical methodologies noted in the ‘simpler design’ desideratum.

Is it possible to have a cleaner design, simpler analyses, and faster reporting—yet maintain broad content coverage and valid achievement-level reporting? Our perspective emphasizes (1) use of management principles in design, so that procedures can be faster, simpler, and more stable no matter how tradeoffs are balanced, and (2) arrangement of design and reporting priorities so as to be, at once, consonant with the desiderata in the Themes and Issues, but ordered so as to reduce costs and complexities in achieving them.

4.2 Tradeoffs and Test Specifications

Among the major features of an assessment that affect the structure of test forms are the following: 1) content specifications (including definition of objectives or outcomes and the number of items measuring each outcome); 2) item types and formats (including but not limited to multiple-choice or performance items); 3) desired standard error functions, especially as they relate to achievement levels; 4) testing time per student; and 5) linking requirements (between forms or grades). Decisions about these design features—which, as we shall discuss, should flow from decisions about priorities on assessment purposes—will create the look of the assessment and have a great influence on the complexity of the analysis techniques needed.

Test frameworks determine the breadth of content coverage needed, but test specifications are more specific than frameworks. If it is desired to make broad and robust generalizations about student achievement, then broad content coverage is needed. The level of detail at which distinctions will be made is also important. For example, if it is desired to draw generalizable conclusions about students’ achievement in problem solving versus algebra, then a sufficient number of items needs to be included to measure those separate objectives. In the past, NAEP has been notable for the breadth of its content coverage, which has positively affected its reputation as a valid and useful benchmark of American student achievement. However, this breadth has contributed to the need for a large, complex, expensive number of test forms.

It is an explicit desideratum of the Themes and Issues that constructed response or performance items be included in a redesigned NAEP. The number and type of performance items have tremendous impact on testing time and scoring costs. Also, while increasing the depth of assessment, the task effects inherent in performance items decrease the generalizability of results relative to devoting the same amount of testing time on multiple-choice items. That is, the use of just one performance task creates the need to use additional performance tasks in order to maintain stable results. For example, if only one math task is used in one year and it focuses heavily on geometry, and the next year an algebra-laden task is used, it will not be possible to understand the meaning of score changes: Are the changes due to changes in levels of student achievement in math skills common to both tasks or are they due to the fact that students can do one type of task better than the other? Using several carefully chosen tasks in each assessment improves the interpretability of the results since it affords the possibility of sorting out some of these competing explanations.

The desideratum to report scores in terms of achievement levels places particular emphasis on the standard error functions, or degree of accuracy of information about individual students. Items need to be placed in the assessment to match the target achievement levels. For example, to accurately measure Advanced performance, difficult items must be in the assessment. If NAGB decides that it is important to place more emphasis on measuring students’ progress as they move toward achieving the Basic level, more items at the low end of the scale need to be added to the assessment. There is a fundamental dilemma in designing an assessment before standards are set: it may be that reasonable standards are set but that a given assessment design cannot measure with necessary accuracy, the proportion of students reaching those standards.

Past NAEP results have found that when students are tested for longer than one testing session (about an hour), there is a substantial loss of student participation.(2) Such loss biases assessment results. As long as it is desired to measure more content than one student will take, more than one test form must be used, and complexities arise in design and analyses. For example, imagine that it takes two forms to cover the NAEP content framework and item format specifications to an acceptable degree. Since two forms are needed, in one or more ways they cannot be parallel; they may measure different content, perhaps with different formats, or have different standard error functions. In an extreme case, one form might contain only multiple-choice items (Form A) and the other contain one or more performance items (Form B). To obtain an overall picture of performance of a group of students, it will be necessary to pool results from the two forms. It will not be possible for only one form to be used by, say, states, to link their assessments to NAEP; both forms will be needed.

Furthermore, if more than one form is needed to cover the desired content and item formats, comparability of results over years requires the use of either a) tight restrictions on test form characteristics or b) complex analysis procedures. Continuing the example above, call the first year’s test forms A1 and B1. To maintain overall consistency of results in the second year of testing, it is necessary to design A2 to be parallel to A1 and B2 to be parallel to B1. (Looser restrictions could be used, but they are more complicated to explain and implement.) If form consistency is not maintained, then the distributions of observed scores (and percents of students in each achievement level) will be affected by differences in standard error functions. Sophisticated statistical techniques exist for dealing with these differences (e.g., the "plausible values" methodology), but one of the Themes and Issues desiderata is to reduce use of such techniques. We will discuss these issues further in Section 6.1C.

4.3 Remarks

It is not possible to "have it all;" trade-offs must be made. The present NAEP design emphasizes breadth of content coverage, use of performance items, minimum testing time per student, and achievement level reporting. These features have been obtained by increasing the cost and complexity of the form design and analysis. The cost and complexity can be reduced, but then something else must be given up. The configuration we sketch in Section 7, for example, maintains broad content coverage and allows for controlled evolution of the task pools, and permits more rapid reporting—but it does so by constraining the specifications of booklets upon which standard, initial reports are based. Subsequent reports incorporating broader content, newer and more complex tasks, and more additional student background variables can come later, necessarily carried out with more complex analyses.


5.0 A Selective History of Elements of the NAEP Design

This section briefly reviews selected elements of design configurations NAEP has exhibited over the years, in terms of purposes, priorities, and trade-offs—some explicit, others implicit; some intentional, others adventitious; and some with unforeseen consequences. This discussion further illustrates the principles introduced above, and sets the stage for deliberation of options for the future.

5.1 1970-1983

Certain features of NAEP were instituted at its onset, conceived to produce results sufficiently useful, cost-effective, and politically benign to come into being.

5.1.1 Student Sampling

NAEP was designed to gather information from samples of students rather than from every student. This approach, motivated more by practice in public-opinion polling than educational testing, allowed extraordinary efficiencies when the target of inference was performance of groups of students rather than of individual students. Accurate estimates of national performance, for example, could be grounded on a random sample of a few thousand students. A multi-stage sample was employed (a simple random sample of students from the nation is impractical), necessitating that clustering effects and stratification be accounted for in estimating item averages and precision of estimation. Since results were not obtained for all students, nor used for purposes specific to sampled individuals, motivation was more of a concern than in typical tests in which something good happens to a student if he does well, or something bad happens if he does poorly.

A tradeoff appeared in the sampling of students at random from their schools, rather than from intact classrooms. The advantage: A lower clustering effect, implying more efficient estimates of group performance for a given sample size. The disadvantage: Hierarchical linear modeling (HLM), which would examine impact of class and teacher effects, was precluded.

5.1.2 Item Sampling

Item-sampling is the dual of student-sampling. Since performance in any subject area is only poorly reflected by any single item, or even several of them, we learn more comprehensively about all the many facets of skill and knowledge in a subject from a large number of diverse tasks—far too many for any single student to be administered, especially under unmotivated conditions. NAEP pioneered the radical solution of item-sampling: each sampled student was administered a sample of items from the pool. Technical innovations made it possible to obtain, from these ‘matrix samples’ of responses, estimates of average performance (e.g., Lord, 1962). Matrix sampling was ideal for broad content coverage and efficient estimates of performance in large domains of items. An important feature of matrix sampling is that it supports estimates of average performance in the domain or on individual items even if students respond to very few items. This was a partial solution to the motivation problem, since under low-stakes conditions, motivation declines as amount-of-effort-required increases. (Indeed, motivation can decline to the point of negative returns as testing sessions become longer; two hours of testing time per student can provide less information about a group than one hour of testing time, if rates of school and student refusal, and item omit rates, increase.)

Items in the original NAEP design were administered by paced audio tapes. That is, all students in a testing session were administered the same booklet of items, and an audiotape moved students through the booklet item by item. A number of trade-offs were involved here: Administration was logistically cumbersome, and data were less than optimally efficient because of the clustering of students. On the other hand, information was better item-by-item than when students are simply given a number of items to work, and a block of time to work on them. In this latter situation, it is up to students to decide when how long and in what order they will work on each item—a factor of some importance in their performance, but uncontrolled by administration conditions.

5.1.3 Reporting

From the inception of NAEP through 1988, NAEP reported results in terms of regions of the nations rather than states. Obviously, since regions of the country are not responsible agencies for education, reporting in these terms had less policy relevance than reporting in terms of states or school districts. Why make such a tradeoff? In order to make NAEP acceptable, so it could come into being in the first place. It was not politic to create a national assessment that was too useful.

Similarly, sampling and reporting was organized in terms of ages of students rather than grades, even though schooling in this nation is mainly organized in terms of grades. Ages 9, 13, and 17 were targeted, in tune with international assessments which were starting to be carried out by the International Assessments of Education (IEA).

Reporting originally focused on single items: percents of correct response, or distributions of kinds of response, in the nation as a whole and in subgroups of students defined by the background variables NAEP also included in its surveys (e.g., race/ethnicity, parents’ education, and region of the country). This item-by-item reporting contrasts with student-based reporting, in which individual students’ performance is summarized over items, and the distributions of these summaries are analyzed. A great advantage of item-level reporting was that no equating or scaling procedures were required; average performance on an item was simply what it was. Analysis was as simple as possible as far as items were concerned, given that the complex student-sampling required a certain level of complexity in analysis (jackknife estimation of variance due to student sampling, which is still used). Another advantage was that there were relatively few constraints on the composition of assessment booklets. They need not have had similar content, formats, difficulties, or lengths. A disadvantage was that item-by-item reporting provided estimates of average performance by subgroup, but it precluded the conception of distributions of students’ performance, or of reporting in terms of achievement levels.

Subject area experts liked the original item-by-item reports, but such detailed reports were quickly found to be unsatisfactory for communicating with policy-makers and the public. When someone asked ‘how are kids doing’, she did not want two hundred answers, one for each item—especially if the same general message was being repeated for most of the items. Beginning around 1974, reports began to provide results in terms of average performance over clusters of related items.

5.1.4 Measuring Trends

Once reporting was organized around average percents of correct response across clusters of tasks, these clusters naturally became the basis of comparisons across time and across age groups. The disadvantage was that the groups of items in common across years were not selected purposefully to this end; they varied in number and content coverage, and constituted only haphazard and unrelated reporting scales. For example, percents-correct for 13-year olds might be higher than those for 17-year olds, simply because the 13-year olds’ common items happened to be among the easier ones they were administered. Moreover, the release of 1/4 of the items with each assessment cycle meant that fewer and fewer items were available for comparing performance over time. In short, this method of comparing achievement over time had not been planned. It arose as an ad hoc response to a mission whose importance grew over time, but for which the design was not well suited.

5.2 The 1984 Redesign

After having started out with a clean design in the early 1970’s, satisfactorily addressing the perceived needs of the time with the technologies of the time, NAEP became increasingly unwieldy over the years as expectations changed. Ad hoc procedures (such as trend reports on clusters of items) had been introduced to meet new expectations as well as possible, but even so dissatisfaction was increasing. The competition for a redesigned NAEP in 1984 led to a contract in which many of the features of the current configuration originated (see Messick, Beaton, & Lord, 1983). New priorities were recognized, and the new design was introduced to reflect different balancing of recognized tradeoffs.

5.2.1 Student-Sampling

Recognizing the fact that schooling in the US was organized mainly by grades, the 1984 redesign introduced concurrent age and grade sampling. The sampled grades were the ‘modal’ grades associated with ages: Grade 4 with Age 9, Grade 8 with Age 13, and Grade 11 with Age 17. A given administration scheme and set of test booklets was used in each grade/age combination. Advantages included in this extension of NAEP include the maintenance of age-based surveys for trend and international comparisons, and the availability of grade-based surveys for increased policy relevance. Disadvantages included increased fieldwork; additional complexities in sampling, weighting, administration, and data structures; and dual analyses and possibilities of analytic errors.

5.2.2 Item-Sampling

A more complex version of multiple-matrix item-sampling was introduced in 1984: BIB-spiraling. BIB spiraling presented booklets which were organized around three blocks of items, and these blocks were combined so that each block would appear at least once with every other block. This Balanced Incomplete Block design is the ‘BIB’ in BIB spiraling. The motivation for this innovation was to support the construction of response scales for performance; that is, to support reporting in terms of what individual students do on collections of items, rather than simply average performance per item. Specifically, scaling by means of item response theory (IRT) models was introduced. This scaling allowed consideration of distributions of performance, setting the stage for achievement level reporting. The trade off was increased complexity of analysis.

The ‘spiral’ in BIB-spiraling involved administering different booklets to students in a given testing session—spiraling through the entire set of booklets rather than having every student in a session paced through the same booklet. These spiraled booklets were constructed so that each block was allotted the same amount of time, and students allocated their time as they chose within each block. Advantages of this procedure included easier logistics, since audiotape equipment and its vicissitudes were no longer a factor, and more efficient sampling in one sense, since the testing-session-by-booklet clustering effect was eliminated. However, additional uncertainties (sources of variance) were introduced into item-level results: larger position effects, more omits, and more dependence on students’ varying time-management skills. National estimates of item-level performance were less trustworthy, since the percent-correct now depended materially on whether an item happened to appear near the beginning or the end of a block.(3) Block of items, rather than individual items, became the fungible unit of interpretation.

5.2.3 Plausible values

A major feature of the current NAEP configuration is the use of ‘plausible values’ methodology to estimate distributions of students’ proficiencies. It is noteworthy that this methodology was an unintended consequence of the redesign. Specifically, it became necessary as a consequence of: (1) placing a high priority on student-based, rather than item-based, reporting; (2) collecting data in booklets constructed to far looser constraints than are employed in typical student testing programs; and (3) finding that complex analyses had to be invented to meet these missions with the data that had been collected.

IRT scaling was introduced to enable comparisons of group-level performance across distinct years, ages, and booklets. By the early 1980’s, established scaling methods were available to do this with reasonable speed, stability, and expenditure. Item parameters were estimated to establish a common scale; IRT ability estimates were produced for each examinee; and subsequent analyses could be carried out across non-identical item sets. In particular, the IRT calibration and scoring were completely separate from the preparation of, and analyses concerning, any other background or instructional variables whose relationships with performance were of interest. Subgroup distributions and secondary analyses could then be carried out with these estimates for individual students. It was therefore planned to carry out the analysis of NAEP data using this approach—a relatively simple analysis, under which the following elements all appeared to be in concert: the data, the analysis, the commitments, the PERT chart, the requisite resources, and the long-term stability of the approach.

But this approach failed in the 1984 assessment. Although NAEP’s sparse matrix-sampling design was extremely efficient for obtaining information about population characteristics, it didn’t support response-data-only IRT ability estimates for individual students upon which suitable analyses could then be based. Vastly differing mixes across booklets of content, difficulty, test-length, and timing, further impaired the approach, since varying measurement error distributions caused distortions in individual-student score estimates that were larger than the true differences of interest, such as between sexes or regions. These form-to-form factors, it was realized, were explicitly managed in programs like the SAT and ACT. In such applications, IRT could indeed characterize and take advantage of patterns among students’ performances on different sets of items—but only because the test forms were sufficiently long and parallel. This was the crisis faced with the 1984 assessment:

  • Data had already been collected, in anticipation of an analysis of performance with a methodology that was familiar, expeditious, and totally separate from background variables and sampling design considerations;

  • Commitments as to the content and timeliness of results were based on this assumption; and

  • The anticipated analysis failed.

Two steps were taken to meet the crisis:

1. A conceptual framework was established for using IRT-based models to establish a common reporting metric and characterize population relationships from sparse matrix-sampling data (i.e., marginal estimation(4) ). The reporting metric was a 0-500 scale based on the IRT ability. (See Section 6.1C for further discussion).

2. A marginal-estimation analysis system was devised to deliver on a majority of the established commitments, using the data in hand. Specifically, the ‘plausible values’ approach, based on Rubin’s (1987) multiple imputation methods for handling missing data, was introduced to implement the concept of marginal estimation. The main idea was to estimate the joint distribution among IRT ‘true scores’ and student background variables, then produce pseudo-data sets from which these results could be reproduced.

The commitment of providing a secondary user data tape merits special note. It had originally been promised to provide users with a NAEP data file including IRT ability estimates as well as background information for each student, from which any analyses could be carried out. This would have been both easy to do and satisfactory for secondary analyses had the anticipated analyses worked as planned. They did not. User tapes containing plausible values were instead produced. If secondary analyses were to include a given variable, that variable had to be included in the construction of the plausible values (i.e., including them in the ‘conditioning model’) in order to obtain satisfactory results. Were it not for the mission of producing this specific type of data file for secondary users, the imputation procedures could have been avoided by using alternative, somewhat simpler, marginal analyses. a tape which could be used more or less as if the original plans had worked, appropriate marginal analysis could be carried out for given inferences without having to first . This then necessitated cleaning, processing, and, when required, collating, all student background variables before analysis for initial reports were carried out. It is important to note that this requirement placed on the critical path to initial reports the most problematic background variables (especially untested new or revised self-report items, and those from teacher surveys, which involved complex file matching).

The commitments were largely met, but at the cost of severe and unanticipated mismatches among the analysis, the commitments, the PERT chart, the requisite resources, and the stability of the configuration. The new analysis procedures made it possible to bring together results from booklets of different lengths, difficulties, and compositions, although there were limits to how far even approach this could be pushed. The analysis, however, was more complex, more dependent on models, and unfamiliar (and therefore suspect) even among the educational measurement community.

5.2.4 Trend Analysis

Once the marginal analyses described above had been devised, the IRT model and distributional estimation methods were applied to map the historical stream of pre-1984 reading data into the 0-500 scale. Stable and credible results were attained, which echoed, amplified, and made more comprehensible the cross-year and cross-age results that had been reported in the past in terms of average percents-correct. The procedure worked well despite the widely varying design configurations in past assessments. In retrospect, it was realized that paced administration helped satisfy IRT assumptions by reducing context effects.

This historical trend was necessarily based on historical data, under the content frameworks as they had been developed and operationalized (e.g., audiotape paced administration, definition of age cohorts, booklet designs). These same procedures were used to collect data in 1984 concurrently with the new BIB-spiral administration, and this bridge was used to set the baseline for what was anticipated to be the start of a new trend line based on new procedures. No new trend line ever materialized. Changes in frameworks, item specifications, definitions, and administration conditions were introduced every one or two assessment cycles so that a consistently defined metric could never be established for solid comparisons over time for more than two assessments.

5.3 The 1986 Reading Anomaly

The ‘Reading Anomaly’ refers to results from the 1986 assessment that showed declines from 1984 levels that were much greater than changes in any four-, five-, or six-year period in the previous history of NAEP. It is not that the changes were large in absolute terms; at Age 17, the most startling, they were only about 3 points in terms of average percents correct. But this was large in relation to the target of inference, namely population changes, because population changes are very small over short time periods. Subsequent investigations (Beaton & Zwick, 1990) showed that the anomaly could be traced to seemingly inconsequential changes in booklet configuration, administration procedures, item context, and post-stratification procedures—each one designed to provide better information—yet which made it impossible to compare results across assessments. Beaton and Zwick offered the moral, "When measuring change, do not change the measure’.

This experience provides the sobering realization of the severe constraints on population definition, booklet design, administration protocol, and analytic procedures that ensue when the mission of NAEP includes tracking change over time. Hundreds of seemingly small facets of the configuration could be tweaked to produce modest improvements in estimation within one time point, but these changes could have a greater impact on the overall results than actual differences over a two-year period in what students know and can do.

5.4 Some Changes Since the Anomaly

5.4.1 ‘The only constant is change’

Changes in frameworks, item specifications, the time of year of testing, age definitions, exclusion rules, and so on, have been the rule in NAEP. It has sometimes been possible to jointly administer the assessment under both the previous and new versions of each change and to estimate an adjustment factor that takes into account the average effect of the change. Even when such adjustments appear to have been successful, the average impact need not be representative of the impact on different demographic or curricular groups. Thus, results concerning such variables have an added source of variance that is not accounted for in standard errors of estimation. Higher rates of ‘false positives’ of significant differences result. And, even when adjustments appear to have been successful, additional time has been added to the critical path for attempting to estimate the effect of the change, to determine whether it can be adjusted for, and possibly to invent a way to make the adjustment. Sometimes the impact of change is large enough to make it impossible to directly compare results from one assessment to the next. In all the years of NAEP assessments since 1984, it has only happened once that results from three successive assessments have been comparable (1994 Mathematics). Typically only two in a row are comparable before revisions sufficiently large to obviate the continuation of a ‘clean’ trend line are introduced. ‘One in a row’ is not uncommon.

The trade-off involved with continuous change is a victory of local optimization over global optimization. The intended advantage is improvement each time, insofar as honing data to mission as perceived by content area committees or other stakeholders. The negative effects are that (1) there are almost always some new wrinkles in each assessment, so there are almost always some first-time glitches and slow reporting; and (2) no new trend line has ever gotten started and remained in place long enough to become fast and reliable. An interesting unintended positive effect has been the continuation of so-called ‘long term trend’ assessments in Reading, Mathematics, and Science, which still use definitions, booklets, and administration procedures from the 1970’s and early 1980’s. Procedures for long-term trend have been refined and honed, so analyses needed for what would correspond to the ‘standard report card’ in NAGB’s Themes and Issues can be carried out reliably and quickly—within 3 months after receipt of data. This observation supports our recommendations in Section 7 for ways to design a standard NAEP assessment configuration that can yield initial reports within six months after the completion of testing.

5.4.2 Grade 12 Reporting

The 1984 grade/age combinations included grades 4, 8, and 11. It was noticed that if a given subject area were assessed every four years, NAEP could track and compare cohort effects if grades four years apart were surveyed. The 1986 assessment therefore surveyed grades 3, 7, and 11. Grade 3 proved problematic, as students at this age exhibited considerable difficulties dealing with the assessment. Moreover, interest was expressed by various stakeholders in surveying progress at key transition points in the educational system: Grades 4, 8, and 12. These became the surveyed grades beginning in 1988. Probably the most serious disadvantage in this shift is the lack of motivation among Grade 12 students, as evidenced in field observers’ notes, students’ self-reports, and omit patterns in data (see Section 6.1G).

5.4.3 State-Level Reporting

The Alexander-James Report opened the door to reporting at the state level in the 1988 assessment. An obvious advantage is increased policy relevance: states are agencies with direct responsibility for education. A disadvantage was the increase in the size of NAEP by almost two orders of magnitude, adding considerable logistic and analytic challenges. Creative ways of addressing these challenges have been advanced, such as administration by local rather than contractor personnel, under contractor-directed training and sampled observation. Differences between state assessment samples and a specially-selected ‘state assessment comparison’ subsample of the contractor-administered national NAEP have exhibited small but statistically significant differences. Another disadvantage in state-level reporting is the increased burden on states—especially small states, for whom practically all schools are involved in NAEP every assessment cycle, and for sparsely-populated states, for whom logistic difficulties arise with small schools spread across wide geographic areas (see discussions in Section 1H and in the Appendix). Whether states and schools will be sufficiently motivated to maintain this effort over the long run is an open question.

An interesting example of a costly unintended consequence is the breakout of private school results in state NAEP assessments. The original plan was to report only public schools, since they have public responsibility, and it is generally easier to locate and gain cooperation from them than private schools. But states have different proportions and compositions of public/private schooling, so omitting private schools from the sample may bias inferences about how students in a state are faring compared to students in other states. Now a reliable unbiased estimate of students in all public schools in a state can be obtained with a sample of, say, 50 to 100 public schools. A reliable unbiased estimate of all schools in the state, public AND private, can be obtained with perhaps an additional subsample of, say, 10 private schools. However, 10 schools is NOT sufficient for a reliable estimate of all private school students, and obtaining unbiased estimates is difficult because non-Catholic private schools have a high rate of refusing to participate. The advantage of having state-level private-school results is mainly to avoid users simply subtracting the ‘public schools’ estimate from the ‘all schools’ estimate to get their own (highly unreliable) private-school estimate. Yet securing enough private schools to ground a reliable private school estimate could, in some states, require as much effort as the public school effort! (Keith Rust discusses this issue in his 5/8/96 memo to Mary Lyn Bourque; see Appendix to this report.)

5.5 Where Are We Now?

NAEP finds itself in much the same situation as it did in the days of the 1984 NAEP redesign competition: A clean design had been introduced a bit more than a decade previously, which responded to the needs of its time, but for which changing desires, technologies, and political milieu appeared to push beyond its limits. Now as then, much has been learned from preceding work upon which to craft configurations better suited to the current situation.


6. Themes and Issues

This section discusses the objectives and recommendations from the National Assessment Governing Board’s "Themes and Issues" document. The discussion is organized point-by-point, with cross references as appropriate. Further amplification of specific tradeoffs and issues appear here. The results are synthesized into a sketch of a feasible configuration in Section 7.

OBJECTIVE 1: To measure national and state progress toward the third National Education Goal and provide timely, fair, and accurate data about student achievement at the national level, among the states, and in comparison with other nations.

1A. Test all subjects specified by Congress: reading, writing, mathematics, science, history, geography, civics, the arts, foreign language, and economics.

  • The National Assessment should be conducted annually;

  • Reading, writing, mathematics, and science should be given priority, with testing in these subjects conducted according to a publicly released 10-year schedule adopted by the NAGB;

  • History, geography, the arts, civics, foreign language, and economics also should be tested on a reliable basis according to a publicly released schedule adopted by NAGB.

There are many ways to devise schedules that meet the stated goals. Although it is not up to the Design/Feasibility Team to say what subjects or how often, we can say that it if the tracking of trends is a primary goal, then there is a need to maintain considerable stability in the framework and assessment design for at least three administrations. This section offers three examples, not as recommendations, but as vehicles to illustrate issues and tradeoffs. It will be noted that all of these examples show a lot more assessment going on than under the current design, with more subjects assessed and more frequent assessment. This will be feasible and affordable only if subjects which are not being assessed comprehensively are kept relatively simple and exhibit minimal changes (the nature of which is discussed further in Section 6.1B). We assume that "comprehensive assessments" coincide with revised subject-area frameworks (Section 6.2A).

Example 1

Table 6.1A-1 is an example with two core subjects, Math and Reading, which are assessed biennially, and eight other subjects, which are assessed two or three times during each ten year period. Three subjects are assessed each year, although field testing of new items from other subjects can take place in any year as required. Math and Reading are given highest priority in this example since there seems to be no argument from any quarter (educators, policy makers, and parents) that these two subjects are critical for students’ success. It is supposing that after Math and Reading, the four subjects with next priority are Science, U.S. History, Writing, and Geography. These subjects are grouped into two pairs, one pair of which is assessed between the Reading/Math years, so that each is assessed every four years. Three other subjects appear once every four years, and one subject is tested every five years.

Table 6.1A-1
Example 1 of Subject Area Assessment Cycles

Subjects

Year

1

2

3

1

Math

Reading

Civics

2

Science

History

F. Language

3

Math

Reading

Arts

4

Writing

Geography

Economics

5

Math

Reading

Civics

6

Science

History

F. Language

7

Math

Reading

Arts

8

Writing

Geography

Economics

9

Math

Reading

Civics

10

Science

History

F. Language

11

Math

Reading

Arts

etc.




Note: Core subjects in bold

Example 2

Table 6.1A-2 is an example with four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. A ten-year period is illustrated, in which core subjects are assessed every three years and non-priority subjects every five years. This example assesses two or three of the main subjects each year, with the possibility of augmenting a two-subject year with a special assessment.

Table 6.1A-2
Example 2 of Subject Area Assessment Cycles

Type of Assessment

Year

Comprehensive

Standard

Special/Probe

1

Math

Writing, History

-

2

Reading

Arts, Economics

-

3

Science

F. Language

Possible

4

Writing

Civics, Math

-

5

History

Reading

Possible

6

Geography

Science

Possible

7

Arts

Math, Writing

-

8

Economics

Reading

Possible

9

F. Language

Science, History

-

10

Math, Civics

Writing

-

etc.




Note: Core subjects in bold

Compared with Example 1, twice as many core subjects are assessed. The tradeoff for more core subjects is assessing them only every three years. Three-year cycles for the core subjects may suffice for timely monitoring of slowly changing trends. Also, if a National Assessment is the basic NAEP to which states can attach themselves, a state interested in only, say, Reading and Math can hold its participation to only one subject every other year. Moreover, a more integrated process (framework development/items development/tryout/final "test"/ reporting of results) could be achieved, with additional field-testing and such activities as DIF analyses and achievement level setting occurring in off years. A disadvantage of a three-year pattern for main assessments is that cohorts cannot be tracked.

Example 3

Table 6.1A-3 is a variation of Example 2, similar in that there are again four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. Also, special assessments can appear periodically. It differs in that (1) there is an eight-year rather than a ten-year pattern, (2) core subjects are assessed biennially, and (3) in order to achieve the foregoing increases in intensity of assessment, four subjects are assessed every year. States could elect to participate in four-year cycles for core subjects in order to reduce their costs and burdens.

Table 6.1A-3
Example 3 of Subject Area Assessment Cycles

Type of Assessment

Year

Comprehensive

Standard

Special/Probe

1

Math

Science, History, Civics

-

2

Reading

Writing, Arts

Possible

3

Science

Math, Geography, Economics

-

4

Writing

Reading, F. Language

Possible

5

History, Civics

Math, Science

-

6

Arts

Reading, Writing

Possible

7

Geography, Economics

Math, Science

-

8

F. Language

Reading, Writing

Possible

etc.




Note: Core subjects in bold

1B. Vary the amount of detail in testing and in reporting results.

  • National Assessment testing and reporting should vary, using standard report cards most frequently and comprehensive reporting in selected subjects about every ten years;

  • National Assessment results should be timely, with the goal being to release results within 6 months of the completion of testing.

The notion of decreasing returns plays a role in deciding how to vary comprehensive data-gathering and reporting should be. For the sake of argument, suppose that a typical main assessment under the current configuration supports a thousand inferences, in the way of distributions at achievement levels, comparisons among subgroups, levels of background variables, and associations between background variables and performance. This is a lot to learn the first time it is done, and there will be several "surprises"—leads for following up with additional or different kinds of research. The second time the same survey is conducted, however, most of these results will be essentially the same as they were two years before. Changes across time will not be measured precisely enough to detect change in most variables, except for large, well-measured ones. Collecting the same kind of data provides little additional information about the stories behind the surprises. All in all, the cost is about the same as the first time, but the informational value is far less. "Information per dollar" from the same survey continues to decrease over time (Boruch & Terhanian, 1996).

This phenomenon supports the notion of having only occasional comprehensive assessments, with performance and background variables rethought so we can surprise ourselves again. Between these periodic larger efforts, two complementary kinds of assessment can take place: (1) more modest and largely constant ‘standard’ assessments that report basic results and track major changes reliably and quickly; and (2) targeted assessments that dig more deeply into focused aspects of performance or correlates thereof, but off the critical path to standard reports. Targeted assessments can be costed out and designed separately, but administered jointly with the core assessment.

The decision to vary the intensity of assessments is not really a technical one. A key technical issue, however, comes from the assumption that comparisons across assessments varying in intensity are desired. This suggests the need to have the design provide the means for making such comparisons dependable. What size standard errors are acceptable? This question might be addressed in terms of the magnitude of changes that have been used to make the case of that achievement is improving or declining—for example, the size of the long-term trend analyses. There is also the issue of subgroup comparisons. The law says that NAEP should "include information on special groups, including, wherever feasible, information collected, cross-tabulated, analyzed, and reported by sex, race or ethnicity, and socioeconomic status." Thus, group comparisons and changes deemed important in past policy discussions of NAEP results might be used to set targets for standard errors, which in turn determine the sample size (more precisely, the outcome of sample sizes in a multi-stage sampling design, in which number of PSUs is the dominant factor). We note that cutting back on non-cognitive background variables does not have to be accompanied by cutting back on the achievement items.

Completion Date/Elapsed Time

One desideratum for the new National Assessment is that "results should be timely, with the goal being to release results within 6 months of the completion of testing." Following is a discussion of a sample NAEP schedule and some ballpark comparisons with other large scale testing programs conducted by states and commercial publishers.

The KPMG Peat Marwick-Mathtech review of NAEP (1996) describes the timeline for the 1994 NAEP Reading Report. The completion dates and elapsed time for the main activities can be summarized as follows:

Step

Task(s)

Completion

Months

0

Testing

4/01/94


1

Scoring & preliminary weights

7/30/94

4

2

DIF item review

8/31/94

1

3

Scaling, conditioning, & weighting

12/8/94

3

4

Draft report

3/01/95

3

5

NCES-NAGB review/revision; final report

3/07/96

12


Total


23