Design/Feasibility TeamReport to the National Assessment Governing BoardJuly 1, 1996Robert Forsyth, University of Iowa | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Time Spent in Reading Instruction | |||
|
30-45 Minutes |
60 Minutes |
90 Minutes or More |
|
|
Average Proficiency |
220 |
219 |
216 |
With a negative correlation (r= -.1) between reading performance and time spent in reading instruction, it appears that increasing reading instruction decreases reading performance! But the average difference among students in the population who received various amounts of reading instructionthe prima facie effectdoesnt necessarily estimate the average causal effect of reading instruction on performance, because factors that may influence instructional time or reading performance are not taken into account in the comparison (Holland & Rubin, 1987). The NAEP report explains that the negative relationship in this example makes sense when we remember that (a) students who get extra help are usually students who seem to need extra help, and (b) students who seem to need extra help usually have low test scores. Other prima facie effects that we interpret as causal effects if they conform to our expectations can be just as wrong for similar reasons.
Other offsetting features of NAEP include limitations with respect to (a) motivation, (b) reliance upon survey data, and (c) constraints on students time and the connection of assessment tasks with their instructional experiences. We will discuss motivation in a Section 6.1G. As for reliance on survey data, we point out that teacher and pupil reports of instructional practice are notoriously dubious. Soliciting background data from the students themselves is quite economical, compared to ascertaining information such as home characteristics from actual observation or record searches. But especially with younger students, the trade-off is accuracy:
Some indicator systems have relied on student reports for information on background factors. A[n] analysis of the quality of responses in the High School and Beyond study provided sobering results. Correlation coefficients between sophomores and parents reports of background variables ranged from very low to quite highfor example, .21 for the presence of a specific place to study in the home; .35 for the presence of an encyclopedia in the home (an item used in the NAEP as well); .44 for mothers occupation; .50 for family income; .56 for whether the family owns or rents its residence; .81 for mothers education; and .87 for fathers education (Fetters, Stowe, & Owings, 1984). (Koretz, 1992b, pp. 17-18)
As for constraints on students time in assessment and lack of connection with their instructional programs, we must recognize that what we can learn about students from the NAEP cognitive tasks is limited in its scope. That is, there are kinds of learning we want our students to accomplish, but about which NAEP cannot provide direct evidence. For example, NAEP is not well suited to support inferences about how well students perform in tasks that extend over time, that involve the use of resources beyond the NAEP setting, or directly address skills and concepts on which the student has been specifically working. In these senses, NAEP tends to underestimate what students can do (Kane, 1996). Conversely, NAEP can overestimate the capabilities of students who do well on its limited palette of tasks but fare poorly in the context of the classroom. These facts hold implications for both achievement-level reporting and for the view of domains of NAEP tasks as representations of domains of learning (see Section 6.1E on standard-level setting).
A related mission for which NAEP is not well-suited is as a measurement tool for high stakes state or local accountability. While there is much consensus around the country in terms of what should be taught, there are also serious differences, with perspectives ranging from the most conservative to the most avant-garde. These differences produce intense scrutiny of any assessments used for high stakes evaluations. The National Assessment is vulnerable to attack if it is seen as a federal test implemented to support a federal curriculum. While the low-stakes nature of NAEP has contributed to participation and motivation problems, the same low stakes have also been a key contributor to its longevity, support, and usefulness.
This leads naturally to the mission of linking NAEP with the assessments of states and others. It is critical to NAEPs credibility that the limitations of what can and cannot be accomplished with such links be acknowledged. NAEP frameworks will rarely match any given states frameworks, and NAEP assessment forms will rarely be parallel with state assessment forms. Student and administrator motivations are very different on the NAEP and local assessments. All of these differences produce uncertainty (error) in linking state assessments to the National Assessment (Linn & Kiplinger, 1994; Ercikan, in press). Some states may wish to use the link to assess how their students would do on NAEP in years or grades where NAEP is not administered. Others may wish to use the link to estimate how the nation would do on a state assessment, estimating national norms for it. But the state assessment cannot be a "stand in" for NAEP, or vice versa. The changes over grades and years that states are concerned about assessing will often be smaller than the linking errors.
The bottom line for assessments like the current NAEP is that they can provide excellent information about the status of a limited number and nature of indicators of WHAT students do, and establish frameworks for public discussion of educational progress and policybut limited information on WHY (i.e., the determinants of their performance, which is what policy-makers are really interested in), HOW (i.e., what educators in content areas and educational and cognitive psychologists are really interested in), or UNDER WHAT CONDITIONS (another thing educational and cognitive psychologists see as important). Different ways of gathering information are much better suited to providing information about these aspects of student learning, including longitudinal studies, laboratory research, in-depth cognitive studies of smaller numbers of individual students, controlled field trials, and careful observational studies of classroom processes.
2.3 Implications
A key objective in rethinking NAEP is to focus resources within the range of missions that a survey with its evidentiary characteristics is good at, and minimizing what it is not good at. If it is deemed important at a national level to obtain information that NAEP is ill suited to provide, we should not attempt to stretch NAEP to do so (necessarily poorly). Rather, we should conceive of an informational system in which NAEP is but one component; a system in which complementary and interconnected research of various kinds is each designed to do well the kinds of things it can do, and does not waste time and money doing things it cannot do well. This would argue for a simpler and more compact National Assessment which effectively indicates status and trend of key indicators, routinely gathering information about selected background variables as well, but not professing to answer causal questions about trends or to explain the cognition underlying performances. Instead, NAEP should be designed to be easy to plug into alternative projects and ways of gathering data that are well designed for other purposes. Examples of complementary studies that could include National Assessment indicators among their own data-gathering are program evaluations, classroom observations, cognitive research studies, protocol analyses of large-scale assessment tasks, longitudinal surveys such as NELS, and studies including in-depth background and instructional practices of students.
Many of the problems that have plagued NAEP over the years, including anomalies, errors, high costs, and extended time lines, can be diminished by applying familiar management principles from business and industry (e.g., Deming, 1982). They apply no matter what configuration of design, analysis, and reporting is ultimately decided upon for NAEP. They concern how complex systems, with multiple steps and many actors, are structured. The following sections present the relevant concepts, and illustrate how they apply in NAEP.
3.1 Local vs. Global Optimization
How do we improve quality and productivity? "By everyone doing his best?," asked W. Edwards Deming; "Five wordsand it is wrong. You have to know what to do. You have to know what to do, then do your best. Sure we need everybodys besteverybody working together with a common aim. And knowing something about how to achieve it" (Walton, 1986, p. 32). The concept of local optimization is everyone doing his best but with a limited understanding of how their work fits into the system as a whole. The criteria that seem important to each contributor may do a good job of balancing tradeoffs that are visible to each of them, in accordance with priorities as they see themyet when brought together, the contribution of one group can block or delay contributions of others. The resulting system, even if locally optimal everywhere, can be globally suboptimal. Some examples from NAEP:
3.2 The Critical Path
The PERT chart is a popular management tool for understanding the interrelationships among tasks in a large project. It shows which tasks depend on others, and which can be carried out in parallel. Importantly, it describes the chain of tasks, each depending on the previous, which absolutely must be carried out for the project to be completed; this is the critical path. Carrying out the tasks in the critical path determines the minimal amount of time required to complete the project. This concept is a key to cutting the time to report NAEP results: No task should appear on this critical path between data collection and reporting if it can be done before or if it is not essential for the report.
There appear to be few tasks currently on the NAEP critical path that can be moved ahead without incurring any tradeoffs whatsoever. Others involve tradeoffs, but ones for which disadvantages appear to be overwhelmed by the advantages of speed and efficiency. For example:
3.3 Decreasing Returns/Negative Returns
Most people are familiar with the principle of decreasing returns. In test theory, for example, three item responses provide more information about a student than two responses, but the increase is not as much as the gain from two responses over one. The Spearman-Brown formula allows us to approximate these decreasing gains. However, the increment in testing time from two to three is just much as the increase from one to two. At some point, the added items do not provide enough additional information to justify their cost. We will see several examples of this principle at work in NAEP, and it enters into deciding among design tradeoffs.
The lesser-known phenomenon of negative returns also arises frequently in NAEP. To continue the test theory example, when increasing test length begins to influence students performance because of fatigue, frustration, or lack of cooperation, the Spearman-Brown predictions of decreasing returns are no longer correct. Costs are linearly higher, but the information gained can actually be less than it would have been with fewer items. This situation arises in NAEP as a consequence of motivation, logistical limitations, and attempts to address inferences that large-scale surveys are not, by their nature, suited to support. Some examples:
3.4 Operational Definitions
Educators can agree unanimously that we need to help students "improve their math skills," but disagree vehemently about just how to appraise students skills. Their conceptions of mathematical skills diverge as they move from generalities to the classroom. They employ the language and concepts of alternative perspectives on how mathematics is taught, how it is learned, and about which topics and skills are important. The disparate assessments they have in mind all provide evidence about students competencebut each from a particular point of view of that competence, how it is evidenced, and how much to value different aspects of it.
Several levels of abstraction might be conceived for thinking or talking about student achievement, but it is an actual specific assessment that a student ultimately encounters. "Test specifications" identify what a particular assessment should comprise: The kinds and numbers of tasks, the way it will be carried out, and the processes by which observations will be summarized and reported. This level of specification determines an operational definition of competence. Deming (1982) describes how similar processes are routinely required in industry, law, and medicine:
Does pollution mean, for example, carbon monoxide in sufficient concentration to cause sickness in 3 breaths, or does one mean carbon monoxide in sufficient concentration to cause sickness when breathed continuously over a period of 5 days? In either case, how is the effect going to be recognized? By what procedure is the presence of carbon monoxide to be detected? What is the diagnosis or criterion for poisoning? Men? Animals? If men, how will they be selected? How many? How many in the sample must satisfy the criteria for poisoning from carbon monoxide in order that we may declare the air to be unsafe for a few breaths, or for a steady diet?Operational definitions are necessary for economy and reliability. Without an operational definition, unemployment, pollution, safety of goods and of apparatus, effectiveness (as of a drug), side-effects, duration of dosage before side-effects become apparent (as examples), have no meaning unless defined in statistical terms. Without an operational definition, investigations on a problem will be costly and ineffective, almost certain to lead to endless bickering and controversy. (pp. 286-287)
For practical work, stakeholders agree on one or more operational definitions to track the more abstractly defined matters in which they are interested. The U.S. Food and Drug Administration, for example, works with an operational definition for "acceptable frozen broccoli" that includes less than 272 aphids per poundobviously a consensually defined quantity. Different operational definitions, equally defensible, can lead to somewhat different resultsbut only after they have been specified can accurate estimation, or discourse based on the matter, proceed.
In NAEP, an operational definition of proficiency in a subject area is determined jointly by the subject-area framework, test specifications, administration procedures, and scaling/reporting procedures. Even a seemingly minor decision about whether to ignore omitted responses or to count them wrong is part of the definition. Any change in any of the components changes the operational definition of the proficiency, and has the potential to affect results by more than changes in what students actually know and can do affects them.
Operational definitions come into play in NAEP in several other places, such as sampling frames, background variables, exclusion rules for testing students, and, importantly, achievement levels. This last instance is discussed in Section 6.1E.
3.5 Variation in Systems
At the heart of Demings revolutionary approach to quality control was an understanding of variation in a system. Any system exhibits variation. Even an established system under what Deming called statistical control exhibits a certain amount of variation. Resources are squandered when attention is focused on variation within these limits. One way that resources are effectively used is identifying and resolving special causes of variation that lie outside the natural variation of a system"putting out fires", or, in NAEP, "resolving anomalies." Statistical ideas help distinguish special causes from the natural variation of a system. In industry, typical control limits for zeroing in on outliers are three standard deviations beyond average results. While putting out fires is an effective use of resources, it does not improve a system. Only changing the system can do that. The second way to use resources effectively is to change the system so as to improve its productand, almost always, to reduce the amount of variation in the system. These principles are relevant to the NAEP redesign, for decreasing reporting time and improving the accuracy of trend results.
3.5.1 Reporting time
Figures 3-1 a) and b) present HYPOTHETICAL illustrations of two reporting systems. The top panel suggests time-to-release of main reports under the current main-NAEP configuration, which includes revisions, changes, new procedures, and reporting decisions (such as standard setting, how to handle scaling, what results to report and how to report them). This figure is fictitious, partly because calendar time to reports depends on which reports are given priority. For example, the average time is higher than desired, although some reports are ready fairly quickly. But the variation is very wide, due in large part to unforeseeable needs for rework due to unstable or new portions of the assessment or attendant processes, under a configuration in which almost everything must be analyzed and resolved before anything is reported. Simply exhorting everyone to do better does little to bring average reporting time down, since the wide variation in the system, as configured, leads predictably to some reporting times above the desired target. Focusing resources on the specific incidents that led an assessment to come in after schedule is wasted, if the underlying cause is an untested change, an inherently unstable variable, or a survey that requires complex file-matchingif the next assessment cycle will include new untested changes, inherently unstable variables, and surveys that require complex file-matching.
The bottom panel illustrates some important observations we have made about the process of reporting long-term trends. Long-term trend reporting, neglected in the main NAEP activities, has coincidentally become a stable process. Very few changes at all enter into test designs, administration, or analysis (although reporting has sometimes been extensive, as trend reports sometimes have much interpretation and contextualization). The time necessary to prepare the basic data for reporting is not only shorter, but exhibits far less variation. This first feature is the bottom line, of course, and we will be exploring ways to achieve it in a redesigned standard NAEP. The second feature, reduced variation, is important not just for the predictability and the reliability of the system, but because it permits quicker and more accurate detection of true special causes. That is, if variation due to controllable nuisance effects is decreased, true anomalies are faster and easier to detect and resolve.
3.5.2 Accuracy of trend results
Deming, as a statistician, appreciated both the value of statistical models for gauging uncertainty and the limitations. A limitation of model-based estimates of uncertainty (i.e., standard errors) is that they depend on the model. To the degree that the model is wrong or incomplete, the reported standard errors are wrongusually too small, because they do not include important sources of variation in the results. This is important in NAEP in the following way.
NAEP results may be called reading proficiency or math performance in a rather generic or global use of the term, but what they really are, are summaries of observations (which we believe have something to do with students knowledge and skills) collected in specific ways under specific conditions. Literally hundreds of specifics are involved, everything from definitions of the population frames, sampling procedures, color of ink, and weights, to item specifications, timing, administration, analytic procedures, and training procedures for scorers. Design changes have there three important implications:
Error variance from some of these effects can be handled with statistical modelsstudent sampling and item sampling, in particular. These sources of variance are all that show up in reported standard errors. Score variations due to other feature changes usually are not estimated, except when a change is deemed sufficiently suspect to merit dual administration under old and new conditions, and an attempt made to adjust for the average effect of the change. This means that the variation associated with seemingly small changes is present in results, but not in the standard errors for them.
Two untoward consequences result from this underestimation of the variability in results. First, distortions result in planning the sampling design. For example, there was uncertainty present in the 1986 design due to changing item context that was as large as uncertainty due to student sampling(1) (Mislevy, 1990). The huge expense of securing large random samples of students is wasted if locally desirable changes in design and procedure add variance back into the results.
Second, while the current student- and item-based standard errors are not too bad for within-assessment comparisons (because conditions are constant within that assessment), they underestimate more seriously standard errors for trends because of changes across assessment cycles. Setting control limits in relation to the underestimated standard errors guarantees that false alarms will be set off on a regular basis. Too many observations will be identified as suspicious. This triggers a search for the mistake, a special cause of variation, when there is no special cause; just another draw from the natural variance of a noisy system. If one wants accurate reports and the current system is not accurate enough, continually chasing a few real and many false signals of anomalies cannot solve the problem. The real solution requires honest estimates of the actual uncertainty in the existing system, then changing the system so it is less noisy.
The impact of variations in design options, and the consequent generalizability of inferences drawn from NAEP data, can and should be examined empirically by the use of generalizability studies. These studies should be done as part of the planning process, and should not be on the critical path to main reports. In such studies, different versions of an assessment are developed that vary in controlled ways. For example, test forms may be developed that contain different items but that are designed to be parallel in terms of the number and type of items and their measurement properties. Or forms may be created that vary more systematically, such as in their proportions of constructed-response, performance, and multiple-choice items. The variation in results across forms provides important information about how much error can be expected from changes in the assessment. This information makes it clearer how generalizable are conclusions drawn from a particular assessment design.
3.6 Implications and Approaches
How can we apply these management concepts to achieve the desiderata of the Themes and Issues? Our discussion of the Themes and Issues and our design sketch make use of four ideas:
Modular Design. The idea here is to design NAEP in terms of distinguishable modules, perhaps the most important of which ("a core NAEP") supports trend comparisons over time and consists of elements which are important, stable, and (comparatively) easy to analyze and report. These core modules could be embedded in other NAEP activities (in particular, state NAEP), and in non-NAEP studies. Other elements of NAEP could be spiraled into the main NAEP administration, but would not appear on the critical path to initial reports. These could include, for example, teacher surveys, experimental and more extensive tasks, long-term trend blocks, and blocks of tasks being readied to appear on the critical path in the next assessment cycle.
Phased Analysis & Reporting. As large as NAEP is, it is dwarfed by the census that the Census Bureau carries out every ten years. Yet the Census Bureau reports its first results six months after the data are inas required by law. How can they do this? They do not report every possible result in every conceivable form. They report the most important results in the most straightforward way, then continue, over the next ten years, to analyze, to refine, to report, and to release further analyses in priority order. The analyses required for these results are not on the critical path to the initial report. NAEP has moved in this direction recently with its First Look reports. In Section 6.1D we discuss how even quicker initial reports could be accomplished.
Phased-In Change. In every administration of NAEP assessments, some aspects of the data collection have been essentially unchanged from the previous administration, others are changed only modestly, and others are quite different. We see time and again that the chances of problems (some remediable, others not) increase accordingly. For example, long-term trend assessments are essentially unchanged from one administration to the next, and, not surprisingly, they exhibit far fewer problems than main NAEP. Many things that could go wrong in an assessment have been discovered (often the hard way), worked through, and are avoided in successive administrations. It is largely known what the data will look like when they arrive, and what to do, and how to do it. Many of these advantages could be built into a core NAEP, while relaxing some of the incidental constraints that also characterize the long-term trend assessments. Open-ended tasks, which are not part of the current long-term assessments, could be included in the mix of tasks. New blocks could be introduced, as long as (a) they were not included in the standard results the first time they are used, and (b) they were very similar to blocks already in the mix in terms of structure, difficulty, content, and format balance. More consequential shifts of these factors would be introduced only periodically (say, eight to ten years), and after at least one joint administration in which they are not included in the initial results.
Even when attention focuses on the kinds of information that large-scale surveys such as NAEP can do well, there remains much leeway for setting priorities. Broad and current content coverage, for example, has always been important for NAEP; so has the capability to compare performance across time points. The NAGB Themes and Issues propose a higher priority for expeditious turnaround of results than has been the case historically. And while associations between performance and student background variables have been desired, the high cost of reliable measures of student background has led NAEP to rely on less trustworthy self-reports. Three key points must be kept in mind:
4.1 A Fundamental Tradeoff
Perhaps the crucial tradeoffs to be addressed in a NAEP redesign emerge from the interplay of the following points made in the Themes and Issues:
Content-coverage has been important to NAEP since its inception. Such comprehensiveness cannot be attained if all students are administered the same, or virtually parallel, test forms. In and of itself, variation in test forms is not a barrier to rapidity and simplicity. The NAEP design of the early 1970s had few restrictions on booklet construction yet supported simple analysesbut largely because results were reported in terms of performance on items, not in terms of performance by students.
This may seem like a trivial distinction, since all the data are is performances on items by students. The key difference, though, is that under item-level reporting, the issue addressed is how students would do on this item, regardless of performance on other items. In a student-level framework of reporting (even if scores are never even calculated for individual students), the focus is on how a given student would do across items. This means projecting from how she does on the one particular set of items in her tet form, to how she might have done on some larger set (e.g., an actual set of reference items, or a performance scale that implies levels of performance in a domain of items). This means that interrelationships among performances across items are important, and the complexities of some kind of linking and scaling procedures appear. Methodologies available for linking results on different test forms vary in their complexity. The simplest can be employed when (1) forms are parallel, which demands tight constraints on form design and works against breadth of content-coverage, and (2) target inferences are about individuals measured equally well, rather than about properties of distributions of groups.
The current NAEP configuration has neither of these characteristics. Data come from booklets that vary within assessments and over time. Students are administered too few items to obtain accurate measures of their performance, since experience has shown that administering large numbers of tasks under unmotivated drop in from the sky testing abrades the engagement of students and schools alike. And the target of inference is proportions of students at or above designated achievement levelsone of the hardest to estimate from sparse matrix-sampling designs. This is the state of affairs that led to the complex statistical methodologies noted in the simpler design desideratum.
Is it possible to have a cleaner design, simpler analyses, and faster reportingyet maintain broad content coverage and valid achievement-level reporting? Our perspective emphasizes (1) use of management principles in design, so that procedures can be faster, simpler, and more stable no matter how tradeoffs are balanced, and (2) arrangement of design and reporting priorities so as to be, at once, consonant with the desiderata in the Themes and Issues, but ordered so as to reduce costs and complexities in achieving them.
4.2 Tradeoffs and Test Specifications
Among the major features of an assessment that affect the structure of test forms are the following: 1) content specifications (including definition of objectives or outcomes and the number of items measuring each outcome); 2) item types and formats (including but not limited to multiple-choice or performance items); 3) desired standard error functions, especially as they relate to achievement levels; 4) testing time per student; and 5) linking requirements (between forms or grades). Decisions about these design featureswhich, as we shall discuss, should flow from decisions about priorities on assessment purposeswill create the look of the assessment and have a great influence on the complexity of the analysis techniques needed.
Test frameworks determine the breadth of content coverage needed, but test specifications are more specific than frameworks. If it is desired to make broad and robust generalizations about student achievement, then broad content coverage is needed. The level of detail at which distinctions will be made is also important. For example, if it is desired to draw generalizable conclusions about students achievement in problem solving versus algebra, then a sufficient number of items needs to be included to measure those separate objectives. In the past, NAEP has been notable for the breadth of its content coverage, which has positively affected its reputation as a valid and useful benchmark of American student achievement. However, this breadth has contributed to the need for a large, complex, expensive number of test forms.
It is an explicit desideratum of the Themes and Issues that constructed response or performance items be included in a redesigned NAEP. The number and type of performance items have tremendous impact on testing time and scoring costs. Also, while increasing the depth of assessment, the task effects inherent in performance items decrease the generalizability of results relative to devoting the same amount of testing time on multiple-choice items. That is, the use of just one performance task creates the need to use additional performance tasks in order to maintain stable results. For example, if only one math task is used in one year and it focuses heavily on geometry, and the next year an algebra-laden task is used, it will not be possible to understand the meaning of score changes: Are the changes due to changes in levels of student achievement in math skills common to both tasks or are they due to the fact that students can do one type of task better than the other? Using several carefully chosen tasks in each assessment improves the interpretability of the results since it affords the possibility of sorting out some of these competing explanations.
The desideratum to report scores in terms of achievement levels places particular emphasis on the standard error functions, or degree of accuracy of information about individual students. Items need to be placed in the assessment to match the target achievement levels. For example, to accurately measure Advanced performance, difficult items must be in the assessment. If NAGB decides that it is important to place more emphasis on measuring students progress as they move toward achieving the Basic level, more items at the low end of the scale need to be added to the assessment. There is a fundamental dilemma in designing an assessment before standards are set: it may be that reasonable standards are set but that a given assessment design cannot measure with necessary accuracy, the proportion of students reaching those standards.
Past NAEP results have found that when students are tested for longer than one testing session (about an hour), there is a substantial loss of student participation.(2) Such loss biases assessment results. As long as it is desired to measure more content than one student will take, more than one test form must be used, and complexities arise in design and analyses. For example, imagine that it takes two forms to cover the NAEP content framework and item format specifications to an acceptable degree. Since two forms are needed, in one or more ways they cannot be parallel; they may measure different content, perhaps with different formats, or have different standard error functions. In an extreme case, one form might contain only multiple-choice items (Form A) and the other contain one or more performance items (Form B). To obtain an overall picture of performance of a group of students, it will be necessary to pool results from the two forms. It will not be possible for only one form to be used by, say, states, to link their assessments to NAEP; both forms will be needed.
Furthermore, if more than one form is needed to cover the desired content and item formats, comparability of results over years requires the use of either a) tight restrictions on test form characteristics or b) complex analysis procedures. Continuing the example above, call the first years test forms A1 and B1. To maintain overall consistency of results in the second year of testing, it is necessary to design A2 to be parallel to A1 and B2 to be parallel to B1. (Looser restrictions could be used, but they are more complicated to explain and implement.) If form consistency is not maintained, then the distributions of observed scores (and percents of students in each achievement level) will be affected by differences in standard error functions. Sophisticated statistical techniques exist for dealing with these differences (e.g., the "plausible values" methodology), but one of the Themes and Issues desiderata is to reduce use of such techniques. We will discuss these issues further in Section 6.1C.
4.3 Remarks
It is not possible to "have it all;" trade-offs must be made. The present NAEP design emphasizes breadth of content coverage, use of performance items, minimum testing time per student, and achievement level reporting. These features have been obtained by increasing the cost and complexity of the form design and analysis. The cost and complexity can be reduced, but then something else must be given up. The configuration we sketch in Section 7, for example, maintains broad content coverage and allows for controlled evolution of the task pools, and permits more rapid reportingbut it does so by constraining the specifications of booklets upon which standard, initial reports are based. Subsequent reports incorporating broader content, newer and more complex tasks, and more additional student background variables can come later, necessarily carried out with more complex analyses.
This section briefly reviews selected elements of design configurations NAEP has exhibited over the years, in terms of purposes, priorities, and trade-offssome explicit, others implicit; some intentional, others adventitious; and some with unforeseen consequences. This discussion further illustrates the principles introduced above, and sets the stage for deliberation of options for the future.
5.1 1970-1983
Certain features of NAEP were instituted at its onset, conceived to produce results sufficiently useful, cost-effective, and politically benign to come into being.
5.1.1 Student Sampling
NAEP was designed to gather information from samples of students rather than from every student. This approach, motivated more by practice in public-opinion polling than educational testing, allowed extraordinary efficiencies when the target of inference was performance of groups of students rather than of individual students. Accurate estimates of national performance, for example, could be grounded on a random sample of a few thousand students. A multi-stage sample was employed (a simple random sample of students from the nation is impractical), necessitating that clustering effects and stratification be accounted for in estimating item averages and precision of estimation. Since results were not obtained for all students, nor used for purposes specific to sampled individuals, motivation was more of a concern than in typical tests in which something good happens to a student if he does well, or something bad happens if he does poorly.
A tradeoff appeared in the sampling of students at random from their schools, rather than from intact classrooms. The advantage: A lower clustering effect, implying more efficient estimates of group performance for a given sample size. The disadvantage: Hierarchical linear modeling (HLM), which would examine impact of class and teacher effects, was precluded.
5.1.2 Item Sampling
Item-sampling is the dual of student-sampling. Since performance in any subject area is only poorly reflected by any single item, or even several of them, we learn more comprehensively about all the many facets of skill and knowledge in a subject from a large number of diverse tasksfar too many for any single student to be administered, especially under unmotivated conditions. NAEP pioneered the radical solution of item-sampling: each sampled student was administered a sample of items from the pool. Technical innovations made it possible to obtain, from these matrix samples of responses, estimates of average performance (e.g., Lord, 1962). Matrix sampling was ideal for broad content coverage and efficient estimates of performance in large domains of items. An important feature of matrix sampling is that it supports estimates of average performance in the domain or on individual items even if students respond to very few items. This was a partial solution to the motivation problem, since under low-stakes conditions, motivation declines as amount-of-effort-required increases. (Indeed, motivation can decline to the point of negative returns as testing sessions become longer; two hours of testing time per student can provide less information about a group than one hour of testing time, if rates of school and student refusal, and item omit rates, increase.)
Items in the original NAEP design were administered by paced audio tapes. That is, all students in a testing session were administered the same booklet of items, and an audiotape moved students through the booklet item by item. A number of trade-offs were involved here: Administration was logistically cumbersome, and data were less than optimally efficient because of the clustering of students. On the other hand, information was better item-by-item than when students are simply given a number of items to work, and a block of time to work on them. In this latter situation, it is up to students to decide when how long and in what order they will work on each itema factor of some importance in their performance, but uncontrolled by administration conditions.
5.1.3 Reporting
From the inception of NAEP through 1988, NAEP reported results in terms of regions of the nations rather than states. Obviously, since regions of the country are not responsible agencies for education, reporting in these terms had less policy relevance than reporting in terms of states or school districts. Why make such a tradeoff? In order to make NAEP acceptable, so it could come into being in the first place. It was not politic to create a national assessment that was too useful.
Similarly, sampling and reporting was organized in terms of ages of students rather than grades, even though schooling in this nation is mainly organized in terms of grades. Ages 9, 13, and 17 were targeted, in tune with international assessments which were starting to be carried out by the International Assessments of Education (IEA).
Reporting originally focused on single items: percents of correct response, or distributions of kinds of response, in the nation as a whole and in subgroups of students defined by the background variables NAEP also included in its surveys (e.g., race/ethnicity, parents education, and region of the country). This item-by-item reporting contrasts with student-based reporting, in which individual students performance is summarized over items, and the distributions of these summaries are analyzed. A great advantage of item-level reporting was that no equating or scaling procedures were required; average performance on an item was simply what it was. Analysis was as simple as possible as far as items were concerned, given that the complex student-sampling required a certain level of complexity in analysis (jackknife estimation of variance due to student sampling, which is still used). Another advantage was that there were relatively few constraints on the composition of assessment booklets. They need not have had similar content, formats, difficulties, or lengths. A disadvantage was that item-by-item reporting provided estimates of average performance by subgroup, but it precluded the conception of distributions of students performance, or of reporting in terms of achievement levels.
Subject area experts liked the original item-by-item reports, but such detailed reports were quickly found to be unsatisfactory for communicating with policy-makers and the public. When someone asked how are kids doing, she did not want two hundred answers, one for each itemespecially if the same general message was being repeated for most of the items. Beginning around 1974, reports began to provide results in terms of average performance over clusters of related items.
5.1.4 Measuring Trends
Once reporting was organized around average percents of correct response across clusters of tasks, these clusters naturally became the basis of comparisons across time and across age groups. The disadvantage was that the groups of items in common across years were not selected purposefully to this end; they varied in number and content coverage, and constituted only haphazard and unrelated reporting scales. For example, percents-correct for 13-year olds might be higher than those for 17-year olds, simply because the 13-year olds common items happened to be among the easier ones they were administered. Moreover, the release of 1/4 of the items with each assessment cycle meant that fewer and fewer items were available for comparing performance over time. In short, this method of comparing achievement over time had not been planned. It arose as an ad hoc response to a mission whose importance grew over time, but for which the design was not well suited.
5.2 The 1984 Redesign
After having started out with a clean design in the early 1970s, satisfactorily addressing the perceived needs of the time with the technologies of the time, NAEP became increasingly unwieldy over the years as expectations changed. Ad hoc procedures (such as trend reports on clusters of items) had been introduced to meet new expectations as well as possible, but even so dissatisfaction was increasing. The competition for a redesigned NAEP in 1984 led to a contract in which many of the features of the current configuration originated (see Messick, Beaton, & Lord, 1983). New priorities were recognized, and the new design was introduced to reflect different balancing of recognized tradeoffs.
5.2.1 Student-Sampling
Recognizing the fact that schooling in the US was organized mainly by grades, the 1984 redesign introduced concurrent age and grade sampling. The sampled grades were the modal grades associated with ages: Grade 4 with Age 9, Grade 8 with Age 13, and Grade 11 with Age 17. A given administration scheme and set of test booklets was used in each grade/age combination. Advantages included in this extension of NAEP include the maintenance of age-based surveys for trend and international comparisons, and the availability of grade-based surveys for increased policy relevance. Disadvantages included increased fieldwork; additional complexities in sampling, weighting, administration, and data structures; and dual analyses and possibilities of analytic errors.
5.2.2 Item-Sampling
A more complex version of multiple-matrix item-sampling was introduced in 1984: BIB-spiraling. BIB spiraling presented booklets which were organized around three blocks of items, and these blocks were combined so that each block would appear at least once with every other block. This Balanced Incomplete Block design is the BIB in BIB spiraling. The motivation for this innovation was to support the construction of response scales for performance; that is, to support reporting in terms of what individual students do on collections of items, rather than simply average performance per item. Specifically, scaling by means of item response theory (IRT) models was introduced. This scaling allowed consideration of distributions of performance, setting the stage for achievement level reporting. The trade off was increased complexity of analysis.
The spiral in BIB-spiraling involved administering different booklets to students in a given testing sessionspiraling through the entire set of booklets rather than having every student in a session paced through the same booklet. These spiraled booklets were constructed so that each block was allotted the same amount of time, and students allocated their time as they chose within each block. Advantages of this procedure included easier logistics, since audiotape equipment and its vicissitudes were no longer a factor, and more efficient sampling in one sense, since the testing-session-by-booklet clustering effect was eliminated. However, additional uncertainties (sources of variance) were introduced into item-level results: larger position effects, more omits, and more dependence on students varying time-management skills. National estimates of item-level performance were less trustworthy, since the percent-correct now depended materially on whether an item happened to appear near the beginning or the end of a block.(3) Block of items, rather than individual items, became the fungible unit of interpretation.
5.2.3 Plausible values
A major feature of the current NAEP configuration is the use of plausible values methodology to estimate distributions of students proficiencies. It is noteworthy that this methodology was an unintended consequence of the redesign. Specifically, it became necessary as a consequence of: (1) placing a high priority on student-based, rather than item-based, reporting; (2) collecting data in booklets constructed to far looser constraints than are employed in typical student testing programs; and (3) finding that complex analyses had to be invented to meet these missions with the data that had been collected.
IRT scaling was introduced to enable comparisons of group-level performance across distinct years, ages, and booklets. By the early 1980s, established scaling methods were available to do this with reasonable speed, stability, and expenditure. Item parameters were estimated to establish a common scale; IRT ability estimates were produced for each examinee; and subsequent analyses could be carried out across non-identical item sets. In particular, the IRT calibration and scoring were completely separate from the preparation of, and analyses concerning, any other background or instructional variables whose relationships with performance were of interest. Subgroup distributions and secondary analyses could then be carried out with these estimates for individual students. It was therefore planned to carry out the analysis of NAEP data using this approacha relatively simple analysis, under which the following elements all appeared to be in concert: the data, the analysis, the commitments, the PERT chart, the requisite resources, and the long-term stability of the approach.
But this approach failed in the 1984 assessment. Although NAEPs sparse matrix-sampling design was extremely efficient for obtaining information about population characteristics, it didnt support response-data-only IRT ability estimates for individual students upon which suitable analyses could then be based. Vastly differing mixes across booklets of content, difficulty, test-length, and timing, further impaired the approach, since varying measurement error distributions caused distortions in individual-student score estimates that were larger than the true differences of interest, such as between sexes or regions. These form-to-form factors, it was realized, were explicitly managed in programs like the SAT and ACT. In such applications, IRT could indeed characterize and take advantage of patterns among students performances on different sets of itemsbut only because the test forms were sufficiently long and parallel. This was the crisis faced with the 1984 assessment:
Two steps were taken to meet the crisis:
1. A conceptual framework was established for using IRT-based models to establish a common reporting metric and characterize population relationships from sparse matrix-sampling data (i.e., marginal estimation(4) ). The reporting metric was a 0-500 scale based on the IRT ability. (See Section 6.1C for further discussion).
2. A marginal-estimation analysis system was devised to deliver on a majority of the established commitments, using the data in hand. Specifically, the plausible values approach, based on Rubins (1987) multiple imputation methods for handling missing data, was introduced to implement the concept of marginal estimation. The main idea was to estimate the joint distribution among IRT true scores and student background variables, then produce pseudo-data sets from which these results could be reproduced.
The commitment of providing a secondary user data tape merits special note. It had originally been promised to provide users with a NAEP data file including IRT ability estimates as well as background information for each student, from which any analyses could be carried out. This would have been both easy to do and satisfactory for secondary analyses had the anticipated analyses worked as planned. They did not. User tapes containing plausible values were instead produced. If secondary analyses were to include a given variable, that variable had to be included in the construction of the plausible values (i.e., including them in the conditioning model) in order to obtain satisfactory results. Were it not for the mission of producing this specific type of data file for secondary users, the imputation procedures could have been avoided by using alternative, somewhat simpler, marginal analyses. a tape which could be used more or less as if the original plans had worked, appropriate marginal analysis could be carried out for given inferences without having to first . This then necessitated cleaning, processing, and, when required, collating, all student background variables before analysis for initial reports were carried out. It is important to note that this requirement placed on the critical path to initial reports the most problematic background variables (especially untested new or revised self-report items, and those from teacher surveys, which involved complex file matching).
The commitments were largely met, but at the cost of severe and unanticipated mismatches among the analysis, the commitments, the PERT chart, the requisite resources, and the stability of the configuration. The new analysis procedures made it possible to bring together results from booklets of different lengths, difficulties, and compositions, although there were limits to how far even approach this could be pushed. The analysis, however, was more complex, more dependent on models, and unfamiliar (and therefore suspect) even among the educational measurement community.
5.2.4 Trend Analysis
Once the marginal analyses described above had been devised, the IRT model and distributional estimation methods were applied to map the historical stream of pre-1984 reading data into the 0-500 scale. Stable and credible results were attained, which echoed, amplified, and made more comprehensible the cross-year and cross-age results that had been reported in the past in terms of average percents-correct. The procedure worked well despite the widely varying design configurations in past assessments. In retrospect, it was realized that paced administration helped satisfy IRT assumptions by reducing context effects.
This historical trend was necessarily based on historical data, under the content frameworks as they had been developed and operationalized (e.g., audiotape paced administration, definition of age cohorts, booklet designs). These same procedures were used to collect data in 1984 concurrently with the new BIB-spiral administration, and this bridge was used to set the baseline for what was anticipated to be the start of a new trend line based on new procedures. No new trend line ever materialized. Changes in frameworks, item specifications, definitions, and administration conditions were introduced every one or two assessment cycles so that a consistently defined metric could never be established for solid comparisons over time for more than two assessments.
5.3 The 1986 Reading Anomaly
The Reading Anomaly refers to results from the 1986 assessment that showed declines from 1984 levels that were much greater than changes in any four-, five-, or six-year period in the previous history of NAEP. It is not that the changes were large in absolute terms; at Age 17, the most startling, they were only about 3 points in terms of average percents correct. But this was large in relation to the target of inference, namely population changes, because population changes are very small over short time periods. Subsequent investigations (Beaton & Zwick, 1990) showed that the anomaly could be traced to seemingly inconsequential changes in booklet configuration, administration procedures, item context, and post-stratification procedureseach one designed to provide better informationyet which made it impossible to compare results across assessments. Beaton and Zwick offered the moral, "When measuring change, do not change the measure.
This experience provides the sobering realization of the severe constraints on population definition, booklet design, administration protocol, and analytic procedures that ensue when the mission of NAEP includes tracking change over time. Hundreds of seemingly small facets of the configuration could be tweaked to produce modest improvements in estimation within one time point, but these changes could have a greater impact on the overall results than actual differences over a two-year period in what students know and can do.
5.4 Some Changes Since the Anomaly
5.4.1 The only constant is change
Changes in frameworks, item specifications, the time of year of testing, age definitions, exclusion rules, and so on, have been the rule in NAEP. It has sometimes been possible to jointly administer the assessment under both the previous and new versions of each change and to estimate an adjustment factor that takes into account the average effect of the change. Even when such adjustments appear to have been successful, the average impact need not be representative of the impact on different demographic or curricular groups. Thus, results concerning such variables have an added source of variance that is not accounted for in standard errors of estimation. Higher rates of false positives of significant differences result. And, even when adjustments appear to have been successful, additional time has been added to the critical path for attempting to estimate the effect of the change, to determine whether it can be adjusted for, and possibly to invent a way to make the adjustment. Sometimes the impact of change is large enough to make it impossible to directly compare results from one assessment to the next. In all the years of NAEP assessments since 1984, it has only happened once that results from three successive assessments have been comparable (1994 Mathematics). Typically only two in a row are comparable before revisions sufficiently large to obviate the continuation of a clean trend line are introduced. One in a row is not uncommon.
The trade-off involved with continuous change is a victory of local optimization over global optimization. The intended advantage is improvement each time, insofar as honing data to mission as perceived by content area committees or other stakeholders. The negative effects are that (1) there are almost always some new wrinkles in each assessment, so there are almost always some first-time glitches and slow reporting; and (2) no new trend line has ever gotten started and remained in place long enough to become fast and reliable. An interesting unintended positive effect has been the continuation of so-called long term trend assessments in Reading, Mathematics, and Science, which still use definitions, booklets, and administration procedures from the 1970s and early 1980s. Procedures for long-term trend have been refined and honed, so analyses needed for what would correspond to the standard report card in NAGBs Themes and Issues can be carried out reliably and quicklywithin 3 months after receipt of data. This observation supports our recommendations in Section 7 for ways to design a standard NAEP assessment configuration that can yield initial reports within six months after the completion of testing.
5.4.2 Grade 12 Reporting
The 1984 grade/age combinations included grades 4, 8, and 11. It was noticed that if a given subject area were assessed every four years, NAEP could track and compare cohort effects if grades four years apart were surveyed. The 1986 assessment therefore surveyed grades 3, 7, and 11. Grade 3 proved problematic, as students at this age exhibited considerable difficulties dealing with the assessment. Moreover, interest was expressed by various stakeholders in surveying progress at key transition points in the educational system: Grades 4, 8, and 12. These became the surveyed grades beginning in 1988. Probably the most serious disadvantage in this shift is the lack of motivation among Grade 12 students, as evidenced in field observers notes, students self-reports, and omit patterns in data (see Section 6.1G).
5.4.3 State-Level Reporting
The Alexander-James Report opened the door to reporting at the state level in the 1988 assessment. An obvious advantage is increased policy relevance: states are agencies with direct responsibility for education. A disadvantage was the increase in the size of NAEP by almost two orders of magnitude, adding considerable logistic and analytic challenges. Creative ways of addressing these challenges have been advanced, such as administration by local rather than contractor personnel, under contractor-directed training and sampled observation. Differences between state assessment samples and a specially-selected state assessment comparison subsample of the contractor-administered national NAEP have exhibited small but statistically significant differences. Another disadvantage in state-level reporting is the increased burden on statesespecially small states, for whom practically all schools are involved in NAEP every assessment cycle, and for sparsely-populated states, for whom logistic difficulties arise with small schools spread across wide geographic areas (see discussions in Section 1H and in the Appendix). Whether states and schools will be sufficiently motivated to maintain this effort over the long run is an open question.
An interesting example of a costly unintended consequence is the breakout of private school results in state NAEP assessments. The original plan was to report only public schools, since they have public responsibility, and it is generally easier to locate and gain cooperation from them than private schools. But states have different proportions and compositions of public/private schooling, so omitting private schools from the sample may bias inferences about how students in a state are faring compared to students in other states. Now a reliable unbiased estimate of students in all public schools in a state can be obtained with a sample of, say, 50 to 100 public schools. A reliable unbiased estimate of all schools in the state, public AND private, can be obtained with perhaps an additional subsample of, say, 10 private schools. However, 10 schools is NOT sufficient for a reliable estimate of all private school students, and obtaining unbiased estimates is difficult because non-Catholic private schools have a high rate of refusing to participate. The advantage of having state-level private-school results is mainly to avoid users simply subtracting the public schools estimate from the all schools estimate to get their own (highly unreliable) private-school estimate. Yet securing enough private schools to ground a reliable private school estimate could, in some states, require as much effort as the public school effort! (Keith Rust discusses this issue in his 5/8/96 memo to Mary Lyn Bourque; see Appendix to this report.)
5.5 Where Are We Now?
NAEP finds itself in much the same situation as it did in the days of the 1984 NAEP redesign competition: A clean design had been introduced a bit more than a decade previously, which responded to the needs of its time, but for which changing desires, technologies, and political milieu appeared to push beyond its limits. Now as then, much has been learned from preceding work upon which to craft configurations better suited to the current situation.
This section discusses the objectives and recommendations from the National Assessment Governing Boards "Themes and Issues" document. The discussion is organized point-by-point, with cross references as appropriate. Further amplification of specific tradeoffs and issues appear here. The results are synthesized into a sketch of a feasible configuration in Section 7.
OBJECTIVE 1: To measure national and state progress toward the third National Education Goal and provide timely, fair, and accurate data about student achievement at the national level, among the states, and in comparison with other nations.
1A. Test all subjects specified by Congress: reading, writing, mathematics, science, history, geography, civics, the arts, foreign language, and economics.
There are many ways to devise schedules that meet the stated goals. Although it is not up to the Design/Feasibility Team to say what subjects or how often, we can say that it if the tracking of trends is a primary goal, then there is a need to maintain considerable stability in the framework and assessment design for at least three administrations. This section offers three examples, not as recommendations, but as vehicles to illustrate issues and tradeoffs. It will be noted that all of these examples show a lot more assessment going on than under the current design, with more subjects assessed and more frequent assessment. This will be feasible and affordable only if subjects which are not being assessed comprehensively are kept relatively simple and exhibit minimal changes (the nature of which is discussed further in Section 6.1B). We assume that "comprehensive assessments" coincide with revised subject-area frameworks (Section 6.2A).
Example 1
Table 6.1A-1 is an example with two core subjects, Math and Reading, which are assessed biennially, and eight other subjects, which are assessed two or three times during each ten year period. Three subjects are assessed each year, although field testing of new items from other subjects can take place in any year as required. Math and Reading are given highest priority in this example since there seems to be no argument from any quarter (educators, policy makers, and parents) that these two subjects are critical for students success. It is supposing that after Math and Reading, the four subjects with next priority are Science, U.S. History, Writing, and Geography. These subjects are grouped into two pairs, one pair of which is assessed between the Reading/Math years, so that each is assessed every four years. Three other subjects appear once every four years, and one subject is tested every five years.
|
Subjects | |||
|
Year |
1 |
2 |
3 |
|
1 |
Math |
Reading |
Civics |
|
2 |
Science |
History |
F. Language |
|
3 |
Math |
Reading |
Arts |
|
4 |
Writing |
Geography |
Economics |
|
5 |
Math |
Reading |
Civics |
|
6 |
Science |
History |
F. Language |
|
7 |
Math |
Reading |
Arts |
|
8 |
Writing |
Geography |
Economics |
|
9 |
Math |
Reading |
Civics |
|
10 |
Science |
History |
F. Language |
|
11 |
Math |
Reading |
Arts |
|
etc. | |||
Note: Core subjects in bold
Example 2
Table 6.1A-2 is an example with four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. A ten-year period is illustrated, in which core subjects are assessed every three years and non-priority subjects every five years. This example assesses two or three of the main subjects each year, with the possibility of augmenting a two-subject year with a special assessment.
|
Type of Assessment | |||
|
Year |
Comprehensive |
Standard |
Special/Probe |
|
1 |
Math |
Writing, History |
- |
|
2 |
Reading |
Arts, Economics |
- |
|
3 |
Science |
F. Language |
Possible |
|
4 |
Writing |
Civics, Math |
- |
|
5 |
History |
Reading |
Possible |
|
6 |
Geography |
Science |
Possible |
|
7 |
Arts |
Math, Writing |
- |
|
8 |
Economics |
Reading |
Possible |
|
9 |
F. Language |
Science, History |
- |
|
10 |
Math, Civics |
Writing |
- |
|
etc. | |||
Note: Core subjects in bold
Compared with Example 1, twice as many core subjects are assessed. The tradeoff for more core subjects is assessing them only every three years. Three-year cycles for the core subjects may suffice for timely monitoring of slowly changing trends. Also, if a National Assessment is the basic NAEP to which states can attach themselves, a state interested in only, say, Reading and Math can hold its participation to only one subject every other year. Moreover, a more integrated process (framework development/items development/tryout/final "test"/ reporting of results) could be achieved, with additional field-testing and such activities as DIF analyses and achievement level setting occurring in off years. A disadvantage of a three-year pattern for main assessments is that cohorts cannot be tracked.
Example 3
Table 6.1A-3 is a variation of Example 2, similar in that there are again four core subjects, Math, Reading, Science, and Writing, and six lower-priority subjects. Also, special assessments can appear periodically. It differs in that (1) there is an eight-year rather than a ten-year pattern, (2) core subjects are assessed biennially, and (3) in order to achieve the foregoing increases in intensity of assessment, four subjects are assessed every year. States could elect to participate in four-year cycles for core subjects in order to reduce their costs and burdens.
|
Type of Assessment | |||
|
Year |
Comprehensive |
Standard |
Special/Probe |
|
1 |
Math |
Science, History, Civics |
- |
|
2 |
Reading |
Writing, Arts |
Possible |
|
3 |
Science |
Math, Geography, Economics |
- |
|
4 |
Writing |
Reading, F. Language |
Possible |
|
5 |
History, Civics |
Math, Science |
- |
|
6 |
Arts |
Reading, Writing |
Possible |
|
7 |
Geography, Economics |
Math, Science |
- |
|
8 |
F. Language |
Reading, Writing |
Possible |
|
etc. | |||
Note: Core subjects in bold
1B. Vary the amount of detail in testing and in reporting results.
The notion of decreasing returns plays a role in deciding how to vary comprehensive data-gathering and reporting should be. For the sake of argument, suppose that a typical main assessment under the current configuration supports a thousand inferences, in the way of distributions at achievement levels, comparisons among subgroups, levels of background variables, and associations between background variables and performance. This is a lot to learn the first time it is done, and there will be several "surprises"leads for following up with additional or different kinds of research. The second time the same survey is conducted, however, most of these results will be essentially the same as they were two years before. Changes across time will not be measured precisely enough to detect change in most variables, except for large, well-measured ones. Collecting the same kind of data provides little additional information about the stories behind the surprises. All in all, the cost is about the same as the first time, but the informational value is far less. "Information per dollar" from the same survey continues to decrease over time (Boruch & Terhanian, 1996).
This phenomenon supports the notion of having only occasional comprehensive assessments, with performance and background variables rethought so we can surprise ourselves again. Between these periodic larger efforts, two complementary kinds of assessment can take place: (1) more modest and largely constant standard assessments that report basic results and track major changes reliably and quickly; and (2) targeted assessments that dig more deeply into focused aspects of performance or correlates thereof, but off the critical path to standard reports. Targeted assessments can be costed out and designed separately, but administered jointly with the core assessment.
The decision to vary the intensity of assessments is not really a technical one. A key technical issue, however, comes from the assumption that comparisons across assessments varying in intensity are desired. This suggests the need to have the design provide the means for making such comparisons dependable. What size standard errors are acceptable? This question might be addressed in terms of the magnitude of changes that have been used to make the case of that achievement is improving or decliningfor example, the size of the long-term trend analyses. There is also the issue of subgroup comparisons. The law says that NAEP should "include information on special groups, including, wherever feasible, information collected, cross-tabulated, analyzed, and reported by sex, race or ethnicity, and socioeconomic status." Thus, group comparisons and changes deemed important in past policy discussions of NAEP results might be used to set targets for standard errors, which in turn determine the sample size (more precisely, the outcome of sample sizes in a multi-stage sampling design, in which number of PSUs is the dominant factor). We note that cutting back on non-cognitive background variables does not have to be accompanied by cutting back on the achievement items.
Completion Date/Elapsed Time
One desideratum for the new National Assessment is that "results should be timely, with the goal being to release results within 6 months of the completion of testing." Following is a discussion of a sample NAEP schedule and some ballpark comparisons with other large scale testing programs conducted by states and commercial publishers.
The KPMG Peat Marwick-Mathtech review of NAEP (1996) describes the timeline for the 1994 NAEP Reading Report. The completion dates and elapsed time for the main activities can be summarized as follows:
|
Step |
Task(s) |
Completion |
Months |
|
0 |
Testing |
4/01/94 | |
|
1 |
Scoring & preliminary weights |
7/30/94 |
4 |
|
2 |
DIF item review |
8/31/94 |
1 |
|
3 |
Scaling, conditioning, & weighting |
12/8/94 |
3 |
|
4 |
Draft report |
3/01/95 |
3 |
|
5 |
NCES-NAGB review/revision; final report |
3/07/96 |
12 |
|
Total |
23 |