by Dr. John Poulsen and Kurtis Hewson
Standardized testing in some circles is demonized as the vilest form of assessment. These individuals point to many problems with how these tests are created and administered, as well how the results are used. In other circles standardized testing represents true assessment whereby individual performances can be compared to other performances in a meaningful manner. That is, standardized testing is seen by some as a fair form of comparison; others do not. Knowing where standardized testing came from and what were the motivations for its growth, may help in understanding and perhaps in being able to use the results of standardized tests to improve teaching and learning. This article serves as an overview of the history and current realities of standardized testing.
Considering the role standardized testing has acquired in education systems internationally, one can safely assume that a vast majority of Canadians have experienced these tests as students. More and more students’ lives are becoming influenced by standardized testing, as a societal push for educational accountability has led to a dramatic increase in the use of these assessments across districts and nations (Guskey & Jung, 2013). Their value is much debated by educators, academics, and politicians, but what is clear is that their use seems to be increasing rather than decreasing. Experiencing standardized tests as students can provide a useful perspective, however, it is important that faculty and students have a general understanding of the history of standardized or high-stakes testing, as well as a basic overview of the how these assessments are built.
This article will explore the history of standardized testing, recent developments within standardized testing, creation of test questions, and applicability.
Stiggins (2008) states that
these once-a-year tests are not likely to be of much value to classroom teachers as you plan and carry out day-to-day instruction. They are assessments OF learning that are too infrequent, broad in focus, and slow in returning results to inform the ongoing array of daily decisions. But this does not mean that these tests are without purpose or value. They can communicate valuable information about students’ achievement status to other decision makers (pp. 347-348).
This relatively rational statement could be considered a definition of the battle lines that have been drawn up between those who are proponents of standardized tests and those against them.
The intent in standardized testing is to have large numbers of students write a single test, then to compare any single score against all others to see how an individual’s score compares to the large sample. The results are then posted on a bell curve that indicates where a score sits within descriptive statistical standards. Standardized tests are given to large groups numbering at least in the thousands, sometimes millions. In order to make the results as valid as possible, thus “standardizing” the administration of the assessment, the tests are:
- written at the same time and same day for all students,
- administered with consistent instructions,
- allowed the same amount of time for each student to write the test, and
- scored in the same manner.
Scantron is a common method of marking bubble sheets of multiple-choice style questions. Essays are marked by specialists who have been trained to mark in similar fashion.
Burke (1999) maintains that traditionally “standardized” meant that the test is standard or the same in three ways: (a) format/questions, (b) instructions, and (c) time allotment. Format/questions means that the test questions are the same for all students writing the exam. The information that the students are to show they know is asked of them in the same format that is usually multiple choice. Multiple choice is the format of choice because as Stiggins (2008) suggests, “It is relatively easy to develop, administer, and score in large numbers” (p. 354). Further, in order for the test to be fair in the sense of all students having the same chance to answer each question correctly, all questions must be the same.
The instructions are to be the same as well. These are to be delivered in the same way to all students so that no students are advantaged or disadvantaged. The last standardization is time allotment. All students are to be given the same amount of time to finish the exam.
However, the standardization of standardized exams is being eroded. Common changes to standardized testing allow certain students to have more than the allotted amount of time. Some students with certain learning needs are now allowed to have more time than other students to complete the exam. These students are then often allowed to write in different rooms as well.
The second requirement of standardized tests is also frequently adapted. Students with reading problems can get “readers” to read the questions. The rationale behind this is that the curriculum asks that students know certain information. Whether the students know this information is the purpose of the exam, not whether the students can read. These readers may adapt the standardized instructions that the students receive. Also, reading the questions to the students may give them an advantage or disadvantage other students do not have. Therefore, the second and third requirements of standardized testing are no longer strongly in effect.
There are other forms of standardized testing that are available other than multiple-choice questions, for example, essay writing. This form of testing currently has the disadvantage of needing markers to assess the essays. Essay markers must be trained to gain a sense of what the standards are. Then they must engage in the time-consuming activity of reading the essays. Even with the training assessors can give significantly different grades to an essay.
Proponents of standardized testing point to large-scale use of the tests that go beyond the individual student or even the school. Standardized testing allows comparison between provincial education systems or even national education systems. Advocates say that standardized tests are impartial and rational. They state that standardized tests are an inexpensive way to check that schools and teachers are accountable, that students and therefore the public are getting the education that public dollars are paying for. Standardized tests by this measure are intended to examine the whole education system and therefore individual scores may be not as significant.
“… the standardization of standardized exams is being eroded.”
The history of standardized testing is underpinned by noble sentiments. Testing can be found in all cultures. Evaluating the understanding of someone learning a new skill is common for all societies. Standardized testing as we know it today began in earnest in China as a form of aptitude testing, trying to ascertain who would be best at a particular job. Fletcher (2009) states that, “The earliest record of standardized testing comes from China, where hopefuls for government jobs had to fill out examinations testing their knowledge of Confucian philosophy and poetry.” These exams started in about 100 CE but were firmly established during the Sui Dynasty in 605 CE. They attempted to predict aptitude by discerning the best candidates for the Chinese civil service.
The most recent impetus to standardized testing was the Industrial Revolution and the movement to increased schooling where students were moved out of the work force and into schools. One of the easiest and arguably the cheapest way to test large numbers of those children was with a standardized exam.
Alfred Binet (1857-1911) and Theodore Simon (1872-1961) developed what is now commonly known as an IQ Test, beginning in the late 1800s and culminating with the Binet-Simon scale in 1905. These intelligence tests were created in response to the French government wanting to develop special education classes for students who were not benefiting from the newly instituted regular compulsory education program. The tests tried to identify students who needed focused education in order to maximize their education. These standardized tests were an attempt to streamline education so that society would gain maximum benefit from each citizen, a noble sentiment.
The test contained problems arranged in order of difficulty in a range of subjects but had as the basis items assessing comprehension, reasoning, and judgment (Reynolds, Livingston, & Willson, 2009). Louis Terman (1877-1956), who was teaching at the time at Stanford University, noted the success of these exams and their potential applicability in America. He spearheaded the creation of the Stanford-Binet Test which remains, in its fifth iteration, the most popular IQ testing vehicle in existence.
Fletcher (2009) suggests that “… by World War I, standardized testing was standard practice: aptitude quizzes called Army Mental Tests were conducted to assign U.S. servicemen jobs during the war effort.” Robert Yerkes was one of the academics assigned to test the servicemen and then suggest appropriate placement. This testing of servicemen helped build up a record of statistical evidence for IQ testing. Carl Brigham worked with Yerkes in the testing of servicemen. After the war he published a book, A Study of American Intelligence, based on the results in World War I. From this finding and analysis he created the Scholastic Aptitude Test (SAT) in 1926. Its intention was to screen college applicants to insure the worthy candidates were allowed admission. The test became immediately popular and by 1945 it became a standard method of college and university entrance, again a noble enterprise.
Everett Linquist invented the American College Test (ACT) in 1959 as a competitor to the SAT. In 2011, more than 3.3 million individuals wrote SAT and ACT exams. The ACT is considered more of a test of accumulated knowledge while the SAT is thought to test logic. Other important standardized exams are the Medical College Admission Test (MCAT) and the Graduate Management Admission Test (GMAT).
These standardized tests that attempt to predict success or aptitude seem to be successful. Reynolds, Livingston, and Willson (2009) state, “As a general rule, research has shown with considerable consistency that contemporary intelligence tests are good predictors of academic success” (p. 334). Fishman and Pasanella (1960) reviewed SAT predictive validity in the 1950s, finding that the median correlation between student first-year success and the SAT score was a significant 0.61. Recently Kobrin, Patterson, Shaw, Mattern, and Barbuti (2008) found a correlation of 0.29, a respectable correlation between SAT scores and First Year Grade Point Average (FYGPA).
In Alberta, standardized testing began in the 1960s. McEwen (1995) suggests that Alberta’s introduction of achievement testing for Grades 3, 6, and 9 was done in response to a worldwide wave of educational reform that wanted more accountability in education. At the Grade 12 level, diploma exams were reinstated in 1984 after being removed for a few years. McEwen clarifies the reason for the achievement tests:
Public education is funded by taxpayers who want and have a right to know if they are getting value for their investment. Such accountability requires public information. An indicator system is a tool to focus reform and to improve accountability by providing better information about the education system’s performance. The goals, or intended benefits, of implementing indicator systems are to assess the effectiveness and efficiency of the educational enterprise, to improve education, and to provide a mechanism for accountability (p. 28).
Pros and Cons of Standardized Testing
The primary conundrums in standardized testing of achievement lie in the validity and applicability of the test results. Validity relates to how accurately the test results actually reflect the students’ knowledge about the subject. Standardized tests use a minimum number of questions and getting even one or two wrong due to environmental reasons will affect the individual student’s results. The factors that affect a student getting a question right or wrong may be infinite and could be organized into (a) situational/environmental confounding factors, (b) personal/emotional factors, and (c) grade-spread requirement in standardized testing.
Even though standardized testing attempts to minimize confounding variables by requiring students to write in similar situations, it may be that some students are writing in situations that are significantly different from other students, for example, it might be too bright or too dark or even too cold or too hot. The testing conditions may cause students to perform poorly such as when students might miss questions not because they do not know the material but for something as simple as the testing centre had poor lighting that caused headaches in students, or because the testing room was too cold and did not allow certain students to focus.
Students who are poor test takers because of nerves associated with tests may not be able to show what they can accomplish in the high-stakes atmosphere of standardized testing. Their anxiety becomes the determining factor of how well they do the test, not whether they know the material. Even students who are normally good test takers can have a skewed result; for example, a student who had an emotional moment just before the test might not be able to focus and receives a result that is not reflective of his or her capabilities.
Perhaps the primary concern with achievement standardized testing is that testing should be based on curricular outcomes that are mandated by the provincial or state governing bodies. Standardized tests have to make a one-size-fits-all test that will not fit all because as Popham (1999) says, “… standardized achievement tests will invariably contain a number of items that are not aligned with what’s emphasized in a particular setting” (p. 331). A 1983 study of alignment between textbook content and the standardized test found that, “In no case was even 50 percent of a test’s content satisfactorily addressed in any textbook” (Popham, p. 331). That is, there was a poor correlation between what was in the test and in the textbooks that were a prime resource to prepare students for the test.
Test creators seek a score spread in their questions. They seek questions that are not answered correctly by too many students. Questions that are answered correctly by more than 60% of the students are usually removed from the test. Popham indicates this is a problem because “… items on which students perform well often cover the content that, because of its importance, teachers stress” (p. 332). So the important material that is required by the curriculum is often not tested.
How questions are determined to be most worthy for standardized testing is important. When deciding which questions to use, test creators, in essence, try to find questions that only the top 50% of the students will get right. These types of questions are popular in standardized testing because they support the common theory of testing whereby the highest achieving students answer the questions correctly. So, standardized tests can be self-affirming. Students who are in the top 50% of the class answered the questions correctly because they are in the top 50% of the class.
Further, if a concept is taught to all students in a class and all students answer the question correctly, that question will not be used in the future as it does not spread the students’ scores so that fine-grained norm-referenced numbers can be associated with each student. That is, if all students did well on the test then there would be no bell curve and the associate connection with where each student sits on the curve. Put more simply, there have to be questions that are only answered by about 50% of the students in order for comparisons to be made.
A student’s socio-economic status is highly correlated to standardized achievement test scores. This is probably due to the tests being skewed to reflect learning that children gain at home. Again there is a curriculum and testing mismatch. For example, if a question asks about a “field of work” such as law or medicine, students whose parents are in such professions may understand the concept from conversations at home. However, students whose parents work in the service industry or work at the local grocery store may not. Answering the question correctly may not be a function of what was learned at school but rather what has been learned out of school. Antagonists to standardized achievement testing suggest that it is not fair to check on student achievement that is not in the curriculum.
What instructors or textbooks focus on may not be reflected in the test. The requirement for a score spread in the exams means that questions that are answered by a majority of students will probably be removed because they do not discriminate enough.
The history of standardized testing suggests that the impetus for large-scale testing has been based on noble aspirations, primarily that of having the right person in the right place, whether that place is the correct job in the military or the correct form of education. Standardized testing has value in today’s society. Aptitude testing for admission into colleges and universities seems to be especially effective as quantitative research has established links between such testing and later success at post-secondary institutions.
Achievement testing has issues especially related to situational/environmental factors, personal/emotional factors, and grade-spread requirement that may make applicability difficult to ascertain. That is, standardized testing may be best at determining aptitude or future ability in an individual and also good at examining a school district’s efficaciousness. Standardized tests seem to be weaker at being able to correctly indicate how much a specific student has learned.
Alberta Assessment Consortium (2012). A new look at public assurance: Imagining the possibilities for Alberta students. Retrieved from http://www.aac.ab.ca/a-new-look-at-public-assurance-imagining-the-possibilities-for-alberta-students.html
Alberta Education (1997). Teaching Quality Standard applicable to the provision of basic education in Alberta. (Ministerial order #016/97). Retrieved from http://education.alberta.ca/media/6734948/teaching%20quality%20standard%20-%20english.pdf
Bew, Lord P. (2011). Independent review of key stage 2 testing, assessment and accountability, final report, as written for the Department of Education. Retrieved from https://www.education.gov.uk/publications/standard/publicationDetail/Page1/DFE-00068-2011
Boardman, A. G., & Woodruff, A. L. (2004). Teacher change and “high stakes” assessment: What happens to professional development. Teaching & Teacher Education, 20(6), 545-557.
Booi, L., & Couture, J. C. (2011). Testing, testing. What Alberta can learn from Finland about standardization and the role of the teacher. Alberta Views, 7, 28-32.
Brookhart, S. M. (2001). The “Standards” and classroom assessment research. Paper presented at the annual meeting of the American Association of Colleges for Teacher Education, Dallas, TX. (ERIC Document Reproduction Service No. ED451189).
Burke, K. (1999). The mindful school: How to assess authentic learning (3rd ed.). Arlington Heights, IL: Skylight Publishing.
Fishman, J. A., & Pasanella, A. K. (1960). College admission selection studies. Review of Educational Research, 30(4), 298-310.
Fletcher, D. (2009, December 11). Standardized testing. Time. Retrieved from http://www.time.com/time/nation/article/0,8599,1947019,00.html
Franklin, C. A., & Snow-Gerono, J. L. (2007). Perceptions of teaching in an environment of standardized testing: Voices from the field. The Researcher, 21(1), 2-21.
Gordon, S. P., & Reese, M. (1997). High-stakes testing: Worth the price? Journal of School Leadership, 7, 345-368.
Gronlund, N., & Waugh, C. (2009). Assessment of student achievement (9th ed.). Upper Saddle River, NJ: Pearson.
Guskey, T. R., & Jung, L. A. (2013). Answers to essential questions about standards, assessments, grading, & reporting. Thousand Oaks, CA: Corwin.
Kobrin, J., Patterson, B., Shaw, E., Mattern, K., & Barbuti, S. (2008). Validity of SAT for predicting first year college grade point average (Report No. 2008-5). New York, NY: College Board. Retrieved from http://professionals.collegeboard.com/profdownload/Validity_of_the_SAT_for_Predicting_First_Year_College_Grade_Point_Average.pdf
McEwen, N. (1995). Accountability in education in Canada. Canadian Journal of Education, 20, 1-17.
Pedulla, J. P. (2003). State-mandated testing – What do teachers think? Educational Leadership, 61(3), 42-46.
Popham, J. (2002). Classroom assessment: What teachers need to know (3rd ed.) Boston: Allyn and Bacon.
Popham, W. J. (1999). Why standardized tests don’t measure educational quality. Educational Leadership, 56(6), 8-15.
Reynolds, C., Livingston, R., & Willson, V. (2009). Measurement and assessment in education (2nd ed.). Upper Saddle River, NJ: Pearson.
Stiggins, R. J. (1999). Are you assessment literate? The High School Journal, 6(5), 20-23.
Stiggins, R. J. (2008). An introduction to student-involved assessment for learning (5th ed.). Columbus, OH: Pearson Merrill Prentice Hall.