Teaching, testing, student achievement, and teacher evaluation

Incoherent thoughts brought up via this posting:

Tisch Commentary

I will use quotation marks repeatedly in this small essay, but they are
certainly not intended to be scare quotes. I want to keep the
phrase “teaching to the test” as a single, unitary entity. Quotation
marks seemed to be the easiest way to do this. And I am very lazy.

I’m sure everyone’s heard of the concept of “teaching to the test”, and
most probably think it’s a bad thing. I’m here to point out one logical
issue with this, and try to back it up with a few technical points about
the psychometric construction and evaluation of standardized tests.

Caveat emptor: I am not currently, nor have I ever been
employed directly in the construction or evaluation of standardized
educational tests. I have worked in a psychological lab that did
considerable work on these, and I have worked with other medium-to-large
scale testing programs. I am familiar with the technical side of both
test construction and evaluation, but I have no knowledge of any
particular testing program.

Now, the first issue is basically a logical one. People often believe
that “teaching to the test” is bad, and this seems to rest on the
implicit assumption that the test is somehow, some way, deficient. Now,
if this assumption is true, it is likely that the conclusion that
teaching to this test is bad is true.

Let’s take a step back, and think about what the phrase “teaching to the
test” means. In the design of instructional systems, that is, say
prepping an academic course or a vocational training program, or
what-have-you, it is important to have something called learning
objectives
, or those things the learner should know or be able to
do when the program or course is finished. These are the key
“take-aways” (scare quotes around corporate jargon intentional). These
objectives form the blueprint for the course, so it’s best to be
explicit in defining them; then, both the student and the instructor
will know–consciously–what is expected. The blueprint can, as a bonus,
be used as the test blueprint. That is, it specifies what information,
skills, and other material is fair game for the test.

For instance, in my Intro to Industrial and Organizational Psychology
course, each class is introduced with the learning objectives for that
lesson, with lecture, discussion, and exercises (hopefully) designed to
impart the knowledge and skills. Some objectives are better than others,
and some lectures, demos, and exercises are better than others, too. In
the end, though, those objectives serve as the blueprint for the exams.
It is then easier for me to write an exam, and easier for my students to
study; they know what they need to know. I am, in fact, “teaching to the
test”; that test has, however, been designed. Now, it’s been
designed by me, so your mileage may vary on whether it’s good enough,
but it covers everything I think an undergraduate introductory IO
psychology student should know.

Other forms of academic testing carry the same issues. A test, if
designed to capture whether students possess the information and skills
that a particular curriculum is intended to impart should be built on
the same blueprint–according to the same learning objectives–as the
curriculum itself. If there are no explicit objectives, designing the
test can serve to specify them. If this process is done consciously,
then the act of “teaching to the test” should not be a problem. That is,
the content should be no narrower than it would be in any other designed
curriculum.

Of course, all of the above can be well or poorly done. Learning
objectives vary in scope, scale, and specificity. For instance, “Know
the definition of adverse impact and be able to calculate the adverse
impact ratio” is a very narrow, specific learning objective that
expressly defines a set of exam questions. On the other hand,
“Understand issues of experimental design involved in evaluating
training programs” is a much broader learning objective. Exam questions
can be written for this objective, but they are far from as clear cut as
the previous example was.

Learning objectives could even be irrelevant for the topic or area at
hand. So, in general, these objectives should come from some sort of
needs analysis, to identify gaps between needs and skills. For instance,
the AICPA–the body the develops and administers the CPA exam–does
regular practice studies; to determine what practicing
accountants are actually currently doing, and works to update the exam
accordingly.

Now, of course, you may believe that none of this work can create a test
that legitimately covers the body of knowledge that individuals need.
You may even be right, though this seems doubtful when not taken as
tautological. That is, every test of finite length will be content
deficient. We cannot cover everything in limited space. However, that
does not mean that the test is useless, and it may be very good, if it
covers a broad enough body of knowledge and skill–or that it covers
those things that are necessary for passing a threshold. Regardless of
this, if we prefer to have explicit metrics, is it not better to have
ones that are based on extensive work than on implicit standards or
guesswork? Or worse, the prejudices of those in power?

Also, there’s often criticism about the use of multiple-choice questions
(MCQs) in these exams; that no one has to answer MCQs in their real
lives, so these questions are a poor investigation of whether someone
really possesses the knowledge and skills of the curriculum versus being
good at guessing. It is true that there is some element of test
“wiseness”–the ability to make good choices about which answer is most
likely to be correct, even when you do not know the correct answer.
However, these tests, though lacking in what we in the business call
“face validity”–they don’t look like the real-world demonstration of
knowledge and skills–seem to predict real-world outcomes as well as any
psychological measure (often better).

Furthermore, we we try to assess more differentiated skills, using
things like essay-type questions or other “constructed response
questions”, these tend to correlate extremely highly with the MCQs (often
even “loading on a common factor”, to be overly technical). Another
example is when Nathan Brody analyzed tests of Sternberg’s triarchic
intelligence, all tests, including academic intelligence (like these
tests) and practical intelligence (like street-smarts) lumped together.
I don’t say this as making a statement that is definitive, of the “there
is only one intelligence’ variety, merely to demonstrate that it is
extremely difficult to construct exams that actually do measure
something other than the kind of academic skills and abilities the
standardized tests do.

Now, I make no claims about how well the specific tests work in
practice, nor how well they’ve been designed and evaluated. I think it
is also important to point out that even if the construction and
evaluation was flawless
the testing program could easily be misused
or abused for political, ideological, or economic reasons. I certainly
have no knowledge of whether this is the case in the State of New York,
or any other state. This sort of abuse, however, turns testing into a
bad metric, in that the testing program no longer serves its
purpose, but a different one. That is worth keeping in mind when
discussing these issues. There are both technical and political
challenges to dealing with how to evaluate students and their teachers,
but those two components are separable, and should be treated
distinctly.

Advertisements