Emory Report
Feb. 15, 1999

Volume 51, No. 20

First person:

Student course evaluations and what research teaches us

In many respects, I consider myself quite fortunate, even lucky. As a faculty member at two universities over the past nine years, I have usually received high student course evaluations. Most students seem to enjoy my courses and provide me with positive feedback on my teaching. Yet something about my good fortune has always puzzled me. Specifically, over the years I have met faculty colleagues who are every bit as conscientious a teacher as I and take their jobs as educators very seriously, yet who receive mediocre or low course evaluations. Therein lies a mystery that has perplexed educational psychologists--not to mention the aggrieved teachers themselves--for decades. Therein also lies a dilemma that colleges and universities entrusted with the task of evaluating teaching quality are obliged to confront.

As Emory's newly formed University Advisory Council on Teaching grapples with the question of how to improve the quality of teaching at Emory, the issue of how to evaluate teaching effectiveness will likely assume center stage. My fervent hope is that these discussions, unlike many that have taken place on campuses around the country, will be informed by the scientific literature on the validity of student course evaluations. In my experience, many faculty members, including those intimately involved in tenure and promotion decisions, are largely unaware of this voluminous body of research. This literature is not easy to interpret and raises more questions than answers. But if teaching is to be evaluated fairly, these questions must be addressed.

So, what have psychologists learned about the validity of student course ratings?

First, researchers have found that when objective indices of student learning are used as criteria, student course evaluations tend to possess at least some degree of validity (see Abrami, d'Appollonia and Cohen, 1990, Journal of Educational Psychology). In other words, course evaluations are positively correlated with how much students have learned in their courses. But the overall picture is murky. Most of these correlations are modest or even weak in magnitude, and a few investigators have even reported sizeable negative correlations between student ratings and objective indices of learning. These conflicting findings should give us pause, because they suggest that course evaluations, although somewhat useful, are not especially robust or dependable indicators of student learning.

Second, student course evaluations are associated with variables that appear to be largely irrelevant to objective indices of student learning. For example, research by Anthony Greenwald and his colleagues demonstrates that student course ratings are positively and substantially correlated with grading leniency (Greenwald, 1997, American Psychologist). Although this finding is open to several alternative explanations, the most plausible--and the one supported by sophisticated statistical models--is that faculty members who give difficult examinations and who grade stringently often produce disgruntled students. Several other largely extraneous factors are associated with course evaluations. Instructors who teach larger classes, required classes and science courses tend to receive lower ratings than instructors who teach small classes, elective courses and humanities courses, respectively (Marsh and Roche, 1997, American Psychologist).

There is also evidence that student evaluations are influenced by the "halo effect"--the tendency of respondents to give overly positive or negative ratings to many items on the basis of one (or a few) strongly liked or disliked characteristics. For example, Greenwald found that students who give low ratings to their teachers also tend to give low ratings to variables that bear little obvious relation to teacher effectiveness, such as the quality of classroom audiovisual aids and even the legibility of the instructor's handwriting. Students tend to paint their course evaluations with a broad brush.

Why do some instructors receive higher ratings than others? In part, it seems likely that students appreciate clear and well-organized lecturers and justifiably reward them with high ratings. It also seems likely, however, that student evaluations are unduly influenced by teachers' personal characteristics, some of which may be irrelevant to objective measures of learning. Robert Rosenthal and his colleagues showed observers very brief (e.g., six-second) snatches of silent videotapes of college teachers, and then asked them to guess these teachers' course evaluations. Remarkably, observers' guesses highly correlated with the teachers' course evaluations, even though observers could not hear a word of what the teachers said.

Presumably, observers detected certain nonverbal behaviors, such as expressiveness and energy level, that tend to make teachers popular. These findings suggest that students may base their course evaluations partly on superficial teacher attributes. Of course, some of these attributes, such as outward enthusiasm, may be conducive to student learning. The problem is that teachers who effectively convey a great deal of information, but who are not especially dynamic, may be penalized relative to their more flamboyant colleagues.

What about alternatives to student course evaluations? A number of researchers, including Herbert Marsh, have found that peer evaluations of teaching, which are used increasingly on campuses across the country, are virtually uncorrelated with objective measures of teaching effectiveness. At the very least, peer evaluations should be interpreted with considerable caution.

So where does this leave us? Although the answers are not entirely clear, the research literature provides some preliminary suggestions for interpreting course evaluations. This literature suggests that course evaluations can be a modestly helpful barometer of teaching effectiveness but that such evaluations probably can be affected by variables that are independent of teaching quality. The latter findings imply that course evaluations should be judged not in isolation but in the explicit context of such potentially confounding variables as grading leniency, class size and subject matter (e.g., science versus humanities). For example, systematic comparisons of course evaluations with evaluations from similar courses (or the same courses taught by other instructors), rather than with all course evaluations across the University, should help minimize variance irrelevant to instructor effectiveness.

The relatively low validities of student course evaluations imply that exclusive reliance on these evaluations is problematic. Emory and some other institutions have moved increasingly toward multidimensional assessments of teaching quality such as teaching "portfolios," which incorporate course evaluations, syllabi, examinations and peer evaluations. Such portfolios probably represent an improvement over course evaluations alone, although even here we must be careful not to place undue emphasis on indicators (e.g., peer evaluations) that possess questionable validity.

Regrettably, many colleges and universities have committed the error of neglecting the research literature and treating student course evaluations as the gold standards or ultimate arbiters of teaching effectiveness. We at Emory can and should do better.

Scott Lilienfeld is assistant professor in the Department of Psychology.

Return to Feb. 15, 1999, contents page