How to Improve Student Testing with Crowdsourcing and Data

In a recent post, I wrote about why A/B testing versus holistic UX design is a false choice.

One choice measures, then it optimizes against an overly simplistic set of variables. The other generally eschews measurement, instead opting to blindly trust the holistic vision of a (no doubt brilliant) product designer. It’s a false choice because one can choose to instead have close to the best of both worlds: use principles of holistic design, then measure the crap out of that design.

The current debate about school assessment has a very similar structure.

Teachers unions and education schools lead one side of the debate. Their rallying cry is essentially “we are the experts at teaching students, so no need to waste time with silly standardized tests.” This mindset is justifiable for the very best teachers, who have (like great product designers) good instincts and teach well; it’s completely ludicrous for the worst teachers, who need either data that can help them improve or a new profession. Moreover, it results in a ridiculous system where pay (and layoffs) are based on seniority and performance. Think about the three best teachers you had and the three worst. If the three best were also the three most senior, and the three worst were the most junior, I’ll shut up. But I’m guessing that’s not the case, which means that — because of the anti-testing, anti-accountability mentality — your best teachers weren’t getting what they deserve.

On the other side are advocates of accountability and metrics, who want to rid the world of unsuccessful teachers and schools. Without measuring student performance, they argue, it’s impossible to know which students, teachers, and schools are performing well and which are not. To my mind, this is all good. But there’s one hitch: the tool of choice for accountability advocates is crappy standardized testing. In their current form, standardized tests are time-consuming for teachers and students, aren’t measuring what most teachers want to measure, and have an oversimplified form (usually multiple choice). Moreover, they usually matter immensely for schools but don’t affect individual students’ outcomes at all.

As with the parallel testing-vs-design question, there’s a third way that takes advantage of the best of both worlds.

Virtually all teachers use tests and essays to assess their students, and it’s nearly always the case that all students in a class get the same questions or assignments. So there’s little debate as to whether testing that’s standardized at the class level is a good tool for gauging student progress: it’s widely used to grade students on their progress.

That begs a question: why can’t one use the same system both for evaluating students’ performance in a class and for evaluating their overall progress relative to the state or country?

If it’s 1950, the answer is that there’s no fast way to transmit questions and answers to test between teachers in different schools. So if there’s only one US History teacher in a high school, and you want to coordinate evaluation across schools, you have to mail questions and answers between schools and everything is slow and messy.

Good news! It’s not 1950, it’s 2012. So data transfer isn’t an issue.

But how do you give teachers the right to choose how their students will be evaluated, while also tracking at the district, state, or federal level?

It’s a little bit nuanced, so here goes:

1) Group teachers based on the subject they’re teaching, the way they want to evaluate students (e.g., long form answers, short text answers, multiple choice, all of the above, etc.)

2) For any given subject — very narrowly defined (adding fractions with small denominators; Kennedy-era Civil Rights legislation attempts; basic properties of neutrons; To Kill a Mockingbird first three chapters) — allow teachers to add questions they use for their tests.

3) For new questions, allow upvotes and downvotes from other teachers teaching the same subject, to allow the system to tell the difference between good questions and bad ones.

4) When it comes time to give students a test, the teacher picks subject(s) on which s/he will test students. Immediately before the test is to be given, the teacher will be shown a pool of questions which may be on the test. The teacher will have the opportunity to filter out questions which are a poor fit for the class. This should be done not based on difficulty (which, when mature, the system should handle satisfactorily), but based on subject matter covered.

5) The actual test questions will be taken from the pool the teacher saw, and given to students. Ideally, on a computer, but that’s not a dealbreaker. This is the straightforward part!

6) The students’ responses will be evaluated by other teachers in the same group (from #1), not by the student’s teacher (a few students might be evaluated directly by the teacher in addition to someone else, for calibration/sanity check purposes, etc.) The grading system used could be similar to existing ones — possibly grading students on a curve relative to others in the class — or it could be relative to other students state- or nationwide.

7) This then allows anyone — the teacher, the school, the district, the state — to gauge student performance relative to certain norms. There are a lot of checks to put in place to make sure teachers are evaluating students comparably — if one teacher gives a set of 20 answers an average of a C, and another teacher gives the same set an average of a B+, either the system or the teacher will need to adjust — but once that’s in place, it’s relatively straightforward to combine question difficulty and teacher scoring.

Such a system would have a number of benefits. It would crowdsource questions, which in the long run would both save teachers time (less time creating tests) and lead to better tests.

It would provide better, faster feedback to teachers and administrators, telling them almost immediately how well their students are performing relative to norms, and if and whether there are gaps in knowledge and understanding.

And it would eliminate the painful process of standardized testing that exists today, while providing a quantitative framework that teachers could easily buy into.

Mike Greenfield founded Bonafide, Circle of Moms, and Team Rankings, led LinkedIn's analytics team, and built much of PayPal's early fraud detection technology. Ping him at [first_name] at mikegreenfield.com.