Interview Scorecards That Actually Predict Performance
Most interview scorecards are bias-laundering machines. Here is how to design ones that meaningfully predict on-the-job performance.
If you have ever sat in a debrief where five interviewers each scored a candidate 4 out of 5 on 'communication' and then disagreed violently about whether to hire them, you have experienced the central problem of unstructured interviewing: the scores are a vocabulary, not a measurement. Fixing this is one of the highest-leverage things a talent team can do, and it does not require a $100k vendor — it requires discipline.
Why most scorecards are theatre
The typical scorecard asks interviewers to rate a candidate from 1 to 5 on six or seven abstract attributes: communication, problem solving, leadership, culture fit, ownership, technical depth, learning agility. Each interviewer interprets each attribute differently. 'Communication' means 'spoke clearly' to one interviewer and 'told a structured story' to another. 'Culture fit' is the most dangerous of all — it is almost always a proxy for 'reminds me of myself'. The scores then get averaged, the average looks objective, and the actual hiring decision is made in the room by whoever speaks loudest.
The fix is not to score harder. It is to score different things.
Anchor every rating in observable behaviour
A good scorecard rating is a description, not a number. Instead of '4/5 — Communication', it reads 'Explained the problem in plain language; checked for understanding twice; asked one clarifying question before answering'. The interviewer's job is to record what happened, not to render a verdict. The verdict gets made in the debrief, with all the evidence on the table.
Concretely, this means every competency on the scorecard needs three things: a one-line definition, two or three behavioural anchors at each level (1, 3, 5), and a prompt that forces the interviewer to write a sentence of evidence before they can submit a rating. Screeq enforces this at the form level — you cannot submit a 5 without typing what you saw. That single constraint changes the quality of the debrief more than any other lever we have measured.
Calibrate, then calibrate again
The most underrated practice in interview design is the calibration session. Pull four interviewers into a room, give them the same recorded interview, ask them to score it independently, then compare. If your inter-rater reliability is below 0.7, your scorecard is not measuring what you think it is. Adjust the anchors, retrain, repeat. This is not a one-time exercise — calibration drifts every quarter as new interviewers join.
Senior leaders should be in these sessions, not because they are better interviewers (they often are not) but because their visible commitment to the practice is what makes it stick. A calibration session led by the CTO is a culture event, not a HR event.
The role-specific scorecard beats the universal one
There is no such thing as a good universal scorecard. The competencies that predict success for a senior backend engineer are different from those that predict success for an enterprise account executive, which are different again for a customer success lead. The right number of competencies per role is four to six — fewer and you are not covering the role; more and interviewers stop reading them.
Build a competency library, not a single scorecard. Each role draws four to six competencies from the library, with role-specific anchors. New roles inherit from the closest existing role. Two years in, you have a defensible, calibrated, evolving system instead of a PDF nobody reads.
In closing
The goal of a scorecard is not to make hiring feel scientific. It is to make hiring decisions reviewable, debiased and improvable over time. Done well, it is the single most important artefact in your talent function.