Causal thinking in UX Research with Small Sample Sizes

Five users won’t give you statistical power. They can still give you the right answer – if you reason like a detective, not a pollster.

1. Each participant is a case study, not a data point. With n=8, you don’t have 1/600th of a statistical sample. You have eight detailed natural experiments. Treat them that way. What did P3 do that P4 didn’t? Why? The richness per participant is the asset; flattening them into frequencies throws it away. Example: Five users tried the new onboarding. Three completed it; two abandoned. The “60% completion rate” tells you almost nothing. The fact that both abandoners hit the same screen, paused for 40 seconds, and asked the same question – that tells you exactly what’s wrong.

2. Within-person beats between-person, every time. Same user, two designs – you’ve controlled for everything that makes people different. Different users, one design each, and personality, mood, expertise, and context are all loose in the room. Counterbalance the order and you’ve built a tiny but real experiment. Example: “Five users preferred the new design” is weak if each saw only one. “Five users completed the task faster on the new design than the old, and four said it felt easier” is much stronger – same brain, two conditions.

3. You’re not observing users. You’re observing users-being-interviewed-by-you. The moderator is part of the causal chain. A warm moderator gets agreement; a probing one gets criticism; a junior one gets politeness. The same script produces different findings depending on who runs it. Pretending the moderator is a neutral instrument is the first lie of qualitative research. Example: A team runs the same usability study with two researchers. Researcher A reports “users found it intuitive.” Researcher B reports “users were confused but didn’t want to seem rude.” Both observed the same behaviors. The moderation style was the variable.

4. Recruitment is the study. Whatever you can learn is bounded by who you recruited. Recruit current users and ask “is this product easy to learn?” – you’ve already excluded everyone who found it too hard to keep using. Recruit “people interested in productivity software” and you’ve selected for a specific relationship with the category. Example: A B2B SaaS team interviews 12 customers and concludes the product is well-loved. The 200 trial users who never converted are invisible. They’re the answer to the question being asked.

5. The task is the intervention. You’re studying the task you assigned at least as much as the product. “Buy a pair of running shoes” and “browse for inspiration” produce different cognitive modes, different attention patterns, different judgments. Change the task wording and the findings change. The task framing is causally upstream of everything you observe. Example: A checkout flow tested with “complete this purchase using the test credit card” looks frictionless. The same flow with real money and real hesitation looks completely different. The stakes weren’t in the design – they were in the task.

6. Order contaminates everything. Task 2 is never independent of Task 1. The user is now warmed up, fatigued, biased toward the pattern they just learned, or actively comparing. If you didn’t counterbalance, you can’t separate “design effect” from “second-task effect.” Example: All five users struggled with the search feature, which they encountered after a smooth onboarding. Was search hard, or were they just losing patience? Without rotation, you can’t tell.

7. What people say and what people do are different systems. Stated preference and revealed behavior come out of different cognitive processes. Self-report passes through self-image, social pressure, and post-hoc rationalization. Behavior doesn’t. When they disagree, behavior usually wins – but the gap itself is the finding. Example: Users say they want more customization options. Analytics show 4% of users ever change a default. Both are true. The causal story is that customization makes the product feel powerful at evaluation time, not that anyone uses it.

8. Saturation tells you about the typical, not the true. The “five users find 85% of issues” rule holds under one assumption: that issues are roughly evenly distributed across users. They’re not. Five users surface the common issues – which are exactly the ones you’d find anyway. The rare-but-severe issues, the ones that hit one user in fifty and tank your NPS, live in the tail. Saturation is a stopping rule for breadth, not depth. Example: Five rounds of usability testing on a banking app surface dozens of minor issues, none catastrophic. A wider beta finds a rare flow where users accidentally transfer to the wrong account. No amount of n=5 was going to catch it.

9. Novelty isn’t preference. Show a user the current design and a new one; the new one usually wins. Some of that is design quality and some is just newness. Two months in, the novelty dissolves and the real preference shows up. If you can’t run longitudinally, at least separate “exciting” from “better” in your questioning. Example: A redesign tests beautifully in concept reviews. Six months after launch, support tickets are up and power users are angry. The thing being measured in the test wasn’t preference – it was contrast.

10. The mechanism is the finding, not the count. With n=5 you cannot say “30% of users had this problem.” You can say: “Here is the mechanism by which this problem occurs, and we observed it twice.” The mechanism is what transfers to other users, other contexts, other versions of the design. The percentage doesn’t. Example: Two users out of seven tried to drag an item that wasn’t draggable. The number 2/7 means nothing. The mechanism – “the visual styling matched other draggable elements” – is a specific, fixable, generalizable claim.

11. Look for the user who breaks the pattern. With small n, the disconfirming case is worth more than the confirming ones. Four users sailed through; one struggled badly. Don’t average them – investigate the outlier. The exception is where the causal model lives. Example: Six users found the feature easily. The seventh, a screen-reader user, couldn’t find it at all. The “86% success rate” framing buries the actual finding: there’s a specific population for whom the design fails completely.

12. Distinguish “this design has a problem” from “any new design would have this problem.” Users will struggle with anything unfamiliar for the first few minutes. Some of what you’re calling “usability issues” is just “this is the first time they’ve seen it.” A placebo design – a plausible alternative shown the same way – tells you which problems are about this design specifically. Example: Users hesitate at the new navigation. Is it confusing, or is it just new? Show a different new navigation to a comparable group. If they hesitate too, your finding is about novelty, not your design.

Five users won’t give you statistical power. They can still give you the right answer – if you reason like a detective, not a pollster.

Related Posts

The hidden cost of badly designed public services

Concept evaluation

Leave a Reply Cancel reply