Choosing the right sample size for a given decision - UX Principles for the Government

The most common question in government UX research is also the most often answered wrongly: how many people do we need to talk to?

The wrong answer is “five” — repeated reflexively, because someone read a Nielsen Norman article fifteen years ago. The other wrong answer is “as many as possible,” repeated nervously, because a stakeholder might challenge the findings. Both miss the point. Sample size is not a property of research in the abstract. It is a property of the decision the research has to support.

Different decisions need different sample sizes. The skill is matching them. This post is the practitioner’s heuristic — not a statistics lecture — for getting the match roughly right.

The four-decision framework

Most government UX research supports one of four decision types. Each has its own sample size logic.

Find usability problems in a flow. Five to eight users per segment.

Understand a journey, a context, or a behaviour. Eight to fifteen participants.

Segment a population or define audience groups. Twenty to thirty participants per segment.

Measure something — completion rate, satisfaction, comprehension, error rate. A hundred to several hundred, depending on the precision you need.

That’s the entire framework. The rest of this post is when each applies, why, and the traps that catch teams who get it wrong.

Finding usability problems

The five-user rule comes from Jakob Nielsen’s 1993 finding that five testers catch roughly 85% of usability problems in a given interface. The mathematics is sound and the rule is, broadly, correct. What it isn’t is universal.

Five users works when you are testing one flow, with one fairly homogeneous user group, for one type of usability problem (mechanical: can they complete the task, do they understand the labels, where do they get stuck). In that narrow case, the sixth user usually surfaces the same issues the first five did, and the cost of recruiting them outweighs the marginal insight.

The rule breaks the moment any of those conditions don’t hold. If your service has to work for distinct user segments — first-time applicants and returning users, digital-confident and digital-hesitant, native speakers and second-language — you need five to eight per segment, not five total. A test with two first-time applicants and three returning users tells you nothing reliable about either group.

It also breaks for complex services with branching paths. Five users on a benefits eligibility checker that has fifteen possible outcomes will exercise three of them. The other twelve are untested. Add users until the major paths are covered, or test the paths one at a time.

The practical rule for usability research in government:

One simple flow, one user type: five to eight users.
Multiple user segments: five to eight per segment.
Complex branching service: enough users to cover the major paths, often twelve to twenty.
Always include at least one user from a hard-to-reach group, even if that means recruiting one more than the framework suggests.

The cost of one extra participant is small. The cost of missing a critical issue because you stopped at five is large.

Understanding journeys, contexts, and behaviour

Journey mapping, ethnographic research, lived-experience research, and most discovery work belongs in this category. The decision you’re supporting is not “is this interface usable” but “what is going on in this person’s life that makes them need this service, and what shape does that need take.”

The sample size for this kind of work is eight to fifteen participants. Below eight, you don’t see enough patterns to distinguish a participant’s idiosyncrasy from a recurring theme. Above fifteen, the marginal participant rarely adds new patterns — they confirm or refine what’s already there.

The composition matters more than the number. Eight participants representing the four hard-to-reach groups you actually need to design for is more useful than fifteen who are convenient to recruit. If you recruit only the easy participants, you don’t get a better sample by recruiting more of them; you get a more confident wrong answer.

A specific pattern worth naming: teams who do twelve interviews and find one extremely vivid story sometimes treat that story as representative. It might be. It might not. The point of twelve interviews is to be able to tell which — by checking whether the vivid story is echoed, at least partially, in others. If it stands alone, it’s a data point, not a finding.

Segmenting a population

When the decision is “how should we define our user segments,” “what archetypes does this audience break into,” or “do these groups differ meaningfully in their behaviour” — you have left exploratory research and entered something more structural. The sample needs to be large enough that segment differences are visible above the noise of individual variation.

The rule of thumb is twenty to thirty participants per segment you expect to surface. If you think there are three meaningful segments, plan for sixty to ninety participants total. This is expensive, and it’s the reason segmentation work tends to combine qualitative research with survey instruments — the survey scales the sample, the interviews provide the depth.

The trap here is over-claiming from too small a sample. A team that does twelve interviews and produces four personas has not done segmentation. They have done good qualitative research and then drawn lines through it that the data doesn’t support. The personas may turn out to be right, but the research didn’t prove them.

If you need segmentation findings but don’t have segmentation budget, the honest answer is to call them working hypotheses rather than segments, and label the next research phase as testing whether they hold.

Measuring something

Measurement research — quantifying completion rates, satisfaction scores, comprehension, error rates, time on task — has very different sample size logic from anything qualitative. Now you’re trying to produce a number with a known margin of error, and the size of the margin is what sets the sample.

For most government UX measurement purposes, the practitioner’s rules of thumb are:

A hundred participants gives you roughly a ±10% margin of error around a proportion. Enough to say “about half” with confidence; not enough to distinguish 45% from 55%.
Four hundred participants gives you roughly ±5%. Enough to detect meaningful differences between groups.
A thousand gives you roughly ±3%. Enough for headline statistics in a published report.

These are approximations. The actual numbers depend on what you’re measuring, how variable the population is, and how the sampling was done. But they’re roughly right for most government use cases, and they’re a useful sanity check on claims like “the completion rate is 67%, based on eight participants” — which is a sentence that should never appear in any document.

The other trap is the implicit measurement claim hiding inside qualitative research. When a usability report says “users found the form confusing,” it’s making a frequency claim. If it’s based on three of five users finding it confusing, the claim is true of the sample but is not strong evidence about the population. Be honest about what the sample supports, and what it doesn’t.

What changes the answer

Three factors shift the right sample size away from the rules of thumb above.

Population homogeneity. If your users are very similar to each other — same demographics, same context, same task — the smaller end of each range is fine. If they’re highly varied, you need the upper end, plus deliberate segment-by-segment recruitment.

Stakes of the decision. A research finding that supports a £200,000 UI change can be smaller-sample than one supporting a £20 million policy decision. The cost of being wrong should inform the cost of being thorough. If the decision affects vulnerable populations, lean toward more.

Existing knowledge. Research that confirms or extends an existing strong evidence base can be smaller-sample than research that has to stand alone. Five users to validate a redesign of an already-researched form is reasonable. Five users to characterise a population the team has never spoken to before is not.

A few additional practical disciplines

Recruit until you stop learning, not until you hit a number. The sample sizes above are planning numbers — they help you budget and book participants. But the actual decision of when to stop should be based on whether each new participant is still adding new insight. If you’re hearing the same things from participant six that you heard from participants one through five, stop. If you’re still hearing new things at participant fifteen, keep going. The number you planned for is a guide, not a target.

Don’t trade segment representation for sample size. Eight participants with deliberate hard-to-reach representation will tell you more than fifteen participants who happened to be available. Plan the composition first, then the count.

Be honest about what the sample supports. The most common credibility failure in government UX research is overclaiming — extending findings from a sample to a population without acknowledging the gap. The fix is not to inflate sample sizes; it’s to write more carefully. “Three of five participants struggled with the eligibility question” is honest. “Users struggled with the eligibility question” is a stronger claim than the data supports.

Pre-register your sample logic where you can. Write down before the research begins what sample size you plan, why, what decisions it will support, and what it won’t. This protects you against the most common stakeholder pressure — the demand for “more data” when the findings are inconvenient. If the sample plan was agreed in advance, the findings stand on their own.

The one sentence to take away

Sample size is the answer to a question about a decision, not a property of research in the abstract. Decide the decision first. The sample size follows. If someone asks “how many people do we need to talk to” and you don’t yet know what the research has to support, the honest answer is “I’ll tell you when we’ve agreed what we’re trying to decide.”

That conversation, held early, prevents most of the sample-size arguments that happen later.

The four-decision framework

Finding usability problems

Understanding journeys, contexts, and behaviour

Segmenting a population

Measuring something

What changes the answer

A few additional practical disciplines

The one sentence to take away

Related Posts

The Service Walk

Recruiting hard-to-reach populations for government research

Choosing a bullseye user group for research

Leave a Reply Cancel reply