The most advanced AI systems on the market struggle with the everyday arithmetic of Britain’s benefits system, according to new research from the Alan Turing Institute that found error rates exceeding 30% when leading models were asked to calculate Universal Credit entitlements. The findings, published this week, arrive at a delicate moment: the Department for Work and Pensions (DWP) is actively piloting AI tools to help claimants navigate the system, raising fresh concern that flawed software could quietly miscalculate the incomes of some of the country’s most vulnerable people.
The benchmark, dubbed WelfareBench, tested OpenAI’s GPT-5, Anthropic’s Claude and Google’s Gemini against hundreds of real-world Universal Credit scenarios drawn from anonymised casework and DWP guidance. While the models handled simple eligibility questions competently, their performance collapsed on the system’s more intricate mechanics — particularly the tapered deductions that reduce a claimant’s award as their earnings rise.
Where the models broke down
Universal Credit applies a 55% taper rate, meaning that for every pound earned above a claimant’s work allowance, their benefit is reduced by 55 pence. The calculation sounds simple, but it interacts with work allowances, the benefit cap, childcare costs, housing elements and deductions for advances or debts in ways that quickly become labyrinthine.
It was here that the frontier models faltered most. Across the taper-heavy scenarios, the Turing researchers recorded error rates above 30%, with some models confidently producing entitlement figures that were tens of pounds adrift per month — enough to leave a household short on rent or, conversely, overpaid and later pursued for repayment.
“These models are extraordinarily fluent, and that fluency is precisely the danger,” said Dr Priya Anand, the computational social scientist who led the study. “They explain their reasoning in plausible, authoritative prose and then arrive at the wrong number. A claimant has no way of knowing the answer is wrong, and neither, often, does an overstretched caseworker.”
The team also found the models were inconsistent: asking the same question twice could yield different figures, and small rewordings of a scenario sometimes flipped an answer from correct to incorrect. That brittleness, the researchers argue, is incompatible with a domain where accuracy is non-negotiable.
Why benefits maths is uniquely hard for AI
Large language models are trained to predict plausible text, not to execute precise multi-step arithmetic under strict rules. The UK benefits system is, by design, a thicket of conditional logic — exactly the kind of structured, rule-bound reasoning where probabilistic text generation tends to slip.
The report notes that performance improved when models were given a calculator tool or asked to show their working step by step, but errors persisted even then, often because the model misapplied a rule rather than miscomputed a sum.
- Taper interactions: models frequently applied the 55% taper to gross rather than net earnings, or forgot the work allowance entirely.
- Benefit cap: several scenarios saw the cap ignored or applied incorrectly.
- Deductions: repayments for advance payments and debts were routinely omitted.
- Edge cases: self-employment, fluctuating income and shared-care arrangements produced the highest failure rates.
A warning shot for DWP’s AI ambitions
The timing is pointed. The DWP has confirmed it is trialling AI assistants intended to help claimants understand their entitlements and reduce pressure on phone lines and jobcentres. Ministers have framed generative AI as a route to a more responsive, efficient state.
The Turing findings suggest such deployments need firm guardrails. Welfare campaigners warn that even small, intermittent errors could have outsized consequences for households living close to the margin.
“An overpayment isn’t a glitch you shrug off — it’s a debt the department will claw back months later, often without warning,” said Marcus Lowe, a policy analyst at the fictional welfare think tank the Sutton Centre for Social Policy. “If a chatbot tells someone they’re entitled to more than they are, the system punishes the claimant, not the software.”
A DWP spokesperson, responding in general terms, stressed that any AI tools in its pilots are designed to signpost guidance rather than issue binding entitlement decisions, and that official calculations remain the responsibility of departmental systems and trained staff. The Turing team welcomed that distinction but cautioned that the line between “signposting” and “advice” blurs rapidly in practice.
What the researchers want next
The Institute is calling for mandatory, domain-specific benchmarking before any public-facing benefits AI is deployed, alongside clear disclaimers, human review of outputs and transparency about error rates. It has released WelfareBench publicly so that developers and government bodies can test systems against the same scenarios.
“We’re not saying AI has no place here,” Dr Anand added. “We’re saying you cannot deploy a tool that is wrong a third of the time and call it help.”
What this means
The study lands as a sobering check on the rush to wire generative AI into public services. Frontier models are impressive generalists, but the WelfareBench results show that fluency is not the same as accuracy — and that the gap is most dangerous precisely where the stakes are highest. For the DWP and other departments eyeing AI rollouts, the message is clear: rigorous, transparent testing against real-world rules must come before deployment, not after. Until error rates fall dramatically, the safest place for these tools in the benefits system is alongside human expertise, not in front of vulnerable claimants making decisions about how to feed their families.
Photo by Brett Jordan on Pexels