Code suggested by AI assistants such as GitHub Copilot and its rivals introduces subtle security flaws in roughly one in five commits — and that code is more likely to clear human review than equivalent human-written changes, according to a peer-reviewed study published this week. The findings, drawn from an analysis of millions of public GitHub commits, suggest that the productivity gains firms are eagerly tallying may be quietly accruing a long-term security debt that few organisations are tracking.
The research, led by a team at the fictional Centre for Software Integrity Research (CSIR) and presented at a software engineering conference this month, examined commit histories across thousands of repositories where AI assistance could be reliably identified. The authors say their headline figure — 19.7% of AI-influenced commits containing at least one detectable vulnerability — should give pause to engineering leaders treating AI tooling as an unalloyed win.
What the study measured
The CSIR team combined static analysis, manual auditing of a stratified sample, and metadata signals to flag commits likely to have originated from AI suggestions. They then compared vulnerability rates, review outcomes and time-to-merge against a control set of human-authored changes.
Three findings stood out. First, AI-influenced commits carried a measurably higher rate of security weaknesses — chiefly improper input validation, insecure default configurations, and mishandled authentication logic. Second, those commits were approved during code review more readily, with fewer review comments and shorter discussion threads. Third, the flaws skewed towards the subtle: not glaring errors, but plausible-looking code that behaves correctly in the common case.
“The dangerous pattern isn’t the obviously broken suggestion — reviewers catch those,” said Dr Amara Okonkwo, lead author of the study. “It’s the code that reads cleanly, compiles, passes the happy-path tests, and contains a flaw that only surfaces under adversarial conditions. Reviewers extend a kind of unearned trust to fluent code.”
That last point — what the authors call “fluency bias” — may be the most consequential. AI assistants produce idiomatic, confidently structured code that mirrors the surrounding style. The researchers argue this surface polish lowers a reviewer’s guard, shifting attention away from the security-relevant edge cases that demand the most scrutiny.
Why the flaws slip through
The study identifies several reinforcing mechanisms. AI models are trained on vast corpora of public code that itself contains insecure patterns, so the assistants reproduce common mistakes with uncommon confidence. Suggestions also arrive in-flow, at the moment a developer is focused on functionality rather than threat modelling, encouraging acceptance with minimal interrogation.
The reviewers fare little better. Because AI output tends to be syntactically clean and consistent, it triggers fewer of the stylistic red flags that normally prompt a closer look.
- Input validation gaps — assistants frequently omit sanitisation for inputs they assume are trusted.
- Insecure defaults — generated configuration favoured convenience over hardening.
- Authentication and authorisation slips — logic that worked but failed to enforce boundaries consistently.
- Outdated cryptographic patterns — echoes of deprecated practices embedded in training data.
“We’ve effectively automated the production of plausible code without automating the production of secure code,” said Tom Reedley, a software supply-chain analyst at the fictional consultancy Northbridge Advisory. “The two are not the same thing, and a lot of dashboards are conflating them.”
The productivity-versus-security blind spot
The findings land at an awkward moment for an industry busily quantifying the returns on AI tooling. Many firms now track metrics such as pull requests merged, lines of code generated and developer-reported time savings — figures that flatter the case for wider rollout. Almost none, the researchers note, track the corresponding security trajectory.
“There is a measurement asymmetry,” Dr Okonkwo said. “Productivity is immediate and easy to count. Security debt is deferred and diffuse. You don’t see it until an incident forces you to, and by then it’s spread across thousands of commits.”
For UK organisations, the timing is particularly pointed given tightening expectations around software supply-chain assurance and incoming regulatory pressure on critical sectors. A vulnerability introduced quietly today may surface as a compliance failure — or a breach — months down the line.
The authors stop short of advising against AI assistants. Instead, they recommend treating AI-influenced commits as a higher-risk category warranting dedicated security review, mandatory static and dynamic analysis, and review processes explicitly designed to counter fluency bias. Some teams in the study that flagged AI-origin code for extra scrutiny saw vulnerability rates fall sharply.
What this means
The study doesn’t argue that AI coding assistants are unsafe — it argues that they are being measured incompletely. The productivity gains are real, but so is a quietly accumulating security liability that conventional review is poorly equipped to catch. For engineering leaders, the practical takeaway is to balance the speed metrics with security ones: track vulnerability rates in AI-influenced code, harden review processes against the false comfort of fluent output, and treat machine-generated suggestions as drafts to be interrogated rather than answers to be trusted. The firms that win on AI productivity will be the ones that refuse to ignore the bill arriving later.
Photo by Google DeepMind on Pexels