Part 7 of The Agentic AI Playbook
Series Navigation:
- Part 1: This Isn’t Autocomplete
- Part 2: What Your AI Doesn’t Know
- Part 3: Define Before You Design
- Part 4: Spec First, Always
- Part 5: How to Work With It Daily
- Part 6: When One Agent Isn’t Enough
- Part 7: What Goes Wrong
- Companion: Ontology in the Age of AI
- Full Series Guide
Every anti-pattern in this post felt reasonable in the moment. That is what makes them dangerous. Nobody decides to skip the spec because they are lazy. They skip it because the task looks simple, the deadline is close, and the AI is right there waiting. Nobody decides to deploy without review because they are reckless. They do it because the tests passed and the code looked clean.
The failures that follow are not random. They are predictable. And almost every one of them traces back to the same place: something that should have happened before the prompting did not.
The preparation failures
These are the most consequential mistakes, because they compound. Every artefact the AI produces downstream will carry the gap forward.
Starting without context is the most common. No CLAUDE.md, no briefing, no architecture guidelines. The AI defaults to the most common patterns it has seen in training — which are rarely your patterns. It invents naming conventions, chooses architectural styles, and makes structural decisions based on what most teams do, not what your team does. The first few results look reasonable. By the time the inconsistencies surface, they are distributed across multiple files and sessions.
Starting without a spec is the second. “Build me a user service” produces a service. It is just not your service. Every ambiguity in the instruction becomes a decision the AI makes on your behalf: what fields the entity has, how errors are surfaced, how edge cases are handled, what happens on concurrent writes. Those decisions are invisible until you try to use the output, at which point correcting them is more expensive than the spec would have been.
Starting without an ontology is the third, and in brownfield work it is the most insidious. The AI reads your existing codebase and inherits its conceptual model. If Customer means three different things across three services, the AI will carry all three meanings forward into every artefact it produces. You asked for modernisation. You got preservation with better formatting.
These three failures share a structure. The work that was skipped before prompting becomes rework after it. The rework takes longer than the preparation would have. Every time.
The speed trap
Agentic AI does not introduce new classes of failure. It accelerates everything, including your existing process gaps, into view faster than you are used to.
Teams that relied on “a careful developer will catch it during coding” discover that what they mistook for quality control was actually the thinking that happened during typing. When the typing disappears, so does that thinking — unless you put it somewhere else first. Into the spec. Into the edge cases. Into the acceptance criteria. Into the threat model.
The speed feels like an advantage and it is, but only if the foundation is in place. Without it, you are not moving faster toward a good outcome. You are accumulating mistakes faster. A vague requirement that used to surface as a problem two days into implementation now surfaces as a problem in two minutes. If you have not built the habit of fixing requirements before they reach implementation, you will spend those two minutes trying to course-correct a session that should never have started without a clearer brief.
The teams that genuinely benefit from agentic AI are the ones that reinvested the time the AI saved them on coding into design, specification, and adversarial thinking. The teams that struggle are the ones that treated the AI as a shortcut past that work rather than through it.
Review and verification failures
Running no review is dangerous in an obvious way: the AI can write syntactically correct code that has subtle logic errors, missing edge cases, or security holes that pass tests. “Tests pass” is not the same as “code is correct.”
But the less obvious failure is the rubber-stamp review: running an automated review pass, seeing “no issues found,” and merging without applying any human judgement to the 20% that requires it. The value of the 80/20 model described earlier in this series is not in the 80% the AI handles — it is in deliberately reserving your attention for the security-critical paths, architectural decisions, data model changes, and integration contracts that the AI cannot reliably evaluate. If that reservation does not happen, you have created a review theatre rather than a review process.
The other review failure is passive acceptance of plausible-sounding output. The AI can be confidently wrong. It can hallucinate method signatures, invent configuration properties, and recommend patterns that work in one version of a library but not yours. The more authoritative the response sounds, the less likely you are to verify it — which is exactly backwards. When the AI references a specific API or library feature, check the documentation for your version. Run the code. A hallucinated API is invisible to review and immediately obvious at runtime.
What the symptoms tell you
Output quality drops mid-session. This is the context window degrading. Long conversations accumulate every message, file read, and tool output. As the window fills, earlier content gets compressed or dropped. The AI starts contradicting decisions it made an hour ago. It re-introduces bugs it already fixed. Start a fresh session. Save current progress to a file before ending — a plan document, a summary, a spec update — and read it back in at the start of the next one.
The AI produces different results for the same prompt. This is normal and not a problem to solve. Large language models are probabilistic. The fix is not to find the “correct” prompt — it is to use specs and acceptance criteria as your verification layer. If the output satisfies the spec and passes the tests, it is correct regardless of how it looks compared to a previous run.
The AI keeps making the same mistake across sessions. The correction died with the conversation. The fix is explicit capture: when you correct something fundamental, say “remember this for future conversations.” Then verify the memory was actually written. If it keeps recurring, it belongs in your CLAUDE.md, not in memory.
The AI says it ran the tests but you see no test output. Check for actual tool invocations in the session output. You should see a shell command executed and its real output. If you only see the AI asserting that tests passed, it reasoned about the result rather than running the command. Add a rule: when told to run tests, use the shell tool to execute them; do not simulate.
Multi-agent sessions produce conflicting output. File ownership was not defined, or the spec used prose descriptions instead of concrete examples. Two agents reading “returns user data” produce two different response shapes. Write the exact JSON. Every endpoint, every response body. When two agents are reading the same literal example, their implementations will be compatible.
Security is a preparation problem
The most important thing to understand about security and agentic AI: the AI does not introduce new vulnerability classes. It surfaces existing specification gaps faster.
Most security bugs are caused by missing requirements, not bad code. “Implement JWT validation” is missing the second part of the sentence: implement JWT validation that also rejects tokens with no algorithm specified, expired tokens with any clock skew beyond thirty seconds, and tokens signed with the wrong key. That second part is implicit. Implicit requirements are where vulnerabilities live, and they live there regardless of whether a human or an AI writes the code.
The fix is to make security requirements explicit in the spec before anything is implemented. Which algorithms are allowed and which must be rejected. What happens on validation failure — fail closed, not fail open. Rate limiting requirements. Input validation boundaries. Logging requirements, including what must never be logged. Give the AI a spec that contains those requirements and it will implement them. Leave them out and it will miss them, for the same reason a developer would: nobody said they were needed.
Use the AI to strengthen the design phase, not just accelerate the implementation. Ask it to identify attack vectors you might have missed. Ask it to generate adversarial test cases for your security spec. Ask it to map the authentication boundaries and identify where an attacker could bypass a check. That is where AI in security delivers genuine leverage: in the threat modelling and specification work that happens before a line of code is written.
Security-critical code warrants three review layers regardless of who wrote it: spec-driven adversarial tests that exercise known attack vectors, an automated code review pass for obvious issues, and human expert review focused on what the spec might have missed. The vulnerability does not care who typed the code. The three layers catch what each layer alone would miss.
Devil’s advocate
A list of anti-patterns can create its own failure mode: checklist thinking. You run through the list before each session, confirm you have a spec and a CLAUDE.md and a defined review process, and then proceed to produce mediocre output anyway because the spec was too thin, the rules were too vague, and the review was too shallow. Compliance with the checklist is not the same as quality in the underlying artefacts.
There is also a real argument that some of this overhead is not always justified. For genuinely simple, isolated, low-stakes changes, a minimal prompt and a quick review is the right tool. Applying the full spec-first lifecycle to a field rename or a format fix creates friction without proportionate value. The discipline described in this series is calibrated for work that has meaningful scope, integration surface, or consequence. Applying it uniformly regardless of context creates bureaucracy, not quality.
And the preparation work itself can be done badly. An ontology that defines the wrong concepts, a spec that specifies the wrong behaviour, a set of rules that encode outdated conventions — these are not neutral. They actively mislead the AI and compound as surely as a missing spec would. The preparation is only as good as the thinking that went into it.
The meta-point
Almost every failure in this post traces back to missing preparation. Not to the AI. Not to the tooling. Not to the approach.
The series has made this argument from the first post: the quality of what you get from agentic AI is determined almost entirely by what you do before you prompt. Brief the AI before it builds. Define what things mean before you design. Write the spec before implementation begins. Structure the review so your attention lands on the things that actually require it. Run the right team of agents for the scope of the work.
Those are not AI disciplines. They are engineering disciplines that matter more when an AI is involved, because the AI amplifies whatever is already true about your process. A clear brief becomes a clear build. An ambiguous one becomes an ambiguous build at twice the speed.
The frustration curve described in Part 1 is real. The first two weeks with agentic AI are often slower and more corrective than expected. Most of that frustration is preparation catching up with ambition. The teams that push through and build the habits — the CLAUDE.md, the ontology, the spec, the review model — report consistent productivity gains from week three onward.
The preparation is the work. The AI handles the rest.
The Agentic AI Playbook — Part 7 of 7