The Agentic AI Playbook, Part 2 of 7: Why Your AI Keeps Getting It Wrong (Hint: It’s Not the AI)

Part 2 of The Agentic AI Playbook

Series Navigation:

You would not hire a contractor and hand them a laptop without a briefing. No codebase tour, no explanation of the architecture, no “by the way, we never use field injection here.” You would spend at least an hour getting them up to speed. Then you would expect them to remember it.

Most people skip that step with agentic AI entirely. They open a session, fire off a task, and wonder why the output doesn’t match how their team actually works.

The agent is not being difficult. It genuinely does not know your world. It has read a vast amount of code in training, which means it will default to common patterns — and common patterns are rarely your patterns. Without explicit context, it will do what most teams do, not what you do.

The fix is straightforward. You write the briefing once, and the agent follows it forever.

Table of Contents

CLAUDE.md: the standing brief

A CLAUDE.md file is a plain markdown file the agent reads at the start of every conversation. Think of it as a persistent instruction sheet: your stack, your conventions, your constraints, and the things it should never do. Write it once; it applies to every session, every team member, every task.

The most useful thing to understand about it is the hierarchy. A user-level file at ~/.claude/CLAUDE.md holds your personal preferences across all projects. A project-level file in the repo root holds team-wide standards, committed to git so everyone’s agent behaves consistently. Directory-level files can refine rules further for specific parts of the codebase. Rules cascade: the project-level file overrides the user-level, a directory-level file overrides the project-level.

What you put in it matters more than the format. The most common mistake is being vague.

“Follow clean code principles” is useless. “Use constructor injection, never field injection; use records for DTOs, never Lombok @Data; all list endpoints must support pagination with a default page size of 20” is something the agent can actually act on. Specificity is everything.

Beyond coding standards, two categories of rules are especially valuable.

“Do not” rules. If the agent keeps doing something you dislike, add a rule. Do not add comments unless asked. Do not create abstract classes where a simple function will do. Do not modify files outside the scope of the current task. Every time you find yourself correcting the same behaviour twice, it should become a rule.

Non-functional requirements. These are the guardrails the agent will miss entirely unless you state them explicitly. Idempotency on state-changing endpoints. Explicit timeouts on external HTTP calls. Pagination on any query that could return unbounded results. These are the things that cause production incidents, and they are exactly the kind of thing an agent will skip if nobody told it they mattered.

One thing worth understanding early: the CLAUDE.md does not have to contain everything. It can reference other files. Your core rules file might point to a separate document containing detailed NFRs, another containing design guidelines, another containing your security policies or data governance guardrails. “Before proposing any API changes, read docs/api-standards.md. Before touching the payment domain, read docs/adr/payment-decisions.md.” The CLAUDE.md acts as the briefing and the index; the detail lives in dedicated files that can be maintained, versioned, and reviewed independently. This matters particularly in larger teams or regulated environments where policies and guardrails are substantial enough to warrant their own governance.

The right approach to building a CLAUDE.md is to start small and iterate. Five to ten rules on day one, covering your strongest conventions. Watch what the agent gets wrong. Apply a simple rule: if you correct the same thing twice, it becomes a rule. After a few weeks, your file becomes a precise encoding of how your team actually works.

The starting point differs between greenfield and brownfield. In greenfield, you define your world from scratch: your stack choices, your naming conventions, your architectural patterns. The CLAUDE.md is a blank canvas and the agent will build exactly what you describe. In brownfield, you need to document two things: the world as it currently is, and the world you are moving toward. The agent needs to know which patterns to carry forward and which to retire. Without that explicit direction, it defaults to mirroring what it reads in the existing codebase — including the parts you are trying to leave behind. A brownfield CLAUDE.md typically includes a section on target patterns alongside a section on legacy patterns to avoid. That distinction is what separates genuine modernisation from polishing the old system.

Memory: not starting from zero every morning

CLAUDE.md handles the static context — your standards, your architecture, your constraints. But some knowledge is dynamic. Decisions made during a session. Context that emerged from a tricky debugging conversation. The reason your team stopped using a particular pattern after an incident last year.

Without memory, that knowledge disappears when the conversation ends. The next session, the agent has no idea. You repeat yourself. You re-explain things you already explained. You correct mistakes you already corrected.

Memory is the mechanism that fixes this. The agent stores what it learns as small files between sessions. When you start a new conversation, it reads the relevant ones and picks up where it left off.

The critical habit is explicit capture. If the agent makes a mistake and you correct it, say “remember this for future conversations.” Without that instruction, the correction dies with the session. With it, the agent writes a memory file and the mistake doesn’t recur.

At the end of a significant session, ask the agent what it learned that should be remembered. It will review the conversation and identify the key decisions, corrections, and context that deserve to persist. This takes two minutes and saves the first ten minutes of every future session.

The same logic applies to project context that lives in people’s heads rather than in the code. “The billing service is being deprecated — new features go into billing-v2.” “We moved away from Flyway for new services after the migration incident.” “The payment gateway has an undocumented rate limit of 50 requests per second.” These are things the agent cannot infer from the codebase, but they are exactly the kind of thing that shapes every decision made inside it.

Memory does need occasional maintenance. It can go stale. Decisions get reversed. Architecture evolves. A memory that was accurate six months ago can now actively mislead. Spend ten minutes every month scanning the memory directory, removing outdated entries, and merging anything that’s become contradictory. Treat it like you’d treat any other project artefact.

Why decisions matter as much as what they are

There is a specific type of context that deserves more attention than it usually gets: the reasoning behind past decisions.

The agent knows your codebase. It can read your code and infer that you’re using cursor-based pagination. What it cannot infer is why. If that decision was made because offset-based pagination degraded badly under load at scale, the agent needs to know that. Otherwise, the next time it encounters a pagination problem, it might suggest offset-based as a simpler option — not because it’s ignoring your decision, but because it doesn’t know the failure you were trying to avoid.

Architectural Decision Records solve this. A short document per decision: the context, the decision, and the consequences. Stored in the repo. Referred to in your CLAUDE.md so the agent knows to check them before proposing changes.

The value is not just in preventing bad suggestions. It is in enabling good ones. An agent that understands why you made the choices you did can work within them more intelligently than one that can only see the what.

The feedback loop

The most useful long-term pattern here is the one that makes your setup progressively smarter rather than staying static.

A human reviewer spots something the agent missed: a service method modifying two aggregates without a transaction boundary. They add a rule. The next session, the agent catches it automatically. Another session, another finding, another rule. Over time, the CLAUDE.md accumulates the team’s hard-won knowledge. Issues that caused production incidents become permanent checks. Patterns that junior developers commonly get wrong get caught before they reach a human reviewer.

The role shift is significant. You stop being the person who reviews every line of code and start being the person who designs the system that reviews code. Your attention goes to the 20% of the work that actually requires human judgement: security logic, architectural decisions, data model changes, integration contracts. The rest is handled.

You can take this further. Instruct the agent to run its own lessons-learned review as part of verification. When the agent finishes a task and checks its output against the spec, if it finds a gap it missed, an assumption that turned out to be wrong, or an edge case it didn’t anticipate, it should flag it and propose a rule that would have caught it earlier. Put this directly in your CLAUDE.md: “When verifying your own work, if you identify a gap or missed requirement, propose a new rule to prevent the same gap in future.” The agent then participates in improving the system that governs it. This is not foolproof, but it means the feedback loop doesn’t depend entirely on a human noticing every failure. The agent contributes to its own improvement.

This only works if the feedback loop is actually closed. Corrections that don’t become rules are corrections you will make again.

Devil’s advocate

The CLAUDE.md approach has a real weakness: it only constrains what you thought to write down.

The agent follows explicit rules well. What it cannot do is exercise judgement about situations your rules didn’t anticipate. A new pattern, an unusual edge case, a type of problem your codebase hasn’t encountered before — none of your rules cover those. You can’t write rules for everything, and trying to will make your CLAUDE.md enormous and unmanageable.

There is also a risk of stale rules guiding the agent in the wrong direction. A rule that made sense when the project was small can become actively harmful at scale. Rules that no one remembers adding for reasons no one can recall are worse than no rules at all.

And memory poisoning is a real failure mode. If the agent stores something incorrect — and you confirmed it without checking — future sessions will build on that wrong assumption. It compounds quietly until something breaks.

None of this undermines the approach. It means the approach requires maintenance, not just setup. A CLAUDE.md that’s actively maintained is a force multiplier. One that’s left to accumulate and drift becomes noise.

What to do tomorrow morning

Three things, in order.

Create a CLAUDE.md in your current project. Start with what you already know: your stack, your top five conventions, the two or three things the agent should never do. Don’t try to be comprehensive; try to be specific.

Run the agent on a task you’d normally do yourself. Watch what it gets wrong. For each mistake, ask whether a rule would have prevented it. If yes, add the rule.

At the end of that session, ask: “What did we learn today that should be remembered?” Let the agent identify it and save it. That’s the beginning of a knowledge base that compounds over time.

Next: Part 3 — Define Before You Design