The Agentic AI Playbook, Part 5 of 7: Prompting as Delegation: A Practical Framework for Agentic AI

Part 5 of The Agentic AI Playbook

Series Navigation:

The difference between mediocre and great results from agentic AI is almost entirely in how people communicate. Not what they ask for. How they ask for it.

Most teams notice this eventually. Two prompts request the same thing and get wildly different outputs. The one that worked was specific. It gave context, named constraints, and said what “done” looks like. The one that did not was a sentence: a direction rather than an instruction.

This is learnable. There is a daily rhythm that consistently produces good results, and it does not require expertise in AI. It requires the same discipline you would bring to directing any capable engineer.

Table of Contents

Structure your prompts like delegation

When you ask an agentic AI to do something, you are not submitting a query. You are delegating a task. The distinction matters, because queries can be vague and delegations cannot.

Every effective prompt has four components. Context: what does the AI need to know about the current state of things? What is already built? What broke? What constraints does this task have to respect? If there is a relevant ticket or requirement, paste the key parts directly into the prompt.

Goal: what does the end state actually look like? “Refactor the payment module” is vague. “Extract the payment validation logic into a dedicated PaymentValidator class, keeping the same public interface so existing callers don’t change” is actionable. The AI will build to the specification it receives. If the specification is a direction, the output will be directional. If it is a destination, the output will arrive there.

Constraints: what should the AI not do? Without explicit constraints, the AI makes choices you did not sanction. Do not modify the database schema. Keep the existing API contract. Only touch files in this directory. Do not introduce new frameworks. These boundaries are not defensive — they are the difference between reviewing twelve files and reviewing three.

Definition of done: how will you know it worked? All existing tests pass. The new test I described passes. The code compiles without warnings. This gives the AI a finish line it can verify against before presenting results. Without one, “done” is whatever the AI decides it is.

This structure is not a template to fill out every time. It is a mental checklist. Before sending a prompt, ask yourself whether you have covered all four. The ones you skip are the ones that produce corrections.

One conversation, one task

Claude’s context window is its working memory. Long conversations degrade. After dozens of exchanges, the AI starts forgetting earlier decisions, repeating suggestions you already rejected, and losing track of the overall plan.

The rule of thumb is one conversation per logical task. Finish the authentication fix, then open a new conversation for the invoice feature. The cost of re-establishing context at the start of a new conversation is lower than the cost of the AI conflating two unrelated tasks halfway through.

When a conversation is growing long but you are not ready to start fresh, ask the AI to summarise the current state: what has been done, what is left, what decisions were made. That summary becomes the opening message in the next session if you do switch.

Knowing when to restart is as important as knowing how to start. If the AI heads in a direction you did not intend, stop it. Be direct: that approach will not work because we need to maintain backwards compatibility; let us use the adapter pattern instead. Redirect clearly rather than nudge vaguely. Vague feedback produces vague corrections.

If output is more than 80% right, iterate in the same conversation with targeted corrections. If it is between 50 and 80%, start fresh with a better prompt and tighter constraints. Below 50%, or after three failed iterations on the same problem, the issue is almost certainly in the prompt or the spec. Rewrite from the beginning.

The review model

If you read every line the AI produces, you have made yourself the bottleneck. You are generating code more quickly and then reviewing all of it manually. The productivity gain disappears.

The answer is not to skip review. It is to stop reviewing everything with the same attention.

Routine checks belong to the AI. Style and convention compliance, known anti-patterns, test quality, consistency with existing codebase patterns. Set up your CLAUDE.md rules to define what the AI reviewer should check. Run a review pass after every implementation, before you look at anything. Make it non-negotiable.

Your review focuses on what automated checks cannot reliably judge: security-critical logic, architectural decisions, data model changes, integration contract changes, anything regulated or domain-specific. This is the 20% that requires your knowledge of the system, the business, and the risk.

The feedback loop that makes this model improve over time is straightforward. You review the critical 20% and catch something the AI missed. That becomes a rule. The AI catches it next time. Over time, the 20% that needs your attention shifts toward higher-level concerns. The system gets smarter, not just you.

Your role in this model is review system designer, not line-by-line reviewer.

Memory: not starting from zero

Without memory, every conversation starts from scratch. You re-explain the same context, correct the same mistakes, re-establish the same preferences. It is the same onboarding conversation with the same engineer every single morning.

The fix is explicit capture. When you correct something fundamental during a session, say “remember this for future conversations.” Without that instruction, the correction dies when the conversation ends. With it, the AI writes a memory file and the mistake does not recur.

At the end of any significant session, ask the AI what it learned that should be remembered. It will review the conversation and identify the key decisions, corrections, and context worth preserving. Two minutes at the end of a session saves the first ten minutes of the next one.

Memory does need maintenance. Decisions get reversed. Architecture evolves. A memory that was accurate six months ago can actively mislead if it is still present. Scan the memory directory monthly, remove outdated entries, merge anything that has become contradictory. Treat it like any other project artefact.

One failure mode to watch for: memory poisoning. The AI saves something incorrect because you confirmed it without checking. Every subsequent session builds on that wrong assumption. The mistakes compound quietly until something breaks and you trace it backwards. If you realise a memory is wrong, correct it immediately. Bad memories do not fix themselves.

Tooling: what actually adds value

MCPs and skills extend what the AI can do. Not everything in the ecosystem is worth configuring, and adding tools for theoretical usefulness creates noise rather than value.

Two MCPs earn their place from the start. Context7 gives the AI access to current library documentation. The AI’s training has a cutoff date, and for anything that shipped after it, the AI may suggest deprecated patterns or hallucinate API details. Context7 resolves that by fetching real documentation on demand. The second is a GitHub integration, which lets the AI read issue descriptions and PR review comments directly rather than relying on you to paste them in.

Skills give the AI structured workflows rather than leaving it to generalist defaults. The skill pack worth installing immediately is Superpowers (published by Obra). It adds structured approaches across the full delivery cycle: a brainstorm skill that explores approaches and trade-offs before any design locks in, a plan skill that produces a reviewable implementation plan before any code is written, a TDD skill that writes the test before the implementation, and a review skill that checks finished code against your standards and the original requirements. Each skill encodes discipline as a repeatable step rather than something that depends on you remembering to ask. The brainstorm and review steps in particular change the quality ceiling — the AI challenges its own design before you have to, and checks its own output before you see it.

When a single agent starts hitting its limits — tasks that span too many files, features that cut across too many layers, work that exceeds a single context window — Claude-flow is worth looking at. It is a multi-agent orchestration layer that lets you run coordinated Claude agents in parallel: one handling the service layer, another handling the frontend, a coordinator managing the overall plan and merging results. Each agent has its own full context window, so nothing gets lost or confused across a large task. It is not something you need on day one, but once single-agent work starts feeling limiting, it is the natural next step rather than trying to force increasingly complex work into a single session.

Custom skills are worth considering once your team has established a repeatable process. If your team has a standard approach to database migrations or a checklist for API design, encoding it as a skill turns tribal knowledge into an automated process that every developer and every AI session follows consistently.

Add tooling in response to friction, not in anticipation of it.

Brownfield is different in tactics, not principles

The same principles apply to brownfield work as to greenfield, but the starting point is different.

In greenfield, you define your world before the AI builds in it. In brownfield, the legacy estate is the AI’s source of truth until you tell it otherwise. Ask it to modernise a service without guidance, and it will mirror the existing system’s patterns back in cleaner code. That is not transformation. It is preservation with better formatting.

Before working in a brownfield area, ask the AI to describe what it sees. What patterns does this part of the codebase use? What would you need to know to add a feature here? This surfaces the implicit conventions and assumptions the AI has inferred from reading the code. Some of those inferences will be accurate. Some will not. Find out which before the AI acts on them.

When the goal is genuine modernisation, you need to explicitly tell the AI which patterns to carry forward and which to replace. Without that decision, the target state inherits the old confusion. The ontology and spec work from the earlier posts apply here more directly than in greenfield — not less. There is more to define, because the existing system is a mixture of things you want and things you are trying to leave behind.

Devil’s advocate

The four-component prompt structure is sound, but it assumes you know what you need before writing the prompt. For genuinely exploratory work, the right approach is sometimes to prompt loosely, observe what the AI does with the ambiguity, and use that output as a diagnostic. What did it assume? Those assumptions are often a clearer articulation of the problem space than your initial description, and a better foundation for the real prompt.

There is also a risk that the review model degrades into learned complacency. If you consistently trust the AI’s first-pass review, you gradually rely on it to catch things it cannot catch. The 80/20 split is a starting point. If the AI’s review is consistently missing things you find in the critical 20%, the calibration is wrong and the CLAUDE.md rules need updating.

And memory accumulation has compounding risks on both ends. Too little, and you repeat yourself indefinitely. Too much, and the memory index becomes noise the AI has to parse before it can start anything useful. Monthly maintenance is not optional. A memory system that is not maintained becomes a liability faster than one that was never started.

What to do tomorrow

Pick a task you would normally do yourself.

Before opening the AI session, write down the context, goal, constraints, and definition of done. Not as a formal document. As the prompt itself. Spend five minutes on this.

Run the task. Afterwards, note every place where the output was not what you wanted. For each one, ask whether a clearer constraint or a more explicit definition of done would have prevented it. If yes, adjust your prompting approach. If a correction was fundamental enough to apply beyond this task, add it to your CLAUDE.md.

Ask the AI what it learned in that session that should be remembered. Let it identify and save it.

Do this for three tasks in a row. By the third one, the structure is no longer effort. It is habit.

Next: Part 6 — When One Agent Isn’t Enough