"The truth is rarely pure and never simple." Algernon, The Importance of Being Earnest
I've been building agentic apps for friends lately. I keep running into the same problems, and there isn't shared vocabulary for them yet, so here's what I've been thinking about.
A while-loop that calls a model, reads the output, runs a tool, and loops is an agent. It works in a demo and breaks in production, and it breaks in the same handful of ways regardless of the model or the stack. Six of them, below. About half have names in the literature; for the rest I'll propose some.
(Naming something you only half-discovered invites an argument. The alternative is everyone hitting it separately forever, so: names.)
1. The agent forgets what you asked
Give an agent a bug to fix. It opens the relevant file, notices the logger is ugly, refactors the logger, notices the tests are slow, and some number of tool calls later it is reorganizing imports in a file unrelated to the original task. Asked for status, it reports progress, because by its current frame it is making progress. The original task is untouched.
This has a name: goal drift. A technical report defines it operationally: measurable deviation from the original objective under context length and competing sub-goals (arXiv:2505.02709). Later work reframes it as machine akrasia (arXiv:2512.05449). The model is not confused. The original instruction is one message competing against a growing pile of more recent, more specific tokens, and recency wins.
Give it a long enough transcript and a plausible local sub-goal and it will quietly redefine "done" to mean "the thing I am currently doing." The drift is gradual and each individual step looks reasonable, which is why it is hard to catch by reading the trace. It is Bunburying: the agent invents a more agreeable task as a pretext to avoid the one it was given.
Watch it happen:
Spinning up the agent...
The perturbations are real inputs: an offhand aside several turns back, a large file that dominates the context by sheer token mass, the model adopting a wrong direction because agreement reads as cooperative. The meter is the distance between the stated objective and current behavior. The gold lines are the objective being re-pinned, which is the next section.
2. Intent-pinning
The mitigation is the recurring correction in that demo (the gold lines), and as far as I can find it has no agreed name. So: intent-pinning.
Re-inject the original objective on a schedule, as a fresh recent message rather than relying on the system prompt. The system prompt already holds the goal and is already losing the recency fight; re-injection puts the objective back where recency now works in its favor. The nearest prior work is ReCAP's "structured re-injection" of plan context across hierarchy levels (arXiv:2510.23822), framed there as a prompting technique.
Treating it as a control loop is what justifies a separate name. Setpoint: the pinned objective. Error: measured drift, computable as semantic distance between current trajectory and the objective. Actuator: re-inject; if re-injection stops reducing error, replan; if replanning stops working, halt and escalate to a human. Each gold line in the demo is the actuator firing. The open question is gain: re-inject too rarely and drift has already happened, too often and the context window is spent restating the objective to a model that already has it. There is a usable setting in between, and most systems never find it because the error signal was never instrumented.
Mitigations, weak to strong:
- Intent-pinning: re-inject the objective on a schedule. The cheap end. Static text gets pattern-matched, and longer context can reduce drift while making that pattern-matching worse (arXiv:2505.02709).
- Recitation: have the agent restate the objective at the tail of its own context each turn.
- Agent-rewritten plan file: a
todo.mdthe agent keeps editing, so reciting the plan is the work (Manus, Claude Code). - Decomposition into fresh-context subtasks: each subgoal runs in a clean window (ReCAP, arXiv:2510.23822; Context-Folding, arXiv:2510.11967).
- Orchestrator with workers: the lead holds the goal, workers return only results (Anthropic's multi-agent setup; though Cognition reports this coordination breaks down on write-heavy tasks).
- Closed control loop: measure drift and replan or escalate when re-injection stops bringing it down (Reflexion, arXiv:2303.11366).
3. Context rot
Past some context length, output quality degrades: the agent cites files that don't exist and treats its own earlier guesses as established fact. The decline is not abrupt and tends to be worst for information in the middle of a long context.
This is named and measured. Lost in the middle is the canonical result (arXiv:2307.03172). The broader degradation under accumulated context has been measured and called "context rot" (Chroma Research). Anthropic's engineering team treats context engineering as its own discipline.
The model worth holding: every token in the window has a cost and a half-life, and the transcript is a budget being spent whether or not anyone decided to spend it.
A common and expensive mistake is compaction by paraphrase. Summarizing ran billing tests, 3 failed in refund.test.ts:88 down to encountered some test failures is not compression. The file and line were the content that mattered and the prose was filler. When compacting, keep the pointers (file:line, commit SHA, ticket ID) and discard the prose around them. This matters again in section 6.
That is one rung. Mitigations, weak to strong:
- Positional ordering: put must-keep facts at the very start or end of the window, since the middle is where recall is worst (arXiv:2307.03172). A weak lever, and some reorderings backfire.
- Window pruning: cap or evict so the transcript never reaches the bad part of the curve.
- Compaction with pointers: keep file:line, commit SHA, ticket ID; drop the prose.
- External transcript with retrieval: keep the full history out of the window and pull the relevant slice on demand (Anthropic's context engineering).
- Context-folding handoffs: reset into a fresh window at each phase boundary with only a compact state summary (Context-Folding, arXiv:2510.11967).
4. Naive reflection loops often do nothing
A frequently-deployed pattern: after the agent answers, ask it to critique its own answer, then revise. Measured against a real metric, this version often produces no improvement, which leads some teams to conclude self-critique doesn't work.
Self-critique does work, and the foundational results are solid: Reflexion (arXiv:2303.11366), Self-Refine (arXiv:2303.17651), CRITIC (arXiv:2305.11738). The naive version underperforms for a specific reason: a critic with no ground truth is the generator's own distribution fed back as input.
Self-critique improves output when the critic has access to something the generator did not: a test result, a compiler error, a retrieved document, or a second model with different failure modes. CRITIC's thesis is specifically tool-grounded critique. The widely-copied version keeps the ritual of asking "any problems with that?" and drops the part where the answer is checked against something external. Ungrounded self-critique converges toward "looks good to me," restated with rising confidence. Ungrounded self-correction can make reasoning worse, because the model flips correct answers about as often as wrong ones (arXiv:2310.01798).
Mitigations, weak to strong:
- Self-consistency: sample the answer several times and take the majority, with no critique step at all (arXiv:2203.11171).
- Cross-model review: a separate or different-family model critiques it, with the caveat that model judges carry self-preference and verbosity bias.
- Multi-agent debate: instances debate over a few rounds (arXiv:2305.14325).
- Tool-grounded critique: check the answer against tests, a compiler, types, or retrieved evidence (CRITIC, arXiv:2305.11738).
- Trained verifier: a separate model scoring each step against ground truth (arXiv:2305.20050).
The rule under all of them: never let the model grade its own final answer.
5. The lethal trifecta
To a language model, instructions and data arrive through the same channel. An agent that ingests untrusted text, holds access to private data, and can send data outward is therefore exploitable by anyone able to place text in front of it, including text inside an otherwise-normal support ticket, web page, or email.
Simon Willison named this the lethal trifecta: untrusted input, access to private data, and an outbound channel. All three together is the exploitable configuration.
Watch it assemble:
Spinning up the agent...
Two of these you can live with. The third closes the loop from untrusted text to outbound action. The defense is structural, because you cannot prompt your way out. The attacker also writes prompts, delivered through the same channel as the data. The cleanest structural approach is CaMeL (arXiv:2503.18813): a trusted planner emits a program, and untrusted content moves through it only as tainted values that are structurally barred from becoming control flow. The data can be read. It is never allowed to decide.
Mitigations, weak to strong:
- Spotlighting and data-marking: wrap untrusted text in delimiters or encodings so the model treats it as inert (arXiv:2403.14720). Probabilistic and bypassable; defense-in-depth, never the only control.
- Egress allowlists, least-privilege credentials, human approval on outbound: narrows the blast radius without closing it.
- Break a leg: remove the untrusted input, the private data, or the outbound channel. The one reliable simple lever.
- CaMeL-style trusted planner: untrusted content flows only as tainted values that cannot become control flow (arXiv:2503.18813).
- Design-pattern family and system-level IFC: enforce the property in the runtime rather than asking the model to respect it (arXiv:2506.08837; arXiv:2409.19091).
6. The effect system for tool calls
One idea runs under all six.
The production fixes (don't repeat the same refund on a retry, gate the irreversible action behind a human, never let a tainted ticket trigger an outbound send) usually ship as unrelated if statements in different files, each added after a different incident. They are treated as three problems. They are one.
Every tool call has a type, and its effect is the part of that type that matters:
sendRefund(amount) -> Effect<
reversibility = Irreversible,
taint = requires Untainted,
idempotency = key required
>
readTicket(id) -> Effect<
reversibility = Pure,
taint = produces Tainted,
idempotency = n/a
>
Once calls are typed this way, composition rules are mechanical and the runtime enforces them. A Tainted value reaching an Irreversible sink without a human gate is a type error. An Irreversible call with no idempotency key is a type error. The duplicate-refund failure becomes something the runtime refuses to run, instead of something discovered later in a payment dashboard.
Watch the runtime hold the line:
Spinning up the agent...
Without the type, teams ship the same defenses piecemeal, weak to strong:
- Dry run: preview effects before commit; only as honest as the simulation.
- Human gate: pause on whichever actions someone remembered to mark dangerous.
- Idempotency keys: a retry collapses to one effect.
- Durable execution: replay completed steps from a log instead of re-running them.
Each is one column of the same signature. The type is what makes enforcement total instead of remembered.
The pieces exist separately. The reversibility taxonomy is named: Pure / Reversible / Compensable / Irreversible, in "Revisable by Design" (arXiv:2604.23283). Idempotency for agent actions appears in the literature mostly as the attack: ACRFence (arXiv:2603.20625) names the duplicate-effect exploit without naming a general defense. I could not find prior work putting reversibility, taint, and idempotency into one call signature enforced by the runtime. That synthesis is the part of this post with no citation, which is the honest reason to write it down.
It reduces to one rule: put it in the runtime. The model proposes a call; the runtime is allowed to refuse it.
Point your agent at your own loop
Paste this into the coding agent that already has your repo checked out. It returns a report, not a diff.
Spinning up the agent...
Go instrument your loop
Six failure modes. Instrument the drift so it is observable. Pin the objective on a schedule and tune the rate. Treat the context window as a budget. Ground the critic in something external or remove it. Assume untrusted text is addressed to your agent. Type the tool calls and enforce the rules in the runtime rather than the prompt.
On the coinages: intent-pinning is a name for a technique people already use without one, and I'll stand behind it. The effect-system framing is a synthesis, flagged as such in section 6, and it is the part most likely to be either obvious in hindsight or wrong. The title is a Wilde pun, and idempotency is the word doing the work, so it stays. Earnestness is the rest of it. The model will not supply it, so the runtime has to.