Orchestrating AI Agents: A Subagent Architecture

Single-agent AI coding hits a ceiling. Context windows fill up. Role confusion creeps in. Output quality degrades. The solution: multiple specialized agents handle distinct phases with structured handoffs.

Basic code assistants show roughly 10% productivity gains. But companies pairing AI with end-to-end process transformation report 25-30% improvements (Bain, 2025). The difference isn’t the model. It’s the architecture, specifically how you engineer the context each agent receives.

Anthropic’s research on multi-agent systems confirms what we observe: architecture matters more than model choice. Their finding that “token usage explains 80% of the variance” reflects the impact of isolation: focused context rather than accumulated conversation history.

This post documents a production workflow using Block Goose, an open-source AI assistant. The same architecture runs in Claude Code via a skill file. Specialized agents, each optimized for its phase: research, planning, building, and validation.

Contents

Why Do Single-Agent AI Coding Workflows Hit a Ceiling?
How Does Subagent Architecture Solve Context Problems?
How Does Model Selection Affect Cost and Quality?
How Do Subagents Communicate?
Where Should Human Judgment Stay in AI Workflows?
What Results Does This Produce?
- Design Targets
When Does This Work (and When Doesn’t It)?
Takeaways
References

Why Do Single-Agent AI Coding Workflows Hit a Ceiling?

A single AI model handling an entire coding task accumulates context with every interaction. By implementation time, the model carries baggage from analysis, research, and planning phases. This stems from three core problems.

Context Rot

Long conversations consume token budgets. The model forgets early instructions or weighs recent context too heavily. Chroma Research calls this context rot: performance degrades consistently as input tokens increase, even on simple tasks, and worsens for multi-step reasoning like coding (Chroma Research, 2025). On-demand retrieval adds another failure mode: agents miss context 56% of the time because they don’t recognize when to fetch it (Gao, 2026).

Role Confusion

A model asked to analyze, plan, implement, and validate lacks clear boundaries. It starts implementing during planning. It skips validation steps. Outputs blur together.

Accumulated Errors

Mistakes in early phases propagate. A misunderstanding in analysis leads to a flawed plan. A flawed plan leads to incorrect implementation. Fixing requires starting over.

How Does Subagent Architecture Solve Context Problems?

The fix: spawn specialized subagents for each phase. An orchestrator handles high-level coordination and synthesis. Subagents handle execution with fresh context.

Subagent workflow diagram showing Orchestrator with RESEARCH, PLAN phases flowing to Builder and Validator subagents

Figure 1: Core subagent workflow. SCOUT and GUARD handle RESEARCH sequentially; ORCHESTRATOR synthesizes both into a PLAN. Each agent starts with fresh context, isolated from prior phases. SETUP and COMMIT/PR phases omitted for clarity.

The orchestrator (Claude Sonnet 4.6) handles PLAN only. RESEARCH is split between a SCOUT agent (Sonnet, exploratory) and a GUARD agent (Haiku, adversarial) that stress-tests SCOUT’s proposals before the orchestrator synthesizes both into a plan. Adversarial review is lower-entropy than synthesis. Finding flaws in a concrete proposal requires less reasoning range than producing one, which is why a smaller model excels at it. After plan completion, it spawns a BUILD subagent (Claude Haiku 4.5) that receives only the plan, not accumulated history. The builder writes code, runs tests, then hands off to a CHECK subagent (Haiku) for validation.

Each subagent starts with clean context. The builder knows what to build, not how we decided to build it. The validator knows what was built, not what alternatives we considered. This is context engineering in practice. Smaller, structured inputs produce deterministic behavior: the same plan JSON produces the same output. Phase prompts stay under 1,000 tokens. That reproducibility is what drives a high PR acceptance rate.

How Does Model Selection Affect Cost and Quality?

Different phases need different capabilities. Planning requires reasoning. Building requires speed and instruction-following. Validation requires balanced judgment.

Agent	Model	Temperature	Role
ORCHESTRATOR	Sonnet	0.3	Plan and agent coordination
SCOUT	Sonnet	0.5	Exploratory research and code analysis
GUARD	Haiku	0.1	Adversarial risk review
BUILD	Haiku	0.2	Precise instruction-following coding
CHECK	Haiku	0.1	Compliance and security validation

Table 1: Model selection by agent. Sonnet handles open-ended reasoning; Haiku handles constrained execution. Temperature reflects task openness: higher for exploration, lower for verification.

How Does Model Routing Reduce Cost?

Building involves the most token-heavy work: reading files, writing code, running tests. Routing this volume to cheaper models cuts costs significantly.

Model	Input	Output	Agent
Opus	$5/MTok	$25/MTok	Not used in default flow
Sonnet	$3/MTok	$15/MTok	ORCHESTRATOR + SCOUT
Haiku	$1/MTok	$5/MTok	GUARD + BUILD + CHECK

Table 2: Anthropic API pricing, March 2026. Standard rates (prompts under 200K tokens). Three agents on Haiku, two on Sonnet.

Research on multi-agent LLM systems shows up to 94% cost reduction through model cascading (Gandhi et al., 2025). This architecture targets 50-60% savings: three of five agents run on Haiku ($1/$5 per MTok), two on Sonnet ($3/$15 per MTok). Routing the highest-volume phases (GUARD, BUILD, and CHECK) to the cheaper model drives that reduction versus an all-Sonnet workflow. Versus Opus, the saving is 5x.

Smaller models with focused context outperform a single large model carrying full session history. Anthropic’s multi-agent research showed a 90.2% performance gain over single-agent Opus by distributing work across Sonnet subagents with isolated context windows (Anthropic Engineering, 2025). Cost drops; quality improves.

Beyond cost, fresh context enables tasks that fail with single agents. A 12-file refactor that exhausts a single model’s context window succeeds when each subagent starts clean.

Which Alternative Models Pass Quality Gates?

Each run was scored on an 8-point binary rubric (file identification, dependency verification, architectural tradeoff, pattern analysis, taint-tracking gap, non-obvious synthesis, JSON schema compliance). QP is a composite efficiency metric: cost per run divided by score times reliability. Our model-comparison experiments tested eight candidates as SCOUT delegates: only MiniMax M2.5 ($0.019/QP) and Kimi K2.5 ($0.040/QP) passed against a Haiku baseline of $0.150/QP. Mistral Small 4 ($0.002/QP) narrowly failed on a floor score violation; DeepSeek V3.2 failed all gates; Mercury-2, the cheapest candidate at $0.001/QP, failed on score (Clouatre, 2026).

How Does Project Context Reach Subagents?

Recipes define the workflow, but subagents also need project context: build commands, conventions, file structure. That’s where AGENTS.md comes in, a portable markdown file that provides the baseline knowledge every subagent inherits. Goose, Cursor, Codex, and 40+ other tools read it natively. Claude Code reads CLAUDE.md; include it via @AGENTS.md at the top. In Vercel’s evals (Gao, 2026), an AGENTS.md file achieved a 100% pass rate on build, lint, and test tasks where skills-based approaches maxed out at 79%.

Think of it as CSS for agents: global rules cascade into every project, project-specific rules override where needed. The orchestrator and every subagent it spawns inherit both layers without explicit prompting.

## Commits
- Conventional commits, GPG signed and DCO sign-off
- Feature branches only, PRs for everything
- Never merge without explicit user request

## Security
- Treat all repositories as public
- No secrets, API keys, credentials, or PII~/.config/goose/AGENTS.md

## Stack
Rust 2024 + Tokio + Clap (derive) + Octocrab

## Project-Specific Patterns
- Apache-2.0 license with SPDX headers
- cargo-deny for dependency auditsaptu/AGENTS.md

Code Snippet 1: Global and project-level AGENTS.md files. The builder subagent from Table 1 inherits both layers: it knows to GPG-sign commits (global) and use cargo-deny (project) without either appearing in the handoff JSON.

How Do Subagents Communicate?

Subagents communicate through JSON files in $WORKTREE/.handoff/. Each session uses an isolated git worktree, so handoff files are scoped to that execution context. This creates an explicit contract between phases.

File	From	To
`01a-research-scout.json`	SCOUT	GUARD
`01b-research-guard.json`	GUARD	ORCHESTRATOR
`02-plan.json`	ORCHESTRATOR	BUILD
`03-build.json`	BUILD	CHECK
`04-validation.json`	CHECK	BUILD (on failure)

Table 5: Handoff file chain. Each file is scoped to one phase transition.

GUARD confirms SCOUT’s file list, scores each approach by safety, and flags risks SCOUT didn’t surface. The orchestrator synthesizes both into a plan only after GUARD clears it.

{
  "scout_verification": {"accurate": true, "corrections": []},
  "risks": [
    {
      "function": "try_with_fallback",
      "risk": "Async extraction requires careful borrowing",
      "severity": "medium",
      "mitigation": "Extract only sync setup; keep async operation in caller"
    }
  ],
  "safety_ranking": [
    {"id": "A", "score": 95, "rationale": "Minimal scope, aligns with existing patterns"},
    {"id": "B", "score": 70, "rationale": "Struct over-engineered for current scope"},
    {"id": "C", "score": 55, "rationale": "try_fold() pattern; highest maintenance risk"}
  ],
  "guard_test_gaps": [
    "try_with_fallback: add test for fallback chain with real failures"
  ]
}.handoff/01b-research-guard.json

Code Snippet 3: GUARD handoff file. Three things SCOUT didn’t produce: a risk with concrete mitigation, safety scores with rationale, and a test gap.

The plan file contains everything the builder needs:

{
  "overview": "Remove 4 dead render_with_context methods",
  "files": [
    {"path": "src/output/triage.rs", "action": "modify"},
    {"path": "src/output/history.rs", "action": "modify"},
    {"path": "src/output/bulk.rs", "action": "modify"},
    {"path": "src/output/create.rs", "action": "modify"}
  ],
  "steps": [
    "Remove render_with_context impl blocks from each file",
    "Remove #[allow(dead_code)] annotations",
    "Remove unused imports",
    "Run cargo fmt && cargo clippy && cargo test"
  ],
  "risks": ["None - confirmed dead code"]
}.handoff/02-plan.json

Code Snippet 4: Plan handoff file with structured task definition for the builder subagent.

The validator reads both 02-plan.json and 03-build.json to verify implementation matches requirements. It writes structured feedback to 04-validation.json:

{
  "verdict": "FAIL",
  "checks": [
    {"name": "Remove #[allow(dead_code)] annotations", "status": "FAIL",
     "notes": "Annotations still present in history.rs:145, bulk.rs:31, create.rs:63"}
  ],
  "issues": ["Plan required removing annotations, but these are still present"],
  "next_steps": "Fix issue: Remove the three annotations, then re-validate"
}.handoff/04-validation.json

Code Snippet 5: Validation handoff file with actionable feedback for the builder to address.

The builder reads this feedback, fixes the specific issues, and triggers another CHECK cycle until validation passes.

Why files instead of memory? Three reasons:

Auditable. Every decision is recorded. Debug failures by reading the handoff chain.
Resumable. Handoff files checkpoint state. Resume any session from where it stopped.
Debuggable. Failed validations include exact locations and actionable next steps.

Both Goose and Claude Code resume sessions natively, but they restore full conversation history, reintroducing the context rot this architecture avoids. Resuming here means picking up the plan and build results with fresh agent context, not replaying an accumulated chat.

Where Should Human Judgment Stay in AI Workflows?

The quality of the input determines the ceiling for everything that follows. A well-scoped GitHub issue with clear acceptance criteria and explicit constraints, written with AI assistance but reviewed by a human, is the best prompt you can give this workflow. Imprecise requirements produce technically correct but misaligned results. A high-quality issue or spec is the single biggest lever for PR acceptance rate.

The workflow runs autonomously: RESEARCH, PLAN, BUILD, and CHECK all auto-proceed. The one recommended human touchpoint at the end is the PR review itself. The code has your name on it.

Phases (all auto-proceed):

SETUP: Initialize context and gather requirements
RESEARCH: SCOUT and GUARD run sequentially; the orchestrator synthesizes their findings autonomously
PLAN: Design solution based on research synthesis
BUILD: Execute the approved plan
CHECK: Validate and loop back to BUILD on failure, then push to PR

What Results Does This Produce?

This architecture powers development across multiple projects. Three examples from aptu:

PR	Scope	Files Changed
#272	Consolidate 4 clients → 1 generic	9 files
#256	Add Groq + Cerebras providers	9 files
#244	Extract shared AiProvider trait	9 files

Table 3: Representative PRs using subagent architecture. All passed CI, all merged without rework.

The validation phase caught issues the builder missed. In PR #272, the CHECK subagent identified a missing trait bound that would have failed compilation. The builder fixed it on the retry loop. No human intervention required.

Across 39 sessions on code-analyze-mcp, the CHECK phase caught 7 issues before they reached a PR: a wrong serde attribute, uncommitted builds, scope creep, CLI interface drift, and formatting violations. 89.7% of aptu PRs merged without rework after human review, across 58 PRs.

Design Targets

Enterprise platforms reach the same conclusions at scale. Sid Pardeshi, CTO of Blitzy, described on the TWiML AI podcast (2026) a system that writes millions of lines of code autonomously. His diagnosis of the root constraint is identical: the effective context window has been stuck at 80-120K tokens for two years, regardless of headline sizes. Their solution replaces the single orchestrator with a database as the coordination layer, enabling tens of thousands of parallel agents. The architectural principle is the same one this workflow applies: isolate context, specialize roles, and pass structured state between agents rather than accumulating a single conversation.

Research on multi-agent frameworks for code generation shows they consistently outperform single-model systems (Raghavan & Mallick, 2025).

Metric	Single-Agent	Multi-Agent
Output quality over session	Degrades as context fills	Stable (each agent starts fresh)
Model strategy	Generic model	Specialized per role
Estimated cost reduction	Baseline	50-60% via model routing
Human interventions	Throughout	Issue scoping + PR review
Audit trail	Conversation history	Structured JSON handoff chain

Table 4: Isolated context, specialized roles, structured state over accumulated conversation.

When Does This Work (and When Doesn’t It)?

Works well for:

Multi-file refactors where context isolation prevents confusion
Feature additions following established patterns
Complex changes requiring distinct planning and execution
Teams wanting audit trails (handoff files document decisions)

Less effective for:

Simple one-file fixes (overhead exceeds benefit)
Legacy systems without clear patterns (builder lacks context)
Exploratory work where plans change during implementation

See AI agents in legacy environments for integration patterns that work when your data lives in mainframes and AS400 systems.

Takeaways

Separate reasoning from execution. Use capable models for planning, fast models for building.
Fresh context beats accumulated context. Subagents start clean. They follow instructions without historical baggage.
Structured handoffs create audit trails. JSON files document what was planned, built, and validated.
Quality input, autonomous execution. A well-scoped issue is the highest-leverage human contribution.

The full workflow is available on GitHub Gist as both a Goose recipe (YAML) and a Claude Code skill (Markdown). It builds on patterns from AI-Assisted Development: From Implementation to Judgment.

References

Anthropic Engineering, “How we built our multi-agent research system” (2025) — https://www.anthropic.com/engineering/multi-agent-research-system
Bain & Company, “From Pilots to Payoff: Generative AI in Software Development” (2025) — https://www.bain.com/insights/from-pilots-to-payoff-generative-ai-in-software-development-technology-report-2025/
Chroma Research, “Context Rot: How Increasing Input Tokens Impacts LLM Performance” (2025) — https://research.trychroma.com/context-rot
Gao, Jude, “AGENTS.md outperforms skills in our agent evals” (2026) — https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
Clouatre, H., “LLM Agent Experiments: Model Comparison for SCOUT Delegates” (2026) — https://github.com/clouatre-labs/llm-agent-experiments
Gandhi et al., “BudgetMLAgent: A Cost-Effective LLM Multi-Agent System” (2025) — https://arxiv.org/abs/2411.07464
Pardeshi, Sid, “Agent Swarms, Knowledge Graphs, and Autonomous Software Development” (2026) — https://twimlai.com/podcast/twimlai/agent-swarms-knowledge-graphs-autonomous-software-development
Raghavan & Mallick, “MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding” (2025) — https://arxiv.org/abs/2510.08804