Agent Harness Architecture Explained

The term "agent harness" has been gaining traction in engineering circles since early 2026, but the concept underneath it is practical enough to explain in a single walkthrough. If you're building agents, evaluating agent platforms, or trying to figure out what questions to ask your engineering team, this is the piece to bookmark.

The Core Loop

Every agent harness runs the same fundamental pattern. The model receives context, decides on an action, the harness executes that action through a tool, captures the result, feeds it back into the model's context, and the cycle continues until the task is done or a stopping condition fires.

This is called the action/observation loop. The model makes an action decision. The harness routes that decision to a tool or an MCP server. The tool returns a result. The harness updates session state, runs any guardrail or approval checks, and either passes the observation back to the model for the next step or pauses for human review. When the work is complete, the harness packages the final output and emits traces, logs, and metrics.

The model only handles one part of the cycle: the reasoning step. Everything else in the loop is harness territory. Tool execution, state management, guardrail enforcement, approval routing, telemetry. The model decides what to do. The harness decides whether to let it, executes the action, records what happened, and sets up the next iteration.

The Ten Components

Production harnesses share a common set of building blocks. Different frameworks name them differently, but the underlying capabilities are consistent across OpenAI's Agents SDK, LangGraph, Google ADK, Microsoft Agent Framework, and AWS AgentCore.

System instructions define the agent's role, goals, boundaries, and behavioral constraints. These are the standard operating procedure for the agent, loaded into the model's context at the start of every session. In practice, these often live in versioned files (AGENTS.md, CLAUDE.md) that grow incrementally as teams discover new failure modes and encode permanent fixes. OpenAI's Codex repository uses 88 separate instruction files across its subcomponents.

Tool use is how the agent interacts with external systems. APIs, databases, search engines, file systems, code execution environments, and increasingly MCP servers that standardize the connection interface. Each tool's name, description, and schema gets loaded into the prompt on every request, which means tool design directly shapes agent performance. A clean, focused tool set outperforms a sprawling one because the model can hold the entire menu in working memory.

The action loop is the model's decision step. Given the current context, the model chooses whether to call a tool, generate a response, hand off to a sub-agent, or request more information. This is the only part of the cycle that's non-deterministic. Everything around it should be deterministic.

The observation loop is the return path. Tool results, validation outputs, and feedback flow back into the model's context for the next step. Robinhood's architecture, with a validation agent checking every output before it can proceed, is an observation-loop pattern. The observation step is also where you can inject additional context: flags from guardrail checks, warnings about approaching token limits, status updates from parallel processes, or reminders of the original task specification.

State is the current working data for the active session. The live case file. It includes the conversation history, tool outputs, reasoning breadcrumbs, and any pending approvals. LangGraph implements state through thread-scoped checkpoints. OpenAI uses Sessions. Google ADK uses stateful sessions. The critical property is persistence: if the agent crashes mid-task, can you reload state and resume from the last checkpoint, or do you start over? For any workflow longer than a few minutes, checkpoint-and-resume is foundational.

Memory is what the system retains across sessions. State is ephemeral, scoped to the current task. Memory persists. Prior meeting notes, learned preferences, case history, exception patterns. LangGraph handles short-term memory through checkpoints and is building toward long-term memory primitives. Google's Memory Bank provides cross-session persistence. In regulated finance, memory needs explicit scope, retention rules, and deletion mechanisms. You can't let the agent accumulate knowledge indefinitely without a data governance framework around it.

Guardrails and approvals are the control layer. Guardrails validate inputs, outputs, and tool behavior automatically. Approvals pause the agent and route work to a human for review. In practice, these two mechanisms work together. A guardrail might check that a generated number falls within a plausible range. An approval gate might pause the workflow whenever the agent attempts a write operation to a system of record. OpenAI's guardrails documentation, LangGraph's human-in-the-loop middleware, Google's semantic governance policies, and AWS's policy enforcement layer all implement variants of this pattern. The key design question for finance is where the approval boundaries sit. Any action above a dollar threshold, any output that goes external, any write to a production system, any interaction with client data: these are the natural approval gates for most finance workflows.

Orchestration is routing, sub-agent delegation, retries, checkpointing, and branching logic. For most workflows, a single-agent loop is enough. You add orchestration complexity when the work naturally decomposes into specialist roles or when one model/tool profile clearly underperforms on a subset of the task. Anthropic's guidance is explicit on this point: start with simple, composable patterns. Add multi-agent orchestration only when the job requires it.

Telemetry covers traces, metrics, logs, debugging, and evaluation. Every action, tool call, approval decision, latency measurement, token count, and state transition should be recorded. This is your process mining layer plus your audit trail. OpenAI's tracing framework, Google's OpenTelemetry instrumentation, AWS's OTEL-compatible observability, and LangSmith all serve this function. In finance, telemetry is what lets you answer the question your auditors will eventually ask: "Show me exactly what this agent did, with what data, and why."

The Capability-Risk Spectrum

Different workflows need different levels of agent autonomy, and each level carries different control requirements.

A prompt-only assistant has low-to-moderate capability, high human control, and limited workflow completion. It answers questions, drafts text, summarizes documents. The risk is low because the human is doing all the acting. This is where most finance teams started in 2023-2024.

A single-agent harness with tools and approvals has moderate-to-high capability, high control if the approvals are well designed, and covers most enterprise finance use cases. Research prep, client follow-up drafting, controls analysis, reporting support, reconciliation assistance. The risk is tool misuse if the approval design is weak. For most finance deployments in 2026, this is the right level of complexity.

A multi-agent harness has high capability but requires careful governance. Multiple specialized agents coordinate on a task, with routing, handoffs, and shared state. Research, complex case handling, investigation triage. The risks are coordination complexity, cost, hidden failure modes that emerge from agent interactions, and debugging difficulty when something goes wrong in the handoff between agents. Robinhood's paired investigating-and-validating agents are a well-governed example.

A long-running autonomous harness has the highest capability and the hardest control requirements. Agents that operate across hours or days, making sequences of decisions with minimal human oversight. The risks are drift, permission creep, context staleness, and weak observability. This pattern only works in narrow expert domains with strong automated tests and well-defined boundaries.

Most finance teams should be building single-agent harnesses with tools and approvals. The temptation to jump to multi-agent architectures is real, but the governance overhead increases significantly with each level. Start narrow. Prove the harness works. Then expand.

What a Minimal Harness Looks Like

A minimal harness for a general enterprise use case has five components wired together: an app/API layer that receives the trigger, an agent orchestrator that runs the loop, an LLM that handles reasoning, a tool router that dispatches to external systems (SQL, CRM, document retrieval), and session state plus guardrails plus a trace sink that capture everything that happens. You can build this on any of the major frameworks in a matter of days.

What a Finance-Grade Harness Requires

A production harness for regulated finance adds several layers on top. A gateway with authentication. PII redaction and policy checks before anything reaches the model. A router or planner agent that assigns work to specialist agents for retrieval, analysis, or client actions. An MCP gateway for standardized tool access. A session store and long-term memory with explicit retention policies. A human approval queue with resume-or-reject logic. And a full trace, log, metrics, and evaluation pipeline that feeds into dashboards, alerts, and SIEM integration.

The jump from minimal to finance-grade is significant, but it's mostly additive. You're wrapping the same core loop in additional control layers. The core architecture stays the same. The governance surface area grows.

Build to Delete

Every piece of harness logic should be built with the assumption that it might need to be deleted. Manus, the agent platform Meta acquired for roughly $2 billion, rebuilt its harness five times in six months. Vercel removed 80% of their agent's tools and got better results. LangChain re-architected their agent three times in a single year.

The harness compensates for current model limitations; those limitations change with every model release. Capabilities that required complex hand-coded pipelines a year ago are now handled by a single context-window prompt. If you over-engineer the control flow, the next model update breaks your system. Build the harness to be rippable. Version it, test it, observe it, and be ready to simplify it the moment the model catches up.

In regulated finance, this creates a tension. You need the harness to be robust enough for compliance and auditing, but lightweight enough to evolve as models improve. The resolution is disciplined modularity: each component has a clear boundary, a clear purpose, and a clear removal path. The harness gets simpler over time as the model absorbs more capability.