Agent Harnesses for AI in Finance

For roughly two years, every startup with a chatbot started calling whatever AI gizmo they created an agent. A language model with a system prompt and an API key got called an agent. A retrieval pipeline that answered questions about internal documents got called an agent. The term got diluted to the point of uselessness.

An agent, if the word is going to carry any weight, needs to take actions, use tools, operate over multiple steps, and decide what to do next without a human directing each move. A chatbot needs a good prompt. An agent needs an operating environment: tool access, verification, state management, approval gates, an audit trail. The controls are different because the risk profile is different.

That operating environment is now catching up to the models. Tool use got reliable. Multi-step reasoning got competent. Frameworks matured around state management and observability. Finance teams, which sat out the first wave for defensible reasons, are starting to deploy agents into workflows where the work is well-defined enough to govern and the stakes justify the engineering.

This is showing up in production now. Robinhood's financial crimes team pairs every investigating AI agent with a separate validation agent that checks the work before it can proceed. The validation step is enforced structurally in code, not left to a prompt. If the validator flags a problem, the output stops. That setup, the structural enforcement layer around the model, is an agent harness.

An agent harness is the execution infrastructure surrounding a language model: the instruction set, the tools it can access, the verification loops, the approval gates, the state management, the audit trail. The model is a capable analyst. The harness is the operating environment around that analyst: the task queue, the policy manual, the approved systems access, the working papers, the escalation rules, the evidence trail. The model alone can answer questions. Add a harness and it can execute a reconciliation workflow or triage an investigation or draft client follow-ups with CRM context, and you can prove what it did afterward.

Harnesses matter more in finance than in most other domains because the cost of being wrong is severe. When a coding agent hallucinates, someone catches it during code review. Annoying, but survivable. When a finance agent hallucinates, you're looking at a material misstatement or a compliance finding. The tolerance for unchecked output is effectively zero. The verification layer isn't optional; it's what makes deployment possible. Verification lives in the harness.

Morgan Stanley's deployment of AI across its wealth management division shows this at scale. The firm reports over 98% advisor adoption of its AI tools, with document access jumping from 20% to 80% and advisor follow-ups compressing from days to hours. But underneath those numbers is a harness story: the quality bar was met through rigorous evaluation suites, retrieval tuning, compliance quality assurance, and human adjustment loops. The model selection mattered, sure. The controls around the model determined whether the thing could actually ship in a regulated wealth management environment. Morgan Stanley built the harness first, then let the model loose inside it.

Auquan's case tells a similar story from the buy side. The platform targets the unstructured-data problem that private equity, asset management, and investment banking teams deal with constantly: earnings transcripts, filings, news, research reports, all of it scattered across formats and systems. Their agent architecture reportedly delivers up to 95% manual-effort reduction, with over 50,000 hours saved and 50% cost reduction for financial institutions. It's orchestration over messy unstructured inputs with quantifiable throughput gains. The agents execute well-defined retrieval, analysis, and packaging workflows with human checkpoints at the decision-relevant stages. They're closer to a disciplined research process than an autonomous analyst making calls on its own.

Domain-specific models help. BloombergGPT demonstrated that a finance-trained model could outperform general-purpose models on financial tasks. A better model improves the analyst. You still need the operating layer around the analyst for tool access, approvals, state management, and auditability. The model and the harness solve different problems. Conflating them is how teams end up with a brilliant AI that nobody in compliance will sign off on.

For finance teams evaluating or building agents, the harness architecture matters at least as much as the model choice. Probably more. What verification checks run before the agent's output is delivered? What actions require human approval, and is that approval enforced in code? What happens when the agent fails mid-task: does it checkpoint and resume, or start over? Can you produce an audit trail that shows what the agent did, what data it accessed, and what decisions it made at each step? These are controls questions. Finance teams have been asking versions of them about human workflows for decades. The only thing that's changed is that now the worker is a model.

Anthropic's own guidance says it plainly: most enterprises don't need autonomous agent swarms. They need well-governed harnesses around narrow, high-value workflows. In finance, the winning architecture is more control per unit of autonomy. Start with one workflow, one owner, one KPI, one approval policy, one trace dashboard. Build the harness right. Then expand.