Cost engineering 60 Claude agents: prompt caching, Batch API and the Haiku/Sonnet/Opus mix

Marketing OS runs more than 60 production agents on Claude across sales, SEO, content, paid media, social, email and analytics. The headline price-per-million-tokens at Anthropic's list rate would make this economics-prohibitive at any meaningful scale.

It isn't, because the headline number is the wrong number to optimise. Four levers, applied consistently across the portfolio, compound to roughly an order of magnitude reduction in cost-per-deliverable.

Lever one: prompt caching

Anthropic's prompt cache lets you mark up to four cache breakpoints inside a request. Tokens before each breakpoint are stored for five minutes. If the next request starts with the same prefix, the cached portion costs one-tenth the input price.

In a typical Marketing OS agent run, the prompt looks like this:

System prompt (3–6K tokens) — agent identity, tool catalogue, refusal rules
Brand context (4–10K tokens) — the customer's brand voice fingerprint, tone rules, banned phrases, product catalogue
Workflow context (1–3K tokens) — the customer's current campaign, audience, recent activity
User turn (variable)

Items 1 and 2 are stable across runs. We mark them as cache breakpoints. On the second and subsequent runs within five minutes — which is essentially every run during an active workflow — they cost a tenth of full price.

Cache hit rate across the portfolio runs at 80%-plus in production. We measure it daily and alert on drift below 70%. A drop usually means someone added a per-run timestamp or a UUID to the brand context block, which we then refactor out.

Lever two: Batch API for non-realtime work

A surprising amount of marketing work is not real-time. Daily anomaly detection runs at 02:00. Weekly performance narrators run on Sunday night. Monthly client reports compile on the first of the month. None of these need a synchronous response.

Anthropic's Batch API discounts these workloads by 50%. The trade-off is response within 24 hours instead of within seconds. For our scheduled work, 24 hours is fine; we run the batch overnight and pick up the results in the morning shift.

Of our 60+ agents, about 18 run primarily through Batch API. They account for roughly 35% of token volume. The compounded effect is a discount of around 17% on the whole portfolio, just from batching what could be batched.

Lever three: the Haiku / Sonnet / Opus mix

Claude comes in three tiers. Pricing differs by an order of magnitude across the tiers; quality differs by a smaller margin on most workloads. The optimisation is to push as much work as possible to Haiku without quality degradation.

Across the portfolio, our enforced mix is:

15% on Haiku — classification, routing, simple lookups, deterministic transforms
75% on Sonnet — generation, analysis, reasoning, the bulk of agent work
10% on Opus — multi-step strategy, cross-document synthesis, edge cases

The enforcement comes through per-agent budgets in the agent registry. Each agent declares a model tier in its defineAgent() call. The orchestrator (AGT-110) refuses to route work to a tier above the declared budget unless an explicit override flag is set, which logs an audit event.

Promoting a workload from Sonnet to Opus has to be justified with a test set showing the quality lift. We've found about half the time the cheaper model is actually better, because Opus tends to over-elaborate on tasks where concision matters more than depth.

Lever four: per-agent and per-client ceilings

The first three levers are about cost per token. This one is about damage control.

Every agent has a budget — tokens per run, run-time per run, total cost per day. The orchestrator hard-stops a run that breaches the budget and returns a partial result with a budget-exceeded flag. The user can opt in to a higher ceiling for the next run.

Per-client ceilings work the same way at the workspace level. A client on the Starter plan has a monthly Claude spend ceiling; once they're at 90%, we warn; at 100%, the orchestrator throttles. Throttling means switching to longer Batch API turnarounds rather than refusing requests, so the customer experience degrades gracefully.

The ceiling system was the most operationally important thing we built, and it took us a year to do it properly. Before the ceiling, one mis-configured agent could blow a month's budget in a weekend. After, the worst case is a $20 over-run that the orchestrator catches and refunds against next month.

The compounded result

A blog post draft that would cost roughly $0.40 at Claude Sonnet list pricing, with no caching and synchronous calls, comes in at around $0.06 in our production system. Multiply across 60 agents and tens of thousands of daily runs and the difference is what makes the product viable as SaaS.

The lesson, which is true outside AI workloads as well: list pricing is the worst price you'll ever pay. Engineering the cost levers is the real product.

Cost engineering 60 Claude agents: prompt caching, Batch API and the Haiku/Sonnet/Opus mix

Lever one: prompt caching

Lever two: Batch API for non-realtime work

Lever three: the Haiku / Sonnet / Opus mix

Lever four: per-agent and per-client ceilings

The compounded result

Keep reading.

Why Claude is the default AI across every SourceForge product

Have a project that touches what you just read?