AI Agent Cost Optimization: How I Cut My Claude API Bill by 60%

automation
AI Agent Cost Optimization: How I Cut My Claude API Bill by 60%

Three months after deploying a suite of AI agents across my client accounts, I got an Anthropic invoice that made me sit down.

It wasn't catastrophic — but it was growing faster than the value we were delivering. Multiple workflows running on Claude Opus, long prompts being re-sent on every call, agents spinning up expensive chains for questions that didn't need them. The kind of waste that's invisible until you actually look at the numbers.

Over the following six weeks, I systematically cut that bill by 60% without degrading quality for any of the agents that matter. This is what actually worked.

The Problem With "Just Use the Best Model"

The default approach when building AI agents is to reach for the most capable model available. That made sense when you were prototyping. In production, it's a budget killer.

Claude Opus 4.6 is genuinely exceptional. It handles nuance, ambiguity, complex reasoning, and long context better than anything else I've worked with. It's also roughly 5x the price of Claude Sonnet and 20x the price of Claude Haiku per token.

If your agent is using Opus to answer "is this email a support request or a sales inquiry?" — you're massively overpaying. That's a classification task. Haiku does it perfectly. The difference in output quality for that specific task is zero. The difference in cost is 20x.

The first thing I did was audit every agent task and categorize it by what the task actually required:

Tier 1 — Simple classification, extraction, routing (Haiku)

  • Email intent classification
  • Extracting structured data from forms
  • Routing webhooks to the right workflow
  • Yes/no decisions with clear criteria

Tier 2 — Moderate reasoning, generation, summarization (Sonnet)

  • Writing first drafts of client-facing content
  • Summarizing meeting transcripts
  • Generating workflow descriptions
  • Multi-step data processing

Tier 3 — Complex reasoning, nuanced judgment, long-form generation (Opus)

  • Strategic recommendations
  • Analyzing attribution data anomalies
  • Generating proposals or detailed reports
  • Anything requiring domain expertise and judgment

After mapping every agent call to these tiers, about 65% of my calls were using Opus for Tier 1 tasks. That alone was responsible for the majority of the bill.

Model Routing in Practice

The routing logic itself is straightforward in n8n. I have a classification node at the start of most agent chains that decides which model to use based on the task type coming in.

// In an n8n Code node
const taskType = $input.item.json.task_type;
const complexity = $input.item.json.estimated_complexity;

const modelMap = {
  'classify': 'claude-haiku-4-5-20251001',
  'extract': 'claude-haiku-4-5-20251001',
  'route': 'claude-haiku-4-5-20251001',
  'summarize': 'claude-sonnet-4-6',
  'generate': 'claude-sonnet-4-6',
  'analyze': complexity === 'high' ? 'claude-opus-4-6' : 'claude-sonnet-4-6',
  'strategize': 'claude-opus-4-6',
  'report': 'claude-opus-4-6',
};

return { model: modelMap[taskType] || 'claude-sonnet-4-6' };

For agents that handle varied inputs (like an email manager that processes everything from "can you reschedule?" to "here's our Q1 attribution discrepancy analysis"), I run a cheap pre-classification step with Haiku first. It costs roughly $0.001 and tells me whether to route to Sonnet or Opus. The savings from avoiding unnecessary Opus calls easily justify the cost of the pre-classifier.

Prompt Caching: The Feature Most People Miss

Anthropic's prompt caching is probably the single highest-leverage optimization available and it's underused.

The way most agent systems work: you define a system prompt with extensive context (role, instructions, constraints, examples, tool definitions), and then you re-send that entire prompt with every single API call. If your system prompt is 2,000 tokens and you're making 500 calls per day, you're paying to transmit 1 million tokens of identical context daily.

With prompt caching, Anthropic caches the system prompt (or any large static prefix) on their side for up to 5 minutes. Cache hits cost 10% of the original input token price. On long system prompts, this delivers immediate cost reduction.

Implementation requires adding cache_control markers to your request:

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You are an automation specialist for VIXI Agency...[2000 tokens of context]",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}

The catch: the cache is per-model and resets every 5 minutes of inactivity. For agents that run continuously, this is nearly free. For batch jobs that run once a day, the benefit is only realized if you process multiple items in a single run rather than separate calls.

I restructured our overnight batch agents to process all their items in a single API session rather than spawning separate calls for each item. This alone cut batch processing costs by around 40%.

Token Waste: What Your Prompts Are Actually Doing

Run a sample of your agent conversations through a token counter before and after optimization. You'll be surprised.

Common waste patterns I found:

Redundant context. A prompt that explains your business, your client, the task, the output format, the constraints, and then includes all the same information again embedded in the user message. Pick one place to put each piece of context.

Over-specified formatting instructions. If you're getting JSON back and then parsing it, you don't need five paragraphs explaining exactly how to format the JSON. One example is worth a thousand words of description, and a well-placed respond only with valid JSON, no explanation saves you the explanation tokens on every call.

History bloat in multi-turn agents. If your agent maintains conversation history, implement a summarization step every N turns. Keep the last 3 exchanges verbatim, compress everything older into a summary. For a customer support agent that can have 20+ turn conversations, this reduced context sizes by 70%.

Examples that are too long. Few-shot examples are powerful but expensive. Two crisp examples at 200 tokens each often outperform six elaborate ones at 800 tokens. Test it. You may be paying 4x for marginally better output.

For our main Orus agent (the one I use personally for complex reasoning tasks), I rewrote the system prompt from 3,800 tokens to 1,400 tokens. Performance for actual tasks stayed the same or improved — turns out a bloated prompt introduces noise that the model has to work around.

Batching and Async Processing

Not all agent work needs to be real-time.

I had several workflows that triggered an API call for every single event as it arrived — each email, each form submission, each webhook. For anything that didn't require an immediate response to a user, this was wasteful.

I replaced the event-by-event pattern with a queue + batch processor. Events queue in Supabase. Every 15 minutes, a workflow processes everything in the queue in a single agent session. The agent sees all 12 emails from the last 15 minutes in one call rather than 12 separate calls.

For tasks where batch processing is acceptable, this reduces overhead significantly — fewer API calls means less connection overhead, and batch context often lets the agent spot patterns it would miss processing items individually.

The constraint is latency. If your workflow needs to respond within seconds, batching isn't an option. If it just needs to complete within the hour, batch everything you can.

Output Length Control

By default, models will generate as much text as they think is appropriate. For many tasks, that's more than you need.

The max_tokens parameter is your friend, but it's a blunt instrument. More useful is explicit length instruction in your prompt: "Respond in 2-3 sentences maximum" or "Return only the JSON object, no surrounding text."

For our classification agents, I added a hard constraint: "Your entire response must be a single JSON object with no explanation, preamble, or follow-up." Average response length dropped from ~180 tokens to ~40 tokens. Same task, 78% fewer output tokens, no quality loss.

For content generation tasks where length actually matters, I set reasonable ceilings rather than leaving it open-ended. "Write a 400-600 word section on X" produces better, more focused content than "Write a section on X" and costs roughly half as much.

The Full Picture

After six weeks of systematic optimization, here's where I landed:

| Optimization | Cost Reduction | |---|---| | Model routing (Opus → Haiku/Sonnet where appropriate) | ~35% | | Prompt caching on long system prompts | ~15% | | Prompt compression and cleanup | ~8% | | Batching overnight jobs | ~5% | | Output length constraints | ~4% |

Total reduction: approximately 60% of the original bill, with zero degradation on any workflow that matters.

The principle underlying all of it: use the most expensive tool only for work that actually requires it. That sounds obvious. In practice, the default in AI development is to reach for the best available model and assume the cost will figure itself out later. It won't.

Audit your calls. Map your tasks to tiers. Implement caching on anything with a long static prefix. Batch what doesn't need to be real-time. Your future self — and your invoice — will thank you.


Running AI agents in production at your agency? We help teams optimize their AI infrastructure for both performance and cost. Book a call to see where the leverage points are in your stack.