I Looked at My Claude Bill. 90% Was Tokens I Didn't Need.
How MCP tools and database queries are silently draining your AI budget
DISCLAIMER: The opinions here are my personal opinions, and are not the opinions of my employer, Netflix. While I am an employee at Netflix, none of this work is a reflection of my work at Netflix or is used at Netflix
Last month, I watched my Claude Code bill hit $287.
Not because I was doing anything special. Just normal development work: debugging a React app, refactoring some Python services, running MCP tools to query my database.
I looked at the token breakdown and found the culprit: 90% of my context window was filled with redundant garbage.
Every time I ran a database query, the MCP tool returned 500 rows. I only needed 3. Every time I searched logs, I got 1,000 entries. I needed the 2 with errors. Every API response was 50KB of nested JSON. The actual information I needed was 200 bytes.
I was paying for tokens I didn’t need. And so are you.
TL;DR: Slash your Claude Code, Cursor, and AI Agent token costs, without loss of accuracy, by using Headroom, an OSS project!
The Dirty Secret of AI Coding Tools
Here’s something Anthropic, OpenAI, and Cursor don’t want to talk about:
Your context window is a tax, not a feature.
Yes, Claude Sonnet 4.5 now supports 1M tokens.
Yes, GPT-4o handles 128K.
But you're paying for every single one of them.
At $3/million input tokens for Claude Sonnet 4.5 (or $6/million if you exceed 200K context), that 50KB API response just cost you 15 cents. Do that 100 times a day across a team of 10 developers, and you're burning $450/month on... JSON boilerplate.
And it gets worse with premium models. Claude Opus 4.5 runs $5/$25 per million tokens. GPT-4o is $2.50/$10. The new reasoning models like o1 charge $15/$60—and they generate thousands of "thinking tokens" you can't even see, but still pay for.
The math gets worse with tools.
When you use MCP (Model Context Protocol) to connect your AI to databases, APIs, and file systems, the tool outputs explode. A simple “find users who signed up last week” query might return:
{
"results": [
{"id": 1, "name": "Alice", "email": "alice@...", "created_at": "...", "updated_at": "...", "last_login": "...", "preferences": {...}, "metadata": {...}},
{"id": 2, "name": "Bob", ...},
// ... 498 more identical structures
],
"pagination": {...},
"query_metadata": {...}
}
That’s 45,000 tokens. For data you’ll glance at once and never reference again.
Why Existing Solutions Don’t Work
“Just use summarization!” No.
Summarization is lossy. The LLM throws away information you might actually need. Ever had Claude confidently tell you something wasn’t in the logs, only to find out later it was—but got summarized away?
“Just use RAG!” Still no.
RAG is great for static knowledge bases. It’s terrible for ephemeral tool outputs. You’re not going to embed and index every database query result.
“Just truncate!” The worst option.
Truncation is how you get the “needle in a haystack” problem. The one line that matters is always in the middle of what you cut.
Every “solution” I found was designed for a different problem: helping LLMs handle documents, not helping developers handle tool outputs.
The Insight That Changed Everything
I realized: LLM context is not random data. It’s highly structured.
Think about what fills your context window:
JSON with repeated schemas
Log lines with identical formats
Database rows with the same columns
API responses with nested templates
This isn’t prose. This isn’t creative writing. This is compressible data masquerading as text.
The key insight: You can compress 50-90% of typical tool outputs without losing any information.
Not summarization. Not truncation. Actual lossless compression that’s reversible when the model needs the full data.
Introducing Headroom
I built Headroom to solve this problem.
It’s a context optimization layer that sits between your app and the LLM. Here’s what it does:
1. SmartCrusher: Statistical Compression
When Headroom sees a tool output with 500 database rows, it doesn’t summarize them. It applies statistical analysis:
Keep anomalies: Errors, outliers, values > 2 standard deviations from the mean
Keep relevant items: BM25/embedding similarity to the user’s query
Keep context: First few and last few items
Compress the rest: Everything else gets represented by its statistics
That 45,000 token result becomes 4,500 tokens. 90% reduction.
And here’s the critical part: if the LLM needs a specific item that was compressed, Headroom can restore it. The compression is reversible.
2. CacheAligner: Fix Your Cache Hits
Here’s something that drives me crazy. OpenAI and Anthropic both offer prompt caching—pay to cache your system prompt once, then get 90% off subsequent calls.
Except it never works.
Why? Because your system prompt probably says something like:
You are a helpful assistant. Today is January 15, 2025.
That date changes every day. Cache invalidated.
CacheAligner automatically identifies dynamic content in your prompts and moves it to the end, keeping your static prefixes identical across calls. Suddenly your cache hit rate goes from 5% to 80%.
3. RollingWindow: Never Break a Tool Call
Context window overflow is catastrophic with tools. If you truncate a tool call without its response, or a response without its call, the LLM hallucinates.
RollingWindow manages overflow intelligently:
Drops oldest tool outputs first (as atomic units)
Never orphans a tool call from its response
Preserves the last N conversation turns
Never touches the system prompt
Real Numbers
I’ve been running Headroom on my own Claude Code and Cursor usage for a month. Here’s what I’m seeing:
That last row matters. I’m not claiming magic. Dense, unique content doesn’t compress. Headroom knows this and doesn’t waste cycles trying.
My monthly costs dropped from ~$280 to ~$110. That’s 60% savings with zero accuracy loss. Here is an exact recording of one of the sessions.
The 5-Minute Setup
Here’s the thing: you don’t have to change your code.
Option 1: Proxy (works with everything)
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
Now, point your tool at localhost:8787:
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Cursor
OPENAI_BASE_URL=http://localhost:8787/v1 cursor
# Any OpenAI-compatible client
export OPENAI_BASE_URL=http://localhost:8787/v1
That’s it. Headroom intercepts requests, compresses tool outputs, optimizes caching, and forwards to the real API.
Option 2: SDK (for fine-grained control)
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# Use exactly like the original
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
# Check your savings
print(client.get_stats())
# {"session": {"tokens_saved_total": 12500, ...}}
Who Should Use This?
Developers using Claude Code or Cursor: If you’re hitting rate limits or burning through your monthly allocation in a week, Headroom extends your runway.
Teams building AI agents: MCP tools are powerful but expensive. Headroom makes them sustainable.
Startups watching burn rate: Every dollar saved on LLM costs is a dollar for runway.
Enterprises at scale: 50 developers × $200/month savings = $120K/year back in your budget.
What Headroom Is NOT
Let me be clear about limitations:
Not magic: If your content is already dense and unique, there’s nothing to compress.
Not lossy summarization: We don’t use an LLM to summarize. We use statistical compression.
Not free: Headroom adds ~2-5ms latency per request. For most use cases, that’s negligible. For ultra-low-latency needs, measure it.
Not a replacement for good prompt engineering: If your prompts are bad, Headroom won’t fix them.
The Bigger Picture
LLM inference costs are dropping (about 10x per year for commodity models). But usage is growing faster. And reasoning models like o1 are actually increasing costs—they think for thousands of tokens before responding.
The companies winning this race aren’t the ones with the biggest context windows. They’re the ones using context most efficiently.
Headroom is open source (Apache 2.0) because I believe this should be infrastructure, not a SaaS tax. The more people using it, the better it gets.
Try It
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
Star it if it saves you money: github.com/chopratejas/headroom
Report bugs, request features, or contribute: PRs welcome.
I’m building this in public. Follow along, and let’s make AI development sustainable.
Subscribe for more deep dives on AI engineering and practical cost optimization.




