How I Cut Our AI API Bill by 95%: What Actually Worked

**Hoje** at 18:25

How I Cut Our AI API Bill by 95%: What Actually Worked

Tópico: How I Cut Our AI API Bill by 95%: What Actually Worked
Categoria: Tutoriais | Programação & Tecnologia
Idioma Principal: Português (Conteúdo de Tecnologia)

Descrição do Conteúdo / Informações:
-------------------------------------------------------------------------
Honestly, how I Cut Our AI API Bill by 95%: What Actually Worked

When I first looked at our AI infrastructure spend six months ago, I nearly choked on my coffee. We were burning $11,000 a month on LLM calls for a product serving maybe 4,000 active users. The math was brutal — we were subsidizing every interaction, and our unit economics were completely broken.

The worst part? I knew it was bad, but I didn't realise how much was being left on the table. After three months of focused optimization, we're running the same workload for under $400/month. That's not a typo. Here's the playbook, written from the trenches.

If you're a CTO or engineering lead shipping AI features right now, this is for you. No fluff, no hand-waving — just the architecture decisions that moved the needle on our P&L.

The First Mistake: Defaulting to the Most Expensive Model

I'm guilty of this. We started with GPT-4o for everything because it was the path of least resistance. The docs are good, the SDK works out of the box, and when you're moving fast on a prototype, you don't want to think about model selection.

The problem is that "don't think about model selection" becomes a permanent state when nobody on the team questions it. Six months in, we were still sending classification tasks, simple chat replies, and translation requests through the most expensive model in the stack. That's pure waste.

Here's what changed my mind: I built a simple mapping table that matched task complexity to model cost. Just sitting down and writing it out made the absurdity obvious.

Task Type
What We Were Using
What We Switched To
Savings

Simple chat
GPT-4o at $10.00/M output
DeepSeek V4 Flash at $0.25/M
97.5%

Classification
GPT-4o-mini at $0.60/M
Qwen3-8B at $0.01/M
98.3%

Code generation
GPT-4o at $10.00/M
DeepSeek Coder at $0.25/M
97.5%

Summarization
GPT-4o at $10.00/M
Qwen3-32B at $0.28/M
97.2%

Translation
GPT-4o at $10.00/M
Qwen-MT-Turbo at $0.30/M
97%

Look at that classification row. We were paying $0.60/M for routing user inputs into one of six buckets when Qwen3-8B handles it at $0.01/M. That's a 60× multiplier on zero added complexity.

Here's the basic implementation we ended up standardizing across our services:

from openai import OpenAI

client = OpenAI(
base_url="https://global-apis.com/v1",
api_key="YOUR_GLOBAL_API_KEY"
)

MODEL_MAP = {
"chat": "deepseek-v4-flash", # $0.25/M output
"code": "deepseek-coder", # $0.25/M output
"classification": "Qwen/Qwen3-8B", # $0.01/M output
"summarization": "Qwen/Qwen3-32B", # $0.28/M output
"translation": "Qwen-MT-Turbo", # $0.30/M output
"reasoning": "deepseek-reasoner", # $2.50/M output — only for hard stuff
}

def route_request(user_input: str) -> str:
task = classify_complexity(user_input)
return MODEL_MAP[task]

response = client.chat.completions.create(
model=route_request(user_input),
messages=[{"role": "user", "content": user_input}]
)

The big lesson here: model selection isn't a one-time decision, it's a per-request routing problem. And the routing logic is trivial — usually a few hundred tokens of classifier output.

Tiered Routing: Why Pay Premium When Budget Will Do?

After we deployed basic model selection, we still had a problem. Some requests needed the good models. Some didn't. We were paying for the good model on every request because we didn't have a confidence threshold to fall back on.

So we built a tiered routing layer. Try cheap first, escalate only when needed. This is the pattern that took us from "already pretty good" to "absurdly cheap."

def smart_generate(prompt: str, max_budget_tier: int = 3) -> dict:
"""
Tier 1: Ultra-budget model handles easy queries
Tier 2: Standard model handles moderate complexity
Tier 3: Premium model reserved for hard reasoning
"""

# Tier 1: $0.01/M — handles 80%+ of traffic
tier1_resp = call_model("Qwen/Qwen3-8B", prompt)
if quality_score(tier1_resp) >= 0.8:
return {"response": tier1_resp, "tier": 1, "cost": 0.00001}

# Tier 2: $0.25/M — handles most of the rest
tier2_resp = call_model("deepseek-v4-flash", prompt)
if quality_score(tier2_resp) >= 0.9:
return {"response": tier2_resp, "tier": 2, "cost": 0.00025}

# Tier 3: Premium models — only the hardest 5%
tier3_model = "deepseek-reasoner" if max_budget_tier >= 3 else "deepseek-v4-flash"
tier3_resp = call_model(tier3_model, prompt)
return {"response": tier3_resp, "tier": 3, "cost": 0.0025}

The real-world result on our customer support chatbot: monthly bill dropped from $420 to $28. That's an 85% reduction from tiered routing alone, on top of the savings we already had from smart model selection.

The reason this works at scale is that quality requirements are bimodal. Most queries are either trivially easy (greetings, simple lookups, FAQ-type questions) or genuinely hard (multi-step reasoning, edge cases). The middle ground is smaller than you'd expect.

Your quality scoring function is the heart of this system. We use a combination of:

• A second cheap model that grades the first response (self-consistency check)

• Heuristic checks for length, format compliance, and refusal patterns

• Embedding similarity to known-good reference answers for our top query types

Response Caching: Free Money

This one's almost embarrassing because it's so obvious in retrospect. We had no caching layer for months. Every request hit the API even when the exact same question had been answered 50 times that day.

FAQ pages, documentation lookups, "how do I reset my password" type queries — these are massively cacheable. Our hit rate now sits between 50-80% depending on the surface.

import hashlib
import json
import time
from typing import Optional

_cache = {}

def cached_chat(model: str, messages: list, ttl: int = 3600) -> dict:
"""
Cache identical requests for `ttl` seconds.
Saves 20-50% on most workloads at zero quality cost.
"""
key = hashlib.md5(
json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
).hexdigest()

if key in _cache:
entry = _cache[key]
if time.time() - entry["timestamp"] < ttl:
return entry["response"] # Cache hit — $0 cost

response = client.chat.completions.create(
model=model,
messages=messages
)

_cache[key] = {
"response": response,
"timestamp": time.time()
}
return response

For production we moved this from an in-memory dict to Redis with a 24-hour TTL on most entries. The implementation got a bit more complex around serialization, but the pattern is identical.

One caveat: don't cache personalized responses or anything where the prompt includes user-specific data without normalizing it first. We strip PII from cache keys to avoid serving User A's response to User B.

Prompt Compression: The Hidden Multiplier

This is where it gets interesting at scale. Every token you don't send is money saved, and most prompts are way longer than they need to be.

We had a system prompt for our RAG pipeline that clocked in around 2,000 tokens. It was thorough, well-organized, and completely bloated. Compressing it to 400 tokens saved us $0.024 per request on DeepSeek V4 Flash.

$0.024 sounds trivial. Multiply by 10,000 requests per day and you're at $240/day. That's $87,600/year saved on a single prompt.

The compression itself is cheap — you use the budget model to summarize context before you send it to the expensive model:

def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
"""
Compress long prompts using a cheap model.
target_ratio=0.5 means compress to 50% of original length.
"""
if len(text) < 500:
return text # Not worth compressing

target_chars = int(len(text) * target_ratio)
summary = call_model(
"Qwen/Qwen3-8B",
f"Summarize this content in approximately {target_chars} characters, "
f"preserving all key instructions and constraints: {text}"
)
return summary

# Usage
system_prompt = load_full_prompt()
compressed = compress_prompt(system_prompt, target_ratio=0.2)

response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": compressed},
{"role": "user", "content": user_input}
]
)

The trick is to preserve the semantic content while cutting filler. LLMs are remarkably good at this when you ask them to.

A few prompt compression tactics we use beyond model-based summarization:

• Removing redundant examples from few-shot prompts after the model has learned the pattern

• Replacing verbose instructions with terse commands ("Be concise" instead of "Please provide responses that are clear, concise, and to the point, avoiding unnecessary verbosity")

• Deduplicating retrieved context chunks before injection

At scale, even a 15% reduction in average prompt length compounds significantly across millions of requests.

Batch Processing: One Call Beats Three

This one's simple. If you have 10 questions to answer, don't make 10 API calls. Make one.

The naive approach:

# Before: 3 separate API calls, 3x input tokens, 3x latency
for question in questions:
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": question}]
)
results.append(response)

The batched approach:

# After: 1 API call, shared system prompt, much faster
batch_prompt = "\n\n".join([f"Question {i+1}: {q}" for i, q in enumerate(questions)])

response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "Answer each numbered question in order. Format: '1. [answer]\n2. [answer]'"},
{"role": "user", "content": batch_prompt}
]
)

# Parse out individual answers
answers = parse_numbered_response(response.choices[0].message.content)

The savings come from sharing the system prompt across all questions. You're paying for input tokens once instead of N times. With typical overhead of 100-300 tokens per request for system prompts and message formatting, batching 10 requests saves you 900-2700 input tokens per batch.

The catch is latency — a batched call takes longer than a single call. So this only works for asynchronous workflows: bulk classification, batch summarization, overnight report generation, etc. Don't batch your user-facing chat responses.

Vendor Lock-In Is a Strategic Risk, Not Just a Cost Issue

Let me step back from the tactical stuff for a second. The reason model selection, tiered routing, and prompt compression are all possible is that we have access to multiple models through a unified API. This is the strategic move that enables everything else.

If you're locked into a single provider, you can't negotiate, you can't route around outages, and you can't take advantage of price drops when new models launch. Last quarter, three major providers cut prices on their flagship models within six weeks of each other. The teams locked into single vendors missed all three windows.

We use Global API as our primary routing layer. One endpoint, one SDK, every model we need. This gave us three things that mattered:

•
Negotiating use. When our rep at the underlying model provider knows we can route around them in a day, the conversation about pricing gets more productive.

•
Instant failover. When DeepSeek had an outage two months ago, we flipped 30% of traffic to alternative models in under an hour. No code changes — just config.

•
Freedom to optimize. Every optimization technique I described above requires being able to call multiple models cheaply. Single-vendor lock-in kills that optionality.

# Multi-provider setup through Global API
client = OpenAI(
base_url="https://global-apis.com/v1",
api_key="YOUR_GLOBAL_API_KEY"
)

# Swap models without changing SDK or auth
def call_model(model_name: str, prompt: str):
return client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}]
)

That last block of code looks trivial, but it's the foundation. One client object, one auth flow, every model in the market. When someone tells me they've achieved vendor independence, I ask them how many code paths they had to rewrite to switch providers last time. If the answer isn't "zero," they aren't actually independent.

The Combined Stack: Real Numbers

Let me put this together with actual numbers from our production system. We serve about 4,000 active users generating roughly 180,000 LLM requests per month.

Before optimization:

• All requests through GPT-4o at $10.00/M output

• No caching

• No batching

• Full 2,000-token system prompt on every request

• Monthly bill: ~$11,000

After optimization:

• 80% of requests handled by Qwen3-8B at $0.01/M

• 15% handled by DeepSeek V4 Flash at $0.25/M

• 5% escalated to DeepSeek Reasoner at $2.50/M

• Cache hit rate of 60%