AI Engineering·June 12, 2026·12 min read

AI Cost Optimization: How We Cut a Document AI Bill by 99%

A practical AI cost optimization guide built on a real case study: how task-based model routing cut one document platform's AI spend by 99% and trims most LLM bills 30 to 50%.

AI CostsLLMCost OptimizationAI StrategySelf-Hosting

We built a document platform that turns scanned PDFs and photographed forms into clean structured data. The first design sent every page to a frontier multimodal model. It worked beautifully and it was going to bankrupt the thing at scale. So we changed one decision about which model sees which page, and the AI bill dropped by roughly 99%. Same accuracy. Same languages. Same zero-touch automation.

That is the AI cost optimization story this guide is really about. The 99% case is the extreme end, but the lever behind it is the same one that takes 30 to 50 percent off most AI bills: stop sending easy work to your most expensive model.

Why your AI bill exploded even as prices fell

If you are watching your AI spend climb while every vendor pricing page brags about getting cheaper, you are not imagining it. The price per token has fallen ~280× in two years. Over the same period total enterprise AI spend has risen ~320%. Cheaper units, much bigger bills.

Unit prices collapsed; usage grew faster. The result is a rising bill that feels impossible to explain.

The reasons are consistent across every team I have seen. Three of them do most of the damage.

Usage outran the price cuts. Enterprise token consumption has multiplied ~13× since early 2025. Cheaper tokens just got consumed in far greater volume.
Agentic workflows multiplied spend. Teams piloted AI as a single-query chatbot, then shipped multi-step agents. Token use jumped an order of magnitude past what their ROI models assumed.
Everything routes to the flagship model. Most organisations default to "the best model" for every task. A trivial classification that could run for $0.05 per million tokens gets sent to a $30+ per million reasoning engine. The spread between the cheapest and most expensive models is roughly 4,500×.

The result is measurable waste. Studies put typical LLM overspend at 50–90%, and 69% of CFOs believe 10–30% of their cloud bill is wasted. Cloud is now the #2 cost line behind labour. This is no longer an IT footnote. 90% of CFOs say they actively worry about it.

The real problem isn't the model, it's the routing

Cost conversations usually open with "which model is cheapest?" That is the wrong question. The expensive model is not the problem. Sending easy work to it is.

Think of it like staffing. You would not put your principal engineer on password resets. Yet "use GPT-class models for everything" does precisely that, assigning your most expensive resource to your most trivial tasks. The way out is a routing problem first and a pricing problem second.

The fix is a tiered routing layer. Classify each request by how hard it actually is, then send it to the cheapest model that can do the job, with a cache in front so you never pay twice for the same answer.

Task-based routing: a cache absorbs repeats, then each request goes to the cheapest tier that can handle it. The flagship model is reserved for the small slice that truly needs it.

The four levers that deliver the 30–50%

You don't need all four to see results. Together they compound.

Lever 1: Right-size the model to the task

Audit what each AI call actually does. Classification, extraction, short-text summaries, simple routing decisions: most of these never need a frontier model. A small or open-weight model handles them at a fraction of the cost. Keep the expensive models for genuine multi-step reasoning. This one change is usually the biggest win.

Lever 2: Cache aggressively

A huge share of requests are near-duplicates. A semantic cache returns a stored answer for an equivalent query, which turns repeat work into a $0 operation and cuts latency at the same time.

Lever 3: Hybrid hosting where volume justifies it

For predictable, high-volume tasks, a self-hosted open-weight model can run far cheaper than per-call API pricing. The key word is predictable. There is an honest break-even below.

Lever 4: Token discipline

Trim bloated system prompts. Cap context windows. Batch requests where you can, and stop re-sending static context on every call. None of this is glamorous, but it multiplies directly against every request you make.

The 99% case study, in full

Back to the document platform from the top of this article. The intelligent document processing platform we built converts scanned PDFs and photographed forms into clean, structured Markdown across 10+ languages, including hard scripts like Urdu, Arabic and Amharic, with zero manual intervention.

The naive and expensive design sends every page image straight to a frontier multimodal LLM for OCR and extraction. It works. It is also ruinous at thousands of pages a month, because you are paying flagship multimodal prices for pages that are perfectly clean and trivial to read.

The right-sized pipeline flips that. Fast local OCR (RapidOCR) and a layout parser (Docling) handle the overwhelming majority of pages at near-zero marginal cost. Only the pages that come back low-confidence, such as smudged scans, complex tables, or unusual scripts, fall through to the paid LLM. The expensive model becomes a fallback for the hard 1–10%, not the default for everything.

document_router.py

from rapidocr import RapidOCR          # fast, local, ~free per page
from llm import extract_structured     # frontier multimodal — pay per call
 
ocr = RapidOCR()
CONFIDENCE_FLOOR = 0.85
 
def process_page(image):
    # 1. Bulk path: local OCR clears the 90%+ of clean pages at ~zero cost
    result = ocr(image)
    if result.mean_confidence >= CONFIDENCE_FLOOR:
        return result.to_markdown()
 
    # 2. Fallback: only low-confidence / complex pages reach the paid model
    return extract_structured(image)   # the slice that actually needs it

The business outcome speaks the language decision-makers care about:

Same accuracy, same languages, same zero-touch automation, at roughly 1% of the all-frontier cost. That's the routing lever at full stretch.

It also replaced manual data entry that averaged 6–8 minutes per page, so the savings stack. Lower AI spend, and the labour gone too. That is the outcome a CFO signs off on without a second meeting.

When self-hosting actually pays off (and when it doesn't)

This is where most "cut your AI costs" advice goes wrong by yelling "self-host everything!" The honest math is messier, and saying so is exactly why decision-makers should trust the recommendation.

Approach	Where it wins	Watch out for
Managed API	Low/spiky volume, fast iteration, frontier quality on demand	Per-call cost balloons at scale
Self-hosted	High, predictable volume (≈100M+ tokens/day) or strict data-privacy needs	Real cost is 3–5× the raw GPU price plus 10–20 engineer-hours/month
Hybrid (recommended)	Almost everyone. Cheap/self-hosted for the bulk, API for the hard cases	Needs a routing layer (the whole point of this article)

Here is the uncomfortable truth. For the majority of use cases, a managed API is still cheaper once you count the hidden operational load of self-hosting: engineering time, monitoring, GPU utilisation. Self-hosting only flips the math at genuinely high volume, or when privacy forces on-premise. That is why the hybrid, routed approach wins for most teams. You capture self-hosting economics on your predictable bulk traffic and keep frontier quality a fallback away.

Your 30-day AI cost-reduction playbook

A sequence any team can run this month:

Instrument spend. Add per-feature, per-model cost tracking. You can't cut what you can't see, and "where is the money going?" is usually a one-day fix that reveals the 20% of calls driving 80% of cost.
Right-size the top 3 cost drivers. Move your highest-volume simple tasks off the flagship model to a small or mid-tier one. Measure quality side by side. In most cases it is indistinguishable.
Add a cache. Semantic caching on repeat-heavy endpoints. Immediate, compounding savings.
Tighten tokens. Prune system prompts, cap context, batch where possible.
Pilot self-hosting on ONE task. Pick your single highest-volume, most predictable workload, and compare true total cost (not just GPU price) against the API.

Done in order, the first three steps alone typically land the 30 to 50 percent. They are all software, with no new infrastructure to babysit.

Frequently asked questions

How much can I realistically cut my AI costs?

Most teams find 30 to 50 percent from routing and caching alone, with no change to capability. At the extreme, where a heavy task can be moved off a frontier model onto cheap local processing, reductions above 90% are achievable. The document platform above is one such case. The savings scale with how badly your current setup over-routes to expensive models.

Is self-hosting an LLM cheaper than using an API?

Only at high, predictable volume (roughly 100M+ tokens/day) or when data privacy requires on-premise deployment. Self-hosting costs 3 to 5 times the raw GPU price once you add monitoring, ops and engineering time. For most workloads a managed API, or a hybrid that self-hosts only the bulk traffic, comes out cheaper.

Will trimming my AI spend reduce quality or capability?

Done right, no. The savings come from sending easy tasks to cheaper models that handle them just as well, caching repeats, and keeping frontier models for genuinely hard work. Flagship quality stays exactly where it matters. You just stop paying flagship prices for trivial tasks.

What's the fastest way to reduce LLM API costs?

Caching and model right-sizing. Both are software changes you can ship in days with no new infrastructure, and together they usually capture the bulk of the savings. Self-hosting comes later, only for your highest-volume task.

How do I know which tasks to route to cheaper models?

Instrument your spend first, then look at the high-volume, low-complexity calls. Think classification, extraction, short summaries, simple routing. Run a small model against them and compare outputs side by side. Where quality holds, and it usually does, route them down a tier.

The takeaway

Runaway AI spend is rarely a pricing problem you are stuck with. It is a routing problem you can fix. Match each task to the cheapest model that can do it, cache the repeats, self-host the predictable bulk, and keep frontier models for the work that truly needs them. That is how you take 30 to 50 percent off the bill without losing anything users would notice. In the right case, like the document platform, it is 99%.

If your AI bill is climbing faster than your usage justifies, that is the kind of problem I solve for a living. See how I package data and AI work, or tell me about your workload and let's find the savings. For the layers either side of this one, the companion reads are the ETL orchestration tools comparison (Dagster vs Airflow vs Prefect) for the pipeline that feeds your models, and the cloud data warehouse migration guide for the storage layer underneath. If the data pipeline behind all this runs on AWS Glue, its bill has the same hidden defaults waiting to be switched off.

Mirza Hammad Tariq

AWS Data Engineer with 5+ years building production-grade ETL pipelines, cloud data warehouses, and scalable data architectures in Python, SQL, Dagster, and AWS.

Work With Me

Related case studies

AI EngineeringProduction

Intelligent Document Processing Platform

AI-Powered Multilingual OCR & Document Intelligence

Converts scanned PDFs and photographed forms into clean, structured Markdown across 10+ languages, including hard scripts like Urdu, Arabic and Amhari…

AutomatedManual keying eliminated

FastAPICeleryRedisDocling

View Case Study

Keep reading

Continue reading

Data EngineeringJun 5, 2026

Dagster vs Airflow vs Prefect for ETL in 2026

Dagster vs Airflow vs Prefect for ETL in 2026, compared by the one question they disagree on: what a pipeline is actually made of. Field notes from a 50M-records/day build.

DagsterAirflowPrefectETL

Read Article12 min read

Data EngineeringJun 17, 2026

Cloud Data Warehouse Migration: Snowflake vs Redshift vs BigQuery

A cloud data warehouse migration guide to Snowflake vs Redshift vs BigQuery vs Databricks: how to choose on cost, lock-in and performance, and how to de-risk the move.

Data WarehouseSnowflakeRedshiftBigQuery

Read Article14 min read

Data EngineeringJun 25, 2026

AWS Athena Query Optimization: Scan Less, Pay Less

AWS Athena query optimization comes down to one thing: scanning less data. How partitioning, Parquet, bucketing and projection cut what you scan, and the bill.

Amazon AthenaQuery OptimizationCost OptimizationPartitioning

Read Article13 min read

Taking on new projects · Outside IR35

Have a data pipeline or warehouse problem worth solving?

From messy source data to analytics-ready warehouses that cut cost. Let's scope it. I reply within one business day.

Start a Project Connect on LinkedIn