Agent cost optimization from trace data

Your OpenAI bill doubled last month and you don't know why

You opened the invoice. It is double last month's. Usage is up but not 2x. You have 1000 traces a day and no way to tell which ones are expensive, which are cheap, or which are burning tokens on retries. Your cost optimization playbook is "move to a cheaper model," which you know is a shortcut that trades quality for savings.

The fix is trace-driven cost analysis. Langfuse records token count and cost per trace. With 4 SQL queries against that data, you can find the specific waste patterns: long prompts that did not need to be long, redundant tool calls, retries that never should have fired, and verbose outputs the user did not ask for. Fixing each pattern gives you 5-20 percent savings without changing the model.

This post is the cost optimization workflow: the 4 queries, how to interpret them, the fixes each one points to, and the cost-per-user view that separates heavy users from waste.

Why are agent costs so hard to debug without trace data?

Because the cost signal is spread across thousands of individual LLM calls, each with its own prompt, model, and token count. Your OpenAI invoice shows one number. To optimize, you need per-call data. 3 specific failure modes of invoice-only cost analysis:

No attribution. You cannot tell if the spike is driven by more users, longer prompts, more retries, or a bug in one feature.
No waste detection. You cannot spot redundant tool calls, cache-missable duplicate prompts, or overly-verbose completions.
No per-user view. You cannot tell if one heavy user is skewing the average, or if the whole distribution shifted.

Langfuse traces give you all 3. The queries below extract the actionable signal in under 10 minutes.

graph TD
    Traces[(Langfuse traces)] --> Q1[Query 1: cost per feature]
    Traces --> Q2[Query 2: longest prompts]
    Traces --> Q3[Query 3: retry waste]
    Traces --> Q4[Query 4: cost per user]

    Q1 --> F1[Fix: feature flag the expensive one]
    Q2 --> F2[Fix: quote extraction or chunking]
    Q3 --> F3[Fix: better retry logic]
    Q4 --> F4[Fix: rate limit heavy users]

    style Traces fill:#dbeafe,stroke:#1e40af
    style F1 fill:#dcfce7,stroke:#15803d
    style F2 fill:#dcfce7,stroke:#15803d
    style F3 fill:#dcfce7,stroke:#15803d
    style F4 fill:#dcfce7,stroke:#15803d

Query 1: which feature burns the most tokens?

Group traces by the metadata.feature tag (set this in your Langfuse integration) and sum cost per feature per day.

-- filename: cost_by_feature.sql
-- description: Cost per feature per day for the last 14 days.
SELECT
  date_trunc('day', start_time) as day,
  metadata->>'feature' as feature,
  SUM(total_cost) as daily_cost,
  COUNT(*) as trace_count,
  AVG(total_cost) as avg_cost_per_trace
FROM traces
WHERE start_time > NOW() - INTERVAL '14 days'
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC;

Expected finding: 1 or 2 features dominate. Often they are the ones you launched last (new = unoptimized). Fix: feature-flag the expensive one, add aggressive caching, or rewrite the prompt to be shorter.

Query 2: which traces have the longest prompts?

-- filename: longest_prompts.sql
-- description: Top 20 longest-prompt traces in the last 24h.
SELECT
  t.id,
  t.user_id,
  t.name,
  SUM(o.prompt_tokens) as total_prompt_tokens,
  SUM(o.total_cost) as trace_cost
FROM traces t
JOIN observations o ON o.trace_id = t.id
WHERE t.start_time > NOW() - INTERVAL '1 day'
GROUP BY t.id, t.user_id, t.name
ORDER BY total_prompt_tokens DESC
LIMIT 20;

Expected finding: a small number of traces use 5-20x more prompt tokens than average. Usually they are retrieving too much context, or the agent is including its full tool history on every turn. Fix: quote extraction before the final LLM call, or context window management.

For the quote extraction pattern that solves this, see the Advanced RAG quote extraction for context compression post.

Query 3: how much token spend comes from retries?

-- filename: retry_waste.sql
-- description: Traces with 2+ LLM calls on the same prompt (retry waste).
SELECT
  trace_id,
  COUNT(*) as attempt_count,
  SUM(total_cost) as wasted_cost
FROM observations
WHERE start_time > NOW() - INTERVAL '7 days'
  AND type = 'GENERATION'
GROUP BY trace_id, LEFT(md5(prompt::text), 8)  -- group by prompt hash
HAVING COUNT(*) >= 2
ORDER BY wasted_cost DESC
LIMIT 50;

Expected finding: 3-10 percent of spend comes from retrying the same LLM call. Usually the retry logic is too aggressive, or a validation failure re-calls the LLM instead of fixing the parser. Fix: tighter retry policy, better output validation, or just cache the first attempt.

Query 4: cost per user

-- filename: cost_per_user.sql
-- description: Top 20 users by cost over the last 30 days.
SELECT
  user_id,
  COUNT(DISTINCT id) as trace_count,
  SUM(total_cost) as total_cost,
  AVG(total_cost) as avg_per_trace,
  MAX(total_cost) as max_single_trace
FROM traces
WHERE start_time > NOW() - INTERVAL '30 days'
  AND user_id IS NOT NULL
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT 20;

Expected finding: top 5 users account for 30-60 percent of total spend. Some are legitimate power users; some are scraping or testing your product at the free tier. Fix: rate limit the outliers, or charge them the actual cost.

For the token bucket rate limiter that implements per-user caps, see the Rate limiting FastAPI agents with token buckets post.

What are the 4 biggest savings patterns?

Ranked by how much each one typically saves:

Quote extraction / context compression (saves 30-60 percent): instead of sending full retrieved chunks to the model, extract only the relevant sentences. Dramatic savings on any RAG system.
Cheaper models for simple tasks (saves 20-40 percent): route classification, extraction, and summarization tasks to Haiku instead of Sonnet. Keep Sonnet for complex reasoning.
Response caching on repeat prompts (saves 10-30 percent): users often ask the same question multiple times per session. Cache the response and serve from cache on exact match.
Tighter retry logic (saves 5-15 percent): many retries re-call the LLM unnecessarily. Validate output more carefully and retry only on specific errors, not on any exception.

Add them in order. Each one pays for the time to implement.

What to do Monday morning

Confirm Langfuse is tagging every trace with a metadata.feature label. If not, add it to your integration first.
Run the 4 queries. Write down the finding from each: which feature is most expensive, longest prompt cost, retry waste, top cost users.
Pick the biggest lever. Usually it is quote extraction or model routing. Both save 20-40 percent with one weekend of work.
Implement the fix. Measure before and after with the same queries. Expect a 10-30 percent drop in total cost.
Add a weekly review: run query 1 every Monday morning and compare to the previous week. Flag any feature whose cost grew by more than 20 percent.

The headline: agent cost optimization is 4 SQL queries against Langfuse, 4 fixes, and typically a 40-60 percent total savings. No model downgrade, no quality loss, just removing waste.

Frequently asked questions

How do I know if my agent is wasting tokens?

Run the 4 queries above: cost per feature, longest prompts, retry waste, cost per user. If any single feature dominates, any traces have prompt lengths 5x the average, retries contribute more than 5 percent of spend, or the top 5 users use more than 50 percent of tokens, you have waste to address. Most production agents have all 4 patterns.

What is the single biggest cost savings pattern?

Quote extraction or context compression before the final LLM call. Typical savings: 30-60 percent of prompt tokens, which is usually 40-70 percent of total cost. Instead of sending 5 full retrieved chunks, extract the 3-4 sentences that actually matter and send only those. The extra extraction call uses a cheap model and is still a net win.

Should I move my agent to a cheaper model to cut costs?

Not as the first move. Model downgrade trades quality for savings, which is often a false economy. First find the waste patterns and fix them. Quote extraction, response caching, and tighter retries together save 40-60 percent without changing the model. After those, model routing (cheap model for simple tasks, flagship for complex) is the next lever.

How do I track cost by user in Langfuse?

Set the user_id field on every trace in your Langfuse integration. Then query the traces table grouping by user_id and summing total_cost. The default Langfuse integrations for OpenAI, Anthropic, and LangChain all accept a user_id parameter.

How often should I run cost analysis?

Weekly in normal operation, daily during a cost investigation. Set up the 4 queries as Grafana panels so you can check them in one view. Alert on any feature's daily cost doubling compared to the previous week, that catches runaway bugs within 24 hours instead of at month-end invoice time.

Key takeaways

Invoice-only cost analysis cannot find waste. Trace-level data from Langfuse lets you find and fix the specific patterns that burn tokens.
Four queries cover 95 percent of optimization: cost per feature, longest prompts, retry waste, cost per user. Run weekly.
Quote extraction is the single biggest savings pattern. 30-60 percent token reduction, usually without quality loss.
Model routing (cheap model for simple tasks, flagship for complex) is the second lever. 20-40 percent savings with careful task classification.
Rate limit the top 5 cost users. They often account for 30-60 percent of spend and include scrapers and free-tier abusers.
To see cost optimization wired into a full production agent stack with Langfuse, Grafana, and rate limits, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the Langfuse cost tracking documentation including per-model pricing configuration, see the Langfuse cost guide.

Your OpenAI bill doubled last month and you don't know why

This post is the cost optimization workflow: the 4 queries, how to interpret them, the fixes each one points to, and the cost-per-user view that separates heavy users from waste.

Why are agent costs so hard to debug without trace data?

No attribution. You cannot tell if the spike is driven by more users, longer prompts, more retries, or a bug in one feature.
No waste detection. You cannot spot redundant tool calls, cache-missable duplicate prompts, or overly-verbose completions.
No per-user view. You cannot tell if one heavy user is skewing the average, or if the whole distribution shifted.

Langfuse traces give you all 3. The queries below extract the actionable signal in under 10 minutes.

graph TD
    Traces[(Langfuse traces)] --> Q1[Query 1: cost per feature]
    Traces --> Q2[Query 2: longest prompts]
    Traces --> Q3[Query 3: retry waste]
    Traces --> Q4[Query 4: cost per user]

    Q1 --> F1[Fix: feature flag the expensive one]
    Q2 --> F2[Fix: quote extraction or chunking]
    Q3 --> F3[Fix: better retry logic]
    Q4 --> F4[Fix: rate limit heavy users]

    style Traces fill:#dbeafe,stroke:#1e40af
    style F1 fill:#dcfce7,stroke:#15803d
    style F2 fill:#dcfce7,stroke:#15803d
    style F3 fill:#dcfce7,stroke:#15803d
    style F4 fill:#dcfce7,stroke:#15803d

Query 1: which feature burns the most tokens?

Group traces by the metadata.feature tag (set this in your Langfuse integration) and sum cost per feature per day.

-- filename: cost_by_feature.sql
-- description: Cost per feature per day for the last 14 days.
SELECT
  date_trunc('day', start_time) as day,
  metadata->>'feature' as feature,
  SUM(total_cost) as daily_cost,
  COUNT(*) as trace_count,
  AVG(total_cost) as avg_cost_per_trace
FROM traces
WHERE start_time > NOW() - INTERVAL '14 days'
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC;

Query 2: which traces have the longest prompts?

-- filename: longest_prompts.sql
-- description: Top 20 longest-prompt traces in the last 24h.
SELECT
  t.id,
  t.user_id,
  t.name,
  SUM(o.prompt_tokens) as total_prompt_tokens,
  SUM(o.total_cost) as trace_cost
FROM traces t
JOIN observations o ON o.trace_id = t.id
WHERE t.start_time > NOW() - INTERVAL '1 day'
GROUP BY t.id, t.user_id, t.name
ORDER BY total_prompt_tokens DESC
LIMIT 20;

For the quote extraction pattern that solves this, see the Advanced RAG quote extraction for context compression post.

Query 3: how much token spend comes from retries?

-- filename: retry_waste.sql
-- description: Traces with 2+ LLM calls on the same prompt (retry waste).
SELECT
  trace_id,
  COUNT(*) as attempt_count,
  SUM(total_cost) as wasted_cost
FROM observations
WHERE start_time > NOW() - INTERVAL '7 days'
  AND type = 'GENERATION'
GROUP BY trace_id, LEFT(md5(prompt::text), 8)  -- group by prompt hash
HAVING COUNT(*) >= 2
ORDER BY wasted_cost DESC
LIMIT 50;

Query 4: cost per user

-- filename: cost_per_user.sql
-- description: Top 20 users by cost over the last 30 days.
SELECT
  user_id,
  COUNT(DISTINCT id) as trace_count,
  SUM(total_cost) as total_cost,
  AVG(total_cost) as avg_per_trace,
  MAX(total_cost) as max_single_trace
FROM traces
WHERE start_time > NOW() - INTERVAL '30 days'
  AND user_id IS NOT NULL
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT 20;

For the token bucket rate limiter that implements per-user caps, see the Rate limiting FastAPI agents with token buckets post.

What are the 4 biggest savings patterns?

Ranked by how much each one typically saves:

Quote extraction / context compression (saves 30-60 percent): instead of sending full retrieved chunks to the model, extract only the relevant sentences. Dramatic savings on any RAG system.
Cheaper models for simple tasks (saves 20-40 percent): route classification, extraction, and summarization tasks to Haiku instead of Sonnet. Keep Sonnet for complex reasoning.
Response caching on repeat prompts (saves 10-30 percent): users often ask the same question multiple times per session. Cache the response and serve from cache on exact match.
Tighter retry logic (saves 5-15 percent): many retries re-call the LLM unnecessarily. Validate output more carefully and retry only on specific errors, not on any exception.

Add them in order. Each one pays for the time to implement.

What to do Monday morning

Confirm Langfuse is tagging every trace with a metadata.feature label. If not, add it to your integration first.
Run the 4 queries. Write down the finding from each: which feature is most expensive, longest prompt cost, retry waste, top cost users.
Pick the biggest lever. Usually it is quote extraction or model routing. Both save 20-40 percent with one weekend of work.
Implement the fix. Measure before and after with the same queries. Expect a 10-30 percent drop in total cost.
Add a weekly review: run query 1 every Monday morning and compare to the previous week. Flag any feature whose cost grew by more than 20 percent.

The headline: agent cost optimization is 4 SQL queries against Langfuse, 4 fixes, and typically a 40-60 percent total savings. No model downgrade, no quality loss, just removing waste.

Invoice-only cost analysis cannot find waste. Trace-level data from Langfuse lets you find and fix the specific patterns that burn tokens.
Four queries cover 95 percent of optimization: cost per feature, longest prompts, retry waste, cost per user. Run weekly.
Quote extraction is the single biggest savings pattern. 30-60 percent token reduction, usually without quality loss.
Model routing (cheap model for simple tasks, flagship for complex) is the second lever. 20-40 percent savings with careful task classification.
Rate limit the top 5 cost users. They often account for 30-60 percent of spend and include scrapers and free-tier abusers.
To see cost optimization wired into a full production agent stack with Langfuse, Grafana, and rate limits, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the Langfuse cost tracking documentation including per-model pricing configuration, see the Langfuse cost guide.

Agent cost optimization from trace data

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Agent cost optimization from trace data

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?