Miscellaneous

Miscellaneous

LLM API Cost Breakdown: Claude, Gemini & OpenAI Compared

Every developer building with LLMs learns the same lesson sooner or later: your first cost estimate is wrong. Not because the per-token prices are hard to find, but because the number of tokens you actually send per request - once you factor in conversation history, context, retrieved documents, and system prompts - is much larger than the number of tokens in the user's message.

This guide breaks down the current API pricing for Anthropic (Claude), Google (Gemini), and OpenAI (GPT-4.1), covers the cost levers that actually move the needle in production, and - because it is the piece most pricing guides skip entirely - shows how your memory architecture determines whether your LLM costs are sustainable or not.

If you want to run the numbers on your own usage, Mem0's LLM cost calculator lets you model token costs across models and see the impact of memory-based context management on your actual bill.

What You Are Actually Paying For

Before the model comparison, it helps to be precise about what tokens cost.

Every LLM API charges separately for input tokens (everything you send to the model: the system prompt, conversation history, retrieved context, user message) and output tokens (what the model generates). Input and output rates are different - usually output is 3 to 5 times more expensive per token because generation is computationally more intensive than processing.

A token is roughly 0.75 words or 4 characters in English text. A typical short user message is 50-150 tokens. A system prompt for a production agent might be 500-2,000 tokens. If you are sending the last 20 conversation turns as history, that is another 3,000-10,000 tokens. The tokens add up fast, and every request pays for the full input total.

This is why the choice between context-stuffing and selective memory retrieval is not just an architecture decision. It is a cost decision.

Claude API Pricing (Anthropic) - March 2026

Anthropic's current production lineup has three tiers, each with a distinct performance-to-cost profile:

Model

Input (per 1M tokens)

Output (per 1M tokens)

Context Window

Claude Opus 4.6

$5.00

$25.00

200K standard / 1M extended

Claude Sonnet 4.6

$3.00 (≤200K) / $6.00 (>200K)

$15.00 (≤200K) / $22.50 (>200K)

200K standard / 1M extended

Claude Haiku 4.5

$1.00

$5.00

200K

Batch API: 50% discount across all models for asynchronous workloads.

Prompt caching: Cache writes at 1.25x the base input price for a 5-minute cache, 2x for a 1-hour cache. Cache reads at 0.1x the base input price. For any system prompt or retrieved context that appears in multiple requests, caching is one of the most effective cost levers available.

Sonnet 4.6 is the model most production teams are defaulting to: it has the reasoning capability of a frontier model at a price point that scales. Haiku 4.5 is the right choice for high-throughput, low-latency use cases like classification, extraction, or short-turn conversational tasks where cost-per-call matters more than raw reasoning depth.

For a full breakdown of Claude-specific pricing tiers, Mem0's Anthropic Claude pricing guide covers the model generations in detail.

Gemini API Pricing (Google) - March 2026

Google's Gemini 2.5 lineup has a clear value advantage at scale, particularly for input-heavy workloads:

Model

Input (per 1M tokens)

Output (per 1M tokens)

Context Window

Gemini 2.5 Pro

$1.25 (≤200K) / $2.50 (>200K)

$10.00

1M tokens

Gemini 2.5 Flash

$0.30

$2.50

1M tokens

Batch API: 50% discount for asynchronous workloads, bringing Gemini 2.5 Pro to $0.625/$5.00 per million tokens.

Context caching: Up to 90% reduction on cached input tokens, making Gemini particularly cost-effective for applications that repeatedly send the same system context, documents, or retrieved knowledge to the model.

Gemini 2.5 Flash's pricing - $0.30 per million input tokens with flat pricing across its full 1M token context window - makes it the strongest option for volume-sensitive applications. You get a large context window at a price that does not scale with context length.

Gemini 2.5 Pro's 1M token context window with the tiered pricing structure (doubling above 200K tokens) matters: at scale, requests that routinely exceed 200K tokens become expensive quickly. The argument for long-context approaches being "simpler" breaks down at the cost level well before it breaks down architecturally.

GPT-4.1 API Pricing (OpenAI) - March 2026

OpenAI's GPT-4.1 generation provides a strong mid-market option, with a notably generous context window and competitive input pricing:

Model

Input (per 1M tokens)

Output (per 1M tokens)

Context Window

GPT-4o

$2.50

$10.00

128K

GPT-4.1

$2.00

$8.00

1M tokens

GPT-4.1 Nano

$0.05

$0.20

1M tokens

Batch API: 50% discount for asynchronous processing.

Prompt caching: Cache reads at 0.1x the standard input price, matching Anthropic's discount on repeated context.

GPT-4.1's 1M token context window at $2.00/M input puts it between Gemini 2.5 Pro and Claude Sonnet 4.6 in both price and context capacity. GPT-4.1 Nano is the most aggressively priced option across all three providers for high-volume, lightweight tasks - $0.05 per million input tokens is significantly cheaper than any equivalent from Anthropic or Google.

Head-to-Head: The Same Workload Across Providers

Numbers in isolation are less useful than numbers applied to a real workload. Take a mid-complexity conversational AI agent: a 500-token system prompt, 3,000 tokens of conversation history passed each turn, a 100-token user message, and a 300-token response.

Total input per request: 3,600 tokens. Total output: 300 tokens.

Cost per 1,000 requests:

Model

Input cost

Output cost

Total

Gemini 2.5 Flash

$0.001080

$0.000750

$1.83

GPT-4.1 Nano

$0.000180

$0.000060

$0.24

Gemini 2.5 Pro

$0.004500

$0.003000

$7.50

GPT-4.1

$0.007200

$0.002400

$9.60

Claude Haiku 4.5

$0.003600

$0.001500

$5.10

Claude Sonnet 4.6

$0.010800

$0.004500

$15.30

Claude Opus 4.6

$0.018000

$0.007500

$25.50

Scale this to 100,000 requests per month and Claude Opus 4.6 costs $2,550. GPT-4.1 Nano costs $24. The variance between the cheapest and most expensive model is roughly 100x for the same workload.

This is why model selection matters and why the pattern of sending full conversation history on every request compounds the cost problem dramatically.

The Hidden Cost Driver: How Many Tokens You Send Per Request

The table above assumes 3,000 tokens of history per request. Most teams think that is reasonable. At 10 conversation turns of average length, it is.

But a customer support agent with an ongoing case might have 40-50 turns of history. A coding assistant might be sending 15,000 tokens of file context alongside each message. A research assistant with access to retrieved documents might routinely send 20,000-80,000 token inputs.

At those volumes, the input token cost stops being a line item and starts being the budget.

The full-context approach - sending everything, every time - also happens to be the most computationally expensive path at inference. Mem0's research on the LOCOMO benchmark measured exactly this trade-off: full-context approaches consume roughly 26,000 tokens per conversation on average. Mem0's selective memory retrieval consumes roughly 1,800 tokens - a 93% reduction.

Here is what that reduction looks like on your monthly invoice at 100,000 conversations:

Provider + Model

Full-context monthly input cost

With Mem0 (1,800 tokens)

Monthly savings

Gemini 2.5 Pro

$3,250

$225

$3,025

GPT-4.1

$5,200

$360

$4,840

Claude Sonnet 4.6

$7,800

$540

$7,260

Claude Opus 4.6

$13,000

$900

$12,100

These figures cover input tokens only. Output token costs add further, and the latency impact of large inputs adds engineering cost on top.

The 90% token reduction is not a theoretical ceiling. It is the measured output of the two-phase extraction pipeline - extract relevant facts, retrieve only what is needed for the current query - benchmarked in published research (arXiv:2504.19413).

The Three Cost Levers That Actually Matter

Once you have the right model for your use case, three levers drive most of the remaining cost variation:

1. Context discipline

The cheapest optimization available to any LLM application is sending fewer tokens per request. Practically, this means: not sending the full conversation history when a compact memory retrieval would serve the same purpose, not attaching entire documents when only the relevant sections apply to the current query, and not repeating system context that could be cached.

The context engineering guide covers how to think about what belongs in the context window versus what should live in a memory layer.

2. Caching

Prompt caching is available across all three providers and cuts repeated input token costs by 90% (read at 0.1x the base input price). Any application that sends the same system prompt, document set, or conversation prefix on multiple requests should be using caching. It is the highest-leverage cost reduction available without changing the model or the memory architecture.

The mechanics differ slightly: Anthropic supports 5-minute and 1-hour cache TTLs, OpenAI and Google have their own cache management. The reduction is substantial regardless of provider.

3. Batch processing

Every major provider offers 50% off for asynchronous batch workloads. If your application has any processing that does not need to return a response in real time - classification, memory extraction, document processing, scheduled summarization - the Batch API halves your cost on those calls. Many teams are running memory extraction and update operations through the Batch API specifically because the slight latency increase is invisible to users while the cost reduction is real.

When Long-Context Pricing Matters

The 1M token context windows available on Gemini 2.5 Pro, GPT-4.1, and Claude models (extended mode) represent a genuine capability expansion for certain use cases: full codebase analysis, legal document review, processing entire research papers in a single call.

For those use cases, the cost math changes. At 500,000 input tokens:

●     Gemini 2.5 Pro: $2.50/M (long-context tier) × 0.5M = $1.25 per request

●     GPT-4.1: $2.00/M × 0.5M = $1.00 per request

●     Claude Sonnet 4.6: $6.00/M (long-context tier) × 0.5M = $3.00 per request

Long-context use cases are genuinely served by these expanded windows. But for agents and conversational AI, sending 500K tokens to pass conversation history is the wrong tool. For those applications, the 26,000-versus-1,800-token comparison from the benchmark is the more relevant figure.

A 2026 research paper - "Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents" (arXiv:2603.04814) - addresses this directly: fact-based memory retrieval is not just cheaper, it maintains competitive accuracy at a fraction of the context window cost for persistent agent workloads.

Calculate Your Actual LLM Cost

The pricing tables above give you the rates. What actually determines your bill is the combination of requests per month, tokens per request (input and output), which operations qualify for batch, and how much context you are sending versus retrieving selectively.

Mem0's LLM cost calculator at llmcost.app lets you model this across models. You can input your expected usage volumes, adjust token counts for your specific workload, and see the cost difference between full-context and memory-based context management side by side.

For teams already using Mem0 in production, the long-term memory guide covers the memory scoping patterns that keep token counts low across sessions. The LLM chat history summarization guide covers the architectural decision between summarization, truncation, and memory extraction and the cost implications of each.

The model you choose matters. The number of tokens you send matters more.

External references:

GET TLDR from:

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.