Context Cache
For repeated system prompts, long document contexts, or fixed prefixes in multi-turn conversations, enabling context caching can save up to 90% of input costs.
How It Works
1
Mark for Caching
Add cache_control annotation on content block
2
First Request
Marked portion is cached, billed at 1.25x input price
3
Subsequent Requests
Cache hit, billed at 0.1x input price (save 90%)
Billing Rules
Cache Conditions
- Minimum Token Count:Explicit caching requires marked content ≥ 1024 tokens (some models 256/512)
- Cache TTL:ephemeral type cache is valid for approximately 5 minutes; requests with the same prefix automatically hit during this period
- Maximum Markers:Up to 4
cache_controlmarkers - Implicit vs Explicit:Implicit caching (no markers) is determined automatically by the system with no configuration; explicit caching uses markers to precisely control cache boundaries
- Mutually Exclusive:Explicit and implicit caching are mutually exclusive in the same request; when markers exist, explicit caching takes precedence
Usage: OpenAI Protocol
In /v1/chat/completions requests, change content to array format and add cache_control on the text blocks you want to cache:
Usage: Anthropic Protocol
In /v1/messages requests, also add markers on system or messages content blocks:
Python SDK Example
Supported Models
Response Format
Cache hit information is returned via the usage field, with slight format differences between protocols:
OpenAI Protocol Response
usage.prompt_tokens_details: cached_tokens: 1804 cache_creation_input_tokens: 0
Anthropic Protocol Response
usage: cache_read_input_tokens: 1804 cache_creation_input_tokens: 0
Best Practices
- Place unchanged long content(system prompt,reference documents, code context) at the beginning of messages and mark for caching
- Put user messages last - Cache covers from the beginning of the messages array to the marked position; changing content placed after does not affect cache hits
- Suitable scenarios: RAG document injection, fixed system prompt in multi-turn conversations, Agent tool definitions, code repository context
- Not suitable: Completely different content each request, prompt length below minimum threshold
Chat Completions
OpenAI format complete reference
Anthropic Messages
Anthropic protocol invocation
Model Pricing
View full tiered pricing