Context Cache

For repeated system prompts, long document context, or fixed prefixes in multi-turn conversations, enabling context caching can save up to 90% on input costs.

How It Works

Mark the Cache

Add a cache_control annotation to a content block

First Request

The marked portion is cached, billed at 1.25x the input price

Later Requests

Cache hit, billed at 0.1x the input price (90% savings)

Billing Rules

Token TypeBilling MultiplierResponse FieldDescription

Cache creation1.25xcache_creation_input_tokensFirst request writes the cache, slightly higher than normal input

Cache hit0.1xcached_tokensLater requests hit the cache, saving 90%

Normal input1xprompt_tokens - cached portionInput not marked for caching

Output1xcompletion_tokensOutput billed normally, unaffected by caching

Cache Conditions

Minimum tokens: explicit caching requires the marked content to be ≥ 1024 tokens (256/512 for some models)
Cache TTL: ephemeral caches last about 5 minutes, during which requests with the same prefix hit automatically
Max markers: up to 4 cache_control markers per request
Implicit vs explicit: implicit caching (no markers) is decided automatically with no configuration; explicit caching uses markers to precisely control cache boundaries
Mutually exclusive: explicit and implicit caching are mutually exclusive in one request; when markers are present, explicit takes precedence

Usage: OpenAI Protocol

In a /v1/chat/completions request, change content to an array and add cache_control to the text blocks you want cached:

curl https://nexusflow.hk/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-plus",
    "enable_context_caching": true,
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a financial analysis assistant. Below is the full text of the company annual report (about 50,000 words)...",
            "cache_control": {"type": "ephemeral"}
          }
        ]
      },
      {"role": "user", "content": "Summarize the key risks in this annual report"}
    ]
  }'

Usage: Anthropic Protocol

In a /v1/messages request, similarly add the marker to a content block in system or messages:

curl https://nexusflow.hk/v1/messages \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a code review expert. Below is the full codebase context...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Find the security vulnerabilities in this code"}
    ]
  }'

Python SDK Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-air-xxx",
    base_url="https://nexusflow.hk/v1"
)

# The long system prompt is cached only on the first request; later requests hit it automatically
response = client.chat.completions.create(
    model="qwen3.5-plus",
    extra_body={"enable_context_caching": True},
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": long_document,  # your long document
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "Please summarize the key points"}
    ]
)

# Check cache hit details
details = response.usage.prompt_tokens_details
print(f"Cache hit: {details.cached_tokens} tokens")
print(f"Cache created: {details.cache_creation_input_tokens} tokens")

Supported Models

ProviderModelsMin Cache Length

QwenQwen3.7 Max, Qwen3.6 Max Preview, Qwen3.6 Plus/Flash, Qwen3.5 Plus/Flash, Qwen3 Max, Qwen Plus/Turbo, Qwen VL series, Qwen3 Coder series1024 (explicit) / 256 (implicit)

DeepSeekDeepSeek V3.21024 (explicit)

Zhipu GLMGLM 5.2, GLM 5.1, GLM 5, GLM 4.7512

KimiKimi K2.5, K2.61024 (explicit)

AnthropicClaude Opus 4.7, Sonnet 4.6, Haiku 4.51024

Response Format

Cache hit information is returned via the usage field, with slightly different formats per protocol:

OpenAI Protocol Response

usage.prompt_tokens_details:
  cached_tokens: 1804
  cache_creation_input_tokens: 0

Anthropic Protocol Response

usage:
  cache_read_input_tokens: 1804
  cache_creation_input_tokens: 0

Best Practices

Place unchanging long content (system prompt, reference documents, code context) at the start of messages and mark it for caching
Put user messages last — the cache spans from the start of the messages array to the marker, so changing content placed after it does not affect cache hits
Good fits: RAG document injection, fixed system prompts in multi-turn chat, agent tool definitions, code repository context
Poor fits: requests with completely different content each time, or prompts below the minimum length

Chat Completions

Complete OpenAI format reference

Anthropic Messages

Calling via the Anthropic protocol

Model Pricing

View full tiered pricing