nexusflow
Online

Context Cache

For repeated system prompts, long document contexts, or fixed prefixes in multi-turn conversations, enabling context caching can save up to 90% of input costs.

How It Works

1
Mark for Caching
Add cache_control annotation on content block
2
First Request
Marked portion is cached, billed at 1.25x input price
3
Subsequent Requests
Cache hit, billed at 0.1x input price (save 90%)

Billing Rules

Token TypeBilling MultiplierResponse FieldDescription
Cache creation1.25xcache_creation_input_tokensFirst request writes to cache, slightly higher than normal input
Cache hit0.1xcached_tokensSubsequent requests hit cache, saving 90%
Normal Input1xprompt_tokens - cached portionInput not marked for caching
Output1xcompletion_tokensNormal output billing, not affected by caching

Cache Conditions

  • Minimum Token Count:Explicit caching requires marked content ≥ 1024 tokens (some models 256/512)
  • Cache TTL:ephemeral type cache is valid for approximately 5 minutes; requests with the same prefix automatically hit during this period
  • Maximum Markers:Up to 4 cache_control markers
  • Implicit vs Explicit:Implicit caching (no markers) is determined automatically by the system with no configuration; explicit caching uses markers to precisely control cache boundaries
  • Mutually Exclusive:Explicit and implicit caching are mutually exclusive in the same request; when markers exist, explicit caching takes precedence

Usage: OpenAI Protocol

In /v1/chat/completions requests, change content to array format and add cache_control on the text blocks you want to cache:

curl https://nexusflow.vip/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-plus",
    "enable_context_caching": true,
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a financial analysis assistant. Here is the full company annual report (approximately 50,000 words)...",
            "cache_control": {"type": "ephemeral"}
          }
        ]
      },
      {"role": "user", "content": "Summarize the core risks of this annual report"}
    ]
  }'

Usage: Anthropic Protocol

In /v1/messages requests, also add markers on system or messages content blocks:

curl https://nexusflow.vip/v1/messages \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a code review expert. Here is the full codebase context...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Find security vulnerabilities in this code"}
    ]
  }'

Python SDK Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-air-xxx",
    base_url="https://nexusflow.vip/v1"
)

# Long system prompt is cached on the first request only, subsequent requests automatically hit the cache
response = client.chat.completions.create(
    model="qwen3.5-plus",
    extra_body={"enable_context_caching": True},
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": long_document,  # Your long document
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "Please summarize the key points"}
    ]
)

# Check cache hit status
details = response.usage.prompt_tokens_details
print(f"Cache hit: {details.cached_tokens} tokens")
print(f"Cache creation: {details.cache_creation_input_tokens} tokens")

Supported Models

ProviderModelsMin Cache Length
Qwen (Tongyi)Qwen3.7 Max, Qwen3.6 Max Preview, Qwen3.6 Plus/Flash, Qwen3.5 Plus/Flash, Qwen3 Max, Qwen Plus/Turbo, Qwen VL series, Qwen3 Coder series1024 (explicit) / 256 (implicit)
DeepSeekDeepSeek V3.21024 (explicit)
GLM (Zhipu)GLM 5.1, GLM 5, GLM 4.7512
KimiKimi K2.5, K2.61024 (explicit)
AnthropicClaude Opus 4.7, Sonnet 4.6, Haiku 4.51024

Response Format

Cache hit information is returned via the usage field, with slight format differences between protocols:

OpenAI Protocol Response
usage.prompt_tokens_details:
  cached_tokens: 1804
  cache_creation_input_tokens: 0
Anthropic Protocol Response
usage:
  cache_read_input_tokens: 1804
  cache_creation_input_tokens: 0

Best Practices

  • Place unchanged long content(system prompt,reference documents, code context) at the beginning of messages and mark for caching
  • Put user messages last - Cache covers from the beginning of the messages array to the marked position; changing content placed after does not affect cache hits
  • Suitable scenarios: RAG document injection, fixed system prompt in multi-turn conversations, Agent tool definitions, code repository context
  • Not suitable: Completely different content each request, prompt length below minimum threshold
Chat Completions
OpenAI format complete reference
Anthropic Messages
Anthropic protocol invocation
Model Pricing
View full tiered pricing