POST/v1/chat/completions

Chat Completions API

Creates a chat completion response. The endpoint is fully compatible with the OpenAI Chat Completions format and can be used directly with the official OpenAI SDKs (Python / Node.js) — just change base_url and api_key. Supports streaming, multi-turn conversations, function calling, vision, and more.

✓ Protocol Coverage

/v1/chat/completions supports all models available on NexusFlow — including Qwen, GLM, DeepSeek, Kimi, MiniMax, and more.

Request Endpoint

POSThttps://nexusflow.hk/v1/chat/completions

Request Headers

Header	Value	Required	Description
`Authorization`	`Bearer <API_KEY>`	*	API key. Created in the console, starts with sk-air-.
`Content-Type`	`application/json`	*	Request body format, always JSON.

Request Parameters

Parameter	Type	Required	Description
`model`	string	*	Model ID. For example qwen3.5-plus, deepseek-v4-flash, etc.View list →
`messages`	array	*	Array of conversation messages. Each message has a role (system / user / assistant / tool) and a content field. content can be a string or an array of content blocks; multimodal content availability depends on the model's capabilities.
`stream`	boolean	-	Whether to enable streaming output. When enabled, tokens are returned incrementally as SSE (Server-Sent Events).Default: `false`
`temperature`	number	-	Sampling temperature, range [0, 2). Higher values are more random, lower values more deterministic. Adjust either this or top_p, not both.Default: `1.0`
`top_p`	number	-	Nucleus sampling probability threshold, range (0, 1]. The model samples only from the smallest token set whose cumulative probability reaches top_p.Default: `1.0`
`max_tokens`	integer	-	Maximum number of tokens to generate. Limits vary by model; the model default is used when not set.
`tools`	array	-	List of available tool/function definitions for function calling. Each tool has a type and a function field.
`tool_choice`	string \| object	-	Tool selection strategy. Stably supports "auto", "none", or {"type":"function","function":{"name":"..."}} to specify a function. Forcing a tool is not recommended for thinking mode models.Default: `"auto"`
`stop`	string \| string[]	-	A stop word or array of stop words (up to 4). The model ends output immediately when it generates a stop word.
`frequency_penalty`	number	-	Frequency penalty, range [-2.0, 2.0]. Positive values penalize tokens by how often they have appeared, reducing repetition.Default: `0`
`presence_penalty`	number	-	Presence penalty, range [-2.0, 2.0]. Positive values penalize tokens that have already appeared, increasing topic diversity.Default: `0`
`enable_thinking`	boolean	-	Whether to enable thinking mode. Only hybrid thinking models support the true/false toggle; thinking-only models keep thinking even if false is passed.
`stream_options`	object	-	Additional options for streaming requests. Set {"include_usage": true} to return token usage in the final SSE chunk.
`response_format`	object	-	Response format control. Supports {"type":"text"} (default) and {"type":"json_object"} (JSON mode).

Code Examples

curl -X POST https://nexusflow.hk/v1/chat/completions \
  -H "Authorization: Bearer sk-air-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-plus",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1000
  }'

Response Format (Non-streaming)

Non-streaming requests return a complete JSON object whose object field is "chat.completion".

Response Example

{
  "id": "chatcmpl-abc123xyz789",
  "object": "chat.completion",
  "created": 1709123456,
  "model": "qwen3.5-plus",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a branch of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 256,
    "total_tokens": 284
  }
}

Response Fields

Field	Type	Description
`id`	string	Unique identifier for this request, e.g. chatcmpl-abc123xyz789.
`object`	string	Always "chat.completion".
`created`	integer	Creation time as a Unix timestamp (seconds).
`model`	string	The model name actually used.
`choices`	array	Array of generated results (usually one element).
`choices[].index`	integer	Index of the result within the array.
`choices[].message.role`	string	Message role, always "assistant".
`choices[].message.content`	string \| null	Generated text content. May be null when the model calls a tool.
`choices[].message.reasoning_content`	string	Chain-of-thought content returned by reasoning models (e.g. QwQ). Non-reasoning models do not return this field.
`choices[].message.tool_calls`	array	Array of tool call requests. Returned only when the model decides to call a tool.
`choices[].finish_reason`	string	Stop reason: stop (natural end), length (reached max_tokens), tool_calls (called a tool).
`usage.prompt_tokens`	integer	Number of tokens consumed by the input.
`usage.completion_tokens`	integer	Number of tokens consumed by the output.
`usage.total_tokens`	integer	Total tokens consumed (prompt_tokens + completion_tokens).

Streaming Response Format (SSE)

When stream: true, the response is returned incrementally as Server-Sent Events (SSE). Each event starts with data: and the stream ends with data: [DONE]. Each chunk's object field is "chat.completion.chunk".

SSE Data Format

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"content":"Machine"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"content":" learning"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":28,"completion_tokens":256,"total_tokens":284}}

data: [DONE]

Chunk Fields

Field	Type	Description
`id`	string	Same request ID as the full response.
`object`	string	Always "chat.completion.chunk".
`choices[].delta.role`	string	Appears only in the first chunk, with value "assistant".
`choices[].delta.content`	string	Incremental text content of this chunk.
`choices[].delta.reasoning_content`	string	Incremental chain-of-thought content of this chunk (reasoning models).
`choices[].delta.tool_calls`	array	Incremental tool call data (streaming function calling).
`choices[].finish_reason`	string \| null	Non-null only in the final chunk, indicating the stop reason.
`usage`	object	Returned in the final chunk only when stream_options.include_usage is true, reporting token usage.

Tip: When using the OpenAI SDK, you do not need to parse SSE manually — the SDK handles streaming responses and provides an iterator interface. You only need to parse SSE data yourself when using cURL or a raw HTTP client.

Protocol Pass-through

NexusFlow's /v1/chat/completions is fully pass-through with the OpenAI Chat Completions protocol: the request body is forwarded to the upstream as-is, and the response is returned as-is. Extension fields such as tools, tool_choice, response_format,enable_thinking, thinking_budget, enable_search, search_options,seed, top_k, logprobs, and stream_options can be used directly. Actual support depends on the model.

Billing

Tiered Pricing

Model families such as Qwen and GLM use tiered pricing based on the request's input token count. The total prompt tokens of a single request determine the applicable price tier, and input and output are billed at that tier's unit prices respectively.

Example: qwen3-maxInput Token RangeInput Price ($/M)Output Price ($/M)

Tier 10 ~ 32K2.510

Tier 232K ~ 128K416

Tier 3128K ~ 256K728

For example, a request with 50K input tokens + 2K output tokens is billed at $4/M for input and $16/M for output (Tier 2). See the full tiered pricing on the Pricing page.

Context Cache (Prompt Caching)

Context caching is supported when calling via /v1/messages (Anthropic protocol). For repeated system prompts or long documents, the upstream automatically caches the prompt prefix, and subsequent requests get a discount on the cached portion:

Token TypeBilling MultiplierDescription

cache_creation_input_tokens1.25x input priceFirst write to cache, slightly higher than regular input

cache_read_input_tokens0.1x input priceCache hit, 90% discount

input_tokens (non-cached)1x input priceBilled normally

/v1/chat/completions supports explicit caching via the enable_context_caching: true parameter. /v1/messages (Anthropic protocol) supports the cache_control content block annotation. Both protocols automatically benefit from implicit cache discounts.

Notes

The max_tokens limit varies by model; see the Models list for each model's limits.
Adjust only one of temperature or top_p; setting both may produce unpredictable results.
In streaming output, only the final chunk's finish_reason is non-null, marking the end of generation.
For image understanding, use multimodal models such as the Qwen-VL series. content must be an array that includes an image_url type.
For function calling, use model families that support tools, such as Qwen, DeepSeek, and GLM.
Thinking mode (enable_thinking) must be used per model ID; see the support matrix in the Parameter Matrix.
The request body is passed through to the upstream protocol; extension fields beyond this doc (such as thinking_budget, enable_search, search_options) can be used directly, with actual support depending on the model.
For the full parameter reference and model compatibility matrix, see the Parameter Matrix.

Models

Browse all available models and capabilities

Error Codes

Error code reference and troubleshooting guide

Rate Limits

Request rate limits and quotas