nexusflow
Online
POST/v1/chat/completions

Chat Completions API

Create chat completion responses. Fully compatible with the OpenAI Chat Completions format. You can directly use the official OpenAI SDK (Python / Node.js) for integration by simply modifying base_url and api_key. Supports streaming output, multi-turn conversation, Function Calling, vision understanding, and other capabilities.

Request Endpoint

POSThttps://nexusflow.vip/v1/chat/completions

Request Headers

HeaderValueRequiredDescription
AuthorizationBearer <API_KEY>*API Key. Create one in the dashboard starting with sk-air-.
Content-Typeapplication/json*Request body format, fixed as JSON.

Request Parameters

ParameterTypeRequiredDescription
modelstring*Model ID, e.g. qwen3.5-plus, deepseek-v4-flash, etc.View list →
messagesarray*Chat message array. Each message includes role (system / user / assistant / tool) and content field. Content can be a string or a content array; multimodal content availability depends on the model's capabilities.
streamboolean-Whether to enable streaming output. When enabled, responses are returned token by token in SSE (Server-Sent Events) format.Default: false
temperaturenumber-Sampling temperature, range [0, 2). Higher values produce more random output; lower values produce more deterministic output. It is recommended to adjust either temperature or top_p, but not both simultaneously.Default: 1.0
top_pnumber-Nucleus sampling probability threshold, range (0, 1]. The model only samples from tokens whose cumulative probability reaches top_p.Default: 1.0
max_tokensinteger-Maximum number of tokens to generate. Different models have different upper limits; if not set, the model's default value is used.
toolsarray-List of available tools/functions for Function Calling. Each tool includes type and function fields.
tool_choicestring | object-Tool calling strategy. Supports "auto", "none", or {"type":"function","function":{"name":"..."}} to specify a particular function. For thinking mode models, it is not recommended to force a specific tool.Default: "auto"
stopstring | string[]-Stop word or stop word array (up to 4). The model will immediately stop generating output when it reaches a stop word.
frequency_penaltynumber-Frequency penalty, range [-2.0, 2.0]. Penalizes tokens based on their frequency in already generated text, reducing repetition.Default: 0
presence_penaltynumber-Presence penalty, range [-2.0, 2.0]. Penalizes tokens that have already appeared, increasing topic diversity.Default: 0
enable_thinkingboolean-Whether to enable thinking mode. Only hybrid thinking models support toggling true/false; pure thinking models will continue thinking even if false is passed.
stream_optionsobject-Streaming request additional options. Set {"include_usage": true} to return token usage in the final SSE chunk.
response_formatobject-Response format control. Supports {"type":"text"} (default) and {"type":"json_object"} (JSON mode).

Code Examples

curl -X POST https://nexusflow.vip/v1/chat/completions \
  -H "Authorization: Bearer sk-air-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-plus",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1000
  }'

Response Format (Non-streaming)

Non-streaming requests return a complete JSON object. The object field value is "chat.completion".

Response Example

{
  "id": "chatcmpl-abc123xyz789",
  "object": "chat.completion",
  "created": 1709123456,
  "model": "qwen3.5-plus",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a branch of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 256,
    "total_tokens": 284
  }
}

Response Fields

FieldTypeDescription
idstringUnique request identifier, e.g. chatcmpl-abc123xyz789.
objectstringFixed as "chat.completion".
createdintegerCreation time, Unix timestamp (seconds).
modelstringThe actual model name used.
choicesarrayGenerated result array (usually contains 1 element).
choices[].indexintegerIndex position in the result array.
choices[].message.rolestringMessage role, fixed as "assistant".
choices[].message.contentstring | nullGenerated text content. Can be null when the model calls a tool.
choices[].message.reasoning_contentstringChain-of-thought content returned by reasoning models (e.g. QwQ). Non-reasoning models do not return this field.
choices[].message.tool_callsarrayTool call request array. Only returned when the model decides to call a tool.
choices[].finish_reasonstringStop reason: stop (natural end), length (reached max_tokens), tool_calls (tool call).
usage.prompt_tokensintegerNumber of input tokens consumed.
usage.completion_tokensintegerNumber of output tokens consumed.
usage.total_tokensintegerTotal token consumption (prompt_tokens + completion_tokens).

Streaming Response Format (SSE)

When stream: true is set, the response is returned step by step via Server-Sent Events (SSE). Each event starts with data: and ends with data: [DONE] as a termination marker. Each chunk's object field value is "chat.completion.chunk".

SSE Data Format

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"content":"Machine"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{"content":"learning"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709123456,"model":"qwen3.5-plus","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":28,"completion_tokens":256,"total_tokens":284}}

data: [DONE]

Chunk Field Descriptions

FieldTypeDescription
idstringSame request ID as the complete response.
objectstringFixed as "chat.completion.chunk".
choices[].delta.rolestringOnly appears in the first chunk, value is "assistant".
choices[].delta.contentstringIncremental text content of the current chunk.
choices[].delta.reasoning_contentstringIncremental chain-of-thought content of the current chunk (reasoning models).
choices[].delta.tool_callsarrayIncremental tool call data (streaming Function Calling).
choices[].finish_reasonstring | nullOnly non-null in the final chunk, indicating the stop reason.
usageobjectOnly when stream_options.include_usage is true, token usage is returned in the final chunk.
Tip: When using the OpenAI SDK, you do not need to manually parse SSE. The SDK automatically handles streaming responses and provides an iterator interface. Manual SSE parsing is only needed when using cURL or a raw HTTP client.

Relationship with the Bailian Official Chat API

NexusFlow's /v1/chat/completions is fully protocol-compatible with Alibaba Cloud Bailian's OpenAI-compatible endpoint: the request body is forwarded upstream as-is, and the response is relayed unchanged. Bailian extension fields such as tools, tool_choice, response_format,enable_thinking, thinking_budget, enable_search, search_options,seed, top_k, logprobs, stream_options can all be used directly. The exact support range depends on the specific model. Official reference: Qwen API Reference.

Billing Details

Tiered Billing

Bailian series models (Qwen/Tongyi, GLM, etc.) use tiered billing based on the input token count per request. The total prompt tokens of a single request determine the applicable pricing tier, with input and output billed at the corresponding tier's unit price.

Example: qwen3-maxInput Token RangeInput Price (¥/M)Output Price (¥/M)
Tier 10 ~ 32K2.510
Tier 232K ~ 128K416
Tier 3128K ~ 256K728

Example: a request with 50K input tokens + 2K output tokens would bill input at ¥4/M and output at ¥16/M (falling into Tier 2). See the full tiered pricing on the Pricing page.

Context Caching (Prompt Caching)

Context caching is supported when calling via /v1/messages (Anthropic protocol). For repeated system prompts or long documents, DashScope automatically caches the prompt prefix, and subsequent requests hitting the cached portion enjoy a discount:

Token TypeBilling MultiplierDescription
cache_creation_input_tokens1.25x input priceFirst time writing to cache, slightly higher than standard input
cache_read_input_tokens0.1x input priceCache hit, 90% discount
input_tokens (non-cache)1x input priceNormal billing

/v1/chat/completions supports explicit caching via the enable_context_caching: true parameter (Bailian series models). /v1/messages (Anthropic protocol) supports cache_control content block annotations. Both protocols also automatically benefit from implicit cache discounts.

Important Notes

  • Different models have different max_tokens upper limits. Please refer to the Model List for each model's limitations.
  • temperature and top_p should be adjusted independently; setting both simultaneously may produce unpredictable results.
  • In streaming output, only the final chunk has a non-null finish_reason value, indicating generation has ended.
  • For image understanding, it is recommended to use multimodal models such as Qwen-VL. The content field must use the array format and include image_url type entries.
  • For Function Calling, it is recommended to use model series that support tool calling, such as Qwen, DeepSeek, GLM, etc.
  • Thinking mode (enable_thinking) must be used with the appropriate model ID. See the support matrix at Parameters Matrix.
  • The request body is protocol-compatible with the upstream Bailian API; undocumented Bailian extension fields (e.g. thinking_budget, enable_search, search_options) can be used directly, with specific support depending on the model.
  • For the complete parameter descriptions and model compatibility matrix, see Parameters Matrix.
Model List
View all available models and capabilities
Error Codes
Error code descriptions and troubleshooting guide
Rate Limits
Request rate limits and quotas