Rate Limits
NexusFlow controls peak traffic via RPM, TPM, concurrent streams, async tasks, and a monitoring system. High concurrency is not a single number, but a combination of rate limiting, queuing, polling cadence, and model latency.
RPM
Controls request frequency, preventing instantaneous spikes from overwhelming upstream.
TPM
Limits tokens per minute, preventing long-context traffic from squeezing resources.
Concurrency
Long tasks should use the async queue rather than holding synchronous connections for extended periods.
Monitoring
Observe peak-period changes via TTFT, success rate, and per-model latency.
Plan Rate Limits
Models Token Limitation
Rate Limiting Response Headers
Currently, the stable response headers you can rely on are those related to remaining balance. For more granular headers, refer to subsequent platform releases.
High-Concurrency Scenario Recommendations
Sync vs. Async Traffic Separation
Chat goes to `/v1/chat/completions`, image/video goes to `/v1/tasks`. Separate long tasks from the synchronous path.
Smart Polling
Don't poll task status at high frequency; use a fixed 3-5 second interval or exponential backoff to reduce cascading amplification effects.
Monitor via the Dashboard
Track request volume, TTFT, success rate, and per-model latency changes to recognize whether you're approaching capacity limits.
Business-Side Degradation
During peak periods, prioritize switching to faster models, or downgrade by lowering max_tokens and long-context usage.