Models¶
The proxy exposes whichever single model is currently provisioned on the 8× B200 node. Only one model is live at a time.
Currently live¶
DeepSeek-V4-Pro — deployed via Docker
vllm/vllm-openai:deepseekv4-cu130 with the
--tokenizer-mode deepseek_v4 encoder.
| Property | Value |
|---|---|
| Parallelism | DP=8 expert-parallel |
| Speculative decoding | MTP spec=2 |
| Context window | 256K (MAX_MODEL_LEN=262144) |
| KV cache | fp8 |
| Indexer | FP4 cache |
| GPU memory utilisation | 0.93 (per GPU) |
--max-num-seqs |
256 |
--max-num-batched-tokens |
8192 |
Variants¶
| Model name | Mode | Notes |
|---|---|---|
DeepSeek-V4-Pro |
Think High | Default; emits <think>...</think> summary block |
DeepSeek-V4-Pro + output_config.effort: max |
Think Max | Prepends "Reasoning Effort: Absolute maximum…" prefix |
DeepSeek-V4-Pro + output_config.effort: xhigh |
Think Max | Synonym for max |
DeepSeek-V4-Pro-nonthinking |
Non-think (chat) | Skips the <think> block entirely; emits the summary directly |
*-nonthinking / *-nonthink suffix |
Non-think | Any model name with this suffix is treated as non-thinking |
See Thinking Modes for the full mapping table and the
underlying chat_template_kwargs injection.
Alternate launchers¶
These models are not always live, but their launchers and weights are maintained on the GPU host:
| Model name | Launcher | Notes |
|---|---|---|
glm-5.1-fp8 |
scripts/launch-vllm.sh |
GLM-5.1-FP8 via vLLM 0.19.1 stable, TP=8 |
Kimi-K2.6 |
scripts/launch-kimi.sh |
Kimi-K2.6 via vLLM 0.19.1 stable, TP=8 |
When a different model is provisioned, the public alias rewriting still
forwards DeepSeek-V4-Pro / Kimi-K2.6 / glm-5.1-fp8 / claude-* to
whatever is live (provided the operator has set PROXY_ALIAS_MODEL and
PROXY_ALIAS_REWRITE_ALL=on for the new tenant).
Model discovery¶
Returns the vLLM /v1/models payload verbatim (with the synthetic
DeepSeek-V4-Pro-nonthinking row when DeepSeek is live). Use this
endpoint to discover the served-model-name; do not parse it
case-insensitively — vLLM itself is case-sensitive, so always
echo whatever string the listing returns.
Compatibility table¶
| Surface | Supported? | Notes |
|---|---|---|
| OpenAI Chat Completions | ✓ | streaming & non-streaming |
| OpenAI legacy Completions | ✓ | depends on backend |
| OpenAI Responses API | ✓ | depends on backend |
| OpenAI Embeddings | ✓ | only if embedding model is loaded |
| Anthropic Messages | ✓ | streaming & non-streaming |
Anthropic thinking block |
✓ | mapped onto DeepSeek chat_template_kwargs |
| Tool / function calling | ✓ | DeepSeek deepseekv32_tool_parser patch active — see project deploy/patches/ |
| Vision (image input) | ✗ | not enabled on the live tenant |
| File uploads | ✗ | not exposed |
| Audio | ✗ | not exposed |
Tool calling on DeepSeek-V4-Pro uses the in-tree
deepseekv32_tool_parser with the project's hot-patch for streaming
DSML leakage (see commit history).