Skip to content

Models

The proxy exposes whichever single model is currently provisioned on the 8× B200 node. Only one model is live at a time.

Currently live

DeepSeek-V4-Pro — deployed via Docker vllm/vllm-openai:deepseekv4-cu130 with the --tokenizer-mode deepseek_v4 encoder.

Property Value
Parallelism DP=8 expert-parallel
Speculative decoding MTP spec=2
Context window 256K (MAX_MODEL_LEN=262144)
KV cache fp8
Indexer FP4 cache
GPU memory utilisation 0.93 (per GPU)
--max-num-seqs 256
--max-num-batched-tokens 8192

Variants

Model name Mode Notes
DeepSeek-V4-Pro Think High Default; emits <think>...</think> summary block
DeepSeek-V4-Pro + output_config.effort: max Think Max Prepends "Reasoning Effort: Absolute maximum…" prefix
DeepSeek-V4-Pro + output_config.effort: xhigh Think Max Synonym for max
DeepSeek-V4-Pro-nonthinking Non-think (chat) Skips the <think> block entirely; emits the summary directly
*-nonthinking / *-nonthink suffix Non-think Any model name with this suffix is treated as non-thinking

See Thinking Modes for the full mapping table and the underlying chat_template_kwargs injection.

Alternate launchers

These models are not always live, but their launchers and weights are maintained on the GPU host:

Model name Launcher Notes
glm-5.1-fp8 scripts/launch-vllm.sh GLM-5.1-FP8 via vLLM 0.19.1 stable, TP=8
Kimi-K2.6 scripts/launch-kimi.sh Kimi-K2.6 via vLLM 0.19.1 stable, TP=8

When a different model is provisioned, the public alias rewriting still forwards DeepSeek-V4-Pro / Kimi-K2.6 / glm-5.1-fp8 / claude-* to whatever is live (provided the operator has set PROXY_ALIAS_MODEL and PROXY_ALIAS_REWRITE_ALL=on for the new tenant).

Model discovery

curl -s https://proxy.worv.ai/v1/models \
  -H "Authorization: Bearer $API_KEY" | jq '.data[].id'

Returns the vLLM /v1/models payload verbatim (with the synthetic DeepSeek-V4-Pro-nonthinking row when DeepSeek is live). Use this endpoint to discover the served-model-name; do not parse it case-insensitively — vLLM itself is case-sensitive, so always echo whatever string the listing returns.

Compatibility table

Surface Supported? Notes
OpenAI Chat Completions streaming & non-streaming
OpenAI legacy Completions depends on backend
OpenAI Responses API depends on backend
OpenAI Embeddings only if embedding model is loaded
Anthropic Messages streaming & non-streaming
Anthropic thinking block mapped onto DeepSeek chat_template_kwargs
Tool / function calling DeepSeek deepseekv32_tool_parser patch active — see project deploy/patches/
Vision (image input) not enabled on the live tenant
File uploads not exposed
Audio not exposed

Tool calling on DeepSeek-V4-Pro uses the in-tree deepseekv32_tool_parser with the project's hot-patch for streaming DSML leakage (see commit history).