Models¶

The proxy exposes whichever single model is currently provisioned on the 8× B200 node. Only one model is live at a time.

Currently live¶

DeepSeek-V4-Pro — deployed via Docker vllm/vllm-openai:deepseekv4-cu130 with the --tokenizer-mode deepseek_v4 encoder.

Property	Value
Parallelism	DP=8 expert-parallel
Speculative decoding	MTP `spec=2`
Context window	256K (`MAX_MODEL_LEN=262144`)
KV cache	fp8
Indexer	FP4 cache
GPU memory utilisation	0.93 (per GPU)
`--max-num-seqs`	256
`--max-num-batched-tokens`	8192

Variants¶

Model name	Mode	Notes
`DeepSeek-V4-Pro`	Think High	Default; emits `<think>...</think>` summary block
`DeepSeek-V4-Pro` + `output_config.effort: max`	Think Max	Prepends "Reasoning Effort: Absolute maximum…" prefix
`DeepSeek-V4-Pro` + `output_config.effort: xhigh`	Think Max	Synonym for `max`
`DeepSeek-V4-Pro-nonthinking`	Non-think (chat)	Skips the `<think>` block entirely; emits the summary directly
`-nonthinking` / `-nonthink` suffix	Non-think	Any model name with this suffix is treated as non-thinking

See Thinking Modes for the full mapping table and the underlying chat_template_kwargs injection.

Alternate launchers¶

These models are not always live, but their launchers and weights are maintained on the GPU host:

Model name	Launcher	Notes
`glm-5.1-fp8`	`scripts/launch-vllm.sh`	GLM-5.1-FP8 via vLLM 0.19.1 stable, TP=8
`Kimi-K2.6`	`scripts/launch-kimi.sh`	Kimi-K2.6 via vLLM 0.19.1 stable, TP=8

When a different model is provisioned, the public alias rewriting still forwards DeepSeek-V4-Pro / Kimi-K2.6 / glm-5.1-fp8 / claude-* to whatever is live (provided the operator has set PROXY_ALIAS_MODEL and PROXY_ALIAS_REWRITE_ALL=on for the new tenant).

Model discovery¶

curl -s https://proxy.worv.ai/v1/models \
  -H "Authorization: Bearer $API_KEY" | jq '.data[].id'

Returns the vLLM /v1/models payload verbatim (with the synthetic DeepSeek-V4-Pro-nonthinking row when DeepSeek is live). Use this endpoint to discover the served-model-name; do not parse it case-insensitively — vLLM itself is case-sensitive, so always echo whatever string the listing returns.

Compatibility table¶

Surface	Supported?	Notes
OpenAI Chat Completions	✓	streaming & non-streaming
OpenAI legacy Completions	✓	depends on backend
OpenAI Responses API	✓	depends on backend
OpenAI Embeddings	✓	only if embedding model is loaded
Anthropic Messages	✓	streaming & non-streaming
Anthropic `thinking` block	✓	mapped onto DeepSeek `chat_template_kwargs`
Tool / function calling	✓	DeepSeek `deepseekv32_tool_parser` patch active — see project `deploy/patches/`
Vision (image input)	✗	not enabled on the live tenant
File uploads	✗	not exposed
Audio	✗	not exposed

Tool calling on DeepSeek-V4-Pro uses the in-tree deepseekv32_tool_parser with the project's hot-patch for streaming DSML leakage (see commit history).