Thinking Modes (DeepSeek-V4-Pro)¶
DeepSeek-V4-Pro supports three reasoning regimes. The proxy translates
the inbound model name and (for Anthropic-shaped clients) the
output_config.effort field into vLLM
chat_template_kwargs for the --tokenizer-mode deepseek_v4 encoder.
Modes at a glance¶
| Mode | What it does | Triggered by |
|---|---|---|
| Think High | Default. Emits <think>...</think> block followed by the answer. |
Any non-suffixed model name without an explicit effort: max/xhigh. |
| Think Max | Think High + the "Reasoning Effort: Absolute maximum…" system-prompt prefix. Use for hard math / proofs / research. | output_config.effort: "max" or "xhigh". |
| Non-think | Skips the <think> block entirely; emits a chat-style answer directly. |
Model name ending with -nonthinking or -nonthink, or exactly DeepSeek-V4-Pro-nonthinking. |
Underlying chat_template_kwargs¶
The vLLM deepseek_v4 tokenizer reads two keys:
{
"thinking": true | false, // (or legacy alias enable_thinking)
"reasoning_effort": "max" | "high" | null
}
The proxy fills these as follows (gated on PROXY_ALIAS_MODEL ==
DeepSeek-V4-Pro):
| Inbound model | Inbound output_config.effort |
Injected chat_template_kwargs |
|---|---|---|
DeepSeek-V4-Pro-nonthinking (any case) |
(ignored) | {"thinking": false} |
*-nonthinking, *-nonthink |
(ignored) | {"thinking": false} |
| any other | "max" or "xhigh" |
{"thinking": true, "reasoning_effort": "max"} |
| any other | "low", "medium", "high", missing, unknown |
{"thinking": true, "reasoning_effort": "high"} |
reasoning_effort: "high" is set explicitly (rather than null) so the
template renderer always picks a deliberate budget.
Client-supplied chat_template_kwargs keys win — the proxy only
fills what's missing. So if a request already says
"chat_template_kwargs": {"thinking": false}, the proxy will not
override it.
Worked examples¶
POST /v1/chat/completions
{
"model": "DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "Plan a refactor"}]
}
Becomes, on the wire to vLLM:
POST /v1/chat/completions
{
"model": "DeepSeek-V4-Pro",
"output_config": {"effort": "max"},
"messages": [{"role": "user", "content": "Prove that..."}]
}
Becomes:
A "Reasoning Effort: Absolute maximum…" prefix is prepended to the prompt at message index 0.
POST /v1/chat/completions
{
"model": "DeepSeek-V4-Pro-nonthinking",
"messages": [{"role": "user", "content": "What's 2+2?"}]
}
Becomes:
The model name is rewritten to canonical (vLLM doesn't actually serve
the -nonthinking variant); the encoder consumes thinking: false
and emits the summary directly.
Anthropic-shape compatibility¶
When using /v1/messages, the Anthropic-native
thinking: {"type": "enabled", "budget_tokens": N} block is forwarded
verbatim to the backend. It does not override
chat_template_kwargs; both can coexist (the engine uses the budget
value from thinking.budget_tokens and the regime from
chat_template_kwargs.thinking).
For Claude Code clients, the cds shell wrapper (see project repo
zshrc) already pins ANTHROPIC_DEFAULT_*_MODEL=DeepSeek-V4-Pro, so
the proxy receives DeepSeek-V4-Pro regardless of which Claude tier
the client thinks it is invoking.