Skip to content

Thinking Modes (DeepSeek-V4-Pro)

DeepSeek-V4-Pro supports three reasoning regimes. The proxy translates the inbound model name and (for Anthropic-shaped clients) the output_config.effort field into vLLM chat_template_kwargs for the --tokenizer-mode deepseek_v4 encoder.

Modes at a glance

Mode What it does Triggered by
Think High Default. Emits <think>...</think> block followed by the answer. Any non-suffixed model name without an explicit effort: max/xhigh.
Think Max Think High + the "Reasoning Effort: Absolute maximum…" system-prompt prefix. Use for hard math / proofs / research. output_config.effort: "max" or "xhigh".
Non-think Skips the <think> block entirely; emits a chat-style answer directly. Model name ending with -nonthinking or -nonthink, or exactly DeepSeek-V4-Pro-nonthinking.

Underlying chat_template_kwargs

The vLLM deepseek_v4 tokenizer reads two keys:

{
  "thinking": true | false,        // (or legacy alias enable_thinking)
  "reasoning_effort": "max" | "high" | null
}

The proxy fills these as follows (gated on PROXY_ALIAS_MODEL == DeepSeek-V4-Pro):

Inbound model Inbound output_config.effort Injected chat_template_kwargs
DeepSeek-V4-Pro-nonthinking (any case) (ignored) {"thinking": false}
*-nonthinking, *-nonthink (ignored) {"thinking": false}
any other "max" or "xhigh" {"thinking": true, "reasoning_effort": "max"}
any other "low", "medium", "high", missing, unknown {"thinking": true, "reasoning_effort": "high"}

reasoning_effort: "high" is set explicitly (rather than null) so the template renderer always picks a deliberate budget.

Client-supplied chat_template_kwargs keys win — the proxy only fills what's missing. So if a request already says "chat_template_kwargs": {"thinking": false}, the proxy will not override it.

Worked examples

POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro",
  "messages": [{"role": "user", "content": "Plan a refactor"}]
}

Becomes, on the wire to vLLM:

{
  "model": "DeepSeek-V4-Pro",
  "messages": [...],
  "chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}
}
POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro",
  "output_config": {"effort": "max"},
  "messages": [{"role": "user", "content": "Prove that..."}]
}

Becomes:

{
  "chat_template_kwargs": {"thinking": true, "reasoning_effort": "max"}
}

A "Reasoning Effort: Absolute maximum…" prefix is prepended to the prompt at message index 0.

POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro-nonthinking",
  "messages": [{"role": "user", "content": "What's 2+2?"}]
}

Becomes:

{
  "model": "DeepSeek-V4-Pro",
  "chat_template_kwargs": {"thinking": false}
}

The model name is rewritten to canonical (vLLM doesn't actually serve the -nonthinking variant); the encoder consumes thinking: false and emits the summary directly.

Anthropic-shape compatibility

When using /v1/messages, the Anthropic-native thinking: {"type": "enabled", "budget_tokens": N} block is forwarded verbatim to the backend. It does not override chat_template_kwargs; both can coexist (the engine uses the budget value from thinking.budget_tokens and the regime from chat_template_kwargs.thinking).

For Claude Code clients, the cds shell wrapper (see project repo zshrc) already pins ANTHROPIC_DEFAULT_*_MODEL=DeepSeek-V4-Pro, so the proxy receives DeepSeek-V4-Pro regardless of which Claude tier the client thinks it is invoking.