Thinking Modes (DeepSeek-V4-Pro)¶

DeepSeek-V4-Pro supports three reasoning regimes. The proxy translates the inbound model name and (for Anthropic-shaped clients) the output_config.effort field into vLLM chat_template_kwargs for the --tokenizer-mode deepseek_v4 encoder.

Modes at a glance¶

Mode	What it does	Triggered by
Think High	Default. Emits `<think>...</think>` block followed by the answer.	Any non-suffixed model name without an explicit `effort: max/xhigh`.
Think Max	Think High + the "Reasoning Effort: Absolute maximum…" system-prompt prefix. Use for hard math / proofs / research.	`output_config.effort: "max"` or `"xhigh"`.
Non-think	Skips the `<think>` block entirely; emits a chat-style answer directly.	Model name ending with `-nonthinking` or `-nonthink`, or exactly `DeepSeek-V4-Pro-nonthinking`.

Underlying `chat_template_kwargs`¶

The vLLM deepseek_v4 tokenizer reads two keys:

{
  "thinking": true | false,        // (or legacy alias enable_thinking)
  "reasoning_effort": "max" | "high" | null
}

The proxy fills these as follows (gated on PROXY_ALIAS_MODEL == DeepSeek-V4-Pro):

Inbound model	Inbound `output_config.effort`	Injected `chat_template_kwargs`
`DeepSeek-V4-Pro-nonthinking` (any case)	(ignored)	`{"thinking": false}`
`-nonthinking`, `-nonthink`	(ignored)	`{"thinking": false}`
any other	`"max"` or `"xhigh"`	`{"thinking": true, "reasoning_effort": "max"}`
any other	`"low"`, `"medium"`, `"high"`, missing, unknown	`{"thinking": true, "reasoning_effort": "high"}`

reasoning_effort: "high" is set explicitly (rather than null) so the template renderer always picks a deliberate budget.

Client-supplied chat_template_kwargs keys win — the proxy only fills what's missing. So if a request already says "chat_template_kwargs": {"thinking": false}, the proxy will not override it.

Worked examples¶

Think High (default)Think MaxNon-think

POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro",
  "messages": [{"role": "user", "content": "Plan a refactor"}]
}

Becomes, on the wire to vLLM:

{
  "model": "DeepSeek-V4-Pro",
  "messages": [...],
  "chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}
}

POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro",
  "output_config": {"effort": "max"},
  "messages": [{"role": "user", "content": "Prove that..."}]
}

Becomes:

{
  "chat_template_kwargs": {"thinking": true, "reasoning_effort": "max"}
}

A "Reasoning Effort: Absolute maximum…" prefix is prepended to the prompt at message index 0.

POST /v1/chat/completions
{
  "model": "DeepSeek-V4-Pro-nonthinking",
  "messages": [{"role": "user", "content": "What's 2+2?"}]
}

Becomes:

{
  "model": "DeepSeek-V4-Pro",
  "chat_template_kwargs": {"thinking": false}
}

The model name is rewritten to canonical (vLLM doesn't actually serve the -nonthinking variant); the encoder consumes thinking: false and emits the summary directly.

Anthropic-shape compatibility¶

When using /v1/messages, the Anthropic-native thinking: {"type": "enabled", "budget_tokens": N} block is forwarded verbatim to the backend. It does not override chat_template_kwargs; both can coexist (the engine uses the budget value from thinking.budget_tokens and the regime from chat_template_kwargs.thinking).

For Claude Code clients, the cds shell wrapper (see project repo zshrc) already pins ANTHROPIC_DEFAULT_*_MODEL=DeepSeek-V4-Pro, so the proxy receives DeepSeek-V4-Pro regardless of which Claude tier the client thinks it is invoking.

Thinking Modes (DeepSeek-V4-Pro)¶

Modes at a glance¶

Underlying chat_template_kwargs¶

Worked examples¶

Anthropic-shape compatibility¶

Underlying `chat_template_kwargs`¶