Inference endpoints (/v1)¶
The proxy is a thin pass-through to vLLM under
https://proxy.worv.ai/v1/<subpath>. Any subpath reachable on the
backend is reachable through the proxy — common ones are listed
below.
| Method | Path | Notes |
|---|---|---|
GET |
/v1/models |
Augmented to advertise DeepSeek-V4-Pro-nonthinking when DeepSeek is live |
POST |
/v1/chat/completions |
OpenAI Chat Completions; stream: true supported (SSE) |
POST |
/v1/completions |
OpenAI legacy text completions |
POST |
/v1/embeddings |
If the backend serves an embedding model |
POST |
/v1/messages |
Anthropic Messages API; stream: true supported (SSE) |
POST |
/v1/responses |
OpenAI Responses API |
* |
/v1/<anything> |
Forwarded as-is (path traversal .. rejected with 400) |
All HTTP verbs are accepted (GET, POST, PUT, PATCH, DELETE,
OPTIONS).
Streaming¶
stream: true is forwarded transparently when the backend opts into SSE
(Content-Type: text/event-stream). Edge-side timeouts:
| Setting | Default | Override |
|---|---|---|
| Streaming inter-byte read | 1200 s | PROXY_STREAM_READ_TIMEOUT env var |
| Non-stream read | 1200 s | PROXY_NONSTREAM_READ_TIMEOUT env var |
| Connect | 10 s | hard-coded |
nginx proxy_read_timeout |
1800 s | per-vhost |
DeepSeek-V4-Pro's DP=8 + MTP prefill on a 100 k+ token Claude-Code payload can take well over 300 s before the first byte; the long timeouts are deliberate. Clients should not retry on 504s blindly.
Model field¶
Submit one of:
DeepSeek-V4-Pro— canonical live modelDeepSeek-V4-Pro-nonthinking— chat-mode (no<think>block); see Thinking Modesglm-5.1-fp8/Kimi-K2.6— retired/alternate aliases; rewritten transparently whenPROXY_ALIAS_REWRITE_ALL=on- Any
claude-*ID — rewritten transparently to the live model - Any string — passed through if the live alias is unset; rewritten if
PROXY_ALIAS_REWRITE_ALL=on
Matching is case-insensitive since 2026-04-26: deepseek-v4-pro,
DEEPSEEK-V4-PRO, DeepSeek-v4-pro, Claude-Sonnet-4-6, GLM-5.1-FP8
all route correctly. The proxy normalises the case to the canonical
served-model name before forwarding to vLLM (which is case-sensitive on
the served-model name).
See the full alias matrix in Notes & Limits.
Token-count semantics¶
For OpenAI-style responses the proxy parses the JSON usage.prompt_tokens
and usage.completion_tokens fields (or, for streaming, the final
usage block in the SSE stream). For Anthropic-style responses it parses
message.usage.input_tokens / output_tokens accordingly. Per-key
totals and per-day buckets are written to SQLite synchronously before the
response stream closes.
If the backend returns no usage block, only the request count is
incremented.
max_tokens capping¶
The proxy clamps the client's max_tokens to fit the live context window:
available = MAX_CONTEXT_TOKENS - estimated_input - 512
capped = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)
estimated_input is a coarse len(payload_text) / 3 heuristic (vLLM
itself enforces the precise limit). Defaults: MAX_CONTEXT_TOKENS=202752,
MAX_OUTPUT_TOKENS=16384. Both are configurable per-deployment via env
var.
If the cap is hit, the request still succeeds — the proxy mutates
max_tokens in place. There is no error, no warning header, and no
client-side notification.
System-prompt injection¶
For /v1/chat/completions and /v1/messages the proxy may prepend a
system prompt to:
- Discourage Han-script ideographs in output if the inbound conversation is not Chinese / Japanese.
- Apply DeepSeek-V4-Pro Reasoning Effort prefixes (Think Max mode).
Existing client-supplied system messages are preserved and the proxy's prefix is concatenated, not replaced.
Examples¶
Streaming chat completion¶
curl https://proxy.worv.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
--no-buffer \
-d '{
"model": "DeepSeek-V4-Pro",
"stream": true,
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Three primes greater than 100."}
]
}'
Response is text/event-stream with data: {…} SSE frames terminated
by data: [DONE].
Anthropic Messages with thinking budget¶
curl https://proxy.worv.ai/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "DeepSeek-V4-Pro",
"max_tokens": 4096,
"thinking": {"type": "enabled", "budget_tokens": 2048},
"messages": [{"role": "user", "content": "Solve: ..."}]
}'
Listing models¶
Returns the upstream vLLM /v1/models payload, with a synthesised
DeepSeek-V4-Pro-nonthinking entry appended whenever DeepSeek is the
live alias.