Skip to content

Inference endpoints (/v1)

The proxy is a thin pass-through to vLLM under https://proxy.worv.ai/v1/<subpath>. Any subpath reachable on the backend is reachable through the proxy — common ones are listed below.

Method Path Notes
GET /v1/models Augmented to advertise DeepSeek-V4-Pro-nonthinking when DeepSeek is live
POST /v1/chat/completions OpenAI Chat Completions; stream: true supported (SSE)
POST /v1/completions OpenAI legacy text completions
POST /v1/embeddings If the backend serves an embedding model
POST /v1/messages Anthropic Messages API; stream: true supported (SSE)
POST /v1/responses OpenAI Responses API
* /v1/<anything> Forwarded as-is (path traversal .. rejected with 400)

All HTTP verbs are accepted (GET, POST, PUT, PATCH, DELETE, OPTIONS).

Streaming

stream: true is forwarded transparently when the backend opts into SSE (Content-Type: text/event-stream). Edge-side timeouts:

Setting Default Override
Streaming inter-byte read 1200 s PROXY_STREAM_READ_TIMEOUT env var
Non-stream read 1200 s PROXY_NONSTREAM_READ_TIMEOUT env var
Connect 10 s hard-coded
nginx proxy_read_timeout 1800 s per-vhost

DeepSeek-V4-Pro's DP=8 + MTP prefill on a 100 k+ token Claude-Code payload can take well over 300 s before the first byte; the long timeouts are deliberate. Clients should not retry on 504s blindly.

Model field

Submit one of:

  • DeepSeek-V4-Pro — canonical live model
  • DeepSeek-V4-Pro-nonthinking — chat-mode (no <think> block); see Thinking Modes
  • glm-5.1-fp8 / Kimi-K2.6 — retired/alternate aliases; rewritten transparently when PROXY_ALIAS_REWRITE_ALL=on
  • Any claude-* ID — rewritten transparently to the live model
  • Any string — passed through if the live alias is unset; rewritten if PROXY_ALIAS_REWRITE_ALL=on

Matching is case-insensitive since 2026-04-26: deepseek-v4-pro, DEEPSEEK-V4-PRO, DeepSeek-v4-pro, Claude-Sonnet-4-6, GLM-5.1-FP8 all route correctly. The proxy normalises the case to the canonical served-model name before forwarding to vLLM (which is case-sensitive on the served-model name).

See the full alias matrix in Notes & Limits.

Token-count semantics

For OpenAI-style responses the proxy parses the JSON usage.prompt_tokens and usage.completion_tokens fields (or, for streaming, the final usage block in the SSE stream). For Anthropic-style responses it parses message.usage.input_tokens / output_tokens accordingly. Per-key totals and per-day buckets are written to SQLite synchronously before the response stream closes.

If the backend returns no usage block, only the request count is incremented.

max_tokens capping

The proxy clamps the client's max_tokens to fit the live context window:

available = MAX_CONTEXT_TOKENS - estimated_input - 512
capped    = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)

estimated_input is a coarse len(payload_text) / 3 heuristic (vLLM itself enforces the precise limit). Defaults: MAX_CONTEXT_TOKENS=202752, MAX_OUTPUT_TOKENS=16384. Both are configurable per-deployment via env var.

If the cap is hit, the request still succeeds — the proxy mutates max_tokens in place. There is no error, no warning header, and no client-side notification.

System-prompt injection

For /v1/chat/completions and /v1/messages the proxy may prepend a system prompt to:

  1. Discourage Han-script ideographs in output if the inbound conversation is not Chinese / Japanese.
  2. Apply DeepSeek-V4-Pro Reasoning Effort prefixes (Think Max mode).

Existing client-supplied system messages are preserved and the proxy's prefix is concatenated, not replaced.

Examples

Streaming chat completion

curl https://proxy.worv.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  --no-buffer \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are concise."},
      {"role": "user",   "content": "Three primes greater than 100."}
    ]
  }'

Response is text/event-stream with data: {…} SSE frames terminated by data: [DONE].

Anthropic Messages with thinking budget

curl https://proxy.worv.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "max_tokens": 4096,
    "thinking": {"type": "enabled", "budget_tokens": 2048},
    "messages": [{"role": "user", "content": "Solve: ..."}]
  }'

Listing models

curl https://proxy.worv.ai/v1/models \
  -H "Authorization: Bearer $API_KEY"

Returns the upstream vLLM /v1/models payload, with a synthesised DeepSeek-V4-Pro-nonthinking entry appended whenever DeepSeek is the live alias.