Inference endpoints (`/v1`)¶

The proxy is a thin pass-through to vLLM under https://proxy.worv.ai/v1/<subpath>. Any subpath reachable on the backend is reachable through the proxy — common ones are listed below.

Method	Path	Notes
`GET`	`/v1/models`	Augmented to advertise `DeepSeek-V4-Pro-nonthinking` when DeepSeek is live
`POST`	`/v1/chat/completions`	OpenAI Chat Completions; `stream: true` supported (SSE)
`POST`	`/v1/completions`	OpenAI legacy text completions
`POST`	`/v1/embeddings`	If the backend serves an embedding model
`POST`	`/v1/messages`	Anthropic Messages API; `stream: true` supported (SSE)
`POST`	`/v1/responses`	OpenAI Responses API
`*`	`/v1/<anything>`	Forwarded as-is (path traversal `..` rejected with 400)

All HTTP verbs are accepted (GET, POST, PUT, PATCH, DELETE, OPTIONS).

Streaming¶

stream: true is forwarded transparently when the backend opts into SSE (Content-Type: text/event-stream). Edge-side timeouts:

Setting	Default	Override
Streaming inter-byte read	1200 s	`PROXY_STREAM_READ_TIMEOUT` env var
Non-stream read	1200 s	`PROXY_NONSTREAM_READ_TIMEOUT` env var
Connect	10 s	hard-coded
nginx `proxy_read_timeout`	1800 s	per-vhost

DeepSeek-V4-Pro's DP=8 + MTP prefill on a 100 k+ token Claude-Code payload can take well over 300 s before the first byte; the long timeouts are deliberate. Clients should not retry on 504s blindly.

Model field¶

Submit one of:

DeepSeek-V4-Pro — canonical live model
DeepSeek-V4-Pro-nonthinking — chat-mode (no <think> block); see Thinking Modes
glm-5.1-fp8 / Kimi-K2.6 — retired/alternate aliases; rewritten transparently when PROXY_ALIAS_REWRITE_ALL=on
Any claude-* ID — rewritten transparently to the live model
Any string — passed through if the live alias is unset; rewritten if PROXY_ALIAS_REWRITE_ALL=on

Matching is case-insensitive since 2026-04-26: deepseek-v4-pro, DEEPSEEK-V4-PRO, DeepSeek-v4-pro, Claude-Sonnet-4-6, GLM-5.1-FP8 all route correctly. The proxy normalises the case to the canonical served-model name before forwarding to vLLM (which is case-sensitive on the served-model name).

See the full alias matrix in Notes & Limits.

Token-count semantics¶

For OpenAI-style responses the proxy parses the JSON usage.prompt_tokens and usage.completion_tokens fields (or, for streaming, the final usage block in the SSE stream). For Anthropic-style responses it parses message.usage.input_tokens / output_tokens accordingly. Per-key totals and per-day buckets are written to SQLite synchronously before the response stream closes.

If the backend returns no usage block, only the request count is incremented.

`max_tokens` capping¶

The proxy clamps the client's max_tokens to fit the live context window:

available = MAX_CONTEXT_TOKENS - estimated_input - 512
capped    = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)

estimated_input is a coarse len(payload_text) / 3 heuristic (vLLM itself enforces the precise limit). Defaults: MAX_CONTEXT_TOKENS=202752, MAX_OUTPUT_TOKENS=16384. Both are configurable per-deployment via env var.

If the cap is hit, the request still succeeds — the proxy mutates max_tokens in place. There is no error, no warning header, and no client-side notification.

System-prompt injection¶

For /v1/chat/completions and /v1/messages the proxy may prepend a system prompt to:

Discourage Han-script ideographs in output if the inbound conversation is not Chinese / Japanese.
Apply DeepSeek-V4-Pro Reasoning Effort prefixes (Think Max mode).

Existing client-supplied system messages are preserved and the proxy's prefix is concatenated, not replaced.

Examples¶

Streaming chat completion¶

curl https://proxy.worv.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  --no-buffer \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are concise."},
      {"role": "user",   "content": "Three primes greater than 100."}
    ]
  }'

Response is text/event-stream with data: {…} SSE frames terminated by data: [DONE].

Anthropic Messages with thinking budget¶

curl https://proxy.worv.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "max_tokens": 4096,
    "thinking": {"type": "enabled", "budget_tokens": 2048},
    "messages": [{"role": "user", "content": "Solve: ..."}]
  }'

Listing models¶

curl https://proxy.worv.ai/v1/models \
  -H "Authorization: Bearer $API_KEY"

Returns the upstream vLLM /v1/models payload, with a synthesised DeepSeek-V4-Pro-nonthinking entry appended whenever DeepSeek is the live alias.

Inference endpoints (/v1)¶