worv.ai Inference API¶

Self-hosted, OpenAI-/Anthropic-compatible inference proxy.

The proxy serves whichever frontier-class open-weights model is currently provisioned on the 8× NVIDIA B200 backend — DeepSeek-V4-Pro (live as of 2026-04-26), with GLM-5.1-FP8 and Kimi-K2.6 as alternate launch profiles. Clients should not hard-code the model name; the proxy transparently rewrites known aliases to whatever is currently live.

Hosts¶

Host	Purpose	Authentication
`https://proxy.worv.ai`	Public OpenAI/Anthropic-compatible inference under `/v1/*`	API key
`https://admin.proxy.worv.ai`	Admin console — key management, usage stats, GPU telemetry	Session + CSRF
`https://chat.proxy.worv.ai`	Open WebUI front-end (browser chat)	Open WebUI account
`https://docs.proxy.worv.ai`	This documentation	Public

All hosts share the same backing instance and resolve to the same IP (52.78.33.184 — AWS EC2, Seoul). HTTPS is enforced; the front edge automatically 301-redirects HTTP to HTTPS.

Quick start¶

curl (OpenAI)Python (openai SDK)Anthropic MessagesClaude Code (env)

curl https://proxy.worv.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "Reply with one word: pong"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="https://proxy.worv.ai/v1",
    api_key="<your-api-key>",
)
resp = client.chat.completions.create(
    model="DeepSeek-V4-Pro",
    messages=[{"role": "user", "content": "Reply with one word: pong"}],
)
print(resp.choices[0].message.content)

curl https://proxy.worv.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "ping"}]
  }'

export ANTHROPIC_BASE_URL="https://proxy.worv.ai"
export ANTHROPIC_AUTH_TOKEN="<your-api-key>"
export ANTHROPIC_DEFAULT_OPUS_MODEL="DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_SONNET_MODEL="DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="DeepSeek-V4-Pro"
claude

Compatibility surface¶

The /v1/* namespace forwards to a vLLM backend, so everything the backend serves is reachable transparently. The proxy adds:

API-key gating — every request requires a valid key (see Authentication).
Model alias rewriting — claude-*, glm-5.1-fp8, kimi-k2.6, case variants, and the live alias all route to the current backend (see Notes & Limits).
Dynamic max_tokens capping — client-supplied output budgets are clamped so prompt + output + 512 fits the served context window.
DeepSeek thinking-mode mapping — DeepSeek-V4-Pro-nonthinking and Claude-style output_config.effort translate to vLLM chat_template_kwargs (see Thinking Modes).
Optional Han-script stripping in streaming output (off by default).

Hardware & live deployment¶

Item	Value
Edge	AWS EC2 (Seoul) `proxy.worv.ai`
Backend	8× NVIDIA B200 SXM5 (183 GiB / GPU, NVLink 18 mesh)
Live model	DeepSeek-V4-Pro — DP=8 expert-parallel + MTP spec=2 + 256K context, fp8 KV
Inference engine	vLLM 0.19.1 stable (Kimi/GLM) · vLLM `deepseekv4-cu130` Docker image (DeepSeek)
Default total budget	202,752 tokens (`MAX_CONTEXT_TOKENS`); reserved 512 tokens for system prompt safety
Default max output	16,384 tokens (`MAX_OUTPUT_TOKENS`)

Operational state and historical changes live in the project repo (docs/current-state.md).