Skip to content

worv.ai Inference API

Self-hosted, OpenAI-/Anthropic-compatible inference proxy.

The proxy serves whichever frontier-class open-weights model is currently provisioned on the 8× NVIDIA B200 backend — DeepSeek-V4-Pro (live as of 2026-04-26), with GLM-5.1-FP8 and Kimi-K2.6 as alternate launch profiles. Clients should not hard-code the model name; the proxy transparently rewrites known aliases to whatever is currently live.

Hosts

Host Purpose Authentication
https://proxy.worv.ai Public OpenAI/Anthropic-compatible inference under /v1/* API key
https://admin.proxy.worv.ai Admin console — key management, usage stats, GPU telemetry Session + CSRF
https://chat.proxy.worv.ai Open WebUI front-end (browser chat) Open WebUI account
https://docs.proxy.worv.ai This documentation Public

All hosts share the same backing instance and resolve to the same IP (52.78.33.184 — AWS EC2, Seoul). HTTPS is enforced; the front edge automatically 301-redirects HTTP to HTTPS.

Quick start

curl https://proxy.worv.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "Reply with one word: pong"}]
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://proxy.worv.ai/v1",
    api_key="<your-api-key>",
)
resp = client.chat.completions.create(
    model="DeepSeek-V4-Pro",
    messages=[{"role": "user", "content": "Reply with one word: pong"}],
)
print(resp.choices[0].message.content)
curl https://proxy.worv.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "DeepSeek-V4-Pro",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "ping"}]
  }'
export ANTHROPIC_BASE_URL="https://proxy.worv.ai"
export ANTHROPIC_AUTH_TOKEN="<your-api-key>"
export ANTHROPIC_DEFAULT_OPUS_MODEL="DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_SONNET_MODEL="DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="DeepSeek-V4-Pro"
claude

Compatibility surface

The /v1/* namespace forwards to a vLLM backend, so everything the backend serves is reachable transparently. The proxy adds:

  • API-key gating — every request requires a valid key (see Authentication).
  • Model alias rewritingclaude-*, glm-5.1-fp8, kimi-k2.6, case variants, and the live alias all route to the current backend (see Notes & Limits).
  • Dynamic max_tokens capping — client-supplied output budgets are clamped so prompt + output + 512 fits the served context window.
  • DeepSeek thinking-mode mappingDeepSeek-V4-Pro-nonthinking and Claude-style output_config.effort translate to vLLM chat_template_kwargs (see Thinking Modes).
  • Optional Han-script stripping in streaming output (off by default).

Hardware & live deployment

Item Value
Edge AWS EC2 (Seoul) proxy.worv.ai
Backend 8× NVIDIA B200 SXM5 (183 GiB / GPU, NVLink 18 mesh)
Live model DeepSeek-V4-Pro — DP=8 expert-parallel + MTP spec=2 + 256K context, fp8 KV
Inference engine vLLM 0.19.1 stable (Kimi/GLM) · vLLM deepseekv4-cu130 Docker image (DeepSeek)
Default total budget 202,752 tokens (MAX_CONTEXT_TOKENS); reserved 512 tokens for system prompt safety
Default max output 16,384 tokens (MAX_OUTPUT_TOKENS)

Operational state and historical changes live in the project repo (docs/current-state.md).