worv.ai Inference API¶
Self-hosted, OpenAI-/Anthropic-compatible inference proxy.
The proxy serves whichever frontier-class open-weights model is currently provisioned on the 8× NVIDIA B200 backend — DeepSeek-V4-Pro (live as of 2026-04-26), with GLM-5.1-FP8 and Kimi-K2.6 as alternate launch profiles. Clients should not hard-code the model name; the proxy transparently rewrites known aliases to whatever is currently live.
Hosts¶
| Host | Purpose | Authentication |
|---|---|---|
https://proxy.worv.ai |
Public OpenAI/Anthropic-compatible inference under /v1/* |
API key |
https://admin.proxy.worv.ai |
Admin console — key management, usage stats, GPU telemetry | Session + CSRF |
https://chat.proxy.worv.ai |
Open WebUI front-end (browser chat) | Open WebUI account |
https://docs.proxy.worv.ai |
This documentation | Public |
All hosts share the same backing instance and resolve to the same IP
(52.78.33.184 — AWS EC2, Seoul). HTTPS is enforced; the
front edge automatically 301-redirects HTTP to HTTPS.
Quick start¶
Compatibility surface¶
The /v1/* namespace forwards to a vLLM backend, so everything the
backend serves is reachable transparently. The proxy adds:
- API-key gating — every request requires a valid key (see Authentication).
- Model alias rewriting —
claude-*,glm-5.1-fp8,kimi-k2.6, case variants, and the live alias all route to the current backend (see Notes & Limits). - Dynamic
max_tokenscapping — client-supplied output budgets are clamped soprompt + output + 512fits the served context window. - DeepSeek thinking-mode mapping —
DeepSeek-V4-Pro-nonthinkingand Claude-styleoutput_config.efforttranslate to vLLMchat_template_kwargs(see Thinking Modes). - Optional Han-script stripping in streaming output (off by default).
Hardware & live deployment¶
| Item | Value |
|---|---|
| Edge | AWS EC2 (Seoul) proxy.worv.ai |
| Backend | 8× NVIDIA B200 SXM5 (183 GiB / GPU, NVLink 18 mesh) |
| Live model | DeepSeek-V4-Pro — DP=8 expert-parallel + MTP spec=2 + 256K context, fp8 KV |
| Inference engine | vLLM 0.19.1 stable (Kimi/GLM) · vLLM deepseekv4-cu130 Docker image (DeepSeek) |
| Default total budget | 202,752 tokens (MAX_CONTEXT_TOKENS); reserved 512 tokens for system prompt safety |
| Default max output | 16,384 tokens (MAX_OUTPUT_TOKENS) |
Operational state and historical changes live in the project repo
(docs/current-state.md).