worv.ai Inference API¶
Self-hosted, OpenAI-/Anthropic-compatible inference proxy.
The proxy serves whichever frontier-class open-weights model is currently
provisioned on the 8× NVIDIA B200 backend —
GLM-5.1-FP8 (live as of 2026-04-27), with DeepSeek-V4-Pro,
DeepSeek-V4-Flash, and Kimi-K2.6 as alternate launch profiles.
Clients should not hard-code the model name; the proxy transparently
rewrites known aliases (glm-5.1-fp8, DeepSeek-V4-Pro,
DeepSeek-V4-Flash, kimi-k2.6, claude-*) to whatever is currently
live.
Hosts¶
| Host | Purpose | Authentication |
|---|---|---|
https://proxy.worv.ai |
Public OpenAI/Anthropic-compatible inference under /v1/* |
API key |
https://admin.proxy.worv.ai |
Admin console — key management, usage stats, GPU telemetry | Session + CSRF |
https://chat.proxy.worv.ai |
Open WebUI front-end (browser chat) | Open WebUI account |
https://docs.proxy.worv.ai |
This documentation | Public |
All hosts share the same backing instance and resolve to the same IP
(52.78.33.184 — AWS EC2, Seoul). HTTPS is enforced; the
front edge automatically 301-redirects HTTP to HTTPS.
Quick start¶
Compatibility surface¶
The /v1/* namespace forwards to a vLLM backend, so everything the
backend serves is reachable transparently. The proxy adds:
- API-key gating — every request requires a valid key (see Authentication).
- Model alias rewriting —
claude-*,glm-5.1-fp8,kimi-k2.6, case variants, and the live alias all route to the current backend (see Notes & Limits). - Dynamic
max_tokenscapping — client-supplied output budgets are clamped soprompt + output + 512fits the served context window. - DeepSeek thinking-mode mapping —
DeepSeek-V4-Pro-nonthinkingand Claude-styleoutput_config.efforttranslate to vLLMchat_template_kwargs(see Thinking Modes). - Optional Han-script stripping in streaming output (off by default).
Hardware & live deployment¶
| Item | Value |
|---|---|
| Edge | AWS EC2 (Seoul) proxy.worv.ai |
| Backend | 8× NVIDIA B200 SXM5 (183 GiB / GPU, NVLink 18 mesh) |
| Live model | GLM-5.1-FP8 — TP=8 (single-engine), 202K context, fp8 KV, MTP off (reverted from DeepSeek-V4-Pro on 2026-04-27 after a vLLM v1 DP-router wedge) |
| Inference engine | vLLM 0.19.1 stable (Kimi/GLM) · vLLM deepseekv4-cu130 Docker image (DeepSeek) |
| Default total budget | 202,752 tokens (MAX_CONTEXT_TOKENS); reserved 512 tokens for system prompt safety |
| Default max output | 16,384 tokens (MAX_OUTPUT_TOKENS) |
Operational state and historical changes live in the project repo
(docs/current-state.md).