Skip to content

Notes & Limits

Model alias rewriting

The proxy can rewrite client-supplied model strings to whatever model is currently provisioned. Two env vars on the service control this:

Env var Effect
PROXY_ALIAS_MODEL Canonical served-model name (e.g. DeepSeek-V4-Pro). When unset, model IDs pass through unchanged.
PROXY_ALIAS_SOURCES Comma-separated list of legacy IDs that should also rewrite to the canonical name (e.g. glm-5.1-fp8,Kimi-K2.6).
PROXY_ALIAS_REWRITE_ALL on to rewrite every non-empty model string to the canonical name.

Decision order

  1. If PROXY_ALIAS_MODEL is unset → pass through.
  2. If the inbound model already equals the canonical (case-sensitive) → pass through.
  3. Else, lowercase both sides and rewrite if any of:
  4. PROXY_ALIAS_REWRITE_ALL=on, or
  5. lowercased inbound equals lowercased canonical, or
  6. lowercased inbound starts with claude-, or
  7. lowercased inbound is in lowercased PROXY_ALIAS_SOURCES.
  8. Else → pass through (vLLM will 404 if the served-model-name doesn't match).

Case-insensitivity

Since 2026-04-26 the alias check is case-insensitive on both sides. deepseek-v4-pro, DEEPSEEK-V4-PRO, Claude-Sonnet-4-6, and GLM-5.1-FP8 all rewrite correctly. The payload's model field is always overwritten with the canonical spelling before forwarding so that vLLM (which is case-sensitive) sees the exact string it serves.

Dynamic max_tokens capping

estimated_input = len(payload_text_chars) // 3
available       = MAX_CONTEXT_TOKENS - estimated_input - 512
capped          = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)
Env var Default Notes
MAX_CONTEXT_TOKENS 202 752 Total budget the backend exposes. 256 K context = full DeepSeek-V4-Pro window.
MAX_OUTPUT_TOKENS 16 384 Upper bound on output regardless of available budget.

The 512 token slack is reserved for system-prompt injection. The heuristic is intentionally pessimistic (3 chars/token vs. the real ~3.5-4 for English) so that we never over-cap; vLLM still enforces the exact context limit. There is no client-visible signal that capping occurred.

Han-script stripping

When the user-side messages are not detected as Chinese / Japanese, the proxy can strip Han ideographs and CJK punctuation from the model's streaming output. This is opt-in via PROXY_STRIP_HAN=on — off by default.

Detection is based on Unicode codepoints in the inbound user messages:

  • Han ideographs (CJK Unified Ideographs blocks)
  • Hiragana / Katakana ⇒ treated as Japanese; output preserved
  • Hangul ⇒ treated as Korean; output preserved (Han stripping doesn't affect Hangul)

Tool-use streaming chunks are not filtered — only the natural language content blocks.

Rate limits

Endpoint Limit
POST /api/login 10 / minute / IP
/v1/* None at the proxy. Backend vLLM has its own concurrency limits via --max-num-seqs.
/api/* (other) None.

X-Forwarded-For is honoured (ProxyFix is in the WSGI middleware), so the rate limiter sees the real client IP behind the nginx edge.

Timeouts (full chain)

Hop Timeout
nginx → Flask (proxy_read_timeout) 1800 s
Flask → vLLM connect 10 s
Flask → vLLM read (streaming) 1200 s (env PROXY_STREAM_READ_TIMEOUT)
Flask → vLLM read (non-stream) 1200 s (env PROXY_NONSTREAM_READ_TIMEOUT)
Flask → vLLM connect retry once on ConnectionError
Session cookie lifetime 365 days

CORS

CORS is disabled by default. Set FRONTEND_URL on the service to enable a single allowed origin with supports_credentials=True (used by the admin web UI when served from a different host during dev).

Body size limit

Flask's MAX_CONTENT_LENGTH is 10 MiB. Requests exceeding that return 413 Request Entity Too Large from Flask before any backend contact.

Path-traversal guard

Any /v1/<subpath> containing .. segments is rejected with 400 {"error": "Invalid path"}.

Header forwarding

Request headers forwarded to the backend exclude: host, x-api-key, authorization, transfer-encoding. The proxy re-injects its own backend Authorization header from VLLM_API_KEY.

Response headers passed back to the client exclude: content-encoding, content-length, transfer-encoding, connection, keep-alive, proxy-authenticate, proxy-authorization, te, trailers, upgrade, x-powered-by, x-request-id.

Streaming Server-Sent Events

text/event-stream upstream responses are streamed verbatim with proxy_buffering off at nginx and Flask Response(generator) chunked-encoding. SSE comments (: heartbeat\n\n) and data: frames both pass through unmodified unless Han-stripping is on, in which case each data: JSON payload is parsed, filtered, and re-serialised.

Deployment surface

Component Path
Service unit glm-proxy.service (systemd, on proxy.worv.ai)
Code root /home/ubuntu/glm-proxy/
Static frontend /home/ubuntu/glm-proxy/dist/
SQLite /home/ubuntu/glm-proxy/proxy.db
Edge nginx site files in /etc/nginx/sites-enabled/glm-proxy, /etc/nginx/sites-enabled/admin-proxy, /etc/nginx/sites-enabled/chat.proxy.worv.ai
TLS Let's Encrypt (certbot) per-vhost in /etc/letsencrypt/live/
Backend SSH-tunnelled to vLLM on the 8× B200 GPU host

A change to proxy_routes.py deploys with:

scp -i ~/.ssh/worv-proxy.pem server/proxy_routes.py \
  ubuntu@proxy.worv.ai:/home/ubuntu/glm-proxy/proxy_routes.py
ssh -i ~/.ssh/worv-proxy.pem ubuntu@proxy.worv.ai \
  "sudo systemctl restart glm-proxy"