Notes & Limits¶
Model alias rewriting¶
The proxy can rewrite client-supplied model strings to whatever model
is currently provisioned. Two env vars on the service control this:
| Env var | Effect |
|---|---|
PROXY_ALIAS_MODEL |
Canonical served-model name (e.g. DeepSeek-V4-Pro). When unset, model IDs pass through unchanged. |
PROXY_ALIAS_SOURCES |
Comma-separated list of legacy IDs that should also rewrite to the canonical name (e.g. glm-5.1-fp8,Kimi-K2.6). |
PROXY_ALIAS_REWRITE_ALL |
on to rewrite every non-empty model string to the canonical name. |
Decision order¶
- If
PROXY_ALIAS_MODELis unset → pass through. - If the inbound model already equals the canonical (case-sensitive) → pass through.
- Else, lowercase both sides and rewrite if any of:
PROXY_ALIAS_REWRITE_ALL=on, or- lowercased inbound equals lowercased canonical, or
- lowercased inbound starts with
claude-, or - lowercased inbound is in lowercased
PROXY_ALIAS_SOURCES. - Else → pass through (vLLM will 404 if the served-model-name doesn't match).
Case-insensitivity¶
Since 2026-04-26 the alias check is case-insensitive on both sides.
deepseek-v4-pro, DEEPSEEK-V4-PRO, Claude-Sonnet-4-6, and
GLM-5.1-FP8 all rewrite correctly. The payload's model field is
always overwritten with the canonical spelling before forwarding so
that vLLM (which is case-sensitive) sees the exact string it serves.
Dynamic max_tokens capping¶
estimated_input = len(payload_text_chars) // 3
available = MAX_CONTEXT_TOKENS - estimated_input - 512
capped = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)
| Env var | Default | Notes |
|---|---|---|
MAX_CONTEXT_TOKENS |
202 752 | Total budget the backend exposes. 256 K context = full DeepSeek-V4-Pro window. |
MAX_OUTPUT_TOKENS |
16 384 | Upper bound on output regardless of available budget. |
The 512 token slack is reserved for system-prompt injection. The heuristic is intentionally pessimistic (3 chars/token vs. the real ~3.5-4 for English) so that we never over-cap; vLLM still enforces the exact context limit. There is no client-visible signal that capping occurred.
Han-script stripping¶
When the user-side messages are not detected as Chinese / Japanese, the
proxy can strip Han ideographs and CJK punctuation from the model's
streaming output. This is opt-in via PROXY_STRIP_HAN=on — off
by default.
Detection is based on Unicode codepoints in the inbound user messages:
- Han ideographs (CJK Unified Ideographs blocks)
- Hiragana / Katakana ⇒ treated as Japanese; output preserved
- Hangul ⇒ treated as Korean; output preserved (Han stripping doesn't affect Hangul)
Tool-use streaming chunks are not filtered — only the natural language content blocks.
Rate limits¶
| Endpoint | Limit |
|---|---|
POST /api/login |
10 / minute / IP |
/v1/* |
None at the proxy. Backend vLLM has its own concurrency limits via --max-num-seqs. |
/api/* (other) |
None. |
X-Forwarded-For is honoured (ProxyFix is in the WSGI middleware), so
the rate limiter sees the real client IP behind the nginx edge.
Timeouts (full chain)¶
| Hop | Timeout |
|---|---|
| nginx → Flask (proxy_read_timeout) | 1800 s |
| Flask → vLLM connect | 10 s |
| Flask → vLLM read (streaming) | 1200 s (env PROXY_STREAM_READ_TIMEOUT) |
| Flask → vLLM read (non-stream) | 1200 s (env PROXY_NONSTREAM_READ_TIMEOUT) |
| Flask → vLLM connect retry | once on ConnectionError |
| Session cookie lifetime | 365 days |
CORS¶
CORS is disabled by default. Set FRONTEND_URL on the service to
enable a single allowed origin with supports_credentials=True (used
by the admin web UI when served from a different host during dev).
Body size limit¶
Flask's MAX_CONTENT_LENGTH is 10 MiB. Requests exceeding that
return 413 Request Entity Too Large from Flask before any backend
contact.
Path-traversal guard¶
Any /v1/<subpath> containing .. segments is rejected with
400 {"error": "Invalid path"}.
Header forwarding¶
Request headers forwarded to the backend exclude:
host, x-api-key, authorization, transfer-encoding. The proxy
re-injects its own backend Authorization header from
VLLM_API_KEY.
Response headers passed back to the client exclude:
content-encoding, content-length, transfer-encoding, connection,
keep-alive, proxy-authenticate, proxy-authorization, te,
trailers, upgrade, x-powered-by, x-request-id.
Streaming Server-Sent Events¶
text/event-stream upstream responses are streamed verbatim with
proxy_buffering off at nginx and Flask Response(generator)
chunked-encoding. SSE comments (: heartbeat\n\n) and data: frames
both pass through unmodified unless Han-stripping is on, in which case
each data: JSON payload is parsed, filtered, and re-serialised.
Deployment surface¶
| Component | Path |
|---|---|
| Service unit | glm-proxy.service (systemd, on proxy.worv.ai) |
| Code root | /home/ubuntu/glm-proxy/ |
| Static frontend | /home/ubuntu/glm-proxy/dist/ |
| SQLite | /home/ubuntu/glm-proxy/proxy.db |
| Edge | nginx site files in /etc/nginx/sites-enabled/glm-proxy, /etc/nginx/sites-enabled/admin-proxy, /etc/nginx/sites-enabled/chat.proxy.worv.ai |
| TLS | Let's Encrypt (certbot) per-vhost in /etc/letsencrypt/live/ |
| Backend | SSH-tunnelled to vLLM on the 8× B200 GPU host |
A change to proxy_routes.py deploys with: