Notes & Limits¶

Model alias rewriting¶

The proxy can rewrite client-supplied model strings to whatever model is currently provisioned. Two env vars on the service control this:

Env var	Effect
`PROXY_ALIAS_MODEL`	Canonical served-model name (e.g. `DeepSeek-V4-Pro`). When unset, model IDs pass through unchanged.
`PROXY_ALIAS_SOURCES`	Comma-separated list of legacy IDs that should also rewrite to the canonical name (e.g. `glm-5.1-fp8,Kimi-K2.6`).
`PROXY_ALIAS_REWRITE_ALL`	`on` to rewrite every non-empty model string to the canonical name.

Decision order¶

If PROXY_ALIAS_MODEL is unset → pass through.
If the inbound model already equals the canonical (case-sensitive) → pass through.
Else, lowercase both sides and rewrite if any of:
PROXY_ALIAS_REWRITE_ALL=on, or
lowercased inbound equals lowercased canonical, or
lowercased inbound starts with claude-, or
lowercased inbound is in lowercased PROXY_ALIAS_SOURCES.
Else → pass through (vLLM will 404 if the served-model-name doesn't match).

Case-insensitivity¶

Since 2026-04-26 the alias check is case-insensitive on both sides. deepseek-v4-pro, DEEPSEEK-V4-PRO, Claude-Sonnet-4-6, and GLM-5.1-FP8 all rewrite correctly. The payload's model field is always overwritten with the canonical spelling before forwarding so that vLLM (which is case-sensitive) sees the exact string it serves.

Dynamic `max_tokens` capping¶

estimated_input = len(payload_text_chars) // 3
available       = MAX_CONTEXT_TOKENS - estimated_input - 512
capped          = min(client_max_tokens, available, MAX_OUTPUT_TOKENS)

Env var	Default	Notes
`MAX_CONTEXT_TOKENS`	202 752	Total budget the backend exposes. 256 K context = full DeepSeek-V4-Pro window.
`MAX_OUTPUT_TOKENS`	16 384	Upper bound on output regardless of available budget.

The 512 token slack is reserved for system-prompt injection. The heuristic is intentionally pessimistic (3 chars/token vs. the real ~3.5-4 for English) so that we never over-cap; vLLM still enforces the exact context limit. There is no client-visible signal that capping occurred.

Han-script stripping¶

When the user-side messages are not detected as Chinese / Japanese, the proxy can strip Han ideographs and CJK punctuation from the model's streaming output. This is opt-in via PROXY_STRIP_HAN=on — off by default.

Detection is based on Unicode codepoints in the inbound user messages:

Han ideographs (CJK Unified Ideographs blocks)
Hiragana / Katakana ⇒ treated as Japanese; output preserved
Hangul ⇒ treated as Korean; output preserved (Han stripping doesn't affect Hangul)

Tool-use streaming chunks are not filtered — only the natural language content blocks.

Rate limits¶

Endpoint	Limit
`POST /api/login`	10 / minute / IP
`/v1/*`	None at the proxy. Backend vLLM has its own concurrency limits via `--max-num-seqs`.
`/api/*` (other)	None.

X-Forwarded-For is honoured (ProxyFix is in the WSGI middleware), so the rate limiter sees the real client IP behind the nginx edge.

Timeouts (full chain)¶

Hop	Timeout
nginx → Flask (proxy_read_timeout)	1800 s
Flask → vLLM connect	10 s
Flask → vLLM read (streaming)	1200 s (env `PROXY_STREAM_READ_TIMEOUT`)
Flask → vLLM read (non-stream)	1200 s (env `PROXY_NONSTREAM_READ_TIMEOUT`)
Flask → vLLM connect retry	once on `ConnectionError`
Session cookie lifetime	365 days

CORS¶

CORS is disabled by default. Set FRONTEND_URL on the service to enable a single allowed origin with supports_credentials=True (used by the admin web UI when served from a different host during dev).

Body size limit¶

Flask's MAX_CONTENT_LENGTH is 10 MiB. Requests exceeding that return 413 Request Entity Too Large from Flask before any backend contact.

Path-traversal guard¶

Any /v1/<subpath> containing .. segments is rejected with 400 {"error": "Invalid path"}.

Header forwarding¶

Request headers forwarded to the backend exclude: host, x-api-key, authorization, transfer-encoding. The proxy re-injects its own backend Authorization header from VLLM_API_KEY.

Response headers passed back to the client exclude: content-encoding, content-length, transfer-encoding, connection, keep-alive, proxy-authenticate, proxy-authorization, te, trailers, upgrade, x-powered-by, x-request-id.

Streaming Server-Sent Events¶

text/event-stream upstream responses are streamed verbatim with proxy_buffering off at nginx and Flask Response(generator) chunked-encoding. SSE comments (: heartbeat\n\n) and data: frames both pass through unmodified unless Han-stripping is on, in which case each data: JSON payload is parsed, filtered, and re-serialised.

Deployment surface¶

Component	Path
Service unit	`glm-proxy.service` (systemd, on `proxy.worv.ai`)
Code root	`/home/ubuntu/glm-proxy/`
Static frontend	`/home/ubuntu/glm-proxy/dist/`
SQLite	`/home/ubuntu/glm-proxy/proxy.db`
Edge	nginx site files in `/etc/nginx/sites-enabled/glm-proxy`, `/etc/nginx/sites-enabled/admin-proxy`, `/etc/nginx/sites-enabled/chat.proxy.worv.ai`
TLS	Let's Encrypt (certbot) per-vhost in `/etc/letsencrypt/live/`
Backend	SSH-tunnelled to vLLM on the 8× B200 GPU host

A change to proxy_routes.py deploys with:

scp -i ~/.ssh/worv-proxy.pem server/proxy_routes.py \
  ubuntu@proxy.worv.ai:/home/ubuntu/glm-proxy/proxy_routes.py
ssh -i ~/.ssh/worv-proxy.pem ubuntu@proxy.worv.ai \
  "sudo systemctl restart glm-proxy"