Admin Console — GPU Telemetry¶
Live and historical GPU stats for the 8× B200 backend.
The GPU host pushes nvidia-smi samples to the proxy at ~1 Hz via
POST /api/gpu-stats. The proxy persists samples in SQLite and prunes
rows older than 7 days.
GET /api/gpu-stats¶
Latest sample for each GPU.
| Auth | Session |
| CSRF | Not required |
Response¶
[
{
"id": 1234567,
"recorded_at": "2026-04-26T04:00:00+00:00",
"gpu_id": 0,
"name": "NVIDIA B200",
"power_draw_w": 612.4,
"temp_c": 47,
"memory_used_mb": 173456,
"memory_total_mb": 183359,
"utilization_pct": 84
},
...
]
Ordered by gpu_id ASC (so [0..7] for the 8-GPU node).
GET /api/gpu-stats/power-history¶
Time-bucketed cluster power draw — useful for the dashboard's "power over time" chart.
| Auth | Session |
| CSRF | Not required |
Query params¶
| Param | Type | Default | Range |
|---|---|---|---|
days |
int | 7 | 1..30 |
bucket_minutes |
int | 5 | 1..1440 |
Response¶
{
"days": 7,
"bucket_minutes": 5,
"points": [
{
"recorded_at": "2026-04-26T00:05:00+09:00",
"power_draw_w": 5234.7
},
...
]
}
power_draw_w per bucket is computed as the maximum of the
per-sample cluster sum (sum across all 8 GPUs at each recorded_at).
Using MAX rather than AVG preserves brief power spikes that 1 Hz
sampling would otherwise smear away.
POST /api/gpu-stats¶
Reporter ingest endpoint — called by the GPU host's
scripts/report-gpu-stats.sh cron / systemd timer.
| Auth | API key (Authorization: Bearer or x-api-key) |
| CSRF | Not required (no session) |
This is the only /api/* endpoint that takes an API key rather than
a session. It exists as a lightweight reporter ingress.
Request¶
{
"gpus": [
{
"gpu_id": 0, // int, 0..255
"name": "NVIDIA B200", // str, ≤ 64 chars
"power_draw_w": 612.4, // optional, 0..10000
"temp_c": 47, // optional, 0..150
"memory_used_mb": 173456, // optional, ≥ 0
"memory_total_mb": 183359, // optional, ≥ 0
"utilization_pct": 84 // optional, 0..100
},
...
]
}
Out-of-range values for any one field cause the row for that GPU to be silently skipped (other GPUs in the same payload still ingest). This fail-soft behaviour is deliberate so a single bad sample doesn't poison the dashboard.
Response¶
200 {"ok": true} — even if some sub-rows were skipped.
| Status | Cause |
|---|---|
| 401 | Invalid / missing API key |
| 400 | Missing top-level gpus array |
The same write also runs the 7-day prune (DELETE FROM gpu_stats WHERE
recorded_at < ?) so retention stays bounded without a separate cron.
Notes¶
- The GPU host has a separate "backdoor" reporter (DCGM Exporter on port
9400) used by Clunix's Prometheus stack. That one is not part of
this proxy — see the project's
docs/security-audit-2026-04-25.mdfor an audit of all background services onnew-learning-1. power_draw_wreflects nvidia-smi's instantaneous reading. B200 SXM TDP is ~1 kW/GPU; idle draw on this box runs ~240-260 W/GPU. Sustained >1 kW indicates a model under saturation.