Admin Console — GPU Telemetry¶

Live and historical GPU stats for the 8× B200 backend.

The GPU host pushes nvidia-smi samples to the proxy at ~1 Hz via POST /api/gpu-stats. The proxy persists samples in SQLite and prunes rows older than 7 days.

`GET /api/gpu-stats`¶

Latest sample for each GPU.


Auth	Session
CSRF	Not required

Response¶

[
  {
    "id": 1234567,
    "recorded_at": "2026-04-26T04:00:00+00:00",
    "gpu_id": 0,
    "name": "NVIDIA B200",
    "power_draw_w": 612.4,
    "temp_c": 47,
    "memory_used_mb": 173456,
    "memory_total_mb": 183359,
    "utilization_pct": 84
  },
  ...
]

Ordered by gpu_id ASC (so [0..7] for the 8-GPU node).

`GET /api/gpu-stats/power-history`¶

Time-bucketed cluster power draw — useful for the dashboard's "power over time" chart.


Auth	Session
CSRF	Not required

Query params¶

Param	Type	Default	Range
`days`	int	7	1..30
`bucket_minutes`	int	5	1..1440

Response¶

{
  "days": 7,
  "bucket_minutes": 5,
  "points": [
    {
      "recorded_at": "2026-04-26T00:05:00+09:00",
      "power_draw_w": 5234.7
    },
    ...
  ]
}

power_draw_w per bucket is computed as the maximum of the per-sample cluster sum (sum across all 8 GPUs at each recorded_at). Using MAX rather than AVG preserves brief power spikes that 1 Hz sampling would otherwise smear away.

`POST /api/gpu-stats`¶

Reporter ingest endpoint — called by the GPU host's scripts/report-gpu-stats.sh cron / systemd timer.


Auth	API key (`Authorization: Bearer` or `x-api-key`)
CSRF	Not required (no session)

This is the only /api/* endpoint that takes an API key rather than a session. It exists as a lightweight reporter ingress.

Request¶

{
  "gpus": [
    {
      "gpu_id": 0,                 // int, 0..255
      "name": "NVIDIA B200",       // str, &le; 64 chars
      "power_draw_w": 612.4,       // optional, 0..10000
      "temp_c": 47,                // optional, 0..150
      "memory_used_mb": 173456,    // optional, &ge; 0
      "memory_total_mb": 183359,   // optional, &ge; 0
      "utilization_pct": 84        // optional, 0..100
    },
    ...
  ]
}

Out-of-range values for any one field cause the row for that GPU to be silently skipped (other GPUs in the same payload still ingest). This fail-soft behaviour is deliberate so a single bad sample doesn't poison the dashboard.

Response¶

200 {"ok": true} — even if some sub-rows were skipped.

Status	Cause
401	Invalid / missing API key
400	Missing top-level `gpus` array

The same write also runs the 7-day prune (DELETE FROM gpu_stats WHERE recorded_at < ?) so retention stays bounded without a separate cron.

Notes¶

The GPU host has a separate "backdoor" reporter (DCGM Exporter on port 9400) used by Clunix's Prometheus stack. That one is not part of this proxy — see the project's docs/security-audit-2026-04-25.md for an audit of all background services on new-learning-1.
power_draw_w reflects nvidia-smi's instantaneous reading. B200 SXM TDP is ~1 kW/GPU; idle draw on this box runs ~240-260 W/GPU. Sustained >1 kW indicates a model under saturation.

Admin Console — GPU Telemetry¶

GET /api/gpu-stats¶

Response¶

GET /api/gpu-stats/power-history¶

Query params¶

Response¶

POST /api/gpu-stats¶

Request¶

Response¶

Notes¶

`GET /api/gpu-stats`¶

`GET /api/gpu-stats/power-history`¶

`POST /api/gpu-stats`¶