Skip to content

Admin Console — GPU Telemetry

Live and historical GPU stats for the 8× B200 backend.

The GPU host pushes nvidia-smi samples to the proxy at ~1 Hz via POST /api/gpu-stats. The proxy persists samples in SQLite and prunes rows older than 7 days.

GET /api/gpu-stats

Latest sample for each GPU.

Auth Session
CSRF Not required

Response

[
  {
    "id": 1234567,
    "recorded_at": "2026-04-26T04:00:00+00:00",
    "gpu_id": 0,
    "name": "NVIDIA B200",
    "power_draw_w": 612.4,
    "temp_c": 47,
    "memory_used_mb": 173456,
    "memory_total_mb": 183359,
    "utilization_pct": 84
  },
  ...
]

Ordered by gpu_id ASC (so [0..7] for the 8-GPU node).

GET /api/gpu-stats/power-history

Time-bucketed cluster power draw — useful for the dashboard's "power over time" chart.

Auth Session
CSRF Not required

Query params

Param Type Default Range
days int 7 1..30
bucket_minutes int 5 1..1440

Response

{
  "days": 7,
  "bucket_minutes": 5,
  "points": [
    {
      "recorded_at": "2026-04-26T00:05:00+09:00",
      "power_draw_w": 5234.7
    },
    ...
  ]
}

power_draw_w per bucket is computed as the maximum of the per-sample cluster sum (sum across all 8 GPUs at each recorded_at). Using MAX rather than AVG preserves brief power spikes that 1 Hz sampling would otherwise smear away.

POST /api/gpu-stats

Reporter ingest endpoint — called by the GPU host's scripts/report-gpu-stats.sh cron / systemd timer.

Auth API key (Authorization: Bearer or x-api-key)
CSRF Not required (no session)

This is the only /api/* endpoint that takes an API key rather than a session. It exists as a lightweight reporter ingress.

Request

{
  "gpus": [
    {
      "gpu_id": 0,                 // int, 0..255
      "name": "NVIDIA B200",       // str, ≤ 64 chars
      "power_draw_w": 612.4,       // optional, 0..10000
      "temp_c": 47,                // optional, 0..150
      "memory_used_mb": 173456,    // optional, ≥ 0
      "memory_total_mb": 183359,   // optional, ≥ 0
      "utilization_pct": 84        // optional, 0..100
    },
    ...
  ]
}

Out-of-range values for any one field cause the row for that GPU to be silently skipped (other GPUs in the same payload still ingest). This fail-soft behaviour is deliberate so a single bad sample doesn't poison the dashboard.

Response

200 {"ok": true} — even if some sub-rows were skipped.

Status Cause
401 Invalid / missing API key
400 Missing top-level gpus array

The same write also runs the 7-day prune (DELETE FROM gpu_stats WHERE recorded_at < ?) so retention stays bounded without a separate cron.

Notes

  • The GPU host has a separate "backdoor" reporter (DCGM Exporter on port 9400) used by Clunix's Prometheus stack. That one is not part of this proxy — see the project's docs/security-audit-2026-04-25.md for an audit of all background services on new-learning-1.
  • power_draw_w reflects nvidia-smi's instantaneous reading. B200 SXM TDP is ~1 kW/GPU; idle draw on this box runs ~240-260 W/GPU. Sustained >1 kW indicates a model under saturation.