Status board internals

Why the status board is a standalone app, its probe model, and the fire-and-forget Supabase heartbeat.

A standalone live status board at status.busymate.net. It probes every Busymate service — Dashboard, Backend (Supabase), MCP, Proxy, Docs, and itself — and renders overall health, per-service detail, VPS host metrics, a Proxy Security card (open-relay flood + the #001 source-IP gate), and a Supabase Project card (per-service health for our exact Cloud Pro project). It is a separate Next.js + Turbopack + shadcn/ui app (React 19, dark-only) with its own systemd unit (busymate-status), its own port (:3940), and its own deploy.

For the operator-facing task guide (reading the board, local dev, deploy), see Status board — How-to.

Why it's a separate component

A status page embedded in the thing it monitors goes down exactly when you need it. If the board lived inside the dashboard, a dashboard outage would take the status page down with it — and the page exists precisely to tell you the dashboard is down.

So the board is deliberately decoupled:

  • Its own Next.js app, systemd unit, port, and nginx vhost — nothing it imports from the dashboard, no shared process.
  • Its health JSON at GET /api/status is unauthenticated and returns coarse health only, never secrets. Auth being down is one of the failure modes it has to survive, so it can't depend on auth to render.
  • It runs on the VPS it reports on, which is what makes a real status page possible without a separate monitoring agent: host load, memory, and uptime come straight from Node's os.*, and systemd state comes from systemctl show on the same box.

The only thing it shares with the rest of the monorepo is the repo-root version.json (read off disk for build numbers) and the optional Supabase ingest endpoint (fire-and-forget, see below) — neither is on the critical render path.

Topology

Browserpolls 7sweb/status (VPS :3940)GET /api/statusprobe() per serviceos.* host metricssystemctl show4s cache · 10s heartbeatSibling servicesdashboard · backendmcp · proxy · docsHTTP probe (up/code/ms)+ self (self:true)Supabase ingeststatus-ingest Edge Fnhistorical + RealtimeJSONpollprobepush

The probe model

Every monitored service is a Component in a static TARGETS array in web/status/app/api/status/route.ts. A collect() pass probes all six in parallel (Promise.all) and assembles a StatusBody.

Each target carries metadata plus tuning flags:

FieldMeaning
keyStable identity (dashboard, supabase, mcp, proxy-server, docs, status). Also the lookup key into version.json for the build number.
label / roleDisplay name + one-line description.
hostPublic hostname (rendered as the "open" link).
urlWhat probe() fetches. For internal services this is loopback (http://127.0.0.1:3838/api/version) to avoid bouncing through nginx.
criticalA core service. Dashboard and Backend are core; everything else is non-core.
unitsystemd unit name for systemctl show, or null for cloud-hosted services (Backend, MCP).
selfThis very board — it's serving the request, so it's trivially up (no self-fetch).
wantBuildWhether to parse build from the probe's JSON body (only the dashboard's /api/version returns one).
upWhenCustom up/down predicate. Default is status < 500; MCP uses s > 0 (any answer at all proves the edge function is reachable).
metricsUrlLoopback flood/abuse metrics URL — proxy only (http://127.0.0.1:8888/metrics). Fetched alongside the main probe and attached as probe.flood; stripped from the serialised body (it's an internal 127.0.0.1 URL). Drives the Proxy Security card.

probe(t):

  • self:true short-circuits to { up: true, httpStatus: 200, latencyMs: 0 }.
  • Otherwise fetch(url) with a 4.5 s timeout (AbortSignal.timeout), redirect: "manual", cache: "no-store", and a busymate-status/1 user-agent.
  • Latency is performance.now() delta in ms.
  • If wantBuild, the response is cloned and parsed for a numeric build field.
  • If metricsUrl is set (proxy only), fetchFloodMetrics() best-effort-fetches the loopback /metrics (2.5 s timeout) and attaches it as probe.flood. Any failure (proxy down, old build without /metrics, malformed body) yields null and never affects the up/down verdict.
  • up = upWhen(status) if provided, else status < 500.
  • Any throw (timeout, connection refused) returns { up: false, httpStatus: null, error }TimeoutError is normalised to "timeout".

Liveness probes are picked to be cheap and auth-free. Backend hits https://api.busymate.net/auth/v1/health — any non-5xx (200, or a 401 when the edge is picky about the apikey header) proves the auth API is up. Proxy hits /ca/bundle. Docs hits /. MCP hits / and treats any status as up.

systemd uptime

For targets with a unit, unitState(unit) runs systemctl show <unit> --property=ActiveState,SubState,ActiveEnterTimestamp (2.5 s timeout, read-only). ActiveEnterTimestamp is parsed into uptimeSec. If systemctl isn't present (local dev on a Mac), it returns { active: "unknown", … } — and the client treats "unknown" as OK, not a failure, so the board renders cleanly off-VPS.

Build numbers

manifestBuilds() reads version.json from ../version.json or ../../version.json relative to process.cwd(), and maps components.<key>.build. The dashboard's live build comes from its own /api/version probe; for the others the manifest value is the source. The board therefore shows the deployed build per component, surfacing drift at a glance.

Overall status

After the parallel collect(), the overall verdict is computed:

  • down — any core service is down (critical && !up). Backend or Dashboard being unreachable is a major outage.
  • degraded — some non-core service is down but all core services are up.
  • operational — everything responding.

The client maps these to "All systems operational" (emerald), "Partial degradation" (amber), and "Major outage" (red), with a pulsing dot for the operational state.

A per-service card is also considered unhealthy if its systemd unit reports a non-active/non-unknown state, even when the HTTP probe succeeded — a process can answer while its unit is failed mid-restart.

Proxy Security card

Because the status app runs on the same VPS as the proxy, it can read the proxy's open-relay-abuse counters off loopback for free. The proxy target carries metricsUrl: "http://127.0.0.1:8888/metrics"; fetchFloodMetrics() fetches it during the proxy probe and attaches the result as probe.flood (a FloodMetrics mirroring FloodStats.snapshot() in web/proxy-server/src/floodStats.ts). The client renders a dedicated Proxy Security card directly below the Host card and above the Supabase Project card — but only when flood is non-null, so an old proxy build without /metrics simply omits the card.

The card surfaces the issue-#001 source-IP gate: the gate gateMode badge (shadow = logging only, amber; enforce = refusing un-bound source IPs, emerald), an under attack / clean verdict from underAttack, and a metric grid of last-60 s abuse counters (abuse/min, distinct abusive IPs, refused = denied + noBinding, cap-drops), live pool pressure (active conns, throttled IPs, top-abuser conns), and devices/knownIps gated, plus a lifetime allowed · refused · cap-dropped footer. fetchFloodMetrics() is fully best-effort: a 2.5 s timeout, a body-shape sanity check (gateMode string + last60s + pool present), and any error returns null.

Supabase Project card

The Backend probe (/auth/v1/health) only answers up/down. The Supabase Project card adds the detail: it reports our exact Cloud Pro project (SUPABASE_PROJECT_REF, default xfjplaganjqowkcnznbr), not generic upstream Supabase health. collectSupabase() assembles a SupabaseInfo from three server-side fetches run in parallel, all stripped of secrets before serialising:

  • Anonymous version probes (fetchSupabaseVersions) — GoTrue /auth/v1/health (sent with the browser-safe publishable apikey) and Storage /storage/v1/version. Both answer unauthenticated, double as the latency probe (versionLatencyMs), and prove Auth + Storage are live even with no token.
  • Management API per-service health (fetchSupabaseServiceHealth) — GET https://api.supabase.com/v1/projects/{ref}/health?services=… for db · rest · auth · realtime · storage · pooler. Token-gated: returns null (and the four db/rest/realtime/pooler rows render muted as needs token) unless SUPABASE_ACCESS_TOKEN is set.
  • Project meta (fetchSupabaseProject) — GET /v1/projects/{ref} for project status (ACTIVE_HEALTHY, …), region, and Postgres version. Also token-gated.

Each service row merges the authoritative Management-API status (source: "management") with the anon version string where we have one; with no token, Auth/Storage fall back to the anon liveness verdict and the rest report healthy: null (unknown, rendered muted — never red).

Region comes from the Management API (project.region); the SUPABASE_REGION env/default (eu-west-1) is only the no-token fallback. Our project runs in eu-west-1 (Ireland) — intentionally not the lon1/London VPS region, so the default was corrected to match reality.

Client-side Statuspage fetch

Supabase's platform-wide public Statuspage (status.supabase.com/api/v2/summary.json) is fetched from the viewer's browser, never the server — AWS WAF CAPTCHA-blocks our VPS datacenter IP, but the summary API is CORS-open (access-control-allow-origin: *), so each visitor's own residential/office IP fetches it cleanly. The useSupabasePlatform hook polls it every 60 s (paused when the tab is hidden), and the card shows only the slices that name our region: the matching compute-capacity dot, plus any unresolved incidents or scheduled maintenance affecting our region. A failed fetch leaves the platform overlay null and the card still renders the server-probed versions.

The Management API token itself lives in the VPS supabase.conf systemd drop-in (SUPABASE_ACCESS_TOKEN). Off-VPS or without it, the four token-gated service rows stay muted while Auth + Storage continue reporting live.

/api/status response shape

The route is force-dynamic, runtime: "nodejs", revalidate: 0. The serialised body (StatusBody in web/status/app/api/status/route.ts, mirrored in StatusClient.tsx):

ts
interface StatusBody {
  generatedAt: string;                              // ISO-8601
  overall: "operational" | "degraded" | "down";
  host: {
    hostname: string;
    platform: string;                               // `${os.platform()} ${os.release()}`
    uptimeSec: number;
    cpus: number;
    load: [number, number, number];                 // os.loadavg() — 1m, 5m, 15m
    loadRatio: number;                              // load[0] / cpus
    memTotal: number;                               // bytes
    memUsed: number;                                // bytes (memTotal − freemem)
    memUsedRatio: number;
  };
  components: Array<{
    key: string;
    label: string;
    role: string;
    host: string;
    url: string | null;
    critical: boolean;
    unit: string | null;
    self?: boolean;
    probe: {
      up: boolean;
      httpStatus: number | null;
      latencyMs: number | null;
      build?: number | null;
      error?: string;
      flood?: FloodMetrics | null;                  // proxy only — loopback /metrics snapshot
    };
    unitState: { active: string; sub: string; uptimeSec: number | null } | null;
  }>;
  supabase: {                                         // our project's detail (collectSupabase)
    projectRef: string;
    region: string;                                   // Management API; eu-west-1 fallback
    versionLatencyMs: number | null;
    projectStatus: string | null;                     // ACTIVE_HEALTHY | … (null without token)
    postgresVersion: string | null;                   // null without token
    hasToken: boolean;                                 // SUPABASE_ACCESS_TOKEN present?
    services: Array<{
      name: string;                                   // db | rest | auth | realtime | storage | pooler
      label: string;
      status: string;                                 // ACTIVE_HEALTHY | up | down | unknown
      healthy: boolean | null;                        // null = couldn't determine (no token / no anon probe)
      version: string | null;                         // GoTrue / Storage version where known
      source: "management" | "anon";
    }>;
  };
}

The wantBuild, upWhen, and metricsUrl fields are internal tuning knobs — they're stripped before serialisation. FloodMetrics mirrors the proxy's /metrics snapshot (gateMode, underAttack, last60s, pool, lifetime, …) and appears only on the proxy component's probe; it's null when the proxy is unreachable or predates the endpoint. The platform-wide Supabase Statuspage is not in this body — it's fetched client-side (see the Supabase Project card). The response carries Cache-Control: no-store and an X-Status-Cache: hit|miss header so you can tell whether you got the cached snapshot or a fresh collect().

There are no secrets in this body: no tokens, no env, no internal IPs beyond the loopback probe URLs. That's deliberate — the endpoint is public so it survives an auth outage.

Host metrics

Because the app runs on the VPS, host metrics come straight from Node's os module — no agent, no shell scraping:

  • Loados.loadavg() gives the 1m/5m/15m averages; loadRatio is the 1-minute figure divided by os.cpus().length. The board surfaces it as "% of capacity (1m)".
  • MemorymemUsed = os.totalmem() − os.freemem(), rendered as used/total GB and a used-ratio.
  • Uptimeos.uptime() (seconds since boot), formatted as Xd Yh.

The client renders load and memory as discrete 20-cell meter gauges (Meter in StatusClient.tsx) — pure Tailwind cells, no inline widths. Tone is driven by ratio: emerald below 60%, amber 60–85%, red at or above 85% (ratioTone).

Caching, polling, and the self-heartbeat

Three timers keep the board fresh without hammering the probed services:

  • Server cache — 4 s. GET /api/status serves a cached StatusBody for up to CACHE_TTL_MS (4000 ms). Concurrent viewers share one snapshot; the probed services see at most one collect() burst every 4 s regardless of viewer count.
  • Client poll — 7 s. StatusClient polls /api/status every REFRESH_MS (7000 ms), and pauses while the tab is hidden (document.hidden). A separate 1 s ticker only re-renders the "updated Ns ago" relative timestamp; it does not re-fetch.
  • Self-heartbeat — 10 s. A module-level setInterval (HEARTBEAT_MS, 10000 ms) recomputes a snapshot and pushes it to the Supabase ingest endpoint even when nobody has the board open, so the stored history never has a gap. It's unref()-ed (won't keep the event loop alive on its own), guarded by a globalThis flag against dev-HMR duplicate intervals, and a running latch prevents overlapping ticks.

Supabase ingest

Each computed snapshot is also pushed, fire-and-forget, to a Supabase Edge Function so it's queryable over REST/MCP and fanned out over Realtime/WS for any historical or dashboard consumer.

  • WhereSTATUS_INGEST_URL (defaults to https://api.busymate.net/functions/v1/status-ingest).
  • Auth — a shared secret in STATUS_INGEST_SECRET, sent as the x-status-ingest-secret header. Without the secret the push is skipped silently (one warning), so local dev never errors.
  • Never blockspushSnapshot() never throws: any failure (timeout, non-2xx, network) is logged and swallowed. The GET handler calls it with void (un-awaited), so ingest being down can never block or fail a status response.
  • Two callers — the GET handler (on a cache miss) and the 10 s heartbeat both push, so the stored snapshot tracks both real traffic and the no-viewer baseline.

Deploy note: enabling ingest in production means adding STATUS_INGEST_SECRET to the busymate-status systemd drop-in env. See Status board — How-to for the deploy mechanics.

File map

web/status/app/api/status/route.tsprobe() · fetchFloodMetrics() · collectSupabase() · unitState() · manifestBuilds() · collect() · 4s cache · 10s heartbeat · ingestapp/StatusClient.tsxClient board: 7s poll, Proxy Security + Supabase Project cards, useSupabasePlatform (client-side Statuspage)app/page.tsxforce-dynamic page → renders StatusClientapp/layout.tsxDark-only root layoutcomponents/ui/shadcn primitives (card, badge, separator)deploy/systemd/busymate-status.service unit filedeploy/nginx/status.busymate.net.conf vhostpackage.jsondev :3940 · build · deploy (ssh git pull + restart)

Where to look next