Setup guide Architecture Admin guide
Docs

Architecture

One control plane, any number of nodes. Nodes only ever call home — nothing needs to be reachable from outside.

┌─────────────────────────────────────────────────────────┐
│ Control Plane (Docker Compose)                            │
│  Caddy (TLS, reverse proxy)                                │
│   ├─ /auth /nodes /environments /assignments /users        │
│   │  /audit-log /enrollment /install-agent.sh → backend    │
│   └─ everything else → frontend (Angular, via nginx)       │
│  Backend (FastAPI): REST API + gRPC server + internal CA   │
│  PostgreSQL                                                │
└───────────────┬─────────────────────────────────────────┘
                │ gRPC: :50051 enrollment (TLS), :50052 heartbeat (mTLS)
     ┌──────────┼──────────┐
     ▼          ▼          ▼
  Agent      Agent      Agent (Proxmox host, PVE API)

Components

Control plane

Layered Python service: routers → services → repositories → SQLAlchemy models, with Alembic migrations. An internal CA (EC P-384) issues client certificates for agent↔control-plane mTLS. The gRPC server exposes two listeners — one TLS-only for enrollment (the node has no certificate yet), one mTLS-required for heartbeats (node identity comes from the certificate's CN). Tunneling is pluggable via a TunnelProvider interface, with SSH ProxyJump and NetBird implementations. Email is generic SMTP (STARTTLS or implicit TLS), used for invites and admin notifications.

Agent (Go)

A single static binary per node. Discovers hardware (CPU/RAM/GPU via /proc and nvidia-smi), applies one of four interchangeable GPU enforcement strategies (soft, cgroup, mig, container), manages local Linux users and authorized_keys, and — for Proxmox hosts — talks to the PVE REST API for VM/LXC and user management.

Frontend

Angular standalone app with two areas: the user portal (accept invite, SSH key wizard, assigned resources) and the admin dashboard (nodes, Environments, users, assignments, audit log).

Key flows

  1. Registration — invite-only. Admin invites → email with a token → user completes name/password, email prefilled from the token.
  2. Node enrollment — admin creates a node → one-time token → install-agent.sh builds the agent, generates a CSR, gets a certificate from the internal CA.
  3. Resource assignment — admin assigns GPU/CPU/RAM to an Environment; unless "fixed", the control plane dynamically resolves a node with a free GPU at request time.
  4. Heartbeat — the agent reports hardware capabilities every N seconds over mTLS; the control plane replies with the active assignments to enforce on that node (which Linux user to create, SSH key, enforcement mode).
  5. User access — the dashboard generates an SSH ProxyJump config or a NetBird setup key, depending on the Environment's tunnel provider.

Known limitation

Physical GPU indices aren't yet persisted per assignment (only the count is tracked); they're derived as contiguous blocks at heartbeat time. Sufficient for soft/cgroup enforcement — will need explicit index persistence once GPU allocation becomes fully hardware-aware.