Architecture
One control plane, any number of nodes. Nodes only ever call home — nothing needs to be reachable from outside.
┌─────────────────────────────────────────────────────────┐
│ Control Plane (Docker Compose) │
│ Caddy (TLS, reverse proxy) │
│ ├─ /auth /nodes /environments /assignments /users │
│ │ /audit-log /enrollment /install-agent.sh → backend │
│ └─ everything else → frontend (Angular, via nginx) │
│ Backend (FastAPI): REST API + gRPC server + internal CA │
│ PostgreSQL │
└───────────────┬─────────────────────────────────────────┘
│ gRPC: :50051 enrollment (TLS), :50052 heartbeat (mTLS)
┌──────────┼──────────┐
▼ ▼ ▼
Agent Agent Agent (Proxmox host, PVE API)Components
Control plane
Layered Python service: routers → services → repositories → SQLAlchemy models, with Alembic migrations. An internal CA (EC P-384) issues client certificates for agent↔control-plane mTLS. The gRPC server exposes two listeners — one TLS-only for enrollment (the node has no certificate yet), one mTLS-required for heartbeats (node identity comes from the certificate's CN). Tunneling is pluggable via a TunnelProvider interface, with SSH ProxyJump and NetBird implementations. Email is generic SMTP (STARTTLS or implicit TLS), used for invites and admin notifications.
Agent (Go)
A single static binary per node. Discovers hardware (CPU/RAM/GPU via /proc and nvidia-smi), applies one of four interchangeable GPU enforcement strategies (soft, cgroup, mig, container), manages local Linux users and authorized_keys, and — for Proxmox hosts — talks to the PVE REST API for VM/LXC and user management.
Frontend
Angular standalone app with two areas: the user portal (accept invite, SSH key wizard, assigned resources) and the admin dashboard (nodes, Environments, users, assignments, audit log).
Key flows
- Registration — invite-only. Admin invites → email with a token → user completes name/password, email prefilled from the token.
- Node enrollment — admin creates a node → one-time token →
install-agent.shbuilds the agent, generates a CSR, gets a certificate from the internal CA. - Resource assignment — admin assigns GPU/CPU/RAM to an Environment; unless "fixed", the control plane dynamically resolves a node with a free GPU at request time.
- Heartbeat — the agent reports hardware capabilities every N seconds over mTLS; the control plane replies with the active assignments to enforce on that node (which Linux user to create, SSH key, enforcement mode).
- User access — the dashboard generates an SSH ProxyJump config or a NetBird setup key, depending on the Environment's tunnel provider.
Known limitation
Physical GPU indices aren't yet persisted per assignment (only the count is tracked); they're derived as contiguous blocks at heartbeat time. Sufficient for soft/cgroup enforcement — will need explicit index persistence once GPU allocation becomes fully hardware-aware.