The Goal
The lab has reached the point where a single Proxmox host is no longer enough. I want to test high-availability behaviour properly — what actually happens when a node fails, how clean live migration looks under load, and whether the corosync ring holds up the way the documentation claims. The end state is a small Proxmox cluster I can break confidently and rebuild quickly.
Why Clustering at Home?
The case for clustering in a home lab is more compelling than it first appears. Virtualization Howto makes the point well: even with just two physical nodes, you get the same benefits as a production deployment — high availability, seamless live migration of virtual machines, and centralised web-based management. For anything you actually depend on at home — DNS, Home Assistant, Pi-hole, a Plex server — losing a single node should not mean losing the service.
It does add complexity. Networking matters more. Firewall rules matter more. And the cluster bus (corosync) is genuinely fussy about latency and packet loss. But that complexity is exactly what makes it a good learning environment.
Planned Hardware
The Plan
The build is straightforward on paper. Per Virtualization Howto's cluster installation guide:
- Install Proxmox VE on each node with a unique hostname and static IP
- From the Proxmox Web GUI:
Datacenter → Cluster → Create Cluster - Generate the join information token on the primary node
- On the second node:
Datacenter → Cluster → Join Cluster, paste the token - Verify quorum and ring status with
pvecm statusandjournalctl -b -u corosync
What I'm Watching For
Corosync stability. Their HA Do's & Don'ts article is blunt about this: corosync hates dropped packets and asymmetric links. Wireless is out. A single network cable is asking for trouble. The recommendation is at least two identical wired connections in an active/backup bond — so a single cable failure never takes the cluster bus offline. I'll set this up properly from the start rather than wait for a 3 a.m. failure to discover it.
Quorum behaviour with two nodes. A two-node cluster has an obvious problem: lose either node and you have no quorum. The fix is corosync's external vote support (a small QDevice running on a Raspberry Pi or NUC). It's not strictly clustering — it just casts a vote so the surviving node knows it's safe to keep running.
Live migration under load. Spinning up a VM with active TCP sessions, hammering it with iperf3, and migrating it between nodes. The interesting metric isn't whether it works — it's how long the dirty page sync takes and whether anything visibly drops.
Next Update
Once the hardware is racked I'll post the actual pvecm status output, the corosync configuration (with redundant rings), and what failover actually looks like when I yank a power cable. The Ceph storage layer is an obvious next step after that — but one thing at a time.