markus.preinl • 21. Mai 2026

Building a Proxmox cluster: High availability for your server infrastructure

A Proxmox cluster connects multiple virtualization hosts into a shared environment.

This allows for centralized management of VMs (virtual machines), live migration, and faster restarts in case of failures.

A stable Proxmox cluster requires more than just additional servers: A clean quorum, reliable cluster communication, a suitable storage strategy, and clear operational processes are crucial. These factors determine whether high availability truly works in everyday use.

This article explains how a Proxmox cluster is structured, the prerequisites it should meet, and which storage options are suitable for small and medium-sized enterprises (SMEs).

Table of contents

Why a Proxmox Cluster? High Availability and Business Benefits for Enterprises
Cluster Basics: Architecture, Quorum, and Avoiding Split-Brain
Requirements: Hardware, Network, and Design for a Stable Setup
Setting Up an HA Cluster: Step by Step
Storage Strategy: Shared Storage, Ceph vs. NFS, and Alternatives
Operation and Best Practices: Updates, Monitoring, Backups, and Maintenance Processes
Costs, Subscriptions, and Planning for SMEs
Conclusio
FAQ

Technician working on the server rack to set up network and nodes for a Proxmox cluster

Why a Proxmox cluster? High availability and business benefits for companies.

A Proxmox cluster connects multiple virtualization hosts to a shared operating environment. Workloads can be centrally managed, moved between hosts (live migration), and automatically restarted in case of failures. This reduces the risk of extended downtime, makes maintenance more predictable, and ensures that critical services remain available more quickly.

In businesses, outages are not only caused by defects, but often also by maintenance, misconfigurations, or bottlenecks. A Proxmox cluster reduces single points of failure at the host level and provides the foundation for controlling service migration or automatically re-provisioning services.

It is important not to confuse high availability with backup and disaster recovery: HA reduces downtime during operation, but it does not replace recovery after data loss or site failure.

For a concise overview of product features, see also: Proxmox VE Features.

In practice, many cluster projects fail not because of Proxmox itself, but because of planning, network design, and operation. FIGULI CONSULTING helps companies to set up exactly these points cleanly – from conception to implementation to stable regular operation.

What failure scenarios does a Proxmox cluster solve in practice?

Typical failures affect individual components: power supplies, controllers, RAM, SSDs, or entire hosts. A cluster can mitigate these events by restarting affected workloads on remaining hosts or preparing for planned maintenance through migration. This requires that the remaining capacity is allocated for emergency operation and that system dependencies are known.

Host failure: automatic relocation and restart of VMs/containers
Planned maintenance: live migration before reboot or hardware replacement
Power issue: interception of individual UPS branches if power paths are redundant
Network disruption: redundancy through separate paths and proper Corosync configuration

How exactly do typical workloads benefit from HA & Live Migration?

Typical workloads such as ERP, file servers or internal applications benefit primarily from predictable maintenance and shorter restart times. For companies, the technology is less important than the question of how quickly a service is available again after a host failure. Thanks to live migration, Proxmox hosts can now be updated more often and in urgent cases without having to take important applications such as CRM, ERP and file servers out of service.

Cluster basics: Architecture, quorum, and avoiding split-brain

A Proxmox cluster only functions stably if decisions can be made unambiguously in the event of a failure. Quorum, Corosync, and fencing are crucial for this. Only the part of the cluster with the majority is permitted to execute critical actions. This prevents separate cluster parts from continuing to operate in parallel and thus avoids the risk of data corruption.

The technical implementation is achieved through Corosync, quorum mechanisms, and supplementary safeguards such as fencing and watchdog timers.

How a Proxmox cluster is structured and what that means in operation

A cluster consists of multiple nodes that share a common cluster configuration and status information. Management is centralized via the interface, with changes to relevant cluster objects being consistently distributed. This facilitates standardization but also increases the importance of clean change processes, as configuration errors can have a more rapid impact.

What Quorum is in a Proxmox cluster and why it is the basis for high availability

Quorum ensures that only the part of the cluster with the required majority vote can continue working. This prevents conflicting states, such as when two cluster parts remain active simultaneously after a network outage. Quorum is therefore crucial for high availability: without it, many cluster actions are limited or even impossible.

Majority decides: Quorum is based on votes, not computing power.
Without quorum: limited cluster actions and increased risk.
For small environments: QDevice can stabilize a 2-node design.

How split-brain syndrome develops and which measures reliably prevent it

Split-brain occurs when cluster components lose contact with each other, yet both remain active and continue to control resources. The cause is usually a network partition or an unstable Corosync network, less frequently a combination of latency, packet loss, and faulty switch configurations. In storage scenarios with write operations, this can lead to conflicting data states.

Split-brain can be prevented through a robust quorum design, redundant Corosync paths (ring0 and ring1), and consistent fencing. Fencing ensures that a node is forcibly shut down before the other component takes over resources. In Proxmox, the watchdog approach is a key component for mitigating risky situations.

Overview of quorum and typical topologies

topology	Impact on quorum and operation
2 nodes without QDevice	Quorum breaks down if a node fails; HA functions are then only partially usable.
2 nodes with QDevice	Quorum remains achievable even if a node fails; requires an additional, separate system for voting.
3 nodes	Quorum remains stable even if a node fails; often the simplest approach for true HA in small environments.
4+ nodes	Mehr More reserve capacity and better distribution; network and storage design are becoming more important to manage complexity.und bessere Verteilung; Netzwerk- und Storage-Design werden wichtiger, um Komplexität zu beherrschen.

Requirements: Hardware, network & design for a stable setup

A stable Proxmox cluster rarely fails due to software issues, but rather due to poor planning: insufficient resources, unclear network segmentation, or a lack of redundancy.

These are precisely the factors that determine whether high availability will function effectively in a critical situation. Before you begin, you should define which services must continue running during a failover and which can be temporarily dispensed with.

The most important prerequisites at a glance:

N+1 capacity: one host can fail without overloading critical systems
Redundant power paths and a sound UPS strategy
Network segmentation for management, Corosync, storage, and migration
Clear operational goals: RTO/RPO and priorities for each workload

What hardware and node requirements are suitable for SMEs?

For SMEs, a few well-sized nodes are often more practical than many small ones. Plan CPU and RAM so that normal operation doesn't result in sustained loads near 80–90%, as this would require handling additional load in a failover scenario. Reserves for live migration, caching, and short-term peaks are also crucial, especially for databases or terminal servers.

Size CPU/RAM with reserves for failover and maintenance.
Plan for redundant power supplies and separate power feeds.
Evaluate storage based on latency and IOPS, not just terabytes.
Define a spare parts and RMA process to minimize downtime.

How the network should be structured in the Proxmox cluster

A robust design separates traffic types: Management access, Corosync cluster communication, storage traffic, and migration should be at least logically separated, and in critical environments, also physically separated. This prevents backup jobs or storage spikes from impacting cluster communication. For Corosync, low latency and very low packet loss are more important than maximum bandwidth.

Corosync: preferably its own Layer 2 area, stable and redundant
Migration: sufficient bandwidth to prevent maintenance from becoming a bottleneck
Storage network: consistent MTU and clean jumbo frame scheduling, if used
ring0/ring1: separate paths to avoid partitions

Which cluster topologies make sense and how they affect reliability

For true HA, a majority decision is crucial. A 3-node design is often the simplest minimum topology because quorum is maintained even if a node fails, and no additional QDevice is required. For very small environments, a 2-node cluster with QDevice can be an economical alternative if the QDevice is operated separately and is not subject to the same risk as the nodes.

3 nodes: usually a stable entry point for HA and quorum
2 nodes + QDevice: possible, but the QDevice must be separate and stable
Shared dependencies significantly reduce the benefits of HA

Technical background information on quorum and Corosync can be found in the official Proxmox documentation.

Setting up an HA cluster: Step-by-step

A Proxmox HA cluster only functions reliably if the setup is structured. Typical problems don't arise from Proxmox itself, but from errors in DNS, time, network, or storage.

Therefore, proceed in clear steps:

Prepare the basics: Define hostnames, DNS, NTP, and network segments. Determine which interfaces will be used for management and Corosync, and plan for sufficient capacity reserves.
Create the cluster and add nodes: Create the cluster on the first node and add further nodes incrementally. After each step, check the quorum, cluster status, and reachability.
Configure Corosync for stability and redundancy: Set up ring0 and ring1 on separate paths and monitor latency and packet loss. Ensure consistent MTU settings.
Activate and test HA resources: Initially, activate HA for selected systems and test controlled failover. Define startup sequences and test the behavior in case of failure.
Configure fencing and watchdog: Ensure that a node fails and stops writing to the node before another node takes over. Test this behavior under realistic conditions.

High availability (HA) should only be enabled broadly once the cluster, Corosync, storage, and monitoring are running stably. Otherwise, high availability is merely theoretical.

Checklist:

Uniform hostnames and DNS entries
NTP configured for all nodes
Management, Corosync, and storage VLANs defined
Capacity reserve for N+1 documented

The setup process is crucial in determining whether a cluster will run stably later on. FIGULI CONSULTING provides support with clear methodologies, sound network and HA concepts, and practical testing to ensure high availability actually works in critical situations.

How to create the cluster and add nodes cleanly

Start with consistent hostnames, clean name resolution, and stable time synchronization. NTP is essential for logging and troubleshooting. Also, define which interfaces will be used for management and Corosync.

How to create the cluster and add nodes cleanly

Configure Corosync so that cluster communication is not affected by normal network traffic. Use ring0 and ring1 on separate paths and monitor latency and packet loss.

Run ring0/ring1 separately via NICs, VLANs, and switches.
Measure latency and packet loss and define limits.
Keep the MTU consistent for each ring and deploy changes in a controlled manner.
Simulate link failures and document the cluster response.

How to activate and test HA resources including controlled failover

Enable HA initially for a few, well-understood systems and define startup sequences and dependencies. Critical services like databases should not start uncontrolled in parallel if application servers are not yet ready.

Enable HA gradually and define priorities for each workload.
Use maintenance mode before restarting or patching hosts.
Perform failover tests with application checks, not just VM status.
Measure RTO and optimize startup sequences.

How to set up fencing/watchdog to prevent data corruption and safely mitigate split-brain attacks.

Fencing ensures that a node cannot perform any further write operations in case of an error before another node takes over. This is particularly important in shared or replicated storage scenarios to prevent conflicting data states. A watchdog can additionally ensure that a node automatically restarts or isolates itself in critical situations.

Fencing as protection against data corruption in unclear cluster states
Activate the watchdog and verify its behavior in test cases
Ensure independence from the management network where possible
Create a runbook for fencing events

Further technical details can be found in the official Proxmox High Availability documentation.

Storage Strategy: Shared Storage, Ceph vs. NFS & Alternatives

Storage determines whether high availability (HA) and live migration function reliably in a Proxmox cluster. A design without single points of failure is crucial.

When do you need shared storage – and when don't you?

For uninterrupted live migration, shared storage is usually necessary in practice, as the VM data is immediately available on the target host. HA setups also benefit from this.

Alternatives are possible without shared storage, but require more effort for replication and recovery.

Live migration: generally requires shared storage
HA restart: significantly easier with shared storage
Without shared storage: increased complexity regarding data availability

Ceph vs. NFS in a Proxmox cluster: Which is a better fit?

NFS is easy to operate, but it creates a central dependency. If the storage fails, many systems are affected.

Ceph distributes data across multiple nodes and increases fault tolerance, but requires more resources and operational overhead.

NFS: easy to get started, but central dependency
Ceph: higher fault tolerance, but more complex operation
Decision: depends on RTO/RPO, budget, and expertise

How do you plan a Ceph setup effectively?

Ceph requires clean network design and sufficient resources. Errors in this area quickly lead to performance or stability problems.

Dedicated storage network with stable bandwidth

Plan for replication and rebuild reserves

Definitively define failure domains (host, rack, power)

Establish monitoring before going live

Ceph vs. NFS in direct comparison

Criterion	Classification
Operation	NFS is easier, Ceph is significantly more demanding
Fail-safety	Ceph is more robust, NFS needs additional security.
Performance	NFS is dependent on storage, Ceph scales with nodes.
Cost	NFS is cheaper to start with, Ceph requires more effort.

Operations & Best Practices: Updates without downtime, monitoring, backups and maintenance processes

A Proxmox cluster only delivers real benefits when its operation and maintenance are properly managed. This includes rolling updates, monitoring with clear alert rules, and a backup strategy aligned with RPO and RTO. Without these processes, a cluster merely postpones problems.

Weaknesses in processes and maintenance become apparent, especially during operation. Learn more in the article "IT Maintenance for Businesses: Why Regular Support Is Crucial."

Regular testing is essential: Failover, restore, and maintenance procedures should be performed in a controlled manner. This allows you to identify early on whether the cluster, storage, and HA rules are working together reliably.

Rolling updates with live migration
Monitoring for cluster, storage, and HA
Backups including offsite and restore tests
Clear runbooks for operations and incident management

Especially during ongoing operations, FIGULI CONSULTING helps companies structure maintenance, monitoring, and failover tests in such a way that clusters can not only be set up but also operated stably in the long term.

How do you perform updates in the Proxmox cluster without downtime?

Updates are performed as a rolling process: node in maintenance mode, workloads migrated, patched, tested – then the next node.

Activate maintenance mode before migration and updates.
Check cluster status and quorum after each step.
Define a rollback plan and maintenance window.
Include firmware and drivers.

Wie setzen Sie Monitoring & Alerting sinnvoll auf?

Monitoring must cover cluster-specific risks, not just "host up/down."

Monitor quorum, Corosync, and packet loss.
Define storage health and capacity thresholds.
Set targeted alerts for HA events and restarts.
Clearly define escalation procedures and responsibilities.

Especially with management access and privileged accounts, endpoint security should also be considered. Learn more in the article "Endpoint Security: Why Antivirus Alone Is No Longer Enough."

Which backup strategy makes sense for HA?

High Availability (HA) is not a substitute for backups. Protection against data loss requires clear Recovery Point Objective (RPO) and Recovery Time Objective (RTO) goals, offsite backups, and regular restore tests.

Define RPO/RTO for each system
Perform offsite backups to protect against ransomware
Perform regular restore tests
Monitor backup windows and growth

What typical errors jeopardize high availability?

The biggest risks rarely lie in Proxmox itself, but rather in its design and operation.

Incorrect quorum design
Unstable Corosync network
Storage bottlenecks or insufficient reserves
Untested fencing
Different firmware or MTU (configuration drift)

Costs, subscriptions & planning for SMEs

The cost of a Proxmox cluster stems less from the software itself and more from redundancy, storage, networking, and operations. The crucial factor is determining which failure risks actually need to be mitigated—and how much effort is reasonable to invest. This will dictate whether a 2-node setup with QDevice is sufficient or whether a 3-node design or Ceph is necessary.

Align costs with objectives: RTO/RPO, critical workloads, growth.
Use redundancy strategically where genuine single points of failure exist.
Plan for operational costs: updates, monitoring, testing, documentation.

Which cost categories are realistic?

The most significant costs arise from hosts (CPU/RAM), storage, network, power/UPS, and ongoing operations.

Redundancy is most beneficial for power and network, followed by storage.

Power and network: the greatest leverage for real-world reliability.
Storage: prevents performance and chain issues.
Hosts: N+1 redundancy is essential for failover.

What role do Proxmox subscriptions play?

Subscriptions primarily cover updates, support, and access to stable repositories.

For production environments, it's crucial to implement updates in a controlled and low-risk manner. Without a clear update strategy, operational risk increases.

Therefore, consider subscriptions not as license costs, but as a component of operational security.

Which setup options are suitable for SMEs?

The choice depends on budget, risk, and availability requirements.

2 Nodes + QDevice: compact, but with an additional dependency.
3 Nodes: stable standard for HA and quorum.
Ceph: useful when distributed storage redundancy is needed.

RTO is primarily determined by HA, RPO by backup and application design.

Conclusion

A Proxmox cluster is a sensible choice for businesses when virtualization needs to be not only flexible but also highly reliable. The key is not individual features, but a clean overall design encompassing quorum, Corosync, suitable storage, fencing, and clearly defined operational processes.

For SMEs, this means: First, define goals, failure scenarios, and priorities, then carefully plan the topology, storage, and high availability rules. This is precisely how you avoid making high availability unnecessarily complex or failing to respond as expected in a critical situation.

FIGULI CONSULTING helps companies plan Proxmox clusters with technical precision, realistically test failover, and set up an environment that remains stable and transparent during operation.

Plan your Proxmox cluster now.

FAQ

What is a Proxmox cluster?

A Proxmox cluster connects multiple virtualization hosts into a common management and operating environment. This allows workloads to be controlled centrally, hosts to be managed together, and VMs to be restarted more quickly in the event of failures.

How does high availability (HA) work with Proxmox?

HA basiert auf Clusterkommunikation, Quorum und dem HA-Manager, der definierte Ressourcen überwacht und bei Störungen neu startet oder umplatziert. Damit das zuverlässig funktioniert, müssen Netzwerk und Storage stabil sein und Fencing riskante Zustände absichern. HA reduziert Ausfallzeiten, ersetzt aber keine Backups.

How many nodes does a Proxmox cluster need for quorum?

HA is based on cluster communication, quorum, and the HA manager, which monitors defined resources and restarts or relocates them in case of failures. For this to function reliably, the network and storage must be stable, and fencing must protect against risky conditions. HA reduces downtime but does not replace backups.

What is Quorum in the Proxmox Cluster?

Quorum is the majority's ability within a cluster to make valid decisions. Only the majority of cluster members are permitted to execute critical cluster operations to prevent conflicting states. Quorum is therefore a crucial foundation for safe failover and consistent management.

Do I need shared storage for HA and live migration?

For uninterrupted live migration, shared storage is often the simplest solution because the VM disks are immediately available on the target host. High availability (HA) can also function without shared storage, but then requires different concepts for data availability and consistent recovery. Recovery time objective (RTO)/recovery point objective (RPO) and workload type are crucial factors.

Ceph vs. NFS in a Proxmox cluster: which is a better fit?

NFS is easier to operate, but it introduces a central dependency. Ceph distributes data across multiple nodes and increases fault tolerance, but requires more resources, a clean network design, and more operational expertise. Which option is better depends on budget, team expertise, and availability goals.

Is a Proxmox cluster worthwhile for SMEs?

Yes, if central systems such as ERP, file servers, databases, or internal applications need to remain available with minimal downtime. The decisive factor is not the company size, but rather how critical outages are and whether a single host poses a risk.

Notice:
This article provides general technical information and does not replace individual planning, risk analysis, or manufacturer consultation. Configurations and recommendations must be adapted to the specific environment and tested before going into production.