Fix RabbitMQ cluster partitions

This section describes how to fix scenarios in which a RabbitMQ cluster gets portioned.

What is a partition

In a RabbitMQ cluster, partitions occur due to network instability, heavy message queue loads, or host restarts, leading to a "split-brain" scenario. This can significantly disrupt operations. Intervention is necessary when the cluster cannot self-recover.

Symptoms of a RabbitMQ partition can include:

Message Delivery Failures: One of the most immediate effects of a RabbitMQ partition is the failure of message delivery. Microservices may either stop receiving messages altogether or experience significant delays.
Errors in Status or Event Reporting: Front-facing services attempt to store device data on RabbitMQ, but fail, triggering retries. With a constant load of device data, this can exhaust the threads allocated for depositing data in RabbitMQ, resulting in requests being rejected.

How to identify a partition

Partitions can be identified via:

Server logs
RabbitMQ Admin UI
CLI commands
Http API

Server logs (elastic)

In the event of a network partition within a RabbitMQ cluster, certain log entries can help quickly identify the issue. For example, you might encounter logs similar to:

Network partition detected. Node rabbit@node1 cannot see nodes [rabbit@node2, rabbit@node3].

It’s important to note that the absence of explicit partition detection logs does not necessarily mean there is no partition. RabbitMQ might not have detected it yet.

RabbitMQ Admin UI

Example Partition in RabbitMQ Admin UI:

In the first image, there are only 2 nodes are displayed. Typically, there are at least 3 RabbitMQ nodes. Refreshing the page might redirect you to another node due to load balancing. In the next image, it can be observed that the node rabbitmq-statefulset-1, which appears to have formed its own cluster.

How to fix a partition

To recover from a split-brain, first choose one partition which you trust the most. This partition will become the authority for the state of the system (schema, messages) to use; any changes which have occurred on other partitions will be lost.

Stop the RabbitMQ application

First, you need to stop the RabbitMQ application on the node that is out of sync or has been partitioned.

Connect with the nodes (container) shell and execute the following command.

rabbitmqctl stop_app

Reset the RabbitMQ node

After stopping the application, reset the node. This step clears the current node’s data about its cluster membership. It does not affect message data, but be cautious as this removes the node from the cluster, and it will need to rejoin.

rabbitmqctl reset

Join the cluster

Next, join the node back to the cluster by specifying the name of the node you trust (the one you determined to be the source of truth). Replace rabbit@rabbitmq1 with the actual hostname of the cluster’s master node or the node you’re using as the authoritative source.

rabbitmqctl join_cluster <rabbit@rabbitmq1>

Start the RabbitMQ application

Finally, start the RabbitMQ application on the node. This action will reintegrate the node into the cluster, and it will start functioning according to the cluster configuration.

rabbitmqctl start_app