Fix RabbitMQ cluster partitions
This section describes how to fix scenarios in which a RabbitMQ cluster gets portioned.
What is a partition
In a RabbitMQ cluster, partitions occur due to network instability, heavy message queue loads, or host restarts, leading to a "split-brain" scenario. This can significantly disrupt operations. Intervention is necessary when the cluster cannot self-recover.
Symptoms of a RabbitMQ partition can include:
-
Message Delivery Failures: One of the most immediate effects of a RabbitMQ partition is the failure of message delivery. Microservices may either stop receiving messages altogether or experience significant delays.
-
Errors in Status or Event Reporting: Front-facing services attempt to store device data on RabbitMQ, but fail, triggering retries. With a constant load of device data, this can exhaust the threads allocated for depositing data in RabbitMQ, resulting in requests being rejected.
How to identify a partition
Partitions can be identified via:
-
Server logs
-
RabbitMQ Admin UI
-
CLI commands
-
Http API
Server logs (elastic)
In the event of a network partition within a RabbitMQ cluster, certain log entries can help quickly identify the issue. For example, you might encounter logs similar to:
Network partition detected. Node rabbit@node1 cannot see nodes [rabbit@node2, rabbit@node3].
|
It’s important to note that the absence of explicit partition detection logs does not necessarily mean there is no partition. RabbitMQ might not have detected it yet. |
RabbitMQ Admin UI
Example Partition in RabbitMQ Admin UI:
In the first image, there are only 2 nodes are displayed. Typically, there are at least 3 RabbitMQ nodes. Refreshing the page might redirect you to another node due to load balancing. In the next image, it can be observed that the node rabbitmq-statefulset-1, which appears to have formed its own cluster.
How to fix a partition
To recover from a split-brain, first choose one partition which you trust the most. This partition will become the authority for the state of the system (schema, messages) to use; any changes which have occurred on other partitions will be lost.
Stop the RabbitMQ application
First, you need to stop the RabbitMQ application on the node that is out of sync or has been partitioned.
Connect with the nodes (container) shell and execute the following command.
rabbitmqctl stop_app
Reset the RabbitMQ node
After stopping the application, reset the node. This step clears the current node’s data about its cluster membership. It does not affect message data, but be cautious as this removes the node from the cluster, and it will need to rejoin.
rabbitmqctl reset
Join the cluster
Next, join the node back to the cluster by specifying the name of the node you trust (the one you determined to be the source of truth). Replace rabbit@rabbitmq1 with the actual hostname of the cluster’s master node or the node you’re using as the authoritative source.
rabbitmqctl join_cluster <rabbit@rabbitmq1>