Jesse's Software Engineering Blog
RabbitMQ Cluster
RabbitMQ is a robust message broker for handling distributed message queues. In high availability applications, having a single server may not be sufficient enough for handling availability needs of the application’s data. RabbitMQ offers various clustering options and configurations to ensure messages will be persisted through failures and available when needed.
Setup
Although a RabbitMQ cluster can be set up and run locally, this post will use Vagrant to set up a two node environment. Each node will need to have RabbitMQ installed and running, as well as the host name for the node must be set prior to RabbitMQ installation. Updating a hostname post installation can lead to some issues which require manual erlang database structure removal or RabbitMQ re-install.
The Puppet modules and Vagrant file (vagrant/ubuntu/rabbitmq-cluster) used can be found on github to follow along, or can be set up manually to run the commands.
Whether using the predefined Vagrant file or building new boxes, verify access to both servers before continuing:
vagrant up vagrant ssh master vagrant ssh slave
Considerations
Some considerations before beginning to configure the RabbitMQ cluster:
An erlang cookie is used on the different nodes to determine if the nodes can communicate to one another. The cookie can be any sequence of alphanumeric characters, as long as each node has the exact some string.
The clustered nodes cannot communicate with one another by IP address, therefore the host names need to be set on each of the individual nodes, as well as in the server’s /etc/hosts file.
Clustering
Once the Vagrant VMs are available ensure that RabbitMQ is installed along with the management plugin, and that a new admin user has been created.
SSH into the master server and update the /etc/hosts file to match the host names to the IP addresses
vagrant ssh master sudo vi /etc/hosts 10.2.2.2 rabbit-master 10.2.2.4 rabbit-slave
Copy the .erlang.cookie. This can be done manually or via scp
sudo cat /var/lib/rabbitmq/.erlang.cookie
SSH into the slave server and update the hosts file and erlang cookie
vagrant ssh slave sudo vi /etc/hosts 10.2.2.2 rabbit-master 10.2.2.4 rabbit-slave # stop the server before updating the cookie sudo service rabbitmq-server stop sudo vi /var/lib/rabbitmq/.erlang.cookie sudo service rabbitmq-server start
NOTE: I have run into issues when updating the cookie of a running node, therefore I always stop the node prior to the update. If you are receiving the following error, node … with name “rabbit” … already running, you can manually find/kill the rabbitmq processes
ps aux | grep erl ps aux | grep epmd sudo kill -9
The nodes are now able to connect. Still on the slave, cluster with the master
# check current cluster status sudo rabbitmqctl cluster_status # stop the app first sudo rabbitmqctl stop_app sudo rabbitmqctl join_cluster rabbit@rabbit-master sudo rabbitmqctl start_app # verify cluster status sudo rabbitmqctl cluster_status
If successful, the pre and post cluster_status outputs should read
# pre [{nodes,[{disc,['rabbit@rabbit-slave'']}]}, {running_nodes,['rabbit@rabbit-slave']}, {cluster_name,<<"rabbit@rabbit-slave">>}, {partitions,[]}] # post [{nodes,[{disc,['rabbit@rabbit-master','rabbit@rabbit-slave']}]}, {running_nodes,['rabbit@rabbit-slave','rabbit@rabbit-master']}, {cluster_name,<<"rabbit@rabbit-master">>}, {partitions,[]}]
One optional flag for the join_cluster command is –ram. Paraphrased from the RabbitMQ documentation
Nodes can be RAM based or disk (disc) based. RAM nodes keep their state only in memory, disk nodes keep state in memory and on disk. As RAM nodes don't have to write to disk as much as disk nodes, they can perform better…, the performance improvements will affect only resources management (e.g. adding/removing queues, exchanges, or vhosts), but not publishing or consuming speed. It is sufficient, but not recommended, to have just one disk node within a cluster, to store the state of the cluster safely.
To instead have added a RAM node
sudo rabbitmqctl join_cluster --ram rabbit@rabbit-master
This can also be update in the future. It is important to always stop/start the broker prior to making such changes
sudo rabbitmqctl stop_app sudo rabbitmqctl change_cluster_node_type ram sudo rabbitmqctl start_app
Now that the nodes are connected, status updates and commands can be run from any node. Still on the slave node, check the cluster status of the master node.
sudo rabbitmqctl -n rabbit@rabbit-master cluster_status Cluster status of node 'rabbit@rabbit-master' ... [{nodes,[{disc,['rabbit@rabbit-master','rabbit@rabbit-slave']}]}, {running_nodes,['rabbit@rabbit-slave','rabbit@rabbit-master']}, {cluster_name,<<"rabbit@rabbit-master">>}, {partitions,[]}]
This level of interoperability also implies that a cluster can be made from any node. In multi node systems, the join_cluster command can be run on any of the running nodes.
To remove a node from a cluster and have it run stand alone
sudo rabbitmqctl stop_app sudo rabbitmqctl reset sudo rabbitmqctl start_app
Auto Connecting
In most situations having to SSH into each node and attach the node to the cluster is not ideal. Also, if any of the node resets that would require manual SSH access to reattach. Using the RabbitMQ configuration file, the clustering params can be predefined. For a fresh install, with no defined configurations, simpy add the following to both servers
sudo vi /etc/rabbitmq/rabbitmq.config [{rabbit, [{cluster_nodes, {['rabbit@rabbit-master','rabbit@rabbit-slave'], disc}}]}].
The configuration file will only be read by fresh, newly installed nodes, or nodes that have been manually reset. To test, manually reset the slave node and verify the cluster connection
# run on slave sudo rabbitmqctl stop_app sudo rabbitmqctl reset sudo rabbitmqctl start_app sudo rabbitmqctl cluster_status
Even though the node was reset, it automatically reattaches to the cluster due to the RabbitMQ configuration settings. This will also happen when the server reboots as well or the RabbitMQ service restarts.
The cluster verification can also be done via the managment console UI, 10.2.2.4:15672, as long as a user was created and the ports are accessible. On the Overview tab there will be a list of the nodes in the cluster.
Mirroring
Although there are two nodes in the cluster, high availability has yet to be achieved for the queues. Depending on which node the queue is created on, determines which node is the “master” node. If a queue is created on the master node, while the queue will be available from the slave node, it will not physically reside on the slave. Therefore if the master were to go down, there would be no way to access any queue which was only defined on the master. This can be avoided by setting the RabbitMQ policy.
By running the following set policy command, the node will be forced to replicate all of it’s queues to the other nodes in the cluster. There are a variety of different ha-modes. The command uses regular expressions to define which queues to mirror. In the following example, all queues beginning with jesse. will be replicated across the cluster.
sudo rabbitmqctl set_policy ha-all "^jesse\." '{"ha-mode":"all"}'
Although the queues are replicated across all nodes, it’s important to note that consumers are always connected to the master regardless of the host they connect with. So even in defining a slave IP or host in the client connection string, the client will truly be connected to the master. Even though the mirroring enhances availability it does not have the performance benefit of distributing load across nodes as all requests are re-routed to the master. When the master receives requests/changes to the queues, those changes are then broadcast to all the slave nodes and applied in the same order in order to keep the slaves insync. Thus, it is more efficient to always deal directly with the master to avoid the slave to master redirect, but this can cause other issues, as discussed later.
The HA policy for the nodes can be confirmed in the management UI under the Admin -> policies section.
Verify
With the cluster configured and mirroring enabled, the RabbitMQ broker is now better equiped to handle node failures. To verify this have both the nodes online and apart of the cluster, and both of the management UI’s open to view the node/queue statuses. All of the confirmations can also be done command line instead of using the management UI.
This example will use PHP, and have the following Composer dependencies installed
{ "require": { "videlalvaro/php-amqplib": "2.2.*" } }
All of the following examples will be prefixed with
<?php use PhpAmqpLib\Connection\AMQPConnection; use PhpAmqpLib\Message\AMQPMessage; require ‘/vendor/autoload.php';
Start by adding a new durable queue to the master node
$Connection = new AMQPConnection(‘10.2.2.2', 5672, 'jesse', 'pass'); $channel = $Connection->channel(); $channel->queue_declare('jesse.jobs',false,true,false,false); $channel->basic_qos(null, 1, null);
The queue should be visible under the queue tab for both the master and slave nodes. Also notice under the features column for the queue, ha-all, confirming the HA policy for the queue.
Publish some messages to the master node
$Connection = new AMQPConnection(‘10.2.2.2', 5672, 'jesse', 'pass'); $channel = $Connection->channel(); $channel->queue_declare('jesse.jobs',false,true,false,false); $channel->basic_qos(null, 1, null); $properties = [ 'delivery_mode' => 2, // persistent 'priority' => 1 ]; $Message = new AMQPMessage(json_encode([‘brewhaha']), $properties); $channel->basic_publish($Message, '', 'jesse.jobs');
Verify that the queue on both nodes has the same number of messages. Repeat the publish but publish to the slave, 10.2.2.4, and verify the queues are still insync.
Now turn the master off
vagrant ssh master sudo rabbitmqctl stop_app
The master node UI now reflects that the node is down, and the node list on the slave node UI shows the same. Confirm message can still be published to the slave. Trying to publish messages to the master will thrown an exception. When the master goes down the oldest slave, the most likely to be insync, will become the new master. Turning the master back on will make that node a new slave and will auto connect back to the cluster due to the auto connect configurations created earlier.
# run on the original master node sudo rabbitmqctl start_app sudo rabbitmqctl cluster_status Cluster status of node 'rabbit@rabbit-master' ... [{nodes,[{disc,['rabbit@rabbit-master','rabbit@rabbit-slave']}]}, {running_nodes,['rabbit@rabbit-slave','rabbit@rabbit-master']}, {cluster_name,<<"rabbit@rabbit-master">>}, {partitions,[]}]
Special considerations need to be made if new nodes, and rebooted nodes, need to always be synced with the master node.
Looking at the management UI, under the Queues tab, the job queue Node column states rabbit-slave, which is correctly displaying that the slave is now the new master
Load Balancing
One last consideration is how to load balance the nodes. While it is ideal to only connect to the master to improve performance, this can cause issues when the master node goes down. Some solutions for load balancing:
Dynamic configurations – Keep dynamic configurations of the nodes and their statuses and let the client, code base, determine who to connect to. By knowing which node is the master and which nodes are currently online, the client application can select the correct node, or even randomize the connections. This would require monitoring nodes and keeping a list somewhere accessible by the code. Using host names i.e. rabbit-master to connect instead of IP would allow for this approach to work even on server reboots/IP resets.
Load balancer – Using a load balancer server, such as HAProxy, gives much more flexibility on the connection routing, while allowing the client to connect to a single host which is then dispersed to the nodes based on configurations. While providing more flexibility than having dynamic configurations used by clients, having a dedicated load balancer adds a new level of failure due to losing the load balancer would make the RabbitMQ cluster availability irrelevant.