Openstack Nova on Rabbimq migration
Last time we worked on migrating our Nova service to a new Rabbitmq cluster. Things look straightforward at first, but there is an issue.
Intro
If you have been working on Openstack, you might know the default message queue for Openstack is Rabbitmq. Besides many other Openstack services using that queue such as Cinder, Octavia, Trove ...
Topology
The role of Rabbitmq in Nova can be simplified like this:
Assuming you have a running Rabbitmq cluster with 3 nodes: 10.237.5.11, 10.237.5.12 and 10.237.5.13.
For simplicity, we will expand the Rabbitmq cluster with new node 10.237.5.14, and change our endpoint to that new node only.
Change
First attempt
we can easily think of changing Nova endpoint by applying a new value for transport_url:
[DEFAULT]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672
...
[oslo_messaging_notifications]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672
Apply these changes and restart all the nova services. Surprisingly, that is not all! It works for some processes inside nova_api but not all of them. By checking the network connection on one Rabbitmq node, we can see there are still some of them not using the new endpoints:
$ ss -npt dst :5672 | grep "10.237.5.11:5672"
ESTAB 0 0 10.237.5.11:36378 10.237.5.11:5672 users:(("nova-scheduler",pid=23458,fd=14))
ESTAB 0 0 10.237.5.11:45110 10.237.5.11:5672 users:(("nova-scheduler",pid=23412,fd=14))
ESTAB 0 0 10.237.5.11:57906 10.237.5.11:5672 users:(("nova-scheduler",pid=23475,fd=14))
ESTAB 0 0 10.237.5.11:45406 10.237.5.11:5672 users:(("nova-scheduler",pid=23430,fd=14))
ESTAB 0 0 10.237.5.11:59714 10.237.5.11:5672 users:(("nova-scheduler",pid=23404,fd=14))
ESTAB 0 0 10.237.5.11:44712 10.237.5.11:5672 users:(("nova-api",pid=23466,fd=15))
ESTAB 0 0 10.237.5.11:36434 10.237.5.11:5672 users:(("nova-api",pid=23419,fd=15))
And if we accidentally assume the Nova's message queue config is migrated, we shut the `Rabbitmq`` in old nodes (10.237.5.11-13), then we may get this error:
2024-03-26 14:00:42.315 26 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-03-26 14:00:42.321 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): error: [Errno 111] ECONNREFUSED
Be careful, when you stop old Rabbitmq nodes, the current nova_api and nova_scheduler will not work properly although the new config changed
Final fix
After digging for a while, we found that nova_api is not only using in the default configuration file but also inside the database.
SELECT transport_url FROM `nova_api`.`cell_mappings`
Not a surprise that it's still pointing to the old values. Adjust it and things are good.
