Openstack Nova on Rabbimq migration
Last time we worked on migrating our Nova
service to a new Rabbitmq
cluster. Things look straightforward at first, but there is an issue.
Intro
If you have been working on Openstack, you might know the default message queue for Openstack is Rabbitmq
. Besides many other Openstack services using that queue such as Cinder
, Octavia
, Trove
...
Topology
The role of Rabbitmq
in Nova
can be simplified like this:
Assuming you have a running Rabbitmq
cluster with 3 nodes: 10.237.5.11, 10.237.5.12 and 10.237.5.13.
For simplicity, we will expand the Rabbitmq
cluster with new node 10.237.5.14
, and change our endpoint to that new node only.
Change
First attempt
we can easily think of changing Nova
endpoint by applying a new value for transport_url
:
[DEFAULT]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672
...
[oslo_messaging_notifications]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672
Apply these changes and restart all the nova
services. Surprisingly, that is not all! It works for some processes inside nova_api but not all of them. By checking the network connection on one Rabbitmq
node, we can see there are still some of them not using the new endpoints:
$ ss -npt dst :5672 | grep "10.237.5.11:5672"
ESTAB 0 0 10.237.5.11:36378 10.237.5.11:5672 users:(("nova-scheduler",pid=23458,fd=14))
ESTAB 0 0 10.237.5.11:45110 10.237.5.11:5672 users:(("nova-scheduler",pid=23412,fd=14))
ESTAB 0 0 10.237.5.11:57906 10.237.5.11:5672 users:(("nova-scheduler",pid=23475,fd=14))
ESTAB 0 0 10.237.5.11:45406 10.237.5.11:5672 users:(("nova-scheduler",pid=23430,fd=14))
ESTAB 0 0 10.237.5.11:59714 10.237.5.11:5672 users:(("nova-scheduler",pid=23404,fd=14))
ESTAB 0 0 10.237.5.11:44712 10.237.5.11:5672 users:(("nova-api",pid=23466,fd=15))
ESTAB 0 0 10.237.5.11:36434 10.237.5.11:5672 users:(("nova-api",pid=23419,fd=15))
And if we accidentally assume the Nova
's message queue config is migrated, we shut the `Rabbitmq`` in old nodes (10.237.5.11-13), then we may get this error:
2024-03-26 14:00:42.315 26 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-03-26 14:00:42.321 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): error: [Errno 111] ECONNREFUSED
Be careful, when you stop old Rabbitmq nodes, the current nova_api
and nova_scheduler
will not work properly although the new config changed
Final fix
After digging for a while, we found that nova_api is not only using in the default configuration file but also inside the database.
SELECT transport_url FROM `nova_api`.`cell_mappings`
Not a surprise that it's still pointing to the old values. Adjust it and things are good.