Skip to main content

Openstack Nova on Rabbimq migration

· 4 min read
Sang, Nguyen Nhat
Infrastructure Engineer at VNG

Last time we worked on migrating our Nova service to a new Rabbitmq cluster. Things look straightforward at first, but there is an issue.

Intro

If you have been working on Openstack, you might know the default message queue for Openstack is Rabbitmq. Besides many other Openstack services using that queue such as Cinder, Octavia, Trove ...

Topology

The role of Rabbitmq in Nova can be simplified like this:

Assuming you have a running Rabbitmq cluster with 3 nodes: 10.237.5.11, 10.237.5.12 and 10.237.5.13.

For simplicity, we will expand the Rabbitmq cluster with new node 10.237.5.14, and change our endpoint to that new node only.

Change

First attempt

we can easily think of changing Nova endpoint by applying a new value for transport_url:

/etc/nova/nova.conf
[DEFAULT]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672
...

[oslo_messaging_notifications]
transport_url = rabbit://rabbitmq_user:rabbit_password@10.237.5.14:5672

Apply these changes and restart all the nova services. Surprisingly, that is not all! It works for some processes inside nova_api but not all of them. By checking the network connection on one Rabbitmq node, we can see there are still some of them not using the new endpoints:

$ ss -npt dst :5672 | grep "10.237.5.11:5672"
ESTAB 0 0 10.237.5.11:36378 10.237.5.11:5672 users:(("nova-scheduler",pid=23458,fd=14))
ESTAB 0 0 10.237.5.11:45110 10.237.5.11:5672 users:(("nova-scheduler",pid=23412,fd=14))
ESTAB 0 0 10.237.5.11:57906 10.237.5.11:5672 users:(("nova-scheduler",pid=23475,fd=14))
ESTAB 0 0 10.237.5.11:45406 10.237.5.11:5672 users:(("nova-scheduler",pid=23430,fd=14))
ESTAB 0 0 10.237.5.11:59714 10.237.5.11:5672 users:(("nova-scheduler",pid=23404,fd=14))
ESTAB 0 0 10.237.5.11:44712 10.237.5.11:5672 users:(("nova-api",pid=23466,fd=15))
ESTAB 0 0 10.237.5.11:36434 10.237.5.11:5672 users:(("nova-api",pid=23419,fd=15))

And if we accidentally assume the Nova's message queue config is migrated, we shut the `Rabbitmq`` in old nodes (10.237.5.11-13), then we may get this error:

2024-03-26 14:00:42.315 26 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer

2024-03-26 14:00:42.321 26 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): error: [Errno 111] ECONNREFUSED

danger

Be careful, when you stop old Rabbitmq nodes, the current nova_api and nova_scheduler will not work properly although the new config changed

Final fix

After digging for a while, we found that nova_api is not only using in the default configuration file but also inside the database.

SELECT transport_url FROM `nova_api`.`cell_mappings`

Not a surprise that it's still pointing to the old values. Adjust it and things are good.