2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)

This document discusses various latency issues encountered in our production platform at Société Générale which makes heavy use of OpenStack. The platform is experiencing rapid growth, putting a considerable load on the distributed services that OpenStack relies upon.

After further investigations, a timeout issue has been identified during the process of virtual machine creation. The occurrence of this timeout suggests that there are communication difficulties between the Nova and Neutron services.

Under the Hood

During the virtual machine creation process, once the compute host has been selected, Nova begins the build process on that host. This involves several steps, including initiating the QEMU process and creating a TAP interface on the host. At this point, Nova places the virtual machine in a paused state, awaiting an event (or signal) from Neutron to indicate that the network for that particular virtual machine is ready. The process to receive this event from Neutron can be time-consuming, due to the following sequence of events:

  1. The Neutron agent on the compute host responsible for building network and informing Nova sends a message via RPC to the Neutron server.

  2. The Neutron server, in turn, communicates this with Nova api through a REST API.

  3. Finally, the Nova api informs the compute host related of the event via another RPC message.

That whole process is using event callback API introduced for icehouse, and the events related network-vif-plugged, network-vif-failed.

Considering the difficulty identified for Neutron to inform Nova in time we have decided to focus ourself on putting less activity on it.

Reducing Nova requests on Neutron

For each virtual machines Nova continuously requests Neutron to refresh its networking cache. Initially, this mechanism is designed to reduce the number of API requests that Nova makes to Neutron. However, the default interval for this periodic task refresh is set to 60 seconds and in a heavily-loaded environment with thousands of virtual machines, this leads to a substantial number of cache refresh requests for any given virtual machine.

[DEFAULT]
heal_instance_info_cache = 600

The networking driver in use is quite stable. According to some discussions within the community, it was safe to adjust the refresh interval to 600 seconds. It could have even been possible to disable this feature entirely.

Note

When you restart the Nova Compute service, the healing process is reset and starts over. This can be particularly problematic in environments where the Nova Compute service is restarted frequently.

Increasing RPC workers for Neutron

We have also decided to significantly increase the value for rpc_workers. Given that RPC operations are designed to be I/O-bound, we considered that exceeding the number of available cores on our hosts by a factor of two would be both conservative and safe.

[DEFAULT]
rpc_workers = 20

Increasing Neutron max_pool_size

We have made a deliberate change to extend the max_pool_size value from 1 to 60 in the Neutron database settings. This adjustment is logical, given that we have increased the number of workers and can anticipate these workers making use of the database.

[database]
max_pool_size = 60

Deferring flows deletion

We have observed that the Neutron agents in our OpenStack environment experience delays when deleting network flows as part of the virtual machine termination process. This operation is blocking in nature, causing the agent to become unresponsive to any other tasks until the flow deletion is completed.

We decided to deploy the change OpenDev Review #843253, aims to mitigate this issue. This change is offloading the flow deletion task to a separate thread, freeing the main thread to continue with other operations.

  # will not match with the ip flow's cookie so OVS won't actually
  # delete the flow
  flow['cookie'] = ovs_lib.COOKIE_ANY
+ self._delete_flows(deferred=False, **flow)
- self._delete_flows(**flow)

Improvements after deploying

Finally, after deploying those changes, we have noticed a considerable improvement in stability and success rate for virtual machine creation. The latency involved in creating virtual machines is now stable, requiring only a reasonable amount of time to transition them to an active state.