6.17.5. OpenStack Neutron Density Testing report

Abstract:This document includes OpenStack Networking (aka Neutron) density test results against 200 nodes OpenStack environment. All tests have been performed regarding OpenStack Neutron Density Testing

6.17.5.1. Environment description

6.17.5.1.1. Lab A (200 nodes)

3 controllers, 196 computes, 1 node for Grafana/Prometheus

6.17.5.1.1.1. Hardware configuration of each server

Description of controller servers
server vendor,model Supermicro MBD-X10DRI
CPU vendor,model Intel Xeon E5-2650v3
processor_count 2
core_count 10
frequency_MHz 2300
RAM vendor,model 8x Samsung M393A2G40DB0-CPB
amount_MB 2097152
NETWORK vendor,model Intel,I350 Dual Port
bandwidth 1G
vendor,model Intel,82599ES Dual Port
bandwidth 10G
STORAGE vendor,model Intel SSD DC S3500 Series
SSD/HDD SSD
size 240GB
vendor,model 2x WD WD5003AZEX
SSD/HDD HDD
size 500GB
Description of compute servers
server vendor,model SUPERMICRO 5037MR-H8TRF
CPU vendor,model INTEL XEON Ivy Bridge 6C E5-2620
processor_count 1
core_count 6
frequency_MHz 2100
RAM vendor,model 4x Samsung DDRIII 8GB DDR3-1866
amount_MB 32768
NETWORK vendor,model AOC-STGN-i2S - 2-port
bandwidth 10G
STORAGE vendor,model Intel SSD DC S3500 Series
SSD/HDD SSD
size 80GB
vendor,model 1x WD Scorpio Black BP WD7500BPKT
SSD/HDD HDD
size 750GB

6.17.5.2. Test results

6.17.5.2.1. Test Case: VM density check

The idea was to boot as many VMs as possible (in batches of 200-1000 VMs) and make sure they are properly wired and have access to the external network. The test allows to measure the maximum number of VMs which can be deployed without issues with cloud operability, etc.

The external access was checked by the external server to which VMs connect upon spawning. The server logs incoming connections from provisioned VMs which send instance IPs to this server via POST requests. Instances also report a number of attempts it took to get an IP address from metadata server and send connect to the HTTP server respectively.

A Heat template was used for creating 1 net with a subnet, 1 DVR router, and a VM per compute node. Heat stacks were created in batches of 1 to 5 (5 most of the times), so 1 iteration effectively means 5 new nets/routers and 196 * 5 VMs. During the execution of the test we were constantly monitoring lab’s status using Grafana dashboard and checking agents’ status.

As a result we were able to successfully create 125 Heat stacks which gives us the total of 24500 VMs.

Iteration 1:

../../../_images/iteration1.png

Iteration i:

../../../_images/iterationi.png

Example of Grafana dashboard during density test:

../../../_images/grafana.png

6.17.5.2.2. Observed issues

Issues faced during testing:

During testing it was also needed to tune nodes configuration in order to comply with the growing number of VMs per node, such as:

  • Increase ARP table size on compute nodes and controllers
  • Raise cpu_allocation_ratio from 8.0 for 12.0 in nova.conf to prevent hitting Nova vCPUs limit on computes

At ~16000 VMs we reached ARP table size limit on compute nodes so Heat stack creation started to fail. Having increased maximum table size we decided to cleanup failed stacks, in attempt to do so we ran into a following Nova issue (LP #1606825 nova-compute hangs while executing a blocking call to librbd): on VM deletion nova-compute may hang for a while executing a call to librbd and eventually go down in Nova service-list output. This issue was fixed with the help of the Mirnatis Nova team and the fix was applied on the lab as a patch.

After launching ~20000 VMs cluster started experiencing problems with RabbitMQ and Ceph. When the number of VMs reached 24500 control plane services and agents started to massively go down: the initial failure might have been caused by the lack of allowed PIDs per OSD nodes (https://bugs.launchpad.net/fuel/+bug/1536271) , Ceph failure affected all services, i.e. MySQL errors in Neutron server that lead to agents going down and massive resource rescheduling/resync. After Ceph failure the control plane cluster could not be recovered and due to that density test had to be stopped before the capacity of compute nodes was exhausted.

Ceph team commented that 3 Ceph monitors aren’t enough for over 20000 VMs (each having 2 drives) and recommended to have at least 1 monitor per ~1000 client connections or move them to dedicated nodes.

Note

Connectivity check of Integrity test passed 100% even when control plane cluster went crazy. That is a good illustration of control plane failures not affecting data plane.

Note

Final result - 24500 VMs on a cluster.