6.17.1. Neutron L3 HA test results Liberty

This report is generated for OpenStack Neutron L3 HA Test Plan test plan.

6.17.1.1. Environment description

6.17.1.1.1. Cluster description

  • 3 controllers
  • 46 compute nodes

6.17.1.1.2. Software versions

MOS 8.0

6.17.1.1.3. Hardware configuration of each server

Description of servers hardware

Compute Vendor:
1x SUPERMICRO SUPERSERVER 5037MR-H8TRF MICRO-CLOUD http://www.supermicro.com/products/system/3u/5037/sys-5037mr-h8trf.cfm
CPU
1x INTEL XEON Ivy Bridge 6C E5-2620 V2 2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R 1600 http://ark.intel.com/products/75789/Intel-Xeon-Processor-E5-2620-v2-15M-Cache-2_10-GHz
RAM:
4x Samsung DDRIII 8GB DDR3-1866 1Rx4 ECC REG RoHS M393B1G70QH0-CMA
NIC
1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+

6.17.1.2. Rally test results

L3 HA has a restriction of 255 routers per HA network per tenant. At this moment we do not have the ability to create new HA network per tenant if the number of VIPs exceed this limit. Based on this, for some tests, the number of tenants was increased (NeutronNetworks.create_and_list_router).

The most important results are provided by test_create_delete_routers test, as it allows to catch possible race conditions during creation/deletion of HA routers, HA networks and HA interfaces. There are already several known bugs related to this which have been fixed in upstream. To find out more possible issues test_create_delete_routers has been run multiple times with different concurrency.

Results of test_create_delete_routers
Times Concurrency Number of errors Link for rally report
92 20 0 rally report
92 40 0 rally report
150 50 1 rally report
150 50 0 rally report
200 60 1 rally report
200 60 1 rally report
200 70 2 rally report
200 70 0 rally report
200 75 1 rally report
200 75 1 rally report
300 100 1 rally report
300 100 0 rally report
400 100 1 rally report
400 100 0 rally report

Multiple scenarios:

Test Number of tenants Times Concurrency Number of errors Link for rally report
create_and_delete_routers 1 92 10 0 rally report
create_and_list_routers 2 368 10 272
create_and_update_routers 1 92 10 0
create_and_delete_routers 1 92 10 0 rally report
create_and_list_routers 2 100 10 6
create_and_update_routers 1 92 10 0
create_and_delete_routers 1 92 10 0 rally report
create_and_list_routers 10 368 10 0
create_and_update_routers 1 92 10 0
create_and_delete_routers 1 300 50 1 rally report
create_and_list_routers 10 368 50 0
create_and_update_routers 1 300 50 0
create_and_delete_routers 1 300 50 1 rally report
create_and_list_routers 10 368 50 0
create_and_update_routers 1 300 50 0

The errors discovered have been classified as the following bugs:

Bugs
Short description Trace Upstream bug Status
IpAddressGenerationFailure No more IP addresses available on network trace bug/1562887 Open (Affects Neutron without L3 HA enabled, probably Rally bug)
Device “tap-<id>” does not exist. trace bug/1562887 Open
Session rollback trace bug/1550886 In progress
SubnetInUse: Unable to complete operation on subnet trace bug/1562878 Open
MessagingTimeout: Timed out waiting for a reply to message trace bug/1555670 Open
DBDeadlock: ipallocationpools trace bug/1562876 Open
Not all HA networks deleted not a trace bug/1562892 Open

6.17.1.2.1. Summary:

  1. The number of failed tests is less than 1% (exception test_create_list_routers, but with increased number of tenants the problem was fixed; automatic creation of new HA network after the previous one ran out of virtual ips is more like a feature request).
  2. All bugs found are Medium or Low priority.

6.17.1.3. Shaker test results

L3 HA L3 HA during L3 agents restart Router rescheduling (Non L3 HA) during L3 agent restart
Lost Errors Link for report Lost Errors Link for report Lost Errors Link for report
OpenStack L3 East-West
0 0 report 0 0 report 50 5 report
OpenStack L3 East-West Performance
1 0 report 0 0 report 0 1 (all) report
OpenStack L3 North-South
0 0 report 8 0 report 95 3 report
OpenStack L3 North-South UDP
10 1 report 14 0 report      

OpenStack L3 North-South Performance

(concurrency 2)

0 0 report 0 0 report      

OpenStack L3 North-South Performance

(concurrency 5)

0 0 report 1 0 report      
OpenStack L3 North-South Dense
0 0 report 41 0 report 81 1 report

Shaker provides statistics about maximum, minimum and mean values of different connection measurements. For each test was found the maximum among all maximum values, minimum among all minimum values and counts the mean value from all mean values. In the table below, these values are presented.

type L3 HA L3 HA during l3 agents restart Router rescheduling (Non L3 HA) during l3 agent restart
  min mean max min mean max min mean max
OpenStack L3 East-West

ping_icmp,

ms

0.05 2.45 12.39 0.07 7.39 18.03 0.41 32.84 2583.93

tcp_download

Mbits/s

0.02 874.04 5820.88 0.11 957.66 5883.96 77.41 896.96 3703.83

tcp_upload

Mbits/s

0.02 884.25 5649.94 0.13 897.11 5963.02 64.11 1268.74 5111.02
OpenStack L3 East-West Performance
ping_icmp ms 0.64 0.81 1.45 0.57 0.82 1.79 No statistic
Bandwidth Mbit/s 839.84 1876.83 3880.01 630.0 1497.19 3020.0
Packets pps 101680.0 129664.2 136880.0 89660.0 129515.33 367930.0
retransmits 0.0 0.67 25.0 0.0 2.5 72.0
OpenStack L3 North-South

ping_icmp,

ms

0.08 9.83 27.61 0.06 7.11 25.73 0.33 0.62 2.45

tcp_download

Mbits/s

65.28 902.35 4454.43 72.7 769.61 4494.97 741.95 1647.07 2776.53

tcp_upload

Mbits/s

0.13 815.02 4345.86 0.13 867.68 4289.98 No statistic
OpenStack L3 North-South UDP
Packets pps 31218.0 123452.06 476254.0 39196.0 122214.76 431108.0  

OpenStack L3 North-South Performance

(concurrency 2)
ping_icmp ms 0.9 1.22 2.36 0.67 0.93 2.34  
Bandwidth Mbit/s 439.91 449.94 525.5 0.0 2000.8 3400.5  
Packets pps 126360.0 129349.33 135150.0 131700.0 135319.33 140550.0  
retransmits 0.0 1.0 83.0 0.0 3.0 205.0  

OpenStack L3 North-South Performance

(concurrency 5)
ping_icmp ms 0.74 0.97 1.72 0.2 1.02 3.01  
Bandwidth Mbit/s 41.99 181.01 386.43 0.0 1720.71 3519.77  
Packets pps 122140.0 131601.17 138220.0 103510.0 129021.6 138860.0  
retransmits 0.0 1.0 49.0 0.0 3.17 231.0  
OpenStack L3 North-South Dense

ping_icmp,

ms

0.56 18.18 96.42 0.38 4.07 56.35 0.45 9.79 106.52

tcp_download

Mbits/s

1.72 210.2 862.02 322.24 1634.48 4656.44 11.61 407.69 2235.84

tcp_upload

Mbits/s

18.88 209.49 781.86 49.96 1590.83 4667.82 18.77 1955.41 4333.32

These results show that there is no significant difference between results during multiple l3 agent restarts and normal test execution.

Average value of difference between these values without and with restart presented in the next table:

 

ping_icmp,

ms

tcp_download

Mbits/s

tcp_upload

Mbits/s

Bandwidth Mbit/s Packets pps retransmits
min 0.17 -103.34 -10.39 230.58 4333 0
mean 2.02 -458.39 -482.39 -903.64 -501.07 -2
max 5.78 -1299.35 -1381.05 -1717.11 -47986 -117

6.17.1.3.1. Summary:

  1. Results of comparison between L3 HA and standard router rescheduling show that L3 HA allows to perform testing uninterrupted without huge loss of statistics during L3 agent restarts.
  2. Comparing L3 HA results with and without restart show that bandwidth and speed do not decrease during agent restart.

6.17.1.4. Manual tests execution

During manual testing, the following scenarios were tested:

  • Ping to external network from VM during reset of primary(non-primary)
    controller
  • Ping from one VM to another VM in different network during ban L3
    agent
  • Iperf UPD testing between VMs in different networks during ban L3
    agent

All tests were performed with large number of routers.

6.17.1.4.1. Ping to external network from VM during reset of primary(non-primary) controller

../../../_images/ping_external1.png
Iteration Number of routers Command Number of loss packages
1 1   3
2 25   3
3 50   3
4 100   3
5 150   3
6 170 ping 8.8.8.8 3
7 175   89
8 175   116
9 175   52
10 200   51
11 200   3

Current result looks unstable and not directly dependent on the number of routers. The huge loss of packages on iterations 7-10 happened because agent from recovered controller became “active” (master) while there was already another active L3 agent. After some time it became the only “active” L3 agent for router.

This issue needs special attention and will be investigated as bug/1563298.

6.17.1.4.2. Ping from one VM to another VM in different network during ban L3 agent

../../../_images/ping1.png
Iteration Number of routers Command Number of loss packages
1 100   4
2     4
3     3
4 200   3
5     3
6   ping 10.0.1.6 103
7     26
8     3
9 250   3
10     4

The loss of packages on iterations 6-7 happened for the similar reason as for previous manual scenario. L3 agent status flapped during loss.

With 250 routers l3 agents started to fail with unmanaged state.

6.17.1.4.3. Iperf UPD testing between VMs in different networks ban L3 agent

../../../_images/iperf_addresses1.png
Number of routers Command Loss (%)
10   0.14
    4.9
    1.3
    5.3
24   1.3
  iperf -c 10.0.3.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u 8.9
    6.1
    2.4
50   1.7
    10
    40
    18

6.17.1.4.4. Summary:

  1. For unstable behaviour of L3 HA, bug was filed.
  2. With number of routers less than 170, the network can be classified as stable for failures.
  3. With number of routers more than 240, agent’s recovery leads to falling into unmanaged state.