5.27.2. OpenStack Neutron L3 HA Test Plan

status:

ready

version:

1.0

Abstract:

We are able to spawn many L3 agents, however each L3 agent is a SPOF. If an L3 agent fails, all virtual routers scheduled to this agent will be lost, and consequently all VMs connected to these virtual routers will be isolated from external networks and possibly from other tenant networks.

The main purpose of L3 HA is to address this issue by adding a new type of router (HA router), which will be spawned twice on two different agents. One agent will be in charge of the master version of this router, and another l3 agent will be in charge of the slave router.

L3 HA functionality in Neutron was implemented in Juno, however detailed testing on scale for it was not performed. The purpose of this document is to describe the scenarios for its testing.

../../../_images/L3HA.png
Conventions:
  • VRRP - Virtual Router Redundancy Protocol
  • Keepalived - Routing software based on VRRP protocol
  • Rally - Benchmarking tool for OpenStack
  • Shaker - Data plane performance testing tool
  • iperf - Commonly-used network testing tool

5.27.2.1. Test Plan

The purpose of this section is to describe scenarios for testing L3 HA. The most important aspect is the number of packets that will be lost during restart of the L3 agent or controller as a whole. The second aspect is the number of routers that can move from one agent to another without it falling into unmanaged state.

5.27.2.1.1. Test Environment

5.27.2.1.1.1. Preparation

This test plan is performed against existing OpenStack cloud.

5.27.2.1.1.2. Environment description

The environment description includes hardware specification of servers, network parameters, operation system and OpenStack deployment characteristics.

5.27.2.1.1.2.1. Hardware

This section contains list of all types of hardware nodes.

Parameter Value Comments
model   e.g. Supermicro X9SRD-F
CPU   e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
role   e.g. compute or network
5.27.2.1.1.2.2. Network

This section contains list of interfaces and network parameters. For complicated cases this section may include topology diagram and switch parameters.

Parameter Value Comments
network role   e.g. provider or public
card model   e.g. Intel
driver   e.g. ixgbe
speed   e.g. 10G or 1G
MTU   e.g. 9000
offloading modes   e.g. default
5.27.2.1.1.2.3. Software

This section describes installed software.

Parameter Value Comments
OS   e.g. Ubuntu 14.04.3
OpenStack   e.g. Liberty
Hypervisor   e.g. KVM
Neutron plugin   e.g. ML2 + OVS
L2 segmentation   e.g. VLAN or VxLAN or GRE
virtual routers   HA

5.27.2.1.2. Test Case 1: Comparative analysis of metrics with and without L3 agents restart

5.27.2.1.2.1. Description

Shaker is able to deploy OpenStack instances and networks in different topologies. For L3 HA, the most important scenarios are those that check connection between VMs in different networks (L3 east-west) and connection via floating ip (L3 north-south).

The following tests should be executed:

  1. OpenStack L3 East-West
    • This scenario launches pairs of VMs in different networks connected to one router (L3 east-west)
  2. OpenStack L3 East-West Performance
    • This scenario launches 1 pair of VMs in different networks connected to one router (L3 east-west). VMs are hosted on different compute nodes.
  3. OpenStack L3 North-South
    • This scenario launches pairs of VMs on different compute nodes. VMs are in the different networks connected via different routers, master accesses slave by floating ip.
  4. OpenStack L3 North-South UDP
  5. OpenStack L3 North-South Performance
  6. OpenStack L3 North-South Dense
    • This scenario launches pairs of VMs on one compute node. VMs are in the different networks connected via different routers, master accesses slave by floating ip.

For scenarios 1,2,3 and 6, results were also collected for L3 agent restart with L3 HA option disabled and standard router rescheduling enabled.

While running shaker tests, scripts restart.sh and restart_not_ha.sh were executed.

5.27.2.1.2.2. List of performance metrics

Priority Value Measurement Units Description
1 Latency ms The network latency
1 TCP bandwidth Mbits/s TCP network bandwidth
2 UDP bandwidth packets per sec Number of UDP packets of 32 bytes size
2 TCP retransmits packets per sec Number of retransmitted TCP packets

5.27.2.1.3. Test Case 2: Rally tests execution

5.27.2.1.3.1. Description

Rally allows to check the ability of OpenStack to perform simple operations like create-delete, create-update, etc on scale.

L3 HA has a restriction of 255 routers per HA network per tenant. At this moment we do not have the ability to create new HA network per tenant if the number of VIPs exceed this limit. Based on this, for some tests, the number of tenants was increased (NeutronNetworks.create_and_list_router). The most important results are provided by test_create_delete_routers test, as it allows to catch possible race conditions during creation/deletion of HA routers, HA networks and HA interfaces. There are already several known bugs related to this which have been fixed in upstream. To find out more possible issues test_create_delete_routers has been run multiple times with different concurrency.

5.27.2.1.3.2. List of performance metrics

Priority Measurement Units Description
1 Number of failed tests Number of tests that failed during Rally tests execution
2 Concurrency Number of tests that executed in parallel

5.27.2.1.4. Test Case 3: Manual destruction test: Ping to external network from VM during reset of primary(non-primary) controller

5.27.2.1.4.1. Description

../../../_images/ping_external.png

Scenario steps:

  1. create router
    neutron router-create routerHA --ha True
  2. set gateway for external network and add interface
    neutron router-gateway-set routerHA <ext_net_id> neutron router-interface-add routerHA <private_subnet_id>
  3. boot an instance in private net
    nova boot --image <image_id> --flavor <flavor_id> --nic net_id=<private_net_id> vm1
  4. Login to VM using ssh or VNC console
  5. Start ping 8.8.8.8 and check that packets are not lost
  6. Check which agent is active with
    neutron l3-agent-list-hosting-router <router_id>
  7. Restart node on which l3-agent is active
    sudo shutdown -r now or sudo reboot
  8. Wait until another agent becomes active and restarted node recover
    neutron l3-agent-list-hosting-router <router_id>
  9. Stop ping and check the number of packets that was lost.
  10. Increase number of routers and repeat steps 5-10

5.27.2.1.4.2. List of performance metrics

Priority Measurement Units Description
1 Number of loss packets Number of packets that was lost when L3 agent was banned
2 Number of routers Number of existing router of the environment

5.27.2.1.5. Test Case 4: Manual destruction test: Ping from one VM to another VM in different network during ban L3 agent

5.27.2.1.5.1. Description

../../../_images/ping.png

Scenario steps:

  1. create router
    neutron router-create routerHA--ha True
  2. add interface for two internal networks
    router-interface-add routerHA <private_subnet1_id> router-interface-add routerHA <private_subnet2_id>
  3. boot an instance in private net1 and net2
    nova boot --image <image_id> --flavor <flavor_id> --nic net_id=<private_net_id> vm1
  4. Login into VM1 using ssh or VNC console
  5. Start ping vm2_ip and check that packets are not lost
  6. Check which agent is active with
    neutron l3-agent-list-hosting-router <router_id>
  7. ban active l3 agent run:
    pcs resource ban p_neutron-l3-agent node-<id>
  8. Wait until another agent become active in neutron l3-agent-list-hosting-router <router_id>
  9. Clear banned agent
    pcs resource clear p_neutron-l3-agent node-<id>
  10. Stop ping and check the number of packets that was lost.
  11. Increase number of routers and repeat steps 5-10

5.27.2.1.5.2. List of performance metrics

Priority Measurement Units Description
1 Number of loss packets Number of packets that was lost during restart of the node
2 Number of routers Number of existing router of the environment

5.27.2.1.6. Test Case 5: Manual destruction test: Iperf UPD testing between VMs in different networks ban L3 agent

5.27.2.1.6.1. Description

../../../_images/iperf_addresses.png

Scenario steps:

  1. Create vms.
  2. Login to VM1 using ssh or VNC console and run
    iperf -s -u
  3. Login to VM2 using ssh or VNC console and run
    iperf -c vm1_ip -p 5001 -t 60 -i 10 --bandwidth 30M --len 64 -u
  4. Check that loss is less than 1%
  5. Check which agent is active with
    neutron l3-agent-list-hosting-router <router_id>
  6. Run command from step 3 again
  7. ban active l3 agent run:
    pcs resource ban p_neutron-l3-agent node-<id>
  8. Check the results of iperf command and clear banned L3 agent.
    pcs resource clear p_neutron-l3-agent node-<id>
  9. Increase number of routers and repeat steps 3-8

5.27.2.1.6.2. List of performance metrics

Priority Value Measurement Units Description
1 UDP bandwidth % Loss of UDP packets of 64 bytes size

5.27.2.2. Reports

Test plan execution reports: