6.20.2. OpenStack reliability testing

status:draft
version:0
Abstract:This document describes an abstract methodology for OpenStack cluster high-availability testing and analysis. OpenStack data plane testing at this moment is out of scope but will be described in future. All tests have been performed regarding OpenStack reliability testing
Conventions:

6.20.2.1. Test results

6.20.2.1.1. Test environment

6.20.2.1.1.1. Software configuration on servers with OpenStack

Basic cluster configuration
Name Build-9.0.0-451
OpenStack release Mitaka on Ubuntu 14.04
Total nodes 6 nodes
Controller 3 nodes
Compute, Ceph OSD 3 nodes with KVM hypervisor
Network Neutron with tunneling segmentation
Storage back ends
Ceph RBD for volumes (Cinder)
Ceph RadosGW for objects (Swift API)
Ceph RBD for ephemeral volumes (Nova)
Ceph RBD for images (Glance)

6.20.2.1.1.2. Software configuration on servers with Rally role

Before you start configuring a server with Rally role, verify that Rally is installed. For more information, see Rally installation documentation.

Software version of Rally server
Software Version
Rally 0.4.0
Ubuntu 14.04.3 LTS
6.20.2.1.1.2.1. Environment description

6.20.2.1.1.3. Hardware

Description of server hardware
SERVER name
728997-comp-disk-228
728998-comp-disk-227
728999-comp-disk-226
729000-comp-disk-225
729001-comp-disk-224
729002-comp-disk-223
729017-comp-disk-255
role
controller
controller
controller
compute, ceph-osd
compute, ceph-osd
compute, ceph-osd
Rally
vendor, model HP, DL380 Gen9 HP, DL380 Gen9
operating_system
3.13.0-87-generic
Ubuntu-trusty
x86_64
3.13.0-87-generic
Ubuntu-trusty
x86_64
CPU vendor, model Intel, E5-2680 v3 Intel, E5-2680 v3
processor_count 2 2
core_count 12 12
frequency_MHz 2500 2500
RAM vendor, model HP, 752369-081 HP, 752369-081
amount_MB 262144 262144
NETWORK interface_name p1p1 p1p1
vendor, model Intel, X710 Dual Port Intel, X710 Dual Port
bandwidth 10 Gbit 10 Gbit
STORAGE dev_name /dev/sda /dev/sda
vendor, model
raid10 - HP P840
12 disks EH0600JEDHE
raid10 - HP P840
12 disks EH0600JEDHE
SSD/HDD HDD HDD
size 3,6 TB 3,6 TB

6.20.2.1.1.4. Software

Services on servers by role
Role Service name
controller
horizon
keystone
nova-api
nava-scheduler
nova-cert
nova-conductor
nova-consoleauth
nova-consoleproxy
cinder-api
cinder-backup
cinder-scheduler
cinder-volume
glance-api
glance-glare
glance-registry
neutron-dhcp-agent
neutron-l3-agent
neutron-metadata-agent
neutron-openvswitch-agent
neutron-server
heat-api
heat-api-cfn
heat-api-cloudwatch
ceph-mon
rados-gw
heat-engine
memcached
rabbitmq_server
mysqld
galera
corosync
pacemaker
haproxy
compute-osd
nova-compute
neutron-l3-agent
neutron-metadata-agent
neutron-openvswitch-agent
ceph-osd

6.20.2.1.1.5. High availability cluster architecture

Controller nodes:

Mirantis reference HA architecture

Compute nodes:

Mirantis reference HA architecture

6.20.2.1.1.6. Networking

All servers have the similar network configuration:

Network Scheme of the environment

The following example shows a part of a switch configuration for each switch port that is connected to ens1f0 interface of a server:

switchport mode trunk
switchport trunk native vlan 600
switchport trunk allowed vlan 600-602,630-649
spanning-tree port type edge trunk
spanning-tree bpduguard enable
no snmp trap link-status

6.20.2.1.2. Factors description

  • reboot-random-controller: consists of a node-crash fault injection on a random OpenStack controller node.
  • sigkill-random-rabbitmq: consists of a service-crash fault injection on a random slave RabbitMQ messaging node.
  • sigkill-random-mysql: consists of a service-crash fault injection on a random MySQL node.
  • freeze-random-nova-api: consists of a service-hang fault injection to all nova-api process on a random controller node for a 150 seconds period.
  • freeze-random-memcached: consists of a service-hang fault injection to the memcached service on a random controller node for a 150 seconds period.
  • freeze-random-keystone: consists of a service-hang fault injection to the keystone (public and admin endpoints) service on a random controller node for a 150 seconds period.

6.20.2.2. Testing process

Use the following VM parameters for testing purposes:

Test parameters
Name Value
Flavor to create VM from m1.tiny
Image name to create VM from cirros
  1. Create a work directory on a server with Rally role. In this documentation, we name this directory WORK_DIR. The path example: /data/rally.

  2. Create a directory plugins in WORK_DIR and copy the scrappy.py plugin into that directory.

  3. Download the bash framework scrappy.sh and scrappy.conf to WORK_DIR/plugins.

  4. Modify the scrappy.conf file with appropriate values. For example:

    # SSH credentials
    SSH_LOGIN="root"
    SSH_PASS="r00tme"
    
    # Controller nodes
    CONTROLLERS[0]="10.44.0.7"
    CONTROLLERS[1]="10.44.0.6"
    CONTROLLERS[2]="10.44.0.5"
    
    # Compute nodes
    COMPUTES[0]="10.44.0.3"
    COMPUTES[1]="10.44.0.4"
    COMPUTES[2]="10.44.0.8"
    
    #Scrappy base path
    SCRAPPY_BASE="/root/scrappy"
    
  5. Create a scenarios directory in WORK_DIR and copy all Rally scenarios with factors that you are planning to test to that directory. For example: random_controller_reboot_factor.json.

  6. Create a deployment.json file in the WORK_DIR and fill it with your OpenStack environment info. It should looks like this:

    {
      "admin": {
        "password": "password",
        "tenant_name": "tenant",
        "username": "user"
      },
      "auth_url": "http://1.2.3.4:5000/v2.0",
      "region_name": "RegionOne",
      "type": "ExistingCloud",
      "endpoint_type": "internal",
      "admin_port": 35357,
      "https_insecure": true
    }
    
  7. Prepare for tests:

    ${WORK_DIR:?}
    DEPLOYMENT_NAME="$(uuidgen)"
    DEPLOYMENT_CONFIG="${WORK_DIR}/deployment.json"
    rally deployment create --filename $(DEPLOYMENT_CONFIG) --name $(DEPLOYMENT_NAME)
    
  8. Create a /root/scrappy directory on every node in your OpenStack environment and copy scrappy_host.sh to that directory.

  9. Perform tests:

    PLUGIN_PATH="${WORK_DIR}/plugins"
    SCENARIOS="random_controller_reboot_factor.json"
    for scenario in SCENARIOS; do
      rally --plugin-paths ${PLUGINS_PATH} task start --tag ${scenario} ${WORK_DR}/scenarios/${scenario}
    done
    task_list="$(rally task list --uuids-only)"
    rally task report --tasks ${task_list} --out=${WORK_DIR}/rally_report.html
    

Once these steps are done, you get an HTML file with Rally test results.

6.20.2.2.1. Test case 1: NovaServers.boot_and_delete_server

Description

This Rally scenario boots and deletes virtual instances with injected fault factors using OpenStack Nova API.

Service-level agreement

Parameter Value
MTTR (sec) <=240
Failure rate (%) <=95
Auto-healing Yes

Parameters

Parameter Value
Runner constant
Concurrency 5
Times 100
Injection-iteration 20
Testing-cycles 5

List of reliability metrics

Priority Value Measurement Units Description
1 SLA Boolean Service-level agreement result
2 Auto-healing Boolean Is cluster auto-healed after fault-injection
3 Failure rate Percents Test iteration failure ratio
4 MTTR (auto) Seconds Automatic mean time to repair
5 MTTR (manual) Seconds Manual mean time to repair, if Auto MTTR is Inf.

6.20.2.2.2. Test case 1 results

6.20.2.2.2.1. reboot-random-controller

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {

          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 5
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh random_controller_reboot",
                  "cycle": {{i}}
                }
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 4.31 2 Yes Yes, up to 148.52 sec.
2 19.88 14 Yes Yes, up to 150.946 sec.
3 7.31 8 Yes Yes, up to 124.593 sec.
4 95.07 9 Yes Yes, up to 240.893
5 Inf. 80.00 No Inf.

Rally report: reboot_random_controller.html

Testing results summary
Value MTTR(sec) Failure rate
Min 4.31 2
Max 95.07 80
SLA Yes No

Detailed results description

This factor affects OpenStack cluster operation on every run. Auto-healing works, but may take a long time. In our testing results, the cluster was recovered on the fifth testing cycle only, after Rally had completed testing with the error status. Therefore, the performance degradation is very significant during cluster recovering.

6.20.2.2.2.2. sigkill-random-rabbitmq

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {

          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 5
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh random_controller_kill_rabbitmq",
                  "cycle": {{i}}
                }
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 0 0 Yes Yes, up to 12.266 sec.
2 0 0 Yes Yes, up to 15.775 sec.
3 98.52 1 Yes Yes, up to 145.115 sec.
4 0 0 Yes No
5 0 0 Yes Yes, up to 65.926 sec.

Rally report: random_controller_kill_rabbitmq.html

Testing results summary
Value MTTR(sec) Failure rate
Min 0 0
Max 98.52 1
SLA Yes Yes

Detailed results description

This factor may affect OpenStack cluster operation. Auto-healing works fine. Performance degradation is significant during cluster recovering.

6.20.2.2.2.3. sigkill-random-mysql

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {

          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 5
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh send_signal /usr/sbin/mysqld -KILL",
                  "cycle": {{i}}
                }
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 2.31 0 Yes Yes, up to 12.928 sec.
2 0 0 Yes Yes, up to 11.156 sec.
3 0 1 Yes Yes, up to 13.592 sec.
4 0 0 Yes Yes, up to 11.864 sec.
5 0 0 Yes Yes, up to 12.715 sec.

Rally report: random_controller_kill_mysqld.html

Testing results summary
Value MTTR(sec) Failure rate
Min 0 0
Max 2.31 1
SLA Yes Yes

Detailed results description

This factor may affect OpenStack cluster operation. Auto-healing works fine. Performance degradation is not significant.

6.20.2.2.2.4. freeze-random-nova-api

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {
          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 15
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh random_controller_freeze_process_fixed_interval nova-api 150",
                  "cycle": {{i}}
                },
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 0 0 Yes Yes, up to 156.935 sec.
2 0 0 Yes Yes, up to 155.085 sec.
3 0 0 Yes Yes, up to 156.93 sec.
4 0 0 Yes Yes, up to 156.782 sec.
5 150.55 1 Yes Yes, up to 154.741 sec.

Rally report: random_controller_freeze_nova_api_150_sec.html

Testing results summary
Value MTTR(sec) Failure rate
Min 0 0
Max 150.55 1
SLA Yes Yes

Detailed results description

This factor affects OpenStack cluster operation. Auto-healing does not work. Cluster operation was recovered only after sending SIGCONT POSIX signal to all freezed nova-api processes. Performance degradation is determined by the factor duration time. This behaviour is not normal for an HA OpenStack configuration and should be investigated.

6.20.2.2.2.5. freeze-random-memcached

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {
          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 5
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh random_controller_freeze_process_fixed_interval memcached 150",
                  "cycle": {{i}}
                },
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 0 0 Yes Yes, up to 26.679 sec.
2 0 0 Yes Yes, up to 23.726 sec.
3 0 0 Yes Yes, up to 21.893 sec.
4 0 0 Yes Yes, up to 22.796 sec.
5 0 0 Yes Yes, up to 27.737 sec.

Rally report: random_controller_freeze_memcached_150_sec.html

Testing results summary
Value MTTR(sec) Failure rate
Min 0 0
Max 0 0
SLA Yes Yes

Detailed results description

This factor does not affect an OpenStack cluster operations. During the factor testing, a small performance degradation is observed.

6.20.2.2.2.6. freeze-random-keystone

Rally scenario used during factor testing:

{% set flavor_name = flavor_name or "m1.tiny" %}
{% set image_name = image_name or "^(cirros.*uec|TestVM)$" %}
{
  "NovaServers.boot_and_delete_server": [
      {% for i in range (0, 5, 1) %}
      {
          "args": {
              "flavor": {
                  "name": "{{flavor_name}}"
              },
              "image": {
                  "name": "{{image_name}}"
              },
              "force_delete": false
          },
          "runner": {
              "type": "constant",
              "times": 100,
              "concurrency": 5
          },
          "context": {
              "users": {
                  "tenants": 1,
                  "users_per_tenant": 1
              }
          },
            "sla": {
                "scrappy": {
                  "on_iter": 20,
                  "execute": "/bin/bash /data/rally/rally_plugins/scrappy/scrappy.sh random_controller_freeze_process_fixed_interval keystone 150",
                  "cycle": {{i}}
                },
            }
        },
    {% endfor %}
    ]
}

Factor testing results:

Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 97.19 7 Yes No
2 93.87 6 Yes No
3 92.12 8 Yes No
4 94.51 6 Yes No
5 98.37 7 Yes No

Rally report: random_controller_freeze_keystone_150_sec.html

Testing results summary
Value MTTR(sec) Failure rate
Min 92.12 6
Max 98.37 8
SLA Yes No

Detailed results description

This factor affects an OpenStack cluster operations. After the keystone processes freeze on controllers, the HA logic needs approximately 95 seconds to recover service operation. After recovering, performance degradation is not observed but only at small concurrency. This behaviour is not normal for an HA OpenStack configuration and should be investigated in future.