5.28.1. OpenStack reliability testing

status:ready
version:1.0
Abstract:This document describes an abstract methodology for OpenStack cluster high-availability testing and analysis. OpenStack data plane testing at this moment is out of scope, but will be described in future.
Conventions:
  • OpenStack cluster: consists of server nodes with deployed and fully operational OpenStack environment in high-availability configuration.
  • Fault-injection operation: represents common types of failures which can occur in production environment: service-hang, service-crash, network-partition, network-flapping, and node-crash.
  • Service-hang: faults are injected into specified OpenStack service by sending -SIGSTOP and -SIGCONT POSIX signals.
  • Service-crash: faults are injected by sending -SIGKILL signal into specified OpenStack service.
  • Node-crash: faults are injected to an OpenStack cluster by rebooting or shutting down a server node.
  • Network-partition: faults are injected by inserting iptables rules to OpenStack cluster nodes to a corresponding service that should be network-partitioned.
  • Network-flapping: faults are injected into OpenStack cluster nodes by inserting/deleting iptables rules on the fly which will affect corresponding service that should be tested.
  • Factor: consists of a set of atomic fault-injection operations. For example: reboot-random-controller, reboot-random-rabbitmq.
  • Test plan: contains two elements: test scenario execution graph and fault-injection factors.
  • SLA: Service-level agreement
  • Testing-cycles: number of test cycles of each factor
  • Inf: assumes infinite time to auto-healing of cluster after fault-factor injection.

5.28.1.1. Test Plan

5.28.1.1.1. Test Environment

This section should contain all information about deployed OpenStack environment including archive with all information in the /etc folder from all nodes.

5.28.1.1.1.1. Preparation

This section should contain all steps to reproduce Openstack environment deployment and client node. For example: if testing environment is deployed with DevStack, this section should contain all DevStack configuration files, DevStack version and all deployment steps.

5.28.1.1.1.2. Environment description

This section should contain all cluster hardware information, including processor model and its frequency, memory size, storage type and its capacity, network interfaces, and others. A separate client node must be used to drive the tests.

5.28.1.1.1.2.1. Hardware

This section should contain a full hardware nodes specification.

Description of server hardware
SERVER name    
role    
vendor,model    
operating_system    
CPU vendor,model    
processor_count    
core_count    
frequency_MHz    
RAM vendor,model    
amount_MB    
NETWORK interface_name    
vendor,model    
bandwidth    
STORAGE dev_name    
vendor,model    
SSD/HDD    
size    
5.28.1.1.1.2.2. Networking

This section should сontain full description of network equipment used in OpenStack cluster. Network topology diagram and network hardware configuration files should be included in this section.

5.28.1.1.2. Factors description

Please define here description of used factors during test runs. Examples are:

  • reboot-random-controller: consist node-crash fault injection on random

OpenStack controller node.

  • reboot-random-rabbitmq: consist node-crash fault injection on master

RabbitMQ messaging node.

  • sigstop-random-nova-api: consist service-hang fault injection on random

nova-api service.

  • sigkill-random-mysql: consist service-crash fault injection on

random MySQL node.

  • network-partition-random-mysql: consist network-partition fault injection on

random MySQL node.

5.28.1.1.3. Test Case 1: NovaServers.boot_and_delete_server

5.28.1.1.3.1. Description

This Rally scenario boots and deletes virtual instances with injected fault factors through OpenStack Nova API.

5.28.1.1.3.2. Service-level agreement

In this section, specify SLA values. For example:

Parameter Value
MTTR (sec) <=240
Failure rate (%) <=95
Auto-healing Yes

5.28.1.1.3.3. Parameters

In this section, specify load parameters during the test. For example:

Parameter Value
Runner constant
Concurrency X
Times Y
Injection-iteration Z
Testing-cycles N

5.28.1.1.3.4. List of reliability metrics

Priority Value Measurement Units Description
1 SLA Boolean Service-level agreement result
2 Auto-healing Boolean Is cluster auto-healed after fault-injection
3 Failure rate Percents Test iteration failure ratio
4 MTTR (auto) Seconds Automatic mean time to repair
5 MTTR (manual) Seconds Manual mean time to repair, if Auto MTTR is Inf.

5.28.1.1.3.5. Results

5.28.1.1.3.5.1. reboot-random-controller
Full description of cyclic execution results
Cycles MTTR(sec) | Failure rate(%) Auto-healing Performance degradation
1 X Y Yes Yes
2 X Y Yes Yes
3 X Y No Yes
4 X Y Yes Yes
5 X Y Yes Yes

Place here link to rally report file with results of testing this factor.

Testing results summary
Value MTTR Failure rate
Min X Y
Max X Y
SLA X Y
5.28.1.1.3.5.2. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.28.1.1.3.5.3. reboot-random-rabbitmq
Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 X Y Yes Yes
2 X Y Yes Yes
3 X Y No Yes
4 X Y Yes Yes
5 X Y Yes Yes

Place here link to rally report file with results of testing this factor.

Testing results summary
Value MTTR Failure rate
Min X Y
Max X Y
SLA X Y
5.28.1.1.3.5.4. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.28.1.1.4. Test Case 2: GlanceImages.create_and_delete_image

5.28.1.1.4.1. Description

This Rally scenario creates and deletes images with injected fault factors through OpenStack Glance API.

5.28.1.1.4.2. Service-level agreement

In this section, specify SLA values. For example:

Parameter Value
MTTR (sec) <=120
Failure rate (%) <=95
Auto-healing Yes

5.28.1.1.4.3. Parameters

In this section, specify load parameters during the test. For example:

Parameter Value
Runner constant
Concurrency X
Times Y
Injection-iteration Z
Testing-cycles N

5.28.1.1.4.4. List of reliability metrics

Priority Value Measurement Units Description
1 SLA Boolean Service-level agreement result
2 Auto-healing Boolean Is cluster auto-healed after fault-injection
3 Failure rate Percents Test iteration failure ratio
4 MTTR (auto) Seconds Automatic mean time to repair
5 MTTR (manual) Seconds Manual mean time to repair, if Auto MTTR is Inf.

5.28.1.1.4.5. Results

5.28.1.1.4.5.1. reboot-random-controller
Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 X Y Yes Yes
2 X Y Yes Yes
3 X Y No Yes
4 X Y Yes Yes
5 X Y Yes Yes

Place here link to rally report file with results of testing this factor.

Testing results summary
Value MTTR Failure rate
Min X Y
Max X Y
SLA X Y
5.28.1.1.4.5.2. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.28.1.1.4.5.3. reboot-random-rabbitmq
Full description of cyclic execution results
Cycles MTTR(sec) Failure rate(%) Auto-healing Performance degradation
1 X Y Yes Yes
2 X Y Yes Yes
3 X Y No Yes
4 X Y Yes Yes
5 X Y Yes Yes

Place here link to rally report file with results of testing this factor.

Testing results summary
Value MTTR Failure rate
Min X Y
Max X Y
SLA X Y
5.28.1.1.4.5.4. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.28.1.2. Reports

Test plan execution reports: