Appendix L: Masakari

Overview

As of the 20.05 charm release, Masakari can be deployed to provide automated instance recovery for clouds that use shared storage for its instances. The following functionality is provided:

  1. Evacuation of instances (supported since OpenStack Stein)

    In the event of hypervisor software failure the associated compute node is shut down and instance images are started on another hypervisor.

  2. Restarting of instances (supported since OpenStack Ussuri)

    A failed instance can be restarted on its current hypervisor.

See the Masakari charm for an overview of the charms involved.

Note

MAAS is required when enabling Masakari on Charmed OpenStack.

Software

Install the software necessary for configuring Masakari:

sudo snap install openstackclients

Verify that the segment sub-command is available (this is provided by the python-masakariclient plugin):

openstack segment --help

Important

If the segment sub-command is not available you will need a more recent version of the openstackclients snap. For example, you may need to use the ‘edge’ channel: sudo snap refresh openstackclients --channel=edge.

Instance evacuation mechanics

In order for an instance to be relocated to another hypervisor some form of shared storage must be implemented. As a result, the scenario where a hypervisor has lost network access to its peers yet continues to access that shared storage must be considered.

The mechanics of instance evacuation is now described:

Masakari Monitors, on a peer hypervisor, detects that its peer is unavailable and notifies the Masakari API server. This in turn triggers the Masakari engine to initiate a failover of the instance via Nova. Assuming that Nova concurs that the hypervisor is absent, it will attempt to start the instance on another hypervisor. At this point there are two instances competing for the same disk image, which can lead to data corruption.

The solution is to enable a STONITH Pacemaker plugin, which will power off the compute node via the MAAS API when Pacemaker detects the hypervisor as being offline.

Caution

Since nova-compute is typically deployed on bare metal, which may host containerised applications and possibly even applications alongside nova-compute (e.g. ceph-osd), care is advised when designing a cloud with Masakari to avoid a powered-off compute node from disrupting crucial non-HA cloud services.

Ensure that Masakari functionality has been fully validated in a staging environment prior to using it in production.

Usage

Configuration

The below overlay bundle can be used to deploy Masakari when using a bundle to deploy OpenStack.

Ensure that the machines section and the placement directives (i.e. the to option under the masakari application) can co-exist with your OpenStack bundle.

Provide values for the maas_url, maas_credentials, and vip hacluster charm options . A VIP is a virtual IP needed for Masakari to enable HA (a requirement when using the masakari charm). If multiple networks are used, multiple (space separated) VIPs should be provided. See OpenStack high availability for HA guidance.

Enable STONITH via the enable-stonith pacemaker-remote charm option.

Provide values for the binding (network spaces) masakari charm option according to your local environment. For simplicity (or for testing), the same network space can be used for all Masakari bindings.

machines:
  '0':
    series: bionic
  '1':
    series: bionic
  '2':
    series: bionic
  '3':
    series: bionic
relations:
- - nova-compute:juju-info
  - masakari-monitors:container
- - masakari:ha
  - hacluster:ha
- - keystone:identity-credentials
  - masakari-monitors:identity-credentials
- - nova-compute:juju-info
  - pacemaker-remote:juju-info
- - hacluster:pacemaker-remote
  - pacemaker-remote:pacemaker-remote
- - masakari:identity-service
  - keystone:identity-service
- - masakari:shared-db
  - mysql:shared-db
- - masakari:amqp
  - rabbitmq-server:amqp
series: bionic
applications:
  masakari-monitors:
    charm: cs:masakari-monitors
  hacluster:
    charm: cs:hacluster
    options:
      maas_url: <INSERT MAAS URL>
      maas_credentials: <INSERT MAAS API KEY>
  pacemaker-remote:
    charm: cs:pacemaker-remote
    options:
      enable-stonith: True
      enable-resources: False
  masakari:
    charm: cs:masakari
    series: bionic
    num_units: 3
    options:
      openstack-origin: cloud:bionic-stein
      vip: <INSERT VIP(S)>
    bindings:
      public: public
      admin: admin
      internal: internal
      shared-db: internal
      amqp: internal
    to:
    - 'lxd:1'
    - 'lxd:2'
    - 'lxd:3'

Deployment

To deploy Masakari during the deployment of a new cloud (e.g. via the openstack-base bundle):

juju deploy ./bundle.yaml --overlay masakari-overlay.yaml

To add Masakari to an existing deployment (i.e. the Juju model has pre-existing machines) the --map-machines option should be used.

The cloud should then be configured for usage. See Configure OpenStack for assistance.

For the purposes of this document the below hypervisors are presumed:

+-------------------+---------+-------+
| Host              | Status  | State |
+-------------------+---------+-------+
| virt-node-01.maas | enabled | up    |
| virt-node-10.maas | enabled | up    |
| virt-node-02.maas | enabled | up    |
+-------------------+---------+-------+

In addition let us assume that instance ‘bionic-1’ now resides on host ‘virt-node-02.maas’:

+----------------------+-------------------+
| Field                | Value             |
+----------------------+-------------------+
| OS-EXT-SRV-ATTR:host | virt-node-02.maas |
+----------------------+-------------------+

The above information was obtained by the following two commands, respectively:

openstack compute service list -c Host -c Status -c State --service nova-compute
openstack server show bionic-1 -c OS-EXT-SRV-ATTR:host

Instance evacuation recovery methods

With Masakari, compute nodes are grouped into failover segments. In the event of a compute node failure, that node’s instances are moved onto another compute node within the same segment.

The destination node is determined by the recovery method configured for the affected segment. There are four methods:

  • reserved_host

  • auto

  • rh_priority

  • auto_priority

A compute node failure can be simulated by bringing down its primary network interface. For example, to bring down a node that corresponds to unit nova-compute/2:

juju run --unit nova-compute/2 sudo ip link set br-ens3 down

‘reserved_host’

The reserved_host recovery method relocates instances to a subset of non-active nodes. Because these nodes are not active and are typically resourced adequately for failover duty, there is a guarantee that sufficient resources will exist on a reserved node to accommodate migrated instances.

For example, to create segment ‘S1’, configure it to use the reserved_host method, and assign it three compute nodes, with one being tagged as a reserved node:

openstack segment create S1 reserved_host COMPUTE
openstack segment host create virt-node-10.maas COMPUTE SSH S1
openstack segment host create virt-node-02.maas COMPUTE SSH S1
openstack segment host create --reserved True virt-node-01.maas COMPUTE SSH S1

View the details of a segment:

openstack segment list

Sample output:

+--------------------------------------+------+-------------+--------------+-----------------+
| uuid                                 | name | description | service_type | recovery_method |
+--------------------------------------+------+-------------+--------------+-----------------+
| 3af6dfe7-1619-486f-a2c6-8453488c6a66 | S2   | None        | COMPUTE      | auto            |
+--------------------------------------+------+-------------+--------------+-----------------+

A segment’s hosts can be listed like this:

openstack segment host list -c name -c reserved -c on_maintenance S2

The output should show a value of ‘True’ in the ‘reserved’ column for the appropriate node:

+-------------------+----------+----------------+
| name              | reserved | on_maintenance |
+-------------------+----------+----------------+
| virt-node-01.maas | True     | False          |
| virt-node-10.maas | False    | False          |
| virt-node-02.maas | False    | False          |
+-------------------+----------+----------------+

Finally, disable the reserved node in Nova so that it becomes non-active, and thus available for failover:

openstack compute service set --disable virt-node-01.maas nova-compute

The cloud’s compute node list should show a status of ‘disabled’ for the appropriate node:

+-------------------+----------+-------+
| Host              | Status   | State |
+-------------------+----------+-------+
| virt-node-01.maas | disabled | up    |
| virt-node-10.maas | enabled  | up    |
| virt-node-02.maas | enabled  | up    |
+-------------------+----------+-------+

When a compute node failure is detected, Masakari will, in Nova, disable the failed node and enable a reserved node. The state of the node should also show as ‘down’.

Presuming that node ‘virt-node-02.maas’ has failed the cloud’s compute node list should become:

+-------------------+----------+-------+
| Host              | Status   | State |
+-------------------+----------+-------+
| virt-node-01.maas | enabled  | up    |
| virt-node-10.maas | enabled  | up    |
| virt-node-02.maas | disabled | down  |
+-------------------+----------+-------+

The reserved node will begin hosting evacuated instances and Masakari will remove the reserved flag from it. It will also place the failed node in maintenance mode.

The segment’s host list should show:

+-------------------+----------+----------------+
| name              | reserved | on_maintenance |
+-------------------+----------+----------------+
| virt-node-01.maas | False    | False          |
| virt-node-10.maas | False    | False          |
| virt-node-02.maas | False    | True           |
+-------------------+----------+----------------+

The expectation is that instance ‘bionic-1’ has been moved from ‘virt-node-02.maas’ to the reserved node, host ‘virt-node-01.maas’:

+----------------------+-------------------+
| Field                | Value             |
+----------------------+-------------------+
| OS-EXT-SRV-ATTR:host | virt-node-01.maas |
+----------------------+-------------------+

‘auto’

The auto recovery method relocates instances to any available node in the same segment. Because all the nodes are active, contrarily to the reserved_host method, there is no guarantee that sufficient resources will exist on the destination node to accommodate migrated instances.

For example, to create segment ‘S2’, configure it to use the auto method, and assign it three compute nodes:

openstack segment create S2 auto COMPUTE
openstack segment host create virt-node-01.maas COMPUTE SSH S2
openstack segment host create virt-node-02.maas COMPUTE SSH S2
openstack segment host create virt-node-10.maas COMPUTE SSH S2

In contrast to the reserved_host method all the nodes show as active (i.e. none are reserved):

+-------------------+----------+----------------+
| name              | reserved | on_maintenance |
+-------------------+----------+----------------+
| virt-node-10.maas | False    | False          |
| virt-node-02.maas | False    | False          |
| virt-node-01.maas | False    | False          |
+-------------------+----------+----------------+

Continuing with the above observation, upon node failure, there are no hypervisors for Masakari to enable in Nova. A failed node will however be put on_maintenance in Masakari:

+-------------------+----------+----------------+
| name              | reserved | on_maintenance |
+-------------------+----------+----------------+
| virt-node-10.maas | False    | False          |
| virt-node-02.maas | False    | False          |
| virt-node-01.maas | False    | True           |
+-------------------+----------+----------------+

‘rh_priority’ and ‘auto_priority’

The below recovery methods utilise one of the previously described methods but use the other as a failover.

  • rh_priority

    Attempts to evacuate instances using the reserved_host method. If the latter is unsuccessful the auto method will be used.

  • auto_priority

    Attempts to evacuate instances using the auto method. If the latter is unsuccessful the reserved_host method will be used.

Instance restart

The enabling of the instance restart feature is done on a per-instance basis.

For example, tag instance ‘bionic-1’ as HA-enabled in order to have it restarted automatically on its hypervisor:

openstack server set --property HA_Enabled=True bionic-1

Important

Perhaps non-intuitively, if the instance evacuation feature is not desired a hypervisor must nonetheless be assigned a failover segment in order for the restart feature to be available to its instances.

An instance failure can be simulated by killing its process. First determine its hypervisor and qemu guest name:

openstack server show bionic-1 -c OS-EXT-SRV-ATTR:host -c OS-EXT-SRV-ATTR:instance_name

Output:

+-------------------------------+-------------------+
| Field                         | Value             |
+-------------------------------+-------------------+
| OS-EXT-SRV-ATTR:host          | virt-node-02.maas |
| OS-EXT-SRV-ATTR:instance_name | instance-00000001 |
+-------------------------------+-------------------+

If you do not have admin rights in the cloud the above fields may not be visible.

This hypervisor corresponds to unit nova-compute/2 in this example cloud.

Check the current PID, kill the process, wait a minute, and verify that a new process gets started:

juju run --unit nova-compute/2 'pgrep -f guest=instance-00000001'
juju run --unit nova-compute/2 'sudo pkill -f -9 guest=instance-00000001'
juju run --unit nova-compute/2 'pgrep -f guest=instance-00000001'

Supplementary information

This section contains information that can be useful when working with Masakari.

  • Once a failed node has been re-inserted into the cloud it will show, in Nova, as ‘disabled’ but ‘up’ and, in Masakari, as ‘on_maintenance’. It can become an active hypervisor with:

    openstack compute service set --enable <host-name> nova-compute
    openstack segment host update --on_maintenance=False <segment-name> <host-name>
    
  • A segment’s recovery method can be updated with:

    openstack segment update --recovery_method <method> --service_type COMPUTE <segment-name>
    
  • A node cannot be assigned to a segment while it’s assigned to another segment. It must first be removed from the current segment with:

    openstack segment host delete <segment-name> <host-name>
    
  • A node’s reserved status can be updated with:

    openstack segment host update --reserved=<boolean> <segment-name> <host-name>