Health Policy

The health policy is designed for Senlin to detect cluster node failures and to recover them in a way customizable by users. The health policy is not meant to be an universal solution that can solve all problems related to high-availability. However, the ultimate goal for the development team is to provide an auto-healing framework that is usable, flexible, extensible for most deployment scenarios.

The policy type is currently applicable to clusters whose profile type is one of os.nova.server or os.heat.stack. This could be extended in future.

Properties

detection
interval
Number of seconds between pollings. Only required when type is ‘NODE_STATUS_POLLING’ or ‘NODE_STATUS_POLL_URL’ or ‘HYPERVISOR_STATUS_POLLING.
node_update_timeout
Number of seconds since last node update to wait before checking node health.
recovery_conditional
The conditional that determines when recovery should be performed in case multiple detection modes are specified. ‘ALL_FAILED’ means that all detection modes have to return failed health checks before a node is recovered. ‘ANY_FAILED’ means that a failed health check with a single detection mode triggers a node recovery.
detection_modes
List properties:
type
Type of node failure detection.
options
poll_url
URL to poll for node status. See documentation for valid expansion parameters. Only required when type is ‘NODE_STATUS_POLL_URL’.
poll_url_conn_error_as_unhealthy
Whether to treat URL connection errors as an indication of an unhealthy node. Only required when type is ‘NODE_STATUS_POLL_URL’.
poll_url_healthy_response
String pattern in the poll URL response body that indicates a healthy node. Required when type is ‘NODE_STATUS_POLL_URL’.
poll_url_retry_interval
Number of seconds between URL polling retries before a node is considered down. Required when type is ‘NODE_STATUS_POLL_URL’.
poll_url_retry_limit
Number of times to retry URL polling when its return body is missing POLL_URL_HEALTHY_RESPONSE string before a node is considered down. Required when type is ‘NODE_STATUS_POLL_URL’.
poll_url_ssl_verify
Whether to verify SSL when calling URL to poll for node status. Only required when type is ‘NODE_STATUS_POLL_URL’.
recovery
node_delete_timeout
Number of seconds to wait for node deletion to finish and start node creation for recreate recovery option. Required when type is ‘NODE_STATUS_POLL_URL and recovery action is RECREATE’.
node_force_recreate
Whether to create node even if node deletion failed. Required when type is ‘NODE_STATUS_POLL_URL’ and action recovery action is RECREATE.
actions
List properties:
name
Name of action to execute.
params
Parameters for the action
fencing
List properties:
Service to be fenced.

Sample

A typical spec for a health policy looks like the following example:

# Sample health policy based on node health checking
type: senlin.policy.health
version: 1.1
description: A policy for maintaining node health from a cluster.
properties:
  detection:
    # Number of seconds between two adjacent checking
    interval: 600

    detection_modes:
      # Type for health checking, valid values include:
      # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
      - type: NODE_STATUS_POLLING

  recovery:
    # Action that can be retried on a failed node, will improve to
    # support multiple actions in the future. Valid values include:
    # REBOOT, REBUILD, RECREATE
    actions:
      - name: RECREATE

There are two groups of properties (detection and recovery), each of which provides information related to the failure detection and the failure recovery aspect respectively.

For failure detection, you can specify a detection mode that can be one of the following two values:

  • NODE_STATUS_POLLING: Senlin engine (more specifically, the health manager service) is expected to poll each and every nodes periodically to find out if they are “alive” or not.

  • NODE_STATUS_POLL_URL: Senlin engine (more specifically, the health manager service) is expected to poll the specified URL periodically to find out if a node is considered healthy or not.

  • LIFECYCLE_EVENTS: Many services can emit notification messages on the message queue when configured. Senlin engine is expected to listen to these events and react to them appropriately.

It is possible to combine NODE_STATUS_POLLING and NODE_STATUS_POLL_URL detections by specifying multiple detection modes. In the case of multiple detection modes, Senlin engine tries each detection type in the order specified. The behavior of a failed health check in the case of multiple detection modes is specified using recovery_conditional.

LIFECYCLE_EVENTS cannot be combined with any other detection type.

All detection types can carry an optional map of options. When the detection type is set to “NODE_STATUS_POLL_URL”, for example, you can specify a value for poll_url property to specify the URL to be used for health checking.

As the policy type implementation stabilizes, more options may be added later.

For failure recovery, there are currently two properties: actions and fencing. The actions property takes a list of action names and an optional map of parameters specific to that action. For example, the REBOOT action can be accompanied with a type parameter that indicates if the intended reboot operation is a soft reboot or a hard reboot.

Note

The plan for recovery actions is to support a list of actions which can be tried one by one by the Senlin engine. Currently, you can specify only one action due to implementation limitation.

Another extension to the recovery action is to add triggers to user provided workflows. This is also under development.

Validation

Due to implementation limitation, currently you can only specify one action for the recovery.actions property. This constraint will be removed soon after the support to action list is completed.

Fencing

Fencing may be an important step during a reliable node recovery process. Without fencing, we cannot ensure that the compute, network and/or storage resources are in a consistent, predictable status. However, fencing is very difficult because it always involves an out-of-band operation to the resource controller, for example, an IPMI command to power off a physical host sent to a specific IP address.

Currently, the health policy only supports the fencing of virtual machines by forcibly delete it before taking measures to recover it.

Snapshots

There have been some requirements to take snapshots of a node before recovery so that the recovered node(s) will resume from where they failed. This feature is also on the TODO list for the development team.

References

For more detailed information on how the health policy work, please check Health Policy V1.1