Available Plugins

In this section we present all the plugins that are shipped along with Watcher. If you want to know which plugins your Watcher services have access to, you can use the Guru Meditation Reports to display them.

Goals

airflow_optimization

AirflowOptimization

This goal is used to optimize the airflow within a cloud infrastructure.

cluster_maintaining

ClusterMaintenance

This goal is used to maintain compute nodes without having the user’s application being interrupted.

dummy

Dummy

Reserved goal that is used for testing purposes.

hardware_maintenance

HardwareMaintenance

This goal is to migrate instances and volumes on a set of compute nodes and storage from nodes under maintenance

noisy_neighbor

NoisyNeighborOptimization

This goal is used to identify and migrate a Noisy Neighbor - a low priority VM that negatively affects performance of a high priority VM in terms of IPC by over utilizing Last Level Cache.

saving_energy

SavingEnergy

This goal is used to reduce power consumption within a data center.

server_consolidation

ServerConsolidation

This goal is for efficient usage of compute server resources in order to reduce the total number of servers.

thermal_optimization

ThermalOptimization

This goal is used to balance the temperature across different servers.

unclassified

Unclassified

This goal is used to ease the development process of a strategy. Containing no actual indicator specification, this goal can be used whenever a strategy has yet to be formally associated with an existing goal. If the goal achieve has been identified but there is no available implementation, this Goal can also be used as a transitional stage.

workload_balancing

WorkloadBalancing

This goal is used to evenly distribute workloads across different servers.

Scoring Engines

dummy_scorer

Sample Scoring Engine implementing simplified workload classification.

Typically a scoring engine would be implemented using machine learning techniques. For example, for workload classification problem the solution could consist of the following steps:

  1. Define a problem to solve: we want to detect the workload on the machine based on the collected metrics like power consumption, temperature, CPU load, memory usage, disk usage, network usage, etc.

  2. The workloads could be predefined, e.g. IDLE, CPU-INTENSIVE, MEMORY-INTENSIVE, IO-BOUND, … Or we could let the ML algorithm to find the workloads based on the learning data provided. The decision here leads to learning algorithm used (supervised vs. non-supervised learning).

  3. Collect metrics from sample servers (learning data).

  4. Define the analytical model, pick ML framework and algorithm.

  5. Apply learning data to the data model. Once taught, the data model becomes a scoring engine and can start doing predictions or classifications.

  6. Wrap up the scoring engine with the class like this one, so it has a standard interface and can be used inside Watcher.

This class is a greatly very simplified version of the above model. The goal is to provide an example how such class could be implemented and used in Watcher, without adding additional dependencies like machine learning frameworks (which can be quite heavy) or over-complicating it’s internal implementation, which can distract from looking at the overall picture.

That said, this class implements a workload classification “manually” (in plain python code) and is not intended to be used in production.

Scoring Engine Containers

dummy_scoring_container

Sample Scoring Engine container returning a list of scoring engines.

Please note that it can be used in dynamic scenarios and the returned list might return instances based on some external configuration (e.g. in database). In order for these scoring engines to become discoverable in Watcher API and Watcher CLI, a database re-sync is required. It can be executed using watcher-sync tool for example.

Strategies

actuator

Actuator

Actuator that simply executes the actions given as parameter

This strategy allows anyone to create an action plan with a predefined set of actions. This strategy can be used for 2 different purposes:

  • Test actions

  • Use this strategy based on an event trigger to perform some explicit task

basic

Good server consolidation strategy

Basic offline consolidation using live migration

Consolidation of VMs is essential to achieve energy optimization in cloud environments such as OpenStack. As VMs are spinned up and/or moved over time, it becomes necessary to migrate VMs among servers to lower the costs. However, migration of VMs introduces runtime overheads and consumes extra energy, thus a good server consolidation strategy should carefully plan for migration in order to both minimize energy consumption and comply to the various SLAs.

This algorithm not only minimizes the overall number of used servers, but also minimizes the number of migrations.

It has been developed only for tests. You must have at least 2 physical compute nodes to run it, so you can easily run it on DevStack. It assumes that live migration is possible on your OpenStack cluster.

dummy

Dummy strategy used for integration testing via Tempest

Description

This strategy does not provide any useful optimization. Its only purpose is to be used by Tempest tests.

Requirements

<None>

Limitations

Do not use in production.

Spec URL

<None>

dummy_with_resize

Dummy strategy used for integration testing via Tempest

Description

This strategy does not provide any useful optimization. Its only purpose is to be used by Tempest tests.

Requirements

<None>

Limitations

Do not use in production.

Spec URL

<None>

dummy_with_scorer

A dummy strategy using dummy scoring engines.

This is a dummy strategy demonstrating how to work with scoring engines. One scoring engine is predicting the workload type of a machine based on the telemetry data, the other one is simply calculating the average value for given elements in a list. Results are then passed to the NOP action.

The strategy is presenting the whole workflow: - Get a reference to a scoring engine - Prepare input data (features) for score calculation - Perform score calculation - Use scorer’s metadata for results interpretation

host_maintenance

[PoC]Host Maintenance

Description

It is a migration strategy for one compute node maintenance, without having the user’s application been interruptted. If given one backup node, the strategy will firstly migrate all instances from the maintenance node to the backup node. If the backup node is not provided, it will migrate all instances, relying on nova-scheduler.

Requirements

  • You must have at least 2 physical compute nodes to run this strategy.

Limitations

  • This is a proof of concept that is not meant to be used in production

  • It migrates all instances from one host to other hosts. It’s better to execute such strategy when load is not heavy, and use this algorithm with ONESHOT audit.

  • It assumes that cold and live migrations are possible.

node_resource_consolidation

consolidating resources on nodes using server migration

Description

This strategy checks the resource usages of compute nodes, if the used resources are less than total, it will try to migrate server to consolidate the use of resource.

Requirements

  • You must have at least 2 compute nodes to run this strategy.

  • Hardware: compute nodes should use the same physical CPUs/RAMs

Limitations

  • This is a proof of concept that is not meant to be used in production

  • It assume that live migrations are possible

Spec URL

http://specs.openstack.org/openstack/watcher-specs/specs/train/implemented/node-resource-consolidation.html

noisy_neighbor

Noisy Neighbor strategy using live migration

Description

This strategy can identify and migrate a Noisy Neighbor - a low priority VM that negatively affects performance of a high priority VM in terms of IPC by over utilizing Last Level Cache.

Requirements

To enable LLC metric, latest Intel server with CMT support is required.

Limitations

This is a proof of concept that is not meant to be used in production

Spec URL

http://specs.openstack.org/openstack/watcher-specs/specs/pike/implemented/noisy_neighbor_strategy.html

outlet_temperature

[PoC] Outlet temperature control using live migration

Description

It is a migration strategy based on the outlet temperature of compute hosts. It generates solutions to move a workload whenever a server’s outlet temperature is higher than the specified threshold.

Requirements

  • Hardware: All computer hosts should support IPMI and PTAS technology

  • Software: Ceilometer component ceilometer-agent-ipmi running in each compute host, and Ceilometer API can report such telemetry hardware.ipmi.node.outlet_temperature successfully.

  • You must have at least 2 physical compute hosts to run this strategy.

Limitations

  • This is a proof of concept that is not meant to be used in production

  • We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.

  • It assume that live migrations are possible

Spec URL

https://github.com/openstack/watcher-specs/blob/master/specs/mitaka/implemented/outlet-temperature-based-strategy.rst

saving_energy

Saving Energy Strategy

Description

Saving Energy Strategy together with VM Workload Consolidation Strategy can perform the Dynamic Power Management (DPM) functionality, which tries to save power by dynamically consolidating workloads even further during periods of low resource utilization. Virtual machines are migrated onto fewer hosts and the unneeded hosts are powered off.

After consolidation, Saving Energy Strategy produces a solution of powering off/on according to the following detailed policy:

In this policy, a preset number(min_free_hosts_num) is given by user, and this min_free_hosts_num describes minimum free compute nodes that users expect to have, where “free compute nodes” refers to those nodes unused but still powered on.

If the actual number of unused nodes(in power-on state) is larger than the given number, randomly select the redundant nodes and power off them; If the actual number of unused nodes(in poweron state) is smaller than the given number and there are spare unused nodes(in poweroff state), randomly select some nodes(unused,poweroff) and power on them.

Requirements

In this policy, in order to calculate the min_free_hosts_num, users must provide two parameters:

  • One parameter(“min_free_hosts_num”) is a constant int number. This number should be int type and larger than zero.

  • The other parameter(“free_used_percent”) is a percentage number, which describes the quotient of min_free_hosts_num/nodes_with_VMs_num, where nodes_with_VMs_num is the number of nodes with VMs running on it. This parameter is used to calculate a dynamic min_free_hosts_num. The nodes with VMs refer to those nodes with VMs running on it.

Then choose the larger one as the final min_free_hosts_num.

Limitations

  • at least 2 physical compute hosts

Spec URL

http://specs.openstack.org/openstack/watcher-specs/specs/pike/implemented/energy-saving-strategy.html

storage_capacity_balance

Storage capacity balance using cinder volume migration

Description

This strategy migrates volumes based on the workload of the cinder pools. It makes decision to migrate a volume whenever a pool’s used utilization % is higher than the specified threshold. The volume to be moved should make the pool close to average workload of all cinder pools.

Requirements

  • You must have at least 2 cinder volume pools to run this strategy.

Limitations

  • Volume migration depends on the storage device. It may take a long time.

Spec URL

http://specs.openstack.org/openstack/watcher-specs/specs/queens/implemented/storage-capacity-balance.html

uniform_airflow

[PoC]Uniform Airflow using live migration

Description

It is a migration strategy based on the airflow of physical servers. It generates solutions to move VM whenever a server’s airflow is higher than the specified threshold.

Requirements

  • Hardware: compute node with NodeManager 3.0 support

  • Software: Ceilometer component ceilometer-agent-compute running in each compute node, and Ceilometer API can report such telemetry “airflow, system power, inlet temperature” successfully.

  • You must have at least 2 physical compute nodes to run this strategy

Limitations

  • This is a proof of concept that is not meant to be used in production.

  • We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.

  • It assumes that live migrations are possible.

vm_workload_consolidation

VM Workload Consolidation Strategy

A load consolidation strategy based on heuristic first-fit algorithm which focuses on measured CPU utilization and tries to minimize hosts which have too much or too little load respecting resource capacity constraints.

This strategy produces a solution resulting in more efficient utilization of cluster resources using following four phases:

  • Offload phase - handling over-utilized resources

  • Consolidation phase - handling under-utilized resources

  • Solution optimization - reducing number of migrations

  • Disability of unused compute nodes

A capacity coefficients (cc) might be used to adjust optimization thresholds. Different resources may require different coefficient values as well as setting up different coefficient values in both phases may lead to more efficient consolidation in the end. If the cc equals 1 the full resource capacity may be used, cc values lower than 1 will lead to resource under utilization and values higher than 1 will lead to resource overbooking. e.g. If targeted utilization is 80 percent of a compute node capacity, the coefficient in the consolidation phase will be 0.8, but may any lower value in the offloading phase. The lower it gets the cluster will appear more released (distributed) for the following consolidation phase.

As this strategy leverages VM live migration to move the load from one compute node to another, this feature needs to be set up correctly on all compute nodes within the cluster. This strategy assumes it is possible to live migrate any VM from an active compute node to any other active compute node.

workload_balance

[PoC]Workload balance using live migration

Description

It is a migration strategy based on the VM workload of physical servers. It generates solutions to move a workload whenever a server’s CPU or RAM utilization % is higher than the specified threshold. The VM to be moved should make the host close to average workload of all compute nodes.

Requirements

  • Hardware: compute node should use the same physical CPUs/RAMs

  • Software: Ceilometer component ceilometer-agent-compute running in each compute node, and Ceilometer API can report such telemetry “instance_cpu_usage” and “instance_ram_usage” successfully.

  • You must have at least 2 physical compute nodes to run this strategy.

Limitations

  • This is a proof of concept that is not meant to be used in production

  • We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.

  • It assume that live migrations are possible

workload_stabilization

Workload Stabilization control using live migration

This is workload stabilization strategy based on standard deviation algorithm. The goal is to determine if there is an overload in a cluster and respond to it by migrating VMs to stabilize the cluster.

This strategy has been tested in a small (32 nodes) cluster.

It assumes that live migrations are possible in your cluster.

zone_migration

Zone migration using instance and volume migration

This is zone migration strategy to migrate many instances and volumes efficiently with minimum downtime for hardware maintenance.

Actions

change_node_power_state

Compute node power on/off

By using this action, you will be able to on/off the power of a compute node.

The action schema is:

schema = Schema({
 'resource_id': str,
 'state': str,
})

The resource_id references a baremetal node id (list of available ironic nodes is returned by this command: ironic node-list). The state value should either be on or off.

change_nova_service_state

Disables or enables the nova-compute service, deployed on a host

By using this action, you will be able to update the state of a nova-compute service. A disabled nova-compute service can not be selected by the nova scheduler for future deployment of server.

The action schema is:

schema = Schema({
 'resource_id': str,
 'state': str,
 'disabled_reason': str,
})

The resource_id references a nova-compute service name (list of available nova-compute services is returned by this command: nova service-list --binary nova-compute). The state value should either be ONLINE or OFFLINE. The disabled_reason references the reason why Watcher disables this nova-compute service. The value should be with watcher_ prefix, such as watcher_disabled, watcher_maintaining.

migrate

Migrates a server to a destination nova-compute host

This action will allow you to migrate a server to another compute destination host. Migration type ‘live’ can only be used for migrating active VMs. Migration type ‘cold’ can be used for migrating non-active VMs as well active VMs, which will be shut down while migrating.

The action schema is:

schema = Schema({
 'resource_id': str,  # should be a UUID
 'migration_type': str,  # choices -> "live", "cold"
 'destination_node': str,
 'source_node': str,
})

The resource_id is the UUID of the server to migrate. The source_node and destination_node parameters are respectively the source and the destination compute hostname (list of available compute hosts is returned by this command: nova service-list --binary nova-compute).

Note

Nova API version must be 2.56 or above if destination_node parameter is given.

nop

logs a message

The action schema is:

schema = Schema({
 'message': str,
})

The message is the actual message that will be logged.

resize

Resizes a server with specified flavor.

This action will allow you to resize a server to another flavor.

The action schema is:

schema = Schema({
 'resource_id': str,  # should be a UUID
 'flavor': str,  # should be either ID or Name of Flavor
})

The resource_id is the UUID of the server to resize. The flavor is the ID or Name of Flavor (Nova accepts either ID or Name of Flavor to resize() function).

sleep

Makes the executor of the action plan wait for a given duration

The action schema is:

schema = Schema({
 'duration': float,
})

The duration is expressed in seconds.

volume_migrate

Migrates a volume to destination node or type

By using this action, you will be able to migrate cinder volume. Migration type ‘swap’ can only be used for migrating attached volume. Migration type ‘migrate’ can be used for migrating detached volume to the pool of same volume type. Migration type ‘retype’ can be used for changing volume type of detached volume.

The action schema is:

schema = Schema({
    'resource_id': str,  # should be a UUID
    'migration_type': str,  # choices -> "swap", "migrate","retype"
    'destination_node': str,
    'destination_type': str,
})

The resource_id is the UUID of cinder volume to migrate. The destination_node is the destination block storage pool name. (list of available pools are returned by this command: cinder get-pools) which is mandatory for migrating detached volume to the one with same volume type. The destination_type is the destination block storage type name. (list of available types are returned by this command: cinder type-list) which is mandatory for migrating detached volume or swapping attached volume to the one with different volume type.

Workflow Engines

taskflow

Taskflow as a workflow engine for Watcher

Full documentation on taskflow at https://docs.openstack.org/taskflow/latest

Planners

node_resource_consolidation

Node Resource Consolidation planner implementation

This implementation preserves the original order of actions in the solution and try to parallelize actions which have the same action type.

Limitations

  • This is a proof of concept that is not meant to be used in production

weight

Weight planner implementation

This implementation builds actions with parents in accordance with weights. Set of actions having a higher weight will be scheduled before the other ones. There are two config options to configure: action_weights and parallelization.

Limitations

  • This planner requires to have action_weights and parallelization configs tuned well.

workload_stabilization

Workload Stabilization planner implementation

This implementation comes with basic rules with a set of action types that are weighted. An action having a lower weight will be scheduled before the other ones. The set of action types can be specified by ‘weights’ in the watcher.conf. You need to associate a different weight to all available actions into the configuration file, otherwise you will get an error when the new action will be referenced in the solution produced by a strategy.

Limitations

  • This is a proof of concept that is not meant to be used in production

Cluster Data Model Collectors

baremetal

Baremetal cluster data model collector

The Baremetal cluster data model collector creates an in-memory representation of the resources exposed by the baremetal service.

compute

Nova cluster data model collector

The Nova cluster data model collector creates an in-memory representation of the resources exposed by the compute service.

storage

Cinder cluster data model collector

The Cinder cluster data model collector creates an in-memory representation of the resources exposed by the storage service.