Available Plugins¶
In this section we present all the plugins that are shipped along with Watcher. If you want to know which plugins your Watcher services have access to, you can use the Guru Meditation Reports to display them.
Goals¶
airflow_optimization¶
AirflowOptimization
This goal is used to optimize the airflow within a cloud infrastructure.
cluster_maintaining¶
ClusterMaintenance
This goal is used to maintain compute nodes without having the user’s application being interrupted.
dummy¶
Dummy
Reserved goal that is used for testing purposes.
hardware_maintenance¶
HardwareMaintenance
This goal is to migrate instances and volumes on a set of compute nodes and storage from nodes under maintenance
noisy_neighbor¶
NoisyNeighborOptimization
This goal is used to identify and migrate a Noisy Neighbor - a low priority VM that negatively affects performance of a high priority VM in terms of IPC by over utilizing Last Level Cache.
saving_energy¶
SavingEnergy
This goal is used to reduce power consumption within a data center.
server_consolidation¶
ServerConsolidation
This goal is for efficient usage of compute server resources in order to reduce the total number of servers.
thermal_optimization¶
ThermalOptimization
This goal is used to balance the temperature across different servers.
unclassified¶
Unclassified
This goal is used to ease the development process of a strategy. Containing no actual indicator specification, this goal can be used whenever a strategy has yet to be formally associated with an existing goal. If the goal achieve has been identified but there is no available implementation, this Goal can also be used as a transitional stage.
workload_balancing¶
WorkloadBalancing
This goal is used to evenly distribute workloads across different servers.
Scoring Engines¶
dummy_scorer¶
Sample Scoring Engine implementing simplified workload classification.
Typically a scoring engine would be implemented using machine learning techniques. For example, for workload classification problem the solution could consist of the following steps:
Define a problem to solve: we want to detect the workload on the machine based on the collected metrics like power consumption, temperature, CPU load, memory usage, disk usage, network usage, etc.
The workloads could be predefined, e.g. IDLE, CPU-INTENSIVE, MEMORY-INTENSIVE, IO-BOUND, … Or we could let the ML algorithm to find the workloads based on the learning data provided. The decision here leads to learning algorithm used (supervised vs. non-supervised learning).
Collect metrics from sample servers (learning data).
Define the analytical model, pick ML framework and algorithm.
Apply learning data to the data model. Once taught, the data model becomes a scoring engine and can start doing predictions or classifications.
Wrap up the scoring engine with the class like this one, so it has a standard interface and can be used inside Watcher.
This class is a greatly very simplified version of the above model. The goal is to provide an example how such class could be implemented and used in Watcher, without adding additional dependencies like machine learning frameworks (which can be quite heavy) or over-complicating it’s internal implementation, which can distract from looking at the overall picture.
That said, this class implements a workload classification “manually” (in plain python code) and is not intended to be used in production.
Scoring Engine Containers¶
dummy_scoring_container¶
Sample Scoring Engine container returning a list of scoring engines.
Please note that it can be used in dynamic scenarios and the returned list might return instances based on some external configuration (e.g. in database). In order for these scoring engines to become discoverable in Watcher API and Watcher CLI, a database re-sync is required. It can be executed using watcher-sync tool for example.
Strategies¶
actuator¶
Actuator
Actuator that simply executes the actions given as parameter
This strategy allows anyone to create an action plan with a predefined set of actions. This strategy can be used for 2 different purposes:
Test actions
Use this strategy based on an event trigger to perform some explicit task
basic¶
Good server consolidation strategy
Basic offline consolidation using live migration
Consolidation of VMs is essential to achieve energy optimization in cloud environments such as OpenStack. As VMs are spinned up and/or moved over time, it becomes necessary to migrate VMs among servers to lower the costs. However, migration of VMs introduces runtime overheads and consumes extra energy, thus a good server consolidation strategy should carefully plan for migration in order to both minimize energy consumption and comply to the various SLAs.
This algorithm not only minimizes the overall number of used servers, but also minimizes the number of migrations.
It has been developed only for tests. You must have at least 2 physical compute nodes to run it, so you can easily run it on DevStack. It assumes that live migration is possible on your OpenStack cluster.
dummy¶
Dummy strategy used for integration testing via Tempest
Description
This strategy does not provide any useful optimization. Its only purpose is to be used by Tempest tests.
Requirements
<None>
Limitations
Do not use in production.
Spec URL
<None>
dummy_with_resize¶
Dummy strategy used for integration testing via Tempest
Description
This strategy does not provide any useful optimization. Its only purpose is to be used by Tempest tests.
Requirements
<None>
Limitations
Do not use in production.
Spec URL
<None>
dummy_with_scorer¶
A dummy strategy using dummy scoring engines.
This is a dummy strategy demonstrating how to work with scoring engines. One scoring engine is predicting the workload type of a machine based on the telemetry data, the other one is simply calculating the average value for given elements in a list. Results are then passed to the NOP action.
The strategy is presenting the whole workflow: - Get a reference to a scoring engine - Prepare input data (features) for score calculation - Perform score calculation - Use scorer’s metadata for results interpretation
host_maintenance¶
[PoC]Host Maintenance
Description
It is a migration strategy for one compute node maintenance, without having the user’s application been interruptted. If given one backup node, the strategy will firstly migrate all instances from the maintenance node to the backup node. If the backup node is not provided, it will migrate all instances, relying on nova-scheduler.
Requirements
You must have at least 2 physical compute nodes to run this strategy.
Limitations
This is a proof of concept that is not meant to be used in production
It migrates all instances from one host to other hosts. It’s better to execute such strategy when load is not heavy, and use this algorithm with ONESHOT audit.
It assumes that cold and live migrations are possible.
node_resource_consolidation¶
consolidating resources on nodes using server migration
Description
This strategy checks the resource usages of compute nodes, if the used resources are less than total, it will try to migrate server to consolidate the use of resource.
Requirements
You must have at least 2 compute nodes to run this strategy.
Hardware: compute nodes should use the same physical CPUs/RAMs
Limitations
This is a proof of concept that is not meant to be used in production
It assume that live migrations are possible
Spec URL
noisy_neighbor¶
Noisy Neighbor strategy using live migration
Description
This strategy can identify and migrate a Noisy Neighbor - a low priority VM that negatively affects performance of a high priority VM in terms of IPC by over utilizing Last Level Cache.
Requirements
To enable LLC metric, latest Intel server with CMT support is required.
Limitations
This is a proof of concept that is not meant to be used in production
Spec URL
outlet_temperature¶
[PoC] Outlet temperature control using live migration
Description
It is a migration strategy based on the outlet temperature of compute hosts. It generates solutions to move a workload whenever a server’s outlet temperature is higher than the specified threshold.
Requirements
Hardware: All computer hosts should support IPMI and PTAS technology
Software: Ceilometer component ceilometer-agent-ipmi running in each compute host, and Ceilometer API can report such telemetry
hardware.ipmi.node.outlet_temperature
successfully.You must have at least 2 physical compute hosts to run this strategy.
Limitations
This is a proof of concept that is not meant to be used in production
We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.
It assume that live migrations are possible
Spec URL
saving_energy¶
Saving Energy Strategy
Description
Saving Energy Strategy together with VM Workload Consolidation Strategy can perform the Dynamic Power Management (DPM) functionality, which tries to save power by dynamically consolidating workloads even further during periods of low resource utilization. Virtual machines are migrated onto fewer hosts and the unneeded hosts are powered off.
After consolidation, Saving Energy Strategy produces a solution of powering off/on according to the following detailed policy:
In this policy, a preset number(min_free_hosts_num) is given by user, and this min_free_hosts_num describes minimum free compute nodes that users expect to have, where “free compute nodes” refers to those nodes unused but still powered on.
If the actual number of unused nodes(in power-on state) is larger than the given number, randomly select the redundant nodes and power off them; If the actual number of unused nodes(in poweron state) is smaller than the given number and there are spare unused nodes(in poweroff state), randomly select some nodes(unused,poweroff) and power on them.
Requirements
In this policy, in order to calculate the min_free_hosts_num, users must provide two parameters:
One parameter(“min_free_hosts_num”) is a constant int number. This number should be int type and larger than zero.
The other parameter(“free_used_percent”) is a percentage number, which describes the quotient of min_free_hosts_num/nodes_with_VMs_num, where nodes_with_VMs_num is the number of nodes with VMs running on it. This parameter is used to calculate a dynamic min_free_hosts_num. The nodes with VMs refer to those nodes with VMs running on it.
Then choose the larger one as the final min_free_hosts_num.
Limitations
at least 2 physical compute hosts
Spec URL
storage_capacity_balance¶
Storage capacity balance using cinder volume migration
Description
This strategy migrates volumes based on the workload of the cinder pools. It makes decision to migrate a volume whenever a pool’s used utilization % is higher than the specified threshold. The volume to be moved should make the pool close to average workload of all cinder pools.
Requirements
You must have at least 2 cinder volume pools to run this strategy.
Limitations
Volume migration depends on the storage device. It may take a long time.
Spec URL
uniform_airflow¶
[PoC]Uniform Airflow using live migration
Description
It is a migration strategy based on the airflow of physical servers. It generates solutions to move VM whenever a server’s airflow is higher than the specified threshold.
Requirements
Hardware: compute node with NodeManager 3.0 support
Software: Ceilometer component ceilometer-agent-compute running in each compute node, and Ceilometer API can report such telemetry “airflow, system power, inlet temperature” successfully.
You must have at least 2 physical compute nodes to run this strategy
Limitations
This is a proof of concept that is not meant to be used in production.
We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.
It assumes that live migrations are possible.
vm_workload_consolidation¶
VM Workload Consolidation Strategy
A load consolidation strategy based on heuristic first-fit algorithm which focuses on measured CPU utilization and tries to minimize hosts which have too much or too little load respecting resource capacity constraints.
This strategy produces a solution resulting in more efficient utilization of cluster resources using following four phases:
Offload phase - handling over-utilized resources
Consolidation phase - handling under-utilized resources
Solution optimization - reducing number of migrations
Disability of unused compute nodes
A capacity coefficients (cc) might be used to adjust optimization thresholds. Different resources may require different coefficient values as well as setting up different coefficient values in both phases may lead to more efficient consolidation in the end. If the cc equals 1 the full resource capacity may be used, cc values lower than 1 will lead to resource under utilization and values higher than 1 will lead to resource overbooking. e.g. If targeted utilization is 80 percent of a compute node capacity, the coefficient in the consolidation phase will be 0.8, but may any lower value in the offloading phase. The lower it gets the cluster will appear more released (distributed) for the following consolidation phase.
As this strategy leverages VM live migration to move the load from one compute node to another, this feature needs to be set up correctly on all compute nodes within the cluster. This strategy assumes it is possible to live migrate any VM from an active compute node to any other active compute node.
workload_balance¶
[PoC]Workload balance using live migration
Description
It is a migration strategy based on the VM workload of physical servers. It generates solutions to move a workload whenever a server’s CPU or RAM utilization % is higher than the specified threshold. The VM to be moved should make the host close to average workload of all compute nodes.
Requirements
Hardware: compute node should use the same physical CPUs/RAMs
Software: Ceilometer component ceilometer-agent-compute running in each compute node, and Ceilometer API can report such telemetry “instance_cpu_usage” and “instance_ram_usage” successfully.
You must have at least 2 physical compute nodes to run this strategy.
Limitations
This is a proof of concept that is not meant to be used in production
We cannot forecast how many servers should be migrated. This is the reason why we only plan a single virtual machine migration at a time. So it’s better to use this algorithm with CONTINUOUS audits.
It assume that live migrations are possible
workload_stabilization¶
Workload Stabilization control using live migration
This is workload stabilization strategy based on standard deviation algorithm. The goal is to determine if there is an overload in a cluster and respond to it by migrating VMs to stabilize the cluster.
This strategy has been tested in a small (32 nodes) cluster.
It assumes that live migrations are possible in your cluster.
zone_migration¶
Zone migration using instance and volume migration
This is zone migration strategy to migrate many instances and volumes efficiently with minimum downtime for hardware maintenance.
Actions¶
change_node_power_state¶
Compute node power on/off
By using this action, you will be able to on/off the power of a compute node.
The action schema is:
schema = Schema({
'resource_id': str,
'state': str,
})
The resource_id references a baremetal node id (list of available
ironic nodes is returned by this command: ironic node-list
).
The state value should either be on or off.
change_nova_service_state¶
Disables or enables the nova-compute service, deployed on a host
By using this action, you will be able to update the state of a nova-compute service. A disabled nova-compute service can not be selected by the nova scheduler for future deployment of server.
The action schema is:
schema = Schema({
'resource_id': str,
'state': str,
'disabled_reason': str,
})
The resource_id references a nova-compute service name (list of available
nova-compute services is returned by this command: nova service-list
--binary nova-compute
).
The state value should either be ONLINE or OFFLINE.
The disabled_reason references the reason why Watcher disables this
nova-compute service. The value should be with watcher_ prefix, such as
watcher_disabled, watcher_maintaining.
migrate¶
Migrates a server to a destination nova-compute host
This action will allow you to migrate a server to another compute destination host. Migration type ‘live’ can only be used for migrating active VMs. Migration type ‘cold’ can be used for migrating non-active VMs as well active VMs, which will be shut down while migrating.
The action schema is:
schema = Schema({
'resource_id': str, # should be a UUID
'migration_type': str, # choices -> "live", "cold"
'destination_node': str,
'source_node': str,
})
The resource_id is the UUID of the server to migrate.
The source_node and destination_node parameters are respectively the
source and the destination compute hostname (list of available compute
hosts is returned by this command: nova service-list --binary
nova-compute
).
Note
Nova API version must be 2.56 or above if destination_node parameter is given.
nop¶
logs a message
The action schema is:
schema = Schema({
'message': str,
})
The message is the actual message that will be logged.
resize¶
Resizes a server with specified flavor.
This action will allow you to resize a server to another flavor.
The action schema is:
schema = Schema({
'resource_id': str, # should be a UUID
'flavor': str, # should be either ID or Name of Flavor
})
The resource_id is the UUID of the server to resize. The flavor is the ID or Name of Flavor (Nova accepts either ID or Name of Flavor to resize() function).
sleep¶
Makes the executor of the action plan wait for a given duration
The action schema is:
schema = Schema({
'duration': float,
})
The duration is expressed in seconds.
volume_migrate¶
Migrates a volume to destination node or type
By using this action, you will be able to migrate cinder volume. Migration type ‘swap’ can only be used for migrating attached volume. Migration type ‘migrate’ can be used for migrating detached volume to the pool of same volume type. Migration type ‘retype’ can be used for changing volume type of detached volume.
The action schema is:
schema = Schema({
'resource_id': str, # should be a UUID
'migration_type': str, # choices -> "swap", "migrate","retype"
'destination_node': str,
'destination_type': str,
})
The resource_id is the UUID of cinder volume to migrate.
The destination_node is the destination block storage pool name.
(list of available pools are returned by this command: cinder
get-pools
) which is mandatory for migrating detached volume
to the one with same volume type.
The destination_type is the destination block storage type name.
(list of available types are returned by this command: cinder
type-list
) which is mandatory for migrating detached volume or
swapping attached volume to the one with different volume type.
Workflow Engines¶
taskflow¶
Taskflow as a workflow engine for Watcher
Full documentation on taskflow at https://docs.openstack.org/taskflow/latest
Planners¶
node_resource_consolidation¶
Node Resource Consolidation planner implementation
This implementation preserves the original order of actions in the solution and try to parallelize actions which have the same action type.
Limitations
This is a proof of concept that is not meant to be used in production
weight¶
Weight planner implementation
This implementation builds actions with parents in accordance with weights. Set of actions having a higher weight will be scheduled before the other ones. There are two config options to configure: action_weights and parallelization.
Limitations
This planner requires to have action_weights and parallelization configs tuned well.
workload_stabilization¶
Workload Stabilization planner implementation
This implementation comes with basic rules with a set of action types that
are weighted. An action having a lower weight will be scheduled before the
other ones. The set of action types can be specified by ‘weights’ in the
watcher.conf
. You need to associate a different weight to all available
actions into the configuration file, otherwise you will get an error when
the new action will be referenced in the solution produced by a strategy.
Limitations
This is a proof of concept that is not meant to be used in production
Cluster Data Model Collectors¶
baremetal¶
Baremetal cluster data model collector
The Baremetal cluster data model collector creates an in-memory representation of the resources exposed by the baremetal service.
compute¶
Nova cluster data model collector
The Nova cluster data model collector creates an in-memory representation of the resources exposed by the compute service.
storage¶
Cinder cluster data model collector
The Cinder cluster data model collector creates an in-memory representation of the resources exposed by the storage service.