Current Series Release Notes

14.0.0-89

New Features

  • Three new parameters have been added to the nop action:

    • fail_pre_condition: When setting it to true the action fails on the pre_condition execution.

    • fail_execute: When setting it to true the action fails on the execute step.

    • fail_post_condition: When setting it to true the action fails on the post_condition execution.

  • A new module, watcher.wsgi, has been added as a place to gather WSGI application objects. This is intended to ease deployment by providing a consistent location for these objects. For example, if using uWSGI then instead of:

    [uwsgi]
     wsgi-file = /bin/watcher-api-wsgi
    

    You can now use:

    [uwsgi]
    module = watcher.wsgi.api:application
    

    This also simplifies deployment with other WSGI servers that expect module paths such as gunicorn.

  • A new Aetos data source is added. This allows the watcher decision engine to collect metrics through an Aetos reverse proxy server which provides multi-tenant aware access to Prometheus with Keystone authentication and role-based access control. The Aetos datasource uses Keystone service discovery to automatically locate the Aetos endpoint and provides enhanced security compared to direct Prometheus access. For more information about the Aetos data source, including configuration options see https://docs.openstack.org/watcher/latest/datasources/aetos.html

  • A new state SKIPPED has been added to the Actions. Actions can reach this state in two situations:

    • Watcher detects a specific pre-defined condition in the pre_condition phase.

    • An admin sets the state to SKIPPED using a call to the new Patch API /actions/{action_id} before the action plan is started.

    An action in SKIPPED state will not be executed by Watcher as part of an ActionPlan run.

    Additionally, a new field status_message has been added to Audits, ActionPlans and Actions which will be used to provide additional details about the state of an object.

    All these changes have been introduced in a new Watcher API microversion 1.5.

    For additional information, see the API reference.

  • The compute model was extended with additional server attributes to provide more detailed information about compute instances. These additions will enable strategies to make more precise decisions by considering more server placement constraints. The new attributes are flavor extra specs and pinned availability zone. Each new attribute depends on a minimal microversion to be supported in nova and configured in the watcher configuration, at nova_client section. Please refer to the nova api-ref documentation for more details on which microversion is required: https://docs.openstack.org/api-ref/compute/ A new configuration option was added to allow the user to enable or disable the extended attributes collection, which is disabled by default.

  • The Decision Engine service now supports running with native threading mode enabled as opposed to the use of the Eventlet library. Note that the use of native threading is still experimental, and is disabled by default. It should not be used in production. To switch from Eventlet to native threading mode, the environment variable OS_WATCHER_DISABLE_EVENTLET_PATCHING=true needs to be added to the decision engine service configuration. For more information, please check eventlet removal documentation.

  • The Host Maintenance strategy now supports two new input parameters: disable_live_migration and disable_cold_migration. These parameters allow cloud administrators to control whether live, cold or no migration should be considered during host maintenance operations.

    • If disable_live_migration is set, active instances will be cold migrated if disable_cold_migration is not set, otherwise active instances will be stopped.

    • If disable_cold_migration is set, inactive instances will not be cold migrated.

    • If both are set, only stop actions will be applied on active instances.

    A new stop action has been introduced and registered to support scenarios where migration is disabled.

Upgrade Notes

  • The default value of [keystone_client] interface has been changed from admin to public.

  • Watcher now requires python 3.10 or newer. The last release of watcher to support 3.9 was 2025.1. Please ensure you have a supported python version before upgrading.

  • Glance, Ironic, MAAS, and Neutron integrations with Watcher are now marked as Experimental and may be deprecated in a future release. These integrations have not been tested recently and may not be fully stable.

  • Monasca client dependency is now optional. The Monasca datasource remains deprecated for removal, and the python-monascaclient package is no longer installed by default. If you use the Monasca datasource, you MUST install the optional extra when upgrading. Behavior for deployments that do not use Monasca is unchanged.

    Example

    pip install watcher[monasca]
    
  • Watcher now listens by default to the exchange and channel used by default in Cinder, which is ‘openstack.notifications’. The previous instructions in the documentation to enable cinder notifications have been modified to make clear that there is no need of changing the cinder conf.

Deprecation Notes

  • The watcher-api-wsgi console script is deprecated for removal in a future release. This artifact is generated using a setup-tools extension that is provide by PBR which is also deprecated. due to the changes in python packaging this custom extensions is planned to be removed form all OpenStack projects in a future PBR release in favor of module based wsgi applications entry points.

  • Noisy Neighbor strategy is deprecated and will be removed in a future release. This strategy relies on Last Level Cache metrics that are not available in Nova since Victoria release.

  • The [collector] api_query_timeout option was deprecated in favor of the [collector] api_query_interval option.

  • The [collector] api_call_retries option was deprecated in favor of the [collector] api_query_max_retries option.

  • The [watcher_datasources] query_timeout option was deprecated in favor of the [watcher_datasources] query_interval option.

  • The following deprecated options were removed.

    • [gnocchi_client] query_timeout (Use [watcher_datsources] query_interval)

    • [gnocchi_client] query_max_retries (Use [watcher_datasources] query_max_retires)

Security Issues

  • Watchers no longer forges requests on behalf of a tenant when swapping volumes. Prior to this release watcher had 2 implementations of moving a volume, it could use cinders volume migrate api or its own internal implementation that directly calls nova volume attachment update api. The former is safe and the recommend way to move volumes between cinder storage backend the internal implementation was insecure, fragile due to a lack of error handling and capable of deleting user data.

    Insecure: the internal volume migration operation created a new keystone user with a weak name and password and added it to the tenants project with the admin role. It then used that user to forge request on behalf of the tenant with admin right to swap the volume. if the applier was restarted during the execution of this operation it would never be cleaned up.

    Fragile: the error handling was minimal, the swap volume api is async so watcher has to poll for completion, there was no support to resume that if interrupted of the time out was exceeded.

    Data-loss: while the internal polling logic returned success or failure watcher did not check the result, once the function returned it unconditionally deleted the source volume. For larger volumes this could result in irretrievable data loss.

    Finally if a volume was swapped using the internal workflow it put the nova instance in an out of sync state. If the VM was live migrated after the swap volume completed successfully prior to a hard reboot then the migration would fail or succeed and break tenant isolation.

    see: https://bugs.launchpad.net/nova/+bug/2112187 for details.

Bug Fixes

  • When using prometheus datasource and more that one target has the same value for the fqdn_label, the driver used the wrong instance label to query for host metrics. The instance label is no longer used in the queries but the fqdn_label which identifies all the metrics for a specific compute node. see Bug 2103451: https://bugs.launchpad.net/watcher/+bug/2103451 for more info.

  • When trying to do volume migration using the zone migration strategy, the keystone service is reached, by default through the admin endpoint. The default value of [keystone_client] interface has been fixed. see Bug https://bugs.launchpad.net/watcher/+bug/2109494 for more info.

  • Previously, when users attempted to create a new audit without providing a name and a goal or an audit template, the API returned error 500 and an incorrect error message was displayed.

    Now, Watcher displays a helpful message and returns HTTP error 400.

    For more info see: https://bugs.launchpad.net/watcher/+bug/2110947

  • Currently, when Watcher applies a volume_migrate action with value retype for the migratione_type, it can wrongly report the result of the action when the retype does not trigger a volume migration.

    This patch fixes the logic to validate the resulting state of the action and reports it correctly.

    For more details: https://bugs.launchpad.net/watcher/+bug/2112100

  • All code related to creating keystone user and granting roles has been removed. The internal swap volume implementation has been removed and replaced by cinders volume migrate api. Note as part of this change Watcher will no longer attempt volume migrations or retypes if the instance is in the Verify Resize task state. This resolves several issues related to volume migration in the zone migration and Storage capacity balance strategies. While efforts have been made to maintain backward compatibility these changes are required to address a security weakness in watcher’s prior approach.

    see: https://bugs.launchpad.net/nova/+bug/2112187 for more context.

  • When running an audit with the workload_stabilization strategy with instance_ram_usage metric in a deployment with prometheus datasource, the host metric for the ram usage was wrongly reported with the incorrect unit which lead to incorrect standard deviation and action plans due to the application of the wrong scale factor in the algorithm.

    The host ram usage metric is now properly reported in KB when using a prometheus datasource and the strategy workload_stabilization calculates the standard deviation properly.

    For more details: https://launchpad.net/bugs/2113776

  • Fix API reference documentation for GET /infra-optim/v1/data_model, to include all missing fields from the response body. Please see Bug 2117726 for more details.

  • Fixed nova client microversion comparison in enable and disable compute service methods. The code was incorrectly comparing API versions, which caused failures for microversions greater than 2.99. For more details, see the bug report: https://bugs.launchpad.net/watcher/+bug/2120586

  • Fixed action status_message update restrictions to allow updates when action is in SKIPPED state. Previously, users could only update the status_message when initially changing the action state to SKIPPED. Now users can update the status_message field at any time while the action remains in SKIPPED state, enabling them to fix typos, provide more detailed explanations, or expand on reasons that were initially omitted. For more details, see the bug report: https://bugs.launchpad.net/watcher/+bug/2121601

  • Host maintenance strategy should migrate servers based on backup node if specified or rely on nova scheduler. It was enabling disabled hosts with watcher_disabled reason and migrating servers to those nodes. It can impact customer workload. Compute nodes were disabled for a reason.

    Host maintenance strategy is fixed now to support migrating servers only on backup node or rely on nova scheduler if no backup node is provided.

  • Removed the python-dateutil dependency from Watcher to reduce the number of external dependencies and improve maintainability.

  • Previously, if an action failed in an action plan, the state of the action plan was reported as SUCCEEDED if the execution of the action has finished regardless of the outcome.

    Watcher will now reflect the actual state of all the actions in the plan after the execution has finished. If any action has status FAILED, it will set the state of the action plan as FAILED. This is the expected behavior according to Watcher documentation.

    For more info see: https://bugs.launchpad.net/watcher/+bug/2106407

  • Bug #2110538: Corrected the HTTP error code returned when watcher users try to create audits with invalid parameters. The API now correctly returns a 400 Bad Request error.

  • The zone migration strategy no longer requires a dst_node to be passed. When unspecified, the Nova scheduler will select an appropriate host automatically. This brings the implementation of the strategy in line with the the api schema where dest_node is optional.

    See: https://bugs.launchpad.net/watcher/+bug/2108988 for more details.

Other Notes

  • Removed unused OperationNotPermitted exception that was dead code since the initial import of the Watcher codebase. This exception provides the appropriate 400 Bad Request response behavior.

  • The DELETE, POST and Patch REST methods for the action APIs are forbidden and not implemented. They are now removed from the API controller.

  • The Watcher Overload Standard Deviation algorithm is now referred to in the documentation as the Workload Stabilization Strategy. The documentation of this strategy has been enhanced to clarify and better explain the usage of parameters.