Victoria Series (6.2.0 - 6.4.x) Release Notes¶
Adds an configuration option which can be encoded into the ramdisk itself or the PXE parameters being provided to instruct the agent to ignore bootloader installation or configuration failures. This functionality is useful to work around well-intentioned hardware which is auto-populating all possible device into the UEFI nvram firmware in order to try and help ensure the machine boots. Except, this can also mean any explict configuration attempt will fail. Operators needing this bypass can use the
ipa-ignore-bootloader-failureconfiguration option on the PXE command line or utilize the
ignore_bootloader_failureoption for the Ramdisk configuration. In a future version of ironic, this setting may be able to be overriden by ironic node level configuration.
Adds the capability into the agent to read and act upon bootloader CSV files which serve as authoritative indicators of what bootloader to load instead of leaning towards utilizing the default.
If multiple bootloader CSV files are present on the EFI filesystem, the first CSV file discovered will be utilized. The Ironic team considers multiple files to be a defect in the image being deployed. This may be changed in the future.
Setting the new
ipa-ignore-bootloader-failureconfig option prevents errors due to bootloader installation failure generated by automatic bootloader entries configuration from multiple attached devices.
The system file system configuration file for Linux machines, the
/etc/fstabfile is now updated to include a reference to the EFI partition in the case of a partition image base deployment. Without this reference, images deployed using partition images could end up in situations where upgrading the bootloader could fail.
Fixes a minor issue with the regular expression used for UEFI duplicate entry cleanup which was introduced in a prior change to refactor the cleanup operation to avoid UEFI firmware which treats deletion of entries after addition as an invalid operation.
Fixes cases where duplicates may not be found in the UEFI firmware NVRAM boot entry table by explicitly looking for, and deleting for matching labels in advance of creating the EFI boot loader entry.
IPA now properly checks if the root partition is already mounted. See Story 2008631 for details.
Fixes an error with UEFI based deployments where using a partition image a NVMe device was previously failing due to the different device name pattern.
Fixes an issue where partitions are not visible due to a incorrect call to have the partition table re-read.
Fixes an issue where partitions are not visible due to an incorrect call to have the partition table re-read during raid configuration creation.
Fixes an issue where the NTP time sync at the IPA startup via chronyd is not immediate (which can break time sensitive components such as the generation of a TLS certificate).
Fixes failures with disk image conversions which result in memory allocation or input/output errors due to memory limitations by limiting the number of available memory allocation pools to a non-dynamic reasonable number which should not exceed the available system memory.
The lshw package version B.02.19.2-5 on CentOS 8.4 and 8.5 contains a bug that prevents the size of individual memory banks from being reported, with the result that the total memory size would be reported as 0 in some places. The total memory size is now taken from lshw’s total memory size output (which does not suffer from the same problem) when available.
No longer crashes if MAC address cannot be determined for one of the network interfaces.
Fixes an issue where metadata erasure cleaning fails for partitions because the read-only file isn’t found, while it is available at the base device. Adds a check for the base device file on failure. See story 2008696.
Fixes the agent’s EFI boot handling such that EFI assets from a partition image are preserved and used instead of overridden. This should permit operators to use Secure Boot with partition images IF the assets are already present in the partition image.
Mirrors the previously disconnected EFI system partitions (ESPs) in UEFI software RAID setups. Disconnected ESPs can lead to nodes booting with outdated kernel parameters or the UEFI firmware not finding bootable kernels at all.
Fixes incorrect root partition UUID after streaming a raw partition image.
Fixes nodes failing after deployment completes due to issues in the Grub2 EFI loader entry addition where a
BOOT.CSVfile provides the authoritative pointer to the bootloader to be used for booting the OS. The base issue with Grub2 is that it would update the UEFI bootloader NVRAM entries with whatever is present in a vendor specific
BOOTX64.CSVfile. In some cases, a baremetal machine can crash when this occurs. More information can be found at story 2008962.
Adds a call to “udevadm settle” in write_image.sh. After GPT and MBR are destroyed systemd-udevd gets triggered which may hold /dev/sda open preventing qemu-img from writting its image.
Provides a more specific error message if a UEFI-incompatible image is used in the UEFI mode.
Increase memory usage limit for
qemu-img convertcommand to 2 GiB. See Story 2008667 for details.
Adds the ability to bring up VLAN interfaces and include them in the introspection report. This is needed in environments that require an IP address to be configured on tagged VLANs. A new kernel params field is added -
ipa-enable-vlan-interfaces, which defines either the VLAN interface to enable, the interface to use, or ‘all’ - which indicates all interfaces. If the particular VLAN is not provided, IPA will use the LLDP information for the interface to determine which VLANs should be enabled. See story 2008298.
Automatically generated TLS certificates now have their validity starting in the past (1 hour by default) to allow for clock skew.
Fixes the agent process for determining what partition label type to utilize when writing partition images. In many cases, this could fallback to
msdosif the instance flavor was not properly labeled.
Correctly decodes error messages from ironic API.
mdadmutility is no longer a hard requirement. It’s still required if software RAID is used (even when not managed by ironic).
write_imagedeploy step to actually check and return any errors during its execution.
Avoids a traceback when using
install_bootloaderwith whole disk images. If the root UUID cannot be detected, don’t try to call grub.
Enables support in IPA for hosting the API server over TLS. Using this support requires setting
[DEFAULT]listen_tlsto True, and then setting
[ssl]key_file, and optionally
[ssl]ca_fileto files embedded in the ramdisk IPA runs inside.
When a recent enough version of ironic is detected and
False, agent will now generate a self-signed TLS certificate and send it to ironic on heartbeat. This ensures encrypted communication from ironic to the agent. Set
Falseto disable this behavior.
logsinspection collector is now enabled by default, change
IPA heartbeat intervals now rely on accurate clock time. Any clean or deploy steps which attempt to sync the clock may cause heartbeats to not be emitted. IPA syncs time at startup and shutdown, so these steps should not be required.
Fixes an issue with nodes undergoing fast-track from introspection to deployment where the agent internal cache of the node may be stale. In particular, this can be observed if node does not honor a root device hint which is saved to Ironic’s API after the agent was started. More information can be found in story 2008039.
Fixes a minor incorrect keyword argument that was matching between the method caller and the unit test but not the actual method, unit test, and caller. This was a non-fatal issue, and should now permit the agent to attempt to lookup the node one last time before deploying the instance image to pick-up a root device hint.
Fixes an issue with the IntelCnaHardwareManager which prevented hardware managers with lower priority to be executed and therefore may blocked the initialization and collection of hardware these managers are supposed to take care of.
Fixes a bug where the partitions created during software RAID setup are cleaned too early and therefore may prevent the proper cleaning of the md superblocks. Leaving superblocks behind will impact the creation of new md devices later on.
Detects md component devices by their UUID, rather than by scanning the output of mdadm. This will prevent that devices miss md superblock cleanup when they are currently not part of an array.
Since the Ussuri release, IPA has ignored the listen_host and listen_port directives. This fixes the behavior and restores those configuration values to working status. https://storyboard.openstack.org/#!/story/2008016
Adds an explicit capture of connectivity failures in the heartbeat process to provide a more verbose error message in line with what is occuring as opposed to just indicating that an error occured. This new exception is called
HeartbeatConnectionErrorand is likely only going to be visible if there is a local connectivity failure such as a router failure, switchport in a blocking state, or connection centered transient failure.
The new kernel parameter
ipa-advertise-protocolcan be used to change the protocol of the callback URL to
deploy.erase_devices_metadataclean step can now also be used as a deploy step.
Introspection of PCI devices now collects PCI class, revision and PCI bus.
Adds a Poll extension which provides the ability to retrieve hardware information as well as set node data from API. This feature is required for poll mode deployment driven by ironic.
Fixes the return value of the
apply_configurationdeploy step: the
agentRAID interface expects the final RAID configuration to be returned.
Fixes an issue where the bootloader installation can fail on a software RAID volume when no root_device hint is set. See Story 2007905
Fixes retry logic issues with the Agent Lookup which can result in the lookup failing prematurely before being completed, typically resulting in an abrupt end to the agent logging and potentially weird errors like TypeError being reported on the agent process standard error output. For more information see bug 2007968.
Fixes an issue with the ironic-python-agent where we would call to setup the bootloader, which is necessary with software raid, but also attempt to clean up iSCSI. This can cause issues when using the
deploy_interface. Now the agent will only clean up iSCSI connections if iSCSI was explicitly started. For more information, please see story 2007937.
Devices with size 0 are now ignored when collecting inventory. Some hardware represents virtual floppy devices this way, see e.g. https://www.dell.com/community/Systems-Management-General/How-to-disable-iDRAC-Virtual-CD/td-p/4734424
Fixes deployment failures when the image download is interrupted mid-stream while the contents are being downloaded. Previously retries were limited to only opening the initial connection.
Fixes the short timeout retries interval, which was previously
5seconds, to a length that will allow the agent to retry after a network interruption. The time between retries is now
10seconds, and the number of retries are set to
9to help ensure intermittent network outages do not cause recoverable failures.
Fixes an issue with high cpu usage caused by ironic-python-agent greenthread eventlent implementation.
Using eventlet.sleep(0.1) instead of eventlet.sleep(0) gives other processes of IPA more cpu time to run.
Speeds up going from inspection to cleaning with fast-track enabled by caching hardware information between the steps.
Fixes serializing exceptions originating from ironic-lib. Previously an attempt to do so would result in a
TypeError, for example: Object of type ‘InstanceDeployFailure’ is not JSON serializable.
The size of the ESP partition created for software RAID has been increased from 128 MiB to 550 MiB. This change is in line with the recent diskimage-builder change as well as the guidance from the author of gdisk.
Fixes failure to detect a hung file download connection in the event that the kernel has not rapidly detected that the remote server has hung up the socket. This can happen when there is intermittent and transient connectivity issues such as those that can occur due to LACP failure response hold-downs timers in switching fabrics.