Test Environment

This document should give you a good idea of what you can count on in the test environments managed by the Infrastructure team. This information may be useful when creating new jobs or debugging existing jobs.

Unprivileged Single Use VMs

All jobs currently run on these nodes. These are single use VMs booted in OpenStack clouds. You should start here unless you know you have a reason to use a privileged VM.

Each single use VM has these attributes which you can count on:

  • Every instance has a public IP address. This may be an IPv4 address or an IPv6 address or maybe both.

    • You may not get both, it is entirely valid for an instance to have only a public IPv6 address and for another to have only a public IPv4 address.

    • In some cases the public IPv4 address is provided via NAT and the instance will only see a private IPv4 address. In some cases instances may have both a public and a private IPv4 address.

    • It is also possible that these addresses are on multiple network interfaces.

  • CPUs are all running x86-64.

  • There is at least 8GB of system memory available.

  • There is at least 80GB of disk available. This disk may not all be exposed in a single filesystem partition and so not all mounted at /. Any additional disk can be partitioned, formatted and mounted by the root user; though if you need this it is recommended to use devstack-gate which takes care of it automatically and mounts the extra space on /opt early in its setup phase. To give you an idea of what this can look like most clouds just give us an 80GB or bigger /. One cloud gives us a 40GB / and 80GB /opt. Generally you will want to write large things to /opt to take advantage of available disk.

  • Swap is not guaranteed to be present. Some clouds give us swap and others do not. Some tests (like devstack-gate based tests) will create swap either using a second disk device if available or by using a file otherwise. Be aware you may need to create swap if you need it.

  • Filesystems are ext4. If you need other filesystems you can create them on files mounted via loop devices.

  • Package mirrors for PyPi, NPM, Ubuntu, Debian, and Centos 7 (including EPEL) are provided and preconfigured on these instances before starting any jobs. We also have mirrors for Ceph and Ubuntu Cloud Archive that jobs must opt into using (details for these are written to disk on the test instances but are disabled by default).

Because these instances are single use we are able to give jobs full root access to them. This means you can install system packages, modify partition tables, and so on. Note that if you reboot the test instances you will need to restart the zuul-console process.

If jobs need to perform privileged actions they can do so using Zuul v3’s secrets. Things like AFS access tokens or dockerhub credentials can be stored in Zuul secrets then used by jobs to perform privileged actions requiring this data. Please refer to the Zuul documentation for more info.

Known Differences to Watch Out For

  • Underlying hypervisors are not all the same. You may run into KVM or Xen and possibly others depending on the cloud in use.

  • CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region.

  • Nested virt is not available in all clouds. And in clouds where it is enabled we have observed a higher rate of crashed test VMs when using it. As a result we enforce qemu when running devstack and may further restrict the use of nested virt.

  • Some clouds give us multiple network interfaces, some only give us one. In the case of multiple network interfaces some clouds give all of them Internet routable addresses and some others do not.

  • Geographic location is widely variable. We have instances all across North America and in Europe. This may affect network performance between instances and network resources geographically distant.

  • Some network protocols may be blocked in some clouds. Specfically we have had problems with GRE. You can rely on TCP, UDP, and ICMP being functional on all of our clouds.

  • Network interface MTU of 1500 is not guaranteed. Some clouds give us smaller MTUs due to use of overlay networking. Test jobs should check interface MTUs and use an appropriate value for the current instance if creating new interfaces or bridges.

Why are Jobs for Changes Queued for a Long Time

We have a finite number of resources to run jobs on. We process jobs for changes in order based on a priority queuing system. This priority queue assigns test resources to Zuul queues based on the number of total changes in that queue. Changes at the heads of these queues are assigned resources before those at the end of the queues.

We have done this to ensure that large projects with many changes and long running jobs do not starve small projects with few changes and short jobs.

In order to make the queues run quicker there are several variables we can change:

  1. Lower demand. Fewer changes and/or jobs will result in less demand for resources increasing availability for the changes that remain.

  2. Reduce job resource costs. Reducing job runtime means those resources can be reused sooner by other jobs. Keep in mind that multinode jobs use a whole integer multiple more resources than single node jobs. You should only use multinode jobs where necessary to test specific interactions or to fit a complex test case into the resources we have.

  3. Improve job reliability. If jobs fail because the tests or software under test are unreliable then we have to run more jobs to successfully merge our software. This effect is compounded by our gate queues because anytime we have a change that fails we must remove it from the queue, rebuild the queue without that change, then restart all jobs in the queue with that change evicted.

    Keep in mind that we are dog fooding OpenStack to run OpenStack’s CI system. This means that a more reliable OpenStack is able to provide resources to our CI system effectively. Fixing OpenStack in this case is a win win situation.

  4. Add resources to our pools. If we have more total resources then we will have more to spread around.

In general, we would like to see our software perform the testing that the developers feel is necessary. We should do so responsibly. What this means is instead of deleting jobs or ignoring changes we should improve our test reliability to ensure changes exit queues as quickly as possible with minimal resource cost. This then ensures the changes behind are able to get resources quickly.

We are also always happy to add resources if they are available, but the priority from the project should be to ensure we are using what we do have responsibly.

Can my changes skip the check queue?

The OpenStack project uses a “clean check” approach to keep flaky changes out of the gate. So, a change always needs to pass “check” before it enters “gate” - and if it fails in “gate”, it re-enters the “check” pipeline.

  • If your change fails in the gate, then there is an increased chance it is introducing non-deterministic failure behavior so forcing it to go through check again helps make that more apparent.

  • This avoids also approving changes that have no hope of ever passing due to pep8 or other trivial errors.

  • It also helps with approving changes that had been sitting around with a 6-month-old passing check.

Changes in the gate pipeline are prioritized but also serialized, so if a change fails, all tests for changes behind that failing change have to be restarted. If restarts after restarts happen, then resources are never freed up for the check pipeline.

Therefore, having a stable gate pipeline is crucial - and the “clean check” requirement will help with the stable jobs.