Replace a hyperconverged Ceph storage and compute node

Introduction

A common topology for Charmed OpenStack is the co-location of the nova-compute and the ceph-osd applications. This article covers the removal and redeployment of such a data plane cloud node.

Important

For the target cloud node, only nova-compute, ceph-osd, and their subordinate applications are assumed to be deployed.

Warning

Migration involves disabling compute services on the source host, effectively removing the hypervisor from the cloud.

Procedure

First ensure that the cloud is in a healthy state, the Compute services and Ceph cluster in particular.

Identify cloud node specifics

Identify the unit and hypervisor name of the compute node:

juju status nova-compute
openstack hypervisor list

Identify the unit of the storage node and the IDs of the associated OSDs:

juju status ceph-osd
juju run -a ceph-osd mount | grep ceph
juju ssh ceph-mon/leader sudo ceph osd tree

In this example,

  • the existing compute node:

    • is hosted on unit nova-compute/0

    • has a hypervisor name of node2.maas

  • the existing storage node:

    • is hosted on unit ceph-osd/2

    • has two OSDs present and their IDs are 0 and 1

The ID of the existing Juju machine is common to both applications and is assumed to be 14.

The ID of the replacement Juju machine is assumed to be 21.

  • the new storage node:

    • will be hosted on unit ceph-osd/10

    • will have storage disks /dev/nvme0n1 /dev/nvme0n2

Warning

You must replace the values in this example with the values of your actual environment.

Disable nova-compute services

Disable nova-compute services on the node:

juju run-action --wait nova-compute/0 disable

Respawn any Octavia VMs

Skip this section if Octavia is not deployed in the cloud.

Any possible Octavia load balancer VMs (amphorae) need to be identified and respawned.

Note

Migrating the amphorae like any other VMs (see next section) may work but the Octavia project recommends respawning (failing over) its VMs. This is because migration may take longer than expected, which may in turn cause Octavia to see its VMs as lost/stale. See Evacuating a Specific Amphora from a Host in the upstream documentation.

List the amphorae hosted on the node:

openstack server list --host node2.maas --all-projects | grep amphora

The Amphora ID is appended to the VM name.

For each VM,

  1. gather the load balancer ID:

    openstack loadbalancer amphora show <Amphora ID>
    
  2. respawn an Octavia VM and monitor its progress:

    openstack loadbalancer failover <LB ID>
    watch 'openstack loadbalancer amphora list | grep <LB ID>'
    

    The original VM will be removed from the compute node.

Live migrate the compute node VMs

Evacuate the compute node’s VMs by live migration:

nova host-evacuate-live node2.maas

See cloud operation Live migrate VMs from a running compute node for in-depth coverage of live migration.

Ensure that all VMs have been evacuated:

juju ssh nova-compute/0 sudo virsh list --all

Unregister objects from the cloud

Unregister the compute node

Unregister the compute node from the cloud:

juju run-action --wait nova-compute/0 remove-from-cloud

See cloud operation Scale back the nova-compute application for more details on this step.

Unregister the neutron agents

Unregister the associated neutron agent from the cloud. The agent’s ID should be the compute node’s name. Verify this by first listing the agents:

openstack network agent list
openstack network agent delete node2.maas

Remove OSD storage devices

juju run-action --wait ceph-osd/2 remove-disk osd-ids=osd.0 purge=true
juju run-action --wait ceph-osd/2 remove-disk osd-ids=osd.1 purge=true

Note

The Ceph operation Removing OSDs has more details on the remove-disk action.

Remove and add a Juju machine

Remove the affected Juju machine from the model:

juju remove-machine 14

Add a Juju machine

juju add-machine

The machine’s hardware requirements can be stated via the --constraints option. This option can also be used to select a particular MAAS node by specifying a MAAS tag. The chosen machine should have the storage devices necessary to compensate for the Ceph OSDs that were removed.

Add Ceph storage and compute services

Add Ceph storage and compute services to the new Juju machine:

juju add-unit nova-compute --to 21
juju add-unit ceph-osd --to 21

Integrate the new Ceph disks

The current value of the ceph-osd charm option osd-devices may match the two storage devices belonging to the new cloud node. In such a case, there is nothing else to do; the disks will be integrated into the cluster automatically.

First list all the disks on the new storage node:

juju run-action --wait ceph-osd/10 list-disks

Then query the charm option:

juju config ceph-osd osd-devices

If the new disk is not represented by the option’s value you can either change the value (which applies to the entire cluster) or use the add-disk action against the new ceph-osd unit. Here, we’ll use the action using our previously-assumed values:

juju run-action --wait ceph-osd/10 add-disk \
   osd-devices='/dev/nvme0n1 /dev/nvme0n2'

Inspect Ceph cluster changes

It is recommended to get a summary of the Ceph cluster using the commands used previously. In particular, the ceph-osd unit number will have changed:

juju status ceph-osd
juju run -a ceph-osd mount | grep ceph
juju ssh ceph-mon/leader sudo ceph osd tree

Customise the local environment

Perform any customisations that may be required as per the local environment. This may include:

  1. Adding the new compute node to a Nova aggregate or availability zone

  2. Setting CRUSH device classes for the new Ceph OSDs

Verify the new cloud node

The hyperconverged Ceph storage and compute node has now been replaced.

Verify that the new compute node is functional. See the verification step in cloud operation Scale out the nova-compute application <scale_out_nova_compute_verfication> for guidance.

Verify that the Ceph cluster is healthy:

juju ssh ceph-mon/leader sudo ceph status