Devspecs Guide

Devspecs Guide

Some specs for developers. Who are interest in tricircle.

Tricircle Asynchronous Job Management API

Background

In the Tricircle, XJob provides OpenStack multi-region functionality. It receives and processes jobs from the Admin API or Tricircle Central Neutron Plugin and handles them in an asynchronous way. For example, when booting an instance in the first time for the project, router, security group rule, FIP and other resources may have not already been created in the local Neutron(s), these resources could be created asynchronously to accelerate response for the initial instance booting request, different from network, subnet and security group resources that must be created before an instance booting. Central Neutron could send such creation jobs to local Neutron(s) through XJob and then local Neutron(s) handle them with their own speed.

Implementation

XJob server may strike occasionally so tenants and cloud administrators need to know the job status and delete or redo the failed job if necessary. Asynchronous job management APIs provide such functionality and they are listed as following:

  • Create a job

    Create a job to synchronize resource if necessary.

    Create Job Request:

    POST /v1.0/jobs
    {
        "job": {
            "type": "port_delete",
            "project_id": "d01246bc5792477d9062a76332b7514a",
            "resource": {
                "pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
                "port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
            }
        }
    }
    
    Response:
    {
        "job": {
            "id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
            "project_id": "d01246bc5792477d9062a76332b7514a",
            "type": "port_delete",
            "timestamp": "2017-03-03 11:05:36",
            "status": "NEW",
            "resource": {
                "pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
                "port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
            }
        }
    }
    
    Normal Response Code: 202
    
  • Get a job

    Retrieve a job from the Tricircle database.

    The detailed information of the job will be shown. Otherwise it will return “Resource not found” exception.

    List Request:

    GET /v1.0/jobs/3f4ecf30-0213-4f1f-9cb0-0233bcedb767
    
    Response:
    {
        "job": {
            "id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
            "project_id": "d01246bc5792477d9062a76332b7514a",
            "type": "port_delete",
            "timestamp": "2017-03-03 11:05:36",
            "status": "NEW",
            "resource": {
                "pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
                "port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
            }
        }
    }
    
    Normal Response Code: 200
    
  • Get all jobs

    Retrieve all of the jobs from the Tricircle database.

    List Request:

    GET /v1.0/jobs/detail
    
    Response:
    {
       "jobs":
           [
                {
                    "id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
                    "project_id": "d01246bc5792477d9062a76332b7514a",
                    "type": "port_delete",
                    "timestamp": "2017-03-03 11:05:36",
                    "status": "NEW",
                    "resource": {
                        "pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
                        "port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
                    }
                },
                {
                    "id": "b01fe514-5211-4758-bbd1-9f32141a7ac2",
                    "project_id": "d01246bc5792477d9062a76332b7514a",
                    "type": "seg_rule_setup",
                    "timestamp": "2017-03-01 17:14:44",
                    "status": "FAIL",
                    "resource": {
                        "project_id": "d01246bc5792477d9062a76332b7514a"
                    }
                }
           ]
    }
    
    Normal Response Code: 200
    
  • Get all jobs with filter(s)

    Retrieve job(s) from the Tricircle database. We can filter them by project ID, job type and job status. If no filter is provided, GET /v1.0/jobs will return all jobs.

    The response contains a list of jobs. Using filters, a subset of jobs will be returned.

    List Request:

    GET /v1.0/jobs?project_id=d01246bc5792477d9062a76332b7514a
    
    Response:
    {
       "jobs":
           [
                {
                    "id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
                    "project_id": "d01246bc5792477d9062a76332b7514a",
                    "type": "port_delete",
                    "timestamp": "2017-03-03 11:05:36",
                    "status": "NEW",
                    "resource": {
                        "pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
                        "port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
                    }
                },
                {
                    "id": "b01fe514-5211-4758-bbd1-9f32141a7ac2",
                    "project_id": "d01246bc5792477d9062a76332b7514a",
                    "type": "seg_rule_setup",
                    "timestamp": "2017-03-01 17:14:44",
                    "status": "FAIL",
                    "resource": {
                        "project_id": "d01246bc5792477d9062a76332b7514a"
                    }
                }
           ]
    }
    
    Normal Response Code: 200
    
  • Get all jobs’ schemas

    Retrieve all jobs’ schemas. User may want to know what the resources are needed for a specific job.

    List Request:

    GET /v1.0/jobs/schemas
    
    return all jobs' schemas.
    Response:
    {
       "schemas":
           [
                {
                    "type": "configure_route",
                    "resource": ["router_id"]
                },
                {
                    "type": "router_setup",
                    "resource": ["pod_id", "router_id", "network_id"]
                },
                {
                    "type": "port_delete",
                    "resource": ["pod_id", "port_id"]
                },
                {
                    "type": "seg_rule_setup",
                    "resource": ["project_id"]
                },
                {
                    "type": "update_network",
                    "resource": ["pod_id", "network_id"]
                },
                {
                    "type": "subnet_update",
                    "resource": ["pod_id", "subnet_id"]
                },
                {
                    "type": "shadow_port_setup",
                    "resource": [pod_id", "network_id"]
                }
           ]
    }
    
    Normal Response Code: 200
    
  • Delete a job

    Delete a failed or duplicated job from the Tricircle database. A pair of curly braces will be returned if succeeds, otherwise an exception will be thrown. What’s more, we can list all jobs to verify whether it is deleted successfully or not.

    Delete Job Request:

    DELETE /v1.0/jobs/{id}
    
    Response:
    This operation does not return a response body.
    
    Normal Response Code: 200
    
  • Redo a job

    Redo a halted job brought by the XJob server corruption or network failures. The job handler will redo a failed job with time interval, but this Admin API will redo a job immediately. Nothing will be returned for this request, but we can monitor its status through the execution state.

    Redo Job Request:

    PUT /v1.0/jobs/{id}
    
    Response:
    This operation does not return a response body.
    
    Normal Response Code: 200
    

Data Model Impact

In order to manage the jobs for each tenant, we need to filter them by project ID. So project ID is going to be added to the AsyncJob model and AsyncJobLog model.

Dependencies

None

Documentation Impact

  • Add documentation for asynchronous job management API

  • Add release note for asynchronous job management API

References

None

Cross Neutron L2 networking in Tricircle

Background

The Tricircle provides unified OpenStack API gateway and networking automation functionality. Those main functionalities allow cloud operators to manage multiple OpenStack instances which are running in one site or multiple sites as a single OpenStack cloud.

Each bottom OpenStack instance which is managed by the Tricircle is also called a pod.

The Tricircle has the following components:

  • Nova API-GW

  • Cinder API-GW

  • Neutron API Server with Neutron Tricircle plugin

  • Admin API

  • XJob

  • DB

Nova API-GW provides the functionality to trigger automatic networking creation when new VMs are being provisioned. Neutron Tricircle plug-in is the functionality to create cross Neutron L2/L3 networking for new VMs. After the binding of tenant-id and pod finished in the Tricircle, Cinder API-GW and Nova API-GW will pass the cinder api or nova api request to appropriate bottom OpenStack instance.

Please refer to the Tricircle design blueprint[1], especially from ‘7. Stateless Architecture Proposal’ for the detail description of each components.

Problem Description

When a user wants to create a network in Neutron API Server, the user can specify the ‘availability_zone_hints’(AZ or az will be used for short for availability zone) during network creation[5], in the Tricircle, the ‘az_hints’ means which AZ the network should be spread into. The ‘az_hints’ meaning in Tricircle is a little different from the ‘az_hints’ meaning in Neutron[5]. If no ‘az_hints’ was specified during network creation, this created network will be spread into any AZ. If there is a list of ‘az_hints’ during the network creation, that means the network should be able to be spread into these AZs which are suggested by a list of ‘az_hints’.

When a user creates VM or Volume, there is also one parameter called availability zone. The AZ parameter is used for Volume and VM co-location, so that the Volume and VM will be created into same bottom OpenStack instance.

When a VM is being attached to a network, the Tricircle will check whether a VM’s AZ is inside in the network’s AZs scope. If a VM is not in the network’s AZs scope, the VM creation will be rejected.

Currently, the Tricircle only supports one pod in one AZ. And only supports a network associated with one AZ. That means currently a tenant’s network will be presented only in one bottom OpenStack instance, that also means all VMs connected to the network will be located at one bottom OpenStack instance. If there are more than one pod in one AZ, refer to the dynamic pod binding[6].

There are lots of use cases where a tenant needs a network being able to be spread out into multiple bottom OpenStack instances in one AZ or multiple AZs.

  • Capacity expansion: tenants add VMs more and more, the capacity of one OpenStack may not be enough, then a new OpenStack instance has to be added to the cloud. But the tenant still wants to add new VMs into same network.

  • Cross Neutron network service chaining. Service chaining is based on the port-pairs. Leveraging the cross Neutron L2 networking capability which is provided by the Tricircle, the chaining could also be done by across sites. For example, vRouter1 in pod1, but vRouter2 in pod2, these two VMs could be chained.

  • Applications are often required to run in different availability zones to achieve high availability. Application needs to be designed as Active-Standby/Active-Active/N-Way to achieve high availability, and some components inside one application are designed to work as distributed cluster, this design typically leads to state replication or heart beat among application components (directly or via replicated database services, or via private designed message format). When this kind of applications are distributedly deployed into multiple OpenStack instances, cross Neutron L2 networking is needed to support heart beat or state replication.

  • When a tenant’s VMs are provisioned in different OpenStack instances, there is E-W (East-West) traffic for these VMs, the E-W traffic should be only visible to the tenant, and isolation is needed. If the traffic goes through N-S (North-South) via tenant level VPN, overhead is too much, and the orchestration for multiple site to site VPN connection is also complicated. Therefore cross Neutron L2 networking to bridge the tenant’s routers in different Neutron servers can provide more light weight isolation.

  • In hybrid cloud, there is cross L2 networking requirement between the private OpenStack and the public OpenStack. Cross Neutron L2 networking will help the VMs migration in this case and it’s not necessary to change the IP/MAC/Security Group configuration during VM migration.

The spec[5] is to explain how one AZ can support more than one pod, and how to schedule a proper pod during VM or Volume creation.

And this spec is to deal with the cross Neutron L2 networking automation in the Tricircle.

The simplest way to spread out L2 networking to multiple OpenStack instances is to use same VLAN. But there is a lot of limitations: (1) A number of VLAN segment is limited, (2) the VLAN network itself is not good to spread out multiple sites, although you can use some gateways to do the same thing.

So flexible tenant level L2 networking across multiple Neutron servers in one site or in multiple sites is needed.

Proposed Change

Cross Neutron L2 networking can be divided into three categories, VLAN, Shared VxLAN and Mixed VLAN/VxLAN.

  • VLAN

    Network in each bottom OpenStack is VLAN type and has the same VLAN ID. If we want VLAN L2 networking to work in multi-site scenario, i.e., Multiple OpenStack instances in multiple sites, physical gateway needs to be manually configured to make one VLAN networking be extended to other sites.

    Manual setup physical gateway is out of the scope of this spec

  • Shared VxLAN

    Network in each bottom OpenStack instance is VxLAN type and has the same VxLAN ID.

    Leverage L2GW[2][3] to implement this type of L2 networking.

  • Mixed VLAN/VxLAN

    Network in each bottom OpenStack instance may have different types and/or have different segment IDs.

    Leverage L2GW[2][3] to implement this type of L2 networking.

There is another network type called “Local Network”. For “Local Network”, the network will be only presented in one bottom OpenStack instance. And the network won’t be presented in different bottom OpenStack instances. If a VM in another pod tries to attach to the “Local Network”, it should be failed. This use case is quite useful for the scenario in which cross Neutron L2 networking is not required, and one AZ will not include more than bottom OpenStack instance.

Cross Neutron L2 networking will be able to be established dynamically during tenant’s VM is being provisioned.

There is assumption here that only one type of L2 networking will work in one cloud deployment.

A Cross Neutron L2 Networking Creation

A cross Neutron L2 networking creation will be able to be done with the az_hint attribute of the network. If az_hint includes one AZ or more AZs, the network will be presented only in this AZ or these AZs, if no AZ in az_hint, it means that the network can be extended to any bottom OpenStack.

There is a special use case for external network creation. For external network creation, you need to specify the pod_id but not AZ in the az_hint so that the external network will be only created in one specified pod per AZ.

Support of External network in multiple OpenStack instances in one AZ is out of scope of this spec.

Pluggable L2 networking framework is proposed to deal with three types of L2 cross Neutron networking, and it should be compatible with the Local Network.

  1. Type Driver under Tricircle Plugin in Neutron API server

  • Type driver to distinguish different type of cross Neutron L2 networking. So the Tricircle plugin need to load type driver according to the configuration. The Tricircle can reuse the type driver of ML2 with update.

  • Type driver to allocate VLAN segment id for VLAN L2 networking.

  • Type driver to allocate VxLAN segment id for shared VxLAN L2 networking.

  • Type driver for mixed VLAN/VxLAN to allocate VxLAN segment id for the network connecting L2GWs[2][3].

  • Type driver for Local Network only updating network_type for the network to the Tricircle Neutron DB.

When a network creation request is received in Neutron API Server in the Tricircle, the type driver will be called based on the configured network type.

  1. Nova API-GW to trigger the bottom networking automation

Nova API-GW can be aware of when a new VM is provisioned if boot VM api request is received, therefore Nova API-GW is responsible for the network creation in the bottom OpenStack instances.

Nova API-GW needs to get the network type from Neutron API server in the Tricircle, and deal with the networking automation based on the network type:

  • VLAN Nova API-GW creates network in bottom OpenStack instance in which the VM will run with the VLAN segment id, network name and type that are retrieved from the Neutron API server in the Tricircle.

  • Shared VxLAN Nova API-GW creates network in bottom OpenStack instance in which the VM will run with the VxLAN segment id, network name and type which are retrieved from Tricricle Neutron API server. After the network in the bottom OpenStack instance is created successfully, Nova API-GW needs to make this network in the bottom OpenStack instance as one of the segments in the network in the Tricircle.

  • Mixed VLAN/VxLAN Nova API-GW creates network in different bottom OpenStack instance in which the VM will run with the VLAN or VxLAN segment id respectively, network name and type which are retrieved from Tricricle Neutron API server. After the network in the bottom OpenStack instances is created successfully, Nova API-GW needs to update network in the Tricircle with the segmentation information of bottom netwoks.

  1. L2GW driver under Tricircle Plugin in Neutron API server

Tricircle plugin needs to support multi-segment network extension[4].

For Shared VxLAN or Mixed VLAN/VxLAN L2 network type, L2GW driver will utilize the multi-segment network extension in Neutron API server to build the L2 network in the Tricircle. Each network in the bottom OpenStack instance will be a segment for the whole cross Neutron L2 networking in the Tricircle.

After the network in the bottom OpenStack instance was created successfully, Nova API-GW will call Neutron server API to update the network in the Tricircle with a new segment from the network in the bottom OpenStack instance.

If the network in the bottom OpenStack instance was removed successfully, Nova API-GW will call Neutron server api to remove the segment in the bottom OpenStack instance from network in the Tricircle.

When L2GW driver under Tricircle plugin in Neutron API server receives the segment update request, L2GW driver will start async job to orchestrate L2GW API for L2 networking automation[2][3].

Data model impact

In database, we are considering setting physical_network in top OpenStack instance as bottom_physical_network#bottom_pod_id to distinguish segmentation information in different bottom OpenStack instance.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Local Network Implementation

For Local Network, L2GW is not required. In this scenario, no cross Neutron L2/L3 networking is required.

A user creates network Net1 with single AZ1 in az_hint, the Tricircle plugin checks the configuration, if tenant_network_type equals local_network, it will invoke Local Network type driver. Local Network driver under the Tricircle plugin will update network_type in database.

For example, a user creates VM1 in AZ1 which has only one pod POD1, and connects it to network Net1. Nova API-GW will send network creation request to POD1 and the VM will be booted in AZ1 (There should be only one pod in AZ1).

If a user wants to create VM2 in AZ2 or POD2 in AZ1, and connect it to network Net1 in the Tricircle, it would be failed. Because the Net1 is local_network type network and it is limited to present in POD1 in AZ1 only.

VLAN Implementation

For VLAN, L2GW is not required. This is the most simplest cross Neutron L2 networking for limited scenario. For example, with a small number of networks, all VLANs are extended through physical gateway to support cross Neutron VLAN networking, or all Neutron servers under same core switch with same visible VLAN ranges that supported by the core switch are connected by the core switch.

when a user creates network called Net1, the Tricircle plugin checks the configuration. If tenant_network_type equals vlan, the Tricircle will invoke VLAN type driver. VLAN driver will create segment, and assign network_type with VLAN, update segment and network_type and physical_network with DB

A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be booted in POD1, Nova API-GW needs to get the network information and send network creation message to POD1. Network creation message includes network_type and segment and physical_network.

Then the user creates VM2 in AZ2, and connects it to network Net1. If VM will be booted in POD2, Nova API-GW needs to get the network information and send create network message to POD2. Create network message includes network_type and segment and physical_network.

Shared VxLAN Implementation

A user creates network Net1, the Tricircle plugin checks the configuration, if tenant_network_type equals shared_vxlan, it will invoke shared VxLAN driver. Shared VxLAN driver will allocate segment, and assign network_type with VxLAN, and update network with segment and network_type with DB

A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be booted in POD1, Nova API-GW needs to get the network information and send create network message to POD1, create network message includes network_type and segment.

Nova API-GW should update Net1 in Tricircle with the segment information got by POD1.

Then the user creates VM2 in AZ2, and connects it to network Net1. If VM2 will be booted in POD2, Nova API-GW needs to get the network information and send network creation massage to POD2, network creation message includes network_type and segment.

Nova API-GW should update Net1 in the Tricircle with the segment information get by POD2.

The Tricircle plugin detects that the network includes more than one segment network, calls L2GW driver to start async job for cross Neutron networking for Net1. The L2GW driver will create L2GW1 in POD1 and L2GW2 in POD2. In POD1, L2GW1 will connect the local Net1 and create L2GW remote connection to L2GW2, then populate the information of MAC/IP which resides in L2GW1. In POD2, L2GW2 will connect the local Net1 and create L2GW remote connection to L2GW1, then populate remote MAC/IP information which resides in POD1 in L2GW2.

L2GW driver in the Tricircle will also detect the new port creation/deletion API request. If port (MAC/IP) created or deleted in POD1 or POD2, it needs to refresh the L2GW2 MAC/IP information.

Whether to populate the information of port (MAC/IP) should be configurable according to L2GW capability. And only populate MAC/IP information for the ports that are not resides in the same pod.

Mixed VLAN/VxLAN

To achieve cross Neutron L2 networking, L2GW will be used to connect L2 network in different Neutron servers, using L2GW should work for Shared VxLAN and Mixed VLAN/VxLAN scenario.

When L2GW connected with local network in the same OpenStack instance, no matter it’s VLAN or VxLAN or GRE, the L2GW should be able to connect the local network, and because L2GW is extension of Neutron, only network UUID should be enough for L2GW to connect the local network.

When admin user creates network in Tricircle, he/she specifies the network type as one of the network type as discussed above. In the phase of creating network in Tricircle, only one record is saved in the database, no network will be created in bottom OpenStack.

After the network in the bottom created successfully, need to retrieve the network information like segment id, network name and network type, and make this network in the bottom pod as one of the segments in the network in Tricircle.

In the Tricircle, network could be created by tenant or admin. For tenant, no way to specify the network type and segment id, then default network type will be used instead. When user uses the network to boot a VM, Nova API-GW checks the network type. For Mixed VLAN/VxLAN network, Nova API-GW first creates network in bottom OpenStack without specifying network type and segment ID, then updates the top network with bottom network segmentation information returned by bottom OpenStack.

A user creates network Net1, plugin checks the configuration, if tenant_network_type equals mixed_vlan_vxlan, it will invoke mixed VLAN and VxLAN driver. The driver needs to do nothing since segment is allocated in bottom.

A user creates VM1 in AZ1, and connects it to the network Net1, the VM is booted in bottom POD1, and Nova API-GW creates network in POD1 and queries the network detail segmentation information (using admin role), and gets network type, segment id, then updates this new segment to the Net1 in Tricircle Neutron API Server.

Then the user creates another VM2, and with AZ info AZ2, then the VM should be able to be booted in bottom POD2 which is located in AZ2. And when VM2 should be able to be booted in AZ2, Nova API-GW also creates a network in POD2, and queries the network information including segment and network type, updates this new segment to the Net1 in Tricircle Neutron API Server.

The Tricircle plugin detects that the Net1 includes more than one network segments, calls L2GW driver to start async job for cross Neutron networking for Net1. The L2GW driver will create L2GW1 in POD1 and L2GW2 in POD2. In POD1, L2GW1 will connect the local Net1 and create L2GW remote connection to L2GW2, then populate information of MAC/IP which resides in POD2 in L2GW1. In POD2, L2GW2 will connect the local Net1 and create L2GW remote connection to L2GW1, then populate remote MAC/IP information which resides in POD1 in L2GW2.

L2GW driver in Tricircle will also detect the new port creation/deletion api calling, if port (MAC/IP) created or deleted in POD1, then needs to refresh the L2GW2 MAC/IP information. If port (MAC/IP) created or deleted in POD2, then needs to refresh the L2GW1 MAC/IP information,

Whether to populate MAC/IP information should be configurable according to L2GW capability. And only populate MAC/IP information for the ports that are not resides in the same pod.

L3 bridge network

Current implementation without cross Neutron L2 networking.

  • A special bridge network is created and connected to the routers in different bottom OpenStack instances. We configure the extra routes of the routers to route the packets from one OpenStack to another. In current implementation, we create this special bridge network in each bottom OpenStack with the same VLAN ID, so we have an L2 network to connect the routers.

Difference between L2 networking for tenant’s VM and for L3 bridging network.

  • The creation of bridge network is triggered during attaching router interface and adding router external gateway.

  • The L2 network for VM is triggered by Nova API-GW when a VM is to be created in one pod, and finds that there is no network, then the network will be created before the VM is booted, network or port parameter is required to boot VM. The IP/Mac for VM is allocated in the Tricircle, top layer to avoid IP/mac collision if they are allocated separately in bottom pods.

After cross Neutron L2 networking is introduced, the L3 bridge network should be updated too.

L3 bridge network N-S (North-South):

  • For each tenant, one cross Neutron N-S bridge network should be created for router N-S inter-connection. Just replace the current VLAN N-S bridge network to corresponding Shared VxLAN or Mixed VLAN/VxLAN.

L3 bridge network E-W (East-West):

  • When attaching router interface happened, for VLAN, it will keep current process to establish E-W bridge network. For Shared VxLAN and Mixed VLAN/VxLAN, if a L2 network is able to expand to the current pod, then just expand the L2 network to the pod, all E-W traffic will go out from local L2 network, then no bridge network is needed.

  • For example, (Net1, Router1) in Pod1, (Net2, Router1) in Pod2, if Net1 is a cross Neutron L2 network, and can be expanded to Pod2, then will just expand Net1 to Pod2. After the Net1 expansion ( just like cross Neutron L2 networking to spread one network in multiple Neutron servers ), it’ll look like (Net1, Router1) in Pod1, (Net1, Net2, Router1) in Pod2, In Pod2, no VM in Net1, only for E-W traffic. Now the E-W traffic will look like this:

from Net2 to Net1:

Net2 in Pod2 -> Router1 in Pod2 -> Net1 in Pod2 -> L2GW in Pod2 —> L2GW in Pod1 -> Net1 in Pod1.

Note: The traffic for Net1 in Pod2 to Net1 in Pod1 can bypass the L2GW in Pod2, that means outbound traffic can bypass the local L2GW if the remote VTEP of L2GW is known to the local compute node and the packet from the local compute node with VxLAN encapsulation cloud be routed to remote L2GW directly. It’s up to the L2GW implementation. With the inbound traffic through L2GW, the inbound traffic to the VM will not be impacted by the VM migration from one host to another.

If Net2 is a cross Neutron L2 network, and can be expanded to Pod1 too, then will just expand Net2 to Pod1. After the Net2 expansion(just like cross Neutron L2 networking to spread one network in multiple Neutron servers ), it’ll look like (Net2, Net1, Router1) in Pod1, (Net1, Net2, Router1) in Pod2, In Pod1, no VM in Net2, only for E-W traffic. Now the E-W traffic will look like this: from Net1 to Net2:

Net1 in Pod1 -> Router1 in Pod1 -> Net2 in Pod1 -> L2GW in Pod1 —> L2GW in Pod2 -> Net2 in Pod2.

To limit the complexity, one network’s az_hint can only be specified when creating, and no update is allowed, if az_hint need to be updated, you have to delete the network and create again.

If the network can’t be expanded, then E-W bridge network is needed. For example, Net1(AZ1, AZ2,AZ3), Router1; Net2(AZ4, AZ5, AZ6), Router1. Then a cross Neutron L2 bridge network has to be established:

Net1(AZ1, AZ2, AZ3), Router1 –> E-W bridge network —> Router1, Net2(AZ4, AZ5, AZ6).

Assignee(s)

Primary assignee:

Other contributors:

Work Items

Dependencies

None

Testing

None

References

[1] https://docs.google.com/document/d/18kZZ1snMOCD9IQvUKI5NVDzSASpw-QKj7l2zNqMEd3g/

[2] https://review.opendev.org/#/c/270786/

[3] https://github.com/openstack/networking-l2gw/blob/master/specs/kilo/l2-gateway-api.rst

[4] https://developer.openstack.org/api-ref-networking-v2-ext.html#networks-multi-provider-ext

[5] https://docs.openstack.org/mitaka/networking-guide/config-az.html

[6] https://review.opendev.org/#/c/306224/

Cross Neutron VxLAN Networking in Tricircle

Background

Currently we only support VLAN as the cross-Neutron network type. For VLAN network type, central plugin in Tricircle picks a physical network and allocates a VLAN tag(or uses what users specify), then before the creation of local network, local plugin queries this provider network information and creates the network based on this information. Tricircle only guarantees that instance packets sent out of hosts in different pods belonging to the same VLAN network will be tagged with the same VLAN ID. Deployers need to carefully configure physical networks and switch ports to make sure that packets can be transported correctly between physical devices.

For more flexible deployment, VxLAN network type is a better choice. Compared to 12-bit VLAN ID, 24-bit VxLAN ID can support more numbers of bridge networks and cross-Neutron L2 networks. With MAC-in-UDP encapsulation of VxLAN network, hosts in different pods only need to be IP routable to transport instance packets.

Proposal

There are some challenges to support cross-Neutron VxLAN network.

  1. How to keep VxLAN ID identical for the same VxLAN network across Neutron servers

  2. How to synchronize tunnel endpoint information between pods

  3. How to trigger L2 agents to build tunnels based on this information

  4. How to support different back-ends, like ODL, L2 gateway

The first challenge can be solved as VLAN network does, we allocate VxLAN ID in central plugin and local plugin will use the same VxLAN ID to create local network. For the second challenge, we introduce a new table called “shadow_agents” in Tricircle database, so central plugin can save the tunnel endpoint information collected from one local Neutron server in this table and use it to populate the information to other local Neutron servers when needed. Here is the schema of the table:

Shadow Agent Table

Field

Type

Nullable

Key

Default

id

string

no

primary

null

pod_id

string

no

null

host

string

no

unique

null

type

string

no

unique

null

tunnel_ip

string

no

null

How to collect tunnel endpoint information

When the host where a port will be located is determined, local Neutron server will receive a port-update request containing host ID in the body. During the process of this request, local plugin can query agent information that contains tunnel endpoint information from local Neutron database with host ID and port VIF type; then send tunnel endpoint information to central Neutron server by issuing a port-update request with this information in the binding profile.

How to populate tunnel endpoint information

When the tunnel endpoint information in one pod is needed to be populated to other pods, XJob will issue port-create requests to corresponding local Neutron servers with tunnel endpoint information queried from Tricircle database in the bodies. After receiving such request, local Neutron server will save tunnel endpoint information by calling real core plugin’s “create_or_update_agent” method. This method comes from neutron.db.agent_db.AgentDbMixin class. Plugins that support “agent” extension will have this method. Actually there’s no such agent daemon running in the target local Neutron server, but we insert a record for it in the database so the local Neutron server will assume there exists an agent. That’s why we call it shadow agent.

The proposed solution for the third challenge is based on the shadow agent and L2 population mechanism. In the original Neutron process, if the port status is updated to active, L2 population mechanism driver does two things. First, driver checks if the updated port is the first port in the target agent. If so, driver collects tunnel endpoint information of other ports in the same network, then sends the information to the target agent via RPC. Second, driver sends the tunnel endpoint information of the updated port to other agents where ports in the same network are located, also via RPC. L2 agents will build the tunnels based on the information they received. To trigger the above processes to build tunnels across Neutron servers, we further introduce shadow port.

Let’s say we have two instance ports, port1 is located in host1 in pod1 and port2 is located in host2 in pod2. To make L2 agent running in host1 build a tunnel to host2, we create a port with the same properties of port2 in pod1. As discussed above, local Neutron server will create shadow agent during the process of port-create request, so local Neutron server in pod1 won’t complain that host2 doesn’t exist. To trigger L2 population process, we then update the port status to active, so L2 agent in host1 will receive tunnel endpoint information of port2 and build the tunnel. Port status is a read-only property so we can’t directly update it via ReSTful API. Instead, we issue a port-update request with a special key in the binding profile. After local Neutron server receives such request, it pops the special key from the binding profile and updates the port status to active. XJob daemon will take the job to create and update shadow ports.

Here is the flow of shadow agent and shadow port process:

+-------+       +---------+                                                          +---------+
|       |       |         |     +---------+                                          |         |
| Local |       | Local   |     |         |      +----------+       +------+         | Local   |
| Nova  |       | Neutron |     | Central |      |          |       |      |         | Neutron |
| Pod1  |       | Pod1    |     | Neutron |      | Database |       | XJob |         | Pod2    |
|       |       |         |     |         |      |          |       |      |         |         |
+---+---+       +---- ----+     +----+----+      +----+-----+       +--+---+         +----+----+
    |               |                |                |                |                  |
    | update port1  |                |                |                |                  |
    | [host id]     |                |                |                |                  |
    +--------------->                |                |                |                  |
    |               | update port1   |                |                |                  |
    |               | [agent info]   |                |                |                  |
    |               +---------------->                |                |                  |
    |               |                | save shadow    |                |                  |
    |               |                | agent info     |                |                  |
    |               |                +---------------->                |                  |
    |               |                |                |                |                  |
    |               |                | trigger shadow |                |                  |
    |               |                | port setup job |                |                  |
    |               |                | for pod1       |                |                  |
    |               |                +--------------------------------->                  |
    |               |                |                |                | query ports in   |
    |               |                |                |                | the same network |
    |               |                |                |                +------------------>
    |               |                |                |                |                  |
    |               |                |                |                |     return port2 |
    |               |                |                |                <------------------+
    |               |                |                |   query shadow |                  |
    |               |                |                |     agent info |                  |
    |               |                |                |      for port2 |                  |
    |               |                |                <----------------+                  |
    |               |                |                |                |                  |
    |               |                |                |  create shadow |                  |
    |               |                |                | port for port2 |                  |
    |               <--------------------------------------------------+                  |
    |               |                |                |                |                  |
    |               | create shadow  |                |                |                  |
    |               | agent and port |                |                |                  |
    |               +-----+          |                |                |                  |
    |               |     |          |                |                |                  |
    |               |     |          |                |                |                  |
    |               <-----+          |                |                |                  |
    |               |                |                |  update shadow |                  |
    |               |                |                | port to active |                  |
    |               <--------------------------------------------------+                  |
    |               |                |                |                |                  |
    |               | L2 population  |                |                | trigger shadow   |
    |               +-----+          |                |                | port setup job   |
    |               |     |          |                |                | for pod2         |
    |               |     |          |                |                +-----+            |
    |               <-----+          |                |                |     |            |
    |               |                |                |                |     |            |
    |               |                |                |                <-----+            |
    |               |                |                |                |                  |
    |               |                |                |                |                  |
    +               +                +                +                +                  +

Bridge network can support VxLAN network in the same way, we just create shadow ports for router interface and router gateway. In the above graph, local Nova server updates port with host ID to trigger the whole process. L3 agent will update interface port and gateway port with host ID, so similar process will be triggered to create shadow ports for router interface and router gateway.

Currently Neutron team is working on push notification 1, Neutron server will send resource data to agents; agents cache this data and use it to do the real job like configuring openvswitch, updating iptables, configuring dnsmasq, etc. Agents don’t need to retrieve resource data from Neutron server via RPC any more. Based on push notification, if tunnel endpoint information is stored in port object later, and this information supports updating via ReSTful API, we can simplify the solution for challenge 3 and 4. We just need to create shadow port containing tunnel endpoint information. This information will be pushed to agents and agents use it to create necessary tunnels and flows.

How to support different back-ends besides ML2+OVS implementation

We consider two typical back-ends that can support cross-Neutron VxLAN networking, L2 gateway and SDN controller like ODL. For L2 gateway, we consider only supporting static tunnel endpoint information for L2 gateway at the first step. Shadow agent and shadow port process is almost the same with the ML2+OVS implementation. The difference is that, for L2 gateway, the tunnel IP of the shadow agent is set to the tunnel endpoint of the L2 gateway. So after L2 population, L2 agents will create tunnels to the tunnel endpoint of the L2 gateway. For SDN controller, we assume that SDN controller has the ability to manage tunnel endpoint information across Neutron servers, so Tricircle only helps to allocate VxLAN ID and keep the VxLAN ID identical across Neutron servers for one network. Shadow agent and shadow port process will not be used in this case. However, if different SDN controllers are used in different pods, it will be hard for each SDN controller to connect hosts managed by other SDN controllers since each SDN controller has its own mechanism. This problem is discussed in this page 2. One possible solution under Tricircle is as what L2 gateway does. We create shadow ports that contain L2 gateway tunnel endpoint information so SDN controller can build tunnels in its own way. We then configure L2 gateway in each pod to forward the packets between L2 gateways. L2 gateways discussed here are mostly hardware based, and can be controlled by SDN controller. SDN controller will use ML2 mechanism driver to receive the L2 network context and further control L2 gateways for the network.

To distinguish different back-ends, we will add a new configuration option cross_pod_vxlan_mode whose valid values are “p2p”, “l2gw” and “noop”. Mode “p2p” works for the ML2+OVS scenario, in this mode, shadow ports and shadow agents containing host tunnel endpoint information are created; mode “l2gw” works for the L2 gateway scenario, in this mode, shadow ports and shadow agents containing L2 gateway tunnel endpoint information are created. For the SDN controller scenario, as discussed above, if SDN controller can manage tunnel endpoint information by itself, we only need to use “noop” mode, meaning that neither shadow ports nor shadow agents will be created; or if SDN controller can manage hardware L2 gateway, we can use “l2gw” mode.

Data Model Impact

New table “shadow_agents” is added.

Dependencies

None

Documentation Impact

  • Update configuration guide to introduce options for VxLAN network

  • Update networking guide to discuss new scenarios with VxLAN network

  • Add release note about cross-Neutron VxLAN networking support

Dynamic Pod Binding in Tricircle

Background

Most public cloud infrastructure is built with Availability Zones (AZs). Each AZ is consisted of one or more discrete data centers, each with high bandwidth and low latency network connection, separate power and facilities. These AZs offer cloud tenants the ability to operate production applications and databases deployed into multiple AZs are more highly available, fault tolerant and scalable than a single data center.

In production clouds, each AZ is built by modularized OpenStack, and each OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the pods, they are classified into different categories. For example, servers in one pod are only for general purposes, and the other pods may be built for heavy load CAD modeling with GPU. So pods in one AZ could be divided into different groups. Different pod groups for different purposes, and the VM’s cost and performance are also different.

The concept “pod” is created for the Tricircle to facilitate managing OpenStack instances among AZs, which therefore is transparent to cloud tenants. The Tricircle maintains and manages a pod binding table which records the mapping relationship between a cloud tenant and pods. When the cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod based on the pod binding table.

Motivation

In resource allocation scenario, when a tenant creates a VM in one pod and a new volume in a another pod respectively. If the tenant attempt to attach the volume to the VM, the operation will fail. In other words, the volume should be in the same pod where the VM is, otherwise the volume and VM would not be able to finish the attachment. Hence, the Tricircle needs to ensure the pod binding so as to guarantee that VM and volume are created in one pod.

In capacity expansion scenario, when resources in one pod are exhausted, then a new pod with the same type should be added into the AZ. Therefore, new resources of this type should be provisioned in the new added pod, which requires dynamical change of pod binding. The pod binding could be done dynamically by the Tricircle, or by admin through admin api for maintenance purpose. For example, for maintenance(upgrade, repairement) window, all new provision requests should be forwarded to the running one, but not the one under maintenance.

Solution: dynamic pod binding

It’s quite headache for capacity expansion inside one pod, you have to estimate, calculate, monitor, simulate, test, and do online grey expansion for controller nodes and network nodes whenever you add new machines to the pod. It’s quite big challenge as more and more resources added to one pod, and at last you will reach limitation of one OpenStack. If this pod’s resources exhausted or reach the limit for new resources provisioning, the Tricircle needs to bind tenant to a new pod instead of expanding the current pod unlimitedly. The Tricircle needs to select a proper pod and stay binding for a duration, in this duration VM and volume will be created for one tenant in the same pod.

For example, suppose we have two groups of pods, and each group has 3 pods, i.e.,

GroupA(Pod1, Pod2, Pod3) for general purpose VM,

GroupB(Pod4, Pod5, Pod6) for CAD modeling.

Tenant1 is bound to Pod1, Pod4 during the first phase for several months. In the first phase, we can just add weight in Pod, for example, Pod1, weight 1, Pod2, weight2, this could be done by adding one new field in pod table, or no field at all, just link them by the order created in the Tricircle. In this case, we use the pod creation time as the weight.

If the tenant wants to allocate VM/volume for general VM, Pod1 should be selected. It can be implemented with flavor or volume type metadata. For general VM/Volume, there is no special tag in flavor or volume type metadata.

If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be selected. For CAD modeling VM/Volume, a special tag “resource: CAD Modeling” in flavor or volume type metadata determines the binding.

When it is detected that there is no more resources in Pod1, Pod4. Based on the resource_affinity_tag, the Tricircle queries the pod table for available pods which provision a specific type of resources. The field resource_affinity is a key-value pair. The pods will be selected when there are matched key-value in flavor extra-spec or volume extra-spec. A tenant will be bound to one pod in one group of pods with same resource_affinity_tag. In this case, the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an Pod6 for CAD purpose. The Tricircle needs to change the binding, for example, tenant1 needs to be bound to Pod2, Pod5.

Implementation

Measurement

To get the information of resource utilization of pods, the Tricircle needs to conduct some measurements on pods. The statistic task should be done in bottom pod.

For resources usages, current cells provide interface to retrieve usage for cells [1]. OpenStack provides details of capacity of a cell, including disk and ram via api of showing cell capacities [1].

If OpenStack is not running with cells mode, we can ask Nova to provide an interface to show the usage detail in AZ. Moreover, an API for usage query at host level is provided for admins [3], through which we can obtain details of a host, including cpu, memory, disk, and so on.

Cinder also provides interface to retrieve the backend pool usage, including updated time, total capacity, free capacity and so on [2].

The Tricircle needs to have one task to collect the usage in the bottom on daily base, to evaluate whether the threshold is reached or not. A threshold or headroom could be configured for each pod, but not to reach 100% exhaustion of resources.

On top there should be no heavy process. So getting the sum info from the bottom can be done in the Tricircle. After collecting the details, the Tricircle can judge whether a pod reaches its limit.

Tricircle

The Tricircle needs a framework to support different binding policy (filter).

Each pod is one OpenStack instance, including controller nodes and compute nodes. E.g.,

                       +->  controller(s) - pod1 <--> compute nodes <---+
                                                                        |
 The tricircle         +->  controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary
(resource controller)                       ....                        |
                       +->  controller(s) - pod{N} <--> compute nodes <-+

The Tricircle selects a pod to decide where the requests should be forwarded to which controller. Then the controllers in the selected pod will do its own scheduling.

One simplest binding filter is as follows. Line up all available pods in a list and always select the first one. When all the resources in the first pod has been allocated, remove it from the list. This is quite like how production cloud is built: at first, only a few pods are in the list, and then add more and more pods if there is not enough resources in current cloud. For example,

List1 for general pool: Pod1 <- Pod2 <- Pod3 List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6

If Pod1’s resource exhausted, Pod1 is removed from List1. The List1 is changed to: Pod2 <- Pod3. If Pod4’s resource exhausted, Pod4 is removed from List2. The List2 is changed to: Pod5 <- Pod6

If the tenant wants to allocate resources for general VM, the Tricircle selects Pod2. If the tenant wants to allocate resources for CAD modeling VM, the Tricircle selects Pod5.

Filtering

For the strategy of selecting pods, we need a series of filters. Before implementing dynamic pod binding, the binding criteria are hard coded to select the first pod in the AZ. Hence, we need to design a series of filter algorithms. Firstly, we plan to design an ALLPodsFilter which does no filtering and passes all the available pods. Secondly, we plan to design an AvailabilityZoneFilter which passes the pods matching the specified available zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods matching the specified resource type. Based on the resource_affinity_tag, the Tricircle can be aware of which type of resource the tenant wants to provision. In the future, we can add more filters, which requires adding more information in the pod table.

Weighting

After filtering all the pods, the Tricircle obtains the available pods for a tenant. The Tricircle needs to select the most suitable pod for the tenant. Hence, we need to define a weight function to calculate the corresponding weight of each pod. Based on the weights, the Tricircle selects the pod which has the maximum weight value. When calculating the weight of a pod, we need to design a series of weigher. We first take the pod creation time into consideration when designing the weight function. The second one is the idle capacity, to select a pod which has the most idle capacity. Other metrics will be added in the future, e.g., cost.

Data Model Impact

Firstly, we need to add a column “resource_affinity_tag” to the pod table, which is used to store the key-value pair, to match flavor extra-spec and volume extra-spec.

Secondly, in the pod binding table, we need to add fields of start binding time and end binding time, so the history of the binding relationship could be stored.

Thirdly, we need a table to store the usage of each pod for Cinder/Nova. We plan to use JSON object to store the usage information. Hence, even if the usage structure is changed, we don’t need to update the table. And if the usage value is null, that means the usage has not been initialized yet. As just mentioned above, the usage could be refreshed in daily basis. If it’s not initialized yet, it means there is still lots of resources available, which could be scheduled just like this pod has not reach usage threshold.

Dependencies

None

Testing

None

Documentation Impact

None

Reference

[1] https://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities

[2] https://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2

[3] https://developer.openstack.org/api-ref-compute-v2.1.html#showinfo

Enhance Reliability of Asynchronous Job

Background

Currently we are using cast method in our RPC client to trigger asynchronous job in XJob daemon. After one of the worker threads receives the RPC message from the message broker, it registers the job in the database and starts to run the handle function. The registration guarantees that asynchronous job will not be lost after the job fails and the failed job can be redone. The detailed discussion of the asynchronous job process in XJob daemon is covered in our design document [1].

Though asynchronous jobs are correctly saved after worker threads get the RPC message, we still have risk to lose jobs. By using cast method, it’s only guaranteed that the message is received by the message broker, but there’s no guarantee that the message can be received by the message consumer, i.e., the RPC server thread running in XJob daemon. According to the RabbitMQ document, undelivered messages will be lost if RabbitMQ server stops [2]. Message persistence or publisher confirm can be used to increase reliability, but they sacrifice performance. On the other hand, we can not assume that message brokers other than RabbitMQ will provide similar persistence or confirmation functionality. Therefore, Tricircle itself should handle the asynchronous job reliability problem as far as possible. Since we already have a framework to register, run and redo asynchronous jobs in XJob daemon, we propose a cheaper way to improve reliability.

Proposal

One straightforward way to make sure that the RPC server has received the RPC message is to use call method. RPC client will be blocked until the RPC server replies the message if it uses call method to send the RPC request. So if something wrong happens before the reply, RPC client can be aware of it. Of course we cannot make RPC client wait too long, thus RPC handlers in the RPC server side need to be simple and quick to run. Thanks to the asynchronous job framework we already have, migrating from cast method to call method is easy.

Here is the flow of the current process:

+--------+     +--------+         +---------+     +---------------+   +----------+
|        |     |        |         |         |     |               |   |          |
| API    |     | RPC    |         | Message |     | RPC Server    |   | Database |
| Server |     | client |         | Broker  |     | Handle Worker |   |          |
|        |     |        |         |         |     |               |   |          |
+---+----+     +---+----+         +----+----+     +-------+-------+   +----+-----+
    |              |                   |                  |                |
    | call RPC API |                   |                  |                |
    +-------------->                   |                  |                |
    |              | send cast message |                  |                |
    |              +------------------->                  |                |
    | call return  |                   | dispatch message |                |
    <--------------+                   +------------------>                |
    |              |                   |                  | register job   |
    |              |                   |                  +---------------->
    |              |                   |                  |                |
    |              |                   |                  | obtain lock    |
    |              |                   |                  +---------------->
    |              |                   |                  |                |
    |              |                   |                  | run job        |
    |              |                   |                  +----+           |
    |              |                   |                  |    |           |
    |              |                   |                  |    |           |
    |              |                   |                  <----+           |
    |              |                   |                  |                |
    |              |                   |                  |                |
    +              +                   +                  +                +

We can just leave register job phase in the RPC handle and put obtain lock and run job phase in a separate thread, so the RPC handle is simple enough to use call method to invoke it. Here is the proposed flow:

+--------+     +--------+          +---------+     +---------------+   +----------+   +-------------+   +-------+
|        |     |        |          |         |     |               |   |          |   |             |   |       |
| API    |     | RPC    |          | Message |     | RPC Server    |   | Database |   | RPC Server  |   | Job   |
| Server |     | client |          | Broker  |     | Handle Worker |   |          |   | Loop Worker |   | Queue |
|        |     |        |          |         |     |               |   |          |   |             |   |       |
+---+----+     +---+----+          +----+----+     +-------+-------+   +----+-----+   +------+------+   +---+---+
    |              |                    |                  |                |                |              |
    | call RPC API |                    |                  |                |                |              |
    +-------------->                    |                  |                |                |              |
    |              | send call message  |                  |                |                |              |
    |              +-------------------->                  |                |                |              |
    |              |                    | dispatch message |                |                |              |
    |              |                    +------------------>                |                |              |
    |              |                    |                  | register job   |                |              |
    |              |                    |                  +---------------->                |              |
    |              |                    |                  |                |                |              |
    |              |                    |                  | job enqueue    |                |              |
    |              |                    |                  +------------------------------------------------>
    |              |                    |                  |                |                |              |
    |              |                    | reply message    |                |                | job dequeue  |
    |              |                    <------------------+                |                |-------------->
    |              | send reply message |                  |                | obtain lock    |              |
    |              <--------------------+                  |                <----------------+              |
    | call return  |                    |                  |                |                |              |
    <--------------+                    |                  |                |        run job |              |
    |              |                    |                  |                |           +----+              |
    |              |                    |                  |                |           |    |              |
    |              |                    |                  |                |           |    |              |
    |              |                    |                  |                |           +---->              |
    |              |                    |                  |                |                |              |
    |              |                    |                  |                |                |              |
    +              +                    +                  +                +                +              +

In the above graph, Loop Worker is a new-introduced thread to do the actual work. Job Queue is an eventlet queue used to coordinate Handle Worker who produces job entries and Loop Worker who consumes job entries. While accessing an empty queue, Loop Worker will be blocked until some job entries are put into the queue. Loop Worker retrieves job entries from the job queue then starts to run it. Similar to the original flow, since multiple workers may get the same type of job for the same resource at the same time, workers need to obtain the lock before it can run the job. One problem occurs whenever XJob daemon stops before it finishes all the jobs in the job queue; all unfinished jobs are lost. To solve it, we make changes to the original periodical task that is used to redo failed job, and let it also handle the jobs which have been registered for a certain time but haven’t been started. So both failed jobs and “orphan” new jobs can be picked up and redone.

You can see that Handle Worker doesn’t do many works, it just consumes RPC messages, registers jobs then puts job items in the job queue. So one extreme solution here, will be to register new jobs in the API server side and start worker threads to retrieve jobs from the database and run them. In this way, we can remove all the RPC processes and use database to coordinate. The drawback of this solution is that we don’t dispatch jobs. All the workers query jobs from the database so there is high probability that some of the workers obtain the same job and thus race occurs. In the first solution, message broker helps us to dispatch messages, and so dispatch jobs.

Considering job dispatch is important, we can make some changes to the second solution and move to the third one, that is to also register new jobs in the API server side, but we still use cast method to trigger asynchronous job in XJob daemon. Since job registration is done in the API server side, we are not afraid that the jobs will be lost if cast messages are lost. If API server side fails to register the job, it will return response of failure; If registration of job succeeds, the job will be done by XJob daemon at last. By using RPC, we dispatch jobs with the help of message brokers. One thing which makes cast method better than call method is that retrieving RPC messages and running job handles are done in the same thread so if one XJob daemon is busy handling jobs, RPC messages will not be dispatched to it. However when using call method, RPC messages are retrieved by one thread(the Handle Worker) and job handles are run by another thread(the Loop Worker), so XJob daemon may accumulate many jobs in the queue and at the same time it’s busy handling jobs. This solution has the same problem with the call method solution. If cast messages are lost, the new jobs are registered in the database but no XJob daemon is aware of these new jobs. Same way to solve it, use periodical task to pick up these “orphan” jobs. Here is the flow:

+--------+     +--------+         +---------+     +---------------+   +----------+
|        |     |        |         |         |     |               |   |          |
| API    |     | RPC    |         | Message |     | RPC Server    |   | Database |
| Server |     | client |         | Broker  |     | Handle Worker |   |          |
|        |     |        |         |         |     |               |   |          |
+---+----+     +---+----+         +----+----+     +-------+-------+   +----+-----+
    |              |                   |                  |                |
    | call RPC API |                   |                  |                |
    +-------------->                   |                  |                |
    |              | register job      |                  |                |
    |              +------------------------------------------------------->
    |              |                   |                  |                |
    |              | [if succeed to    |                  |                |
    |              |  register job]    |                  |                |
    |              | send cast message |                  |                |
    |              +------------------->                  |                |
    | call return  |                   | dispatch message |                |
    <--------------+                   +------------------>                |
    |              |                   |                  | obtain lock    |
    |              |                   |                  +---------------->
    |              |                   |                  |                |
    |              |                   |                  | run job        |
    |              |                   |                  +----+           |
    |              |                   |                  |    |           |
    |              |                   |                  |    |           |
    |              |                   |                  <----+           |
    |              |                   |                  |                |
    |              |                   |                  |                |
    +              +                   +                  +                +

Discussion

In this section we discuss the pros and cons of the above three solutions.

Solution Comparison

Solution

Pros

Cons

API server uses call

no RPC message lost

downtime of unfinished jobs in the job queue when XJob daemon stops, job dispatch not based on XJob daemon workload

API server register jobs + no RPC

no requirement on RPC(message broker), no downtime

no job dispatch, conflict costs time

API server register jobs + uses cast

job dispatch based on XJob daemon workload

downtime of lost jobs due to cast messages lost

Downtime means that after a job is dispatched to a worker, other workers need to wait for a certain time to determine that job is expired and take over it.

Conclusion

We decide to implement the third solution(API server register jobs + uses cast) since it improves the asynchronous job reliability and at the mean time has better work load dispatch.

Data Model Impact

None

Dependencies

None

Documentation Impact

None

Layer-3 Networking and Combined Bridge Network

Background

To achieve cross-Neutron layer-3 networking, we utilize a bridge network to connect networks in each Neutron server, as shown below:

East-West networking:

+-----------------------+                +-----------------------+
| OpenStack1            |                | OpenStack2            |
|                       |                |                       |
| +------+  +---------+ | +------------+ | +---------+  +------+ |
| | net1 |  |      ip1| | | bridge net | | |ip2      |  | net2 | |
| |      +--+    R    +---+            +---+    R    +--+      | |
| |      |  |         | | |            | | |         |  |      | |
| +------+  +---------+ | +------------+ | +---------+  +------+ |
+-----------------------+                +-----------------------+

Fig 1

North-South networking:

+---------------------+                  +-------------------------------+
| OpenStack1          |                  | OpenStack2                    |
|                     |                  |                               |
| +------+  +-------+ | +--------------+ | +-------+  +----------------+ |
| | net1 |  |    ip1| | |  bridge net  | | |    ip2|  |  external net  | |
| |      +--+  R1   +---+              +---+  R2   +--+                | |
| |      |  |       | | | 100.0.1.0/24 | | |       |  | 163.3.124.0/24 | |
| +------+  +-------+ | +--------------+ | +-------+  +----------------+ |
+---------------------+                  +-------------------------------+

Fig 2

To support east-west networking, we configure extra routes in routers in each OpenStack cloud:

In OpenStack1, destination: net2, nexthop: ip2
In OpenStack2, destination: net1, nexthop: ip1

To support north-south networking, we set bridge network as the external network in OpenStack1 and as the internal network in OpenStack2. For instance in net1 to access the external network, the packets are SNATed twice, first SNATed to ip1, then SNATed to ip2. For floating ip binding, ip in net1 is first bound to ip(like 100.0.1.5) in bridge network(bridge network is attached to R1 as external network), then the ip(100.0.1.5) in bridge network is bound to ip (like 163.3.124.8)in the real external network (bridge network is attached to R2 as internal network).

Problems

The idea of introducing a bridge network is good, but there are some problems in the current usage of the bridge network.

Redundant Bridge Network

We use two bridge networks to achieve layer-3 networking for each tenant. If VLAN is used as the bridge network type, limited by the range of VLAN tag, only 2048 pairs of bridge networks can be created. The number of tenants supported is far from enough.

Redundant SNAT

In the current implementation, packets are SNATed two times for outbound traffic and are DNATed two times for inbound traffic. The drawback is that packets of outbound traffic consume extra operations. Also, we need to maintain extra floating ip pool for inbound traffic.

DVR support

Bridge network is attached to the router as an internal network for east-west networking and north-south networking when the real external network and the router are not located in the same OpenStack cloud. It’s fine when the bridge network is VLAN type, since packets directly go out of the host and are exchanged by switches. But if we would like to support VxLAN as the bridge network type later, attaching bridge network as an internal network in the DVR scenario will cause some troubles. How DVR connects the internal networks is that packets are routed locally in each host, and if the destination is not in the local host, the packets are sent to the destination host via a VxLAN tunnel. Here comes the problem, if bridge network is attached as an internal network, the router interfaces will exist in all the hosts where the router namespaces are created, so we need to maintain lots of VTEPs and VxLAN tunnels for bridge network in the Tricircle. Ports in bridge network are located in different OpenStack clouds so local Neutron server is not aware of ports in other OpenStack clouds and will not setup VxLAN tunnel for us.

Proposal

To address the above problems, we propose to combine the bridge networks for east-west and north-south networking. Bridge network is always attached to routers as an external network. In the DVR scenario, different from router interfaces, router gateway will only exist in the SNAT namespace in a specific host, which reduces the number of VTEPs and VxLAN tunnels the Tricircle needs to handle. By setting “enable_snat” option to “False” when attaching the router gateway, packets will not be SNATed when go through the router gateway, so packets are only SNATed and DNATed one time in the real external gateway. However, since one router can only be attached to one external network, in the OpenStack cloud where the real external network is located, we need to add one more router to connect the bridge network with the real external network. The network topology is shown below:

+-------------------------+                  +-------------------------+
|OpenStack1               |                  |OpenStack2               |
|  +------+   +--------+  |  +------------+  |  +--------+   +------+  |
|  |      |   |     IP1|  |  |            |  |  |IP2     |   |      |  |
|  | net1 +---+   R1   XXXXXXX bridge net XXXXXXX   R2   +---+ net2 |  |
|  |      |   |        |  |  |            |  |  |        |   |      |  |
|  +------+   +--------+  |  +---X----+---+  |  +--------+   +------+  |
|                         |      X    |      |                         |
+-------------------------+      X    |      +-------------------------+
                                 X    |
                                 X    |
+--------------------------------X----|-----------------------------------+
|OpenStack3                      X    |                                   |
|                                X    |                                   |
|  +------+    +--------+        X    |   +--------+    +--------------+  |
|  |      |    |     IP3|        X    |   |IP4     |    |              |  |
|  | net3 +----+   R3   XXXXXXXXXX    +---+   R4   XXXXXX external net |  |
|  |      |    |        |                 |        |    |              |  |
|  +------+    +--------+                 +--------+    +--------------+  |
|                                                                         |
+-------------------------------------------------------------------------+

router interface: -----
router gateway: XXXXX
IPn: router gateway ip or router interface ip

Fig 3

Extra routes and gateway ip are configured to build the connection:

routes of R1: net2 via IP2
              net3 via IP3
external gateway ip of R1: IP4
    (IP2 and IP3 are from bridge net, so routes will only be created in
     SNAT namespace)

routes of R2: net1 via IP1
              net3 via IP3
external gateway ip of R2: IP4
    (IP1 and IP3 are from bridge net, so routes will only be created in
     SNAT namespace)

routes of R3: net1 via IP1
              net2 via IP2
external gateway ip of R3: IP4
    (IP1 and IP2 are from bridge net, so routes will only be created in
     SNAT namespace)

routes of R4: net1 via IP1
              net2 via IP2
              net3 via IP3
external gateway ip of R1: real-external-gateway-ip
disable DVR mode

An alternative solution which can reduce the extra router is that for the router that locates in the same OpenStack cloud with the real external network, we attach the bridge network as an internal network, so the real external network can be attached to the same router. Here is the topology:

+-------------------------+                  +-------------------------+
|OpenStack1               |                  |OpenStack2               |
|  +------+   +--------+  |  +------------+  |  +--------+   +------+  |
|  |      |   |     IP1|  |  |            |  |  |IP2     |   |      |  |
|  | net1 +---+   R1   XXXXXXX bridge net XXXXXXX   R2   +---+ net2 |  |
|  |      |   |        |  |  |            |  |  |        |   |      |  |
|  +------+   +--------+  |  +-----+------+  |  +--------+   +------+  |
|                         |        |         |                         |
+-------------------------+        |         +-------------------------+
                                   |
                                   |
            +----------------------|---------------------------------+
            |OpenStack3            |                                 |
            |                      |                                 |
            |      +------+    +---+----+      +--------------+      |
            |      |      |    |  IP3   |      |              |      |
            |      | net3 +----+   R3   XXXXXXXX external net |      |
            |      |      |    |        |      |              |      |
            |      +------+    +--------+      +--------------+      |
            |                                                        |
            +--------------------------------------------------------+

router interface: -----
router gateway: XXXXX
IPn: router gateway ip or router interface ip

Fig 4

The limitation of this solution is that R3 needs to be set as non-DVR mode. As is discussed above, for network attached to DVR mode router, the router interfaces of this network will be created in all the hosts where the router namespaces are created. Since these interfaces all have the same IP and MAC, packets sent between instances(could be virtual machine, container or bare metal) can’t be directly wrapped in the VxLAN packets, otherwise packets sent from different hosts will have the same MAC. How Neutron solve this problem is to introduce DVR MACs which are allocated by Neutron server and assigned to each host hosting DVR mode router. Before wrapping the packets in the VxLAN packets, the source MAC of the packets are replaced by the DVR MAC of the host. If R3 is DVR mode, source MAC of packets sent from net3 to bridge network will be changed, but after the packets reach R1 or R2, R1 and R2 don’t recognize the DVR MAC, so the packets are dropped.

The same, extra routes and gateway ip are configured to build the connection:

routes of R1: net2 via IP2
              net3 via IP3
external gateway ip of R1: IP3
    (IP2 and IP3 are from bridge net, so routes will only be created in
     SNAT namespace)

routes of R2: net1 via IP1
              net3 via IP3
external gateway ip of R1: IP3
    (IP1 and IP3 are from bridge net, so routes will only be created in
     SNAT namespace)

routes of R3: net1 via IP1
              net2 via IP2
external gateway ip of R3: real-external-gateway-ip
    (non-DVR mode, routes will all be created in the router namespace)

The real external network can be deployed in one dedicated OpenStack cloud. In that case, there is no need to run services like Nova and Cinder in that cloud. Instance and volume will not be provisioned in that cloud. Only Neutron service is required. Then the above two topologies transform to the same one:

+-------------------------+                  +-------------------------+
|OpenStack1               |                  |OpenStack2               |
|  +------+   +--------+  |  +------------+  |  +--------+   +------+  |
|  |      |   |     IP1|  |  |            |  |  |IP2     |   |      |  |
|  | net1 +---+   R1   XXXXXXX bridge net XXXXXXX   R2   +---+ net2 |  |
|  |      |   |        |  |  |            |  |  |        |   |      |  |
|  +------+   +--------+  |  +-----+------+  |  +--------+   +------+  |
|                         |        |         |                         |
+-------------------------+        |         +-------------------------+
                                   |
                                   |
                       +-----------|-----------------------------------+
                       |OpenStack3 |                                   |
                       |           |                                   |
                       |           |   +--------+    +--------------+  |
                       |           |   |IP3     |    |              |  |
                       |           +---+   R3   XXXXXX external net |  |
                       |               |        |    |              |  |
                       |               +--------+    +--------------+  |
                       |                                               |
                       +-----------------------------------------------+

Fig 5

The motivation of putting the real external network in a dedicated OpenStack cloud is to simplify the real external network management, and also to separate the real external network and the internal networking area, for better security control.

Discussion

The implementation of DVR does bring some restrictions to our cross-Neutron layer-2 and layer-3 networking, resulting in the limitation of the above two proposals. In the first proposal, if the real external network is deployed with internal networks in the same OpenStack cloud, one extra router is needed in that cloud. Also, since one of the router is DVR mode and the other is non-DVR mode, we need to deploy at least two l3 agents, one is dvr-snat mode and the other is legacy mode. The limitation of the second proposal is that the router is non-DVR mode, so east-west and north-south traffic are all go through the router namespace in the network node.

Also, cross-Neutron layer-2 networking can not work with DVR because of source MAC replacement. Considering the following topology:

+----------------------------------------------+       +-------------------------------+
|OpenStack1                                    |       |OpenStack2                     |
|  +-----------+   +--------+   +-----------+  |       |  +--------+   +------------+  |
|  |           |   |        |   |           |  |       |  |        |   |            |  |
|  | net1      +---+   R1   +---+ net2      |  |       |  |   R2   +---+ net2       |  |
|  | Instance1 |   |        |   | Instance2 |  |       |  |        |   | Instance3  |  |
|  +-----------+   +--------+   +-----------+  |       |  +--------+   +------------+  |
|                                              |       |                               |
+----------------------------------------------+       +-------------------------------+

Fig 6

net2 supports cross-Neutron layer-2 networking, so instances in net2 can be created in both OpenStack clouds. If the router net1 and net2 connected to is DVR mode, when Instance1 ping Instance2, the packets are routed locally and exchanged via a VxLAN tunnel. Source MAC replacement is correctly handled inside OpenStack1. But when Instance1 tries to ping Instance3, OpenStack2 does not recognize the DVR MAC from OpenStack1, thus connection fails. Therefore, only local type network can be attached to a DVR mode router.

Cross-Neutron layer-2 networking and DVR may co-exist after we address the DVR MAC recognition problem(we will issue a discussion about this problem in the Neutron community) or introduce l2 gateway. Actually this bridge network approach is just one of the implementation, we are considering in the near future to provide a mechanism to let SDN controller to plug in, which DVR and bridge network may be not needed.

Having the above limitation, can our proposal support the major user scenarios? Considering whether the tenant network and router are local or across Neutron servers, we divide the user scenarios into four categories. For the scenario of cross-Neutron router, we use the proposal shown in Fig 3 in our discussion.

Local Network and Local Router

Topology:

+-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |
|                 |       |                 |
| ext net1        |       | ext net2        |
|   +-----+-----+ |       |   +-----+-----+ |
|         |       |       |         |       |
|         |       |       |         |       |
|      +--+--+    |       |      +--+--+    |
|      |     |    |       |      |     |    |
|      | R1  |    |       |      | R2  |    |
|      |     |    |       |      |     |    |
|      +--+--+    |       |      +--+--+    |
|         |       |       |         |       |
|         |       |       |         |       |
|     +---+---+   |       |     +---+---+   |
|     net1        |       |     net2        |
|                 |       |                 |
+-----------------+       +-----------------+

Fig 7

Each OpenStack cloud has its own external network, instance in each local network accesses the external network via the local router. If east-west networking is not required, this scenario has no requirement on cross-Neutron layer-2 and layer-3 networking functionality. Both central Neutron server and local Neutron server can process network resource management request. While if east-west networking is needed, we have two choices to extend the above topology:

                                                *
+-----------------+       +-----------------+   *   +-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |   *   |OpenStack1       |       |OpenStack2       |
|                 |       |                 |   *   |                 |       |                 |
| ext net1        |       | ext net2        |   *   | ext net1        |       | ext net2        |
|   +-----+-----+ |       |   +-----+-----+ |   *   |   +-----+-----+ |       |   +-----+-----+ |
|         |       |       |         |       |   *   |         |       |       |         |       |
|         |       |       |         |       |   *   |         |       |       |         |       |
|      +--+--+    |       |      +--+--+    |   *   |      +--+--+    |       |      +--+--+    |
|      |     |    |       |      |     |    |   *   |      |     |    |       |      |     |    |
|      | R1  |    |       |      | R2  |    |   *   |      | R1  +--+ |       |  +---+ R2  |    |
|      |     |    |       |      |     |    |   *   |      |     |  | |       |  |   |     |    |
|      +--+--+    |       |      +--+--+    |   *   |      +--+--+  | |       |  |   +--+--+    |
|         |       |       |         |       |   *   |         |     | |       |  |      |       |
|         |       |       |         |       |   *   |         |     | |       |  |      |       |
|     +---+-+-+   |       |     +---+-+-+   |   *   |     +---+---+ | |       |  |  +---+---+   |
|     net1  |     |       |     net2  |     |   *   |     net1      | |       |  |  net2        |
|           |     |       |           |     |   *   |               | |       |  |              |
|  +--------+--+  |       |  +--------+--+  |   *   |               | | net3  |  |              |
|  | Instance1 |  |       |  | Instance2 |  |   *   |  +------------+------------+-----------+  |
|  +-----------+  |       |  +-----------+  |   *   |                 |       |                 |
|         |       |       |         |       |   *   +-----------------+       +-----------------+
|         |       | net3  |         |       |   *
|  +------+-------------------------+----+  |   *   Fig 8.2
|                 |       |                 |   *
+-----------------+       +-----------------+   *
                                                *
Fig 8.1

In the left topology, two instances are connected by a shared VxLAN network, only local network is attached to local router, so it can be either legacy or DVR mode. In the right topology, two local routers are connected by a shared VxLAN network, so they can only be legacy mode.

Cross-Neutron Network and Local Router

Topology:

+-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |
|                 |       |                 |
| ext net1        |       | ext net2        |
|   +-----+-----+ |       |   +-----+-----+ |
|         |       |       |         |       |
|         |       |       |         |       |
|      +--+--+    |       |      +--+--+    |
|      |     |    |       |      |     |    |
|      | R1  |    |       |      | R2  |    |
|      |     |    |       |      |     |    |
|      +--+--+    |       |      +--+--+    |
|         |       |       |         |       |
|   net1  |       |       |         |       |
|  +--+---+---------------------+---+---+   |
|     |           |       |     |           |
|     |           |       |     |           |
|  +--+--------+  |       |  +--+--------+  |
|  | Instance1 |  |       |  | Instance2 |  |
|  +-----------+  |       |  +-----------+  |
|                 |       |                 |
+-----------------+       +-----------------+

Fig 9

From the Neutron API point of view, attaching a network to different routers that each has its own external gateway is allowed but packets can only get out via one of the external network because there is only one gateway ip in one subnet. But in the Tricircle, we allocate one gateway ip for network in each OpenStack cloud, so instances can access specific external network via specific gateway according to which OpenStack cloud they are located.

We can see this topology as a simplification of the topology shown in Fig 8.1 that it doesn’t require an extra network interface for instances. And if no other networks are attached to R1 and R2 except net1, R1 and R2 can be DVR mode.

In the NFV scenario, usually instance itself acts as a router, so there’s no need to create a Neutron router and we directly attach the instance to the provider network and access the real external network via the provider network. In that case, when creating Neutron network, “router:external” label should be set to “False”. See Fig 10:

+-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |
|                 |       |                 |
| provider net1   |       | provider net2   |
|  +--+---------+ |       |  +--+---------+ |
|     |           |       |     |           |
|     |           |       |     |           |
|  +--+--------+  |       |  +--+--------+  |
|  | VNF       |  |       |  | VNF       |  |
|  | Instance1 |  |       |  | Instance2 |  |
|  +------+----+  |       |  +------+----+  |
|         |       |       |         |       |
|         |       |       |         |       |
|   net1  |       |       |         |       |
|  +------+-------------------------+---+   |
|                 |       |                 |
+-----------------+       +-----------------+

Fig 10

Local Network and Cross-Neutron Router

Topology:

+-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |
|                 |       |                 |
|                 |       | ext net         |
|                 |       |   +-------+---+ |
|   bridge net    |       |           |     |
|   +-----+-----------------+-+-+     |     |
|         |       |       | | |    +--+--+  |
|         |       |       | | |    |     |  |
|      +--+--+    |       | | +----+  R  |  |
|      |     |    |       | |      |     |  |
|      |  R  |    |       | |      +-----+  |
|      |     |    |       | |               |
|      +--+--+    |       | |   +-----+     |
|         |       |       | |   |     |     |
|         |       |       | +---+  R  |     |
|     +---+---+   |       |     |     |     |
|     net1        |       |     +--+--+     |
|                 |       |        |        |
|                 |       |        |        |
|                 |       |    +---+---+    |
|                 |       |    net2         |
|                 |       |                 |
+-----------------+       +-----------------+

Fig 11

Since the router is cross-Neutron type, the Tricircle automatically creates bridge network to connect router instances inside the two Neutron servers and connect the router instance to the real external network. Networks attached to the router are local type, so the router can be either legacy or DVR mode.

Cross-Neutron Network and Cross-Neutron Router

Topology:

                                               *
+-----------------+       +-----------------+  *  +-----------------+       +-----------------+
|OpenStack1       |       |OpenStack2       |  *  |OpenStack1       |       |OpenStack2       |
|                 |       |                 |  *  |                 |       |                 |
|                 |       | ext net         |  *  |                 |       | ext net         |
|                 |       |   +-------+---+ |  *  |                 |       |   +-------+---+ |
|   bridge net    |       |           |     |  *  |   bridge net    |       |           |     |
|   +-----+-----------------+-+-+     |     |  *  |   +-----+-----------------+-+-+     |     |
|         |       |       | | |    +--+--+  |  *  |         |       |       | | |    +--+--+  |
|         |       |       | | |    |     |  |  *  |         |       |       | | |    |     |  |
|         |       |       | | +----+  R  |  |  *  |         |       |       | | +----+  R  |  |
|         |       |       | |      |     |  |  *  |         |       |       | |      |     |  |
|      +--+--+    |       | |      +-----+  |  *  |      +--+--+    |       | |      +-----+  |
|      |     |    |       | |               |  *  |      |     |    |       | |               |
|      |  R  |    |       | |   +-----+     |  *  |   +--+  R  |    |       | |   +-----+     |
|      |     |    |       | |   |     |     |  *  |   |  |     |    |       | |   |     |     |
|      +--+--+    |       | +---+  R  |     |  *  |   |  +--+--+    |       | +---+  R  +--+  |
|         |       |       |     |     |     |  *  |   |     |       |       |     |     |  |  |
|         |       |       |     +--+--+     |  *  |   |     |       |       |     +--+--+  |  |
|         |       |       |        |        |  *  |   |     |       |       |        |     |  |
|         |       |       |        |        |  *  |   |     |       |       |        |     |  |
|     +---+------------------------+---+    |  *  |   | +---+------------------------+---+ |  |
|     net1        |       |                 |  *  |   | net1        |       |              |  |
|                 |       |                 |  *  |   |             |       |              |  |
+-----------------+       +-----------------+  *  |   |             |       |              |  |
                                               *  | +-+------------------------------------++ |
Fig 12.1                                       *  | net2            |       |                 |
                                               *  |                 |       |                 |
                                               *  +-----------------+       +-----------------+
                                               *
                                                  Fig 12.2

In Fig 12.1, the router can only be legacy mode since net1 attached to the router is shared VxLAN type. Actually in this case the bridge network is not needed for east-west networking. Let’s see Fig 12.2, both net1 and net2 are shared VxLAN type and are attached to the router(also this router can only be legacy mode), so packets between net1 and net2 are routed in the router of the local OpenStack cloud and then sent to the target. Extra routes will be cleared so no packets will go through the bridge network. This is the current implementation of the Tricircle to support VLAN network.

Implementation

Local router: It’s a router which is created with region name specified in the availability zone hint, this will be present only in the specific region.

East-west gateway router: It’s a router which will be spread into multiple regions and this will handle the east-west traffic to attached local networks.

The following description of implementation is not pseudo code, it’s the logical judgemenet for different conditions combination.

Adding router interface to east-west gateway router:

if IP of the router interface is the subnet default gateway IP
    # north-south traffic and east-west traffic will
    # go through this router
    # router is the default router gateway, it's the
    # single north-south external network mode
    if the network is cross Neutron L2 network
        reserve gateway port in different region
        add router interface in each region using reserved gateway port IP
        make sure the gateway port IP is the default route
    else # local network
        add router interface using the default gateway port or the port
        specified in request
else # not the default gateway IP in this subnet
    if the network is cross Neutron L2 network
        reserve gateway port in different region
        add router interface in each region using reserved gateway port IP
        update host route in each connected local network in each region,
        next hop is the reserved gateway port IP
    else # local network
        create router in the region as needed
        add router interface using the port specified in request
        if there are more than one interfaces on this router
            update host route in each connected local network in each
            region, next hop is port IP on this router.

Configure extra route to the router in each region for EW traffic

Adding router interface to local router for cross Neutron L2 network will make the local router as the default gateway router in this region:

# default north-south traffic will go through this router
add router interface using the default gateway port or the port
specified in request
make sure this local router in the region is the default gateway

If external network is attached to east-west gateway router, and network’s default gateway is the east-west gateway router, then the router will be upgraded to north-south networking via single external network mode.

Constraints:

Network can only be attached to one local router in one region.

If a network has already been attached to a east-west gateway router, and the east-west gateway router is the default gateway of this network, then the network can’t be attached to another local router.

Note

Host route update in a subnet will function only in next dhcp request. It may take dhcp_lease_duration for VMs in the subnet to update the host route. It’s better to compose the networking topology before attached VMs to the netwrok. dhcp_lease_duration is configured by the cloud operator. If tenant wants to make the host route work immediately, can send dhcp request directly in VMs.

Data Model Impact

None

Dependencies

None

Documentation Impact

  1. Add new guide for North South Networking via Multiple External Networks with east-west enabled.

  2. Release notes.

Reference

[1] North South Networking via Multiple External Networks: https://docs.openstack.org/tricircle/latest/networking/networking-guide-multiple-external-networks.html [2] l3-networking-combined-bridge-net: https://github.com/openstack/tricircle/blob/master/specs/ocata/l3-networking-combined-bridge-net.rst [3] North South Networking via Single External Network: https://docs.openstack.org/tricircle/latest/networking/networking-guide-single-external-network.html

Distributed LBaaS in Multi-Region Scenario

Background

Currently, LBaaS (Load-Balancing-as-a-Service) is not supported in the Tricircle. This spec is to describe how LBaaS will be implemented in the Tricircle. LBaaS is an advanced service of Neutron, which allows for proprietary and open-source load balancing technologies to drive the actual load balancing of requests. Based on the networking guide of Ocata release, LBaaS can be configured with an agent or Octavia. Given that the OpenStack community try to take Octavia as the reference implementation of LBaaS, we only enable LBaaS based on Octavia in the Tricircle.

Different from existing LBaaS implementation, Octavia accomplishes its delivery of load balancing services by managing a fleet of virtual machines, containers, or bare metal servers, collectively known as amphorae, which it spins up on demand. This spec file is dedicated to how to implement LBaaS in multiple regions with the Tricircle.

Overall Implementation

The Tricircle is designed in a central-local fashion, where all the local neutrons are managed by the central neutron. As a result, in order to adapt the central-local design and the amphorae mechanism of Octavia, we plan to deploy LBaaS as follows.

              +---------------------------+
              |                           |
              |      Central Neutron      |
              |                           |
              +---------------------------+
                     Central Region

+----------------------------+    +-----------------------------+
|     +----------------+     |    |     +----------------+      |
|     |  LBaaS Octavia |     |    |     |  LBaaS Octavia |      |
|     +----------------+     |    |     +----------------+      |
| +------+ +---------------+ |    | +-------+ +---------------+ |
| | Nova | | Local Neutron | |    | | Nova  | | Local Neutron | |
| +------+ +---------------+ |    | +-------+ +---------------+ |
+----------------------------+    +-----------------------------+
          Region One                          Region Two

As demonstrated in the figure above, for each region where a local neutron is installed, admins can optionally choose to configure and install Octavia. Typically, Octavia leverages nova installed in its region to spin up amphorae. By employing load balancing softwares (e.g. haproxy) installed in the amphorae and Virtual Router Redundancy Protocol (VRRP), a load balancer which consists of a VIP and an amphora, can balance load across members with high availability. However, under the central-local scenario, we plan to let Octavia employ the central neutron in Central Region to manage networking resources, while still employ services in its region to manage amphora. Hence, the workflow of networking resource management in Tricircle can be described as follows.

Tenant–>local neutron API–>neutron-LBaaS—>local Octavia—>central neutron

Specifically, when a tenant attempts to create a load balancer, he/she needs to send a request to the local neutron-lbaas service. The service plugin of neutron-lbaas then prepares for creating the load balancer, including creating port via local plugin, inserting the info of the port into the database, and so on. Next the service plugin triggers the creating function of the corresponding driver of Octavia, i.e., Octavia.network.drivers.neutron.AllowedAddressPairsDriver to create the amphora. During the creation, Octavia employs the central neutron to complete a series of operations, for instance, allocating VIP, plugging in VIP, updating databases. Given that the main features of managing networking resource are implemented, we hence need to adapt the mechanism of Octavia and neutron-lbaas by improving the functionalities of the local and central plugins.

Considering the Tricircle is dedicated to enabling networking automation across Neutrons, the implementation can be divided as two parts, i.e., LBaaS members in one OpenStack instance, and LBaaS members in multiple OpenStack instances.

LBaaS members in single region

For LBaaS in one region, after installing octavia, cloud tenants should build a management network and two security groups for amphorae manually in the central neutron. Next, tenants need to create an interface for health management. Then, tenants need to configure the newly created networking resources for octavia and let octavia employ central neutron to create resources. Finally, tenants can create load balancers, listeners, pools, and members in the local neutron. In this case, all the members of a loadbalancer are in one region, regardless of whether the members reside in the same subnet or not.

LBaaS members in multiple regions

1. members in the same subnet yet locating in different regions

As shown below.

+-------------------------------+  +-----------------------+
| +---------------------------+ |  |                       |
| |    Amphora                | |  |                       |
| |                           | |  |                       |
| |  +-------+  +---------+   | |  |                       |
| +--+ mgmt  +--+ subnet1 +---+ |  |                       |
|    +-------+  +---------+     |  |                       |
|                               |  |                       |
| +--------------------------+  |  | +-------------------+ |
| | +---------+  +---------+ |  |  | |    +---------+    | |
| | | member1 |  | member2 | |  |  | |    | member3 |    | |
| | +---------+  +---------+ |  |  | |    +---------+    | |
| +--------------------------+  |  | +-------------------+ |
|       network1(subnet1)       |  |   network1(subnet1)   |
+-------------------------------+  +-----------------------+
           Region One                     Region Two
Fig. 1. The scenario of balancing load across instances of one subnet which
reside in different regions.

As shown in Fig. 1, suppose that a load balancer is created in Region one, and hence a listener, a pool, and two members in subnet1. When adding an instance in Region Two to the pool as a member, the local neutron creates the network in Region Two. Members that locate in different regions yet reside in the same subnet form a shared VLAN/VxLAN network. As a result, the Tricircle supports adding members that locates in different regions to a pool.

2. members residing in different subnets and regions

As shown below.

+---------------------------------------+  +-----------------------+
| +-----------------------------------+ |  |                       |
| |            Amphora                | |  |                       |
| |                                   | |  |                       |
| | +---------+  +------+ +---------+ | |  |                       |
| +-+ subnet2 +--+ mgmt +-+ subnet1 +-+ |  |                       |
|   +---------+  +------+ +---------+   |  |                       |
|                                       |  |                       |
| +----------------------------------+  |  | +-------------------+ |
| |                                  |  |  | |                   | |
| |   +---------+      +---------+   |  |  | |    +---------+    | |
| |   | member1 |      | member2 |   |  |  | |    | member3 |    | |
| |   +---------+      +---------+   |  |  | |    +---------+    | |
| |                                  |  |  | |                   | |
| +----------------------------------+  |  | +-------------------+ |
|           network1(subnet1)           |  |    network2(subnet2)  |
+---------------------------------------+  +-----------------------+
               Region One                         Region Two
Fig. 2. The scenario of balancing load across instances of different subnets
which reside in different regions as well.

As show in Fig. 2, supposing that a load balancer is created in region one, as well as a listener, a pool, and two members in subnet1. When adding an instance of subnet2 located in region two, the local neutron-lbaas queries the central neutron whether subnet2 exist or not. If subnet2 exists, the local neutron-lbaas employ octavia to plug a port of subnet2 to the amphora. This triggers cross-region vxlan networking process, then the amphora can reach the members. As a result, the LBaaS in multiple regions works.

Please note that LBaaS in multiple regions should not be applied to the local network case. When adding a member in a local network which resides in other regions, neutron-lbaas use ‘get_subnet’ will fail and returns “network not located in current region”

Data Model Impact

None

Dependencies

None

Documentation Impact

Configuration guide needs to be updated to introduce the configuration of Octavia, local neutron, and central neutron.

References

None

Tricircle Table Clean After Splitting

Background

Originally the Tricircle provided unified OpenStack API gateway and networking automation functionality. But now the Tricircle narrows its scope to networking automation across Neutron servers, the functionality of OpenStack API gateway is developed in another project called Trio2o[1].

Problem Description

After this splitting, many tables would no longer be used, including quota, volume, aggregate and pod binding, etc. The data models, tables and APIs of them should be removed. As for the rest of the tables that are still in use in the Tricircle, they should be renamed for better understanding.

Apart from the table cleaning work and table renaming work, a new feature will be developed to remove the dependency on old table. During the period of external network creation, it will take ‘availability_zone_hints’ (AZ or az will be used for short for availability zone) as a parameter. Previously az_hints was searched in the pod binding table by az_name and tenant_id, now the pod binding table is deprecated and new search strategy is needed to fix the problem[2]. A function named find_pod_by_az will be developed to find the az_hints by az_name in the pod table. Given the az_name, if it is not empty, we first match it with region_name in the pod table. When a pod with the same region_name is found, it will be returned back. The search procedure is complete. If no pod is found with the same region_name, then we try to match it with az_name in the pod table. If multiple pods are found, then we will raise an exception. If only one pod is found, this pod will be returned back. An exception will be raised if no pod is matched at the end of the previous search procedure. However, if the az_name is empty, we will return None, a new configuration item “default_region_for_external_network” will be used.

Proposed Change

All tables that need to be changed can be divided into two categories, Table to be removed, Table to be renamed.

Table to be removed:

  • quality_of_service_specs

  • quota_classes

  • quota_usages

  • quotas

  • reservations

  • volume_type_extra_specs

  • volume_type_projects

  • volume_types

  • aggregates

  • aggregate_metadata

  • instance_types

  • instance_type_projects

  • instance_type_extra_specs

  • key_pairs

  • pod_binding

Table to be renamed:

  • cascaded_pod_service_configuration(new name: cached_endpoints)

  • cascaded_pods(new name: pods)

  • cascaded_pods_resource_routing(new name: resource_routings)

  • job(new name: async_jobs)

The deprecated tables will be removed from the repository directly, and other tables containing old meanings will be renamed for better understanding.

After the deletion of pod binding table, a new feature will be developed to lookup the az in the pod table rather than the pod binding table.

Data Model Impact

In database, many tables are removed, other tables are renamed for better understanding.

Documentation Impact

After the pod binding table is removed, the explanation of the pod binding API in the doc/source/api_v1.rst will be removed as well.

Dependencies

None

Tricircle Local Neutron Plugin

Background

One of the key value we would like to achieve via the Tricircle project is to provide networking automation functionality across several Neutron servers. Each OpenStack instance runs its own Nova and Neutron services but shares the same Keystone service or uses federated Keystone, which is a multi-region deployment mode. With networking automation, virtual machines or bare metals booted in different OpenStack instances can inter-communicate via layer2 or layer3 network.

Considering the cross Neutron layer2 network case, if Neutron service in each OpenStack instance allocates ip address independently, the same ip address could be assigned to virtual machines in different OpenStack instances, thus ip address conflict could occur. One straightforward solution to this problem is to divide the ip allocation pool into several parts and each OpenStack instance has one. The drawback is that since virtual machines are not distributed evenly in each OpenStack instance, we may see some OpenStack instances uses up ip addresses while other OpenStack instances still have ip addresses not allocated. What’s worse, dividing the ip allocation pool makes it impossible for us to process virtual machine migration from one OpenStack instance to another.

Thanks to Neutron’s flexible plugin framework, by writing a new plugin and configuring Neutron server to use it, developers can define what Neutron server should do after receiving a network resources operation request. So for the ip address conflict issue discussed above, we decide to run one central Neutron server with the Tricircle central Neutron plugin(abbr: “central plugin”) to manage ip allocation pool centrally.

Besides central plugin, we need a bridge to connect central and local Neutron servers since each OpenStack instance has its own local Nova and Neutron server but these two services are not aware of the central Neutron server. This bridge should validate requested network data via the central Neutron server, then create necessary network resources in the target OpenStack instance with the data retrieved from the central Neutron server.

Local Plugin

For connecting central and local Neutron servers, Neutron plugin is again a good place for us to build the bridge. We can write our own plugin, the Tricircle local Neutron plugin(abbr: “local plugin”) to trigger the cross Neutron networking automation in local Neutron server. During virtual machine booting, local Nova server will interact with local Neutron server to query network or create port, which will trigger local plugin to retrieve data from central Neutron server and create necessary network resources according to the data. To support different core plugins, we will introduce a new option “real_core_plugin” in the “tricircle” configuration group. During initialization, local plugin will load the plugin specified by “real_core_plugin”. Local plugin only adds logic to interact with central Neutron server, but invokes the real core plugin to finish the CRUD operations of local network resources. The following graph shows the relation between user and Nova and Neutron servers:

               +------+
               | user |
               +-+--+-+
                 |  |
     +-----------+  +----------------------+
     |  boot vm         create and query   |
     |                  network resource   |
     v                                     |
+----+-------+                             |
| local Nova |     xxxxxxxxxxxxxxx         |
+----+-------+   xxx             xxx       |
     |          xx                 xx      |
     +---+    xxx      +--------+   xxx    |
         |    x        |        |     x    |
         |    x        |        |     x    |
         v    V        |        v     x    v
+--------+---------+   |   +----+----------+----+
| local Neutron    |   |   | central Neutron    |
| +--------------+ |   |   | +----------------+ |
| | local plugin | |   |   | | central plugin | |
| +--------------+ |   |   | +----------------+ |
+------------------+   |   +--------------------+
         |             |
         +-------------+

Next using virtual machine booting procedure to elaborate how local plugin works. To begin with, user creates network and subnet via central Neutron server. Then this user passes the network id as the requested network information to local Nova server to boot a virtual machine. During parameter validation, local Nova server queries local Neutron server to ensure the passed-in network id is valid, which is a “network-get” request. In the “network-get” handle function, local plugin first checks if local Neutron already has a network with that id. If not, local plugin retrieves network and also subnet information from central Neutron server then creates network and subnet based on this information. User may pass an invalid network id by mistake, in this case, local plugin will receive a 404 response from central Neutron server, it just returns a 404 response to local Nova server.

After the network id validation passes, local Nova server continues to schedule a host so compute manager running in that host will do the left works. Compute manager creates a port in the requested network via local Neutron server, which is a “port-create” request. In the “port-create” handle function, local plugin sends the same request to central Neutron server to create a port, and uses the returned port information to create a local port. With local plugin, we ensure all ip addresses are allocated by central Neutron server.

At the end of the network setup of the virtual machine, compute manager issues a “port-update” request to local Neutron server to associate the host with the port. In the “port-update” handle function, local plugin recognizes that this request is sent from local Nova server by the request body that the request body contains host information, so it sends a “port-update” request to central Neutron server with region name in the request body. In Keystone, we register services inside one OpenStack instance as one unique region, so we can use region name to identify one OpenStack instance. After receiving the request, central Neutron server is informed that one virtual machine port is correctly setup in one OpenStack instance, so it starts the cross Neutron networking automation process, like security group rule population, tunnel setup for layer2 communication and route setup for layer3 communication, which are done by making Neutron API call to each local Neutron server.

Implementation

Implementation details of the local plugin is discussed in this section.

Resource Id

Local plugin always retrieves data of networks resources from central Neutron server and use these data to create network resources in local Neutron server. During the creation of these network resources, we need to guarantee resource ids in central and local server the same. Consider the scenario that user creates a port via central Neutron server then use this port to boot a virtual machine. After local Nova server receives the request, it will use the port id to create a tap device for the virtual machine. If port ids in central and local Neutron servers are different, OVS agent can’t correctly recognize the tap device and configure it. As a result, virtual machine fails to connect to the network. Fortunately, database access module in Neutron allow us to specify id before creating the resource record, so in local plugin, we just specify id the same as central resource’s to create local resource.

Network Type Adaption

Two network types are supported currently in central plugin, which are local and vlan type. Before creating network based on information retrieved from central Neutron server, local plugin needs to adapt network type. For local type, local plugin creates the network without specifying the network type, so the default tenant network type is used. For vlan type, local plugin keeps the network type, segmentation id and physical network parameter.

We plan to support another two network types later. They are shared_vxlan and mixed network type. For shared_vxlan type, local plugin changes the network type parameter from “shared_vxlan” to “vxlan”, but keeps the segmentation id parameter(vxlan type doesn’t need physical network parameter). For mixed type, like local type, local plugin uses the default tenant network type to create the network, but it needs to do one more thing, that is to save the segment information in central Neutron server. Neutron has a extension which allows one network to carry multiple segments information[1], so segment information of each local network can all be saved in the central network.

Dhcp Port Handle

After local subnet creation, local Neutron server will schedule one dhcp agent for that subnet, and dhcp agent will automatically create a dhcp port. The ip address of this dhcp port is not allocated by central Neutron server, so we may encounter ip address conflict. We need to address this problem to ensure all ip addresses are allocated by central Neutron server.

Here is the approach. After central Neutron server receives subnet creation subnet, central plugin not only creates the requested subnet, but also create a port to pre-allocate an ip address for the dhcp port. So during creation of local subnet, local plugin will query central Neutron server to retrieve the data of the pre-created port and use its ip address to create a local dhcp port. The “device_id” of the dhcp port is set to “reserved_dhcp_port” so after one dhcp agent is scheduled, it will use this port other than create a new one.

Gateway Port Handle

If cross Neutron layer2 networking is enabled in one network, we need to allocate one gateway ip for that network in each OpenStack instance. The reason is that we want layer3 routing to be finished locally in each OpenStack instance. If all the OpenStack instances have the same gateway ip, packets sent to the gateway may reach the remote one, so the path is not the best and not predictable.

How we address this problem in local plugin is that before creating local subnet, local plugin sends request to central Neutron server to create an “gateway port”, then uses the ip of this port as the gateway ip of the local subnet. Name of the gateway port includes the region name of the OpenStack instance and the id of the subnet so each OpenStack instance can have its own gateway port and gateway ip for one specific subnet.

Data Model Impact

None

Dependencies

None

Documentation Impact

Installation guide needs to be updated to introduce the configuration of central and local plugin.

References

[1] https://blueprints.launchpad.net/neutron/+spec/ml2-multi-segment-api

A New Layer-3 Networking multi-NS-with-EW-enabled

Problems

Based on spec for l3 networking [1], a l3 networking which enables multiple NS traffic along with EW traffic is demonstrated. However, in the aforementioned l3 networking model, the host route will be only valid after DHCP lease time expired and renewed. It may take dhcp_lease_duration for VMs in the subnet to update the host route, after a new pod with external network is added to Tricircle. To solve the problem, this spec is written to introduce a new l3 networking model.

Proposal

For the networking model in [1], a tenant network is attached to two routers, one for NS traffic, the other for EW traffic. In the new networking model, inspired by combined bridge network [2], we propose to attach the tenant network to one router, and the router takes charge of routing NS and EW traffic. The new networking mode is plotted in Fig. 1.

+-----------------------+             +----------------------+
|            ext-net1   |             |        ext-net2      |
|      +---+---+        |             |            +--+---+  |
|RegionOne |            |             | RegionTwo     |      |
|      +---+---+        |             |          +----+--+   |
|      |  R1   +------+ |             | +--------+  R2   |   |
|      +-------+      | |             | |        +-------+   |
|           net1      | |             | |      net2          |
|     +------+---+-+  | |             | | +-+----+------+    |
|            |   |    | |             | |   |    |           |
|  +---------+-+ |    | |             | |   | +--+--------+  |
|  | Instance1 | |    | |             | |   | | Instance2 |  |
|  +-----------+ |    | |             | |   | +-----------+  |
|           +----+--+ | |             | |  ++------+         |
|           | R3(1) +-+-----------------+--+ R3(2) |         |
|           +-------+   |  bridge net |    +-------+         |
+-----------------------+             +----------------------+

Figure 1 Multiple external networks with east-west networking

As shown in Fig. 1, R1 connects to external network (i.e., ext-net1) and ext-net1 is the default gateway of R1. Meanwhile, net1 is attached to R3 and R3’s default gateway is the bridge net. Further, interfaces of bridge net are only attached to R1 and R2 which are regarded as local routers.

In such a scenario, all traffic (no matter NS or EW traffic) flows to R3. For EW traffic, from net1 to net2, R3(1) will forwards packets to the interface of net2 in R3(2) router namespace. For NS traffic, R3 forwards packets to the interface of an available local router (i.e., R1 or R2) which attached to the real external network. As a result, bridge net is an internal net where NS and EW traffic is steered, rather than the real external network of R3.

To create such a topology, we need to create a logical (non-local) router R3 in the central Neutron. Tricircle central Neutron plugin then creates R3(1) in RegionOne and R3(2) in RegionTwo, as well as the bridge network to inter-connect R3(1) and R3(2). As such, the networking for EW traffic is ready for tenants. To enable NS traffic, real external networks are required to be attached to R3. When explicitly adding the gateway port of each external network to R3, Tricircle automatically creates a local router (e.g. R1) for external network and set the gateway to the local router. Then to connect the local router (e.g. R1) and the non-local router (R3), two interfaces of bridge-net are also created and attached to respect router. The logical topology in central Neutron is plotted in Fig. 2.

  ext-net1             ext-net2
+---+---+            +---+---+
    |                    |
+---+---+            +---+---+
|  R1   |            |  R2   |
+---+---+            +---+---+
    |                    |
+---+--------------------+---+
|         bridge-net         |
+-------------+--------------+
              |
              |
+-------------+--------------+
|            R3              |
+---+--------------------+---+
    | net1          net2 |
+---+-----+-+      +---+-+---+
          |            |
+---------+-+       +--+--------+
| Instance1 |       | Instance2 |
+-----------+       +-----------+

Figure 2 Logical topology in central Neutron

To improve the logic of building l3 networking, we introduce routed network to manage external networks in central Neutron. In central Neutron, one routed network is created as a logical external network, and real external networks are stored as segments of the external network. As such, the local routers (e.g., R1 and R2 in Fig. 2) are transparent to users. As a result, when a real external network is created, a local router is created and the external network’s gateway is set to the router. Moreover, a port of bridge-net is created and added to the local router.

The routed network is created as follows:

openstack --os-region-name=CentralRegion network create --share --provider-physical-network extern --provider-network-type vlan --provider-segment 3005 ext-net
openstack --os-region-name=CentralRegion network segment create --physical-network extern  --network-type vlan --segment 3005 --network ext-net ext-sm-net1
openstack --os-region-name=CentralRegion network segment create --physical-network extern  --network-type vlan --segment 3005 --network ext-net ext-sm-net2
openstack --os-region-name=CentralRegion subnet create --network ext-net --network-segment ext-net1 --ip-version 4 --subnet-range 203.0.113.0/24 net1-subnet-v4
openstack --os-region-name=CentralRegion subnet create --network ext-net --network-segment ext-net1 --ip-version 4 --subnet-range 203.0.114.0/24 net2--subnet-v4

The logical topology exposed to users is plotted in Fig. 3.

             ext-net (routed network)
           +---+---+
               |
               |
+--------------+-------------+
|            R3              |
+---+--------------------+---+
    | net1          net2 |
+---+-----+-+      +---+-+---+
          |            |
+---------+-+       +--+--------+
| Instance1 |       | Instance2 |
+-----------+       +-----------+

Figure 3 Logical topology exposed to users in central Neutron

For R3, net1 and net2 should be attached to R3:

openstack --os-region-name=CentralRegion router add subnet R3 <net1's subnet>
openstack --os-region-name=CentralRegion router add subnet R3 <net2's subnet>

The gateway of the ext-net, i.e., the routed network, is set to R3:

openstack --os-region-name=CentralRegion router set <ext-net> R3

However, a routed network does not have a gateway. Consequently, the command above fails for trying adding the gateway of a routed network to the router, i.e., R3. To ensure the command works, we plan to create a gateway port for the routed network before setting the gateway to a router. Actually, the port is a blank port which does not have an IP, because a routed network is a software entity of multiple segments (i.e., subnets). To make sure the gateways of real external networks can be retrieved, we manage the IPs of gateways in “tags” field of the gateway port.

This command creates a port of bridget-net and add it to R3, which is plotted in Fig. 2.

Tricircle central Neutron plugin will automatically configure R3(1), R3(2) and bridge-network as follows:

For net1 and net2, no host route is needed, so in such an l3 networking model, users are no longer required to wait for DHCP renew to update host route. All traffic is forwarded to R3 by default.

In R3(1), extra route will be configured:

destination=net2's cidr, nexthop=R3(2)'s interface in bridge-net
destination=ext-net1's cidr, nexthop=R1's interface in bridge-net

In R3(2), extra route will be configured:

destination=net1's cidr, nexthop=R3(1)'s interface in bridge-net
destination=ext-net2's cidr, nexthop=R2's interface in bridge-net

R3(1) and R3(2) will set the external gateway to bridge-net:

router-gateway-set R3(1) bridge-net
router-gateway-set R3(2) bridge-net

Now, north-south traffic of Instance1 and Instance2 work as follows:

Instance1 -> net1 -> R3(1) -> R1 -> ext-net1
Instance2 -> net2 -> R3(2) -> R2 -> ext-net2

Two hops for north-south traffic.

East-west traffic between Instance1 and Instance2 work as follows:

Instance1 <-> net1 <-> R3(1) <-> bridge-net <-> R3(2) <-> net2 <-> Instance2

Two hops for cross Neutron east-west traffic.

The topology with cross Neutron L2 networks except local networks is illustrated in Fig. 4.

+-----------------------+            +-----------------------+
|    ext-net1           |            |          ext-net2     |
|      +---+---+        |            |             +--+---+  |
|RegionOne |            |            |  RegionTwo     |      |
|      +---+------+     |            |     +----------+--+   |
|      |    R1    +---+ |            | +---+   R2        |   |
|      +----------+   | |            | |   +-------------+   |
|    net1             | |            | |              net2   |
|     ++---+          | |            | |          +-----+    |
|      | net3         | |            | |         net4|       |
|      |  ++---+      | |            | |      +--+-+ |       |
|      |   |          | |  net5      | |         |   |       |
|      |   | +-+-----------------------------+-+ |   |       |
|      |   |   |      | |  net6      | |     |   |   |       |
|      |   |   | ++-----------------------++ |   |   |       |
|      |   |   |  |   | |            | |  |  |   |   |       |
|      |   |   |  |   | |            | |  |  |   |   |       |
|      |   |   |  |   | |            | |  |  |   |   |       |
|      |   |   |  |   | |            | |  |  |   |   |       |
| +----+---+---+--+-+ | | bridge-net | | ++--+---+---+-----+ |
| |      R3(1)      +-+----------------+-+      R3(2)      | |
| +-----------------+   |            |   +-----------------+ |
+-----------------------+            +-----------------------+

Figure 4 Multi-NS and cross Neutron L2 networks

The logical topology in central Neutron for Figure. 4 is plotted in Fig. 5.

 ext-net1                           ext-net2
+---+---+                          +--+---+
    |                                 |
 +--+-----------+                 +---+------------+
 |    R1        |                 |      R2        |
 +----------+---+                 +----+-----------+
            |                          |
 +----------+--------------------------+-----------+
 |                   bridge-net                    |
 +-----------------------+-------------------------+
                         |
 +-----------------------+-------------------------+
 |                    R3                           |
 +--+----+------+-----------------+---------+----+-+
    |    |      |                 |         |    |
    |    |      |                 |         |    |
    |    |      |                 |         |    |
    |    |    +-+--------------------+      |    |
    |    |     net5               |         |    |
    |    |         +--------------+------+  |    |
    |    |            net6                  |    |
    |  +-+---+                          +---+-+  |
    |   net3                             net2    |
  +-+---+                                    +---+-+
   net1                                       net4

Figure 5 Logical topology in central Neutron with cross Neutron L2 network

By adding networks to R3, EW traffic is routed by R3.

For net5 in RegionOne, extra route in R3(1) should be added:

destination=net1's cidr, nexthop=<net5-R3-RegionOne-interface's IP>
destination=net3's cidr, nexthop=<net5-R3-RegionOne-interface's IP>

For net5 in RegionTwo, extra route in R3(2) should be added:

destination=net1's cidr, nexthop=<net5-R3-RegionTwo-interface's id>
destination=net3's cidr, nexthop=<net5-R3-RegionTwo-interface's IP>

The east-west traffic between these networks will work as follows:

net1 <-> R3 <-> net3
net1 <-> R3 <-> net5
net1 <-> R3 <-> net6
net3 <-> R3 <-> net5
net3 <-> R3 <-> net6
net5 <-> R3 <-> net6

For NS traffic, the route to external network is already configured, so NS traffic is routed to R1 or R2.

Implementation

Part 0: add an option in local.conf to enable the new l3 networking model

Add an option “ENABLE_HOST_ROUTE_INDEPENDENT_L3_NETWORKING”, whose value is TRUE or FALSE, to indicate whether users expect to adopt such new l3 networking model.

Part 1: enable external network creation with transparent (local) router

This part mainly ensures a real external network is created along with a local router, and set the gateway of the external network to the router. As shown in Fig. 2, when ext-net1 is created, R1 is created, too. And the gateway of ext-net1 is set to R1. Moreover, the local router, e.g. R1, is transparent to users. In other words, users only create external network, while tricircle complete the creation of the local router. As a result, users are unaware of the local routers.

Part 2: enable routed network and gateway setting process

This part enables routed network in the central neutron. Meanwhile, this part also needs to complete the process of setting gateway of the routed network to the distributed router, e.g. R3 in Fig. 2. Here since the routed network is a software entity of multiple real external networks, the gateway ip of the routed network is set as NULL. And the gateway ips of real external networks is planned to stored in tag field of the routed network. So this part mainly deal with the blank gateway ip of the routed network when setting gateway to the router.

Part 3: modify floating ip creation

In the existing l3 networking, external network and tenant network is connected by a router, so implementing floating ip only needs NAT once. However, in the new l3 networking model, as shown in Fig. 2, external network and tenant network connect two routers, respectively. And the two routers are connected by bridge network. So implementing floating ip needs to be NATed twice. This part mainly deal with such an issue.

Data Model Impact

None

Dependencies

None

Documentation Impact

  1. Add a new guide for North South Networking via Multiple External Networks with east-west enabled.

  2. Release notes.

Tricircle Quality of Service

Background

QoS is defined as the ability to guarantee certain network requirements like bandwidth, latency, jitter and reliability in order to satisfy a Service Level Agreement (SLA) between an application provider and end tenants. In the Tricircle, each OpenStack instance runs its own Nova and Neutron services but shares the same Keystone service or uses federated KeyStones, which is a multi-region deployment mode. With networking automation, networks or ports created in different OpenStack cloud should be able to be associated with QoS policies.

Proposal

As networking automation across Neutron could be done through the Tricircle, the QoS automation should be able to work based on tenant’s need too. When tenant wants to apply QoS to the network or port from the central Neutron, QoS can’t be created in the local Neutron server in the bottom pod directly, since it’s still unclear whether the network will be presented in this pod or not.

In order to achieve QoS automation operations, QoS can’t be created in the local Neutron server directly until there are some existing networks/ports in bottom pod. The Tricircle central Neutron plugin(abbr: “central plugin”) will operate QoS information in the local Neutron server, QoS service isn’t like network/port that needs to be created during VM booting, in order to speed up the local VMs booting and reduce the delay that caused by synchronization between central Neutron and local Neutron, Tricircle central plugin should use an asynchronous method to associate QoS with the local network/port, or remove QoS association in each local Neutron if needed.

Implementation

Case 1, QoS policy creation

In this case, we only create QoS in the central Neutron.

Case 2, QoS policy association without local network/port in place

QoS has been created in the central Neutron but local network/port has not yet been created.

In this case, we just need to update network/port with QoS policy id in the central Neutron.

Case 3, QoS policy association with local network/port in place

After QoS has been created in the central Neutron and local network/port also has been created, associate QoS with network/port in the central Neutron.

In this case, network/port has been created in the local Neutron. After network/port is updated with the QoS policy id in the central Neutron, we also need to do some similar association in the local Neutron. Central Neutron uses “create_qos_policy” job to create the local QoS policy firstly, then update the network/port QoS association asynchronously in the local Neutron through the network/port routing information and add the QoS routing information in routing table. XJob will interact with local Neutron to update the QoS policy id for network/port in local Neutron.

Case 4, provision VM with QoS policy associated central port/network

QoS has been associated to central port/network first, local network/port is created later in VM provision.

In this case, QoS has been associated to the central network/port and at this point local network/port does not exist. Since QoS has not been created in the local Neutron but central Neutron has finished the association, local neutron needs to trigger central Neutron to finish the local network/port QoS association when VMs booting in the local. When VM booting in the bottom pod, local Neutron sends update port request with port information to central Neutron and if QoS id field exists in the network/port, the central Neutron will be triggered to use XJob to create an QoS policy creation job in the local Neutron (it also speeds up VM booting) and add the QoS routing information in routing table.

Case 5, QoS policy updating

In this case, if local network/port isn’t associated with QoS, we only update QoS in the central Neutron.

If QoS policy has been associated with local network/port in place, after central Neutron updates QoS, central Neutron will use XJob to create a QoS asynchronous updating job through the network/port routing information. XJob will asynchronously update QoS in the local Neutron.

Case 6, QoS policy disassociation

For QoS policy disassociation, just need to change the parameters of “QoS_policy_id” to None when update network/port in the central Neutron and we can disassociate network/port.

In this case, if network/port in local Neutron isn’t associated with QoS, we only disassociate network/port in the central Neutron.

If QoS policy has been associated with network/port in local Neutron, after central Neutron disassociates network, central Neutron will use XJob to create a network update job to disassociate the network with the QoS policy; for port, central Neutron will synchronously update the port to disassociate it with the QoS policy in the local Neutron.

Case 7, QoS policy deletion

QoS policy can only be deleted if there is no any association in central Neutron. In this case, if local network/port isn’t associated with QoS, we only delete QoS in the central Neutron.

If there is QoS policy routing info, after central Neutron deletes QoS, central Neutron will use XJob to create a QoS asynchronous deletion job through the network/port routing information. XJob will asynchronously delete QoS in the local Neutron.

Case 8, QoS rule creation

In this case, if local network/port isn’t associated with QoS, we only create QoS rule in the central Neutron.

If QoS policy has been associated with local network/port in place, after central Neutron creates QoS rules, central Neutron will use XJob to create a QoS rules syncing job through the network/port routing information, then asynchronously creates QoS rules in the local Neutron.

Case 9, QoS rule updating

In this case, if local network/port isn’t associated with QoS, we only update QoS rule in the central Neutron. If QoS policy has been associated with local network/port in place, after central Neutron updates QoS rule, central Neutron will trigger XJob to create a QoS rules syncing job in the local Neutron through the network/port routing information. XJob will asynchronously update QoS rule in the local Neutron.

Case 10, QoS rule deletion

In this case, if local network/port isn’t associated with QoS, we only delete QoS rule in the central Neutron.

If QoS policy has been associated with local network/port in place, after central Neutron deletes QoS rule, central Neutron will use XJob to create a QoS rules syncing job through the network/port routing information. XJob will asynchronously delete QoS rule in the local Neutron.

QoS XJob jobs list

  • 1: create_qos_policy(self, ctxt, policy_id, pod_id, res_type, res_id=None)

Asynchronously creating QoS policy for the corresponding pod which id equals “pod_id”, specify network or port in through the parameter res_type and res_id. If res_type is RT_NETWORK, then res_id is network’s uuid, if res_type is RT_PORT, then res_id is port’s uuid

Triggering condition:

When associating network/port in the central Neutron, if this network/port exists in the local Neutron, triggering this asynchronous job to complete the local association.

When central plugin processing a port update request sent by local plugin and finding the port is associated with QoS.

If pod_id is POD_NOT_SPECIFIED then the async job will process all related pods, so the create_qos_policy(self, ctxt, policy_id, pod_id) job will deal with not only single pod’s QoS association.

If the res_type is RT_NETWORK/RT_PORT, after creating the qos policy on pod, the async job will bind the qos policy that just created to the network/port specified by the parameter of res_id.

  • 2: update_qos_policy(self, ctxt, policy_id, pod_id)

Asynchronously updating QoS policy for the corresponding pod which id equals “pod_id”.

Triggering condition:

When updating QoS policy in the central Neutron, if it also exists in the local Neutron, triggering this asynchronous job to complete the local QoS updating.

If pod_id is POD_NOT_SPECIFIED then the async job will process all related pods, so the update_qos_policy(self,ctxt,policy_id,pod_id) job will deal with not only single pod’s QoS association.

  • 3: delete_qos_policy(self, ctxt, policy_id, pod_id)

Asynchronously deleting QoS policy for the corresponding pod which id equals “pod_id”.

Triggering condition:

When deleting QoS policy in the central Neutron, if this QoS policy exists in the local Neutron, triggering this asynchronous job to complete the local QoS deletion. (Warning: the deleted QoS policy must be disassociated first.)

If pod_id is POD_NOT_SPECIFIED then the async job will process all related pods, so the delete_qos_policy(self,ctxt,policy_id,pod_id) job will deal with not only single pod’s QoS association.

  • 4: sync_qos_policy_rules(self, ctxt, policy_id)

Asynchronous operation for rules of one QoS policy for specified project. There are two trigger conditions. The one is that central Neutron creates/updates/deletes QoS rules after QoS policy has been associated with local network/port. The other is that central plugin processes a port update request sent by local plugin and finds the port is associated with QoS policy.

If the rule both exists in the central Neutron and local Neutron, but with inconsistent content, just asynchronously updating this QoS rule in the local Neutron.

If the rule exits in the central Neutron, but it does not exist in the local Neutron, just asynchronously creating this QoS rule in the local Neutron.

If the rule exits in the local Neutron, but it does not exist in the central Neutron, just asynchronously deleting this QoS rule in the local Neutron.

Data Model Impact

None

Dependencies

None

Documentation Impact

Release notes

Reliable resource deleting in Tricircle

Background

During the deletion of resources which are mapped to several local Neutron(s), it may bring some conflict operations. For example, deleting a network in central neutron which is also resided in several local Neutron(s). The reason is that network-get request will trigger local neutron to query central neutron and create the network, and we delete local networks before deleting central network. When a network-get request comes to a local neutron server after the local network is completely deleted in that region and at this time the network in central neutron still exists (assuming it takes certain time to delete all local networks), local neutron will still retrieve the network from central neutron and the deleted local network will be recreated. This issue also applies to the deletion cases of other resource types.

Proposed Solution

Recently, Tricircle adds a feature to distinguish the source of requests[1], so we can distinguish the deletion request from ‘Central Neutron’ or ‘Local Neutron’. In order to avoid the conflict mentioned above, we introduce a new table called “deleting_resource” in Tricircle database, so central plugin can save the resource deletion information and set the information when it receives a deletion request. Here is the schema of the table:

Resource deleting table

Field

Type

Nullable

pk/fk/uk

Description

resource_id

string

False

uk

resource id in central Neutron

resource_type

string

False

uk

resource_type denotes one of the available resource types

deleted_at

timestamp

False

n/a

deletion timestamp

How to delete the resource without conflict operation

Let’s take network deletion as an example.

At the beginning of network-delete handle, central neutron server sets the information of deleted network into the “deleting_resource” table.

At this point, if get-request from local neutron servers comes, central neutron server will check the “deleting_resource” table whether the associated resource has been recorded and return 404 to local neutron server if the associated resources is being deleting.

At this point, if deletion request is from central Neutron, central neutron server will check the “deleting_resource” table whether the associated resource has been recorded and it will return 204 to user if associated resource is being deleting.

For the get-request of user, central neutron server will query the related network information in “deleting_resource” table and will return the deleting resource to user if the network information which the user queries exists in the table. When user re-deleting the network after something wrong happens, central neutron will return 204 to user.

At the end of network-delete handle that all the mapped local networks have been deleted, central neutron server will remove the deleting resource record and remove this network.

In addition, there is a timestamp in table that cloud administrator is able to delete a resource which is in deleting status over long time (too long to delete, or in abnormal status).

[1] https://review.opendev.org/#/c/518421/

Smoke Test Engine

Problems

Currently we are running a simple smoke test in the CI job. Several resources are created to build a simple topology, then we query to check whether the resources are also created in local Neutron servers as expected. The problems exist are:

  • 1 Bash scripts are used to invoke client to send API request while python scripts are used to check the result. Mix use of bash and python makes us hard to write new tests.

  • 2 Resources are not cleaned at the end of the test so we can’t proceed other tests in the same environment.

Proposal

Using bash scripts to do both API request and result validation is tricky and hard to read, working with python is a better choice. We have several python libraries that can help us to send API request: openstackclient, neutronclient and openstacksdk. The main goal of the first two libraries is providing command line interface(CLI), so they don’t expose methods for us to send API request, but we can still use them by calling internal functions that are used by their CLI instance. The drawback of using internal functions is that those internal functions are undocumented and are possible to be changed or removed someday. Compare to openstackclient and neutronclient, openstacksdk is a library that aims for application building and is well-documented. Actually openstackclient uses openstacksdk for some of its commands’ implementation. The limitation of openstacksdk is that some service extensions like trunk and service function chaining have not been supported yet, but it’s easy to extend by our own.

Before starting to write python code to prepare, validate and finally clean resources for each test scenario, let’s hold on and move one step forward. Heat uses template to define resources and networking topologies that need to be created, we can also use YAML file to describe our test tasks.

Schema

A task can be defined as a dict that has the following basic fields:

Field

Type

Description

Required or not

task_id

string

user specified task ID

required

region

string

keystone region to send API

required

type

string

resource type

required

depend

list

task IDs the current task depends on

optional

params

dict

parameters to run the task, usage differs in different task types

optional

Currently four type of tasks are defined. The usage of “params” field for each type of task is listed below:

Task type

Usage of “params” field

create

used as the post body of the create request

query

used as the query filter

action

used as the put body of the action request

validate

used as the filter to query resources that need to be validated

Task doesn’t have “task type” field, but it can have an extra dict type field to include extra needed information for that task. This extra field differs in different task types. “Create” task doesn’t have an extra field.

Extra field

Sub field

Type

Description

Required or not

query(for query task)

get_one

bool

whether to return an element or a list

required

action(for action task)

target

string

target resource ID

required

method

string

action method, “update” and “delete” are also included

required

retries

int

times to retry the current task

optional

validate(for validate task)

predicate

string

value should be “any” or “all”, “any” means that for each condition, there exists an resource satisfying that condition; “all” means that every condition is satisfied by all the resources

required

condition

list

each condition is a dict, key of the dict is the field of the resource, value of the dict is the expected value of the resource field

required

retries

int

times to retry the current task

optional

Several related tasks can be grouped to form a task set. A task set is a dict with the following fields:

Field

Type

Description

Required or not

task_set_id

string

user specified task set ID

required

depend

list

task set IDs the current task set depends on

optional

tasks

list

task dicts of the task set

required

So the YAML file contains a list of task sets.

Result and Reference

“Create” and “query” type tasks will return results, which can be used in the definition of other tasks that depend on them. Use task_id@resource_field to refer to “resource_field” of the resource returned by “task_id”. If the task relied on belongs to other task set, use task_set_id@task_id@resource_field to specify the task set ID. The reference can be used in the “params”, “action target” and “validate condition” field. If reference is used, task_id needs to be in the list of task’s “depend” field, and task_set_id needs to be in the list of task set’s “depend” field. For the “query” type task which is depended on, “get_one” field needs to be true.

Example

Give an example to show how to use the above schema to define tasks:

- task_set_id: preparation
  tasks:
    - task_id: image1
      region: region1
      type: image
      query:
        get_one: true
    - task_id: net1
      region: central
      type: network
      params:
        name: net1
    - task_id: subnet1
      region: central
      type: subnet
      depend: [net1]
      params:
        name: subnet1
        ip_version: 4
        cidr: 10.0.1.0/24
        network_id: net1@id
    - task_id: vm1
      region: region1
      type: server
      depend:
        - net1
        - subnet1
        - image1
      params:
        flavor_id: 1
        image_id: image1@id
        name: vm1
        networks:
          - uuid: net1@id
- task_set_id: wait-for-job
  tasks:
    - task_id: check-job
      region: central
      type: job
      validate:
        predicate: all
        retries: 10
        condition:
          - status: SUCCESS
- task_set_id: check
  depend: [preparation]
  tasks:
    - task_id: check-servers1
      region: region1
      type: server
      validate:
        predicate: any
        condition:
          - status: ACTIVE
            name: vm1

The above YAML content define three task sets. “Preparation” task set create network, subnet and server, then “wait-for-job” task set waits for asynchronous jobs to finish, finally “check” task set check whether the server is active.

Implementation

A task engine needs to be implemented to parse the YAML file, analyse the task and task set dependency and then run the tasks. A runner based on openstacksdk will also be implemented.

Dependencies

None

Creative Commons Attribution 3.0 License

Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.

tricircle 6.1.0.dev16

Page Contents