.. _LMA_test_results:

****************
LMA Test Results
****************

:Abstract:

  This document includes results of measuring how many resources LMA service
  needs as a monitoring service during using on a big environment (~200 nodes).
  This document includes results of reliability testing of `LMA`_ services.


Environment description
=======================
Hardware configuration of each server
-------------------------------------

.. table:: Description of servers hardware

  +-------+----------------+------------------------+------------------------+
  |role   |role            |OpenStackController     |OpenStackCompute and LMA|
  +-------+----------------+------------------------+------------------------+
  |CPU    |core_count (+HT)|40                      |12                      |
  |       +----------------+------------------------+------------------------+
  |       |frequency_MHz   |2300                    |2100                    |
  +-------+----------------+------------------------+------------------------+
  |RAM    |amount_MB       |262144                  |32768                   |
  +-------+----------------+------------------------+------------------------+
  |Disk1  |amount_GB       |111.8                   |75                      |
  +       +----------------+------------------------+------------------------+
  |       |SSD/HDD         |SSD                     |SSD                     |
  +-------+----------------+------------------------+------------------------+
  |Disk2  |amount_GB       |111.8                   |1000                    |
  +       +----------------+------------------------+------------------------+
  |       |SSD/HDD         |SSD                     |HDD                     |
  +-------+----------------+------------------------+------------------------+
  |Disk3  |amount_GB       |1800                    |-                       |
  +       +----------------+------------------------+------------------------+
  |       |SSD/HDD         |HDD                     |-                       |
  +-------+----------------+------------------------+------------------------+
  |Disk4  |amount_GB       |1800                    |-                       |
  +       +----------------+------------------------+------------------------+
  |       |SSD/HDD         |HDD                     |-                       |
  +-------+----------------+------------------------+------------------------+

Software configuration of the services
--------------------------------------
Installation of OpenStack and LMA plugins:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OpenStack has been installed using Fuel version 8.0 and fuel plugins:
3 controllers, 193 computes (20 OSD), 3 Elasticsearch, 3 InfluxDB, 1 Nagios

.. table:: Versions of some software

  +--------------------------------+------------+
  |Software                        |Version     |
  +================================+============+
  |Fuel                            |8.0         |
  +--------------------------------+------------+
  |fuel-plugin-lma-collector       |0.9         |
  +--------------------------------+------------+
  |fuel-plugin-elasticsearch-kibana|0.9         |
  +--------------------------------+------------+
  |fuel-plugin-influxdb-grafana    |0.9         |
  +--------------------------------+------------+

Testing process
===============
1.
  Fuel 8.0, LMA plugins and OpenStack have been installed installed.

2.
  Rally tests have been performed two times. Results are here:
  :download:`rally_report_1.html <./rally_report_1.html>`
  :download:`rally_report_2.html <./rally_report_2.html>`
3. Metrics (cpu, memory, I/O) have been collected using collectd
4. Disbale InfluxDB services in haproxy to prevent Heka to send metrics to
   InfluxDB. The outage time should be equal to 3 hours
5. Enable InfluxDB services in haproxy backends and measure how many resources
   and time InfluxDB needs to get all statistic from Heka after outage.
6. Disbale Elasticsearch services in haproxy to prevent Heka to send metrics to
   Elasticsearch. The outage time should be equal to 3 hours
7. Enable Elasticsearch services in haproxy backends and measure how many
   resources and time Elasticsearch needs to get all statistic from Heka after
   outage.

Usage Results
=============

Collector: Hekad / collectd
---------------------------
The following table describe how many resources was used by Hekad and Collectd
during the test in depend on OpenStack role:

.. table::
   CPU, Memory and Disk consumption in depend on OpenStack role

  +------------------------+----------------+----------------+----------------+
  | role                   |CPU             |Memory          |I/O per second  |
  |                        |(hekad/collectd)|(hekad/collectd)|(hekad/collectd)|
  +========================+================+================+================+
  | controller             | 0.7 cpu        | 223 MB         |730 KB write    |
  |                        |                |                |                |
  |                        | 0.13 cpu       | 45 MB          |730 KB read     |
  |                        |                |                |                |
  |                        |                |                |0 KB write      |
  |                        |                |                |                |
  |                        |                |                |250 KB read     |
  +------------------------+----------------+----------------+----------------+
  || Controller without    | 0.4 cpu        |no impact       |220 KB write    |
  || RabbitMQ queues       |                |                |                |
  || metrics (~4500 queues)|                |                |                |
  || `1549721`_            | 0.06 cpu       |                |280 KB read     |
  |                        |                |                |                |
  |                        |                |                |0 KB write      |
  |                        |                |                |                |
  |                        |                |                |250 KB read     |
  +------------------------+----------------+----------------+----------------+
  | aggregator             | 0.9 cpu        | 285 MB         |830 KB write    |
  |                        |                |                |                |
  |                        | 0.13 cpu       | 50 MB          |830 KB read     |
  |                        |                |                |                |
  |                        |                |                |0 KB write      |
  |                        |                |                |                |
  |                        |                |                |247 KB read     |
  +------------------------+----------------+----------------+----------------+
  | compute                | 0.2 cpu        | 145 MB         |15 KB write     |
  |                        |                |                |                |
  |                        | 0.02 cpu       | 6.1 MB         |40 KB read      |
  |                        |                |                |                |
  |                        |                |                |0 KB write      |
  |                        |                |                |                |
  |                        |                |                |22 KB read      |
  +------------------------+----------------+----------------+----------------+
  | compute/osd            | 0.25 cpu       | 154 MB         |15 KB write     |
  |                        |                |                |                |
  |                        | 0.02 cpu       | 13 MB          |40 KB read      |
  |                        |                |                |                |
  |                        |                |                |0 KB write      |
  |                        |                |                |                |
  |                        |                |                |23 KB read      |
  +------------------------+----------------+----------------+----------------+

Influxdb
--------

InfluxDB consumes manageable amount of CPU (more information in the table
below). The compaction operation is performed regularly which produces spike of
resource consumption (every ~ 6 minutes with the actual load of
200 nodes / 1000 VMs):

|image0|

The average write operation duration is 3ms (SSD drive)

+-------------------------+-----------------+--------+-------+-----------------+
| Conditions              | write/s         | cpu    | memory| I/O             |
|                         |                 |(normal |(normal|(normal/         |
|                         |                 |/spike) |/spike)|spike)           |
+=========================+=================+========+=======+=================+
| normal                  |111 HTTP writes/s|0.38 cpu|1.2GB  |1.3MB(r)/1.7MB(w)|
|                         |                 |        |       |                 |
|                         |(37 w/s per node)|2 cpu   |2.3GB  |1.5MB(r)/7.3MB(w)|
+-------------------------+-----------------+--------+-------+-----------------+
|| Controller without     |75 HTTP writes/s |0.3 cpu |1.2GB  |930KB(r)/1MB(w)  |
|| RabbitMQ queues        |(25 w/s per node)|        |       |                 |
|| metrics (~4500 queues) |                 |        |       |                 |
|| `1549721`_             |(-30% w/o        |1.9 cpu |2.2GB  |1.5MB(r)/7.3MB(w)|
||                        |rabbitmq queues) |        |       |                 |
+-------------------------+-----------------+--------+-------+-----------------+
| w/o rabbitMQ            | 93 HTTP writes/s|0.5 cpu |1.5 GB |1MB(r)/1.4MB(w)  |
|                         |(31 w/s per node)|        |       |                 |
|                         |                 |        |       |                 |
| and 1000 VMs            | (0,018 w/s/vm)  |2.5 cpu |2 GB   |1.2MB(r)/6.6MB(w)|
+-------------------------+-----------------+--------+-------+-----------------+

Disk space usage evolution with 1000 VMs:

~125 MB / hour

~3 GB / day

~90 GB / month

|image1|

Elasticsearch
-------------

The bulk operations takes ~80 ms (mean) on SATA disk (this is the mean
response time from HAProxy log).

The CPU usage depends on the REST API activity (see the extra load in
the graph below) and also seems to depends on the current index size
(CPU utilization increases proportionally while the load is constant):

|image2|

|image3|

Disk space usage evolution with a constant API solicitation (eg, while
true; nova\|cinder\|neutron list); done) and 1000 VMs spawned:

~670 MB / hour

~16 GB / day

~500 GB / month

|image4|

All RabbitMQ queues collection impact
-------------------------------------

The collection of all RabbitMQ queue metrics has a significant impact
on Heka and Collectd CPU utilization and obviously on the InfluxDB load
(HTTP request per second)

Heka

|image5|

Collectd

|image6|

InfluxDB

|image7|

Reliability Results
===================
Backends outage for 2 hours
---------------------------

InfluxDB
~~~~~~~~

After a complete InfluxDB cluster downtime (simulated by a HAProxy
shutdown) the cluster is capable to take over all metrics accumulated by
Heka instances in less than 10 minutes, here is the spike of resource
consumption per node.

+-------------------+------------------------------+--------+-------+---------+
|Conditions         |write/s                       |cpu     |memory | I/O     |
+===================+==============================+========+=======+=========+
|| take over 3 hours|| ~900 w/s                    || 6.1cpu|| 4.8GB|| 22MB(r)|
|| of metrics       || total of 2700 HTTP writes/s ||       ||      || 25MB(w)|
+-------------------+------------------------------+--------+-------+---------+

|image8|\ fuel nodes

|image9|

|image10|

|image11|

Data loss
^^^^^^^^^

A window of less than 40 minutes of metrics are lost on controllers.

Other node roles have no data loss because they have much less metrics
collected than controllers. Hence, the heka buffer size (1GB) for
influxdb queue is filled within ~1h20.

This retention period can be increased drastically by avoiding to
collect all the rabbitmq queues metrics.

The following examples show both controller and compute/osd CPU metric.
The 2 first annotations indicate the downtime (InfluxDB and
Elasticsearch) while the 2 last annotations indicate the recovery
status.

On controller node the CPU metric is lost from 18h52 to 19h29 while the
InfluxDB outage ran from ~17h30 to 19h30:

|image12|

A role with osd/compute roles didn’t lose metrics:

|image13|

Elasticsearch
~~~~~~~~~~~~~

After a complete ES cluster downtime (simulated by an HAProxy shutdown)
the cluster is capable to take over all logs accumulated by Hekad
instances in less than 10 minutes, here the spike resource consumption
per node

+-------------------+-----------+-------+-----------------------+------------+
|Conditions         |HTTP bulk  |cpu    |memory                 |I/O         |
|                   |request/s  |       |                       |            |
|                   |           |       |(normal/spike)         |normal/spike|
+===================+===========+=======+=======================+============+
|| take over 3 hours|| 680 req/s|| 4 cpu|| 16GB (jvm fixed size)|| 26 MB (r) |
|| of logs          ||          ||      ||                      || 25 MB (w) |
+-------------------+-----------+-------+-----------------------+------------+

CPU utilization:

|image14|

I/O

|image15|

Data lost
^^^^^^^^^

We lost some logs (and maybe notification) since heka log has a bunch of
“queue is full”

Apache2/Nagios3
~~~~~~~~~~~~~~~

Apache is flooded and never recover the load

Elasticsearch failover/recovery
-------------------------------

One ES node down
~~~~~~~~~~~~~~~~

The cluster is detected as WARNING (cannot honor the number of replicas)
but there is no downtime observed and no data lost since the cluster
accepts data.

.. code::

    root@node-47:~# curl 192.168.0.4:9200/\_cluster/health?pretty

    {

    "cluster\_name" : "lma",

    **"status" : "yellow",**

    "timed\_out" : false,

    "number\_of\_nodes" : 2,

    "number\_of\_data\_nodes" : 2,

    "active\_primary\_shards" : 25,

    "active\_shards" : 50,

    "relocating\_shards" : 0,

    "initializing\_shards" : 0,

    "unassigned\_shards" : 20,

    "delayed\_unassigned\_shards" : 0,

    "number\_of\_pending\_tasks" : 0,

    "number\_of\_in\_flight\_fetch" : 0

    }

    root@node-47:~# curl 192.168.0.4:9200/\_cat/indices?v

    health status index pri rep docs.count docs.deleted store.size
    pri.store.size

    green open kibana-int 5 1 2 0 52.1kb 26.1kb

    yellow open log-2016.03.08 5 2 5457994 0 2.1gb 1gb

    yellow open log-2016.03.07 5 2 10176926 0 3.7gb 1.8gb

    yellow open notification-2016.03.08 5 2 1786 0 3.5mb 1.9mb

    yellow open notification-2016.03.07 5 2 2103 0 3.7mb 1.8mb

|image16|

|image17|

|image18|

|image19|

|image20|

2 ES down
~~~~~~~~~

The cluster is unavailable, all heka buffersize data until recovery.

    root@node-47:~# curl 192.168.0.4:9200/\_cluster/health?pretty

    {

    "error" : "MasterNotDiscoveredException[waited for [30s]]",

    "status" : 503

    }

*ES logs*

[2016-03-08 09:48:10,758][INFO ][cluster.service ]
[node-47.domain.tld\_es-01] removed
{[node-153.domain.tld\_es-01][bIVAau9SRc-K3lomVAe1\_A][node-153.domain.tld][inet[/192.168.0.163:9

300]]{master=true},}, reason: zen-disco-receive(from master
[[node-204.domain.tld\_es-01][SLMBNAvcRt6DWQdNvFE4Yw][node-204.domain.tld][inet[/192.168.0.138:9300]]{master=true}])

[2016-03-08 09:48:12,375][INFO ][discovery.zen ]
[node-47.domain.tld\_es-01] master\_left
[[node-204.domain.tld\_es-01][SLMBNAvcRt6DWQdNvFE4Yw][node-204.domain.tld][inet[/192.168.0.1

38:9300]]{master=true}], reason [transport disconnected]

[2016-03-08 09:48:12,375][WARN ][discovery.zen ]
[node-47.domain.tld\_es-01] master left (reason = transport
disconnected), current nodes: {[node-47.domain.tld\_es-01][l-UXgVBgSze7g

twc6Lt\_yw][node-47.domain.tld][inet[/192.168.0.108:9300]]{master=true},}

[2016-03-08 09:48:12,375][INFO ][cluster.service ]
[node-47.domain.tld\_es-01] removed
{[node-204.domain.tld\_es-01][SLMBNAvcRt6DWQdNvFE4Yw][node-204.domain.tld][inet[/192.168.0.138:9

300]]{master=true},}, reason: zen-disco-master\_failed
([node-204.domain.tld\_es-01][SLMBNAvcRt6DWQdNvFE4Yw][node-204.domain.tld][inet[/192.168.0.138:9300]]{master=true})

[2016-03-08 09:48:21,385][DEBUG][action.admin.cluster.health]
[node-47.domain.tld\_es-01] no known master node, scheduling a retry

[2016-03-08 09:48:32,482][DEBUG][action.admin.indices.get ]
[node-47.domain.tld\_es-01] no known master node, scheduling a retry

*LMA collector logs:*

2016/03/08 09:54:00 Plugin 'elasticsearch\_output' error: HTTP response
error. Status: 503 Service Unavailable. Body:
{"error":"ClusterBlockException[blocked by: [SERVICE\_UNAVAILABLE/2/no
master];]","status":503}

InfluxDB failover/recovery
--------------------------

1 InfluxDB node is down
~~~~~~~~~~~~~~~~~~~~~~~

no downtime

⅔ nodes are down:
~~~~~~~~~~~~~~~~~

One node is in a bad shape (missing data during and after the outage!)

This is not supported

Apache2 overloaded
------------------

.. note::
   The issue described in this section has been resolved in 0.10 version. You
   can read more here
   https://blueprints.launchpad.net/lma-toolchain/+spec/scalable-nagios-api

All nodes push AFD status to Nagios through the CGI script. This
represent 110 request/s

The server cannot handle the load:

100% CPU (12), load average 190, 125 process fork/s

The CGI script is definitively not scalable.

|image21|

When increasing the AFD interval from 10 to 20 seconds on all nodes and
purging the heka output queue buffer, the load is maintainable by node
(90 forks / second):

|image22|

|image23|

Outcomes
========
InfluxDB
--------
InfluxDB worked correctly only with SSD drives. With SATA drives, it was unable
to cope with the data generated by 200 nodes.

Supported scale-up operations: 1 node -> 3 nodes.

Failover mode: a cluster of 3 nodes supports the loss of 1 node.

Deployment size <= 200 nodes
4 cpu
4 GB RAM
SSD drive
100 GB is required for retention of 30 days

Elasticsearch
-------------
Elasticsearch can handle the load with a dedicated SATA disk, using SSD drives
is obviously a better choice but not mandatory.

Supported scale-up operations:  1 node -> 3 nodes

Failover mode: a cluster of 3 nodes survives after the loss of 1 node. It can
also support the loss of 2 nodes with downtime (when using the default
configuration of number_of_replicas).

.. note::
  When OpenStack services are configured with DEBUG log level and
  relatively high load on the cluster (several API calls for some time) could lead
  to fill up the Heka buffers.

Sizing guide
------------

These guidelines apply for an environment configured to log at the INFO level.
They take info account a high rate of API calls. Using the DEBUG log level
implies much more resource consumption in terms of disk space (~ x5) and
CPU/Memory (~ x2).

Deployment size <= 200 nodes
4 CPU
8 GB RAM
SSD or SATA drive
500 GB is required for retention of 30 days

Apache2/Nagios3

.. note::
   The following issue has been resolved in 0.10 version. Therefore you don't
   need to apply the workaround described bellow.

The default configuration doesn’t allow to handle the load of 200 nodes: the
CGI script introduces a bottleneck. The recommendation for 0.9.0 is not to
deploy the lma_infrastructure_alerting plugin for an environment with more than
50 nodes. With 200 nodes, it required at least 7 cores to handle the incoming
requests.

In the current state, the recommendation to be able to handle 200 nodes is to
perform this operation after the initial deployment:

 - increase all AFD filters interval from 10s to 20s

 - decrease all Nagios outputs buffering size to 500KB, to limit the flooding at
   startup time

 - stop lma_collector on all nodes

 - remove the heka queue buffer (rm -rf /var/log/lma_collector/nagios_output)

 - restart lma_collector on all nodes

Issues which have been found during the tests
=============================================

.. table:: Issues which have been found during the tests

  +---------------------------------------------------------------+------------+
  |Issue description                                              | Link       |
  +===============================================================+============+
  || Kibana dashboards unavailable after an ElasticSearch scale up| `1552258`_ |
  || from 1 to 3 nodes                                            |            |
  +---------------------------------------------------------------+------------+
  || Reduce the monitoring scope of Rabbitmq queues               | `1549721`_ |
  +---------------------------------------------------------------+------------+
  || Nova collectd plugin timeout with a lot of instances         | `1554502`_ |
  +---------------------------------------------------------------+------------+
  || Apache doesn't handle the load to process passive checks with| `1552772`_ |
  || 200 nodes                                                    |            |
  +---------------------------------------------------------------+------------+
  || InfluxDB crash while scaling up from 1 to 2 nodes            | `1552191`_ |
  +---------------------------------------------------------------+------------+

.. references:

.. _LMA: http://fuel-plugin-lma-collector.readthedocs.io/en/latest/intro.html
.. _1549721: https://bugs.launchpad.net/lma-toolchain/+bug/1549721
.. _1552258: https://bugs.launchpad.net/lma-toolchain/+bug/1552258
.. _1554502: https://bugs.launchpad.net/lma-toolchain/+bug/1554502
.. _1552772: https://bugs.launchpad.net/lma-toolchain/+bug/1552772
.. _1552191: https://bugs.launchpad.net/lma-toolchain/+bug/1552191

.. |image0| image:: media/image25.png
            :scale: 50
.. |image1| image:: media/image16.png
            :scale: 40
.. |image2| image:: media/image39.png
            :scale: 40
.. |image3| image:: media/image30.png
            :scale: 40
.. |image4| image:: media/image10.png
            :scale: 40
.. |image5| image:: media/image41.png
            :scale: 40
.. |image6| image:: media/image13.png
            :scale: 40
.. |image7| image:: media/image20.png
            :scale: 40
.. |image8| image:: media/image46.png
            :scale: 40
.. |image9| image:: media/image45.png
            :scale: 40
.. |image10| image:: media/image38.png
            :scale: 40
.. |image11| image:: media/image21.png
            :scale: 40
.. |image12| image:: media/image19.png
            :scale: 40
.. |image13| image:: media/image47.png
            :scale: 40
.. |image14| image:: media/image40.png
            :scale: 40
.. |image15| image:: media/image27.png
            :scale: 40
.. |image16| image:: media/image42.png
            :scale: 40
.. |image17| image:: media/image44.png
            :scale: 40
.. |image18| image:: media/image14.png
            :scale: 40
.. |image19| image:: media/image37.png
            :scale: 40
.. |image20| image:: media/image02.png
            :scale: 50
.. |image21| image:: media/image43.png
            :scale: 40
.. |image22| image:: media/image23.png
            :scale: 40
.. |image23| image:: media/image17.png
            :scale: 40