When i did my research for monitoring solution of physical nodes of our OpenStack cloud enviroment, Zabbix is one of my favorite. Besides, some one told me about using Ceilometer for monitoring purpose but definitely, it is a wrong idea. I noted some notes below for further understanding about designing both of them:

Zabbix

I. Zabbix features in cluster:

Zabbix works well and if Zabbix gets wrong, it is mostly due to other dependencis such as Mysql. If there are some errors that make Mysql not working properly, Zabbix will fail. One of the current problem of Mysql is about replication of database. The current Galera cluster with multiple Active Mysql servers running has a problem race condition when writing to databases.

The most popular graduality value is 60 seconds. However, it can decrease down to 1 second to fit requirements of detecting state of hosts.

Zabbix cluster works in active/passive. It solves the problem of fail over quite well. When each Zabbix server is down, another one is popped up. Although Zabbix still depends on the fail-over capability of Galera cluster of Mysql but the fail-over capability of Zabbix cluster itself is good

Density is not a challenge of Zabbix Agents. A Zabbix server can serve multiple agents running on nodes. Zabbix agent uses Push mechanism to send data to Zabbix server and the communication mechanism between server and agent of Zabbix ensures flow of pushing.

We can check the uptime of a server for reboot detection. It could be fulfilled by using Zabbix Agent.

When the compute node is up, we can set Zabbix Agent to have checking of state of compute host where it is running on.

When both of the compute node is down by some reasons or network problem that ruins connection between Zabbix server and agent, we should have to find another way since there is no connection to Zabbix Agent. Otherwise, we can query Nova database (if Nova has no impact from these situations) to get the state of node. Besides, we can create a new solution by running the external_scripts in Zabbix Server to checking state of compute host (e.g. triggering ssh checking): https://www.zabbix.com/documentation/2.0/manual/config/items/itemtypes/external

II. Solution for monitoring solution collaborating with other services:

  1. It should have some thresholds that Zabbix server relies on and triggers appropriate actions if the return values from Zabbix Agent go beyond the thresholds.
  2. It should set Zabbix server some templates that force Zabbix server to get metrics from system based on these templates. Templates will define the structure of metrics that Zabbix server should get and it stores those metrics to database.

  3. When Zabbix server realizes that some thresholds are caught, it will get information from system based on the template that is defined for the threshold and call ZabbixEndpoint service – actually it is a webserver – through REST API. Later on, the ZabbixEndpoint service will take the information get from Zabbix server and trigger the consequence actions (e.g. sending alarms).

4. ZabbixEndpoint service can run in both Active/Active or Active/Passive on three CiCs, under the control of pace_maker/corosync.

 

Ceilometer

Since it strictly depends on Mongodb, Rabbitmq but both of them seem are not reliable. Since Mongodb is NoSql, document-database solution and actually a cache without backing store (for example: Mysql, Sqlalchemy) behind it, then it becomes flat-out inconsistent. One of the most severe scenario is cache invalidation. Otherwise, it is easy to let Mongodb fill up the disk when creating/storing new item in replication. Besides, the inconsistency of Rabbitmq is reported when it is used in ACTIVE/PASSIVE.

The ceilometer-agent is running ACTIVE/PASSIVE under the management of pacemaker. It deals with failover problem quite well.

The graduality of ceilometer (time for periodic metrics achievement) is not fine enough due to purpose of fitting detection requirements when it should be as fast as possible to detect the availability of a node.

No

NOTE:

- Ceilometer was initially created for billing, not for monitoring then when we use ceilometer, we can get the metrics about virtual machines, not about compute hosts. Therefore ceilometer does not seem to be a prospective and potential candidate.

13/09/2016

VietStack team

</p>