Monitoring EOS with tcollector and OpenTSDB

EOS is a Linux distribution (based on Fedora), which means, among other things, that it can be monitored like any Linux server running Fedora.  In this post we show how to package a popular open-source monitoring framework, tcollector, as an EOS extension.

A bit of history

OpenTSDB is a distributed time series database used for infrastructure monitoring in many medium to large scale environments.  It uses a push model, meaning that OpenTSDB is not responsible for pulling monitoring from a set list of targets to monitor, rather the targets themselves are responsible for pushing their monitoring data to OpenTSDB, be they bare machines, VMs, containers, cron jobs, or anything else.  This is one of the key design aspects that make OpenTSDB easy to scale and operate, as adding monitoring capacity is as simple as spinning up more instances of OpenTSDB, and in case of failure, the targets are responsible for finding an OpenTSDB instance they can connect to.

One of the most common ways of pushing monitoring data to OpenTSDB is to use tcollector, an utility written in Python that usually runs on all servers and VMs.  tcollector comes with dozens of collectors built-in, for things ranging from collecting hundreds of metrics from Linux to MySQL or Postgres, elastic search, Hadoop and HBase, HAProxy or Varnish, ZooKeeper, etc.  Creating new collectors is easy too, since collectors are usually simple shell or Python scripts, and can be written in any programming language.

In March 2013, during one of the biannual hackathon events hosted at Arista, called “hack-a-switch”, I decided to integrate tcollector on EOS.  This involved writing a custom CLI plugin and a tiny bit of C++ code.  This extension has then found an avid customer base amongst Arista’s POC team, which started to regularly use it to track real-time CPU and memory usage during POCs, especially POCs with demanding cloud customers.

Fast forward to the end of that year, in November 2013 we started building the EOS SDK, in partnership with one of our biggest cloud customers.  By the following month, we had Python bindings available thanks to swig.  In September 2014, we rewrote the extension to use the EOS SDK, and the rewrite was subsequently open-sourced on GitHub.  So here are the instructions to build and deploy your own tcollector extension, so you can monitor EOS just like any server, while also getting visibility from the data plane.

OpenTSDB screenshot

Building the tcollector extension

You need an environment in which you can build RPMs.  If you don’t have one handy, or run on Mac OS X, you can use Docker Machine to start a Fedora container in which you can build the extension.

docker run -it fedora bash
dnf install -y make git rpm-build zip
git clone https://github.com/OpenTSDB/tcollector.git
cd tcollector/rpm
make

This will create a file named tcollector-1.2.2-1.swix (or whatever is the new version at the time you build the extension yourself).  A SWIX is really just a ZIP file with a manifest.txt, it’s a little file format that is convenient to bundle multiple RPMs into one deployment unit for EOS.

Deploying the tcollector extension

First check that EOS SDK is available on your switch.  EOS SDK is bundled with the EOS image starting from EOS 4.17.0F, and is available as an extension for earlier releases (check our download site).  The easiest way to check whether you have the SDK is:

switch#show version detail | grep EosSdk
EosSdk               1.7.1           2877426.4154F

Here we have v1.7.1 – you need at least v1.5.1, but using v1.7.0 or newer is recommended.

Copy your SWIX to the switch (either by scp’ing to the switch from your Docker container, or by fetching it from the switch via the copy command):

switch#copy file:/tmp/tcollector-1.2.2-1.swix extension:
switch#extension tcollector-1.2.2-1.swix

Check that the extension is properly installed:

switch#show extensions
Name                                       Version/Release           Status extension
—————————————— ————————- —— —-
tcollector-1.2.2-1.swix                    1.2.2/1                   A, I      3

A: available | NA: not available | I: installed | NI: not installed | F: forced

Here our status is “A, I”, meaning “available, installed”, which is what you want to see.  In order for the extension to be automatically re-installed on reboot, you then need to do:

switch#copy installed-extensions boot-extensions
Copy completed successfully.

We can now configure and enable the tcollector extension:

switch#configure
switch(config)#daemon tcollector
switch(config-daemon-tcollector)#exec /usr/bin/tcollector
switch(config-daemon-tcollector)#option tsd-host value tsd.host.name.here
switch(config-daemon-tcollector)#no shutdown

The only “option” you need to pass is tsd-host, which is the hostname you want the switch to push its data to.  This hostname can be a DNS name that resolves to multiple A records, to provide a simple but effective load balancing and HA solution.  If you need tcollector to connect from another VRF than the default one, configure the VRF name using the vrf option:

switch(config-daemon-tcollector)#option vrf value management

You can then take a look at the logs of the tcollector agent to make sure everything is working fine:

switch#show agent tcollector logs
===> /var/log/agents/tcollector-15709 Sun Feb  7 22:50:45 2016 <===
===== Output from /usr/bin/tcollector [] (PID=15709) started Feb  7 22:40:43 ===
[…]
2016-02-07 22:40:45.613033 15709 tcollector           5 INFO: Selected connection: tsd:4242
2016-02-07 22:40:59.523139 15709 tcollector           5 INFO: removing smart-stats.py from the list of collectors (by request)
2016-02-07 22:50:45.145047 15709 tcollector           5 INFO: Heartbeat (9 collectors running)

(this log file is at /var/log/agents/tcollector-<pid> in case you want to tail it.)

At this point you’re all set!  Data points are flowing into OpenTSDB and you can start graphing the status of the control plane in real time, just like any other server or VM.

Other options that can be configured from the CLI:

  • option tsd-port value <port-number>
    This one is fairly self-explanatory.
  • option transport value <http|https>
    Uses HTTP(S) API calls instead of the simple telnet-like protocol.  Using HTTPS is recommended if you are pushing to a backend that is outside your datacenter, such as a remote hosted backend like RunAbove’s IoT PaaS, which is OpenTSDB-compatible.
  • option username value <user>
    option password value <pass>
    Username-password pair for HTTP Basic Auth.  These only make sense when the HTTP(S) transport is configured.
  • option trace value <debug|info|warn|error>
    Override the logging level, useful for troubleshooting.
  • Want to add more options to expose some of the many knobs tcollector has to the CLI?  Please send a pull request on GitHub!

Enjoy your real-time network monitoring and kiss a goodbye to those SNMP scripts!