The Art of Monitoring

James Turnbull

Reviewers: Jatri Dave , Tanya Jha

Chapter 1: An Introduction to Monitoring

What is monitoring?
From a technological perspective, monitoring is the tools and processes by which you measure and manage your IT systems. Monitoring is something that provides the translation between business value and the metrics generated by your systems and applications. These metrics can be translated into measurable user experience, which in turn provides feedback to the business to help ensure the delivery of customers requirements. It also gives feedback on insufficient or nonworking services.
Monitoring system has two customers : (1) The business (2) Information Technology
The stages of a three-level maturity model reflecting monitoring evolution are :
(i) Manual : In this stage , monitoring is largely manual, user initiated, or not done at all.
(ii) Reactive : This type of monitoring is mostly automatic with some remnants of manual or unmonitored components.
(iii) Proactive : Here, monitoring is automatic and generated by configuration management. Example of different monitoring tools are Nagios, Sensu, and Graphite.

Chapter 2: A Monitoring, Metrics and Measurement

This chapter proposes an architecture for a monitoring framework. It also discusses the faults of traditional monitoring frameworks and compared the new architecture with it. The new architecture is based on "whitebox" or push based monitoring rather than a "blackbox" or pull based monitoring. In a white box or push based monitoring the components being monitored are queried. But in a "blackbox" or pull based monitoring the components being monitored send their data to a central collector. This kind of architecture is centered around collecting events,log and metrics. Metrics are properties of hardware and software. They are usually collected with as a value that occurred at a time stamp. These metrics can be transformed mathematically using sum, count, average, percentiles etc. If we have multiple sources, we may want to show aggregated result of the metrics across all the sources. A good visualization of data provides powerful analytics. It should clearly show the data without distorting it.

Chapter 3: Managing Events and metrics with Riemann

Key design features for routing engine are as follows :

Receive events and metrics including scaling with growing environment.

Maintain sufficient state to do event matching and provide context for notifications.

Munge events including extracting metrics from events.

Categorize and route data to be stored, graphed, alerted on, or sent to any other potential destinations.

"If only I had the theorems! Then I should nd the proofs easily enough." -Bernard Riemann

Riemann is a monitoring tool that aggregates events from hosts and applications and can feed them into a stream processing language to be manipulated, summarized, or actioned. It can also track the state of incoming event and provides notifications.
A brief explanation on installing Riemann on three different hosts is given. Moreover, Configuring Riemann, Connecting Riemann servers, Alerting on the upstream Riemann servers, Testing, Validating, Performance, scaling, and making Riemann highly available is also explained in detail.

Chapter 4: Storing and graphing metrics, including Graphite and Grafana

The last chapter dealt with Reimann that provides a central destination and routing engine for the events and metrics. This chapter deals with the storage of the time series metrics. "Grpahite is an engine that stores time-series data and then can render graphs from that data using an API. Grafana is an open-source metrics dashboard that supports Graphite, InfluxDB, and OpenTSDB." Graphite is made up for 3 components - carbon, whisper and Graphite Web. Carbon is a collection of daemons that are event driven and listen on network ports. THey listen receive and write time series data. Whisper is a flat-file database used for storing time-series data. Graphite web is a Django-based web UI that can be used to compose graphs from the metrics data collected. But it can be hard to configure and install it and hence Grafana is used instead. The chapter discusses how to install and configure and run Graphite and Grafana. The events are gathered from Reimann and sent in the form of metrics to Graphite. In order for the events to be consistent between Reimann and Graphite NTP is configured.

Chapter 5: Host-based monitoring with collectd

In this chapter, host-based data is collected and sent to Reimann. This is done by collectd which is a daemon acting as a monitoring collection agent. This will do the local monitoring and send it to Reimann. Collectd runs locally on the hosts and monitors and collects data from a variety of components. In order to collect data some plugins are enabled. Events collect the data and send them to graphite which in turn can be sent to grafana where they can be graphed.

Chapter 6: Using collectd events in Riemann

Following things about collectd event in Riemann are covered in this chapter.

The processes plugin is introduced to check for running processes. It generates a series of metrics for the processes such as ps_count metric which counts the number of running processes.
These three streams are utilized to measure availibility.
(i)The tagged stream wrapper
(ii)A stream that matches the threshold notification
(iii)A stream that matches expired events.

The other actions that can be taken using collectd notifications are redeploying the code,an attempt to restart a stopped service, trigger a configuration management run and many more.

How to replicate some of the host monitoring checks is explained in detail.

We can apply two mechanisms: data granularity and check functions, to make a better use of metric data.

Constructing a new dashboard in graphana by creating a host dashboard, a host graph and memory graph is demonstarted here.

Commercial tools such as New Relic, Circonus, DataDog etc and Open source tools such as Ganglia, Munin, StatsD etc are alternatives to collectd.

Chapter 7: Containers - another kind of host

This chapter deals with monitoring in containers. Containers are light weighted and shortlived and this poses a problem in monitoring them. Docker does not have a default plugin for collectd and do an open source plugin is used. In docker, there is a daemon running on the host which creates and manages containers. We can interact with the daemon using the docker cli command or using an api that lets user query docker for stats. The docker collectd plugin is used to collect statistics on the containers. Once the statistics are collected they are converted to metrics and then sent to Reimann.

Chapter 8: Logs and Logging, covering structure logging and the ELK stack

In this chapter, we’re introduced to the next layer of framework: logging. While the hosts, services, and applications generate crucial metrics and events, they also often generate logs that can tell useful things about their state and status.
Additionally, logs are incredibly useful for diagnostic purposes when investigating an issue or incident. It is also described how these logs are captured, sent to a central store, and made use of to detect issues and provide diagnostic assistance. Metrics generation and graphs from these logs is also explained in brief.
A log management platform is built to complement the other components of the monitoring framework. Some of the logs are collected and sent to Riemann and Graphite, to be recorded as metrics. They have also explained how to integrate Docker containers into logging.

Chapter 9: Building monitored applications

This chapter deals with monitoring applications. Two types of metrics are monitored in application monitoring. One is application metric which measures the state and performance of the application. The other is business metric which is at a layer higher than the application metric. For example, if the payment transaction latency is an application metric then the business metric can be the value of that transaction. These metrics can be calulated using StatsD which is a daemon that listens on TCP or UDP ports collects messages and parses metrics out of the messages. We can add logging to our application log events in our application. We can also add notification on our application when a deployment takes place.

Chapter 10: Alerting and Alert Management

The main purpose of this chapter is to make notifications useful: to send them at the right time and put useful information in them. Sending too many notifications will make the recipients numb to notifications and they would want to turn them off which might lead to crucial notifications being regarded as unimportant updates. Sometimes, the notifications that move across are not all useful to the recipients and hence, this chapter helps us in knowing how to add appropriate context to these notifications so that they become instantly useful to the recipients. Apart from this, we also get to learn how to maintenance of these notifications. What notifications the recipients read that are not so important will eventually help in identifying patterns and trends and the focus is also to learn how to identify them.

Chapter 11: Monitoring Tornado: a capstone

The chapter combines all peices from the previous chapters to monitor an application stack. The application is Tornado which is an order processing system that buys and sells items. The chapter discusses the monitoring of the web tier for the application. The collectd plugin is used to monitor the different web components. The log output from the services are used to collect metrics and diagonistic information. Small checks are added in Reimann to notify if any issues occur.

Chapter 12: Monitoring Tornado: Application Tier

This chapter deals with monitoring the application tier of the Tornado application which is made up of two API servers running on top of the Java Virtual Machine. The JVM is monitored using the GenericJMX plugin executed by java collectd plugin. To collect the logs, the logging framework timbre is used. The tornado api is monitored by querying the api to make sure that its returning results and instrumenting the application so that it generate events when it operates.

Chapter 13: Monitoring Tornado: Data tier

In this chapter the Tornado database tier is monitored. The database tier consists of the tornado-db server running MYSQL and a tornado-redis running Redis. For monitoring MYSQL, the MYSQL server is monitored using its health and metrics as well as looking at the results of application specific queries. This is done using a plugin. To monitor the redis server the collect redis plugin is used to connect to the Redis instances using the credis library. This return usage statistics.