The Art of Monitoring
Reviewers: Jatri Dave , Tanya Jha
Chapter 1: An Introduction to Monitoring
What is monitoring?
From a technological perspective, monitoring is the tools and
processes by which you measure and manage your IT systems.
Monitoring is something that provides the translation between business
value and the metrics generated by your systems and applications.
These metrics can be translated into measurable user experience, which
in turn provides feedback to the business to help ensure the delivery
of customers requirements. It also gives feedback on insufficient or
nonworking services.
Monitoring system has two customers :
(1) The business (2) Information Technology
The stages of a three-level maturity model reflecting
monitoring evolution are :
(i) Manual : In this stage , monitoring is largely manual, user initiated,
or not done at all.
(ii) Reactive : This type of monitoring is mostly automatic
with some remnants of manual or unmonitored components.
(iii) Proactive : Here, monitoring is automatic and generated
by configuration management. Example of different monitoring tools
are Nagios, Sensu, and Graphite.
Chapter 2: A Monitoring, Metrics and Measurement
This chapter proposes an architecture for a monitoring framework. It also discusses the faults of traditional monitoring frameworks and compared the new architecture with it. The new architecture is based on "whitebox" or push based monitoring rather than a "blackbox" or pull based monitoring. In a white box or push based monitoring the components being monitored are queried. But in a "blackbox" or pull based monitoring the components being monitored send their data to a central collector. This kind of architecture is centered around collecting events,log and metrics. Metrics are properties of hardware and software. They are usually collected with as a value that occurred at a time stamp. These metrics can be transformed mathematically using sum, count, average, percentiles etc. If we have multiple sources, we may want to show aggregated result of the metrics across all the sources. A good visualization of data provides powerful analytics. It should clearly show the data without distorting it.
Chapter 3: Managing Events and metrics with Riemann
Key design features for routing engine are as follows :
"If only I had the theorems! Then I should nd the proofs easily enough." -Bernard Riemann
Riemann is a monitoring tool that aggregates events from hosts
and applications and can feed them into a stream processing
language to be manipulated, summarized, or actioned. It can also
track the state of incoming event and provides notifications.
A brief explanation on installing Riemann on three different hosts
is given. Moreover, Configuring Riemann, Connecting Riemann servers,
Alerting on the upstream Riemann servers, Testing, Validating,
Performance, scaling, and making Riemann highly available
is also explained in detail.
Chapter 4: Storing and graphing metrics, including Graphite and Grafana
The last chapter dealt with Reimann that provides a central destination and routing engine for the events and metrics. This chapter deals with the storage of the time series metrics. "Grpahite is an engine that stores time-series data and then can render graphs from that data using an API. Grafana is an open-source metrics dashboard that supports Graphite, InfluxDB, and OpenTSDB." Graphite is made up for 3 components - carbon, whisper and Graphite Web. Carbon is a collection of daemons that are event driven and listen on network ports. THey listen receive and write time series data. Whisper is a flat-file database used for storing time-series data. Graphite web is a Django-based web UI that can be used to compose graphs from the metrics data collected. But it can be hard to configure and install it and hence Grafana is used instead. The chapter discusses how to install and configure and run Graphite and Grafana. The events are gathered from Reimann and sent in the form of metrics to Graphite. In order for the events to be consistent between Reimann and Graphite NTP is configured.
Chapter 5: Host-based monitoring with collectd
In this chapter, host-based data is collected and sent to Reimann. This is done by collectd which is a daemon acting as a monitoring collection agent. This will do the local monitoring and send it to Reimann. Collectd runs locally on the hosts and monitors and collects data from a variety of components. In order to collect data some plugins are enabled. Events collect the data and send them to graphite which in turn can be sent to grafana where they can be graphed.
Chapter 6: Using collectd events in Riemann
Following things about collectd event in Riemann are covered in this chapter.
Chapter 7: Containers - another kind of host
This chapter deals with monitoring in containers. Containers are light weighted and shortlived and this poses a problem in monitoring them. Docker does not have a default plugin for collectd and do an open source plugin is used. In docker, there is a daemon running on the host which creates and manages containers. We can interact with the daemon using the docker cli command or using an api that lets user query docker for stats. The docker collectd plugin is used to collect statistics on the containers. Once the statistics are collected they are converted to metrics and then sent to Reimann.
Chapter 8: Logs and Logging, covering structure logging and the ELK stack
In this chapter, we’re introduced to the next layer of framework:
logging. While the hosts, services, and applications generate crucial
metrics and events, they also often generate logs that can tell
useful things about their state and status.
Additionally, logs are incredibly useful for diagnostic purposes
when investigating an issue or incident. It is also described how
these logs are captured, sent to a central store, and made use of
to detect issues and provide diagnostic assistance. Metrics generation
and graphs from these logs is also explained in brief.
A log management platform is built to complement the other components
of the monitoring framework. Some of the logs are collected and sent
to Riemann and Graphite, to be recorded as metrics. They have also
explained how to integrate Docker containers into logging.
Chapter 9: Building monitored applications
This chapter deals with monitoring applications. Two types of metrics are monitored in application monitoring. One is application metric which measures the state and performance of the application. The other is business metric which is at a layer higher than the application metric. For example, if the payment transaction latency is an application metric then the business metric can be the value of that transaction. These metrics can be calulated using StatsD which is a daemon that listens on TCP or UDP ports collects messages and parses metrics out of the messages. We can add logging to our application log events in our application. We can also add notification on our application when a deployment takes place.
Chapter 10: Alerting and Alert Management
The main purpose of this chapter is to make notifications useful: to send them at the right time and put useful information in them. Sending too many notifications will make the recipients numb to notifications and they would want to turn them off which might lead to crucial notifications being regarded as unimportant updates. Sometimes, the notifications that move across are not all useful to the recipients and hence, this chapter helps us in knowing how to add appropriate context to these notifications so that they become instantly useful to the recipients. Apart from this, we also get to learn how to maintenance of these notifications. What notifications the recipients read that are not so important will eventually help in identifying patterns and trends and the focus is also to learn how to identify them.
Chapter 11: Monitoring Tornado: a capstone
The chapter combines all peices from the previous chapters to monitor an application stack. The application is Tornado which is an order processing system that buys and sells items. The chapter discusses the monitoring of the web tier for the application. The collectd plugin is used to monitor the different web components. The log output from the services are used to collect metrics and diagonistic information. Small checks are added in Reimann to notify if any issues occur.
Chapter 12: Monitoring Tornado: Application Tier
This chapter deals with monitoring the application tier of the Tornado application which is made up of two API servers running on top of the Java Virtual Machine. The JVM is monitored using the GenericJMX plugin executed by java collectd plugin. To collect the logs, the logging framework timbre is used. The tornado api is monitored by querying the api to make sure that its returning results and instrumenting the application so that it generate events when it operates.
Chapter 13: Monitoring Tornado: Data tier
In this chapter the Tornado database tier is monitored. The database tier consists of the tornado-db server running MYSQL and a tornado-redis running Redis. For monitoring MYSQL, the MYSQL server is monitored using its health and metrics as well as looking at the results of application specific queries. This is done using a plugin. To monitor the redis server the collect redis plugin is used to connect to the Redis instances using the credis library. This return usage statistics.