Elastic stack: a renewed introduction

A few years ago, I had an assignment at my former client involving Elasticsearch, Logstash and Kibana to build an operational dashboard. It was fun to do and very instructive; afterwards, I wrote an article about my experiences and spoke at various conferences. Recently, I was asked by another team in my company to assist them in setting it up for their team. A good chance to catch up with some old friends, and see how they have changed over the years.

Elasticsearch

To start with Elasticsearch: it has updated from 1.4 that I used before to 5.x to date. Quite a big change actually happened with the 5.x version: Ingest Node. Using Ingest Node, you can pre-process documents before they are actually stored in Elasticsearch. Processing takes place in pipelines which consists of processors. There are plenty of processors, varying from simple things like adding a field to a document to complex stuff like extracting structured fields a single text field, extracting key/value-pairs or parsing JSON. Processors can be chained to form very complex pipelines that will eventually perform complex transformations. Of course, error handling is included, too, meaning you can define what should be done when some of the steps fail. This error handling can be per-processor or at the global level of your pipeline.

After defining the pipeline, you often want to test it. Elastic offers you an endpoint to do that, called the Simulate API. First, store the pipeline in Kibana (using PUT _ingest/pipeline/my-pipeline-id), and then simulate an invocation of it using POST _ingest/pipeline/_simulate. It will output the documents that you supplied after they have been transformed by the pipeline, but the documents will not be stored in Elasticsearch.

Pipelines

So, what does a pipeline look like? Let’s look at a simple example that processes lines from an nginx access log. These lines often look like this: 207.46.13.230 - - [16/Apr/2017:00:12:29 +0000] “GET /robots.txt HTTP/1.1” 301 178 “-” “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)”.

{
  "description" : "nginx access logs pipeline",
  "processors" : [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:client_address} - (%{WORD:username})?(-)? \\[%{HTTPDATE}] \"%{WORD:method} %{URIPATHPARAM:request_uri} %{EMAILLOCALPART}\" %{NUMBER:status:int} %{NUMBER:bytes:int} \"(%{URI:referrer})?(-)?\" %{GREEDYDATA:user_agent}"]
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "grok": {
        "field": "source",
        "patterns": ["%{GREEDYDATA}/(%{DATA:domain}_)?access.log","%{GREEDYDATA}/(%{DATA:domain}_)?error.log"]
      }
    }
  ],
  "on_failure" : [
    {
      "set" : {
        "field" : "ingest-error",
        "value" : "{{ _ingest.on_failure_message }}"
      }
    }
  ]
}

Most of the magic happens in the grok processor. It attempts to match the message field of each submitted document to a pattern. The pattern contains a lot of blocks like %{NUMBER:bytes:int}; the general pattern is %{SYNTAX:SEMANTIC:TYPE}, but SEMANTIC and TYPE can be omitted. This example says “there’s a set of chars that matches the NUMBER pattern; store it in the bytes field of the target document with type int. If SEMANTIC is omitted, the text that matched the pattern will not be included as a field in the target document.

Writing these patterns is often a trial-and-error process. But don’t worry, there are two nice online tools to help you with that:

Kibana

Kibana is now also at a 5.x version, since it is versioned together with Elastic. From an end-user perspective it might seem that nothing changed that much, at first glance. But I guess under the hood a lot has changed. At least it is now deployed in a different way, since it comes with its own web server. What I quite liked was the query editor that you can use as a simple way to interact with Elastic. A bit like Postman or similar tools, but simpler, since it is only meant to communicate with Elastic.

Filebeat

When it comes to getting your logging data into Elastic, I met a new friend: Filebeat. Filebeat follows an approach different from Logstash: the latter does all processing locally, where you have the raw logging data, whereas the former is relatively dumb and just ships all logging to Elastic, relying on Elastic to do processing, cleaning and the like. Whether or not this is a good solution will probably depend on your context. If, for some reason, your log files contain information that is not allowed to leave the system (e.g. privacy sensitive info), you might not want to ship it to Elastic even though that will clean the privacy-related info. But for other situations, Filebeat might well be a viable alternative, for example because it’s more light-weight than Logstash.

As said, Filebeats approach is different: it will keep track of what parts of a log file have been processed so far, and detect changes to that file (because your application genereted new logging, for example). These changes are then sent to Elastic. This means the configuration file is relatively simple:

filebeat.prospectors:
- input_type: log
  paths:
  - /var/log/nginx/*.log
  fields:
    application: nginx
  tags: ["access"]

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  pipeline: "general"
  pipelines:
    - pipeline: "nginx-access"
      when.contains:
        tags: "access"

In this example, we see one prospector, a process that manages one or more harvesters. Harvesters are the components that keep track of a single file, opening and closing it and processing data in it. The prospector will add an extra application field to the Elastic documents generated, and it will also tag them. Also, the output configuration is fairly simple: just ship all logging to Elastic. Note that we can specify which pipelines should be invoked: each document will be processed by the “general” pipeline. Additionally, documents tagged access will also be processed by the “nginx-access” pipeline.

It is also possble to process logging that spans multiple lines. A typical example would be stacktraces, but some applications write a timestamp to the first line and the message to a second one. In either case, you’d want to keep those together when being sent to Elastic. By specifying the multiline.pattern, multiline.negate and multiline.match settings you can tweak this behaviour. multiline.pattern expects a regular expression that will be used to match every line from the input file:

  • When multiline.negate: true
    • and the line does not match the pattern
      • and multiline.match: after, Filebeat will combine the line that did not match with the line before.
    • and the line does match the pattern
      • Filebeat will consider this matching line as a new log message. The value for multiline.match does not matter in this case.

The Filebeat documentation gives some hints how to experiment with these settings.

Conclusion

It was interesting and fun to see how the various tools have developed over time. I was especially pleased by the pipelines feature of Elastic, since it makes it easier to maintain the parsing and processing of log data. In a situation where you have, say, four application servers, keeping the Grok patterns up-to-date and in sync can be quite cumbersome. By storing them in a central place, that burden is now gone.