Skip to main content

Reducing Data Volume

The Enriching Data chapter discussed how data collected at a remote point could be enriched before being sent to a central collector. However, for cases where there is a significant amount of data relative to the bandwidth available, it is useful to reduce the amount of data sent to collector using the distributed processing power in the nodes. This is an effective way to minimize licensing costs (e.g., Splunk) and to work with sites on the edge when only mobile data is available.

Filtering

The first method involves discarding events that are not considered important. filter has three variations for catagorizing events:

  1. Pattern matches on field values:
- filter:
patterns:
- severity: high
- source: ^GAUTENG-
  1. Conditional expressions:
- filter:
condition: speed > 1

A variation of this involves using stream to watch for changes — a highly effective technique for only sending changed values:

- stream:
operation: delta
watch: throughput
- filter:
condition: delta != 0

Discarding and Renaming

We may be only interested in certain data fields — which brings us to the third variation of filter:

  1. Schema:
- filter:
schema:
- source
- destination
- sent_kilobytes_per_sec

filter.schema only passes through events with the specified fields and discards any other fields.

note

We recommend documenting event structure and getting rid of any temporary fields generated during processing.

To do this, remember that it is always possible to explicitly use remove.

When JSON is used as data transport, field names are a significant part of the payload size. It is therefore useful to rename fields:

- rename:
- source: s
- destination: d
- sent_kilobytes_per_sec: sent

Using a more Compact Data Format

CSV is a very efficient data transfer format because rows do not repeat column names.

collapse will convert the fields of a JSON event into CSV data.

In most cases, this CSV data would need to be converted back into JSON for storage in analytics engines like Elasticsearch. Creating a Logstash filter to perform this conversion process can be tedious, collapse provides some workarounds, as seen below.

CSV output allows for the column names-types to be written to a field. There is also an option to specify that it only be written if the fields change:

# Input: {"a":1,"b":"hello"}, {"a":2,"b":"goodbye"}

- collapse:
output-field: d
csv:
header-field: h
header-field-types: true
header-field-on-change: true

# Output: {"d":"1,hello","h":"a:num,b:str"}, {"d":"2,goodbye"}

The reverse operation, expand, can take place on a server-side Pipe. This takes the output of the remote Pipe and restores the original output:

- expand:
input-field: d
remove: true
delim: ',' # default
csv:
header-field: h
header-field-types: true

If there is a corresponding Pipe on the server, you can also move any enrichments to the Pipe in question.