Skip to main content

Enriching Data

Raw data is converted to JSON format (see working with data). It is then enriched, reshaped, and annotated for easy consumption and storage:

  • add to add constant value fields

  • script to conditionally add scripted fields

  • enrich for general CSV lookups

  • inputs as actions can be used for arbitrary retrieval

Fields with Agent-specific Values

Data needs to be tagged with the location at the source. There are standard Context variables that are different for each Agent, and more can be added to the Pipe Contexts:

- add:
output-fields:
- site: '{{name}}'
- pipe: '{{pipe}}'

Generated Fields

All data needs a timestamp as this is the processing time at the Agent:

- time:
output-field: '@timestamp'

A sequence number can be added to each event, note that this is not persistent across restarts:

- script:
let:
- seq: 'count()'

However, using uuid(), is a more efficient method for generating fields as it gives a unique ID.

Calculated Fields

Use the script action to calculate values for fields and the script.let action to anonymize data using hash functions:

script:
let:
- name_hash: md5(name)
- address_hash: md5(address)

remove:
- name
- address

We perform this action to prepare data for storage purposes and further processing outside of private networks.

Hashes are one-way functions, but it is still possible to encrypt sensitive fields using encrypt().

Find the available scripting functions here.

Conditional Fields

If condition is defined and true, script will add fields.

It is easier to use set rather than let when adding literal strings. As the add and script actions default to never overwriting existing fields, this snippet allows you to add the field quality to the event. It displays quality: good if the field has a condition: a > 1, and quality: bad for any other values.

- script:
condition: a > 1
set:
- quality: good

- script:
set:
- quality: bad

For a more elegant solution, use the cond function:

- script:
let:
- quality: cond(a > 1,"good","bad")

Table Lookup

enrich is an efficient way to enrich data with tables read from a CSV file. If the value of an event matches a column, we can use the value of another column on the same row to create a new field.

Note that the lookup files need to be attached to the Pipe using a files: section. An example of this can be found at the end of this section.

A sample event:

id,name,nick,office
23,Alice,bbye,head
12,Bob,wkr,kzn
13,John,nomo,wcape

If iden in the event matches the id in the table we can set nice_name to the value of name:

# Input: {"iden":12}

enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nice_name
lookup-field: name

# Output: {"iden":12,"nice_name":"Bob"}

Specifying a type for the match is required. These are found below:

  • str text values
  • num numbers
  • ip IPv4 addresses
  • cidr IPv4 address ranges. For example: '192.168.1.0/16'
  • num-list separated by commas. For example: '10,20,30'
  • str-list separated by commas. For example: 'office,home'
  • num-range ranges. For example: '10-23'

You may need to satisfy multiple matches:

# Input: {"iden":12,"office":"kzn"}
enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
- type: str
event-field: office
lookup-field: office
add:
event-field: nice_name
lookup-field: name

# Output: {"iden":12,"office":"kzn","nice_name":"Bob"}

Adding multiple values with enrich can be tedious, since the match must be repeated:

# Input: {"iden":12}

enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nice_name
lookup-field: name

- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-field: nickname
lookup-field: nick

# Output: {"iden":12,"office":"kzn","nice_name":"Bob"}

There is a convenient shortcut. Here, fields to be added need to match the lookup names as follows:

# Input: {"iden":12}

enrich
- lookup-file: names.csv
match:
- type: num
event-field: iden
lookup-field: id
add:
event-fields:
- name: <unknown>
- nick: ''

# Output: {"iden":12,"name":"Bob","nick":"wkr"}

event-fields gives the field name, which must match the same column in the CSV file. The value (after the colon) is the default value.

The lookup CSV file will be reloaded if it is modified. This allows other Pipes to modify the enrichment globally.

A complete example illustrating the inclusion of a fruits.csv file in an optional lookups subdirectory:

name: simple_echo_with_enrich

files:
- lookups/fruits.csv

input:
echo:
json: true
event: |
{ "this": "a", "that": "b" }

actions:
- enrich:
lookup-file: fruits.csv
add:
event-field: fruit
lookup-field: output_column
match:
- type: str
event-field: this
lookup-field: input_column

output:
print: STDOUT

# Output: {"that":"b","this":"a","fruit":"Apple"}

Here is the lookups/fruits.csv file:

input_column,output_column
a,"Apple"
b,"Banana"

Enriching with Input

Inputs as actions is a powerful technique. Let’s say we have an HTTP endpoint that is given a name and returns the city where the person lives as {"city":"NAME"}. Events containing name receive a city field:

name: http-enrich

input:
exec:
command: echo '{"name":"Joe"}'
raw: true

actions:
- input:
http-poll:
address: http://127.0.0.1:3030
query:
- name: ${name}
raw: true

output:
write: console

# Output: {"city":"Johnnesburg","name":"Joe"}

While much of the functionality on Unix-like systems is provided through the CLI, we can still execute commands as actions. The host command can perform either a forward or reverse DNS lookup:

name: host-enrich

input:
exec:
command: echo '{"ip":"98.137.246.7"}'
raw: true

actions:
- exec:
command: host ${ip}
result:
stdout-field: host

- raw:
extract:
input-field: host
pattern: '(\S+)\.$'

output:
write: console

# Output:
# {"ip":"98.137.246.7","host":"media-router-fp1.prod1.media.vip.gq1.yahoo.com"}

The only requirement is extracting the hostname from the end of the output afterwards.

note

The script function ip2asn is a more appropriate way to get the actual ASN. This function uses the Team Cymru service and in this case, returns YAHOO-GQ1, US.

Use input: redis to look up a field in a hash as it is particularly effective in instances where the lookups are simple and often repeated.