Introduction to OpenTelemetry Collector
In recent years, the term Open Telementry (Otel) has been hotter than ever. One of the popular components in that ecosystem is Open Telementry Collector (otel-collector), designed to facilitate the collection, processing, and exporting of telemetry data such as metrics, traces, and logs.
In this article, we focus mainly on logging 🧾.
Intro
On its official document, it states:
The OpenTelemetry Collector is an executable file that can receive telemetry, process it, and export it to multiple targets, such as observability backends.
Basically, otel-collector configuration consists of 3 parts:
- receivers: that receive data
- processors: get data from receivers to exporters
- exporters: output the data to the destination
The below pipeline illustrates the data path of data from receivers
to exporters
.
Sample raw logs collection
Let's consider this raw queries log from phpmyadmin:
[2024-07-28 14:53:14] SELECT @@lower_case_table_names
[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';
I only take 2 lines for the sake of simplicity. Now let's take a look at our minimal otel-collector's configuration (will add in-line comments):
# receivers section
receivers:
# we use filelog receivers
# more info: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver
filelog/test:
include:
- /var/log/test.log
# start from beginning as we're testing
start_at: beginning
# operator section, can have multiple operators
operators:
# use regex_parser and take all the character with (.*)
# can test on https://regex101.com/
- type: regex_parser
regex: '(?P<raw_log>.*)'
# this section read input from the beginning of file `/var/log/test.log`, take all the text as regex
# processors section
processors:
batch:
send_batch_size: 512
# exporters section
exporters:
# we use file exporter
# more info: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/fileexporter
file/test:
path: /var/log/output.log
# basically this section output the final data to `/var/log/output.log`
# pipeline section: mapping all the above ones
service:
pipelines:
logs/test:
receivers:
- filelog/test # match the name on line 5
processors:
- batch # match the name on line 20
exporters:
- file/test # match the name on line 27
For testing regex_parser, use should use start_at: beginning
(line 9), later we will change it to start_at: end
which works like tail -F
Start the app using docker with the below command:
docker run --rm -d --name=otel-collector --hostname=$(hostname) --user=0 \
-v /var/log:/var/log \
-v /tmp/config.yaml:/etc/otelcol-contrib/config.yaml:ro \
--network=host otel/opentelemetry-collector-contrib:0.102.1 \
--config /etc/otelcol-contrib/config.yaml
I used --rm
flag for quickly reconfiguring containers in the future
Output examination
Let's cat the content of /var/log/output.log
so that we can learn more about the output format:
...
"logRecords": [
{
"observedTimeUnixNano": "1722190924664883682",
"body": {
"stringValue": "[2024-07-28 14:53:14] SELECT @@lower_case_table_names"
},
"attributes": [
{
"key": "log.file.name",
"value": {
"stringValue": "test.log"
}
},
{
"key": "raw_log",
"value": {
"stringValue": "[2024-07-28 14:53:14] SELECT @@lower_case_table_names"
}
}
],
...
},
{
"observedTimeUnixNano": "1722190924664953651",
"body": {
"stringValue": "[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';"
},
"attributes": [
{
"key": "log.file.name",
"value": {
"stringValue": "test.log"
}
},
{
"key": "raw_log",
"value": {
"stringValue": "[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';"
}
}
],
...
}
]
...
We can see the default log will be stored in body
, it automatically adds the attribute
with key log.file.name
for us. Our regex_parser
will result in attribute.raw_log
, as we match all (.*) so it is the same content with body
.
Multi-line log collector
If our log file is the same with a single line like the example, we're all good. Sometimes we may face logs with multi-line format (i.e. Java/Python stack trace output). Let's consider the more complex version of phpmyadmin log:
[2024-07-28 14:53:13] SELECT TABLE_NAME
FROM information_schema.VIEWS
[2024-07-28 14:53:14] SELECT @@lower_case_table_names
[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';
[2024-07-28 14:53:16] SELECT TABLE_NAME
FROM information_schema.VIEWS
WHERE TABLE_SCHEMA = 'nova-240318'
AND TABLE_NAME = 'block_device_mapping'
After that restart the otel-collector container with docker restart otel-collector
. The new output.log
will be like this:
{
"body": {
"stringValue": "[2024-07-28 14:53:13] SELECT TABLE_NAME\nFROM information_schema.VIEWS"
},
...
"key": "raw_log",
"stringValue": "[2024-07-28 14:53:13] SELECT TABLE_NAME\nFROM information_schema.VIEWS"
...
"body": {
"stringValue": "[2024-07-28 14:53:14] SELECT @@lower_case_table_names"
},
...
"key": "raw_log",
"stringValue": "[2024-07-28 14:53:14] SELECT @@lower_case_table_names"
...
"body": {
"stringValue": "[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';"
},
...
"key": "raw_log",
"stringValue": "[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';"
...
"body": {
"stringValue": "[2024-07-28 14:53:16] SELECT TABLE_NAME\nFROM information_schema.VIEWS\nWHERE TABLE_SCHEMA = 'nova-240318'\nAND TABLE_NAME = 'block_device_mapping'"
},
...
"key": "raw_log",
"stringValue": "[2024-07-28 14:53:16] SELECT TABLE_NAME\nFROM information_schema.VIEWS\nWHERE TABLE_SCHEMA = 'nova-240318'\nAND TABLE_NAME = 'block_device_mapping'"
}
It will take a single line as a new log entry, which is not what we want to achieve.
Operator recombine
comes in handy for this situation. You can learn more about it here.
Modify our receivers
:
receivers:
filelog/test:
include:
- /var/log/test.log
start_at: beginning
operators:
- type: recombine
combine_field: body # combine/merge the body if the below condition is match
is_first_entry: body matches "^\\[" # new log line start with [, ignore other cases
source_identifier: attributes["log.file.name"] # distinguish between multile log files
- type: regex_parser
regex: '(?P<raw_log>[\S\s]*)'
...
In this case, the raw_log
is span multi-line. So we should change to [\S\s]*
for the regex match instead of .*
(which will get only current line).
Rerun the container and the output.log
will be look like this:
{
"body": {
"stringValue": "[2024-07-28 14:53:13] SELECT TABLE_NAME\nFROM information_schema.VIEWS"
}, ...
"body": {
"stringValue": "[2024-07-28 14:53:14] SELECT @@lower_case_table_names"
}, ...
"body": {
"stringValue": "[2024-07-28 14:53:15] SET collation_connection = 'utf8mb4_unicode_ci';"
}, ...
"body": {
"stringValue": "[2024-07-28 14:53:16] SELECT TABLE_NAME\nFROM information_schema.VIEWS\nWHERE TABLE_SCHEMA = 'nova-240318'\nAND TABLE_NAME = 'block_device_mapping'"
}, ...
}
Remember to change start_at: end
for tailing newest logs