Today, a coworker reported that they had added a counter metric project_monitor_created
but it wasn't showing up in our metrics backend (datadog).
The other metrics from the service were all going through fine,
so why not this one?
For context, our metric collection system consists of:
application exposing metrics in prometheus exposition format at /metrics
,
OpenTelemetry Collector
scraping and doing a delta conversion,
Vector doing some extra transformations,
and finally sending it to Datadog.
First, we verify that the application is actually producing the metric
(curl localhost:8080/metrics
)
and that it's increasing.
We first suspect vector
,
because it's usually the one with problems.
But running vector tap
on the pipeline just shows it getting data points with value: 0.0
(if we didn't crash our teleport agents when trying to dump all that data).
So we move back a stage into the collector.
Adding a debuge
wasn't too hard,
I also used a filter
to only include the metric I wanted:
1connectors:
2 forward:
3
4processors:
5 filter:
6 metrics:
7 metric:
8 - name == "project_monitor_created"
9exporters:
10 debug:
11
12services:
13 pipelines:
14 mettrics/original:
15 receivers:
16 - prometheus
17 processors:
18 - cumulativetodelta
19 - # others
20 exporters:
21 - forward
22
23 metrics/debug:
24 receivers:
25 - forward
26 processors:
27 - filter
28 exporters:
29 - debug
The output of this also showed data points with Value: 0.0
.
So the problem was even earlier in the stack.
Instead of pulling from the final output stage,
we can use the output of the prometheus
receiver directly in our debug pipeline.
This finally has something interesting:
The data point value is still Value: 0.0
,
but it has a StartTimestamp: 1970-01-01 00:00:03
.
This gave me a decent idea of what was happening:
even though the receiver is called a prometheus receiver,
prometheus metrics can be a bit too loosely typed.
So there's the OpenMetrics project that tries to formalize it,
and in the process, introduced some extra features.
If you look at the OpenMetrics spec
you can find that _created
is a suffix for counters
along with the following quotes:
A MetricPoint in a Metric with the type Counter SHOULD have a Timestamp value called Created. This can help ingestors discern between new metrics and long-running ones it did not see before.
The MetricPoint's Total Value Sample MetricName MUST have the suffix "_total". If present the MetricPoint's Created Value Sample MetricName MUST have the suffix "_created".
So our collector has decided to treat the value as a start Timestamp for a metric we don't have a value of. Once we found this, it was a relatively easy fix of renaming the metric.