k8s cluster metrics

where to get your stats from

k8s

The DIY PaaS, and today I'm thinking about how to measure CPU and memory in this giant jigsaw puzzle.

collecting data

cluster level

The kubernetes control plane exposes about itself through its apiservers. Additionally, kube-state-metrics generates metrics about the things inside the cluster.

TODO: decide if we need to replace the instance label to a stable name in the case of multiple instances

 1scrape_configs:
 2  - job_name: kubernetes-apiservers
 3    kubernetes_sd_configs:
 4      - role: endpoints
 5    scheme: https
 6    tls_config:
 7      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 8    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 9    relabel_configs:
10      - source_labels:
11          - __meta_kubernetes_namespace
12          - __meta_kubernetes_service_name
13          - __meta_kubernetes_endpoint_port_name
14        action: keep
15        regex: default;kubernetes;https
16
17  # note kube-state-metrics also has an alternate port with metrics about itself
18  - job_name: kube-state-metrics
19    kubernetes_sd_configs:
20      - role: endpoints
21    relabel_configs:
22      - source_labels:
23          - __meta_kubernetes_namespace
24          - __meta_kubernetes_service_name
25          - __meta_kubernetes_endpoint_port_name
26        action: keep
27        regex: kube-state-metrics;kube-state-metrics;kube-state-metrics

node level

Kubernetes kubelets expose both their own metrics and the metrics on pods runnin on their node

 1scrape_configs:
 2  - job_name: kubernetes-nodes
 3    kubernetes_sd_configs:
 4      - role: node
 5    scheme: https
 6    tls_config:
 7      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 8      insecure_skip_verify: true
 9    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
10    relabel_configs:
11      - action: labelmap
12        regex: __meta_kubernetes_node_label_(.+)
13
14  - job_name: kubernetes-cadvisor
15    kubernetes_sd_configs:
16      - role: node
17    scheme: https
18    metrics_path: /metrics/cadvisor
19    tls_config:
20      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
21      insecure_skip_verify: true
22    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
23    relabel_configs:
24      - action: labelmap
25        regex: __meta_kubernetes_node_label_(.+)
26
27  - job_name: node-exporter
28    kubernetes_sd_configs:
29      # remember to create a service to capture expose this
30      # using role: node is also possible if you expose a hostport
31      - role: endpoints
32    relabel_configs:
33      - source_labels:
34          - __meta_kubernetes_namespace
35          - __meta_kubernetes_service_name
36          - __meta_kubernetes_endpoint_port_name
37        action: keep
38        regex: node-exporter;node-exporter;node-exporter
39      - action: replace # rename the instance from the discovered pod ip (we're using endpoints) to the node name
40        target_label: instance
41        source_labels:
42          - __meta_kubernetes_pod_node_name

dumping metrics

prometheus has a useful /federate endpoint you can use to dump out everything after relabelling, example query:

1http://localhost:8080/federate?match[]={job=~".*"}

example

repo: testrepo-cluster-metrics

future

In the future we might be able to run N+1 (N = number of nodes) instances of opentelemetry-collector, N as agents (DaemonSet) and 1 as gateway, replacing the need for N node-exporter, 1 kube-state-metrics, as well as N+1 tracing collectors and N+1 logging collectors. As it currently stands, it still needs some more work to export the metrics in a stable manner, and maybe some extra exporters to write directly to storage.