首页 > 其他分享> > Prometheus Operator 自定义配置

Prometheus Operator 自定义配置

2021-09-14 16:32:58 作者：互联网

通常，我们需要对 Prometheus 的监控、告警规则和消息推送进行一些自定义的配置。对于部署在虚拟机的 Prometheus 和 Alertmanager 实例来说，上述配置分别对应以下文件：

prometheus.yaml
*-rules.yaml
alertmanager.yaml

但如果使用 Operator 在 Kubernetes/Openshift 中部署 Prometheus，其 Prometheus 和 Alertmanager 实例由对应的 Operator 负责管理，自然无法通过修改 Pod 挂载的 ConfigMap 或 Secret 来更新配置。因此，我们只能直接修改与其有关的 CRD（Custom Resource Definition）配置。

Prometheus Operator 在集群中创建的 CRD 资源主要有：

Prometheus：管理集群中的 Prometheus StatefulSet
ServiceMonitor：通过 Label Selector 选取 Endpoint 对象
Alertmanager：管理集群中的 Alertmanager StatefulSet
PrometheusRule：将告警规则配置动态加载到 Prometheus 实例中

接下来我们将分别讨论两者之间的对应关系。为了区分 CRD 资源和原生的 Prometheus 概念，文中的 CRD 名称均以斜体表示。

监控

prometheus.yaml的示例文件如下，其中各字段的定义详见官方文档。

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  externalLabels:          # The labels to add to any time series or alerts when communicating with external systems (federation, remote storage, Alertmanager).
    cluster: my-k8s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

与之对应的 Prometheus 对象配置则为：

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: openshift-monitoring
spec:
  # global
  scrapeInterval: 15s
  evaluationInterval: 15s
  externalLabels:
    cluster: my-k8s
  # alerting
  alerting:
    alertmanagers:
    - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      name: alertmanager-main
      namespace: openshift-monitoring
      port: web
      scheme: https
      tlsConfig:
        ca: {}
        caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
        cert: {}
        serverName: alertmanager-main.openshift-monitoring.svc
  # rule_file
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  # scrape_configs
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      k8s-app: node-exporter

其中，global下属的子字段可以在 Prometheus 的spec中直接定义，不过写法有所差异。全部的字段对应关系可以查看 Prometheus Operator 官方给出的 API 文档。值得一提的是，如果是在 Openshift 而非原生的 Kubernetes 集群中，使用oc edit prometheus来直接修改spec会在一段时间后回退为默认配置。我们需要在 openshift-monitoring 的 project 中创建一个名为 cluster-monitoring-config 的 ConfigMap，只有在其中定义全局配置才会生效：

apiVersion: v1
data:
  config.yaml: |
    prometheusK8s:
      scrapeInterval: 15s
      evaluationInterval: 15s
      externalLabels:
        cluster: my-k8s
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring

由于 Alertmanger 实例是由其对应的 CRD 对象管理的，因此alerting下属的子字段不再是静态配置（static_configs），而是 Alertmanager 对象及其鉴权配置。同样，serviceMonitorSelector和ruleSelector分别指定了关联的 PrometheusRule 和 ServiceMonitor 对象，对应的scrape_configs和rule_file配置则在被选择的 CRD 对象中声明：

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: node-exporter
  name: node-exporter
  namespace: openshift-monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    bearerTokenSecret:
      key: ""
    interval: 30s
    port: https
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    scheme: https
    tlsConfig:
      ca: {}
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      cert: {}
      serverName: node-exporter.openshift-monitoring.svc
  jobLabel: k8s-app
  namespaceSelector: {}
  selector:
    matchLabels:
      k8s-app: node-exporter

该对象的 Label 为k8s-app: node-exporter，因此会被示例的 Prometheus 对象选取。ServiceMonitor 的详细字段定义同样可以在 API 文档中查阅，此处简单介绍几个常用的字段：

endpoint.interval：与全局的scrapeInterval相比，该字段只对选取的 Endpoint 对象有效；
endpoint.port：Endpoint 暴露的 metrics 端口，即 Service 中的targetPort；
endpoint.relabeling：等效于prometheus.yaml文件中的relabel_config字段；
jobLabel：该字段值k8s-app将从selector.matchLabels中选取node-exporter作为 metrics 数据的 job 标签，相当于prometheus.yaml文件中的job_name字段；
namespaceSelector：指定 Endpoint 对象所在的 Namespace；
selector：通过 Label Selector 选取需要监控的 Endpoint 对象；

对 Prometheus 和ServiceMonitor 配置完毕后，我们可以在 Prometheus 容器中的/etc/prometheus/config_out目录下找到配置文件prometheus.env.yaml，它将代替prometheus.yaml作为 Prometheus 实例的配置文件。

告警规则

告警规则的配置应在 PrometheusRule 对象中进行声明，其形式与*-rules.yaml完全一致：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: app-rules
  namespace: openshift-monitoring
spec:
  groups:
  - name: app-rules
    rules:
    - alert: APP_NotReady
      annotations:
        description: No any pod is ready for app {{ $labels.container}} .
        summary: App Not Ready
      expr: sum(kube_pod_container_status_ready{namespace="app",container != "deployment"})
        by (container) < 1
      for: 1m
      labels:
        severity: critical
        user: app-admin

该对象的 Label 为prometheus: k8s和role: alert-rules，因此会被示例的 Prometheus 对象选取。同样，我们可以在 Prometheus 容器中的/etc/prometheus/rules目录下找到对应的*-rules.yaml文件。

消息推送

Alertmanager 对象将管理集群中的 Alertmanager StatefulSet 实例，而在 StatefulSet 挂载的 Secret 中就有告警消息推送的配置。我们将 Secret 中的内容使用 Base64 解码：

[root@bastion ~]# oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
[root@bastion ~]# cat alertmanager.yaml 
global:
  smtp_smarthost: "******"
  smtp_from: "prometheus@me.com"
route:
  group_by: [alertname]
  receiver: default
receivers:
  - name: default
    email_configs:
      - to: "app@me.com"
        send_resolved: true

我们可以通过修改名为 alertmanager-main 的 Secret 来修改 Alertmanager 的告警消息推送配置。

参考文献

Prometheus Configuration

Prometheus Operator API

How to configure prometheus remote_write / remoteWrite in OpenShift Container Platform 4.x

标签：monitoring,自定义,rules,prometheus,yaml,Prometheus,Operator,k8s
来源： https://www.cnblogs.com/koktlzz/p/15268185.html