zl程序教程

您现在的位置是:首页 >  Python

当前栏目

基于Prometheus的监控告警系统的Python开发

2023-02-18 16:34:52 时间

周末外面太冷,在家搞了下Prometheus的白屏化运维DEMO。目前只是把后端简单的几个接口搞出来,校验之类的还没加。。。

这里先记录下。 后续等后端完成后,把前端也尝试写一下。

重点:

1、prometheus的target,是存在数据库里面的,只要符合一定的格式即可。 prometheus很早之前就支持了http接口方式动态target发现机制。格式类似这样:

prometheus的配置文件,需要改动下,加些relabel,如下:

$ cat /usr/local/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.31.181:9093
rule_files:
   - "rules/*.yml"
#   - "rules/*.yaml"

scrape_configs:
  - job_name: "alertcenter_api"
    metrics_path: "/metrics"
    http_sd_configs:
      - url: "http://192.168.31.79:8000/api/prom/prom_targets"
        refresh_interval: 30s
    relabel_configs:
    - source_labels:
      - "__meta_datacenter"
      separator: "-"
      regex: "(.*)"
      target_label: "datacenter"
      action: replace
      replacement: "$1"
    - source_labels:
      - "__meta_prometheus_job"
      separator: "-"
      regex: "(.*)"
      target_label: "job"
    - source_labels:
      - "__meta_role"
      separator: "-"
      regex: "(.*)"
      target_label: "role"
    - source_labels:
      - "__meta_cluster"
      separator: "-"
      regex: "(.*)"
      target_label: "cluster"
    - source_labels:
      - "__meta_instance"
      separator: "-"
      regex: "(.*)"
      target_label: "instance"
    - source_labels:
      - "__address__"
      separator: "-"
      regex: "(.*)"
      target_label: "endpoint"

2、告警的rules,也是存在数据库里面的,根据库的数据,渲染成json,然后转成yaml格式的文件,apply到prometheus里面生效。

3、alertmanager告警。配置个webhook。大致这样:

$ cat /usr/local/alertmanager-0.23.0.linux-amd64/alertmanager.yml
global: 
  resolve_timeout: 30s 

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 30s
  receiver: 'webhook1'

  routes:
  - match:
      job: ^.*(数据库|mysql|MySQL).*$
    receiver: dba
    group_wait: 10s
    group_interval: 30s
    repeat_interval: 30s
  - match_re:
      job: ^.*(数据库|mysql|MySQL).*$
    group_wait: 30s
    group_interval: 30s
    repeat_interval: 30s
    receiver: dba

receivers:
- name: webhook1
  webhook_configs:
  - send_resolved: true
    url: http://192.168.31.79:8000/api/prom/test
- name: dba
  webhook_configs:
  - send_resolved: true
    url: http://192.168.31.79:8000/api/prom/test

post的接口这里做了很多事情,大致步骤:1、接收到alertmanager推送的消息(目前看是分为2类:firing告警、resolved恢复)。2、调用selenium访问prometheus的web ui,进行截图。3、截图上传到腾讯云oss,生成一个固定的公开访问链接。4、发送钉钉告警消息,带上文字内容和截图。类似如下:

告警这块还要做的事情很多,例如:

1、critical的告警,需要有个确认按钮,如果没人确认,则持续N次后,会触发告警升级(一线->leader->总监)

2、告警静默的时间段(有些job,在夜里跑批可能负载很高,持续告警也没任何意义)

3、告警的合并

4、自定义告警接收人

5、可接入非alertmanager推送的告警,例如shell脚本运行异常触发告警