Prometheus GPU 监控
监控 GPU Prometheus
2023-09-14 09:16:04 时间
Prometheus GPU 监控
1,Prometheus GPU 监控
- 安装
DCGM
datacenter-gpu-manager_1.7.2_amd64.deb
# dcgmi --version
dcgmi version: 1.7.2
2,安装gpu-monitoring-tools
# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
# cd gpu-monitoring-tools/
# make binary
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
# make install
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
install -m 557 dcgm-exporter /usr/bin/dcgm-exporter
install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv
install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv
- 运行
dcgm-exporter
# which dcgm-exporter
/usr/bin/dcgm-exporter
# dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
- 测试,可以看到监控数据
# curl 192.168.1.2:9400/metrics
2.1,设置dcgm-exporter
开机启动
vim /lib/systemd/system/dcgm-exporter.service
新建服务
[Unit]
Description=dcgm-exporter service
[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
# systemctl daemon-reload
# systemctl enable dcgm-exporter.service
# systemctl start dcgm-exporter.service
# systemctl status dcgm-exporter.service
3,Prometheus修改配置
- 添加
dcgm-exporter
# dcgm-exporter
- job_name: 'gpu'
static_configs:
- targets: ['192.168.1.2:9400']
# cat prometheus.yml
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# node_exporter
- job_name: 'node'
static_configs:
- targets: ['127.0.0.1:9100','192.168.1.2:9100']
# dcgm-exporter
- job_name: 'gpu'
static_configs:
- targets: ['192.168.1.2:9400']
- 重启
prometheus
systemctl restart prometheus.service
4,grafana
5,使用监控面板9957
可以切换节点
6,Grafana设置
- 监控功率,
instance
为ip地址
DCGM_FI_DEV_POWER_USAGE{instance="192.168.1.101:9400"}
- 显卡使用率
DCGM_FI_DEV_GPU_UTIL{instance="192.168.1.101:9400"}
7,使用12027
# dcgm-exporter
- job_name: 'gpu-metrics'
static_configs:
- targets: ['127.0.0.1:9400','192.168.1.101:9400','192.168.1.102:9400']
- 手动设置监控
- 查看显卡指标
curl http://127.0.0.1:9400/metrics
- 使用功率
DCGM_FI_DEV_POWER_USAGE{instance="127.0.0.1:9400"}
- 内存使用
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}
- 总内存
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}+DCGM_FI_DEV_FB_FREE{instance="127.0.0.1:9400"}
- GPU使用率
DCGM_FI_DEV_GPU_UTIL{instance="127.0.0.1:9400"}
- GPU内存使用率
DCGM_FI_DEV_MEM_COPY_UTIL{instance="192.168.0.114:9400"}
8,使用GPU-Nodes-Metrics-Nvidia 12639
参考:
相关文章
- 给监控项设定阈值
- Zabbix分布式监控系统
- ebpf监控_链路追踪命令
- 虚拟GPU_vmware gpu
- 2023分布式存储高峰论坛:云原生趋势下,腾讯云存储的布局和智能监控解决方案的实践
- 利用 Tanzu Application Platform 实现应用云调试与面向开发者的应用运行状态监控
- Prometheus + Granafa 构建高大上的MySQL监控平台
- Zabbix监控主机自定义监控项
- Linux环境下网卡流量实时监控(linux网卡流量监控)
- Linux下GPU性能测试实践(linux测试gpu)
- 在 Linux 上监控 CPU 和 GPU 温度
- Linux上GPU编程的新玩法(linux使用gpu)
- 细心监控,保障 SQL Server 正常运行(sqlserver 监控)
- 轻松学习——如何在Linux下查看GPU型号(linux查看gpu型号)
- GPU加速极大提升Oracle性能(gpu加速oracle)