Monitoring Nvidia GPUs without fansy tools - securely
TL;DR: Not everyone needs a K8S environment where everything is automated and over engineered. Sometimes you just need a small prometheus & grafana to collect data. This is a little automated to start and destroy a monitoring solution. You can skip to the end with the link to the github repo.
When we think of monitoring GPUs we are thinking about multi cluster configuration, or a bunch of enterprise Nvidia GPUs but, what about a small office or a home environment where you want to see what you are using. Of course I thought why not use docker compose to find a little automation to run the monitoring and delete it when I am done.
First thing was to find what I can do to scrape metrics from my GPU with Google was DCGM from nvidia, however, this proved to be pointless as they work with datacenter drivers only. So I found this cool project: https://github.com/utkuozdemir/nvidia_gpu_exporter. This was exactly what I needed, nothing fansy just collect all the metrics from nvidia-smi command.
My first step was to find a docker compose file, it was pretty fast but, I found all the current solutions on the web are extremely lacking and OPEN! No TLS, not even basic authentication. We cannot have that… So I got to work.
Some might say, well what is wrong with http, these are only metrics. I will list just a few of them:
Information leakage, please do not give an attacker information what you are running in your environment, it makes if a lot easier to attack. They can expose APIs etc. Not to mention if they find a way in through that exporter.
This is an open ticket to do a DoS attack on you.
If they know what you are running this opens you up to supply chain attacks, data exfiltration etc.
You get the picture, open ports without authentication = bad idea!
My first step was to create a docker compose file, that was pretty simple.
services:
prometheus:
image: prom/prometheus
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.config.file=/etc/prometheus/web-config.yml'
ports:
- 9090:9090
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- prom_data:/prometheus
grafana:
image: grafana/grafana
container_name: grafana
ports:
- 3000:3000
restart: unless-stopped
environment:
- GF_SERVER_PROTOCOL=https
- GF_SERVER_CERT_FILE=/var/lib/grafana/ssl/grafana.crt
- GF_SERVER_CERT_KEY=/var/lib/grafana/ssl/grafana.key
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=grafana
volumes:
- ./grafana:/etc/grafana/provisioning/datasources
- ./grafana_ssl:/var/lib/grafana/ssl
- grafanadb:/var/lib/grafana
volumes:
prom_data:
grafanadb:
Next we needed to create all the certificates, configuration files, passwords etc. I could do that manually and all but, in about a week I will forget what I did and how it worked. So created a little unorganized script to build it for me (no judgement please - it was for my home network):
#!/bin/bash
[ -d grafana ] && echo "Already installed" && exit 0
YQ_VERSION=4.44.6
sudo curl -sfL https://github.com/mikefarah/yq/releases/download/v$YQ_VERSION/yq_linux_amd64 -o /usr/bin/yq
sudo chmod +x /usr/bin/yq
# Constants
PROM_PASSWORD=prometheus
# Cert details
COUNTRY=IL
STATE=Israel
CITY=Tel-Aviv
ORG=RoifGroup
OU=IT
# Gen password for prometheus
pip3 install bcrypt 2>&1 >/dev/null
BCRYPT_PROM_PASS=$(python3 -c "import bcrypt; print(bcrypt.hashpw(b'$PROM_PASSWORD', bcrypt.gensalt()).decode())")
mkdir grafana grafana_ssl prometheus
# Create CA
cat <<EOF >san.cnf
[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no
[req_distinguished_name]
C = $COUNTRY
ST = $STATE
L = $CITY
O = $ORG
OU = $OU
CN = example
[v3_req]
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = example
EOF
cat <<EOF >node_exporter.cnf
[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no
[req_distinguished_name]
C = $COUNTRY
ST = $STATE
L = $CITY
O = $ORG
OU = $OU
CN = node_exporter
[v3_req]
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = node_exporter
EOF
openssl genrsa -out ca.key 4096
sed 's/example/prometheus-ca/g' san.cnf >ca.cnf
openssl req -new -x509 -key ca.key -out prometheus/ca.crt -days 3650 -config ca.cnf
# Create Prom certificate
openssl genrsa -out prometheus/prometheus.key 4096
sed 's/example/prometheus/g' san.cnf >prometheus.cnf
openssl req -new -key prometheus/prometheus.key -out prometheus.csr -config prometheus.cnf
openssl x509 -req -in prometheus.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out prometheus/prometheus.crt -days 3650 -extfile prometheus.cnf -extensions v3_req
# Create grafana certificate
openssl genrsa -out grafana_ssl/grafana.key 4096
sed 's/example/grafana/g' san.cnf >grafana.cnf
openssl req -new -key grafana_ssl/grafana.key -out grafana.csr -config grafana.cnf
openssl x509 -req -in grafana.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out grafana_ssl/grafana.crt -days 3650 -extfile grafana.cnf -extensions v3_req
# Create Node Exporter certificate
mkdir -p node_exporter_installer
openssl genrsa -out node_exporter_installer/node_exporter.key 4096
openssl req -new -key node_exporter_installer/node_exporter.key -out node_exporter.csr -config node_exporter.cnf
openssl x509 -req -in node_exporter.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out node_exporter_installer/node_exporter.crt -days 3650 -extfile node_exporter.cnf -extensions v3_req
CA=$(cat prometheus/ca.crt | sed 's/^/ /')
cat <<EOF >prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: []
scheme: http
timeout: 10s
api_version: v2
scrape_configs:
- job_name: prometheus
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
ca_file: '/etc/prometheus/ca.crt'
cert_file: '/etc/prometheus/prometheus.crt'
key_file: '/etc/prometheus/prometheus.key'
server_name: 'prometheus'
basic_auth:
username: 'admin'
password: '$PROM_PASSWORD'
static_configs:
- targets:
- localhost:9090
- job_name: nodes
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
ca_file: '/etc/prometheus/ca.crt'
cert_file: '/etc/prometheus/prometheus.crt'
key_file: '/etc/prometheus/prometheus.key'
server_name: 'node_exporter'
static_configs:
- targets:
- 10.200.0.80:9100
- job_name: gpus
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
ca_file: '/etc/prometheus/ca.crt'
cert_file: '/etc/prometheus/prometheus.crt'
key_file: '/etc/prometheus/prometheus.key'
server_name: 'node_exporter'
static_configs:
- targets:
- 10.200.0.80:9835
EOF
cat <<EOF >prometheus/web-config.yml
tls_server_config:
cert_file: prometheus.crt
key_file: prometheus.key
client_ca_file: ca.crt
client_auth_type: VerifyClientCertIfGiven
basic_auth_users:
admin: $BCRYPT_PROM_PASS
EOF
cat <<EOF >grafana/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: https://prometheus:9090
basicAuth: true
basicAuthUser: admin
basicAuthPassword: $PROM_PASSWORD
jsonData:
tlsAuthWithCACert: true
secureJsonData:
tlsCACert: |
$CA
basicAuthPassword: $PROM_PASSWORD
isDefault: true
access: proxy
editable: true
EOF
sudo chown 0:0 -R grafana
sudo chown 65534:65534 -R prometheus
sudo chown 472:0 -R grafana_ssl/*
cp prometheus/ca.crt node_exporter_installer/
docker compose up -d
You see this and think, what a mess! Well it kind of is… But, it gets the job done. Basically this is creating a ca, server and client certificates for my setup so I do not leave everything with http open for everyone to explore. I left prometheus with basic auth since I wanted to debug some things and I was too lazy to move this to mTLS too after I was done but, it is pretty simple to do the switch. This basically creates our two containers (prometheus and grafana) with self signed certificates.
Now we need to deploy the agents, I needed both Nvidia gpu node exporter and the regular node exporter to collect my data. No need to deploy those manually so:
#!/bin/bash
set -x
NODE_EXPORTER_VERSION=1.8.2
GPU_EXPORTER_VERSION=1.2.1
[ -f /usr/bin/nvidia_gpu_exporter ] && echo "Exporter already installed" && exit 0
curl https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${GPU_EXPORTER_VERSION}/nvidia_gpu_exporter_${GPU_EXPORTER_VERSION}_linux_x86_64.tar.gz -OsfL
sudo tar -C /usr/bin/ -xvzf nvidia_gpu_exporter_${GPU_EXPORTER_VERSION}_linux_x86_64.tar.gz nvidia_gpu_exporter
rm nvidia_gpu_exporter_${GPU_EXPORTER_VERSION}_linux_x86_64.tar.gz
curl https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz -OsfL
tar -xvzf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter
sudo mv node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter /usr/bin/
rm -rf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz node_exporter-$NODE_EXPORTER_VERSION.linux-amd64
sudo useradd --system --no-create-home --shell /usr/sbin/nologin nvidia_gpu_exporter
cat <<EOF >nvidia_gpu_exporter.service
[Unit]
Description=Nvidia GPU Exporter
After=network-online.target
[Service]
Type=simple
User=nvidia_gpu_exporter
Group=nvidia_gpu_exporter
ExecStart=/usr/bin/nvidia_gpu_exporter --web.config.file=/etc/node_exporter/web-config.yml
SyslogIdentifier=nvidia_gpu_exporter
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
cat <<EOF >node_exporter.service
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
Type=simple
User=nvidia_gpu_exporter
Group=nvidia_gpu_exporter
ExecStart=/usr/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml
SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
sudo mkdir -p /etc/node_exporter
cat <<EOF >web-config.yml
tls_server_config:
cert_file: "/etc/node_exporter/node_exporter.crt"
key_file: "/etc/node_exporter/node_exporter.key"
client_ca_file: "/etc/node_exporter/ca.crt"
client_auth_type: "RequireAndVerifyClientCert"
EOF
sudo cp web-config.yml node_exporter_installer/node_exporter.* node_exporter_installer/ca.crt /etc/node_exporter/
sudo chown -R nvidia_gpu_exporter: /etc/node_exporter
sudo chmod -R 440 /etc/node_exporter/*
sudo mv nvidia_gpu_exporter.service node_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia_gpu_exporter node_exporter
Basically create the agent we need with all the certificates. I used the same node-exporter certificate for all my agents, I tried to think what is the harm in that, could not come up with something so left it using one certificate.
Final step, add the ip of the agent to the prometheus to scrape:
#!/bin/bash
# Const
NODE_EXPORTER_PORT=9100
GPU_EXPORTER_PORT=9835
[ $# -ne 1 ] && echo -e "Error: No argument provided, enter in the following syntax:\n ./add_gpu_system 192.168.1.30" && exit 0
sudo yq e '.scrape_configs[] |= select(.job_name == "gpus") |= (.static_configs[0].targets += "'"$1:$GPU_EXPORTER_PORT"'")' -i prometheus/prometheus.yml
sudo yq e '.scrape_configs[] |= select(.job_name == "nodes") |= (.static_configs[0].targets += "'"$1:$NODE_EXPORTER_PORT"'")' -i prometheus/prometheus.yml
docker restart prometheus
And voila! We have a working system. Now just add the dashboards we need and we are basically done. Final result looks something like this:

If you need to remove an agent or clean up the environment, I added a few scripts to do that too.
Now, deployment take me less than a minute.

