<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[roifgroup]]></title><description><![CDATA[roifgroup]]></description><link>https://blog.roifgroup.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1736636539615/fb7c8421-7925-49c2-8f15-a6f96e7b5c8d.png</url><title>roifgroup</title><link>https://blog.roifgroup.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 24 May 2026 01:59:11 GMT</lastBuildDate><atom:link href="https://blog.roifgroup.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Monitoring Nvidia GPUs without fansy tools - securely]]></title><description><![CDATA[TL;DR: Not everyone needs a K8S environment where everything is automated and over engineered. Sometimes you just need a small prometheus & grafana to collect data. This is a little automated to start and destroy a monitoring solution. You can skip t...]]></description><link>https://blog.roifgroup.com/monitoring-nvidia-gpus-without-fansy-tools-securely</link><guid isPermaLink="true">https://blog.roifgroup.com/monitoring-nvidia-gpus-without-fansy-tools-securely</guid><category><![CDATA[Grafana]]></category><category><![CDATA[#prometheus]]></category><category><![CDATA[NVIDIA]]></category><category><![CDATA[GPU]]></category><category><![CDATA[Docker]]></category><dc:creator><![CDATA[RoifGroup]]></dc:creator><pubDate>Sat, 11 Jan 2025 23:40:02 GMT</pubDate><content:encoded><![CDATA[<p><strong>TL;DR: Not everyone needs a K8S environment where everything is automated and over engineered. Sometimes you just need a small prometheus &amp; grafana to collect data. This is a little automated to start and destroy a monitoring solution. You can skip to the end with the link to the github repo.</strong></p>
<p>When we think of monitoring GPUs we are thinking about multi cluster configuration, or a bunch of enterprise Nvidia GPUs but, what about a small office or a home environment where you want to see what you are using. Of course I thought why not use docker compose to find a little automation to run the monitoring and delete it when I am done.</p>
<p>First thing was to find what I can do to scrape metrics from my GPU with Google was DCGM from nvidia, however, this proved to be pointless as they work with datacenter drivers only. So I found this cool project: <a target="_blank" href="https://github.com/utkuozdemir/nvidia_gpu_exporter">https://github.com/utkuozdemir/nvidia_gpu_exporter</a>. This was exactly what I needed, nothing fansy just collect all the metrics from nvidia-smi command.</p>
<p>My first step was to find a docker compose file, it was pretty fast but, I found all the current solutions on the web are extremely lacking and OPEN! No TLS, not even basic authentication. We cannot have that… So I got to work.</p>
<p>Some might say, well what is wrong with http, these are only metrics. I will list just a few of them:</p>
<ol>
<li><p>Information leakage, please do not give an attacker information what you are running in your environment, it makes if a lot easier to attack. They can expose APIs etc. Not to mention if they find a way in through that exporter.</p>
</li>
<li><p>This is an open ticket to do a DoS attack on you.</p>
</li>
<li><p>If they know what you are running this opens you up to supply chain attacks, data exfiltration etc.</p>
</li>
</ol>
<p>You get the picture, open ports without authentication = bad idea!</p>
<p>My first step was to create a docker compose file, that was pretty simple.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">services:</span>
  <span class="hljs-attr">prometheus:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">prom/prometheus</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">prometheus</span>
    <span class="hljs-attr">command:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">'--config.file=/etc/prometheus/prometheus.yml'</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">'--web.config.file=/etc/prometheus/web-config.yml'</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-number">9090</span><span class="hljs-string">:9090</span>
    <span class="hljs-attr">restart:</span> <span class="hljs-string">unless-stopped</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./prometheus:/etc/prometheus</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">prom_data:/prometheus</span>
  <span class="hljs-attr">grafana:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/grafana</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">grafana</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-number">3000</span><span class="hljs-string">:3000</span>
    <span class="hljs-attr">restart:</span> <span class="hljs-string">unless-stopped</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">GF_SERVER_PROTOCOL=https</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">GF_SERVER_CERT_FILE=/var/lib/grafana/ssl/grafana.crt</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">GF_SERVER_CERT_KEY=/var/lib/grafana/ssl/grafana.key</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">GF_SECURITY_ADMIN_USER=admin</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">GF_SECURITY_ADMIN_PASSWORD=grafana</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./grafana:/etc/grafana/provisioning/datasources</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./grafana_ssl:/var/lib/grafana/ssl</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">grafanadb:/var/lib/grafana</span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">prom_data:</span>
  <span class="hljs-attr">grafanadb:</span>
</code></pre>
<p>Next we needed to create all the certificates, configuration files, passwords etc. I could do that manually and all but, in about a week I will forget what I did and how it worked. So created a little unorganized script to build it for me (no judgement please - it was for my home network):</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

[ -d grafana ] &amp;&amp; <span class="hljs-built_in">echo</span> <span class="hljs-string">"Already installed"</span> &amp;&amp; <span class="hljs-built_in">exit</span> 0

YQ_VERSION=4.44.6
sudo curl -sfL https://github.com/mikefarah/yq/releases/download/v<span class="hljs-variable">$YQ_VERSION</span>/yq_linux_amd64 -o /usr/bin/yq
sudo chmod +x /usr/bin/yq

<span class="hljs-comment"># Constants</span>
PROM_PASSWORD=prometheus

<span class="hljs-comment"># Cert details</span>
COUNTRY=IL
STATE=Israel
CITY=Tel-Aviv
ORG=RoifGroup
OU=IT

<span class="hljs-comment"># Gen password for prometheus</span>
pip3 install bcrypt 2&gt;&amp;1 &gt;/dev/null
BCRYPT_PROM_PASS=$(python3 -c <span class="hljs-string">"import bcrypt; print(bcrypt.hashpw(b'<span class="hljs-variable">$PROM_PASSWORD</span>', bcrypt.gensalt()).decode())"</span>)

mkdir grafana grafana_ssl prometheus

<span class="hljs-comment"># Create CA</span>
cat &lt;&lt;EOF &gt;san.cnf
[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no

[req_distinguished_name]
C = <span class="hljs-variable">$COUNTRY</span>
ST = <span class="hljs-variable">$STATE</span>
L = <span class="hljs-variable">$CITY</span>
O = <span class="hljs-variable">$ORG</span>
OU = <span class="hljs-variable">$OU</span>
CN = example

[v3_req]
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = example
EOF

cat &lt;&lt;EOF &gt;node_exporter.cnf
[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no

[req_distinguished_name]
C = <span class="hljs-variable">$COUNTRY</span>
ST = <span class="hljs-variable">$STATE</span>
L = <span class="hljs-variable">$CITY</span>
O = <span class="hljs-variable">$ORG</span>
OU = <span class="hljs-variable">$OU</span>
CN = node_exporter

[v3_req]
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = node_exporter
EOF

openssl genrsa -out ca.key 4096
sed <span class="hljs-string">'s/example/prometheus-ca/g'</span> san.cnf &gt;ca.cnf
openssl req -new -x509 -key ca.key -out prometheus/ca.crt -days 3650 -config ca.cnf

<span class="hljs-comment"># Create Prom certificate</span>
openssl genrsa -out prometheus/prometheus.key 4096
sed <span class="hljs-string">'s/example/prometheus/g'</span> san.cnf &gt;prometheus.cnf
openssl req -new -key prometheus/prometheus.key -out prometheus.csr -config prometheus.cnf
openssl x509 -req -<span class="hljs-keyword">in</span> prometheus.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out prometheus/prometheus.crt -days 3650 -extfile prometheus.cnf -extensions v3_req

<span class="hljs-comment"># Create grafana certificate</span>
openssl genrsa -out grafana_ssl/grafana.key 4096
sed <span class="hljs-string">'s/example/grafana/g'</span> san.cnf &gt;grafana.cnf
openssl req -new -key grafana_ssl/grafana.key -out grafana.csr -config grafana.cnf
openssl x509 -req -<span class="hljs-keyword">in</span> grafana.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out grafana_ssl/grafana.crt -days 3650 -extfile grafana.cnf -extensions v3_req

<span class="hljs-comment"># Create Node Exporter certificate</span>
mkdir -p node_exporter_installer
openssl genrsa -out node_exporter_installer/node_exporter.key 4096
openssl req -new -key node_exporter_installer/node_exporter.key -out node_exporter.csr -config node_exporter.cnf
openssl x509 -req -<span class="hljs-keyword">in</span> node_exporter.csr -CA prometheus/ca.crt -CAkey ca.key -CAcreateserial -out node_exporter_installer/node_exporter.crt -days 3650 -extfile node_exporter.cnf -extensions v3_req

CA=$(cat prometheus/ca.crt | sed <span class="hljs-string">'s/^/      /'</span>)

cat &lt;&lt;EOF &gt;prometheus/prometheus.yml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
alerting:
  alertmanagers:
  - static_configs:
    - targets: []
    scheme: http
    timeout: 10s
    api_version: v2
scrape_configs:
- job_name: prometheus
  honor_timestamps: <span class="hljs-literal">true</span>
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:
    ca_file: <span class="hljs-string">'/etc/prometheus/ca.crt'</span>
    cert_file: <span class="hljs-string">'/etc/prometheus/prometheus.crt'</span>
    key_file: <span class="hljs-string">'/etc/prometheus/prometheus.key'</span>
    server_name: <span class="hljs-string">'prometheus'</span>
  basic_auth:
    username: <span class="hljs-string">'admin'</span>
    password: <span class="hljs-string">'$PROM_PASSWORD'</span>
  static_configs:
  - targets:
    - localhost:9090
- job_name: nodes
  honor_timestamps: <span class="hljs-literal">true</span>
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:
    ca_file: <span class="hljs-string">'/etc/prometheus/ca.crt'</span>
    cert_file: <span class="hljs-string">'/etc/prometheus/prometheus.crt'</span>
    key_file: <span class="hljs-string">'/etc/prometheus/prometheus.key'</span>
    server_name: <span class="hljs-string">'node_exporter'</span>
  static_configs:
  - targets:
    - 10.200.0.80:9100
- job_name: gpus
  honor_timestamps: <span class="hljs-literal">true</span>
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:
    ca_file: <span class="hljs-string">'/etc/prometheus/ca.crt'</span>
    cert_file: <span class="hljs-string">'/etc/prometheus/prometheus.crt'</span>
    key_file: <span class="hljs-string">'/etc/prometheus/prometheus.key'</span>
    server_name: <span class="hljs-string">'node_exporter'</span>
  static_configs:
  - targets:
    - 10.200.0.80:9835
EOF

cat &lt;&lt;EOF &gt;prometheus/web-config.yml
tls_server_config:
  cert_file: prometheus.crt
  key_file: prometheus.key
  client_ca_file: ca.crt
  client_auth_type: VerifyClientCertIfGiven
basic_auth_users:
  admin: <span class="hljs-variable">$BCRYPT_PROM_PASS</span>
EOF

cat &lt;&lt;EOF &gt;grafana/datasource.yml
apiVersion: 1

datasources:
- name: Prometheus
  <span class="hljs-built_in">type</span>: prometheus
  url: https://prometheus:9090 
  basicAuth: <span class="hljs-literal">true</span>
  basicAuthUser: admin
  basicAuthPassword: <span class="hljs-variable">$PROM_PASSWORD</span>
  jsonData:
    tlsAuthWithCACert: <span class="hljs-literal">true</span>
  secureJsonData:
    tlsCACert: |
<span class="hljs-variable">$CA</span>
    basicAuthPassword: <span class="hljs-variable">$PROM_PASSWORD</span>
  isDefault: <span class="hljs-literal">true</span>
  access: proxy
  editable: <span class="hljs-literal">true</span>
EOF

sudo chown 0:0 -R grafana
sudo chown 65534:65534 -R prometheus
sudo chown 472:0 -R grafana_ssl/*
cp prometheus/ca.crt node_exporter_installer/

docker compose up -d
</code></pre>
<p>You see this and think, what a mess! Well it kind of is… But, it gets the job done. Basically this is creating a ca, server and client certificates for my setup so I do not leave everything with http open for everyone to explore. I left prometheus with basic auth since I wanted to debug some things and I was too lazy to move this to mTLS too after I was done but, it is pretty simple to do the switch. This basically creates our two containers (prometheus and grafana) with self signed certificates.</p>
<p>Now we need to deploy the agents, I needed both Nvidia gpu node exporter and the regular node exporter to collect my data. No need to deploy those manually so:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

<span class="hljs-built_in">set</span> -x

NODE_EXPORTER_VERSION=1.8.2
GPU_EXPORTER_VERSION=1.2.1

[ -f /usr/bin/nvidia_gpu_exporter ] &amp;&amp; <span class="hljs-built_in">echo</span> <span class="hljs-string">"Exporter already installed"</span> &amp;&amp; <span class="hljs-built_in">exit</span> 0

curl https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v<span class="hljs-variable">${GPU_EXPORTER_VERSION}</span>/nvidia_gpu_exporter_<span class="hljs-variable">${GPU_EXPORTER_VERSION}</span>_linux_x86_64.tar.gz -OsfL
sudo tar -C /usr/bin/ -xvzf nvidia_gpu_exporter_<span class="hljs-variable">${GPU_EXPORTER_VERSION}</span>_linux_x86_64.tar.gz nvidia_gpu_exporter
rm nvidia_gpu_exporter_<span class="hljs-variable">${GPU_EXPORTER_VERSION}</span>_linux_x86_64.tar.gz

curl https://github.com/prometheus/node_exporter/releases/download/v<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>/node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64.tar.gz -OsfL
tar -xvzf node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64.tar.gz node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64/node_exporter
sudo mv node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64/node_exporter /usr/bin/
rm -rf node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64.tar.gz node_exporter-<span class="hljs-variable">$NODE_EXPORTER_VERSION</span>.linux-amd64

sudo useradd --system --no-create-home --shell /usr/sbin/nologin nvidia_gpu_exporter

cat &lt;&lt;EOF &gt;nvidia_gpu_exporter.service
[Unit]
Description=Nvidia GPU Exporter
After=network-online.target

[Service]
Type=simple

User=nvidia_gpu_exporter
Group=nvidia_gpu_exporter

ExecStart=/usr/bin/nvidia_gpu_exporter --web.config.file=/etc/node_exporter/web-config.yml

SyslogIdentifier=nvidia_gpu_exporter

Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

cat &lt;&lt;EOF &gt;node_exporter.service
[Unit]
Description=Node Exporter
After=network-online.target

[Service]
Type=simple

User=nvidia_gpu_exporter
Group=nvidia_gpu_exporter

ExecStart=/usr/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml

SyslogIdentifier=node_exporter

Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

sudo mkdir -p /etc/node_exporter
cat &lt;&lt;EOF &gt;web-config.yml
tls_server_config:
  cert_file: <span class="hljs-string">"/etc/node_exporter/node_exporter.crt"</span>
  key_file: <span class="hljs-string">"/etc/node_exporter/node_exporter.key"</span>
  client_ca_file: <span class="hljs-string">"/etc/node_exporter/ca.crt"</span>
  client_auth_type: <span class="hljs-string">"RequireAndVerifyClientCert"</span>
EOF

sudo cp web-config.yml node_exporter_installer/node_exporter.* node_exporter_installer/ca.crt /etc/node_exporter/
sudo chown -R nvidia_gpu_exporter: /etc/node_exporter
sudo chmod -R 440 /etc/node_exporter/*

sudo mv nvidia_gpu_exporter.service node_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl <span class="hljs-built_in">enable</span> --now nvidia_gpu_exporter node_exporter
</code></pre>
<p>Basically create the agent we need with all the certificates. I used the same node-exporter certificate for all my agents, I tried to think what is the harm in that, could not come up with something so left it using one certificate.</p>
<p>Final step, add the ip of the agent to the prometheus to scrape:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

<span class="hljs-comment"># Const</span>
NODE_EXPORTER_PORT=9100
GPU_EXPORTER_PORT=9835

[ <span class="hljs-variable">$#</span> -ne 1 ] &amp;&amp; <span class="hljs-built_in">echo</span> -e <span class="hljs-string">"Error: No argument provided, enter in the following syntax:\n ./add_gpu_system 192.168.1.30"</span> &amp;&amp; <span class="hljs-built_in">exit</span> 0

sudo yq e <span class="hljs-string">'.scrape_configs[] |= select(.job_name == "gpus") |= (.static_configs[0].targets += "'</span><span class="hljs-string">"<span class="hljs-variable">$1</span>:<span class="hljs-variable">$GPU_EXPORTER_PORT</span>"</span><span class="hljs-string">'")'</span> -i prometheus/prometheus.yml
sudo yq e <span class="hljs-string">'.scrape_configs[] |= select(.job_name == "nodes") |= (.static_configs[0].targets += "'</span><span class="hljs-string">"<span class="hljs-variable">$1</span>:<span class="hljs-variable">$NODE_EXPORTER_PORT</span>"</span><span class="hljs-string">'")'</span> -i prometheus/prometheus.yml

docker restart prometheus
</code></pre>
<p>And voila! We have a working system. Now just add the dashboards we need and we are basically done. Final result looks something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736638166398/2a96183f-80c4-4dda-9a5d-1009ebf50891.png" alt class="image--center mx-auto" /></p>
<p>If you need to remove an agent or clean up the environment, I added a few scripts to do that too.</p>
<p>Now, deployment take me less than a minute.</p>
<p><a target="_blank" href="https://github.com/roifgroup/gpu-monitoring-prometheus">All the code is right here</a></p>
]]></content:encoded></item></channel></rss>