Nomad nvidia-device-plugin

Nur, damit ich nicht immer wieder nachgucken muß: Um Nomad Client Nodes und ihre Ausstattung mit GPUs zu erkennen, gibt es u. A. das nvidia-device-plugin, welches Nomad für das "Fingerprinting" von Nodes verwendet.

Die Anwendung im Test ist denkbar einfach: Das Plugin herunterladen und im Plugin-Directory von Nomad ablegen und Nomad starten. Für den produktiven Betrieb sollte das Plugin in der Konfiguration aufgeführt werden.

Nomad sollte beim Start dann mindestens

2024-03-22T11:50:43.126Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0

ins Logfile schreiben. Wenn das Fingerprinting des Nodes dann stattgefunden hat, kann man mit nomad node status die ID des Nodes ermitteln und damit dann weitere Informationen anzeigen:

ubuntu@l40-single:~/tb$ ./bin/nomad node status f67d07fe 
ID              = f67d07fe-e3fc-a925-2d31-4fb4d3ab69df
Name            = l40-single
Node Pool       = default
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 117h56m0s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,raw_exec

Node Events
Time                  Subsystem  Message
2024-03-22T11:50:44Z  Cluster    Node registered

Allocated Resources
CPU          Memory      Disk
0/32000 MHz  0 B/63 GiB  0 B/152 GiB

Allocation Resource Utilization
CPU          Memory
0/32000 MHz  0 B/63 GiB

Host Resource Utilization
CPU           Memory          Disk
19/32000 MHz  834 MiB/63 GiB  344 GiB/496 GiB

Device Resource Utilization
nvidia/gpu/NVIDIA L40[GPU-43ef7714-0dcc-749b-39f0-257a9d4512af]  690 / 46068 MiB

Allocations
No allocations placed

Mit der Option -verbose gibt es noch mehr Informationen:

ubuntu@l40-single:~/tb$ ./bin/nomad node status -verbose f67d07fe 
ID              = f67d07fe-e3fc-a925-2d31-4fb4d3ab69df
Name            = l40-single
Node Pool       = default
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 117h59m17s

Drivers
Driver    Detected  Healthy  Message  Time
docker    true      true     Healthy  2024-03-22T11:50:43Z
exec      true      true     Healthy  2024-03-22T11:50:43Z
java      true      true     Healthy  2024-03-22T11:50:43Z
qemu      false     false    <none>   2024-03-22T11:50:43Z
raw_exec  true      true     Healthy  2024-03-22T11:50:43Z

Node Events
Time                  Subsystem  Message          Details
2024-03-22T11:50:44Z  Cluster    Node registered  <none>

Allocated Resources
CPU          Memory      Disk
0/32000 MHz  0 B/63 GiB  0 B/152 GiB

Allocation Resource Utilization
CPU          Memory
0/32000 MHz  0 B/63 GiB

Host Resource Utilization
CPU            Memory          Disk
140/32000 MHz  847 MiB/63 GiB  344 GiB/496 GiB

Device Resource Utilization
nvidia/gpu/NVIDIA L40[GPU-43ef7714-0dcc-749b-39f0-257a9d4512af]  690 / 46068 MiB

Allocations
No allocations placed

Attributes
cpu.arch                  = amd64
cpu.frequency.efficiency  = 2000
cpu.frequency.performance = 0
cpu.modelname             = AMD EPYC-Milan Processor
cpu.numcores              = 16
cpu.numcores.efficiency   = 16
cpu.numcores.performance  = 0
cpu.reservablecores       = 16
cpu.totalcompute          = 32000
cpu.usablecompute         = 32000
driver.docker             = 1
driver.docker.bridge_ip   = 172.17.0.1
driver.docker.os_type     = linux
driver.docker.runtimes    = io.containerd.runc.v2,nvidia,runc
driver.docker.version     = 25.0.4
driver.exec               = 1
driver.java               = 1
driver.java.runtime       = OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
driver.java.version       = 11.0.22
driver.java.vm            = OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)
driver.raw_exec           = 1
kernel.arch               = x86_64
kernel.landlock           = v1
kernel.name               = linux
kernel.version            = 5.15.0-100-generic
memory.totalbytes         = 67420852224
nomad.advertise.address   = 127.0.0.1:4646
nomad.bridge.hairpin_mode = false
nomad.revision            = 594fedbfbc4f0e532b65e8a69b28ff9403eb822e
nomad.service_discovery   = true
nomad.version             = 1.7.6
numa.node.count           = 1
numa.node0.cores          = 0-15
os.cgroups.version        = 2
os.name                   = ubuntu
os.signals                = SIGILL,SIGINT,SIGPROF,SIGSEGV,SIGUSR2,SIGXFSZ,SIGABRT,SIGHUP,SIGQUIT,SIGXCPU,SIGWINCH,SIGBUS,SIGCONT,SIGSTOP,SIGTERM,SIGTRAP,SIGTTIN,SIGIO,SIGTSTP,SIGUSR1,SIGPIPE,SIGNULL,SIGKILL,SIGSYS,SIGTTOU,SIGALRM,SIGFPE,SIGIOT
os.version                = 22.04
unique.hostname           = l40-single
unique.network.ip-address = 127.0.0.1
unique.storage.bytesfree  = 163323670528
unique.storage.bytestotal = 532608356352
unique.storage.volume     = /dev/sda1

[bold]Device Group Attributes[reset]
Device Group     = nvidia/gpu/NVIDIA L40
bar1             = 65536 MiB
cores_clock      = 2490 MHz
display_state    = Enabled
driver_version   = 545.23.08
memory_clock     = 9001 MHz
memory           = 46068 MiB
pci_bandwidth    = 31504 MB/s
persistence_mode = Disabled
power            = 300 W

Meta
connect.gateway_image     = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level         = info
connect.proxy_concurrency = 1
connect.sidecar_image     = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}

Alle "gefundenen" Attribute können natürlich in Nomad Jobs für die Selektion von Nodes verwendet werden.