Nur, damit ich nicht immer wieder nachgucken muß: Um Nomad Client Nodes und ihre Ausstattung mit GPUs zu erkennen, gibt es u. A. das nvidia-device-plugin, welches Nomad für das "Fingerprinting" von Nodes verwendet.
Die Anwendung im Test ist denkbar einfach: Das Plugin herunterladen und im Plugin-Directory von Nomad ablegen und Nomad starten. Für den produktiven Betrieb sollte das Plugin in der Konfiguration aufgeführt werden.
Nomad sollte beim Start dann mindestens
2024-03-22T11:50:43.126Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0
ins Logfile schreiben. Wenn das Fingerprinting des Nodes dann stattgefunden hat, kann man mit nomad node status
die ID des Nodes ermitteln und damit dann weitere Informationen anzeigen:
ubuntu@l40-single:~/tb$ ./bin/nomad node status f67d07fe
ID = f67d07fe-e3fc-a925-2d31-4fb4d3ab69df
Name = l40-single
Node Pool = default
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 117h56m0s
Host Volumes = <none>
Host Networks = <none>
CSI Volumes = <none>
Driver Status = docker,exec,java,raw_exec
Node Events
Time Subsystem Message
2024-03-22T11:50:44Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/32000 MHz 0 B/63 GiB 0 B/152 GiB
Allocation Resource Utilization
CPU Memory
0/32000 MHz 0 B/63 GiB
Host Resource Utilization
CPU Memory Disk
19/32000 MHz 834 MiB/63 GiB 344 GiB/496 GiB
Device Resource Utilization
nvidia/gpu/NVIDIA L40[GPU-43ef7714-0dcc-749b-39f0-257a9d4512af] 690 / 46068 MiB
Allocations
No allocations placed
Mit der Option -verbose
gibt es noch mehr Informationen:
ubuntu@l40-single:~/tb$ ./bin/nomad node status -verbose f67d07fe
ID = f67d07fe-e3fc-a925-2d31-4fb4d3ab69df
Name = l40-single
Node Pool = default
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 117h59m17s
Drivers
Driver Detected Healthy Message Time
docker true true Healthy 2024-03-22T11:50:43Z
exec true true Healthy 2024-03-22T11:50:43Z
java true true Healthy 2024-03-22T11:50:43Z
qemu false false <none> 2024-03-22T11:50:43Z
raw_exec true true Healthy 2024-03-22T11:50:43Z
Node Events
Time Subsystem Message Details
2024-03-22T11:50:44Z Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
0/32000 MHz 0 B/63 GiB 0 B/152 GiB
Allocation Resource Utilization
CPU Memory
0/32000 MHz 0 B/63 GiB
Host Resource Utilization
CPU Memory Disk
140/32000 MHz 847 MiB/63 GiB 344 GiB/496 GiB
Device Resource Utilization
nvidia/gpu/NVIDIA L40[GPU-43ef7714-0dcc-749b-39f0-257a9d4512af] 690 / 46068 MiB
Allocations
No allocations placed
Attributes
cpu.arch = amd64
cpu.frequency.efficiency = 2000
cpu.frequency.performance = 0
cpu.modelname = AMD EPYC-Milan Processor
cpu.numcores = 16
cpu.numcores.efficiency = 16
cpu.numcores.performance = 0
cpu.reservablecores = 16
cpu.totalcompute = 32000
cpu.usablecompute = 32000
driver.docker = 1
driver.docker.bridge_ip = 172.17.0.1
driver.docker.os_type = linux
driver.docker.runtimes = io.containerd.runc.v2,nvidia,runc
driver.docker.version = 25.0.4
driver.exec = 1
driver.java = 1
driver.java.runtime = OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
driver.java.version = 11.0.22
driver.java.vm = OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)
driver.raw_exec = 1
kernel.arch = x86_64
kernel.landlock = v1
kernel.name = linux
kernel.version = 5.15.0-100-generic
memory.totalbytes = 67420852224
nomad.advertise.address = 127.0.0.1:4646
nomad.bridge.hairpin_mode = false
nomad.revision = 594fedbfbc4f0e532b65e8a69b28ff9403eb822e
nomad.service_discovery = true
nomad.version = 1.7.6
numa.node.count = 1
numa.node0.cores = 0-15
os.cgroups.version = 2
os.name = ubuntu
os.signals = SIGILL,SIGINT,SIGPROF,SIGSEGV,SIGUSR2,SIGXFSZ,SIGABRT,SIGHUP,SIGQUIT,SIGXCPU,SIGWINCH,SIGBUS,SIGCONT,SIGSTOP,SIGTERM,SIGTRAP,SIGTTIN,SIGIO,SIGTSTP,SIGUSR1,SIGPIPE,SIGNULL,SIGKILL,SIGSYS,SIGTTOU,SIGALRM,SIGFPE,SIGIOT
os.version = 22.04
unique.hostname = l40-single
unique.network.ip-address = 127.0.0.1
unique.storage.bytesfree = 163323670528
unique.storage.bytestotal = 532608356352
unique.storage.volume = /dev/sda1
[bold]Device Group Attributes[reset]
Device Group = nvidia/gpu/NVIDIA L40
bar1 = 65536 MiB
cores_clock = 2490 MHz
display_state = Enabled
driver_version = 545.23.08
memory_clock = 9001 MHz
memory = 46068 MiB
pci_bandwidth = 31504 MB/s
persistence_mode = Disabled
power = 300 W
Meta
connect.gateway_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level = info
connect.proxy_concurrency = 1
connect.sidecar_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
Alle "gefundenen" Attribute können natürlich in Nomad Jobs für die Selektion von Nodes verwendet werden.