NVIDIA GRID™
The following guide outlines how to install the require NVIDIA GRID™ drivers to use vGPUs with a HyperCloud cluster.
End User install instructions
-
To use vGPU, first acquire the GRID driver package, and place the entire GRID
.zip
file onto the HyperCloud dashboard usingscp
. Once copied, runnvidia-grid-install <GRID-zip-file>
. For example,nvidia-grid-install /var/cores/NVIDIA-GRID-Linux-KVM-535.154.02-535.154.05-538.15.zip
on the dashboard. -
Run
hypercloud-reboot-cluster-live -compute
to reboot all the compute nodes. When a vGPU-capable node starts, the GRID kernel drivers will automatically be loaded and will also configure three vGPUs at 8 GB RAM each with well-known UUIDs of00000000-0000-0000-0001-000000000001
,00000000-0000-0000-0001-000000000002
, and00000000-0000-0000-0001-000000000003
- this is the only GPU configuration option within HyperCloud. The last number in this UUID, by policy, indicates the vGPU ID (numbering starts at 1) and the second-to-last number grouping represents the GPU (numbering starts at 1); e.g. the second-to-last grouping will be0002
for vGPUs created on the second GPU.Note
For HyperCloud nodes which do not contain a vGPU, nothing will occur. The GRID kernel drivers will not be loaded.
-
Create a VM template in the HyperCloud GUI to use a vGPU.
- From the menu on the left, click "Templates" then "VMs" then the "+", selecting Create.
- Then, complete the template creation as per documentation.
-
Before finalizing the template, select the Tags tab and under "Raw Data", paste the following in the
DATA
field: -
Modify the UUID to match the vGPU that a VM created from this template will use.
-
Click the green Create to finish creating the VM Template.
Warning
The UUID cannot be changed from the UI for an instantiated VM. It can only be changed in the template. A template will need to be created for each vGPU usage configuration; furthermore, a UUID cannot be duplicated across multiple instances.
OS and driver
-
Create a VM running Ubuntu Linux version 22.04 with a vGPU attached to it, as described in the instructions above.
Note
Only Ubuntu and RHEL are supported by NVIDIA.
-
From inside the Ubuntu VM, run:
-
Install the NVIDIA GRID driver from the NVIDIA NDA website.
-
Run the following:
Example
The following shows an internal SoftIron depository for the drivers. Replace the URL as applicable.
wget https://git.softiron.com/jenkins/cloud/nvidia_grid/NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip mkdir nv cd nv unzip ../NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip cd Guest_Drivers/ chmod +x 0644 ./nvidia-linux-grid-535_535.161.08_amd64.deb apt install ./nvidia-linux-grid-535_535.161.08_amd64.deb
There may be a complaint about running as root; however, this should be ignorable.
-
Run
nvidia-smi
to confirm that the driver has loaded. -
Once the VM has been instantiated with the recently created vGPU template, use
lspci
to see that the vGPU has been passed through to the VM. -
The GRID driver can now be installed onto the VM by following NVIDIA's documentation.
- On Ubuntu, extract the GRID zip file and install the
.deb
inGuest_Drivers
; similarly, for RHEL, install the.rpm
in the directory.
- On Ubuntu, extract the GRID zip file and install the
Once the installation is complete, the command nvidia-smi
will show the vGPU:
Make note of the supported CUDA version in the top right corner.
root@ubuntu-vgpu:~/gpu-burn# nvidia-smi
Thu Apr 11 13:29:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10-8Q On | 00000000:00:05.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 0MiB / 8192MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Of note
The vGPU shows 8192 MiB of RAM.
End User uninstall instructions
Simply run the following command from the Dashboard:
CUDA Install
- Download CUDA from NVIDIA at https://developer.nvidia.com/cuda-downloads?target_os=Linux.
- Select the OS and make sure to select "deb (network)".
- Run the "Base Install" instructions from the web page, but note that the last line installs a specific CUDA version. For HyperCloud 2.3.x this must be changed to match the version shown from running the
nvidia-smi
command, in this case, 12.2; therefore, the last line of the instructions will require (instead of 12-4):
GPU-Burn install
-
Download and build GPU-burn
-
Run GPU-burn with:
600 is the number of seconds to run the program.
Info
For a licensed GPU on HyperCloud 2.3 on HC41XXX
family nodes, the benchmark numbers should be around 13900.
License setup
The NVIDIA vGPU driver requires a license from NVIDIA in order to operate. Unlicensed vGPUs will run at full speed for 20 minutes, then throttle to a lower speed after that time.
See: https://docs.nvidia.com/grid/13.0/grid-licensing-user-guide/index.html.
-
Go to: https://ui.licensing.nvidia.com/ to set up a license server and retrieve the license.
Info
Setting up the license server is outside the scope of this document and requires an NVIDIA NDA account. Note that the new method of licensing does not involve setting up a local server and installing
Tomcat
. If you are looking at documents which describe this, they are out of date and will not work. -
The easiest way to conduct licensing is to set up a license server that runs on NVIDIA’s cloud, and they have made it easy to do this. This will require the VM to have access to the internet. If this is not possible, they do allow local license servers to be run on your network (but it’s not the old-style
Tomcat
server). This is also outside the scope of this document. -
After the license server has been set up, the licenses will need to be assigned. When assigning licenses, the server must be Stopped. The licenses required for the current HyperCloud configuration of vGPU are RTX Virtual Workstation. No other license types will work.
-
Select on the license server and click the green Actions button at the top right, and select "Download Configuration Token". This will provide a
.tok
file. This file will communicate to the VM's GRID driver to acquire a license from NVIDIA's server. -
Copy this file onto the VM in the
/etc/nvidia/ClientConfigToken/
directory, then run: -
Wait for ~10 seconds then run:
The session should resemble below:
root@ubuntu-vgpu:~/gpu-burn# systemctl restart nvidia-gridd.service
root@ubuntu-vgpu:~/gpu-burn# sleep 10
root@ubuntu-vgpu:~/gpu-burn# systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-04-11 14:00:58 UTC; 9s ago
Process: 28237 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
Main PID: 28238 (nvidia-gridd)
Tasks: 4 (limit: 19140)
Memory: 1.4M
CPU: 213ms
CGroup: /system.slice/nvidia-gridd.service
└─28238 /usr/bin/nvidia-gridd
Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Starting NVIDIA Grid Daemon...
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Started (28238)
Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Started NVIDIA Grid Daemon.
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: vGPU Software package (0)
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Ignore service provider and node-locked licensing
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: NLS initialized
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Acquiring license. (Info: api.cls.licensing.nvidia.>
Apr 11 14:01:00 ubuntu-vgpu nvidia-gridd[28238]: License acquired successfully. (Info: api.cls.licen>
- Ensure the output relays a successful acquisition, then the status can be further verified with
nvidia-smi
:
root@ubuntu-vgpu:~/gpu-burn# nvidia-smi -q |grep License
vGPU Software Licensed Product
License Status : Licensed (Expiry: 2024-4-12 14:1:0 GMT)
From the NVIDIA license server's website the license use will be displayed.
- Run the
gpu-burn
command again to verify the benchmark as expected (~13900 for NVIDIAA10
GPU onHC41XXX
family nodes).