NVIDIA GRID™

The following guide outlines how to install the require NVIDIA GRID™ drivers to use vGPUs with a HyperCloud or VM Squared cluster.

HyperCloud / VM Squared release 2.6.1 supports NVIDIA GRID 16.9.

End User install instructions

To use vGPU nodes in the cluster, first acquire the NVIDIA GRID driver package. A registered username and password with purchased entitlement will be required to access the software. This can be found at:

ui.licensing.nvidia.com/software
- One of the two entitlements is required:
Select and download the LTS package for Linux KVM.
Upload the file(s) into the FILES datastore via the Glasshouse web GUI (recommended) or any other supported method.

Then, the following command can be ran from the dashboard CLI:

[test] root@si-dashboard:~# nvidia-grid-install 

Available Images: 
1:  NVIDIA GRID 16.9 (535.230.02)

Enter the number of an image to use  (Ctrl-C to cancel): 1
Proceeding with image: NVIDIA GRID 16.2
Continue? (type yes to continue): yes
Computing sha256 of NVIDIA GRID driver package...
Checking that NVIDIA GRID driver is a supported version...
Extracting NVIDIA GRID driver components...
Extracting NVIDIA GRID host driver .run file...
Creating directory NVIDIA-Linux-x86_64-535.230.02-vgpu-kvm
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.230.02........................................................................................................................................................................................................................................................................................
Copying files...
Creating hotpatch...
Building Hotpatch for VM Squared
Installing hotpatch on floating services node...
Imported hot patch "nvidia-grid" from /var/cores/nvidia-grid.bundle

Installation successful. Reboot any vGPU nodes to start using vGPU in VMs.

Note

For clusters which do not contain a vGPU, any attempt to specify a vGPU in the template will result in an error state where the scheduler will never deploy the VM.

If the command is accidentally ran after installation the following will occur:

root@si-dashboard:~# nvidia-grid-install
nvidia-grid appears to already be installed. Please uninstall
it before proceeding. It can be uninstalled by running:
nvidia-grid-uninstall

At this point, reboot the compute nodes to start using the vGPU.

When the node is back online, the compute host can be inspected to verify the vGPUs are available for use.
Create a VM template in the GUI to use a vGPU.
- From the menu on the left, click "Templates" then "VMs" then the "+", selecting Create.
- Then, complete the template creation as per documentation.
Within the template wizard, a PCI device (vGPU) and profile can be attached.

Further verification of attachment can be seen by viewing the template's context for PCI devices:
Once the template is created and a VM is instantiated, the vGPU can be verified by opening the VM list menu from the dashboard and viewing the PCI tab or via its context:

OS and driver

Create a VM running Ubuntu Linux version 22.04 with a vGPU attached to it, as described in the instructions above.

Note

Only Ubuntu and RHEL are supported by NVIDIA.
From inside the Ubuntu VM, run:
```
apt update
```
Install the NVIDIA GRID driver from the NVIDIA NDA website.

Run the following:

Example

The following shows an internal SoftIron depository for the drivers. Replace the URL as applicable.

wget https://git.softiron.com/jenkins/cloud/nvidia_grid/NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
mkdir nv
cd nv
unzip ../NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
cd Guest_Drivers/
chmod +x 0644 ./nvidia-linux-grid-535_535.161.08_amd64.deb
apt install ./nvidia-linux-grid-535_535.161.08_amd64.deb

There may be a complaint about running as root; however, this should be ignorable.

Run nvidia-smi to confirm that the driver has loaded.
```
nvidia-smi
```
Once the VM has been instantiated with the recently created vGPU template, use lspci to see that the vGPU has been passed through to the VM.
Example
```
root@ubuntu-vm:~# lspci -d 10de:
00:05.0 VGA compatible controller: NVIDIA Corporation GA102GL [A10] (rev a1)
```
The GRID driver can now be installed onto the VM by following NVIDIA's documentation.
- On Ubuntu, extract the GRID zip file and install the .deb in Guest_Drivers; similarly, for RHEL, install the .rpm in the directory.

Once the installation is complete, the command nvidia-smi will show the vGPU:

Make note of the supported CUDA version in the top right corner.

root@ubuntu-vgpu:~/gpu-burn# nvidia-smi
Thu Apr 11 13:29:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-8Q                  On  | 00000000:00:05.0 Off |                    0 |
| N/A   N/A P8              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI       PID   Type   Process name                             GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Of note

The vGPU shows 8192 MiB of RAM.

End User uninstall instructions

Simply run the following command from the Dashboard:

root@si-dashboard:~# nvidia-grid-uninstall
Warning: Permanently added 'hypercloud-storage' (RSA) to the list of known hosts.
Expunged hot patch "nvidia-grid"
root@si-dashboard:~#

CUDA Install

Download CUDA from NVIDIA at https://developer.nvidia.com/cuda-downloads?target_os=Linux.
- Select the OS and make sure to select "deb (network)".
Run the "Base Install" instructions from the web page, but note that the last line installs a specific CUDA version. For HyperCloud 2.3.x this must be changed to match the version shown from running the nvidia-smi command, in this case, 12.2; therefore, the last line of the instructions will require (instead of 12-4):

sudo apt-get -y install cuda-toolkit-12-2

GPU-Burn install

Download and build GPU-burn

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make COMPUTE-86

Run GPU-burn with:
```
./gpu-burn 600
```
600 is the number of seconds to run the program.

Info

For a licensed GPU on HyperCloud 2.3 on HC41XXX family nodes, the benchmark numbers should be around 13900.

License setup

The NVIDIA vGPU driver requires a license from NVIDIA in order to operate. Unlicensed vGPUs will run at full speed for 20 minutes, then throttle to a lower speed after that time.

See: https://docs.nvidia.com/grid/13.0/grid-licensing-user-guide/index.html.

Go to: https://ui.licensing.nvidia.com/ to set up a license server and retrieve the license.

Info

Setting up the license server is outside the scope of this document and requires an NVIDIA NDA account. Note that the new method of licensing does not involve setting up a local server and installing Tomcat. If you are looking at documents which describe this, they are out of date and will not work.
The easiest way to conduct licensing is to set up a license server that runs on NVIDIA’s cloud, and they have made it easy to do this. This will require the VM to have access to the internet. If this is not possible, they do allow local license servers to be run on your network (but it’s not the old-style Tomcat server). This is also outside the scope of this document.
After the license server has been set up, the licenses will need to be assigned. When assigning licenses, the server must be Stopped. The licenses required for the current HyperCloud configuration of vGPU are RTX Virtual Workstation. No other license types will work.
Select on the license server and click the green Actions button at the top right, and select "Download Configuration Token". This will provide a .tok file. This file will communicate to the VM's GRID driver to acquire a license from NVIDIA's server.
Copy this file onto the VM in the /etc/nvidia/ClientConfigToken/ directory, then run:
```
systemctl restart nvidia-gridd.service
```
Wait for ~10 seconds then run:
```
systemctl status nvidia-gridd.service
```

The session should resemble below:

root@ubuntu-vgpu:~/gpu-burn# systemctl restart nvidia-gridd.service
root@ubuntu-vgpu:~/gpu-burn# sleep 10
root@ubuntu-vgpu:~/gpu-burn# systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
    Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
    Active: active (running) since Thu 2024-04-11 14:00:58 UTC; 9s ago
    Process: 28237 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 28238 (nvidia-gridd)
    Tasks: 4 (limit: 19140)
    Memory: 1.4M
        CPU: 213ms
    CGroup: /system.slice/nvidia-gridd.service
            └─28238 /usr/bin/nvidia-gridd

Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Starting NVIDIA Grid Daemon...
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Started (28238)
Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Started NVIDIA Grid Daemon.
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: vGPU Software package (0)
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Ignore service provider and node-locked licensing
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: NLS initialized
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Acquiring license. (Info: api.cls.licensing.nvidia.>
Apr 11 14:01:00 ubuntu-vgpu nvidia-gridd[28238]: License acquired successfully. (Info: api.cls.licen>

Ensure the output relays a successful acquisition, then the status can be further verified with nvidia-smi:

root@ubuntu-vgpu:~/gpu-burn# nvidia-smi -q |grep License
    vGPU Software Licensed Product
    License Status                  : Licensed (Expiry: 2024-4-12 14:1:0 GMT)

From the NVIDIA license server's website the license use will be displayed.

Run the gpu-burn command again to verify the benchmark as expected (~13900 for NVIDIA A10 GPU on HC41XXX family nodes).