Skip to content

NVIDIA GRID™

The following guide outlines how to install the require NVIDIA GRID™ drivers to use vGPUs with a HyperCloud or VM Squared cluster.

HyperCloud / VM Squared release 2.6.1 supports NVIDIA GRID 16.9.

End User install instructions

  1. To use vGPU nodes in the cluster, first acquire the NVIDIA GRID driver package. A registered username and password with purchased entitlement will be required to access the software. This can be found at:

    ui.licensing.nvidia.com/software

    • One of the two entitlements is required:

    entitlements

  2. Select and download the LTS package for Linux KVM.

    16.9 pkg

  3. Upload the file(s) into the FILES datastore via the Glasshouse web GUI (recommended) or any other supported method.

    files create upload progress ready

  4. Then, the following command can be ran from the dashboard CLI:

    [test] root@si-dashboard:~# nvidia-grid-install 
    
    Available Images: 
    1:  NVIDIA GRID 16.9 (535.230.02)
    
    Enter the number of an image to use  (Ctrl-C to cancel): 1
    Proceeding with image: NVIDIA GRID 16.2
    Continue? (type yes to continue): yes
    Computing sha256 of NVIDIA GRID driver package...
    Checking that NVIDIA GRID driver is a supported version...
    Extracting NVIDIA GRID driver components...
    Extracting NVIDIA GRID host driver .run file...
    Creating directory NVIDIA-Linux-x86_64-535.230.02-vgpu-kvm
    Verifying archive integrity... OK
    Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.230.02........................................................................................................................................................................................................................................................................................
    Copying files...
    Creating hotpatch...
    Building Hotpatch for VM Squared
    Installing hotpatch on floating services node...
    Imported hot patch "nvidia-grid" from /var/cores/nvidia-grid.bundle
    
    Installation successful. Reboot any vGPU nodes to start using vGPU in VMs.
    

    Note

    For clusters which do not contain a vGPU, any attempt to specify a vGPU in the template will result in an error state where the scheduler will never deploy the VM.

    If the command is accidentally ran after installation the following will occur:

    root@si-dashboard:~# nvidia-grid-install
    nvidia-grid appears to already be installed. Please uninstall
    it before proceeding. It can be uninstalled by running:
    nvidia-grid-uninstall
    

    At this point, reboot the compute nodes to start using the vGPU.

  5. When the node is back online, the compute host can be inspected to verify the vGPUs are available for use.

    hosts

  6. Create a VM template in the GUI to use a vGPU.

    • From the menu on the left, click "Templates" then "VMs" then the "+", selecting Create.
    • Then, complete the template creation as per documentation.

    Within the template wizard, a PCI device (vGPU) and profile can be attached.

    attach device

    Further verification of attachment can be seen by viewing the template's context for PCI devices:

    template context

  7. Once the template is created and a VM is instantiated, the vGPU can be verified by opening the VM list menu from the dashboard and viewing the PCI tab or via its context:

    vm pci vm pci context

OS and driver

  1. Create a VM running Ubuntu Linux version 22.04 with a vGPU attached to it, as described in the instructions above.

    Note

    Only Ubuntu and RHEL are supported by NVIDIA.

  2. From inside the Ubuntu VM, run:

    apt update
    
  3. Install the NVIDIA GRID driver from the NVIDIA NDA website.

  4. Run the following:

    Example

    The following shows an internal SoftIron depository for the drivers. Replace the URL as applicable.

    wget https://git.softiron.com/jenkins/cloud/nvidia_grid/NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
    mkdir nv
    cd nv
    unzip ../NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
    cd Guest_Drivers/
    chmod +x 0644 ./nvidia-linux-grid-535_535.161.08_amd64.deb
    apt install ./nvidia-linux-grid-535_535.161.08_amd64.deb
    

    There may be a complaint about running as root; however, this should be ignorable.

  5. Run nvidia-smi to confirm that the driver has loaded.

    nvidia-smi
    
  6. Once the VM has been instantiated with the recently created vGPU template, use lspci to see that the vGPU has been passed through to the VM.

    Example

    root@ubuntu-vm:~# lspci -d 10de:
    00:05.0 VGA compatible controller: NVIDIA Corporation GA102GL [A10] (rev a1)
    
  7. The GRID driver can now be installed onto the VM by following NVIDIA's documentation.

    • On Ubuntu, extract the GRID zip file and install the .deb in Guest_Drivers; similarly, for RHEL, install the .rpm in the directory.

Once the installation is complete, the command nvidia-smi will show the vGPU:

Make note of the supported CUDA version in the top right corner.

root@ubuntu-vgpu:~/gpu-burn# nvidia-smi
Thu Apr 11 13:29:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-8Q                  On  | 00000000:00:05.0 Off |                    0 |
| N/A   N/A P8              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI       PID   Type   Process name                             GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Of note

The vGPU shows 8192 MiB of RAM.

End User uninstall instructions

Simply run the following command from the Dashboard:

root@si-dashboard:~# nvidia-grid-uninstall
Warning: Permanently added 'hypercloud-storage' (RSA) to the list of known hosts.
Expunged hot patch "nvidia-grid"
root@si-dashboard:~#

CUDA Install

  • Download CUDA from NVIDIA at https://developer.nvidia.com/cuda-downloads?target_os=Linux.
    • Select the OS and make sure to select "deb (network)".
  • Run the "Base Install" instructions from the web page, but note that the last line installs a specific CUDA version. For HyperCloud 2.3.x this must be changed to match the version shown from running the nvidia-smi command, in this case, 12.2; therefore, the last line of the instructions will require (instead of 12-4):
sudo apt-get -y install cuda-toolkit-12-2

GPU-Burn install

  • Download and build GPU-burn

    git clone https://github.com/wilicc/gpu-burn.git
    cd gpu-burn
    make COMPUTE-86
    
  • Run GPU-burn with:

    ./gpu-burn 600
    

    600 is the number of seconds to run the program.

Info

For a licensed GPU on HyperCloud 2.3 on HC41XXX family nodes, the benchmark numbers should be around 13900.

License setup

The NVIDIA vGPU driver requires a license from NVIDIA in order to operate. Unlicensed vGPUs will run at full speed for 20 minutes, then throttle to a lower speed after that time.

See: https://docs.nvidia.com/grid/13.0/grid-licensing-user-guide/index.html.

  • Go to: https://ui.licensing.nvidia.com/ to set up a license server and retrieve the license.

    Info

    Setting up the license server is outside the scope of this document and requires an NVIDIA NDA account. Note that the new method of licensing does not involve setting up a local server and installing Tomcat. If you are looking at documents which describe this, they are out of date and will not work.

  • The easiest way to conduct licensing is to set up a license server that runs on NVIDIA’s cloud, and they have made it easy to do this. This will require the VM to have access to the internet. If this is not possible, they do allow local license servers to be run on your network (but it’s not the old-style Tomcat server). This is also outside the scope of this document.

  • After the license server has been set up, the licenses will need to be assigned. When assigning licenses, the server must be Stopped. The licenses required for the current HyperCloud configuration of vGPU are RTX Virtual Workstation. No other license types will work.

  • Select on the license server and click the green Actions button at the top right, and select "Download Configuration Token". This will provide a .tok file. This file will communicate to the VM's GRID driver to acquire a license from NVIDIA's server.

  • Copy this file onto the VM in the /etc/nvidia/ClientConfigToken/ directory, then run:

    systemctl restart nvidia-gridd.service
    
  • Wait for ~10 seconds then run:

    systemctl status nvidia-gridd.service
    

The session should resemble below:

root@ubuntu-vgpu:~/gpu-burn# systemctl restart nvidia-gridd.service
root@ubuntu-vgpu:~/gpu-burn# sleep 10
root@ubuntu-vgpu:~/gpu-burn# systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
    Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
    Active: active (running) since Thu 2024-04-11 14:00:58 UTC; 9s ago
    Process: 28237 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 28238 (nvidia-gridd)
    Tasks: 4 (limit: 19140)
    Memory: 1.4M
        CPU: 213ms
    CGroup: /system.slice/nvidia-gridd.service
            └─28238 /usr/bin/nvidia-gridd

Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Starting NVIDIA Grid Daemon...
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Started (28238)
Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Started NVIDIA Grid Daemon.
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: vGPU Software package (0)
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Ignore service provider and node-locked licensing
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: NLS initialized
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Acquiring license. (Info: api.cls.licensing.nvidia.>
Apr 11 14:01:00 ubuntu-vgpu nvidia-gridd[28238]: License acquired successfully. (Info: api.cls.licen>
  • Ensure the output relays a successful acquisition, then the status can be further verified with nvidia-smi:
root@ubuntu-vgpu:~/gpu-burn# nvidia-smi -q |grep License
    vGPU Software Licensed Product
    License Status                  : Licensed (Expiry: 2024-4-12 14:1:0 GMT)

From the NVIDIA license server's website the license use will be displayed.

  • Run the gpu-burn command again to verify the benchmark as expected (~13900 for NVIDIA A10 GPU on HC41XXX family nodes).