Don't Call It A Comeback: Proxmox Server for AI training and exploration

So I finally bit the bullet and built a server capable of working with various AI projects such as LLMs and NLPs. Hopefully also diving into generative software such as Stable Diffusion.

Components:

Motherboard: GIGABYTE X870E AORUS Master AMD AM5

CPU: AMD Ryzen™ 9 9950X 16-Core, 32-Thread Unlocked Desktop Processor

GPU x 2: GIGABYTE GeForce RTX 5060 Ti Eagle OC 16GB 128-bit GDDR7, PCIe 5.0

I installed Proxmox in order to be able to spin up various instances for different projects. The problem, which is what this post will address, is that passthrough of two GPUs to a Virtual Machine in Proxmox is incredibly annoying. And I had to do it twice since a power surge caused some issues with the motherboard only two weeks into completing the build.

Step 1: Proxmox Host Configuration

  1. Enable IOMMU in GRUB:

    • IOMMU (Input/Output Memory Management Unit) controls device grouping. It’s crucial that your GPUs are in their own groups, otherwise when you try to pass the GPU through, you could end up passing the whole group and that will cause the whole server to crash.

      • Check if your CPU supports IOMMU: dmesg | grep -i iommu

      • Open the GRUB configuration file: nano /etc/default/grub.

      • Add intel_iommu=on or amd_iommu=on

      • Update GRUB: update-grub.

  2. Blacklist the NVIDIA Drivers on the Host:

    • The part is crucial, it prevents the host from loading the GPU's drivers, which in turn makes them available for the guest VM. If the host grabs the GPUs then you can’t pass them to the VM.

    • Edit blacklist file: nano /etc/modprobe.d/pve-blacklist.conf.

    • Add to the file: nouveau, nvidiafb, nvidia

    • Update initramfs after blacklisting: update-initramfs -u.

    • You’ll have to reboot the machine for these changes to take affect.

  3. Isolate the GPU with VFIO:

    • Next you need to bind the GPU to the vfio-pci driver, ensuring it's ready for passthrough. Binding to this driver assists with blocking out the blacklisted drivers. You’d think blacklisting them would be enough, but noooooo.

    • Find the GPU's PCI IDs by pulling up the list of Nvidia devices: lspci -nn | grep -i nvidia.

    • Create a file: nano /etc/modprobe.d/vfio.conf and specify the GPU's PCI ID.

    • Update initramfs again: update-initramfs -u.

    • Final reboot

Step 2: Proxmox VM Settings

Add the PCI Device:

  1. Navigate to the VM's hardware tab and add a "PCI Device."

  2. Add one GPU at a time. Both GPUs should get "All Functions," and "PCI-Express." checked off.

  3. Only one GPU should get "Primary GPU" checked off.

    Configure QEMU Arguments:

  4. nano /etc/pve/qemu-server/<VMID>.conf)

  5. Add this line to the end: args: -cpu 'host,kvm=off,hv_vendor_id=null'

    1. Here is a breakdown of what each part does:

      • args: This is a Proxmox-specific parameter that lets you pass custom arguments directly to the QEMU command-line.

      • -cpu 'host,kvm=off,hv_vendor_id=null' This is the core part of the trick.

        • host: Tells QEMU to expose the host CPU's features to the guest, giving it full access to all the CPU's capabilities.

        • kvm=off: Hides the KVM hypervisor's signature. This is a key part of the anti-virtualization detection that the NVIDIA driver looks for.

        • hv_vendor_id=null: Removes or nullifies the hypervisor vendor ID that is normally passed to the guest. This is another signal that the NVIDIA driver looks for to determine if it's running in a VM

  6. Load VBIOS rom file for VM

    1. Finding the VBIOS rom file for my specific GPU proved to be difficult, but after some Googling I was able to find it on a GPU ranking site and download it.

    2. Edit the conf file nano /etc/pve/qemu-server/<VMID>.conf

    3. Edit the hostpci0 line: hostpci0: 01:00,pcie=1,x-vga=1,romfile=my_gpu_vbios.rom

    Adjust BIOS and Display Settings:

  7. Go to the VM's Options tab.

  8. Set the "Display" to None to prevent Proxmox from trying to use a virtual display.

Step 3: Installing the NVIDIA Drivers in the Debian VM

  1. Clean Up Any Previous Driver Attempts:

    • Purge all existing NVIDIA packages: sudo apt --purge autoremove "nvidia-*"

  2. Install Essential Build Tools:

    • Install dkms, build-essential, and the kernel headers: sudo apt install dkms build-essential linux-headers-$(uname -r).

  3. Download and Run the NVIDIA Installer:

    • Download the driver from NVIDIA's official site. For the Linux VM in Proxmox I had to get the Linux drivers.

    • Make the installer executable (chmod +x) and run it (sudo ./NVIDIA-Linux-x86_64-*.run).

  4. Answer the Installer Prompts Correctly:

    • Yes to "Register the kernel module with DKMS."

    • Yes to "Rebuild the initramfs."

    • No to "Update X configuration file."

  5. Final Verification:

    • Reboot the VM. Then log back in.

    • Check if the driver is working by opening a terminal and running: nvidia-smi.

    • If that command shows you the details of the GPU, you have successfully passed the GPUs through to your VM.