By Julia Hansbrough, Software Engineer
Google recently launched GPUs on Google Cloud Platform (GCP), which will allow customers to leverage this hardware for highly parallel workloads. These GPUs are connected to our cloud machines via a variety of PCIe switches, and that required us to have a deep understanding of PCIe security.
Securing PCIe devices requires overcoming some inherent challenges. For instance, GPUs have become far more complex in the past few decades, opening up new avenues for attack. Since GPUs are designed to directly access system memory, and since hardware has historically been considered trusted, it’s difficult to ensure all the settings to keep it contained are set accurately, and difficult to ensure whether such settings even work. And since GPU manufacturers don’t make the source code or binaries available for the GPU’s main processes, we can’t examine those to gain more confidence. You can read more about the challenges presented by the PCI and PCIe specs here.
With the risk of malicious behavior from compromised PCIe devices, Google needed to have a plan for combating these types of attacks, especially in a world of cloud services and publicly available virtual machines. Our approach has been to focus on mitigation: ensuring that compromised PCIe devices can’t jeopardize the security of the rest of the computer.
Fuzzing to the rescue
A key weapon in our arsenal is fuzzing, a testing technique that uses invalid, unexpected or random inputs to expose irregular behavior, such as memory leaks, crashes, or undocumented functionality. The hardware fuzzer we built directly tests the behavior of the PCIe switches used by our cloud GPUs.
After our initial research into the PCIe spec, we prepared a list of edge cases and device behaviors that didn’t have clearly defined outcomes. We wanted to test these behaviors on real hardware, and we also wanted to find out whether real hardware implemented the well defined parts of the spec properly. Hardware bugs are actually quite common, but many security professionals assume their absence, simply trusting the manufacturer. At Google, we want to verify every layer of the stack, including hardware.
Our plan called for a fuzzer that was highly specialized, and designed to be effective against the production configurations we use in our cloud hardware. We use a variety of GPU and switch combinations on our machines, so we set up some programmable network interface controllers (NICs) in similar configurations to simulate GPU memory accesses.
Our fuzzer used those NICs to aggressively hammer the port directly upstream from each NIC, as well as any other accessible ports in the network, with a variety of memory reads and writes. These operations included a mixture of targeted attacks, randomness and “lucky numbers” that tend to cause problems on many hardware architectures. We wanted to detect changes to the configuration of any port as a result of the fuzzing, particularly the port’s secondary and subordinate bus numbers. PCIe networks with Source Validation enabled are governed primarily by these bus numbers, which dictate where packets can and cannot go. Being able to reconfigure a port’s secondary or subordinate bus numbers could give you access to parts of the PCIe network that should be forbidden.
Our security team reviewed any suspicious memory reads or writes to determine if they represent security vulnerabilities, and adjusted either the fuzzer or our PCIe settings accordingly.
We discovered some curiosities. For instance, on one incorrect configuration, some undocumented debug registers on the switch were incorrectly exposed to downstream devices, which we discovered could cause serious malfunctioning of the switch under certain access patterns. If a device can cause out-of-spec behavior in the switch it’s connected to, it may be able to cause insecure routing, which would compromise the entire network. The value of fuzzing is its ability to find vulnerabilities in undocumented and undefined areas, outside the normal set of behaviors and operations defined in the spec. But by the end of the process, we had determined a minimum set of ACS features necessary to securely run GPUs in the cloud.
Let’s check out those memory mappings too
When you make use of a GPU on a local computer through the root OS, it has direct memory access to the computer’s memory. This is very fast and straightforward. However, that model doesn’t work in a virtualized environment like Google Compute Engine.
When a virtual machine is initialized, a set of page tables maps the guest’s physical memory to the host’s physical memory, but the GPU has no way to know about those mappings, and thus will attempt to write to the wrong places. This is where the Input–output memory management unit (IOMMU) comes in. The IOMMU is a page table, translating GPU accesses into DRAM/MMIO reads and writes. It’s implemented in hardware, which reduces the remapping overhead.
This means the IOMMU is performing a pretty delicate operation. It’s mapping its own I/O virtual addresses into host physical addresses. We wanted to verify that the IOMMU was functioning correctly, and ensure that it was enabled any time a device may be running untrusted code, so that there would be no opportunity for unfiltered accesses.
Furthermore, there were features of the IOMMU that we didn’t want, like compatibility interrupts. This is a type of interrupt that exists to support older Intel platforms that lack the interrupt-remapping capabilities that the IOMMU gives you. They’re not necessary for modern hardware, and leaving them enabled allows guests to trigger unexpected MSIs, machine reboots, and host crashes.
The most interesting challenge here is protecting against PCIe’s Address Translation Services (ATS). Using this feature, any device can claim it’s using an address that’s already been translated, and thus bypass IOMMU translation. For trusted devices, this is a useful performance improvement. For untrusted devices, this is a big security threat. ATS could allow a compromised device to ignore the IOMMU and write to places it shouldn’t have access to.
Luckily, there’s an ACS setting that can disable ATS for any given device. Thus, we disabled compatibility interrupts, disabled ATS, and had a separate fuzzer attempt to access memory outside the range specifically mapped to it. After some aggressive testing we determined that the IOMMU worked as advertised and could not be bypassed by a malicious device.
Beyond simply verifying our hardware in a test environment, we wanted to make sure our hardware remains secure in all of production. Misconfigurations are likely the biggest source of major outages in production environments, and it’s a similar story with security vulnerabilities. Since ACS and IOMMU can be enabled or disabled at multiple layers of the stack—potentially varying between kernel versions, the default settings of the device, or other seemingly-minor tweaks—we would be remiss to rely solely on isolated unit tests to verify these settings. So, we developed tooling to monitor the ACS and IOMMU settings in production, so that any misconfiguration of the system could be quickly detected and rolled back.
As much as possible, it’s good practice not to trust hardware without first verifying that it works correctly, and our targeted attacks and robust fuzzing allowed us to settle on a list of ACS settings that allow us to share GPUs with cloud users securely. This has resulted in being able to provide GPUs to our customers with a high degree of confidence in the security of the underlying system. Stay tuned for more posts that detail how we implement security at Google Cloud.