Stage 2 translation

What is stage 2 translation?

Stage 2 translation allows a hypervisor to control a view of memory in a Virtual Machine (VM). Specifically, it allows the hypervisor to control which memory-mapped system resources a VM can access, and where those resources appear in the address space of the VM.

This ability to control memory access is important for isolation and sandboxing. Stage 2 translation can be used to ensure that a VM can only see the resources that are allocated to it, and not the resources that are allocated to other VMs or the hypervisor.

For memory address translation, stage 2 translation is a second stage of translation. To support this, a new set of translation tables known as Stage 2 tables, are required, as shown here:

 VA to IPA to PA address translation

An Operating System (OS) controls a set of translation tables that map from the virtual address space to what it thinks is the physical address space. However, this process undergoes a second translation into the real physical address space. This second stage is controlled by the hypervisor. 

The OS-controlled translation is called stage 1 translation, and the hypervisor-controlled translation is called stage 2 translation. The address space that the OS thinks is physical memory is referred to as the Intermediate Physical Address (IPA) space.

Note: For an introduction to how address translation works, see our guide on Memory Management.

The format of the translation tables used for stage 2 is very similar to that used for stage 1. However, some of the attributes are handled differently in stage 2 and the Type, Normal or Device, is encoded directly into table entry rather than via a MAIR_ELx register.

VMIDs

Each VM is assigned a virtual machine identifier (VMID). The VMID is used to tag translation lookaside buffer (TLB) entries, to identify which VM each entry belongs to.  This tagging allows translations for multiple different VMs to be present in the TLBs at the same time.

The VMID can either be 8 or 16 bits, and is controlled by the VTCR_EL2.VS bit. Support for 16-bit VMIDs is optional, and was added in Armv8.1-A.

Note: Translations for the EL2 and EL3 translation regimes are not tagged with a VMID, because they are not subject to stage 2 translation.

VMID interaction with ASIDs

TLB entries can also be tagged with an Address Space Identifier (ASID). An application is assigned an ASID by the OS, and all the TLB entries in that application are tagged with that ASID. This means that TLB entries for different applications are able to coexist in the TLB, without the possibility that one application uses the TLB entries that belong to a different application.

Each VM has its own ASID namespace. For example, two VMs might both use ASID 5, but they use them for different things. The combination of ASID and VMID is the thing that is important.            

Attribute combining and overriding

The stage 1 and stage 2 mappings both include attributes, such as type and access permissions. The Memory Management Unit (MMU) combines the attributes from the two stages to give a final effective value. The MMU does this by selecting the stage that is more restrictive, as you can see here:

Combining stage 1 and stage 2 attributes 

In this example, the Device type is more restrictive than the Normal type.  Therefore, the resulting type is Device. The result would be the same if we reversed the example, so that stage 1 = Normal, and stage 2 = Device.

This method of combining attributes works for most use cases, but sometimes the hypervisor might want to override this behavior. For example, during early boot of a VM.  For these cases, there are some control bits that override the normal behavior:

  • HCR_EL2.CD. This makes all stage 1 attributes Non-cacheable.
  • HCR_EL2.DC. This forces stage 1 attributes to be Normal, Write-Back Cacheable.
  • HCR_EL2.FWB. This allows stage 2 to override the stage 1 attribute, instead of regular attribute combining.

Note: HCR_EL2.FWB was introduced in Armv8.4-A.

Emulating Memory-mapped Input/Output (MMIO)

Like the physical address space on a physical machine, the IPA space in a VM contains regions that are used to access both memory and peripherals, as shown here:

 

The VM can use peripheral regions to access both real physical peripherals, which are often referred to as directly assigned peripherals, and virtual peripherals.

Virtual peripherals are completely emulated in software by the hypervisor, as this diagram highlights:

Stage 2 mappings for virtual and assigned peripherals 

An assigned peripheral is a real physical device that has been allocated to the VM, and mapped into its IPA space. This allows software that is running within the VM to interact with the peripheral directly.

A virtual peripheral is one that the hypervisor is going to emulate in software.  The corresponding stage 2 table entries would be marked as fault. Software in the VM thinks that it can talk directly to the peripheral, but each access triggers a stage 2 fault, with the hypervisor emulating the peripheral access in the exception handler.

To emulate a peripheral, a hypervisor needs to know not only which peripheral was accessed, but also which register in that peripheral was accessed, whether the access was a read or a write, the size of the access, and the registers used for transferring data.

Starting with the address, Exception Model introduces the FAR_ELx registers.  When dealing with stage 1 faults, these registers report the virtual address that triggered the exception.  A virtual address is not helpful to a hypervisor, because the hypervisor would not usually know how the Guest OS has configured its virtual address space.  For stage 2 faults, there is an additional register, HPFAR_EL2, which reports the IPA of the address that aborted.  Because the IPA space is controlled by the hypervisor, it can use this information to determine the register that it needs to emulate.

Exception Model shows how the ESR_ELx registers report information about the exception.  For single general-purpose register loads or stores that trigger a stage 2 fault, additional syndrome information is provided.  This information includes the size of the accesses and the source or destination register, and allows a hypervisor to determine the type of access that is being made to the virtual peripheral.

This diagram illustrates the process of trapping then emulating the access:

Example of emulating an access to MMIO 

This process is described in these steps:

  1. Software in the VM attempts to access the virtual peripheral. In this example, this is the receive FIFO of a virtual UART.
  2. This access is blocked at stage 2 translation, leading to an abort routed to EL2.
    • The abort populates ESR_EL2 with information about the exception, including the number of bytes accessed, the target register and whether it was a load or store.
    • The abort also populates HPFAR_EL2 with the IPA of the aborting access.
  3. The hypervisor uses the information from ESR_EL2 and HPFAR_EL2 to identify the virtual peripheral register accessed. This information allows the hypervisor to emulate the operation. It then returns to the vCPU via an ERET.
    • Execution restarts on the instruction after the LDR.
System Memory Management Units (SMMUs)

So far, we have considered different types of access that come from the processor. Other masters in a system, such DMA controllers, might be allocated for use by a VM. We need some way to extend the stage 2 protections to those masters as well.

Consider a system with a DMA controller that does not use virtualization, as shown in the following diagram:

DMA controller that does not use virtualization 

The DMA controller would be programmed via a driver, typically in kernel space.  That kernel space driver can ensure that the OS level memory protections are not breached.  This means that one application cannot use the DMA to get access to memory that it should not be able to see.

Let's consider the same system, but with the OS running within a VM, as shown in the following diagram:

 

In this system, a hypervisor is using stage 2 to provide isolation between VMs. The ability of software to see memory is limited by the stage 2 tables that the hypervisor controls.

Allowing a driver in the VM to directly interact with the DMA controller creates two problems:

  • Isolation: The DMA controller is not subject to the stage 2 tables, and could be used to breach the VM’s sandbox.
  • Address space: With two stages of translation, what the kernel believes to be PAs are IPAs. The DMA controller still sees PAs, therefore the kernel and DMA controller have different views of memory. To overcome this problem, the hypervisor could trap every interaction between the VM and the DMA controller, providing the necessary translation.When memory is fragmented, this process is inefficient and problematic.

An alternative to trapping and emulating driver accesses is to extend the stage 2 regime to also cover other masters, such as our DMA controller. When this happens, those masters also need an MMU. This is referred to as a System Memory Management Unit (SMMU, sometimes also called IOMMU):

System Memory Management Unit 

The hypervisor would be responsible for programming the SMMU, so that the upstream master, which is the DMA in our example, sees the same view of memory as the VM to which it is assigned. 

This process solves both of the problems that we identified. The SMMU can enforce the isolation between VMs, ensuring that external masters cannot be used to breach the sandbox. The SMMU also gives a consistent view of memory to software in the VM and the external masters allocated to the VM.

Virtualization is not the only use case for SMMUs. There are many other cases that are not covered within the scope of this guide.

Previous Next