|  | L1TF - L1 Terminal Fault | 
|  | ======================== | 
|  |  | 
|  | L1 Terminal Fault is a hardware vulnerability which allows unprivileged | 
|  | speculative access to data which is available in the Level 1 Data Cache | 
|  | when the page table entry controlling the virtual address, which is used | 
|  | for the access, has the Present bit cleared or other reserved bits set. | 
|  |  | 
|  | Affected processors | 
|  | ------------------- | 
|  |  | 
|  | This vulnerability affects a wide range of Intel processors. The | 
|  | vulnerability is not present on: | 
|  |  | 
|  | - Processors from AMD, Centaur and other non Intel vendors | 
|  |  | 
|  | - Older processor models, where the CPU family is < 6 | 
|  |  | 
|  | - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft, | 
|  | Penwell, Pineview, Silvermont, Airmont, Merrifield) | 
|  |  | 
|  | - The Intel XEON PHI family | 
|  |  | 
|  | - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the | 
|  | IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected | 
|  | by the Meltdown vulnerability either. These CPUs should become | 
|  | available by end of 2018. | 
|  |  | 
|  | Whether a processor is affected or not can be read out from the L1TF | 
|  | vulnerability file in sysfs. See :ref:`l1tf_sys_info`. | 
|  |  | 
|  | Related CVEs | 
|  | ------------ | 
|  |  | 
|  | The following CVE entries are related to the L1TF vulnerability: | 
|  |  | 
|  | =============  =================  ============================== | 
|  | CVE-2018-3615  L1 Terminal Fault  SGX related aspects | 
|  | CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects | 
|  | CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects | 
|  | =============  =================  ============================== | 
|  |  | 
|  | Problem | 
|  | ------- | 
|  |  | 
|  | If an instruction accesses a virtual address for which the relevant page | 
|  | table entry (PTE) has the Present bit cleared or other reserved bits set, | 
|  | then speculative execution ignores the invalid PTE and loads the referenced | 
|  | data if it is present in the Level 1 Data Cache, as if the page referenced | 
|  | by the address bits in the PTE was still present and accessible. | 
|  |  | 
|  | While this is a purely speculative mechanism and the instruction will raise | 
|  | a page fault when it is retired eventually, the pure act of loading the | 
|  | data and making it available to other speculative instructions opens up the | 
|  | opportunity for side channel attacks to unprivileged malicious code, | 
|  | similar to the Meltdown attack. | 
|  |  | 
|  | While Meltdown breaks the user space to kernel space protection, L1TF | 
|  | allows to attack any physical memory address in the system and the attack | 
|  | works across all protection domains. It allows an attack of SGX and also | 
|  | works from inside virtual machines because the speculation bypasses the | 
|  | extended page table (EPT) protection mechanism. | 
|  |  | 
|  |  | 
|  | Attack scenarios | 
|  | ---------------- | 
|  |  | 
|  | 1. Malicious user space | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Operating Systems store arbitrary information in the address bits of a | 
|  | PTE which is marked non present. This allows a malicious user space | 
|  | application to attack the physical memory to which these PTEs resolve. | 
|  | In some cases user-space can maliciously influence the information | 
|  | encoded in the address bits of the PTE, thus making attacks more | 
|  | deterministic and more practical. | 
|  |  | 
|  | The Linux kernel contains a mitigation for this attack vector, PTE | 
|  | inversion, which is permanently enabled and has no performance | 
|  | impact. The kernel ensures that the address bits of PTEs, which are not | 
|  | marked present, never point to cacheable physical memory space. | 
|  |  | 
|  | A system with an up to date kernel is protected against attacks from | 
|  | malicious user space applications. | 
|  |  | 
|  | 2. Malicious guest in a virtual machine | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The fact that L1TF breaks all domain protections allows malicious guest | 
|  | OSes, which can control the PTEs directly, and malicious guest user | 
|  | space applications, which run on an unprotected guest kernel lacking the | 
|  | PTE inversion mitigation for L1TF, to attack physical host memory. | 
|  |  | 
|  | A special aspect of L1TF in the context of virtualization is symmetric | 
|  | multi threading (SMT). The Intel implementation of SMT is called | 
|  | HyperThreading. The fact that Hyperthreads on the affected processors | 
|  | share the L1 Data Cache (L1D) is important for this. As the flaw allows | 
|  | only to attack data which is present in L1D, a malicious guest running | 
|  | on one Hyperthread can attack the data which is brought into the L1D by | 
|  | the context which runs on the sibling Hyperthread of the same physical | 
|  | core. This context can be host OS, host user space or a different guest. | 
|  |  | 
|  | If the processor does not support Extended Page Tables, the attack is | 
|  | only possible, when the hypervisor does not sanitize the content of the | 
|  | effective (shadow) page tables. | 
|  |  | 
|  | While solutions exist to mitigate these attack vectors fully, these | 
|  | mitigations are not enabled by default in the Linux kernel because they | 
|  | can affect performance significantly. The kernel provides several | 
|  | mechanisms which can be utilized to address the problem depending on the | 
|  | deployment scenario. The mitigations, their protection scope and impact | 
|  | are described in the next sections. | 
|  |  | 
|  | The default mitigations and the rationale for choosing them are explained | 
|  | at the end of this document. See :ref:`default_mitigations`. | 
|  |  | 
|  | .. _l1tf_sys_info: | 
|  |  | 
|  | L1TF system information | 
|  | ----------------------- | 
|  |  | 
|  | The Linux kernel provides a sysfs interface to enumerate the current L1TF | 
|  | status of the system: whether the system is vulnerable, and which | 
|  | mitigations are active. The relevant sysfs file is: | 
|  |  | 
|  | /sys/devices/system/cpu/vulnerabilities/l1tf | 
|  |  | 
|  | The possible values in this file are: | 
|  |  | 
|  | ===========================   =============================== | 
|  | 'Not affected'		The processor is not vulnerable | 
|  | 'Mitigation: PTE Inversion'	The host protection is active | 
|  | ===========================   =============================== | 
|  |  | 
|  | If KVM/VMX is enabled and the processor is vulnerable then the following | 
|  | information is appended to the 'Mitigation: PTE Inversion' part: | 
|  |  | 
|  | - SMT status: | 
|  |  | 
|  | =====================  ================ | 
|  | 'VMX: SMT vulnerable'  SMT is enabled | 
|  | 'VMX: SMT disabled'    SMT is disabled | 
|  | =====================  ================ | 
|  |  | 
|  | - L1D Flush mode: | 
|  |  | 
|  | ================================  ==================================== | 
|  | 'L1D vulnerable'		      L1D flushing is disabled | 
|  |  | 
|  | 'L1D conditional cache flushes'   L1D flush is conditionally enabled | 
|  |  | 
|  | 'L1D cache flushes'		      L1D flush is unconditionally enabled | 
|  | ================================  ==================================== | 
|  |  | 
|  | The resulting grade of protection is discussed in the following sections. | 
|  |  | 
|  |  | 
|  | Host mitigation mechanism | 
|  | ------------------------- | 
|  |  | 
|  | The kernel is unconditionally protected against L1TF attacks from malicious | 
|  | user space running on the host. | 
|  |  | 
|  |  | 
|  | Guest mitigation mechanisms | 
|  | --------------------------- | 
|  |  | 
|  | .. _l1d_flush: | 
|  |  | 
|  | 1. L1D flush on VMENTER | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | To make sure that a guest cannot attack data which is present in the L1D | 
|  | the hypervisor flushes the L1D before entering the guest. | 
|  |  | 
|  | Flushing the L1D evicts not only the data which should not be accessed | 
|  | by a potentially malicious guest, it also flushes the guest | 
|  | data. Flushing the L1D has a performance impact as the processor has to | 
|  | bring the flushed guest data back into the L1D. Depending on the | 
|  | frequency of VMEXIT/VMENTER and the type of computations in the guest | 
|  | performance degradation in the range of 1% to 50% has been observed. For | 
|  | scenarios where guest VMEXIT/VMENTER are rare the performance impact is | 
|  | minimal. Virtio and mechanisms like posted interrupts are designed to | 
|  | confine the VMEXITs to a bare minimum, but specific configurations and | 
|  | application scenarios might still suffer from a high VMEXIT rate. | 
|  |  | 
|  | The kernel provides two L1D flush modes: | 
|  | - conditional ('cond') | 
|  | - unconditional ('always') | 
|  |  | 
|  | The conditional mode avoids L1D flushing after VMEXITs which execute | 
|  | only audited code paths before the corresponding VMENTER. These code | 
|  | paths have been verified that they cannot expose secrets or other | 
|  | interesting data to an attacker, but they can leak information about the | 
|  | address space layout of the hypervisor. | 
|  |  | 
|  | Unconditional mode flushes L1D on all VMENTER invocations and provides | 
|  | maximum protection. It has a higher overhead than the conditional | 
|  | mode. The overhead cannot be quantified correctly as it depends on the | 
|  | workload scenario and the resulting number of VMEXITs. | 
|  |  | 
|  | The general recommendation is to enable L1D flush on VMENTER. The kernel | 
|  | defaults to conditional mode on affected processors. | 
|  |  | 
|  | **Note**, that L1D flush does not prevent the SMT problem because the | 
|  | sibling thread will also bring back its data into the L1D which makes it | 
|  | attackable again. | 
|  |  | 
|  | L1D flush can be controlled by the administrator via the kernel command | 
|  | line and sysfs control files. See :ref:`mitigation_control_command_line` | 
|  | and :ref:`mitigation_control_kvm`. | 
|  |  | 
|  | .. _guest_confinement: | 
|  |  | 
|  | 2. Guest VCPU confinement to dedicated physical cores | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | To address the SMT problem, it is possible to make a guest or a group of | 
|  | guests affine to one or more physical cores. The proper mechanism for | 
|  | that is to utilize exclusive cpusets to ensure that no other guest or | 
|  | host tasks can run on these cores. | 
|  |  | 
|  | If only a single guest or related guests run on sibling SMT threads on | 
|  | the same physical core then they can only attack their own memory and | 
|  | restricted parts of the host memory. | 
|  |  | 
|  | Host memory is attackable, when one of the sibling SMT threads runs in | 
|  | host OS (hypervisor) context and the other in guest context. The amount | 
|  | of valuable information from the host OS context depends on the context | 
|  | which the host OS executes, i.e. interrupts, soft interrupts and kernel | 
|  | threads. The amount of valuable data from these contexts cannot be | 
|  | declared as non-interesting for an attacker without deep inspection of | 
|  | the code. | 
|  |  | 
|  | **Note**, that assigning guests to a fixed set of physical cores affects | 
|  | the ability of the scheduler to do load balancing and might have | 
|  | negative effects on CPU utilization depending on the hosting | 
|  | scenario. Disabling SMT might be a viable alternative for particular | 
|  | scenarios. | 
|  |  | 
|  | For further information about confining guests to a single or to a group | 
|  | of cores consult the cpusets documentation: | 
|  |  | 
|  | https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst | 
|  |  | 
|  | .. _interrupt_isolation: | 
|  |  | 
|  | 3. Interrupt affinity | 
|  | ^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Interrupts can be made affine to logical CPUs. This is not universally | 
|  | true because there are types of interrupts which are truly per CPU | 
|  | interrupts, e.g. the local timer interrupt. Aside of that multi queue | 
|  | devices affine their interrupts to single CPUs or groups of CPUs per | 
|  | queue without allowing the administrator to control the affinities. | 
|  |  | 
|  | Moving the interrupts, which can be affinity controlled, away from CPUs | 
|  | which run untrusted guests, reduces the attack vector space. | 
|  |  | 
|  | Whether the interrupts with are affine to CPUs, which run untrusted | 
|  | guests, provide interesting data for an attacker depends on the system | 
|  | configuration and the scenarios which run on the system. While for some | 
|  | of the interrupts it can be assumed that they won't expose interesting | 
|  | information beyond exposing hints about the host OS memory layout, there | 
|  | is no way to make general assumptions. | 
|  |  | 
|  | Interrupt affinity can be controlled by the administrator via the | 
|  | /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is | 
|  | available at: | 
|  |  | 
|  | https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst | 
|  |  | 
|  | .. _smt_control: | 
|  |  | 
|  | 4. SMT control | 
|  | ^^^^^^^^^^^^^^ | 
|  |  | 
|  | To prevent the SMT issues of L1TF it might be necessary to disable SMT | 
|  | completely. Disabling SMT can have a significant performance impact, but | 
|  | the impact depends on the hosting scenario and the type of workloads. | 
|  | The impact of disabling SMT needs also to be weighted against the impact | 
|  | of other mitigation solutions like confining guests to dedicated cores. | 
|  |  | 
|  | The kernel provides a sysfs interface to retrieve the status of SMT and | 
|  | to control it. It also provides a kernel command line interface to | 
|  | control SMT. | 
|  |  | 
|  | The kernel command line interface consists of the following options: | 
|  |  | 
|  | =========== ========================================================== | 
|  | nosmt	 Affects the bring up of the secondary CPUs during boot. The | 
|  | kernel tries to bring all present CPUs online during the | 
|  | boot process. "nosmt" makes sure that from each physical | 
|  | core only one - the so called primary (hyper) thread is | 
|  | activated. Due to a design flaw of Intel processors related | 
|  | to Machine Check Exceptions the non primary siblings have | 
|  | to be brought up at least partially and are then shut down | 
|  | again.  "nosmt" can be undone via the sysfs interface. | 
|  |  | 
|  | nosmt=force Has the same effect as "nosmt" but it does not allow to | 
|  | undo the SMT disable via the sysfs interface. | 
|  | =========== ========================================================== | 
|  |  | 
|  | The sysfs interface provides two files: | 
|  |  | 
|  | - /sys/devices/system/cpu/smt/control | 
|  | - /sys/devices/system/cpu/smt/active | 
|  |  | 
|  | /sys/devices/system/cpu/smt/control: | 
|  |  | 
|  | This file allows to read out the SMT control state and provides the | 
|  | ability to disable or (re)enable SMT. The possible states are: | 
|  |  | 
|  | ==============  =================================================== | 
|  | on		SMT is supported by the CPU and enabled. All | 
|  | logical CPUs can be onlined and offlined without | 
|  | restrictions. | 
|  |  | 
|  | off		SMT is supported by the CPU and disabled. Only | 
|  | the so called primary SMT threads can be onlined | 
|  | and offlined without restrictions. An attempt to | 
|  | online a non-primary sibling is rejected | 
|  |  | 
|  | forceoff	Same as 'off' but the state cannot be controlled. | 
|  | Attempts to write to the control file are rejected. | 
|  |  | 
|  | notsupported	The processor does not support SMT. It's therefore | 
|  | not affected by the SMT implications of L1TF. | 
|  | Attempts to write to the control file are rejected. | 
|  | ==============  =================================================== | 
|  |  | 
|  | The possible states which can be written into this file to control SMT | 
|  | state are: | 
|  |  | 
|  | - on | 
|  | - off | 
|  | - forceoff | 
|  |  | 
|  | /sys/devices/system/cpu/smt/active: | 
|  |  | 
|  | This file reports whether SMT is enabled and active, i.e. if on any | 
|  | physical core two or more sibling threads are online. | 
|  |  | 
|  | SMT control is also possible at boot time via the l1tf kernel command | 
|  | line parameter in combination with L1D flush control. See | 
|  | :ref:`mitigation_control_command_line`. | 
|  |  | 
|  | 5. Disabling EPT | 
|  | ^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Disabling EPT for virtual machines provides full mitigation for L1TF even | 
|  | with SMT enabled, because the effective page tables for guests are | 
|  | managed and sanitized by the hypervisor. Though disabling EPT has a | 
|  | significant performance impact especially when the Meltdown mitigation | 
|  | KPTI is enabled. | 
|  |  | 
|  | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. | 
|  |  | 
|  | There is ongoing research and development for new mitigation mechanisms to | 
|  | address the performance impact of disabling SMT or EPT. | 
|  |  | 
|  | .. _mitigation_control_command_line: | 
|  |  | 
|  | Mitigation control on the kernel command line | 
|  | --------------------------------------------- | 
|  |  | 
|  | The kernel command line allows to control the L1TF mitigations at boot | 
|  | time with the option "l1tf=". The valid arguments for this option are: | 
|  |  | 
|  | ============  ============================================================= | 
|  | full		Provides all available mitigations for the L1TF | 
|  | vulnerability. Disables SMT and enables all mitigations in | 
|  | the hypervisors, i.e. unconditional L1D flushing | 
|  |  | 
|  | SMT control and L1D flush control via the sysfs interface | 
|  | is still possible after boot.  Hypervisors will issue a | 
|  | warning when the first VM is started in a potentially | 
|  | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | disabled. | 
|  |  | 
|  | full,force	Same as 'full', but disables SMT and L1D flush runtime | 
|  | control. Implies the 'nosmt=force' command line option. | 
|  | (i.e. sysfs control of SMT is disabled.) | 
|  |  | 
|  | flush		Leaves SMT enabled and enables the default hypervisor | 
|  | mitigation, i.e. conditional L1D flushing | 
|  |  | 
|  | SMT control and L1D flush control via the sysfs interface | 
|  | is still possible after boot.  Hypervisors will issue a | 
|  | warning when the first VM is started in a potentially | 
|  | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | disabled. | 
|  |  | 
|  | flush,nosmt	Disables SMT and enables the default hypervisor mitigation, | 
|  | i.e. conditional L1D flushing. | 
|  |  | 
|  | SMT control and L1D flush control via the sysfs interface | 
|  | is still possible after boot.  Hypervisors will issue a | 
|  | warning when the first VM is started in a potentially | 
|  | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | disabled. | 
|  |  | 
|  | flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is | 
|  | started in a potentially insecure configuration. | 
|  |  | 
|  | off		Disables hypervisor mitigations and doesn't emit any | 
|  | warnings. | 
|  | It also drops the swap size and available RAM limit restrictions | 
|  | on both hypervisor and bare metal. | 
|  |  | 
|  | ============  ============================================================= | 
|  |  | 
|  | The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`. | 
|  |  | 
|  |  | 
|  | .. _mitigation_control_kvm: | 
|  |  | 
|  | Mitigation control for KVM - module parameter | 
|  | ------------------------------------------------------------- | 
|  |  | 
|  | The KVM hypervisor mitigation mechanism, flushing the L1D cache when | 
|  | entering a guest, can be controlled with a module parameter. | 
|  |  | 
|  | The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the | 
|  | following arguments: | 
|  |  | 
|  | ============  ============================================================== | 
|  | always	L1D cache flush on every VMENTER. | 
|  |  | 
|  | cond		Flush L1D on VMENTER only when the code between VMEXIT and | 
|  | VMENTER can leak host memory which is considered | 
|  | interesting for an attacker. This still can leak host memory | 
|  | which allows e.g. to determine the hosts address space layout. | 
|  |  | 
|  | never		Disables the mitigation | 
|  | ============  ============================================================== | 
|  |  | 
|  | The parameter can be provided on the kernel command line, as a module | 
|  | parameter when loading the modules and at runtime modified via the sysfs | 
|  | file: | 
|  |  | 
|  | /sys/module/kvm_intel/parameters/vmentry_l1d_flush | 
|  |  | 
|  | The default is 'cond'. If 'l1tf=full,force' is given on the kernel command | 
|  | line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush | 
|  | module parameter is ignored and writes to the sysfs file are rejected. | 
|  |  | 
|  | .. _mitigation_selection: | 
|  |  | 
|  | Mitigation selection guide | 
|  | -------------------------- | 
|  |  | 
|  | 1. No virtualization in use | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The system is protected by the kernel unconditionally and no further | 
|  | action is required. | 
|  |  | 
|  | 2. Virtualization with trusted guests | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | If the guest comes from a trusted source and the guest OS kernel is | 
|  | guaranteed to have the L1TF mitigations in place the system is fully | 
|  | protected against L1TF and no further action is required. | 
|  |  | 
|  | To avoid the overhead of the default L1D flushing on VMENTER the | 
|  | administrator can disable the flushing via the kernel command line and | 
|  | sysfs control files. See :ref:`mitigation_control_command_line` and | 
|  | :ref:`mitigation_control_kvm`. | 
|  |  | 
|  |  | 
|  | 3. Virtualization with untrusted guests | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | 3.1. SMT not supported or disabled | 
|  | """""""""""""""""""""""""""""""""" | 
|  |  | 
|  | If SMT is not supported by the processor or disabled in the BIOS or by | 
|  | the kernel, it's only required to enforce L1D flushing on VMENTER. | 
|  |  | 
|  | Conditional L1D flushing is the default behaviour and can be tuned. See | 
|  | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. | 
|  |  | 
|  | 3.2. EPT not supported or disabled | 
|  | """""""""""""""""""""""""""""""""" | 
|  |  | 
|  | If EPT is not supported by the processor or disabled in the hypervisor, | 
|  | the system is fully protected. SMT can stay enabled and L1D flushing on | 
|  | VMENTER is not required. | 
|  |  | 
|  | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. | 
|  |  | 
|  | 3.3. SMT and EPT supported and active | 
|  | """"""""""""""""""""""""""""""""""""" | 
|  |  | 
|  | If SMT and EPT are supported and active then various degrees of | 
|  | mitigations can be employed: | 
|  |  | 
|  | - L1D flushing on VMENTER: | 
|  |  | 
|  | L1D flushing on VMENTER is the minimal protection requirement, but it | 
|  | is only potent in combination with other mitigation methods. | 
|  |  | 
|  | Conditional L1D flushing is the default behaviour and can be tuned. See | 
|  | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. | 
|  |  | 
|  | - Guest confinement: | 
|  |  | 
|  | Confinement of guests to a single or a group of physical cores which | 
|  | are not running any other processes, can reduce the attack surface | 
|  | significantly, but interrupts, soft interrupts and kernel threads can | 
|  | still expose valuable data to a potential attacker. See | 
|  | :ref:`guest_confinement`. | 
|  |  | 
|  | - Interrupt isolation: | 
|  |  | 
|  | Isolating the guest CPUs from interrupts can reduce the attack surface | 
|  | further, but still allows a malicious guest to explore a limited amount | 
|  | of host physical memory. This can at least be used to gain knowledge | 
|  | about the host address space layout. The interrupts which have a fixed | 
|  | affinity to the CPUs which run the untrusted guests can depending on | 
|  | the scenario still trigger soft interrupts and schedule kernel threads | 
|  | which might expose valuable information. See | 
|  | :ref:`interrupt_isolation`. | 
|  |  | 
|  | The above three mitigation methods combined can provide protection to a | 
|  | certain degree, but the risk of the remaining attack surface has to be | 
|  | carefully analyzed. For full protection the following methods are | 
|  | available: | 
|  |  | 
|  | - Disabling SMT: | 
|  |  | 
|  | Disabling SMT and enforcing the L1D flushing provides the maximum | 
|  | amount of protection. This mitigation is not depending on any of the | 
|  | above mitigation methods. | 
|  |  | 
|  | SMT control and L1D flushing can be tuned by the command line | 
|  | parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run | 
|  | time with the matching sysfs control files. See :ref:`smt_control`, | 
|  | :ref:`mitigation_control_command_line` and | 
|  | :ref:`mitigation_control_kvm`. | 
|  |  | 
|  | - Disabling EPT: | 
|  |  | 
|  | Disabling EPT provides the maximum amount of protection as well. It is | 
|  | not depending on any of the above mitigation methods. SMT can stay | 
|  | enabled and L1D flushing is not required, but the performance impact is | 
|  | significant. | 
|  |  | 
|  | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' | 
|  | parameter. | 
|  |  | 
|  | 3.4. Nested virtual machines | 
|  | """""""""""""""""""""""""""" | 
|  |  | 
|  | When nested virtualization is in use, three operating systems are involved: | 
|  | the bare metal hypervisor, the nested hypervisor and the nested virtual | 
|  | machine.  VMENTER operations from the nested hypervisor into the nested | 
|  | guest will always be processed by the bare metal hypervisor. If KVM is the | 
|  | bare metal hypervisor it will: | 
|  |  | 
|  | - Flush the L1D cache on every switch from the nested hypervisor to the | 
|  | nested virtual machine, so that the nested hypervisor's secrets are not | 
|  | exposed to the nested virtual machine; | 
|  |  | 
|  | - Flush the L1D cache on every switch from the nested virtual machine to | 
|  | the nested hypervisor; this is a complex operation, and flushing the L1D | 
|  | cache avoids that the bare metal hypervisor's secrets are exposed to the | 
|  | nested virtual machine; | 
|  |  | 
|  | - Instruct the nested hypervisor to not perform any L1D cache flush. This | 
|  | is an optimization to avoid double L1D flushing. | 
|  |  | 
|  |  | 
|  | .. _default_mitigations: | 
|  |  | 
|  | Default mitigations | 
|  | ------------------- | 
|  |  | 
|  | The kernel default mitigations for vulnerable processors are: | 
|  |  | 
|  | - PTE inversion to protect against malicious user space. This is done | 
|  | unconditionally and cannot be controlled. The swap storage is limited | 
|  | to ~16TB. | 
|  |  | 
|  | - L1D conditional flushing on VMENTER when EPT is enabled for | 
|  | a guest. | 
|  |  | 
|  | The kernel does not by default enforce the disabling of SMT, which leaves | 
|  | SMT systems vulnerable when running untrusted guests with EPT enabled. | 
|  |  | 
|  | The rationale for this choice is: | 
|  |  | 
|  | - Force disabling SMT can break existing setups, especially with | 
|  | unattended updates. | 
|  |  | 
|  | - If regular users run untrusted guests on their machine, then L1TF is | 
|  | just an add on to other malware which might be embedded in an untrusted | 
|  | guest, e.g. spam-bots or attacks on the local network. | 
|  |  | 
|  | There is no technical way to prevent a user from running untrusted code | 
|  | on their machines blindly. | 
|  |  | 
|  | - It's technically extremely unlikely and from today's knowledge even | 
|  | impossible that L1TF can be exploited via the most popular attack | 
|  | mechanisms like JavaScript because these mechanisms have no way to | 
|  | control PTEs. If this would be possible and not other mitigation would | 
|  | be possible, then the default might be different. | 
|  |  | 
|  | - The administrators of cloud and hosting setups have to carefully | 
|  | analyze the risk for their scenarios and make the appropriate | 
|  | mitigation choices, which might even vary across their deployed | 
|  | machines and also result in other changes of their overall setup. | 
|  | There is no way for the kernel to provide a sensible default for this | 
|  | kind of scenarios. |