Documentation/admin-guide/l1tf.rst

   1 L1TF - L1 Terminal Fault
   2 ========================
   3
   4 L1 Terminal Fault is a hardware vulnerability which allows unprivileged
   5 speculative access to data which is available in the Level 1 Data Cache
   6 when the page table entry controlling the virtual address, which is used
   7 for the access, has the Present bit cleared or other reserved bits set.
   8
   9 Affected processors
  10 -------------------
  11
  12 This vulnerability affects a wide range of Intel processors. The
  13 vulnerability is not present on:
  14
  15    - Processors from AMD, Centaur and other non Intel vendors
  16
  17    - Older processor models, where the CPU family is < 6
  18
  19    - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
  20      Penwell, Pineview, Slivermont, Airmont, Merrifield)
  21
  22    - The Intel Core Duo Yonah variants (2006 - 2008)
  23
  24    - The Intel XEON PHI family
  25
  26    - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
  27      IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
  28      by the Meltdown vulnerability either. These CPUs should become
  29      available by end of 2018.
  30
  31 Whether a processor is affected or not can be read out from the L1TF
  32 vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
  33
  34 Related CVEs
  35 ------------
  36
  37 The following CVE entries are related to the L1TF vulnerability:
  38
  39    =============  =================  ==============================
  40    CVE-2018-3615  L1 Terminal Fault  SGX related aspects
  41    CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
  42    CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
  43    =============  =================  ==============================
  44
  45 Problem
  46 -------
  47
  48 If an instruction accesses a virtual address for which the relevant page
  49 table entry (PTE) has the Present bit cleared or other reserved bits set,
  50 then speculative execution ignores the invalid PTE and loads the referenced
  51 data if it is present in the Level 1 Data Cache, as if the page referenced
  52 by the address bits in the PTE was still present and accessible.
  53
  54 While this is a purely speculative mechanism and the instruction will raise
  55 a page fault when it is retired eventually, the pure act of loading the
  56 data and making it available to other speculative instructions opens up the
  57 opportunity for side channel attacks to unprivileged malicious code,
  58 similar to the Meltdown attack.
  59
  60 While Meltdown breaks the user space to kernel space protection, L1TF
  61 allows to attack any physical memory address in the system and the attack
  62 works across all protection domains. It allows an attack of SGX and also
  63 works from inside virtual machines because the speculation bypasses the
  64 extended page table (EPT) protection mechanism.
  65
  66
  67 Attack scenarios
  68 ----------------
  69
  70 1. Malicious user space
  71 ^^^^^^^^^^^^^^^^^^^^^^^
  72
  73    Operating Systems store arbitrary information in the address bits of a
  74    PTE which is marked non present. This allows a malicious user space
  75    application to attack the physical memory to which these PTEs resolve.
  76    In some cases user-space can maliciously influence the information
  77    encoded in the address bits of the PTE, thus making attacks more
  78    deterministic and more practical.
  79
  80    The Linux kernel contains a mitigation for this attack vector, PTE
  81    inversion, which is permanently enabled and has no performance
  82    impact. The kernel ensures that the address bits of PTEs, which are not
  83    marked present, never point to cacheable physical memory space.
  84
  85    A system with an up to date kernel is protected against attacks from
  86    malicious user space applications.
  87
  88 2. Malicious guest in a virtual machine
  89 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  90
  91    The fact that L1TF breaks all domain protections allows malicious guest
  92    OSes, which can control the PTEs directly, and malicious guest user
  93    space applications, which run on an unprotected guest kernel lacking the
  94    PTE inversion mitigation for L1TF, to attack physical host memory.
  95
  96    A special aspect of L1TF in the context of virtualization is symmetric
  97    multi threading (SMT). The Intel implementation of SMT is called
  98    HyperThreading. The fact that Hyperthreads on the affected processors
  99    share the L1 Data Cache (L1D) is important for this. As the flaw allows
 100    only to attack data which is present in L1D, a malicious guest running
 101    on one Hyperthread can attack the data which is brought into the L1D by
 102    the context which runs on the sibling Hyperthread of the same physical
 103    core. This context can be host OS, host user space or a different guest.
 104
 105    If the processor does not support Extended Page Tables, the attack is
 106    only possible, when the hypervisor does not sanitize the content of the
 107    effective (shadow) page tables.
 108
 109    While solutions exist to mitigate these attack vectors fully, these
 110    mitigations are not enabled by default in the Linux kernel because they
 111    can affect performance significantly. The kernel provides several
 112    mechanisms which can be utilized to address the problem depending on the
 113    deployment scenario. The mitigations, their protection scope and impact
 114    are described in the next sections.
 115
 116    The default mitigations and the rationale for chosing them are explained
 117    at the end of this document. See :ref:`default_mitigations`.
 118
 119 .. _l1tf_sys_info:
 120
 121 L1TF system information
 122 -----------------------
 123
 124 The Linux kernel provides a sysfs interface to enumerate the current L1TF
 125 status of the system: whether the system is vulnerable, and which
 126 mitigations are active. The relevant sysfs file is:
 127
 128 /sys/devices/system/cpu/vulnerabilities/l1tf
 129
 130 The possible values in this file are:
 131
 132   ===========================   ===============================
 133   'Not affected'                The processor is not vulnerable
 134   'Mitigation: PTE Inversion'   The host protection is active
 135   ===========================   ===============================
 136
 137 If KVM/VMX is enabled and the processor is vulnerable then the following
 138 information is appended to the 'Mitigation: PTE Inversion' part:
 139
 140   - SMT status:
 141
 142     =====================  ================
 143     'VMX: SMT vulnerable'  SMT is enabled
 144     'VMX: SMT disabled'    SMT is disabled
 145     =====================  ================
 146
 147   - L1D Flush mode:
 148
 149     ================================  ====================================
 150     'L1D vulnerable'                  L1D flushing is disabled
 151
 152     'L1D conditional cache flushes'   L1D flush is conditionally enabled
 153
 154     'L1D cache flushes'               L1D flush is unconditionally enabled
 155     ================================  ====================================
 156
 157 The resulting grade of protection is discussed in the following sections.
 158
 159
 160 Host mitigation mechanism
 161 -------------------------
 162
 163 The kernel is unconditionally protected against L1TF attacks from malicious
 164 user space running on the host.
 165
 166
 167 Guest mitigation mechanisms
 168 ---------------------------
 169
 170 .. _l1d_flush:
 171
 172 1. L1D flush on VMENTER
 173 ^^^^^^^^^^^^^^^^^^^^^^^
 174
 175    To make sure that a guest cannot attack data which is present in the L1D
 176    the hypervisor flushes the L1D before entering the guest.
 177
 178    Flushing the L1D evicts not only the data which should not be accessed
 179    by a potentially malicious guest, it also flushes the guest
 180    data. Flushing the L1D has a performance impact as the processor has to
 181    bring the flushed guest data back into the L1D. Depending on the
 182    frequency of VMEXIT/VMENTER and the type of computations in the guest
 183    performance degradation in the range of 1% to 50% has been observed. For
 184    scenarios where guest VMEXIT/VMENTER are rare the performance impact is
 185    minimal. Virtio and mechanisms like posted interrupts are designed to
 186    confine the VMEXITs to a bare minimum, but specific configurations and
 187    application scenarios might still suffer from a high VMEXIT rate.
 188
 189    The kernel provides two L1D flush modes:
 190     - conditional ('cond')
 191     - unconditional ('always')
 192
 193    The conditional mode avoids L1D flushing after VMEXITs which execute
 194    only audited code pathes before the corresponding VMENTER. These code
 195    pathes have beed verified that they cannot expose secrets or other
 196    interesting data to an attacker, but they can leak information about the
 197    address space layout of the hypervisor.
 198
 199    Unconditional mode flushes L1D on all VMENTER invocations and provides
 200    maximum protection. It has a higher overhead than the conditional
 201    mode. The overhead cannot be quantified correctly as it depends on the
 202    work load scenario and the resulting number of VMEXITs.
 203
 204    The general recommendation is to enable L1D flush on VMENTER. The kernel
 205    defaults to conditional mode on affected processors.
 206
 207    **Note**, that L1D flush does not prevent the SMT problem because the
 208    sibling thread will also bring back its data into the L1D which makes it
 209    attackable again.
 210
 211    L1D flush can be controlled by the administrator via the kernel command
 212    line and sysfs control files. See :ref:`mitigation_control_command_line`
 213    and :ref:`mitigation_control_kvm`.
 214
 215 .. _guest_confinement:
 216
 217 2. Guest VCPU confinement to dedicated physical cores
 218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 219
 220    To address the SMT problem, it is possible to make a guest or a group of
 221    guests affine to one or more physical cores. The proper mechanism for
 222    that is to utilize exclusive cpusets to ensure that no other guest or
 223    host tasks can run on these cores.
 224
 225    If only a single guest or related guests run on sibling SMT threads on
 226    the same physical core then they can only attack their own memory and
 227    restricted parts of the host memory.
 228
 229    Host memory is attackable, when one of the sibling SMT threads runs in
 230    host OS (hypervisor) context and the other in guest context. The amount
 231    of valuable information from the host OS context depends on the context
 232    which the host OS executes, i.e. interrupts, soft interrupts and kernel
 233    threads. The amount of valuable data from these contexts cannot be
 234    declared as non-interesting for an attacker without deep inspection of
 235    the code.
 236
 237    **Note**, that assigning guests to a fixed set of physical cores affects
 238    the ability of the scheduler to do load balancing and might have
 239    negative effects on CPU utilization depending on the hosting
 240    scenario. Disabling SMT might be a viable alternative for particular
 241    scenarios.
 242
 243    For further information about confining guests to a single or to a group
 244    of cores consult the cpusets documentation:
 245
 246    https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
 247
 248 .. _interrupt_isolation:
 249
 250 3. Interrupt affinity
 251 ^^^^^^^^^^^^^^^^^^^^^
 252
 253    Interrupts can be made affine to logical CPUs. This is not universally
 254    true because there are types of interrupts which are truly per CPU
 255    interrupts, e.g. the local timer interrupt. Aside of that multi queue
 256    devices affine their interrupts to single CPUs or groups of CPUs per
 257    queue without allowing the administrator to control the affinities.
 258
 259    Moving the interrupts, which can be affinity controlled, away from CPUs
 260    which run untrusted guests, reduces the attack vector space.
 261
 262    Whether the interrupts with are affine to CPUs, which run untrusted
 263    guests, provide interesting data for an attacker depends on the system
 264    configuration and the scenarios which run on the system. While for some
 265    of the interrupts it can be assumed that they wont expose interesting
 266    information beyond exposing hints about the host OS memory layout, there
 267    is no way to make general assumptions.
 268
 269    Interrupt affinity can be controlled by the administrator via the
 270    /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
 271    available at:
 272
 273    https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
 274
 275 .. _smt_control:
 276
 277 4. SMT control
 278 ^^^^^^^^^^^^^^
 279
 280    To prevent the SMT issues of L1TF it might be necessary to disable SMT
 281    completely. Disabling SMT can have a significant performance impact, but
 282    the impact depends on the hosting scenario and the type of workloads.
 283    The impact of disabling SMT needs also to be weighted against the impact
 284    of other mitigation solutions like confining guests to dedicated cores.
 285
 286    The kernel provides a sysfs interface to retrieve the status of SMT and
 287    to control it. It also provides a kernel command line interface to
 288    control SMT.
 289
 290    The kernel command line interface consists of the following options:
 291
 292      =========== ==========================================================
 293      nosmt       Affects the bring up of the secondary CPUs during boot. The
 294                  kernel tries to bring all present CPUs online during the
 295                  boot process. "nosmt" makes sure that from each physical
 296                  core only one - the so called primary (hyper) thread is
 297                  activated. Due to a design flaw of Intel processors related
 298                  to Machine Check Exceptions the non primary siblings have
 299                  to be brought up at least partially and are then shut down
 300                  again.  "nosmt" can be undone via the sysfs interface.
 301
 302      nosmt=force Has the same effect as "nosmt' but it does not allow to
 303                  undo the SMT disable via the sysfs interface.
 304      =========== ==========================================================
 305
 306    The sysfs interface provides two files:
 307
 308    - /sys/devices/system/cpu/smt/control
 309    - /sys/devices/system/cpu/smt/active
 310
 311    /sys/devices/system/cpu/smt/control:
 312
 313      This file allows to read out the SMT control state and provides the
 314      ability to disable or (re)enable SMT. The possible states are:
 315
 316         ==============  ===================================================
 317         on              SMT is supported by the CPU and enabled. All
 318                         logical CPUs can be onlined and offlined without
 319                         restrictions.
 320
 321         off             SMT is supported by the CPU and disabled. Only
 322                         the so called primary SMT threads can be onlined
 323                         and offlined without restrictions. An attempt to
 324                         online a non-primary sibling is rejected
 325
 326         forceoff        Same as 'off' but the state cannot be controlled.
 327                         Attempts to write to the control file are rejected.
 328
 329         notsupported    The processor does not support SMT. It's therefore
 330                         not affected by the SMT implications of L1TF.
 331                         Attempts to write to the control file are rejected.
 332         ==============  ===================================================
 333
 334      The possible states which can be written into this file to control SMT
 335      state are:
 336
 337      - on
 338      - off
 339      - forceoff
 340
 341    /sys/devices/system/cpu/smt/active:
 342
 343      This file reports whether SMT is enabled and active, i.e. if on any
 344      physical core two or more sibling threads are online.
 345
 346    SMT control is also possible at boot time via the l1tf kernel command
 347    line parameter in combination with L1D flush control. See
 348    :ref:`mitigation_control_command_line`.
 349
 350 5. Disabling EPT
 351 ^^^^^^^^^^^^^^^^
 352
 353   Disabling EPT for virtual machines provides full mitigation for L1TF even
 354   with SMT enabled, because the effective page tables for guests are
 355   managed and sanitized by the hypervisor. Though disabling EPT has a
 356   significant performance impact especially when the Meltdown mitigation
 357   KPTI is enabled.
 358
 359   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
 360
 361 There is ongoing research and development for new mitigation mechanisms to
 362 address the performance impact of disabling SMT or EPT.
 363
 364 .. _mitigation_control_command_line:
 365
 366 Mitigation control on the kernel command line
 367 ---------------------------------------------
 368
 369 The kernel command line allows to control the L1TF mitigations at boot
 370 time with the option "l1tf=". The valid arguments for this option are:
 371
 372   ============  =============================================================
 373   full          Provides all available mitigations for the L1TF
 374                 vulnerability. Disables SMT and enables all mitigations in
 375                 the hypervisors, i.e. unconditional L1D flushing
 376
 377                 SMT control and L1D flush control via the sysfs interface
 378                 is still possible after boot.  Hypervisors will issue a
 379                 warning when the first VM is started in a potentially
 380                 insecure configuration, i.e. SMT enabled or L1D flush
 381                 disabled.
 382
 383   full,force    Same as 'full', but disables SMT and L1D flush runtime
 384                 control. Implies the 'nosmt=force' command line option.
 385                 (i.e. sysfs control of SMT is disabled.)
 386
 387   flush         Leaves SMT enabled and enables the default hypervisor
 388                 mitigation, i.e. conditional L1D flushing
 389
 390                 SMT control and L1D flush control via the sysfs interface
 391                 is still possible after boot.  Hypervisors will issue a
 392                 warning when the first VM is started in a potentially
 393                 insecure configuration, i.e. SMT enabled or L1D flush
 394                 disabled.
 395
 396   flush,nosmt   Disables SMT and enables the default hypervisor mitigation,
 397                 i.e. conditional L1D flushing.
 398
 399                 SMT control and L1D flush control via the sysfs interface
 400                 is still possible after boot.  Hypervisors will issue a
 401                 warning when the first VM is started in a potentially
 402                 insecure configuration, i.e. SMT enabled or L1D flush
 403                 disabled.
 404
 405   flush,nowarn  Same as 'flush', but hypervisors will not warn when a VM is
 406                 started in a potentially insecure configuration.
 407
 408   off           Disables hypervisor mitigations and doesn't emit any
 409                 warnings.
 410   ============  =============================================================
 411
 412 The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
 413
 414
 415 .. _mitigation_control_kvm:
 416
 417 Mitigation control for KVM - module parameter
 418 -------------------------------------------------------------
 419
 420 The KVM hypervisor mitigation mechanism, flushing the L1D cache when
 421 entering a guest, can be controlled with a module parameter.
 422
 423 The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
 424 following arguments:
 425
 426   ============  ==============================================================
 427   always        L1D cache flush on every VMENTER.
 428
 429   cond          Flush L1D on VMENTER only when the code between VMEXIT and
 430                 VMENTER can leak host memory which is considered
 431                 interesting for an attacker. This still can leak host memory
 432                 which allows e.g. to determine the hosts address space layout.
 433
 434   never         Disables the mitigation
 435   ============  ==============================================================
 436
 437 The parameter can be provided on the kernel command line, as a module
 438 parameter when loading the modules and at runtime modified via the sysfs
 439 file:
 440
 441 /sys/module/kvm_intel/parameters/vmentry_l1d_flush
 442
 443 The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
 444 line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
 445 module parameter is ignored and writes to the sysfs file are rejected.
 446
 447
 448 Mitigation selection guide
 449 --------------------------
 450
 451 1. No virtualization in use
 452 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 453
 454    The system is protected by the kernel unconditionally and no further
 455    action is required.
 456
 457 2. Virtualization with trusted guests
 458 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 459
 460    If the guest comes from a trusted source and the guest OS kernel is
 461    guaranteed to have the L1TF mitigations in place the system is fully
 462    protected against L1TF and no further action is required.
 463
 464    To avoid the overhead of the default L1D flushing on VMENTER the
 465    administrator can disable the flushing via the kernel command line and
 466    sysfs control files. See :ref:`mitigation_control_command_line` and
 467    :ref:`mitigation_control_kvm`.
 468
 469
 470 3. Virtualization with untrusted guests
 471 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 472
 473 3.1. SMT not supported or disabled
 474 """"""""""""""""""""""""""""""""""
 475
 476   If SMT is not supported by the processor or disabled in the BIOS or by
 477   the kernel, it's only required to enforce L1D flushing on VMENTER.
 478
 479   Conditional L1D flushing is the default behaviour and can be tuned. See
 480   :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
 481
 482 3.2. EPT not supported or disabled
 483 """"""""""""""""""""""""""""""""""
 484
 485   If EPT is not supported by the processor or disabled in the hypervisor,
 486   the system is fully protected. SMT can stay enabled and L1D flushing on
 487   VMENTER is not required.
 488
 489   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
 490
 491 3.3. SMT and EPT supported and active
 492 """""""""""""""""""""""""""""""""""""
 493
 494   If SMT and EPT are supported and active then various degrees of
 495   mitigations can be employed:
 496
 497   - L1D flushing on VMENTER:
 498
 499     L1D flushing on VMENTER is the minimal protection requirement, but it
 500     is only potent in combination with other mitigation methods.
 501
 502     Conditional L1D flushing is the default behaviour and can be tuned. See
 503     :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
 504
 505   - Guest confinement:
 506
 507     Confinement of guests to a single or a group of physical cores which
 508     are not running any other processes, can reduce the attack surface
 509     significantly, but interrupts, soft interrupts and kernel threads can
 510     still expose valuable data to a potential attacker. See
 511     :ref:`guest_confinement`.
 512
 513   - Interrupt isolation:
 514
 515     Isolating the guest CPUs from interrupts can reduce the attack surface
 516     further, but still allows a malicious guest to explore a limited amount
 517     of host physical memory. This can at least be used to gain knowledge
 518     about the host address space layout. The interrupts which have a fixed
 519     affinity to the CPUs which run the untrusted guests can depending on
 520     the scenario still trigger soft interrupts and schedule kernel threads
 521     which might expose valuable information. See
 522     :ref:`interrupt_isolation`.
 523
 524 The above three mitigation methods combined can provide protection to a
 525 certain degree, but the risk of the remaining attack surface has to be
 526 carefully analyzed. For full protection the following methods are
 527 available:
 528
 529   - Disabling SMT:
 530
 531     Disabling SMT and enforcing the L1D flushing provides the maximum
 532     amount of protection. This mitigation is not depending on any of the
 533     above mitigation methods.
 534
 535     SMT control and L1D flushing can be tuned by the command line
 536     parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
 537     time with the matching sysfs control files. See :ref:`smt_control`,
 538     :ref:`mitigation_control_command_line` and
 539     :ref:`mitigation_control_kvm`.
 540
 541   - Disabling EPT:
 542
 543     Disabling EPT provides the maximum amount of protection as well. It is
 544     not depending on any of the above mitigation methods. SMT can stay
 545     enabled and L1D flushing is not required, but the performance impact is
 546     significant.
 547
 548     EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
 549     parameter.
 550
 551
 552 .. _default_mitigations:
 553
 554 Default mitigations
 555 -------------------
 556
 557   The kernel default mitigations for vulnerable processors are:
 558
 559   - PTE inversion to protect against malicious user space. This is done
 560     unconditionally and cannot be controlled.
 561
 562   - L1D conditional flushing on VMENTER when EPT is enabled for
 563     a guest.
 564
 565   The kernel does not by default enforce the disabling of SMT, which leaves
 566   SMT systems vulnerable when running untrusted guests with EPT enabled.
 567
 568   The rationale for this choice is:
 569
 570   - Force disabling SMT can break existing setups, especially with
 571     unattended updates.
 572
 573   - If regular users run untrusted guests on their machine, then L1TF is
 574     just an add on to other malware which might be embedded in an untrusted
 575     guest, e.g. spam-bots or attacks on the local network.
 576
 577     There is no technical way to prevent a user from running untrusted code
 578     on their machines blindly.
 579
 580   - It's technically extremely unlikely and from today's knowledge even
 581     impossible that L1TF can be exploited via the most popular attack
 582     mechanisms like JavaScript because these mechanisms have no way to
 583     control PTEs. If this would be possible and not other mitigation would
 584     be possible, then the default might be different.
 585
 586   - The administrators of cloud and hosting setups have to carefully
 587     analyze the risk for their scenarios and make the appropriate
 588     mitigation choices, which might even vary across their deployed
 589     machines and also result in other changes of their overall setup.
 590     There is no way for the kernel to provide a sensible default for this
 591     kind of scenarios.