Documentation: Add section about CPU vulnerabilities
[GitHub/moto-9609/android_kernel_motorola_exynos9610.git] / Documentation / admin-guide / l1tf.rst
1 L1TF - L1 Terminal Fault
2 ========================
3
4 L1 Terminal Fault is a hardware vulnerability which allows unprivileged
5 speculative access to data which is available in the Level 1 Data Cache
6 when the page table entry controlling the virtual address, which is used
7 for the access, has the Present bit cleared or other reserved bits set.
8
9 Affected processors
10 -------------------
11
12 This vulnerability affects a wide range of Intel processors. The
13 vulnerability is not present on:
14
15 - Processors from AMD, Centaur and other non Intel vendors
16
17 - Older processor models, where the CPU family is < 6
18
19 - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
20 Penwell, Pineview, Slivermont, Airmont, Merrifield)
21
22 - The Intel Core Duo Yonah variants (2006 - 2008)
23
24 - The Intel XEON PHI family
25
26 - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
27 IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
28 by the Meltdown vulnerability either. These CPUs should become
29 available by end of 2018.
30
31 Whether a processor is affected or not can be read out from the L1TF
32 vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
33
34 Related CVEs
35 ------------
36
37 The following CVE entries are related to the L1TF vulnerability:
38
39 ============= ================= ==============================
40 CVE-2018-3615 L1 Terminal Fault SGX related aspects
41 CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
42 CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
43 ============= ================= ==============================
44
45 Problem
46 -------
47
48 If an instruction accesses a virtual address for which the relevant page
49 table entry (PTE) has the Present bit cleared or other reserved bits set,
50 then speculative execution ignores the invalid PTE and loads the referenced
51 data if it is present in the Level 1 Data Cache, as if the page referenced
52 by the address bits in the PTE was still present and accessible.
53
54 While this is a purely speculative mechanism and the instruction will raise
55 a page fault when it is retired eventually, the pure act of loading the
56 data and making it available to other speculative instructions opens up the
57 opportunity for side channel attacks to unprivileged malicious code,
58 similar to the Meltdown attack.
59
60 While Meltdown breaks the user space to kernel space protection, L1TF
61 allows to attack any physical memory address in the system and the attack
62 works across all protection domains. It allows an attack of SGX and also
63 works from inside virtual machines because the speculation bypasses the
64 extended page table (EPT) protection mechanism.
65
66
67 Attack scenarios
68 ----------------
69
70 1. Malicious user space
71 ^^^^^^^^^^^^^^^^^^^^^^^
72
73 Operating Systems store arbitrary information in the address bits of a
74 PTE which is marked non present. This allows a malicious user space
75 application to attack the physical memory to which these PTEs resolve.
76 In some cases user-space can maliciously influence the information
77 encoded in the address bits of the PTE, thus making attacks more
78 deterministic and more practical.
79
80 The Linux kernel contains a mitigation for this attack vector, PTE
81 inversion, which is permanently enabled and has no performance
82 impact. The kernel ensures that the address bits of PTEs, which are not
83 marked present, never point to cacheable physical memory space.
84
85 A system with an up to date kernel is protected against attacks from
86 malicious user space applications.
87
88 2. Malicious guest in a virtual machine
89 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
90
91 The fact that L1TF breaks all domain protections allows malicious guest
92 OSes, which can control the PTEs directly, and malicious guest user
93 space applications, which run on an unprotected guest kernel lacking the
94 PTE inversion mitigation for L1TF, to attack physical host memory.
95
96 A special aspect of L1TF in the context of virtualization is symmetric
97 multi threading (SMT). The Intel implementation of SMT is called
98 HyperThreading. The fact that Hyperthreads on the affected processors
99 share the L1 Data Cache (L1D) is important for this. As the flaw allows
100 only to attack data which is present in L1D, a malicious guest running
101 on one Hyperthread can attack the data which is brought into the L1D by
102 the context which runs on the sibling Hyperthread of the same physical
103 core. This context can be host OS, host user space or a different guest.
104
105 If the processor does not support Extended Page Tables, the attack is
106 only possible, when the hypervisor does not sanitize the content of the
107 effective (shadow) page tables.
108
109 While solutions exist to mitigate these attack vectors fully, these
110 mitigations are not enabled by default in the Linux kernel because they
111 can affect performance significantly. The kernel provides several
112 mechanisms which can be utilized to address the problem depending on the
113 deployment scenario. The mitigations, their protection scope and impact
114 are described in the next sections.
115
116 The default mitigations and the rationale for chosing them are explained
117 at the end of this document. See :ref:`default_mitigations`.
118
119 .. _l1tf_sys_info:
120
121 L1TF system information
122 -----------------------
123
124 The Linux kernel provides a sysfs interface to enumerate the current L1TF
125 status of the system: whether the system is vulnerable, and which
126 mitigations are active. The relevant sysfs file is:
127
128 /sys/devices/system/cpu/vulnerabilities/l1tf
129
130 The possible values in this file are:
131
132 =========================== ===============================
133 'Not affected' The processor is not vulnerable
134 'Mitigation: PTE Inversion' The host protection is active
135 =========================== ===============================
136
137 If KVM/VMX is enabled and the processor is vulnerable then the following
138 information is appended to the 'Mitigation: PTE Inversion' part:
139
140 - SMT status:
141
142 ===================== ================
143 'VMX: SMT vulnerable' SMT is enabled
144 'VMX: SMT disabled' SMT is disabled
145 ===================== ================
146
147 - L1D Flush mode:
148
149 ================================ ====================================
150 'L1D vulnerable' L1D flushing is disabled
151
152 'L1D conditional cache flushes' L1D flush is conditionally enabled
153
154 'L1D cache flushes' L1D flush is unconditionally enabled
155 ================================ ====================================
156
157 The resulting grade of protection is discussed in the following sections.
158
159
160 Host mitigation mechanism
161 -------------------------
162
163 The kernel is unconditionally protected against L1TF attacks from malicious
164 user space running on the host.
165
166
167 Guest mitigation mechanisms
168 ---------------------------
169
170 .. _l1d_flush:
171
172 1. L1D flush on VMENTER
173 ^^^^^^^^^^^^^^^^^^^^^^^
174
175 To make sure that a guest cannot attack data which is present in the L1D
176 the hypervisor flushes the L1D before entering the guest.
177
178 Flushing the L1D evicts not only the data which should not be accessed
179 by a potentially malicious guest, it also flushes the guest
180 data. Flushing the L1D has a performance impact as the processor has to
181 bring the flushed guest data back into the L1D. Depending on the
182 frequency of VMEXIT/VMENTER and the type of computations in the guest
183 performance degradation in the range of 1% to 50% has been observed. For
184 scenarios where guest VMEXIT/VMENTER are rare the performance impact is
185 minimal. Virtio and mechanisms like posted interrupts are designed to
186 confine the VMEXITs to a bare minimum, but specific configurations and
187 application scenarios might still suffer from a high VMEXIT rate.
188
189 The kernel provides two L1D flush modes:
190 - conditional ('cond')
191 - unconditional ('always')
192
193 The conditional mode avoids L1D flushing after VMEXITs which execute
194 only audited code pathes before the corresponding VMENTER. These code
195 pathes have beed verified that they cannot expose secrets or other
196 interesting data to an attacker, but they can leak information about the
197 address space layout of the hypervisor.
198
199 Unconditional mode flushes L1D on all VMENTER invocations and provides
200 maximum protection. It has a higher overhead than the conditional
201 mode. The overhead cannot be quantified correctly as it depends on the
202 work load scenario and the resulting number of VMEXITs.
203
204 The general recommendation is to enable L1D flush on VMENTER. The kernel
205 defaults to conditional mode on affected processors.
206
207 **Note**, that L1D flush does not prevent the SMT problem because the
208 sibling thread will also bring back its data into the L1D which makes it
209 attackable again.
210
211 L1D flush can be controlled by the administrator via the kernel command
212 line and sysfs control files. See :ref:`mitigation_control_command_line`
213 and :ref:`mitigation_control_kvm`.
214
215 .. _guest_confinement:
216
217 2. Guest VCPU confinement to dedicated physical cores
218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
219
220 To address the SMT problem, it is possible to make a guest or a group of
221 guests affine to one or more physical cores. The proper mechanism for
222 that is to utilize exclusive cpusets to ensure that no other guest or
223 host tasks can run on these cores.
224
225 If only a single guest or related guests run on sibling SMT threads on
226 the same physical core then they can only attack their own memory and
227 restricted parts of the host memory.
228
229 Host memory is attackable, when one of the sibling SMT threads runs in
230 host OS (hypervisor) context and the other in guest context. The amount
231 of valuable information from the host OS context depends on the context
232 which the host OS executes, i.e. interrupts, soft interrupts and kernel
233 threads. The amount of valuable data from these contexts cannot be
234 declared as non-interesting for an attacker without deep inspection of
235 the code.
236
237 **Note**, that assigning guests to a fixed set of physical cores affects
238 the ability of the scheduler to do load balancing and might have
239 negative effects on CPU utilization depending on the hosting
240 scenario. Disabling SMT might be a viable alternative for particular
241 scenarios.
242
243 For further information about confining guests to a single or to a group
244 of cores consult the cpusets documentation:
245
246 https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
247
248 .. _interrupt_isolation:
249
250 3. Interrupt affinity
251 ^^^^^^^^^^^^^^^^^^^^^
252
253 Interrupts can be made affine to logical CPUs. This is not universally
254 true because there are types of interrupts which are truly per CPU
255 interrupts, e.g. the local timer interrupt. Aside of that multi queue
256 devices affine their interrupts to single CPUs or groups of CPUs per
257 queue without allowing the administrator to control the affinities.
258
259 Moving the interrupts, which can be affinity controlled, away from CPUs
260 which run untrusted guests, reduces the attack vector space.
261
262 Whether the interrupts with are affine to CPUs, which run untrusted
263 guests, provide interesting data for an attacker depends on the system
264 configuration and the scenarios which run on the system. While for some
265 of the interrupts it can be assumed that they wont expose interesting
266 information beyond exposing hints about the host OS memory layout, there
267 is no way to make general assumptions.
268
269 Interrupt affinity can be controlled by the administrator via the
270 /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
271 available at:
272
273 https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
274
275 .. _smt_control:
276
277 4. SMT control
278 ^^^^^^^^^^^^^^
279
280 To prevent the SMT issues of L1TF it might be necessary to disable SMT
281 completely. Disabling SMT can have a significant performance impact, but
282 the impact depends on the hosting scenario and the type of workloads.
283 The impact of disabling SMT needs also to be weighted against the impact
284 of other mitigation solutions like confining guests to dedicated cores.
285
286 The kernel provides a sysfs interface to retrieve the status of SMT and
287 to control it. It also provides a kernel command line interface to
288 control SMT.
289
290 The kernel command line interface consists of the following options:
291
292 =========== ==========================================================
293 nosmt Affects the bring up of the secondary CPUs during boot. The
294 kernel tries to bring all present CPUs online during the
295 boot process. "nosmt" makes sure that from each physical
296 core only one - the so called primary (hyper) thread is
297 activated. Due to a design flaw of Intel processors related
298 to Machine Check Exceptions the non primary siblings have
299 to be brought up at least partially and are then shut down
300 again. "nosmt" can be undone via the sysfs interface.
301
302 nosmt=force Has the same effect as "nosmt' but it does not allow to
303 undo the SMT disable via the sysfs interface.
304 =========== ==========================================================
305
306 The sysfs interface provides two files:
307
308 - /sys/devices/system/cpu/smt/control
309 - /sys/devices/system/cpu/smt/active
310
311 /sys/devices/system/cpu/smt/control:
312
313 This file allows to read out the SMT control state and provides the
314 ability to disable or (re)enable SMT. The possible states are:
315
316 ============== ===================================================
317 on SMT is supported by the CPU and enabled. All
318 logical CPUs can be onlined and offlined without
319 restrictions.
320
321 off SMT is supported by the CPU and disabled. Only
322 the so called primary SMT threads can be onlined
323 and offlined without restrictions. An attempt to
324 online a non-primary sibling is rejected
325
326 forceoff Same as 'off' but the state cannot be controlled.
327 Attempts to write to the control file are rejected.
328
329 notsupported The processor does not support SMT. It's therefore
330 not affected by the SMT implications of L1TF.
331 Attempts to write to the control file are rejected.
332 ============== ===================================================
333
334 The possible states which can be written into this file to control SMT
335 state are:
336
337 - on
338 - off
339 - forceoff
340
341 /sys/devices/system/cpu/smt/active:
342
343 This file reports whether SMT is enabled and active, i.e. if on any
344 physical core two or more sibling threads are online.
345
346 SMT control is also possible at boot time via the l1tf kernel command
347 line parameter in combination with L1D flush control. See
348 :ref:`mitigation_control_command_line`.
349
350 5. Disabling EPT
351 ^^^^^^^^^^^^^^^^
352
353 Disabling EPT for virtual machines provides full mitigation for L1TF even
354 with SMT enabled, because the effective page tables for guests are
355 managed and sanitized by the hypervisor. Though disabling EPT has a
356 significant performance impact especially when the Meltdown mitigation
357 KPTI is enabled.
358
359 EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
360
361 There is ongoing research and development for new mitigation mechanisms to
362 address the performance impact of disabling SMT or EPT.
363
364 .. _mitigation_control_command_line:
365
366 Mitigation control on the kernel command line
367 ---------------------------------------------
368
369 The kernel command line allows to control the L1TF mitigations at boot
370 time with the option "l1tf=". The valid arguments for this option are:
371
372 ============ =============================================================
373 full Provides all available mitigations for the L1TF
374 vulnerability. Disables SMT and enables all mitigations in
375 the hypervisors, i.e. unconditional L1D flushing
376
377 SMT control and L1D flush control via the sysfs interface
378 is still possible after boot. Hypervisors will issue a
379 warning when the first VM is started in a potentially
380 insecure configuration, i.e. SMT enabled or L1D flush
381 disabled.
382
383 full,force Same as 'full', but disables SMT and L1D flush runtime
384 control. Implies the 'nosmt=force' command line option.
385 (i.e. sysfs control of SMT is disabled.)
386
387 flush Leaves SMT enabled and enables the default hypervisor
388 mitigation, i.e. conditional L1D flushing
389
390 SMT control and L1D flush control via the sysfs interface
391 is still possible after boot. Hypervisors will issue a
392 warning when the first VM is started in a potentially
393 insecure configuration, i.e. SMT enabled or L1D flush
394 disabled.
395
396 flush,nosmt Disables SMT and enables the default hypervisor mitigation,
397 i.e. conditional L1D flushing.
398
399 SMT control and L1D flush control via the sysfs interface
400 is still possible after boot. Hypervisors will issue a
401 warning when the first VM is started in a potentially
402 insecure configuration, i.e. SMT enabled or L1D flush
403 disabled.
404
405 flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
406 started in a potentially insecure configuration.
407
408 off Disables hypervisor mitigations and doesn't emit any
409 warnings.
410 ============ =============================================================
411
412 The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
413
414
415 .. _mitigation_control_kvm:
416
417 Mitigation control for KVM - module parameter
418 -------------------------------------------------------------
419
420 The KVM hypervisor mitigation mechanism, flushing the L1D cache when
421 entering a guest, can be controlled with a module parameter.
422
423 The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
424 following arguments:
425
426 ============ ==============================================================
427 always L1D cache flush on every VMENTER.
428
429 cond Flush L1D on VMENTER only when the code between VMEXIT and
430 VMENTER can leak host memory which is considered
431 interesting for an attacker. This still can leak host memory
432 which allows e.g. to determine the hosts address space layout.
433
434 never Disables the mitigation
435 ============ ==============================================================
436
437 The parameter can be provided on the kernel command line, as a module
438 parameter when loading the modules and at runtime modified via the sysfs
439 file:
440
441 /sys/module/kvm_intel/parameters/vmentry_l1d_flush
442
443 The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
444 line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
445 module parameter is ignored and writes to the sysfs file are rejected.
446
447
448 Mitigation selection guide
449 --------------------------
450
451 1. No virtualization in use
452 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
453
454 The system is protected by the kernel unconditionally and no further
455 action is required.
456
457 2. Virtualization with trusted guests
458 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
459
460 If the guest comes from a trusted source and the guest OS kernel is
461 guaranteed to have the L1TF mitigations in place the system is fully
462 protected against L1TF and no further action is required.
463
464 To avoid the overhead of the default L1D flushing on VMENTER the
465 administrator can disable the flushing via the kernel command line and
466 sysfs control files. See :ref:`mitigation_control_command_line` and
467 :ref:`mitigation_control_kvm`.
468
469
470 3. Virtualization with untrusted guests
471 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
472
473 3.1. SMT not supported or disabled
474 """"""""""""""""""""""""""""""""""
475
476 If SMT is not supported by the processor or disabled in the BIOS or by
477 the kernel, it's only required to enforce L1D flushing on VMENTER.
478
479 Conditional L1D flushing is the default behaviour and can be tuned. See
480 :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
481
482 3.2. EPT not supported or disabled
483 """"""""""""""""""""""""""""""""""
484
485 If EPT is not supported by the processor or disabled in the hypervisor,
486 the system is fully protected. SMT can stay enabled and L1D flushing on
487 VMENTER is not required.
488
489 EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
490
491 3.3. SMT and EPT supported and active
492 """""""""""""""""""""""""""""""""""""
493
494 If SMT and EPT are supported and active then various degrees of
495 mitigations can be employed:
496
497 - L1D flushing on VMENTER:
498
499 L1D flushing on VMENTER is the minimal protection requirement, but it
500 is only potent in combination with other mitigation methods.
501
502 Conditional L1D flushing is the default behaviour and can be tuned. See
503 :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
504
505 - Guest confinement:
506
507 Confinement of guests to a single or a group of physical cores which
508 are not running any other processes, can reduce the attack surface
509 significantly, but interrupts, soft interrupts and kernel threads can
510 still expose valuable data to a potential attacker. See
511 :ref:`guest_confinement`.
512
513 - Interrupt isolation:
514
515 Isolating the guest CPUs from interrupts can reduce the attack surface
516 further, but still allows a malicious guest to explore a limited amount
517 of host physical memory. This can at least be used to gain knowledge
518 about the host address space layout. The interrupts which have a fixed
519 affinity to the CPUs which run the untrusted guests can depending on
520 the scenario still trigger soft interrupts and schedule kernel threads
521 which might expose valuable information. See
522 :ref:`interrupt_isolation`.
523
524 The above three mitigation methods combined can provide protection to a
525 certain degree, but the risk of the remaining attack surface has to be
526 carefully analyzed. For full protection the following methods are
527 available:
528
529 - Disabling SMT:
530
531 Disabling SMT and enforcing the L1D flushing provides the maximum
532 amount of protection. This mitigation is not depending on any of the
533 above mitigation methods.
534
535 SMT control and L1D flushing can be tuned by the command line
536 parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
537 time with the matching sysfs control files. See :ref:`smt_control`,
538 :ref:`mitigation_control_command_line` and
539 :ref:`mitigation_control_kvm`.
540
541 - Disabling EPT:
542
543 Disabling EPT provides the maximum amount of protection as well. It is
544 not depending on any of the above mitigation methods. SMT can stay
545 enabled and L1D flushing is not required, but the performance impact is
546 significant.
547
548 EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
549 parameter.
550
551
552 .. _default_mitigations:
553
554 Default mitigations
555 -------------------
556
557 The kernel default mitigations for vulnerable processors are:
558
559 - PTE inversion to protect against malicious user space. This is done
560 unconditionally and cannot be controlled.
561
562 - L1D conditional flushing on VMENTER when EPT is enabled for
563 a guest.
564
565 The kernel does not by default enforce the disabling of SMT, which leaves
566 SMT systems vulnerable when running untrusted guests with EPT enabled.
567
568 The rationale for this choice is:
569
570 - Force disabling SMT can break existing setups, especially with
571 unattended updates.
572
573 - If regular users run untrusted guests on their machine, then L1TF is
574 just an add on to other malware which might be embedded in an untrusted
575 guest, e.g. spam-bots or attacks on the local network.
576
577 There is no technical way to prevent a user from running untrusted code
578 on their machines blindly.
579
580 - It's technically extremely unlikely and from today's knowledge even
581 impossible that L1TF can be exploited via the most popular attack
582 mechanisms like JavaScript because these mechanisms have no way to
583 control PTEs. If this would be possible and not other mitigation would
584 be possible, then the default might be different.
585
586 - The administrators of cloud and hosting setups have to carefully
587 analyze the risk for their scenarios and make the appropriate
588 mitigation choices, which might even vary across their deployed
589 machines and also result in other changes of their overall setup.
590 There is no way for the kernel to provide a sensible default for this
591 kind of scenarios.