GitHub/mt8127/android_kernel_alcatel_ttab.git
7 years agosh: fix copy_from_user()
Al Viro [Mon, 22 Aug 2016 03:39:47 +0000 (23:39 -0400)]
sh: fix copy_from_user()

commit 6e050503a150b2126620c1a1e9b3a368fcd51eac upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoscore: fix copy_from_user() and friends
Al Viro [Mon, 22 Aug 2016 02:30:44 +0000 (22:30 -0400)]
score: fix copy_from_user() and friends

commit b615e3c74621e06cd97f86373ca90d43d6d998aa upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoblackfin: fix copy_from_user()
Al Viro [Fri, 9 Sep 2016 23:16:58 +0000 (19:16 -0400)]
blackfin: fix copy_from_user()

commit 8f035983dd826d7e04f67b28acf8e2f08c347e41 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocris: buggered copy_from_user/copy_to_user/clear_user
Al Viro [Thu, 18 Aug 2016 23:34:00 +0000 (19:34 -0400)]
cris: buggered copy_from_user/copy_to_user/clear_user

commit eb47e0293baaa3044022059f1fa9ff474bfe35cb upstream.

* copy_from_user() on access_ok() failure ought to zero the destination
* none of those primitives should skip the access_ok() check in case of
small constant size.

Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agofrv: fix clear_user()
Al Viro [Fri, 19 Aug 2016 00:54:02 +0000 (20:54 -0400)]
frv: fix clear_user()

commit 3b8767a8f00cc6538ba6b1cf0f88502e2fd2eb90 upstream.

It should check access_ok().  Otherwise a bunch of places turn into
trivially exploitable rootholes.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoasm-generic: make get_user() clear the destination on errors
Al Viro [Thu, 18 Aug 2016 03:19:01 +0000 (23:19 -0400)]
asm-generic: make get_user() clear the destination on errors

commit 9ad18b75c2f6e4a78ce204e79f37781f8815c0fa upstream.

both for access_ok() failures and for faults halfway through

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoARC: uaccess: get_user to zero out dest in cause of fault
Vineet Gupta [Fri, 19 Aug 2016 19:10:02 +0000 (12:10 -0700)]
ARC: uaccess: get_user to zero out dest in cause of fault

commit 05d9d0b96e53c52a113fd783c0c97c830c8dc7af upstream.

Al reported potential issue with ARC get_user() as it wasn't clearing
out destination pointer in case of fault due to bad address etc.

Verified using following

| {
|   u32 bogus1 = 0xdeadbeef;
| u64 bogus2 = 0xdead;
| int rc1, rc2;
|
|   pr_info("Orig values %x %llx\n", bogus1, bogus2);
| rc1 = get_user(bogus1, (u32 __user *)0x40000000);
| rc2 = get_user(bogus2, (u64 __user *)0x50000000);
| pr_info("access %d %d, new values %x %llx\n",
| rc1, rc2, bogus1, bogus2);
| }

| [ARCLinux]# insmod /mnt/kernel-module/qtn.ko
| Orig values deadbeef dead
| access -14 -14, new values 0 0

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agos390: get_user() should zero on failure
Al Viro [Mon, 22 Aug 2016 02:00:54 +0000 (22:00 -0400)]
s390: get_user() should zero on failure

commit fd2d2b191fe75825c4c7a6f12f3fef35aaed7dd7 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoscore: fix __get_user/get_user
Al Viro [Mon, 22 Aug 2016 02:13:39 +0000 (22:13 -0400)]
score: fix __get_user/get_user

commit c2f18fa4cbb3ad92e033a24efa27583978ce9600 upstream.

* should zero on any failure
* __get_user() should use __copy_from_user(), not copy_from_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agosh64: failing __get_user() should zero
Al Viro [Mon, 22 Aug 2016 03:33:47 +0000 (23:33 -0400)]
sh64: failing __get_user() should zero

commit c6852389228df9fb3067f94f3b651de2a7921b36 upstream.

It could be done in exception-handling bits in __get_user_b() et.al.,
but the surgery involved would take more knowledge of sh64 details
than I have or _want_ to have.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agom32r: fix __get_user()
Al Viro [Fri, 9 Sep 2016 23:20:13 +0000 (19:20 -0400)]
m32r: fix __get_user()

commit c90a3bc5061d57e7931a9b7ad14784e1a0ed497d upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agomn10300: failing __get_user() and get_user() should zero
Al Viro [Sat, 20 Aug 2016 20:32:02 +0000 (16:32 -0400)]
mn10300: failing __get_user() and get_user() should zero

commit 43403eabf558d2800b429cd886e996fd555aa542 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agomicroblaze: fix copy_from_user()
Al Viro [Fri, 9 Sep 2016 23:22:34 +0000 (19:22 -0400)]
microblaze: fix copy_from_user()

commit d0cf385160c12abd109746cad1f13e3b3e8b50b8 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[wt: s/might_fault/might_sleep]

Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agomicroblaze: fix __get_user()
Al Viro [Fri, 9 Sep 2016 23:23:33 +0000 (19:23 -0400)]
microblaze: fix __get_user()

commit e98b9e37ae04562d52c96f46b3cf4c2e80222dc1 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoparisc: Ensure consistent state when switching to kernel stack at syscall entry
John David Anglin [Sat, 29 Oct 2016 03:00:34 +0000 (23:00 -0400)]
parisc: Ensure consistent state when switching to kernel stack at syscall entry

commit 6ed518328d0189e0fdf1bb7c73290d546143ea66 upstream.

We have one critical section in the syscall entry path in which we switch from
the userspace stack to kernel stack. In the event of an external interrupt, the
interrupt code distinguishes between those two states by analyzing the value of
sr7. If sr7 is zero, it uses the kernel stack. Therefore it's important, that
the value of sr7 is in sync with the currently enabled stack.

This patch now disables interrupts while executing the critical section.  This
prevents the interrupt handler to possibly see an inconsistent state which in
the worst case can lead to crashes.

Interestingly, in the syscall exit path interrupts were already disabled in the
critical section which switches back to the userspace stack.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agos390/dasd: fix hanging device after clear subchannel
Stefan Haberland [Mon, 8 Aug 2016 12:08:17 +0000 (14:08 +0200)]
s390/dasd: fix hanging device after clear subchannel

commit 9ba333dc55cbb9523553df973adb3024d223e905 upstream.

When a device is in a status where CIO has killed all I/O by itself the
interrupt for a clear request may not contain an irb to determine the
clear function. Instead it contains an error pointer -EIO.
This was ignored by the DASD int_handler leading to a hanging device
waiting for a clear interrupt.

Handle -EIO error pointer correctly for requests that are clear pending and
treat the clear as successful.

Signed-off-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoavr32: off by one in at32_init_pio()
Dan Carpenter [Wed, 13 Jul 2016 10:08:55 +0000 (13:08 +0300)]
avr32: off by one in at32_init_pio()

commit 55f1cf83d5cf885c75267269729805852039c834 upstream.

The pio_dev[] array has MAX_NR_PIO_DEVICES elements so the > should be
>=.

Fixes: 5f97f7f9400d ('[PATCH] avr32 architecture')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoavr32: fix 'undefined reference to `___copy_from_user'
Guenter Roeck [Sat, 17 Sep 2016 14:52:49 +0000 (07:52 -0700)]
avr32: fix 'undefined reference to `___copy_from_user'

commit 65c0044ca8d7c7bbccae37f0ff2972f0210e9f41 upstream.

avr32 builds fail with:

arch/avr32/kernel/built-in.o: In function `arch_ptrace':
(.text+0x650): undefined reference to `___copy_from_user'
arch/avr32/kernel/built-in.o:(___ksymtab+___copy_from_user+0x0): undefined
reference to `___copy_from_user'
kernel/built-in.o: In function `proc_doulongvec_ms_jiffies_minmax':
(.text+0x5dd8): undefined reference to `___copy_from_user'
kernel/built-in.o: In function `proc_dointvec_minmax_sysadmin':
sysctl.c:(.text+0x6174): undefined reference to `___copy_from_user'
kernel/built-in.o: In function `ptrace_has_cap':
ptrace.c:(.text+0x69c0): undefined reference to `___copy_from_user'
kernel/built-in.o:ptrace.c:(.text+0x6b90): more undefined references to
`___copy_from_user' follow

Fixes: 8630c32275ba ("avr32: fix copy_from_user()")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Havard Skinnemoen <hskinnemoen@gmail.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoavr32: fix copy_from_user()
Al Viro [Fri, 9 Sep 2016 23:28:23 +0000 (19:28 -0400)]
avr32: fix copy_from_user()

commit 8630c32275bac2de6ffb8aea9d9b11663e7ad28e upstream.

really ugly, but apparently avr32 compilers turns access_ok() into
something so bad that they want it in assembler.  Left that way,
zeroing added in inline wrapper.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agopowerpc/nvram: Fix an incorrect partition merge
Pan Xinhui [Thu, 10 Dec 2015 07:30:02 +0000 (15:30 +0800)]
powerpc/nvram: Fix an incorrect partition merge

commit 11b7e154b132232535befe51c55db048069c8461 upstream.

When we merge two contiguous partitions whose signatures are marked
NVRAM_SIG_FREE, We need update prev's length and checksum, then write it
to nvram, not cur's. So lets fix this mistake now.

Also use memset instead of strncpy to set the partition's name. It's
more readable if we want to fill up with duplicate chars .

Fixes: fa2b4e54d41f ("powerpc/nvram: Improve partition removal")
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agopowerpc/64: Fix incorrect return value from __copy_tofrom_user
Paul Mackerras [Tue, 11 Oct 2016 11:25:47 +0000 (22:25 +1100)]
powerpc/64: Fix incorrect return value from __copy_tofrom_user

commit 1a34439e5a0b2235e43f96816dbb15ee1154f656 upstream.

Debugging a data corruption issue with virtio-net/vhost-net led to
the observation that __copy_tofrom_user was occasionally returning
a value 16 larger than it should.  Since the return value from
__copy_tofrom_user is the number of bytes not copied, this means
that __copy_tofrom_user can occasionally return a value larger
than the number of bytes it was asked to copy.  In turn this can
cause higher-level copy functions such as copy_page_to_iter_iovec
to corrupt memory by copying data into the wrong memory locations.

It turns out that the failing case involves a fault on the store
at label 79, and at that point the first unmodified byte of the
destination is at R3 + 16.  Consequently the exception handler
for that store needs to add 16 to R3 before using it to work out
how many bytes were not copied, but in this one case it was not
adding the offset to R3.  To fix it, this moves the label 179 to
the point where we add 16 to R3.  I have checked manually all the
exception handlers for the loads and stores in this code and the
rest of them are correct (it would be excellent to have an
automated test of all the exception cases).

This bug has been present since this code was initially
committed in May 2002 to Linux version 2.5.20.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agopowerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data()
Gavin Shan [Tue, 2 Aug 2016 04:10:32 +0000 (14:10 +1000)]
powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data()

commit 5adaf8629b193f185ca5a1665b9e777a0579f518 upstream.

This fixes the warnings reported from sparse:

  pci.c:312:33: warning: restricted __be64 degrades to integer
  pci.c:313:33: warning: restricted __be64 degrades to integer

Fixes: cee72d5bb489 ("powerpc/powernv: Display diag data on p7ioc EEH errors")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agopowerpc/vdso64: Use double word compare on pointers
Anton Blanchard [Sun, 25 Sep 2016 07:16:53 +0000 (17:16 +1000)]
powerpc/vdso64: Use double word compare on pointers

commit 5045ea37377ce8cca6890d32b127ad6770e6dce5 upstream.

__kernel_get_syscall_map() and __kernel_clock_getres() use cmpli to
check if the passed in pointer is non zero. cmpli maps to a 32 bit
compare on binutils, so we ignore the top 32 bits.

A simple test case can be created by passing in a bogus pointer with
the bottom 32 bits clear. Using a clk_id that is handled by the VDSO,
then one that is handled by the kernel shows the problem:

  printf("%d\n", clock_getres(CLOCK_REALTIME, (void *)0x100000000));
  printf("%d\n", clock_getres(CLOCK_BOOTTIME, (void *)0x100000000));

And we get:

  0
  -1

The bigger issue is if we pass a valid pointer with the bottom 32 bits
clear, in this case we will return success but won't write any data
to the pointer.

I stumbled across this issue because the LLVM integrated assembler
doesn't accept cmpli with 3 arguments. Fix this by converting them to
cmpldi.

Fixes: a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel")
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agopowerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
Paul Mackerras [Fri, 2 Sep 2016 11:47:59 +0000 (21:47 +1000)]
powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET

commit f077aaf0754bcba0fffdbd925bc12f09cd1e38aa upstream.

In commit c60ac5693c47 ("powerpc: Update kernel VSID range", 2013-03-13)
we lost a check on the region number (the top four bits of the effective
address) for addresses below PAGE_OFFSET.  That commit replaced a check
that the top 18 bits were all zero with a check that bits 46 - 59 were
zero (performed for all addresses, not just user addresses).

This means that userspace can access an address like 0x1000_0xxx_xxxx_xxxx
and we will insert a valid SLB entry for it.  The VSID used will be the
same as if the top 4 bits were 0, but the page size will be some random
value obtained by indexing beyond the end of the mm_ctx_high_slices_psize
array in the paca.  If that page size is the same as would be used for
region 0, then userspace just has an alias of the region 0 space.  If the
page size is different, then no HPTE will be found for the access, and
the process will get a SIGSEGV (since hash_page_mm() will refuse to create
a HPTE for the bogus address).

The access beyond the end of the mm_ctx_high_slices_psize can be at most
5.5MB past the array, and so will be in RAM somewhere.  Since the access
is a load performed in real mode, it won't fault or crash the kernel.
At most this bug could perhaps leak a little bit of information about
blocks of 32 bytes of memory located at offsets of i * 512kB past the
paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11.

Fixes: c60ac5693c47 ("powerpc: Update kernel VSID range")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoMIPS: ptrace: Fix regs_return_value for kernel context
Marcin Nowakowski [Wed, 12 Oct 2016 07:32:56 +0000 (09:32 +0200)]
MIPS: ptrace: Fix regs_return_value for kernel context

commit 74f1077b5b783e7bf4fa3007cefdc8dbd6c07518 upstream.

Currently regs_return_value always negates reg[2] if it determines
the syscall has failed, but when called in kernel context this check is
invalid and may result in returning a wrong value.

This fixes errors reported by CONFIG_KPROBES_SANITY_TEST

Fixes: d7e7528bcd45 ("Audit: push audit success and retcode into arch ptrace.h")
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/14381/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoMIPS: Malta: Fix IOCU disable switch read for MIPS64
Paul Burton [Fri, 2 Sep 2016 15:07:10 +0000 (16:07 +0100)]
MIPS: Malta: Fix IOCU disable switch read for MIPS64

commit 305723ab439e14debc1d339aa04e835d488b8253 upstream.

Malta boards used with CPU emulators feature a switch to disable use of
an IOCU. Software has to check this switch & ignore any present IOCU if
the switch is closed. The read used to do this was unsafe for 64 bit
kernels, as it simply casted the address 0xbf403000 to a pointer &
dereferenced it. Whilst in a 32 bit kernel this would access kseg1, in a
64 bit kernel this attempts to access xuseg & results in an address
error exception.

Fix by accessing a correctly formed ckseg1 address generated using the
CKSEG1ADDR macro.

Whilst modifying this code, define the name of the register and the bit
we care about within it, which indicates whether PCI DMA is routed to
the IOCU or straight to DRAM. The code previously checked that bit 0 was
also set, but the least significant 7 bits of the CONFIG_GEN0 register
contain the value of the MReqInfo signal provided to the IOCU OCP bus,
so singling out bit 0 makes little sense & that part of the check is
dropped.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: b6d92b4a6bdb ("MIPS: Add option to disable software I/O coherency.")
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14187/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoarm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP
Will Deacon [Fri, 26 Aug 2016 10:36:39 +0000 (11:36 +0100)]
arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP

commit 3a402a709500c5a3faca2111668c33d96555e35a upstream.

When TIF_SINGLESTEP is set for a task, the single-step state machine is
enabled and we must take care not to reset it to the active-not-pending
state if it is already in the active-pending state.

Unfortunately, that's exactly what user_enable_single_step does, by
unconditionally setting the SS bit in the SPSR for the current task.
This causes failures in the GDB testsuite, where GDB ends up missing
expected step traps if the instruction being stepped generates another
trap, e.g. PTRACE_EVENT_FORK from an SVC instruction.

This patch fixes the problem by preserving the current state of the
stepping state machine when TIF_SINGLESTEP is set on the current thread.

Cc: <stable@vger.kernel.org>
Reported-by: Yao Qi <yao.qi@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoarm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb()
Will Deacon [Mon, 5 Sep 2016 10:56:05 +0000 (11:56 +0100)]
arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb()

commit 872c63fbf9e153146b07f0cece4da0d70b283eeb upstream.

smp_mb__before_spinlock() is intended to upgrade a spin_lock() operation
to a full barrier, such that prior stores are ordered with respect to
loads and stores occuring inside the critical section.

Unfortunately, the core code defines the barrier as smp_wmb(), which
is insufficient to provide the required ordering guarantees when used in
conjunction with our load-acquire-based spinlock implementation.

This patch overrides the arm64 definition of smp_mb__before_spinlock()
to map to a full smp_mb().

Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoarm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
James Hogan [Mon, 25 Jul 2016 15:59:52 +0000 (16:59 +0100)]
arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO

commit 3146bc64d12377a74dbda12b96ea32da3774ae07 upstream.

AT_VECTOR_SIZE_ARCH should be defined with the maximum number of
NEW_AUX_ENT entries that ARCH_DLINFO can contain, but it wasn't defined
for arm64 at all even though ARCH_DLINFO will contain one NEW_AUX_ENT
for the VDSO address.

This shouldn't be a problem as AT_VECTOR_SIZE_BASE includes space for
AT_BASE_PLATFORM which arm64 doesn't use, but lets define it now and add
the comment above ARCH_DLINFO as found in several other architectures to
remind future modifiers of ARCH_DLINFO to keep AT_VECTOR_SIZE_ARCH up to
date.

Fixes: f668cd1673aa ("arm64: ELF definitions")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoarm64: avoid returning from bad_mode
Mark Rutland [Mon, 23 Jan 2017 18:28:50 +0000 (18:28 +0000)]
arm64: avoid returning from bad_mode

commit 7d9e8f71b989230bc613d121ca38507d34ada849 upstream.

Generally, taking an unexpected exception should be a fatal event, and
bad_mode is intended to cater for this. However, it should be possible
to contain unexpected synchronous exceptions from EL0 without bringing
the kernel down, by sending a SIGILL to the task.

We tried to apply this approach in commit 9955ac47f4ba1c95 ("arm64:
don't kill the kernel on a bad esr from el0"), by sending a signal for
any bad_mode call resulting from an EL0 exception.

However, this also applies to other unexpected exceptions, such as
SError and FIQ. The entry paths for these exceptions branch to bad_mode
without configuring the link register, and have no kernel_exit. Thus, if
we take one of these exceptions from EL0, bad_mode will eventually
return to the original user link register value.

This patch fixes this by introducing a new bad_el0_sync handler to cater
for the recoverable case, and restoring bad_mode to its original state,
whereby it calls panic() and never returns. The recoverable case
branches to bad_el0_sync with a bl, and returns to userspace via the
usual ret_to_user mechanism.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9955ac47f4ba1c95 ("arm64: don't kill the kernel on a bad esr from el0")
Reported-by: Mark Salter <msalter@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoARM: sa1111: fix pcmcia suspend/resume
Russell King [Tue, 6 Sep 2016 13:34:05 +0000 (14:34 +0100)]
ARM: sa1111: fix pcmcia suspend/resume

commit 06dfe5cc0cc684e735cb0232fdb756d30780b05d upstream.

SA1111 PCMCIA was broken when PCMCIA switched to using dev_pm_ops for
the PCMCIA socket class.  PCMCIA used to handle suspend/resume via the
socket hosting device, which happened at normal device suspend/resume
time.

However, the referenced commit changed this: much of the resume now
happens much earlier, in the noirq resume handler of dev_pm_ops.

However, on SA1111, the PCMCIA device is not accessible as the SA1111
has not been resumed at _noirq time.  It's slightly worse than that,
because the SA1111 has already been put to sleep at _noirq time, so
suspend doesn't work properly.

Fix this by converting the core SA1111 code to use dev_pm_ops as well,
and performing its own suspend/resume at noirq time.

This fixes these errors in the kernel log:

pcmcia_socket pcmcia_socket0: time out after reset
pcmcia_socket pcmcia_socket1: time out after reset

and the resulting lack of PCMCIA cards after a S2RAM cycle.

Fixes: d7646f7632549 ("pcmcia: use dev_pm_ops for class pcmcia_socket_class")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoARM: sa1100: clear reset status prior to reboot
Russell King [Fri, 19 Aug 2016 15:34:45 +0000 (16:34 +0100)]
ARM: sa1100: clear reset status prior to reboot

commit da60626e7d02a4f385cae80e450afc8b07035368 upstream.

Clear the current reset status prior to rebooting the platform.  This
adds the bit missing from 04fef228fb00 ("[ARM] pxa: introduce
reset_status and clear_reset_status for driver's usage").

Fixes: 04fef228fb00 ("[ARM] pxa: introduce reset_status and clear_reset_status for driver's usage")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7
Srinivas Ramana [Fri, 30 Sep 2016 14:03:31 +0000 (15:03 +0100)]
ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7

commit 117e5e9c4cfcb7628f08de074fbfefec1bb678b7 upstream.

If the bootloader uses the long descriptor format and jumps to
kernel decompressor code, TTBCR may not be in a right state.
Before enabling the MMU, it is required to clear the TTBCR.PD0
field to use TTBR0 for translation table walks.

The commit dbece45894d3a ("ARM: 7501/1: decompressor:
reset ttbcr for VMSA ARMv7 cores") does the reset of TTBCR.N, but
doesn't consider all the bits for the size of TTBCR.N.

Clear TTBCR.PD0 field and reset all the three bits of TTBCR.N to
indicate the use of TTBR0 and the correct base address width.

Fixes: dbece45894d3 ("ARM: 7501/1: decompressor: reset ttbcr for VMSA ARMv7 cores")
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoARM: 8616/1: dt: Respect property size when parsing CPUs
Robin Murphy [Mon, 26 Sep 2016 15:50:55 +0000 (16:50 +0100)]
ARM: 8616/1: dt: Respect property size when parsing CPUs

commit ba6dea4f7cedb4b1c17e36f4087675d817c2e24b upstream.

Whilst MPIDR values themselves are less than 32 bits, it is still
perfectly valid for a DT to have #address-cells > 1 in the CPUs node,
resulting in the "reg" property having leading zero cell(s). In that
situation, the big-endian nature of the data conspires with the current
behaviour of only reading the first cell to cause the kernel to think
all CPUs have ID 0, and become resoundingly unhappy as a consequence.

Take the full property length into account when parsing CPUs so as to
be correct under any circumstances.

Cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoiommu/amd: Free domain id when free a domain of struct dma_ops_domain
Baoquan He [Thu, 15 Sep 2016 08:50:52 +0000 (16:50 +0800)]
iommu/amd: Free domain id when free a domain of struct dma_ops_domain

commit c3db901c54466a9c135d1e6e95fec452e8a42666 upstream.

The current code missed freeing domain id when free a domain of
struct dma_ops_domain.

Signed-off-by: Baoquan He <bhe@redhat.com>
Fixes: ec487d1a110a ('x86, AMD IOMMU: add domain allocation and deallocation functions')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoiommu/amd: Update Alias-DTE in update_device_table()
Joerg Roedel [Tue, 26 Jul 2016 13:18:54 +0000 (15:18 +0200)]
iommu/amd: Update Alias-DTE in update_device_table()

commit 3254de6bf74fe94c197c9f819fe62a3a3c36f073 upstream.

Not doing so might cause IO-Page-Faults when a device uses
an alias request-id and the alias-dte is left in a lower
page-mode which does not cover the address allocated from
the iova-allocator.

Fixes: 492667dacc0a ('x86/amd-iommu: Remove amd_iommu_pd_table')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/um: reuse asm-generic/barrier.h
Michael S. Tsirkin [Mon, 21 Dec 2015 07:22:18 +0000 (09:22 +0200)]
x86/um: reuse asm-generic/barrier.h

commit 577f183acc88645eae116326cc2203dc88ea730c upstream.

On x86/um CONFIG_SMP is never defined.  As a result, several macros
match the asm-generic variant exactly. Drop the local definitions and
pull in asm-generic/barrier.h instead.

This is in preparation to refactoring this code area.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Richard Weinberger <richard@nod.at>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/build: Build compressed x86 kernels as PIE
H.J. Lu [Thu, 17 Mar 2016 03:04:35 +0000 (20:04 -0700)]
x86/build: Build compressed x86 kernels as PIE

commit 6d92bc9d483aa1751755a66fee8fb39dffb088c0 upstream.

The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
relocation to get the symbol address in PIC.  When the compressed x86
kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
to their fixed symbol addresses.  However, when the compressed x86
kernel is loaded at a different address, it leads to the following
load failure:

  Failed to allocate space for phdrs

during the decompression stage.

If the compressed x86 kernel is relocatable at run-time, it should be
compiled with -fPIE, instead of -fPIC, if possible and should be built as
Position Independent Executable (PIE) so that linker won't optimize
R_386_GOT32X relocation to its fixed symbol address.

Older linkers generate R_386_32 relocations against locally defined
symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
R_386_32 relocations when relocating the kernel.  To generate
R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
hidden in both 32-bit and 64-bit x86 kernels.

To build a 64-bit compressed x86 kernel as PIE, we need to disable the
relocation overflow check to avoid relocation overflow errors. We do
this with a new linker command-line option, -z noreloc-overflow, which
got added recently:

 commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
 Author: H.J. Lu <hjl.tools@gmail.com>
 Date:   Tue Mar 15 11:07:06 2016 -0700

    Add -z noreloc-overflow option to x86-64 ld

    Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
    disable relocation overflow check.  This can be used to avoid relocation
    overflow check if there will be no dynamic relocation overflow at
    run-time.

The 64-bit compressed x86 kernel is built as PIE only if the linker supports
-z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
boots fine even when it is built as a normal executable.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Edited the changelog and comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/paravirt: Do not trace _paravirt_ident_*() functions
Steven Rostedt [Wed, 25 May 2016 17:47:26 +0000 (13:47 -0400)]
x86/paravirt: Do not trace _paravirt_ident_*() functions

commit 15301a570754c7af60335d094dd2d1808b0641a5 upstream.

Å\81ukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:

  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()

It was found that this is only an issue when noreplace-paravirt is added
to the kernel command line.

This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.

In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue.  But both _paravirt_ident_{32,64}() causes the
following splat when they are traced:

 mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054)
 mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070)
 mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054)
 mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054)
 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
 Modules linked in: e1000e
 CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000
 RIP: 0010:[<ffffffff81134148>]  [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0
 RSP: 0018:ffff8800d4aefb90  EFLAGS: 00000246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40
 RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030
 RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000
 R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8
 R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0
 FS:  00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0
 Call Trace:
   _raw_spin_lock+0x27/0x30
   handle_pte_fault+0x13db/0x16b0
   handle_mm_fault+0x312/0x670
   __do_page_fault+0x1b1/0x4e0
   do_page_fault+0x22/0x30
   page_fault+0x28/0x30
   __vfs_read+0x28/0xe0
   vfs_read+0x86/0x130
   SyS_read+0x46/0xa0
   entry_SYSCALL_64_fastpath+0x1e/0xa8
 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

Reported-by: Å\81ukasz Daniluk <lukasz.daniluk@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/mm/pat, /dev/mem: Remove superfluous error message
Jiri Kosina [Fri, 8 Jul 2016 09:38:28 +0000 (11:38 +0200)]
x86/mm/pat, /dev/mem: Remove superfluous error message

commit 39380b80d72723282f0ea1d1bbf2294eae45013e upstream.

Currently it's possible for broken (or malicious) userspace to flood a
kernel log indefinitely with messages a-la

Program dmidecode tried to access /dev/mem between f0000->100000

because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on
dumps this information each and every time devmem_is_allowed() fails.

Reportedly userspace that is able to trigger contignuous flow of these
messages exists.

It would be possible to rate limit this message, but that'd have a
questionable value; the administrator wouldn't get information about all
the failing accessess, so then the information would be both superfluous
and incomplete at the same time :)

Returning EPERM (which is what is actually happening) is enough indication
for userspace what has happened; no need to log this particular error as
some sort of special condition.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pm
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/apic: Do not init irq remapping if ioapic is disabled
Wanpeng Li [Tue, 23 Aug 2016 12:07:19 +0000 (20:07 +0800)]
x86/apic: Do not init irq remapping if ioapic is disabled

commit 2e63ad4bd5dd583871e6602f9d398b9322d358d9 upstream.

native_smp_prepare_cpus
  -> default_setup_apic_routing
    -> enable_IR_x2apic
      -> irq_remapping_prepare
        -> intel_prepare_irq_remapping
          -> intel_setup_irq_remapping

So IR table is setup even if "noapic" boot parameter is added. As a result we
crash later when the interrupt affinity is set due to a half initialized
remapping infrastructure.

Prevent remap initialization when IOAPIC is disabled.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/mm: Disable preemption during CR3 read+write
Sebastian Andrzej Siewior [Fri, 5 Aug 2016 13:37:39 +0000 (15:37 +0200)]
x86/mm: Disable preemption during CR3 read+write

commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream.

There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

 -> mmput()
 -> exit_mmap()
 -> tlb_finish_mmu()
 -> tlb_flush_mmu()
 -> tlb_flush_mmu_tlbonly()
 -> tlb_flush()
 -> flush_tlb_mm_range()
 -> __flush_tlb_up()
 -> __flush_tlb()
 ->  __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more timeâ\80¦

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/traps: Ignore high word of regs->cs in early_idt_handler_common
Andy Lutomirski [Thu, 1 Dec 2016 17:26:42 +0000 (09:26 -0800)]
x86/traps: Ignore high word of regs->cs in early_idt_handler_common

This is a backport of:
commit fc0e81b2bea0ebceb71889b61d2240856141c9ee upstream

On the 80486 DX, it seems that some exceptions may leave garbage in
the high bits of CS.  This causes sporadic failures in which
early_fixup_exception() refuses to fix up an exception.

As far as I can tell, this has been buggy for a long time, but the
problem seems to have been exacerbated by commits:

  1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4")
  e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines")

This appears to have broken for as long as we've had early
exception handling.

[ This backport should apply to kernels from 3.4 - 4.5. ]

Fixes: 4c5023a3fa2e ("x86-32: Handle exception table entries during early boot")
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org
Reported-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
Juergen Gross [Thu, 23 Jun 2016 05:12:27 +0000 (07:12 +0200)]
x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()

commit 1cf38741308c64d08553602b3374fb39224eeb5a upstream.

xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper
bound of the loop setting non-kernel-image entries to zero should not
exceed the size of level2_kernel_pgt.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen-pciback: Add name prefix to global 'permissive' variable
Ben Hutchings [Sun, 12 Apr 2015 23:26:35 +0000 (00:26 +0100)]
xen-pciback: Add name prefix to global 'permissive' variable

commit 8014bcc86ef112eab9ee1db312dba4e6b608cf89 upstream.

The variable for the 'permissive' module parameter used to be static
but was recently changed to be extern.  This puts it in the kernel
global namespace if the driver is built-in, so its name should begin
with a prefix identifying the driver.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: af6fc858a35b ("xen-pciback: limit guest control of command register")
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 23:13:27 +0000 (18:13 -0500)]
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.

commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0 upstream.

commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.

Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).

Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.

Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!

When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.

This is part of XSA-157

Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
Konrad Rzeszutek Wilk [Wed, 1 Apr 2015 14:49:47 +0000 (10:49 -0400)]
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.

commit 7cfb905b9638982862f0331b36ccaaca5d383b49 upstream.

Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).

This does not change the behavior of XEN_PCI_OP_disable_msi[|x].

The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.

However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.

This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.

Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.

This is part of XSA-157

Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Do not install an IRQ handler for MSI interrupts.
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 22:24:08 +0000 (17:24 -0500)]
xen/pciback: Do not install an IRQ handler for MSI interrupts.

commit a396f3a210c3a61e94d6b87ec05a75d0be2a60d0 upstream.

Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:

 for (i = 0; i < entry->nvec_used; i++)
        BUG_ON(irq_has_action(entry->irq + i));

Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.

To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.

Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.

Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.

Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).

Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.

The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.

P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.

This is part of XSA-157

Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X...
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 23:07:44 +0000 (18:07 -0500)]
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled

commit 5e0ce1455c09dd61d029b8ad45d82e1ac0b6c4c9 upstream.

The guest sequence of:

  a) XEN_PCI_OP_enable_msix
  b) XEN_PCI_OP_enable_msix

results in hitting an NULL pointer due to using freed pointers.

The device passed in the guest MUST have MSI-X capability.

The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).

The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.

The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.

This is part of XSA-157

Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
Konrad Rzeszutek Wilk [Fri, 3 Apr 2015 15:08:22 +0000 (11:08 -0400)]
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled

commit 56441f3c8e5bd45aab10dd9f8c505dd4bec03b0d upstream.

The guest sequence of:

 a) XEN_PCI_OP_enable_msi
 b) XEN_PCI_OP_enable_msi
 c) XEN_PCI_OP_disable_msi

results in hitting an BUG_ON condition in the msi.c code.

The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.

The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set).  c) pci_disable_msi passes the msi_enabled checks and hits:

BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

and blows up.

The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.

This is part of XSA-157.

Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Save the number of MSI-X entries to be copied later.
Konrad Rzeszutek Wilk [Thu, 11 Feb 2016 21:10:24 +0000 (16:10 -0500)]
xen/pciback: Save the number of MSI-X entries to be copied later.

commit d159457b84395927b5a52adb72f748dd089ad5e5 upstream.

Commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5 (xen/pciback: Save
xen_pci_op commands before processing it) broke enabling MSI-X because
it would never copy the resulting vectors into the response.  The
number of vectors requested was being overwritten by the return value
(typically zero for success).

Save the number of vectors before processing the op, so the correct
number of vectors are copied afterwards.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen/pciback: Save xen_pci_op commands before processing it
Konrad Rzeszutek Wilk [Mon, 16 Nov 2015 17:40:48 +0000 (12:40 -0500)]
xen/pciback: Save xen_pci_op commands before processing it

commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5 upstream.

Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.

The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.

This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.

This is part of XSA155.

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Jan Beulich" <JBeulich@suse.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen-blkback: only read request operation from shared ring once
Roger Pau Monné [Tue, 3 Nov 2015 16:34:09 +0000 (16:34 +0000)]
xen-blkback: only read request operation from shared ring once

commit 1f13d75ccb806260079e0679d55d9253e370ec8a upstream.

A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.

When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().

This is part of XSA155.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Jan Beulich" <JBeulich@suse.com>
[wt: s/READ_ONCE/ACCESS_ONCE for 3.10]
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen-netback: use RING_COPY_REQUEST() throughout
David Vrabel [Fri, 30 Oct 2015 15:17:06 +0000 (15:17 +0000)]
xen-netback: use RING_COPY_REQUEST() throughout

commit 68a33bfd8403e4e22847165d149823a2e0e67c9c upstream.

Instead of open-coding memcpy()s and directly accessing Tx and Rx
requests, use the new RING_COPY_REQUEST() that ensures the local copy
is correct.

This is more than is strictly necessary for guest Rx requests since
only the id and gref fields are used and it is harmless if the
frontend modifies these.

This is part of XSA155.

Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[wt: adjustments for 3.10 : netbk_rx_meta instead of struct xenvif_rx_meta]

Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen-netback: don't use last request to determine minimum Tx credit
David Vrabel [Fri, 30 Oct 2015 15:16:01 +0000 (15:16 +0000)]
xen-netback: don't use last request to determine minimum Tx credit

commit 0f589967a73f1f30ab4ac4dd9ce0bb399b4d6357 upstream.

The last from guest transmitted request gives no indication about the
minimum amount of credit that the guest might need to send a packet
since the last packet might have been a small one.

Instead allow for the worst case 128 KiB packet.

This is part of XSA155.

Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoxen: Add RING_COPY_REQUEST()
David Vrabel [Fri, 30 Oct 2015 14:58:08 +0000 (14:58 +0000)]
xen: Add RING_COPY_REQUEST()

commit 454d5d882c7e412b840e3c99010fe81a9862f6fb upstream.

Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
(i.e., by not considering that the other end may alter the data in the
shared ring while it is being inspected).  Safe usage of a request
generally requires taking a local copy.

Provide a RING_COPY_REQUEST() macro to use instead of
RING_GET_REQUEST() and an open-coded memcpy().  This takes care of
ensuring that the copy is done correctly regardless of any possible
compiler optimizations.

Use a volatile source to prevent the compiler from reordering or
omitting the copy.

This is part of XSA155.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agox86/mm/xen: Suppress hugetlbfs in PV guests
Jan Beulich [Thu, 21 Apr 2016 06:27:04 +0000 (00:27 -0600)]
x86/mm/xen: Suppress hugetlbfs in PV guests

commit 103f6112f253017d7062cd74d17f4a514ed4485c upstream.

Huge pages are not normally available to PV guests. Not suppressing
hugetlbfs use results in an endless loop of page faults when user mode
code tries to access a hugetlbfs mapped area (since the hypervisor
denies such PTEs to be created, but error indications can't be
propagated out of xen_set_pte_at(), just like for various of its
siblings), and - once killed in an oops like this:

  kernel BUG at .../fs/hugetlbfs/inode.c:428!
  invalid opcode: 0000 [#1] SMP
  ...
  RIP: e030:[<ffffffff811c333b>]  [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320
  ...
  Call Trace:
   [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40
   [<ffffffff81167b3d>] evict+0xbd/0x1b0
   [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0
   [<ffffffff81165b0e>] dput+0x1fe/0x220
   [<ffffffff81150535>] __fput+0x155/0x200
   [<ffffffff81079fc0>] task_work_run+0x60/0xa0
   [<ffffffff81063510>] do_exit+0x160/0x400
   [<ffffffff810637eb>] do_group_exit+0x3b/0xa0
   [<ffffffff8106e8bd>] get_signal+0x1ed/0x470
   [<ffffffff8100f854>] do_signal+0x14/0x110
   [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0
   [<ffffffff814178a5>] retint_user+0x8/0x13

This is CVE-2016-3961 / XSA-174.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <JGross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoppp: defer netns reference release for ppp channel
WANG Cong [Wed, 6 Jul 2016 05:12:36 +0000 (22:12 -0700)]
ppp: defer netns reference release for ppp channel

commit 205e1e255c479f3fd77446415706463b282f94e4 upstream

Matt reported that we have a NULL pointer dereference
in ppp_pernet() from ppp_connect_channel(),
i.e. pch->chan_net is NULL.

This is due to that a parallel ppp_unregister_channel()
could happen while we are in ppp_connect_channel(), during
which pch->chan_net set to NULL. Since we need a reference
to net per channel, it makes sense to sync the refcnt
with the life time of the channel, therefore we should
release this reference when we destroy it.

Fixes: 1f461dcdd296 ("ppp: take reference on channels netns")
Reported-by: Matt Bennett <Matt.Bennett@alliedtelesis.co.nz>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linux-ppp@vger.kernel.org
Cc: Guillaume Nault <g.nault@alphalink.fr>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoPM / devfreq: Fix incorrect type issue.
Xiaolong Ye [Fri, 11 Sep 2015 03:05:23 +0000 (11:05 +0800)]
PM / devfreq: Fix incorrect type issue.

commit 5f25f066f75a67835abb5e400471a27abd09395b upstream

time_in_state in struct devfreq is defined as unsigned long, so
devm_kzalloc should use sizeof(unsigned long) as argument instead
of sizeof(unsigned int), otherwise it will cause unexpected result
in 64bit system.

Signed-off-by: Xiaolong Ye <yexl@marvell.com>
Signed-off-by: Kevin Liu <kliu5@marvell.com>
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: Disable irq while unregistering user notifier
Ignacio Alvarado [Fri, 4 Nov 2016 19:15:55 +0000 (12:15 -0700)]
KVM: Disable irq while unregistering user notifier

commit 1650b4ebc99da4c137bfbfc531be4a2405f951dd upstream.

Function user_notifier_unregister should be called only once for each
registered user notifier.

Function kvm_arch_hardware_disable can be executed from an IPI context
which could cause a race condition with a VCPU returning to user mode
and attempting to unregister the notifier.

Signed-off-by: Ignacio Alvarado <ikalvarado@google.com>
Fixes: 18863bdd60f8 ("KVM: x86 shared msr infrastructure")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim KrÄ\8dmáÅ\99 <rkrcmar@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
Paolo Bonzini [Thu, 17 Nov 2016 14:55:46 +0000 (15:55 +0100)]
KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr

commit 7301d6abaea926d685832f7e1f0c37dd206b01f4 upstream.

Reported by syzkaller:

    [ INFO: suspicious RCU usage. ]
    4.9.0-rc4+ #47 Not tainted
    -------------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    stack backtrace:
    CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
     ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000
     0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9
     ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000
    Call Trace:
     [<     inline     >] __dump_stack lib/dump_stack.c:15
     [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51
     [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445
     [<     inline     >] __kvm_memslots include/linux/kvm_host.h:534
     [<     inline     >] kvm_memslots include/linux/kvm_host.h:541
     [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941
     [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: fda4e2e85589191b123d31cdc21fd33ee70f50fd
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim KrÄ\8dmáÅ\99 <rkrcmar@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: x86: fix wbinvd_dirty_mask use-after-free
Ido Yariv [Fri, 21 Oct 2016 16:39:57 +0000 (12:39 -0400)]
KVM: x86: fix wbinvd_dirty_mask use-after-free

commit bd768e146624cbec7122ed15dead8daa137d909d upstream.

vcpu->arch.wbinvd_dirty_mask may still be used after freeing it,
corrupting memory. For example, the following call trace may set a bit
in an already freed cpu mask:
    kvm_arch_vcpu_load
    vcpu_load
    vmx_free_vcpu_nested
    vmx_free_vcpu
    kvm_arch_vcpu_free

Fix this by deferring freeing of wbinvd_dirty_mask.

Signed-off-by: Ido Yariv <ido@wizery.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim KrÄ\8dmáÅ\99 <rkrcmar@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: MIPS: Make ERET handle ERL before EXL
James Hogan [Tue, 25 Oct 2016 15:11:11 +0000 (16:11 +0100)]
KVM: MIPS: Make ERET handle ERL before EXL

commit ede5f3e7b54a4347be4d8525269eae50902bd7cd upstream.

The ERET instruction to return from exception is used for returning from
exception level (Status.EXL) and error level (Status.ERL). If both bits
are set however we should be returning from ERL first, as ERL can
interrupt EXL, for example when an NMI is taken. KVM however checks EXL
first.

Fix the order of the checks to match the pseudocode in the instruction
set manual.

Fixes: e685c689f3a8 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄ\8dmáÅ\99" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
Radim KrÄ\8dmáÅ\99 [Mon, 8 Aug 2016 18:16:23 +0000 (20:16 +0200)]
KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write

commit dccbfcf52cebb8963246eba5b177b77f26b34da0 upstream.

If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
write with vmcs02 as the current VMCS.
This will incorrectly apply modifications intended for vmcs01 to vmcs02
and L2 can use it to gain access to L0's x2APIC registers by disabling
virtualized x2APIC while using msr bitmap that assumes enabled.

Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
current VMCS.  An alternative solution would temporarily make vmcs01 the
current VMCS, but it requires more care.

Fixes: 8d14695f9542 ("x86, apicv: add virtual x2apic support")
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim KrÄ\8dmáÅ\99 <rkrcmar@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: MIPS: Drop other CPU ASIDs on guest MMU changes
James Hogan [Wed, 9 Nov 2016 14:46:24 +0000 (14:46 +0000)]
KVM: MIPS: Drop other CPU ASIDs on guest MMU changes

commit 91e4f1b6073dd680d86cdb7e42d7cccca9db39d8 upstream.

When a guest TLB entry is replaced by TLBWI or TLBWR, we only invalidate
TLB entries on the local CPU. This doesn't work correctly on an SMP host
when the guest is migrated to a different physical CPU, as it could pick
up stale TLB mappings from the last time the vCPU ran on that physical
CPU.

Therefore invalidate both user and kernel host ASIDs on other CPUs,
which will cause new ASIDs to be generated when it next runs on those
CPUs.

We're careful only to do this if the TLB entry was already valid, and
only for the kernel ASID where the virtual address it mapped is outside
of the guest user address range.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄ\8dmáÅ\99" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-
Cc: Jiri Slaby <jslaby@suse.cz>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoKVM: MIPS: Precalculate MMIO load resume PC
James Hogan [Wed, 9 Nov 2016 16:13:50 +0000 (16:13 +0000)]
KVM: MIPS: Precalculate MMIO load resume PC

commit e1e575f6b026734be3b1f075e780e91ab08ca541 upstream.

The advancing of the PC when completing an MMIO load is done before
re-entering the guest, i.e. before restoring the guest ASID. However if
the load is in a branch delay slot it may need to access guest code to
read the prior branch instruction. This isn't safe in TLB mapped code at
the moment, nor in the future when we'll access unmapped guest segments
using direct user accessors too, as it could read the branch from host
user memory instead.

Therefore calculate the resume PC in advance while we're still in the
right context and save it in the new vcpu->arch.io_pc (replacing the no
longer needed vcpu->arch.pending_load_cause), and restore it on MMIO
completion.

Fixes: e685c689f3a8 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄ\8dmáÅ\99 <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x: 5f508c43a764: MIPS: KVM: Fix unused variable build warning
Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agoMIPS: KVM: Fix unused variable build warning
Nicholas Mc Guire [Wed, 9 Nov 2016 16:13:49 +0000 (16:13 +0000)]
MIPS: KVM: Fix unused variable build warning

commit 5f508c43a7648baa892528922402f1e13f258bd4 upstream.

As kvm_mips_complete_mmio_load() did not yet modify PC at this point
as James Hogans <james.hogan@imgtec.com> explained the curr_pc variable
and the comments along with it can be dropped.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Link: http://lkml.org/lkml/2015/5/8/422
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/9993/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: gcm - Fix IV buffer size in crypto_gcm_setkey
Ondrej MosnáÄ\8dek [Fri, 23 Sep 2016 08:47:32 +0000 (10:47 +0200)]
crypto: gcm - Fix IV buffer size in crypto_gcm_setkey

commit 50d2e6dc1f83db0563c7d6603967bf9585ce934b upstream.

The cipher block size for GCM is 16 bytes, and thus the CTR transform
used in crypto_gcm_setkey() will also expect a 16-byte IV. However,
the code currently reserves only 8 bytes for the IV, causing
an out-of-bounds access in the CTR transform. This patch fixes
the issue by setting the size of the IV buffer to 16 bytes.

Fixes: 84c911523020 ("[CRYPTO] gcm: Add support for async ciphers")
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: skcipher - Fix blkcipher walk OOM crash
Herbert Xu [Thu, 27 Oct 2016 14:29:51 +0000 (17:29 +0300)]
crypto: skcipher - Fix blkcipher walk OOM crash

commit acdb04d0b36769b3e05990c488dc74d8b7ac8060 upstream.

When we need to allocate a temporary blkcipher_walk_next and it
fails, the code is supposed to take the slow path of processing
the data block by block.  However, due to an unrelated change
we instead end up dereferencing the NULL pointer.

This patch fixes it by moving the unrelated bsize setting out
of the way so that we enter the slow path as inteded.

Fixes: 7607bd8ff03b ("[CRYPTO] blkcipher: Added blkcipher_walk_virt_block")
Cc: stable@vger.kernel.org
Reported-by: xiakaixu <xiakaixu@huawei.com>
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: cryptd - initialize child shash_desc on import
Ard Biesheuvel [Thu, 27 Oct 2016 14:29:50 +0000 (17:29 +0300)]
crypto: cryptd - initialize child shash_desc on import

commit 0bd2223594a4dcddc1e34b15774a3a4776f7749e upstream.

When calling .import() on a cryptd ahash_request, the structure members
that describe the child transform in the shash_desc need to be initialized
like they are when calling .init()

Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Load TX SG list after waiting
Herbert Xu [Thu, 27 Oct 2016 14:29:48 +0000 (17:29 +0300)]
crypto: algif_skcipher - Load TX SG list after waiting

commit 4f0414e54e4d1893c6f08260693f8ef84c929293 upstream.

We need to load the TX SG list in sendmsg(2) after waiting for
incoming data, not before.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Fix race condition in skcipher_check_key
Herbert Xu [Thu, 27 Oct 2016 14:29:47 +0000 (17:29 +0300)]
crypto: algif_skcipher - Fix race condition in skcipher_check_key

commit 1822793a523e5d5730b19cc21160ff1717421bc8 upstream.

We need to lock the child socket in skcipher_check_key as otherwise
two simultaneous calls can cause the parent socket to be freed.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_hash - Fix race condition in hash_check_key
Herbert Xu [Thu, 27 Oct 2016 14:29:46 +0000 (17:29 +0300)]
crypto: algif_hash - Fix race condition in hash_check_key

commit ad46d7e33219218605ea619e32553daf4f346b9f upstream.

We need to lock the child socket in hash_check_key as otherwise
two simultaneous calls can cause the parent socket to be freed.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: af_alg - Forbid bind(2) when nokey child sockets are present
Herbert Xu [Thu, 27 Oct 2016 14:29:45 +0000 (17:29 +0300)]
crypto: af_alg - Forbid bind(2) when nokey child sockets are present

commit a6a48c565f6f112c6983e2a02b1602189ed6e26e upstream.

This patch forbids the calling of bind(2) when there are child
sockets created by accept(2) in existence, even if they are created
on the nokey path.

This is needed as those child sockets have references to the tfm
object which bind(2) will destroy.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Remove custom release parent function
Herbert Xu [Thu, 27 Oct 2016 14:29:44 +0000 (17:29 +0300)]
crypto: algif_skcipher - Remove custom release parent function

commit d7b65aee1e7b4c87922b0232eaba56a8a143a4a0 upstream.

This patch removes the custom release parent function as the
generic af_alg_release_parent now works for nokey sockets too.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_hash - Remove custom release parent function
Herbert Xu [Thu, 27 Oct 2016 14:29:43 +0000 (17:29 +0300)]
crypto: algif_hash - Remove custom release parent function

commit f1d84af1835846a5a2b827382c5848faf2bb0e75 upstream.

This patch removes the custom release parent function as the
generic af_alg_release_parent now works for nokey sockets too.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path
Herbert Xu [Thu, 27 Oct 2016 14:29:42 +0000 (17:29 +0300)]
crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path

commit 6a935170a980024dd29199e9dbb5c4da4767a1b9 upstream.

This patch allows af_alg_release_parent to be called even for
nokey sockets.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Add key check exception for cipher_null
Herbert Xu [Thu, 27 Oct 2016 14:29:41 +0000 (17:29 +0300)]
crypto: algif_skcipher - Add key check exception for cipher_null

commit 6e8d8ecf438792ecf7a3207488fb4eebc4edb040 upstream.

This patch adds an exception to the key check so that cipher_null
users may continue to use algif_skcipher without setting a key.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: skcipher - Add crypto_skcipher_has_setkey
Herbert Xu [Thu, 27 Oct 2016 14:29:40 +0000 (17:29 +0300)]
crypto: skcipher - Add crypto_skcipher_has_setkey

commit a1383cd86a062fc798899ab20f0ec2116cce39cb upstream.

This patch adds a way for skcipher users to determine whether a key
is required by a transform.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_hash - Require setkey before accept(2)
Herbert Xu [Thu, 27 Oct 2016 14:29:39 +0000 (17:29 +0300)]
crypto: algif_hash - Require setkey before accept(2)

commit 6de62f15b581f920ade22d758f4c338311c2f0d4 upstream.

Hash implementations that require a key may crash if you use
them without setting a key.  This patch adds the necessary checks
so that if you do attempt to use them without a key that we return
-ENOKEY instead of proceeding.

This patch also adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: shash - Fix has_key setting
Herbert Xu [Thu, 27 Oct 2016 14:29:49 +0000 (17:29 +0300)]
crypto: shash - Fix has_key setting

commit 00420a65fa2beb3206090ead86942484df2275f3 upstream.

The has_key logic is wrong for shash algorithms as they always
have a setkey function.  So we should instead be testing against
shash_no_setkey.

Fixes: a5596d633278 ("crypto: hash - Add crypto_ahash_has_setkey")
Cc: stable@vger.kernel.org
Reported-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: hash - Add crypto_ahash_has_setkey
Herbert Xu [Thu, 27 Oct 2016 14:29:38 +0000 (17:29 +0300)]
crypto: hash - Add crypto_ahash_has_setkey

commit a5596d6332787fd383b3b5427b41f94254430827 upstream.

This patch adds a way for ahash users to determine whether a key
is required by a crypto_ahash transform.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Add nokey compatibility path
Herbert Xu [Thu, 27 Oct 2016 14:29:37 +0000 (17:29 +0300)]
crypto: algif_skcipher - Add nokey compatibility path

commit a0fa2d037129a9849918a92d91b79ed6c7bd2818 upstream.

This patch adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: af_alg - Add nokey compatibility path
Herbert Xu [Thu, 27 Oct 2016 14:29:36 +0000 (17:29 +0300)]
crypto: af_alg - Add nokey compatibility path

commit 37766586c965d63758ad542325a96d5384f4a8c9 upstream.

This patch adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: af_alg - Disallow bind/setkey/... after accept(2)
Herbert Xu [Thu, 27 Oct 2016 14:29:35 +0000 (17:29 +0300)]
crypto: af_alg - Disallow bind/setkey/... after accept(2)

commit c840ac6af3f8713a71b4d2363419145760bd6044 upstream.

Each af_alg parent socket obtained by socket(2) corresponds to a
tfm object once bind(2) has succeeded.  An accept(2) call on that
parent socket creates a context which then uses the tfm object.

Therefore as long as any child sockets created by accept(2) exist
the parent socket must not be modified or freed.

This patch guarantees this by using locks and a reference count
on the parent socket.  Any attempt to modify the parent socket will
fail with EBUSY.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agocrypto: algif_skcipher - Require setkey before accept(2)
Herbert Xu [Thu, 27 Oct 2016 14:29:34 +0000 (17:29 +0300)]
crypto: algif_skcipher - Require setkey before accept(2)

commit dd504589577d8e8e70f51f997ad487a4cb6c026f upstream.

Some cipher implementations will crash if you try to use them
without calling setkey first.  This patch adds a check so that
the accept(2) call will fail with -ENOKEY if setkey hasn't been
done on the socket yet.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agosched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
Peter Zijlstra [Wed, 7 Oct 2015 12:14:13 +0000 (14:14 +0200)]
sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()

commit ecf7d01c229d11a44609c0067889372c91fb4f36 upstream.

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
7 years agosched/core: Fix a race between try_to_wake_up() and a woken up task
Balbir Singh [Mon, 5 Sep 2016 03:16:40 +0000 (13:16 +1000)]
sched/core: Fix a race between try_to_wake_up() and a woken up task

commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

do {
schedule()
set_current_state(TASK_(UN)INTERRUPTIBLE);
} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

while (p->on_cpu)
cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1 CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
wakeup_routine()
  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)   wake_up_process()
 }   try_to_wake_up()
 set_current_state(TASK_RUNNING);   ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoLinux 3.10.104
Willy Tarreau [Fri, 21 Oct 2016 10:13:35 +0000 (12:13 +0200)]
Linux 3.10.104

8 years agomm: remove gup_flags FOLL_WRITE games from __get_user_pages()
Linus Torvalds [Thu, 13 Oct 2016 20:07:36 +0000 (13:07 -0700)]
mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better).  The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9.  Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
     s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoxen-netback: ref count shared rings
Wei Liu [Fri, 11 Jul 2014 16:37:37 +0000 (17:37 +0100)]
xen-netback: ref count shared rings

... so that we can make sure the rings are not freed until all SKBs in
internal queues are consumed.

1. The VM is receiving packets through bonding + bridge + netback +
   netfront.
2. For some unknown reason at least one packet remains in the rx queue
   and is not delivered to the domU immediately by netback.
3. The VM finishes shutting down.
4. The shared ring between dom0 and domU is freed.
5. then xen-netback continues processing the pending requests and tries
   to put the packet into the now already released shared ring.

> XXXlan0: port 9(vif26.0) entered disabled state
> BUG: unable to handle kernel paging request at ffffc900108641d8
> IP: [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
> PGD 57e20067 PUD 57e21067 PMD 571a7067 PTE 0
> Oops: 0000 [#1] SMP
> ...
> CPU: 0 PID: 12587 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian 3.10.11-1.58.201405060908
> Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 Rev.3C78.3051 07/22/2011
> task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: ffff8800561ec000
> RIP: e030:[<ffffffffa04147dc>]  [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
> RSP: e02b:ffff8800561edce8  EFLAGS: 00010202
> RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: ffffc90010864000
> RDX: 000000000000003b RSI: 0000000000000000 RDI: ffff880040014380
> RBP: ffff8800570e6800 R08: 0000000000000000 R09: ffff880004799800
> R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: ffff880054e6f800
> R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 0000000000000000
> FS:  00007f19a54a8700(0000) GS:ffff88005da00000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
>  ffff880004b06ba0 0000000000000000 ffff88005da13ec0 ffff88005da13ec0
>  0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 000000005da13ec0
>  0000000000000000 0000000000000001 ffffc900104a8ac0 ffffc900104adac0
> Call Trace:
>  [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
>  [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
>  [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
>  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
>  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
>  [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
>  [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
>  [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
> Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07 c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24 60 00 00 00 00 8b 44 d1 04 89 44 24
> RIP  [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
>  RSP <ffff8800561edce8>
> CR2: ffffc900108641d8

Track the shared ring buffer being unmapped and drop those packets.

Ref-count the rings as followed:
  map         -> set to 1
   start_xmit -> inc when queueing SKB to internal queue
   rx_action  -> dec after finishing processing a SKB
  unmap       -> dec and wait to be 0

Note that this is different from ref counting the vif structure itself.
Currently only guest Rx path is taken care of because that's where the
bug surfaced.

This bug doesn't exist in kernel >=3.12 as multi-queue support was added
there.

Link: <https://lists.xenproject.org/archives/html/xen-devel/2014-06/msg00818.html>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Philipp Hahn <hahn@univention.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Tested-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agosecurity: let security modules use PTRACE_MODE_* with bitmasks
Jann Horn [Wed, 20 Jan 2016 23:00:01 +0000 (15:00 -0800)]
security: let security modules use PTRACE_MODE_* with bitmasks

commit 3dfb7d8cdbc7ea0c2970450e60818bb3eefbad69 upstream.

It looks like smack and yama weren't aware that the ptrace mode
can have flags ORed into it - PTRACE_MODE_NOAUDIT until now, but
only for /proc/$pid/stat, and with the PTRACE_MODE_*CREDS patch,
all modes have flags ORed into them.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: no smk_ptrace_mode() in 3.10]
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoMIPS: KVM: Check for pfn noslot case
James Hogan [Thu, 15 Sep 2016 21:51:27 +0000 (22:51 +0100)]
MIPS: KVM: Check for pfn noslot case

commit ba913e4f72fc9cfd03dad968dfb110eb49211d80 upstream.

When mapping a page into the guest we error check using is_error_pfn(),
however this doesn't detect a value of KVM_PFN_NOSLOT, indicating an
error HVA for the page. This can only happen on MIPS right now due to
unusual memslot management (e.g. being moved / removed / resized), or
with an Enhanced Virtual Memory (EVA) configuration where the default
KVM_HVA_ERR_* and kvm_is_error_hva() definitions are unsuitable (fixed
in a later patch). This case will be treated as a pfn of zero, mapping
the first page of physical memory into the guest.

It would appear the MIPS KVM port wasn't updated prior to being merged
(in v3.10) to take commit 81c52c56e2b4 ("KVM: do not treat noslot pfn as
a error pfn") into account (merged v3.8), which converted a bunch of
is_error_pfn() calls to is_error_noslot_pfn(). Switch to using
is_error_noslot_pfn() instead to catch this case properly.

Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄ\8dmáÅ\99" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[james.hogan@imgtec.com: Backport to v3.16.y]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agomm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
Andrea Arcangeli [Fri, 26 Feb 2016 23:19:28 +0000 (15:19 -0800)]
mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED

commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream.

pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).

While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
tiny window for a race.

Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.

[js] 3.12 backport: no pmd_devmap in 3.12 yet.

[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoACPI / sysfs: fix error code in get_status()
Dan Carpenter [Thu, 5 May 2016 13:23:04 +0000 (16:23 +0300)]
ACPI / sysfs: fix error code in get_status()

commit f18ebc211e259d4f591e39e74b2aa2de226c9a1d upstream.

The problem with ornamental, do-nothing gotos is that they lead to
"forgot to set the error code" bugs.  We should be returning -EINVAL
here but we don't.  It leads to an uninitalized variable in
counter_show():

    drivers/acpi/sysfs.c:603 counter_show()
    error: uninitialized symbol 'status'.

Fixes: 1c8fce27e275 (ACPI: introduce drivers/acpi/sysfs.c)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agostaging: comedi: daqboard2000: bug fix board type matching code
Ian Abbott [Wed, 29 Jun 2016 19:27:44 +0000 (20:27 +0100)]
staging: comedi: daqboard2000: bug fix board type matching code

commit 80e162ee9b31d77d851b10f8c5299132be1e120f upstream.

`daqboard2000_find_boardinfo()` is supposed to check if the
DaqBoard/2000 series model is supported, based on the PCI subvendor and
subdevice ID.  The current code is wrong as it is comparing the PCI
device's subdevice ID to an expected, fixed value for the subvendor ID.
It should be comparing the PCI device's subvendor ID to this fixed
value.  Correct it.

Fixes: 7e8401b23e7f ("staging: comedi: daqboard2000: add back
subsystem_device check")
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Cc: <stable@vger.kernel.org> # 3.7+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agocrypto: nx - off by one bug in nx_of_update_msc()
Dan Carpenter [Fri, 15 Jul 2016 11:09:13 +0000 (14:09 +0300)]
crypto: nx - off by one bug in nx_of_update_msc()

commit e514cc0a492a3f39ef71b31590a7ef67537ee04b upstream.

The props->ap[] array is defined like this:

struct alg_props ap[NX_MAX_FC][NX_MAX_MODE][3];

So we can see that if msc->fc and msc->mode are == to NX_MAX_FC or
NX_MAX_MODE then we're off by one.

Fixes: ae0222b7289d ('powerpc/crypto: nx driver code supporting nx encryption')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agomegaraid_sas: Fix probing cards without io port
Yinghai Lu [Sat, 6 Aug 2016 06:37:34 +0000 (23:37 -0700)]
megaraid_sas: Fix probing cards without io port

commit e7f851684efb3377e9c93aca7fae6e76212e5680 upstream.

Found one megaraid_sas HBA probe fails,

[  187.235190] scsi host2: Avago SAS based MegaRAID driver
[  191.112365] megaraid_sas 0000:89:00.0: BAR 0: can't reserve [io  0x0000-0x00ff]
[  191.120548] megaraid_sas 0000:89:00.0: IO memory region busy!

and the card has resource like,
[  125.097714] pci 0000:89:00.0: [1000:005d] type 00 class 0x010400
[  125.104446] pci 0000:89:00.0: reg 0x10: [io  0x0000-0x00ff]
[  125.110686] pci 0000:89:00.0: reg 0x14: [mem 0xce400000-0xce40ffff 64bit]
[  125.118286] pci 0000:89:00.0: reg 0x1c: [mem 0xce300000-0xce3fffff 64bit]
[  125.125891] pci 0000:89:00.0: reg 0x30: [mem 0xce200000-0xce2fffff pref]

that does not io port resource allocated from BIOS, and kernel can not
assign one as io port shortage.

The driver is only looking for MEM, and should not fail.

It turns out megasas_init_fw() etc are using bar index as mask.  index 1
is used as mask 1, so that pci_request_selected_regions() is trying to
request BAR0 instead of BAR1.

Fix all related reference.

Fixes: b6d5d8808b4c ("megaraid_sas: Use lowest memory bar for SR-IOV VF support")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoaacraid: Check size values after double-fetch from user
Dave Carroll [Fri, 5 Aug 2016 19:44:10 +0000 (13:44 -0600)]
aacraid: Check size values after double-fetch from user

commit fa00c437eef8dc2e7b25f8cd868cfa405fcc2bb3 upstream.

In aacraid's ioctl_send_fib() we do two fetches from userspace, one the
get the fib header's size and one for the fib itself. Later we use the
size field from the second fetch to further process the fib. If for some
reason the size from the second fetch is different than from the first
fix, we may encounter an out-of- bounds access in aac_fib_send(). We
also check the sender size to insure it is not out of bounds. This was
reported in https://bugzilla.kernel.org/show_bug.cgi?id=116751 and was
assigned CVE-2016-6480.

Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Fixes: 7c00ffa31 '[SCSI] 2.6 aacraid: Variable FIB size (updated patch)'
Cc: stable@vger.kernel.org
Signed-off-by: Dave Carroll <david.carroll@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
8 years agoPCI: Limit config space size for Netronome NFP4000
Simon Horman [Fri, 11 Dec 2015 02:30:12 +0000 (11:30 +0900)]
PCI: Limit config space size for Netronome NFP4000

commit c2e771b02792d222cbcd9617fe71482a64f52647 upstream.

Like the NFP6000, the NFP4000 as an erratum where reading/writing to PCI
config space addresses above 0x600 can cause the NFP to generate PCIe
completion timeouts.

Limit the NFP4000's PF's config space size to 0x600 bytes as is already
done for the NFP6000.

The NFP4000's VF is 0x6004 (PCI_DEVICE_ID_NETRONOME_NFP6000_VF), the same
device ID as the NFP6000's VF.  Thus, its config space is already limited
by the existing use of quirk_nfp6000().

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>