Ingo Molnar [Sat, 2 Apr 2016 07:31:10 +0000 (09:31 +0200)]
Merge tag 'efi-urgent' of git://git./linux/kernel/git/mfleming/efi into efi/urgent
Pull EFI/arm64 fix from Matt Fleming:
" * Fix a boot crash on arm64 caused by a recent commit to mark the EFI
memory map as 'MEMBLOCK_NOMAP' which causes the regions to be
omitted from the kernel direct mapping - Ard Biesheuvel "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ard Biesheuvel [Wed, 30 Mar 2016 07:46:23 +0000 (09:46 +0200)]
efi/arm64: Don't apply MEMBLOCK_NOMAP to UEFI memory map mapping
Commit
4dffbfc48d65 ("arm64/efi: mark UEFI reserved regions as
MEMBLOCK_NOMAP") updated the mapping logic of both the RuntimeServices
regions as well as the kernel's copy of the UEFI memory map to set the
MEMBLOCK_NOMAP flag, which causes these regions to be omitted from the
kernel direct mapping, and from being covered by a struct page.
For the RuntimeServices regions, this is an obvious win, since the contents
of these regions have significance to the firmware executable code itself,
and are mapped in the EFI page tables using attributes that are described in
the UEFI memory map, and which may differ from the attributes we use for
mapping system RAM. It also prevents the contents from being modified
inadvertently, since the EFI page tables are only live during runtime
service invocations.
None of these concerns apply to the allocation that covers the UEFI memory
map, since it is entirely owned by the kernel. Setting the MEMBLOCK_NOMAP on
the region did allow us to use ioremap_cache() to map it both on arm64 and
on ARM, since the latter does not allow ioremap_cache() to be used on
regions that are covered by a struct page.
The ioremap_cache() on ARM restriction will be lifted in the v4.7 timeframe,
but in the mean time, it has been reported that commit
4dffbfc48d65 causes
a regression on 64k granule kernels. This is due to the fact that, given
the 64 KB page size, the region that we end up removing from the kernel
direct mapping is rounded up to 64 KB, and this 64 KB page frame may be
shared with the initrd when booting via GRUB (which does not align its
EFI_LOADER_DATA allocations to 64 KB like the stub does). This will crash
the kernel as soon as it tries to access the initrd.
Since the issue is specific to arm64, revert back to memblock_reserve()'ing
the UEFI memory map when running on arm64. This is a temporary fix for v4.5
and v4.6, and will be superseded in the v4.7 timeframe when we will be able
to move back to memblock_reserve() unconditionally.
Fixes:
4dffbfc48d65 ("arm64/efi: mark UEFI reserved regions as MEMBLOCK_NOMAP")
Reported-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Mark Langsdorf <mlangsdo@redhat.com>
Cc: <stable@vger.kernel.org> # v4.5
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Linus Torvalds [Thu, 31 Mar 2016 12:55:14 +0000 (07:55 -0500)]
Merge branch 'parisc-4.6-2' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Fix seccomp filter support and SIGSYS signals on compat kernel.
Both patches are tagged for v4.5 stable kernel"
* 'parisc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix and enable seccomp filter support
parisc: Fix SIGSYS signals in compat case
Linus Torvalds [Thu, 31 Mar 2016 12:19:39 +0000 (07:19 -0500)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs fix from Al Viro.
Automount handling was broken by commit
e3c13928086f ("namei: massage
lookup_slow() to be usable by lookup_one_len_unlocked()") moving the
test for negative dentry too early.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()"
Linus Torvalds [Thu, 31 Mar 2016 12:13:56 +0000 (07:13 -0500)]
Merge tag 'gpio-v4.6-2' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Here are two GPIO fixes for the v4.6 series, both in drivers:
- Prevent NULL dereference in the Xgene driver
- Fix an uninitialized spinlock in the menz127 driver"
* tag 'gpio-v4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: xgene: Prevent NULL pointer dereference
gpio: menz127: Drop lock field from struct men_z127_gpio
Linus Torvalds [Thu, 31 Mar 2016 11:56:50 +0000 (06:56 -0500)]
Merge branch 'libnvdimm-for-next' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull nvdimm mcsafe_memcpy use from Dan Williams:
"Now that mcsafe_memcpy() has landed, and the return value was been
clarified in commit
cbf8b5a2b649 ("x86/mm, x86/mce: Fix return
type/value for memcpy_mcsafe()"), let's hook up its primary usage in
the pmem driver.
The compilation problems from the initial posting have been fixed,
this has appeared in a -next release with no reported issues, and it
picked up an ack from Ingo. There is no pressing need to merge this
in 4.6- rc2. However, if we wait until 4.7 the new memcpy_mcsafe()
capability will ship without a user in 4.6-final"
* 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()
Helge Deller [Wed, 30 Mar 2016 12:14:31 +0000 (14:14 +0200)]
parisc: Fix and enable seccomp filter support
The seccomp filter support requires careful handling of task registers. This
includes reloading of the return value (%r28) and proper syscall exit if
secure_computing() returned -1.
Additionally we need to sign-extend the syscall number from signed 32bit to
signed 64bit in do_syscall_trace_enter() since the ptrace interface only allows
storing 32bit values in compat mode.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v4.5
Helge Deller [Wed, 30 Mar 2016 12:11:50 +0000 (14:11 +0200)]
parisc: Fix SIGSYS signals in compat case
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v4.5
Al Viro [Thu, 31 Mar 2016 04:23:05 +0000 (00:23 -0400)]
fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()"
We should try to trigger automount *before* bailing out on negative dentry.
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reported-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Thu, 31 Mar 2016 01:40:42 +0000 (20:40 -0500)]
Merge tag 'nios2-v4.6-rc2' of git://git./linux/kernel/git/lftan/nios2
Pull nios2 fix from Ley Foon Tan:
"Replace fdt_translate_address with of_flat_dt_translate_address"
Fixes a build failure.
* tag 'nios2-v4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
nios2: Replace fdt_translate_address with of_flat_dt_translate_address
Guenter Roeck [Tue, 29 Mar 2016 09:02:13 +0000 (17:02 +0800)]
nios2: Replace fdt_translate_address with of_flat_dt_translate_address
nios2 builds fail with the following build error.
arch/nios2/kernel/prom.c: In function 'early_init_dt_scan_serial':
arch/nios2/kernel/prom.c:100:2: error:
implicit declaration of function 'fdt_translate_address'
Commit
c90fe9c0394b ("of: earlycon: Move address translation to
of_setup_earlycon()") replaced fdt_translate_address() with
of_flat_dt_translate_address() but missed updating the nios2 code.
Fixes:
c90fe9c0394b ("of: earlycon: Move address translation to of_setup_earlycon()")
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Ley Foon Tan <lftan@altera.com>
Linus Torvalds [Wed, 30 Mar 2016 18:28:34 +0000 (13:28 -0500)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This fixes a bug in pkcs7_validate_trust and its users where the
output value may in fact be taken from uninitialised memory"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
PKCS#7: pkcs7_validate_trust(): initialize the _trusted output argument
Linus Torvalds [Wed, 30 Mar 2016 18:24:28 +0000 (13:24 -0500)]
Merge tag 'dlm-4.6-fixes' of git://git./linux/kernel/git/teigland/linux-dlm
Pull dlm fix from David Teigland:
"This fixes a bug from the configfs cleanup"
* tag 'dlm-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: config: Fix ENOMEM failures in make_cluster()
Linus Torvalds [Wed, 30 Mar 2016 18:22:47 +0000 (13:22 -0500)]
Merge tag 'hwmon-for-linus-v4.6-rc2' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Fix crash due to NULL pointer access in max1111 driver"
* tag 'hwmon-for-linus-v4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated
Axel Lin [Mon, 21 Mar 2016 12:03:50 +0000 (20:03 +0800)]
gpio: xgene: Prevent NULL pointer dereference
platform_get_resource() can return NULL, thus add NULL test to prevent NULL
pointer dereference.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Axel Lin [Wed, 9 Mar 2016 12:38:55 +0000 (20:38 +0800)]
gpio: menz127: Drop lock field from struct men_z127_gpio
Current code uses a uninitialized spin lock.
bgpio_init() already initialized a spin lock, so let's switch to use
&gc->bgpio_lock instead and remove the lock from struct men_z127_gpio.
Fixes:
f436bc2726c6 "gpio: add driver for MEN 16Z127 GPIO controller"
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Andrew Price [Tue, 22 Mar 2016 17:36:34 +0000 (17:36 +0000)]
dlm: config: Fix ENOMEM failures in make_cluster()
Commit
1ae1602de0 "configfs: switch ->default groups to a linked list"
left the NULL gps pointer behind after removing the kcalloc() call which
made it non-NULL. It also left the !gps check in place so make_cluster()
now fails with ENOMEM. Remove the remaining uses of the gps variable to
fix that.
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Dave Hansen [Mon, 14 Dec 2015 19:06:34 +0000 (11:06 -0800)]
x86/mm/pkeys: Add missing Documentation
Stefan Richter noticed that the X86_INTEL_MEMORY_PROTECTION_KEYS option
in arch/x86/Kconfig references Documentation/x86/protection-keys.txt,
but the file does not exist.
This is a patch merging mishap: the final (v8) version of the pkeys
series did not include the documentation patch 32 and v7 included.
Add it now.
Reported-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20151214190634.426BEE41@viggo.jf.intel.com
[ Added changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Huang Rui [Fri, 25 Mar 2016 02:08:40 +0000 (10:08 +0800)]
x86/cpu: Add advanced power management bits
Bit 11 of CPUID 8000_0007 edx is processor feedback interface.
Bit 12 of CPUID 8000_0007 edx is accumulated power.
Print proper names in proc/cpuinfo
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Cc: Tony Li <tony.li@amd.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Len Brown" <lenb@kernel.org>
Link: http://lkml.kernel.org/r/1458871720-3209-1-git-send-email-ray.huang@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Borislav Petkov [Mon, 28 Mar 2016 18:20:17 +0000 (20:20 +0200)]
x86/thread_info: Merge two !__ASSEMBLY__ sections
We have
#ifndef __ASSEMBLY__
...
#endif
#ifndef __ASSEMBLY__
...
#endif
Merge the two.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1459189217-25532-1-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Vladimir Zapolskiy [Sat, 26 Mar 2016 18:47:00 +0000 (20:47 +0200)]
x86/cpufreq: Remove duplicated TDP MSR macro definitions
The list of CPU model specific registers contains two copies of TDP
registers, remove the one, which is out of numerical order in the
list.
Fixes:
6a35fc2d6c22 ("cpufreq: intel_pstate: get P1 from TAR when available")
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Kristen Carlson
Accardi <kristen@linux.intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: http://lkml.kernel.org/r/1459018020-24577-1-git-send-email-vladimir_zapolskiy@mentor.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Borislav Petkov [Mon, 28 Mar 2016 09:56:09 +0000 (11:56 +0200)]
x86/Documentation: Start documenting x86 topology
This should contain important aspects of how we represent the system
topology on x86. If people have questions about it and this file doesn't
answer it, then it must be updated.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20160328095609.GD26651@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Borislav Petkov [Fri, 25 Mar 2016 14:52:36 +0000 (15:52 +0100)]
x86/cpu: Get rid of compute_unit_id
It is cpu_core_id anyway.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458917557-8757-3-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Fri, 25 Mar 2016 14:52:35 +0000 (15:52 +0100)]
perf/x86/amd: Cleanup Fam10h NB event constraints
Avoid allocating the AMD NB event constraints data structure when not
needed. This gets rid of x86_max_cores usage and avoids allocating
this on AMD Core Perfctr supporting hardware (which has separate MSRs
for NB events).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: aherrmann@suse.com
Cc: Rui Huang <ray.huang@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: jencce.kernel@gmail.com
Link: http://lkml.kernel.org/r/20160320124629.GY6375@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Fri, 25 Mar 2016 14:52:34 +0000 (15:52 +0100)]
x86/topology: Fix AMD core count
It turns out AMD gets x86_max_cores wrong when there are compute
units.
The issue is that Linux assumes:
nr_logical_cpus = nr_cores * nr_siblings
But AMD reports its CU unit as 2 cores, but then sets num_smp_siblings
to 2 as well.
Boris: fixup ras/mce_amd_inj.c too, to compute the Node Base Core
properly, according to the new nomenclature.
Fixes:
1f12e32f4cd5 ("x86/topology: Create logical package id")
Reported-by: Xiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <aherrmann@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dan Williams [Tue, 8 Mar 2016 18:30:19 +0000 (10:30 -0800)]
x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()
Update the definition of memcpy_from_pmem() to return 0 or a negative
error code. Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Linus Torvalds [Mon, 28 Mar 2016 20:17:02 +0000 (15:17 -0500)]
Merge git://git./linux/kernel/git/davem/ide
Pull IDE fixes from David Miller:
"Just two small changes:
1) Remove bogus init annotation in icside, from Arnd Bergmann.
2) Don't use zero clock rates in palm_bk3710 driver, from Wolfram
Sang"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
ide: palm_bk3710: test clock rate to avoid division by 0
ide: icside: remove incorrect initconst annotation
Linus Torvalds [Mon, 28 Mar 2016 20:14:11 +0000 (15:14 -0500)]
Merge git://git./linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
"Minor typing cleanup from Joe Perches, and some comment typo fixes
from Adam Buchbinder"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: Convert naked unsigned uses to unsigned int
sparc: Fix misspellings in comments.
Linus Torvalds [Mon, 28 Mar 2016 20:04:55 +0000 (15:04 -0500)]
Merge git://git./linux/kernel/git/cmetcalf/linux-tile
Pull arch/tile bugfixes from Chris Metcalf:
"These include updates to MAINTAINERS, some comment spelling fixes, and
a bugfix to the tile kgdb.c support"
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: Fix misspellings in comments.
MAINTAINERS: update web link for tile architecture
MAINTAINERS: update arch/tile maintainer email domain
tile kgdb: fix bug in copy to gdb regs, and optimize memset
Guenter Roeck [Sat, 26 Mar 2016 19:28:05 +0000 (12:28 -0700)]
hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated
arm:pxa_defconfig can result in the following crash if the max1111 driver
is not instantiated.
Unhandled fault: page domain fault (0x01b) at 0x00000000
pgd =
c0004000
[
00000000] *pgd=
00000000
Internal error: : 1b [#1] PREEMPT ARM
Modules linked in:
CPU: 0 PID: 300 Comm: kworker/0:1 Not tainted
4.5.0-01301-g1701f680407c #10
Hardware name: SHARP Akita
Workqueue: events sharpsl_charge_toggle
task:
c390a000 ti:
c391e000 task.ti:
c391e000
PC is at max1111_read_channel+0x20/0x30
LR is at sharpsl_pm_pxa_read_max1111+0x2c/0x3c
pc : [<
c03aaab0>] lr : [<
c0024b50>] psr:
20000013
...
[<
c03aaab0>] (max1111_read_channel) from [<
c0024b50>]
(sharpsl_pm_pxa_read_max1111+0x2c/0x3c)
[<
c0024b50>] (sharpsl_pm_pxa_read_max1111) from [<
c00262e0>]
(spitzpm_read_devdata+0x5c/0xc4)
[<
c00262e0>] (spitzpm_read_devdata) from [<
c0024094>]
(sharpsl_check_battery_temp+0x78/0x110)
[<
c0024094>] (sharpsl_check_battery_temp) from [<
c0024f9c>]
(sharpsl_charge_toggle+0x48/0x110)
[<
c0024f9c>] (sharpsl_charge_toggle) from [<
c004429c>]
(process_one_work+0x14c/0x48c)
[<
c004429c>] (process_one_work) from [<
c0044618>] (worker_thread+0x3c/0x5d4)
[<
c0044618>] (worker_thread) from [<
c004a238>] (kthread+0xd0/0xec)
[<
c004a238>] (kthread) from [<
c000a670>] (ret_from_fork+0x14/0x24)
This can occur because the SPI controller driver (SPI_PXA2XX) is built as
module and thus not necessarily loaded. While building SPI_PXA2XX into the
kernel would make the problem disappear, it appears prudent to ensure that
the driver is instantiated before accessing its data structures.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Linus Torvalds [Sat, 26 Mar 2016 23:03:24 +0000 (16:03 -0700)]
Linux 4.6-rc1
Linus Torvalds [Sat, 26 Mar 2016 22:53:16 +0000 (15:53 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Linus Torvalds [Sat, 26 Mar 2016 19:59:04 +0000 (12:59 -0700)]
Merge tag 'ofs-pull-tag-1' of git://git./linux/kernel/git/hubcap/linux
Pull orangefs filesystem from Mike Marshall.
This finally merges the long-pending orangefs filesystem, which has been
much cleaned up with input from Al Viro over the last six months. From
the documentation file:
"OrangeFS is an LGPL userspace scale-out parallel storage system. It
is ideal for large storage problems faced by HPC, BigData, Streaming
Video, Genomics, Bioinformatics.
Orangefs, originally called PVFS, was first developed in 1993 by Walt
Ligon and Eric Blumer as a parallel file system for Parallel Virtual
Machine (PVM) as part of a NASA grant to study the I/O patterns of
parallel programs.
Orangefs features include:
- Distributes file data among multiple file servers
- Supports simultaneous access by multiple clients
- Stores file data and metadata on servers using local file system
and access methods
- Userspace implementation is easy to install and maintain
- Direct MPI support
- Stateless"
see Documentation/filesystems/orangefs.txt for more in-depth details.
* tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
orangefs: fix orangefs_superblock locking
orangefs: fix do_readv_writev() handling of error halfway through
orangefs: have ->kill_sb() evict the VFS side of things first
orangefs: sanitize ->llseek()
orangefs-bufmap.h: trim unused junk
orangefs: saner calling conventions for getting a slot
orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
orangefs: get rid of readdir_handle_s
ornagefs: ensure that truncate has an up to date inode size
orangefs: move code which sets i_link to orangefs_inode_getattr
orangefs: remove needless wrapper around GFP_KERNEL
orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
orangefs: refactor inode type or link_target change detection
orangefs: use new getattr for revalidate and remove old getattr
orangefs: use new getattr in inode getattr and permission
orangefs: use new orangefs_inode_getattr to get size in write and llseek
orangefs: use new orangefs_inode_getattr to create new inodes
orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
orangefs: remove inode->i_lock wrapper
orangefs: put register_chrdev immediately before register_filesystem
...
Linus Torvalds [Sat, 26 Mar 2016 18:37:42 +0000 (11:37 -0700)]
Merge tag 'ntb-4.6' of git://github.com/jonmason/ntb
Pull NTB bug fixes from Jon Mason:
"NTB bug fixes for tasklet from spinning forever, link errors,
translation window setup, NULL ptr dereference, and ntb-perf errors.
Also, a modification to the driver API that makes _addr functions
optional"
* tag 'ntb-4.6' of git://github.com/jonmason/ntb:
NTB: Remove _addr functions from ntb_hw_amd
NTB: Make _addr functions optional in the API
NTB: Fix incorrect clean up routine in ntb_perf
NTB: Fix incorrect return check in ntb_perf
ntb: fix possible NULL dereference
ntb: add missing setup of translation window
ntb: stop link work when we do not have memory
ntb: stop tasklet from spinning forever during shutdown.
ntb: perf test: fix address space confusion
Linus Torvalds [Sat, 26 Mar 2016 18:31:01 +0000 (11:31 -0700)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"The only new stuff which missed the first pull request is an update to
the UFS driver.
The rest is an assortment of bug fixes and minor tweaks which appeared
recently (some are fixes for recent code and some are stuff spotted
recently by the checkers or the new gcc-6 compiler [most of Arnd's
stuff])"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
scsi_common: do not clobber fixed sense information
scsi: ufs: select CONFIG_NLS
scsi: fc: use get/put_unaligned64 for wwn access
fnic: move printk()s outside of the critical code section.
qla2xxx: avoid maybe_uninitialized warning
megaraid_sas: add missing curly braces in ioctl handler
lpfc: fix misleading indentation
scsi_transport_sas: add 'scsi_target_id' sysfs attribute
scsi_dh_alua: uninitialized variable in alua_check_vpd()
scsi: ufs-qcom: add printouts of testbus debug registers
scsi: ufs-qcom: enable/disable the device ref clock
scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
scsi: ufs: add device quirk delay before putting UFS rails in LPM
scsi: ufs: fix leakage during link off state
scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
scsi: ufs: handle non spec compliant bkops behaviour by device
scsi: ufs: add retry for query descriptors
scsi: ufs: add error recovery after DL NAC error
scsi: ufs: make error handling bit faster
scsi: ufs: disable vccq if it's not needed by UFS device
...
Linus Torvalds [Sat, 26 Mar 2016 17:13:05 +0000 (10:13 -0700)]
f2fs/crypto: fix xts_tweak initialization
Commit
0b81d07790726 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
to just "FS" for preprocessor symbols).
Because of the symbol renaming, it's a bit hard to see it as a file
move: use
git show -M30
0b81d07790726
to lower the rename detection to just 30% similarity and make git show
the files as renamed (the header file won't be shown as a rename even
then - since all it contains is symbol definitions, it looks almost
completely different).
Even with the renames showing as renames, the diffs are not all that
easy to read, since so much is just the renames. But Eric Biggers
noticed that it's not just all renames: the initialization of the
xts_tweak had been broken too, using the inode number rather than the
page offset.
That's not right - it makes the xfs_tweak the same for all pages of each
inode. It _might_ make sense to make the xfs_tweak contain both the
offset _and_ the inode number, but not just the inode number.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allen Hubbe [Mon, 21 Mar 2016 08:53:14 +0000 (04:53 -0400)]
NTB: Remove _addr functions from ntb_hw_amd
Kernel zero day testing warned about address space confusion. A virtual
iomem address was used where a physical address is expected. The
offending functions implement an optional part of the api, so they are
removed. They can be added later, after testing.
Fixes:
a1b3695820aa490e58915d720a1438069813008b
Signed-off-by: Allen Hubbe <Allen.Hubbe@emc.com>
Acked-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Al Viro [Fri, 25 Mar 2016 23:56:34 +0000 (19:56 -0400)]
orangefs: fix orangefs_superblock locking
* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
* remove from the list _before_ orangefs_unmount() - request_mutex
in the latter will make sure that nothing observed in the loop in
ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
of loop
* on removal, keep the forward pointer and zero the back one. That
way we can drop and regain the spinlock in the loop body (again,
ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
rest of the list.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 01:15:43 +0000 (20:15 -0500)]
orangefs: fix do_readv_writev() handling of error halfway through
Error should only be returned if nothing had been read/written.
Otherwise we need to report a short read/write instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 02:08:29 +0000 (21:08 -0500)]
orangefs: have ->kill_sb() evict the VFS side of things first
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 01:25:19 +0000 (20:25 -0500)]
orangefs: sanitize ->llseek()
a) open files can't have NULL inodes
b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
c) make_bad_inode() on lseek()?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 01:12:04 +0000 (20:12 -0500)]
orangefs-bufmap.h: trim unused junk
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 01:10:26 +0000 (20:10 -0500)]
orangefs: saner calling conventions for getting a slot
just have it return the slot number or -E... - the caller checks
the sign anyway
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 01:06:19 +0000 (20:06 -0500)]
orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
it's always __orangefs_bufmap
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro [Wed, 17 Feb 2016 00:54:13 +0000 (19:54 -0500)]
orangefs: get rid of readdir_handle_s
no point, really - we couldn't keep those across the calls of
getdents(); it would be too easy to DoS, having all slots exhausted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Linus Torvalds [Fri, 25 Mar 2016 23:59:11 +0000 (16:59 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge fourth patch-bomb from Andrew Morton:
"A lot more stuff than expected, sorry. A bunch of ocfs2 reviewing was
finished off.
- mhocko's oom-reaper out-of-memory-handler changes
- ocfs2 fixes and features
- KASAN feature work
- various fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (42 commits)
thp: fix typo in khugepaged_scan_pmd()
MAINTAINERS: fill entries for KASAN
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2
mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
mm, kasan: add GFP flags to KASAN API
mm, kasan: SLAB support
kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()
include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()
mm/page_alloc: prevent merging between isolated and other pageblocks
drivers/memstick/host/r592.c: avoid gcc-6 warning
ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
ocfs2: solve a problem of crossing the boundary in updating backups
ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
ocfs2/dlm: fix race between convert and recovery
ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
...
Linus Torvalds [Fri, 25 Mar 2016 23:55:37 +0000 (16:55 -0700)]
Merge tag 'pm+acpi-4.6-rc1-3' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixlet from Rafael Wysocki:
"One of commits in my previous pull request changed the permissions of
drivers/power/avs/rockchip-io-domain.c to executable by mistake"
* tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Fix permissions of drivers/power/avs/rockchip-io-domain.c
Linus Torvalds [Fri, 25 Mar 2016 23:48:45 +0000 (16:48 -0700)]
Merge tag 'please-pull-preadv2' of git://git./linux/kernel/git/aegl/linux
Pull ia64 update from Tony Luck:
"Wire up new system calls p{read,write}v2 for ia64"
* tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
[IA64] Enable preadv2 and pwritev2 syscalls for ia64
Linus Torvalds [Fri, 25 Mar 2016 23:39:05 +0000 (16:39 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull more input updates from Dmitry Torokhov:
"Second round of updates for the input subsystem.
The BYD PS/2 protocol driver now uses absolute reporting mode and
should behave more like other touchpads; Synaptics driver needed to
extend one of its quirks to a newer firmware version, and a few USB
drivers got tightened up checks for the contents of their descriptors"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: sur40 - fix DMA on stack
Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
Input: synaptics - handle spurious release of trackstick buttons, again
Input: synaptics-rmi4 - remove check of Non-NULL array
Input: byd - enable absolute mode
Input: ims-pcu - sanity check against missing interfaces
Input: melfas_mip4 - add hw_version sysfs attribute
Kirill A. Shutemov [Fri, 25 Mar 2016 21:22:20 +0000 (14:22 -0700)]
thp: fix typo in khugepaged_scan_pmd()
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Fri, 25 Mar 2016 21:22:17 +0000 (14:22 -0700)]
MAINTAINERS: fill entries for KASAN
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolai Stange [Fri, 25 Mar 2016 21:22:14 +0000 (14:22 -0700)]
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-
20160318+ #14
[...]
Call Trace:
[...]
[<
ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<
ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<
ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<
ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<
ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<
ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<
ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:22:11 +0000 (14:22 -0700)]
kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:22:08 +0000 (14:22 -0700)]
mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks. The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.
IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.
Right now stackdepot support is only enabled in SLAB allocator. Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.
This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.
Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.
[akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabinin@virtuozzo.com: comment style fixes]
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:22:05 +0000 (14:22 -0700)]
arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.
Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:22:02 +0000 (14:22 -0700)]
mm, kasan: add GFP flags to KASAN API
Add GFP flags to KASAN hooks for future patches to use.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:21:59 +0000 (14:21 -0700)]
mm, kasan: SLAB support
Add KASAN hooks to SLAB allocator.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 25 Mar 2016 21:21:56 +0000 (14:21 -0700)]
kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()
This patchset implements SLAB support for KASAN
Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap
objects, therefore we reimplement this feature in mm/kasan/stackdepot.c.
The intention is to ultimately switch SLUB to use this implementation as
well, which will save a lot of memory (right now SLUB bloats each object
by 256 bytes to store the allocation/deallocation stacks).
Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which
is necessary for better detection of use-after-free errors. We
introduce memory quarantine (mm/kasan/quarantine.c), which allows
delayed reuse of deallocated memory.
This patch (of 7):
Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as
the test only checks the page allocator functionality. Also reimplement
kmalloc_large_oob_right() so that the test allocates a large enough
chunk of memory that still does not trigger the page allocator fallback.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 25 Mar 2016 21:21:53 +0000 (14:21 -0700)]
include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()
A leftover from commit
c32b3cbe0d06 ("oom, PM: make OOM detection in the
freezer path raceless").
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 25 Mar 2016 21:21:50 +0000 (14:21 -0700)]
mm/page_alloc: prevent merging between isolated and other pageblocks
Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:
> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree:
6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal:
16342016 kB
> MemFree:
22367268 kB
> MemAvailable:
22370528 kB
Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:
> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.
This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
but since it doesn't hold the zone->lock between pageblocks, a race
window does exist.
It's also likely that unexpected merging can occur between
MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
__free_one_page() since commit
3c605096d315 ("mm/page_alloc: restrict
max order of merging on isolated pageblock"). However, we only check
the migratetype of the pageblock where buddy merging has been initiated,
not the migratetype of the buddy pageblock (or group of pageblocks)
which can be MIGRATE_ISOLATE.
Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath
and bloat-o-meter has shown significant code bloat (the function is
inline).
This patch reduces the bloat at some expense of more complicated code.
The buddy-merging while-loop in __free_one_page() is initially bounded
to pageblock_border and without any migratetype checks. The checks are
placed outside, bumping the max_order if merging is allowed, and
returning to the while-loop with a statement which can't be possibly
considered harmful.
This fixes the accounting bug and also removes the arguably weird state
in the original commit
3c605096d315 where buddies could be left
unmerged.
Fixes:
3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Link: https://lkml.org/lkml/2016/3/2/280
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Debugged-by: Laura Abbott <labbott@redhat.com>
Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 25 Mar 2016 21:21:47 +0000 (14:21 -0700)]
drivers/memstick/host/r592.c: avoid gcc-6 warning
The r592 driver relies on behavior of the DMA mapping API that is
normally observed but not guaranteed by the API. Instead it uses a
runtime check to fail transfers if the API ever behaves
When CONFIG_NEED_SG_DMA_LENGTH is not set, one of the checks turns into a
comparison of a variable with itself, which gcc-6.0 now warns about:
drivers/memstick/host/r592.c: In function 'r592_transfer_fifo_dma':
drivers/memstick/host/r592.c:302:31: error: self-comparison always evaluates to false [-Werror=tautological-compare]
(sg_dma_len(&dev->req->sg) < dev->req->sg.length)) {
^
The check itself is not a problem, so this patch just rephrases the
condition in a way that gcc does not consider an indication of a mistake.
We already know that dev->req->sg.length was initially R592_LFIFO_SIZE, so
we can compare it to that constant again.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Quentin Lambert <lambert.quentin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Fri, 25 Mar 2016 21:21:44 +0000 (14:21 -0700)]
ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
Now function ocfs2_replay_truncate_records() first modifies tl_used,
then calls ocfs2_extend_trans() to extend transactions for gd and alloc
inode used for freeing clusters. jbd2_journal_restart() may be called
and it may happen that tl_used in truncate log is decreased but the
clusters are not freed, which means these clusters are lost. So we
should avoid extending transactions in these two operations.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Fri, 25 Mar 2016 21:21:41 +0000 (14:21 -0700)]
ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
I found that jbd2_journal_restart() is called in some places without
keeping things consistently before. However, jbd2_journal_restart() may
commit the handle's transaction and restart another one. If the first
transaction is committed successfully while another not, it may cause
filesystem inconsistency or read only. This is an effort to fix this
kind of problems.
This patch (of 3):
The following functions will be called while truncating an extent:
ocfs2_remove_btree_range
-> ocfs2_start_trans
-> ocfs2_remove_extent
-> ocfs2_truncate_rec
-> ocfs2_extend_rotate_transaction
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_rotate_tree_left
-> ocfs2_remove_rightmost_path
-> ocfs2_extend_rotate_transaction
-> ocfs2_unlink_subtree
-> ocfs2_update_edge_lengths
-> ocfs2_extend_trans
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_et_update_clusters
-> ocfs2_commit_trans
jbd2_journal_restart() may be called and it may happened that the buffers
dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
ocfs2_et_update_clusters() are not, the total clusters on extent tree and
i_clusters in ocfs2_dinode is inconsistency. So the clusters got from
ocfs2_dinode is incorrect, and it also cause read-only problem when call
ocfs2_commit_truncate() with the error message: "Inode %llu has empty
extent block at %llu".
We should extend enough credits for function ocfs2_remove_rightmost_path
and ocfs2_update_edge_lengths to avoid this inconsistency.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Fri, 25 Mar 2016 21:21:38 +0000 (14:21 -0700)]
ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
We have found a bug when two nodes doing umount one after another.
1) Node 1 migrate a lockres that has 3 locks in grant queue such as
N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock
N3(NL) and N4(PR) are empty on node 2 because migration target do not
copy lvb to these two lock.
2) Node 3 want to convert to PR, it can be granted in
__dlmconvert_master(), and the order of these locks is unchanged. The
lvb of the lock N3(PR) on node 2 is copyed from lockres in function
dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
3) Node 2 want to leave domain, it will migrate this lockres to node 3.
Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
when adding the lock N4(PR) to mres with the following message because
the lvb of mres is already copied from lock N3(PR), but the lvb of lock
N4(PR) is empty.
"Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jiangyiwen [Fri, 25 Mar 2016 21:21:35 +0000 (14:21 -0700)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary as
follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~
33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jiangyiwen [Fri, 25 Mar 2016 21:21:32 +0000 (14:21 -0700)]
ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
This patch fixes a deadlock, as follows:
Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted
2) start to mount b start to mount a
3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++
4) -------------------- all nodes' network down --------------------
5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.
6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.
Because all works in the ocfs2_wq are relative to the super block.
The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 25 Mar 2016 21:21:29 +0000 (14:21 -0700)]
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
When master handles convert request, it queues ast first and then
returns status. This may happen that the ast is sent before the request
status because the above two messages are sent by two threads. And
right after the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 25 Mar 2016 21:21:26 +0000 (14:21 -0700)]
ocfs2/dlm: fix race between convert and recovery
There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
dlmconvert_remote
{
spin_lock(&res->spinlock);
list_move_tail(&lock->list, &res->converting);
lock->convert_pending = 1;
spin_unlock(&res->spinlock);
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
spin_lock(&res->spinlock);
lock->convert_pending = 0;
if (status != DLM_NORMAL)
dlm_revert_pending_convert(res, lock);
spin_unlock(&res->spinlock);
}
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
(res is still in recovering) or res master changed (new master has
finished recovery), reset the status to DLM_RECOVERING, then it will
retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:23 +0000 (14:21 -0700)]
ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
The code should call ocfs2_free_alloc_context() to free meta_ac &
data_ac before calling ocfs2_run_deallocs(). Because
ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by
meta_ac. So try to release the lock before ocfs2_run_deallocs().
Fixes:
af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Acked-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:20 +0000 (14:21 -0700)]
ocfs2: fix disk file size and memory file size mismatch
When doing append direct write in an already allocated cluster, and fast
path in ocfs2_dio_get_block() is triggered, function
ocfs2_dio_end_io_write() will be skipped as there is no context
allocated.
As a result, the disk file size will not be changed as it should be.
The solution is to skip fast path when we are about to change file size.
Fixes:
af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Acked-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:18 +0000 (14:21 -0700)]
ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write
Take ip_alloc_sem to prevent concurrent access to extent tree, which may
cause the extent tree in an unstable state.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:15 +0000 (14:21 -0700)]
ocfs2: fix ip_unaligned_aio deadlock with dio work queue
In the current implementation of unaligned aio+dio, lock order behave as
follow:
in user process context:
-> call io_submit()
-> get i_mutex
<== window1
-> get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return
in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
<== window2
-> get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex
There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule. But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.
This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.
My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.
[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:12 +0000 (14:21 -0700)]
ocfs2: code clean up for direct io
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
* remove append dio check: it will be checked in ocfs2_direct_IO()
* remove file hole check: file hole is supported for now
* remove inline data check: it will be checked in ocfs2_direct_IO()
* remove the full_coherence check when append dio: we will get the
inode_lock in ocfs2_dio_get_block, there is no need to fall back to
buffer io to ensure the coherence semantics.
Now the drop dio procedure is gone. :)
[akpm@linux-foundation.org: remove unused label]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:09 +0000 (14:21 -0700)]
ocfs2: fix sparse file & data ordering issue in direct io
There are mainly three issues in the direct io code path after commit
24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"):
* Does not support sparse file.
* Does not support data ordering. eg: when write to a file hole, it
will alloc extent first. If system crashed before io finished, data
will corrupt.
* Potential risk when doing aio+dio. The -EIOCBQUEUED return value is
likely to be ignored by ocfs2_direct_IO_write().
To resolve above problems, re-design direct io code with following ideas:
* Use buffer io to fill in holes. And this will make better
performance also.
* Clear unwritten after direct write finished. So we can make sure
meta data changes after data write to disk. (Unwritten extent is
invisible to user, from user's view, meta data is not changed when
allocate an unwritten extent.)
* Clear ocfs2_direct_IO_write(). Do all ending work in end_io.
This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
test cases of ltp.
For performance improvement, see following test result:
ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
The original way:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=
1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
After this patch:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=
1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:06 +0000 (14:21 -0700)]
ocfs2: record UNWRITTEN extents when populate write desc
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is still one issue in the direct write procedure.
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.
To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.
This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:03 +0000 (14:21 -0700)]
ocfs2: return the physical address in ocfs2_write_cluster
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io needs to get the physical address from write_begin, to map the
user page. This patch is to change the arg 'phys' of
ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin.
And we can retrieve it to the direct io procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:21:01 +0000 (14:21 -0700)]
ocfs2: do not change i_size in write_end for direct io
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Append direct io do not change i_size in get block phase. It only move
to orphan when starting write. After data is written to disk, it will
delete itself from orphan and update i_size. So skip i_size change
section in write_begin for direct io.
And when there is no extents alloc, no meta data changes needed for
direct io (since write_begin start trans for 2 reason: alloc extents &
change i_size. Now none of them needed). So we can skip start trans
procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:20:58 +0000 (14:20 -0700)]
ocfs2: test target page before change it
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io data will not appear in buffer. The w_target_page member will
not be filled by direct io. So avoid to use it when it's NULL. Unlinke
buffer io and mmap, direct io will call write_begin with more than 1
page a time. So the target_index is not sufficient to describe the
actual data. change it to a range start at target_index, end in
end_index.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:20:55 +0000 (14:20 -0700)]
ocfs2: use c_new to indicate newly allocated extents
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is a problem in ocfs2's direct io implement: if system crashed
after extents allocated, and before data return, we will get a extent
with dirty data on disk. This problem violate the journal=order
semantics, which means meta changes take effect after data written to
disk. To resolve this issue, direct write can use the UNWRITTEN flag to
describe a extent during direct data writeback. The direct write
procedure should act in the following order:
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
This patch is to change the 'c_unwritten' member of
ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear
the unwritten flag. It do not care if a extent is allocated or not.
And use 'c_new' to specify a newly allocated extent. So the direct io
procedure can use c_clear_unwritten to control the UNWRITTEN bit on
extent.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Ding [Fri, 25 Mar 2016 21:20:52 +0000 (14:20 -0700)]
ocfs2: add ocfs2_write_type_t type to identify the caller of write
Patchset: fix ocfs2 direct io code patch to support sparse file and data
ordering semantics
The idea is to use buffer io(more precisely use the interface
ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
beyond block size. And clear UNWRITTEN flag until direct io data has
been written to disk, which can prevent data corruption when system
crashed during direct write.
And we will also archive a better performance: eg. dd direct write new
file with block size 4KB: before this patchset:
2.5 MB/s
after this patchset:
66.4 MB/s
This patch (of 8):
To support direct io in ocfs2_write_begin_nolock &
ocfs2_write_end_nolock.
Remove unused args filp & flags. Add new arg type. The type is one of
buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
has implemented. direct type will be implemented later.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Fri, 25 Mar 2016 21:20:50 +0000 (14:20 -0700)]
ocfs2: o2hb: fix double free bug
This is a regression issue and caused the following kernel panic when do
ocfs2 multiple test.
BUG: unable to handle kernel paging request at
00000002000800c0
IP: [<
ffffffff81192978>] kmem_cache_alloc+0x78/0x160
PGD
7bbe5067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-
20160225 #1
Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
task:
ffff88007a521a80 ti:
ffff88007aed0000 task.ti:
ffff88007aed0000
RIP: 0010:[<
ffffffff81192978>] [<
ffffffff81192978>] kmem_cache_alloc+0x78/0x160
RSP: 0018:
ffff88007aed3a48 EFLAGS:
00010282
RAX:
0000000000000000 RBX:
0000000000000000 RCX:
0000000000001991
RDX:
0000000000001990 RSI:
00000000024000c0 RDI:
000000000001b330
RBP:
ffff88007aed3a98 R08:
ffff88007d29b330 R09:
00000002000800c0
R10:
0000000c51376d87 R11:
ffff8800792cac38 R12:
ffff88007cc30f00
R13:
00000000024000c0 R14:
ffffffff811b053f R15:
ffff88007aed3ce7
FS:
0000000000000000(0000) GS:
ffff88007d280000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000002000800c0 CR3:
000000007aeb2000 CR4:
00000000000406e0
Call Trace:
__d_alloc+0x2f/0x1a0
d_alloc+0x17/0x80
lookup_dcache+0x8a/0xc0
path_openat+0x3c3/0x1210
do_filp_open+0x80/0xe0
do_sys_open+0x110/0x200
SyS_open+0x19/0x20
do_syscall_64+0x72/0x230
entry_SYSCALL64_slow_path+0x25/0x25
Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
RIP kmem_cache_alloc+0x78/0x160
CR2:
00000002000800c0
---[ end trace
823969e602e4aaac ]---
Fixes:
a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 25 Mar 2016 21:20:47 +0000 (14:20 -0700)]
drivers/input: eliminate INPUT_COMPAT_TEST macro
INPUT_COMPAT_TEST became much simpler after commit
f4056b52845283
("input: redefine INPUT_COMPAT_TEST as in_compat_syscall()") so we can
cleanly eliminate it altogether.
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 25 Mar 2016 21:20:44 +0000 (14:20 -0700)]
oom, oom_reaper: protect oom_reaper_list using simpler way
"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:41 +0000 (14:20 -0700)]
oom: make oom_reaper freezable
After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target task
after it is done. This might however race with the PM freezer:
CPU0 CPU1 CPU2
freeze_processes
try_to_freeze_tasks
# Allocation request
out_of_memory
oom_killer_disable
wake_oom_reaper(P1)
__oom_reap_task
exit_oom_victim(P1)
wait_event(oom_victims==0)
[...]
do_exit(P1)
perform IO/interfere with the freezer
which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.
In order to close this race, make the oom_reaper thread freezable. This
will work because
a) already running oom_reaper will block freezer to enter the
quiescent state
b) wake_oom_reaper will not wake up the reaper after it has been
frozen
c) the only way to call exit_oom_victim after try_to_freeze_tasks
is from the oom victim's context when we know the further
interference shouldn't be possible
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Fri, 25 Mar 2016 21:20:39 +0000 (14:20 -0700)]
oom: make oom_reaper_list single linked
Entries are only added/removed from oom_reaper_list at head so we can
use a single linked list and hence save a word in task_struct.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:36 +0000 (14:20 -0700)]
oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.
This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:33 +0000 (14:20 -0700)]
mm, oom_reaper: implement OOM victims queuing
wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require some
way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach to
help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck
out_of_memory
wake_oom_reaper
cmpxchg // OK
oom_reaper
oom_reap_task
__oom_reap_task
oom_victim terminates
atomic_inc_not_zero // fail
out_of_memory
wake_oom_reaper
cmpxchg // fails
task_to_reap = NULL
This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim
might have not released a lot of memory for some reason.
The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:30 +0000 (14:20 -0700)]
mm, oom_reaper: report success/failure
Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.
[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:27 +0000 (14:20 -0700)]
oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 Mar 2016 21:20:24 +0000 (14:20 -0700)]
mm, oom: introduce oom reaper
This patch (of 5):
This is based on the idea from Mel Gorman discussed during LSFMM 2015
and independently brought up by Oleg Nesterov.
The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory. Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.
It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.
This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway. There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.
A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between. Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests
done with the lock held to absolute minimum to reduce the risk even
further.
The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
NULL->mm transition and oom_reaper clear this atomically once it is done
with the work. This means that only a single mm_struct can be reaped at
the time. As the operation is potentially disruptive we are trying to
limit it to the ncessary minimum and the reaper blocks any updates while
it operates on an mm. mm_struct is pinned by mm_count to allow parallel
exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 25 Mar 2016 21:20:21 +0000 (14:20 -0700)]
sched: add schedule_timeout_idle()
This will be needed in the patch "mm, oom: introduce oom reaper".
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tony Luck [Fri, 25 Mar 2016 21:37:32 +0000 (14:37 -0700)]
[IA64] Enable preadv2 and pwritev2 syscalls for ia64
New system calls added in:
f17d8b35452cab31a70d224964cd583fb2845449
vfs: vfs: Define new syscalls preadv2,pwritev2
Signed-off-by: Tony Luck <tony.luck@intel.com>
Rafael J. Wysocki [Fri, 25 Mar 2016 21:33:10 +0000 (22:33 +0100)]
Fix permissions of drivers/power/avs/rockchip-io-domain.c
The permissions of this file were modified by commit (
f447671b9e4f PM /
AVS: rockchip-io: add io selectors and supplies for rk3399) by mistake,
so fix them.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Geliang Tang [Sun, 13 Mar 2016 07:18:39 +0000 (15:18 +0800)]
libceph: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Geliang Tang [Sun, 13 Mar 2016 07:26:29 +0000 (15:26 +0800)]
ceph: use kmem_cache_zalloc
Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Geliang Tang [Sun, 13 Mar 2016 07:17:32 +0000 (15:17 +0800)]
rbd: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Yan, Zheng [Thu, 17 Mar 2016 06:41:59 +0000 (14:41 +0800)]
ceph: use lookup request to revalidate dentry
If dentry has no lease, ceph_d_revalidate() previously return 0.
This causes VFS to invalidate the dentry and create a new dentry
for later lookup. Invalidating a dentry also detach any underneath
mount points. So mount point inside cephfs can disapear mystically
(even the mount point is not modified by other hosts).
The fix is using lookup request to revalidate dentry without lease.
This can partly solve the mount points disapear issue (as long as
the mount point is not modified by other hosts)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 16 Mar 2016 08:40:23 +0000 (16:40 +0800)]
ceph: kill ceph_get_dentry_parent_inode()
use vfs helper dget_parent() instead
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Mon, 7 Mar 2016 02:34:50 +0000 (10:34 +0800)]
ceph: fix security xattr deadlock
When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.
The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)
Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Sat, 12 Mar 2016 05:32:16 +0000 (13:32 +0800)]
ceph: don't request vxattrs from MDS
It's uselese because MDS reply does not carry any vxattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>