GitHub/moto-9609/android_kernel_motorola_exynos9610.git
9 years agos390/smp: reduce size of struct pcpu
Heiko Carstens [Wed, 11 Feb 2015 11:31:03 +0000 (12:31 +0100)]
s390/smp: reduce size of struct pcpu

Reduce the size of struct pcpu, since the pcpu_devices array consists
of NR_CPUS elements of type struct pcpu. For most machines this is just
a waste of memory.
So let's try to make it a bit smaller.
This saves 16k with performance_defconfig.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
9 years agos390/topology: convert cpu_topology array to per cpu variable
Heiko Carstens [Wed, 4 Feb 2015 13:48:25 +0000 (14:48 +0100)]
s390/topology: convert cpu_topology array to per cpu variable

Convert the per cpu topology cpu masks to a per cpu variable.
At least for machines which do have less possible cpus than NR_CPUS this can
save a bit of memory (z/VM: max 64 vs 512 for performance_defconfig).

This reduces the kernel image size by 100k.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
9 years agos390/topology: delay initialization of topology cpu masks
Heiko Carstens [Wed, 4 Feb 2015 13:21:31 +0000 (14:21 +0100)]
s390/topology: delay initialization of topology cpu masks

There is no reason to initialize the topology cpu masks already while
setup_arch() is being called. It is sufficient to initialize the masks
before the scheduler becomes SMP aware.
Therefore a pre-SMP initcall aka early_initcall is suffucient.

This also allows to convert the cpu_topology array into a per cpu
variable with a later patch. Without this patch this wouldn't be
possible since the per cpu memory areas are not allocated while setup_arch
is executed.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
9 years agos390/vdso: fix clock_gettime for CLOCK_THREAD_CPUTIME_ID, -2 and -3
Martin Schwidefsky [Tue, 27 Jan 2015 14:15:13 +0000 (15:15 +0100)]
s390/vdso: fix clock_gettime for CLOCK_THREAD_CPUTIME_ID, -2 and -3

Git commit 8d8f2e18a6dbd3d09dd918788422e6ac8c878e96
"s390/vdso: ectg gettime support for CLOCK_THREAD_CPUTIME_ID"
broke clock_gettime for CLOCK_THREAD_CPUTIME_ID.

Git commit c742b31c03f37c5c499178f09f57381aa6c70131
"fast vdso implementation for CLOCK_THREAD_CPUTIME_ID"
introduced the ECTG for clock id -2. Correct would have been
clock id -3.

Fix the whole mess, CLOCK_THREAD_CPUTIME_ID is based on
CPUCLOCK_SCHED and can not be speed up by the vdso. A speedup
is only available for clock id -3 which is CPUCLOCK_VIRT for
the task currently running on the CPU.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
9 years agoMerge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux...
Linus Torvalds [Thu, 12 Feb 2015 04:25:11 +0000 (20:25 -0800)]
Merge branch 'next' of git://git./linux/kernel/git/jmorris/linux-security

Pull security layer updates from James Morris:
 "Highlights:

   - Smack adds secmark support for Netfilter
   - /proc/keys is now mandatory if CONFIG_KEYS=y
   - TPM gets its own device class
   - Added TPM 2.0 support
   - Smack file hook rework (all Smack users should review this!)"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
  cipso: don't use IPCB() to locate the CIPSO IP option
  SELinux: fix error code in policydb_init()
  selinux: add security in-core xattr support for pstore and debugfs
  selinux: quiet the filesystem labeling behavior message
  selinux: Remove unused function avc_sidcmp()
  ima: /proc/keys is now mandatory
  Smack: Repair netfilter dependency
  X.509: silence asn1 compiler debug output
  X.509: shut up about included cert for silent build
  KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
  MAINTAINERS: email update
  tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
  smack: fix possible use after frees in task_security() callers
  smack: Add missing logging in bidirectional UDS connect check
  Smack: secmark support for netfilter
  Smack: Rework file hooks
  tpm: fix format string error in tpm-chip.c
  char/tpm/tpm_crb: fix build error
  smack: Fix a bidirectional UDS connect check typo
  smack: introduce a special case for tmpfs in smack_d_instantiate()
  ...

9 years agoMerge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Linus Torvalds [Thu, 12 Feb 2015 04:07:47 +0000 (20:07 -0800)]
Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit

Pull audit fix from Paul Moore:
 "Just one patch from the audit tree for v3.20, and a very minor one at
  that.

  The patch simply removes an old, unused field from the audit_krule
  structure, a private audit-only struct.  In audit related news, we did
  a proper overhaul of the audit pathname code and removed the nasty
  getname()/putname() hacks for audit, you should see those patches in
  Al's vfs tree if you haven't already.

  That's it for audit this time, let's hope for a quiet -rcX series"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: remove vestiges of vers_ops

9 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Thu, 12 Feb 2015 02:23:28 +0000 (18:23 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge second set of updates from Andrew Morton:
 "More of MM"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
  mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
  mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
  vmstat: Reduce time interval to stat update on idle cpu
  mm/page_owner.c: remove unnecessary stack_trace field
  Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
  mm: incorporate read-only pages into transparent huge pages
  vmstat: do not use deferrable delayed work for vmstat_update
  mm: more aggressive page stealing for UNMOVABLE allocations
  mm: always steal split buddies in fallback allocations
  mm: when stealing freepages, also take pages created by splitting buddy page
  mincore: apply page table walker on do_mincore()
  mm: /proc/pid/clear_refs: avoid split_huge_page()
  mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
  mempolicy: apply page table walker on queue_pages_range()
  arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
  memcg: cleanup preparation for page table walk
  numa_maps: remove numa_maps->vma
  numa_maps: fix typo in gather_hugetbl_stats
  pagemap: use walk->vma instead of calling find_vma()
  clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
  ...

9 years agoMerge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Linus Torvalds [Thu, 12 Feb 2015 02:15:38 +0000 (18:15 -0800)]
Merge tag 'powerpc-3.20-1' of git://git./linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - Update of all defconfigs

 - Addition of a bunch of config options to modernise our defconfigs

 - Some PS3 updates from Geoff

 - Optimised memcmp for 64 bit from Anton

 - Fix for kprobes that allows 'perf probe' to work from Naveen

 - Several cxl updates from Ian & Ryan

 - Expanded support for the '24x7' PMU from Cody & Sukadev

 - Freescale updates from Scott:
    "Highlights include 8xx optimizations, some more work on datapath
     device tree content, e300 machine check support, t1040 corenet
     error reporting, and various cleanups and fixes"

* tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
  cxl: Add missing return statement after handling AFU errror
  cxl: Fail AFU initialisation if an invalid configuration record is found
  cxl: Export optional AFU configuration record in sysfs
  powerpc/mm: Warn on flushing tlb page in kernel context
  powerpc/powernv: Add OPAL soft-poweroff routine
  powerpc/perf/hv-24x7: Document sysfs event description entries
  powerpc/perf/hv-gpci: add the remaining gpci requests
  powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
  powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
  perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
  perf: add PMU_EVENT_ATTR_STRING() helper
  perf: provide sysfs_show for struct perf_pmu_events_attr
  powerpc/kernel: Avoid initializing device-tree pointer twice
  powerpc: Remove old compile time disabled syscall tracing code
  powerpc/kernel: Make syscall_exit a local label
  cxl: Fix device_node reference counting
  powerpc/mm: bail out early when flushing TLB page
  powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
  perf/powerpc: reset event hw state when adding it to the PMU
  powerpc/qe: Use strlcpy()
  ...

9 years agoMerge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64...
Linus Torvalds [Thu, 12 Feb 2015 02:03:54 +0000 (18:03 -0800)]
Merge tag 'arm64-upstream' of git://git./linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "arm64 updates for 3.20:

   - reimplementation of the virtual remapping of UEFI Runtime Services
     in a way that is stable across kexec
   - emulation of the "setend" instruction for 32-bit tasks (user
     endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
     accordingly)
   - compat_sys_call_table implemented in C (from asm) and made it a
     constant array together with sys_call_table
   - export CPU cache information via /sys (like other architectures)
   - DMA API implementation clean-up in preparation for IOMMU support
   - macros clean-up for KVM
   - dropped some unnecessary cache+tlb maintenance
   - CONFIG_ARM64_CPU_SUSPEND clean-up
   - defconfig update (CPU_IDLE)

  The EFI changes going via the arm64 tree have been acked by Matt
  Fleming.  There is also a patch adding sys_*stat64 prototypes to
  include/linux/syscalls.h, acked by Andrew Morton"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits)
  arm64: compat: Remove incorrect comment in compat_siginfo
  arm64: Fix section mismatch on alloc_init_p[mu]d()
  arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros
  arm64: mm: use *_sect to check for section maps
  arm64: drop unnecessary cache+tlb maintenance
  arm64:mm: free the useless initial page table
  arm64: Enable CPU_IDLE in defconfig
  arm64: kernel: remove ARM64_CPU_SUSPEND config option
  arm64: make sys_call_table const
  arm64: Remove asm/syscalls.h
  arm64: Implement the compat_sys_call_table in C
  syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64
  compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes
  arm64: uapi: expose our struct ucontext to the uapi headers
  smp, ARM64: Kill SMP single function call interrupt
  arm64: Emulate SETEND for AArch32 tasks
  arm64: Consolidate hotplug notifier for instruction emulation
  arm64: Track system support for mixed endian EL0
  arm64: implement generic IOMMU configuration
  arm64: Combine coherent and non-coherent swiotlb dma_ops
  ...

9 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Thu, 12 Feb 2015 01:42:32 +0000 (17:42 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux

Pull s390 updates from Martin Schwidefsky:

 - The remaining patches for the z13 machine support: kernel build
   option for z13, the cache synonym avoidance, SMT support,
   compare-and-delay for spinloops and the CES5S crypto adapater.

 - The ftrace support for function tracing with the gcc hotpatch option.
   This touches common code Makefiles, Steven is ok with the changes.

 - The hypfs file system gets an extension to access diagnose 0x0c data
   in user space for performance analysis for Linux running under z/VM.

 - The iucv hvc console gets wildcard spport for the user id filtering.

 - The cacheinfo code is converted to use the generic infrastructure.

 - Cleanup and bug fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
  s390/process: free vx save area when releasing tasks
  s390/hypfs: Eliminate hypfs interval
  s390/hypfs: Add diagnose 0c support
  s390/cacheinfo: don't use smp_processor_id() in preemptible context
  s390/zcrypt: fixed domain scanning problem (again)
  s390/smp: increase maximum value of NR_CPUS to 512
  s390/jump label: use different nop instruction
  s390/jump label: add sanity checks
  s390/mm: correct missing space when reporting user process faults
  s390/dasd: cleanup profiling
  s390/dasd: add locking for global_profile access
  s390/ftrace: hotpatch support for function tracing
  ftrace: let notrace function attribute disable hotpatching if necessary
  ftrace: allow architectures to specify ftrace compile options
  s390: reintroduce diag 44 calls for cpu_relax()
  s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
  s390/zcrypt: Number of supported ap domains is not retrievable.
  s390/spinlock: add compare-and-delay to lock wait loops
  s390/tape: remove redundant if statement
  s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
  ...

9 years agoMerge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl...
Linus Torvalds [Thu, 12 Feb 2015 01:36:52 +0000 (17:36 -0800)]
Merge tag 'please-pull-pstore' of git://git./linux/kernel/git/aegl/linux

Pull pstore update from Tony Luck:
 "Miscellaneous fs/pstore fixes"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: Fix sprintf format specifier in pstore_dump()
  pstore: Add pmsg - user-space accessible pstore object
  pstore: Handle zero-sized prz in series
  pstore: Remove superfluous memory size check
  pstore: Use scnprintf() in pstore_mkfile()

9 years agoMerge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Thu, 12 Feb 2015 01:14:54 +0000 (17:14 -0800)]
Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights incluse:

  Features:
   - Removing the forced serialisation of open()/close() calls in
     NFSv4.x (x>0) makes for a significant performance improvement in
     metadata intensive workloads.
   - Full support for the pNFS "flexible files" layout type
   - Further RPC/RDMA client improvements from Chuck

  Bugfixes:
   - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
   - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
   - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
     as part of the namespace cleanup,
   - Stable fix: Ensure we reference the inode for return-on-close in
     delegreturn
   - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
     the same source address/port combination during a disconnect/
     reconnect event.  This is a requirement imposed by most NFSv3
     server duplicate reply cache implementations.

  Optimisations:
   - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT

  Other:
   - Add Anna Schumaker as co-maintainer for the NFS client"

* tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
  SUNRPC: Cleanup to remove xs_tcp_close()
  pnfs: delete an unintended goto
  pnfs/flexfiles: Do not dprintk after the free
  SUNRPC: Fix stupid typo in xs_sock_set_reuseport
  SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
  SUNRPC: Handle connection reset more efficiently.
  SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
  SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
  SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
  SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
  SUNRPC: Remove TCP socket linger code
  SUNRPC: Remove TCP client connection reset hack
  SUNRPC: TCP/UDP always close the old socket before reconnecting
  SUNRPC: Add helpers to prevent socket create from racing
  SUNRPC: Ensure xs_reset_transport() resets the close connection flags
  SUNRPC: Do not clear the source port in xs_reset_transport
  SUNRPC: Handle EADDRINUSE on connect
  SUNRPC: Set SO_REUSEPORT socket option for TCP connections
  NFSv4.1: Fix pnfs_put_lseg races
  NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
  ...

9 years agomm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:42 +0000 (15:28 -0800)]
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()

I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed".  The problem occurs in
OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:39 +0000 (15:28 -0800)]
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()

I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed".  The problem
occurs in OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agovmstat: Reduce time interval to stat update on idle cpu
Christoph Lameter [Wed, 11 Feb 2015 23:28:36 +0000 (15:28 -0800)]
vmstat: Reduce time interval to stat update on idle cpu

It was noted that the vm stat shepherd runs every 2 seconds and that the
vmstat update is then scheduled 2 seconds in the future.

This yields an interval of double the time interval which is not desired.

Change the shepherd so that it does not delay the vmstat update on the
other cpu.  We stil have to use schedule_delayed_work since we are using a
delayed_work_struct but we can set the delay to 0.

Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/page_owner.c: remove unnecessary stack_trace field
Sergei Rogachev [Wed, 11 Feb 2015 23:28:34 +0000 (15:28 -0800)]
mm/page_owner.c: remove unnecessary stack_trace field

Page owner uses the page_ext structure to keep meta-information for every
page in the system.  The structure also contains a field of type 'struct
stack_trace', page owner uses this field during invocation of the function
save_stack_trace.  It is easy to notice that keeping a copy of this
structure for every page in the system is very inefficiently in terms of
memory.

The patch removes this unnecessary field of page_ext and forces page owner
to use a stack_trace structure allocated on the stack.

[akpm@linux-foundation.org: use struct initializers]
Signed-off-by: Sergei Rogachev <rogachevsergei@gmail.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoDocumentation/filesystems/proc.txt: describe /proc/<pid>/map_files
Cyrill Gorcunov [Wed, 11 Feb 2015 23:28:31 +0000 (15:28 -0800)]
Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files

[akpm@linux-foundation.org: tweaks]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: incorporate read-only pages into transparent huge pages
Ebru Akagunduz [Wed, 11 Feb 2015 23:28:28 +0000 (15:28 -0800)]
mm: incorporate read-only pages into transparent huge pages

This patch aims to improve THP collapse rates, by allowing THP collapse in
the presence of read-only ptes, like those left in place by do_swap_page
after a read fault.

Currently THP can collapse 4kB pages into a THP when there are up to
khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
the same limit for read-only ptes.

The patch was tested with a test program that allocates 800MB of memory,
writes to it, and then sleeps.  I force the system to swap out all but
190MB of the program by touching other memory.  Afterwards, the test
program does a mix of reads and writes to its memory, and the memory gets
swapped back in.

Without the patch, only the memory that did not get swapped out remained
in THPs, which corresponds to 24% of the memory of the program.  The
percentage did not increase over time.

With this patch, after 5 minutes of waiting khugepaged had collapsed 50%
of the program's memory back into THPs.

Test results:

With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      100464 kB
AnonHugePages:  100352 kB
Swap:           699540 kB
Fraction:       99,88

cat /proc/meminfo:
AnonPages:      1754448 kB
AnonHugePages:  1716224 kB
Fraction:       97,82

After swapped in:
In a few seconds:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  145408 kB
Swap:           0 kB
Fraction:       18,17

cat /proc/meminfo:
AnonPages:      2455016 kB
AnonHugePages:  1761280 kB
Fraction:       71,74

In 5 minutes:
cat /proc/pid/smaps
Anonymous:      800004 kB
AnonHugePages:  407552 kB
Swap:           0 kB
Fraction:       50,94

cat /proc/meminfo:
AnonPages:      2456872 kB
AnonHugePages:  2023424 kB
Fraction:       82,35

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      190660 kB
AnonHugePages:  190464 kB
Swap:           609344 kB
Fraction:       99,89

cat /proc/meminfo:
AnonPages:      1740456 kB
AnonHugePages:  1667072 kB
Fraction:       95,78

After swapped in:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  190464 kB
Swap:           0 kB
Fraction:       23,80

cat /proc/meminfo:
AnonPages:      2350032 kB
AnonHugePages:  1667072 kB
Fraction:       70,93

I waited 10 minutes the fractions did not change without the patch.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agovmstat: do not use deferrable delayed work for vmstat_update
Michal Hocko [Wed, 11 Feb 2015 23:28:24 +0000 (15:28 -0800)]
vmstat: do not use deferrable delayed work for vmstat_update

Vinayak Menon has reported that an excessive number of tasks was throttled
in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
was relatively high compared to NR_INACTIVE_FILE.  However it turned out
that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
vm_stat_diff wasn't transferred into the global counter.

vmstat_work which is responsible for the sync is defined as deferrable
delayed work which means that the defined timeout doesn't wake up an idle
CPU.  A CPU might stay in an idle state for a long time and general effort
is to keep such a CPU in this state as long as possible which might lead
to all sorts of troubles for vmstat consumers as can be seen with the
excessive direct reclaim throttling.

This patch basically reverts 39bf6270f524 ("VM statistics: Make timer
deferrable") but it shouldn't cause any problems for idle CPUs because
only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5
("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
longer time shouldn't have per-cpu drift.

Fixes: 39bf6270f524 (VM statistics: Make timer deferrable)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: more aggressive page stealing for UNMOVABLE allocations
Vlastimil Babka [Wed, 11 Feb 2015 23:28:21 +0000 (15:28 -0800)]
mm: more aggressive page stealing for UNMOVABLE allocations

When allocation falls back to stealing free pages of another migratetype,
it can decide to steal extra pages, or even the whole pageblock in order
to reduce fragmentation, which could happen if further allocation
fallbacks pick a different pageblock.  In try_to_steal_freepages(), one of
the situations where extra pages are stolen happens when we are trying to
allocate a MIGRATE_RECLAIMABLE page.

However, MIGRATE_UNMOVABLE allocations are not treated the same way,
although spreading such allocation over multiple fallback pageblocks is
arguably even worse than it is for RECLAIMABLE allocations.  To minimize
fragmentation, we should minimize the number of such fallbacks, and thus
steal as much as is possible from each fallback pageblock.

Note that in theory this might put more pressure on movable pageblocks and
cause movable allocations to steal back from unmovable pageblocks.
However, movable allocations are not as aggressive with stealing, and do
not cause permanent fragmentation, so the tradeoff is reasonable, and
evaluation seems to support the change.

This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
steal extra free pages.  When evaluating with stress-highalloc from
mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
roughly 1/6.  The number of these fallbacks stealing from MIGRATE_MOVABLE
block is reduced to 1/3.  There was no observation of growing number of
unmovable pageblocks over time, and also not of increased movable
allocation fallbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: always steal split buddies in fallback allocations
Vlastimil Babka [Wed, 11 Feb 2015 23:28:18 +0000 (15:28 -0800)]
mm: always steal split buddies in fallback allocations

When allocation falls back to another migratetype, it will steal a page
with highest available order, and (depending on this order and desired
migratetype), it might also steal the rest of free pages from the same
pageblock.

Given the preference of highest available order, it is likely that it will
be higher than the desired order, and result in the stolen buddy page
being split.  The remaining pages after split are currently stolen only
when the rest of the free pages are stolen.  This can however lead to
situations where for MOVABLE allocations we split e.g.  order-4 fallback
UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
allocation (which may be batched to fill the pcplists) we split another
order-3 or higher page, etc.  By stealing all pages that we have split, we
can avoid further stealing.

This patch therefore adjusts the page stealing so that buddy pages created
by split are always stolen.  This has effect only on MOVABLE allocations,
as RECLAIMABLE and UNMOVABLE allocations already always do that in
addition to stealing the rest of free pages from the pageblock.  The
change also allows to simplify try_to_steal_freepages() and factor out CMA
handling.

According to Mel, it has been intended since the beginning that buddy
pages after split would be stolen always, but it doesn't seem like it was
ever the case until commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA
migration type added").  The commit has unintentionally introduced this
behavior, but was reverted by commit 0cbef29a7821 ("mm:
__rmqueue_fallback() should respect pageblock type").  Neither included
evaluation.

My evaluation with stress-highalloc from mmtests shows about 2.5x
reduction of page stealing events for MOVABLE allocations, without
affecting the page stealing events for other allocation migratetypes.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: when stealing freepages, also take pages created by splitting buddy page
Vlastimil Babka [Wed, 11 Feb 2015 23:28:15 +0000 (15:28 -0800)]
mm: when stealing freepages, also take pages created by splitting buddy page

When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.

Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks.  Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.

The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results.  Here I used different baseline which includes RFC compaction
improvements from [1].  I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.

First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test.  First
column is after fresh reboot, other two are reiterations of test without
reboot.  That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).

Baseline:

                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  5-nothp-1       5-nothp-2       5-nothp-3
Page alloc extfrag event                               10264225     8702233    10244125
Extfrag fragmenting                                    10263271     8701552    10243473
Extfrag fragmenting for unmovable                         13595       17616       15960
Extfrag fragmenting unmovable placed with movable          7989       12193        8447
Extfrag fragmenting for reclaimable                         658        1840        1817
Extfrag fragmenting reclaimable placed with movable         558        1677        1679
Extfrag fragmenting for movable                        10249018     8682096    10225696

With Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  6-nothp-1       6-nothp-2       6-nothp-3
Page alloc extfrag event                               11834954     9877523     9774860
Extfrag fragmenting                                    11833993     9876880     9774245
Extfrag fragmenting for unmovable                          7342       16129       11712
Extfrag fragmenting unmovable placed with movable          4191       10547        6270
Extfrag fragmenting for reclaimable                         373        1130         923
Extfrag fragmenting reclaimable placed with movable         302         906         738
Extfrag fragmenting for movable                        11826278     9859621     9761610

With Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  7-nothp-1       7-nothp-2       7-nothp-3
Page alloc extfrag event                                4725990     3668793     3807436
Extfrag fragmenting                                     4725104     3668252     3806898
Extfrag fragmenting for unmovable                          6678        7974        7281
Extfrag fragmenting unmovable placed with movable          2051        3829        4017
Extfrag fragmenting for reclaimable                         429        1208        1278
Extfrag fragmenting reclaimable placed with movable         369         976        1034
Extfrag fragmenting for movable                         4717997     3659070     3798339

With Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  8-nothp-1       8-nothp-2       8-nothp-3
Page alloc extfrag event                                5016183     4700142     3850633
Extfrag fragmenting                                     5015325     4699613     3850072
Extfrag fragmenting for unmovable                          1312        3154        3088
Extfrag fragmenting unmovable placed with movable          1115        2777        2714
Extfrag fragmenting for reclaimable                         437        1193        1097
Extfrag fragmenting reclaimable placed with movable         330         969         879
Extfrag fragmenting for movable                         5013576     4695266     3845887

In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise.  Here, each patch
improves the situation for unmovable events.  Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO.  The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless.  These are least critical as
compaction can move them around.

If we look at success rates, the patches don't affect them, that didn't change.

Baseline:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 1:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            6-nothp-1             6-nothp-2             6-nothp-3
Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)

Patch 2:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            7-nothp-1             7-nothp-2             7-nothp-3
Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 3:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            8-nothp-1             8-nothp-2             8-nothp-3
Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)

While there's no improvement here, I consider reduced fragmentation events
to be worth on its own.  Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:

Patch 1:

Compaction stalls                 4153        3959        3978
Compaction success                1523        1441        1446
Compaction failures               2630        2517        2531
Page migrate success           4600827     4943120     5104348
Page migrate failure             19763       16656       17806
Compaction pages isolated      9597640    10305617    10653541
Compaction migrate scanned    77828948    86533283    87137064
Compaction free scanned      517758295   521312840   521462251
Compaction cost                   5503        5932        6110

Patch 2:

Compaction stalls                 3800        3450        3518
Compaction success                1421        1316        1317
Compaction failures               2379        2134        2201
Page migrate success           4160421     4502708     4752148
Page migrate failure             19705       14340       14911
Compaction pages isolated      8731983     9382374     9910043
Compaction migrate scanned    98362797    96349194    98609686
Compaction free scanned      496512560   469502017   480442545
Compaction cost                   5173        5526        5811

As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.

Configuring the benchmark to allocate like THP page fault (i.e.  no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot.  This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.

Baseline:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    5-thp-1         5-thp-2         5-thp-3
Page alloc extfrag event                                8148965     6227815     6646741
Extfrag fragmenting                                     8147872     6227130     6646117
Extfrag fragmenting for unmovable                         10324       12942       15975
Extfrag fragmenting unmovable placed with movable          5972        8495       10907
Extfrag fragmenting for reclaimable                         601        1707        2210
Extfrag fragmenting reclaimable placed with movable         520        1570        2000
Extfrag fragmenting for movable                         8136947     6212481     6627932

Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    6-thp-1         6-thp-2         6-thp-3
Page alloc extfrag event                                8345457     7574471     7020419
Extfrag fragmenting                                     8343546     7573777     7019718
Extfrag fragmenting for unmovable                         10256       18535       30716
Extfrag fragmenting unmovable placed with movable          6893       11726       22181
Extfrag fragmenting for reclaimable                         465        1208        1023
Extfrag fragmenting reclaimable placed with movable         353         996         843
Extfrag fragmenting for movable                         8332825     7554034     6987979

Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    7-thp-1         7-thp-2         7-thp-3
Page alloc extfrag event                                3512847     3020756     2891625
Extfrag fragmenting                                     3511940     3020185     2891059
Extfrag fragmenting for unmovable                          9017        6892        6191
Extfrag fragmenting unmovable placed with movable          1524        3053        2435
Extfrag fragmenting for reclaimable                         445        1081        1160
Extfrag fragmenting reclaimable placed with movable         375         918         986
Extfrag fragmenting for movable                         3502478     3012212     2883708

Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    8-thp-1         8-thp-2         8-thp-3
Page alloc extfrag event                                3181699     3082881     2674164
Extfrag fragmenting                                     3180812     3082303     2673611
Extfrag fragmenting for unmovable                          1201        4031        4040
Extfrag fragmenting unmovable placed with movable           974        3611        3645
Extfrag fragmenting for reclaimable                         478        1165        1294
Extfrag fragmenting reclaimable placed with movable         387         985        1030
Extfrag fragmenting for movable                         3179133     3077107     2668277

The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.

Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.

[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2

This patch (of 3):

When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype.  With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:

1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)

These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g.  distribute UNMOVABLE allocations among multiple
pageblocks.

Originally, decision for 1) has implied the decision for 3).  Commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.

Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended.  Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1).  This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.

This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.

Note that evaluation has shown that the behavior introduced by
47118af076f6 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again.  For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a7821.

[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a7821]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomincore: apply page table walker on do_mincore()
Naoya Horiguchi [Wed, 11 Feb 2015 23:28:11 +0000 (15:28 -0800)]
mincore: apply page table walker on do_mincore()

This patch makes do_mincore() use walk_page_vma(), which reduces many
lines of code by using common page table walk code.

[daeseok.youn@gmail.com: remove unneeded variable 'err']
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: /proc/pid/clear_refs: avoid split_huge_page()
Kirill A. Shutemov [Wed, 11 Feb 2015 23:28:08 +0000 (15:28 -0800)]
mm: /proc/pid/clear_refs: avoid split_huge_page()

Currently pagewalker splits all THP pages on any clear_refs request.  It's
not necessary.  We can handle this on PMD level.

One side effect is that soft dirty will potentially see more dirty memory,
since we will mark whole THP page dirty at once.

Sanity checked with CRIU test suite. More testing is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
Naoya Horiguchi [Wed, 11 Feb 2015 23:28:06 +0000 (15:28 -0800)]
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)

walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
undesirable behaviour at client end (who called walk_page_range).  For
example for pagemap_read(), when no callbacks are called against VM_PFNMAP
vma, pagemap_read() may prepare pagemap data for next virtual address
range at wrong index.  That could confuse and/or break userspace
applications.

This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
- for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
  over vma(VM_PFNMAP),
- for clear_refs and queue_pages which have their own ->tests_walk,
  just return 1 and skip vma(VM_PFNMAP). This is no problem because
  these are not interested in hole regions,
- for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomempolicy: apply page table walker on queue_pages_range()
Naoya Horiguchi [Wed, 11 Feb 2015 23:28:03 +0000 (15:28 -0800)]
mempolicy: apply page table walker on queue_pages_range()

queue_pages_range() does page table walking in its own way now, but there
is some code duplicate.  This patch applies page table walker to reduce
lines of code.

queue_pages_range() has to do some precheck to determine whether we really
walk over the vma or just skip it.  Now we have test_walk() callback in
mm_walk for this purpose, so we can do this replacement cleanly.
queue_pages_test_walk() depends on not only the current vma but also the
previous one, so queue_pages->prev is introduced to remember it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoarch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
Naoya Horiguchi [Wed, 11 Feb 2015 23:28:00 +0000 (15:28 -0800)]
arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()

We don't have to use mm_walk->private to pass vma to the callback function
because of mm_walk->vma.  And walk_page_vma() is useful if we walk over a
single vma.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomemcg: cleanup preparation for page table walk
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:57 +0000 (15:27 -0800)]
memcg: cleanup preparation for page table walk

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private.  And both of mem_cgroup_count_precharge() and
mem_cgroup_move_charge() do for each vma loop themselves, but now it's
done in pagewalk.c, so let's clean up them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agonuma_maps: remove numa_maps->vma
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:54 +0000 (15:27 -0800)]
numa_maps: remove numa_maps->vma

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private.  And show_numa_map() walks pages on vma basis, so using
walk_page_vma() is preferable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agonuma_maps: fix typo in gather_hugetbl_stats
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:51 +0000 (15:27 -0800)]
numa_maps: fix typo in gather_hugetbl_stats

Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
grep-friendly.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agopagemap: use walk->vma instead of calling find_vma()
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:48 +0000 (15:27 -0800)]
pagemap: use walk->vma instead of calling find_vma()

Page table walker has the information of the current vma in mm_walk, so we
don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
any longer.  Currently pagemap_pte_range() does vma loop itself, so this
patch reduces many lines of code.

NULL-vma check is omitted because we assume that we never run these
callbacks on any address outside vma.  And even if it were broken, NULL
pointer dereference would be detected, so we can get enough information
for debugging.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoclear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:46 +0000 (15:27 -0800)]
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()

clear_refs_write() has some prechecks to determine if we really walk over
a given vma.  Now we have a test_walk() callback to filter vmas, so let's
utilize it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agosmaps: remove mem_size_stats->vma and use walk_page_vma()
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:43 +0000 (15:27 -0800)]
smaps: remove mem_size_stats->vma and use walk_page_vma()

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private.  And show_smap() walks pages on vma basis, so using
walk_page_vma() is preferable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agopagewalk: add walk_page_vma()
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:40 +0000 (15:27 -0800)]
pagewalk: add walk_page_vma()

Introduce walk_page_vma(), which is useful for the callers which want to
walk over a given vma.  It's used by later patches.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agopagewalk: improve vma handling
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:37 +0000 (15:27 -0800)]
pagewalk: improve vma handling

Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.

From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.

In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/pagewalk: remove pgd_entry() and pud_entry()
Naoya Horiguchi [Wed, 11 Feb 2015 23:27:34 +0000 (15:27 -0800)]
mm/pagewalk: remove pgd_entry() and pud_entry()

Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle.  So let's remove it to reduce overhead.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoproc/pagemap: walk page tables under pte lock
Konstantin Khlebnikov [Wed, 11 Feb 2015 23:27:31 +0000 (15:27 -0800)]
proc/pagemap: walk page tables under pte lock

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()
<read *pte>
pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: gup: kvm use get_user_pages_unlocked
Andrea Arcangeli [Wed, 11 Feb 2015 23:27:28 +0000 (15:27 -0800)]
mm: gup: kvm use get_user_pages_unlocked

Use the more generic get_user_pages_unlocked which has the additional
benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
(which allows the first page fault in an unmapped area to be always able
to block indefinitely by being allowed to release the mmap_sem).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: gup: use get_user_pages_unlocked
Andrea Arcangeli [Wed, 11 Feb 2015 23:27:26 +0000 (15:27 -0800)]
mm: gup: use get_user_pages_unlocked

This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
the page fault in order to release the mmap_sem during the I/O.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: gup: use get_user_pages_unlocked within get_user_pages_fast
Andrea Arcangeli [Wed, 11 Feb 2015 23:27:23 +0000 (15:27 -0800)]
mm: gup: use get_user_pages_unlocked within get_user_pages_fast

This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: gup: add __get_user_pages_unlocked to customize gup_flags
Andrea Arcangeli [Wed, 11 Feb 2015 23:27:20 +0000 (15:27 -0800)]
mm: gup: add __get_user_pages_unlocked to customize gup_flags

Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: gup: add get_user_pages_locked and get_user_pages_unlocked
Andrea Arcangeli [Wed, 11 Feb 2015 23:27:17 +0000 (15:27 -0800)]
mm: gup: add get_user_pages_locked and get_user_pages_unlocked

FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion.  The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.

Andres fixed it for the KVM page fault.  However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel.  It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.

The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually.  The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).

While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .

The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first.  So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped.  The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.

This patch (of 5):

We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.

The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call.  Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
        up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before.  Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma
Vlastimil Babka [Wed, 11 Feb 2015 23:27:15 +0000 (15:27 -0800)]
mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma

The previous commit ("mm/thp: Allocate transparent hugepages on local
node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
special policy for THP allocations.  The function has the same interface
as alloc_pages_vma(), shares a lot of boilerplate code and a long
comment.

This patch merges the hugepage special case into alloc_pages_vma.  The
extra if condition should be cheap enough price to pay.  We also prevent
a (however unlikely) race with parallel mems_allowed update, which could
make hugepage allocation restart only within the fallback call to
alloc_hugepage_vma() and not reconsider the special rule in
alloc_hugepage_vma().

Also by making sure mpol_cond_put(pol) is always called before actual
allocation attempt, we can use a single exit path within the function.

Also update the comment for missing node parameter and obsolete reference
to mm_sem.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/thp: allocate transparent hugepages on local node
Aneesh Kumar K.V [Wed, 11 Feb 2015 23:27:12 +0000 (15:27 -0800)]
mm/thp: allocate transparent hugepages on local node

This make sure that we try to allocate hugepages from local node if
allowed by mempolicy.  If we can't, we fallback to small page allocation
based on mempolicy.  This is based on the observation that allocating
pages on local node is more beneficial than allocating hugepages on remote
node.

With this patch applied we may find transparent huge page allocation
failures if the current node doesn't have enough freee hugepages.  Before
this patch such failures result in us retrying the allocation on other
nodes in the numa node mask.

[akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/compaction: add tracepoint to observe behaviour of compaction defer
Joonsoo Kim [Wed, 11 Feb 2015 23:27:09 +0000 (15:27 -0800)]
mm/compaction: add tracepoint to observe behaviour of compaction defer

Compaction deferring logic is heavy hammer that block the way to the
compaction.  It doesn't consider overall system state, so it could prevent
user from doing compaction falsely.  In other words, even if system has
enough range of memory to compact, compaction would be skipped due to
compaction deferring logic.  This patch add new tracepoint to understand
work of deferring logic.  This will also help to check compaction success
and fail.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/compaction: more trace to understand when/why compaction start/finish
Joonsoo Kim [Wed, 11 Feb 2015 23:27:06 +0000 (15:27 -0800)]
mm/compaction: more trace to understand when/why compaction start/finish

It is not well analyzed that when/why compaction start/finish or not.
With these new tracepoints, we can know much more about start/finish
reason of compaction.  I can find following bug with these tracepoint.

http://www.spinics.net/lists/linux-mm/msg81582.html

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/compaction: print current range where compaction work
Joonsoo Kim [Wed, 11 Feb 2015 23:27:04 +0000 (15:27 -0800)]
mm/compaction: print current range where compaction work

It'd be useful to know current range where compaction work for detailed
analysis.  With it, we can know pageblock where we actually scan and
isolate, and, how much pages we try in that pageblock and can guess why it
doesn't become freepage with pageblock order roughly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/compaction: enhance tracepoint output for compaction begin/end
Joonsoo Kim [Wed, 11 Feb 2015 23:27:01 +0000 (15:27 -0800)]
mm/compaction: enhance tracepoint output for compaction begin/end

We now have tracepoint for begin event of compaction and it prints start
position of both scanners, but, tracepoint for end event of compaction
doesn't print finish position of both scanners.  It'd be also useful to
know finish position of both scanners so this patch add it.  It will help
to find odd behavior or problem on compaction internal logic.

And mode is added to both begin/end tracepoint output, since according to
mode, compaction behavior is quite different.

And lastly, status format is changed to string rather than status number
for readability.

[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/compaction: change tracepoint format from decimal to hexadecimal
Joonsoo Kim [Wed, 11 Feb 2015 23:26:58 +0000 (15:26 -0800)]
mm/compaction: change tracepoint format from decimal to hexadecimal

To check the range that compaction is working, tracepoint print
start/end pfn of zone and start pfn of both scanner with decimal format.
Since we manage all pages in order of 2 and it is well represented by
hexadecimal, this patch change the tracepoint format from decimal to
hexadecimal.  This would improve readability.  For example, it makes us
easily notice whether current scanner try to compact previously
attempted pageblock or not.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agopage_writeback: put account_page_redirty() after set_page_dirty()
Konstantin Khebnikov [Wed, 11 Feb 2015 23:26:55 +0000 (15:26 -0800)]
page_writeback: put account_page_redirty() after set_page_dirty()

Helper account_page_redirty() fixes dirty pages counter for redirtied
pages.  This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.

Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: fix false-positive warning on exit due mm_nr_pmds(mm)
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:53 +0000 (15:26 -0800)]
mm: fix false-positive warning on exit due mm_nr_pmds(mm)

The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
*before* pgd_free().  And if an arch does pte/pmd allocation in
pgd_alloc() and frees them in pgd_free() we see offset in counters by the
time of the checks.

We tried to workaround this by offsetting expected counter value according
to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
doesn't work in some cases:

1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
   upper addresses occupied with huge pmd entries, so the trick with
   offsetting expected counter value will get really ugly: we will have
   to apply it nr_pmds, but not nr_ptes.

2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
   pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
   which is allocated at boot and shared accross all processes.

The proposal is to move the check to check_mm() which happens *after*
pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
which would bring counters to zero if nothing leaked.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Tyler Baker <tyler.baker@linaro.org>
Tested-by: Tyler Baker <tyler.baker@linaro.org>
Tested-by: Nishanth Menon <nm@ti.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: account pmd page tables to the process
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:50 +0000 (15:26 -0800)]
mm: account pmd page tables to the process

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 130000

int main(void)
{
char *addr = NULL;
unsigned long i;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoarm: define __PAGETABLE_PMD_FOLDED for !LPAE
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:47 +0000 (15:26 -0800)]
arm: define __PAGETABLE_PMD_FOLDED for !LPAE

ARM uses custom implementation of PMD folding in 2-level page table case.
Generic code expects to see __PAGETABLE_PMD_FOLDED to be defined if PMD is
folded, but ARM doesn't do this.  Let's fix it.

Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc().  It
also fixes problems with recently-introduced pmd accounting on ARM without
LPAE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Nishanth Menon <nm@ti.com>
Reported-by: Simon Horman <horms@verge.net.au>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Tested-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Felipe Balbi <balbi@ti.com>
Tested-by: Nishanth Menon <nm@ti.com>
Tested-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Tested-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm, asm-generic: define PUD_SHIFT in <asm-generic/4level-fixup.h>
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:44 +0000 (15:26 -0800)]
mm, asm-generic: define PUD_SHIFT in <asm-generic/4level-fixup.h>

If an architecure uses <asm-generic/4level-fixup.h>, build fails if we
try to use PUD_SHIFT in generic code:

   In file included from arch/microblaze/include/asm/bug.h:1:0,
                    from include/linux/bug.h:4,
                    from include/linux/thread_info.h:11,
                    from include/asm-generic/preempt.h:4,
                    from arch/microblaze/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:18,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from mm/mmap.c:12:
   mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2858:46: error: 'PUD_SHIFT' undeclared (first use in this function)
       round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
                                                 ^
   include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
     int __ret_warn_on = !!(condition);    \
                            ^
   mm/mmap.c:2858:46: note: each undeclared identifier is reported only once for each function it appears in
       round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
                                                 ^
   include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
     int __ret_warn_on = !!(condition);    \
                            ^
As with <asm-generic/pgtable-nopud.h>, let's define PUD_SHIFT to
PGDIR_SHIFT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: make FIRST_USER_ADDRESS unsigned long on all archs
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:41 +0000 (15:26 -0800)]
mm: make FIRST_USER_ADDRESS unsigned long on all archs

LKP has triggered a compiler warning after my recent patch "mm: account
pmd page tables to the process":

    mm/mmap.c: In function 'exit_mmap':
 >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]

The code:

 > 2857                WARN_ON(mm_nr_pmds(mm) >
   2858                                round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);

In this, on tile, we have FIRST_USER_ADDRESS defined as 0.  round_up() has
the same type -- int.  PUD_SHIFT.

I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
long.  On every arch for consistency.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomicroblaze: define __PAGETABLE_PMD_FOLDED
Kirill A. Shutemov [Wed, 11 Feb 2015 23:26:38 +0000 (15:26 -0800)]
microblaze: define __PAGETABLE_PMD_FOLDED

Microblaze uses custom implementation of PMD folding, but doesn't define
__PAGETABLE_PMD_FOLDED, which generic code expects to see.  Let's fix it.

Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc().  It
also fixes problems with recently-introduced pmd accounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: consolidate swap controller code
Johannes Weiner [Wed, 11 Feb 2015 23:26:36 +0000 (15:26 -0800)]
mm: memcontrol: consolidate swap controller code

The swap controller code is scattered all over the file.  Gather all
the code that isn't directly needed by the memory controller at the
end of the file in its own CONFIG_MEMCG_SWAP section.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: consolidate memory controller initialization
Johannes Weiner [Wed, 11 Feb 2015 23:26:33 +0000 (15:26 -0800)]
mm: memcontrol: consolidate memory controller initialization

The initialization code for the per-cpu charge stock and the soft
limit tree is compact enough to inline it into mem_cgroup_init().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: simplify soft limit tree init code
Johannes Weiner [Wed, 11 Feb 2015 23:26:30 +0000 (15:26 -0800)]
mm: memcontrol: simplify soft limit tree init code

- No need to test the node for N_MEMORY.  node_online() is enough for
  node fallback to work in slab, use NUMA_NO_NODE for everything else.

- Remove the BUG_ON() for allocation failure.  A NULL pointer crash is
  just as descriptive, and the absent return value check is obvious.

- Move local variables to the inner-most blocks.

- Point to the tree structure after its initialized, not before, it's
  just more logical that way.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: cma: fix totalcma_pages to include DT defined CMA regions
George G. Davis [Wed, 11 Feb 2015 23:26:27 +0000 (15:26 -0800)]
mm: cma: fix totalcma_pages to include DT defined CMA regions

The totalcma_pages variable is not updated to account for CMA regions
defined via device tree reserved-memory sub-nodes.  Fix this omission by
moving the calculation of totalcma_pages into cma_init_reserved_mem()
instead of cma_declare_contiguous() such that it will include reserved
memory used by all CMA regions.

Signed-off-by: George G. Davis <george_davis@mentor.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agooom, PM: make OOM detection in the freezer path raceless
Michal Hocko [Wed, 11 Feb 2015 23:26:24 +0000 (15:26 -0800)]
oom, PM: make OOM detection in the freezer path raceless

Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter.  The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution.  That requires the full OOM and freezer's task freezing
exclusion, though.  This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations.  Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.

The page fault path is covered now as well although it was assumed to be
safe before.  As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here.  Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation.  The page allocation
path simply fails the allocation same as before.  The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because

- the victim cannot loop in the page fault handler (it would die
  on the way out from the exception)
- it cannot loop in the page allocator because all the further
  allocation would fail and __GFP_NOFAIL allocations are not
  acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
  (try_to_freeze expects lockless context) and kernel threads and
  work queues are not frozen yet

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agosysrq: convert printk to pr_* equivalent
Michal Hocko [Wed, 11 Feb 2015 23:26:21 +0000 (15:26 -0800)]
sysrq: convert printk to pr_* equivalent

While touching this area let's convert printk to pr_*.  This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoPM: convert printk to pr_* equivalent
Michal Hocko [Wed, 11 Feb 2015 23:26:18 +0000 (15:26 -0800)]
PM: convert printk to pr_* equivalent

While touching this area let's convert printk to pr_*.  This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agooom: thaw the OOM victim if it is frozen
Michal Hocko [Wed, 11 Feb 2015 23:26:15 +0000 (15:26 -0800)]
oom: thaw the OOM victim if it is frozen

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim.  This is basically noop when the task is frozen though because the
task sleeps in the uninterruptible sleep.  The victim is eventually thawed
later when oom_scan_process_thread meets the task again in a later OOM
invocation so the OOM killer doesn't live lock.  But this is less than
optimal.

Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to
the victim.  We are not checking whether the task is frozen because that
would be racy and __thaw_task does that already.  oom_scan_process_thread
doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are
excluded completely now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agooom: add helpers for setting and clearing TIF_MEMDIE
Michal Hocko [Wed, 11 Feb 2015 23:26:12 +0000 (15:26 -0800)]
oom: add helpers for setting and clearing TIF_MEMDIE

This patchset addresses a race which was described in the changelog for
5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code.  I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: fold move_anon() and move_file()
Johannes Weiner [Wed, 11 Feb 2015 23:26:09 +0000 (15:26 -0800)]
mm: memcontrol: fold move_anon() and move_file()

Turn the move type enum into flags and give the flags field a shorter
name.  Once that is done, move_anon() and move_file() are simple enough to
just fold them into the callsites.

[akpm@linux-foundation.org: tweak MOVE_MASK definition, per Michal]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: default hierarchy interface for memory
Johannes Weiner [Wed, 11 Feb 2015 23:26:06 +0000 (15:26 -0800)]
mm: memcontrol: default hierarchy interface for memory

Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.

This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below.  The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.

The control files are thus:

  - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

  - memory.low configures the lower end of the cgroup's expected
    memory consumption range.  The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

  - memory.high configures the upper end of the cgroup's expected
    memory consumption range.  A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

  - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

  - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max.  This
    allows users to identify configuration problems when observing a
    degradation in workload performance.  An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:

  - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset.  As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out.  The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior.  First off, the soft limit has no
    hierarchical meaning.  All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy.  This makes subtree delegation
    impossible.  Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve.  A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible.  Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass.  This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes.  Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

  - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory.  The memory consumption of workloads varies
    during runtime, and that requires users to overcommit.  But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively.  When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer.  As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation.  The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded.  But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group.  Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

  - The original control file names are unwieldy and inconsistent in
    many different ways.  For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

  - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files.  But that very high number is implementation
    and architecture dependent and not very descriptive.  And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: page_counter: pull "-1" handling out of page_counter_memparse()
Johannes Weiner [Wed, 11 Feb 2015 23:26:03 +0000 (15:26 -0800)]
mm: page_counter: pull "-1" handling out of page_counter_memparse()

The unified hierarchy interface for memory cgroups will no longer use "-1"
to mean maximum possible resource value.  In preparation for this, make
the string an argument and let the caller supply it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: use correct format specifiers when printing address ranges
Juergen Gross [Wed, 11 Feb 2015 23:26:01 +0000 (15:26 -0800)]
mm: use correct format specifiers when printing address ranges

Especially on 32 bit kernels memory node ranges are printed with 32 bit
wide addresses only.  Use u64 types and %llx specifiers to print full
width of addresses.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomemcg: add BUILD_BUG_ON() for string tables
Greg Thelen [Wed, 11 Feb 2015 23:25:58 +0000 (15:25 -0800)]
memcg: add BUILD_BUG_ON() for string tables

Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync
with corresponding enums.  There aren't currently any issues with these
tables.  This is just defensive.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agovmscan: force scan offline memory cgroups
Vladimir Davydov [Wed, 11 Feb 2015 23:25:55 +0000 (15:25 -0800)]
vmscan: force scan offline memory cgroups

Since commit b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups") pages charged to a memory cgroup are not reparented when
the cgroup is removed.  Instead, they are supposed to be reclaimed in a
regular way, along with pages accounted to online memory cgroups.

However, an lruvec of an offline memory cgroup will sooner or later get so
small that it will be scanned only at low scan priorities (see
get_scan_count()).  Therefore, if there are enough reclaimable pages in
big lruvecs, pages accounted to offline memory cgroups will never be
scanned at all, wasting memory.

Fix this by unconditionally forcing scanning dead lruvecs from kswapd.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: more checks on free_pages_prepare() for tail pages
Kirill A. Shutemov [Wed, 11 Feb 2015 23:25:52 +0000 (15:25 -0800)]
mm: more checks on free_pages_prepare() for tail pages

Although it was not called, destroy_compound_page() did some potentially
useful checks.  Let's re-introduce them in free_pages_prepare(), where
they can be actually triggered when CONFIG_DEBUG_VM=y.

compound_order() assert is already in free_pages_prepare().  We have few
checks for tail pages left.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/page_alloc.c: drop dead destroy_compound_page()
Kirill A. Shutemov [Wed, 11 Feb 2015 23:25:50 +0000 (15:25 -0800)]
mm/page_alloc.c: drop dead destroy_compound_page()

The only caller is __free_one_page(). By the time we should have
page->flags to be cleared already:

 - for 0-order pages though PCP list:
free_hot_cold_page()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
<put the page to PCP list>

free_pcppages_bulk()
page = <withdraw pages from PCP list>
__free_one_page(page)

 - for non-0-order pages:
__free_pages_ok()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
free_one_page()
__free_one_page()

So there's no way PageCompound() will return true in __free_one_page().
Let's remove dead destroy_compound_page() and put assert for page->flags
there instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: microoptimize zonelist operations
Vlastimil Babka [Wed, 11 Feb 2015 23:25:47 +0000 (15:25 -0800)]
mm: microoptimize zonelist operations

next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
via extra parameter.  Since the latter can be trivially obtained by
dereferencing the former, the overhead of the extra parameter is
unjustified.

This patch thus removes the zone parameter from next_zones_zonelist().
Both callers happen to be in the same header file, so it's simple to add
the zoneref dereference inline.  We save some bytes of code size.

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
function                                     old     new   delta
nr_free_zone_pages                           129     115     -14
__alloc_pages_nodemask                      2300    2285     -15
get_page_from_freelist                      2652    2576     -76

add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
function                                     old     new   delta
try_to_compact_pages                         569     579     +10

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: reduce try_to_compact_pages parameters
Vlastimil Babka [Wed, 11 Feb 2015 23:25:44 +0000 (15:25 -0800)]
mm: reduce try_to_compact_pages parameters

Expand the usage of the struct alloc_context introduced in the previous
patch also for calling try_to_compact_pages(), to reduce the number of its
parameters.  Since the function is in different compilation unit, we need
to move alloc_context definition in the shared mm/internal.h header.

With this change we get simpler code and small savings of code size and stack
usage:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function                                     old     new   delta
__alloc_pages_direct_compact                 283     256     -27
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
function                                     old     new   delta
try_to_compact_pages                         582     569     -13

Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
scripts/checkstack.pl).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm, page_alloc: reduce number of alloc_pages* functions' parameters
Vlastimil Babka [Wed, 11 Feb 2015 23:25:41 +0000 (15:25 -0800)]
mm, page_alloc: reduce number of alloc_pages* functions' parameters

Introduce struct alloc_context to accumulate the numerous parameters
passed between the alloc_pages* family of functions and
get_page_from_freelist().  This excludes gfp_flags and alloc_info, which
mutate too much along the way, and allocation order, which is conceptually
different.

The result is shorter function signatures, as well as overal code size and
stack usage reductions.

bloat-o-meter:

add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
function                                     old     new   delta
get_page_from_freelist                      2525    2652    +127
__alloc_pages_direct_compact                 329     283     -46
__alloc_pages_nodemask                      2564    2300    -264

checkstack.pl:

function                            old    new
__alloc_pages_nodemask              248    200
get_page_from_freelist              168    184
__alloc_pages_direct_compact         40     24

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: set page->pfmemalloc in prep_new_page()
Vlastimil Babka [Wed, 11 Feb 2015 23:25:38 +0000 (15:25 -0800)]
mm: set page->pfmemalloc in prep_new_page()

The possibility of replacing the numerous parameters of alloc_pages*
functions with a single structure has been discussed when Minchan proposed
to expand the x86 kernel stack [1].  This series implements the change,
along with few more cleanups/microoptimizations.

The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
openSUSE 13.2 for compiling.  Config includess NUMA and COMPACTION.

The core change is the introduction of a new struct alloc_context, which looks
like this:

struct alloc_context {
        struct zonelist *zonelist;
        nodemask_t *nodemask;
        struct zone *preferred_zone;
        int classzone_idx;
        int migratetype;
        enum zone_type high_zoneidx;
};

All the contents is mostly constant, except that __alloc_pages_slowpath()
changes preferred_zone, classzone_idx and potentially zonelist.  But
that's not a problem in case control returns to retry_cpuset: in
__alloc_pages_nodemask(), those will be reset to initial values again
(although it's a bit subtle).  On the other hand, gfp_flags and alloc_info
mutate so much that it doesn't make sense to put them into alloc_context.
Still, the result is one parameter instead of up to 7.  This is all in
Patch 2.

Patch 3 is a step to expand alloc_context usage out of page_alloc.c
itself.  The function try_to_compact_pages() can also much benefit from
the parameter reduction, but it means the struct definition has to be
moved to a shared header.

Patch 1 should IMHO be included even if the rest is deemed not useful
enough.  It improves maintainability and also has some code/stack
reduction.  Patch 4 is OTOH a tiny optimization.

Overall bloat-o-meter results:

add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
function                                     old     new   delta
nr_free_zone_pages                           129     115     -14
__alloc_pages_direct_compact                 329     256     -73
get_page_from_freelist                      2670    2576     -94
__alloc_pages_nodemask                      2564    2285    -279
try_to_compact_pages                         582     579      -3

Overall stack sizes per ./scripts/checkstack.pl:

                          old   new delta
get_page_from_freelist:   184   184     0
__alloc_pages_nodemask    248   200   -48
__alloc_pages_direct_c     40     -   -40
try_to_compact_pages       72    72     0
                                      -88

[1] http://marc.info/?l=linux-mm&m=140142462528257&w=2

This patch (of 4):

prep_new_page() sets almost everything in the struct page of the page
being allocated, except page->pfmemalloc.  This is not obvious and has at
least once led to a bug where page->pfmemalloc was forgotten to be set
correctly, see commit 8fb74b9fb2b1 ("mm: compaction: partially revert
capture of suitable high-order page").

This patch moves the pfmemalloc setting to prep_new_page(), which means it
needs to gain alloc_flags parameter.  The call to prep_new_page is moved
from buffered_rmqueue() to get_page_from_freelist(), which also leads to
simpler code.  An obsolete comment for buffered_rmqueue() is replaced.

In addition to better maintainability there is a small reduction of code
and stack usage for get_page_from_freelist(), which inlines the other
functions involved.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
function                                     old     new   delta
get_page_from_freelist                      2670    2525    -145

Stack usage is reduced from 184 to 168 bytes.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agosparc32: fix broken set_pte()
Kirill A. Shutemov [Wed, 11 Feb 2015 23:25:35 +0000 (15:25 -0800)]
sparc32: fix broken set_pte()

32-bit sparc uses swap instruction to implement set_pte().  It called
using GCC inline assembler.  But it misses the "memory" clobber to
indicate that pte value will be updated in memory.

As result GCC doesn't know that it cannot postpone pte pointer dereference
which occurs before set_pte() to post-set_pte() time.

It leads to real-world bugs -- [1]. In this situation we have code:

ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
...
ptep_modify_prot_commit(mm, addr, pte, ptent);

ptep_modify_prot_start() in sparc case is just 'pte' dereference plus
pte_clear().  pte_clear() calls broken set_pte().  GCC thinks it's valid
to dereference 'pte' again on pte_modify() and gets cleared pte.
ptep_modify_prot_commit() puts 'pteent' with pfn==0 back to page table,
which eventually leads to the crash.

[1] http://lkml.kernel.org/r/54C06B19.8060305@roeck-us.net

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: add migration entry check in __unmap_hugepage_range
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:32 +0000 (15:25 -0800)]
mm/hugetlb: add migration entry check in __unmap_hugepage_range

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries.  This patch simply clears the pte for
migration entries as we do for hwpoison entries.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:28 +0000 (15:25 -0800)]
mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection

There is a race condition between hugepage migration and
change_protection(), where hugetlb_change_protection() doesn't care about
migration entries and wrongly overwrites them.  That causes unexpected
results like kernel crash.  HWPoison entries also can cause the same
problem.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:25 +0000 (15:25 -0800)]
mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().

This patch also reorders taking ptl and doing pte_page(), because
pte_page() should be done in ptl.  Due to this reordering, we need use
trylock_page() in page != pagecache_page case to respect locking order.

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: take page table lock in follow_huge_pmd()
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:22 +0000 (15:25 -0800)]
mm/hugetlb: take page table lock in follow_huge_pmd()

We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing.  This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned.  So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: pmd_huge() returns true for non-present hugepage
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:19 +0000 (15:25 -0800)]
mm/hugetlb: pmd_huge() returns true for non-present hugepage

Migrating hugepages and hwpoisoned hugepages are considered as non-present
hugepages, and they are referenced via migration entries and hwpoison
entries in their page table slots.

This behavior causes race condition because pmd_huge() doesn't tell
non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
one example where the kernel would call follow_page_pte() for such
hugepage while this function is supposed to handle only normal pages.

To avoid this, this patch makes pmd_huge() return true when pmd_none() is
true *and* pmd_present() is false.  We don't have to worry about mixing up
non-present pmd entry with normal pmd (pointing to leaf level pte entry)
because pmd_present() is true in normal pmd.

The same race condition could happen in (x86-specific) gup_pmd_range(),
where this patch simply adds pmd_present() check instead of pmd_huge().
This is because gup_pmd_range() is fast path.  If we have non-present
hugepage in this function, we will go into gup_huge_pmd(), then return 0
at flag mask check, and finally fall back to the slow path.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/hugetlb: reduce arch dependent code around follow_huge_*
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:15 +0000 (15:25 -0800)]
mm/hugetlb: reduce arch dependent code around follow_huge_*

Currently we have many duplicates in definitions around
follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
patch tries to remove the m.  The basic idea is to put the default
implementation for these functions in mm/hugetlb.c as weak symbols
(regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
arch-specific code only when the arch needs it.

For follow_huge_addr(), only powerpc and ia64 have their own
implementation, and in all other architectures this function just returns
ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
default.

As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
always return 0 in your architecture (like in ia64 or sparc,) it's never
called (the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.

In some architecture (like mips, s390 and tile,) their current
arch-specific follow_huge_(pmd|pud)() are effectively identical with the
common code, so this patch lets these architecture use the common code.

One exception is metag, where pmd_huge() could return non-zero but it
expects follow_huge_pmd() to always return NULL.  This means that we need
arch-specific implementation which returns NULL.  This behavior looks
strange to me (because non-zero pmd_huge() implies that the architecture
supports PMD-based hugepage, so follow_huge_pmd() can/should return some
relevant value,) but that's beyond this cleanup patch, so let's keep it.

Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
  patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
  is true when follow_huge_pmd() can be called (note that pmd_huge() has
  the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
  code. This patch forces these archs use PMD_MASK, but it's OK because
  they are identical in both archs.
  In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
  In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
  PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
  PTE_ORDER is always 0, so these are identical.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm, vmscan: wake up all pfmemalloc-throttled processes at once
Vlastimil Babka [Wed, 11 Feb 2015 23:25:12 +0000 (15:25 -0800)]
mm, vmscan: wake up all pfmemalloc-throttled processes at once

Kswapd in balance_pgdate() currently uses wake_up() on processes waiting
in throttle_direct_reclaim(), which only wakes up a single process.  This
might leave processes waiting for longer than necessary, until the check
is reached in the next loop iteration.  Processes might also be left
waiting if zone was fully balanced in single iteration.  Note that the
comment in balance_pgdat() also says "Wake them", so waking up a single
process does not seem intentional.

Thus, replace wake_up() with wake_up_all().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: fix typo of MIGRATE_RESERVE in comment
Baoquan He [Wed, 11 Feb 2015 23:25:10 +0000 (15:25 -0800)]
mm: fix typo of MIGRATE_RESERVE in comment

Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agokmemcheck: move hook into __alloc_pages_nodemask() for the page allocator
Xishi Qiu [Wed, 11 Feb 2015 23:25:07 +0000 (15:25 -0800)]
kmemcheck: move hook into __alloc_pages_nodemask() for the page allocator

Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
__alloc_pages_nodemask()
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()

And the page will not be tracked by kmemcheck in the following path.
__alloc_pages_nodemask()
get_page_from_freelist()

So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
like this:
__alloc_pages_nodemask()
...
get_page_from_freelist()
if (!page)
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
...

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/page_alloc.c:__alloc_pages_nodemask(): don't alter arg gfp_mask
Andrew Morton [Wed, 11 Feb 2015 23:25:04 +0000 (15:25 -0800)]
mm/page_alloc.c:__alloc_pages_nodemask(): don't alter arg gfp_mask

__alloc_pages_nodemask() strips __GFP_IO when retrying the page
allocation.  But it does this by altering the function-wide variable
gfp_mask.  This will cause subsequent allocation attempts to inadvertently
use the modified gfp_mask.

Also, pass the correct mask (the mask we actually used) into
trace_mm_page_alloc().

Cc: Ming Lei <ming.lei@canonical.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: memcontrol: track move_lock state internally
Johannes Weiner [Wed, 11 Feb 2015 23:25:01 +0000 (15:25 -0800)]
mm: memcontrol: track move_lock state internally

The complexity of memcg page stat synchronization is currently leaking
into the callsites, forcing them to keep track of the move_lock state and
the IRQ flags.  Simplify the API by tracking it in the memcg.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoswap: remove unused mem_cgroup_uncharge_swapcache declaration
Vladimir Davydov [Wed, 11 Feb 2015 23:24:59 +0000 (15:24 -0800)]
swap: remove unused mem_cgroup_uncharge_swapcache declaration

The body of this function was removed by commit 0a31bc97c80c ("mm:
memcontrol: rewrite uncharge API").

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agooom: make sure that TIF_MEMDIE is set under task_lock
Michal Hocko [Wed, 11 Feb 2015 23:24:56 +0000 (15:24 -0800)]
oom: make sure that TIF_MEMDIE is set under task_lock

OOM killer tries to exclude tasks which do not have mm_struct associated
because killing such a task wouldn't help much.  The OOM victim gets
TIF_MEMDIE set to disable OOM killer while the current victim releases the
memory and then enables the OOM killer again by dropping the flag.

oom_kill_process is currently prone to a race condition when the OOM
victim is already exiting and TIF_MEMDIE is set after the task releases
its address space.  This might theoretically lead to OOM livelock if the
OOM victim blocks on an allocation later during exiting because it
wouldn't kill any other process and the exiting one won't be able to exit.
 The situation is highly unlikely because the OOM victim is expected to
release some memory which should help to sort out OOM situation.

Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
which will serialize the OOM killer with exit_mm which sets task->mm to
NULL.  Setting the flag for current is not necessary because check and set
is not racy.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agooom: don't count on mm-less current process
Tetsuo Handa [Wed, 11 Feb 2015 23:24:54 +0000 (15:24 -0800)]
oom: don't count on mm-less current process

out_of_memory() doesn't trigger the OOM killer if the current task is
already exiting or it has fatal signals pending, and gives the task
access to memory reserves instead.  However, doing so is wrong if
out_of_memory() is called by an allocation (e.g. from exit_task_work())
after the current task has already released its memory and cleared
TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
current task, the OOM killer will be blocked by the task sitting in the
final schedule() waiting for its parent to reap it.  It will trigger an
OOM livelock if its parent is unable to reap it due to doing an
allocation and waiting for the OOM killer to kill it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm:add KPF_ZERO_PAGE flag for /proc/kpageflags
Wang, Yalin [Wed, 11 Feb 2015 23:24:51 +0000 (15:24 -0800)]
mm:add KPF_ZERO_PAGE flag for /proc/kpageflags

Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: add VM_BUG_ON_PAGE() to page_mapcount()
Wang, Yalin [Wed, 11 Feb 2015 23:24:48 +0000 (15:24 -0800)]
mm: add VM_BUG_ON_PAGE() to page_mapcount()

Add VM_BUG_ON_PAGE() for slab pages.  _mapcount is an union with slab
struct in struct page, so we must avoid accessing _mapcount if this page
is a slab page.  Also remove the unneeded bracket.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: add fields for compound destructor and order into struct page
Kirill A. Shutemov [Wed, 11 Feb 2015 23:24:46 +0000 (15:24 -0800)]
mm: add fields for compound destructor and order into struct page

Currently, we use lru.next/lru.prev plus cast to access or set
destructor and order of compound page.

Let's replace it with explicit fields in struct page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoMerge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6
Linus Torvalds [Wed, 11 Feb 2015 21:03:11 +0000 (13:03 -0800)]
Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6

Pull documentation updates from Jonathan Corbet:
 "Highlights this time around include:

   - A thrashing of SubmittingPatches to bring it out of the "send
     everything to Linus" era of kernel development.

   - A new document on completions from Nicholas McGuire

   - Lots of typo fixes, formatting improvements, corrections, build
     fixes, and more"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (35 commits)
  Documentation: Fix the wrong command `echo -1 > set_ftrace_pid` for cleaning the filter.
  can-doc: Fixed a wrong filepath in can.txt
  Documentation: Fix trivial typo in comment.
  kgdb,docs: Fix typo and minor style issues
  Documentation: add description for FTRACE probe status
  doc: brief user documentation for completion
  Documentation/misc-devices/mei: Fix indentation of embedded code.
  Documentation/misc-devices/mei: Fix indentation of enumeration.
  Documentation/misc-devices/mei: Fix spacing around parentheses.
  Documentation/misc-devices/mei: Fix formatting of headings.
  Documentation: devicetree: Fix double words in Doumentation/devicetree
  Documentation: mm: Fix typo in vm.txt
  lockstat: Add documentation on contention and contenting points
  Documentation: fix blackfin gptimers-example build errors
  Fixes column alignment in table of contents entry 1.9 in Documentation/filesystems/proc.txt
  CodingStyle: enable emacs display of trailing whitespace
  DocBook: Do not exceed argument list limit
  gpio: board.txt: Fix the gpio name example
  Documentation/SubmittingPatches: unify whitespace/tabs for the DCO
  MAINTAINERS: Add the docs-next git tree to the maintainer entry
  ...

9 years agoMerge branch 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu...
Linus Torvalds [Wed, 11 Feb 2015 20:56:40 +0000 (12:56 -0800)]
Merge branch 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration

Pull mailbox framework updates from Jassi Brar.

* 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
  mailbox: Add Altera mailbox driver
  mailbox: check for bit set before polling
  Mailbox: Fix return value check in pcc_init()

9 years agocipso: don't use IPCB() to locate the CIPSO IP option
Paul Moore [Wed, 11 Feb 2015 19:46:37 +0000 (14:46 -0500)]
cipso: don't use IPCB() to locate the CIPSO IP option

Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack.  While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit
971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.

This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB().  Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.

Cc: <stable@vger.kernel.org> # 3.18
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
9 years agoMerge tag 'pinctrl-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 11 Feb 2015 19:23:13 +0000 (11:23 -0800)]
Merge tag 'pinctrl-v3.20-1' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pincontrol updates from Linus Walleij:
 :This is the bulk of pin control changes for the v3.20 cycle:

  Framework changes and enhancements:
   - Passing -DDEBUG recursively to subdir drivers so we get debug
     messages properly turned on.
   - Infer map type from DT property in the groups parsing code in the
     generic pinconfig code.
   - Support for custom parameter passing in generic pin config.  This
     is used when you are using the generic pin config, but want to add
     a few custom properties that no other driver will use.

  New drivers:
   - Driver for the Xilinx Zynq
   - Driver for the AmLogic Meson SoCs

  New features in drivers:
   - Sleep support (suspend/resume) for the Cherryview driver
   - mvebeu a38x can now mux a UART on pins MPP19 and MPP20
   - Migrated the qualcomm driver to generic pin config handling of
     extended config options in the core code.
   - Support BUS1 and AUDIO in the Exynos pin controller.
   - Add some missing functions in the sun6i driver.
   - Add support for the A31S variant in the sun6i driver.
   - EMEv2 support in the Renesas PFC driver.
   - Add support for Qualcomm MSM8916 in the qcom driver.

  Deleted features
   - Drop support for the SiRF Marco that was never released to the
     market.
   - Drop SH7372 support as the support for this platform is removed
     from the kernel"

* tag 'pinctrl-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (40 commits)
  sh-pfc: emev2 - Fix mangled author name
  pinctrl: cherryview: Configure HiZ pins to be input when requested as GPIOs
  pinctrl: imx25: fix numbering for pins
  pinctrl: pinctrl-imx: don't use invalid value of conf_reg
  pinctrl: qcom: delete pin_config_get/set pinconf operations
  pinctrl: qcom: Add msm8916 pinctrl driver
  DT: pinctrl: Document Qualcomm MSM8916 pinctrl binding
  pinctrl: qcom: increase variable size for register offsets
  pinctrl: hide PCONFDUMP in #ifdef
  pinctrl: rockchip: Only mask interrupts; never disable
  pinctrl: zynq: Fix usb0 pins
  pinctrl: sh-pfc: sh7372: Remove DT binding documentation
  pinctrl: sh-pfc: sh7372: Remove PFC support
  sh-pfc: Add emev2 pinmux support
  sh-pfc: add macro to define pinmux without function
  pinctrl: add driver for Amlogic Meson SoCs
  staging: drivers: pinctrl: Fixed checkpatch.pl warnings
  pinctrl: exynos: Add AUDIO pin controller for exynos7
  sh-pfc: r8a7790: add MLB+ pin group
  sh-pfc: r8a7791: add MLB+ pin group
  ...

9 years agoMerge tag 'gpio-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 11 Feb 2015 19:17:34 +0000 (11:17 -0800)]
Merge tag 'gpio-v3.20-1' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO changes from Linus Walleij:
 "This is the GPIO bulk changes for the v3.20 series:

  GPIOLIB core changes:
   - Create and use of_mm_gpiochip_remove() for removing memory-mapped
     OF GPIO chips
   - GPIO MMIO library suppports bgpio_set_multiple for switching
     several lines at once, a feature merged in the last cycle.

  New drivers:
   - New driver for the APM X-gene standby GPIO controller
   - New driver for the Fujitsu MB86S7x GPIO controller

  Cleanups:
   - Moved rcar driver to use gpiolib irqchip
   - Moxart converted to the GPIO MMIO library
   - GE driver converted to GPIO MMIO library
   - Move sx150x to irqdomain
   - Move max732x to irqdomain
   - Move vx855 to use managed resources
   - Move dwapb to use managed resources
   - Clean tc3589x from platform data
   - Clean stmpe driver to use device tree only probe

  New subtypes:
   - sx1506 support in the sx150x driver
   - Quark 1000 SoC support in the SCH driver
   - Support X86 in the Xilinx driver
   - Support PXA1928 in the PXA driver

  Extended drivers:
   - max732x supports device tree probe
   - sx150x supports device tree probe

  Various minor cleanups and bug fixes"

* tag 'gpio-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (61 commits)
  gpio: kconfig: replace PPC_OF with PPC
  gpio: pxa: add PXA1928 gpio type support
  dt/bindings: gpio: add compatible string for marvell,pxa1928-gpio
  gpio: pxa: remove mach IRQ includes
  gpio: max732x: use an inline function for container cast
  gpio: use sizeof() instead of hardcoded values
  gpio: max732x: add set_multiple function
  gpio: sch: Consolidate similar algorithms
  gpio: tz1090-pdc: Use resource_size to fix off-by-one resource size calculation
  gpio: ge: Convert to use devm_kstrdup
  gpio: correctly use const char * const
  gpio: sx150x: fixup OF support
  gpio: mpc8xxx: Use of_mm_gpiochip_remove
  gpio: Add Fujitsu MB86S7x GPIO driver
  gpio: mpc8xxx: Convert to platform device interface.
  gpio: zevio: Use of_mm_gpiochip_remove
  gpio: gpio-mm-lantiq: Use of_mm_gpiochip_remove
  gpio: gpio-mm-lantiq: Use of_property_read_u32
  gpio: gpio-mm-lantiq: Do not replicate code
  gpio :gpio-mm-lantiq: Use devm_kzalloc
  ...