Rik van Riel [Fri, 5 Mar 2010 21:42:07 +0000 (13:42 -0800)]
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thiago Farina [Fri, 5 Mar 2010 21:42:04 +0000 (13:42 -0800)]
mm/memcontrol.c: fix "integer as NULL pointer" sparse warning
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer
Signed-off-by: Thiago Farina <tfransosi@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 5 Mar 2010 21:42:03 +0000 (13:42 -0800)]
include/linux/fs.h: convert FMODE_* constants to hex
It was tolerable until Eric went and added
8388608.
Cc: Eric Paris <eparis@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Fri, 5 Mar 2010 21:42:03 +0000 (13:42 -0800)]
readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.
In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.
POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.
So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).
In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!
Tested-by: Quentin Barnes <qbarnes+nfs@yahoo-inc.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> [2.6.33.x]
Cc: <qbarnes+nfs@yahoo-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Fri, 5 Mar 2010 21:42:01 +0000 (13:42 -0800)]
vfs: take f_lock on modifying f_mode after open time
We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> [2.6.33.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:42:00 +0000 (13:42 -0800)]
mm/migrate.c: kill anon local variable from migrate_page_copy
commit
01b1ae63c2 ("memcg: simple migration handling") removed
mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local
variable `anon' is now unused.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:59 +0000 (13:41 -0800)]
mm/mempolicy.c: fix indentation of the comments of do_migrate_pages
Currently, do_migrate_pages() have very long comment and this is not
indent properly. I often misunderstand it is function starting commnents
and confused it.
this patch fixes it.
note: this patch doesn't break 80 column rule. I guess original
author intended this indentaion, but an accident corrupted it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
akpm@linux-foundation.org [Fri, 5 Mar 2010 21:41:58 +0000 (13:41 -0800)]
memory-hotplug: create /sys/firmware/memmap entry for new memory
A memmap is a directory in sysfs which includes 3 text files: start, end
and type. For example:
start: 0x100000
end: 0x7e7b1cff
type: System RAM
Interface firmware_map_add was not called explicitly. Remove it and add
function firmware_map_add_hotplug as hotplug interface of memmap.
Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
does not export memmap entry for it. We add a call in function add_memory
to function firmware_map_add_hotplug.
Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
will be called when initialize memmap and hot-add memory.
[akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment]
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:57 +0000 (13:41 -0800)]
mm: fix mbind vma merge problem
Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
Unfortunately, many vma can reduce performance...
This patch fixes it.
reproduced program
----------------------------------------------------------------
#include <numaif.h>
#include <numa.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
static unsigned long pagesize;
int main(int argc, char** argv)
{
void* addr;
int ch;
int node;
struct bitmask *nmask = numa_allocate_nodemask();
int err;
int node_set = 0;
char buf[128];
while ((ch = getopt(argc, argv, "n:")) != -1){
switch (ch){
case 'n':
node = strtol(optarg, NULL, 0);
numa_bitmask_setbit(nmask, node);
node_set = 1;
break;
default:
;
}
}
argc -= optind;
argv += optind;
if (!node_set)
numa_bitmask_setbit(nmask, 0);
pagesize = getpagesize();
addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE, 0, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);
fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);
/* make page populate */
memset(addr, 0, pagesize*3);
/* first mbind */
err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
nmask->size, MPOL_MF_MOVE_ALL);
if (err)
error("mbind1 ");
/* second mbind */
err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
if (err)
error("mbind2 ");
sprintf(buf, "cat /proc/%d/maps", getpid());
system(buf);
return 0;
}
----------------------------------------------------------------
result without this patch
addr = 0x7fe26ef09000
[snip]
7fe26ef09000-
7fe26ef0a000 rw-p
00000000 00:00 0
7fe26ef0a000-
7fe26ef0b000 rw-p
00000000 00:00 0
7fe26ef0b000-
7fe26ef0c000 rw-p
00000000 00:00 0
7fe26ef0c000-
7fe26ef0d000 rw-p
00000000 00:00 0
=> 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.
result with this patch
addr = 0x7fc9ebc76000
[snip]
7fc9ebc76000-
7fc9ebc7a000 rw-p
00000000 00:00 0
7fffbe690000-
7fffbe6a5000 rw-p
00000000 00:00 0 [stack]
=> 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.
[minchan.kim@gmail.com: fix file offset passed to vma_merge()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:55 +0000 (13:41 -0800)]
mm: restore zone->all_unreclaimable to independence word
commit
e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.
Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.
[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Hong [Fri, 5 Mar 2010 21:41:54 +0000 (13:41 -0800)]
mm: remove free_hot_page()
free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Hong [Fri, 5 Mar 2010 21:41:53 +0000 (13:41 -0800)]
mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct
Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Hong [Fri, 5 Mar 2010 21:41:52 +0000 (13:41 -0800)]
mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct
trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:
K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=
ffffea0003db9f40 pfn=
1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=
ffffea0003db9f40 pfn=
1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=
ffffea0003db9f40 pfn=
1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=
ffffea0003db9f40 pfn=
1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=
ffffea0003db9f40 pfn=
1155800 order=0
This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Ming Chun <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:47 +0000 (13:41 -0800)]
mm, lockdep: annotate reclaim context to zone reclaim too
Commit
cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
context annotation. But it didn't annotate zone reclaim. This patch do
it.
The point is, commit
cf40bd16fd annotate __alloc_pages_direct_reclaim but
zone-reclaim doesn't use __alloc_pages_direct_reclaim.
current call graph is
__alloc_pages_nodemask
get_page_from_freelist
zone_reclaim()
__alloc_pages_slowpath
__alloc_pages_direct_reclaim
try_to_free_pages
Actually, if zone_reclaim_mode=1, VM never call
__alloc_pages_direct_reclaim in usual VM pressure.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:47 +0000 (13:41 -0800)]
vmscan: get_scan_ratio() cleanup
The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Fri, 5 Mar 2010 21:41:45 +0000 (13:41 -0800)]
vmscan: check high watermark after shrink zone
Kswapd checks that zone has sufficient pages free via zone_watermark_ok().
If any zone doesn't have enough pages, we set all_zones_ok to zero.
!all_zone_ok makes kswapd retry rather than sleeping.
I think the watermark check before shrink_zone() is pointless. Only after
kswapd has tried to shrink the zone is the check meaningful.
Move the check to after the call to shrink_zone().
[akpm@linux-foundation.org: fix comment, layout]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Fri, 5 Mar 2010 21:41:44 +0000 (13:41 -0800)]
mm: use rlimit helpers
Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.
I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:43 +0000 (13:41 -0800)]
mm: mlock_vma_pages_range() only return success or failure
Currently, mlock_vma_pages_range() only return len or 0. then current
error handling of mmap_region() is meaningless complex.
This patch makes simplify and makes consist with brk() code.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Fri, 5 Mar 2010 21:41:43 +0000 (13:41 -0800)]
mm: mlock_vma_pages_range() never return negative value
Currently, mlock_vma_pages_range() never return negative value. Then, we
can remove some worthless error check.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Fri, 5 Mar 2010 21:41:42 +0000 (13:41 -0800)]
mm: count swap usage
A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.
Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)
This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc/<pid>/status file as
[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <=============== this.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Fri, 5 Mar 2010 21:41:40 +0000 (13:41 -0800)]
mm: avoid false sharing of mm_counter
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.
This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.
Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.
rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Fri, 5 Mar 2010 21:41:39 +0000 (13:41 -0800)]
mm: clean up mm_counter
Presently, per-mm statistics counter is defined by macro in sched.h
This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.
This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Fri, 5 Mar 2010 21:41:38 +0000 (13:41 -0800)]
infiniband: use for_each_set_bit()
Replace open-coded loop with for_each_set_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Fri, 5 Mar 2010 21:41:37 +0000 (13:41 -0800)]
bitops: rename for_each_bit() to for_each_set_bit()
Rename for_each_bit to for_each_set_bit in the kernel source tree. To
permit for_each_clear_bit(), should that ever be added.
The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit(). This is a (very) temporary thing to ease the migration.
[akpm@linux-foundation.org: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Fri, 5 Mar 2010 21:41:36 +0000 (13:41 -0800)]
timbgpio: fix build
Use of get_irq_chip_data() et al. requires including linux/irq.h
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Richard Röjfors <richard.rojfors@pelagicore.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro [Sat, 6 Mar 2010 18:41:07 +0000 (18:41 +0000)]
Fix a dumb typo - use of & instead of &&
We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")
Reported-by: Walter Sheets <w41ter@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 5 Mar 2010 22:35:40 +0000 (14:35 -0800)]
Merge branch 'slab-for-linus' of git://git./linux/kernel/git/penberg/slab-2.6
* 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLUB: Fix per-cpu merge conflict
failslab: add ability to filter slab caches
slab: fix regression in touched logic
dma kmalloc handling fixes
slub: remove impossible condition
slab: initialize unused alien cache entry as NULL at alloc_alien_cache().
SLUB: Make slub statistics use this_cpu_inc
SLUB: this_cpu: Remove slub kmem_cache fields
SLUB: Get rid of dynamic DMA kmalloc cache allocation
SLUB: Use this_cpu operations in slub
Linus Torvalds [Fri, 5 Mar 2010 21:25:45 +0000 (13:25 -0800)]
Merge branch 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (44 commits)
NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping
NFS: Clean up nfs_sync_mapping
NFS: Simplify nfs_wb_page()
NFS: Replace __nfs_write_mapping with sync_inode()
NFS: Simplify nfs_wb_page_cancel()
NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages
NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set
NFS: Reduce the number of unnecessary COMMIT calls
NFS: Add a count of the number of unstable writes carried by an inode
NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c
nfs41 fix NFS4ERR_CLID_INUSE for exchange id
NFS: Fix an allocation-under-spinlock bug
SUNRPC: Handle EINVAL error returns from the TCP connect operation
NFSv4.1: Various fixes to the sequence flag error handling
nfs4: renewd renew operations should take/put a client reference
nfs41: renewd sequence operations should take/put client reference
nfs: prevent backlogging of renewd requests
nfs: kill renewd before clearing client minor version
NFS: Make close(2) asynchronous when closing NFS O_DIRECT files
NFS: Improve NFS iostat byte count accuracy for writes
...
Linus Torvalds [Fri, 5 Mar 2010 21:25:24 +0000 (13:25 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Add hardlink support to .u extension
9P2010.L handshake: .L protocol negotiation
9P2010.L handshake: Remove "dotu" variable
9P2010.L handshake: Add mount option
9P2010.L handshake: Add VFS flags
net/9p: Handle mount errors correctly.
net/9p: Remove MAX_9P_CHAN limit
net/9p: Add multi channel support.
Linus Torvalds [Fri, 5 Mar 2010 21:20:53 +0000 (13:20 -0800)]
Merge branch 'for_linus' of git://git./linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...
Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
Linus Torvalds [Fri, 5 Mar 2010 21:12:34 +0000 (13:12 -0800)]
Merge branch 'kvm-updates/2.6.34' of git://git./virt/kvm/kvm
* 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (145 commits)
KVM: x86: Add KVM_CAP_X86_ROBUST_SINGLESTEP
KVM: VMX: Update instruction length on intercepted BP
KVM: Fix emulate_sys[call, enter, exit]()'s fault handling
KVM: Fix segment descriptor loading
KVM: Fix load_guest_segment_descriptor() to inject page fault
KVM: x86 emulator: Forbid modifying CS segment register by mov instruction
KVM: Convert kvm->requests_lock to raw_spinlock_t
KVM: Convert i8254/i8259 locks to raw_spinlocks
KVM: x86 emulator: disallow opcode 82 in 64-bit mode
KVM: x86 emulator: code style cleanup
KVM: Plan obsolescence of kernel allocated slots, paravirt mmu
KVM: x86 emulator: Add LOCK prefix validity checking
KVM: x86 emulator: Check CPL level during privilege instruction emulation
KVM: x86 emulator: Fix popf emulation
KVM: x86 emulator: Check IOPL level during io instruction emulation
KVM: x86 emulator: fix memory access during x86 emulation
KVM: x86 emulator: Add Virtual-8086 mode of emulation
KVM: x86 emulator: Add group9 instruction decoding
KVM: x86 emulator: Add group8 instruction decoding
KVM: do not store wqh in irqfd
...
Trivial conflicts in Documentation/feature-removal-schedule.txt
Aneesh Kumar K.V [Fri, 5 Mar 2010 20:43:43 +0000 (14:43 -0600)]
fs/9p: Add hardlink support to .u extension
For regular file and directories we put the link
count in th extension field in a tagged string format.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Sripathi Kodi [Fri, 5 Mar 2010 18:51:04 +0000 (18:51 +0000)]
9P2010.L handshake: .L protocol negotiation
This patch adds 9P2010.L protocol negotiation with the server
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Sripathi Kodi [Fri, 5 Mar 2010 18:50:14 +0000 (18:50 +0000)]
9P2010.L handshake: Remove "dotu" variable
Removes 'dotu' variable and make everything dependent
on 'proto_version' field.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Sripathi Kodi [Fri, 5 Mar 2010 18:49:11 +0000 (18:49 +0000)]
9P2010.L handshake: Add mount option
Add new mount V9FS mount option to specify protocol version
This patch adds a new mount option to specify protocol version.
With this option it is possible to use "-o version=" switch to
specify 9P protocol version to use. Valid options for version
are:
9p2000
9p2000.u
9p2010.L
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Sripathi Kodi [Fri, 5 Mar 2010 18:48:00 +0000 (18:48 +0000)]
9P2010.L handshake: Add VFS flags
Add 9P2000.u and 9P2010.L protocol flags to V9FS VFS
This patch adds 9P2000.u and 9P2010.L protocol flags into V9FS VFS side code
and removes the single flag used for 'extended'.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Aneesh Kumar K.V [Mon, 15 Feb 2010 17:27:02 +0000 (17:27 +0000)]
net/9p: Handle mount errors correctly.
With this patch we have
# mount -t 9p -o trans=virtio virtio2 /mnt/
# mount -t 9p -o trans=virtio virtio2 /mnt/
mount: virtio2 already mounted or /mnt/ busy
mount: according to mtab, virtio2 is already mounted on /mnt
# mount -t 9p -o trans=virtio virtio3 /mnt/ -o debug=0xfff
mount: special device virtio3 does not exist
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Aneesh Kumar K.V [Mon, 15 Feb 2010 17:27:01 +0000 (17:27 +0000)]
net/9p: Remove MAX_9P_CHAN limit
Use a list to track the channel instead of statically
allocated array
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Aneesh Kumar K.V [Mon, 15 Feb 2010 17:27:00 +0000 (17:27 +0000)]
net/9p: Add multi channel support.
This is needed for supporting multiple mount points.
We can find out the device names to be used with mount by checking
/sys/devices/virtio-pci/virtio*/device file
if the device file have value 9 then the specific virtio device can
be used for mounting.
ex:
#cat /sys/devices/virtio-pci/virtio1/device
9
now we can mount using
# mount -t 9p -o trans=virtio virtio1 /mnt/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Trond Myklebust [Fri, 5 Mar 2010 20:46:18 +0000 (15:46 -0500)]
Merge branch 'writeback-for-2.6.34' into nfs-for-2.6.34
Trond Myklebust [Sat, 20 Feb 2010 01:03:30 +0000 (17:03 -0800)]
NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:03:29 +0000 (17:03 -0800)]
NFS: Clean up nfs_sync_mapping
Remove the redundant call to filemap_write_and_wait().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:03:28 +0000 (17:03 -0800)]
NFS: Simplify nfs_wb_page()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:03:26 +0000 (17:03 -0800)]
NFS: Replace __nfs_write_mapping with sync_inode()
Now that we have correct COMMIT semantics in writeback_single_inode, we can
reduce and simplify nfs_wb_all(). Also replace nfs_wb_nocommit() with a
call to filemap_write_and_wait(), which doesn't need to hold the
inode->i_mutex.
With that done, we can eliminate nfs_write_mapping() altogether.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:03:21 +0000 (17:03 -0800)]
NFS: Simplify nfs_wb_page_cancel()
In all cases we should be able to just remove the request and call
cancel_dirty_page().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:03:18 +0000 (17:03 -0800)]
NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages
Since nfs_scan_list() doesn't wait for locked pages, we have a race in
which it is possible to end up with an inode that needs to send a COMMIT,
but which does not have the I_DIRTY_DATASYNC flag set.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 01:02:24 +0000 (17:02 -0800)]
NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Trond Myklebust [Sat, 20 Feb 2010 01:00:02 +0000 (17:00 -0800)]
NFS: Reduce the number of unnecessary COMMIT calls
If the caller is doing a non-blocking flush, and there are still writebacks
pending on the wire, we can usually defer the COMMIT call until those
writes are done.
Also ensure that we honour the wbc->nonblocking flag.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 00:53:39 +0000 (16:53 -0800)]
NFS: Add a count of the number of unstable writes carried by an inode
In order to know when we should do opportunistic commits of the unstable
writes, when the VM is doing a background flush, we add a field to count
the number of unstable writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Sat, 20 Feb 2010 00:46:56 +0000 (16:46 -0800)]
NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c
The sole purpose of nfs_write_inode is to commit unstable writes, so
move it into fs/nfs/write.c, and make nfs_commit_inode static.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Linus Torvalds [Fri, 5 Mar 2010 19:53:53 +0000 (11:53 -0800)]
Merge branch 'write_inode2' of git://git./linux/kernel/git/viro/vfs-2.6
* 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
pass writeback_control to ->write_inode
make sure data is on disk before calling ->write_inode
Linus Torvalds [Fri, 5 Mar 2010 19:46:31 +0000 (11:46 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Switch !O_CREAT case to use of do_last()
Get rid of symlink body copying
Finish pulling of -ESTALE handling to upper level in do_filp_open()
Turn do_link spaghetty into a normal loop
Unify exits in O_CREAT handling
Kill is_link argument of do_last()
Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open()
Leave mangled flag only for setting nd.intent.open.flag
Get rid of passing mangled flag to do_last()
Don't pass mangled open_flag to finish_open()
pull more into do_last()
bail out with ELOOP earlier in do_link loop
pull the common predecessors into do_last()
postpone __putname() until after do_last()
unroll do_last: loop in do_filp_open()
Shift releasing nd->root from do_last() to its caller
gut do_filp_open() a bit more (do_last separation)
beginning to untangle do_filp_open()
Randy Dunlap [Fri, 5 Mar 2010 17:52:52 +0000 (09:52 -0800)]
x86: fix mtrr missing kernel-doc
Fix missing kernel-doc notation in mtrr/main.c:
Warning(arch/x86/kernel/cpu/mtrr/main.c:152): No description found for parameter 'info'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 5 Mar 2010 18:50:22 +0000 (10:50 -0800)]
Merge branch 'perf-probes-for-linus-2' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Issue at least one memory barrier in stop_machine_text_poke()
perf probe: Correct probe syntax on command line help
perf probe: Add lazy line matching support
perf probe: Show more lines after last line
perf probe: Check function address range strictly in line finder
perf probe: Use libdw callback routines
perf probe: Use elfutils-libdw for analyzing debuginfo
perf probe: Rename probe finder functions
perf probe: Fix bugs in line range finder
perf probe: Update perf probe document
perf probe: Do not show --line option without dwarf support
kprobes: Add documents of jump optimization
kprobes/x86: Support kprobes jump optimization on x86
x86: Add text_poke_smp for SMP cross modifying code
kprobes/x86: Cleanup save/restore registers
kprobes/x86: Boost probes when reentering
kprobes: Jump optimization sysctl interface
kprobes: Introduce kprobes jump optimization
kprobes: Introduce generic insn_slot framework
kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
Linus Torvalds [Fri, 5 Mar 2010 18:47:57 +0000 (10:47 -0800)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
padata: Allocate the cpumask for the padata instance
crypto: authenc - Move saved IV in front of the ablkcipher request
crypto: hash - Fix handling of unaligned buffers
crypto: authenc - Use correct ahash complete functions
crypto: md5 - Set statesize
Linus Torvalds [Fri, 5 Mar 2010 18:47:00 +0000 (10:47 -0800)]
Merge branch 'for_linus' of git://git./linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits)
ext4: fix up rb_root initializations to use RB_ROOT
ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
ext4: Fix the NULL reference in double_down_write_data_sem()
ext4: Fix insertion point of extent in mext_insert_across_blocks()
ext4: consolidate in_range() definitions
ext4: cleanup to use ext4_grp_offs_to_block()
ext4: cleanup to use ext4_group_first_block_no()
ext4: Release page references acquired in ext4_da_block_invalidatepages
ext4: Fix ext4_quota_write cross block boundary behaviour
ext4: Convert BUG_ON checks to use ext4_error() instead
ext4: Use direct_IO_no_locking in ext4 dio read
ext4: use ext4_get_block_write in buffer write
ext4: mechanical rename some of the direct I/O get_block's identifiers
ext4: make "offset" consistent in ext4_check_dir_entry()
ext4: Handle non empty on-disk orphan link
ext4: explicitly remove inode from orphan list after failed direct io
ext4: fix error handling in migrate
ext4: deprecate obsoleted mount options
ext4: Fix fencepost error in chosing choosing group vs file preallocation.
jbd2: clean up an assertion in jbd2_journal_commit_transaction()
...
Linus Torvalds [Fri, 5 Mar 2010 18:46:04 +0000 (10:46 -0800)]
Merge git://git./linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: get rid of obsolete definition in header file
Squashfs: get rid of obsolete variable in struct squashfs_sb_info
Squashfs: add decompressor entries for lzma and lzo
Squashfs: add a decompressor framework
Squashfs: factor out remaining zlib dependencies into separate wrapper file
Squashfs: move zlib decompression wrapper code into a separate file
Christoph Hellwig [Fri, 5 Mar 2010 08:21:37 +0000 (09:21 +0100)]
pass writeback_control to ->write_inode
This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 5 Mar 2010 08:21:21 +0000 (09:21 +0100)]
make sure data is on disk before calling ->write_inode
Similar to the fsync issue fixed a while ago in commit
2daea67e966dc0c42067ebea015ddac6834cef88 we need to write for data to
actually hit the disk before writing out the metadata to guarantee
data integrity for filesystems that modify the inode in the data I/O
completion path. Currently XFS and NFS handle this manually, and AFS
has a write_inode method that does nothing but waiting for data, while
others are possibly missing out on this.
Fortunately this change has a lot less impact than the fsync change
as none of the write_inode methods starts data writeout of any form
by itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Phillip Lougher [Thu, 25 Feb 2010 01:31:13 +0000 (01:31 +0000)]
Squashfs: get rid of obsolete definition in header file
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Phillip Lougher [Thu, 25 Feb 2010 00:54:48 +0000 (00:54 +0000)]
Squashfs: get rid of obsolete variable in struct squashfs_sb_info
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Al Viro [Sat, 26 Dec 2009 15:56:19 +0000 (10:56 -0500)]
Switch !O_CREAT case to use of do_last()
... and now we have all intents crap well localized
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 13:37:05 +0000 (08:37 -0500)]
Get rid of symlink body copying
Now that nd->last stays around until ->put_link() is called, we can
just postpone that ->put_link() in do_filp_open() a bit and don't
bother with copying.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 12:21:48 +0000 (07:21 -0500)]
Finish pulling of -ESTALE handling to upper level in do_filp_open()
Don't bother with path_walk() (and its retry loop); link_path_walk()
will do it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 12:16:40 +0000 (07:16 -0500)]
Turn do_link spaghetty into a normal loop
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 12:09:49 +0000 (07:09 -0500)]
Unify exits in O_CREAT handling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 12:04:50 +0000 (07:04 -0500)]
Kill is_link argument of do_last()
We set it to 1 iff we return NULL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Dec 2009 12:01:01 +0000 (07:01 -0500)]
Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open()
Note that in case of !O_CREAT we know that nd.root has already been given up
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 12:15:41 +0000 (07:15 -0500)]
Leave mangled flag only for setting nd.intent.open.flag
Nothing else uses it anymore
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 11:51:13 +0000 (06:51 -0500)]
Get rid of passing mangled flag to do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 11:49:47 +0000 (06:49 -0500)]
Don't pass mangled open_flag to finish_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 08:39:50 +0000 (03:39 -0500)]
pull more into do_last()
Handling of LAST_DOT/LAST_ROOT/LAST_DOTDOT/terminating slash
can be pulled in as well
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 07:27:30 +0000 (02:27 -0500)]
bail out with ELOOP earlier in do_link loop
If we'd passed through 32 trailing symlinks already, there's
no sense following the 33rd - we'll bail out anyway. Better
bugger off earlier.
It *does* change behaviour, after a fashion - if the 33rd happens
to be a procfs-style symlink, original code *would* allow it.
This one will not. Cry me a river if that hurts you. Please, do.
And post a video of that, while you are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 07:12:06 +0000 (02:12 -0500)]
pull the common predecessors into do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 07:08:19 +0000 (02:08 -0500)]
postpone __putname() until after do_last()
Since do_last() doesn't mangle nd->last_name, we can safely postpone
__putname() done in handling of trailing symlinks until after the
call of do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 07:05:43 +0000 (02:05 -0500)]
unroll do_last: loop in do_filp_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 07:02:38 +0000 (02:02 -0500)]
Shift releasing nd->root from do_last() to its caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 06:58:28 +0000 (01:58 -0500)]
gut do_filp_open() a bit more (do_last separation)
Brute-force separation of stuff reachable from do_last: with
the exception of do_link:; just take all that crap to a helper
function as-is and have it tell the caller if it has to go
to do_link.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 24 Dec 2009 06:26:48 +0000 (01:26 -0500)]
beginning to untangle do_filp_open()
That's going to be a long and painful series. The first step:
take the stuff reachable from 'ok' label in do_filp_open() into
a new helper (finish_open()).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Venkatesh Pallipadi [Fri, 5 Mar 2010 03:25:21 +0000 (22:25 -0500)]
ext4: fix up rb_root initializations to use RB_ROOT
ext4 uses rb_node = NULL; to zero rb_root at few places. Using
RB_ROOT as the initializer is more portable in case the underlying
implementation of rbtrees changes in the future.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Paris <eparis@redhat.com>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:08 +0000 (09:05 -0500)]
quota: stop using QUOTA_OK / NO_QUOTA
Just use 0 / -EDQUOT directly - that's what it translates to anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:07 +0000 (09:05 -0500)]
dquot: cleanup dquot initialize routine
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:06 +0000 (09:05 -0500)]
dquot: move dquot initialization responsibility into the filesystem
Currently various places in the VFS call vfs_dq_init directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the initialization. For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.
For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.
For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.
Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:05 +0000 (09:05 -0500)]
dquot: cleanup dquot drop routine
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:04 +0000 (09:05 -0500)]
dquot: move dquot drop responsibility into the filesystem
Currently clear_inode calls vfs_dq_drop directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the drop inside the ->clear_inode
superblock operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:03 +0000 (09:05 -0500)]
dquot: cleanup dquot transfer routine
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:02 +0000 (09:05 -0500)]
dquot: move dquot transfer responsibility into the filesystem
Currently notify_change calls vfs_dq_transfer directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the transfer. Most filesystems already
do this, only ufs and udf need the code added, and for jfs it needs to
be enabled unconditionally instead of only when ACLs are enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:01 +0000 (09:05 -0500)]
dquot: cleanup inode allocation / freeing routines
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.
Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Wed, 3 Mar 2010 14:05:00 +0000 (09:05 -0500)]
dquot: cleanup space allocation / freeing routines
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.
Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not. Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Dmitry Monakhov [Tue, 2 Mar 2010 12:51:02 +0000 (15:51 +0300)]
ext3: add writepage sanity checks
- There is theoretical possibility to perform writepage on
RO superblock. Add explicit check for what case.
- Page must being locked before writepage.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Jan Kara [Mon, 1 Mar 2010 13:02:37 +0000 (14:02 +0100)]
ext3: Truncate allocated blocks if direct IO write fails to update i_size
We have to truncate blocks allocated to file during direct IO when we
fail to update i_size properly.
Signed-off-by: Jan Kara <jack@suse.cz>
Jan Kara [Mon, 22 Feb 2010 20:07:17 +0000 (21:07 +0100)]
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
Sometimes invalidate_bdev() can fail to invalidate a part of block
device cache because of dirty data. If the filesystem has blocksize
smaller than page size, this can happen even for pages containing
quota files and thus kernel would operate on stale data. Fix the
issue by syncing the filesystem before invalidating the cache.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Dmitry Monakhov [Tue, 16 Feb 2010 05:31:50 +0000 (08:31 +0300)]
quota: generalize quota transfer interface
Current quota transfer interface support only uid/gid.
This patch extend interface in order to support various quotas types
The goal is accomplished without changes in most frequently used
vfs_dq_transfer() func.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Dmitry Monakhov [Tue, 16 Feb 2010 05:31:49 +0000 (08:31 +0300)]
quota: sb_quota state flags cleanup
- remove hardcoded USRQUOTA/GRPQUOTA flags
- convert int to bool for appropriate functions
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Jan Kara [Tue, 16 Feb 2010 19:37:12 +0000 (20:37 +0100)]
jbd: Delay discarding buffers in journal_unmap_buffer
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.
This patch is a backport of JBD2 fix by dingdinghua <dingdinghua@nrchpc.ac.cn>.
Signed-off-by: Jan Kara <jack@suse.cz>
Dmitry Monakhov [Tue, 16 Feb 2010 16:33:42 +0000 (19:33 +0300)]
ext3: quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext3_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal credits
reservation. And later bug_on triggering. As soon this never happen the
boundary cross loop is NOOP. In order to make things straight
let's remove this loop and assert cross boundary condition.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Tue, 16 Feb 2010 08:44:56 +0000 (03:44 -0500)]
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
We already do these checks in the generic code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Tue, 16 Feb 2010 08:44:55 +0000 (03:44 -0500)]
quota: split out compat_sys_quotactl support from quota.c
Instead of adding ifdefs just split it into a new file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Tue, 16 Feb 2010 08:44:54 +0000 (03:44 -0500)]
quota: split out netlink notification support from quota.c
Instead of adding ifdefs just split it into a new file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig [Tue, 16 Feb 2010 08:44:53 +0000 (03:44 -0500)]
quota: remove invalid optimization from quota_sync_all
Checking the "VFS" quota enabled and dirty bits from generic code means
this code will never get called for other implementations, e.g. XFS and
GFS2. Grabbing the reference on the superblock really isn't much overhead
for a global Q_SYNC call, so just drop this optimization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>