Serge E. Hallyn [Wed, 23 Mar 2011 23:43:17 +0000 (16:43 -0700)]
userns: security: make capabilities relative to the user namespace
- Introduce ns_capable to test for a capability in a non-default
user namespace.
- Teach cap_capable to handle capabilities in a non-default
user namespace.
The motivation is to get to the unprivileged creation of new
namespaces. It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.
I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.
Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created. THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator. Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in !security case
01/11/2011: [serge] add task_ns_capable helper
01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion
02/16/2011: [serge] fix a logic bug: the root user is always creator of
init_user_ns, but should not always have capabilities to
it! Fix the check in cap_capable().
02/21/2011: Add the required user_ns parameter to security_capable,
fixing a compile failure.
02/23/2011: Convert some macros to functions as per akpm comments. Some
couldn't be converted because we can't easily forward-declare
them (they are inline if !SECURITY, extern if SECURITY). Add
a current_user_ns function so we can use it in capability.h
without #including cred.h. Move all forward declarations
together to the top of the #ifdef __KERNEL__ section, and use
kernel-doc format.
02/23/2011: Per dhowells, clean up comment in cap_capable().
02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable.
(Original written and signed off by Eric; latest, modified version
acked by him)
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: export current_user_ns() for ecryptfs]
[serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Serge E. Hallyn [Wed, 23 Mar 2011 23:43:16 +0000 (16:43 -0700)]
userns: add a user_namespace as creator/owner of uts_namespace
The expected course of development for user namespaces targeted
capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.
Goals:
- Make it safe for an unprivileged user to unshare namespaces. They
will be privileged with respect to the new namespace, but this should
only include resources which the unprivileged user already owns.
- Provide separate limits and accounting for userids in different
namespaces.
Status:
Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
CAP_SETGID capabilities. What this gets you is a whole new set of
userids, meaning that user 500 will have a different 'struct user' in
your namespace than in other namespaces. So any accounting information
stored in struct user will be unique to your namespace.
However, throughout the kernel there are checks which
- simply check for a capability. Since root in a child namespace
has all capabilities, this means that a child namespace is not
constrained.
- simply compare uid1 == uid2. Since these are the integer uids,
uid 500 in namespace 1 will be said to be equal to uid 500 in
namespace 2.
As a result, the lxc implementation at lxc.sf.net does not use user
namespaces. This is actually helpful because it leaves us free to
develop user namespaces in such a way that, for some time, user
namespaces may be unuseful.
Bugs aside, this patchset is supposed to not at all affect systems which
are not actively using user namespaces, and only restrict what tasks in
child user namespace can do. They begin to limit privilege to a user
namespace, so that root in a container cannot kill or ptrace tasks in the
parent user namespace, and can only get world access rights to files.
Since all files currently belong to the initila user namespace, that means
that child user namespaces can only get world access rights to *all*
files. While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.
I've run the 'runltplite.sh' with and without this patchset and found no
difference.
This patch:
copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
will have the new user namespace as its owner. That is what we want,
since we want root in that new userns to be able to have privilege over
it.
Changelog:
Feb 15: don't set uts_ns->user_ns if we didn't create
a new uts_ns.
Feb 23: Move extern init_user_ns declaration from
init/version.c to utsname.h.
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 23 Mar 2011 23:43:14 +0000 (16:43 -0700)]
procfs: kill the global proc_mnt variable
After the previous cleanup in proc_get_sb() the global proc_mnt has no
reasons to exists, kill it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric W. Biederman [Wed, 23 Mar 2011 23:43:13 +0000 (16:43 -0700)]
pidns: call pid_ns_prepare_proc() from create_pid_namespace()
Reorganize proc_get_sb() so it can be called before the struct pid of the
first process is allocated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric W. Biederman [Wed, 23 Mar 2011 23:43:12 +0000 (16:43 -0700)]
pid: remove the child_reaper special case in init/main.c
This patchset is a cleanup and a preparation to unshare the pid namespace.
These prerequisites prepare for Eric's patchset to give a file descriptor
to a namespace and join an existing namespace.
This patch:
It turns out that the existing assignment in copy_process of the
child_reaper can handle the initial assignment of child_reaper we just
need to generalize the test in kernel/fork.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Richard Weinberger [Wed, 23 Mar 2011 23:43:11 +0000 (16:43 -0700)]
sysctl: restrict write access to dmesg_restrict
When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel
ring buffer. But a root user without CAP_SYS_ADMIN is able to reset
dmesg_restrict to 0.
This is an issue when e.g. LXC (Linux Containers) are used and complete
user space is running without CAP_SYS_ADMIN. A unprivileged and jailed
root user can bypass the dmesg_restrict protection.
With this patch writing to dmesg_restrict is only allowed when root has
CAP_SYS_ADMIN.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Dan Rosenberg <drosenberg@vsecurity.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Holasek [Wed, 23 Mar 2011 23:43:09 +0000 (16:43 -0700)]
sysctl: add some missing input constraint checks
Add boundaries of allowed input ranges for: dirty_expire_centisecs,
drop_caches, overcommit_memory, page-cluster and panic_on_oom.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Acked-by: Dave Young <hidave.darkstar@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Denis Kirjanov [Wed, 23 Mar 2011 23:43:08 +0000 (16:43 -0700)]
sysctl_check: drop dead code
Drop dead code.
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Denis Kirjanov [Wed, 23 Mar 2011 23:43:08 +0000 (16:43 -0700)]
sysctl_check: drop table->procname checks
Since the for loop checks for the table->procname drop useless
table->procname checks inside the loop body
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Wed, 23 Mar 2011 23:43:07 +0000 (16:43 -0700)]
rapidio: fix potential null deref on failure path
If rio is not a switch then "rswitch" is null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:06 +0000 (16:43 -0700)]
rapidio: remove mport resource reservation from common RIO code
Removes resource reservation from the common sybsystem initialization code
and make it part of mport driver initialization. This resolves conflict
with resource reservation by device specific mport drivers.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:05 +0000 (16:43 -0700)]
rapidio: modify mport ID assignment
Changes mport ID and host destination ID assignment to implement unified
method common to all mport drivers. Makes "riohdid=" kernel command line
parameter common for all architectures with support for more that one host
destination ID assignment.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:04 +0000 (16:43 -0700)]
rapidio: modify subsystem and driver initialization sequence
Subsystem initialization sequence modified to support presence of multiple
RapidIO controllers in the system. The new sequence is compatible with
initialization of PCI devices.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:03 +0000 (16:43 -0700)]
rapidio: modify configuration to support PCI-SRIO controller
1. Add an option to include RapidIO support if the PCI is available.
2. Add FSL_RIO configuration option to enable controller selection.
3. Add RapidIO support option into x86 and MIPS architectures.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:02 +0000 (16:43 -0700)]
rapidio: add architecture specific callbacks
This set of patches eliminates RapidIO dependency on PowerPC architecture
and makes it available to other architectures (x86 and MIPS). It also
enables support of new platform independent RapidIO controllers such as
PCI-to-SRIO and PCI Express-to-SRIO.
This patch:
Extend number of mport callback functions to eliminate direct linking of
architecture specific mport operations.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:43:00 +0000 (16:43 -0700)]
rapidio: add RapidIO documentation
Add RapidIO documentation files as it was discussed earlier (see thread
http://marc.info/?l=linux-kernel&m=
129202338918062&w=2)
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Wed, 23 Mar 2011 23:42:59 +0000 (16:42 -0700)]
rapidio: add new sysfs attributes
Add new sysfs attributes.
1. Routing information required to to reach the RIO device:
destid - device destination ID (real for for endpoint, route for switch)
hopcount - hopcount for maintenance requests (switches only)
2. device linking information:
lprev - name of device that precedes the given device in the enumeration
or discovery order (displayed along with of the port to which it
is attached).
lnext - names of devices (with corresponding port numbers) that are
attached to the given device as next in the enumeration or
discovery order (switches only)
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changli Gao [Wed, 23 Mar 2011 23:42:58 +0000 (16:42 -0700)]
drivers/char/mem.c: clean up the code
Reduce the lines of code and simplify the logic.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Wed, 23 Mar 2011 23:42:57 +0000 (16:42 -0700)]
drivers/staging/tty/specialix.c: convert func_enter to func_exit
Convert calls to func_enter on leaving a function to func_exit.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
@@
- func_enter();
+ func_exit();
return...;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Roger Wolff <R.E.Wolff@BitWizard.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Wed, 23 Mar 2011 23:42:56 +0000 (16:42 -0700)]
drivers/tty/bfin_jtag_comm.c: avoid calling put_tty_driver on NULL
put_tty_driver calls tty_driver_kref_put on its argument, and then
tty_driver_kref_put calls kref_put on the address of a field of this
argument. kref_put checks for NULL, but in this case the field is likely
to have some offset and so the result of taking its address will not be
NULL. Labels are added to be able to skip over the call to put_tty_driver
when the argument will be NULL.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression *x;
@@
*if (x == NULL)
{ ...
* put_tty_driver(x);
...
return ...;
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Torben Hohn <torbenh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Niranjana Vishwanathapura [Wed, 23 Mar 2011 23:42:55 +0000 (16:42 -0700)]
drivers/char: add MSM smd_pkt driver
Add smd_pkt driver which provides device interface to smd packet ports.
Signed-off-by: Niranjana Vishwanathapura <nvishwan@codeaurora.org>
Cc: Brian Swetland <swetland@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Brown <davidb@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergey Senozhatsky [Wed, 23 Mar 2011 23:42:54 +0000 (16:42 -0700)]
drivers/char/ipmi/ipmi_si_intf.c: fix cleanup_one_si section mismatch
commit
d2478521afc2022 ("char/ipmi: fix OOPS caused by
pnp_unregister_driver on unregistered driver") introduced a section
mismatch by calling __exit cleanup_ipmi_si from __devinit init_ipmi_si.
Remove __exit annotation from cleanup_ipmi_si.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 23 Mar 2011 23:42:53 +0000 (16:42 -0700)]
proc: protect mm start_code/end_code in /proc/pid/stat
While mm->start_stack was protected from cross-uid viewing (commit
f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
processes")), the start_code and end_code values were not. This would
allow the text location of a PIE binary to leak, defeating ASLR.
Note that the value "1" is used instead of "0" for a protected value since
"ps", "killall", and likely other readers of /proc/pid/stat, take
start_code of "0" to mean a kernel thread and will misbehave. Thanks to
Brad Spengler for pointing this out.
Addresses CVE-2011-0726
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: <stable@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Wed, 23 Mar 2011 23:42:52 +0000 (16:42 -0700)]
proc: make struct proc_dir_entry::namelen unsigned int
1. namelen is declared "unsigned short" which hints for "maybe space savings".
Indeed in 2.4 struct proc_dir_entry looked like:
struct proc_dir_entry {
unsigned short low_ino;
unsigned short namelen;
Now, low_ino is "unsigned int", all savings were gone for a long time.
"struct proc_dir_entry" is not that countless to worry about it's size,
anyway.
2. converting from unsigned short to int/unsigned int can only create
problems, we better play it safe.
Space is not really conserved, because of natural alignment for the next
field. sizeof(struct proc_dir_entry) remains the same.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jovi Zhang [Wed, 23 Mar 2011 23:42:51 +0000 (16:42 -0700)]
procfs: fix some wrong error code usage
[root@wei 1]# cat /proc/1/mem
cat: /proc/1/mem: No such process
error code -ESRCH is wrong in this situation. Return -EPERM instead.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aaro Koskinen [Wed, 23 Mar 2011 23:42:50 +0000 (16:42 -0700)]
procfs: fix /proc/<pid>/maps heap check
The current code fails to print the "[heap]" marking if the heap is split
into multiple mappings.
Fix the check so that the marking is displayed in all possible cases:
1. vma matches exactly the heap
2. the heap vma is merged e.g. with bss
3. the heap vma is splitted e.g. due to locked pages
Test cases. In all cases, the process should have mapping(s) with
[heap] marking:
(1) vma matches exactly the heap
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
int main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}
# ./test1
check /proc/553/maps
[1] + Stopped ./test1
# cat /proc/553/maps | head -4
00008000-
00009000 r-xp
00000000 01:00
3113640 /test1
00010000-
00011000 rw-p
00000000 01:00
3113640 /test1
00011000-
00012000 rw-p
00000000 00:00 0 [heap]
4006f000-
40070000 rw-p
00000000 00:00 0
(2) the heap vma is merged
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
char foo[4096] = "foo";
char bar[4096];
int main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}
# ./test2
check /proc/556/maps
[2] + Stopped ./test2
# cat /proc/556/maps | head -4
00008000-
00009000 r-xp
00000000 01:00
3116312 /test2
00010000-
00012000 rw-p
00000000 01:00
3116312 /test2
00012000-
00014000 rw-p
00000000 00:00 0 [heap]
4004a000-
4004b000 rw-p
00000000 00:00 0
(3) the heap vma is splitted (this fails without the patch)
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
int main (void)
{
if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
(sbrk(4096) != (void *)-1)) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}
# ./test3
check /proc/559/maps
[1] + Stopped ./test3
# cat /proc/559/maps|head -4
00008000-
00009000 r-xp
00000000 01:00
3119108 /test3
00010000-
00011000 rw-p
00000000 01:00
3119108 /test3
00011000-
00012000 rw-p
00000000 00:00 0 [heap]
00012000-
00013000 rw-p
00000000 00:00 0 [heap]
It looks like the bug has been there forever, and since it only results in
some information missing from a procfile, it does not fulfil the -stable
"critical issue" criteria.
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Wed, 23 Mar 2011 23:42:48 +0000 (16:42 -0700)]
proc: hide kernel addresses via %pK in /proc/<pid>/stack
This file is readable for the task owner. Hide kernel addresses from
unprivileged users, leave them function names and offsets.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Kees Cook <kees.cook@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Wed, 23 Mar 2011 23:42:48 +0000 (16:42 -0700)]
cpuset: hold callback_mutex in cpuset_post_clone()
Chaning cpuset->mems/cpuset->cpus should be protected under
callback_mutex.
cpuset_clone() doesn't follow this rule. It's ok because it's
called when creating and initializing a cgroup, but we'd better
hold the lock to avoid subtil break in the future.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Wed, 23 Mar 2011 23:42:47 +0000 (16:42 -0700)]
cpuset: fix unchecked calls to NODEMASK_ALLOC()
Those functions that use NODEMASK_ALLOC() can't propagate errno
to users, but will fail silently.
Fix it by using a static nodemask_t variable for each function, and
those variables are protected by cgroup_mutex;
[akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Wed, 23 Mar 2011 23:42:46 +0000 (16:42 -0700)]
cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach()
oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
to cpuset_migrate_mm().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Wed, 23 Mar 2011 23:42:45 +0000 (16:42 -0700)]
cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist()
It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().
As spotted by Paul, a side effect is we fix a bug that the function can
return -ENOMEM but the caller doesn't expect negative return value.
Therefore change the return value of cpuset_sprintf_cpulist() and
cpuset_sprintf_memlist() from int to size_t.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 23 Mar 2011 23:42:44 +0000 (16:42 -0700)]
memcg: give current access to memory reserves if it's trying to die
When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.
This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Wed, 23 Mar 2011 23:42:42 +0000 (16:42 -0700)]
memcg: fix leak on wrong LRU with FUSE
fs/fuse/dev.c::fuse_try_move_page() does
(1) remove a page by ->steal()
(2) re-add the page to page cache
(3) link the page to LRU if it was not on LRU at (1)
This implies the page is _on_ LRU when it's added to radix-tree. So, the
page is added to memory cgroup while it's on LRU. because LRU is lazy and
no one flushs it.
This is the same behavior as SwapCache and needs special care as
- remove page from LRU before overwrite pc->mem_cgroup.
- add page to LRU after overwrite pc->mem_cgroup.
And we need to taking care of pagevec.
If PageLRU(page) is set before we add PCG_USED bit, the page will not be
added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
value before commit_charge(), we need to check PageLRU(page) after
commit_charge().
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Balbir Singh <balbir@in.ibm.com>
Reported-by: Daniel Poelzleithner <poelzi@poelzi.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 23 Mar 2011 23:42:41 +0000 (16:42 -0700)]
memcg: page_cgroup array is never stored on reserved pages
KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
PageReserved because we never store the array on reserved pages (neither
alloc_pages_exact nor vmalloc use those pages).
So we can replace the check by a BUG_ON.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 23 Mar 2011 23:42:40 +0000 (16:42 -0700)]
page_cgroup: reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
Currently we are allocating a single page_cgroup array per memory section
(stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is
correct but memory inefficient solution because the allocated memory
(unless we fall back to vmalloc) is not kmalloc friendly:
- 32b - 16384 entries (20B per entry) fit into
327680B so the
524288B slab cache is used
- 32b with PAE - 131072 entries with
2621440B fit into
4194304B
- 64b - 32768 entries (40B per entry) fit into
2097152 cache
This is ~37% wasted space per memory section and it sumps up for the whole
memory. On a x86_64 machine it is something like 6MB per 1GB of RAM.
We can reduce the internal fragmentation by using alloc_pages_exact which
allocates PAGE_SIZE aligned blocks so we will get down to <4kB wasted
memory per section which is much better.
We still need a fallback to vmalloc because we have no guarantees that we
will have a continuous memory of that size (order-10) later on during the
hotplug events.
[hannes@cmpxchg.org: do not define unused free_page_cgroup() without memory hotplug]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 23 Mar 2011 23:42:39 +0000 (16:42 -0700)]
mm/memcontrol.c: suppress uninitialized-var warning with older gcc's
mm/memcontrol.c: In function 'mem_cgroup_force_empty':
mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function
It's a false positive.
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:38 +0000 (16:42 -0700)]
memcg: use native word page statistics counters
The statistic counters are in units of pages, there is no reason to make
them 64-bit wide on 32-bit machines.
Make them native words. Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size of
4k.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:37 +0000 (16:42 -0700)]
memcg: break out event counters from other stats
For increasing and decreasing per-cpu cgroup usage counters it makes sense
to use signed types, as single per-cpu values might go negative during
updates. But this is not the case for only-ever-increasing event
counters.
All the counters have been signed 64-bit so far, which was enough to count
events even with the sign bit wasted.
This patch:
- divides s64 counters into signed usage counters and unsigned
monotonically increasing event counters.
- converts unsigned event counters into 'unsigned long' rather than
'u64'. This matches the type used by the /proc/vmstat event counters.
The next patch narrows the signed usage counters type (on 32-bit CPUs,
that is).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:36 +0000 (16:42 -0700)]
memcg: unify charge/uncharge quantities to units of pages
There is no clear pattern when we pass a page count and when we pass a
byte count that is a multiple of PAGE_SIZE.
We never charge or uncharge subpage quantities, so convert it all to page
counts.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:35 +0000 (16:42 -0700)]
memcg: convert uncharge batching from bytes to page granularity
We never uncharge subpage quantities.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:34 +0000 (16:42 -0700)]
memcg: convert per-cpu stock from bytes to page granularity
We never keep subpage quantities in the per-cpu stock.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:33 +0000 (16:42 -0700)]
memcg: keep only one charge cancelling function
We have two charge cancelling functions: one takes a page count, the other
a page size. The second one just divides the parameter by PAGE_SIZE and
then calls the first one. This is trivial, no need for an extra function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:32 +0000 (16:42 -0700)]
memcg: remove memcg->reclaim_param_lock
The reclaim_param_lock is only taken around single reads and writes to
integer variables and is thus superfluous. Drop it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:31 +0000 (16:42 -0700)]
memcg: charged pages always have valid per-memcg zone info
page_cgroup_zoneinfo() will never return NULL for a charged page, remove
the check for it in mem_cgroup_get_reclaim_stat_from_page().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:30 +0000 (16:42 -0700)]
memcg: remove direct page_cgroup-to-page pointer
In struct page_cgroup, we have a full word for flags but only a few are
reserved. Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.
This saves a full word for every page in a system with memory cgroups
enabled.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:29 +0000 (16:42 -0700)]
memcg: condense page_cgroup-to-page lookup points
The per-cgroup LRU lists string up 'struct page_cgroup's. To get from
those structures to the page they represent, a lookup is required.
Currently, the lookup is done through a direct pointer in struct
page_cgroup, so a lot of functions down the callchain do this lookup by
themselves instead of receiving the page pointer from their callers.
The next patch removes this pointer, however, and the lookup is no longer
that straight-forward. In preparation for that, this patch only leaves
the non-optional lookups when coming directly from the LRU list and passes
the page down the stack.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:28 +0000 (16:42 -0700)]
memcg: fold __mem_cgroup_move_account into caller
It is one logical function, no need to have it split up.
Also, get rid of some checks from the inner function that ensured the
sanity of the outer function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:27 +0000 (16:42 -0700)]
memcg: change page_cgroup_zoneinfo signature
Instead of passing a whole struct page_cgroup to this function, let it
take only what it really needs from it: the struct mem_cgroup and the
page.
This has the advantage that reading pc->mem_cgroup is now done at the same
place where the ordering rules for this pointer are enforced and
explained.
It is also in preparation for removing the pc->page backpointer.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:26 +0000 (16:42 -0700)]
memcg: no uncharged pages reach page_cgroup_zoneinfo
This patch series removes the direct page pointer from struct page_cgroup,
which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu
enable memcg per default, openSUSE apparently too).
The node id or section number is encoded in the remaining free bits of
pc->flags which allows calculating the corresponding page without the
extra pointer.
I ran, what I think is, a worst-case microbenchmark that just cats a large
sparse file to /dev/null, because it means that walking the LRU list on
behalf of per-cgroup reclaim and looking up pages from page_cgroups is
happening constantly and at a high rate. But it made no measurable
difference. A profile reported a 0.11% share of the new
lookup_cgroup_page() function in this benchmark.
This patch:
All callsites check PCG_USED before passing pc->mem_cgroup, so the latter
is never NULL.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daisuke Nishimura [Wed, 23 Mar 2011 23:42:25 +0000 (16:42 -0700)]
memcg: add memcg sanity checks at allocating and freeing pages
Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.
This check may be useful in debugging a problem and we did similar checks
before the commit
52d4b9ac(memcg: allocate all page_cgroup at boot).
This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:24 +0000 (16:42 -0700)]
memcg: remove NULL check from lookup_page_cgroup() result
The page_cgroup array is set up before even fork is initialized. I
seriously doubt that this code executes before the array is alloc'd.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:23 +0000 (16:42 -0700)]
memcg: remove impossible conditional when committing
No callsite ever passes a NULL pointer for a struct mem_cgroup * to the
committing function. There is no need to check for it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:22 +0000 (16:42 -0700)]
memcg: remove unused page flag bitfield defines
These definitions have been unused since '
4b3bde4 memcg: remove the
overhead associated with the root cgroup'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:21 +0000 (16:42 -0700)]
memcg: simplify the way memory limits are checked
Since transparent huge pages, checking whether memory cgroups are below
their limits is no longer enough, but the actual amount of chargeable
space is important.
To not have more than one limit-checking interface, replace
memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
single memory_cgroup_margin() that returns the chargeable space and leaves
the comparison to the callsite.
Soft limits are now checked the other way round, by using the already
existing function that returns the amount by which soft limits are
exceeded: res_counter_soft_limit_excess().
Also remove all the corresponding functions on the res_counter side that
are now no longer used.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 23 Mar 2011 23:42:20 +0000 (16:42 -0700)]
memcg: soft limit reclaim should end at limit not below
Soft limit reclaim continues until the usage is below the current soft
limit, but the documented semantics are actually that soft limit reclaim
will push usage back until the soft limits are met again.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Wed, 23 Mar 2011 23:42:19 +0000 (16:42 -0700)]
memcg: fix ugly initialization of return value is in caller
Remove initialization of vaiable in caller of memory cgroup function.
Actually, it's return value of memcg function but it's initialized in
caller.
Some memory cgroup uses following style to bring the result of start
function to the end function for avoiding races.
mem_cgroup_start_A(&(*ptr))
/* Something very complicated can happen here. */
mem_cgroup_end_A(*ptr)
In some calls, *ptr should be initialized to NULL be caller. But it's
ugly. This patch fixes that *ptr is initialized by _start function.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Wed, 23 Mar 2011 23:42:18 +0000 (16:42 -0700)]
memcg: res_counter_read_u64(): fix potential races on 32-bit machines
res_counter_read_u64 reads u64 value without lock. It's dangerous in a
32bit environment. Add locking.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:16 +0000 (16:42 -0700)]
bitops: remove minix bitops from asm/bitops.h
minix bit operations are only used by minix filesystem and useless by
other modules. Because byte order of inode and block bitmaps is different
on each architecture like below:
m68k:
big-endian 16bit indexed bitmaps
h8300, microblaze, s390, sparc, m68knommu:
big-endian 32 or 64bit indexed bitmaps
m32r, mips, sh, xtensa:
big-endian 32 or 64bit indexed bitmaps for big-endian mode
little-endian bitmaps for little-endian mode
Others:
little-endian bitmaps
In order to move minix bit operations from asm/bitops.h to architecture
independent code in minix filesystem, this provides two config options.
CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
m32r, mips, sh, xtensa). The architectures which always use little-endian
bitmaps do not select these options.
Finally, we can remove minix bit operations from asm/bitops.h for all
architectures.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Michal Simek <monstr@monstr.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:15 +0000 (16:42 -0700)]
m68k: remove inline asm from minix_find_first_zero_bit
As a preparation for moving minix bit operations from asm/bitops.h to
architecture independent code in minix filesystem, this removes inline asm
from minix_find_first_zero_bit() for m68k.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:14 +0000 (16:42 -0700)]
bitops: remove ext2 non-atomic bitops from asm/bitops.h
As the result of conversions, there are no users of ext2 non-atomic bit
operations except for ext2 filesystem itself. Now we can put them into
architecture independent code in ext2 filesystem, and remove from
asm/bitops.h for all architectures.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:13 +0000 (16:42 -0700)]
dm: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:13 +0000 (16:42 -0700)]
md: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:11 +0000 (16:42 -0700)]
ufs: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:11 +0000 (16:42 -0700)]
udf: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:10 +0000 (16:42 -0700)]
reiserfs: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:08 +0000 (16:42 -0700)]
nilfs2: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:08 +0000 (16:42 -0700)]
ocfs2: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:07 +0000 (16:42 -0700)]
ext4: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:06 +0000 (16:42 -0700)]
ext3: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:05 +0000 (16:42 -0700)]
rds: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andy Grover <andy.grover@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:04 +0000 (16:42 -0700)]
kvm: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:04 +0000 (16:42 -0700)]
asm-generic: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:02 +0000 (16:42 -0700)]
bitops: introduce little-endian bitops for most architectures
Introduce little-endian bit operations to the big-endian architectures
which do not have native little-endian bit operations and the
little-endian architectures. (alpha, avr32, blackfin, cris, frv, h8300,
ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)
These architectures can just include generic implementation
(asm-generic/bitops/le.h).
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:42:00 +0000 (16:42 -0700)]
m68knommu: introduce little-endian bitops
Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 bit operations are kept as wrapper macros using
little-endian bit operations to maintain bisectability until the
conversions are finished.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:59 +0000 (16:41 -0700)]
bitops: introduce CONFIG_GENERIC_FIND_BIT_LE
This introduces CONFIG_GENERIC_FIND_BIT_LE to tell whether to use generic
implementation of find_*_bit_le() in lib/find_next_bit.c or not.
For now we select CONFIG_GENERIC_FIND_BIT_LE for all architectures which
enable CONFIG_GENERIC_FIND_NEXT_BIT.
But m68knommu wants to define own faster find_next_zero_bit_le() and
continues using generic find_next_{,zero_}bit().
(CONFIG_GENERIC_FIND_NEXT_BIT and !CONFIG_GENERIC_FIND_BIT_LE)
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:58 +0000 (16:41 -0700)]
m68k: introduce little-endian bitops
Introduce little-endian bit operations by renaming native ext2 bit
operations and changing find_*_bit_le() to take a "void *". The ext2 bit
operations are kept as wrapper macros using little-endian bit operations
to maintain bisectability until the conversions are finished.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:57 +0000 (16:41 -0700)]
arm: introduce little-endian bitops
Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 and minix bit operations are kept as wrapper macros
using little-endian bit operations to maintain bisectability until the
conversions are finished.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:57 +0000 (16:41 -0700)]
s390: introduce little-endian bitops
Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 bit operations are kept as wrapper macros using
little-endian bit operations to maintain bisectability until the
conversions are finished.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:56 +0000 (16:41 -0700)]
powerpc: introduce little-endian bitops
Introduce little-endian bit operations by renaming existing powerpc native
little-endian bit operations and changing them to take any pointer types.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:50 +0000 (16:41 -0700)]
asm-generic: change little-endian bitops to take any pointer types
This makes the little-endian bitops take any pointer types by changing the
prototypes and adding casts in the preprocessor macros.
That would seem to at least make all the filesystem code happier, and they
can continue to do just something like
#define ext2_set_bit __test_and_set_bit_le
(or whatever the exact sequence ends up being).
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:47 +0000 (16:41 -0700)]
asm-generic: rename generic little-endian bitops functions
As a preparation for providing little-endian bitops for all architectures,
This renames generic implementation of little-endian bitops. (remove
"generic_" prefix and postfix "_le")
s/generic_find_next_le_bit/find_next_bit_le/
s/generic_find_next_zero_le_bit/find_next_zero_bit_le/
s/generic_find_first_zero_le_bit/find_first_zero_bit_le/
s/generic___test_and_set_le_bit/__test_and_set_bit_le/
s/generic___test_and_clear_le_bit/__test_and_clear_bit_le/
s/generic_test_le_bit/test_bit_le/
s/generic___set_le_bit/__set_bit_le/
s/generic___clear_le_bit/__clear_bit_le/
s/generic_test_and_set_le_bit/test_and_set_bit_le/
s/generic_test_and_clear_le_bit/test_and_clear_bit_le/
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:46 +0000 (16:41 -0700)]
bitops: merge little and big endian definisions in asm-generic/bitops/le.h
This patch series introduces little-endian bit operations in asm/bitops.h
for all architectures and converts all ext2 non-atomic and minix bit
operations to use little-endian bit operations. It enables us to remove
ext2 non-atomic and minix bit operations from asm/bitops.h. The reason
they should be removed from asm/bitops.h is as follows:
For ext2 non-atomic bit operations, they are used for little-endian byte
order bitmap access by some filesystems and modules. But using ext2_*()
functions on a module other than ext2 filesystem makes some feel strange.
For minix bit operations, they are only used by minix filesystem and are
useless by other modules. Because byte order of inode and block bitmap is
This patch:
In order to make the forthcoming changes smaller, this merges macro
definisions in asm-generic/bitops/le.h for big-endian and little-endian as
much as possible.
This also removes unused BITOP_WORD macro.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:45 +0000 (16:41 -0700)]
rds: stop including asm-generic/bitops/le.h directly
asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.
This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.
It seems odd to use ext2_*_bit() on rds, but it will replaced with
__{set,clear,test}_bit_le() after introducing little endian bit operations
for all architectures. This indirect step is necessary to maintain
bisectability for some architectures which have their own little-endian
bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andy Grover <andy.grover@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 23 Mar 2011 23:41:44 +0000 (16:41 -0700)]
kvm: stop including asm-generic/bitops/le.h directly
asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.
This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.
It seems odd to use ext2_set_bit() on kvm, but it will replaced with
__set_bit_le() after introducing little endian bit operations for all
architectures. This indirect step is necessary to maintain bisectability
for some architectures which have their own little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 23 Mar 2011 23:41:43 +0000 (16:41 -0700)]
fs/adfs/adfs.h: fix unsigned comparison
fs/adfs/adfs.h: In function 'append_filetype_suffix':
fs/adfs/adfs.h:115: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luck, Tony [Wed, 23 Mar 2011 23:41:43 +0000 (16:41 -0700)]
ia64: fix build breakage in asm/thread_info.h
In commit
504f52b5439aaf26d3e2c1d45ec10fce38c8dd27
mm: NUMA aware alloc_task_struct_node()
Eric Dumazet forgot a "\". Add it.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Wilson [Wed, 23 Mar 2011 18:16:55 +0000 (18:16 +0000)]
Revert "drm/i915: Don't save/restore hardware status page address register"
This reverts commit
a7a75c8f70d6f6a2f16c9f627f938bbee2d32718.
There are two different variations on how Intel hardware addresses the
"Hardware Status Page". One as a location in physical memory and the
other as an offset into the virtual memory of the GPU, used in more
recent chipsets. (The HWS itself is a cacheable region of memory which
the GPU can write to without requiring CPU synchronisation, used for
updating various details of hardware state, such as the position of
the GPU head in the ringbuffer, the last breadcrumb seqno, etc).
These two types of addresses were updated in different locations of code
- one inline with the ringbuffer initialisation, and the other during
device initialisation. (The HWS page is logically associated with
the rings, and there is one HWS page per ring.) During resume, only the
ringbuffers were being re-initialised along with the virtual HWS page,
leaving the older physical address HWS untouched. This then caused a
hang on the older gen3/4 (915GM, 945GM, 965GM) the first time we tried
to synchronise the GPU as the breadcrumbs were never being updated.
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jan Niehusmann <jan@gondor.com>
Reported-by: Justin P. Mattock <justinmattock@gmail.com>
Reported-and-tested-by: Michael "brot" Groh <brot@minad.de>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 23 Mar 2011 14:58:09 +0000 (07:58 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: HDA: Realtek: Avoid unnecessary volume control index on Surround/Side
ASoC: Support !REGULATOR build for sgtl5000
ALSA: hda - VIA: Fix VT1708 can't build up Headphone control issue
ALSA: hda - VIA: Correct stream names for VT1818S
ALSA: hda - VIA: Fix codec type for VT1708BCE at the right timing
ALSA: hda - VIA: Fix invalid A-A path volume adjust issue
ALSA: hda - VIA: Add missing support for VT1718S in A-A path
ALSA: hda - VIA: Fix independent headphone no sound issue
ALSA: hda - VIA: Fix stereo mixer recording no sound issue
ALSA: hda - Set EAPD for Realtek ALC665
ALSA: usb - Remove trailing spaces from USB card name strings
sound: read i_size with i_size_read()
ASoC: Remove bogus check for register validity in debugfs write
ASoC: mini2440: Fix uda134x codec problem.
Cesar Eduardo Barros [Wed, 23 Mar 2011 02:03:13 +0000 (23:03 -0300)]
sys_swapon: fix inode locking
A conflict between
52c50567d8ab ("mm: swap: unlock swapfile inode mutex
before closing file on bad swapfiles") and
83ef99befc32 ("sys_swapon:
remove did_down variable") caused a double unlock of the inode mutex
(once in bad_swap: before the filp_close, once at the end just before
returning).
The patch which added the extra unlock cleared did_down to avoid
unlocking twice, but the other patch removed the did_down variable.
To fix, set inode to NULL after the first unlock, since it will be used
after that point only for the final unlock.
While checking this patch, I found a path which could unlock without
locking, in case the same inode was added as a swapfile twice. To fix,
move the setting of the inode variable further down, to just before
claim_swapfile, which will lock the inode before doing anything else.
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiko Carstens [Wed, 23 Mar 2011 07:24:58 +0000 (08:24 +0100)]
smp: add missing init.h include
Commit
34db18a054c6 ("smp: move smp setup functions to kernel/smp.c")
causes this build error on s390 because of a missing init.h include:
CC arch/s390/kernel/asm-offsets.s
In file included from /home2/heicarst/linux-2.6/arch/s390/include/asm/spinlock.h:14:0,
from include/linux/spinlock.h:87,
from include/linux/seqlock.h:29,
from include/linux/time.h:8,
from include/linux/timex.h:56,
from include/linux/sched.h:57,
from arch/s390/kernel/asm-offsets.c:10:
include/linux/smp.h:117:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'setup_nr_cpu_ids'
include/linux/smp.h:118:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'smp_init'
Fix it by adding the include statement.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Takashi Iwai [Wed, 23 Mar 2011 11:05:01 +0000 (12:05 +0100)]
Merge branch 'topic/asoc' into for-linus
David Henningsson [Wed, 23 Mar 2011 07:35:07 +0000 (08:35 +0100)]
ALSA: HDA: Realtek: Avoid unnecessary volume control index on Surround/Side
Similar to commit
7e59e097c09b82760bb0fe08b0fa2b704d76c3f4, this patch
avoids unnecessary volume control indices for more
Realtek auto-parsers, e g the ALC66x family, on the "Surround" and "Side"
controls.
These indices cause these volume controls to be ignored by PulseAudio and
vmaster and should be removed whenever possible.
Cc: stable@kernel.org
Reported-by: Jan Losinski <losinski@wh2.tu-dresden.de>
Signed-off-by: David Henningsson <david.henningsson@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Linus Torvalds [Wed, 23 Mar 2011 00:53:13 +0000 (17:53 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (66 commits)
avr32: at32ap700x: fix typo in DMA master configuration
dmaengine/dmatest: Pass timeout via module params
dma: let IMX_DMA depend on IMX_HAVE_DMA_V1 instead of an explicit list of SoCs
fsldma: make halt behave nicely on all supported controllers
fsldma: reduce locking during descriptor cleanup
fsldma: support async_tx dependencies and automatic unmapping
fsldma: fix controller lockups
fsldma: minor codingstyle and consistency fixes
fsldma: improve link descriptor debugging
fsldma: use channel name in printk output
fsldma: move related helper functions near each other
dmatest: fix automatic buffer unmap type
drivers, pch_dma: Fix warning when CONFIG_PM=n.
dmaengine/dw_dmac fix: use readl & writel instead of __raw_readl & __raw_writel
avr32: at32ap700x: Specify DMA Flow Controller, Src and Dst msize
dw_dmac: Setting Default Burst length for transfers as 16.
dw_dmac: Allow src/dst msize & flow controller to be configured at runtime
dw_dmac: Changing type of src_master and dest_master to u8.
dw_dmac: Pass Channel Priority from platform_data
dw_dmac: Pass Channel Allocation Order from platform_data
...
Jean Delvare [Tue, 22 Mar 2011 23:35:13 +0000 (16:35 -0700)]
bloat-o-meter: include read-only data section in report
I'm not sure why the read-only data section is excluded from the report,
it seems as relevant as the other data sections (b and d).
I've stripped the symbols starting with __mod_ as they can have their
names dynamically generated and thus comparison between binaries is not
possible.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Keniston [Tue, 22 Mar 2011 23:35:12 +0000 (16:35 -0700)]
zlib: slim down zlib_deflate() workspace when possible
Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.
For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report. In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.
I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Vagin [Tue, 22 Mar 2011 23:35:11 +0000 (16:35 -0700)]
fs/devpts/inode.c: correctly check d_alloc_name() return code in devpts_pty_new()
d_alloc_name return NULL in case error, but we expect errno in
devpts_pty_new.
Addresses http://bugzilla.openvz.org/show_bug.cgi?id=1758
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roland Dreier [Tue, 22 Mar 2011 23:35:10 +0000 (16:35 -0700)]
aio: wake all waiters when destroying ctx
The test program below will hang because io_getevents() uses
add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
wakes up one of the threads. Fix this by using wake_up_all() in the aio
code paths where we want to make sure no one gets stuck.
// t.c -- compile with gcc -lpthread -laio t.c
#include <libaio.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
static const int nthr = 2;
void *getev(void *ctx)
{
struct io_event ev;
io_getevents(ctx, 1, 1, &ev, NULL);
printf("io_getevents returned\n");
return NULL;
}
int main(int argc, char *argv[])
{
io_context_t ctx = 0;
pthread_t thread[nthr];
int i;
io_setup(1024, &ctx);
for (i = 0; i < nthr; ++i)
pthread_create(&thread[i], NULL, getev, ctx);
sleep(1);
io_destroy(ctx);
for (i = 0; i < nthr; ++i)
pthread_join(thread[i], NULL);
return 0;
}
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Gordeev [Tue, 22 Mar 2011 23:35:06 +0000 (16:35 -0700)]
pps: remove unreachable code
Remove code enabled only when CONFIG_PREEMPT_RT is turned on because it is
not used in the vanilla kernel.
Signed-off-by: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Rodolfo Giometti <giometti@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stuart Swales [Tue, 22 Mar 2011 23:35:06 +0000 (16:35 -0700)]
adfs: add hexadecimal filetype suffix option
ADFS (FileCore) storage complies with the RISC OS filetype specification
(12 bits of file type information is stored in the file load address,
rather than using a file extension). The existing driver largely ignores
this information and does not present it to the end user.
It is desirable that stored filetypes be made visible to the end user to
facilitate a precise copy of data and metadata from a hard disc (or image
thereof) into a RISC OS emulator (such as RPCEmu) or to a network share
which can be accessed by real Acorn systems.
This patch implements a per-mount filetype suffix option (use -o
ftsuffix=1) to present any filetype as a ,xyz hexadecimal suffix on each
file. This type suffix is compatible with that used by RISC OS systems
that access network servers using NFS client software and by RPCemu's host
filing system.
Signed-off-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stuart Swales [Tue, 22 Mar 2011 23:35:05 +0000 (16:35 -0700)]
adfs: improve timestamp precision
ADFS (FileCore) storage complies with the RISC OS timestamp specification
(40-bit centiseconds since 01 Jan 1900 00:00:00). It is desirable that
stored timestamp precision be maintained to facilitate a precise copy of
data and metadata from a hard disc (or image thereof) into a RISC OS
emulator (such as RPCEmu).
This patch implements a full-precision conversion from ADFS to Unix
timestamp as the existing driver, for ease of calculation with old 32-bit
compilers, uses the common trick of shifting the 40-bits representing
centiseconds around into 32-bits representing seconds thereby losing
precision.
Signed-off-by: Stuart Swales<stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>