Peter Zijlstra [Wed, 24 Sep 2014 08:18:56 +0000 (10:18 +0200)]
sched: Exclude cond_resched() from nested sleep test
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.
In particular, preemption preserves task_struct::state.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:55 +0000 (10:18 +0200)]
sched: Debug nested sleeps
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.
Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.
Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:54 +0000 (10:18 +0200)]
sched, net: Clean up sk_wait_event() vs. might_sleep()
WARNING: CPU: 1 PID: 1744 at kernel/sched/core.c:7104 __might_sleep+0x58/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<
ffffffff81070e10>] prepare_to_wait+0x50 /0xa0
[<
ffffffff8105bc38>] __might_sleep+0x58/0x90
[<
ffffffff8148c671>] lock_sock_nested+0x31/0xb0
[<
ffffffff81498aaa>] sk_stream_wait_memory+0x18a/0x2d0
Which is a false positive because sk_wait_event() will already have
TASK_RUNNING at that point if it would've gone through
schedule_timeout().
So annotate with sched_annotate_sleep(); which goes away on !DEBUG builds.
Reported-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140924082242.524407432@infradead.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: netdev@vger.kernel.org
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:53 +0000 (10:18 +0200)]
sched, modules: Fix nested sleep in add_unformed_module()
This is a genuine bug in add_unformed_module(), we cannot use blocking
primitives inside a wait loop.
So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org
[ So this is probably complex to backport and the race wasn't reported AFAIK,
so not marked for -stable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:52 +0000 (10:18 +0200)]
sched, smp: Correctly deal with nested sleeps
smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.
__might_sleep+0x74/0x80
kmem_cache_alloc_trace+0x4e/0x1c0
perf_event_alloc+0x55/0x450
perf_event_create_kernel_counter+0x2f/0x100
watchdog_nmi_enable+0x8d/0x160
watchdog_enable+0x45/0x90
smpboot_thread_fn+0xec/0x2b0
kthread+0xe4/0x100
ret_from_fork+0x7c/0xb0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:51 +0000 (10:18 +0200)]
sched, tty: Deal with nested sleeps
n_tty_{read,write} are wait loops with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state.
Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/20140924082242.323011233@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:50 +0000 (10:18 +0200)]
sched, inotify: Deal with nested sleeps
inotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).
Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:49 +0000 (10:18 +0200)]
sched, exit: Deal with nested sleeps
do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.
Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<
ffffffff8109a788>] do_wait+0x88/0x270
Call Trace:
[<
ffffffff81694991>] dump_stack+0x4e/0x7a
[<
ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
[<
ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
[<
ffffffff810bca6e>] __might_sleep+0x7e/0x90
[<
ffffffff811a1c15>] might_fault+0x55/0xb0
[<
ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
[<
ffffffff8109a804>] do_wait+0x104/0x270
[<
ffffffff8109b837>] SyS_wait4+0x77/0x100
[<
ffffffff8169d692>] system_call_fastpath+0x16/0x1b
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: umgwanakikbuti@gmail.com
Cc: ilya.dryomov@inktank.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:48 +0000 (10:18 +0200)]
sched/wait: Add might_sleep() checks
Add more might_sleep() checks, suppose someone put a wait_event() like
thing in a wait loop..
Can't put might_sleep() in ___wait_event() because there's the locked
primitives which call ___wait_event() with locks held.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.119255706@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:47 +0000 (10:18 +0200)]
sched/wait: Provide infrastructure to deal with nested blocking
There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.
We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 24 Sep 2014 08:18:46 +0000 (10:18 +0200)]
locking/mutex: Don't assume TASK_RUNNING
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Wanpeng Li [Tue, 14 Oct 2014 02:22:40 +0000 (10:22 +0800)]
sched/deadline: Don't balance during wakeup if wakee is pinned
Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Wanpeng Li [Tue, 14 Oct 2014 02:22:39 +0000 (10:22 +0800)]
sched/deadline: Don't check SD_BALANCE_FORK
There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Tue, 7 Oct 2014 08:52:11 +0000 (09:52 +0100)]
sched/deadline: Ensure that updates to exclusive cpusets don't break AC
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.
This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Fri, 19 Sep 2014 09:22:40 +0000 (10:22 +0100)]
sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:
- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).
- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.
This patch fixes both things at once, as they are opposite faces
of the same coin.
The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Wanpeng Li [Wed, 22 Oct 2014 00:36:43 +0000 (08:36 +0800)]
sched/deadline: Do not try to push tasks if pinned task switches to dl
As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):
| If rq has already had 2 or more pushable tasks and we try to add a
| pinned task then call of push_rt_task will just waste a time.
Just switched pinned task is not able to be pushed. If the rq has had
several dl tasks before they have already been considered as candidates
to be pushed (or pulled). This patch implements the same behavior as rt
class which introduced by commit
10447917551e ("sched/rt: Do not try to
push tasks if pinned task switches to RT").
Suggested-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg Nesterov [Wed, 8 Oct 2014 18:33:48 +0000 (20:33 +0200)]
sched: Kill task_preempt_count()
task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.
Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20141008183348.GC17495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg Nesterov [Thu, 9 Oct 2014 19:32:32 +0000 (21:32 +0200)]
sched: Make finish_task_switch() return 'struct rq *'
Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().
It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.
Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009193232.GB5408@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg Nesterov [Wed, 8 Oct 2014 19:36:44 +0000 (21:36 +0200)]
sched: Fix schedule_tail() to disable preemption
finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141008193644.GA32055@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg Nesterov [Tue, 7 Oct 2014 19:51:08 +0000 (21:51 +0200)]
sched: Fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()
task_preempt_count() has nothing to do with the actual preempt counter,
thread_info->saved_preempt_count is only valid right after switch_to().
__trace_sched_switch_state() can use preempt_count(), prev is still the
current task when trace_sched_switch() is called.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Added BUG_ON(). ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20141007195108.GB28002@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Thu, 9 Oct 2014 21:27:47 +0000 (17:27 -0400)]
sched/numa: Check all nodes when placing a pseudo-interleaved group
In pseudo-interleaved numa_groups, all tasks try to relocate to
the group's preferred_nid. When a group is spread across multiple
NUMA nodes, this can lead to tasks swapping their location with
other tasks inside the same group, instead of swapping location with
tasks from other NUMA groups. This can keep NUMA groups from converging.
Examining all nodes, when dealing with a task in a pseudo-interleaved
NUMA group, avoids this problem. Note that only CPUs in nodes that
improve the task or group score are examined, so the loop isn't too
bad.
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Vinod Chegu" <chegu_vinod@hp.com>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Fri, 17 Oct 2014 07:29:53 +0000 (03:29 -0400)]
sched/numa: Find the preferred nid with complex NUMA topology
On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.
The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Fri, 17 Oct 2014 07:29:52 +0000 (03:29 -0400)]
sched/numa: Calculate node scores in complex NUMA topologies
In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.
In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.
Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.
Placement policy on directly connected NUMA systems is not affected.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Fri, 17 Oct 2014 07:29:51 +0000 (03:29 -0400)]
sched/numa: Prepare for complex topology placement
Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Fri, 17 Oct 2014 07:29:50 +0000 (03:29 -0400)]
sched/numa: Classify the NUMA topology of a system
Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.
Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.
On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.
On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.
Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
[ Changed to use kernel/sched/sched.h ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Fri, 17 Oct 2014 07:29:49 +0000 (03:29 -0400)]
sched/numa: Export info needed for NUMA balancing on complex topologies
Export some information that is necessary to do placement of
tasks on systems with multi-level NUMA topologies.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill Tkhai [Tue, 21 Oct 2014 16:35:56 +0000 (20:35 +0400)]
sched/dl: Fix preemption checks
1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.
2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.
So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.
(I didn't change switched_from_dl() because no check is necessary
there at all).
I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Chen Hanxiao [Tue, 7 Oct 2014 09:29:07 +0000 (17:29 +0800)]
sched: Update comments for CLONE_NEWNS
Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1412674147-8941-1-git-send-email-chenhanxiao@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg Nesterov [Sun, 5 Oct 2014 20:23:22 +0000 (22:23 +0200)]
sched: stop the unbound recursion in preempt_schedule_context()
preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.
1. Change this code to dec preempt_count() and check need_resched()
by hand.
2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
the enable/disable dance around __schedule(). But in this case
we need to move into sched/core.c.
3. Cosmetic, but x86 forgets to declare this function. This doesn't
really matter because it is only called by asm helpers, still it
make sense to add the declaration into asm/preempt.h to match
preempt_schedule().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill Tkhai [Thu, 16 Oct 2014 10:39:37 +0000 (14:39 +0400)]
sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
This bash command reproduces problem:
$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task:
ffff88013c852600 ti:
ffff880037a68000 task.ti:
ffff880037a68000
RIP: 0010:[<
ffffffff81074191>] [<
ffffffff81074191>] task_scan_min+0x21/0x50
RSP: 0000:
ffff880037a6bce0 EFLAGS:
00010246
RAX:
0000000000000a00 RBX:
00000000000003e8 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
ffff88013c852600
RBP:
ffff880037a6bcf0 R08:
0000000000000001 R09:
0000000000015c90
R10:
ffff880239bf6c00 R11:
0000000000000016 R12:
0000000000003fff
R13:
ffff88013c852600 R14:
ffffea0008d1b000 R15:
0000000000000003
FS:
00007f12bb048700(0000) GS:
ffff88007da00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
0000000001505678 CR3:
0000000234770000 CR4:
00000000000006f0
Stack:
ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
Call Trace:
[<
ffffffff810741d1>] task_scan_max+0x11/0x40
[<
ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
[<
ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
[<
ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
[<
ffffffff8103e2f1>] __do_page_fault+0x191/0x510
[<
ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
[<
ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
[<
ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
[<
ffffffff8104887d>] ? do_fork+0x14d/0x340
[<
ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
[<
ffffffff811799df>] ? __fd_install+0x1f/0x60
[<
ffffffff8103e67c>] do_page_fault+0xc/0x10
[<
ffffffff8150d322>] page_fault+0x22/0x30
RIP [<
ffffffff81074191>] task_scan_min+0x21/0x50
RSP <
ffff880037a6bce0>
---[ end trace
9a826d16936c04de ]---
Also fix race in task_scan_min (it depends on compiler behaviour).
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yasuaki Ishimatsu [Wed, 22 Oct 2014 07:04:35 +0000 (16:04 +0900)]
sched/fair: Care divide error in update_task_scan_period()
While offling node by hot removing memory, the following divide error
occurs:
divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [<
ffffffff810a7081>] task_numa_fault
RSP <
ffff88084eb2bcb0>
The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:
o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------
2. node 1 is offlined by hot removing memory.
3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.
4. The values(0) of fault_types[] pass to update_task_scan_period().
5. numa_faults_locality[1] is set to 1. So the following division is
calculated.
static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}
6. But both of private and shared are set to 0. So divide error
occurs here.
The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill Tkhai [Wed, 22 Oct 2014 07:17:11 +0000 (11:17 +0400)]
sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Fri, 24 Oct 2014 09:16:38 +0000 (10:16 +0100)]
sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
dl_task_timer() is racy against several paths. Daniel noticed that
the replenishment timer may experience a race condition against an
enqueue_dl_entity() called from rt_mutex_setprio(). With his own
words:
rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
throttled is 0
=> BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
enqueued on the -deadline runqueue.
As we do for the other races, we just bail out in the replenishment
timer code.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Fri, 24 Oct 2014 09:16:37 +0000 (10:16 +0100)]
sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
In the deboost path, right after the dl_boosted flag has been
reset, we can currently end up replenishing using -deadline
parameters of a !SCHED_DEADLINE entity. This of course causes
a bug, as those parameters are empty.
In the case depicted above it is safe to simply bail out, as
the deboosted task is going to be back to its original scheduling
class anyway.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill Tkhai [Mon, 27 Oct 2014 10:18:25 +0000 (14:18 +0400)]
sched: Fix race between task_group and sched_task_group
The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.
But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().
To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).
Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sun, 26 Oct 2014 23:48:41 +0000 (16:48 -0700)]
Linux 3.18-rc2
Linus Torvalds [Sun, 26 Oct 2014 18:35:51 +0000 (11:35 -0700)]
Merge tag 'armsoc-for-rc2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"Another week, another small batch of fixes.
Most of these make zynq, socfpga and sunxi platforms work a bit
better:
- due to new requirements for regulators, DWMMC on socfpga broke past
v3.17
- SMP spinup fix for socfpga
- a few DT fixes for zynq
- another option (FIXED_REGULATOR) for sunxi is needed that used to
be selected by other options but no longer is.
- a couple of small DT fixes for at91
- ...and a couple for i.MX"
* tag 'armsoc-for-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: dts: imx28-evk: Let i2c0 run at 100kHz
ARM: i.MX6: Fix "emi" clock name typo
ARM: multi_v7_defconfig: enable CONFIG_MMC_DW_ROCKCHIP
ARM: sunxi_defconfig: enable CONFIG_REGULATOR_FIXED_VOLTAGE
ARM: dts: socfpga: Add a 3.3V fixed regulator node
ARM: dts: socfpga: Fix SD card detect
ARM: dts: socfpga: rename gpio nodes
ARM: at91/dt: sam9263: fix PLLB frequencies
power: reset: at91-reset: fix power down register
MAINTAINERS: add atmel ssc driver maintainer entry
arm: socfpga: fix fetching cpu1start_addr for SMP
ARM: zynq: DT: trivial: Fix mc node
ARM: zynq: DT: Add cadence watchdog node
ARM: zynq: DT: Add missing reference for memory-controller
ARM: zynq: DT: Add missing reference for ADC
ARM: zynq: DT: Add missing address for L2 pl310
ARM: zynq: DT: Remove 222 MHz OPP
ARM: zynq: DT: Fix GEM register area size
Linus Torvalds [Sun, 26 Oct 2014 18:19:18 +0000 (11:19 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"overlayfs merge + leak fix for d_splice_alias() failure exits"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
overlayfs: embed middle into overlay_readdir_data
overlayfs: embed root into overlay_readdir_data
overlayfs: make ovl_cache_entry->name an array instead of pointer
overlayfs: don't hold ->i_mutex over opening the real directory
fix inode leaks on d_splice_alias() failure exits
fs: limit filesystem stacking depth
overlay: overlay filesystem documentation
overlayfs: implement show_options
overlayfs: add statfs support
overlay filesystem
shmem: support RENAME_WHITEOUT
ext4: support RENAME_WHITEOUT
vfs: add RENAME_WHITEOUT
vfs: add whiteout support
vfs: export check_sticky()
vfs: introduce clone_private_mount()
vfs: export __inode_permission() to modules
vfs: export do_splice_direct() to modules
vfs: add i_op->dentry_open()
Olof Johansson [Sun, 26 Oct 2014 03:44:05 +0000 (20:44 -0700)]
Merge tag 'imx-fixes-3.18' of git://git./linux/kernel/git/shawnguo/linux into fixes
Merge "ARM: imx: fixes for 3.18" from Shawn Guo:
The i.MX fixes for 3.18:
- Revert one patch which increases I2C bus frequency on imx28-evk
- Fix a typo on imx6q EIM clock name
* tag 'imx-fixes-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: dts: imx28-evk: Let i2c0 run at 100kHz
ARM: i.MX6: Fix "emi" clock name typo
Signed-off-by: Olof Johansson <olof@lixom.net>
Fabio Estevam [Mon, 20 Oct 2014 13:08:01 +0000 (11:08 -0200)]
ARM: dts: imx28-evk: Let i2c0 run at 100kHz
Commit
78b81f4666fb ("ARM: dts: imx28-evk: Run I2C0 at 400kHz") caused issues
when doing the following sequence in loop:
- Boot the kernel
- Perform audio playback
- Reboot the system via 'reboot' command
In many times the audio card cannot be probed, which causes playback to fail.
After restoring to the original i2c0 frequency of 100kHz there is no such
problem anymore.
This reverts commit
78b81f4666fbb22a20b1e63e5baf197ad2e90e88.
Cc: <stable@vger.kernel.org> # 3.16+
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Steve Longerbeam [Tue, 14 Oct 2014 17:41:49 +0000 (20:41 +0300)]
ARM: i.MX6: Fix "emi" clock name typo
Fix a typo error, the "emi" names refer to the eim clocks.
The change fixes typo in EIM and EIM_SLOW pre-output dividers and
selectors clock names. Notably EIM_SLOW clock itself is named correctly.
Signed-off-by: Steve Longerbeam <steve_longerbeam@mentor.com>
[vladimir_zapolskiy@mentor.com: ported to v3.17]
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Sascha Hauer <kernel@pengutronix.de>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Al Viro [Fri, 24 Oct 2014 03:03:03 +0000 (23:03 -0400)]
overlayfs: embed middle into overlay_readdir_data
same story...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 24 Oct 2014 03:00:53 +0000 (23:00 -0400)]
overlayfs: embed root into overlay_readdir_data
no sense having it a pointer - all instances have it pointing to
local variable in the same stack frame
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 24 Oct 2014 02:58:56 +0000 (22:58 -0400)]
overlayfs: make ovl_cache_entry->name an array instead of pointer
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 24 Oct 2014 02:56:05 +0000 (22:56 -0400)]
overlayfs: don't hold ->i_mutex over opening the real directory
just use it to serialize the assignment
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Fri, 24 Oct 2014 19:48:47 +0000 (12:48 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"This is the first round of fixes and tying up loose ends for MIPS.
- plenty of fixes for build errors in specific obscure configurations
- remove redundant code on the Lantiq platform
- removal of a useless SEAD I2C driver that was causing a build issue
- fix an earlier TLB exeption handler fix to also work on Octeon.
- fix ISA level dependencies in FPU emulator's instruction decoding.
- don't hardcode kernel command line in Octeon software emulator.
- fix an earlier fix for the Loondson 2 clock setting"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: SEAD3: Fix I2C device registration.
MIPS: SEAD3: Nuke PIC32 I2C driver.
MIPS: ftrace: Fix a microMIPS build problem
MIPS: MSP71xx: Fix build error
MIPS: Malta: Do not build the malta-amon.c file if CMP is not enabled
MIPS: Prevent compiler warning from cop2_{save,restore}
MIPS: Kconfig: Add missing MIPS_CPS dependencies to PM and cpuidle
MIPS: idle: Remove leftover __pastwait symbol and its references
MIPS: Sibyte: Include the swarm subdir to the sb1250 LittleSur builds
MIPS: ptrace.h: Add a missing include
MIPS: ath79: Fix compilation error when CONFIG_PCI is disabled
MIPS: MSP71xx: Remove compilation error when CONFIG_MIPS_MT is present
MIPS: Octeon: Remove special case for simulator command line.
MIPS: tlbex: Properly fix HUGE TLB Refill exception handler
MIPS: loongson2_cpufreq: Fix CPU clock rate setting mismerge
pci: pci-lantiq: remove duplicate check on resource
MIPS: Lasat: Add missing CONFIG_PROC_FS dependency to PICVUE_PROC
MIPS: cp1emu: Fix ISA restrictions for cop1x_op instructions
Linus Torvalds [Fri, 24 Oct 2014 19:48:04 +0000 (12:48 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- enable 48-bit VA space now that KVM has been fixed, together with a
couple of fixes for pgd allocation alignment and initial memblock
current_limit. There is still a dependency on !ARM_SMMU which needs
to be updated as it uses the page table manipulation macros of the
host kernel
- eBPF fixes following changes/conflicts during the merging window
- Compat types affecting compat_elf_prpsinfo
- Compilation error on UP builds
- ASLR fix when /proc/sys/kernel/randomize_va_space == 0
- DT definitions for CLCD support on ARMv8 model platform
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Fix memblock current_limit with 64K pages and 48-bit VA
arm64: ASLR: Don't randomise text when randomise_va_space == 0
arm64: vexpress: Add CLCD support to the ARMv8 model platform
arm64: Fix compilation error on UP builds
Documentation/arm64/memory.txt: fix typo
net: bpf: arm64: minor fix of type in jited
arm64: bpf: add 'load 64-bit immediate' instruction
arm64: bpf: add 'shift by register' instructions
net: bpf: arm64: address randomize and write protect JIT code
arm64: mm: Correct fixmap pagetable types
arm64: compat: fix compat types affecting struct compat_elf_prpsinfo
arm64: Align less than PAGE_SIZE pgds naturally
arm64: Allow 48-bits VA space without ARM_SMMU
Linus Torvalds [Fri, 24 Oct 2014 19:45:47 +0000 (12:45 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
Pull two sparc fixes from David Miller:
1) Fix boots with gcc-4.9 compiled sparc64 kernels.
2) Add missing __get_user_pages_fast() on sparc64 to fix hangs on
futexes used in transparent hugepage areas.
It's really idiotic to have a weak symbolled fallback that just
returns zero, and causes this kind of bug. There should be no
backup implementation and the link should fail if the architecture
fails to provide __get_user_pages_fast() and supports transparent
hugepages.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Implement __get_user_pages_fast().
sparc64: Fix register corruption in top-most kernel stack frame during boot.
Linus Torvalds [Fri, 24 Oct 2014 19:42:55 +0000 (12:42 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"This is a pretty large update. I think it is roughly as big as what I
usually had for the _whole_ rc period.
There are a few bad bugs where the guest can OOPS or crash the host.
We have also started looking at attack models for nested
virtualization; bugs that usually result in the guest ring 0 crashing
itself become more worrisome if you have nested virtualization,
because the nested guest might bring down the non-nested guest as
well. For current uses of nested virtualization these do not really
have a security impact, but you never know and bugs are bugs
nevertheless.
A lot of these bugs are in 3.17 too, resulting in a large number of
stable@ Ccs. I checked that all the patches apply there with no
conflicts"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: vfio: fix unregister kvm_device_ops of vfio
KVM: x86: Wrong assertion on paging_tmpl.h
kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
KVM: x86: PREFETCH and HINT_NOP should have SrcMem flag
KVM: x86: Emulator does not decode clflush well
KVM: emulate: avoid accessing NULL ctxt->memopp
KVM: x86: Decoding guest instructions which cross page boundary may fail
kvm: x86: don't kill guest on unknown exit reason
kvm: vmx: handle invvpid vm exit gracefully
KVM: x86: Handle errors when RIP is set during far jumps
KVM: x86: Emulator fixes for eip canonical checks on near branches
KVM: x86: Fix wrong masking on relative jump/call
KVM: x86: Improve thread safety in pit
KVM: x86: Prevent host from panicking on shared MSR writes.
KVM: x86: Check non-canonical addresses upon WRMSR
Linus Torvalds [Fri, 24 Oct 2014 19:41:50 +0000 (12:41 -0700)]
Merge tag 'stable/for-linus-3.18-b-rc1-tag' of git://git./linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- Fix regression in xen_clocksource_read() which caused all Xen guests
to crash early in boot.
- Several fixes for super rare race conditions in the p2m.
- Assorted other minor fixes.
* tag 'stable/for-linus-3.18-b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pci: Allocate memory for physdev_pci_device_add's optarr
x86/xen: panic on bad Xen-provided memory map
x86/xen: Fix incorrect per_cpu accessor in xen_clocksource_read()
x86/xen: avoid race in p2m handling
x86/xen: delay construction of mfn_list_list
x86/xen: avoid writing to freed memory after race in p2m handling
xen/balloon: Don't continue ballooning when BP_ECANCELED is encountered
Linus Torvalds [Fri, 24 Oct 2014 19:35:48 +0000 (12:35 -0700)]
Merge tag 'sound-3.18-rc2' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Here are a chunk of small fixes since rc1: two PCM core fixes, one is
a long-standing annoyance about lockdep and another is an ARM64 mmap
fix.
The rest are a HD-audio HDMI hotplug notification fix, a fix for
missing NULL termination in Realtek codec quirks and a few new
device/codec-specific quirks as usual"
* tag 'sound-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Add missing terminating entry to SND_HDA_PIN_QUIRK macro
ALSA: pcm: Fix false lockdep warnings
ALSA: hda - Fix inverted LED gpio setup for Lenovo Ideapad
ALSA: hda - hdmi: Fix missing ELD change event on plug/unplug
ALSA: usb-audio: Add support for Steinberg UR22 USB interface
ALSA: ALC283 codec - Avoid pop noise on headphones during suspend/resume
ALSA: pcm: use the same dma mmap codepath both for arm and arm64
Linus Torvalds [Fri, 24 Oct 2014 19:33:32 +0000 (12:33 -0700)]
Merge tag 'random_for_linus' of git://git./linux/kernel/git/tytso/random
Pull /dev/random updates from Ted Ts'o:
"This adds a memzero_explicit() call which is guaranteed not to be
optimized away by GCC. This is important when we are wiping
cryptographically sensitive material"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
crypto: memzero_explicit - make sure to clear out sensitive data
random: add and use memzero_explicit() for clearing data
Linus Torvalds [Fri, 24 Oct 2014 18:29:31 +0000 (11:29 -0700)]
Merge tag 'pm+acpi-3.18-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"This is material that didn't make it to my 3.18-rc1 pull request for
various reasons, mostly related to timing and travel (LinuxCon EU /
LPC) plus a couple of fixes for recent bugs.
The only really new thing here is the PM QoS class for memory
bandwidth, but it is simple enough and users of it will be added in
the next cycle. One major change in behavior is that platform devices
enumerated by ACPI will use 32-bit DMA mask by default. Also included
is an ACPICA update to a new upstream release, but that's mostly
cleanups, changes in tools and similar. The rest is fixes and
cleanups mostly.
Specifics:
- Fix for a recent PCI power management change that overlooked the
fact that some IRQ chips might not be able to configure PCIe PME
for system wakeup from Lucas Stach.
- Fix for a bug introduced in 3.17 where acpi_device_wakeup() is
called with a wrong ordering of arguments from Zhang Rui.
- A bunch of intel_pstate driver fixes (all -stable candidates) from
Dirk Brandewie, Gabriele Mazzotta and Pali Rohár.
- Fixes for a rather long-standing problem with the OOM killer and
the freezer that frozen processes killed by the OOM do not actually
release any memory until they are thawed, so OOM-killing them is
rather pointless, with a couple of cleanups on top (Michal Hocko,
Cong Wang, Rafael J Wysocki).
- ACPICA update to upstream release
20140926, inlcuding mostly
cleanups reducing differences between the upstream ACPICA and the
kernel code, tools changes (acpidump, acpiexec) and support for the
_DDN object (Bob Moore, Lv Zheng).
- New PM QoS class for memory bandwidth from Tomeu Vizoso.
- Default 32-bit DMA mask for platform devices enumerated by ACPI
(this change is mostly needed for some drivers development in
progress targeted at 3.19) from Heikki Krogerus.
- ACPI EC driver cleanups, mostly related to debugging, from Lv
Zheng.
- cpufreq-dt driver updates from Thomas Petazzoni.
- powernv cpuidle driver update from Preeti U Murthy"
* tag 'pm+acpi-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (34 commits)
intel_pstate: Correct BYT VID values.
intel_pstate: Fix BYT frequency reporting
intel_pstate: Don't lose sysfs settings during cpu offline
cpufreq: intel_pstate: Reflect current no_turbo state correctly
cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers
cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
PCI / PM: handle failure to enable wakeup on PCIe PME
ACPI: invoke acpi_device_wakeup() with correct parameters
PM / freezer: Clean up code after recent fixes
PM: convert do_each_thread to for_each_process_thread
OOM, PM: OOM killed task shouldn't escape PM suspend
freezer: remove obsolete comments in __thaw_task()
freezer: Do not freeze tasks killed by OOM killer
ACPI / platform: provide default DMA mask
cpuidle: powernv: Populate cpuidle state details by querying the device-tree
cpufreq: cpufreq-dt: adjust message related to regulators
cpufreq: cpufreq-dt: extend with platform_data
cpufreq: allow driver-specific data
ACPI / EC: Cleanup coding style.
ACPI / EC: Refine event/query debugging messages.
...
Linus Torvalds [Fri, 24 Oct 2014 18:21:43 +0000 (11:21 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
"Sorry that I missed the merge window as there is a bug found in the
last minute, and I have to fix it and wait for the code to be tested
in linux-next tree for a few days. Now the buggy patch has been
dropped entirely from my next branch. Thus I hope those changes can
still be merged in 3.18-rc2 as most of them are platform thermal
driver changes.
Specifics:
- introduce ACPI INT340X thermal drivers.
Newer laptops and tablets may have thermal sensors and other
devices with thermal control capabilities that are exposed for the
OS to use via the ACPI INT340x device objects. Several drivers are
introduced to expose the temperature information and cooling
ability from these objects to user-space via the normal thermal
framework.
From: Lu Aaron, Lan Tianyu, Jacob Pan and Zhang Rui.
- introduce a new thermal governor, which just uses a hysteresis to
switch abruptly on/off a cooling device. This governor can be used
to control certain fan devices that can not be throttled but just
switched on or off. From: Peter Feuerer.
- introduce support for some new thermal interrupt functions on
i.MX6SX, in IMX thermal driver. From: Anson, Huang.
- introduce tracing support on thermal framework. From: Punit
Agrawal.
- small fixes in OF thermal and thermal step_wise governor"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (25 commits)
Thermal: int340x thermal: select ACPI fan driver
Thermal: int3400_thermal: use acpi_thermal_rel parsing APIs
Thermal: int340x_thermal: expose acpi thermal relationship tables
Thermal: introduce int3403 thermal driver
Thermal: introduce INT3402 thermal driver
Thermal: move the KELVIN_TO_MILLICELSIUS macro to thermal.h
ACPI / Fan: support INT3404 thermal device
ACPI / Fan: add ACPI 4.0 style fan support
ACPI / fan: convert to platform driver
ACPI / fan: use acpi_device_xxx_power instead of acpi_bus equivelant
ACPI / fan: remove no need check for device pointer
ACPI / fan: remove unused macro
Thermal: int3400 thermal: register to thermal framework
Thermal: int3400 thermal: add capability to detect supporting UUIDs
Thermal: introduce int3400 thermal driver
ACPI: add ACPI_TYPE_LOCAL_REFERENCE support to acpi_extract_package()
ACPI: make acpi_create_platform_device() an external API
thermal: step_wise: fix: Prevent from binary overflow when trend is dropping
ACPI: introduce ACPI int340x thermal scan handler
thermal: Added Bang-bang thermal governor
...
Catalin Marinas [Fri, 24 Oct 2014 17:16:47 +0000 (18:16 +0100)]
arm64: Fix memblock current_limit with 64K pages and 48-bit VA
With 48-bit VA space, the 64K page configuration uses 3 levels instead
of 2 and PUD_SIZE != PMD_SIZE. Since with 64K pages we only cover
PMD_SIZE with the initial swapper_pg_dir populated in head.S, the
memblock current_limit needs to be set accordingly in map_mem() to avoid
allocating unmapped memory. The memblock current_limit is progressively
increased as more blocks are mapped.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
David S. Miller [Fri, 24 Oct 2014 16:59:02 +0000 (09:59 -0700)]
sparc64: Implement __get_user_pages_fast().
It is not sufficient to only implement get_user_pages_fast(), you
must also implement the atomic version __get_user_pages_fast()
otherwise you end up using the weak symbol fallback implementation
which simply returns zero.
This is dangerous, because it causes the futex code to loop forever
if transparent hugepages are supported (see get_futex_key()).
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 23 Oct 2014 19:58:13 +0000 (12:58 -0700)]
sparc64: Fix register corruption in top-most kernel stack frame during boot.
Meelis Roos reported that kernels built with gcc-4.9 do not boot, we
eventually narrowed this down to only impacting machines using
UltraSPARC-III and derivitive cpus.
The crash happens right when the first user process is spawned:
[ 54.451346] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[ 54.451346]
[ 54.571516] CPU: 1 PID: 1 Comm: init Not tainted
3.16.0-rc2-00211-gd7933ab #96
[ 54.666431] Call Trace:
[ 54.698453] [
0000000000762f8c] panic+0xb0/0x224
[ 54.759071] [
000000000045cf68] do_exit+0x948/0x960
[ 54.823123] [
000000000042cbc0] fault_in_user_windows+0xe0/0x100
[ 54.902036] [
0000000000404ad0] __handle_user_windows+0x0/0x10
[ 54.978662] Press Stop-A (L1-A) to return to the boot prom
[ 55.050713] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
Further investigation showed that compiling only per_cpu_patch() with
an older compiler fixes the boot.
Detailed analysis showed that the function is not being miscompiled by
gcc-4.9, but it is using a different register allocation ordering.
With the gcc-4.9 compiled function, something during the code patching
causes some of the %i* input registers to get corrupted. Perhaps
we have a TLB miss path into the firmware that is deep enough to
cause a register window spill and subsequent restore when we get
back from the TLB miss trap.
Let's plug this up by doing two things:
1) Stop using the firmware stack for client interface calls into
the firmware. Just use the kernel's stack.
2) As soon as we can, call into a new function "start_early_boot()"
to put a one-register-window buffer between the firmware's
deepest stack frame and the top-most initial kernel one.
Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Chandran [Fri, 10 Oct 2014 11:31:24 +0000 (12:31 +0100)]
arm64: ASLR: Don't randomise text when randomise_va_space == 0
When user asks to turn off ASLR by writing "0" to
/proc/sys/kernel/randomize_va_space there should not be
any randomization to mmap base, stack, VDSO, libs, text and heap
Currently arm64 violates this behavior by randomising text.
Fix this by defining a constant ELF_ET_DYN_BASE. The randomisation of
mm->mmap_base is done by setup_new_exec -> arch_pick_mmap_layout ->
mmap_base -> mmap_rnd.
Signed-off-by: Arun Chandran <achandran@mvista.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ralf Baechle [Fri, 24 Oct 2014 10:23:26 +0000 (12:23 +0200)]
MIPS: SEAD3: Fix I2C device registration.
This isn't a module and shouldn't be one.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Wanpeng Li [Thu, 9 Oct 2014 10:30:08 +0000 (18:30 +0800)]
kvm: vfio: fix unregister kvm_device_ops of vfio
After commit
80ce163 (KVM: VFIO: register kvm_device_ops dynamically),
kvm_device_ops of vfio can be registered dynamically. Commit
3c3c29fd
(kvm-vfio: do not use module_init) move the dynamic register invoked by
kvm_init in order to fix broke unloading of the kvm module. However,
kvm_device_ops of vfio is unregistered after rmmod kvm-intel module
which lead to device type collision detection warning after kvm-intel
module reinsmod.
WARNING: CPU: 1 PID: 10358 at /root/cathy/kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:3289 kvm_init+0x234/0x282 [kvm]()
Modules linked in: kvm_intel(O+) kvm(O) nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache lockd sunrpc pci_stub bridge stp llc autofs4 8021q cpufreq_ondemand ipv6 joydev microcode pcspkr igb i2c_algo_bit ehci_pci ehci_hcd e1000e i2c_i801 ixgbe ptp pps_core hwmon mdio tpm_tis tpm ipmi_si ipmi_msghandler acpi_cpufreq isci libsas scsi_transport_sas button dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm_intel]
CPU: 1 PID: 10358 Comm: insmod Tainted: G W O 3.17.0-rc1 #2
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.
1311111329 11/11/2013
0000000000000cd9 ffff880ff08cfd18 ffffffff814a61d9 0000000000000cd9
0000000000000000 ffff880ff08cfd58 ffffffff810417b7 ffff880ff08cfd48
ffffffffa045bcac ffffffffa049c420 0000000000000040 00000000000000ff
Call Trace:
[<
ffffffff814a61d9>] dump_stack+0x49/0x60
[<
ffffffff810417b7>] warn_slowpath_common+0x7c/0x96
[<
ffffffffa045bcac>] ? kvm_init+0x234/0x282 [kvm]
[<
ffffffff810417e6>] warn_slowpath_null+0x15/0x17
[<
ffffffffa045bcac>] kvm_init+0x234/0x282 [kvm]
[<
ffffffffa016e995>] vmx_init+0x1bf/0x42a [kvm_intel]
[<
ffffffffa016e7d6>] ? vmx_check_processor_compat+0x64/0x64 [kvm_intel]
[<
ffffffff810002ab>] do_one_initcall+0xe3/0x170
[<
ffffffff811168a9>] ? __vunmap+0xad/0xb8
[<
ffffffff8109c58f>] do_init_module+0x2b/0x174
[<
ffffffff8109d414>] load_module+0x43e/0x569
[<
ffffffff8109c6d8>] ? do_init_module+0x174/0x174
[<
ffffffff8109c75a>] ? copy_module_from_user+0x39/0x82
[<
ffffffff8109b7dd>] ? module_sect_show+0x20/0x20
[<
ffffffff8109d65f>] SyS_init_module+0x54/0x81
[<
ffffffff814a9a12>] system_call_fastpath+0x16/0x1b
---[ end trace
0626f4a3ddea56f3 ]---
The bug can be reproduced by:
rmmod kvm_intel.ko
insmod kvm_intel.ko
without rmmod/insmod kvm.ko
This patch fixes the bug by unregistering kvm_device_ops of vfio when the
kvm-intel module is removed.
Reported-by: Liu Rongrong <rongrongx.liu@intel.com>
Fixes:
3c3c29fd0d7cddc32862c350d0700ce69953e3bd
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 30 Sep 2014 17:49:18 +0000 (20:49 +0300)]
KVM: x86: Wrong assertion on paging_tmpl.h
Even after the recent fix, the assertion on paging_tmpl.h is triggered.
Apparently, the assertion wants to check that the PAE is always set on
long-mode, but does it in incorrect way. Note that the assertion is not
enabled unless the code is debugged by defining MMU_DEBUG.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Quentin Casasnovas [Fri, 17 Oct 2014 20:55:59 +0000 (22:55 +0200)]
kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
The third parameter of kvm_unpin_pages() when called from
kvm_iommu_map_pages() is wrong, it should be the number of pages to un-pin
and not the page size.
This error was facilitated with an inconsistent API: kvm_pin_pages() takes
a size, but kvn_unpin_pages() takes a number of pages, so fix the problem
by matching the two.
This was introduced by commit
350b8bd ("kvm: iommu: fix the third parameter
of kvm_iommu_put_pages (CVE-2014-3601)"), which fixes the lack of
un-pinning for pages intended to be un-pinned (i.e. memory leak) but
unfortunately potentially aggravated the number of pages we un-pin that
should have stayed pinned. As far as I understand though, the same
practical mitigations apply.
This issue was found during review of Red Hat 6.6 patches to prepare
Ksplice rebootless updates.
Thanks to Vegard for his time on a late Friday evening to help me in
understanding this code.
Fixes:
350b8bd ("kvm: iommu: fix the third parameter of... (CVE-2014-3601)")
Cc: stable@vger.kernel.org
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 13 Oct 2014 10:04:14 +0000 (13:04 +0300)]
KVM: x86: PREFETCH and HINT_NOP should have SrcMem flag
The decode phase of the x86 emulator assumes that every instruction with the
ModRM flag, and which can be used with RIP-relative addressing, has either
SrcMem or DstMem. This is not the case for several instructions - prefetch,
hint-nop and clflush.
Adding SrcMem|NoAccess for prefetch and hint-nop and SrcMem for clflush.
This fixes CVE-2014-8480.
Fixes:
41061cdb98a0bec464278b4db8e894a3121671f5
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 13 Oct 2014 10:04:13 +0000 (13:04 +0300)]
KVM: x86: Emulator does not decode clflush well
Currently, all group15 instructions are decoded as clflush (e.g., mfence,
xsave). In addition, the clflush instruction requires no prefix (66/f2/f3)
would exist. If prefix exists it may encode a different instruction (e.g.,
clflushopt).
Creating a group for clflush, and different group for each prefix.
This has been the case forever, but the next patch needs the cflush group
in order to fix a bug introduced in 3.17.
Fixes:
41061cdb98a0bec464278b4db8e894a3121671f5
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 23 Oct 2014 12:54:14 +0000 (14:54 +0200)]
KVM: emulate: avoid accessing NULL ctxt->memopp
A failure to decode the instruction can cause a NULL pointer access.
This is fixed simply by moving the "done" label as close as possible
to the return.
This fixes CVE-2014-8481.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Fixes:
41061cdb98a0bec464278b4db8e894a3121671f5
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ralf Baechle [Fri, 24 Oct 2014 11:27:37 +0000 (13:27 +0200)]
MIPS: SEAD3: Nuke PIC32 I2C driver.
A platform driver for which nothing ever registers the corresponding
platform device.
Also it was driving the same hardware as sead3-i2c-drv.c so redundant
anyway and couldn't co-exist with that driver because each of them was
using a private spinlock to protect access to the same hardware
resources.
This also fixes a randconfig problem:
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c: In function 'i2c_platform_probe':
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c:345:2: error: implicit declaration of
function 'i2c_add_numbered_adapter' [-Werror=implicit-function-declaration]
ret = i2c_add_numbered_adapter(&priv->adap);
^
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c: In function
'i2c_platform_remove':
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c:361:2: error: implicit declaration
of function 'i2c_del_adapter' [-Werror=implicit-function-declaration]
i2c_del_adapter(&priv->adap);
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Nadav Amit [Thu, 2 Oct 2014 22:10:04 +0000 (01:10 +0300)]
KVM: x86: Decoding guest instructions which cross page boundary may fail
Once an instruction crosses a page boundary, the size read from the second page
disregards the common case that part of the operand resides on the first page.
As a result, fetch of long insturctions may fail, and thereby cause the
decoding to fail as well.
Cc: stable@vger.kernel.org
Fixes:
5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Michael S. Tsirkin [Thu, 18 Sep 2014 13:21:16 +0000 (16:21 +0300)]
kvm: x86: don't kill guest on unknown exit reason
KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was
triggered by a priveledged application. Let's not kill the guest: WARN
and inject #UD instead.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Petr Matousek [Tue, 23 Sep 2014 18:22:30 +0000 (20:22 +0200)]
kvm: vmx: handle invvpid vm exit gracefully
On systems with invvpid instruction support (corresponding bit in
IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid
causes vm exit, which is currently not handled and results in
propagation of unknown exit to userspace.
Fix this by installing an invvpid vm exit handler.
This is CVE-2014-3646.
Cc: stable@vger.kernel.org
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 18 Sep 2014 19:39:39 +0000 (22:39 +0300)]
KVM: x86: Handle errors when RIP is set during far jumps
Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not
handle this case, and may result in failed vm-entry once the assignment is
done. The tricky part of doing so is that loading the new CS affects the
VMCS/VMCB state, so if we fail during loading the new RIP, we are left in
unconsistent state. Therefore, this patch saves on 64-bit the old CS
descriptor and restores it if loading RIP failed.
This fixes CVE-2014-3647.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 18 Sep 2014 19:39:38 +0000 (22:39 +0300)]
KVM: x86: Emulator fixes for eip canonical checks on near branches
Before changing rip (during jmp, call, ret, etc.) the target should be asserted
to be canonical one, as real CPUs do. During sysret, both target rsp and rip
should be canonical. If any of these values is noncanonical, a #GP exception
should occur. The exception to this rule are syscall and sysenter instructions
in which the assigned rip is checked during the assignment to the relevant
MSRs.
This patch fixes the emulator to behave as real CPUs do for near branches.
Far branches are handled by the next patch.
This fixes CVE-2014-3647.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 18 Sep 2014 19:39:37 +0000 (22:39 +0300)]
KVM: x86: Fix wrong masking on relative jump/call
Relative jumps and calls do the masking according to the operand size, and not
according to the address size as the KVM emulator does today.
This patch fixes KVM behavior.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andy Honig [Wed, 27 Aug 2014 21:42:54 +0000 (14:42 -0700)]
KVM: x86: Improve thread safety in pit
There's a race condition in the PIT emulation code in KVM. In
__kvm_migrate_pit_timer the pit_timer object is accessed without
synchronization. If the race condition occurs at the wrong time this
can crash the host kernel.
This fixes CVE-2014-3611.
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andy Honig [Wed, 27 Aug 2014 18:16:44 +0000 (11:16 -0700)]
KVM: x86: Prevent host from panicking on shared MSR writes.
The previous patch blocked invalid writes directly when the MSR
is written. As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 16 Sep 2014 00:24:05 +0000 (03:24 +0300)]
KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Henningsson [Fri, 24 Oct 2014 08:00:38 +0000 (10:00 +0200)]
ALSA: hda - Add missing terminating entry to SND_HDA_PIN_QUIRK macro
Without this terminating entry, the pin matching would continue
across random memory until a zero or a non-matching entry was found.
The result being that in some cases, the pin quirk would not be
applied correctly.
Cc: stable@vger.kernel.org
Signed-off-by: David Henningsson <david.henningsson@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Olof Johansson [Fri, 24 Oct 2014 04:05:45 +0000 (21:05 -0700)]
Merge tag 'socfpga_fixes_for_3.18' of git://git.rocketboards.org/linux-socfpga-next into fixes
Merge "SOCFPGA fixes for 3.18" from Dinh Nguyen:
These patches fixes an SMP and SDMMC driver hang during boot up on the
SOCFPGA platform.
Patch "arm: socfpga: fix fetching cpu1start_addr for SMP" fixes the SMP
trampoline code in order for CPU1 to correctly fetch it's cpu1start_addr.
Patch "ARM: dts: socfpga: rename gpio nodes" renames that GPIO node in order
to allow a standard way of specifying status="okay" in the board DTS file.
Patch "ARM: dts: socfpga: Fix SD card detect" fixes a SDMMC driver hang
during boot. The reason for the hang was the deferred probe of the SDMMC
driver was waiting for the GPIO resource that would never come.
Patch "ARM: dts: socfpga: Add a 3.3V fixed regulator node" adds a fixed
regulator node for the SDMMC driver to use.
* tag 'socfpga_fixes_for_3.18' of git://git.rocketboards.org/linux-socfpga-next:
ARM: dts: socfpga: Add a 3.3V fixed regulator node
ARM: dts: socfpga: Fix SD card detect
ARM: dts: socfpga: rename gpio nodes
arm: socfpga: fix fetching cpu1start_addr for SMP
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Fri, 24 Oct 2014 04:02:49 +0000 (21:02 -0700)]
Merge tag 'at91-fixes' of git://git./linux/kernel/git/nferre/linux-at91 into fixes
Merge "at91: fixes for v3.18 #1" from Nicholas Ferre:
First AT91 fixes for 3.18:
- one more MAINTAINERS entry for the SSC driver
- a fix for the newly introduced power/reset driver
- a fix on at91sam9263 USB due to PLLB misconfiguration
* tag 'at91-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nferre/linux-at91:
ARM: at91/dt: sam9263: fix PLLB frequencies
power: reset: at91-reset: fix power down register
MAINTAINERS: add atmel ssc driver maintainer entry
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Fri, 24 Oct 2014 04:01:02 +0000 (21:01 -0700)]
Merge tag 'zynq-dt-fixes-for-3.18' of https://github.com/Xilinx/linux-xlnx into fixes
Merge "Xilinx Zynq dt fixes for v3.18" from Michal Simek:
arm: Xilinx Zynq DT fixes for v3.18
- Fix gem register size
- Fix OPP
- Add missing references
- Trivial cleanup
* tag 'zynq-dt-fixes-for-3.18' of https://github.com/Xilinx/linux-xlnx:
ARM: zynq: DT: trivial: Fix mc node
ARM: zynq: DT: Add cadence watchdog node
ARM: zynq: DT: Add missing reference for memory-controller
ARM: zynq: DT: Add missing reference for ADC
ARM: zynq: DT: Add missing address for L2 pl310
ARM: zynq: DT: Remove 222 MHz OPP
ARM: zynq: DT: Fix GEM register area size
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Mon, 20 Oct 2014 07:18:50 +0000 (00:18 -0700)]
ARM: multi_v7_defconfig: enable CONFIG_MMC_DW_ROCKCHIP
Allows booting from SD/MMC on RK3288 and other platforms. Added here so I
can enable the board in the boot farm.
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Sun, 19 Oct 2014 21:29:14 +0000 (14:29 -0700)]
ARM: sunxi_defconfig: enable CONFIG_REGULATOR_FIXED_VOLTAGE
I missed in
9a2ad529ed26 that REGULATOR_FIXED_VOLTAGE had also gotten
deselected, so it needs to be added back as an explicit option.
Signed-off-by: Olof Johansson <olof@lixom.net>
Al Viro [Fri, 24 Oct 2014 02:52:55 +0000 (22:52 -0400)]
Merge branch 'overlayfs.v25' of git://git./linux/kernel/git/mszeredi/vfs into for-linus
Al Viro [Thu, 23 Oct 2014 17:26:21 +0000 (13:26 -0400)]
fix inode leaks on d_splice_alias() failure exits
d_splice_alias() callers expect it to either stash the inode reference
into a new alias, or drop the inode reference. That makes it possible
to just return d_splice_alias() result from ->lookup() instance, without
any extra housekeeping required.
Unfortunately, that should include the failure exits. If d_splice_alias()
returns an error, it leaves the dentry it has been given negative and
thus it *must* drop the inode reference. Easily fixed, but it goes way
back and will need backporting.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Markos Chandras [Mon, 20 Oct 2014 08:39:31 +0000 (09:39 +0100)]
MIPS: ftrace: Fix a microMIPS build problem
Code before the .fixup section needs to have the .insn directive.
This has no side effects on MIPS32/64 but it affects the way microMIPS
loads the address for the return label.
Fixes the following build problem:
mips-linux-gnu-ld: arch/mips/built-in.o: .fixup+0x4a0: Unsupported jump between
ISA modes; consider recompiling with interlinking enabled.
mips-linux-gnu-ld: final link failed: Bad value
Makefile:819: recipe for target 'vmlinux' failed
The fix is similar to
1658f914ff91c3bf ("MIPS: microMIPS:
Disable LL/SC and fix linker bug.")
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Cc: stable@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/8117/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Stefan Hengelein [Sun, 19 Oct 2014 18:04:26 +0000 (20:04 +0200)]
MIPS: MSP71xx: Fix build error
When CONFIG_MIPS_MT_SMP is enabled, the following compilation error
occurs:
arch/mips/pmcs-msp71xx/msp_irq_cic.c:134: error: ‘irq’ undeclared
This code clearly never saw a compiler.
The surrounding code suggests, that 'd->irq' was intended, not
'irq'.
This error was found with vampyr.
Signed-off-by: Stefan Hengelein <stefan.hengelein@fau.de>
Fixes:
d7881fbdf866d7d0 ("MIPS: msp71xx: Convert to new irq_chip functions")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/8116/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:39 +0000 (00:14 +0200)]
fs: limit filesystem stacking depth
Add a simple read-only counter to super_block that indicates how deep this
is in the stack of filesystems. Previously ecryptfs was the only stackable
filesystem and it explicitly disallowed multiple layers of itself.
Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.
To limit the kernel stack usage we must limit the depth of the
filesystem stack. Initially the limit is set to 2.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Neil Brown [Thu, 23 Oct 2014 22:14:39 +0000 (00:14 +0200)]
overlay: overlay filesystem documentation
Document the overlay filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Erez Zadok [Thu, 23 Oct 2014 22:14:38 +0000 (00:14 +0200)]
overlayfs: implement show_options
This is useful because of the stacking nature of overlayfs. Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.
AV: even failing ovl_parse_opt() could've done some kstrdup()
AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Andy Whitcroft [Thu, 23 Oct 2014 22:14:38 +0000 (00:14 +0200)]
overlayfs: add statfs support
Add support for statfs to the overlayfs filesystem. As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs. There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.
Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:38 +0000 (00:14 +0200)]
overlay filesystem
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.
This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.
The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.
The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.
Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.
Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).
Usage:
mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay
The following cotributions have been folded into this patch:
Neil Brown <neilb@suse.de>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read
Felix Fietkau <nbd@openwrt.org>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts
Erez Zadok <ezk@fsl.cs.sunysb.edu>
- fix cleanup after WARN_ON
Sedat Dilek <sedat.dilek@googlemail.com>
- fix up permission to confirm to new API
Robin Dong <hao.bigrat@gmail.com>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link
Andy Whitcroft <apw@canonical.com>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode
AV:
- ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
- ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
lack of check for udir being the parent of upper, dropping and regaining
the lock on udir (which would require _another_ check for parent being
right).
- bogus d_drop() in copyup and rename [fix from your mail]
- copyup/remove and copyup/rename races [fix from your mail]
- ovl_dir_fsync() leaving ERR_PTR() in ->realfile
- ovl_entry_free() is pointless - it's just a kfree_rcu()
- fold ovl_do_lookup() into ovl_lookup()
- manually assigning ->d_op is wrong. Just use ->s_d_op.
[patches picked from Miklos]:
* copyup/remove and copyup/rename races
* bogus d_drop() in copyup and rename
Also thanks to the following people for testing and reporting bugs:
Jordi Pujol <jordipujolp@gmail.com>
Andy Whitcroft <apw@canonical.com>
Michal Suchanek <hramrach@centrum.cz>
Felix Fietkau <nbd@openwrt.org>
Erez Zadok <ezk@fsl.cs.sunysb.edu>
Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:37 +0000 (00:14 +0200)]
shmem: support RENAME_WHITEOUT
Allocate a dentry, initialize it with a whiteout and hash it in the place
of the old dentry. Later the old dentry will be moved away and the
whiteout will remain.
i_mutex protects agains concurrent readdir.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:37 +0000 (00:14 +0200)]
ext4: support RENAME_WHITEOUT
Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is
created before the rename takes place. The whiteout inode is added to the
old entry instead of deleting it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:37 +0000 (00:14 +0200)]
vfs: add RENAME_WHITEOUT
This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a
whiteout of source. The whiteout creation is atomic relative to the
rename.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:36 +0000 (00:14 +0200)]
vfs: add whiteout support
Whiteout isn't actually a new file type, but is represented as a char
device (Linus's idea) with 0/0 device number.
This has several advantages compared to introducing a new whiteout file
type:
- no userspace API changes (e.g. trivial to make backups of upper layer
filesystem, without losing whiteouts)
- no fs image format changes (you can boot an old kernel/fsck without
whiteout support and things won't break)
- implementation is trivial
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:36 +0000 (00:14 +0200)]
vfs: export check_sticky()
It's already duplicated in btrfs and about to be used in overlayfs too.
Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:36 +0000 (00:14 +0200)]
vfs: introduce clone_private_mount()
Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:35 +0000 (00:14 +0200)]
vfs: export __inode_permission() to modules
We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:35 +0000 (00:14 +0200)]
vfs: export do_splice_direct() to modules
Export do_splice_direct() to modules. Needed by overlay filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 23 Oct 2014 22:14:35 +0000 (00:14 +0200)]
vfs: add i_op->dentry_open()
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Linus Torvalds [Thu, 23 Oct 2014 22:04:27 +0000 (15:04 -0700)]
Merge tag 'remove-weak-declarations' of git://git./linux/kernel/git/helgaas/pci
Pull weak function declaration removal from Bjorn Helgaas:
"The "weak" attribute is commonly used for the default version of a
function, where an architecture can override it by providing a strong
version.
Some header file declarations included the "weak" attribute. That's
error-prone because it causes every implementation to be weak, with no
strong version at all, and the linker chooses one based on link order.
What we want is the "weak" attribute only on the *definition* of the
default implementation. These changes remove "weak" from the
declarations, leaving it on the default definitions"
* tag 'remove-weak-declarations' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
uprobes: Remove "weak" from function declarations
memory-hotplug: Remove "weak" from memory_block_size_bytes() declaration
kgdb: Remove "weak" from kgdb_arch_pc() declaration
ARC: kgdb: generic kgdb_arch_pc() suffices
vmcore: Remove "weak" from function declarations
clocksource: Remove "weak" from clocksource_default_clock() declaration
x86, intel-mid: Remove "weak" from function declarations
audit: Remove "weak" from audit_classify_compat_syscall() declaration