Nick Piggin [Fri, 7 Jan 2011 06:50:08 +0000 (17:50 +1100)]
fs: prefetch inode data in dcache lookup
This makes single threaded git diff -1.25% +/- 0.05% elapsed time on my
2s12c24t Westmere system, and -0.86% +/- 0.05% on my 2s8c Barcelona, by
prefetching the important first cacheline of the inode in while we do the
actual name compare and other operations on the dentry.
There was no measurable slowdown in the single file stat case, or the creat
case (where negative dentries would be common).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:07 +0000 (17:50 +1100)]
fs: improve scalability of pseudo filesystems
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:06 +0000 (17:50 +1100)]
fs: dcache per-inode inode alias locking
dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:05 +0000 (17:50 +1100)]
fs: dcache per-bucket dcache hash locking
We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:04 +0000 (17:50 +1100)]
bit_spinlock: add required includes
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:03 +0000 (17:50 +1100)]
kernel: add bl_list
Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes, and may be useful in other cases such as network
hashes.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:02 +0000 (17:50 +1100)]
xfs: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:01 +0000 (17:50 +1100)]
btrfs: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:50:00 +0000 (17:50 +1100)]
ext2,3,4: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:59 +0000 (17:49 +1100)]
fs: provide simple rcu-walk generic_check_acl implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
This could easily be extended to put acls under RCU and check them
under seqlock, if need be. But this implementation is enough to show
the rcu-walk aware permissions code for path lookups is working, and
will handle cases where there are no ACLs or ACLs in just the final
element.
This patch implicity converts tmpfs to rcu-aware permission check.
Subsequent patches onvert ext*, xfs, and, btrfs. Each of these uses
acl/permission code in a different way, so convert them all to provide
templates and proof of concept.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:58 +0000 (17:49 +1100)]
fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:57 +0000 (17:49 +1100)]
fs: rcu-walk aware d_revalidate method
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:56 +0000 (17:49 +1100)]
fs: cache optimise dentry and inode for rcu-walk
Put dentry and inode fields into top of data structure. This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.
We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.
inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.
This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.
When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.
Align dentry and inode operations structs, so members will be at predictable
offsets and we can group common operations into head of structure.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:55 +0000 (17:49 +1100)]
fs: dcache reduce branches in lookup path
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:54 +0000 (17:49 +1100)]
fs: dcache remove d_mounted
Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
The flag can be cleared by checking the hash table to see if there are any
mounts left, which is not time critical because it is performed at detach time.
The mounted state of a dentry is only used to speculatively take a look in the
mount hash table if it is set -- before following the mount, vfsmount lock is
taken and mount re-checked without races.
This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).
Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.
At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.
XXX: autofs may not work right if we have other mounts go over the top of it?
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:53 +0000 (17:49 +1100)]
fs: fs_struct use seqlock
Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.
Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:52 +0000 (17:49 +1100)]
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:51 +0000 (17:49 +1100)]
kernel: optimise seqlock
Add branch annotations for seqlock read fastpath, and introduce
__read_seqcount_begin and __read_seqcount_end functions, that can avoid the
smp_rmb() if used carefully. These will be used by store-free path walking
algorithm performance is critical and seqlocks are in use.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:50 +0000 (17:49 +1100)]
fs: avoid inode RCU freeing for pseudo fs
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:49 +0000 (17:49 +1100)]
fs: icache RCU free inodes
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:48 +0000 (17:49 +1100)]
fs: consolidate dentry kill sequence
The tricky locking for disposing of a dentry is duplicated 3 times in the
dcache (dput, pruning a dentry from the LRU, and pruning its ancestors).
Consolidate them all into a single function dentry_kill.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:47 +0000 (17:49 +1100)]
fs: use RCU in shrink_dentry_list to reduce lock nesting
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:46 +0000 (17:49 +1100)]
fs: reduce dcache_inode_lock width in lru scanning
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:45 +0000 (17:49 +1100)]
fs: dcache reduce prune_one_dentry locking
prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:44 +0000 (17:49 +1100)]
fs: dcache reduce d_parent locking
Use RCU to simplify locking in dget_parent.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:43 +0000 (17:49 +1100)]
fs: dcache rationalise dget variants
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:42 +0000 (17:49 +1100)]
fs: dcache reduce dcache_inode_lock
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:41 +0000 (17:49 +1100)]
fs: dcache reduce locking in d_alloc
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:40 +0000 (17:49 +1100)]
fs: dcache reduce dput locking
It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:39 +0000 (17:49 +1100)]
fs: dcache avoid starvation in dcache multi-step operations
Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:38 +0000 (17:49 +1100)]
fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:37 +0000 (17:49 +1100)]
fs: Use rename lock and RCU for multi-step operations
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.
We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:36 +0000 (17:49 +1100)]
fs: increase d_name lock coverage
Cover d_name with d_lock in more cases, where there may be concurrent
modification to it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:35 +0000 (17:49 +1100)]
fs: scale inode alias list
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:34 +0000 (17:49 +1100)]
fs: dcache scale subdirs
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:33 +0000 (17:49 +1100)]
fs: dcache scale d_unhashed
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:32 +0000 (17:49 +1100)]
fs: dcache scale dentry refcount
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:31 +0000 (17:49 +1100)]
fs: dcache scale lru
Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock, which allows LRU lists to be
accessed without the lru lock, using RCU in future patches.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:30 +0000 (17:49 +1100)]
fs: dcache scale hash
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:29 +0000 (17:49 +1100)]
hostfs: simplify locking
Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:28 +0000 (17:49 +1100)]
fs: change d_hash for rcu-walk
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:27 +0000 (17:49 +1100)]
fs: change d_compare for rcu-walk
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:26 +0000 (17:49 +1100)]
fs: name case update method
smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:25 +0000 (17:49 +1100)]
jfs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:24 +0000 (17:49 +1100)]
cifs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:23 +0000 (17:49 +1100)]
fs: change d_delete semantics
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:22 +0000 (17:49 +1100)]
fs: dcache documentation cleanup
Remove redundant (and incorrect, since dcache RCU lookup) dentry locking
documentation and point to the canonical document.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:21 +0000 (17:49 +1100)]
config fs: avoid switching ->d_op on live dentry
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix configfs so as not to do this.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:20 +0000 (17:49 +1100)]
cgroup fs: avoid switching ->d_op on live dentry
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix cgroupfs so as not to do this.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:19 +0000 (17:49 +1100)]
fs: use fast counters for vfs caches
percpu_counter library generates quite nasty code, so unless you need
to dynamically allocate counters or take fast approximate value, a
simple per cpu set of counters is much better.
The percpu_counter can never be made to work as well, because it has an
indirection from pointer to percpu memory, and it can't use direct
this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
code will always be worse.
In the fastpath, it is the difference between this:
incl %gs:nr_dentry # nr_dentry
and this:
movl percpu_counter_batch(%rip), %edx # percpu_counter_batch,
movl $1, %esi #,
movq $nr_dentry, %rdi #,
call __percpu_counter_add # (plus I clobber registers)
__percpu_counter_add:
pushq %rbp #
movq %rsp, %rbp #,
subq $32, %rsp #,
movq %rbx, -24(%rbp) #,
movq %r12, -16(%rbp) #,
movq %r13, -8(%rbp) #,
movq %rdi, %rbx # fbc, fbc
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
#NO_APP
incl -8124(%rax) # <variable>.preempt_count
movq 32(%rdi), %r12 # <variable>.counters, tcp_ptr__
#APP
# 78 "lib/percpu_counter.c" 1
add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__
# 0 "" 2
#NO_APP
movslq (%r12),%r13 #* tcp_ptr__, tmp73
movslq %edx,%rax # batch, batch
addq %rsi, %r13 # amount, count
cmpq %rax, %r13 # batch, count
jge .L27 #,
negl %edx # tmp76
movslq %edx,%rdx # tmp76, tmp77
cmpq %rdx, %r13 # tmp77, count
jg .L28 #,
.L27:
movq %rbx, %rdi # fbc,
call _raw_spin_lock #
addq %r13, 8(%rbx) # count, <variable>.count
movq %rbx, %rdi # fbc,
movl $0, (%r12) #,* tcp_ptr__
call _raw_spin_unlock #
.L29:
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
#NO_APP
decl -8124(%rax) # <variable>.preempt_count
movq -8136(%rax), %rax #, D.14625
testb $8, %al #, D.14625
jne .L32 #,
.L31:
movq -24(%rbp), %rbx #,
movq -16(%rbp), %r12 #,
movq -8(%rbp), %r13 #,
leave
ret
.p2align 4,,10
.p2align 3
.L28:
movl %r13d, (%r12) # count,*
jmp .L29 #
.L32:
call preempt_schedule #
.p2align 4,,6
jmp .L31 #
.size __percpu_counter_add, .-__percpu_counter_add
.p2align 4,,15
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:18 +0000 (17:49 +1100)]
vfs: revert per-cpu nr_unused counters for dentry and inodes
The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.
Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.
These counters should stay per-LRU, which currently means global.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:17 +0000 (17:49 +1100)]
kernel: kmem_ptr_validate considered harmful
This is a nasty and error prone API. It is no longer used, remove it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Fri, 7 Jan 2011 06:49:16 +0000 (17:49 +1100)]
fs: d_validate fixes
d_validate has been broken for a long time.
kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.
So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.
Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Nick Piggin [Wed, 5 Jan 2011 09:01:21 +0000 (20:01 +1100)]
Revert "fs: use RCU read side protection in d_validate"
This reverts commit
3825bdb7ed920845961f32f364454bee5f469abb.
You cannot dget() a dentry without having a reference, or holding
a lock that guarantees it remains valid.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Linus Torvalds [Wed, 5 Jan 2011 00:50:19 +0000 (16:50 -0800)]
Linux 2.6.37
Linus Torvalds [Tue, 4 Jan 2011 21:55:49 +0000 (13:55 -0800)]
Merge git://git./linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
ipv4/route.c: respect prefsrc for local routes
bridge: stp: ensure mac header is set
bridge: fix br_multicast_ipv6_rcv for paged skbs
atl1: fix oops when changing tx/rx ring params
drivers/atm/atmtcp.c: add missing atm_dev_put
starfire: Fix dma_addr_t size test for MIPS
tg3: fix return value check in tg3_read_vpd()
Broadcom CNIC core network driver: fix mem leak on allocation failures in cnic_alloc_uio_rings()
ISDN, Gigaset: Fix memory leak in do_disconnect_req()
CAN: Use inode instead of kernel address for /proc file
skfp: testing the wrong variable in skfp_driver_init()
ppp: allow disabling multilink protocol ID compression
ehea: Avoid changing vlan flags
ueagle-atm: fix PHY signal initialization race
Joel Sing [Mon, 3 Jan 2011 20:24:20 +0000 (20:24 +0000)]
ipv4/route.c: respect prefsrc for local routes
The preferred source address is currently ignored for local routes,
which results in all local connections having a src address that is the
same as the local dst address. Fix this by respecting the preferred source
address when it is provided for local routes.
This bug can be demonstrated as follows:
# ifconfig dummy0 192.168.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 192.168.0.1
# ip route change table local local 192.168.0.1 dev dummy0 \
proto kernel scope host src 127.0.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 127.0.0.1
We now establish a local connection and verify the source IP
address selection:
# nc -l 192.168.0.1 3128 &
# nc 192.168.0.1 3128 &
# netstat -ant | grep 192.168.0.1:3128.*EST
tcp 0 0 192.168.0.1:3128 192.168.0.1:33228 ESTABLISHED
tcp 0 0 192.168.0.1:33228 192.168.0.1:3128 ESTABLISHED
Signed-off-by: Joel Sing <jsing@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Hellwig [Tue, 4 Jan 2011 06:14:24 +0000 (07:14 +0100)]
remove trim_fs method from Documentation/filesystems/Locking
The ->trim_fs has been removed meanwhile, so remove it from the documentation
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 4 Jan 2011 00:37:01 +0000 (16:37 -0800)]
Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: pxa: fix page table corruption on resume
ARM: it8152: add IT8152_LAST_IRQ definition to fix build error
ARM: pxa: PXA_ESERIES depends on FB_W100.
ARM: 6605/1: Add missing include "asm/memory.h"
ARM: 6540/1: Stop irqsoff trace on return to user
ARM: 6537/1: update Nomadik, U300 and Ux500 maintainers
ARM: 6536/1: Add missing SZ_{32,64,128}
ARM: fix cache-feroceon-l2 after stack based kmap_atomic()
ARM: fix cache-xsc3l2 after stack based kmap_atomic()
ARM: get rid of kmap_high_l1_vipt()
ARM: smp: avoid incrementing mm_users on CPU startup
ARM: pxa: PXA_ESERIES depends on FB_W100.
Andrew Morton [Mon, 3 Jan 2011 22:59:11 +0000 (14:59 -0800)]
arch/mn10300/kernel/irq.c: fix build
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=25702
Reported-by: Martin Ettl <ettl.martin@gmx.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mimi Zohar [Mon, 3 Jan 2011 22:59:10 +0000 (14:59 -0800)]
ima: fix add LSM rule bug
If security_filter_rule_init() doesn't return a rule, then not everything
is as fine as the return code implies.
This bug only occurs when the LSM (eg. SELinux) is disabled at runtime.
Adding an empty LSM rule causes ima_match_rules() to always succeed,
ignoring any remaining rules.
default IMA TCB policy:
# PROC_SUPER_MAGIC
dont_measure fsmagic=0x9fa0
# SYSFS_MAGIC
dont_measure fsmagic=0x62656572
# DEBUGFS_MAGIC
dont_measure fsmagic=0x64626720
# TMPFS_MAGIC
dont_measure fsmagic=0x01021994
# SECURITYFS_MAGIC
dont_measure fsmagic=0x73636673
< LSM specific rule >
dont_measure obj_type=var_log_t
measure func=BPRM_CHECK
measure func=FILE_MMAP mask=MAY_EXEC
measure func=FILE_CHECK mask=MAY_READ uid=0
Thus without the patch, with the boot parameters 'tcb selinux=0', adding
the above 'dont_measure obj_type=var_log_t' rule to the default IMA TCB
measurement policy, would result in nothing being measured. The patch
prevents the default TCB policy from being replaced.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Cc: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: David Safford <safford@watson.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Russell King [Mon, 3 Jan 2011 22:55:21 +0000 (22:55 +0000)]
Merge branch 'fix' of git://git./linux/kernel/git/ycmiao/pxa-linux-2.6
Florian Westphal [Mon, 3 Jan 2011 04:16:28 +0000 (04:16 +0000)]
bridge: stp: ensure mac header is set
commit
bf9ae5386bca8836c16e69ab8fdbe46767d7452a
(llc: use dev_hard_header) removed the
skb_reset_mac_header call from llc_mac_hdr_init.
This seems fine itself, but br_send_bpdu() invokes ebtables LOCAL_OUT.
We oops in ebt_basic_match() because it assumes eth_hdr(skb) returns
a meaningful result.
Cc: acme@ghostprotocols.net
References: https://bugzilla.kernel.org/show_bug.cgi?id=24532
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 3 Jan 2011 19:51:22 +0000 (11:51 -0800)]
Merge branch 'perf-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Fix callchain hit bad cast on ascii display
arch/x86/oprofile/op_model_amd.c: Perform initialisation on a single CPU
watchdog: Improve initialisation error message and documentation
Linus Torvalds [Mon, 3 Jan 2011 19:50:26 +0000 (11:50 -0800)]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6:
[media] em28xx: radio_fops should also use unlocked_ioctl
[media] wm8775: Revert changeset
fcb9757333 to avoid a regression
[media] cx25840: Prevent device probe failure due to volume control ERANGE error
Linus Torvalds [Mon, 3 Jan 2011 19:48:54 +0000 (11:48 -0800)]
Merge branch 'fixes' of git://git./linux/kernel/git/djbw/async_tx
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
dmaengine: provide dummy functions for DMA_ENGINE=n
mv_xor: fix race in tasklet function
Jan Beulich [Mon, 3 Jan 2011 15:07:02 +0000 (15:07 +0000)]
name_to_dev_t() must not call __init code
The function can't be __init itself (being called from some sysfs
handler), and hence none of the functions it calls can be either.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tomas Winkler [Mon, 3 Jan 2011 19:26:08 +0000 (11:26 -0800)]
bridge: fix br_multicast_ipv6_rcv for paged skbs
use pskb_may_pull to access ipv6 header correctly for paged skbs
It was omitted in the bridge code leading to crash in blind
__skb_pull
since the skb is cloned undonditionally we also simplify the
the exit path
this fixes bug https://bugzilla.kernel.org/show_bug.cgi?id=25202
Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 IEEE 802.11: authenticated
Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 IEEE 802.11: associated (aid 2)
Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 RADIUS: starting accounting session
4D0608A3-
00000005
Dec 15 14:36:41 User-PC kernel: [175576.120287] ------------[ cut here ]------------
Dec 15 14:36:41 User-PC kernel: [175576.120452] kernel BUG at include/linux/skbuff.h:1178!
Dec 15 14:36:41 User-PC kernel: [175576.120609] invalid opcode: 0000 [#1] SMP
Dec 15 14:36:41 User-PC kernel: [175576.120749] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
Dec 15 14:36:41 User-PC kernel: [175576.121035] Modules linked in: approvals binfmt_misc bridge stp llc parport_pc ppdev arc4 iwlagn snd_hda_codec_realtek iwlcore i915 snd_hda_intel mac80211 joydev snd_hda_codec snd_hwdep snd_pcm snd_seq_midi drm_kms_helper snd_rawmidi drm snd_seq_midi_event snd_seq snd_timer snd_seq_device cfg80211 eeepc_wmi usbhid psmouse intel_agp i2c_algo_bit intel_gtt uvcvideo agpgart videodev sparse_keymap snd shpchp v4l1_compat lp hid video serio_raw soundcore output snd_page_alloc ahci libahci atl1c
Dec 15 14:36:41 User-PC kernel: [175576.122712]
Dec 15 14:36:41 User-PC kernel: [175576.122769] Pid: 0, comm: kworker/0:0 Tainted: G W 2.6.37-rc5-wl+ #3 1015PE/1016P
Dec 15 14:36:41 User-PC kernel: [175576.123012] EIP: 0060:[<
f83edd65>] EFLAGS:
00010283 CPU: 1
Dec 15 14:36:41 User-PC kernel: [175576.123193] EIP is at br_multicast_rcv+0xc95/0xe1c [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.123362] EAX:
0000001c EBX:
f5626318 ECX:
00000000 EDX:
00000000
Dec 15 14:36:41 User-PC kernel: [175576.123550] ESI:
ec512262 EDI:
f5626180 EBP:
f60b5ca0 ESP:
f60b5bd8
Dec 15 14:36:41 User-PC kernel: [175576.123737] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 15 14:36:41 User-PC kernel: [175576.123902] Process kworker/0:0 (pid: 0, ti=
f60b4000 task=
f60a8000 task.ti=
f60b0000)
Dec 15 14:36:41 User-PC kernel: [175576.124137] Stack:
Dec 15 14:36:41 User-PC kernel: [175576.124181]
ec556500 f6d06800 f60b5be8 c01087d8 ec512262 00000030 00000024 f5626180
Dec 15 14:36:41 User-PC kernel: [175576.124181]
f572c200 ef463440 f5626300 3affffff f6d06dd0 e60766a4 000000c4 f6d06860
Dec 15 14:36:41 User-PC kernel: [175576.124181]
ffffffff ec55652c 00000001 f6d06844 f60b5c64 c0138264 c016e451 c013e47d
Dec 15 14:36:41 User-PC kernel: [175576.124181] Call Trace:
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c01087d8>] ? sched_clock+0x8/0x10
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0138264>] ? enqueue_entity+0x174/0x440
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c016e451>] ? sched_clock_cpu+0x131/0x190
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c013e47d>] ? select_task_rq_fair+0x2ad/0x730
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0524fc1>] ? nf_iterate+0x71/0x90
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f83e4914>] ? br_handle_frame_finish+0x184/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f83e4790>] ? br_handle_frame_finish+0x0/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f83e46e9>] ? br_handle_frame+0x189/0x230 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f83e4790>] ? br_handle_frame_finish+0x0/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f83e4560>] ? br_handle_frame+0x0/0x230 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c04ff026>] ? __netif_receive_skb+0x1b6/0x5b0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c04f7a30>] ? skb_copy_bits+0x110/0x210
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0503a7f>] ? netif_receive_skb+0x6f/0x80
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f82cb74c>] ? ieee80211_deliver_skb+0x8c/0x1a0 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f82cc836>] ? ieee80211_rx_handlers+0xeb6/0x1aa0 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c04ff1f0>] ? __netif_receive_skb+0x380/0x5b0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c016e242>] ? sched_clock_local+0xb2/0x190
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c012b688>] ? default_spin_lock_flags+0x8/0x10
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c05d83df>] ? _raw_spin_lock_irqsave+0x2f/0x50
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f82cd621>] ? ieee80211_prepare_and_rx_handle+0x201/0xa90 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f82ce154>] ? ieee80211_rx+0x2a4/0x830 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f815a8d6>] ? iwl_update_stats+0xa6/0x2a0 [iwlcore]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f8499212>] ? iwlagn_rx_reply_rx+0x292/0x3b0 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c05d83df>] ? _raw_spin_lock_irqsave+0x2f/0x50
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f8483697>] ? iwl_rx_handle+0xe7/0x350 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
f8486ab7>] ? iwl_irq_tasklet+0xf7/0x5c0 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c01aece1>] ? __rcu_process_callbacks+0x201/0x2d0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0150d05>] ? tasklet_action+0xc5/0x100
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0150a07>] ? __do_softirq+0x97/0x1d0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c05d910c>] ? nmi_stack_correct+0x2f/0x34
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0150970>] ? __do_softirq+0x0/0x1d0
Dec 15 14:36:41 User-PC kernel: [175576.124181] <IRQ>
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c01508f5>] ? irq_exit+0x65/0x70
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c05df062>] ? do_IRQ+0x52/0xc0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c01036b0>] ? common_interrupt+0x30/0x38
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c03a1fc2>] ? intel_idle+0xc2/0x160
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c04daebb>] ? cpuidle_idle_call+0x6b/0x100
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c0101dea>] ? cpu_idle+0x8a/0xf0
Dec 15 14:36:41 User-PC kernel: [175576.124181] [<
c05d2702>] ? start_secondary+0x1e8/0x1ee
Cc: David Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
J. K. Cliburn [Sat, 1 Jan 2011 05:02:12 +0000 (05:02 +0000)]
atl1: fix oops when changing tx/rx ring params
Commit
3f5a2a713aad28480d86b0add00c68484b54febc zeroes out the statistics
message block (SMB) and coalescing message block (CMB) when adapter ring
resources are freed. This is desirable behavior, but, as a side effect,
the commit leads to an oops when atl1_set_ringparam() attempts to alter
the number of rx or tx elements in the ring buffer (by using ethtool
-G, for example). We don't want SMB or CMB to change during this
operation.
Modify atl1_set_ringparam() to preserve SMB and CMB when changing ring
parameters.
Cc: stable@kernel.org
Signed-off-by: Jay Cliburn <jcliburn@gmail.com>
Reported-by: Tõnu Raitviir <jussuf@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingo Molnar [Mon, 3 Jan 2011 18:59:24 +0000 (19:59 +0100)]
Merge branch 'perf/urgent' of git://git./linux/kernel/git/frederic/random-tracing into perf/urgent
Aric D. Blumer [Wed, 29 Dec 2010 16:18:29 +0000 (11:18 -0500)]
ARM: pxa: fix page table corruption on resume
Before this patch, the following error would sometimes occur after a
resume on pxa3xx:
/path/to/mm/memory.c:144: bad pmd
8040542e.
The problem was that a temporary page table mapping was being improperly
restored.
The PXA3xx resume code creates a temporary mapping of resume_turn_on_mmu
to avoid a prefetch abort. The pxa3xx_resume_after_mmu code requires
that the r1 register holding the address of this mapping not be
modified, however, resume_turn_on_mmu does modify it. It is mostly
correct in that r1 receives the base table address, but it may also
get other bits in 13:0. This results in pxa3xx_resume_after_mmu
restoring the original mapping to the wrong place, corrupting memory
and leaving the temporary mapping in place.
Signed-off-by: Matt Reimer <mreimer@sdgsystems.com>
Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
Mike Rapoport [Wed, 29 Dec 2010 07:06:26 +0000 (09:06 +0200)]
ARM: it8152: add IT8152_LAST_IRQ definition to fix build error
The commit
6ac6b817f3f4c23c5febd960d8deb343e13af5f3 (ARM: pxa: encode
IRQ number into .nr_irqs) removed definition of ITE_LAST_IRQ which
caused the following build error:
CC arch/arm/common/it8152.o
arch/arm/common/it8152.c: In function 'it8152_init_irq':
arch/arm/common/it8152.c:86: error: 'IT8152_LAST_IRQ' undeclared (first use in this function)
arch/arm/common/it8152.c:86: error: (Each undeclared identifier is reported only once
arch/arm/common/it8152.c:86: error: for each function it appears in.)
make[2]: *** [arch/arm/common/it8152.o] Error 1
Defining the IT8152_LAST_IRQ in the arch/arm/include/hardware/it8152.c
fixes the build.
Signed-off-by: Mike Rapoport <mike@compulab.co.il>
Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
Lennert Buytenhek [Tue, 14 Dec 2010 23:20:16 +0000 (07:20 +0800)]
ARM: pxa: PXA_ESERIES depends on FB_W100.
As arch/arm/mach-pxa/eseries.c references w100fb_gpio_{read,write}()
directly.
Signed-off-by: Lennert Buytenhek <buytenh@secretlab.ca>
Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
Frederic Weisbecker [Mon, 3 Jan 2011 15:13:11 +0000 (16:13 +0100)]
perf: Fix callchain hit bad cast on ascii display
ipchain__fprintf_graph() casts the number of hits in a branch as an
int, which means we lose its highests bits.
This results in meaningless number of callchain hits in perf.data
that have a high number of hits recorded, typically those that have
callchain branches hits appearing more than INT_MAX. This happens
easily as those are pondered by the event period.
Reported-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Robert Richter [Mon, 3 Jan 2011 11:15:14 +0000 (12:15 +0100)]
arch/x86/oprofile/op_model_amd.c: Perform initialisation on a single CPU
Disable preemption in init_ibs(). The function only checks the
ibs capabilities and sets up pci devices (if necessary). It runs
only on one cpu but operates with the local APIC and some MSRs,
thus it is better to disable preemption.
[ 7.034377] BUG: using smp_processor_id() in preemptible [
00000000] code: modprobe/483
[ 7.034385] caller is setup_APIC_eilvt+0x155/0x180
[ 7.034389] Pid: 483, comm: modprobe Not tainted 2.6.37-rc1-
20101110+ #1
[ 7.034392] Call Trace:
[ 7.034400] [<
ffffffff812a2b72>] debug_smp_processor_id+0xd2/0xf0
[ 7.034404] [<
ffffffff8101e985>] setup_APIC_eilvt+0x155/0x180
[ ... ]
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=22812
Reported-by: <atswartz@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: oprofile-list@lists.sourceforge.net <oprofile-list@lists.sourceforge.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org> [2.6.37.x]
LKML-Reference: <
20110103111514.GM4739@erda.amd.com>
[ small cleanups ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hans Verkuil [Sat, 18 Dec 2010 12:59:51 +0000 (09:59 -0300)]
[media] em28xx: radio_fops should also use unlocked_ioctl
em28xx uses core assisted locking, so it shouldn't use .ioctl.
The .ioctl callback was replaced by .unlocked_ioctl for video nodes,
but not for radio nodes. This is now corrected.
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Mauro Carvalho Chehab [Mon, 3 Jan 2011 11:09:56 +0000 (09:09 -0200)]
[media] wm8775: Revert changeset
fcb9757333 to avoid a regression
It seems that cx88 and ivtv use wm8775 on some different modes. The
patch that added support for a board with wm8775 broke ivtv boards with
this device. As we're too close to release 2.6.37, let's just revert
it.
Reported-by: Andy Walls <awalls@md.metrocast.net>
Reported-by: Eric Sharkey <eric@lisaneric.org>
Reported-by: Auric <auric@aanet.com.au>
Reported by: David Gesswein <djg@pdp8online.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Andy Walls [Sun, 5 Dec 2010 22:42:30 +0000 (19:42 -0300)]
[media] cx25840: Prevent device probe failure due to volume control ERANGE error
This patch fixes a regression that crept into 2.6.36.
The volume control scale in the cx25840 driver has an unusual mapping
from register values to v4l2 volume control values. Enforce the mapping
limits, so that the default volume control setting does not fall out of
bounds to prevent the cx25840 module device probe from failing.
Signed-off-by: Andy Walls <awalls@md.metrocast.net>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Guennadi Liakhovetski [Wed, 22 Dec 2010 13:46:46 +0000 (14:46 +0100)]
dmaengine: provide dummy functions for DMA_ENGINE=n
This lets drivers, optionally using the dmaengine, build with DMA_ENGINE
unselected.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Saeed Bishara [Tue, 21 Dec 2010 14:53:39 +0000 (16:53 +0200)]
mv_xor: fix race in tasklet function
use mv_xor_slot_cleanup() instead of __mv_xor_slot_cleanup() as the former function
aquires the spin lock that needed to protect the drivers data.
Cc: <stable@kernel.org>
Signed-off-by: Saeed Bishara <saeed@marvell.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Axel Lin [Mon, 3 Jan 2011 01:26:53 +0000 (02:26 +0100)]
ARM: 6605/1: Add missing include "asm/memory.h"
This patch fixes below build error by adding the missing asm/memory.h,
which is needed for arch_is_coherent().
$ make pxa3xx_defconfig; make
CC init/do_mounts_rd.o
In file included from include/linux/list_bl.h:5,
from include/linux/rculist_bl.h:7,
from include/linux/dcache.h:7,
from include/linux/fs.h:381,
from init/do_mounts_rd.c:3:
include/linux/bit_spinlock.h: In function 'bit_spin_unlock':
include/linux/bit_spinlock.h:61: error: implicit declaration of function 'arch_is_coherent'
make[1]: *** [init/do_mounts_rd.o] Error 1
make: *** [init] Error 2
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Acked-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Ben Hutchings [Sun, 2 Jan 2011 23:02:42 +0000 (23:02 +0000)]
watchdog: Improve initialisation error message and documentation
The error message 'NMI watchdog failed to create perf event...'
does not make it clear that this is a fatal error for the
watchdog. It also currently prints the error value as a
pointer, rather than extracting the error code with PTR_ERR().
Fix that.
Add a note to the description of the 'nowatchdog' kernel
parameter to associate it with this message.
Reported-by: Cesare Leonardi <celeonar@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 599368@bugs.debian.org
Cc: 608138@bugs.debian.org
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org> # .37.x and later
LKML-Reference: <
1294009362.3167.126.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Maurus Cuelenaere [Sun, 2 Jan 2011 19:48:16 +0000 (14:48 -0500)]
hwmon: (s3c-hwmon) Fix compilation
The owner field was removed from struct attribute in
6fd69dc578fa0b1bbc3aad70ae3af9a137211707, so don't assign it anymore.
Signed-off-by: Maurus Cuelenaere <mcuelenaere@gmail.com>
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
Linus Torvalds [Sun, 2 Jan 2011 18:44:21 +0000 (10:44 -0800)]
Merge branch 'kvm-updates/2.6.37' of git://git./virt/kvm/kvm
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: i8259: initialize isr_ack
KVM: MMU: Fix incorrect direct gfn for unpaged mode shadow
Linus Torvalds [Sun, 2 Jan 2011 18:43:51 +0000 (10:43 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: hda: Use LPIB quirk for Dell Inspiron m101z/1120
sound: Prevent buffer overflow in OSS load_mixer_volumes
ASoC: codecs: wm8753: Fix register cache incoherency
ASoC: codecs: wm9090: Fix register cache incoherency
ASoC: codecs: wm8962: Fix register cache incoherency
ASoC: codecs: wm8955: Fix register cache incoherency
ASoC: codecs: wm8904: Fix register cache incoherency
ASoC: codecs: wm8741: Fix register cache incoherency
ASoC: codecs: wm8523: Fix register cache incoherency
ASoC: codecs: max98088: Fix register cache incoherency
ASoC: codecs: Add missing control_type initialization
Linus Torvalds [Sun, 2 Jan 2011 18:37:19 +0000 (10:37 -0800)]
Merge branch 'rc-fixes' of git://git./linux/kernel/git/mmarek/kbuild-2.6
* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
kconfig: fix undesirable side effect of adding "visible" menu attribute
Takashi Iwai [Sun, 2 Jan 2011 10:01:55 +0000 (11:01 +0100)]
Merge branch 'fix/asoc' into for-linus
Avi Kivity [Fri, 31 Dec 2010 08:52:15 +0000 (10:52 +0200)]
KVM: i8259: initialize isr_ack
isr_ack is never initialized. So, until the first PIC reset, interrupts
may fail to be injected. This can cause Windows XP to fail to boot, as
reported in the fallout from the fix to
https://bugzilla.kernel.org/show_bug.cgi?id=21962.
Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Julia Lawall [Wed, 29 Dec 2010 04:01:03 +0000 (04:01 +0000)]
drivers/atm/atmtcp.c: add missing atm_dev_put
The earlier call to atm_dev_lookup increases the reference count of dev,
so decrease it on the way out.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x, E;
constant C;
@@
x = atm_dev_lookup(...);
... when != false x != NULL
when != true x == NULL
when != \(E = x\|x = E\)
when != atm_dev_put(dev);
*return -C;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Wed, 29 Dec 2010 04:26:17 +0000 (04:26 +0000)]
starfire: Fix dma_addr_t size test for MIPS
Commit
56543af "starfire: use BUILD_BUG_ON for netdrv_addr_t" revealed
that the preprocessor condition used to find the size of dma_addr_t
yielded the wrong result for some architectures and configurations.
This was kluged for 64-bit PowerPC in commit
3e502e6 by adding yet
another case to the condition. However, 64-bit MIPS configurations
are not detected reliably either.
This should be fixed by using CONFIG_ARCH_DMA_ADDR_T_64BIT, but that
isn't yet defined everywhere it should be.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Sterba [Wed, 29 Dec 2010 03:40:31 +0000 (03:40 +0000)]
tg3: fix return value check in tg3_read_vpd()
Besides -ETIMEDOUT and -EINTR, pci_read_vpd may return other error
values like -ENODEV or -EINVAL which are ignored due to the buggy
check, but the data are not read from VPD anyway and this is checked
subsequently with at most 3 needless loop iterations. This does not
show up as a runtime bug.
CC: Matt Carlson <mcarlson@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Juhl [Fri, 31 Dec 2010 19:18:48 +0000 (11:18 -0800)]
Broadcom CNIC core network driver: fix mem leak on allocation failures in cnic_alloc_uio_rings()
We are leaking memory in drivers/net/cnic.c::cnic_alloc_uio_rings() if
either of the calls to dma_alloc_coherent() fail. This patch fixes it by
freeing both the memory allocated with kzalloc() and memory allocated with
previous calls to dma_alloc_coherent() when there's a failure.
Thanks to Joe Perches <joe@perches.com> for suggesting a better
implementation than my initial version.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Juhl [Sun, 26 Dec 2010 09:59:58 +0000 (09:59 +0000)]
ISDN, Gigaset: Fix memory leak in do_disconnect_req()
Hi,
In drivers/isdn/gigaset/capi.c::do_disconnect_req() we will leak the
memory allocated (with kmalloc) to 'b3cmsg' if the call to alloc_skb()
fails.
...
b3cmsg = kmalloc(sizeof(*b3cmsg), GFP_KERNEL);
allocation here ------^
if (!b3cmsg) {
dev_err(cs->dev, "%s: out of memory\n", __func__);
send_conf(iif, ap, skb, CAPI_MSGOSRESOURCEERR);
return;
}
capi_cmsg_header(b3cmsg, ap->id, CAPI_DISCONNECT_B3, CAPI_IND,
ap->nextMessageNumber++,
cmsg->adr.adrPLCI | (1 << 16));
b3cmsg->Reason_B3 = CapiProtocolErrorLayer1;
b3skb = alloc_skb(CAPI_DISCONNECT_B3_IND_BASELEN, GFP_KERNEL);
if (b3skb == NULL) {
dev_err(cs->dev, "%s: out of memory\n", __func__);
send_conf(iif, ap, skb, CAPI_MSGOSRESOURCEERR);
return;
leak here ------^
...
This leak is easily fixed by just kfree()'ing the memory allocated to
'b3cmsg' right before we return. The following patch does that.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Rosenberg [Sun, 26 Dec 2010 06:54:53 +0000 (06:54 +0000)]
CAN: Use inode instead of kernel address for /proc file
Since the socket address is just being used as a unique identifier, its
inode number is an alternative that does not leak potentially sensitive
information.
CC-ing stable because MITRE has assigned CVE-2010-4565 to the issue.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 30 Dec 2010 20:09:26 +0000 (12:09 -0800)]
Merge branch 'drm-intel-fixes' of git://git./linux/kernel/git/ickle/drm-intel
* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel:
drm/i915/dvo: Report LVDS attached to ch701x as connected
Revert "drm/i915/bios: Reverse order of 100/120 Mhz SSC clocks"
drm/i915: Verify Ironlake eDP presence on DP_A using the capability fuse
drm/i915, intel_ips: When i915 loads after IPS, make IPS relink to i915.
drm/i915/sdvo: Add hdmi connector properties after initing the connector
drm/i915: Set the required VFMUNIT clock gating disable on Ironlake.
Nitin Gupta [Thu, 30 Dec 2010 09:07:58 +0000 (04:07 -0500)]
Revert "Staging: zram: work around oops due to startup ordering snafu"
This reverts commit
7e24cce38a99f373450db67bf576fe73e8168d66 because it
was never appropriate for mainline.
Do not check for init flag before starting I/O - zram module is unusable
without this fix.
The oops mentioned in the reverted commit message was actually a problem
only with the zram version as present in project's own repository where
we allocate struct zram_stats_cpu upon device initialization. OTOH, In
mainline/staging version of zram, we allocate struct stats upfront, so
this oops cannot happen in mainline version.
Checking for init_done flag in zram_make_request() results in a *no-op*
for any I/O operation since we simply always return success. This flag
is actually set when the first write occurs on a zram disk which
triggers its initialization.
Bug report: https://bugzilla.kernel.org/show_bug.cgi?id=25722
Reported-by: Dennis Jansen <dennis.jansen@web.de>
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 30 Dec 2010 18:07:44 +0000 (10:07 -0800)]
Merge branch 'merge-spi' of git://git.secretlab.ca/git/linux-2.6
* 'merge-spi' of git://git.secretlab.ca/git/linux-2.6:
spi/m68knommu: Coldfire QSPI platform support
spi/omap2_mcspi.c: Force CS to be in inactive state after off-mode transition
KAMEZAWA Hiroyuki [Wed, 29 Dec 2010 22:07:11 +0000 (14:07 -0800)]
memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check
At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked.
But as commented in mem_cgroup_from_task(), mm->owner can be NULL
in some racy case. This check of VM_BUG_ON() is bad.
A possible story to hit this is at swapoff()->try_to_unuse(). It passes
mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we
can't get proper mem_cgroup from swap_cgroup information, mm->owner is used
as charge target and we see NULL.
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Thu, 16 Dec 2010 11:04:54 +0000 (12:04 +0100)]
update Documentation/filesystems/Locking
Mostly inspired by all the recent BKL removal changes, but a lot of older
updates also weren't properly recorded.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Wilson [Thu, 30 Dec 2010 12:54:00 +0000 (12:54 +0000)]
drm/i915/dvo: Report LVDS attached to ch701x as connected
As we have already detected something attached to the chip during
initialisation, always report the LVDS connector status as connected
during probing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>