Christoph Hellwig [Tue, 5 May 2009 13:41:25 +0000 (15:41 +0200)]
enforce ->sync_fs is only called for rw superblock
Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount. sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there. cachefiles grew
s_umount locking.
I've also added a WARN_ON to sync_filesystem to assert this for
future callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Tue, 5 May 2009 14:08:56 +0000 (16:08 +0200)]
cleanup sync_supers
Merge the write_super helper into sync_super and move the check for
->write_super earlier so that we can avoid grabbing a reference to
a superblock that doesn't have it.
While we're at it also add a little comment documenting sync_supers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Alexey Dobriyan [Sun, 3 May 2009 23:32:03 +0000 (03:32 +0400)]
dcache: extrace and use d_unlinked()
d_unlinked() will be used in middle-term to ban checkpointing when opened
but unlinked file is detected, and in long term, to detect such situation
and special case on it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Tue, 28 Apr 2009 16:00:26 +0000 (18:00 +0200)]
remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.
Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.
Exceptions:
- affs already has identical copy & pasted code at the beginning of
affs_put_super so no need to do it twice.
- xfs does the right thing without it and I have changes pending for
the xfs tree touching this are so I don't really need conflicts
here..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 13:46:45 +0000 (09:46 -0400)]
qnx4: remove ->write_super
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 13:46:44 +0000 (09:46 -0400)]
ocfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 13:46:43 +0000 (09:46 -0400)]
gfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 13:46:42 +0000 (09:46 -0400)]
ext3: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 13:46:41 +0000 (09:46 -0400)]
btrfs: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:55 +0000 (16:43 +0200)]
quota: Introduce writeout_quota_sb() (version 4)
Introduce this function which just writes all the quota structures but
avoids all the syncing and cache pruning work to expose quota structures
to userspace. Use this function from __sync_filesystem when wait == 0.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Mon, 27 Apr 2009 14:43:54 +0000 (16:43 +0200)]
quota: cleanup dquota sync functions (version 4)
Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given
superblock. This is a small wrapper around sync_dquots which for the
case of a non-NULL superblock is a small wrapper around quota_sync_sb.
Just make quota_sync_sb global (rename it to sync_quota_sb) and call it
directly. Also call it directly for those cases in quota.c that have a
superblock and leave sync_dquots purely an iterator over sync_quota_sb and
remove it's superblock argument.
To make this nicer move the check for the lack of a quota_sync method
from the callers into sync_quota_sb.
[folded build fix from Alexander Beregalov <a.beregalov@gmail.com>]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:53 +0000 (16:43 +0200)]
vfs: Rename fsync_super() to sync_filesystem() (version 4)
Rename the function so that it better describe what it really does. Also
remove the unnecessary include of buffer_head.h.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:52 +0000 (16:43 +0200)]
vfs: Move syncing code from super.c to sync.c (version 4)
Move sync_filesystems(), __fsync_super(), fsync_super() from
super.c to sync.c where it fits better.
[build fixes folded]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:51 +0000 (16:43 +0200)]
vfs: Make sys_sync() use fsync_super() (version 4)
It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.
Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.
sync_blockdevs() introduced a couple of patches ago is gone now.
[build fixes folded]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:50 +0000 (16:43 +0200)]
vfs: Make __fsync_super() a static function (version 4)
__fsync_super() does the same thing as fsync_super(). So change the only
caller to use fsync_super() and make __fsync_super() static. This removes
unnecessarily duplicated call to sync_blockdev() and prepares ground
for the changes to __fsync_super() in the following patches.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:49 +0000 (16:43 +0200)]
vfs: Call ->sync_fs() even if s_dirt is 0 (version 4)
sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
->sync_fs() isn't called. This does not really make much sence since s_dirt is
generally used by a filesystem to mean that ->write_super() needs to be called.
But ->sync_fs() does different things. I even suspect that some filesystems
(btrfs?) sets s_dirt just to fool this logic.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Mon, 27 Apr 2009 14:43:48 +0000 (16:43 +0200)]
vfs: Fix sys_sync() and fsync_super() reliability (version 4)
So far, do_sync() called:
sync_inodes(0);
sync_supers();
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);
This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).
Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.
The same issue is fixed in __fsync_super() function used on umount /
remount read-only.
[AV: build fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Tue, 28 Apr 2009 16:05:55 +0000 (18:05 +0200)]
remove s_async_list
Remove the unused s_async_list in the superblock, a leftover of the
broken async inode deletion code that leaked into mainline. Having this
in the middle of the sync/unmount path is not helpful for the following
cleanups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
npiggin@suse.de [Sun, 26 Apr 2009 10:25:56 +0000 (20:25 +1000)]
fs: move mark_files_ro into file_table.c
This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).
[AV: ... and it shouldn't be static after that move]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
npiggin@suse.de [Sun, 26 Apr 2009 10:25:55 +0000 (20:25 +1000)]
fs: introduce mnt_clone_write
This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.
Before:
avg = 462.286
std = 5.46106
After:
avg = 453.12
std = 9.58257
(50 runs of each, stddev gives a reasonable confidence)
It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.
After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).
[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
npiggin@suse.de [Sun, 26 Apr 2009 10:25:54 +0000 (20:25 +1000)]
fs: mnt_want_write speedup
This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.
Before:
avg = 501.9
std = 14.7773
After:
avg = 462.286
std = 5.46106
(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)
It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.
It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.
[AV: minor cleanup]
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 17:19:18 +0000 (13:19 -0400)]
Move junk from proc_fs.h to fs/proc/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 18:06:57 +0000 (14:06 -0400)]
switch lookup_mnt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 17:59:41 +0000 (13:59 -0400)]
switch follow_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 17:58:15 +0000 (13:58 -0400)]
switch follow_down()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 07:28:19 +0000 (03:28 -0400)]
Switch collect_mounts() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 07:26:48 +0000 (03:26 -0400)]
switch follow_up() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 07:00:46 +0000 (03:00 -0400)]
switch rqst_exp_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 06:42:05 +0000 (02:42 -0400)]
switch rqst_exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 06:14:32 +0000 (02:14 -0400)]
switch exp_parent() to struct path
... and lose the always-NULL last argument (non-NULL case had been
split off a while ago).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 18 Apr 2009 06:04:46 +0000 (02:04 -0400)]
nfsd struct path use: exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 16:21:18 +0000 (12:21 -0400)]
Don't bother with check_mnt() in do_add_mount() on shrinkable ones
These guys are what we add as submounts; checks for "is that attached in
our namespace" are simply irrelevant for those and counterproductive for
use of private vfsmount trees a-la what NFS folks want.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 15:53:49 +0000 (11:53 -0400)]
Make vfs_path_lookup() use starting point as root
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 15:49:53 +0000 (11:49 -0400)]
Cache root in nameidata
New field: nd->root. When pathname resolution wants to know the root,
check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
copy current->fs->root there. After path_walk() is finished, we check
if we'd got a cached value in nd->root and drop it. Before calling
path_walk() we should either set nd->root.mnt to NULL *or* copy (and
pin down) some path to nd->root. In the latter case we won't be
looking at current->fs->root at all.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 15:44:16 +0000 (11:44 -0400)]
Preparations to caching root in path_walk()
Split do_path_lookup(), opencode the call from do_filp_open()
do_filp_open() is the only caller of do_path_lookup() that
cares about root afterwards (it keeps resolving symlinks on
O_CREAT path after it'd done LOOKUP_PARENT walk). So when
we start caching fs->root in path_walk(), it'll need a different
treatment.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 7 Apr 2009 15:08:56 +0000 (11:08 -0400)]
Get rid of path_lookup in autofs4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jeff Mahoney [Sun, 10 May 2009 20:05:39 +0000 (16:05 -0400)]
reiserfs: allow exposing privroot w/ xattrs enabled
This patch adds an -oexpose_privroot option to allow access to the privroot.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Thu, 11 Jun 2009 21:23:12 +0000 (14:23 -0700)]
Merge git://git./linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (23 commits)
Btrfs: fix extent_buffer leak during tree log replay
Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
Btrfs: fix -o nodatasum printk spelling
Btrfs: check duplicate backrefs for both data and metadata
Btrfs: init worker struct fields before kthread-run
Btrfs: pin buffers during write_dev_supers
Btrfs: avoid races between super writeout and device list updates
Fix btrfs when ACLs are configured out
Btrfs: fdatasync should skip metadata writeout
Btrfs: remove crc32c.h and use libcrc32c directly.
Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Btrfs: autodetect SSD devices
Btrfs: add mount -o ssd_spread to spread allocations out
Btrfs: avoid allocation clusters that are too spread out
Btrfs: Add mount -o nossd
Btrfs: avoid IO stalls behind congested devices in a multi-device FS
Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
Btrfs: fix metadata dirty throttling limits
Btrfs: reduce mount -o ssd CPU usage
Btrfs: balance btree more often
...
Linus Torvalds [Thu, 11 Jun 2009 21:22:55 +0000 (14:22 -0700)]
Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
fsnotify: allow groups to set freeing_mark to null
inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
dnotify: do not bother to lock entry->lock when reading mask
dnotify: do not use ?true:false when assigning to a bool
fsnotify: move events should indicate the event was on a child
inotify: reimplement inotify using fsnotify
fsnotify: handle filesystem unmounts with fsnotify marks
fsnotify: fsnotify marks on inodes pin them in core
fsnotify: allow groups to add private data to events
fsnotify: add correlations between events
fsnotify: include pathnames with entries when possible
fsnotify: generic notification queue and waitq
dnotify: reimplement dnotify using fsnotify
fsnotify: parent event notification
fsnotify: add marks to inodes so groups can interpret how to handle those inodes
fsnotify: unified filesystem notification backend
Linus Torvalds [Thu, 11 Jun 2009 21:15:57 +0000 (14:15 -0700)]
Merge branch 'for-linus' of git://linux-arm.org/linux-2.6
* 'for-linus' of git://linux-arm.org/linux-2.6:
kmemleak: Add the corresponding MAINTAINERS entry
kmemleak: Simple testing module for kmemleak
kmemleak: Enable the building of the memory leak detector
kmemleak: Remove some of the kmemleak false positives
kmemleak: Add modules support
kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
kmemleak: Add the vmalloc memory allocation/freeing hooks
kmemleak: Add the slub memory allocation/freeing hooks
kmemleak: Add the slob memory allocation/freeing hooks
kmemleak: Add the slab memory allocation/freeing hooks
kmemleak: Add documentation on the memory leak detector
kmemleak: Add the base support
Manual conflict resolution (with the slab/earlyboot changes) in:
drivers/char/vt.c
init/main.c
mm/slab.c
Linus Torvalds [Thu, 11 Jun 2009 21:01:07 +0000 (14:01 -0700)]
Merge branch 'perfcounters-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...
Linus Torvalds [Thu, 11 Jun 2009 19:25:06 +0000 (12:25 -0700)]
Merge branch 'topic/slab/earlyboot' of git://git./linux/kernel/git/penberg/slab-2.6
* 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
vgacon: use slab allocator instead of the bootmem allocator
irq: use kcalloc() instead of the bootmem allocator
sched: use slab in cpupri_init()
sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
memcg: don't use bootmem allocator in setup code
irq/cpumask: make memoryless node zero happy
x86: remove some alloc_bootmem_cpumask_var calling
vt: use kzalloc() instead of the bootmem allocator
sched: use kzalloc() instead of the bootmem allocator
init: introduce mm_init()
vmalloc: use kzalloc() instead of alloc_bootmem()
slab: setup allocators earlier in the boot sequence
bootmem: fix slab fallback on numa
bootmem: use slab if bootmem is no longer available
Eric Paris [Thu, 11 Jun 2009 15:09:48 +0000 (11:09 -0400)]
fsnotify: allow groups to set freeing_mark to null
Most fsnotify listeners (all but inotify) do not care about marks being
freed. Allow groups to set freeing_mark to null and do not call any
function if it is set that way.
Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Thu, 11 Jun 2009 15:09:47 +0000 (11:09 -0400)]
inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
inotify and dnotify will both indicate that they want any event which came
from a child inode. The fix is to mask off FS_EVENT_ON_CHILD when deciding
if inotify or dnotify is interested in a given event.
Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Thu, 11 Jun 2009 15:09:47 +0000 (11:09 -0400)]
dnotify: do not bother to lock entry->lock when reading mask
entry->lock is needed to make sure entry->mask does not change while
manipulating it. In dnotify_should_send_event() we don't care if we get an
old or a new mask value out of this entry so there is no point it taking
the lock.
Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Thu, 11 Jun 2009 15:09:47 +0000 (11:09 -0400)]
dnotify: do not use ?true:false when assigning to a bool
dnotify_should send event assigned a bool using ?true:false when computing
a bit operation. This is poitless and the bool type does this for us.
Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Thu, 11 Jun 2009 15:09:47 +0000 (11:09 -0400)]
fsnotify: move events should indicate the event was on a child
fsnotify tells its listeners explicitly when an event happened on the given
inode verses on the child of the given inode. (see __fsnotify_parent)
However, the semantics of fsnotify_move() are such that we deliver events
directly to the two parent directories in question (old_dir and new_dir)
directly without using the __fsnotify_parent() call. fsnotify should be
adding FS_EVENT_ON_CHILD for the notifications to these parents.
Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Thu, 21 May 2009 21:02:01 +0000 (17:02 -0400)]
inotify: reimplement inotify using fsnotify
Reimplement inotify_user using fsnotify. This should be feature for feature
exactly the same as the original inotify_user. This does not make any changes
to the in kernel inotify feature used by audit. Those patches (and the eventual
removal of in kernel inotify) will come after the new inotify_user proves to be
working correctly.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:58 +0000 (17:01 -0400)]
fsnotify: handle filesystem unmounts with fsnotify marks
When an fs is unmounted with an fsnotify mark entry attached to one of its
inodes we need to destroy that mark entry and we also (like inotify) send
an unmount event.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:54 +0000 (17:01 -0400)]
fsnotify: fsnotify marks on inodes pin them in core
This patch pins any inodes with an fsnotify mark in core. The idea is that
as soon as the mark is removed from the inode->fsnotify_mark_entries list
the inode will be iput. In reality is doesn't quite work exactly this way.
The igrab will happen when the mark is added to an inode, but the iput will
happen when the inode pointer is NULL'd inside the mark.
It's possible that 2 racing things will try to remove the mark from
different directions. One may try to remove the mark because of an
explicit request and one might try to remove it because the inode was
deleted. It's possible that the removal because of inode deletion will
remove the mark from the inode's list, but the removal by explicit request
will actually set entry->inode == NULL; and call the iput. This is safe.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:50 +0000 (17:01 -0400)]
fsnotify: allow groups to add private data to events
inotify needs per group information attached to events. This patch allows
groups to attach private information and implements a callback so that
information can be freed when an event is being destroyed.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:47 +0000 (17:01 -0400)]
fsnotify: add correlations between events
As part of the standard inotify events it includes a correlation cookie
between two dentry move operations. This patch includes the same behaviour
in fsnotify events. It is needed so that inotify userspace can be
implemented on top of fsnotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:43 +0000 (17:01 -0400)]
fsnotify: include pathnames with entries when possible
When inotify wants to send events to a directory about a child it includes
the name of the original file. This patch collects that filename and makes
it available for notification.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:37 +0000 (17:01 -0400)]
fsnotify: generic notification queue and waitq
inotify needs to do asyc notification in which event information is stored
on a queue until the listener is ready to receive it. This patch
implements a generic notification queue for inotify (and later fanotify) to
store events to be sent at a later time.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:33 +0000 (17:01 -0400)]
dnotify: reimplement dnotify using fsnotify
Reimplement dnotify using fsnotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:29 +0000 (17:01 -0400)]
fsnotify: parent event notification
inotify and dnotify both use a similar parent notification mechanism. We
add a generic parent notification mechanism to fsnotify for both of these
to use. This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:26 +0000 (17:01 -0400)]
fsnotify: add marks to inodes so groups can interpret how to handle those inodes
This patch creates a way for fsnotify groups to attach marks to inodes.
These marks have little meaning to the generic fsnotify infrastructure
and thus their meaning should be interpreted by the group that attached
them to the inode's list.
dnotify and inotify will make use of these markings to indicate which
inodes are of interest to their respective groups. But this implementation
has the useful property that in the future other listeners could actually
use the marks for the exact opposite reason, aka to indicate which inodes
it had NO interest in.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Eric Paris [Thu, 21 May 2009 21:01:20 +0000 (17:01 -0400)]
fsnotify: unified filesystem notification backend
fsnotify is a backend for filesystem notification. fsnotify does
not provide any userspace interface but does provide the basis
needed for other notification schemes such as dnotify. fsnotify
can be extended to be the backend for inotify or the upcoming
fanotify. fsnotify provides a mechanism for "groups" to register for
some set of filesystem events and to then deliver those events to
those groups for processing.
fsnotify has a number of benefits, the first being actually shrinking the size
of an inode. Before fsnotify to support both dnotify and inotify an inode had
unsigned long i_dnotify_mask; /* Directory notify events */
struct dnotify_struct *i_dnotify; /* for directory notifications */
struct list_head inotify_watches; /* watches on this inode */
struct mutex inotify_mutex; /* protects the watches list
But with fsnotify this same functionallity (and more) is done with just
__u32 i_fsnotify_mask; /* all events for this inode */
struct hlist_head i_fsnotify_mark_entries; /* marks on this inode */
That's right, inotify, dnotify, and fanotify all in 64 bits. We used that
much space just in inotify_watches alone, before this patch set.
fsnotify object lifetime and locking is MUCH better than what we have today.
inotify locking is incredibly complex. See
8f7b0ba1c8539 as an example of
what's been busted since inception. inotify needs to know internal semantics
of superblock destruction and unmounting to function. The inode pinning and
vfs contortions are horrible.
no fsnotify implementers do allocation under locks. This means things like
f04b30de3 which (due to an overabundance of caution) changes GFP_KERNEL to
GFP_NOFS can be reverted. There are no longer any allocation rules when using
or implementing your own fsnotify listener.
fsnotify paves the way for fanotify. In brief fanotify is a notification
mechanism that delivers the lisener both an 'event' and an open file descriptor
to the object in question. This means that fanotify is pathname agnostic.
Some on lkml may not care for the original companies or users that pushed for
TALPA, but fanotify was designed with flexibility and input for other users in
mind. The readahead group expressed interest in fanotify as it could be used
to profile disk access on boot without breaking the audit system. The desktop
search groups have also expressed interest in fanotify as it solves a number
of the race conditions and problems present with managing inotify when more
than a limited number of specific files are of interest. fanotify can provide
for a userspace access control system which makes it a clean interface for AV
vendors to hook without trying to do binary patching on the syscall table,
LSM, and everywhere else they do their things today. With this patch series
fanotify can be implemented in less than 1200 lines of easy to review code.
Almost all of which is the socket based user interface.
This patch series builds fsnotify to the point that it can implement
dnotify and inotify_user. Patches exist and will be sent soon after
acceptance to finish the in kernel inotify conversion (audit) and implement
fanotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Linus Torvalds [Thu, 11 Jun 2009 18:27:09 +0000 (11:27 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
jfs: Add missing mutex_unlock call to error path
missing unlock in jfs_quota_write()
Linus Torvalds [Thu, 11 Jun 2009 18:26:56 +0000 (11:26 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: memtest: use pointers of equal type for comparison
Oleg Nesterov [Thu, 11 Jun 2009 12:12:55 +0000 (13:12 +0100)]
slow_work_thread() should do the exclusive wait
slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up all
kslowd threads. This is not what we want, so we change slow_work_thread() to
use prepare_to_wait_exclusive() instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Jun 2009 18:23:17 +0000 (11:23 -0700)]
Merge branch 'upstream-linus' of git://git./linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
[libata] ata_piix: Enable parallel scan
sata_nv: use hardreset only for post-boot probing
[libata] ahci: Restore SB600 SATA controller 64 bit DMA
ata_piix: Remove stale comment
ata_piix: Turn on hotplugging support for older chips
ahci: misc cleanups for EM stuff
[libata] get rid of ATA_MAX_QUEUE loop in ata_qc_complete_multiple() v2
sata_sil: enable 32-bit PIO
sata_sx4: speed up ECC initialization
libata-sff: avoid byte swapping in ata_sff_data_xfer()
[libata] ahci: use less error-prone array initializers
Linus Torvalds [Thu, 11 Jun 2009 17:52:27 +0000 (10:52 -0700)]
Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
Linus Torvalds [Thu, 11 Jun 2009 17:36:12 +0000 (10:36 -0700)]
Merge git://git./linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (25 commits)
GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta
GFS2: Fix cache coherency between truncate and O_DIRECT read
GFS2: Fix locking issue mounting gfs2meta fs
GFS2: Remove unused variable
GFS2: smbd proccess hangs with flock() call.
GFS2: Remove args subdir from gfs2 sysfs files
GFS2: Remove lockstruct subdir from gfs2 sysfs files
GFS2: Move gfs2_unlink_ok into ops_inode.c
GFS2: Move gfs2_readlinki into ops_inode.c
GFS2: Move gfs2_rmdiri into ops_inode.c
GFS2: Merge mount.c and ops_super.c into super.c
GFS2: Clean up some file names
GFS2: Be more aggressive in reclaiming unlinked inodes
GFS2: Add a rgrp bitmap full flag
GFS2: Improve resource group error handling
GFS2: Don't warn when delete inode fails on ro filesystem
GFS2: Update docs
GFS2: Umount recovery race fix
GFS2: Remove a couple of unused sysfs entries
GFS2: Add commit= mount option
...
Linus Torvalds [Thu, 11 Jun 2009 17:33:36 +0000 (10:33 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/bp/bp
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (26 commits)
amd64_edac: add MAINTAINERS entry
EDAC: do not enable modules by default
amd64_edac: do not enable module by default
amd64_edac: add module registration routines
amd64_edac: add ECC reporting initializers
amd64_edac: add EDAC core-related initializers
amd64_edac: add error decoding logic
amd64_edac: add ECC chipkill syndrome mapping table
amd64_edac: add per-family descriptors
amd64_edac: add F10h-and-later methods-p3
amd64_edac: add F10h-and-later methods-p2
amd64_edac: add F10h-and-later methods-p1
amd64_edac: add k8-specific methods
amd64_edac: assign DRAM chip select base and mask in a family-specific way
amd64_edac: add helper to dump relevant registers
amd64_edac: add DRAM address type conversion facilities
amd64_edac: add functionality to compute the DRAM hole
amd64_edac: add sys addr to memory controller mapping helpers
amd64_edac: add memory scrubber interface
amd64_edac: add MCA error types
...
Linus Torvalds [Thu, 11 Jun 2009 17:08:33 +0000 (10:08 -0700)]
Merge git://git./linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (266 commits)
sh: Tie sparseirq in to Kconfig.
sh: Wire up sys_rt_tgsigqueueinfo.
sh: Fix sys_pwritev() syscall table entry for sh32.
sh: Fix sh4a llsc-based cmpxchg()
sh: sh7724: Add JPU support
sh: sh7724: INTC setting update
sh: sh7722 clock framework rewrite
sh: sh7366 clock framework rewrite
sh: sh7343 clock framework rewrite
sh: sh7724 clock framework rewrite V3
sh: sh7723 clock framework rewrite V2
sh: add enable()/disable()/set_rate() to div6 code
sh: add AP325RXA mode pin configuration
sh: add Migo-R mode pin configuration
sh: sh7722 mode pin definitions
sh: sh7724 mode pin comments
sh: sh7723 mode pin V2
sh: rework mode pin code
sh: clock div6 helper code
sh: clock div4 frequency table offset fix
...
Linus Torvalds [Thu, 11 Jun 2009 17:03:30 +0000 (10:03 -0700)]
Merge branch 'kvm-updates/2.6.31' of git://git./virt/kvm/kvm
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: Prevent overflow in largepages calculation
KVM: Disable large pages on misaligned memory slots
KVM: Add VT-x machine check support
KVM: VMX: Rename rmode.active to rmode.vm86_active
KVM: Move "exit due to NMI" handling into vmx_complete_interrupts()
KVM: Disable CR8 intercept if tpr patching is active
KVM: Do not migrate pending software interrupts.
KVM: inject NMI after IRET from a previous NMI, not before.
KVM: Always request IRQ/NMI window if an interrupt is pending
KVM: Do not re-execute INTn instruction.
KVM: skip_emulated_instruction() decode instruction if size is not known
KVM: Remove irq_pending bitmap
KVM: Do not allow interrupt injection from userspace if there is a pending event.
KVM: Unprotect a page if #PF happens during NMI injection.
KVM: s390: Verify memory in kvm run
KVM: s390: Sanity check on validity intercept
KVM: s390: Unlink vcpu on destroy - v2
KVM: s390: optimize float int lock: spin_lock_bh --> spin_lock
KVM: s390: use hrtimer for clock wakeup from idle - v2
KVM: s390: Fix memory slot versus run - v3
...
Linus Torvalds [Thu, 11 Jun 2009 17:02:46 +0000 (10:02 -0700)]
Merge git://git./linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: remove never-used in6_addr option
cifs: add addr= mount option alias for ip=
[CIFS] Add mention of new mount parm (forceuid) to cifs readme
cifs: make overriding of ownership conditional on new mount options
cifs: fix IPv6 address length check
cifs: clean up set_cifs_acl interfaces
cifs: reorganize get_cifs_acl
[CIFS] Update readme to indicate change to default mount (serverino)
cifs: make serverino the default when mounting
cifs: rename cifs_iget to cifs_root_iget
cifs: make cnvrtDosUnixTm take a little-endian args and an offset
cifs: have cifs_NTtimeToUnix take a little-endian arg
cifs: tighten up default file_mode/dir_mode
cifs: fix artificial limit on reading symlinks
Linus Torvalds [Thu, 11 Jun 2009 17:01:41 +0000 (10:01 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
nommu: Provide mmap_min_addr definition.
TOMOYO: Add description of lists and structures.
TOMOYO: Remove unused field.
integrity: ima audit dentry_open failure
TOMOYO: Remove unused parameter.
security: use mmap_min_addr indepedently of security models
TOMOYO: Simplify policy reader.
TOMOYO: Remove redundant markers.
SELinux: define audit permissions for audit tree netlink messages
TOMOYO: Remove unused mutex.
tomoyo: avoid get+put of task_struct
smack: Remove redundant initialization.
integrity: nfsd imbalance bug fix
rootplug: Remove redundant initialization.
smack: do not beyond ARRAY_SIZE of data
integrity: move ima_counts_get
integrity: path_check update
IMA: Add __init notation to ima functions
IMA: Minimal IMA policy and boot param for TCB IMA policy
selinux: remove obsolete read buffer limit from sel_read_bool
...
Linus Torvalds [Thu, 11 Jun 2009 17:00:50 +0000 (10:00 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
ext4: Avoid corrupting the uninitialized bit in the extent during truncate
ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
ext4: fix dx_map_entry to support 256k directory blocks
ext4: truncate the file properly if we fail to copy data from userspace
ext4: Avoid leaking blocks after a block allocation failure
ext4: Change all super.c messages to print the device
ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
ext4: super.c whitespace cleanup
jbd2: Fix minor typos in comments in fs/jbd2/journal.c
ext4: Clean up calls to ext4_get_group_desc()
ext4: remove unused function __ext4_write_dirty_metadata
ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
ext4: down i_data_sem only for read when walking tree for fiemap
ext4: Add a comprehensive block validity check to ext4_get_blocks()
ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
ext4: Add documentation to the ext4_*get_block* functions
...
Linus Torvalds [Thu, 11 Jun 2009 17:00:03 +0000 (10:00 -0700)]
Merge branch 'for-2.6.31' of git://git./linux/kernel/git/bart/ide-2.6
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits)
ide-tape: fix debug call
alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC
ide-dma: don't reset request fields on dma_timeout_retry()
ide: drop rq->data handling from ide_map_sg()
ide-atapi: kill unused fields and callbacks
ide-tape: simplify read/write functions
ide-tape: use byte size instead of sectors on rw issue functions
ide-tape: unify r/w init paths
ide-tape: kill idetape_bh
ide-tape: use standard data transfer mechanism
ide-tape: use single continuous buffer
ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len
ide-tape,floppy: fix failed command completion after request sense
ide-pm: don't abuse rq->data
ide-cd,atapi: use bio for internal commands
ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer
ide-cd: convert to using generic sense request
ide: add helpers for preparing sense requests
ide-cd: don't abuse rq->buffer
ide-atapi: don't abuse rq->buffer
...
Pekka Enberg [Thu, 11 Jun 2009 16:25:37 +0000 (19:25 +0300)]
vgacon: use slab allocator instead of the bootmem allocator
Slab is initialized before the console subsystem so use the slab allocator in
vgacon_scrollback_startup().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Thu, 11 Jun 2009 11:46:49 +0000 (14:46 +0300)]
irq: use kcalloc() instead of the bootmem allocator
Fixes the following problem:
[ 0.000000] Experimental hierarchical RCU init done.
[ 0.000000] NR_IRQS:4352 nr_irqs:256
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: at mm/bootmem.c:537 alloc_arch_preferred_bootmem+0x40/0x7e()
[ 0.000000] Hardware name: To Be Filled By O.E.M.
[ 0.000000] Pid: 0, comm: swapper Not tainted
2.6.30-tip-02161-g7a74539-dirty #59709
[ 0.000000] Call Trace:
[ 0.000000] [<
ffffffff823f8c8e>] ? alloc_arch_preferred_bootmem+0x40/0x7e
[ 0.000000] [<
ffffffff81067168>] warn_slowpath_common+0x88/0xcb
[ 0.000000] [<
ffffffff810671d2>] warn_slowpath_null+0x27/0x3d
[ 0.000000] [<
ffffffff823f8c8e>] alloc_arch_preferred_bootmem+0x40/0x7e
[ 0.000000] [<
ffffffff823f9307>] ___alloc_bootmem_nopanic+0x4e/0xec
[ 0.000000] [<
ffffffff823f93c5>] ___alloc_bootmem+0x20/0x61
[ 0.000000] [<
ffffffff823f962e>] __alloc_bootmem+0x1e/0x34
[ 0.000000] [<
ffffffff823f757c>] early_irq_init+0x6d/0x118
[ 0.000000] [<
ffffffff823e0140>] ? early_idt_handler+0x0/0x71
[ 0.000000] [<
ffffffff823e0cf7>] start_kernel+0x192/0x394
[ 0.000000] [<
ffffffff823e0140>] ? early_idt_handler+0x0/0x71
[ 0.000000] [<
ffffffff823e02ad>] x86_64_start_reservations+0xb4/0xcf
[ 0.000000] [<
ffffffff823e0000>] ? __init_begin+0x0/0x140
[ 0.000000] [<
ffffffff823e0420>] x86_64_start_kernel+0x158/0x17b
[ 0.000000] ---[ end trace
a7919e7f17c0a725 ]---
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] Detected 2002.510 MHz processor.
[ 0.004000] Console: colour VGA+ 80x25
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Thu, 11 Jun 2009 05:41:22 +0000 (08:41 +0300)]
sched: use slab in cpupri_init()
Lets not use the bootmem allocator in cpupri_init() as slab is already up when
it is run.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Thu, 11 Jun 2009 05:35:27 +0000 (08:35 +0300)]
sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
Slab is initialized when sched_init() runs now so lets use alloc_cpumask_var().
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Yinghai Lu [Fri, 29 May 2009 01:15:16 +0000 (18:15 -0700)]
memcg: don't use bootmem allocator in setup code
The bootmem allocator is no longer available for page_cgroup_init() because we
set up the kernel slab allocator much earlier now.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Yinghai Lu [Fri, 29 May 2009 01:14:40 +0000 (18:14 -0700)]
irq/cpumask: make memoryless node zero happy
Don't hardcode to node zero for early boot IRQ setup memory allocations.
[ penberg@cs.helsinki.fi: minor cleanups ]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Yinghai Lu [Mon, 25 May 2009 12:10:58 +0000 (15:10 +0300)]
x86: remove some alloc_bootmem_cpumask_var calling
Now that we set up the slab allocator earlier, we can get rid of some
alloc_bootmem_cpumask_var() calls in boot code.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Wed, 10 Jun 2009 20:53:37 +0000 (23:53 +0300)]
vt: use kzalloc() instead of the bootmem allocator
Now that kmem_cache_init() happens before console_init(), we should use
kzalloc() and not the bootmem allocator.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Wed, 10 Jun 2009 20:42:36 +0000 (23:42 +0300)]
sched: use kzalloc() instead of the bootmem allocator
Now that kmem_cache_init() happens before sched_init(), we should use kzalloc()
and not the bootmem allocator.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Thu, 11 Jun 2009 15:29:06 +0000 (18:29 +0300)]
init: introduce mm_init()
As suggested by Christoph Lameter, introduce mm_init() now that we initialize
all the kernel memory allocations together.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Mon, 25 May 2009 12:01:35 +0000 (15:01 +0300)]
vmalloc: use kzalloc() instead of alloc_bootmem()
We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
the bootmem allocator when initializing vmalloc data structures.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Wed, 10 Jun 2009 16:40:04 +0000 (19:40 +0300)]
slab: setup allocators earlier in the boot sequence
This patch makes kmalloc() available earlier in the boot sequence so we can get
rid of some bootmem allocations. The bulk of the changes are due to
kmem_cache_init() being called with interrupts disabled which requires some
changes to allocator boostrap code.
Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
before we call mem_init() during boot as reported by Ingo Molnar:
We have a hard crash in the WP-protect code:
[ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2
ffcff000
[ 0.000000] EDI
00000188 ESI
00000ac7 EBP
c17eaf9c ESP
c17eaf8c
[ 0.000000] EBX
000014e0 EDX
0000000e ECX
01856067 EAX
00000001
[ 0.000000] err
00000003 EIP
c10135b1 CS
00000060 flg
00010002
[ 0.000000] Stack:
c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
[ 0.000000]
00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
[ 0.000000]
c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
[ 0.000000] Pid: 0, comm: swapper Not tainted
2.6.30-tip-02161-g7a74539-dirty #52203
[ 0.000000] Call Trace:
[ 0.000000] [<
c15357c2>] ? printk+0x14/0x16
[ 0.000000] [<
c10135b1>] ? do_test_wp_bit+0x19/0x23
[ 0.000000] [<
c17fd410>] ? test_wp_bit+0x26/0x64
[ 0.000000] [<
c17fd7e5>] ? mem_init+0x1ba/0x1d8
[ 0.000000] [<
c17f1668>] ? start_kernel+0x164/0x2f7
[ 0.000000] [<
c17f1322>] ? unknown_bootoption+0x0/0x19c
[ 0.000000] [<
c17f106a>] ? __init_begin+0x6a/0x6f
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Thu, 11 Jun 2009 05:10:28 +0000 (08:10 +0300)]
bootmem: fix slab fallback on numa
If the user requested bootmem allocation on a specific node, we should use
kzalloc_node() for the fallback allocation.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Wed, 10 Jun 2009 17:05:53 +0000 (20:05 +0300)]
bootmem: use slab if bootmem is no longer available
As a preparation for initializing the slab allocator early, make sure the
bootmem allocator does not crash and burn if someone calls it after slab is up;
otherwise we'd need a flag day for switching to early slab.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Catalin Marinas [Thu, 11 Jun 2009 12:24:14 +0000 (13:24 +0100)]
kmemleak: Add the corresponding MAINTAINERS entry
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:24:14 +0000 (13:24 +0100)]
kmemleak: Simple testing module for kmemleak
This patch adds a loadable module that deliberately leaks memory. It
is used for testing various memory leaking scenarios.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:24:13 +0000 (13:24 +0100)]
kmemleak: Enable the building of the memory leak detector
This patch adds the Kconfig.debug and Makefile entries needed for
building kmemleak into the kernel.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:24:13 +0000 (13:24 +0100)]
kmemleak: Remove some of the kmemleak false positives
There are allocations for which the main pointer cannot be found but
they are not memory leaks. This patch fixes some of them. For more
information on false positives, see Documentation/kmemleak.txt.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:23:20 +0000 (13:23 +0100)]
kmemleak: Add modules support
This patch handles the kmemleak operations needed for modules loading so
that memory allocations from inside a module are properly tracked.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:23:19 +0000 (13:23 +0100)]
kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
The alloc_large_system_hash function is called from various places in
the kernel and it contains pointers to other allocated structures. It
therefore needs to be traced by kmemleak.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:23:19 +0000 (13:23 +0100)]
kmemleak: Add the vmalloc memory allocation/freeing hooks
This patch adds the callbacks to kmemleak_(alloc|free) functions from
vmalloc/vfree.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:23:18 +0000 (13:23 +0100)]
kmemleak: Add the slub memory allocation/freeing hooks
This patch adds the callbacks to kmemleak_(alloc|free) functions from the
slub allocator.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Catalin Marinas [Thu, 11 Jun 2009 12:23:17 +0000 (13:23 +0100)]
kmemleak: Add the slob memory allocation/freeing hooks
This patch adds the callbacks to kmemleak_(alloc|free) functions from the
slob allocator.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Catalin Marinas [Thu, 11 Jun 2009 12:22:40 +0000 (13:22 +0100)]
kmemleak: Add the slab memory allocation/freeing hooks
This patch adds the callbacks to kmemleak_(alloc|free) functions from
the slab allocator. The patch also adds the SLAB_NOLEAKTRACE flag to
avoid recursive calls to kmemleak when it allocates its own data
structures.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Catalin Marinas [Thu, 11 Jun 2009 12:22:39 +0000 (13:22 +0100)]
kmemleak: Add documentation on the memory leak detector
This patch adds the Documentation/kmemleak.txt file with some
information about how kmemleak works.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Catalin Marinas [Thu, 11 Jun 2009 12:22:39 +0000 (13:22 +0100)]
kmemleak: Add the base support
This patch adds the base support for the kernel memory leak
detector. It traces the memory allocation/freeing in a way similar to
the Boehm's conservative garbage collector, the difference being that
the unreferenced objects are not freed but only shown in
/sys/kernel/debug/kmemleak. Enabling this feature introduces an
overhead to memory allocations.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Linus Torvalds [Thu, 11 Jun 2009 16:02:31 +0000 (09:02 -0700)]
Merge branches 'frv' and 'mn10300'
* frv:
FRV: Implement new-style ptrace
FRV: Don't turn on TIF_SYSCALL_TRACE unconditionally in syscall prologue
FRV: Implement TIF_NOTIFY_RESUME
FRV: Remove in-kernel strace code
FRV: BUG to BUG_ON changes
FRV: bitops: Change the bitmap index from int to unsigned long
* mn10300:
MN10300: Add utrace/tracehooks support
MN10300: Don't set the dirty bit in the DTLB entries in the TLB-miss handler
David Howells [Thu, 11 Jun 2009 12:08:37 +0000 (13:08 +0100)]
MN10300: Add utrace/tracehooks support
Add utrace/tracehooks support to MN10300.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Thu, 11 Jun 2009 12:08:32 +0000 (13:08 +0100)]
MN10300: Don't set the dirty bit in the DTLB entries in the TLB-miss handler
Remove the special handling for the Data TLB entry dirty bit in the TLB-miss
handler. As the code stands, all that it does is to cause us to take a second
data address exception to set the dirty bit. Instead, we can just let
pte_mkdirty() set the bit.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>