Al Viro [Wed, 1 Aug 2012 09:07:04 +0000 (13:07 +0400)]
delousing target_core_file a bit
* set_fs(KERNEL_DS) + getname() is probably the weirdest implementation
of strdup() I've seen. Especially since they don't to copy it at all...
* filp_open() never returns NULL; it's ERR_PTR(-E...) on failure.
* file->f_dentry is never going to be NULL, TYVM.
* match_strdup() + snprintf() + kfree() is a bloody weird way to spell
match_strlcpy().
Pox on cargo-cult programmers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Valerie Aurora [Tue, 12 Jun 2012 14:20:48 +0000 (16:20 +0200)]
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.
Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:47 +0000 (16:20 +0200)]
fs: Remove old freezing mechanism
Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:46 +0000 (16:20 +0200)]
ext2: Implement freezing
The only missing piece to make freezing work reliably with ext2 is to
stop iput() of unlinked inode from deleting the inode on frozen filesystem.
So add a necessary protection to ext2_evict_inode().
We also provide appropriate ->freeze_fs and ->unfreeze_fs functions.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:45 +0000 (16:20 +0200)]
btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.
Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).
CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:44 +0000 (16:20 +0200)]
nilfs2: Convert to new freezing mechanism
We change nilfs_page_mkwrite() to provide proper freeze protection for
writeable page faults (we must wait for frozen filesystem even if the
page is fully mapped).
We remove all vfs_check_frozen() checks since they are now handled by
the generic code.
CC: linux-nilfs@vger.kernel.org
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:43 +0000 (16:20 +0200)]
ntfs: Convert to new freezing mechanism
Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and
use new freeze protection.
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:42 +0000 (16:20 +0200)]
fuse: Convert to new freezing mechanism
Convert check in fuse_file_aio_write() to using new freeze protection.
CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:41 +0000 (16:20 +0200)]
gfs2: Convert to new freezing mechanism
We update gfs2_page_mkwrite() to use new freeze protection and the transaction
code to use freeze protection while the transaction is running. That is needed
to stop iput() of unlinked file from modifying the filesystem. The rest is
handled by the generic code.
CC: cluster-devel@redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:40 +0000 (16:20 +0200)]
ocfs2: Convert to new freezing mechanism
Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze
protection. We also protect several ioctl entry points which were missing the
protection. Finally, we add freeze protection to the journaling mechanism so
that iput() of unlinked inode cannot modify a frozen filesystem.
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:39 +0000 (16:20 +0200)]
xfs: Convert to new freezing code
Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.
CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:38 +0000 (16:20 +0200)]
ext4: Convert to new freezing mechanism
We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).
CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:37 +0000 (16:20 +0200)]
fs: Protect write paths by sb_start_write - sb_end_write
There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().
->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:36 +0000 (16:20 +0200)]
fs: Skip atime update on frozen filesystem
It is unexpected to block reading of frozen filesystem because of atime update.
Also handling blocking on frozen filesystem because of atime update would make
locking more complex than it already is. So just skip atime update when
filesystem is frozen like we skip it when filesystem is remounted read-only.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:35 +0000 (16:20 +0200)]
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:34 +0000 (16:20 +0200)]
fs: Improve filesystem freezing handling
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.
Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.
This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions). Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.
Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.
[AV: microoptimize sb_start_write(); we want it fast in normal case]
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 31 Jul 2012 05:28:31 +0000 (09:28 +0400)]
switch the protection of percpu_counter list to spinlock
... making percpu_counter_destroy() non-blocking
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:33 +0000 (16:20 +0200)]
nfsd: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.
CC: linux-nfs@vger.kernel.org
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:32 +0000 (16:20 +0200)]
btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:31 +0000 (16:20 +0200)]
fat: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
outside of i_mutex as in other places.
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:30 +0000 (16:20 +0200)]
fs: Push mnt_want_write() outside of i_mutex
Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.
One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex. Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.
[AV: scratch the previous, mnt_want_write() has been moved to kern_path_create()
by now]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:29 +0000 (16:20 +0200)]
mm: Make default vm_ops provide ->page_mkwrite handler
Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
file's modification times and gets locked page but later it will also handle
filesystem freezing.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:28 +0000 (16:20 +0200)]
mm: Update file times from fault path only if .page_mkwrite is not set
Filesystems wanting to properly support freezing need to have control
when file_update_time() is called. After pushing file_update_time()
to all relevant .page_mkwrite implementations we can just stop calling
file_update_time() when filesystem implements .page_mkwrite.
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:27 +0000 (16:20 +0200)]
sysfs: Push file_update_time() into bin_page_mkwrite()
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:26 +0000 (16:20 +0200)]
gfs2: Push file_update_time() into gfs2_page_mkwrite()
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:25 +0000 (16:20 +0200)]
9p: Push file_update_time() into v9fs_vm_page_mkwrite()
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:24 +0000 (16:20 +0200)]
ceph: Push file_update_time() into ceph_page_mkwrite()
CC: Sage Weil <sage@newdream.net>
CC: ceph-devel@vger.kernel.org
Acked-by: Sage Weil <sage@newdream.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:23 +0000 (16:20 +0200)]
fs: Push file_update_time() into __block_page_mkwrite()
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 12 Jun 2012 14:20:22 +0000 (16:20 +0200)]
fb_defio: Push file_update_time() into fb_deferred_io_mkwrite()
CC: Jaya Kumar <jayalk@intworks.biz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 30 Jul 2012 20:53:35 +0000 (00:53 +0400)]
simplify lookup_open()/atomic_open() - do the temporary mnt_want_write() early
The write ref to vfsmount taken in lookup_open()/atomic_open() is going to
be dropped; we take the one to stay in dentry_open(). Just grab the temporary
in caller if it looks like we are going to need it (create/truncate/writable open)
and pass (by value) "has it succeeded" flag. Instead of doing mnt_want_write()
inside, check that flag and treat "false" as "mnt_want_write() has just failed".
mnt_want_write() is cheap and the things get considerably simpler and more robust
that way - we get it and drop it in the same function, to start with, rather
than passing a "has something in the guts of really scary functions taken it"
back to caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 30 Jul 2012 07:50:30 +0000 (11:50 +0400)]
fix O_EXCL handling for devices
O_EXCL without O_CREAT has different semantics; it's "fail if already opened",
not "fail if already exists". commit
71574865 broke that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 29 Jul 2012 19:17:39 +0000 (23:17 +0400)]
lockd: handle lockowner allocation failure in nlmclnt_proc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 25 Jul 2012 20:39:50 +0000 (00:39 +0400)]
lockd: shift grabbing a reference to nlm_host into nlm_alloc_call()
It's used both for client and server hosts; we can't do nlmclnt_release_host()
on failure exits, since the host might need nlmsvc_release_host(), with BUG_ON()
for calling the wrong one. Makes life simpler for callers, actually...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kees Cook [Thu, 26 Jul 2012 00:29:08 +0000 (17:29 -0700)]
fs: add link restriction audit reporting
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kees Cook [Thu, 26 Jul 2012 00:29:07 +0000 (17:29 -0700)]
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=
87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández GarcÃa-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/
9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=
f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jeff Layton [Wed, 25 Jul 2012 14:19:47 +0000 (10:19 -0400)]
vfs: don't let do_last pass negative dentry to audit_inode
I can reliably reproduce the following panic by simply setting an audit
rule on a recent 3.5.0+ kernel:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000040
IP: [<
ffffffff810d1250>] audit_copy_inode+0x10/0x90
PGD
7acd9067 PUD
7b8fb067 PMD 0
Oops: 0000 [#86] SMP
Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc tpm_bios btrfs zlib_deflate libcrc32c kvm_amd kvm joydev virtio_net pcspkr i2c_piix4 floppy virtio_balloon microcode virtio_blk cirrus drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
CPU 0
Pid: 1286, comm: abrt-dump-oops Tainted: G D 3.5.0+ #1 Bochs Bochs
RIP: 0010:[<
ffffffff810d1250>] [<
ffffffff810d1250>] audit_copy_inode+0x10/0x90
RSP: 0018:
ffff88007aebfc38 EFLAGS:
00010282
RAX:
0000000000000000 RBX:
ffff88003692d860 RCX:
00000000000038c4
RDX:
0000000000000000 RSI:
ffff88006baf5d80 RDI:
ffff88003692d860
RBP:
ffff88007aebfc68 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000001 R12:
0000000000000000
R13:
ffff880036d30f00 R14:
ffff88006baf5d80 R15:
ffff88003692d800
FS:
00007f7562634740(0000) GS:
ffff88007fc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000040 CR3:
000000003643d000 CR4:
00000000000006f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process abrt-dump-oops (pid: 1286, threadinfo
ffff88007aebe000, task
ffff880079614530)
Stack:
ffff88007aebfdf8 ffff88007aebff28 ffff88007aebfc98 ffffffff81211358
ffff88003692d860 0000000000000000 ffff88007aebfcc8 ffffffff810d4968
ffff88007aebfcc8 ffff8800000038c4 0000000000000000 0000000000000000
Call Trace:
[<
ffffffff81211358>] ? ext4_lookup+0xe8/0x160
[<
ffffffff810d4968>] __audit_inode+0x118/0x2d0
[<
ffffffff811955a9>] do_last+0x999/0xe80
[<
ffffffff81191fe8>] ? inode_permission+0x18/0x50
[<
ffffffff81171efa>] ? kmem_cache_alloc_trace+0x11a/0x130
[<
ffffffff81195b4a>] path_openat+0xba/0x420
[<
ffffffff81196111>] do_filp_open+0x41/0xa0
[<
ffffffff811a24bd>] ? alloc_fd+0x4d/0x120
[<
ffffffff811855cd>] do_sys_open+0xed/0x1c0
[<
ffffffff810d40cc>] ? __audit_syscall_entry+0xcc/0x300
[<
ffffffff811856c1>] sys_open+0x21/0x30
[<
ffffffff81611ca9>] system_call_fastpath+0x16/0x1b
RSP <
ffff88007aebfc38>
CR2:
0000000000000040
The problem is that do_last is passing a negative dentry to audit_inode.
The comments on lookup_open note that it can pass back a negative dentry
if O_CREAT is not set.
This patch fixes the oops, but I'm not clear on whether there's a better
approach.
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 Jul 2012 17:15:37 +0000 (21:15 +0400)]
brcm80211: pointless current->files passed to filp_close()
... only needed if it's been in descriptor table
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 Jul 2012 17:26:51 +0000 (21:26 +0400)]
sound_firmware: don't pass crap to filp_close()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 Jul 2012 17:23:33 +0000 (21:23 +0400)]
gadgetfs: clean up
sigh...
* opened files have non-NULL dentries and non-NULL inodes
* close_filp() needs current->files only if the file had been
in descriptor table.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 Jul 2012 17:09:14 +0000 (21:09 +0400)]
slightly reduce lossage in gdm72xx
* filp_close() needs non-NULL second argument only if it'd been in descriptor
table
* opened files have non-NULL dentries, TYVM
* ... and those dentries are positive - it's kinda hard to open a file that
doesn't exist.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 Jul 2012 17:02:01 +0000 (21:02 +0400)]
slightly reduce idiocy in drivers/staging/bcm/Misc.c
a) vfs_llseek() does *not* access userland pointers of any kind
b) neither does filp_close(), for that matter
c) ... nor filp_open()
d) vfs_read() does, but we do have a wrapper for that (kernel_read()),
so there's no need to reinvent it.
e) passing current->files to filp_close() on something that never
had been in descriptor table is pointless.
ISAGN: voodoo dolls to be used on voodoo programmers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 21 Jul 2012 11:33:25 +0000 (15:33 +0400)]
consolidate pipe file creation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 20 Jul 2012 19:28:46 +0000 (23:28 +0400)]
take grabbing f->f_path to do_dentry_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 20 Jul 2012 19:05:59 +0000 (23:05 +0400)]
uninline file_free_rcu()
What inline? Its only use is passing its address to call_rcu(), for fuck sake!
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 20 Jul 2012 08:09:19 +0000 (12:09 +0400)]
ecryptfs_lookup_interpose(): allocate dentry_info first
less work on failure that way
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 20 Jul 2012 08:03:41 +0000 (12:03 +0400)]
sanitize ecryptfs_lookup()
* ->lookup() never gets hit with . or ..
* dentry it gets is unhashed, so unless we had gone and hashed it ourselves, there's
no need to d_drop() the sucker.
* wrong name printed in one of the printks (NULL, in fact)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 22:37:29 +0000 (02:37 +0400)]
clean unix_bind() up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 22:25:00 +0000 (02:25 +0400)]
pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp.
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now. Makes more sense, POSIX allows, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 21:17:26 +0000 (01:17 +0400)]
mknod: take sanity checks on mode into the very beginning
Note that applying umask can't affect their results. While
that affects errno in cases like
mknod("/no_such_directory/a", 030000)
yielding -EINVAL (due to impossible mode_t) instead of
-ENOENT (due to inexistent directory), IMO that makes a lot
more sense, POSIX allows to return either and any software
that relies on getting -ENOENT instead of -EINVAL in that
case deserves everything it gets.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 21:15:31 +0000 (01:15 +0400)]
new helper: done_path_create()
releases what needs to be released after {kern,user}_path_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 12:23:13 +0000 (16:23 +0400)]
pull unlock+dput() out into do_spu_create()
... and cleaning spufs_create() a bit, while we are at it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 12:12:22 +0000 (16:12 +0400)]
spufs: pull unlock-and-dput() up into spufs_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 12:07:30 +0000 (16:07 +0400)]
spufs_create_context(): simplify failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 12:03:21 +0000 (16:03 +0400)]
move spu_forget() into spufs_rmdir()
now that __fput() is *not* done in any callchain containing mmput(),
we can do that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 07:19:07 +0000 (11:19 +0400)]
ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 07:17:49 +0000 (11:17 +0400)]
btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 26 Jun 2012 17:58:53 +0000 (21:58 +0400)]
switch dentry_open() to struct path, make it grab references itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 25 Jun 2012 07:46:13 +0000 (11:46 +0400)]
spufs: shift dget/mntget towards dentry_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 14 Jul 2012 09:49:40 +0000 (13:49 +0400)]
zoran: don't bother with struct file * in zoran_map
all we need it for is file->private_data, which is assign-once, already
assigned by that point and, incidentally, its value is already in use
by zoran ->mmap() anyway. So just store that pointer instead...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 25 Jun 2012 07:38:56 +0000 (11:38 +0400)]
ecryptfs: don't reinvent the wheels, please - use struct completion
... and keep the sodding requests on stack - they are small enough.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 19 Jul 2012 05:18:15 +0000 (09:18 +0400)]
don't expose I_NEW inodes via dentry->d_inode
d_instantiate(dentry, inode);
unlock_new_inode(inode);
is a bad idea; do it the other way round...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 18 Jul 2012 16:43:19 +0000 (20:43 +0400)]
tidy up namei.c a bit
locking/unlocking for rcu walk taken to a couple of inline helpers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 18 Jul 2012 13:32:50 +0000 (17:32 +0400)]
unobfuscate follow_up() a bit
really convoluted test in there has grown up during struct mount
introduction; what it checks is that we'd reached the root of
mount tree.
Eric Sandeen [Mon, 30 Apr 2012 18:16:04 +0000 (13:16 -0500)]
ext3: pass custom EOF to generic_file_llseek_size()
Use the new custom EOF argument to generic_file_llseek_size so
that SEEK_END will go to the max hash value for htree dirs
in ext3 rather than to i_size_read()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Eric Sandeen [Mon, 30 Apr 2012 18:14:03 +0000 (13:14 -0500)]
ext4: use core vfs llseek code for dir seeks
Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.
Also fix up comments on ext4_llseek() to reflect reality.
Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Eric Sandeen [Mon, 30 Apr 2012 18:11:29 +0000 (13:11 -0500)]
vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:34 +0000 (16:45 +0200)]
vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
wakeup_flusher_threads(0) will queue work doing complete writeback for each
flusher thread. Thus there is not much point in submitting another work doing
full inode WB_SYNC_NONE writeback by writeback_inodes_sb().
After this change it does not make sense to call nonblocking ->sync_fs and
block device flush before calling sync_inodes_sb() because
wakeup_flusher_threads() is completely asynchronous and thus these functions
would be called in parallel with inode writeback running which will effectively
void any work they do. So we move sync_inodes_sb() call before these two
functions.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:33 +0000 (16:45 +0200)]
vfs: Remove unnecessary flushing of block devices
It is not necessary to write block devices twice. The reason why we first did
flush and then proper sync is that
for_each_bdev() {
write_bdev()
wait_for_completion()
}
is much slower than
for_each_bdev()
write_bdev()
for_each_bdev()
wait_for_completion()
when there is bigger amount of data. But as is seen in the above, there's no real
need to scan pages and submit them twice. We just need to separate the submission
and waiting part. This patch does that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:32 +0000 (16:45 +0200)]
vfs: Make sys_sync writeout also block device inodes
In case block device does not have filesystem mounted on it, sys_sync will just
ignore it and doesn't writeout its dirty pages. This is because writeback code
avoids writing inodes from superblock without backing device and
blockdev_superblock is such a superblock. Since it's unexpected that sync
doesn't writeout dirty data for block devices be nice to users and change the
behavior to do so. So now we iterate over all block devices on blockdev_super
instead of iterating over all superblocks when syncing block devices.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:31 +0000 (16:45 +0200)]
vfs: Create function for iterating over block devices
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:30 +0000 (16:45 +0200)]
vfs: Reorder operations during sys_sync
Change the order of operations during sync from
for_each_sb {
writeback_inodes_sb();
sync_fs(nowait);
__sync_blockdev(nowait);
}
for_each_sb {
sync_inodes_sb();
sync_fs(wait);
__sync_blockdev(wait);
}
to
for_each_sb
writeback_inodes_sb();
for_each_sb
sync_fs(nowait);
for_each_sb
__sync_blockdev(nowait);
for_each_sb
sync_inodes_sb();
for_each_sb
sync_fs(wait);
for_each_sb
__sync_blockdev(wait);
This is a preparation for the following patches in this series.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:29 +0000 (16:45 +0200)]
quota: Move quota syncing to ->sync_fs method
Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).
So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:28 +0000 (16:45 +0200)]
quota: Split dquot_quota_sync() to writeback and cache flushing part
Split off part of dquot_quota_sync() which writes dquots into a quota file
to a separate function. In the next patch we will use the function from
filesystems and we do not want to abuse ->quota_sync quotactl callback more
than necessary.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Tue, 3 Jul 2012 14:45:27 +0000 (16:45 +0200)]
vfs: Move noop_backing_dev_info check from sync into writeback
In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 13:28:08 +0000 (16:28 +0300)]
fs/ufs: get rid of write_super
This patch makes UFS stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.
The way we implement this is that we schedule a delay job instead relying on
's_dirt' and '->write_super()'.
The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back. But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.
Tested using fsstress from the LTP project.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 13:28:07 +0000 (16:28 +0300)]
fs/ufs: re-arrange the code a bit
This patch does not do any functional changes. It only moves 3 functions
in fs/ufs/super.c a little bit up in order to prepare for further changes
where I'll need this new arrangement to avoid forward declarations.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 13:28:06 +0000 (16:28 +0300)]
fs/ufs: remove extra superblock write on unmount
UFS calls 'ufs_write_super()' from 'ufs_put_super()' in order to write the
superblocks to the media. However, it is not needed because VFS calls
'->sync_fs()' before calling '->put_super()' - so by the time we are in
'ufs_write_super()', the superblocks are already synchronized.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Tue, 3 Jul 2012 13:43:28 +0000 (16:43 +0300)]
fs/sysv: stop using write_super and s_dirt
It does not look like sysv FS needs 'write_super()' at all, because all it
does is a timestamp update. I cannot test this patch, because this
file-system is so old and probably has not been used by anyone for years,
so there are no tools to create it in Linux. But from the code I see that
marking the superblock as dirty is basically marking the superblock buffers as
drity and then setting the s_dirt flag. And when 'write_super()' is executed to
handle the s_dirt flag, we just update the timestamp and again mark the
superblock buffer as dirty. Seems pointless.
It looks like we can update the timestamp more opprtunistically - on unmount
or remount of sync, and nothing should change.
Thus, this patch removes 'sysv_write_super()' and 's_dirt'.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Tue, 3 Jul 2012 13:43:27 +0000 (16:43 +0300)]
fs/sysv: remove another useless write_super call
We do not need to call 'sysv_write_super()' from 'sysv_remount()',
because VFS has called 'sysv_sync_fs()' before calling '->remount()'.
So remove it. Remove also '(un)lock_super()' which obvioulsy is becoming
useless in this function.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Tue, 3 Jul 2012 13:43:26 +0000 (16:43 +0300)]
fs/sysv: remove useless write_super call
We do not need to call 'sysv_write_super()' from 'sysv_put_super()',
because VFS has called 'sysv_sync_fs()' before calling '->put_super()'.
So remove it.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:49 +0000 (17:28 +0300)]
hfs: get rid of hfs_sync_super
This patch makes hfs stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.
The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back. But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.
Tested using fsstress from the LTP project.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:48 +0000 (17:28 +0300)]
hfs: introduce VFS superblock object back-reference
Add an 'sb' VFS superblock back-reference to the 'struct hfs_sb_info' data
structure - we will need to find the VFS superblock from a
'struct hfs_sb_info' object in the next patch, so this change is jut a
preparation.
Remove few useless newlines while on it.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:47 +0000 (17:28 +0300)]
hfs: simplify a bit checking for R/O
We have the following pattern in 2 places in HFS
if (!RDONLY)
hfs_mdb_commit();
This patch pushes the RDONLY check down to 'hfs_mdb_commit()'. This will
make the following patches a bit simpler.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:46 +0000 (17:28 +0300)]
hfs: remove extra mdb write on unmount
HFS calls 'hfs_write_super()' from 'hfs_put_super()' in order to write the MDB
to the media. However, it is not needed because VFS calls '->sync_fs()' before
calling '->put_super()' - so by the time we are in 'hfs_write_super()', the MDB
is already synchronized.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:45 +0000 (17:28 +0300)]
hfs: get rid of lock_super
Stop using lock_super for serializing the MDB changes - use the buffer-head own
lock instead. Tested with fsstress.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:28:44 +0000 (17:28 +0300)]
hfs: push lock_super down
HFS uses 'lock_super()'/'unlock_super()' around 'hfs_mdb_commit()' in order
to serialize MDB (Master Directory Block) changes. Push it down to
'hfs_mdb_commit()' in order to simplify the code a bit.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:26:31 +0000 (17:26 +0300)]
hfsplus: get rid of write_super
This patch makes hfsplus stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.
The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back. But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.
Tested using fsstress from the LTP project.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:26:30 +0000 (17:26 +0300)]
hfsplus: remove useless check
This check is useless because we always have 'sb->s_fs_info' to be non-NULL.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:26:29 +0000 (17:26 +0300)]
hfsplus: amend debugging print
Print correct function name in the debugging print of the
'hfsplus_sync_fs()' function.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Artem Bityutskiy [Thu, 12 Jul 2012 14:26:28 +0000 (17:26 +0300)]
hfsplus: make hfsplus_sync_fs static
... because it is used only in fs/hfsplus/super.c.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 30 Jun 2012 07:55:24 +0000 (11:55 +0400)]
hold task_lock around checks in keyctl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 24 Jun 2012 06:03:05 +0000 (10:03 +0400)]
get rid of ->scm_work_list
recursion in __scm_destroy() will be cut by delaying final fput()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 24 Jun 2012 06:00:10 +0000 (10:00 +0400)]
aio: now fput() is OK from interrupt context; get rid of manual delayed __fput()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 24 Jun 2012 05:56:45 +0000 (09:56 +0400)]
switch fput to task_work_add
... and schedule_work() for interrupt/kernel_thread callers
(and yes, now it *is* OK to call from interrupt).
We are guaranteed that __fput() will be done before we return
to userland (or exit). Note that for fput() from a kernel
thread we get an async behaviour; it's almost always OK, but
sometimes you might need to have __fput() completed before
you do anything else. There are two mechanisms for that -
a general barrier (flush_delayed_fput()) and explicit
__fput_sync(). Both should be used with care (as was the
case for fput() from kernel threads all along). See comments
in fs/file_table.c for details.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 27 Jun 2012 07:33:29 +0000 (11:33 +0400)]
deal with task_work callbacks adding more work
It doesn't matter on normal return to userland path (we'll recheck the
NOTIFY_RESUME flag anyway), but in case of exit_task_work() we'll
need that as soon as we get callbacks capable of triggering more
task_work_add().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 27 Jun 2012 07:31:24 +0000 (11:31 +0400)]
move exit_task_work() past exit_files() et.al.
... and get rid of PF_EXITING check in task_work_add().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 27 Jun 2012 07:07:19 +0000 (11:07 +0400)]
merge task_work and rcu_head, get rid of separate allocation for keyring case
task_work and rcu_head are identical now; merge them (calling the result
struct callback_head, rcu_head #define'd to it), kill separate allocation
in security/keys since we can just use cred->rcu now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 27 Jun 2012 05:24:13 +0000 (09:24 +0400)]
trim task_work: get rid of hlist
layout based on Oleg's suggestion; single-linked list,
task->task_works points to the last element, forward pointer
from said last element points to head. I'd still prefer
much more regular scheme with two pointers in task_work,
but...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 26 Jun 2012 18:10:04 +0000 (22:10 +0400)]
trimming task_work: kill ->data
get rid of the only user of ->data; this is _not_ the final variant - in the
end we'll have task_work and rcu_head identical and just use cred->rcu,
at which point the separate allocation will be gone completely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 15 Jul 2012 10:10:52 +0000 (14:10 +0400)]
signal: make sure we don't get stopped with pending task_work
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>