Anton Blanchard [Fri, 7 Jan 2011 03:30:41 +0000 (03:30 +0000)]
xfs: Add log level to assertion printk
I received a ppc64 bug report involving xfs but the assertion was
filtered out by the console log level. Use KERN_CRIT to ensure it
makes it out.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Jesper Juhl [Sat, 25 Dec 2010 20:14:53 +0000 (20:14 +0000)]
xfs: fix an assignment within an ASSERT()
In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
label we have this:
ASSERT(error = 0);
I believe a comparison was intended, not an assignment. If I'm
right, the patch below fixes that up.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 7 Jan 2011 13:02:23 +0000 (13:02 +0000)]
xfs: fix error handling for synchronous writes
If we get an IO error on a synchronous superblock write, we attach an
error release function to it so that when the last reference goes away
the release function is called and the buffer is invalidated and
unlocked. The buffer is left locked until the release function is
called so that other concurrent users of the buffer will be locked out
until the buffer error is fully processed.
Unfortunately, for the superblock buffer the filesyetm itself holds a
reference to the buffer which prevents the reference count from
dropping to zero and the release function being called. As a result,
once an IO error occurs on a sync write, the buffer will never be
unlocked and all future attempts to lock the buffer will hang.
To make matters worse, this problems is not unique to such buffers;
if there is a concurrent _xfs_buf_find() running, the lookup will grab
a reference to the buffer and then wait on the buffer lock, preventing
the reference count from ever falling to zero and hence unlocking the
buffer.
As such, the whole b_relse function implementation is broken because it
cannot rely on the buffer reference count falling to zero to unlock the
errored buffer. The synchronous write error path is the only path that
uses this callback - it is used to ensure that the synchronous waiter
gets the buffer error before the error state is cleared from the buffer
by the release function.
Given that the only sychronous buffer writes now go through xfs_bwrite
and the error path in question can only occur for a write of a dirty,
logged buffer, we can move most of the b_relse processing to happen
inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
In addition to that we make sure the error is not cleared in
xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
Given that xfs_bwrite keeps the buffer locked until it has waited for
it and checked the error this allows to reliably propagate the error
to the caller, and make sure that the buffer is reliably unlocked.
Given that xfs_buf_iodone_callbacks was the only instance of the
b_relse callback we can remove it entirely.
Based on earlier patches by Dave Chinner and Ajeet Yadav.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 7 Jan 2011 13:02:04 +0000 (13:02 +0000)]
xfs: add FITRIM support
Allow manual discards from userspace using the FITRIM ioctl. This is not
intended to be run during normal workloads, as the freepsace btree walks
can cause large performance degradation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Tue, 4 Jan 2011 04:49:29 +0000 (04:49 +0000)]
xfs: ensure log covering transactions are synchronous
To ensure the log is covered and the filesystem idles correctly, we
need to ensure that dummy transactions hit the disk and do not stay
pinned in memory. If the superblock is pinned in memory, it can't
be flushed so the log covering cannot make progress. The result is
dependent on timing - more oftent han not we continue to issues a
log covering transaction every 36s rather than idling after ~90s.
Fix this by making the log covering transaction synchronous. To
avoid additional log force from xfssyncd, make the log covering
transaction take the place of the existing log force in the xfssyncd
background sync process.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Mon, 10 Jan 2011 23:22:40 +0000 (10:22 +1100)]
xfs: serialise unaligned direct IOs
When two concurrent unaligned, non-overlapping direct IOs are issued
to the same block, the direct Io layer will race to zero the block.
The result is that one of the concurrent IOs will overwrite data
written by the other IO with zeros. This is demonstrated by the
xfsqa test 240.
To avoid this problem, serialise all unaligned direct IOs to an
inode with a big hammer. We need a big hammer approach as we need to
serialise AIO as well, so we can't just block writes on locks.
Hence, the big hammer is calling xfs_ioend_wait() while holding out
other unaligned direct IOs from starting.
We don't bother trying to serialised aligned vs unaligned IOs as
they are overlapping IO and the result of concurrent overlapping IOs
is undefined - the result of either IO is a valid result so we let
them race. Hence we only penalise unaligned IO, which already has a
major overhead compared to aligned IO so this isn't a major problem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:23:42 +0000 (10:23 +1100)]
xfs: factor common write setup code
The buffered IO and direct IO write paths share a common set of
checks and limiting code prior to issuing the write. Factor that
into a common helper function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:17:30 +0000 (10:17 +1100)]
xfs: split buffered IO write path from xfs_file_aio_write
Complete the split of the different write IO paths by splitting the
buffered IO write path out of xfs_file_aio_write(). This makes the
different mechanisms of the write patchs easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:15:36 +0000 (10:15 +1100)]
xfs: split direct IO write path from xfs_file_aio_write
The current xfs_file_aio_write code is a mess of locking shenanigans
to handle the different locking requirements of buffered and direct
IO. Start to clean this up by disentangling the direct IO path from
the mess.
This also removes the failed direct IO fallback path to buffered IO.
XFS handles all direct IO cases without needing to fall back to
buffered IO, so we can safely remove this unused path. This greatly
simplifies the logic and locking needed in the write path.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Wed, 12 Jan 2011 00:37:10 +0000 (11:37 +1100)]
xfs: introduce xfs_rw_lock() helpers for locking the inode
We need to obtain the i_mutex, i_iolock and i_ilock during the read
and write paths. Add a set of wrapper functions to neatly
encapsulate the lock ordering and shared/exclusive semantics to make
the locking easier to follow and get right.
Note that this changes some of the exclusive locking serialisation in
that serialisation will occur against the i_mutex instead of the
XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
arguably more efficient to use the mutex for such serialisation than
the rw_sem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:14:16 +0000 (10:14 +1100)]
xfs: factor post-write newsize updates
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:14:06 +0000 (10:14 +1100)]
xfs: factor common post-write isize handling code
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 10 Jan 2011 23:13:53 +0000 (10:13 +1100)]
xfs: ensure sync write errors are returned
xfs_file_aio_write() only returns the error from synchronous
flushing of the data and inode if error == 0. At the point where
error is being checked, it is guaranteed to be > 0. Therefore any
errors returned by the data or fsync flush will never be returned.
Fix the checks so we overwrite the current error once and only if an
error really occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:29:14 +0000 (12:29 +1100)]
xfs: convert grant head manipulations to lockless algorithm
The only thing that the grant lock remains to protect is the grant head
manipulations when adding or removing space from the log. These calculations
are already based on atomic variables, so we can already update them safely
without locks. However, the grant head manpulations require atomic multi-step
calculations to be executed, which the algorithms currently don't allow.
To make these multi-step calculations atomic, convert the algorithms to
compare-and-exchange loops on the atomic variables. That is, we sample the old
value, perform the calculation and use atomic64_cmpxchg() to attempt to update
the head with the new value. If the head has not changed since we sampled it,
it will succeed and we are done. Otherwise, we rerun the calculation again from
a new sample of the head.
This allows us to remove the grant lock from around all the grant head space
manipulations, and that effectively removes the grant lock from the log
completely. Hence we can remove the grant lock completely from the log at this
point.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:29:01 +0000 (12:29 +1100)]
xfs: introduce new locks for the log grant ticket wait queues
The log grant ticket wait queues are currently protected by the log
grant lock. However, the queues are functionally independent from
each other, and operations on them only require serialisation
against other queue operations now that all of the other log
variables they use are atomic values.
Hence, we can make them independent of the grant lock by introducing
new locks just to protect the lists operations. because the lists
are independent, we can use a lock per list and ensure that reserve
and write head queuing do not contend.
To ensure forced shutdowns work correctly in conjunction with the
new fast paths, ensure that we check whether the log has been shut
down in the grant functions once we hold the relevant spin locks but
before we go to sleep. This is needed to co-ordinate correctly with
the wakeups that are issued on the ticket queues so we don't leave
any processes sleeping on the queues during a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Fri, 3 Dec 2010 13:02:40 +0000 (00:02 +1100)]
xfs: convert log grant heads to atomic variables
Convert the log grant heads to atomic64_t types in preparation for
converting the accounting algorithms to atomic operations. his patch
just converts the variables; the algorithmic changes are in a
separate patch for clarity.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:28:39 +0000 (12:28 +1100)]
xfs: convert l_tail_lsn to an atomic variable.
log->l_tail_lsn is currently protected by the log grant lock. The
lock is only needed for serialising readers against writers, so we
don't really need the lock if we make the l_tail_lsn variable an
atomic. Converting the l_tail_lsn variable to an atomic64_t means we
can start to peel back the grant lock from various operations.
Also, provide functions to safely crack an atomic LSN variable into
it's component pieces and to recombined the components into an
atomic variable. Use them where appropriate.
This also removes the need for explicitly holding a spinlock to read
the l_tail_lsn on 32 bit platforms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Dave Chinner [Fri, 3 Dec 2010 11:11:29 +0000 (22:11 +1100)]
xfs: convert l_last_sync_lsn to an atomic variable
log->l_last_sync_lsn is updated in only one critical spot - log
buffer Io completion - and is protected by the grant lock here. This
requires the grant lock to be taken for every log buffer IO
completion. Converting the l_last_sync_lsn variable to an atomic64_t
means that we do not need to take the grant lock in log buffer IO
completion to update it.
This also removes the need for explicitly holding a spinlock to read
the l_last_sync_lsn on 32 bit platforms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:09:20 +0000 (12:09 +1100)]
xfs: make AIL tail pushing independent of the grant lock
The xlog_grant_push_ail() currently takes the grant lock internally to sample
the tail lsn, last sync lsn and the reserve grant head. Most of the callers
already hold the grant lock but have to drop it before calling
xlog_grant_push_ail(). This is a left over from when the AIL tail pushing was
done in line and hence xlog_grant_push_ail had to drop the grant lock. AIL push
is now done in another thread and hence we can safely hold the grant lock over
the entire xlog_grant_push_ail call.
Push the grant lock outside of xlog_grant_push_ail() to simplify the locking
and synchronisation needed for tail pushing. This will reduce traffic on the
grant lock by itself, but this is only one step in preparing for the complete
removal of the grant lock.
While there, clean up the formatting of xlog_grant_push_ail() to match the
rest of the XFS code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:09:01 +0000 (12:09 +1100)]
xfs: use wait queues directly for the log wait queues
The log grant queues are one of the few places left using sv_t
constructs for waiting. Given we are touching this code, we should
convert them to plain wait queues. While there, convert all the
other sv_t users in the log code as well.
Seeing as this removes the last users of the sv_t type, remove the
header file defining the wrapper and the fragments that still
reference it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:08:20 +0000 (12:08 +1100)]
xfs: combine grant heads into a single 64 bit integer
Prepare for switching the grant heads to atomic variables by
combining the two 32 bit values that make up the grant head into a
single 64 bit variable. Provide wrapper functions to combine and
split the grant heads appropriately for calculations and use them as
necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:06:05 +0000 (12:06 +1100)]
xfs: rework log grant space calculations
The log grant space calculations are repeated for both write and
reserve grant heads. To make it simpler to convert the calculations
toa different algorithm, factor them so both the gratn heads use the
same calculation functions. Once this is done we can drop the
wrappers that are used in only a couple of place to update both
grant heads at once as they don't provide any particular value.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:02:52 +0000 (12:02 +1100)]
xfs: fact out common grant head/log tail verification code
Factor repeated debug code out of grant head manipulation functions into a
separate function. This removes ifdef DEBUG spagetti from the code and makes
the code easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 21 Dec 2010 01:02:25 +0000 (12:02 +1100)]
xfs: convert log grant ticket queues to list heads
The grant write and reserve queues use a roll-your-own double linked
list, so convert it to a standard list_head structure and convert
all the list traversals to use list_for_each_entry(). We can also
get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
check to tell if the ticket is in a list or not.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 01:36:15 +0000 (12:36 +1100)]
xfs: use AIL bulk delete function to implement single delete
We now have two copies of AIL delete operations that are mostly
duplicate functionality. The single log item deletes can be
implemented via the bulk updates by turning xfs_trans_ail_delete()
into a simple wrapper. This removes all the duplicate delete
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 01:34:26 +0000 (12:34 +1100)]
xfs: use AIL bulk update function to implement single updates
We now have two copies of AIL insert operations that are mostly
duplicate functionality. The single log item updates can be
implemented via the bulk updates by turning xfs_trans_ail_update()
into a simple wrapper. This removes all the duplicate insert
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 01:03:17 +0000 (12:03 +1100)]
xfs: remove all the inodes on a buffer from the AIL in bulk
When inode buffer IO completes, usually all of the inodes are removed from the
AIL. This involves processing them one at a time and taking the AIL lock once
for every inode. When all CPUs are processing inode IO completions, this causes
excessive amount sof contention on the AIL lock.
Instead, change the way we process inode IO completion in the buffer
IO done callback. Allow the inode IO done callback to walk the list
of IO done callbacks and pull all the inodes off the buffer in one
go and then process them as a batch.
Once all the inodes for removal are collected, take the AIL lock
once and do a bulk removal operation to minimise traffic on the AIL
lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Fri, 3 Dec 2010 06:00:52 +0000 (17:00 +1100)]
xfs: consume iodone callback items on buffers as they are processed
To allow buffer iodone callbacks to consume multiple items off the
callback list, first we need to convert the xfs_buf_do_callbacks()
to consume items and always pull the next item from the head of the
list.
The means the item list walk is never dependent on knowing the
next item on the list and hence allows callbacks to remove items
from the list as well. This allows callbacks to do bulk operations
by scanning the list for identical callbacks, consuming them all
and then processing them in bulk, negating the need for multiple
callbacks of that type.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Fri, 17 Dec 2010 09:08:04 +0000 (20:08 +1100)]
xfs: reduce the number of AIL push wakeups
The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 01:02:19 +0000 (12:02 +1100)]
xfs: bulk AIL insertion during transaction commit
When inserting items into the AIL from the transaction committed
callbacks, we take the AIL lock for every single item that is to be
inserted. For a CIL checkpoint commit, this can be tens of thousands
of individual inserts, yet almost all of the items will be inserted
at the same point in the AIL because they have the same index.
To reduce the overhead and contention on the AIL lock for such
operations, introduce a "bulk insert" operation which allows a list
of log items with the same LSN to be inserted in a single operation
via a list splice. To do this, we need to pre-sort the log items
being committed into a temporary list for insertion.
The complexity is that not every log item will end up with the same
LSN, and not every item is actually inserted into the AIL. Items
that don't match the commit LSN will be inserted and unpinned as per
the current one-at-a-time method (relatively rare), while items that
are not to be inserted will be unpinned and freed immediately. Items
that are to be inserted at the given commit lsn are placed in a
temporary array and inserted into the AIL in bulk each time the
array fills up.
As a result of this, we trade off AIL hold time for a significant
reduction in traffic. lock_stat output shows that the worst case
hold time is unchanged, but contention from AIL inserts drops by an
order of magnitude and the number of lock traversal decreases
significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Fri, 3 Dec 2010 05:42:57 +0000 (16:42 +1100)]
xfs: clean up xfs_ail_delete()
xfs_ail_delete() has a needlessly complex interface. It returns the log item
that was passed in for deletion (which the callers then assert is identical to
the one passed in), and callers of xfs_ail_delete() still need to invalidate
current traversal cursors.
Make xfs_ail_delete() return void, move the cursor invalidation inside it, and
clean up the callers just to use the log item pointer they passed in.
While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments
around all these functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 00:59:49 +0000 (11:59 +1100)]
xfs: Pull EFI/EFD handling out from under the AIL lock
EFI/EFD interactions are protected from races by the AIL lock. They
are the only type of log items that require the the AIL lock to
serialise internal state, so they need to be separated from the AIL
lock before we can do bulk insert operations on the AIL.
To acheive this, convert the counter of the number of extents in the
EFI to an atomic so it can be safely manipulated by EFD processing
without locks. Also, convert the EFI state flag manipulations to use
atomic bit operations so no locks are needed to record state
changes. Finally, use the state bits to determine when it is safe to
free the EFI and clean up the code to do this neatly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Mon, 20 Dec 2010 00:57:24 +0000 (11:57 +1100)]
xfs: fix EFI transaction cancellation.
XFS_EFI_CANCELED has not been set in the code base since
xfs_efi_cancel() was removed back in 2006 by commit
065d312e15902976d256ddaf396a7950ec0350a8 ("[XFS] Remove unused
iop_abort log item operation), and even then xfs_efi_cancel() was
never called. I haven't tracked it back further than that (beyond
git history), but it indicates that the handling of EFIs in
cancelled transactions has been broken for a long time.
Basically, when we get an IOP_UNPIN(lip, 1); call from
xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log
item descriptor we leak it. Fix the behviour to be correct and kill
the XFS_EFI_CANCELED flag.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Thu, 2 Dec 2010 05:31:13 +0000 (16:31 +1100)]
xfs: connect up buffer reclaim priority hooks
Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Thu, 2 Dec 2010 05:30:55 +0000 (16:30 +1100)]
xfs: add a lru to the XFS buffer cache
Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.
The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().
Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.
This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 30 Nov 2010 06:27:57 +0000 (17:27 +1100)]
xfs: convert xfsbud shrinker to a per-buftarg shrinker.
Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Thu, 16 Dec 2010 06:08:41 +0000 (17:08 +1100)]
xfs: convert pag_ici_lock to a spin lock
now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Fri, 17 Dec 2010 06:29:43 +0000 (17:29 +1100)]
xfs: convert inode cache lookups to use RCU locking
With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.
To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.
We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Thu, 16 Dec 2010 05:41:39 +0000 (16:41 +1100)]
xfs: rcu free inodes
Introduce RCU freeing of XFS inodes so that we can convert lookup
traversals to use rcu_read_lock() protection. This patch only
introduces the RCU freeing to minimise the potential conflicts with
mainline if this is merged into mainline via a VFS patchset. It
abuses the i_dentry list for the RCU callback structure because the
VFS patches make this a union so it is safe to use like this and
simplifies and merge issues.
This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU.
The later lookup patches need the same "found free inode" protection
regardless of the RCU freeing method used, so once again the RCU
freeing method can be dealt with apprpriately at merge time without
affecting any other code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Dave Chinner [Thu, 23 Dec 2010 01:02:31 +0000 (12:02 +1100)]
xfs: don't truncate prealloc from frequently accessed inodes
A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF. This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.
To avoid this problem, keep track of ->release calls made on a dirty
inode. For most cases, an inode is only going to be opened once for
writing and then closed again during it's lifetime in cache. Hence
if there are multiple ->release calls when the inode is dirty, there
is a good chance that the inode is being accessed by the NFS server.
Hence set a flag the first time ->release is called while there are
delalloc blocks still outstanding on the inode.
If this flag is set when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.
If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 4 Jan 2011 00:35:03 +0000 (11:35 +1100)]
xfs: dynamic speculative EOF preallocation
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.
For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..
1048575]:
1048672..
2097247 0 (
1048672..
2097247)
1048576
1: [
1048576..
2097151]:
5242976..
6291551 0 (
5242976..
6291551)
1048576
2: [
2097152..
4194303]:
12583008..
14680159 0 (
12583008..
14680159)
2097152
3: [
4194304..
8388607]:
25165920..
29360223 0 (
25165920..
29360223)
4194304
4: [
8388608..
16777215]:
58720352..
67108959 0 (
58720352..
67108959)
8388608
5: [
16777216..
33554423]:
117440584..
134217791 0 (
117440584..
134217791)
16777208
6: [
33554424..
50331511]:
184549056..
201326143 0 (
184549056..
201326143)
16777088
7: [
50331512..
67108599]:
251657408..
268434495 0 (
251657408..
268434495)
16777088
and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]:
2490472..
2752615 0 (
2490472..
2752615) 262144
1: [262144..524287]:
6291560..
6553703 0 (
6291560..
6553703) 262144
2: [524288..
1048575]:
13631592..
14155879 0 (
13631592..
14155879) 524288
3: [
1048576..
2097151]:
30408808..
31457383 0 (
30408808..
31457383)
1048576
4: [
2097152..
4194303]:
52428904..
54526055 0 (
52428904..
54526055)
2097152
5: [
4194304..
8388607]:
104857704..
109052007 0 (
104857704..
109052007)
4194304
6: [
8388608..
16777215]:
209715304..
218103911 0 (
209715304..
218103911)
8388608
7: [
16777216..
33554423]:
452984848..
469762055 0 (
452984848..
469762055)
16777208
Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):
freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)
This should reduce the amount of space held in speculative
preallocation for such cases.
The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Dave Chinner [Thu, 23 Dec 2010 00:57:37 +0000 (11:57 +1100)]
xfs: use KM_NOFS for allocations during attribute list operations
When listing attributes, we are doiing memory allocations under the
inode ilock using only KM_SLEEP. This allows memory allocation to
recurse back into the filesystem and do writeback, which may the
ilock we already hold on the current inode. THis will deadlock.
Hence use KM_NOFS for such allocations outside of transaction
context to ensure that reclaim recursion does not occur.
Reported-by: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Thu, 23 Dec 2010 00:57:13 +0000 (11:57 +1100)]
xfs: provide a inode iolock lockdep class
The XFS iolock needs to be re-initialised to a new lock class before
it enters reclaim to prevent lockdep false positives. Unfortunately,
this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
state can be recycled and not re-initialised before being reused.
We need to re-initialise the lock state when transfering out of
XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
class as if the inode was just allocated. Hence we need a specific
lockdep class variable for the iolock so that both initialisations
use the same class.
While there, add a specific class for inodes in the reclaim state so
that it is easy to tell from lockdep reports what state the inode
was in that generated the report.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Fri, 10 Dec 2010 15:04:11 +0000 (15:04 +0000)]
xfs: factor duplicate code in xfs_alloc_ag_vextent_near into a helper
Add a new xfs_alloc_find_best_extent that does a forward/backward
search in the allocation btree. That code previously was existed
two times in xfs_alloc_ag_vextent_near, once for each search
direction.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 15:03:57 +0000 (15:03 +0000)]
xfs: clean up xfs_alloc_ag_vextent_exact
Use a goto label to consolidate all block not found cases, and add a
tracepoint for them. Also clean up a few whitespace issues.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:25 +0000 (08:42 +0000)]
xfs: simplify xfs_map_at_offset
Move the buffer locking into the callers as they need to do it
wether they call xfs_map_at_offset or not. Remove the b_bdev
assignment, which is already done by get_blocks. Remove the
duplicate extent type asserts in xfs_convert_page just before
calling xfs_map_at_offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:24 +0000 (08:42 +0000)]
xfs: refactor xfs_vm_writepage
After the last patches the code for overwrites is the same as for
delayed and unwritten extents except that it doesn't need to call
xfs_map_at_offset. Take care of that fact to simplify
xfs_vm_writepage.
The buffer loop now first checks the type of buffer and checks/sets
the ioend type, or continues to the next buffer if it's not
interesting to us. Only after that we validate the iomap and
perform the block mapping if needed, all in common code for the
cases where we have to do work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:23 +0000 (08:42 +0000)]
xfs: remove the all_bh flag from xfs_convert_page
The all_bh flag is always set when entering the page clustering
machinery with a regular written extent, which means the check for
it is superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:22 +0000 (08:42 +0000)]
xfs: remove xfs_probe_cluster
xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE
entire flag, which tells it to not cap the extent at the passed in
size, but just treat the size as an minimum to map. This means
xfs_probe_cluster is entirely useless as we'll always get the whole
extent back anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:21 +0000 (08:42 +0000)]
xfs: simplify xfs_map_blocks
No need to lock the extent map exclusive when performing an
overwrite, we know the extent map must already have been loaded by
get_blocks. Apply the non-blocking inode semantics to all mapping
types instead of just delayed allocations. Remove the handling of
not yet allocated blocks for the IO_UNWRITTEN case - if an extent is
marked as unwritten allocated in the buffer it must already have an
extent on disk.
Add asserts to verify all the assumptions above in debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:20 +0000 (08:42 +0000)]
xfs: kill xfs_iomap
Opencode the xfs_iomap code in it's two callers. The overlap of
passed flags already was minimal and will be further reduced in the
next patch.
As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
for I/O end processing are merged into a single set of flags, which
should be a bit more descriptive of the operation we perform.
Also improve the tracing by giving each caller it's own type set of
tracepoints.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:19 +0000 (08:42 +0000)]
xfs: cleanup the xfs_iomap_write_* helpers
Remove passing the BMAPI_* flags to these helpers, in
xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and
in the xfs_iomap_write_delay path is was never checked at all.
Remove the nmap return value as we never make use of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:18 +0000 (08:42 +0000)]
xfs: a few small tweaks for overwrites in xfs_vm_writepage
Don't trylock the buffer. We are the only one ever locking it for a
regular file address space, and trylock was only copied from the
generic code which did it due to the old buffer based writeout in
jbd. Also make sure to only write out the buffer if the iomap
actually is valid, because we wouldn't have a proper mapping
otherwise. In practice we will never get an invalid mapping here as
the page lock guarantees truncate doesn't race with us, but better
be safe than sorry. Also make sure we allocate a new ioend when
crossing boundaries between mappings, just like we do for delalloc
and unwritten extents. Again this currently doesn't matter as the
I/O end handler only cares for the boundaries for unwritten extents,
but this makes the code fully correct and the same as for
delalloc/unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:17 +0000 (08:42 +0000)]
xfs: remove some dead bio handling code
We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
can only happen for discards, and used to happen for barriers, none
of which is every submitted by xfs_submit_ioend_bio. Also remove
the loop around bio_alloc as it will never fail due to it's mempool
backing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Fri, 10 Dec 2010 08:42:16 +0000 (08:42 +0000)]
xfs: improve mapping type check in xfs_vm_writepage
Currently we only refuse a "read-only" mapping for writing out
unwritten and delayed buffers, and refuse any other for overwrites.
Improve the checks to require delalloc mappings for delayed buffers,
and unwritten extent mappings for unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Wed, 1 Dec 2010 22:06:24 +0000 (22:06 +0000)]
xfs: untangle phase1 vs phase2 recovery helpers
Dispatch to a different helper for phase1 vs phase2 in
xlog_recover_commit_trans instead of doing it in all the
low-level functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Wed, 1 Dec 2010 22:06:23 +0000 (22:06 +0000)]
xfs: refactor xlog_recover_commit_trans
Merge the call to xlog_recover_reorder_trans and the loop over the
recovery items from xlog_recover_do_trans into xlog_recover_commit_trans,
and keep the switch statement over the log item types as a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Wed, 1 Dec 2010 22:06:22 +0000 (22:06 +0000)]
xfs: use struct list_head for the buf cancel table
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Wed, 1 Dec 2010 22:06:21 +0000 (22:06 +0000)]
xfs: remove leftovers of old buffer log items in recovery code
XFS used to support different types of buffer log items long time
ago. Remove the switch statements checking the log item type in
various buffer recovery helpers that were left over from those days
and the rather useless xlog_recover_do_buffer_pass2 wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Samuel Kvasnica [Fri, 19 Nov 2010 13:38:49 +0000 (13:38 +0000)]
xfs: fix exporting with left over 64-bit inodes
We now support mounting and using filesystems with 64-bit inodes
even when not mounted with the inode64 option (which now only
controls if we allocate new inodes in that space or not). Make sure
we always use large NFS file handles when exporting a filesystem
that may contain 64-bit inodes. Note that this only affects newly
generated file handles, any outstanding 32-bit file handle is still
accepted.
[hch: the comment and commit log are mine, the rest is from a patch
snipplet from Samuel]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Christoph Hellwig [Tue, 7 Dec 2010 10:16:41 +0000 (10:16 +0000)]
xfs: log timestamp changes to the source inode in rename
Now that we don't mark VFS inodes dirty anymore for internal
timestamp changes, but rely on the transaction subsystem to push
them out, we need to explicitly log the source inode in rename after
updating it's timestamps to make sure the changes actually get
forced out by sync/fsync or an AIL push.
We already account for the fourth inode in the log reservation, as a
rename of directories needs to update the nlink field, so just
adding the xfs_trans_log_inode call is enough.
This fixes the xfsqa 065 regression introduced by:
"xfs: don't use vfs writeback for pure metadata modifications"
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Tue, 30 Nov 2010 04:15:31 +0000 (15:15 +1100)]
xfs: only run xfs_error_test if error injection is active
Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:
samples pcnt function DSO
_______ _____ ___________________________ _________________
224648.00 36.8% xfs_error_test [kernel.kallsyms]
86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms]
39778.00 6.5% prandom32 [kernel.kallsyms]
37436.00 6.1% xfs_btree_increment [kernel.kallsyms]
29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms]
27717.00 4.5% random32 [kernel.kallsyms]
Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as the first operation. IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.
Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 30 Nov 2010 04:15:46 +0000 (15:15 +1100)]
xfs: avoid moving stale inodes in the AIL
When an inode has been marked stale because the cluster is being
freed, we don't want to (re-)insert this inode into the AIL. There
is a race condition where the cluster buffer may be unpinned before
the inode is inserted into the AIL during transaction committed
processing. If the buffer is unpinned before the inode item has been
committed and inserted, then it is possible for the buffer to be
released and hence processthe stale inode callbacks before the inode
is inserted into the AIL.
In this case, we then insert a clean, stale inode into the AIL which
will never get removed by an IO completion. It will, however, get
reclaimed and that triggers an assert in xfs_inode_free()
complaining about freeing an inode still in the AIL.
This race can be avoided by not moving stale inodes forward in the AIL
during transaction commit completion processing. This closes the
race condition by ensuring we never insert clean stale inodes into
the AIL. It is safe to do this because a dirty stale inode, by
definition, must already be in the AIL.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 30 Nov 2010 04:16:02 +0000 (15:16 +1100)]
xfs: delayed alloc blocks beyond EOF are valid after writeback
There is an assumption in the parts of XFS that flushing a dirty
file will make all the delayed allocation blocks disappear from an
inode. That is, that after calling xfs_flush_pages() then
ip->i_delayed_blks will be zero.
This is an invalid assumption as we may have specualtive
preallocation beyond EOF and they are recorded in
ip->i_delayed_blks. A flush of the dirty pages of an inode will not
change the state of these blocks beyond EOF, so a non-zero
deeelalloc block count after a flush is valid.
The bmap code has an invalid ASSERT() that needs to be removed, and
the swapext code has a bug in that while it swaps the data forks
around, it fails to swap the i_delayed_blks counter associated with
the fork and hence can get the block accounting wrong.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 30 Nov 2010 04:16:16 +0000 (15:16 +1100)]
xfs: push stale, pinned buffers on trylock failures
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.
The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.
This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Tue, 30 Nov 2010 04:14:39 +0000 (15:14 +1100)]
xfs: fix failed write truncation handling.
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks. As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well. Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.
But full blown truncate seems like overkill here. We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.
The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.
Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Linus Torvalds [Tue, 30 Nov 2010 04:42:04 +0000 (20:42 -0800)]
Linux 2.6.37-rc4
Linus Torvalds [Tue, 30 Nov 2010 04:41:39 +0000 (20:41 -0800)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Use call_rcu_sched() for pagetables
Peter Zijlstra [Fri, 26 Nov 2010 14:38:45 +0000 (15:38 +0100)]
powerpc: Use call_rcu_sched() for pagetables
PowerPC relies on IRQ-disable to guard against RCU quiecent states,
use the appropriate RCU call version.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Dave Airlie [Mon, 29 Nov 2010 23:15:46 +0000 (09:15 +1000)]
Revert "debug_locks: set oops_in_progress if we will log messages."
This reverts commit
e0fdace10e75dac67d906213b780ff1b1a4cc360.
On-list discussion seems to suggest that the robustness fixes for printk
make this unnecessary and DaveM has also agreed in person at Kernel Summit
and on list.
The main problem with this code is once we hit a lockdep splat we always
keep oops_in_progress set, the console layer uses oops_in_progress with KMS
to decide when it should be showing the oops and not showing X, so it causes
problems around suspend/resume time when a userspace resume can cause a console
switch away from X, only if oops_in_progress is set (which is what we want
if an oops actually is in progress, but not because we had a lockdep splat
2 days prior).
Cc: David S Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 29 Nov 2010 22:38:06 +0000 (14:38 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
tpm: Autodetect itpm devices
Linus Torvalds [Mon, 29 Nov 2010 22:36:33 +0000 (14:36 -0800)]
Merge git://git./linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
af_unix: limit recursion level
pch_gbe driver: The wrong of initializer entry
pch_gbe dreiver: chang author
ucc_geth: fix ucc halt problem in half duplex mode
inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners
ehea: Add some info messages and fix an issue
hso: fix disable_net
NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty
cxgb4vf: fix setting unicast/multicast addresses ...
net, ppp: Report correct error code if unit allocation failed
DECnet: don't leak uninitialized stack byte
au1000_eth: fix invalid address accessing the MAC enable register
dccp: fix error in updating the GAR
tcp: restrict net.ipv4.tcp_adv_win_scale (#20312)
netns: Don't leak others' openreq-s in proc
Net: ceph: Makefile: Remove unnessary code
vhost/net: fix rcu check usage
econet: fix CVE-2010-3848
econet: fix CVE-2010-3850
econet: disallow NULL remote addr for sendmsg(), fixes CVE-2010-3849
...
Linus Torvalds [Mon, 29 Nov 2010 22:36:07 +0000 (14:36 -0800)]
Merge branch 'omap-fixes-for-linus' of git://git./linux/kernel/git/tmlind/linux-omap-2.6
* 'omap-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6:
OMAP2+: PM/serial: hold console semaphore while OMAP UARTs are disabled
OMAP: UART: don't resume UARTs that are not enabled.
Matthew Garrett [Thu, 21 Oct 2010 21:42:40 +0000 (17:42 -0400)]
tpm: Autodetect itpm devices
Some Lenovos have TPMs that require a quirk to function correctly. This can
be autodetected by checking whether the device has a _HID of INTC0102. This
is an invalid PNPid, and as such is discarded by the pnp layer - however
it's still present in the ACPI code, so we can pull it out that way. This
means that the quirk won't be automatically applied on non-ACPI systems,
but without ACPI we don't have any way to identify the chip anyway so I
don't think that's a great concern.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Rajiv Andrade <srajiv@linux.vnet.ibm.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: James Morris <jmorris@namei.org>
Linus Torvalds [Mon, 29 Nov 2010 22:11:08 +0000 (14:11 -0800)]
Merge git://git./linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
Linus Torvalds [Mon, 29 Nov 2010 22:10:44 +0000 (14:10 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/bp/bp
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
EDAC: Fix typos in Documentation/edac.txt
EDAC, MCE: Fix edac_init_mce_inject error handling
EDAC: Remove deprecated kbuild goal definitions
Linus Torvalds [Mon, 29 Nov 2010 22:10:22 +0000 (14:10 -0800)]
Merge git://git./linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Userland expects quota limit/warn/usage in 512b blocks
Eric Dumazet [Thu, 25 Nov 2010 04:11:39 +0000 (04:11 +0000)]
af_unix: limit recursion level
Its easy to eat all kernel memory and trigger NMI watchdog, using an
exploit program that queues unix sockets on top of others.
lkml ref : http://lkml.org/lkml/2010/11/25/8
This mechanism is used in applications, one choice we have is to have a
recursion limit.
Other limits might be needed as well (if we queue other types of files),
since the passfd mechanism is currently limited by socket receive queue
sizes only.
Add a recursion_level to unix socket, allowing up to 4 levels.
Each time we send an unix socket through sendfd mechanism, we copy its
recursion level (plus one) to receiver. This recursion level is cleared
when socket receive queue is emptied.
Reported-by: Марк Коренберг <socketpair@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiharu Okada [Mon, 29 Nov 2010 06:18:07 +0000 (06:18 +0000)]
pch_gbe driver: The wrong of initializer entry
The wrong of initializer entry was modified.
Signed-off-by: Toshiharu Okada <toshiharu-linux@dsn.okisemi.com>
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiharu Okada [Sun, 21 Nov 2010 19:58:37 +0000 (19:58 +0000)]
pch_gbe dreiver: chang author
This driver's AUTHOR was changed to "Toshiharu Okada" from "Masayuki Ohtake".
I update the Kconfig, renamed "Topcliff" to "EG20T".
Signed-off-by: Toshiharu Okada <toshiharu-linux@dsn.okisemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Mason [Mon, 29 Nov 2010 14:49:11 +0000 (09:49 -0500)]
Btrfs: don't use migrate page without CONFIG_MIGRATION
Fixes compile error
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yang Li [Thu, 25 Nov 2010 23:29:58 +0000 (23:29 +0000)]
ucc_geth: fix ucc halt problem in half duplex mode
In commit
58933c64(ucc_geth: Fix the wrong the Rx/Tx FIFO size),
the UCC_GETH_UTFTT_INIT is set to 512 based on the recommendation
of the QE Reference Manual. But that will sometimes cause tx halt
while working in half duplex mode.
According to errata draft QE_GENERAL-A003(High Tx Virtual FIFO
threshold size can cause UCC to halt), setting UTFTT less than
[(UTFS x (M - 8)/M) - 128] will prevent this from happening
(M is the minimum buffer size).
The patch changes UTFTT back to 256.
Signed-off-by: Li Yang <leoli@freescale.com>
Cc: Jean-Denis Boyer <jdboyer@media5corp.com>
Cc: Andreas Schmitz <Andreas.Schmitz@riedel.net>
Cc: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nagendra Tomar [Fri, 26 Nov 2010 14:26:27 +0000 (14:26 +0000)]
inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners
inet sockets corresponding to passive connections are added to the bind hash
using ___inet_inherit_port(). These sockets are later removed from the bind
hash using __inet_put_port(). These two functions are not exactly symmetrical.
__inet_put_port() decrements hashinfo->bsockets and tb->num_owners, whereas
___inet_inherit_port() does not increment them. This results in both of these
going to -ve values.
This patch fixes this by calling inet_bind_hash() from ___inet_inherit_port(),
which does the right thing.
'bsockets' and 'num_owners' were introduced by commit
a9d8f9110d7e953c
(inet: Allowing more than 64k connections and heavily optimize bind(0))
Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Breno Leitao [Fri, 26 Nov 2010 07:26:27 +0000 (07:26 +0000)]
ehea: Add some info messages and fix an issue
This patch adds some debug information about ehea not being able to
allocate enough spaces. Also it correctly updates the amount of available
skb.
Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Mason [Mon, 29 Nov 2010 00:56:33 +0000 (19:56 -0500)]
Btrfs: deal with DIO bios that span more than one ordered extent
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Linus Torvalds [Mon, 29 Nov 2010 00:27:19 +0000 (16:27 -0800)]
Un-inline get_pipe_info() helper function
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 28 Nov 2010 22:09:57 +0000 (14:09 -0800)]
Export 'get_pipe_info()' to other users
And in particular, use it in 'pipe_fcntl()'.
The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.
The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called. But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 28 Nov 2010 21:56:09 +0000 (13:56 -0800)]
Rename 'pipe_info()' to 'get_pipe_info()'
.. and change it to take the 'file' pointer instead of an inode, since
that's what all users want anyway.
The renaming is preparatory to exporting it to other users. The old
'pipe_info()' name was too generic and is already used elsewhere, so
before making the function public we need to use a more specific name.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 28 Nov 2010 20:25:02 +0000 (12:25 -0800)]
Merge branch 'perf-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Fix the software context switch counter
perf, x86: Fixup Kconfig deps
x86, perf, nmi: Disable perf if counters are not accessible
perf: Fix inherit vs. context rotation bug
Linus Torvalds [Sun, 28 Nov 2010 20:24:20 +0000 (12:24 -0800)]
Merge branch 'fwnet' of git://git./linux/kernel/git/ieee1394/linux1394-2.6
* 'fwnet' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
firewire: net: throttle TX queue before running out of tlabels
firewire: net: replace lists by counters
firewire: net: fix memory leaks
firewire: net: count stats.tx_packets and stats.tx_bytes
Filip Aben [Thu, 25 Nov 2010 03:40:50 +0000 (03:40 +0000)]
hso: fix disable_net
The HSO driver incorrectly creates a serial device instead of a net
device when disable_net is set. It shouldn't create anything for the
network interface.
Signed-off-by: Filip Aben <f.aben@option.com>
Reported-by: Piotr Isajew <pki@ex.com.pl>
Reported-by: Johan Hovold <jhovold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Slaby [Wed, 24 Nov 2010 13:54:54 +0000 (13:54 +0000)]
NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty
We register lapb when tty is created, but unregister it only when the
device is UP. So move the lapb_unregister to x25_asy_close_tty after
the device is down.
The old behaviour causes ldisc switching to fail each second attempt,
because we noted for us that the device is unused, so we use it the
second time, but labp layer still have it registered, so it fails
obviously.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: Sergey Lapin <slapin@ossfans.org>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Tested-by: Sergey Lapin <slapin@ossfans.org>
Tested-by: Mikhail Ulyanov <ulyanov.mikhail@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Casey Leedom [Wed, 24 Nov 2010 12:23:57 +0000 (12:23 +0000)]
cxgb4vf: fix setting unicast/multicast addresses ...
We were truncating the number of unicast and multicast MAC addresses
supported. Additionally, we were incorrectly computing the MAC Address
hash (a "1 << N" where we needed a "1ULL << N").
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cyrill Gorcunov [Tue, 23 Nov 2010 11:43:44 +0000 (11:43 +0000)]
net, ppp: Report correct error code if unit allocation failed
Allocating unit from ird might return several error codes
not only -EAGAIN, so it should not be changed and returned
precisely. Same time unit release procedure should be invoked
only if device is unregistering.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Rosenberg [Tue, 23 Nov 2010 11:02:13 +0000 (11:02 +0000)]
DECnet: don't leak uninitialized stack byte
A single uninitialized padding byte is leaked to userspace.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
CC: stable <stable@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wolfgang Grandegger [Tue, 23 Nov 2010 06:40:25 +0000 (06:40 +0000)]
au1000_eth: fix invalid address accessing the MAC enable register
"aup->enable" holds already the address pointing to the MAC enable
register. The bug was introduced by commit d0e7cb:
"au1000-eth: remove volatiles, switch to I/O accessors".
CC: Florian Fainelli <florian@openwrt.org>
Signed-off-by: Wolfgang Grandegger <wg@denx.de>
Acked-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 23 Nov 2010 02:36:56 +0000 (02:36 +0000)]
dccp: fix error in updating the GAR
This fixes a bug in updating the Greatest Acknowledgment number Received (GAR):
the current implementation does not track the greatest received value -
lower values in the range AWL..AWH (RFC 4340, 7.5.1) erase higher ones.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 28 Nov 2010 19:27:44 +0000 (11:27 -0800)]
Merge branch 'vhost-net' of git://git./linux/kernel/git/mst/vhost
Alexey Dobriyan [Mon, 22 Nov 2010 12:54:21 +0000 (12:54 +0000)]
tcp: restrict net.ipv4.tcp_adv_win_scale (#20312)
tcp_win_from_space() does the following:
if (sysctl_tcp_adv_win_scale <= 0)
return space >> (-sysctl_tcp_adv_win_scale);
else
return space - (space >> sysctl_tcp_adv_win_scale);
"space" is int.
As per C99 6.5.7 (3) shifting int for 32 or more bits is
undefined behaviour.
Indeed, if sysctl_tcp_adv_win_scale is exactly 32,
space >> 32 equals space and function returns 0.
Which means we busyloop in tcp_fixup_rcvbuf().
Restrict net.ipv4.tcp_adv_win_scale to [-31, 31].
Fix https://bugzilla.kernel.org/show_bug.cgi?id=20312
Steps to reproduce:
echo 32 >/proc/sys/net/ipv4/tcp_adv_win_scale
wget www.kernel.org
[softlockup]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov [Mon, 22 Nov 2010 03:26:12 +0000 (03:26 +0000)]
netns: Don't leak others' openreq-s in proc
The /proc/net/tcp leaks openreq sockets from other namespaces.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>