Joe Thornber [Fri, 10 May 2013 13:37:21 +0000 (14:37 +0100)]
dm thin: generate event when metadata threshold passed
Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.
The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:20 +0000 (14:37 +0100)]
dm persistent metadata: add space map threshold callback
Add a threshold callback to dm persistent data space maps.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:20 +0000 (14:37 +0100)]
dm persistent data: add threshold callback to space map
Add a threshold callback function to the persistent data space map
interface for a subsequent patch to use.
dm-thin and dm-cache are interested in knowing when they're getting
low on metadata or data blocks. This patch introduces a new method
for registering a callback against a threshold.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:19 +0000 (14:37 +0100)]
dm thin: detect metadata device resizing
Allow the dm thin pool metadata device to be extended.
Whenever a pool is resumed, detect whether the size of the metadata
device has increased, and if so, extend the metadata to use the new
space.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:19 +0000 (14:37 +0100)]
dm persistent data: support space map resizing
Support extending a dm persistent data metadata space map.
The extend itself is implemented by switching back to the boostrap
allocator and pointing to the new space. The extra bitmap indexes are
then allocated from the new space, and finally we switch back to the
proper space map ops and tweak the reference counts.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:19 +0000 (14:37 +0100)]
dm thin: open dev read only when possible
If a thin pool is created in read-only-metadata mode then only open the
metadata device read-only.
Previously it was always opened with FMODE_READ | FMODE_WRITE.
(Note that dm_get_device() still allows read-only dm devices to be used
read-write at the moment: If I create a read-only linear device for the
metadata, via dmsetup load --readonly, then I can still create a rw pool
out of it.)
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:18 +0000 (14:37 +0100)]
dm thin: refactor data dev resize
Refactor device size functions in preparation for similar metadata
device resizing functions.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:18 +0000 (14:37 +0100)]
dm cache: replace memcpy with struct assignment
Use struct assignment rather than memcpy in dm cache.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:18 +0000 (14:37 +0100)]
dm cache: fix typos in comments
Fix up some typos in dm-cache comments.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Alasdair G Kergon [Fri, 10 May 2013 13:37:17 +0000 (14:37 +0100)]
dm cache policy: fix description of lookup fn
Correct the documented requirement on the return code from dm cache policy
lookup functions stated in the policy module header file.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Alasdair G Kergon [Fri, 10 May 2013 13:37:17 +0000 (14:37 +0100)]
dm: document iterate_devices
Document iterate_devices in device-mapper.h.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:17 +0000 (14:37 +0100)]
dm persistent data: fix error message typos
Fix some typos in dm-space-map-metadata.c error messages.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 10 May 2013 13:37:16 +0000 (14:37 +0100)]
dm cache: tune migration throttling
Tune the dm cache migration throttling.
i) Issue a tick every second, just in case there's no i/o going through.
ii) Drop the migration threshold right down to something suitable for
background work.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 10 May 2013 13:37:16 +0000 (14:37 +0100)]
dm mpath: enable WRITE SAME support
Enable WRITE SAME support in dm multipath. As far as multipath is
concerned it is just another write request.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Bharata B Rao <bharata.rao@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 10 May 2013 13:37:16 +0000 (14:37 +0100)]
dm table: fix write same support
If device_not_write_same_capable() returns true then the iterate_devices
loop in dm_table_supports_write_same() should return false.
Reported-by: Bharata B Rao <bharata.rao@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 10 May 2013 13:37:15 +0000 (14:37 +0100)]
dm bufio: avoid a possible __vmalloc deadlock
This patch uses memalloc_noio_save to avoid a possible deadlock in
dm-bufio. (it could happen only with large block size, at most
PAGE_SIZE << MAX_ORDER (typically 8MiB).
__vmalloc doesn't fully respect gfp flags. The specified gfp flags are
used for allocation of requested pages, structures vmap_area, vmap_block
and vm_struct and the radix tree nodes.
However, the kernel pagetables are allocated always with GFP_KERNEL.
Thus the allocation of pagetables can recurse back to the I/O layer and
cause a deadlock.
This patch uses the function memalloc_noio_save to set per-process
PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore
it. When this flag is set, all allocations in the process are done with
implied GFP_NOIO flag, thus the deadlock can't happen.
This should be backported to stable kernels, but they don't have the
PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore
functions. So, PF_MEMALLOC should be set and restored instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Wei Yongjun [Fri, 10 May 2013 13:37:15 +0000 (14:37 +0100)]
dm snapshot: fix error return code in snapshot_ctr
Return -ENOMEM instead of success if unable to allocate pending
exception mempool in snapshot_ctr.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Wei Yongjun [Fri, 10 May 2013 13:37:14 +0000 (14:37 +0100)]
dm cache: fix error return code in cache_create
Return -ENOMEM if memory allocation fails in cache_create
instead of 0 (to avoid NULL pointer dereference).
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 10 May 2013 13:37:14 +0000 (14:37 +0100)]
dm stripe: fix regression in stripe_width calculation
Fix a regression in the calculation of the stripe_width in the
dm stripe target which led to incorrect processing of device limits.
The stripe_width is the stripe device length divided by the number of
stripes. The group of commits in the range
f14fa69 ("dm stripe: fix
size test") to
eb850de ("dm stripe: support for non power of 2
chunksize") interfered with each other (a merging error) and led to the
stripe_width being set incorrectly to the stripe device length divided by
chunk_size * stripe_count.
For example, a stripe device's table with: 0
33553920 striped 3 512 ...
should result in a stripe_width of
11184640 (
33553920 / 3), but due to
the bug it was getting set to 21845 (
33553920 / (512 * 3)).
The impact of this bug is that device topologies that previously worked
fine with the stripe target are no longer considered valid. In
particular, there is a higher risk of seeing this issue if one of the
stripe devices has a 4K logical block size. Resulting in an error
message like this:
"device-mapper: table: 253:4: len=21845 not aligned to h/w logical block size 4096 of dm-1"
The fix is to swap the order of the divisions and to use a temporary
variable for the second one, so that width retains the intended
value.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.6+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Linus Torvalds [Wed, 8 May 2013 18:51:05 +0000 (11:51 -0700)]
Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
Linus Torvalds [Wed, 8 May 2013 17:13:35 +0000 (10:13 -0700)]
Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
Linus Torvalds [Wed, 8 May 2013 03:49:51 +0000 (20:49 -0700)]
Merge branch 'akpm' (incoming from Andrew)
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
Kent Overstreet [Tue, 7 May 2013 23:19:08 +0000 (16:19 -0700)]
aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.
[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:19:11 +0000 (16:19 -0700)]
aio: kill ki_retry
Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
need this anymore - ki_retry was only called right after the kiocb was
initialized.
This also refactors and trims some duplicated code, as well as cleaning up
the refcounting/error handling a bit.
[akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
[akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:19:10 +0000 (16:19 -0700)]
aio: kill ki_key
ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:56 +0000 (16:18 -0700)]
aio: give shared kioctx fields their own cachelines
[akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:55 +0000 (16:18 -0700)]
aio: kill struct aio_ring_info
struct aio_ring_info was kind of odd, the only place it's used is where
it's embedded in struct kioctx - there's no real need for it.
The next patch rearranges struct kioctx and puts various things on their
own cachelines - getting rid of struct aio_ring_info now makes that
reordering a bit clearer.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:53 +0000 (16:18 -0700)]
aio: kill batch allocation
Previously, allocating a kiocb required touching quite a few global
(well, per kioctx) cachelines... so batching up allocation to amortize
those was worthwhile. But we've gotten rid of some of those, and in
another couple of patches kiocb allocation won't require writing to any
shared cachelines, so that means we can just rip this code out.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:51 +0000 (16:18 -0700)]
aio: change reqs_active to include unreaped completions
The aio code tries really hard to avoid having to deal with the
completion ringbuffer overflowing. To do that, it has to keep track of
the number of outstanding kiocbs, and the number of completions
currently in the ringbuffer - and it's got to check that every time we
allocate a kiocb. Ouch.
But - we can improve this quite a bit if we just change reqs_active to
mean "number of outstanding requests and unreaped completions" - that
means kiocb allocation doesn't have to look at the ringbuffer, which is
a fairly significant win.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:49 +0000 (16:18 -0700)]
aio: use cancellation list lazily
Cancelling kiocbs requires adding them to a per kioctx linked list,
which is one of the few things we need to take the kioctx lock for in
the fast path. But most kiocbs can't be cancelled - so if we just do
this lazily, we can avoid quite a bit of locking overhead.
While we're at it, instead of using a flag bit switch to using ki_cancel
itself to indicate that a kiocb has been cancelled/completed. This lets
us get rid of ki_flags entirely.
[akpm@linux-foundation.org: remove buggy BUG()]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:47 +0000 (16:18 -0700)]
aio: use flush_dcache_page()
This wasn't causing problems before because it's not needed on x86, but
it is needed on other architectures.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:45 +0000 (16:18 -0700)]
aio: make aio_read_evt() more efficient, convert to hrtimers
Previously, aio_read_event() pulled a single completion off the
ringbuffer at a time, locking and unlocking each time. Change it to
pull off as many events as it can at a time, and copy them directly to
userspace.
This also fixes a bug where if copying the event to userspace failed,
we'd lose the event.
Also convert it to wait_event_interruptible_hrtimeout(), which
simplifies it quite a bit.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:43 +0000 (16:18 -0700)]
wait: add wait_event_hrtimeout()
Analagous to wait_event_timeout() and friends, this adds
wait_event_hrtimeout() and wait_event_interruptible_hrtimeout().
Note that unlike the versions that use regular timers, these don't
return the amount of time remaining when they return - instead, they
return 0 or -ETIME if they timed out. because I was uncomfortable with
the semantics of doing it the other way (that I could get it right,
anyways).
If the timer expires, there's no real guarantee that expire_time -
current_time would be <= 0 - due to timer slack certainly, and I'm not
sure I want to know the implications of the different clock bases in
hrtimers.
If the timer does expire and the code calculates that the time remaining
is nonnegative, that could be even worse if the calling code then reuses
that timeout. Probably safer to just return 0 then, but I could imagine
weird bugs or at least unintended behaviour arising from that too.
I came to the conclusion that if other users end up actually needing the
amount of time remaining, the sanest thing to do would be to create a
version that uses absolute timeouts instead of relative.
[akpm@linux-foundation.org: fix description of `timeout' arg]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:41 +0000 (16:18 -0700)]
aio: refcounting cleanup
The usage of ctx->dead was fubar - it makes no sense to explicitly check
it all over the place, especially when we're already using RCU.
Now, ctx->dead only indicates whether we've dropped the initial
refcount. The new teardown sequence is:
set ctx->dead
hlist_del_rcu();
synchronize_rcu();
Now we know no system calls can take a new ref, and it's safe to drop
the initial ref:
put_ioctx();
We also need to ensure there are no more outstanding kiocbs. This was
done incorrectly - it was being done in kill_ctx(), and before dropping
the initial refcount. At this point, other syscalls may still be
submitting kiocbs!
Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
kioctx->users has dropped to 0 and we know no more iocbs could be
submitted.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:39 +0000 (16:18 -0700)]
aio: make aio_put_req() lockless
Freeing a kiocb needed to touch the kioctx for three things:
* Pull it off the reqs_active list
* Decrementing reqs_active
* Issuing a wakeup, if the kioctx was in the process of being freed.
This patch moves these to aio_complete(), for a couple reasons:
* aio_complete() already has to issue the wakeup, so if we drop the
kioctx refcount before aio_complete does its wakeup we don't have to
do it twice.
* aio_complete currently has to take the kioctx lock, so it makes sense
for it to pull the kiocb off the reqs_active list too.
* A later patch is going to change reqs_active to include unreaped
completions - this will mean allocating a kiocb doesn't have to look
at the ringbuffer. So taking the decrement of reqs_active out of
kiocb_free() is useful prep work for that patch.
This doesn't really affect cancellation, since existing (usb) code that
implements a cancel function still calls aio_complete() - we just have
to make sure that aio_complete does the necessary teardown for cancelled
kiocbs.
It does affect code paths where we free kiocbs that were never
submitted; they need to decrement reqs_active and pull the kiocb off the
reqs_active list. This occurs in two places: kiocb_batch_free(), which
is going away in a later patch, and the error path in io_submit_one.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:37 +0000 (16:18 -0700)]
aio: do fget() after aio_get_req()
aio_get_req() will fail if we have the maximum number of requests
outstanding, which depending on the application may not be uncommon. So
avoid doing an unnecessary fget().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:35 +0000 (16:18 -0700)]
aio: dprintk() -> pr_debug()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:33 +0000 (16:18 -0700)]
aio: move private stuff out of aio.h
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:31 +0000 (16:18 -0700)]
aio: add kiocb_cancel()
Minor refactoring, to get rid of some duplicated code
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 7 May 2013 23:18:29 +0000 (16:18 -0700)]
aio: kill return value of aio_complete()
Nothing used the return value, and it probably wasn't possible to use it
safely for the locked versions (aio_complete(), aio_put_req()). Just
kill it.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zach Brown [Tue, 7 May 2013 23:18:27 +0000 (16:18 -0700)]
char: add aio_{read,write} to /dev/{null,zero}
These are handy for measuring the cost of the aio infrastructure with
operations that do very little and complete immediately.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zach Brown [Tue, 7 May 2013 23:18:25 +0000 (16:18 -0700)]
aio: remove retry-based AIO
This removes the retry-based AIO infrastructure now that nothing in tree
is using it.
We want to remove retry-based AIO because it is fundemantally unsafe.
It retries IO submission from a kernel thread that has only assumed the
mm of the submitting task. All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task.
This design flaw means that nothing of any meaningful complexity can use
retry-based AIO.
This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking
around the unused run list in the submission path.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zach Brown [Tue, 7 May 2013 23:18:23 +0000 (16:18 -0700)]
gadget: remove only user of aio retry
This removes the only in-tree user of aio retry. This will let us
remove the retry code from the aio core.
Removing retry is relatively easy as the USB gadget wasn't using it to
retry IOs at all. It always fully submitted the IO in the context of
the initial io_submit() call. It only used the AIO retry facility to
get the submitter's mm context for copying the result of a read back to
user space. This is easy to implement with use_mm() and a work struct,
much like kvm does with async_pf_execute() for get_user_pages().
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zach Brown [Tue, 7 May 2013 23:18:21 +0000 (16:18 -0700)]
aio: remove dead code from aio.h
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zach Brown [Tue, 7 May 2013 23:18:19 +0000 (16:18 -0700)]
mm: remove old aio use_mm() comment
Bunch of performance improvements and cleanups Zach Brown and I have
been working on. The code should be pretty solid at this point, though
it could of course use more review and testing.
The results in my testing are pretty impressive, particularly when an
ioctx is being shared between multiple threads. In my crappy synthetic
benchmark, with 4 threads submitting and one thread reaping completions,
I saw overhead in the aio code go from ~50% (mostly ioctx lock
contention) to low single digits. Performance with ioctx per thread
improved too, but I'd have to rerun those benchmarks.
The reason I've been focused on performance when the ioctx is shared is
that for a fair number of real world completions, userspace needs the
completions aggregated somehow - in practice people just end up
implementing this aggregation in userspace today, but if it's done right
we can do it much more efficiently in the kernel.
Performance wise, the end result of this patch series is that submitting
a kiocb writes to _no_ shared cachelines - the penalty for sharing an
ioctx is gone there. There's still going to be some cacheline
contention when we deliver the completions to the aio ringbuffer (at
least if you have interrupts being delivered on multiple cores, which
for high end stuff you do) but I have a couple more patches not in this
series that implement coalescing for that (by taking advantage of
interrupt coalescing). With that, there's basically no bottlenecks or
performance issues to speak of in the aio code.
This patch:
use_mm() is used in more places than just aio. There's no need to mention
callers when describing the function.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 7 May 2013 23:18:18 +0000 (16:18 -0700)]
mm/vmalloc.c: add vfree comment
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Tue, 7 May 2013 23:18:17 +0000 (16:18 -0700)]
remove unused random32() and srandom32()
After finishing a naming transition, remove unused backward
compatibility wrapper macros
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 7 May 2013 23:18:16 +0000 (16:18 -0700)]
drivers/infiniband/hw: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random number
generator.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Tue, 7 May 2013 23:18:15 +0000 (16:18 -0700)]
drivers/net: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random number
generator.
[akpm@linux-foundation.org: convert team_mode_random.c]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Thomas Sailer <t.sailer@alumni.ethz.ch>
Acked-by: Bing Zhao <bzhao@marvell.com> [mwifiex]
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Thomas Sailer <t.sailer@alumni.ethz.ch>
Cc: Jean-Paul Roubelat <jpr@f6fbb.org>
Cc: Bing Zhao <bzhao@marvell.com>
Cc: Brett Rudley <brudley@broadcom.com>
Cc: Arend van Spriel <arend@broadcom.com>
Cc: "Franky (Zhenhui) Lin" <frankyl@broadcom.com>
Cc: Hante Meuleman <meuleman@broadcom.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Tue, 7 May 2013 23:18:13 +0000 (16:18 -0700)]
hugetlbfs: fix mmap failure in unaligned size request
The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.
This is a regression introduced in commit
40716e29243d ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.
To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881
[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: <iceman_dvd@yahoo.com>
Cc: Steven Truelove <steven.truelove@utoronto.ca>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhao Hongjiang [Tue, 7 May 2013 23:18:12 +0000 (16:18 -0700)]
parisc: remove the second argument of kmap_atomic()
kmap_atomic() requires only one argument now.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas Stach [Tue, 7 May 2013 23:18:11 +0000 (16:18 -0700)]
drivers/rtc/rtc-rs5c372.c: add R2221T/L variant to the driver
Register layout is the same, so just add the variant to the appropriate
places.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Jan Luebbe <jlu@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 7 May 2013 23:18:10 +0000 (16:18 -0700)]
include/linux/mm.h: complete the mm_walk definition
That nameless-function-arguments thing drives me batty. Fix.
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 7 May 2013 23:18:09 +0000 (16:18 -0700)]
mm, memcg: add rss_huge stat to memory.stat
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.
This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.
The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiang Liu [Tue, 7 May 2013 23:18:08 +0000 (16:18 -0700)]
mm/SPARC: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 8 May 2013 00:59:53 +0000 (17:59 -0700)]
arm: fix mismerge of arch/arm/mach-omap2/timer.c
I badly screwed up the merge in commit
6fa52ed33bea ("Merge tag
'drivers-for-linus' of git://git.kernel.org/pub/.../arm-soc") by
incorrectly taking the arch/arm/mach-omap2/* data fully from the merge
target because the 'drivers-for-linus' branch seemed to be a proper
superset of the duplicate ARM commits.
That was bogus: commit
ff931c821bab ("ARM: OMAP: clocks: Delay clk inits
atleast until slab is initialized") only existed in head, and the
changes to arch/arm/mach-omap2/timer.c from that commit got list.
Re-doing the merge more carefully, I do think this part was the only
thing I screwed up. Knock wood.
Reported-by: Tony Lindgren <tony@atomide.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Olof Johansson <olof@lixom.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Tue, 7 May 2013 22:39:03 +0000 (15:39 -0700)]
rwsem: check counter to avoid cmpxchg calls
This patch tries to reduce the amount of cmpxchg calls in the writer
failed path by checking the counter value first before issuing the
instruction. If ->count is not set to RWSEM_WAITING_BIAS then there is
no point wasting a cmpxchg call.
Furthermore, Michel states "I suppose it helps due to the case where
someone else steals the lock while we're trying to acquire
sem->wait_lock."
Two very different workloads and machines were used to see how this
patch improves throughput: pgbench on a quad-core laptop and aim7 on a
large 8 socket box with 80 cores.
Some results comparing Michel's fast-path write lock stealing
(tps-rwsem) on a quad-core laptop running pgbench:
| db_size | clients | tps-rwsem | tps-patch |
+---------+----------+----------------+--------------+
| 160 MB | 1 | 6906 | 9153 | + 32.5
| 160 MB | 2 | 15931 | 22487 | + 41.1%
| 160 MB | 4 | 33021 | 32503 |
| 160 MB | 8 | 34626 | 34695 |
| 160 MB | 16 | 33098 | 34003 |
| 160 MB | 20 | 31343 | 31440 |
| 160 MB | 30 | 28961 | 28987 |
| 160 MB | 40 | 26902 | 26970 |
| 160 MB | 50 | 25760 | 25810 |
------------------------------------------------------
| 1.6 GB | 1 | 7729 | 7537 |
| 1.6 GB | 2 | 19009 | 23508 | + 23.7%
| 1.6 GB | 4 | 33185 | 32666 |
| 1.6 GB | 8 | 34550 | 34318 |
| 1.6 GB | 16 | 33079 | 32689 |
| 1.6 GB | 20 | 31494 | 31702 |
| 1.6 GB | 30 | 28535 | 28755 |
| 1.6 GB | 40 | 27054 | 27017 |
| 1.6 GB | 50 | 25591 | 25560 |
------------------------------------------------------
| 7.6 GB | 1 | 6224 | 7469 | + 20.0%
| 7.6 GB | 2 | 13611 | 12778 |
| 7.6 GB | 4 | 33108 | 32927 |
| 7.6 GB | 8 | 34712 | 34878 |
| 7.6 GB | 16 | 32895 | 33003 |
| 7.6 GB | 20 | 31689 | 31974 |
| 7.6 GB | 30 | 29003 | 28806 |
| 7.6 GB | 40 | 26683 | 26976 |
| 7.6 GB | 50 | 25925 | 25652 |
------------------------------------------------------
For the aim7 worloads, they overall improved on top of Michel's
patchset. For full graphs on how the rwsem series plus this patch
behaves on a large 8 socket machine against a vanilla kernel:
http://stgolabs.net/rwsem-aim7-results.tar.gz
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anatol Pomozov [Tue, 7 May 2013 22:37:48 +0000 (15:37 -0700)]
kref: minor cleanup
- make warning smp-safe
- result of atomic _unless_zero functions should be checked by caller
to avoid use-after-free error
- trivial whitespace fix.
Link: https://lkml.org/lkml/2013/4/12/391
Tested: compile x86, boot machine and run xfstests
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
[ Removed line-break, changed to use WARN_ON_ONCE() - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 7 May 2013 22:14:53 +0000 (15:14 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
"A couple of fixes + getting rid of __blkdev_put() return value"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc: Use PDE attribute setting accessor functions
make blkdev_put() return void
block_device_operations->release() should return void
mtd_blktrans_ops->release() should return void
hfs: SMP race on directory close()
Linus Torvalds [Tue, 7 May 2013 22:13:48 +0000 (15:13 -0700)]
Merge branch 'parisc-for-3.10' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"Main fixes and updates in this patch series are:
- we faced kernel stack corruptions because of multiple delivery of
interrupts
- added kernel stack overflow checks
- added possibility to use dedicated stacks for irq processing
- initial support for page sizes > 4k
- more information in /proc/interrupts (e.g. TLB flushes and number
of IPI calls)
- documented how the parisc gateway page works
- and of course quite some other smaller cleanups and fixes."
* 'parisc-for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: tlb flush counting fix for SMP and UP
parisc: more irq statistics in /proc/interrupts
parisc: implement irq stacks
parisc: add kernel stack overflow check
parisc: only re-enable interrupts if we need to schedule or deliver signals when returning to userspace
parisc: implement atomic64_dec_if_positive()
parisc: use long branch in fork_like macro
parisc: fix NATIVE set up in build
parisc: document the parisc gateway page
parisc: fix partly 16/64k PAGE_SIZE boot
parisc: Provide default implementation for dma_{alloc, free}_attrs
parisc: fix whitespace errors in arch/parisc/kernel/traps.c
parisc: remove the second argument of kmap_atomic
Linus Torvalds [Tue, 7 May 2013 22:11:43 +0000 (15:11 -0700)]
Merge tag '3.9-rc3-smp-6-tag' of git://git./linux/kernel/git/sstabellini/xen
Pull ARM Xen SMP updates from Stefano Stabellini:
"This contains a bunch of Xen/ARM specific changes, including some
fixes, SMP support for Xen on ARM, and moving the xenvm machine from
mach-vexpress to mach-virt.
The non-Xen files that are touched are arch/arm/Kconfig, to select
ARM_PSCI on XEN, and arch/arm/boot/dts/Makefile, to build the xenvm
DTB if CONFIG_ARCH_VIRT.
Highlights:
- Move xenvm to mach-virt.
- Implement SMP support in Xen on ARM.
- Add support for machine reboot and power off via Xen hypercalls"
* tag '3.9-rc3-smp-6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen:
xen/arm: remove duplicated include from enlighten.c
xen/arm: use sched_op hypercalls for machine reboot and power off
xenvm: add a simple PSCI node and a second cpu
xen/arm: XEN selects ARM_PSCI
xen: move the xenvm machine to mach-virt
xen/arm: SMP support
xen/arm: implement HYPERVISOR_vcpu_op
xen/arm: actually pass a non-NULL percpu pointer to request_percpu_irq
Helge Deller [Tue, 7 May 2013 21:42:47 +0000 (21:42 +0000)]
parisc: tlb flush counting fix for SMP and UP
Fix up build error on UP and show correctly number of function call
(ipi) irqs.
Signed-off-by: Helge Deller <deller@gmx.de>
Linus Torvalds [Tue, 7 May 2013 21:04:56 +0000 (14:04 -0700)]
Merge tag 'remoteproc-3.10' of git://git./linux/kernel/git/ohad/remoteproc
Pull remoteproc update from Ohad Ben-Cohen:
- Some refactoring, cleanups and small improvements from Sjur
Brændeland. The improvements are mainly about better supporting
varios virtio properties (such as virtio's config space, status and
features). I now see that I messed up while commiting one of Sjur's
patches and erroneously put myself as the author, as well as letting
a nasty typo sneak in. I will not fix this in order to avoid
rebasing the patches. Sjur - sorry!
- A new remoteproc driver for OMAP-L13x (technically a DaVinci
platform) from Robert Tivy.
- Extend OMAP support to OMAP5 as well, from Vincent Stehlé.
- Fix Kconfig VIRTUALIZATION dependency, from Suman Anna (a
non-critical fix which arrived late during the rc cycle).
* tag 'remoteproc-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/remoteproc:
remoteproc: fix kconfig dependencies for VIRTIO
remoteproc/davinci: add a remoteproc driver for OMAP-L13x DSP
remoteproc: support default firmware name in rproc_alloc()
remoteproc/omap: support OMAP5 too
remoteproc: set vring addresses in resource table
remoteproc: support virtio config space.
remoteproc: perserve resource table data
remoteproc: calculate max_notifyid by counting vrings
remoteproc: code cleanup of resource parsing
remoteproc: parse STE-firmware and find resource table address
remoteproc: add find_loaded_rsc_table firmware ops
remoteproc: refactor rproc_elf_find_rsc_table()
Linus Torvalds [Tue, 7 May 2013 21:02:00 +0000 (14:02 -0700)]
Merge tag 'rpmsg-3.10' of git://git./linux/kernel/git/ohad/rpmsg
Pull rpmsg changes from Ohad Ben-Cohen:
"A small pull request consisting of:
- Make rpmsg process all pending messages instead of just one, from
Robert Tivy
- Fix Kconfig dependency on VIRTUALIZATION, from Suman.
Note: this was submitted late during the 3.9 rc cycle and it seemed
appropriate to wait with it for the merge window.
- Belated addition of an rpmsg entry to the MAINTAINERS file. People
seem to look for this"
* tag 'rpmsg-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/rpmsg:
rpmsg: fix kconfig dependencies for VIRTIO
MAINTAINERS: add rpmsg entry
rpmsg: process _all_ pending messages in rpmsg_recv_done
Linus Torvalds [Tue, 7 May 2013 21:01:27 +0000 (14:01 -0700)]
Merge tag 'hwspinlock-3.10' of git://git./linux/kernel/git/ohad/hwspinlock
Pullhwspinlock update from Ohad Ben-Cohen:
"A single patch from Vincent extending OMAP's hwspinlock support to
OMAP5"
* tag 'hwspinlock-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/hwspinlock:
hwspinlock/omap: support OMAP5 as well
Linus Torvalds [Tue, 7 May 2013 21:00:25 +0000 (14:00 -0700)]
Merge branch 'for-3.10' of git://git./linux/kernel/git/tj/libata
Pull libata maintainership change from Tejun Heo.
Tejun is taking over from Jeff, after many many years. Thanks Jeff.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: change maintainer
Helge Deller [Mon, 6 May 2013 19:20:26 +0000 (19:20 +0000)]
parisc: more irq statistics in /proc/interrupts
Add framework and initial values for more fine grained statistics in
/proc/interrupts.
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Tue, 7 May 2013 20:25:42 +0000 (20:25 +0000)]
parisc: implement irq stacks
Default kernel stack size on parisc is 16k. During tests we found that the
kernel stack can easily grow beyond 13k, which leaves 3k left for irq
processing.
This patch adds the possibility to activate an additional stack of 16k per CPU
which is being used during irq processing. This implementation does not yet
uses this irq stack for the irq bh handler.
The assembler code for call_on_stack was heavily cleaned up by John
David Anglin.
CC: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Tue, 7 May 2013 19:28:52 +0000 (19:28 +0000)]
parisc: add kernel stack overflow check
Add the CONFIG_DEBUG_STACKOVERFLOW config option to enable checks to
detect kernel stack overflows.
Stack overflows can not be detected reliable since we do not want to
introduce too much overhead.
Instead, during irq processing in do_cpu_irq_mask() we check kernel
stack usage of the interrupted kernel process. Kernel threads can be
easily detected by checking the value of space register 7 (sr7) which
is zero when running inside the kernel.
Since THREAD_SIZE is 16k and PAGE_SIZE is 4k, reduce the alignment of
the init thread to the lower value (PAGE_SIZE) in the kernel
vmlinux.ld.S linker script.
Signed-off-by: Helge Deller <deller@gmx.de>
Geert Uytterhoeven [Tue, 7 May 2013 09:20:26 +0000 (11:20 +0200)]
proc: Use PDE attribute setting accessor functions
arch/arm/mach-msm/last_radio_log.c: In function 'msm_init_last_radio_log':
arch/arm/mach-msm/last_radio_log.c:69:7: error: dereferencing pointer to incomplete type
arch/cris/kernel/profile.c: In function 'init_cris_profile':
arch/cris/kernel/profile.c:79:8: error: dereferencing pointer to incomplete type
Use proc_set_size(), cfr. commit
271a15eabe094538d958dc68ccfc9c36b699247a
("proc: Supply PDE attribute setting accessor functions")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
John David Anglin [Tue, 7 May 2013 00:07:25 +0000 (00:07 +0000)]
parisc: only re-enable interrupts if we need to schedule or deliver signals when returning to userspace
Helge and I have found that we have a kernel stack overflow problem
which causes a variety of random failures.
Currently, we re-enable interrupts when returning from an external
interrupt incase we need to schedule or delivery
signals. As a result, a potentially unlimited number of interrupts
can occur while we are running on the kernel
stack. It is very limited in space (currently, 16k). This change
defers enabling interrupts until we have
actually decided to schedule or delivery signals. This only occurs
when we about to return to userspace. This
limits the number of interrupts on the kernel stack to one. In other
cases, interrupts remain disabled until the
final return from interrupt (rfi).
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Linus Torvalds [Tue, 7 May 2013 18:28:42 +0000 (11:28 -0700)]
Merge tag 'multiplatform-for-linus-2' of git://git./linux/kernel/git/arm/arm-soc
Pull late ARM Exynos multiplatform changes from Arnd Bergmann:
"These continue the multiplatform support for exynos, adding support
for building most of the essential drivers (clocksource, clk, irqchip)
when combined with other platforms. As a result, it should become
really easy to add full multiplatform exynos support in 3.11, although
we don't yet enable it for 3.10.
The changes were not included in the earlier multiplatform series in
order to avoid clashes with the other Exynos updates.
This also includes work from Tomasz Figa to fix the pwm clocksource
code on Exynos, which is not strictly required for multiplatform, but
related to the other patches in this set and needed as a bug fix for
at least one board."
* tag 'multiplatform-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (22 commits)
ARM: dts: exynops4210: really add universal_c210 dts
ARM: dts: exynos4210: Add basic dts file for universal_c210 board
ARM: dts: exynos4: Add node for PWM device
ARM: SAMSUNG: Do not register legacy timer interrupts on Exynos
clocksource: samsung_pwm_timer: Work around rounding errors in clockevents core
clocksource: samsung_pwm_timer: Correct programming of clock events
clocksource: samsung_pwm_timer: Use proper clockevents max_delta
clocksource: samsung_pwm_timer: Add support for non-DT platforms
clocksource: samsung_pwm_timer: Drop unused samsung_pwm struct
clocksource: samsung_pwm_timer: Keep all driver data in a structure
clocksource: samsung_pwm_timer: Make PWM spinlock global
clocksource: samsung_pwm_timer: Let platforms select the driver
Documentation: Add device tree bindings for Samsung PWM timers
clocksource: add samsung pwm timer driver
irqchip: exynos: look up irq using irq_find_mapping
irqchip: exynos: pass irq_base from platform
irqchip: exynos: localize irq lookup for ATAGS
irqchip: exynos: allocate combiner_data dynamically
irqchip: exynos: pass max combiner number to combiner_init
ARM: exynos: add missing properties for combiner IRQs
...
Linus Torvalds [Tue, 7 May 2013 18:22:14 +0000 (11:22 -0700)]
Merge tag 'cleanup-for-linus-2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC late cleanups from Arnd Bergmann:
"These are cleanups and smaller changes that either depend on earlier
feature branches or came in late during the development cycle. We
normally try to get all cleanups early, so these are the exceptions:
- A follow-up on the clocksource reworks, hopefully the last time we
need to merge clocksource subsystem changes through arm-soc.
A first set of patches was part of the original 3.10 arm-soc
cleanup series because of interdependencies with timer drivers now
moved out of arch/arm.
- Migrating the SPEAr13xx platform away from using auxdata for DMA
channel descriptions towards using information in device tree,
based on the earlier SPEAr multiplatform series
- A few follow-ups on the Atmel SAMA5 support and other changes for
Atmel at91 based on the larger at91 reworks.
- Moving the armada irqchip implementation to drivers/irqchip
- Several OMAP cleanups following up on the larger series already
merged in 3.10."
* tag 'cleanup-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (50 commits)
ARM: OMAP4: change the device names in usb_bind_phy
ARM: OMAP2+: Fix mismerge for timer.c between
ff931c82 and
da4a686a
ARM: SPEAr: conditionalize SMP code
ARM: arch_timer: Silence debug preempt warnings
ARM: OMAP: remove unused variable
serial: amba-pl011: fix !CONFIG_DMA_ENGINE case
ata: arasan: remove the need for platform_data
ARM: at91/sama5d34ek.dts: remove not needed compatibility string
ARM: at91: dts: add MCI DMA support
ARM: at91: dts: add i2c dma support
ARM: at91: dts: set #dma-cells to the correct value
ARM: at91: suspend both memory controllers on at91sam9263
irqchip: armada-370-xp: slightly cleanup irq controller driver
irqchip: armada-370-xp: move IRQ handler to avoid forward declaration
irqchip: move IRQ driver for Armada 370/XP
ARM: mvebu: move L2 cache initialization in init_early()
devtree: add binding documentation for sp804
ARM: integrator-cp: convert use CLKSRC_OF for timer init
ARM: versatile: use OF init for sp804 timer
ARM: versatile: add versatile dtbs to dtbs target
...
Linus Torvalds [Tue, 7 May 2013 18:06:17 +0000 (11:06 -0700)]
Merge tag 'dt-for-linus-2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC device tree updates (part 2) from Arnd Bergmann:
"These are mostly new device tree bindings for existing drivers, as
well as changes to the device tree source files to add support for
those devices, and a couple of new boards, most notably Samsung's
Exynos5 based Chromebook.
The changes depend on earlier platform specific updates and touch the
usual platforms: omap, exynos, tegra, mxs, mvebu and davinci."
* tag 'dt-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (169 commits)
ARM: exynos: dts: cros5250: add EC device
ARM: dts: Add sbs-battery for exynos5250-snow
ARM: dts: Add i2c-arbitrator bus for exynos5250-snow
ARM: dts: add mshc controller node for Exynos4x12 SoCs
ARM: dts: Add chip-id controller node on Exynos4/5 SoC
ARM: EXYNOS: Create virtual I/O mapping for Chip-ID controller using device tree
ARM: davinci: da850-evm: add SPI flash support
ARM: davinci: da850: override SPI DT node device name
ARM: davinci: da850: add SPI1 DT node
spi/davinci: add DT binding documentation
spi/davinci: no wildcards in DT compatible property
ARM: dts: mvebu: Convert mvebu device tree files to 64 bits
ARM: dts: mvebu: introduce internal-regs node
ARM: dts: mvebu: Convert all the mvebu files to use the range property
ARM: dts: mvebu: move all peripherals inside soc
ARM: dts: mvebu: fix cpus section indentation
ARM: davinci: da850: add EHRPWM & ECAP DT node
ARM/dts: OMAP3: fix pinctrl-single configuration
ARM: dts: Add OMAP3430 SDP NOR flash memory binding
ARM: dts: Add NOR flash bindings for OMAP2420 H4
...
Linus Torvalds [Tue, 7 May 2013 18:02:18 +0000 (11:02 -0700)]
Merge tag 'soc-for-linus-3' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC platform updates (part 3) from Arnd Bergmann:
"This is the third and smallest of the SoC specific updates. Changes
include:
- SMP support for the Xilinx zynq platform
- Smaller imx changes
- LPAE support for mvebu
- Moving the orion5x, kirkwood, dove and mvebu platforms to a common
"mbus" driver for their internal devices.
It would be good to get feedback on the location of the "mbus" driver.
Since this is used on multiple platforms may potentially get shared
with other architectures (powerpc and arm64), it was moved to
drivers/bus/. We expect other similar drivers to get moved to the
same place in order to avoid creating more top-level directories under
drivers/ or cluttering up the messy drivers/misc/ even more."
* tag 'soc-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (50 commits)
ARM: imx: reset_controller may be disabled
ARM: mvebu: Align the internal registers virtual base to support LPAE
ARM: mvebu: Limit the DMA zone when LPAE is selected
arm: plat-orion: remove addr-map code
arm: mach-mv78xx0: convert to use the mvebu-mbus driver
arm: mach-orion5x: convert to use mvebu-mbus driver
arm: mach-dove: convert to use mvebu-mbus driver
arm: mach-kirkwood: convert to use mvebu-mbus driver
arm: mach-mvebu: convert to use mvebu-mbus driver
ARM i.MX53: set CLK_SET_RATE_PARENT flag on the tve_ext_sel clock
ARM i.MX53: tve_di clock is not part of the CCM, but of TVE
ARM i.MX53: make tve_ext_sel propagate rate change to PLL
ARM i.MX53: Remove unused tve_gate clkdev entry
ARM i.MX5: Remove tve_sel clock from i.MX53 clock tree
ARM: i.MX5: Add PATA and SRTC clocks
ARM: imx: do not bring up unavailable cores
ARM: imx: add initial imx6dl support
ARM: imx1: mm: add call to mxc_device_init
ARM: imx_v4_v5_defconfig: Add CONFIG_GPIO_SYSFS
ARM: imx_v6_v7_defconfig: Select CONFIG_PERF_EVENTS
...
Linus Torvalds [Tue, 7 May 2013 17:57:51 +0000 (10:57 -0700)]
Merge tag 'soc-for-linus-2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC platform updates (part 2) from Arnd Bergmann:
"These patches are all for Renesas shmobile, and depend on the earlier
pinctrl updates. Remarkably, this adds support for three new SoCs:
r8a73a4, r8a73a4 and r8a7778. The bulk of the code added for these is
for pinctrl (using the new subsystem) and for clocks (not yet using
the common clock subsystem). The latter will have to get converted in
one of the upcoming releases, but shmobile is not ready for that yet.
The series also contains Renesas shmobile board changes, adding one
board file for each of the three new SoCs. These boards are using a
mix of classic and device-tree based probing, as there is still a lot
of infrastructure in shmobile that has not been converted to DT yet.
Once those are resolved to the degree that no board specific setup
code is needed, they can get folded into the respective SoC setup files."
* tag 'soc-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (78 commits)
ARM: shmobile: use r8a7790 timer setup code on Lager
ARM: shmobile: force enable of r8a7790 arch timer
ARM: shmobile: Add second I/O range for r8a7790 PFC
ARM: shmobile: bockw: enable network settings on bootargs
ARM: shmobile: bockw: add SMSC ethernet support
ARM: shmobile: R8A7778: add Ether support
ARM: shmobile: bockw: enable SMSC ethernet on defconfig
ARM: shmobile: r8a7778: add r8a7778_init_irq_extpin()
ARM: shmobile: r8a7778: remove pointless PLATFORM_INFO()
ARM: shmobile: mackerel: clean up MMCIF vs. SDHI1 selection
ARM: shmobile: mackerel: add interrupt names for SDHI0
ARM: shmobile: mackerel: switch SDHI and MMCIF interfaces to slot-gpio
ARM: shmobile: mackerel: remove OCR masks, where regulators are used
ARM: shmobile: mackerel: SDHI resources do not have to be numbered
ARM: shmobile: Initial r8a7790 Lager board support
ARM: shmobile: APE6EVM LAN9220 support
ARM: shmobile: APE6EVM PFC support
ARM: shmobile: APE6EVM base support
ARM: shmobile: kzm9g-reference: add ethernet support
ARM: shmobile: add R-Car M1A Bock-W platform support
...
Tejun Heo [Tue, 7 May 2013 17:09:27 +0000 (10:09 -0700)]
libata: change maintainer
Jeff is leaving for something more interesting and I'm inheriting the
maintainership of libata. Thanks a lot for the good work and have
fun, Jeff!
v2: The original path forgot to update git tree URL. Updated.
Spotted by Sergei.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Linus Torvalds [Tue, 7 May 2013 17:13:52 +0000 (10:13 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull more vhost fixes from Michael Tsirkin:
"This fixes some minor issues in the patches that have been merged.
We also finally drop the workaround disabling event_idx for scsi: it
was always questionable, and now we know it's not needed.
There's also a memory leak fix"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-scsi: Enable VIRTIO_RING_F_EVENT_IDX
vhost: drop virtio_net.h dependency
vhost-net: Cleanup vhost_ubuf and vhost_zcopy
vhost: Remove vhost_enable_zcopy in vhost.h
vhost: Remove comments for hdr in vhost.h
vhost: Move VHOST_NET_FEATURES to net.c
vhost-net: Free ubuf when vhost_dev_set_owner fails
vhost: Export vhost_dev_set_owner
Linus Torvalds [Tue, 7 May 2013 17:12:32 +0000 (10:12 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
"This contains two patchsets from Maxim Patlasov.
The first reworks the request throttling so that only async requests
are throttled. Wakeup of waiting async requests is also optimized.
The second series adds support for async processing of direct IO which
optimizes direct IO and enables the use of the AIO userspace
interface."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add flag to turn on async direct IO
fuse: truncate file if async dio failed
fuse: optimize short direct reads
fuse: enable asynchronous processing direct IO
fuse: make fuse_direct_io() aware about AIO
fuse: add support of async IO
fuse: move fuse_release_user_pages() up
fuse: optimize wake_up
fuse: implement exclusive wakeup for blocked_waitq
fuse: skip blocking on allocations of synchronous requests
fuse: add flag fc->initialized
fuse: make request allocations for background processing explicit
Linus Torvalds [Tue, 7 May 2013 16:34:40 +0000 (09:34 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
"Here are a few more powerpc bits that I would like in 3.10.
Mostly remaining bolts & screw tightening of power8 support such as
actually exposing the new features via the previously added AT_HWCAP2,
and a few fixes, some of them for problems exposed recently like
irqdomain warnings or sysfs access permission issues, some exposed by
power8 hardware.
The only change outside of arch/powerpc is a small one to irqdomain.c
to allow silent failure to fix a problem on Cell where we get a dozen
WARN_ON's tripping at boot for what is basically a normal case."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Make hard_irq_disable() do the right thing vs. irq tracing
powerpc/topology: Fix spurr attribute permission
powerpc/pci: Support per-aperture memory offset
powerpc/cell/iommu: Improve error message for missing node
powerpc/cell/spufs: Fix status attribute permission
irqdomain: Allow quiet failure mode
powerpc/pnv: Fix "compatible" property for P8 PHB
powerpc/pci: Don't add bogus empty resources to PHBs
powerpc/powerpnv: Properly handle failure starting CPUs
powerpc/cputable: Advertise support for ISEL/HTM/DSCR/TAR on POWER8
powerpc/cputable: Advertise ISEL support on appropriate embedded processors
powerpc/cputable: Advertise DSCR support on P7/P7+
powerpc/cputable: Reserve bits in HWCAP2 for new features
powerpc/pseries: Perform proper max_bus_speed detection
powerpc/pseries: Force 32 bit MSIs for devices that require it
powerpc/tm: Fix null pointer deference in flush_hash_page
powerpc/powernv: Defer OPAL exception handler registration
powerpc: Emulate non privileged DSCR read and write
Linus Torvalds [Tue, 7 May 2013 16:22:03 +0000 (09:22 -0700)]
Merge branch 'rwsem-optimizations'
Merge rwsem optimizations from Michel Lespinasse:
"These patches extend Alex Shi's work (which added write lock stealing
on the rwsem slow path) in order to provide rwsem write lock stealing
on the fast path (that is, without taking the rwsem's wait_lock).
I have unfortunately been unable to push this through -next before due
to Ingo Molnar / David Howells / Peter Zijlstra being busy with other
things. However, this has gotten some attention from Rik van Riel and
Davidlohr Bueso who both commented that they felt this was ready for
v3.10, and Ingo Molnar has said that he was OK with me pushing
directly to you. So, here goes :)
Davidlohr got the following test results from pgbench running on a
quad-core laptop:
| db_size | clients | tps-vanilla | tps-rwsem |
+---------+----------+----------------+--------------+
| 160 MB | 1 | 5803 | 6906 | + 19.0%
| 160 MB | 2 | 13092 | 15931 |
| 160 MB | 4 | 29412 | 33021 |
| 160 MB | 8 | 32448 | 34626 |
| 160 MB | 16 | 32758 | 33098 |
| 160 MB | 20 | 26940 | 31343 | + 16.3%
| 160 MB | 30 | 25147 | 28961 |
| 160 MB | 40 | 25484 | 26902 |
| 160 MB | 50 | 24528 | 25760 |
------------------------------------------------------
| 1.6 GB | 1 | 5733 | 7729 | + 34.8%
| 1.6 GB | 2 | 9411 | 19009 | + 101.9%
| 1.6 GB | 4 | 31818 | 33185 |
| 1.6 GB | 8 | 33700 | 34550 |
| 1.6 GB | 16 | 32751 | 33079 |
| 1.6 GB | 20 | 30919 | 31494 |
| 1.6 GB | 30 | 28540 | 28535 |
| 1.6 GB | 40 | 26380 | 27054 |
| 1.6 GB | 50 | 25241 | 25591 |
------------------------------------------------------
| 7.6 GB | 1 | 5779 | 6224 |
| 7.6 GB | 2 | 10897 | 13611 | + 24.9%
| 7.6 GB | 4 | 32683 | 33108 |
| 7.6 GB | 8 | 33968 | 34712 |
| 7.6 GB | 16 | 32287 | 32895 |
| 7.6 GB | 20 | 27770 | 31689 | + 14.1%
| 7.6 GB | 30 | 26739 | 29003 |
| 7.6 GB | 40 | 24901 | 26683 |
| 7.6 GB | 50 | 17115 | 25925 | + 51.5%
------------------------------------------------------
(Davidlohr also has one additional patch which further improves
throughput, though I will ask him to send it directly to you as I have
suggested some minor changes)."
* emailed patches from Michel Lespinasse <walken@google.com>:
rwsem: no need for explicit signed longs
x86 rwsem: avoid taking slow path when stealing write lock
rwsem: do not block readers at head of queue if other readers are active
rwsem: implement support for write lock stealing on the fastpath
rwsem: simplify __rwsem_do_wake
rwsem: skip initial trylock in rwsem_down_write_failed
rwsem: avoid taking wait_lock in rwsem_down_write_failed
rwsem: use cmpxchg for trying to steal write lock
rwsem: more agressive lock stealing in rwsem_down_write_failed
rwsem: simplify rwsem_down_write_failed
rwsem: simplify rwsem_down_read_failed
rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
rwsem: shorter spinlocked section in rwsem_down_failed_common()
rwsem: make the waiter type an enumeration rather than a bitmask
Linus Torvalds [Tue, 7 May 2013 15:42:20 +0000 (08:42 -0700)]
Merge branch 'slab/for-linus' of git://git./linux/kernel/git/penberg/linux
Pull slab changes from Pekka Enberg:
"The bulk of the changes are more slab unification from Christoph.
There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
the mix."
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
mm, slab_common: Fix bootstrap creation of kmalloc caches
slab: Return NULL for oversized allocations
mm: slab: Verify the nodeid passed to ____cache_alloc_node
slub: tid must be retrieved from the percpu area of the current processor
slub: Do not dereference NULL pointer in node_match
slub: add 'likely' macro to inc_slabs_node()
slub: correct to calculate num of acquired objects in get_partial_node()
slub: correctly bootstrap boot caches
mm/sl[au]b: correct allocation type check in kmalloc_slab()
slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
slab: Handle ARCH_DMA_MINALIGN correctly
slab: Common definition for kmem_cache_node
slab: Rename list3/l3 to node
slab: Common Kmalloc cache determination
stat: Use size_t for sizes instead of unsigned
slab: Common function to create the kmalloc array
slab: Common definition for the array of kmalloc caches
slab: Common constants for kmalloc boundaries
slab: Rename nodelists to node
slab: Common name for the per node structures
...
Linus Torvalds [Tue, 7 May 2013 14:59:19 +0000 (07:59 -0700)]
Merge branch 'misc' of git://git./linux/kernel/git/mmarek/kbuild
Pull misc kbuild updates from Michal Marek:
"Non-critical kbuild changes:
- make coccicheck improvements, but no new semantic patches this time
- make rpm improvements
- make tar-pkg change to include the architecture in the filename.
This is a deliberate incompatibility, but nobody has complained so
far and it is useful if you build for different architectures. It
also matches what the deb-pkg and rpm-pkg targets produce.
- kbuild documentation fix"
* 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
rpm-pkg: Remove pointless set -e statements
rpm-pkg: Always regenerate the specfile
rpm-pkg: Do not write to the parent directory
rpm-pkg: Do not package the whole source directory
buildtar: Add ARCH to the archive name
Coccinelle: Fix patch output when coccicheck is used with M= and C=
Coccinelle: Add support to the SPFLAGS variable
Coccinelle: Cleanup the setting of the FLAGS and OPTIONS variables
Coccinelle: Restore coccicheck verbosity in ONLINE mode (C=1 or C=2)
scripts/package/Makefile: compare objtree with srctree instead of test KBUILD_OUTPUT
doc: change example to existing Makefile fragment
scripts/tags.sh: Add magic for OFFSET and DEFINE
Linus Torvalds [Tue, 7 May 2013 14:58:05 +0000 (07:58 -0700)]
Merge branch 'kconfig' of git://git./linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
- use pkg-config to detect curses libraries
- clean up the way curses headers are searched
- Some randconfig fixes, of which one had to be reverted
- KCONFIG_SEED for randconfig debugging
- memuconfig memory leak plugged
- menuconfig > breadcrumbs > navigation
- xconfig compilation fix
- Other minor fixes
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: fix lists definition for C++
Revert "kconfig: fix randomising choice entries in presence of KCONFIG_ALLCONFIG"
kconfig: implement KCONFIG_PROBABILITY for randconfig
kconfig: allow specifying the seed for randconfig
kconfig: fix randomising choice entries in presence of KCONFIG_ALLCONFIG
kconfig: do not override symbols already set
kconfig: fix randconfig tristate detection
kconfig/lxdialog: rationalise the include paths where to find {.n}curses{,w}.h
menuconfig: Add "breadcrumbs" navigation aid
menuconfig: Fix memory leak introduced by jump keys feature
merge_config.sh: Avoid creating unnessary source softlinks
kconfig: optionally use pkg-config to detect ncurses libs
menuconfig: optionally use pkg-config to detect ncurses libs
Linus Torvalds [Tue, 7 May 2013 14:56:26 +0000 (07:56 -0700)]
Merge branch 'kbuild' of git://git./linux/kernel/git/mmarek/kbuild
Pull kbuild changes from Michal Marek:
"Kbuild commits for v3.10-rc1:
- Fix make mrproper after mod/file2alias rework
- Fix ld-option Makefile function
- Rewrite headers_install to shell to drop Perl dependency.
There are some more patches I have to look at, so I might send another
pull request later. Or just queue them for 3.11."
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
Fix cleaning in scripts/mod
headers_install.pl: convert to headers_install.sh
kbuild: fix ld-option function
Li Zefan [Tue, 7 May 2013 13:56:54 +0000 (15:56 +0200)]
menuconfig: fix NULL pointer dereference when searching a symbol
Searching for PPC_EFIKA results in a segmentation fault, and it's
because get_symbol_prop() returns NULL.
In this case CONFIG_PPC_EFIKA is defined in arch/powerpc/platforms/
52xx/Kconfig, so it won't be parsed if ARCH!=PPC, but menuconfig knows
this symbol when it parses sound/soc/fsl/Kconfig:
config SND_MPC52xx_SOC_EFIKA
tristate "SoC AC97 Audio support for bbplan Efika and STAC9766"
depends on PPC_EFIKA
This bug was introduced by commit
bcdedcc1afd6 ("menuconfig: print more
info for symbol without prompts").
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Tested-by: Libo Chen <libo.chen@huawei.com>
Reviewed-by: "Yann E. MORIN" <yann.morin.1998@free.fr>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bruce Allan [Tue, 7 May 2013 05:52:47 +0000 (22:52 -0700)]
e1000e: fix scheduling while atomic bug
A scheduling while atomic bug was introduced recently (by commit
ce43a2168c59: "e1000e: cleanup USLEEP_RANGE checkpatch checks").
Revert the particular instance of usleep_range() which causes the bug.
Reported-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Tue, 7 May 2013 13:46:02 +0000 (06:46 -0700)]
rwsem: no need for explicit signed longs
Change explicit "signed long" declarations into plain "long" as suggested
by Peter Hurley.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:46:01 +0000 (06:46 -0700)]
x86 rwsem: avoid taking slow path when stealing write lock
modify __down_write[_nested] and __down_write_trylock to grab the write
lock whenever the active count is 0, even if there are queued waiters
(they must be writers pending wakeup, since the active count is 0).
Note that this is an optimization only; architectures without this
optimization will still work fine:
- __down_write() would take the slow path which would take the wait_lock
and then try stealing the lock (as in the spinlocked rwsem implementation)
- __down_write_trylock() would fail, but callers must be ready to deal
with that - since there are some writers pending wakeup, they could
have raced with us and obtained the lock before we steal it.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:46:00 +0000 (06:46 -0700)]
rwsem: do not block readers at head of queue if other readers are active
This change fixes a race condition where a reader might determine it
needs to block, but by the time it acquires the wait_lock the rwsem has
active readers and no queued waiters.
In this situation the reader can run in parallel with the existing
active readers; it does not need to block until the active readers
complete.
Thanks to Peter Hurley for noticing this possible race.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:59 +0000 (06:45 -0700)]
rwsem: implement support for write lock stealing on the fastpath
When we decide to wake up readers, we must first grant them as many read
locks as necessary, and then actually wake up all these readers. But in
order to know how many read shares to grant, we must first count the
readers at the head of the queue. This might take a while if there are
many readers, and we want to be protected against a writer stealing the
lock while we're counting. To that end, we grant the first reader lock
before counting how many more readers are queued.
We also require some adjustments to the wake_type semantics.
RWSEM_WAKE_NO_ACTIVE used to mean that we had found the count to be
RWSEM_WAITING_BIAS, in which case the rwsem was known to be free as
nobody could steal it while we hold the wait_lock. This doesn't make
sense once we implement fastpath write lock stealing, so we now use
RWSEM_WAKE_ANY in that case.
Similarly, when rwsem_down_write_failed found that a read lock was
active, it would use RWSEM_WAKE_READ_OWNED which signalled that new
readers could be woken without checking first that the rwsem was
available. We can't do that anymore since the existing readers might
release their read locks, and a writer could steal the lock before we
wake up additional readers. So, we have to use a new RWSEM_WAKE_READERS
value to indicate we only want to wake readers, but we don't currently
hold any read lock.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:58 +0000 (06:45 -0700)]
rwsem: simplify __rwsem_do_wake
This is mostly for cleanup value:
- We don't need several gotos to handle the case where the first
waiter is a writer. Two simple tests will do (and generate very
similar code).
- In the remainder of the function, we know the first waiter is a reader,
so we don't have to double check that. We can use do..while loops
to iterate over the readers to wake (generates slightly better code).
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:57 +0000 (06:45 -0700)]
rwsem: skip initial trylock in rwsem_down_write_failed
We can skip the initial trylock in rwsem_down_write_failed() if there
are known active lockers already, thus saving one likely-to-fail
cmpxchg.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:56 +0000 (06:45 -0700)]
rwsem: avoid taking wait_lock in rwsem_down_write_failed
In rwsem_down_write_failed(), if there are active locks after we wake up
(i.e. the lock got stolen from us), skip taking the wait_lock and go
back to sleep immediately.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:55 +0000 (06:45 -0700)]
rwsem: use cmpxchg for trying to steal write lock
Using rwsem_atomic_update to try stealing the write lock forced us to
undo the adjustment in the failure path. We can have simpler and faster
code by using cmpxchg instead.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:54 +0000 (06:45 -0700)]
rwsem: more agressive lock stealing in rwsem_down_write_failed
Some small code simplifications can be achieved by doing more agressive
lock stealing:
- When rwsem_down_write_failed() notices that there are no active locks
(and thus no thread to wake us if we decided to sleep), it used to wake
the first queued process. However, stealing the lock is also sufficient
to deal with this case, so we don't need this check anymore.
- In try_get_writer_sem(), we can steal the lock even when the first waiter
is a reader. This is correct because the code path that wakes readers is
protected by the wait_lock. As to the performance effects of this change,
they are expected to be minimal: readers are still granted the lock
(rather than having to acquire it themselves) when they reach the front
of the wait queue, so we have essentially the same behavior as in
rwsem-spinlock.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:53 +0000 (06:45 -0700)]
rwsem: simplify rwsem_down_write_failed
When waking writers, we never grant them the lock - instead, they have
to acquire it themselves when they run, and remove themselves from the
wait_list when they succeed.
As a result, we can do a few simplifications in rwsem_down_write_failed():
- We don't need to check for !waiter.task since __rwsem_do_wake() doesn't
remove writers from the wait_list
- There is no point releaseing the wait_lock before entering the wait loop,
as we will need to reacquire it immediately. We can change the loop so
that the lock is always held at the start of each loop iteration.
- We don't need to get a reference on the task structure, since the task
is responsible for removing itself from the wait_list. There is no risk,
like in the rwsem_down_read_failed() case, that a task would wake up and
exit (thus destroying its task structure) while __rwsem_do_wake() is
still running - wait_lock protects against that.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:52 +0000 (06:45 -0700)]
rwsem: simplify rwsem_down_read_failed
When trying to acquire a read lock, the RWSEM_ACTIVE_READ_BIAS
adjustment doesn't cause other readers to block, so we never have to
worry about waking them back after canceling this adjustment in
rwsem_down_read_failed().
We also never want to steal the lock in rwsem_down_read_failed(), so we
don't have to grab the wait_lock either.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:51 +0000 (06:45 -0700)]
rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
Remove the rwsem_down_failed_common function and replace it with two
identical copies of its code in rwsem_down_{read,write}_failed.
This is because we want to make different optimizations in
rwsem_down_{read,write}_failed; we are adding this pure-duplication
step as a separate commit in order to make it easier to check the
following steps.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Tue, 7 May 2013 13:45:50 +0000 (06:45 -0700)]
rwsem: shorter spinlocked section in rwsem_down_failed_common()
This change reduces the size of the spinlocked and TASK_UNINTERRUPTIBLE
sections in rwsem_down_failed_common():
- We only need the sem->wait_lock to insert ourselves on the wait_list;
the waiter node can be prepared outside of the wait_lock.
- The task state only needs to be set to TASK_UNINTERRUPTIBLE immediately
before checking if we actually need to sleep; it doesn't need to protect
the entire function.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>