GitHub/mt8127/android_kernel_alcatel_ttab.git
13 years agopnfs-obj: pg_test check for max_io_size
Boaz Harrosh [Wed, 25 May 2011 18:25:29 +0000 (21:25 +0300)]
pnfs-obj: pg_test check for max_io_size

Implement pg_test vector to test for max IO sizes. We calculate
a max_io_size member only once, and cache it in lseg so to not
do so on every page insert.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[simplify logic]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: define nfs_generic_pg_test
Boaz Harrosh [Sun, 29 May 2011 08:45:39 +0000 (11:45 +0300)]
NFSv4.1: define nfs_generic_pg_test

By default, unless pnfs is used coalesce pages until pg_bsize
(rsize or wsize) is reached.

pnfs layout drivers define their own pg_test methods that use
pnfs_generic_pg_test and need to define their own I/O size
limits (e.g. based on the file stripe size).

[Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: use pnfs_generic_pg_test directly by layout driver
Benny Halevy [Wed, 25 May 2011 17:54:40 +0000 (20:54 +0300)]
NFSv4.1: use pnfs_generic_pg_test directly by layout driver

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: change pg_test return type to bool
Benny Halevy [Wed, 25 May 2011 18:03:56 +0000 (21:03 +0300)]
NFSv4.1: change pg_test return type to bool

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: unify pnfs_pageio_init functions
Benny Halevy [Wed, 25 May 2011 17:25:22 +0000 (20:25 +0300)]
NFSv4.1: unify pnfs_pageio_init functions

Use common code for pnfs_pageio_init_{read,write} and use
a common generic pg_test function.

Note that this function always assumes the the layout driver's
pg_test method is implemented.

[Fix BUG]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: objlayout_encode_layoutcommit implementation
Boaz Harrosh [Sun, 22 May 2011 16:54:13 +0000 (19:54 +0300)]
pnfs-obj: objlayout_encode_layoutcommit implementation

* Define API for io-engines to report delta_space_used in IOs
* Encode the osd-layout specific information of the layoutcommit
  XDR buffer.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: encode_layoutcommit
Benny Halevy [Sun, 22 May 2011 16:53:48 +0000 (19:53 +0300)]
pnfs: encode_layoutcommit

Add a layout driver method to encode the layout type specific
opaque part of layout commit in-line in the xdr stream.

Currently, the pnfs-objects layout driver uses it to encode metadata hints
to the MDS and the blocks layout driver to commit provisionally allocated
extents to the file.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: report errors and .encode_layoutreturn Implementation.
Boaz Harrosh [Thu, 26 May 2011 18:49:46 +0000 (21:49 +0300)]
pnfs-obj: report errors and .encode_layoutreturn Implementation.

An io_state pre-allocates an error information structure for each
possible osd-device that might error during IO. When IO is done if all
was well the io_state is freed. (as today). If the I/O has ended with an
error, the io_state is queued on a per-layout err_list. When eventually
encode_layoutreturn() is called, each error is properly encoded on the
XDR buffer and only then the io_state is removed from err_list and
de-allocated.

It is up to the io_engine to fill in the segment that fault and the type
of osd_error that occurred. By calling objlayout_io_set_result() for
each failing device.

In objio_osd:
* Allocate io-error descriptors space as part of io_state
* Use generic objlayout error reporting at end of io.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: encode_layoutreturn
Andy Adamson [Sun, 22 May 2011 16:53:10 +0000 (19:53 +0300)]
pnfs: encode_layoutreturn

Add a layout driver method to encode the layout type specific
opaque part of layout return in-line in the xdr stream.

Currently the pnfs-objects layout driver uses it to encode i/o error
information on LAYOUTRETURN.

Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for encode_layoutreturn]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: layoutret_on_setattr
Benny Halevy [Wed, 14 Jul 2010 19:43:57 +0000 (15:43 -0400)]
pnfs: layoutret_on_setattr

With the objects layout security model, we have object capabilities
that are associated with the layout and we anticipate that the server
will issue a cb_layoutrecall for any setattr that changes security
related attributes (user/group/mode/acl) or truncates the file.

Therefore, the layout is returned before issuing the setattr to avoid
the anticipated cb_layoutrecall.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: layoutreturn
Benny Halevy [Sun, 22 May 2011 16:52:37 +0000 (19:52 +0300)]
pnfs: layoutreturn

NFSv4.1 LAYOUTRETURN implementation

Currently, does not support layout-type payload encoding.

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: osd raid engine read/write implementation
Boaz Harrosh [Sun, 22 May 2011 16:52:19 +0000 (19:52 +0300)]
pnfs-obj: osd raid engine read/write implementation

With the use of the in-kernel osd library. Implement read/write
of data from/to osd-objects according to information specified
in the objects-layout.

Support for stripping over mirrors with a received stripe_unit.
There are however a few constrains which are not supported:
 1. Stripe Unit must be a multiple of PAGE_SIZE
 2. stripe length (stripe_unit * number_of_stripes) can not be
    bigger then 32bit.

Also support raid-groups and partial-layout. Partial-layout is
when not all the groups are received on the line, addressing
only a partial range of the file.

TODO:
  Only raid0! raid 4/5/6 support will come at later stage

A none supported layout will send IO through the MDS

[Important fallout from the last rebase]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: support for non-rpc layout drivers
Benny Halevy [Sun, 22 May 2011 16:52:03 +0000 (19:52 +0300)]
pnfs: support for non-rpc layout drivers

Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.

[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: define per-inode private structure
Benny Halevy [Sun, 22 May 2011 16:51:48 +0000 (19:51 +0300)]
pnfs-obj: define per-inode private structure

allocate and deallocate per-inode private pnfs_layout_hdr
in preparation for I/O implementation.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: alloc and free layout_hdr layoutdriver methods
Benny Halevy [Sun, 22 May 2011 16:51:33 +0000 (19:51 +0300)]
pnfs: alloc and free layout_hdr layoutdriver methods

[gfp_flags]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: objio_osd device information retrieval and caching
Boaz Harrosh [Thu, 26 May 2011 18:45:34 +0000 (21:45 +0300)]
pnfs-obj: objio_osd device information retrieval and caching

When a new layout is received in objio_alloc_lseg all device_ids
referenced are retrieved. The device information is queried for from MDS
and then the osd_device is looked-up from the osd-initiator library. The
devices are cached in a per-mount-point list, for later use. At unmount
all devices are "put" back to the library.

objlayout_get_deviceinfo(), objlayout_put_deviceinfo() middleware
API for retrieving device information given a device_id.

TODO: The device cache can get big. Cap its size. Keep an LRU and start
      to return devices which were not used, when list gets to big, or
      when new entries allocation fail.

[pnfs-obj: Bugs in new global-device-cache code]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[use global device cache]
[use layout driver in global device cache]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: decode layout, alloc/free lseg
Boaz Harrosh [Sun, 22 May 2011 16:50:20 +0000 (19:50 +0300)]
pnfs-obj: decode layout, alloc/free lseg

objlayout_alloc_lseg prepares an xdr_stream and calls the
raid engins objio_alloc_lseg() to allocate a private
pnfs_layout_segment.

objio_osd.c::objio_alloc_lseg() uses passed xdr_stream to
decode and store the layout_segment information in an
objio_segment struct, using the pnfs_osd_xdr.h API for
the actual parsing the layout xdr.

objlayout_free_lseg calls objio_free_lseg() to free the
allocated space.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[removed "extern" from function definitions]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: pnfs_osd XDR client implementation
Boaz Harrosh [Sun, 22 May 2011 16:49:57 +0000 (19:49 +0300)]
pnfs-obj: pnfs_osd XDR client implementation

* Add the fs/nfs/objlayout/pnfs_osd_xdr_cli.c file, which will
  include the XDR encode/decode implementations for the pNFS
  client objlayout driver.

[Wrong type in comments]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: pnfs_osd XDR definitions
Benny Halevy [Sun, 22 May 2011 16:49:32 +0000 (19:49 +0300)]
pnfs-obj: pnfs_osd XDR definitions

* Add the pnfs_osd_xdr.h header

* defintions the pnfs_osd_layout structure including all it's
  sub-types and constants.
* Declare the pnfs_osd_xdr_decode_layout API + all needed
  inline helpers.

* Define the pnfs_osd_deviceaddr structure and all its subtypes and
  constants.
* Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.

* Define the pnfs_osd_ioerr structure, its substructures and constants.
* Declare API for encoding of a pnfs_osd_ioerr into XDR stream.

* Define the pnfs_osd_layoutupdate structure and its substructures.
* Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.

[Remove server definitions]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs-obj: objlayoutdriver module skeleton
Benny Halevy [Sun, 22 May 2011 16:49:06 +0000 (19:49 +0300)]
pnfs-obj: objlayoutdriver module skeleton

* Define the PNFS_OBJLAYOUT Kconfig option in the nfs
  master Kconfig file.
* Add the objlayout driver to the Kernel's Kbuild system.
* Add the fs/nfs/objlayout/Kbuild file for building the
  objlayoutdriver.ko driver
* Define fs/nfs/objlayout/objio_osd.c, register the driver on module
  initialization and unregister on exit.

[pnfs-obj: remove of CONFIG_PNFS fallout]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[added "unsure" clause]
[depend on NFS_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: client stats
J. Bruce Fields [Sun, 22 May 2011 16:48:21 +0000 (19:48 +0300)]
pnfs: client stats

A pNFS client auto-negotiates a lot of features (minorversion level,
pNFS layout type, etc.).  This is convenient, but makes certain kinds of
failures hard for a user to detect.

For example, if the client falls back on 4.0, or falls back to MDS IO
because the user didn't connect to the right iscsi disks before
mounting, the only symptoms may be reduced performance, which may not be
noticed till long after the actual failure, and may be difficult for a
user to diagnose.

However, such "failures" may also be perfectly normal in some cases, so
we don't want to spam the system logs with them.

One approach would be to put some more information into
/proc/self/mountstats.

Signed-off-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: add commit client stats]
[fixup data types for "ret" variables in pnfs_try_to* inline funcs.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[fix definition of show_pnfs for !CONFIG_PNFS]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Fix show_sessions in the not CONFIG_NFS_V4_1 case]
    There is a build error when CONFIG_NFS_V4 is set but
    CONFIG_NFS_V4_1 is *not* set. show_sessions() prototype
    was unbalanced between the two cases.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfs: super.c remove CONFIG_PNFS]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: Use byte-range for cb_layoutrecall
Benny Halevy [Sun, 22 May 2011 16:48:02 +0000 (19:48 +0300)]
pnfs: Use byte-range for cb_layoutrecall

Use recalled range to invalidate particular layout segments in the layout cache.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: align layoutget requests on page boundaries
Benny Halevy [Sun, 22 May 2011 16:47:46 +0000 (19:47 +0300)]
pnfs: align layoutget requests on page boundaries

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: Use byte-range for layoutget
Benny Halevy [Sun, 22 May 2011 16:47:26 +0000 (19:47 +0300)]
pnfs: Use byte-range for layoutget

Add offset and count parameters to pnfs_update_layout and use them to get
the layout in the pageio path.

Order cache layout segments in the following order:
* offset (ascending)
* length (descending)
* iomode (RW before READ)

Test byte range against the layout segment in use in pnfs_{read,write}_pg_test
so not to coalesce pages not using the same layout segment.

[fix lseg ordering]
[clean up pnfs_find_lseg lseg arg]
[remove unnecessary FIXME]
[fix ordering in pnfs_insert_layout]
[clean up pnfs_insert_layout]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoSUNRPC: introduce xdr_init_decode_pages
Benny Halevy [Thu, 19 May 2011 18:16:47 +0000 (14:16 -0400)]
SUNRPC: introduce xdr_init_decode_pages

Initialize xdr_stream and xdr_buf using an array of page pointers
and length of buffer.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: use layout driver in global device cache
Benny Halevy [Tue, 24 May 2011 15:04:02 +0000 (18:04 +0300)]
NFSv4.1: use layout driver in global device cache

pnfs deviceids are unique per server, per layout type.
struct nfs_client is currently used to distinguish deviceids from
different nfs servers, yet these may clash between different layout
types on the same server.  Therefore, use the layout driver associated
with each deviceid at insertion time to look it up, unhash, or
delete it.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: CB_NOTIFY_DEVICEID
Marc Eshel [Sun, 22 May 2011 16:47:09 +0000 (19:47 +0300)]
pnfs: CB_NOTIFY_DEVICEID

Note: This functionlaity is incomplete as all layout segments referring to
the 'to be removed device id' need to be reaped, and all in flight I/O drained.

[use be32 res in nfs4_callback_devicenotify]
[use nfs_client to qualify deviceid for cb_notify_deviceid]
[use global deviceid cache for CB_NOTIFY_DEVICEID]
[refactor device cache _lookup_deviceid]
[refactor device cache _find_get_deviceid]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Bug in new global-device-cache code]
[layout_driver MUST set free_deviceid_node if using dev-cache]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoeCryptfs: Cleanup inode initialization code
Tyler Hicks [Tue, 24 May 2011 07:16:51 +0000 (02:16 -0500)]
eCryptfs: Cleanup inode initialization code

The eCryptfs inode get, initialization, and dentry interposition code
has two separate paths. One is for when dentry interposition is needed
after doing things like a mkdir in the lower filesystem and the other
is needed after a lookup. Unlocking new inodes and doing a d_add() needs
to happen at different times, depending on which type of dentry
interposing is being done.

This patch cleans up the inode get and initialization code paths and
splits them up so that the locking and d_add() differences mentioned
above can be handled appropriately in a later patch.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Tested-by: David <david@unsolicited.net>
13 years agoNFSv4.1: purge deviceid cache on nfs_free_client
Benny Halevy [Fri, 20 May 2011 11:47:33 +0000 (13:47 +0200)]
NFSv4.1: purge deviceid cache on nfs_free_client

Use the pnfs_layoutdriver_type both as a qualifier for the deviceid,
distinguishing deviceid from different layout types on the server,
and for freeing the layout-driver allocated structure containing the
nfs4_deviceid_node.

[BUG in _deviceid_purge_client]
[layout_driver MUST set free_deviceid_node if using dev-cache]
[let ver < 4.1 compile]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[removed EXPORT_SYMBOL_GPL(nfs4_deviceid_purge_client)]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoeCryptfs: Consolidate inode functions into inode.c
Tyler Hicks [Tue, 24 May 2011 02:18:20 +0000 (21:18 -0500)]
eCryptfs: Consolidate inode functions into inode.c

These functions should live in inode.c since their focus is on inodes
and they're primarily used by functions in inode.c.

Also does a simple cleanup of ecryptfs_inode_test() and rolls
ecryptfs_init_inode() into ecryptfs_inode_set().

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Tested-by: David <david@unsolicited.net>
13 years agomm, rmap: Add yet more comments to page_get_anon_vma/page_lock_anon_vma
Peter Zijlstra [Sun, 29 May 2011 08:33:44 +0000 (10:33 +0200)]
mm, rmap: Add yet more comments to page_get_anon_vma/page_lock_anon_vma

Inspired by an analysis from Hugh on why again all this doesn't explode
in our face.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agodm kcopyd: return client directly and not through a pointer
Mikulas Patocka [Sun, 29 May 2011 12:03:13 +0000 (13:03 +0100)]
dm kcopyd: return client directly and not through a pointer

Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: reserve fewer pages
Mikulas Patocka [Sun, 29 May 2011 12:03:11 +0000 (13:03 +0100)]
dm kcopyd: reserve fewer pages

Reserve just the minimum of pages needed to process one job.

Because we allocate pages from page allocator, we don't need to reserve
a large number of pages.  The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm io: use fixed initial mempool size
Mikulas Patocka [Sun, 29 May 2011 12:03:09 +0000 (13:03 +0100)]
dm io: use fixed initial mempool size

Replace the arbitrary calculation of an initial io struct mempool size
with a constant.

The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4.  This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.

Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures.  One structure is enough to
process the whole request.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: alloc pages from the main page allocator
Mikulas Patocka [Sun, 29 May 2011 12:03:07 +0000 (13:03 +0100)]
dm kcopyd: alloc pages from the main page allocator

This patch changes dm-kcopyd so that it allocates pages from the main
page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
fail in case of memory pressure). If the allocation fails, dm-kcopyd
allocates pages from its own reserve.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: add gfp parm to alloc_pl
Mikulas Patocka [Sun, 29 May 2011 12:03:04 +0000 (13:03 +0100)]
dm kcopyd: add gfp parm to alloc_pl

Introduce a parameter for gfp flags to alloc_pl() for use in following
patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: remove superfluous page allocation spinlock
Mikulas Patocka [Sun, 29 May 2011 12:03:02 +0000 (13:03 +0100)]
dm kcopyd: remove superfluous page allocation spinlock

Remove the spinlock protecting the pages allocation.  The spinlock is only
taken on initialization or from single-threaded workqueue.  Therefore, the
spinlock is useless.

The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.

kcopyd_get_pages is only called from run_pages_job, which is only
called from process_jobs called from do_work.

kcopyd_put_pages is called from client_alloc_pages (which is initialization
function) or from run_complete_job. run_complete_job is only called from
process_jobs called from do_work.

Another spinlock, kc->job_lock is taken each time someone pushes or pops
some work for the worker thread.  Once we take kc->job_lock, we
guarantee that any written memory is visible to the other CPUs.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: preallocate sub jobs to avoid deadlock
Mikulas Patocka [Sun, 29 May 2011 12:03:00 +0000 (13:03 +0100)]
dm kcopyd: preallocate sub jobs to avoid deadlock

There's a possible theoretical deadlock in dm-kcopyd because multiple
allocations from the same mempool are required to finish a request.
Avoid this by preallocating sub jobs.

There is a mempool of 512 entries. Each request requires up to 9
entries from the mempool. If we have at least 57 concurrent requests
running, the mempool may overflow and mempool allocations may start
blocking until another entry is freed to the mempool. Because the same
thread is used to free entries to the mempool and allocate entries from
the mempool, this may result in a deadlock.

This patch changes it so that one mempool entry contains all 9 "struct
kcopyd_job" required to fulfill the whole request. The allocation is
done only once in dm_kcopyd_copy and no further mempool allocations are
done during request processing.

If dm_kcopyd_copy is not run in the completion thread, this
implementation is deadlock-free.

MIN_JOBS needs reducing accordingly and we've chosen to reduce it
further to 8.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm kcopyd: avoid pointless job splitting
Mikulas Patocka [Sun, 29 May 2011 12:02:58 +0000 (13:02 +0100)]
dm kcopyd: avoid pointless job splitting

Don't split SUB_JOB_SIZE jobs

If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
Splitting it just unnecessarily wastes time, because the split job size
is SUB_JOB_SIZE too.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm mpath: do not fail paths after integrity errors
Martin K. Petersen [Sun, 29 May 2011 12:02:55 +0000 (13:02 +0100)]
dm mpath: do not fail paths after integrity errors

Integrity errors need to be passed to the owner of the integrity
metadata for processing. Consequently EILSEQ should be passed up the
stack.

Cc: stable@kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm table: reject devices without request fns
Milan Broz [Sun, 29 May 2011 12:02:52 +0000 (13:02 +0100)]
dm table: reject devices without request fns

This patch adds a check that a block device has a request function
defined before it is used.  Otherwise, misconfiguration can cause an oops.

Because we are allowing devices with zero size e.g. an offline multipath
device as in commit 2cd54d9bedb79a97f014e86c0da393416b264eb3
("dm: allow offline devices") there needs to be an additional check
to ensure devices are initialised.  Some block devices, like a loop
device without a backing file, exist but have no request function.

Reproducer is trivial: dm-mirror on unbound loop device
(no backing file on loop devices)

dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"

and mirror resync will immediatelly cause OOps.

BUG: unable to handle kernel NULL pointer dereference at   (null)
 ? generic_make_request+0x2bd/0x590
 ? kmem_cache_alloc+0xad/0x190
 submit_bio+0x53/0xe0
 ? bio_add_page+0x3b/0x50
 dispatch_io+0x1ca/0x210 [dm_mod]
 ? read_callback+0x0/0xd0 [dm_mirror]
 dm_io+0xbb/0x290 [dm_mod]
 do_mirror+0x1e0/0x748 [dm_mirror]

Signed-off-by: Milan Broz <mbroz@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years agodm table: allow targets to support discards internally
Mike Snitzer [Sun, 29 May 2011 11:52:55 +0000 (12:52 +0100)]
dm table: allow targets to support discards internally

Permit a target to support discards regardless of whether or not all its
underlying devices do.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
13 years ago[S390] mm: fix mmu_gather rework
Heiko Carstens [Sun, 29 May 2011 10:40:51 +0000 (12:40 +0200)]
[S390] mm: fix mmu_gather rework

Quite a few functions that get called from the tlb gather code require that
preemption must be disabled. So disable preemption inside of the called
functions instead.
The only drawback is that rcu_table_freelist_finish() doesn't get necessarily
called on the cpu(s) that filled the free lists. So we may see a delay, until
we finally see an rcu callback. However over time this shouldn't matter.

So we get rid of lots of "BUG: using smp_processor_id() in preemptible"
messages.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
13 years ago[S390] mm: fix storage key handling
Heiko Carstens [Sun, 29 May 2011 10:40:50 +0000 (12:40 +0200)]
[S390] mm: fix storage key handling

page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".

Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.

Unable to handle kernel paging request at virtual user address (null)
Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 Not tainted 2.6.39-07551-g139f37f-dirty #152
Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
           0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
           0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
           0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
Krnl Code: 0000000000316b04f0a0000407f4       srp     4(11,%r0),2036,0
           0000000000316b0ab9020022           ltgr    %r2,%r2
           0000000000316b0ea7740015           brc     7,316b38
          >0000000000316b12e3d0c0000024       stg     %r13,0(%r12)
           0000000000316b184120c010           la      %r2,16(%r12)
           0000000000316b1c4130d060           la      %r3,96(%r13)
           0000000000316b20e340d0600004       lg      %r4,96(%r13)
           0000000000316b26c0e50002b567       brasl   %r14,36d5f4
Call Trace:
([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
 [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
 [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
 [<00000000002597e8>] writeback_single_inode+0xf8/0x268
 [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
 [<000000000025a700>] writeback_inodes_wb+0x80/0x168
 [<000000000025aa92>] wb_writeback+0x2aa/0x324
 [<000000000025abde>] wb_do_writeback+0xd2/0x274
 [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
 [<00000000001737be>] kthread+0xa6/0xb0
 [<000000000056c1da>] kernel_thread_starter+0x6/0xc
 [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Last Breaking-Event-Address:
 [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138

Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
13 years agoNFSv4.1: make deviceid cache global
Benny Halevy [Fri, 20 May 2011 02:14:47 +0000 (22:14 -0400)]
NFSv4.1: make deviceid cache global

Move deviceid cache from the pnfs files layout driver to the
generic layer in preparation for the objects layout driver.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agopnfs: resolve header dependency in pnfs.h
Benny Halevy [Thu, 5 May 2011 05:28:46 +0000 (08:28 +0300)]
pnfs: resolve header dependency in pnfs.h

Some definitions in the header file depend on nfs_fs.h so pnfs.h can't
be included independently.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: use struct nfs_client to qualify deviceid
Benny Halevy [Fri, 20 May 2011 08:45:05 +0000 (10:45 +0200)]
NFSv4.1: use struct nfs_client to qualify deviceid

deviceids are unique per server, per layout type.
Therefore, in the global cache in the files layout driver
deviceids from different servers may clash so we need
to qualify them with a struct nfs_client that represents
the nfs server that returned the deviceid.

Introduced in 2.6.39 commit ea8eecdd
"NFSv4.1 move deviceid cache to filelayout driver"

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoNFSv4.1: fix typo in filelayout_check_layout
Jim Rees [Wed, 23 Feb 2011 00:31:57 +0000 (19:31 -0500)]
NFSv4.1: fix typo in filelayout_check_layout

Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
13 years agoSquashfs: Fix sanity check patches on big-endian systems
Phillip Lougher [Sat, 28 May 2011 23:38:46 +0000 (00:38 +0100)]
Squashfs: Fix sanity check patches on big-endian systems

le64 values should be swapped when accessing on
big-endian systems.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
13 years agoMerge branch 'ec-cleanup' into release
Len Brown [Sun, 29 May 2011 08:40:39 +0000 (04:40 -0400)]
Merge branch 'ec-cleanup' into release

Conflicts:
drivers/platform/x86/compal-laptop.c

13 years agoMerge branches 'acpica', 'aml-custom', 'bugzilla-16548', 'bugzilla-20242', 'd3-cold...
Len Brown [Sun, 29 May 2011 08:38:48 +0000 (04:38 -0400)]
Merge branches 'acpica', 'aml-custom', 'bugzilla-16548', 'bugzilla-20242', 'd3-cold', 'ec-asus' and 'thermal-fix' into release

13 years agox86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
Len Brown [Fri, 1 Apr 2011 19:46:09 +0000 (15:46 -0400)]
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param

mwait_idle() is a C1-only idle loop intended to be more efficient
than HLT on SMP hardware that supports it.

But mwait_idle() has been replaced by the more general
mwait_idle_with_hints(), which handles both C1 and deeper C-states.
ACPI uses only mwait_idle_with_hints(), and never uses mwait_idle().

Deprecate mwait_idle() and the "idle=mwait" cmdline param
to simplify the x86 idle code.

After this change, kernels configured with
(!CONFIG_ACPI=n && !CONFIG_INTEL_IDLE=n) when run on hardware
that support MWAIT will simply use HLT.  If MWAIT is desired
on those systems, cpuidle and the cpuidle drivers above
can be used.

cc: x86@kernel.org
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agox86 idle: deprecate "no-hlt" cmdline param
Len Brown [Fri, 1 Apr 2011 19:41:17 +0000 (15:41 -0400)]
x86 idle: deprecate "no-hlt" cmdline param

We'd rather that modern machines not check if HLT works on
every entry into idle, for the benefit of machines that had
marginal electricals 15-years ago.  If those machines are still running
the upstream kernel, they can use "idle=poll".  The only difference
will be that they'll now invoke HLT in machine_hlt().

cc: x86@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agox86 idle APM: deprecate CONFIG_APM_CPU_IDLE
Len Brown [Fri, 1 Apr 2011 19:19:23 +0000 (15:19 -0400)]
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE

We don't want to export the pm_idle function pointer to modules.
Currently CONFIG_APM_CPU_IDLE w/ CONFIG_APM_MODULE forces us to.

CONFIG_APM_CPU_IDLE is of dubious value, it runs only on 32-bit
uniprocessor laptops that are over 10 years old.  It calls into
the BIOS during idle, and is known to cause a number of machines
to fail.

Removing CONFIG_APM_CPU_IDLE and will allow us to stop exporting
pm_idle.  Any systems that were calling into the APM BIOS
at run-time will simply use HLT instead.

cc: x86@kernel.org
cc: Jiri Kosina <jkosina@suse.cz>
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agox86 idle floppy: deprecate disable_hlt()
Len Brown [Fri, 1 Apr 2011 19:08:48 +0000 (15:08 -0400)]
x86 idle floppy: deprecate disable_hlt()

Plan to remove floppy_disable_hlt in 2012, an ancient
workaround with comments that it should be removed.

This allows us to remove clutter and a run-time branch
from the idle code.

WARN_ONCE() on invocation until it is removed.

cc: x86@kernel.org
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agox86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
Len Brown [Fri, 1 Apr 2011 19:28:09 +0000 (15:28 -0400)]
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it

In the long run, we don't want default_idle() or (pm_idle)() to
be exported outside of process.c.  Start by not exporting them
to modules, unless the APM build demands it.

cc: x86@kernel.org
cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agox86 idle: clarify AMD erratum 400 workaround
Len Brown [Fri, 1 Apr 2011 20:59:53 +0000 (16:59 -0400)]
x86 idle: clarify AMD erratum 400 workaround

The workaround for AMD erratum 400 uses the term "c1e" falsely suggesting:
1. Intel C1E is somehow involved
2. All AMD processors with C1E are involved

Use the string "amd_c1e" instead of simply "c1e" to clarify that
this workaround is specific to AMD's version of C1E.
Use the string "e400" to clarify that the workaround is specific
to AMD processors with Erratum 400.

This patch is text-substitution only, with no functional change.

cc: x86@kernel.org
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoACPI EC: remove redundant code
Zhang Rui [Tue, 26 Apr 2011 08:29:55 +0000 (16:29 +0800)]
ACPI EC: remove redundant code

ec->handle is set in ec_parse_device(), so don't bother to set it again.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoACPI: Add D3 cold state
Lin Ming [Wed, 4 May 2011 14:56:43 +0000 (22:56 +0800)]
ACPI: Add D3 cold state

_SxW returns an Integer containing the lowest D-state supported in state
Sx. If OSPM has not indicated that it supports _PR3, then the value “3”
corresponds to D3.  If it has indicated _PR3 support, the value “3”
represents D3hot and the value “4” represents D3cold.

Linux does set _OSC._PR3, so we should fix it to expect that _SxW can
return 4.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoACPI: processor: fix processor_physically_present in UP kernel
Lin Ming [Mon, 16 May 2011 01:11:00 +0000 (09:11 +0800)]
ACPI: processor: fix processor_physically_present in UP kernel

Usually, there are multiple processors defined in ACPI table, for
example

    Scope (_PR)
    {
        Processor (CPU0, 0x00, 0x00000410, 0x06) {}
        Processor (CPU1, 0x01, 0x00000410, 0x06) {}
        Processor (CPU2, 0x02, 0x00000410, 0x06) {}
        Processor (CPU3, 0x03, 0x00000410, 0x06) {}
    }

processor_physically_present(...) will be called to check whether those
processors are physically present.

Currently we have below codes in processor_physically_present,

cpuid = acpi_get_cpuid(...);
if ((cpuid == -1) && (num_possible_cpus() > 1))
        return false;
return true;

In UP kernel, acpi_get_cpuid(...) always return -1 and
num_possible_cpus() always return 1, so
processor_physically_present(...) always returns true for all passed in
processor handles.

This is wrong for UP processor or SMP processor running UP kernel.

This patch removes the !SMP version of acpi_get_cpuid(), so both UP and
SMP kernel use the same acpi_get_cpuid function.

And for UP kernel, only processor 0 is valid.

https://bugzilla.kernel.org/show_bug.cgi?id=16548
https://bugzilla.kernel.org/show_bug.cgi?id=16357

Tested-by: Anton Kochkov <anton.kochkov@gmail.com>
Tested-by: Ambroz Bizjak <ambrop7@gmail.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier...
Linus Torvalds [Sun, 29 May 2011 06:12:28 +0000 (23:12 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/vapier/blackfin

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin:
  Blackfin: debug-mmrs: include RSI_PID[4567] MMRs
  Blackfin: bf51x: fix up RSI_PID# MMR defines
  Blackfin: bf52x/bf54x: fix up usb MMR defines
  Blackfin: debug-mmrs: fix typos with gptimers/mdma/ppi
  Blackfin: gptimers: add structure for hardware register layout
  Blackfin: wire up new sendmmsg syscall
  Blackfin: mach/bfin_serial_5xx.h: punt now-unused header
  Blackfin: bfin_serial.h: turn default port wrappers into stubs

13 years agoscsi: fix scsi_proc new kernel-doc warning
Randy Dunlap [Sun, 29 May 2011 00:26:40 +0000 (17:26 -0700)]
scsi: fix scsi_proc new kernel-doc warning

Fix kernel-doc warnings in scsi_proc.c:

  Warning(drivers/scsi/scsi_proc.c:390): No description found for parameter 'dev'
  Warning(drivers/scsi/scsi_proc.c:390): No description found for parameter 'data'
  Warning(drivers/scsi/scsi_proc.c:390): Excess function parameter 's' description in 'always_match'
  Warning(drivers/scsi/scsi_proc.c:390): Excess function parameter 'p' description in 'always_match'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoACPI: Split out custom_method functionality into an own driver
Thomas Renninger [Thu, 26 May 2011 10:26:24 +0000 (12:26 +0200)]
ACPI: Split out custom_method functionality into an own driver

With /sys/kernel/debug/acpi/custom_method root can write
to arbitrary memory and increase his priveleges, even if
these are restricted.

-> Make this an own debug .config option and warn about the
security issue in the config description.

-> Still keep acpi/debugfs.c which now only creates an empty
   /sys/kernel/debug/acpi directory. There might be other
   users of it later.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: rui.zhang@intel.com
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoACPI: Cleanup custom_method debug stuff
Thomas Renninger [Thu, 26 May 2011 10:26:23 +0000 (12:26 +0200)]
ACPI: Cleanup custom_method debug stuff

- Move param aml_debug_output to other params into sysfs.c
- Split acpi_debugfs_init to prepare custom_method to be
  an own .config option and driver.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: rui.zhang@intel.com
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoACPI EC: enable MSI workaround for Quanta laptops
Zhang Rui [Tue, 26 Apr 2011 08:30:02 +0000 (16:30 +0800)]
ACPI EC: enable MSI workaround for Quanta laptops

Enable MSI workaround for Quanta laptops.
https://bugzilla.kernel.org/show_bug.cgi?id=20242

Tested-by: Jan-Matthias Braun <jan_braun@gmx.net>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agoidle governor: Avoid lock acquisition to read pm_qos before entering idle
Tim Chen [Fri, 11 Feb 2011 20:49:04 +0000 (12:49 -0800)]
idle governor: Avoid lock acquisition to read pm_qos before entering idle

Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.

I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received.  This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list.  The contention is severe when there are a lot
of cpus waking and going into idle.  For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:

-     37.82%          swapper  [kernel.kallsyms]          [k]
_raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
      - 95.65% pm_qos_request
           menu_select
           cpuidle_idle_call
         - cpu_idle
              99.98% start_secondary

A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.

cc: stable@kernel.org
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: James Bottomley <James.Bottomley@suse.de>
Acked-by: mark gross <markgross@thegnar.org>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agocpuidle: menu: fixed wrapping timers at 4.294 seconds
Tero Kristo [Thu, 24 Feb 2011 15:19:23 +0000 (17:19 +0200)]
cpuidle: menu: fixed wrapping timers at 4.294 seconds

Cpuidle menu governor is using u32 as a temporary datatype for storing
nanosecond values which wrap around at 4.294 seconds. This causes errors
in predicted sleep times resulting in higher than should be C state
selection and increased power consumption. This also breaks cpuidle
state residency statistics.

cc: stable@kernel.org # .32.x through .39.x
Signed-off-by: Tero Kristo <tero.kristo@nokia.com>
Signed-off-by: Len Brown <len.brown@intel.com>
13 years agomm: fix page_lock_anon_vma leaving mutex locked
Hugh Dickins [Sat, 28 May 2011 20:20:21 +0000 (13:20 -0700)]
mm: fix page_lock_anon_vma leaving mutex locked

On one machine I've been getting hangs, a page fault's anon_vma_prepare()
waiting in anon_vma_lock(), other processes waiting for that page's lock.

This is a replay of last year's f18194275c39 "mm: fix hang on
anon_vma->root->lock".

The new page_lock_anon_vma() places too much faith in its refcount: when
it has acquired the mutex_trylock(), it's possible that a racing task in
anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
to 1, and is about to reset its anon_vma->root.

Fix this by saving anon_vma->root, and relying on the usual page_mapped()
check instead of a refcount check: if page is still mapped, the anon_vma
is still ours; if page is not still mapped, we're no longer interested.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: fix kernel BUG at mm/rmap.c:1017!
Hugh Dickins [Sat, 28 May 2011 20:17:04 +0000 (13:17 -0700)]
mm: fix kernel BUG at mm/rmap.c:1017!

I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
just once.  The stack showed khugepaged allocation trying to compact
pages: the call to page_add_anon_rmap() coming from remove_migration_pte().

That path holds anon_vma lock, but does not hold mmap_sem: it can
therefore race with a split_vma(), and in commit 5f70b962ccc2 "mmap:
avoid unnecessary anon_vma lock" we just took away the anon_vma lock
protection when adjusting vma->vm_end.

I don't think that particular BUG_ON ever caught anything interesting,
so better replace it by a comment, than reinstate the anon_vma locking.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agotmpfs: fix race between truncate and writepage
Hugh Dickins [Sat, 28 May 2011 20:14:09 +0000 (13:14 -0700)]
tmpfs: fix race between truncate and writepage

While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging
(interruptibly), repeatedly failing to locate the owner of a 0xff entry in
the swap_map.

Although shmem_writepage() does abandon when it sees incoming page index
is beyond eof, there was still a window in which shmem_truncate_range()
could come in between writepage's dropping lock and updating swap_map,
find the half-completed swap_map entry, and in trying to free it,
leave it in a state that swap_shmem_alloc() could not correct.

Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling
of the different cases, but easiest to fix by moving swap_shmem_alloc()
under cover of the lock.

More interesting than the bug: it's been there since 2.6.33, why could
I not see it with earlier kernels?  The mmotm of two weeks ago seems to
have some magic for generating races, this is just one of three I found.

With yesterday's git I first saw this in mainline, bisected in search of
that magic, but the easy reproducibility evaporated.  Oh well, fix the bug.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoBlackfin: debug-mmrs: include RSI_PID[4567] MMRs
Mike Frysinger [Thu, 26 May 2011 22:07:11 +0000 (18:07 -0400)]
Blackfin: debug-mmrs: include RSI_PID[4567] MMRs

The documentation is a little iffy as to whether these are actual MMRs,
but reading them on the hardware works, and the previous version of this
logic (the SDH) had PID[4567].  So add it for RSI too.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: bf51x: fix up RSI_PID# MMR defines
Mike Frysinger [Thu, 26 May 2011 22:05:15 +0000 (18:05 -0400)]
Blackfin: bf51x: fix up RSI_PID# MMR defines

Looks like the copying of MMR defines from the SDH block missed updating
the addresses of the RSI_PID# registers.  So tweak them to reflect the
actual hardware.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: bf52x/bf54x: fix up usb MMR defines
Mike Frysinger [Thu, 26 May 2011 21:39:17 +0000 (17:39 -0400)]
Blackfin: bf52x/bf54x: fix up usb MMR defines

The bf52x/bf54x have the incorrect addresses for USB_EP_NI7_RXINTERVAL
and USB_EP_NI7_TXCOUNT, so adjust those.

Further, the bf54x header puts the USB defines in the wrong place, so
shuffle them back to the right grouping.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: debug-mmrs: fix typos with gptimers/mdma/ppi
Mike Frysinger [Thu, 26 May 2011 21:27:36 +0000 (17:27 -0400)]
Blackfin: debug-mmrs: fix typos with gptimers/mdma/ppi

This code was mostly developed against a BF54x, so some BF537-specific
issues were missed.

The PPI block starts at PPI_CONTROL, not PPI_STATUS (which is the reverse
of the EPPI block).

The MDMA block starts at MDMA_NEXT_DESC_PTR, not MDMA_CONFIG.  Seems the
sim does not catch misreads here so that'll need to get fixed.

The gptimer block is mostly 32bit regs, not 16bit.  Use the gptimer struct
to figure that out rather than hardcoding it locally.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: gptimers: add structure for hardware register layout
Mike Frysinger [Thu, 26 May 2011 21:26:58 +0000 (17:26 -0400)]
Blackfin: gptimers: add structure for hardware register layout

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: wire up new sendmmsg syscall
Mike Frysinger [Tue, 24 May 2011 01:45:42 +0000 (21:45 -0400)]
Blackfin: wire up new sendmmsg syscall

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: mach/bfin_serial_5xx.h: punt now-unused header
Mike Frysinger [Mon, 28 Sep 2009 03:16:01 +0000 (03:16 +0000)]
Blackfin: mach/bfin_serial_5xx.h: punt now-unused header

Now that the serial code has been unified in bfin_serial.h, and the
Blackfin UART driver pushed its resources to the boards files, we
don't need these headers anymore.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoBlackfin: bfin_serial.h: turn default port wrappers into stubs
Mike Frysinger [Wed, 27 Oct 2010 07:41:03 +0000 (03:41 -0400)]
Blackfin: bfin_serial.h: turn default port wrappers into stubs

Any consumer that needs to access the MMRs has to provide these helpers,
so make the default into useless stubs.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Linus Torvalds [Sat, 28 May 2011 20:03:41 +0000 (13:03 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
  Cache xattr security drop check for write v2
  fs: block_page_mkwrite should wait for writeback to finish
  mm: Wait for writeback when grabbing pages to begin a write
  configfs: remove unnecessary dentry_unhash on rmdir, dir rename
  fat: remove unnecessary dentry_unhash on rmdir, dir rename
  hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
  minix: remove unnecessary dentry_unhash on rmdir, dir rename
  fuse: remove unnecessary dentry_unhash on rmdir, dir rename
  coda: remove unnecessary dentry_unhash on rmdir, dir rename
  afs: remove unnecessary dentry_unhash on rmdir, dir rename
  affs: remove unnecessary dentry_unhash on rmdir, dir rename
  9p: remove unnecessary dentry_unhash on rmdir, dir rename
  ncpfs: fix rename over directory with dangling references
  ncpfs: document dentry_unhash usage
  ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
  hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
  hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
  hfs: remove unnecessary dentry_unhash on rmdir, dir rename
  omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
  udf: remove unnecessary dentry_unhash from rmdir, dir rename
  ...

13 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 May 2011 19:57:01 +0000 (12:57 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, asm: Clean up desc.h a bit
  x86, amd: Do not enable ARAT feature on AMD processors below family 0x12
  x86: Move do_page_fault()'s error path under unlikely()
  x86, efi: Retain boot service code until after switching to virtual mode
  x86: Remove unnecessary check in detect_ht()
  x86: Reorder mm_context_t to remove x86_64 alignment padding and thus shrink mm_struct
  x86, UV: Clean up uv_tlb.c
  x86, UV: Add support for SGI UV2 hub chip
  x86, cpufeature: Update CPU feature RDRND to RDRAND

13 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 May 2011 19:56:46 +0000 (12:56 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
  sched: Fix ->min_vruntime calculation in dequeue_entity()
  sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
  sched: More sched_domain iterations fixes

13 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 May 2011 19:56:32 +0000 (12:56 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
  rcu: Remove waitqueue usage for cpu, node, and boost kthreads
  rcu: Avoid acquiring rcu_node locks in timer functions
  atomic: Add atomic_or()
  Documentation: Add statistics about nested locks
  rcu: Decrease memory-barrier usage based on semi-formal proof
  rcu: Make rcu_enter_nohz() pay attention to nesting
  rcu: Don't do reschedule unless in irq
  rcu: Remove old memory barriers from rcu_process_callbacks()
  rcu: Add memory barriers
  rcu: Fix unpaired rcu_irq_enter() from locking selftests

13 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 28 May 2011 19:55:55 +0000 (12:55 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  perf: Fix SIGIO handling
  perf top: Don't stop if no kernel symtab is found
  perf top: Handle kptr_restrict
  perf top: Remove unused macro
  perf events: initialize fd array to -1 instead of 0
  perf tools: Make sure kptr_restrict warnings fit 80 col terms
  perf tools: Fix build on older systems
  perf symbols: Handle /proc/sys/kernel/kptr_restrict
  perf: Remove duplicate headers
  ftrace: Add internal recursive checks
  tracing: Update btrfs's tracepoints to use u64 interface
  tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
  ftrace: Set ops->flag to enabled even on static function tracing
  tracing: Have event with function tracer check error return
  ftrace: Have ftrace_startup() return failure code
  jump_label: Check entries limit in __jump_label_update
  ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
  scripts/tags.sh: Add magic for trace-events for etags too
  scripts/tags.sh: Fix ctags for DEFINE_EVENT()
  x86/ftrace: Fix compiler warning in ftrace.c
  ...

13 years agoMerge branch 'for-usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah...
Linus Torvalds [Sat, 28 May 2011 19:36:15 +0000 (12:36 -0700)]
Merge branch 'for-usb-next' of git://git./linux/kernel/git/sarah/xhci

* 'for-usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci:
  Intel xhci: Limit number of active endpoints to 64.
  Intel xhci: Ignore spurious successful event.
  Intel xhci: Support EHCI/xHCI port switching.
  Intel xhci: Add PCI id for Panther Point xHCI host.
  xhci: STFU: Be quieter during URB submission and completion.
  xhci: STFU: Don't print event ring dequeue pointer.
  xhci: STFU: Remove function tracing.
  xhci: Don't submit commands when the host is dead.
  xhci: Clear stopped_td when Stop Endpoint command completes.

13 years agoMerge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
Linus Torvalds [Sat, 28 May 2011 19:35:15 +0000 (12:35 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/djbw/async_tx

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (33 commits)
  x86: poll waiting for I/OAT DMA channel status
  maintainers: add dma engine tree details
  dmaengine: add TODO items for future work on dma drivers
  dmaengine: Add API documentation for slave dma usage
  dmaengine/dw_dmac: Update maintainer-ship
  dmaengine: move link order
  dmaengine/dw_dmac: implement pause and resume in dwc_control
  dmaengine/dw_dmac: Replace spin_lock* with irqsave variants and enable submission from callback
  dmaengine/dw_dmac: Divide one sg to many desc, if sg len is greater than DWC_MAX_COUNT
  dmaengine/dw_dmac: set residue as total len in dwc_tx_status if status is !DMA_SUCCESS
  dmaengine/dw_dmac: don't call callback routine in case dmaengine_terminate_all() is called
  dmaengine: at_hdmac: pause: no need to wait for FIFO empty
  pch_dma: modify pci device table definition
  pch_dma: Support new device ML7223 IOH
  pch_dma: Support I2S for ML7213 IOH
  pch_dma: Fix DMA setting issue
  pch_dma: modify for checkpatch
  pch_dma: fix dma direction issue for ML7213 IOH video-in
  dmaengine: at_hdmac: use descriptor chaining help function
  dmaengine: at_hdmac: implement pause and resume in atc_control
  ...

Fix up trivial conflict in drivers/dma/dw_dmac.c

13 years agoMerge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6
Linus Torvalds [Sat, 28 May 2011 19:33:51 +0000 (12:33 -0700)]
Merge branch 'for-next' of git://git./linux/kernel/git/sameo/mfd-2.6

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
  mfd: Fix build breakage caused by tps65910 gpio directory move
  mfd: Use mfd cell platform_data for db8500-prcmu cells platform bits

13 years agoMerge branch 'spi/next' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Sat, 28 May 2011 17:57:39 +0000 (10:57 -0700)]
Merge branch 'spi/next' of git://git.secretlab.ca/git/linux-2.6

* 'spi/next' of git://git.secretlab.ca/git/linux-2.6:
  spi/spi_bfin_sport: new driver for a SPI bus via the Blackfin SPORT peripheral
  spi/tle620x: add missing device_remove_file()

13 years agoMerge branch 'gpio/next' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Sat, 28 May 2011 17:56:34 +0000 (10:56 -0700)]
Merge branch 'gpio/next' of git://git.secretlab.ca/git/linux-2.6

* 'gpio/next' of git://git.secretlab.ca/git/linux-2.6:
  gpio/pch_gpio: Support new device ML7223
  gpio: make gpio_{request,free}_array gpio array parameter const
  GPIO: OMAP: move to drivers/gpio
  GPIO: OMAP: move register offset defines into <plat/gpio.h>
  gpio: Convert gpio_is_valid to return bool
  gpio: Move the s5pc100 GPIO to drivers/gpio
  gpio: Move the s5pv210 GPIO to drivers/gpio
  gpio: Move the exynos4 GPIO to drivers/gpio
  gpio: Move to Samsung common GPIO library to drivers/gpio
  gpio/nomadik: add function to read GPIO pull down status
  gpio/nomadik: show all pins in debug
  gpio: move Nomadik GPIO driver to drivers/gpio
  gpio: move U300 GPIO driver to drivers/gpio
  langwell_gpio: add runtime pm support
  gpio/pca953x: Add support for pca9574 and pca9575 devices
  gpio/cs5535: Show explicit dependency between gpio_cs5535 and mfd_cs5535

13 years agoMerge branch 'setns'
Linus Torvalds [Sat, 28 May 2011 17:51:01 +0000 (10:51 -0700)]
Merge branch 'setns'

* setns:
  ns: Wire up the setns system call

Done as a merge to make it easier to fix up conflicts in arm due to
addition of sendmmsg system call

13 years agons: Wire up the setns system call
Eric W. Biederman [Sat, 28 May 2011 02:28:27 +0000 (19:28 -0700)]
ns: Wire up the setns system call

32bit and 64bit on x86 are tested and working.  The rest I have looked
at closely and I can't find any problems.

setns is an easy system call to wire up.  It just takes two ints so I
don't expect any weird architecture porting problems.

While doing this I have noticed that we have some architectures that are
very slow to get new system calls.  cris seems to be the slowest where
the last system calls wired up were preadv and pwritev.  avr32 is weird
in that recvmmsg was wired up but never declared in unistd.h.  frv is
behind with perf_event_open being the last syscall wired up.  On h8300
the last system call wired up was epoll_wait.  On m32r the last system
call wired up was fallocate.  mn10300 has recvmmsg as the last system
call wired up.  The rest seem to at least have syncfs wired up which was
new in the 2.6.39.

v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch
v5: ported to v2.6.39-rc6
v6: rebased onto parisc-next and net-next to avoid syscall  conflicts.
v7: ported to Linus's latest post 2.6.39 tree.

>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>  arch/blackfin/mach-common/entry.S      |    1 +
Acked-by: Mike Frysinger <vapier@gentoo.org>
Oh - ia64 wiring looks good.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoCache xattr security drop check for write v2
Andi Kleen [Sat, 28 May 2011 15:25:51 +0000 (08:25 -0700)]
Cache xattr security drop check for write v2

Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.

Why xattr lookup on every write I hear you ask?

write wants to drop suid and security related xattrs that could set o
capabilities for executables.  To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.

In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.

Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.

This patch adds a simple per inode to avoid this problem.  We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.

I also used the same flag to avoid the suid check, although
that one is pretty cheap.

A file system can also set this flag when it creates the inode,
if it has a cheap way to do so.  This is done for some common file systems
in followon patches.

With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.

v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 years agorcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
Paul E. McKenney [Wed, 25 May 2011 20:42:06 +0000 (13:42 -0700)]
rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state

Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
result in softlockup warnings.  Because some of RCU's kthreads can
legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
state in order to avoid those warnings.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agorcu: Remove waitqueue usage for cpu, node, and boost kthreads
Peter Zijlstra [Fri, 20 May 2011 23:06:29 +0000 (16:06 -0700)]
rcu: Remove waitqueue usage for cpu, node, and boost kthreads

It is not necessary to use waitqueues for the RCU kthreads because
we always know exactly which thread is to be awakened.  In addition,
wake_up() only issues an actual wakeup when there is a thread waiting on
the queue, which was why there was an extra explicit wake_up_process()
to get the RCU kthreads started.

Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
eliminates the need for the initial wake_up_process() and also shrinks
the data structure size a bit.  The wakeup logic is placed in a new
rcu_wait() macro.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agorcu: Avoid acquiring rcu_node locks in timer functions
Paul E. McKenney [Wed, 11 May 2011 12:41:41 +0000 (05:41 -0700)]
rcu: Avoid acquiring rcu_node locks in timer functions

This commit switches manipulations of the rcu_node ->wakemask field
to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
acquiring the rcu_node lock.  This should avoid the following lockdep
splat reported by Valdis Kletnieks:

[   12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
[   12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
[   12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[   12.987691] hub 1-4:1.0: USB hub found
[   12.987877] hub 1-4:1.0: 3 ports detected
[   12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
[   13.071471] udevadm used greatest stack depth: 3984 bytes left
[   13.172129]
[   13.172130] =======================================================
[   13.172425] [ INFO: possible circular locking dependency detected ]
[   13.172650] 2.6.39-rc6-mmotm0506 #1
[   13.172773] -------------------------------------------------------
[   13.172997] blkid/267 is trying to acquire lock:
[   13.173009]  (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]
[   13.173009] but task is already holding lock:
[   13.173009]  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
[   13.173009]
[   13.173009] which lock already depends on the new lock.
[   13.173009]
[   13.173009]
[   13.173009] the existing dependency chain (in reverse order) is:
[   13.173009]
[   13.173009] -> #2 (rcu_node_level_0){..-...}:
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
[   13.173009]        [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5
[   13.173009]        [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7
[   13.173009]        [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23
[   13.173009]        [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75
[   13.173009]        [<ffffffff81030cc6>] update_curr+0x101/0x12e
[   13.173009]        [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b
[   13.173009]        [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68
[   13.173009]        [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128
[   13.173009]        [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c
[   13.173009]        [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d
[   13.173009]        [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18
[   13.173009]        [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20
[   13.173009]        [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23
[   13.173009]        [<ffffffff810b739c>] find_get_page+0xa9/0xb9
[   13.173009]        [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d
[   13.173009]        [<ffffffff810d1a25>] __do_fault+0x54/0x3e6
[   13.173009]        [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed
[   13.173009]        [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0
[   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   13.173009]
[   13.173009] -> #1 (&rq->lock){-.-.-.}:
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
[   13.173009]        [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3
[   13.173009]        [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108
[   13.173009]        [<ffffffff810376c3>] do_fork+0x265/0x33f
[   13.173009]        [<ffffffff81007d02>] kernel_thread+0x6b/0x6d
[   13.173009]        [<ffffffff8153a9dd>] rest_init+0x21/0xd2
[   13.173009]        [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6
[   13.173009]        [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3
[   13.173009]        [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7
[   13.173009]
[   13.173009] -> #0 (&p->pi_lock){-.-.-.}:
[   13.173009]        [<ffffffff81067788>] check_prev_add+0x68/0x20e
[   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]        [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
[   13.173009]        [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]        [<ffffffff81032f3c>] wake_up_process+0x10/0x12
[   13.173009]        [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
[   13.173009]        [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
[   13.173009]        [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
[   13.173009]        [<ffffffff8103e487>] __do_softirq+0x109/0x26a
[   13.173009]        [<ffffffff8157144c>] call_softirq+0x1c/0x30
[   13.173009]        [<ffffffff81003207>] do_softirq+0x44/0xf1
[   13.173009]        [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
[   13.173009]        [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
[   13.173009]        [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
[   13.173009]        [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
[   13.173009]        [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
[   13.173009]        [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
[   13.173009]        [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
[   13.173009]        [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
[   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   13.173009]
[   13.173009] other info that might help us debug this:
[   13.173009]
[   13.173009] Chain exists of:
[   13.173009]   &p->pi_lock --> &rq->lock --> rcu_node_level_0
[   13.173009]
[   13.173009]  Possible unsafe locking scenario:
[   13.173009]
[   13.173009]        CPU0                    CPU1
[   13.173009]        ----                    ----
[   13.173009]   lock(rcu_node_level_0);
[   13.173009]                                lock(&rq->lock);
[   13.173009]                                lock(rcu_node_level_0);
[   13.173009]   lock(&p->pi_lock);
[   13.173009]
[   13.173009]  *** DEADLOCK ***
[   13.173009]
[   13.173009] 3 locks held by blkid/267:
[   13.173009]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de
[   13.173009]  #1:  (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9
[   13.173009]  #2:  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
[   13.173009]
[   13.173009] stack backtrace:
[   13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
[   13.173009] Call Trace:
[   13.173009]  <IRQ>  [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9
[   13.173009]  [<ffffffff81067788>] check_prev_add+0x68/0x20e
[   13.173009]  [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46
[   13.173009]  [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
[   13.173009]  [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
[   13.173009]  [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
[   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff81032f3c>] wake_up_process+0x10/0x12
[   13.173009]  [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
[   13.173009]  [<ffffffff810451da>] ? del_timer+0x75/0x75
[   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
[   13.173009]  [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
[   13.173009]  [<ffffffff8103e487>] __do_softirq+0x109/0x26a
[   13.173009]  [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6
[   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
[   13.173009]  [<ffffffff8157144c>] call_softirq+0x1c/0x30
[   13.173009]  [<ffffffff81003207>] do_softirq+0x44/0xf1
[   13.173009]  [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
[   13.173009]  [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
[   13.173009]  [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
[   13.173009]  <EOI>  [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310
[   13.173009]  [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310
[   13.173009]  [<ffffffff812220e7>] ? clear_page_c+0x7/0x10
[   13.173009]  [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd
[   13.173009]  [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
[   13.173009]  [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
[   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
[   13.173009]  [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
[   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
[   13.173009]  [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
[   13.173009]  [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
[   13.173009]  [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
[   13.173009]  [<ffffffff810d915f>] ? sys_brk+0x32/0x10c
[   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
[   13.173009]  [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c
[   13.173009]  [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   13.173009]  [<ffffffff8156a75f>] page_fault+0x1f/0x30
[   14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agoatomic: Add atomic_or()
Paul E. McKenney [Wed, 11 May 2011 12:33:33 +0000 (05:33 -0700)]
atomic: Add atomic_or()

An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
add a generic version.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agoMerge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck...
Ingo Molnar [Fri, 27 May 2011 10:38:52 +0000 (12:38 +0200)]
Merge branch 'rcu/urgent' of git://git./linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent

13 years agoperf: Fix SIGIO handling
Peter Zijlstra [Thu, 26 May 2011 15:02:53 +0000 (17:02 +0200)]
perf: Fix SIGIO handling

Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
explicitly push the wakeup (including signals) when requested.

Reported-by: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agoDocumentation: Add statistics about nested locks
Juri Lelli [Thu, 19 May 2011 10:53:53 +0000 (12:53 +0200)]
Documentation: Add statistics about nested locks

Explain what the trailing "/1" on some lock class names of
lock_stat output means.

Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4F6C1.5090701@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agocpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
KOSAKI Motohiro [Thu, 19 May 2011 06:08:58 +0000 (15:08 +0900)]
cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agosched: Fix ->min_vruntime calculation in dequeue_entity()
Peter Zijlstra [Tue, 17 May 2011 23:21:10 +0000 (16:21 -0700)]
sched: Fix ->min_vruntime calculation in dequeue_entity()

Dima Zavin <dima@android.com> reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin <dima@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>