GitHub/LineageOS/android_kernel_motorola_exynos9610.git
10 years agolibceph: dump pg_temp mappings to debugfs
Ilya Dryomov [Thu, 13 Mar 2014 14:36:13 +0000 (16:36 +0200)]
libceph: dump pg_temp mappings to debugfs

Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: do not prefix osd lines with \t in debugfs output
Ilya Dryomov [Thu, 13 Mar 2014 14:36:12 +0000 (16:36 +0200)]
libceph: do not prefix osd lines with \t in debugfs output

To save screen space in anticipation of more fields (e.g. primary
affinity).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: refer to osdmap directly in osdmap_show()
Ilya Dryomov [Thu, 13 Mar 2014 14:36:12 +0000 (16:36 +0200)]
libceph: refer to osdmap directly in osdmap_show()

To make it more readable and save screen space.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agocrush: support chooseleaf_vary_r tunable (tunables3) by default
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: support chooseleaf_vary_r tunable (tunables3) by default

Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
10 years agocrush: add SET_CHOOSELEAF_VARY_R step
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: add SET_CHOOSELEAF_VARY_R step

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
10 years agocrush: add chooseleaf_vary_r tunable
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: add chooseleaf_vary_r tunable

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
10 years agocrush: allow crush rules to set (re)tries counts to 0
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: allow crush rules to set (re)tries counts to 0

These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
10 years agocrush: fix off-by-one errors in total_tries refactor
Ilya Dryomov [Wed, 19 Mar 2014 14:58:36 +0000 (16:58 +0200)]
crush: fix off-by-one errors in total_tries refactor

Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis.  That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem.  Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does.  Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
10 years agoceph: don't include ceph.{file,dir}.layout vxattr in listxattr()
Yan, Zheng [Mon, 24 Mar 2014 11:15:05 +0000 (19:15 +0800)]
ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()

This avoids 'cp -a' modifying layout of new files/directories.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: check buffer size in ceph_vxattrcb_layout()
Yan, Zheng [Mon, 24 Mar 2014 05:00:54 +0000 (13:00 +0800)]
ceph: check buffer size in ceph_vxattrcb_layout()

If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: fix null pointer dereference in discard_cap_releases()
Yan, Zheng [Mon, 24 Mar 2014 01:56:43 +0000 (09:56 +0800)]
ceph: fix null pointer dereference in discard_cap_releases()

send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agolibceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()
Yan, Zheng [Sat, 22 Mar 2014 22:50:39 +0000 (06:50 +0800)]
libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: Remove get/set acl on symlinks
Fabian Frederick [Fri, 21 Mar 2014 16:52:58 +0000 (09:52 -0700)]
ceph: Remove get/set acl on symlinks

Remove unsupported symlink operations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
10 years agoceph: set mds_wanted when MDS reply changes a cap to auth cap
Yan, Zheng [Tue, 18 Mar 2014 02:15:29 +0000 (10:15 +0800)]
ceph: set mds_wanted when MDS reply changes a cap to auth cap

When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: use fl->fl_file as owner identifier of flock and posix lock
Yan, Zheng [Sun, 9 Mar 2014 15:16:40 +0000 (23:16 +0800)]
ceph: use fl->fl_file as owner identifier of flock and posix lock

flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: forbid mandatory file lock
Yan, Zheng [Tue, 4 Mar 2014 07:50:06 +0000 (15:50 +0800)]
ceph: forbid mandatory file lock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: use fl->fl_type to decide flock operation
Yan, Zheng [Tue, 4 Mar 2014 07:42:24 +0000 (15:42 +0800)]
ceph: use fl->fl_type to decide flock operation

VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: update i_max_size even if inode version does not change
Yan, Zheng [Sat, 8 Mar 2014 12:12:23 +0000 (20:12 +0800)]
ceph: update i_max_size even if inode version does not change

handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: make sure write caps are registered with auth MDS
Yan, Zheng [Sat, 8 Mar 2014 01:51:45 +0000 (09:51 +0800)]
ceph: make sure write caps are registered with auth MDS

Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: print inode number for LOOKUPINO request
Yan, Zheng [Sat, 1 Mar 2014 14:22:57 +0000 (22:22 +0800)]
ceph: print inode number for LOOKUPINO request

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: add get_name() NFS export callback
Yan, Zheng [Thu, 6 Mar 2014 08:40:32 +0000 (16:40 +0800)]
ceph: add get_name() NFS export callback

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: fix ceph_fh_to_parent()
Yan, Zheng [Sat, 1 Mar 2014 15:09:05 +0000 (23:09 +0800)]
ceph: fix ceph_fh_to_parent()

ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: add get_parent() NFS export callback
Yan, Zheng [Sat, 1 Mar 2014 14:11:45 +0000 (22:11 +0800)]
ceph: add get_parent() NFS export callback

The callback uses LOOKUPPARENT MDS request to find parent.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: simplify ceph_fh_to_dentry()
Yan, Zheng [Sat, 1 Mar 2014 10:05:41 +0000 (18:05 +0800)]
ceph: simplify ceph_fh_to_dentry()

MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoceph: fscache: Wait for completion of object initialization
Yunchuan Wen [Thu, 26 Dec 2013 14:29:28 +0000 (06:29 -0800)]
ceph: fscache: Wait for completion of object initialization

The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
10 years agoceph: fscache: Update object store limit after file writing
Yunchuan Wen [Thu, 26 Dec 2013 14:29:27 +0000 (06:29 -0800)]
ceph: fscache: Update object store limit after file writing

Synchronize object->store_limit[_l] with new inode->i_size after file writing.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
10 years agoceph: fscache: add an interface to synchronize object store limit
Yunchuan Wen [Thu, 26 Dec 2013 14:29:26 +0000 (06:29 -0800)]
ceph: fscache: add an interface to synchronize object store limit

Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
10 years agoceph: do not set r_old_dentry_dir on link()
Sage Weil [Tue, 5 Feb 2013 21:41:23 +0000 (13:41 -0800)]
ceph: do not set r_old_dentry_dir on link()

This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: do not assume r_old_dentry[_dir] always set together
Sage Weil [Tue, 5 Feb 2013 21:40:09 +0000 (13:40 -0800)]
ceph: do not assume r_old_dentry[_dir] always set together

Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: do not chain inode updates to parent fsync
Sage Weil [Tue, 5 Feb 2013 21:52:29 +0000 (13:52 -0800)]
ceph: do not chain inode updates to parent fsync

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()
Sage Weil [Tue, 5 Feb 2013 21:36:05 +0000 (13:36 -0800)]
ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()

This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: let MDS adjust readdir 'frag'
Yan, Zheng [Mon, 3 Mar 2014 01:20:44 +0000 (09:20 +0800)]
ceph: let MDS adjust readdir 'frag'

If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
10 years agoceph: fix reset_readdir()
Yan, Zheng [Fri, 28 Feb 2014 08:36:09 +0000 (16:36 +0800)]
ceph: fix reset_readdir()

When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agoceph: fix ceph_dir_llseek()
Yan, Zheng [Thu, 27 Feb 2014 08:26:24 +0000 (16:26 +0800)]
ceph: fix ceph_dir_llseek()

Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agorbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
Ilya Dryomov [Tue, 25 Feb 2014 14:22:28 +0000 (16:22 +0200)]
rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order).  Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agorbd: num_ops parameter for rbd_osd_req_create()
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
rbd: num_ops parameter for rbd_osd_req_create()

In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create().  The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: bump CEPH_OSD_MAX_OP to 3
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
libceph: bump CEPH_OSD_MAX_OP to 3

Our longest osd request now contains 3 ops: copyup+hint+write.

Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2.  Fix it, and switch to rbd_assert while at it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op

This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: encode CEPH_OSD_OP_FLAG_* op flags
Ilya Dryomov [Tue, 25 Feb 2014 14:22:26 +0000 (16:22 +0200)]
libceph: encode CEPH_OSD_OP_FLAG_* op flags

Encode ceph_osd_op::flags field so that it gets sent over the wire.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agorbd: fix error paths in rbd_img_request_fill()
Ilya Dryomov [Tue, 4 Mar 2014 09:57:17 +0000 (11:57 +0200)]
rbd: fix error paths in rbd_img_request_fill()

Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():

    Assertion failure in rbd_obj_request_destroy() at line 1867:

    rbd_assert(obj_request->img_request == NULL);

rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del().  Use it.

Fixes: http://tracker.ceph.com/issues/7327

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agorbd: remove out_partial label in rbd_img_request_fill()
Ilya Dryomov [Tue, 4 Mar 2014 09:57:17 +0000 (11:57 +0200)]
rbd: remove out_partial label in rbd_img_request_fill()

Commit 03507db631c94 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus.  Remove it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
10 years agolibceph: a per-osdc crush scratch buffer
Ilya Dryomov [Fri, 31 Jan 2014 15:54:26 +0000 (17:54 +0200)]
libceph: a per-osdc crush scratch buffer

With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain.  Replace it with a buffer inside
struct osdmap and a mutex.  This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
10 years agoLinux 3.14
Linus Torvalds [Mon, 31 Mar 2014 03:40:15 +0000 (20:40 -0700)]
Linux 3.14

10 years agoMerge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Mon, 31 Mar 2014 00:26:08 +0000 (17:26 -0700)]
Merge branch 'for-linus-2' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "Switch mnt_hash to hlist, turning the races between __lookup_mnt() and
  hash modifications into false negatives from __lookup_mnt() (instead
  of hangs)"

On the false negatives from __lookup_mnt():
 "The *only* thing we care about is not getting stuck in __lookup_mnt().
  If it misses an entry because something in front of it just got moved
  around, etc, we are fine.  We'll notice that mount_lock mismatch and
  that'll be it"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch mnt_hash to hlist
  don't bother with propagate_mnt() unless the target is shared
  keep shadowed vfsmounts together
  resizable namespace.c hashes

10 years agoMAINTAINERS: resume as Documentation maintainer
Randy Dunlap [Fri, 28 Mar 2014 16:45:33 +0000 (09:45 -0700)]
MAINTAINERS: resume as Documentation maintainer

I am the new kernel tree Documentation maintainer (except for parts that
are handled by other people, of course).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Rob Landley <rob@landley.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Mon, 31 Mar 2014 00:20:40 +0000 (17:20 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:
 "Some more updates for the input subsystem.

  You will get a fix for race in mousedev that has been causing quite a
  few oopses lately and a small fixup for force feedback support in
  evdev"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: mousedev - fix race when creating mixed device
  Input: don't modify the id of ioctl-provided ff effect on upload failure

10 years agoAUDIT: Allow login in non-init namespaces
Eric Paris [Sun, 30 Mar 2014 23:07:54 +0000 (19:07 -0400)]
AUDIT: Allow login in non-init namespaces

It its possible to configure your PAM stack to refuse login if audit
messages (about the login) were unable to be sent.  This is common in
many distros and thus normal configuration of many containers.  The PAM
modules determine if audit is enabled/disabled in the kernel based on
the return value from sending an audit message on the netlink socket.
If userspace gets back ECONNREFUSED it believes audit is disabled in the
kernel.  If it gets any other error else it refuses to let the login
proceed.

Just about ever since the introduction of namespaces the kernel audit
subsystem has returned EPERM if the task sending a message was not in
the init user or pid namespace.  So many forms of containers have never
worked if audit was enabled in the kernel.

BUT if the container was not in net_init then the kernel network code
would send ECONNREFUSED (instead of the audit code sending EPERM).  Thus
by pure accident/dumb luck/bug if an admin configured the PAM stack to
reject all logins that didn't talk to audit, but then ran the login
untility in the non-init_net namespace, it would work!! Clearly this was
a bug, but it is a bug some people expected.

With the introduction of network namespace support in 3.14-rc1 the two
bugs stopped cancelling each other out.  Now, containers in the
non-init_net namespace refused to let users log in (just like PAM was
configfured!) Obviously some people were not happy that what used to let
users log in, now didn't!

This fix is kinda hacky.  We return ECONNREFUSED for all non-init
relevant namespaces.  That means that not only will the old broken
non-init_net setups continue to work, now the broken non-init_pid or
non-init_user setups will 'work'.  They don't really work, since audit
isn't logging things.  But it's what most users want.

In 3.15 we should have patches to support not only the non-init_net
(3.14) namespace but also the non-init_pid and non-init_user namespace.
So all will be right in the world.  This just opens the doors wide open
on 3.14 and hopefully makes users happy, if not the audit system...

Reported-by: Andre Tomt <andre@tomt.net>
Reported-by: Adam Richter <adam_richter2004@yahoo.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoext4: atomically set inode->i_flags in ext4_set_inode_flags()
Theodore Ts'o [Sun, 30 Mar 2014 14:20:01 +0000 (10:20 -0400)]
ext4: atomically set inode->i_flags in ext4_set_inode_flags()

Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoswitch mnt_hash to hlist
Al Viro [Fri, 21 Mar 2014 01:10:51 +0000 (21:10 -0400)]
switch mnt_hash to hlist

fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating.  Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.

[fix for dumb braino folded]

Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
10 years agodon't bother with propagate_mnt() unless the target is shared
Al Viro [Fri, 21 Mar 2014 14:14:08 +0000 (10:14 -0400)]
don't bother with propagate_mnt() unless the target is shared

If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
10 years agokeep shadowed vfsmounts together
Al Viro [Fri, 21 Mar 2014 00:34:43 +0000 (20:34 -0400)]
keep shadowed vfsmounts together

preparation to switching mnt_hash to hlist

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
10 years agoresizable namespace.c hashes
Al Viro [Fri, 28 Feb 2014 18:46:44 +0000 (13:46 -0500)]
resizable namespace.c hashes

* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
10 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 29 Mar 2014 22:01:09 +0000 (15:01 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "A late breaking fix from John.  (The bug fixed has a hard lockup
  potential, but that was not observed, warnings were)"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Revert to calling clock_was_set_delayed() while in irq context

10 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph...
Linus Torvalds [Sat, 29 Mar 2014 22:00:27 +0000 (15:00 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

Pull Ceph fix from Sage Weil:
 "This drops a bad assert that a few users have been hitting but we've
  only recently been able to track down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: drop an unsafe assertion

10 years agoInput: mousedev - fix race when creating mixed device
Dmitry Torokhov [Thu, 6 Mar 2014 20:57:24 +0000 (12:57 -0800)]
Input: mousedev - fix race when creating mixed device

We should not be using static variable mousedev_mix in methods that can be
called before that singleton gets assigned. While at it let's add open and
close methods to mousedev structure so that we do not need to test if we
are dealing with multiplexor or normal device and simply call appropriate
method directly.

This fixes: https://bugzilla.kernel.org/show_bug.cgi?id=71551

Reported-by: GiulioDP <depasquale.giulio@gmail.com>
Tested-by: GiulioDP <depasquale.giulio@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
10 years agoInput: don't modify the id of ioctl-provided ff effect on upload failure
Elias Vanderstuyft [Sat, 29 Mar 2014 19:08:45 +0000 (12:08 -0700)]
Input: don't modify the id of ioctl-provided ff effect on upload failure

If a new (id == -1) ff effect was uploaded from userspace,
ff-core.c::input_ff_upload() will have assigned a positive number to the
new effect id.  Currently, evdev.c::evdev_do_ioctl() will save this new id
to userspace, regardless of whether the upload succeeded or not.

On upload failure, this can be confusing because the dev->ff->effects[]
array will not contain an element at the index of that new effect id.

This patch fixes this by leaving the id unchanged after upload fails.

Note: Unfortunately applications should still expect changed effect id for
quite some time.

This has been discussed on:
http://www.mail-archive.com/linux-input@vger.kernel.org/msg08513.html
("ff-core effect id handling in case of a failed effect upload")

Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Elias Vanderstuyft <elias.vds@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
10 years agorbd: drop an unsafe assertion
Alex Elder [Tue, 25 Mar 2014 13:36:02 +0000 (15:36 +0200)]
rbd: drop an unsafe assertion

Olivier Bonvalet reported having repeated crashes due to a failed
assertion he was hitting in rbd_img_obj_callback():

    Assertion failure in rbd_img_obj_callback() at line 2165:
rbd_assert(which >= img_request->next_completion);

With a lot of help from Olivier with reproducing the problem
we were able to determine the object and image requests had
already been completed (and often freed) at the point the
assertion failed.

There was a great deal of discussion on the ceph-devel mailing list
about this.  The problem only arose when there were two (or more)
object requests in an image request, and the problem was always
seen when the second request was being completed.

The problem is due to a race in the window between setting the
"done" flag on an object request and checking the image request's
next completion value.  When the first object request completes, it
checks to see if its successor request is marked "done", and if
so, that request is also completed.  In the process, the image
request's next_completion value is updated to reflect that both
the first and second requests are completed.  By the time the
second request is able to check the next_completion value, it
has been set to a value *greater* than its own "which" value,
which caused an assertion to fail.

Fix this problem by skipping over any completion processing
unless the completing object request is the next one expected.
Test only for inequality (not >=), and eliminate the bad
assertion.

Tested-by: Olivier Bonvalet <ob@daevel.fr>
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
10 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Fri, 28 Mar 2014 22:09:37 +0000 (15:09 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) We've discovered a common error in several networking drivers, they
    put VLAN offload features into ->vlan_features, which would suggest
    that they support offloading 2 or more levels of VLAN encapsulation.
    Not only do these devices not do that, but we don't have the
    infrastructure yet to handle that at all.

    Fixes from Vlad Yasevich.

 2) Fix tcpdump crash with bridging and vlans, also from Vlad.

 3) Some MAINTAINERS updates for random32 and bonding.

 4) Fix late reseeds of prandom generator, from Sasha Levin.

 5) Bridge doesn't handle stacked vlans properly, fix from Toshiaki
    Makita.

 6) Fix deadlock in openvswitch, from Flavio Leitner.

 7) get_timewait4_sock() doesn't report delay times correctly, fix from
    Eric Dumazet.

 8) Duplicate address detection and addrconf verification need to run in
    contexts where RTNL can be obtained.  Move them to run from a
    workqueue.  From Hannes Frederic Sowa.

 9) Fix route refcount leaking in ip tunnels, from Pravin B Shelar.

10) Don't return -EINTR from non-blocking recvmsg() on AF_UNIX sockets,
    from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
  vlan: Warn the user if lowerdev has bad vlan features.
  veth: Turn off vlan rx acceleration in vlan_features
  ifb: Remove vlan acceleration from vlan_features
  qlge: Do not propaged vlan tag offloads to vlans
  bridge: Fix crash with vlan filtering and tcpdump
  net: Account for all vlan headers in skb_mac_gso_segment
  MAINTAINERS: bonding: change email address
  MAINTAINERS: bonding: change email address
  ipv6: move DAD and addrconf_verify processing to workqueue
  tcp: fix get_timewait4_sock() delay computation on 64bit
  openvswitch: fix a possible deadlock and lockdep warning
  bridge: Fix handling stacked vlan tags
  bridge: Fix inabillity to retrieve vlan tags when tx offload is disabled
  vhost: validate vhost_get_vq_desc return value
  vhost: fix total length when packets are too short
  random32: avoid attempt to late reseed if in the middle of seeding
  random32: assign to network folks in MAINTAINERS
  net/mlx4_core: pass pci_device_id.driver_data to __mlx4_init_one during reset
  core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
  vlan: Set hard_header_len according to available acceleration
  ...

10 years agoMerge branch 'vlan_offloads'
David S. Miller [Fri, 28 Mar 2014 21:17:16 +0000 (17:17 -0400)]
Merge branch 'vlan_offloads'

Vlad Yasevich says:

====================
Audit all drivers for correct vlan_features.

Some drivers set vlan acceleration features in vlan_features.  This causes
issues with Q-in-Q/802.1ad configurations.

Audit all the drivers for correct vlan_features.  Fix broken ones.
Add a warning to vlan code to help catch future offenders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovlan: Warn the user if lowerdev has bad vlan features.
Vlad Yasevich [Fri, 28 Mar 2014 02:14:49 +0000 (22:14 -0400)]
vlan: Warn the user if lowerdev has bad vlan features.

Some drivers incorrectly assign vlan acceleration features to
vlan_features thus causing issues for Q-in-Q vlan configurations.
Warn the user of such cases.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoveth: Turn off vlan rx acceleration in vlan_features
Vlad Yasevich [Fri, 28 Mar 2014 02:14:48 +0000 (22:14 -0400)]
veth: Turn off vlan rx acceleration in vlan_features

For completeness, turn off vlan rx acceleration in vlan_features so
that it doesn't show up on q-in-q setups.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoifb: Remove vlan acceleration from vlan_features
Vlad Yasevich [Fri, 28 Mar 2014 02:14:47 +0000 (22:14 -0400)]
ifb: Remove vlan acceleration from vlan_features

Do not include vlan acceleration features in vlan_features as that
precludes correct Q-in-Q operation.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoqlge: Do not propaged vlan tag offloads to vlans
Vlad Yasevich [Fri, 28 Mar 2014 02:14:46 +0000 (22:14 -0400)]
qlge: Do not propaged vlan tag offloads to vlans

qlge driver turns off NETIF_F_HW_CTAG_FILTER, but forgets to
turn off HW_CTAG_TX and HW_CTAG_RX on vlan devices.  With the
current settings, q-in-q will only generate a single vlan header.
Remember to mask off CTAG_TX and CTAG_RX features in vlan_features.

CC: Shahed Shaikh <shahed.shaikh@qlogic.com>
CC: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
CC: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Fix crash with vlan filtering and tcpdump
Vlad Yasevich [Fri, 28 Mar 2014 01:51:18 +0000 (21:51 -0400)]
bridge: Fix crash with vlan filtering and tcpdump

When the vlan filtering is enabled on the bridge, but
the filter is not configured on the bridge device itself,
running tcpdump on the bridge device will result in a
an Oops with NULL pointer dereference.  The reason
is that br_pass_frame_up() will bypass the vlan
check because promisc flag is set.  It will then try
to get the table pointer and process the packet based
on the table.  Since the table pointer is NULL, we oops.
Catch this special condition in br_handle_vlan().

Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: Account for all vlan headers in skb_mac_gso_segment
Vlad Yasevich [Thu, 27 Mar 2014 21:26:18 +0000 (17:26 -0400)]
net: Account for all vlan headers in skb_mac_gso_segment

skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb.  However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers.  That may not
always be the case.  If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.

A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated.  This way skb_mac_gso_segment
will correctly skip all mac headers.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMAINTAINERS: bonding: change email address
Veaceslav Falico [Thu, 27 Mar 2014 17:43:50 +0000 (18:43 +0100)]
MAINTAINERS: bonding: change email address

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMAINTAINERS: bonding: change email address
Jay Vosburgh [Thu, 27 Mar 2014 17:33:44 +0000 (10:33 -0700)]
MAINTAINERS: bonding: change email address

Update my email address.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'akpm' (patches from Andrew Morton)
Linus Torvalds [Fri, 28 Mar 2014 20:57:13 +0000 (13:57 -0700)]
Merge branch 'akpm' (patches from Andrew Morton)

Merge two fixes from Andrew Morton:
 "The x86 fix should come from x86 guys but they appear to be
  conferencing or otherwise distracted.

  The ocfs2 fix is a bit of a mess - the code runs into an immediate
  NULL deref and we're trying to work out how this got through test and
  review, but we haven't heard from Goldwyn in the past few days.
  Sasha's patch fixes the oops, but the feature as a whole is probably
  broken.  So this is a stopgap for 3.14 - I'll aim to get the real
  fixes into 3.14.x"

* emailed patches from Andrew Morton akpm@linux-foundation.org>:
  x86: fix boot on uniprocessor systems
  ocfs2: check if cluster name exists before deref

10 years agox86: fix boot on uniprocessor systems
Artem Fetishev [Fri, 28 Mar 2014 20:33:39 +0000 (13:33 -0700)]
x86: fix boot on uniprocessor systems

On x86 uniprocessor systems topology_physical_package_id() returns -1
which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized
which leads to GPF in rapl_pmu_init().

See arch/x86/kernel/cpu/perf_event_intel_rapl.c.

It turns out that physical_package_id and core_id can actually be
retreived for uniprocessor systems too.  Enabling them also fixes
rapl_pmu code.

Signed-off-by: Artem Fetishev <artem_fetishev@epam.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoocfs2: check if cluster name exists before deref
Sasha Levin [Fri, 28 Mar 2014 20:33:38 +0000 (13:33 -0700)]
ocfs2: check if cluster name exists before deref

Commit c74a3bdd9b52 ("ocfs2: add clustername to cluster connection") is
trying to strlcpy a string which was explicitly passed as NULL in the
very same patch, triggering a NULL ptr deref.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: strlcpy (lib/string.c:388 lib/string.c:151)
  CPU: 19 PID: 19426 Comm: trinity-c19 Tainted: G        W     3.14.0-rc7-next-20140325-sasha-00014-g9476368-dirty #274
  RIP:  strlcpy (lib/string.c:388 lib/string.c:151)
  Call Trace:
   ocfs2_cluster_connect (fs/ocfs2/stackglue.c:350)
   ocfs2_cluster_connect_agnostic (fs/ocfs2/stackglue.c:396)
   user_dlm_register (fs/ocfs2/dlmfs/userdlm.c:679)
   dlmfs_mkdir (fs/ocfs2/dlmfs/dlmfs.c:503)
   vfs_mkdir (fs/namei.c:3467)
   SyS_mkdirat (fs/namei.c:3488 fs/namei.c:3472)
   tracesys (arch/x86/kernel/entry_64.S:749)

akpm: this patch probably disables the feature.  A temporary thing to
avoid triviel oopses.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoipv6: move DAD and addrconf_verify processing to workqueue
Hannes Frederic Sowa [Thu, 27 Mar 2014 17:28:07 +0000 (18:28 +0100)]
ipv6: move DAD and addrconf_verify processing to workqueue

addrconf_join_solict and addrconf_join_anycast may cause actions which
need rtnl locked, especially on first address creation.

A new DAD state is introduced which defers processing of the initial
DAD processing into a workqueue.

To get rtnl lock we need to push the code paths which depend on those
calls up to workqueues, specifically addrconf_verify and the DAD
processing.

(v2)
addrconf_dad_failure needs to be queued up to the workqueue, too. This
patch introduces a new DAD state and stop the DAD processing in the
workqueue (this is because of the possible ipv6_del_addr processing
which removes the solicited multicast address from the device).

addrconf_verify_lock is removed, too. After the transition it is not
needed any more.

As we are not processing in bottom half anymore we need to be a bit more
careful about disabling bottom half out when we lock spin_locks which are also
used in bh.

Relevant backtrace:
[  541.030090] RTNL: assertion failed at net/core/dev.c (4496)
[  541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.10.33-1-amd64-vyatta #1
[  541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  541.031146]  ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8
[  541.031148]  0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18
[  541.031150]  0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000
[  541.031152] Call Trace:
[  541.031153]  <IRQ>  [<ffffffff8148a9f0>] ? dump_stack+0xd/0x17
[  541.031180]  [<ffffffff813c98c1>] ? __dev_set_promiscuity+0x101/0x180
[  541.031183]  [<ffffffff813d3540>] ? __hw_addr_create_ex+0x60/0xc0
[  541.031185]  [<ffffffff813cfe1a>] ? __dev_set_rx_mode+0xaa/0xc0
[  541.031189]  [<ffffffff813d3a81>] ? __dev_mc_add+0x61/0x90
[  541.031198]  [<ffffffffa01dcf9c>] ? igmp6_group_added+0xfc/0x1a0 [ipv6]
[  541.031208]  [<ffffffff8111237b>] ? kmem_cache_alloc+0xcb/0xd0
[  541.031212]  [<ffffffffa01ddcd7>] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6]
[  541.031216]  [<ffffffffa01c2fae>] ? addrconf_join_solict+0x2e/0x40 [ipv6]
[  541.031219]  [<ffffffffa01ba2e9>] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6]
[  541.031223]  [<ffffffffa01c0772>] ? addrconf_join_anycast+0x92/0xa0 [ipv6]
[  541.031226]  [<ffffffffa01c311e>] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6]
[  541.031229]  [<ffffffffa01c3213>] ? ipv6_ifa_notify+0x33/0x50 [ipv6]
[  541.031233]  [<ffffffffa01c36c8>] ? addrconf_dad_completed+0x28/0x100 [ipv6]
[  541.031241]  [<ffffffff81075c1d>] ? task_cputime+0x2d/0x50
[  541.031244]  [<ffffffffa01c38d6>] ? addrconf_dad_timer+0x136/0x150 [ipv6]
[  541.031247]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]
[  541.031255]  [<ffffffff8105313a>] ? call_timer_fn.isra.22+0x2a/0x90
[  541.031258]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]

Hunks and backtrace stolen from a patch by Stephen Hemminger.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotcp: fix get_timewait4_sock() delay computation on 64bit
Eric Dumazet [Thu, 27 Mar 2014 14:19:19 +0000 (07:19 -0700)]
tcp: fix get_timewait4_sock() delay computation on 64bit

It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.

This could result in wrong output in /proc/net/tcp for tm->when field.

Fixes: 96f817fedec4 ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoopenvswitch: fix a possible deadlock and lockdep warning
Flavio Leitner [Thu, 27 Mar 2014 14:05:34 +0000 (11:05 -0300)]
openvswitch: fix a possible deadlock and lockdep warning

There are two problematic situations.

A deadlock can happen when is_percpu is false because it can get
interrupted while holding the spinlock. Then it executes
ovs_flow_stats_update() in softirq context which tries to get
the same lock.

The second sitation is that when is_percpu is true, the code
correctly disables BH but only for the local CPU, so the
following can happen when locking the remote CPU without
disabling BH:

       CPU#0                            CPU#1
  ovs_flow_stats_get()
   stats_read()
 +->spin_lock remote CPU#1        ovs_flow_stats_get()
 |  <interrupted>                  stats_read()
 |  ...                       +-->  spin_lock remote CPU#0
 |                            |     <interrupted>
 |  ovs_flow_stats_update()   |     ...
 |   spin_lock local CPU#0 <--+     ovs_flow_stats_update()
 +---------------------------------- spin_lock local CPU#1

This patch disables BH for both cases fixing the deadlocks.
Acked-by: Jesse Gross <jesse@nicira.com>
=================================
[ INFO: inconsistent lock state ]
3.14.0-rc8-00007-g632b06a #1 Tainted: G          I
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0/0 [HC0[0]:SC1[5]:HE1:SE0] takes:
(&(&cpu_stats->lock)->rlock){+.?...}, at: [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810f973f>] __lock_acquire+0x68f/0x1c40
[<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
[<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
[<ffffffffa05dd9e4>] ovs_flow_stats_get+0xc4/0x1e0 [openvswitch]
[<ffffffffa05da855>] ovs_flow_cmd_fill_info+0x185/0x360 [openvswitch]
[<ffffffffa05daf05>] ovs_flow_cmd_build_info.constprop.27+0x55/0x90 [openvswitch]
[<ffffffffa05db41d>] ovs_flow_cmd_new_or_set+0x4dd/0x570 [openvswitch]
[<ffffffff816c245d>] genl_family_rcv_msg+0x1cd/0x3f0
[<ffffffff816c270e>] genl_rcv_msg+0x8e/0xd0
[<ffffffff816c0239>] netlink_rcv_skb+0xa9/0xc0
[<ffffffff816c0798>] genl_rcv+0x28/0x40
[<ffffffff816bf830>] netlink_unicast+0x100/0x1e0
[<ffffffff816bfc57>] netlink_sendmsg+0x347/0x770
[<ffffffff81668e9c>] sock_sendmsg+0x9c/0xe0
[<ffffffff816692d9>] ___sys_sendmsg+0x3a9/0x3c0
[<ffffffff8166a911>] __sys_sendmsg+0x51/0x90
[<ffffffff8166a962>] SyS_sendmsg+0x12/0x20
[<ffffffff817e3ce9>] system_call_fastpath+0x16/0x1b
irq event stamp: 1740726
hardirqs last  enabled at (1740726): [<ffffffff8175d5e0>] ip6_finish_output2+0x4f0/0x840
hardirqs last disabled at (1740725): [<ffffffff8175d59b>] ip6_finish_output2+0x4ab/0x840
softirqs last  enabled at (1740674): [<ffffffff8109be12>] _local_bh_enable+0x22/0x50
softirqs last disabled at (1740675): [<ffffffff8109db05>] irq_exit+0xc5/0xd0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&cpu_stats->lock)->rlock);
  <Interrupt>
    lock(&(&cpu_stats->lock)->rlock);

 *** DEADLOCK ***

5 locks held by swapper/0/0:
 #0:  (((&ifa->dad_timer))){+.-...}, at: [<ffffffff810a7155>] call_timer_fn+0x5/0x320
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81788a55>] mld_sendpack+0x5/0x4a0
 #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8175d149>] ip6_finish_output2+0x59/0x840
 #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8168ba75>] __dev_queue_xmit+0x5/0x9b0
 #4:  (rcu_read_lock){.+.+..}, at: [<ffffffffa05e41b5>] internal_dev_xmit+0x5/0x110 [openvswitch]

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I  3.14.0-rc8-00007-g632b06a #1
Hardware name:                  /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
 0000000000000000 0fcf20709903df0c ffff88042d603808 ffffffff817cfe3c
 ffffffff81c134c0 ffff88042d603858 ffffffff817cb6da 0000000000000005
 ffffffff00000001 ffff880400000000 0000000000000006 ffffffff81c134c0
Call Trace:
 <IRQ>  [<ffffffff817cfe3c>] dump_stack+0x4d/0x66
 [<ffffffff817cb6da>] print_usage_bug+0x1f4/0x205
 [<ffffffff810f7f10>] ? check_usage_backwards+0x180/0x180
 [<ffffffff810f8963>] mark_lock+0x223/0x2b0
 [<ffffffff810f96d3>] __lock_acquire+0x623/0x1c40
 [<ffffffff810f5707>] ? __lock_is_held+0x57/0x80
 [<ffffffffa05e26c6>] ? masked_flow_lookup+0x236/0x250 [openvswitch]
 [<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dcc64>] ovs_dp_process_received_packet+0x84/0x120 [openvswitch]
 [<ffffffff810f93f7>] ? __lock_acquire+0x347/0x1c40
 [<ffffffffa05e3bea>] ovs_vport_receive+0x2a/0x30 [openvswitch]
 [<ffffffffa05e4218>] internal_dev_xmit+0x68/0x110 [openvswitch]
 [<ffffffffa05e41b5>] ? internal_dev_xmit+0x5/0x110 [openvswitch]
 [<ffffffff8168b4a6>] dev_hard_start_xmit+0x2e6/0x8b0
 [<ffffffff8168be87>] __dev_queue_xmit+0x417/0x9b0
 [<ffffffff8168ba75>] ? __dev_queue_xmit+0x5/0x9b0
 [<ffffffff8175d5e0>] ? ip6_finish_output2+0x4f0/0x840
 [<ffffffff8168c430>] dev_queue_xmit+0x10/0x20
 [<ffffffff8175d641>] ip6_finish_output2+0x551/0x840
 [<ffffffff8176128a>] ? ip6_finish_output+0x9a/0x220
 [<ffffffff8176128a>] ip6_finish_output+0x9a/0x220
 [<ffffffff8176145f>] ip6_output+0x4f/0x1f0
 [<ffffffff81788c29>] mld_sendpack+0x1d9/0x4a0
 [<ffffffff817895b8>] mld_send_initial_cr.part.32+0x88/0xa0
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8178e301>] ipv6_mc_dad_complete+0x31/0x50
 [<ffffffff817690d7>] addrconf_dad_completed+0x147/0x220
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8176934f>] addrconf_dad_timer+0x19f/0x1c0
 [<ffffffff810a71e9>] call_timer_fn+0x99/0x320
 [<ffffffff810a7155>] ? call_timer_fn+0x5/0x320
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff810a76c4>] run_timer_softirq+0x254/0x3b0
 [<ffffffff8109d47d>] __do_softirq+0x12d/0x480

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Fix handling stacked vlan tags
Toshiaki Makita [Thu, 27 Mar 2014 12:46:56 +0000 (21:46 +0900)]
bridge: Fix handling stacked vlan tags

If a bridge with vlan_filtering enabled receives frames with stacked
vlan tags, i.e., they have two vlan tags, br_vlan_untag() strips not
only the outer tag but also the inner tag.

br_vlan_untag() is called only from br_handle_vlan(), and in this case,
it is enough to set skb->vlan_tci to 0 here, because vlan_tci has already
been set before calling br_handle_vlan().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobridge: Fix inabillity to retrieve vlan tags when tx offload is disabled
Toshiaki Makita [Thu, 27 Mar 2014 12:46:55 +0000 (21:46 +0900)]
bridge: Fix inabillity to retrieve vlan tags when tx offload is disabled

Bridge vlan code (br_vlan_get_tag()) assumes that all frames have vlan_tci
if they are tagged, but if vlan tx offload is manually disabled on bridge
device and frames are sent from vlan device on the bridge device, the tags
are embedded in skb->data and they break this assumption.
Extract embedded vlan tags and move them to vlan_tci at ingress.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovhost: validate vhost_get_vq_desc return value
Michael S. Tsirkin [Thu, 27 Mar 2014 10:53:37 +0000 (12:53 +0200)]
vhost: validate vhost_get_vq_desc return value

vhost fails to validate negative error code
from vhost_get_vq_desc causing
a crash: we are using -EFAULT which is 0xfffffff2
as vector size, which exceeds the allocated size.

The code in question was introduced in commit
8dd014adfea6f173c1ef6378f7e5e7924866c923
    vhost-net: mergeable buffers support

CVE-2014-0055

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovhost: fix total length when packets are too short
Michael S. Tsirkin [Thu, 27 Mar 2014 10:00:26 +0000 (12:00 +0200)]
vhost: fix total length when packets are too short

When mergeable buffers are disabled, and the
incoming packet is too large for the rx buffer,
get_rx_bufs returns success.

This was intentional in order for make recvmsg
truncate the packet and then handle_rx would
detect err != sock_len and drop it.

Unfortunately we pass the original sock_len to
recvmsg - which means we use parts of iov not fully
validated.

Fix this up by detecting this overrun and doing packet drop
immediately.

CVE-2014-0077

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agorandom32: avoid attempt to late reseed if in the middle of seeding
Sasha Levin [Fri, 28 Mar 2014 16:38:42 +0000 (17:38 +0100)]
random32: avoid attempt to late reseed if in the middle of seeding

Commit 4af712e8df ("random32: add prandom_reseed_late() and call when
nonblocking pool becomes initialized") has added a late reseed stage
that happens as soon as the nonblocking pool is marked as initialized.

This fails in the case that the nonblocking pool gets initialized
during __prandom_reseed()'s call to get_random_bytes(). In that case
we'd double back into __prandom_reseed() in an attempt to do a late
reseed - deadlocking on 'lock' early on in the boot process.

Instead, just avoid even waiting to do a reseed if a reseed is already
occuring.

Fixes: 4af712e8df99 ("random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Fri, 28 Mar 2014 20:03:00 +0000 (13:03 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input subsystem fixes from Dmitry Torokhov:
 "Updates to Synaptics touchpad to better cope with devices in Lenovo
  laptops, and a couple more fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: synaptics - add manual min/max quirk for ThinkPad X240
  Input: synaptics - add manual min/max quirk
  Input: cypress_ps2 - don't report as a button pads
  Input: da9052_onkey - use correct register bit for key status
  Input: adp5588-keys - get value from data out when dir is out

10 years agorandom32: assign to network folks in MAINTAINERS
Sasha Levin [Thu, 27 Mar 2014 06:01:34 +0000 (02:01 -0400)]
random32: assign to network folks in MAINTAINERS

lib/random32.c was split out of the network code and is de-facto
still maintained by the almighty net/ gods.

Make it a bit more official so that people who aren't aware of
that know where to send their patches.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 28 Mar 2014 17:58:10 +0000 (10:58 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "I didn't want these to wait for stable cycle.

  The nouveau and radeon ones are the same problem, where the runtime pm
  stuff broke non-runtime pm managed secondary GPUs.

  The udl fix is for an oops on unplug, and the i915 fix is for a
  regression on Sandybridge even though it may break haswell (regression
  wins)"

Daniel Vetter comments:
 "My apologies for the i915 regression fumble, that thing somehow fell
  through the cracks here for almost half a year :( Imo that's more than
  enough flailing to just go ahead with the revert, and the re-broken
  hsw should get peoples attention ..."

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/i915: Undo gtt scratch pte unmapping again
  drm/radeon: fix runtime suspend breaking secondary GPUs
  drm/nouveau: fail runtime pm properly.
  drm/udl: take reference to device struct for dma-bufs

10 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Fri, 28 Mar 2014 17:55:44 +0000 (10:55 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c build fix from Wolfram Sang:
 "The build fix from my last request unveiled another build problem
  which is fixed with this patch"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: cpm: Fix build by adding of_address.h and of_irq.h

10 years agoMerge tag 'stable/for-linus-3.14-rc8-tag' of git://git.kernel.org/pub/scm/linux/kerne...
Linus Torvalds [Fri, 28 Mar 2014 17:52:05 +0000 (10:52 -0700)]
Merge tag 'stable/for-linus-3.14-rc8-tag' of git://git./linux/kernel/git/xen/tip

Pull Xen bugfixes from David Vrabel:
 "Fix two bugs that cause x86 PV guest crashes.

  1. Ballooning a 32-bit guest would eventually crash it.

  2. Revert a broken fix for a regression with NUMA_BALACING.  The bad
     fix caused PV guests to crash after migration.  This is not ideal
     but unpicking the madness that is _PAGE_NUMA == _PAGE_PROTNONE will
     take a while longer"

* tag 'stable/for-linus-3.14-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  Revert "xen: properly account for _PAGE_NUMA during xen pte translations"
  xen/balloon: flush persistent kmaps in correct position

10 years agoInput: synaptics - add manual min/max quirk for ThinkPad X240
Hans de Goede [Fri, 28 Mar 2014 08:01:38 +0000 (01:01 -0700)]
Input: synaptics - add manual min/max quirk for ThinkPad X240

This extends Benjamin Tissoires manual min/max quirk table with support for
the ThinkPad X240.

Cc: stable@vger.kernel.org
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
10 years agoInput: synaptics - add manual min/max quirk
Benjamin Tissoires [Fri, 28 Mar 2014 07:43:00 +0000 (00:43 -0700)]
Input: synaptics - add manual min/max quirk

The new Lenovo Haswell series (-40's) contains a new Synaptics touchpad.
However, these new Synaptics devices report bad axis ranges.
Under Windows, it is not a problem because the Windows driver uses RMI4
over SMBus to talk to the device. Under Linux, we are using the PS/2
fallback interface and it occurs the reported ranges are wrong.

Of course, it would be too easy to have only one range for the whole
series, each touchpad seems to be calibrated in a different way.

We can not use SMBus to get the actual range because I suspect the firmware
will switch into the SMBus mode and stop talking through PS/2 (this is the
case for hybrid HID over I2C / PS/2 Synaptics touchpads).

So as a temporary solution (until RMI4 land into upstream), start a new
list of quirks with the min/max manually set.

Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
10 years agotime: Revert to calling clock_was_set_delayed() while in irq context
John Stultz [Thu, 27 Mar 2014 23:30:49 +0000 (16:30 -0700)]
time: Revert to calling clock_was_set_delayed() while in irq context

In commit 47a1b796306356f35 ("tick/timekeeping: Call
update_wall_time outside the jiffies lock"), we moved to calling
clock_was_set() due to the fact that we were no longer holding
the timekeeping or jiffies lock.

However, there is still the problem that clock_was_set()
triggers an IPI, which cannot be done from the timer's hard irq
context, and will generate WARN_ON warnings.

Apparently in my earlier testing, I'm guessing I didn't bump the
dmesg log level, so I somehow missed the WARN_ONs.

Thus we need to revert back to calling clock_was_set_delayed().

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1395963049-11923-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
10 years agodrm/i915: Undo gtt scratch pte unmapping again
Daniel Vetter [Wed, 26 Mar 2014 19:10:09 +0000 (20:10 +0100)]
drm/i915: Undo gtt scratch pte unmapping again

It apparently blows up on some machines. This functionally reverts

commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Wed Oct 16 09:21:30 2013 -0700

    drm/i915: Disable GGTT PTEs on GEN6+ suspend

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64841
Reported-and-Tested-by: Brad Jackson <bjackson0971@gmail.com>
Cc: stable@vger.kernel.org
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: Todd Previte <tprevite@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
10 years agodrm/radeon: fix runtime suspend breaking secondary GPUs
Dave Airlie [Thu, 27 Mar 2014 02:31:08 +0000 (02:31 +0000)]
drm/radeon: fix runtime suspend breaking secondary GPUs

Same fix as for nouveau, when we fail with EINVAL, subsequent
gets fail hard, causing the device not to open.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
10 years agonet/mlx4_core: pass pci_device_id.driver_data to __mlx4_init_one during reset
Wei Yang [Thu, 27 Mar 2014 01:28:31 +0000 (09:28 +0800)]
net/mlx4_core: pass pci_device_id.driver_data to __mlx4_init_one during reset

The second parameter of __mlx4_init_one() is used to identify whether the
pci_dev is a PF or VF. Currently, when it is invoked in mlx4_pci_slot_reset()
this information is missed.

This patch match the pci_dev with mlx4_pci_table and passes the
pci_device_id.driver_data to __mlx4_init_one() in mlx4_pci_slot_reset().

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocore, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
Zoltan Kiss [Wed, 26 Mar 2014 22:37:45 +0000 (22:37 +0000)]
core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovlan: Set hard_header_len according to available acceleration
Vlad Yasevich [Wed, 26 Mar 2014 15:47:56 +0000 (11:47 -0400)]
vlan: Set hard_header_len according to available acceleration

Currently, if the card supports CTAG acceleration we do not
account for the vlan header even if we are configuring an
8021AD vlan.  This may not be best since we'll do software
tagging for 8021AD which will cause data copy on skb head expansion
Configure the length based on available hw offload capabilities and
vlan protocol.

CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agousbnet: include wait queue head in device structure
Oliver Neukum [Wed, 26 Mar 2014 13:32:51 +0000 (14:32 +0100)]
usbnet: include wait queue head in device structure

This fixes a race which happens by freeing an object on the stack.
Quoting Julius:
> The issue is
> that it calls usbnet_terminate_urbs() before that, which temporarily
> installs a waitqueue in dev->wait in order to be able to wait on the
> tasklet to run and finish up some queues. The waiting itself looks
> okay, but the access to 'dev->wait' is totally unprotected and can
> race arbitrarily. I think in this case usbnet_bh() managed to succeed
> it's dev->wait check just before usbnet_terminate_urbs() sets it back
> to NULL. The latter then finishes and the waitqueue_t structure on its
> stack gets overwritten by other functions halfway through the
> wake_up() call in usbnet_bh().

The fix is to just not allocate the data structure on the stack.
As dev->wait is abused as a flag it also takes a runtime PM change
to fix this bug.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
Reported-by: Grant Grundler <grundler@google.com>
Tested-by: Grant Grundler <grundler@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovirtio-net: correct error handling of virtqueue_kick()
Jason Wang [Wed, 26 Mar 2014 05:03:00 +0000 (13:03 +0800)]
virtio-net: correct error handling of virtqueue_kick()

Current error handling of virtqueue_kick() was wrong in two places:
- The skb were freed immediately when virtqueue_kick() fail during
  xmit. This may lead double free since the skb was not detached from
  the virtqueue.
- try_fill_recv() returns false when virtqueue_kick() fail. This will
  lead unnecessary rescheduling of refill work.

Actually, it's safe to just ignore the kick failure in those two
places. So this patch fixes this by partially revert commit
67975901183799af8e93ec60e322f9e2a1940b9b.

Fixes 67975901183799af8e93ec60e322f9e2a1940b9b
(virtio_net: verify if virtqueue_kick() succeeded).

Cc: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovfs: Allocate anon_inode_inode in anon_inode_init()
Jan Kara [Wed, 26 Mar 2014 05:20:14 +0000 (06:20 +0100)]
vfs: Allocate anon_inode_inode in anon_inode_init()

Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
[ Further simplified on suggestion from Dave Jones ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agodrm/nouveau: fail runtime pm properly.
Dave Airlie [Wed, 26 Mar 2014 04:09:37 +0000 (14:09 +1000)]
drm/nouveau: fail runtime pm properly.

If we were on a non-optimus device, we'd return -EINVAL, this would
lead to the over engineered runtime pm system to go into an error
state, subsequent get_sync's would fail, so we'd never be able
to open the device again.

(like really get_sync shouldn't fail if the device isn't powered
down).

Signed-off-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
10 years agodrm/udl: take reference to device struct for dma-bufs
Dave Airlie [Tue, 25 Mar 2014 04:38:44 +0000 (14:38 +1000)]
drm/udl: take reference to device struct for dma-bufs

this stops the device from being deleted before all the dma-bufs
on it are freed, this fixes an oops when you unplug a udl device while
it has imported a buffer from another device.

Signed-off-by: Dave Airlie <airlied@redhat.com>
10 years agonet: unix: non blocking recvmsg() should not return -EINTR
Eric Dumazet [Wed, 26 Mar 2014 01:42:27 +0000 (18:42 -0700)]
net: unix: non blocking recvmsg() should not return -EINTR

Some applications didn't expect recvmsg() on a non blocking socket
could return -EINTR. This possibility was added as a side effect
of commit b3ca9b02b00704 ("net: fix multithreaded signal handling in
unix recv routines").

To hit this bug, you need to be a bit unlucky, as the u->readlock
mutex is usually held for very small periods.

Fixes: b3ca9b02b00704 ("net: fix multithreaded signal handling in unix recv routines")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'mvneta'
David S. Miller [Wed, 26 Mar 2014 20:53:04 +0000 (16:53 -0400)]
Merge branch 'mvneta'

Thomas Petazzoni says:

====================
net: mvneta: fix usage as a module

The following set of two patches fix the usage of the mvneta driver
when built as a module, and used in RGMII configurations. It is
somewhat similar to a previous fix that was made by Arnaud Patard, but
which was limited to SGMII configurations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: mvneta: use devm_ioremap_resource() instead of of_iomap()
Thomas Petazzoni [Tue, 25 Mar 2014 23:26:55 +0000 (00:26 +0100)]
net: mvneta: use devm_ioremap_resource() instead of of_iomap()

The mvneta driver currently uses of_iomap(), which has two drawbacks:
it doesn't request the resource, and it isn't devm-style so some error
handling is needed.

This commit switches to use devm_ioremap_resource() instead, which
automatically requests the resource (so the I/O registers region shows
up properly in /proc/iomem), and also is devm-style, which allows to
get rid of some error handling to unmap the I/O registers region.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: mvneta: fix usage as a module on RGMII configurations
Thomas Petazzoni [Tue, 25 Mar 2014 23:25:42 +0000 (00:25 +0100)]
net: mvneta: fix usage as a module on RGMII configurations

Commit 5445eaf309ff ('mvneta: Try to fix mvneta when compiled as
module') fixed the mvneta driver to make it work properly when loaded
as a module in SGMII configuration, which was tested successful by the
author on the Armada XP OpenBlocks AX3, which uses SGMII.

However, it turns out that the Armada XP GP, which uses RGMII, is
affected by a similar problem: its SERDES configuration is lost when
mvneta is loaded as a module, because this configuration is set by the
bootloader, and then lost because the clock is gated by the clock
framework until the mvneta driver is loaded again and the clock is
re-enabled.

However, it turns out that for the RGMII case, setting the SERDES
configuration is not sufficient: the PCS enable bit in the
MVNETA_GMAC_CTRL_2 register must also be set, like in the SGMII
configuration.

Therefore, this commit reworks the SGMII/RGMII initialization: the
only difference between the two now is a different SERDES
configuration, all the rest is identical.

In detail, to achieve this, the commit:

 * Renames MVNETA_SGMII_SERDES_CFG to MVNETA_SERDES_CFG because it is
   not specific to SGMII, but also used on RGMII configurations.

 * Adds a MVNETA_RGMII_SERDES_PROTO definition, that must be used as
   the MVNETA_SERDES_CFG value in RGMII configurations.

 * Removes the mvneta_gmac_rgmii_set() and mvneta_port_sgmii_config()
   functions, and instead directly do the SGMII/RGMII configuration in
   mvneta_port_up(), from where those functions where called. It is
   worth mentioning that mvneta_gmac_rgmii_set() had an 'enable'
   parameter that was always passed as '1', so it was pretty useless.

 * Reworks the mvneta_port_up() function to set the MVNETA_SERDES_CFG
   register to the appropriate value depending on the RGMII vs. SGMII
   configuration. It also unconditionally set the PCS_ENABLE bit (was
   already done for SGMII, but is now also needed for RGMII), and sets
   the PORT_RGMII bit (which was already done for both SGMII and
   RGMII).

This commit was successfully tested with mvneta compiled as a module,
on both the OpenBlocks AX3 (SGMII configuration) and the Armada XP GP
(RGMII configuration).

Reported-by: Steve McIntyre <steve@einval.com>
Cc: stable@vger.kernel.org # 3.11.x: 5445eaf309ff mvneta: Try to fix mvneta when compiled as module
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>