Ilya Dryomov [Wed, 2 Apr 2014 16:34:04 +0000 (20:34 +0400)]
libceph: output primary affinity values on osdmap updates
Similar to osd weights, output primary affinity values on incremental
osdmap updates.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Yan, Zheng [Tue, 1 Apr 2014 12:34:56 +0000 (20:34 +0800)]
ceph: flush cap release queue when trimming session caps
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Tue, 1 Apr 2014 12:34:18 +0000 (20:34 +0800)]
ceph: don't grabs open file reference for aborted request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Tue, 1 Apr 2014 12:20:17 +0000 (20:20 +0800)]
ceph: drop extra open file reference in ceph_atomic_open()
ceph_atomic_open() calls ceph_open() after receiving the MDS reply.
ceph_open() grabs an extra open file reference. (The open request
already holds an open file reference)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sat, 29 Mar 2014 05:41:15 +0000 (13:41 +0800)]
ceph: preallocate buffer for readdir reply
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:50 +0000 (17:12 +0200)]
libceph: enable PRIMARY_AFFINITY feature bit
Announce our support for osdmaps with non-default primary affinity
values.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:49 +0000 (17:12 +0200)]
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:49 +0000 (17:12 +0200)]
libceph: add support for osd primary affinity
Respond to non-default primary_affinity values accordingly. (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:48 +0000 (17:12 +0200)]
libceph: add support for primary_temp mappings
Change apply_temp() to override primary in the same way pg_temp
overrides osd set. primary_temp overrides pg_temp primary too.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:48 +0000 (17:12 +0200)]
libceph: return primary from ceph_calc_pg_acting()
In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly. Primary is now specified separately
from the order of osds in the set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:48 +0000 (17:12 +0200)]
libceph: switch ceph_calc_pg_acting() to new helpers
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:47 +0000 (17:12 +0200)]
libceph: introduce apply_temps() helper
apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:47 +0000 (17:12 +0200)]
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.
raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:47 +0000 (17:12 +0200)]
libceph: ceph_can_shift_osds(pool) and pool type defines
Bring in pg_pool_t::can_shift_osds() counterpart along with pool type
defines.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Mon, 24 Mar 2014 15:12:46 +0000 (17:12 +0200)]
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
Sync up with ceph.git definitions. Bring in ceph_osd_is_down().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:31 +0000 (19:05 +0200)]
libceph: enable OSDMAP_ENC feature bit
Announce our support for "new" (v7 - split and separately versioned
client and osd sections) osdmap enconding.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:30 +0000 (19:05 +0200)]
libceph: primary_affinity decode bits
Add two helpers to decode primary_affinity (full map, vector<u32>) and
new_primary_affinity (inc map, map<u32, u32>) and switch to them.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:30 +0000 (19:05 +0200)]
libceph: primary_affinity infrastructure
Add primary_affinity infrastructure. primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.
Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:30 +0000 (19:05 +0200)]
libceph: primary_temp decode bits
Add a common helper to decode both primary_temp (full map, map<pg_t,
u32>) and new_primary_temp (inc map, same) and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:29 +0000 (19:05 +0200)]
libceph: primary_temp infrastructure
Add primary_temp mappings infrastructure. struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.
Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:
primary_temp 2.6 4
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:29 +0000 (19:05 +0200)]
libceph: generalize ceph_pg_mapping
In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:29 +0000 (19:05 +0200)]
libceph: introduce get_osdmap_client_data_v()
Full and incremental osdmaps are structured identically and have
identical headers. Add a helper to decode both "old" (16-bit version,
v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
enconding headers and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:28 +0000 (19:05 +0200)]
libceph: introduce decode{,_new}_pg_temp() and switch to them
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
(inc map, same) decoding logic into a common helper and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:28 +0000 (19:05 +0200)]
libceph: switch osdmap_set_max_osd() to krealloc()
Use krealloc() instead of rolling our own. (krealloc() with a NULL
first argument acts as a kmalloc()). Properly initalize the new array
elements. This is needed to make future additions to osdmap easier.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:27 +0000 (19:05 +0200)]
libceph: introduce decode{,_new}_pools() and switch to them
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 21 Mar 2014 17:05:27 +0000 (19:05 +0200)]
libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
To be in line with all the other osdmap decode helpers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:16 +0000 (16:36 +0200)]
libceph: fix and clarify ceph_decode_need() sizes
Sum up sizeof(...) results instead of (incorrectly) hard-coding the
number of bytes, expressed in ints and longs.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:16 +0000 (16:36 +0200)]
libceph: nuke bogus encoding version check in osdmap_apply_incremental()
Only version 6 of osdmap encoding is supported, anything other than
version 6 results in an error and halts the decoding process. Checking
if version is >= 5 is therefore bogus.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:15 +0000 (16:36 +0200)]
libceph: fixup error handling in osdmap_apply_incremental()
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:15 +0000 (16:36 +0200)]
libceph: fix crush_decode() call site in osdmap_decode()
The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length. Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:14 +0000 (16:36 +0200)]
libceph: check length of osdmap osd arrays
Check length of osd_state, osd_weight and osd_addr arrays. They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:14 +0000 (16:36 +0200)]
libceph: safely decode max_osd value in osdmap_decode()
max_osd value is not covered by any ceph_decode_need(). Use a safe
version of ceph_decode_* macro to decode it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:14 +0000 (16:36 +0200)]
libceph: fixup error handling in osdmap_decode()
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:13 +0000 (16:36 +0200)]
libceph: split osdmap allocation and decode steps
Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:13 +0000 (16:36 +0200)]
libceph: dump osdmap and enhance output on decode errors
Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset. dout() map epoch
and max_osd value on success.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:13 +0000 (16:36 +0200)]
libceph: dump pg_temp mappings to debugfs
Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:
pg_temp 2.6 [2,3,4]
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:12 +0000 (16:36 +0200)]
libceph: do not prefix osd lines with \t in debugfs output
To save screen space in anticipation of more fields (e.g. primary
affinity).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Thu, 13 Mar 2014 14:36:12 +0000 (16:36 +0200)]
libceph: refer to osdmap directly in osdmap_show()
To make it more readable and save screen space.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: support chooseleaf_vary_r tunable (tunables3) by default
Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: add SET_CHOOSELEAF_VARY_R step
This lets you adjust the vary_r tunable on a per-rule basis.
Reflects ceph.git commit
f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: add chooseleaf_vary_r tunable
The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call. That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.
Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.
Note that this was done from the get-go for the new crush_choose_indep
algorithm.
This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.
Reflects ceph.git commit
a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Wed, 19 Mar 2014 14:58:37 +0000 (16:58 +0200)]
crush: allow crush rules to set (re)tries counts to 0
These two fields are misnomers; they are *retry* counts.
Reflects ceph.git commit
f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Wed, 19 Mar 2014 14:58:36 +0000 (16:58 +0200)]
crush: fix off-by-one errors in total_tries refactor
Back in
27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.
Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.
This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.
Reflects ceph.git commit
795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Yan, Zheng [Mon, 24 Mar 2014 11:15:05 +0000 (19:15 +0800)]
ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()
This avoids 'cp -a' modifying layout of new files/directories.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Mon, 24 Mar 2014 05:00:54 +0000 (13:00 +0800)]
ceph: check buffer size in ceph_vxattrcb_layout()
If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Mon, 24 Mar 2014 01:56:43 +0000 (09:56 +0800)]
ceph: fix null pointer dereference in discard_cap_releases()
send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sat, 22 Mar 2014 22:50:39 +0000 (06:50 +0800)]
libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()
When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Fabian Frederick [Fri, 21 Mar 2014 16:52:58 +0000 (09:52 -0700)]
ceph: Remove get/set acl on symlinks
Remove unsupported symlink operations.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Yan, Zheng [Tue, 18 Mar 2014 02:15:29 +0000 (10:15 +0800)]
ceph: set mds_wanted when MDS reply changes a cap to auth cap
When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sun, 9 Mar 2014 15:16:40 +0000 (23:16 +0800)]
ceph: use fl->fl_file as owner identifier of flock and posix lock
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).
The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.
The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Tue, 4 Mar 2014 07:50:06 +0000 (15:50 +0800)]
ceph: forbid mandatory file lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Tue, 4 Mar 2014 07:42:24 +0000 (15:42 +0800)]
ceph: use fl->fl_type to decide flock operation
VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sat, 8 Mar 2014 12:12:23 +0000 (20:12 +0800)]
ceph: update i_max_size even if inode version does not change
handle following sequence of events:
- client releases a inode with i_max_size > 0. The release message
is queued. (is not sent to the auth MDS)
- a 'lookup' request reply from non-auth MDS returns the same inode.
- client opens the inode in write mode. The version of inode trace
in 'open' request reply is equal to the cached inode's version.
- client requests new max size. The MDS ignores the request because
it does not affect client's write range
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sat, 8 Mar 2014 01:51:45 +0000 (09:51 +0800)]
ceph: make sure write caps are registered with auth MDS
Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sat, 1 Mar 2014 14:22:57 +0000 (22:22 +0800)]
ceph: print inode number for LOOKUPINO request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Thu, 6 Mar 2014 08:40:32 +0000 (16:40 +0800)]
ceph: add get_name() NFS export callback
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sat, 1 Mar 2014 15:09:05 +0000 (23:09 +0800)]
ceph: fix ceph_fh_to_parent()
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sat, 1 Mar 2014 14:11:45 +0000 (22:11 +0800)]
ceph: add get_parent() NFS export callback
The callback uses LOOKUPPARENT MDS request to find parent.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sat, 1 Mar 2014 10:05:41 +0000 (18:05 +0800)]
ceph: simplify ceph_fh_to_dentry()
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yunchuan Wen [Thu, 26 Dec 2013 14:29:28 +0000 (06:29 -0800)]
ceph: fscache: Wait for completion of object initialization
The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Yunchuan Wen [Thu, 26 Dec 2013 14:29:27 +0000 (06:29 -0800)]
ceph: fscache: Update object store limit after file writing
Synchronize object->store_limit[_l] with new inode->i_size after file writing.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Yunchuan Wen [Thu, 26 Dec 2013 14:29:26 +0000 (06:29 -0800)]
ceph: fscache: add an interface to synchronize object store limit
Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Sage Weil [Tue, 5 Feb 2013 21:41:23 +0000 (13:41 -0800)]
ceph: do not set r_old_dentry_dir on link()
This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.
Also, taking this reference is useless.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Sage Weil [Tue, 5 Feb 2013 21:40:09 +0000 (13:40 -0800)]
ceph: do not assume r_old_dentry[_dir] always set together
Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true. Separate out the ref cleanup and make the debugs dump behave when
it is NULL.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Sage Weil [Tue, 5 Feb 2013 21:52:29 +0000 (13:52 -0800)]
ceph: do not chain inode updates to parent fsync
The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.
Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Sage Weil [Tue, 5 Feb 2013 21:36:05 +0000 (13:36 -0800)]
ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()
This is just old_dir; no reason to abuse the dcache pointers.
Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Mon, 3 Mar 2014 01:20:44 +0000 (09:20 +0800)]
ceph: let MDS adjust readdir 'frag'
If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.
Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Fri, 28 Feb 2014 08:36:09 +0000 (16:36 +0800)]
ceph: fix reset_readdir()
When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Yan, Zheng [Thu, 27 Feb 2014 08:26:24 +0000 (16:26 +0800)]
ceph: fix ceph_dir_llseek()
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.
At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 25 Feb 2014 14:22:28 +0000 (16:22 +0200)]
rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the libceph/osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
rbd: num_ops parameter for rbd_osd_req_create()
In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create(). The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
libceph: bump CEPH_OSD_MAX_OP to 3
Our longest osd request now contains 3 ops: copyup+hint+write.
Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2. Fix it, and switch to rbd_assert while at it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 25 Feb 2014 14:22:27 +0000 (16:22 +0200)]
libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
This is primarily for rbd's benefit and is supposed to combat
fragmentation:
"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks. We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."
SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 25 Feb 2014 14:22:26 +0000 (16:22 +0200)]
libceph: encode CEPH_OSD_OP_FLAG_* op flags
Encode ceph_osd_op::flags field so that it gets sent over the wire.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 4 Mar 2014 09:57:17 +0000 (11:57 +0200)]
rbd: fix error paths in rbd_img_request_fill()
Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():
Assertion failure in rbd_obj_request_destroy() at line 1867:
rbd_assert(obj_request->img_request == NULL);
rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del(). Use it.
Fixes: http://tracker.ceph.com/issues/7327
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Tue, 4 Mar 2014 09:57:17 +0000 (11:57 +0200)]
rbd: remove out_partial label in rbd_img_request_fill()
Commit
03507db631c94 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus. Remove it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ilya Dryomov [Fri, 31 Jan 2014 15:54:26 +0000 (17:54 +0200)]
libceph: a per-osdc crush scratch buffer
With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain. Replace it with a buffer inside
struct osdmap and a mutex. This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Linus Torvalds [Mon, 31 Mar 2014 03:40:15 +0000 (20:40 -0700)]
Linux 3.14
Linus Torvalds [Mon, 31 Mar 2014 00:26:08 +0000 (17:26 -0700)]
Merge branch 'for-linus-2' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Switch mnt_hash to hlist, turning the races between __lookup_mnt() and
hash modifications into false negatives from __lookup_mnt() (instead
of hangs)"
On the false negatives from __lookup_mnt():
"The *only* thing we care about is not getting stuck in __lookup_mnt().
If it misses an entry because something in front of it just got moved
around, etc, we are fine. We'll notice that mount_lock mismatch and
that'll be it"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch mnt_hash to hlist
don't bother with propagate_mnt() unless the target is shared
keep shadowed vfsmounts together
resizable namespace.c hashes
Randy Dunlap [Fri, 28 Mar 2014 16:45:33 +0000 (09:45 -0700)]
MAINTAINERS: resume as Documentation maintainer
I am the new kernel tree Documentation maintainer (except for parts that
are handled by other people, of course).
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Rob Landley <rob@landley.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 31 Mar 2014 00:20:40 +0000 (17:20 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Some more updates for the input subsystem.
You will get a fix for race in mousedev that has been causing quite a
few oopses lately and a small fixup for force feedback support in
evdev"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: mousedev - fix race when creating mixed device
Input: don't modify the id of ioctl-provided ff effect on upload failure
Eric Paris [Sun, 30 Mar 2014 23:07:54 +0000 (19:07 -0400)]
AUDIT: Allow login in non-init namespaces
It its possible to configure your PAM stack to refuse login if audit
messages (about the login) were unable to be sent. This is common in
many distros and thus normal configuration of many containers. The PAM
modules determine if audit is enabled/disabled in the kernel based on
the return value from sending an audit message on the netlink socket.
If userspace gets back ECONNREFUSED it believes audit is disabled in the
kernel. If it gets any other error else it refuses to let the login
proceed.
Just about ever since the introduction of namespaces the kernel audit
subsystem has returned EPERM if the task sending a message was not in
the init user or pid namespace. So many forms of containers have never
worked if audit was enabled in the kernel.
BUT if the container was not in net_init then the kernel network code
would send ECONNREFUSED (instead of the audit code sending EPERM). Thus
by pure accident/dumb luck/bug if an admin configured the PAM stack to
reject all logins that didn't talk to audit, but then ran the login
untility in the non-init_net namespace, it would work!! Clearly this was
a bug, but it is a bug some people expected.
With the introduction of network namespace support in 3.14-rc1 the two
bugs stopped cancelling each other out. Now, containers in the
non-init_net namespace refused to let users log in (just like PAM was
configfured!) Obviously some people were not happy that what used to let
users log in, now didn't!
This fix is kinda hacky. We return ECONNREFUSED for all non-init
relevant namespaces. That means that not only will the old broken
non-init_net setups continue to work, now the broken non-init_pid or
non-init_user setups will 'work'. They don't really work, since audit
isn't logging things. But it's what most users want.
In 3.15 we should have patches to support not only the non-init_net
(3.14) namespace but also the non-init_pid and non-init_user namespace.
So all will be right in the world. This just opens the doors wide open
on 3.14 and hopefully makes users happy, if not the audit system...
Reported-by: Andre Tomt <andre@tomt.net>
Reported-by: Adam Richter <adam_richter2004@yahoo.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Theodore Ts'o [Sun, 30 Mar 2014 14:20:01 +0000 (10:20 -0400)]
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.
Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro [Fri, 21 Mar 2014 01:10:51 +0000 (21:10 -0400)]
switch mnt_hash to hlist
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating. Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.
[fix for dumb braino folded]
Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 21 Mar 2014 14:14:08 +0000 (10:14 -0400)]
don't bother with propagate_mnt() unless the target is shared
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 21 Mar 2014 00:34:43 +0000 (20:34 -0400)]
keep shadowed vfsmounts together
preparation to switching mnt_hash to hlist
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 28 Feb 2014 18:46:44 +0000 (13:46 -0500)]
resizable namespace.c hashes
* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sat, 29 Mar 2014 22:01:09 +0000 (15:01 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"A late breaking fix from John. (The bug fixed has a hard lockup
potential, but that was not observed, warnings were)"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Revert to calling clock_was_set_delayed() while in irq context
Linus Torvalds [Sat, 29 Mar 2014 22:00:27 +0000 (15:00 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
"This drops a bad assert that a few users have been hitting but we've
only recently been able to track down"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: drop an unsafe assertion
Dmitry Torokhov [Thu, 6 Mar 2014 20:57:24 +0000 (12:57 -0800)]
Input: mousedev - fix race when creating mixed device
We should not be using static variable mousedev_mix in methods that can be
called before that singleton gets assigned. While at it let's add open and
close methods to mousedev structure so that we do not need to test if we
are dealing with multiplexor or normal device and simply call appropriate
method directly.
This fixes: https://bugzilla.kernel.org/show_bug.cgi?id=71551
Reported-by: GiulioDP <depasquale.giulio@gmail.com>
Tested-by: GiulioDP <depasquale.giulio@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Elias Vanderstuyft [Sat, 29 Mar 2014 19:08:45 +0000 (12:08 -0700)]
Input: don't modify the id of ioctl-provided ff effect on upload failure
If a new (id == -1) ff effect was uploaded from userspace,
ff-core.c::input_ff_upload() will have assigned a positive number to the
new effect id. Currently, evdev.c::evdev_do_ioctl() will save this new id
to userspace, regardless of whether the upload succeeded or not.
On upload failure, this can be confusing because the dev->ff->effects[]
array will not contain an element at the index of that new effect id.
This patch fixes this by leaving the id unchanged after upload fails.
Note: Unfortunately applications should still expect changed effect id for
quite some time.
This has been discussed on:
http://www.mail-archive.com/linux-input@vger.kernel.org/msg08513.html
("ff-core effect id handling in case of a failed effect upload")
Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Elias Vanderstuyft <elias.vds@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Alex Elder [Tue, 25 Mar 2014 13:36:02 +0000 (15:36 +0200)]
rbd: drop an unsafe assertion
Olivier Bonvalet reported having repeated crashes due to a failed
assertion he was hitting in rbd_img_obj_callback():
Assertion failure in rbd_img_obj_callback() at line 2165:
rbd_assert(which >= img_request->next_completion);
With a lot of help from Olivier with reproducing the problem
we were able to determine the object and image requests had
already been completed (and often freed) at the point the
assertion failed.
There was a great deal of discussion on the ceph-devel mailing list
about this. The problem only arose when there were two (or more)
object requests in an image request, and the problem was always
seen when the second request was being completed.
The problem is due to a race in the window between setting the
"done" flag on an object request and checking the image request's
next completion value. When the first object request completes, it
checks to see if its successor request is marked "done", and if
so, that request is also completed. In the process, the image
request's next_completion value is updated to reflect that both
the first and second requests are completed. By the time the
second request is able to check the next_completion value, it
has been set to a value *greater* than its own "which" value,
which caused an assertion to fail.
Fix this problem by skipping over any completion processing
unless the completing object request is the next one expected.
Test only for inequality (not >=), and eliminate the bad
assertion.
Tested-by: Olivier Bonvalet <ob@daevel.fr>
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Linus Torvalds [Fri, 28 Mar 2014 22:09:37 +0000 (15:09 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) We've discovered a common error in several networking drivers, they
put VLAN offload features into ->vlan_features, which would suggest
that they support offloading 2 or more levels of VLAN encapsulation.
Not only do these devices not do that, but we don't have the
infrastructure yet to handle that at all.
Fixes from Vlad Yasevich.
2) Fix tcpdump crash with bridging and vlans, also from Vlad.
3) Some MAINTAINERS updates for random32 and bonding.
4) Fix late reseeds of prandom generator, from Sasha Levin.
5) Bridge doesn't handle stacked vlans properly, fix from Toshiaki
Makita.
6) Fix deadlock in openvswitch, from Flavio Leitner.
7) get_timewait4_sock() doesn't report delay times correctly, fix from
Eric Dumazet.
8) Duplicate address detection and addrconf verification need to run in
contexts where RTNL can be obtained. Move them to run from a
workqueue. From Hannes Frederic Sowa.
9) Fix route refcount leaking in ip tunnels, from Pravin B Shelar.
10) Don't return -EINTR from non-blocking recvmsg() on AF_UNIX sockets,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
vlan: Warn the user if lowerdev has bad vlan features.
veth: Turn off vlan rx acceleration in vlan_features
ifb: Remove vlan acceleration from vlan_features
qlge: Do not propaged vlan tag offloads to vlans
bridge: Fix crash with vlan filtering and tcpdump
net: Account for all vlan headers in skb_mac_gso_segment
MAINTAINERS: bonding: change email address
MAINTAINERS: bonding: change email address
ipv6: move DAD and addrconf_verify processing to workqueue
tcp: fix get_timewait4_sock() delay computation on 64bit
openvswitch: fix a possible deadlock and lockdep warning
bridge: Fix handling stacked vlan tags
bridge: Fix inabillity to retrieve vlan tags when tx offload is disabled
vhost: validate vhost_get_vq_desc return value
vhost: fix total length when packets are too short
random32: avoid attempt to late reseed if in the middle of seeding
random32: assign to network folks in MAINTAINERS
net/mlx4_core: pass pci_device_id.driver_data to __mlx4_init_one during reset
core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
vlan: Set hard_header_len according to available acceleration
...
David S. Miller [Fri, 28 Mar 2014 21:17:16 +0000 (17:17 -0400)]
Merge branch 'vlan_offloads'
Vlad Yasevich says:
====================
Audit all drivers for correct vlan_features.
Some drivers set vlan acceleration features in vlan_features. This causes
issues with Q-in-Q/802.1ad configurations.
Audit all the drivers for correct vlan_features. Fix broken ones.
Add a warning to vlan code to help catch future offenders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Fri, 28 Mar 2014 02:14:49 +0000 (22:14 -0400)]
vlan: Warn the user if lowerdev has bad vlan features.
Some drivers incorrectly assign vlan acceleration features to
vlan_features thus causing issues for Q-in-Q vlan configurations.
Warn the user of such cases.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Fri, 28 Mar 2014 02:14:48 +0000 (22:14 -0400)]
veth: Turn off vlan rx acceleration in vlan_features
For completeness, turn off vlan rx acceleration in vlan_features so
that it doesn't show up on q-in-q setups.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Fri, 28 Mar 2014 02:14:47 +0000 (22:14 -0400)]
ifb: Remove vlan acceleration from vlan_features
Do not include vlan acceleration features in vlan_features as that
precludes correct Q-in-Q operation.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Fri, 28 Mar 2014 02:14:46 +0000 (22:14 -0400)]
qlge: Do not propaged vlan tag offloads to vlans
qlge driver turns off NETIF_F_HW_CTAG_FILTER, but forgets to
turn off HW_CTAG_TX and HW_CTAG_RX on vlan devices. With the
current settings, q-in-q will only generate a single vlan header.
Remember to mask off CTAG_TX and CTAG_RX features in vlan_features.
CC: Shahed Shaikh <shahed.shaikh@qlogic.com>
CC: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
CC: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Fri, 28 Mar 2014 01:51:18 +0000 (21:51 -0400)]
bridge: Fix crash with vlan filtering and tcpdump
When the vlan filtering is enabled on the bridge, but
the filter is not configured on the bridge device itself,
running tcpdump on the bridge device will result in a
an Oops with NULL pointer dereference. The reason
is that br_pass_frame_up() will bypass the vlan
check because promisc flag is set. It will then try
to get the table pointer and process the packet based
on the table. Since the table pointer is NULL, we oops.
Catch this special condition in br_handle_vlan().
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Thu, 27 Mar 2014 21:26:18 +0000 (17:26 -0400)]
net: Account for all vlan headers in skb_mac_gso_segment
skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb. However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers. That may not
always be the case. If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.
A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated. This way skb_mac_gso_segment
will correctly skip all mac headers.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>