GitHub/LineageOS/android_kernel_motorola_exynos9610.git
7 years agof2fs: use rw_semaphore to protect SIT cache
Chao Yu [Mon, 30 Oct 2017 09:49:53 +0000 (17:49 +0800)]
f2fs: use rw_semaphore to protect SIT cache

There are some cases user didn't update SIT cache under this lock,
so let's use rw_semaphore instead of mutex to enhance concurrently
accessing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support quota sys files
Jaegeuk Kim [Fri, 6 Oct 2017 16:14:28 +0000 (09:14 -0700)]
f2fs: support quota sys files

This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: add quota_ino feature infra
Jaegeuk Kim [Fri, 6 Oct 2017 04:03:06 +0000 (21:03 -0700)]
f2fs: add quota_ino feature infra

This patch adds quota_ino feature infra to be used for quota files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: optimize __update_nat_bits
Fan Li [Mon, 30 Oct 2017 07:19:48 +0000 (15:19 +0800)]
f2fs: optimize __update_nat_bits

Make three modification for __update_nat_bits:
1. Take the codes of dealing the nat with nid 0 out of the loop
    Such nat only needs to be dealt with once at beginning.
2. Use " nat_index == 0" instead of " start_nid == 0" to decide if it's the first nat block
    It's better that we don't assume @start_nid is the first nid of the nat block it's in.
3. Use " if (nat_blk->entries[i].block_addr != NULL_ADDR)" to explicitly comfirm the value of block_addr
    use constant to make sure the codes is right, even if the value of NULL_ADDR changes.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: modify for accurate fggc node io stat
Yunlei He [Mon, 30 Oct 2017 06:18:55 +0000 (14:18 +0800)]
f2fs: modify for accurate fggc node io stat

modify for accurate fggc node io stat

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoRevert "f2fs: handle dirty segments inside refresh_sit_entry"
Yunlong Song [Mon, 30 Oct 2017 01:33:41 +0000 (09:33 +0800)]
Revert "f2fs: handle dirty segments inside refresh_sit_entry"

This reverts commit 5e443818fa0b2a2845561ee25bec181424fb2889

The commit should be reverted because call sequence of below two parts
of code must be kept:
a. update sit information, it needs to be updated before segment
allocation since latter allocation may trigger SSR, and SSR allocation
needs latest valid block information of all segments.
b. update segment status, it needs to be updated after segment allocation
since we can skip updating current opened segment status.

Fixes: 5e443818fa0b ("f2fs: handle dirty segments inside refresh_sit_entry")
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: remove refresh_sit_entry function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: add a function to move nid
Fan Li [Sat, 28 Oct 2017 11:03:37 +0000 (19:03 +0800)]
f2fs: add a function to move nid

This patch add a new function to move nid from one state to another.
Move operation is heavily used, by adding a new function for it
we can cut down some branches from several flow.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: export SSR allocation threshold
Chao Yu [Sat, 28 Oct 2017 08:52:33 +0000 (16:52 +0800)]
f2fs: export SSR allocation threshold

This patch exports min_ssr_segments threshold in sysfs to let user
control triggering SSR allocation flexibly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: give correct trimmed blocks in fstrim
Chao Yu [Sat, 28 Oct 2017 08:52:32 +0000 (16:52 +0800)]
f2fs: give correct trimmed blocks in fstrim

We have supported to issue discard in specified range during fstrim,
it needs to return caller with successfully trimmed bytes in that
range instead of bytes of invalid blocks which are scanned in
checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support bio allocation error injection
Chao Yu [Sat, 28 Oct 2017 08:52:31 +0000 (16:52 +0800)]
f2fs: support bio allocation error injection

This patch adds to support bio allocation error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support get_page error injection
Chao Yu [Sat, 28 Oct 2017 08:52:30 +0000 (16:52 +0800)]
f2fs: support get_page error injection

This patch adds to support get_page error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: add missing sysfs description
Chao Yu [Sat, 28 Oct 2017 08:52:29 +0000 (16:52 +0800)]
f2fs: add missing sysfs description

There are some missing sysfs entries' description in document, add them.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support soft block reservation
Yunlong Song [Fri, 27 Oct 2017 12:45:05 +0000 (20:45 +0800)]
f2fs: support soft block reservation

It supports to extend reserved_blocks sysfs interface to be soft
threshold, which allows user configure it exceeding current available
user space. This patch also introduces a new sysfs interface called
current_reserved_blocks, which shows the current blocks that have
already been reserved.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: handle error case when adding xattr entry
Jaegeuk Kim [Mon, 16 Oct 2017 22:05:16 +0000 (15:05 -0700)]
f2fs: handle error case when adding xattr entry

This patch fixes recovering incomplete xattr entries remaining in inline xattr
and xattr block, caused by any kind of errors.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support flexible inline xattr size
Chao Yu [Wed, 6 Sep 2017 13:59:50 +0000 (21:59 +0800)]
f2fs: support flexible inline xattr size

Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.

In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.

So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.

To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.

Inode disk layout:
  +----------------------+
  | .i_mode              |
  | ...                  |
  | .i_ext               |
  +----------------------+
  | .i_extra_isize       |
  | .i_inline_xattr_size |-----------+
  | ...                  |           |
  +----------------------+           |
  | .i_addr              |           |
  |  - block address or  |           |
  |  - inline data       |           |
  +----------------------+<---+      v
  |    inline xattr      |    +---inline xattr range
  +----------------------+<---+
  | .i_nid               |
  +----------------------+
  |   node_footer        |
  | (nid, ino, offset)   |
  +----------------------+

Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.

Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: show current cp state
Jaegeuk Kim [Thu, 26 Oct 2017 08:31:22 +0000 (10:31 +0200)]
f2fs: show current cp state

This patch shows whether checkpoint met any error case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: add missing quota_initialize
Jaegeuk Kim [Mon, 23 Oct 2017 21:50:15 +0000 (23:50 +0200)]
f2fs: add missing quota_initialize

This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: show # of dirty segments via sysfs
Jaegeuk Kim [Tue, 24 Oct 2017 07:46:54 +0000 (09:46 +0200)]
f2fs: show # of dirty segments via sysfs

This patch adds one sysfs entry to show # of dirty segments which can be
used for gc timing by user.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: stop all the operations by cp_error flag
Jaegeuk Kim [Mon, 23 Oct 2017 21:48:49 +0000 (23:48 +0200)]
f2fs: stop all the operations by cp_error flag

This patch replaces to use cp_error flag instead of RDONLY for quota off.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: remove several redundant assignments
Colin Ian King [Thu, 19 Oct 2017 10:58:21 +0000 (12:58 +0200)]
f2fs: remove several redundant assignments

There are several assignments to variables that are redundant
as the values are never read when the variables are updated later
and so the redundant statements can be safely removed.

Cleans up clang warnings:
fs/f2fs/segment.c:923:19: warning: Value stored to 'p' during its initialization is never read
fs/f2fs/segment.c:2060:2: warning: Value stored to 'hint' is never read
fs/f2fs/segment.c:2353:2: warning: Value stored to 'start_block' is never read
fs/f2fs/segment.c:2354:2: warning: Value stored to 'end_block' is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: avoid using timespec
Arnd Bergmann [Thu, 19 Oct 2017 09:52:47 +0000 (11:52 +0200)]
f2fs: avoid using timespec

All uses of timespec are deprecated, and this one is not particularly
useful, as the documented method for converting seconds to jiffies
is to multiply by 'HZ'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: fix to correct no_fggc_candidate
Chao Yu [Wed, 18 Oct 2017 02:34:14 +0000 (10:34 +0800)]
f2fs: fix to correct no_fggc_candidate

There may be extreme case as below:

For one section contains one segment, and there are total 100 segments
with 10% over-privision ratio in f2fs partition, fggc_threshold will
be rounded down to 460 instead of 460.8 as below caclulation:

sbi->fggc_threshold = div_u64((u64)(main_count - ovp_count) *
BLKS_PER_SEC(sbi), (main_count - resv_count));

If section usage is as:
60 segments which contain 460 valid blocks
40 segments which contain 462 valid blocks

As valid block number in all sections is large than fggc_threshold, so
none of them will be chosen as candidate due to incorrect fggc_threshold.

Let's just soften the term of choosing foreground GC candidates.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoRevert "f2fs: return wrong error number on f2fs_quota_write"
Jaegeuk Kim [Thu, 19 Oct 2017 19:07:11 +0000 (12:07 -0700)]
Revert "f2fs: return wrong error number on f2fs_quota_write"

This reverts commit 4f31d26b0c17f2aae6a6afeb823a87e20671ab4b.

It turns out that we need to report error number if nothing was written.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: remove obsolete pointer for truncate_xattr_node
Jaegeuk Kim [Thu, 19 Oct 2017 18:48:57 +0000 (11:48 -0700)]
f2fs: remove obsolete pointer for truncate_xattr_node

This patch removes obosolete parameter for truncate_xattr_node.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: retry ENOMEM for quota_read|write
Jaegeuk Kim [Thu, 19 Oct 2017 16:43:56 +0000 (09:43 -0700)]
f2fs: retry ENOMEM for quota_read|write

This gives another chance to read or write quota data.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: limit # of inmemory pages
Jaegeuk Kim [Thu, 19 Oct 2017 02:05:57 +0000 (19:05 -0700)]
f2fs: limit # of inmemory pages

If some abnormal users try lots of atomic write operations, f2fs is able to
produce pinned pages in the main memory which affects system performance.
This patch limits that as 20% over total memory size, and if f2fs reaches
to the limit, it will drop all the inmemory pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: update ctx->pos correctly when hitting hole in directory
Chao Yu [Fri, 13 Oct 2017 10:01:36 +0000 (18:01 +0800)]
f2fs: update ctx->pos correctly when hitting hole in directory

This patch fixes to update ctx->pos correctly when hitting hole in
directory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: relocate readahead codes in readdir()
Chao Yu [Fri, 13 Oct 2017 10:01:35 +0000 (18:01 +0800)]
f2fs: relocate readahead codes in readdir()

Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: allow readdir() to be interrupted
Chao Yu [Fri, 13 Oct 2017 10:01:34 +0000 (18:01 +0800)]
f2fs: allow readdir() to be interrupted

This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4: 1f60fbe72749 ("ext4: allow
readdir()'s of large empty directories to be interrupted").

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: trace f2fs_readdir
Chao Yu [Fri, 13 Oct 2017 10:01:33 +0000 (18:01 +0800)]
f2fs: trace f2fs_readdir

This patch adds trace for f2fs_readdir.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: trace f2fs_lookup
Chao Yu [Tue, 17 Oct 2017 09:33:41 +0000 (17:33 +0800)]
f2fs: trace f2fs_lookup

This patch adds trace for f2fs_lookup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: skip searching non-exist range in truncate_hole
Weichao Guo [Sat, 14 Oct 2017 00:13:32 +0000 (08:13 +0800)]
f2fs: skip searching non-exist range in truncate_hole

Let's skip entire non-exist area to speed up truncate_hole by
using get_next_page_offset.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: avoid stale fi->gdirty_list pointer
Jaegeuk Kim [Fri, 13 Oct 2017 02:12:53 +0000 (19:12 -0700)]
f2fs: avoid stale fi->gdirty_list pointer

When doing fault injection test, f2fs_evict_inode() didn't remove gdirty_list
which incurs a kernel panic due to wrong pointer access.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs/crypto: drop crypto key at evict_inode only
Jaegeuk Kim [Sat, 7 Oct 2017 07:08:05 +0000 (00:08 -0700)]
f2fs/crypto: drop crypto key at evict_inode only

This patch avoids dropping crypto key in f2fs_drop_inode, so we can guarantee
it happens only at evict_inode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: fix to avoid race when accessing last_disk_size
Chao Yu [Mon, 9 Oct 2017 09:55:19 +0000 (17:55 +0800)]
f2fs: fix to avoid race when accessing last_disk_size

last_disk_size could be wrong due to concurrently updating, so using
i_sem semaphore to make last_disk_size updating exclusive to fix this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: Fix bool initialization/comparison
Thomas Meyer [Sat, 7 Oct 2017 14:02:21 +0000 (16:02 +0200)]
f2fs: Fix bool initialization/comparison

Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: give up CP_TRIMMED_FLAG if it drops discards
Chao Yu [Wed, 4 Oct 2017 01:08:37 +0000 (09:08 +0800)]
f2fs: give up CP_TRIMMED_FLAG if it drops discards

In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: trace f2fs_remove_discard
Chao Yu [Wed, 4 Oct 2017 01:08:36 +0000 (09:08 +0800)]
f2fs: trace f2fs_remove_discard

This patch adds tracepoint to trace f2fs_remove_discard.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: reduce cmd_lock coverage in __issue_discard_cmd
Chao Yu [Wed, 4 Oct 2017 01:08:35 +0000 (09:08 +0800)]
f2fs: reduce cmd_lock coverage in __issue_discard_cmd

__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: split discard policy
Chao Yu [Wed, 4 Oct 2017 01:08:34 +0000 (09:08 +0800)]
f2fs: split discard policy

There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.

This patch changes to split discard policy for different scenarios.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: wrap discard policy
Chao Yu [Wed, 4 Oct 2017 01:08:33 +0000 (09:08 +0800)]
f2fs: wrap discard policy

This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.

struct discard_policy {
unsigned int min_interval; /* used for candidates exist */
unsigned int max_interval; /* used for candidates not exist */
unsigned int max_requests; /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */
bool io_aware; /* issue discard in idle time */
bool sync; /* submit discard with REQ_SYNC flag */
};

This patch doesn't change any logic of codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: support issuing/waiting discard in range
Chao Yu [Wed, 4 Oct 2017 01:08:32 +0000 (09:08 +0800)]
f2fs: support issuing/waiting discard in range

Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.

In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: fix to flush multiple device in checkpoint
Chao Yu [Fri, 29 Sep 2017 05:59:39 +0000 (13:59 +0800)]
f2fs: fix to flush multiple device in checkpoint

If f2fs manages multiple devices, in checkpoint, we need to issue flush
in those devices which contain dirty data/node in their cache before
we write checkpoint region, otherwise, filesystem metadata could be
corrupted if hitting SPO after checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: enhance multiple device flush
Chao Yu [Fri, 29 Sep 2017 05:59:38 +0000 (13:59 +0800)]
f2fs: enhance multiple device flush

When multiple device feature is enabled, during ->fsync we will issue
flush in all devices to make sure node/data of the file being persisted
into storage. But some flushes of device could be unneeded as file's
data may be not writebacked into those devices. So this patch adds and
manage bitmap per inode in global cache to indicate which device is
dirty and it needs to issue flush during ->fsync, hence, we could improve
performance of fsync in scenario of multiple device.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: fix to show ino management cache size correctly
Chao Yu [Fri, 29 Sep 2017 05:59:37 +0000 (13:59 +0800)]
f2fs: fix to show ino management cache size correctly

It needs to stat size of ino management cache with all type instead of
orphan ino type.

Fixes: 652be55162dc ("f2fs: show # of orphan inodes")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush
Chao Yu [Fri, 29 Sep 2017 05:59:36 +0000 (13:59 +0800)]
f2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush

If we failed to issue flush in ->fsync, we need to keep FI_UPDATE_WRITE
flag to make sure triggering flush in next ->fsync.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: obsolete ALLOC_NID_LIST list
Chao Yu [Fri, 29 Sep 2017 05:59:35 +0000 (13:59 +0800)]
f2fs: obsolete ALLOC_NID_LIST list

As Fan Li reported, there is no user traversing nid_list[ALLOC_NID_LIST]
which is used for tracking preallocated nids. Let's drop it, and only
track preallocated nids in free_nid_root radix-tree.

Reported-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: convert inline data for direct I/O & FI_NO_PREALLOC
Weichao Guo [Fri, 29 Sep 2017 14:43:23 +0000 (22:43 +0800)]
f2fs: convert inline data for direct I/O & FI_NO_PREALLOC

In FI_NO_PREALLOC cases, direct I/O path may allocate blocks for an
inode but keep its inline data flag. This inconsistency may trigger
vfs clear_inode nrpages bug_on when evicting the inode. We should
convert inline data first in this case.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: allow readpages with NULL file pointer
Hsiang Kao [Sat, 23 Sep 2017 18:45:42 +0000 (02:45 +0800)]
f2fs: allow readpages with NULL file pointer

Keep in line with the other Linux file system implementations
since page_cache_sync_readahead supports NULL file pointer,
and thus we can readahead data by f2fs itself without file opening
(something like the btrfs behavior).

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: show flush list status in sysfs
Chao Yu [Thu, 14 Sep 2017 02:18:01 +0000 (10:18 +0800)]
f2fs: show flush list status in sysfs

This patch adds to show flush list status in sysfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: introduce read_xattr_block
Chao Yu [Mon, 4 Sep 2017 10:58:03 +0000 (18:58 +0800)]
f2fs: introduce read_xattr_block

Commit ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_xattr_block to clean up redundant codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: introduce read_inline_xattr
Chao Yu [Mon, 4 Sep 2017 10:58:02 +0000 (18:58 +0800)]
f2fs: introduce read_inline_xattr

Commit ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_inline_xattr to clean up redundant codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoRevert "f2fs: reuse nids more aggressively"
Chao Yu [Mon, 25 Sep 2017 06:17:51 +0000 (14:17 +0800)]
Revert "f2fs: reuse nids more aggressively"

Commit 268344664603 ("f2fs: reuse nids more aggressively") tries to
reuse nids as many as possilbe, in order to mitigate producing obsolete
node pages in page cache.

But acutally, before we reuse the nids and related node page cache,
we will always invalidate that node page, so there will be not any
obsolete node pages in cache.

Let's just revert previous commit, so that nm_i::next_scan_nid can be
increased ascendingly, making __build_free_nids traverses all NAT pages
more easily, finally, free nid bitmap cache can be enabled as soon as
possible.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoRevert "f2fs: node segment is prior to data segment selected victim"
Yunlong Song [Sat, 23 Sep 2017 09:02:18 +0000 (17:02 +0800)]
Revert "f2fs: node segment is prior to data segment selected victim"

This reverts commit b9cd20619e359d199b755543474c3d853c8e3415.

That patch causes much fewer node segments (which can be used for SSR)
than before, and in the corner case (e.g. create and delete *.txt files in
one same directory, there will be very few node segments but many data
segments), if the reserved free segments are all used up during gc, then
the write_checkpoint can still flush dentry pages to data ssr segments,
but will probably fail to flush node pages to node ssr segments, since
there are not enough node ssr segments left (the left ones are all
full).

So revert this patch to give a fair chance to let node segments remain
for SSR, which provides more robustness for corner cases.

Conflicts:
fs/f2fs/gc.c

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoLinux 4.14.7
Greg Kroah-Hartman [Sun, 17 Dec 2017 14:08:14 +0000 (15:08 +0100)]
Linux 4.14.7

7 years agodvb_frontend: don't use-after-free the frontend struct
Mauro Carvalho Chehab [Tue, 7 Nov 2017 13:39:39 +0000 (08:39 -0500)]
dvb_frontend: don't use-after-free the frontend struct

commit b1cb7372fa822af6c06c8045963571d13ad6348b upstream.

dvb_frontend_invoke_release() may free the frontend struct.
So, the free logic can't update it anymore after calling it.

That's OK, as __dvb_frontend_free() is called only when the
krefs are zeroed, so nobody is using it anymore.

That should fix the following KASAN error:

The KASAN report looks like this (running on kernel 3e0cc09a3a2c40ec1ffb6b4e12da86e98feccb11 (4.14-rc5+)):
==================================================================
BUG: KASAN: use-after-free in __dvb_frontend_free+0x113/0x120
Write of size 8 at addr ffff880067d45a00 by task kworker/0:1/24

CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.14.0-rc5-43687-g06ab8a23e0e6 #545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 print_address_description+0x78/0x280 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351
 kasan_report+0x23d/0x350 mm/kasan/report.c:409
 __asan_report_store8_noabort+0x1c/0x20 mm/kasan/report.c:435
 __dvb_frontend_free+0x113/0x120 drivers/media/dvb-core/dvb_frontend.c:156
 dvb_frontend_put+0x59/0x70 drivers/media/dvb-core/dvb_frontend.c:176
 dvb_frontend_detach+0x120/0x150 drivers/media/dvb-core/dvb_frontend.c:2803
 dvb_usb_adapter_frontend_exit+0xd6/0x160 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:340
 dvb_usb_adapter_exit drivers/media/usb/dvb-usb/dvb-usb-init.c:116
 dvb_usb_exit+0x9b/0x200 drivers/media/usb/dvb-usb/dvb-usb-init.c:132
 dvb_usb_device_exit+0xa5/0xf0 drivers/media/usb/dvb-usb/dvb-usb-init.c:295
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f1/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
 worker_thread+0x221/0x1850 kernel/workqueue.c:2253
 kthread+0x363/0x440 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

Allocated by task 24:
 save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772
 kmalloc ./include/linux/slab.h:493
 kzalloc ./include/linux/slab.h:666
 dtt200u_fe_attach+0x4c/0x110 drivers/media/usb/dvb-usb/dtt200u-fe.c:212
 dtt200u_frontend_attach+0x35/0x80 drivers/media/usb/dvb-usb/dtt200u.c:136
 dvb_usb_adapter_frontend_init+0x32b/0x660 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:286
 dvb_usb_adapter_init drivers/media/usb/dvb-usb/dvb-usb-init.c:86
 dvb_usb_init drivers/media/usb/dvb-usb/dvb-usb-init.c:162
 dvb_usb_device_init+0xf73/0x17f0 drivers/media/usb/dvb-usb/dvb-usb-init.c:277
 dtt200u_usb_probe+0xa1/0xe0 drivers/media/usb/dvb-usb/dtt200u.c:155
 usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361
 really_probe drivers/base/dd.c:413
 driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
 __device_attach_driver+0x230/0x290 drivers/base/dd.c:653
 bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
 __device_attach+0x26b/0x3c0 drivers/base/dd.c:710
 device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
 bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
 device_add+0xd0b/0x1660 drivers/base/core.c:1835
 usb_set_configuration+0x104e/0x1870 drivers/usb/core/message.c:1932
 generic_probe+0x73/0xe0 drivers/usb/core/generic.c:174
 usb_probe_device+0xaf/0xe0 drivers/usb/core/driver.c:266
 really_probe drivers/base/dd.c:413
 driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
 __device_attach_driver+0x230/0x290 drivers/base/dd.c:653
 bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
 __device_attach+0x26b/0x3c0 drivers/base/dd.c:710
 device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
 bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
 device_add+0xd0b/0x1660 drivers/base/core.c:1835
 usb_new_device+0x7b8/0x1020 drivers/usb/core/hub.c:2457
 hub_port_connect drivers/usb/core/hub.c:4903
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
 worker_thread+0x221/0x1850 kernel/workqueue.c:2253
 kthread+0x363/0x440 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

Freed by task 24:
 save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459
 kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
 slab_free_hook mm/slub.c:1390
 slab_free_freelist_hook mm/slub.c:1412
 slab_free mm/slub.c:2988
 kfree+0xf6/0x2f0 mm/slub.c:3919
 dtt200u_fe_release+0x3c/0x50 drivers/media/usb/dvb-usb/dtt200u-fe.c:202
 dvb_frontend_invoke_release.part.13+0x1c/0x30 drivers/media/dvb-core/dvb_frontend.c:2790
 dvb_frontend_invoke_release drivers/media/dvb-core/dvb_frontend.c:2789
 __dvb_frontend_free+0xad/0x120 drivers/media/dvb-core/dvb_frontend.c:153
 dvb_frontend_put+0x59/0x70 drivers/media/dvb-core/dvb_frontend.c:176
 dvb_frontend_detach+0x120/0x150 drivers/media/dvb-core/dvb_frontend.c:2803
 dvb_usb_adapter_frontend_exit+0xd6/0x160 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:340
 dvb_usb_adapter_exit drivers/media/usb/dvb-usb/dvb-usb-init.c:116
 dvb_usb_exit+0x9b/0x200 drivers/media/usb/dvb-usb/dvb-usb-init.c:132
 dvb_usb_device_exit+0xa5/0xf0 drivers/media/usb/dvb-usb/dvb-usb-init.c:295
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f1/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
 worker_thread+0x221/0x1850 kernel/workqueue.c:2253
 kthread+0x363/0x440 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

The buggy address belongs to the object at ffff880067d45500
 which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 1280 bytes inside of
 2048-byte region [ffff880067d45500ffff880067d45d00)
The buggy address belongs to the page:
page:ffffea00019f5000 count:1 mapcount:0 mapping:          (null)
index:0x0 compound_mapcount: 0
flags: 0x100000000008100(slab|head)
raw: 0100000000008100 0000000000000000 0000000000000000 00000001000f000f
raw: dead000000000100 dead000000000200 ffff88006c002d80 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff880067d45900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff880067d45980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff880067d45a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff880067d45a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff880067d45b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: ead666000a5f ("media: dvb_frontend: only use kref after initialized")

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Matthias Schwarzott <zzam@gentoo.org>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agomedia: dvb-core: always call invoke_release() in fe_free()
Daniel Scheller [Sun, 29 Oct 2017 15:43:22 +0000 (11:43 -0400)]
media: dvb-core: always call invoke_release() in fe_free()

commit 62229de19ff2b7f3e0ebf4d48ad99061127d0281 upstream.

Follow-up to: ead666000a5f ("media: dvb_frontend: only use kref after initialized")

The aforementioned commit fixed refcount OOPSes when demod driver attaching
succeeded but tuner driver didn't. However, the use count of the attached
demod drivers don't go back to zero and thus couldn't be cleanly unloaded.
Improve on this by calling dvb_frontend_invoke_release() in
__dvb_frontend_free() regardless of fepriv being NULL, instead of returning
when fepriv is NULL. This is safe to do since _invoke_release() will check
for passed pointers being valid before calling the .release() function.

[mchehab@s-opensource.com: changed the logic a little bit to reduce
 conflicts with another bug fix patch under review]
Fixes: ead666000a5f ("media: dvb_frontend: only use kref after initialized")
Signed-off-by: Daniel Scheller <d.scheller@gmx.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agox86/intel_rdt: Fix potential deadlock during resctrl unmount
Reinette Chatre [Fri, 20 Oct 2017 09:16:58 +0000 (02:16 -0700)]
x86/intel_rdt: Fix potential deadlock during resctrl unmount

[ Upstream commit 36b6f9fcb8928c06b6638a4cf91bc9d69bb49aa2 ]

Lockdep warns about a potential deadlock:

[   66.782842] ======================================================
[   66.782888] WARNING: possible circular locking dependency detected
[   66.782937] 4.14.0-rc2-test-test+ #48 Not tainted
[   66.782983] ------------------------------------------------------
[   66.783052] umount/336 is trying to acquire lock:
[   66.783117]  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff81032395>] rdt_kill_sb+0x215/0x390
[   66.783193]
               but task is already holding lock:
[   66.783244]  (rdtgroup_mutex){+.+.}, at: [<ffffffff810321b6>] rdt_kill_sb+0x36/0x390
[   66.783305]
               which lock already depends on the new lock.

[   66.783364]
               the existing dependency chain (in reverse order) is:
[   66.783419]
               -> #3 (rdtgroup_mutex){+.+.}:
[   66.783467]        __lock_acquire+0x1293/0x13f0
[   66.783509]        lock_acquire+0xaf/0x220
[   66.783543]        __mutex_lock+0x71/0x9b0
[   66.783575]        mutex_lock_nested+0x1b/0x20
[   66.783610]        intel_rdt_online_cpu+0x3b/0x430
[   66.783649]        cpuhp_invoke_callback+0xab/0x8e0
[   66.783687]        cpuhp_thread_fun+0x7a/0x150
[   66.783722]        smpboot_thread_fn+0x1cc/0x270
[   66.783764]        kthread+0x16e/0x190
[   66.783794]        ret_from_fork+0x27/0x40
[   66.783825]
               -> #2 (cpuhp_state){+.+.}:
[   66.783870]        __lock_acquire+0x1293/0x13f0
[   66.783906]        lock_acquire+0xaf/0x220
[   66.783938]        cpuhp_issue_call+0x102/0x170
[   66.783974]        __cpuhp_setup_state_cpuslocked+0x154/0x2a0
[   66.784023]        __cpuhp_setup_state+0xc7/0x170
[   66.784061]        page_writeback_init+0x43/0x67
[   66.784097]        pagecache_init+0x43/0x4a
[   66.784131]        start_kernel+0x3ad/0x3f7
[   66.784165]        x86_64_start_reservations+0x2a/0x2c
[   66.784204]        x86_64_start_kernel+0x72/0x75
[   66.784241]        verify_cpu+0x0/0xfb
[   66.784270]
               -> #1 (cpuhp_state_mutex){+.+.}:
[   66.784319]        __lock_acquire+0x1293/0x13f0
[   66.784355]        lock_acquire+0xaf/0x220
[   66.784387]        __mutex_lock+0x71/0x9b0
[   66.784419]        mutex_lock_nested+0x1b/0x20
[   66.784454]        __cpuhp_setup_state_cpuslocked+0x52/0x2a0
[   66.784497]        __cpuhp_setup_state+0xc7/0x170
[   66.784535]        page_alloc_init+0x28/0x30
[   66.784569]        start_kernel+0x148/0x3f7
[   66.784602]        x86_64_start_reservations+0x2a/0x2c
[   66.784642]        x86_64_start_kernel+0x72/0x75
[   66.784678]        verify_cpu+0x0/0xfb
[   66.784707]
               -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   66.784759]        check_prev_add+0x32f/0x6e0
[   66.784794]        __lock_acquire+0x1293/0x13f0
[   66.784830]        lock_acquire+0xaf/0x220
[   66.784863]        cpus_read_lock+0x3d/0xb0
[   66.784896]        rdt_kill_sb+0x215/0x390
[   66.784930]        deactivate_locked_super+0x3e/0x70
[   66.784968]        deactivate_super+0x40/0x60
[   66.785003]        cleanup_mnt+0x3f/0x80
[   66.785034]        __cleanup_mnt+0x12/0x20
[   66.785070]        task_work_run+0x8b/0xc0
[   66.785103]        exit_to_usermode_loop+0x94/0xa0
[   66.786804]        syscall_return_slowpath+0xe8/0x150
[   66.788502]        entry_SYSCALL_64_fastpath+0xab/0xad
[   66.790194]
               other info that might help us debug this:

[   66.795139] Chain exists of:
                 cpu_hotplug_lock.rw_sem --> cpuhp_state --> rdtgroup_mutex

[   66.800035]  Possible unsafe locking scenario:

[   66.803267]        CPU0                    CPU1
[   66.804867]        ----                    ----
[   66.806443]   lock(rdtgroup_mutex);
[   66.808002]                                lock(cpuhp_state);
[   66.809565]                                lock(rdtgroup_mutex);
[   66.811110]   lock(cpu_hotplug_lock.rw_sem);
[   66.812608]
                *** DEADLOCK ***

[   66.816983] 2 locks held by umount/336:
[   66.818418]  #0:  (&type->s_umount_key#35){+.+.}, at: [<ffffffff81229738>] deactivate_super+0x38/0x60
[   66.819922]  #1:  (rdtgroup_mutex){+.+.}, at: [<ffffffff810321b6>] rdt_kill_sb+0x36/0x390

When the resctrl filesystem is unmounted the locks should be obtain in the
locks in the same order as was done when the cpus came online:

      cpu_hotplug_lock before rdtgroup_mutex.

This also requires to switch the static_branch_disable() calls to the
_cpulocked variant because now cpu hotplug lock is held already.

[ tglx: Switched to cpus_read_[un]lock ]

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Acked-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/cc292e76be073f7260604651711c47b09fd0dc81.1508490116.git.reinette.chatre@intel.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoRDMA/cxgb4: Annotate r2 and stag as __be32
Leon Romanovsky [Wed, 25 Oct 2017 20:10:19 +0000 (23:10 +0300)]
RDMA/cxgb4: Annotate r2 and stag as __be32

[ Upstream commit 7d7d065a5eec7e218174d5c64a9f53f99ffdb119 ]

Chelsio cxgb4 HW is big-endian, hence there is need to properly
annotate r2 and stag fields as __be32 and not __u32 to fix the
following sparse warnings.

  drivers/infiniband/hw/cxgb4/qp.c:614:16:
    warning: incorrect type in assignment (different base types)
      expected unsigned int [unsigned] [usertype] r2
      got restricted __be32 [usertype] <noident>
  drivers/infiniband/hw/cxgb4/qp.c:615:18:
    warning: incorrect type in assignment (different base types)
      expected unsigned int [unsigned] [usertype] stag
      got restricted __be32 [usertype] <noident>

Cc: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agomd: free unused memory after bitmap resize
Zdenek Kabelac [Wed, 8 Nov 2017 12:44:56 +0000 (13:44 +0100)]
md: free unused memory after bitmap resize

[ Upstream commit 0868b99c214a3d55486c700de7c3f770b7243e7c ]

When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.

This fixes in particular kmemleak reports like this one:

unreferenced object 0xffff8f4311e9c000 (size 4096):
  comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
  hex dump (first 32 bytes):
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
  backtrace:
    [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
    [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
    [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
    [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
    [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
    [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
    [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
    [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
    [<ffffffffa6774b52>] dm_resume+0x122/0x140
    [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
    [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
    [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
    [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
    [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
    [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agodm raid: fix panic when attempting to force a raid to sync
Heinz Mauelshagen [Thu, 2 Nov 2017 18:58:28 +0000 (19:58 +0100)]
dm raid: fix panic when attempting to force a raid to sync

[ Upstream commit 233978449074ca7e45d9c959f9ec612d1b852893 ]

Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.

Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoaudit: ensure that 'audit=1' actually enables audit for PID 1
Paul Moore [Fri, 1 Sep 2017 13:44:34 +0000 (09:44 -0400)]
audit: ensure that 'audit=1' actually enables audit for PID 1

[ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ]

Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked.  This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.

This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoaudit: Allow auditd to set pid to 0 to end auditing
Steve Grubb [Tue, 17 Oct 2017 22:29:22 +0000 (18:29 -0400)]
audit: Allow auditd to set pid to 0 to end auditing

[ Upstream commit 33e8a907804428109ce1d12301c3365d619cc4df ]

The API to end auditing has historically been for auditd to set the
pid to 0. This patch restores that functionality.

See: https://github.com/linux-audit/audit-kernel/issues/69

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonvmet-rdma: update queue list during ib_device removal
Israel Rukshin [Sun, 5 Nov 2017 08:43:01 +0000 (08:43 +0000)]
nvmet-rdma: update queue list during ib_device removal

[ Upstream commit 43b92fd27aaef0f529c9321cfebbaec1d7b8f503 ]

A NULL deref happens when nvmet_rdma_remove_one() is called more than once
(e.g. while connected via 2 ports).
The first call frees the queues related to the first ib_device but
doesn't remove them from the queue list.
While calling nvmet_rdma_remove_one() for the second ib_device it goes over
the full queue list again and we get the NULL deref.

Fixes: f1d4ef7d ("nvmet-rdma: register ib_client to not deadlock in device removal")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grmberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoblk-mq: Avoid that request queue removal can trigger list corruption
Bart Van Assche [Wed, 8 Nov 2017 18:23:45 +0000 (10:23 -0800)]
blk-mq: Avoid that request queue removal can trigger list corruption

[ Upstream commit aba7afc5671c23beade64d10caf86e24a9105dab ]

Avoid that removal of a request queue sporadically triggers the
following warning:

list_del corruption. next->prev should be ffff8807d649b970, but was 6b6b6b6b6b6b6b6b
WARNING: CPU: 3 PID: 342 at lib/list_debug.c:56 __list_del_entry_valid+0x92/0xa0
Call Trace:
 process_one_work+0x11b/0x660
 worker_thread+0x3d/0x3b0
 kthread+0x129/0x140
 ret_from_fork+0x27/0x40

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoide: ide-atapi: fix compile error with defining macro DEBUG
Hongxu Jia [Fri, 10 Nov 2017 07:59:17 +0000 (15:59 +0800)]
ide: ide-atapi: fix compile error with defining macro DEBUG

[ Upstream commit 8dc7a31fbce5e2dbbacd83d910da37105181b054 ]

Compile ide-atapi failed with defining macro "DEBUG"
...
|drivers/ide/ide-atapi.c:285:52: error: 'struct request' has
no member named 'cmd'; did you mean 'csd'?
|  debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]);
...

Since we split the scsi_request out of struct request, it missed
do the same thing on debug_log

Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request")

Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoipvlan: fix ipv6 outbound device
Keefe Liu [Thu, 9 Nov 2017 12:09:31 +0000 (20:09 +0800)]
ipvlan: fix ipv6 outbound device

[ Upstream commit ca29fd7cce5a6444d57fb86517589a1a31c759e1 ]

When process the outbound packet of ipv6, we should assign the master
device to output device other than input device.

Signed-off-by: Keefe Liu <liuqifa@huawei.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agopowerpc/powernv/idle: Round up latency and residency values
Vaidyanathan Srinivasan [Wed, 23 Aug 2017 18:58:41 +0000 (00:28 +0530)]
powerpc/powernv/idle: Round up latency and residency values

[ Upstream commit 8d4e10e9ed9450e18fbbf6a8872be0eac9fd4999 ]

On PowerNV platforms, firmware provides exit latency and
target residency for each of the idle states in nano
seconds.  Cpuidle framework expects the values in micro
seconds.  Round up to nearest micro seconds to avoid errors
in cases where the values are defined as fractional micro
seconds.

Default idle state of 'snooze' has exit latency of zero.  If
other states have fractional micro second exit latency, they
would get rounded down to zero micro second and make cpuidle
framework choose deeper idle state when snooze loop is the
right choice.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agokbuild: do not call cc-option before KBUILD_CFLAGS initialization
Masahiro Yamada [Thu, 12 Oct 2017 09:22:25 +0000 (18:22 +0900)]
kbuild: do not call cc-option before KBUILD_CFLAGS initialization

[ Upstream commit 433dc2ebe7d17dd21cba7ad5c362d37323592236 ]

Some $(call cc-option,...) are invoked very early, even before
KBUILD_CFLAGS, etc. are initialized.

The returned string from $(call cc-option,...) depends on
KBUILD_CPPFLAGS, KBUILD_CFLAGS, and GCC_PLUGINS_CFLAGS.

Since they are exported, they are not empty when the top Makefile
is recursively invoked.

The recursion occurs in several places.  For example, the top
Makefile invokes itself for silentoldconfig.  "make tinyconfig",
"make rpm-pkg" are the cases, too.

In those cases, the second call of cc-option from the same line
runs a different shell command due to non-pristine KBUILD_CFLAGS.

To get the same result all the time, KBUILD_* and GCC_PLUGINS_CFLAGS
must be initialized before any call of cc-option.  This avoids
garbage data in the .cache.mk file.

Move all calls of cc-option below the config targets because target
compiler flags are unnecessary for Kconfig.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoKVM: arm/arm64: vgic-its: Preserve the revious read from the pending table
Marc Zyngier [Thu, 16 Nov 2017 17:58:17 +0000 (17:58 +0000)]
KVM: arm/arm64: vgic-its: Preserve the revious read from the pending table

commit 64afe6e9eb4841f35317da4393de21a047a883b3 upstream.

The current pending table parsing code assumes that we keep the
previous read of the pending bits, but keep that variable in
the current block, making sure it is discarded on each loop.

We end-up using whatever is on the stack. Who knows, it might
just be the right thing...

Fixes: 33d3bc9556a7d ("KVM: arm64: vgic-its: Read initial LPI pending table")
Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agofix kcm_clone()
Al Viro [Tue, 5 Dec 2017 23:27:57 +0000 (23:27 +0000)]
fix kcm_clone()

commit a5739435b5a3b8c449f8844ecd71a3b1e89f0a33 upstream.

1) it's fput() or sock_release(), not both
2) don't do fd_install() until the last failure exit.
3) not a bug per se, but... don't attach socket to struct file
   until it's set up.

Take reserving descriptor into the caller, move fd_install() to the
caller, sanitize failure exits and calling conventions.

Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agofcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
Jeff Layton [Tue, 14 Nov 2017 19:42:57 +0000 (14:42 -0500)]
fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall

commit 4d2dc2cc766c3b51929658cacbc6e34fc8e242fb upstream.

Currently, we're capping the values too low in the F_GETLK64 case. The
fields in that structure are 64-bit values, so we shouldn't need to do
any sort of fixup there.

Make sure we check that assumption at build time in the future however
by ensuring that the sizes we're copying will fit.

With this, we no longer need COMPAT_LOFF_T_MAX either, so remove it.

Fixes: 94073ad77fff2 (fs/locks: don't mess with the address limit in compat_fcntl64)
Reported-by: Vitaly Lipatov <lav@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agousb: gadget: ffs: Forbid usb_ep_alloc_request from sleeping
Vincent Pelletier [Sun, 26 Nov 2017 06:52:53 +0000 (06:52 +0000)]
usb: gadget: ffs: Forbid usb_ep_alloc_request from sleeping

commit 30bf90ccdec1da9c8198b161ecbff39ce4e5a9ba upstream.

Found using DEBUG_ATOMIC_SLEEP while submitting an AIO read operation:

[  100.853642] BUG: sleeping function called from invalid context at mm/slab.h:421
[  100.861148] in_atomic(): 1, irqs_disabled(): 1, pid: 1880, name: python
[  100.867954] 2 locks held by python/1880:
[  100.867961]  #0:  (&epfile->mutex){....}, at: [<f8188627>] ffs_mutex_lock+0x27/0x30 [usb_f_fs]
[  100.868020]  #1:  (&(&ffs->eps_lock)->rlock){....}, at: [<f818ad4b>] ffs_epfile_io.isra.17+0x24b/0x590 [usb_f_fs]
[  100.868076] CPU: 1 PID: 1880 Comm: python Not tainted 4.14.0-edison+ #118
[  100.868085] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48
[  100.868093] Call Trace:
[  100.868122]  dump_stack+0x47/0x62
[  100.868156]  ___might_sleep+0xfd/0x110
[  100.868182]  __might_sleep+0x68/0x70
[  100.868217]  kmem_cache_alloc_trace+0x4b/0x200
[  100.868248]  ? dwc3_gadget_ep_alloc_request+0x24/0xe0 [dwc3]
[  100.868302]  dwc3_gadget_ep_alloc_request+0x24/0xe0 [dwc3]
[  100.868343]  usb_ep_alloc_request+0x16/0xc0 [udc_core]
[  100.868386]  ffs_epfile_io.isra.17+0x444/0x590 [usb_f_fs]
[  100.868424]  ? _raw_spin_unlock_irqrestore+0x27/0x40
[  100.868457]  ? kiocb_set_cancel_fn+0x57/0x60
[  100.868477]  ? ffs_ep0_poll+0xc0/0xc0 [usb_f_fs]
[  100.868512]  ffs_epfile_read_iter+0xfe/0x157 [usb_f_fs]
[  100.868551]  ? security_file_permission+0x9c/0xd0
[  100.868587]  ? rw_verify_area+0xac/0x120
[  100.868633]  aio_read+0x9d/0x100
[  100.868692]  ? __fget+0xa2/0xd0
[  100.868727]  ? __might_sleep+0x68/0x70
[  100.868763]  SyS_io_submit+0x471/0x680
[  100.868878]  do_int80_syscall_32+0x4e/0xd0
[  100.868921]  entry_INT80_32+0x2a/0x2a
[  100.868932] EIP: 0xb7fbb676
[  100.868941] EFLAGS: 00000292 CPU: 1
[  100.868951] EAX: ffffffda EBX: b7aa2000 ECX: 00000002 EDX: b7af8368
[  100.868961] ESI: b7fbb660 EDI: b7aab000 EBP: bfb6c658 ESP: bfb6c638
[  100.868973]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoipmi: Stop timers before cleaning up the module
Masamitsu Yamazaki [Wed, 15 Nov 2017 07:33:14 +0000 (07:33 +0000)]
ipmi: Stop timers before cleaning up the module

commit 4f7f5551a760eb0124267be65763008169db7087 upstream.

System may crash after unloading ipmi_si.ko module
because a timer may remain and fire after the module cleaned up resources.

cleanup_one_si() contains the following processing.

        /*
         * Make sure that interrupts, the timer and the thread are
         * stopped and will not run again.
         */
        if (to_clean->irq_cleanup)
                to_clean->irq_cleanup(to_clean);
        wait_for_timer_and_thread(to_clean);

        /*
         * Timeouts are stopped, now make sure the interrupts are off
         * in the BMC.  Note that timers and CPU interrupts are off,
         * so no need for locks.
         */
        while (to_clean->curr_msg || (to_clean->si_state != SI_NORMAL)) {
                poll(to_clean);
                schedule_timeout_uninterruptible(1);
        }

si_state changes as following in the while loop calling poll(to_clean).

  SI_GETTING_MESSAGES
    => SI_CHECKING_ENABLES
     => SI_SETTING_ENABLES
      => SI_GETTING_EVENTS
       => SI_NORMAL

As written in the code comments above,
timers are expected to stop before the polling loop and not to run again.
But the timer is set again in the following process
when si_state becomes SI_SETTING_ENABLES.

  => poll
     => smi_event_handler
       => handle_transaction_done
          // smi_info->si_state == SI_SETTING_ENABLES
         => start_getting_events
           => start_new_msg
            => smi_mod_timer
              => mod_timer

As a result, before the timer set in start_new_msg() expires,
the polling loop may see si_state becoming SI_NORMAL
and the module clean-up finishes.

For example, hard LOCKUP and panic occurred as following.
smi_timeout was called after smi_event_handler,
kcs_event and hangs at port_inb()
trying to access I/O port after release.

    [exception RIP: port_inb+19]
    RIP: ffffffffc0473053  RSP: ffff88069fdc3d80  RFLAGS: 00000006
    RAX: ffff8806800f8e00  RBX: ffff880682bd9400  RCX: 0000000000000000
    RDX: 0000000000000ca3  RSI: 0000000000000ca3  RDI: ffff8806800f8e40
    RBP: ffff88069fdc3d80   R8: ffffffff81d86dfc   R9: ffffffff81e36426
    R10: 00000000000509f0  R11: 0000000000100000  R12: 0000000000]:000000
    R13: 0000000000000000  R14: 0000000000000246  R15: ffff8806800f8e00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 --- <NMI exception stack> ---

To fix the problem I defined a flag, timer_can_start,
as member of struct smi_info.
The flag is enabled immediately after initializing the timer
and disabled immediately before waiting for timer deletion.

Fixes: 0cfec916e86d ("ipmi: Start the timer and thread on internal msgs")
Signed-off-by: Yamazaki Masamitsu <m-yamazaki@ah.jp.nec.com>
[Some fairly major changes went into the IPMI driver in 4.15, so this
 required a backport as the code had changed and moved to a different
 file.]
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agosctp: use right member as the param of list_for_each_entry
Xin Long [Sun, 26 Nov 2017 12:56:07 +0000 (20:56 +0800)]
sctp: use right member as the param of list_for_each_entry

[ Upstream commit a8dd397903a6e57157f6265911f7d35681364427 ]

Commit d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues
when migrating a sock") made a mistake that using 'list' as the param of
list_for_each_entry to traverse the retransmit, sacked and abandoned
queues, while chunks are using 'transmitted_list' to link into these
queues.

It could cause NULL dereference panic if there are chunks in any of these
queues when peeling off one asoc.

So use the chunk member 'transmitted_list' instead in this patch.

Fixes: d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues when migrating a sock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agocls_bpf: don't decrement net's refcount when offload fails
Jakub Kicinski [Mon, 27 Nov 2017 19:11:41 +0000 (11:11 -0800)]
cls_bpf: don't decrement net's refcount when offload fails

[ Upstream commit 25415cec502a1232b19fffc85465882b19a90415 ]

When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point.  This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().

__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out.  The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.

Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.

Fixes: aae2c35ec892 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet: openvswitch: datapath: fix data type in queue_gso_packets
Gustavo A. R. Silva [Sat, 25 Nov 2017 19:14:40 +0000 (13:14 -0600)]
net: openvswitch: datapath: fix data type in queue_gso_packets

[ Upstream commit 2734166e89639c973c6e125ac8bcfc2d9db72b70 ]

gso_type is being used in binary AND operations together with SKB_GSO_UDP.
The issue is that variable gso_type is of type unsigned short and
SKB_GSO_UDP expands to more than 16 bits:

SKB_GSO_UDP = 1 << 16

this makes any binary AND operation between gso_type and SKB_GSO_UDP to
be always zero, hence making some code unreachable and likely causing
undesired behavior.

Fix this by changing the data type of variable gso_type to unsigned int.

Addresses-Coverity-ID: 1462223
Fixes: 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet: accept UFO datagrams from tuntap and packet
Willem de Bruijn [Tue, 21 Nov 2017 15:22:25 +0000 (10:22 -0500)]
net: accept UFO datagrams from tuntap and packet

[ Upstream commit 0c19f846d582af919db66a5914a0189f9f92c936 ]

Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotun: fix rcu_read_lock imbalance in tun_build_skb
Xin Long [Sun, 19 Nov 2017 11:31:04 +0000 (19:31 +0800)]
tun: fix rcu_read_lock imbalance in tun_build_skb

[ Upstream commit 654d573845f35017dc397840fa03610fef3d08b0 ]

rcu_read_lock in tun_build_skb is used to rcu_dereference tun->xdp_prog
safely, rcu_read_unlock should be done in every return path.

Now I could see one place missing it, where it returns NULL in switch-case
XDP_REDIRECT,  another palce using rcu_read_lock wrongly, where it returns
NULL in if (xdp_xmit) chunk.

So fix both in this patch.

Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet: ipv6: Fixup device for anycast routes during copy
David Ahern [Tue, 21 Nov 2017 15:08:57 +0000 (07:08 -0800)]
net: ipv6: Fixup device for anycast routes during copy

[ Upstream commit 98d11291d189cb5adf49694d0ad1b971c0212697 ]

Florian reported a breakage with anycast routes due to commit
4832c30d5458 ("net: ipv6: put host and anycast routes on device with
address"). Prior to this commit anycast routes were added against the
loopback device causing repetitive route entries with no insight into
why they existed. e.g.:
  $ ip -6 ro ls  table local type anycast
  anycast 2001:db8:1:: dev lo proto kernel metric 0 pref medium
  anycast 2001:db8:2:: dev lo proto kernel metric 0 pref medium
  anycast fe80:: dev lo proto kernel metric 0 pref medium
  anycast fe80:: dev lo proto kernel metric 0 pref medium

The point of commit 4832c30d5458 is to add the routes using the device
with the address which is causing the route to be added. e.g.,:
  $ ip -6 ro ls  table local type anycast
  anycast 2001:db8:1:: dev eth1 proto kernel metric 0 pref medium
  anycast 2001:db8:2:: dev eth2 proto kernel metric 0 pref medium
  anycast fe80:: dev eth2 proto kernel metric 0 pref medium
  anycast fe80:: dev eth1 proto kernel metric 0 pref medium

For traffic to work as it did before, the dst device needs to be switched
to the loopback when the copy is created similar to local routes.

Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotun: free skb in early errors
Wei Xu [Fri, 1 Dec 2017 10:10:37 +0000 (05:10 -0500)]
tun: free skb in early errors

[ Upstream commit c33ee15b3820a03cf8229ba9415084197b827f8c ]

tun_recvmsg() supports accepting skb by msg_control after
commit ac77cfd4258f ("tun: support receiving skb through msg_control"),
the skb if presented should be freed no matter how far it can go
along, otherwise it would be leaked.

This patch fixes several missed cases.

Signed-off-by: Wei Xu <wexu@redhat.com>
Reported-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp: when scheduling TLP, time of RTO should account for current ACK
Neal Cardwell [Sat, 18 Nov 2017 02:06:14 +0000 (21:06 -0500)]
tcp: when scheduling TLP, time of RTO should account for current ACK

[ Upstream commit ed66dfaf236c04d414de1d218441296e57fb2bd2 ]

Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e6e that was:
  -> schedule a TLP for now + RTO_interval

The response after df92c8394e6e is:
  -> schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
       at write_queue_head->paket_tx_time + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                   tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
            if (flag & FLAG_ACKED) {
                   tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
       to TLP at now + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                   tcp_schedule_loss_probe(sk);

In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotap: free skb if flags error
Wei Xu [Fri, 1 Dec 2017 10:10:38 +0000 (05:10 -0500)]
tap: free skb if flags error

[ Upstream commit 61d78537843e676e7f56ac6db333db0c0529b892 ]

tap_recvmsg() supports accepting skb by msg_control after
commit 3b4ba04acca8 ("tap: support receiving skb from msg_control"),
the skb if presented should be freed within the function, otherwise
it would be leaked.

Signed-off-by: Wei Xu <wexu@redhat.com>
Reported-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet: sched: cbq: create block for q->link.block
Jiri Pirko [Mon, 27 Nov 2017 17:37:21 +0000 (18:37 +0100)]
net: sched: cbq: create block for q->link.block

[ Upstream commit d51aae68b142f48232257e96ce317db25445418d ]

q->link.block is not initialized, that leads to EINVAL when one tries to
add filter there. So initialize it properly.

This can be reproduced by:
$ tc qdisc add dev eth0 root handle 1: cbq avpkt 1000 rate 1000Mbit bandwidth 1000Mbit
$ tc filter add dev eth0 parent 1: protocol ip prio 100 u32 match ip protocol 0 0x00 flowid 1:1

Reported-by: Jaroslav Aster <jaster@redhat.com>
Reported-by: Ivan Vecera <ivecera@redhat.com>
Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp: use current time in tcp_rcv_space_adjust()
Eric Dumazet [Wed, 6 Dec 2017 19:08:19 +0000 (11:08 -0800)]
tcp: use current time in tcp_rcv_space_adjust()

[ Upstream commit 8632385022f2b05a6ca0b9e0f95575865de0e2ce ]

When I switched rcv_rtt_est to high resolution timestamps, I forgot
that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()

Using an old timestamp leads to autotuning lags.

Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()
Tommi Rantala [Wed, 29 Nov 2017 10:48:42 +0000 (12:48 +0200)]
tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()

[ Upstream commit c7799c067c2ae33e348508c8afec354f3257ff25 ]

Remove the second tipc_rcv() call in tipc_udp_recv(). We have just
checked that the bearer is not up, and calling tipc_rcv() with a bearer
that is not up leads to a TIPC div-by-zero crash in
tipc_node_calculate_timer(). The crash is rare in practice, but can
happen like this:

  We're enabling a bearer, but it's not yet up and fully initialized.
  At the same time we receive a discovery packet, and in tipc_udp_recv()
  we end up calling tipc_rcv() with the not-yet-initialized bearer,
  causing later the div-by-zero crash in tipc_node_calculate_timer().

Jon Maloy explains the impact of removing the second tipc_rcv() call:
  "link setup in the worst case will be delayed until the next arriving
   discovery messages, 1 sec later, and this is an acceptable delay."

As the tipc_rcv() call is removed, just leave the function via the
rcu_out label, so that we will kfree_skb().

[   12.590450] Own node address <1.1.1>, network identity 1
[   12.668088] divide error: 0000 [#1] SMP
[   12.676952] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.2-dirty #1
[   12.679225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   12.682095] task: ffff8c2a761edb80 task.stack: ffffa41cc0cac000
[   12.684087] RIP: 0010:tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc]
[   12.686486] RSP: 0018:ffff8c2a7fc838a0 EFLAGS: 00010246
[   12.688451] RAX: 0000000000000000 RBX: ffff8c2a5b382600 RCX: 0000000000000000
[   12.691197] RDX: 0000000000000000 RSI: ffff8c2a5b382600 RDI: ffff8c2a5b382600
[   12.693945] RBP: ffff8c2a7fc838b0 R08: 0000000000000001 R09: 0000000000000001
[   12.696632] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c2a5d8949d8
[   12.699491] R13: ffffffff95ede400 R14: 0000000000000000 R15: ffff8c2a5d894800
[   12.702338] FS:  0000000000000000(0000) GS:ffff8c2a7fc80000(0000) knlGS:0000000000000000
[   12.705099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.706776] CR2: 0000000001bb9440 CR3: 00000000bd009001 CR4: 00000000003606e0
[   12.708847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   12.711016] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   12.712627] Call Trace:
[   12.713390]  <IRQ>
[   12.714011]  tipc_node_check_dest+0x2e8/0x350 [tipc]
[   12.715286]  tipc_disc_rcv+0x14d/0x1d0 [tipc]
[   12.716370]  tipc_rcv+0x8b0/0xd40 [tipc]
[   12.717396]  ? minmax_running_min+0x2f/0x60
[   12.718248]  ? dst_alloc+0x4c/0xa0
[   12.718964]  ? tcp_ack+0xaf1/0x10b0
[   12.719658]  ? tipc_udp_is_known_peer+0xa0/0xa0 [tipc]
[   12.720634]  tipc_udp_recv+0x71/0x1d0 [tipc]
[   12.721459]  ? dst_alloc+0x4c/0xa0
[   12.722130]  udp_queue_rcv_skb+0x264/0x490
[   12.722924]  __udp4_lib_rcv+0x21e/0x990
[   12.723670]  ? ip_route_input_rcu+0x2dd/0xbf0
[   12.724442]  ? tcp_v4_rcv+0x958/0xa40
[   12.725039]  udp_rcv+0x1a/0x20
[   12.725587]  ip_local_deliver_finish+0x97/0x1d0
[   12.726323]  ip_local_deliver+0xaf/0xc0
[   12.726959]  ? ip_route_input_noref+0x19/0x20
[   12.727689]  ip_rcv_finish+0xdd/0x3b0
[   12.728307]  ip_rcv+0x2ac/0x360
[   12.728839]  __netif_receive_skb_core+0x6fb/0xa90
[   12.729580]  ? udp4_gro_receive+0x1a7/0x2c0
[   12.730274]  __netif_receive_skb+0x1d/0x60
[   12.730953]  ? __netif_receive_skb+0x1d/0x60
[   12.731637]  netif_receive_skb_internal+0x37/0xd0
[   12.732371]  napi_gro_receive+0xc7/0xf0
[   12.732920]  receive_buf+0x3c3/0xd40
[   12.733441]  virtnet_poll+0xb1/0x250
[   12.733944]  net_rx_action+0x23e/0x370
[   12.734476]  __do_softirq+0xc5/0x2f8
[   12.734922]  irq_exit+0xfa/0x100
[   12.735315]  do_IRQ+0x4f/0xd0
[   12.735680]  common_interrupt+0xa2/0xa2
[   12.736126]  </IRQ>
[   12.736416] RIP: 0010:native_safe_halt+0x6/0x10
[   12.736925] RSP: 0018:ffffa41cc0cafe90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff4d
[   12.737756] RAX: 0000000000000000 RBX: ffff8c2a761edb80 RCX: 0000000000000000
[   12.738504] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   12.739258] RBP: ffffa41cc0cafe90 R08: 0000014b5b9795e5 R09: ffffa41cc12c7e88
[   12.740118] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[   12.740964] R13: ffff8c2a761edb80 R14: 0000000000000000 R15: 0000000000000000
[   12.741831]  default_idle+0x2a/0x100
[   12.742323]  arch_cpu_idle+0xf/0x20
[   12.742796]  default_idle_call+0x28/0x40
[   12.743312]  do_idle+0x179/0x1f0
[   12.743761]  cpu_startup_entry+0x1d/0x20
[   12.744291]  start_secondary+0x112/0x120
[   12.744816]  secondary_startup_64+0xa5/0xa5
[   12.745367] Code: b9 f4 01 00 00 48 89 c2 48 c1 ea 02 48 3d d3 07 00
00 48 0f 47 d1 49 8b 0c 24 48 39 d1 76 07 49 89 14 24 48 89 d1 31 d2 48
89 df <48> f7 f1 89 c6 e8 81 6e ff ff 5b 41 5c 5d c3 66 90 66 2e 0f 1f
[   12.747527] RIP: tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc] RSP: ffff8c2a7fc838a0
[   12.748555] ---[ end trace 1399ab83390650fd ]---
[   12.749296] Kernel panic - not syncing: Fatal exception in interrupt
[   12.750123] Kernel Offset: 0x13200000 from 0xffffffff82000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   12.751215] Rebooting in 60 seconds..

Fixes: c9b64d492b1f ("tipc: add replicast peer discovery")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
David Ahern [Sun, 3 Dec 2017 17:33:00 +0000 (09:33 -0800)]
tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()

[ Usptream commit b4d1605a8ea608fd7dc45b926a05d75d340bde4b ]

After this fix : ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"),
socket lookups happen while skb->cb[] has not been mangled yet by TCP.

Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is l3mdev")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agos390/qeth: fix GSO throughput regression
Julian Wiedmann [Fri, 1 Dec 2017 09:14:50 +0000 (10:14 +0100)]
s390/qeth: fix GSO throughput regression

[ Upstream commit 6d69b1f1eb7a2edf8a3547f361c61f2538e054bb ]

Using GSO with small MTUs currently results in a substantial throughput
regression - which is caused by how qeth needs to map non-linear skbs
into its IO buffer elements:
compared to a linear skb, each GSO-segmented skb effectively consumes
twice as many buffer elements (ie two instead of one) due to the
additional header-only part. This causes the Output Queue to be
congested with low-utilized IO buffers.

Fix this as follows:
If the MSS is low enough so that a non-SG GSO segmentation produces
order-0 skbs (currently ~3500 byte), opt out from NETIF_F_SG. This is
where we anticipate the biggest savings, since an SG-enabled
GSO segmentation produces skbs that always consume at least two
buffer elements.

Larger MSS values continue to get a SG-enabled GSO segmentation, since
1) the relative overhead of the additional header-only buffer element
becomes less noticeable, and
2) the linearization overhead increases.

With the throughput regression fixed, re-enable NETIF_F_SG by default to
reap the significant CPU savings of GSO.

Fixes: 5722963a8e83 ("qeth: do not turn on SG per default")
Reported-by: Nils Hoppmann <niho@de.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agos390/qeth: fix thinko in IPv4 multicast address tracking
Julian Wiedmann [Fri, 1 Dec 2017 09:14:49 +0000 (10:14 +0100)]
s390/qeth: fix thinko in IPv4 multicast address tracking

[ Upsteam commit bc3ab70584696cb798b9e1e0ac8e6ced5fd4c3b8 ]

Commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
reworked how secondary addresses are managed for qeth devices.
Instead of dropping & subsequently re-adding all addresses on every
ndo_set_rx_mode() call, qeth now keeps track of the addresses that are
currently registered with the HW.
On a ndo_set_rx_mode(), we thus only need to do (de-)registration
requests for the addresses that have actually changed.

On L3 devices, the lookup for IPv4 Multicast addresses checks the wrong
hashtable - and thus never finds a match. As a result, we first delete
*all* such addresses, and then re-add them again. So each set_rx_mode()
causes a short period where the IPv4 Multicast addresses are not
registered, and the card stops forwarding inbound traffic for them.

Fix this by setting the ->is_multicast flag on the lookup object, thus
enabling qeth_l3_ip_from_hash() to search the correct hashtable and
find a match there.

Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agos390/qeth: build max size GSO skbs on L2 devices
Julian Wiedmann [Fri, 1 Dec 2017 09:14:51 +0000 (10:14 +0100)]
s390/qeth: build max size GSO skbs on L2 devices

[ Upstream commit 0cbff6d4546613330a1c5f139f5c368e4ce33ca1 ]

The current GSO skb size limit was copy&pasted over from the L3 path,
where it is needed due to a TSO limitation.
As L2 devices don't offer TSO support (and thus all GSO skbs are
segmented before they reach the driver), there's no reason to restrict
the stack in how large it may build the GSO skbs.

Fixes: d52aec97e5bc ("qeth: enable scatter/gather in layer 2 mode")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp/dccp: block bh before arming time_wait timer
Eric Dumazet [Fri, 1 Dec 2017 18:06:56 +0000 (10:06 -0800)]
tcp/dccp: block bh before arming time_wait timer

[ Upstream commit cfac7f836a715b91f08c851df915d401a4d52783 ]

Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
that might be caused by the following bug.

timewait timer is pinned to the cpu, because we want to transition
timwewait refcount from 0 to 4 in one go, once everything has been
initialized.

At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
handling") was merged, TCP was always running from BH habdler.

After commit 5413d1babe8f ("net: do not block BH while processing
socket backlog") we definitely can run tcp_time_wait() from process
context.

We need to block BH in the critical section so that the pinned timer
has still its purpose.

This bug is more likely to happen under stress and when very small RTO
are used in datacenter flows.

Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agostmmac: reset last TSO segment size after device open
Lars Persson [Fri, 1 Dec 2017 10:12:44 +0000 (11:12 +0100)]
stmmac: reset last TSO segment size after device open

[ Upstream commit 45ab4b13e46325d00f4acdb365d406e941a15f81 ]

The mss variable tracks the last max segment size sent to the TSO
engine. We do not update the hardware as long as we receive skb:s with
the same value in gso_size.

During a network device down/up cycle (mapped to stmmac_release() and
stmmac_open() callbacks) we issue a reset to the hardware and it
forgets the setting for mss. However we did not zero out our mss
variable so the next transmission of a gso packet happens with an
undefined hardware setting.

This triggers a hang in the TSO engine and eventuelly the netdev
watchdog will bark.

Fixes: f748be531d70 ("stmmac: support new GMAC4")
Signed-off-by: Lars Persson <larper@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet: remove hlist_nulls_add_tail_rcu()
Eric Dumazet [Tue, 5 Dec 2017 20:45:56 +0000 (12:45 -0800)]
net: remove hlist_nulls_add_tail_rcu()

[ Upstream commit d7efc6c11b277d9d80b99b1334a78bfe7d7edf10 ]

Alexander Potapenko reported use of uninitialized memory [1]

This happens when inserting a request socket into TCP ehash,
in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.

Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for
mixed v4/v6 sockets")

Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6
ordering fix") missed the opportunity to get rid of
hlist_nulls_add_tail_rcu() :

Both UDP sockets and TCP/DCCP listeners no longer use
__sk_nulls_add_node_rcu() for their hash insertion.

Since all other sockets have unique 4-tuple, the reuseport status
has no special meaning, so we can always use hlist_nulls_add_head_rcu()
for them and save few cycles/instructions.

[1]

==================================================================
BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x185/0x1d0 lib/dump_stack.c:52
 kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
 __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
 __sk_nulls_add_node_rcu ./include/net/sock.h:684
 inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
 reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
 inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
 tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
 tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
 tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
 tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
 tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
 ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
 NF_HOOK ./include/linux/netfilter.h:248
 ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
 dst_input ./include/net/dst.h:477
 ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
 NF_HOOK ./include/linux/netfilter.h:248
 ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
 __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
 __netif_receive_skb net/core/dev.c:4336
 netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
 napi_skb_finish net/core/dev.c:4858
 napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
 e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
 e1000_clean_rx_irq+0x1492/0x1d30
drivers/net/ethernet/intel/e1000/e1000_main.c:4474
 e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
 napi_poll net/core/dev.c:5500
 net_rx_action+0x73c/0x1820 net/core/dev.c:5566
 __do_softirq+0x4b4/0x8dd kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364
 irq_exit+0x203/0x240 kernel/softirq.c:405
 exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
 do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
 common_interrupt+0x86/0x86

Fixes: d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
Fixes: d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Potapenko <glider@google.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agousbnet: fix alignment for frames with no ethernet header
Bjørn Mork [Wed, 6 Dec 2017 19:21:24 +0000 (20:21 +0100)]
usbnet: fix alignment for frames with no ethernet header

[ Upstream commit a4abd7a80addb4a9547f7dfc7812566b60ec505c ]

The qmi_wwan minidriver support a 'raw-ip' mode where frames are
received without any ethernet header. This causes alignment issues
because the skbs allocated by usbnet are "IP aligned".

Fix by allowing minidrivers to disable the additional alignment
offset. This is implemented using a per-device flag, since the same
minidriver also supports 'ethernet' mode.

Fixes: 32f7adf633b9 ("net: qmi_wwan: support "raw IP" mode")
Reported-and-tested-by: Jay Foster <jay@systech.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp: remove buggy call to tcp_v6_restore_cb()
Eric Dumazet [Thu, 30 Nov 2017 01:43:57 +0000 (17:43 -0800)]
tcp: remove buggy call to tcp_v6_restore_cb()

[ Upstream commit 3016dad75b48279e579117ee3ed566ba90a3b023 ]

tcp_v6_send_reset() expects to receive an skb with skb->cb[] layout as
used in TCP stack.
MD5 lookup uses tcp_v6_iif() and tcp_v6_sdif() and thus
TCP_SKB_CB(skb)->header.h6

This patch probably fixes RST packets sent on behalf of a timewait md5
ipv6 socket.

Before Florian patch, tcp_v6_restore_cb() was needed before jumping to
no_tcp_socket label.

Fixes: 271c3b9b7bda ("tcp: honour SO_BINDTODEVICE for TW_RST case too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agonet/packet: fix a race in packet_bind() and packet_notifier()
Eric Dumazet [Tue, 28 Nov 2017 16:03:30 +0000 (08:03 -0800)]
net/packet: fix a race in packet_bind() and packet_notifier()

[ Upstream commit 15fe076edea787807a7cdc168df832544b58eba6 ]

syzbot reported crashes [1] and provided a C repro easing bug hunting.

When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.

This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.

Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.

[1]
dev_remove_pack: ffff8801bf16fa80 not found
------------[ cut here ]------------
kernel BUG at net/core/dev.c:7945!  ( BUG_ON(!list_empty(&dev->ptype_all)); )
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
device syz0 entered promiscuous mode
CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cc57a500 task.stack: ffff8801cc588000
RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945
RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293
RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2
RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810
device syz0 entered promiscuous mode
RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8
R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0
FS:  0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
 tun_detach drivers/net/tun.c:670 [inline]
 tun_chr_close+0x49/0x60 drivers/net/tun.c:2845
 __fput+0x333/0x7f0 fs/file_table.c:210
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x199/0x270 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x9bb/0x1ae0 kernel/exit.c:865
 do_group_exit+0x149/0x400 kernel/exit.c:968
 SYSC_exit_group kernel/exit.c:979 [inline]
 SyS_exit_group+0x1d/0x20 kernel/exit.c:977
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44ad19

Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agopacket: fix crash in fanout_demux_rollover()
Mike Maloney [Tue, 28 Nov 2017 15:44:29 +0000 (10:44 -0500)]
packet: fix crash in fanout_demux_rollover()

syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.

po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release().   It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.

To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors.  This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).

Crashing stack trace:
 fanout_demux_rollover+0xb6/0x4d0 net/packet/af_packet.c:1392
 packet_rcv_fanout+0x649/0x7c8 net/packet/af_packet.c:1487
 dev_queue_xmit_nit+0x835/0xc10 net/core/dev.c:1953
 xmit_one net/core/dev.c:2975 [inline]
 dev_hard_start_xmit+0x16b/0xac0 net/core/dev.c:2995
 __dev_queue_xmit+0x17a4/0x2050 net/core/dev.c:3476
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3509
 neigh_connected_output+0x489/0x720 net/core/neighbour.c:1379
 neigh_output include/net/neighbour.h:482 [inline]
 ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
 ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
 NF_HOOK_COND include/linux/netfilter.h:239 [inline]
 ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
 dst_output include/net/dst.h:459 [inline]
 NF_HOOK.constprop.35+0xff/0x630 include/linux/netfilter.h:250
 mld_sendpack+0x6a8/0xcc0 net/ipv6/mcast.c:1660
 mld_send_initial_cr.part.24+0x103/0x150 net/ipv6/mcast.c:2072
 mld_send_initial_cr net/ipv6/mcast.c:2056 [inline]
 ipv6_mc_dad_complete+0x99/0x130 net/ipv6/mcast.c:2079
 addrconf_dad_completed+0x595/0x970 net/ipv6/addrconf.c:4039
 addrconf_dad_work+0xac9/0x1160 net/ipv6/addrconf.c:3971
 process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2113
 worker_thread+0x223/0x1990 kernel/workqueue.c:2247
 kthread+0x35e/0x430 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:432

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")
Fixes: 509c7a1ecc860 ("packet: avoid panic in packet_getsockopt()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Mike Maloney <maloney@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agotcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
Eric Dumazet [Sun, 3 Dec 2017 17:32:59 +0000 (09:32 -0800)]
tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()

[ Upstream commit eeea10b83a139451130df1594f26710c8fa390c8 ]

James Morris reported kernel stack corruption bug [1] while
running the SELinux testsuite, and bisected to a recent
commit bffa72cf7f9d ("net: sk_buff rbnode reorg")

We believe this commit is fine, but exposes an older bug.

SELinux code runs from tcp_filter() and might send an ICMP,
expecting IP options to be found in skb->cb[] using regular IPCB placement.

We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.

This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
similar way we added them for IPv6.

[1]
[  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
[  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
[  339.822505]
[  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
[  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
[  339.885060] Call Trace:
[  339.896875]  <IRQ>
[  339.908103]  dump_stack+0x63/0x87
[  339.920645]  panic+0xe8/0x248
[  339.932668]  ? ip_push_pending_frames+0x33/0x40
[  339.946328]  ? icmp_send+0x525/0x530
[  339.958861]  ? kfree_skbmem+0x60/0x70
[  339.971431]  __stack_chk_fail+0x1b/0x20
[  339.984049]  icmp_send+0x525/0x530
[  339.996205]  ? netlbl_skbuff_err+0x36/0x40
[  340.008997]  ? selinux_netlbl_err+0x11/0x20
[  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
[  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
[  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
[  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
[  340.074562]  ? tcp_filter+0x2c/0x40
[  340.086400]  ? tcp_v4_rcv+0x820/0xa20
[  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
[  340.111279]  ? ip_local_deliver+0x6f/0xe0
[  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
[  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
[  340.147442]  ? ip_rcv+0x27c/0x3c0
[  340.158668]  ? inet_del_offload+0x40/0x40
[  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
[  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
[  340.195282]  ? __netif_receive_skb+0x18/0x60
[  340.207288]  ? process_backlog+0x95/0x140
[  340.218948]  ? net_rx_action+0x26c/0x3b0
[  340.230416]  ? __do_softirq+0xc9/0x26a
[  340.241625]  ? do_softirq_own_stack+0x2a/0x40
[  340.253368]  </IRQ>
[  340.262673]  ? do_softirq+0x50/0x60
[  340.273450]  ? __local_bh_enable_ip+0x57/0x60
[  340.285045]  ? ip_finish_output2+0x175/0x350
[  340.296403]  ? ip_finish_output+0x127/0x1d0
[  340.307665]  ? nf_hook_slow+0x3c/0xb0
[  340.318230]  ? ip_output+0x72/0xe0
[  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
[  340.340070]  ? ip_local_out+0x35/0x40
[  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
[  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
[  340.372484]  ? __skb_clone+0x2e/0x130
[  340.382633]  ? tcp_transmit_skb+0x558/0xa10
[  340.393262]  ? tcp_connect+0x938/0xad0
[  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
[  340.414206]  ? tcp_v4_connect+0x457/0x4e0
[  340.424471]  ? __inet_stream_connect+0xb3/0x300
[  340.435195]  ? inet_stream_connect+0x3b/0x60
[  340.445607]  ? SYSC_connect+0xd9/0x110
[  340.455455]  ? __audit_syscall_entry+0xaf/0x100
[  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
[  340.476636]  ? __audit_syscall_exit+0x209/0x290
[  340.487151]  ? SyS_connect+0xe/0x10
[  340.496453]  ? do_syscall_64+0x67/0x1b0
[  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: James Morris <james.l.morris@oracle.com>
Tested-by: James Morris <james.l.morris@oracle.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agosit: update frag_off info
Hangbin Liu [Thu, 30 Nov 2017 02:41:14 +0000 (10:41 +0800)]
sit: update frag_off info

[ Upstream commit f859b4af1c52493ec21173ccc73d0b60029b5b88 ]

After parsing the sit netlink change info, we forget to update frag_off in
ipip6_tunnel_update(). Fix it by assigning frag_off with new value.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agords: Fix NULL pointer dereference in __rds_rdma_map
Håkon Bugge [Wed, 6 Dec 2017 16:18:28 +0000 (17:18 +0100)]
rds: Fix NULL pointer dereference in __rds_rdma_map

[ Upstream commit f3069c6d33f6ae63a1668737bc78aaaa51bff7ca ]

This is a fix for syzkaller719569, where memory registration was
attempted without any underlying transport being loaded.

Analysis of the case reveals that it is the setsockopt() RDS_GET_MR
(2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable.

Here is an example stack trace when the bug is hit:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
IP: __rds_rdma_map+0x36/0x440 [rds]
PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp
coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul
ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd
shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2
mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect
sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm
crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror
dm_region_hash dm_log dm_mod
CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2
Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
task: ffff882f9190db00 task.stack: ffffc9002b994000
RIP: 0010:__rds_rdma_map+0x36/0x440 [rds]
RSP: 0018:ffffc9002b997df0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff882fa2182580 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc9002b997e40 RDI: ffff882fa2182580
RBP: ffffc9002b997e30 R08: 0000000000000000 R09: 0000000000000002
R10: ffff885fb29e3838 R11: 0000000000000000 R12: ffff882fa2182580
R13: ffff882fa2182580 R14: 0000000000000002 R15: 0000000020000ffc
FS:  00007fbffa20b700(0000) GS:ffff882fbfb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000002f98a66006 CR4: 00000000001606e0
Call Trace:
 rds_get_mr+0x56/0x80 [rds]
 rds_setsockopt+0x172/0x340 [rds]
 ? __fget_light+0x25/0x60
 ? __fdget+0x13/0x20
 SyS_setsockopt+0x80/0xe0
 do_syscall_64+0x67/0x1b0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fbff9b117f9
RSP: 002b:00007fbffa20aed8 EFLAGS: 00000293 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000000c84a4 RCX: 00007fbff9b117f9
RDX: 0000000000000002 RSI: 0000400000000114 RDI: 000000000000109b
RBP: 00007fbffa20af10 R08: 0000000000000020 R09: 00007fbff9dd7860
R10: 0000000020000ffc R11: 0000000000000293 R12: 0000000000000000
R13: 00007fbffa20b9c0 R14: 00007fbffa20b700 R15: 0000000000000021

Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48
89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48>
83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08

The fix is to check the existence of an underlying transport in
__rds_rdma_map().

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>