Chao Yu [Tue, 14 Nov 2017 11:28:42 +0000 (19:28 +0800)]
f2fs: deny accessing encryption policy if encryption is off
This patch adds missing feature check in encryption ioctl interface.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 13 Nov 2017 09:32:40 +0000 (17:32 +0800)]
f2fs: inject fault in inc_valid_node_count
This patch adds missing fault injection in inc_valid_node_count.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Tue, 14 Nov 2017 01:46:38 +0000 (17:46 -0800)]
f2fs: expose quota information in debugfs
This patch shows # of dirty pages and # of hidden quota files.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlei He [Fri, 10 Nov 2017 21:36:51 +0000 (13:36 -0800)]
f2fs: separate nat entry mem alloc from nat_tree_lock
This patch splits memory allocation part in nat_entry to avoid lock contention.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
LiFan [Fri, 10 Nov 2017 07:41:42 +0000 (15:41 +0800)]
f2fs: validate before set/clear free nat bitmap
In flush_nat_entries, all dirty nats will be flushed and if
their new address isn't NULL_ADDR, their bitmaps will be updated,
the free_nid_count of the bitmaps will be increaced regardless
of whether the nats have already been occupied before.
This could lead to wrong free_nid_count.
So this patch checks the status of the bits beforeactually
set/clear them.
Fixes:
586d1492f301 ("f2fs: skip scanning free nid bitmap of full NAT blocks")
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 10 Nov 2017 01:30:42 +0000 (09:30 +0800)]
f2fs: avoid opened loop codes in __add_ino_entry
We will keep __add_ino_entry success all the time, for ENOMEM failure
case, we have already handled it by using __GFP_NOFAIL flag, so we
don't have to use additional opened loop codes here, remove them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hyunchul Lee [Thu, 9 Nov 2017 05:51:27 +0000 (14:51 +0900)]
f2fs: apply write hints to select the type of segments for buffered write
Write hints helps F2FS to determine which type of segments would be
selected for buffered write.
This patch implements the mapping from write hints to segment types
as shown below.
hints segment type
----- ------------
WRITE_LIFE_SHORT CURSEG_HOT_DATA
WRITE_LIFE_EXTREME CURSEG_COLD_DATA
others CURSEG_WARM_DATA
the F2FS poliy for hot/cold seperation has precedence over this hints.
And hints are not applied in in-place update.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 8 Nov 2017 09:47:36 +0000 (17:47 +0800)]
f2fs: introduce scan_curseg_cache for cleanup
Commit
4ac912427c42 ("f2fs: introduce free nid bitmap") copied codes
from __build_free_nids() into scan_free_nid_bits(), they are redundant,
introduce one common function scan_curseg_cache for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Tue, 7 Nov 2017 11:14:24 +0000 (19:14 +0800)]
f2fs: optimize the way of traversing free_nid_bitmap
We call scan_free_nid_bits only when there isn't many
free nids left, it means that marked bits in free_nid_bitmap
are supposed to be few, use find_next_bit_le is more
efficient in such case.
According to my tests, use find_next_bit_le instead of
test_bit_le will cut down the traversal time to one
third of its original.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Tue, 7 Nov 2017 03:04:33 +0000 (11:04 +0800)]
f2fs: keep scanning until enough free nids are acquired
In current version, after scan_free_nid_bits, the scan is over if
nid_cnt[FREE_NID] != 0. In most cases, there are still free nids in the
free list during the scan, and scan_free_nid_bits usually can't increase
nid_cnt[FREE_NID]. It causes that __build_free_nids is called many times
without solving the shortage of the free nids. This patch fixes that.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 6 Nov 2017 14:51:45 +0000 (22:51 +0800)]
f2fs: trace checkpoint reason in fsync()
This patch slightly changes need_do_checkpoint to return the detail
info that indicates why we need do checkpoint, then caller could print
it with trace message.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sun, 5 Nov 2017 13:53:30 +0000 (21:53 +0800)]
f2fs: keep isize once block is reserved cross EOF
Without FADVISE_KEEP_SIZE_BIT, we will try to recover file size
according to last non-hole block, so in fallocate(), we must set
FADVISE_KEEP_SIZE_BIT flag once we have preallocated block cross
EOF, instead of when all preallocation is success. Otherwise, file
size will be incorrect due to lack of this flag.
Simple testcase to reproduce this:
1. echo 2 > /sys/fs/f2fs/<device>/inject_type
2. echo 10 > /sys/fs/f2fs/<device>/inject_rate
3. run tests/generic/392
4. disable fault injection
5. do remount
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 3 Nov 2017 02:21:05 +0000 (10:21 +0800)]
f2fs: avoid race in between GC and block exchange
During block exchange in {insert,collapse,move}_range, page-block mapping
is unstable due to mapping moving or recovery, so there should be no
concurrent cache read operation rely on such mapping, nor cache write
operation to mess up block exchange.
So this patch let background GC be aware of that.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Thu, 2 Nov 2017 03:02:52 +0000 (11:02 +0800)]
f2fs: save a multiplication for last_nid calculation
Use a slightly easier way to calculate last_nid.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 2 Nov 2017 12:41:03 +0000 (20:41 +0800)]
f2fs: fix summary info corruption
Sometimes, after running generic/270 of fstest, fsck reports summary
info and actual position of block address in direct node becoming
inconsistent.
The root cause is race in between __f2fs_replace_block and change_curseg
as below:
Thread A Thread B
- __clone_blkaddrs
- f2fs_replace_block
- __f2fs_replace_block
- segnoA = GET_SEGNO(sbi, blkaddrA);
- type = se->type:=CURSEG_HOT_DATA
- if (!IS_CURSEG(sbi, segnoA))
type = CURSEG_WARM_DATA
- allocate_data_block
- allocate_segment
- get_ssr_segment
- change_curseg(segnoA, CURSEG_HOT_DATA)
- change_curseg(segnoA, CURSEG_WARM_DATA)
- reset_curseg
- __set_sit_entry_type
- change se->type from CURSEG_HOT_DATA to CURSEG_WARM_DATA
So finally, hot curseg locates in segnoA, but type of segnoA becomes
CURSEG_WARM_DATA.
Then if we invoke __f2fs_replace_block(blkaddrB, blkaddrA, true, false),
as blkaddrA locates in segnoA, so we will move warm type curseg to segnoA,
then change its summary cache and writeback it to summary block.
But segnoA is used by hot type curseg too, once it moves or persist, it
will cover summary block content with inner old summary cache, result in
inconsistent status.
This patch tries to fix this issue by introduce global curseg lock to avoid
race in between __f2fs_replace_block and change_curseg.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 2 Nov 2017 12:41:02 +0000 (20:41 +0800)]
f2fs: remove dead code in update_meta_page
After commit
a468f0ef516f ("f2fs: use crc and cp version to determine
roll-forward recovery"), last caller of update_meta_page passing @src
with NULL is gone, so remove related dead code there.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 2 Nov 2017 12:41:01 +0000 (20:41 +0800)]
f2fs: remove unneeded semicolon
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jeff Layton [Mon, 30 Oct 2017 15:11:56 +0000 (11:11 -0400)]
f2fs: don't bother with inode->i_version
f2fs does not set the SB_I_VERSION flag, so the i_version will never
be incremented on write. It was recently changed to increment the
i_version on a quota write, which isn't necessary here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 30 Oct 2017 09:49:54 +0000 (17:49 +0800)]
f2fs: check curseg space before foreground GC
When we are closing to trigger foreground GC, if there are only a few
of dirty metas, we can log these dirty metas in left space of opened
segments instead of triggering foreground GC.
With this patch, total count of foreground GC triggered by
test/generic/* of fstest suit reduce from 254 to 184.
So let's do the check before foreground GC anyway.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 30 Oct 2017 09:49:53 +0000 (17:49 +0800)]
f2fs: use rw_semaphore to protect SIT cache
There are some cases user didn't update SIT cache under this lock,
so let's use rw_semaphore instead of mutex to enhance concurrently
accessing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 6 Oct 2017 16:14:28 +0000 (09:14 -0700)]
f2fs: support quota sys files
This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 6 Oct 2017 04:03:06 +0000 (21:03 -0700)]
f2fs: add quota_ino feature infra
This patch adds quota_ino feature infra to be used for quota files.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Mon, 30 Oct 2017 07:19:48 +0000 (15:19 +0800)]
f2fs: optimize __update_nat_bits
Make three modification for __update_nat_bits:
1. Take the codes of dealing the nat with nid 0 out of the loop
Such nat only needs to be dealt with once at beginning.
2. Use " nat_index == 0" instead of " start_nid == 0" to decide if it's the first nat block
It's better that we don't assume @start_nid is the first nid of the nat block it's in.
3. Use " if (nat_blk->entries[i].block_addr != NULL_ADDR)" to explicitly comfirm the value of block_addr
use constant to make sure the codes is right, even if the value of NULL_ADDR changes.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlei He [Mon, 30 Oct 2017 06:18:55 +0000 (14:18 +0800)]
f2fs: modify for accurate fggc node io stat
modify for accurate fggc node io stat
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Mon, 30 Oct 2017 01:33:41 +0000 (09:33 +0800)]
Revert "f2fs: handle dirty segments inside refresh_sit_entry"
This reverts commit
5e443818fa0b2a2845561ee25bec181424fb2889
The commit should be reverted because call sequence of below two parts
of code must be kept:
a. update sit information, it needs to be updated before segment
allocation since latter allocation may trigger SSR, and SSR allocation
needs latest valid block information of all segments.
b. update segment status, it needs to be updated after segment allocation
since we can skip updating current opened segment status.
Fixes:
5e443818fa0b ("f2fs: handle dirty segments inside refresh_sit_entry")
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: remove refresh_sit_entry function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Sat, 28 Oct 2017 11:03:37 +0000 (19:03 +0800)]
f2fs: add a function to move nid
This patch add a new function to move nid from one state to another.
Move operation is heavily used, by adding a new function for it
we can cut down some branches from several flow.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:33 +0000 (16:52 +0800)]
f2fs: export SSR allocation threshold
This patch exports min_ssr_segments threshold in sysfs to let user
control triggering SSR allocation flexibly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:32 +0000 (16:52 +0800)]
f2fs: give correct trimmed blocks in fstrim
We have supported to issue discard in specified range during fstrim,
it needs to return caller with successfully trimmed bytes in that
range instead of bytes of invalid blocks which are scanned in
checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:31 +0000 (16:52 +0800)]
f2fs: support bio allocation error injection
This patch adds to support bio allocation error injection to simulate
out-of-memory test scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:30 +0000 (16:52 +0800)]
f2fs: support get_page error injection
This patch adds to support get_page error injection to simulate
out-of-memory test scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:29 +0000 (16:52 +0800)]
f2fs: add missing sysfs description
There are some missing sysfs entries' description in document, add them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Fri, 27 Oct 2017 12:45:05 +0000 (20:45 +0800)]
f2fs: support soft block reservation
It supports to extend reserved_blocks sysfs interface to be soft
threshold, which allows user configure it exceeding current available
user space. This patch also introduces a new sysfs interface called
current_reserved_blocks, which shows the current blocks that have
already been reserved.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 16 Oct 2017 22:05:16 +0000 (15:05 -0700)]
f2fs: handle error case when adding xattr entry
This patch fixes recovering incomplete xattr entries remaining in inline xattr
and xattr block, caused by any kind of errors.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 6 Sep 2017 13:59:50 +0000 (21:59 +0800)]
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 26 Oct 2017 08:31:22 +0000 (10:31 +0200)]
f2fs: show current cp state
This patch shows whether checkpoint met any error case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 23 Oct 2017 21:50:15 +0000 (23:50 +0200)]
f2fs: add missing quota_initialize
This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Tue, 24 Oct 2017 07:46:54 +0000 (09:46 +0200)]
f2fs: show # of dirty segments via sysfs
This patch adds one sysfs entry to show # of dirty segments which can be
used for gc timing by user.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 23 Oct 2017 21:48:49 +0000 (23:48 +0200)]
f2fs: stop all the operations by cp_error flag
This patch replaces to use cp_error flag instead of RDONLY for quota off.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Colin Ian King [Thu, 19 Oct 2017 10:58:21 +0000 (12:58 +0200)]
f2fs: remove several redundant assignments
There are several assignments to variables that are redundant
as the values are never read when the variables are updated later
and so the redundant statements can be safely removed.
Cleans up clang warnings:
fs/f2fs/segment.c:923:19: warning: Value stored to 'p' during its initialization is never read
fs/f2fs/segment.c:2060:2: warning: Value stored to 'hint' is never read
fs/f2fs/segment.c:2353:2: warning: Value stored to 'start_block' is never read
fs/f2fs/segment.c:2354:2: warning: Value stored to 'end_block' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Arnd Bergmann [Thu, 19 Oct 2017 09:52:47 +0000 (11:52 +0200)]
f2fs: avoid using timespec
All uses of timespec are deprecated, and this one is not particularly
useful, as the documented method for converting seconds to jiffies
is to multiply by 'HZ'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 18 Oct 2017 02:34:14 +0000 (10:34 +0800)]
f2fs: fix to correct no_fggc_candidate
There may be extreme case as below:
For one section contains one segment, and there are total 100 segments
with 10% over-privision ratio in f2fs partition, fggc_threshold will
be rounded down to 460 instead of 460.8 as below caclulation:
sbi->fggc_threshold = div_u64((u64)(main_count - ovp_count) *
BLKS_PER_SEC(sbi), (main_count - resv_count));
If section usage is as:
60 segments which contain 460 valid blocks
40 segments which contain 462 valid blocks
As valid block number in all sections is large than fggc_threshold, so
none of them will be chosen as candidate due to incorrect fggc_threshold.
Let's just soften the term of choosing foreground GC candidates.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 19:07:11 +0000 (12:07 -0700)]
Revert "f2fs: return wrong error number on f2fs_quota_write"
This reverts commit
4f31d26b0c17f2aae6a6afeb823a87e20671ab4b.
It turns out that we need to report error number if nothing was written.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 18:48:57 +0000 (11:48 -0700)]
f2fs: remove obsolete pointer for truncate_xattr_node
This patch removes obosolete parameter for truncate_xattr_node.
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 16:43:56 +0000 (09:43 -0700)]
f2fs: retry ENOMEM for quota_read|write
This gives another chance to read or write quota data.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 02:05:57 +0000 (19:05 -0700)]
f2fs: limit # of inmemory pages
If some abnormal users try lots of atomic write operations, f2fs is able to
produce pinned pages in the main memory which affects system performance.
This patch limits that as 20% over total memory size, and if f2fs reaches
to the limit, it will drop all the inmemory pages.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:36 +0000 (18:01 +0800)]
f2fs: update ctx->pos correctly when hitting hole in directory
This patch fixes to update ctx->pos correctly when hitting hole in
directory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:35 +0000 (18:01 +0800)]
f2fs: relocate readahead codes in readdir()
Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:34 +0000 (18:01 +0800)]
f2fs: allow readdir() to be interrupted
This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4:
1f60fbe72749 ("ext4: allow
readdir()'s of large empty directories to be interrupted").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:33 +0000 (18:01 +0800)]
f2fs: trace f2fs_readdir
This patch adds trace for f2fs_readdir.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Tue, 17 Oct 2017 09:33:41 +0000 (17:33 +0800)]
f2fs: trace f2fs_lookup
This patch adds trace for f2fs_lookup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Weichao Guo [Sat, 14 Oct 2017 00:13:32 +0000 (08:13 +0800)]
f2fs: skip searching non-exist range in truncate_hole
Let's skip entire non-exist area to speed up truncate_hole by
using get_next_page_offset.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 13 Oct 2017 02:12:53 +0000 (19:12 -0700)]
f2fs: avoid stale fi->gdirty_list pointer
When doing fault injection test, f2fs_evict_inode() didn't remove gdirty_list
which incurs a kernel panic due to wrong pointer access.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Sat, 7 Oct 2017 07:08:05 +0000 (00:08 -0700)]
f2fs/crypto: drop crypto key at evict_inode only
This patch avoids dropping crypto key in f2fs_drop_inode, so we can guarantee
it happens only at evict_inode.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 9 Oct 2017 09:55:19 +0000 (17:55 +0800)]
f2fs: fix to avoid race when accessing last_disk_size
last_disk_size could be wrong due to concurrently updating, so using
i_sem semaphore to make last_disk_size updating exclusive to fix this
issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thomas Meyer [Sat, 7 Oct 2017 14:02:21 +0000 (16:02 +0200)]
f2fs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:37 +0000 (09:08 +0800)]
f2fs: give up CP_TRIMMED_FLAG if it drops discards
In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:36 +0000 (09:08 +0800)]
f2fs: trace f2fs_remove_discard
This patch adds tracepoint to trace f2fs_remove_discard.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:35 +0000 (09:08 +0800)]
f2fs: reduce cmd_lock coverage in __issue_discard_cmd
__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:34 +0000 (09:08 +0800)]
f2fs: split discard policy
There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.
This patch changes to split discard policy for different scenarios.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:33 +0000 (09:08 +0800)]
f2fs: wrap discard policy
This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.
struct discard_policy {
unsigned int min_interval; /* used for candidates exist */
unsigned int max_interval; /* used for candidates not exist */
unsigned int max_requests; /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */
bool io_aware; /* issue discard in idle time */
bool sync; /* submit discard with REQ_SYNC flag */
};
This patch doesn't change any logic of codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:32 +0000 (09:08 +0800)]
f2fs: support issuing/waiting discard in range
Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.
In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:39 +0000 (13:59 +0800)]
f2fs: fix to flush multiple device in checkpoint
If f2fs manages multiple devices, in checkpoint, we need to issue flush
in those devices which contain dirty data/node in their cache before
we write checkpoint region, otherwise, filesystem metadata could be
corrupted if hitting SPO after checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:38 +0000 (13:59 +0800)]
f2fs: enhance multiple device flush
When multiple device feature is enabled, during ->fsync we will issue
flush in all devices to make sure node/data of the file being persisted
into storage. But some flushes of device could be unneeded as file's
data may be not writebacked into those devices. So this patch adds and
manage bitmap per inode in global cache to indicate which device is
dirty and it needs to issue flush during ->fsync, hence, we could improve
performance of fsync in scenario of multiple device.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:37 +0000 (13:59 +0800)]
f2fs: fix to show ino management cache size correctly
It needs to stat size of ino management cache with all type instead of
orphan ino type.
Fixes:
652be55162dc ("f2fs: show # of orphan inodes")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:36 +0000 (13:59 +0800)]
f2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush
If we failed to issue flush in ->fsync, we need to keep FI_UPDATE_WRITE
flag to make sure triggering flush in next ->fsync.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:35 +0000 (13:59 +0800)]
f2fs: obsolete ALLOC_NID_LIST list
As Fan Li reported, there is no user traversing nid_list[ALLOC_NID_LIST]
which is used for tracking preallocated nids. Let's drop it, and only
track preallocated nids in free_nid_root radix-tree.
Reported-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Weichao Guo [Fri, 29 Sep 2017 14:43:23 +0000 (22:43 +0800)]
f2fs: convert inline data for direct I/O & FI_NO_PREALLOC
In FI_NO_PREALLOC cases, direct I/O path may allocate blocks for an
inode but keep its inline data flag. This inconsistency may trigger
vfs clear_inode nrpages bug_on when evicting the inode. We should
convert inline data first in this case.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hsiang Kao [Sat, 23 Sep 2017 18:45:42 +0000 (02:45 +0800)]
f2fs: allow readpages with NULL file pointer
Keep in line with the other Linux file system implementations
since page_cache_sync_readahead supports NULL file pointer,
and thus we can readahead data by f2fs itself without file opening
(something like the btrfs behavior).
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 14 Sep 2017 02:18:01 +0000 (10:18 +0800)]
f2fs: show flush list status in sysfs
This patch adds to show flush list status in sysfs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 4 Sep 2017 10:58:03 +0000 (18:58 +0800)]
f2fs: introduce read_xattr_block
Commit
ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_xattr_block to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 4 Sep 2017 10:58:02 +0000 (18:58 +0800)]
f2fs: introduce read_inline_xattr
Commit
ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_inline_xattr to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 25 Sep 2017 06:17:51 +0000 (14:17 +0800)]
Revert "f2fs: reuse nids more aggressively"
Commit
268344664603 ("f2fs: reuse nids more aggressively") tries to
reuse nids as many as possilbe, in order to mitigate producing obsolete
node pages in page cache.
But acutally, before we reuse the nids and related node page cache,
we will always invalidate that node page, so there will be not any
obsolete node pages in cache.
Let's just revert previous commit, so that nm_i::next_scan_nid can be
increased ascendingly, making __build_free_nids traverses all NAT pages
more easily, finally, free nid bitmap cache can be enabled as soon as
possible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Sat, 23 Sep 2017 09:02:18 +0000 (17:02 +0800)]
Revert "f2fs: node segment is prior to data segment selected victim"
This reverts commit
b9cd20619e359d199b755543474c3d853c8e3415.
That patch causes much fewer node segments (which can be used for SSR)
than before, and in the corner case (e.g. create and delete *.txt files in
one same directory, there will be very few node segments but many data
segments), if the reserved free segments are all used up during gc, then
the write_checkpoint can still flush dentry pages to data ssr segments,
but will probably fail to flush node pages to node ssr segments, since
there are not enough node ssr segments left (the left ones are all
full).
So revert this patch to give a fair chance to let node segments remain
for SSR, which provides more robustness for corner cases.
Conflicts:
fs/f2fs/gc.c
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Greg Kroah-Hartman [Sun, 17 Dec 2017 14:08:14 +0000 (15:08 +0100)]
Linux 4.14.7
Mauro Carvalho Chehab [Tue, 7 Nov 2017 13:39:39 +0000 (08:39 -0500)]
dvb_frontend: don't use-after-free the frontend struct
commit
b1cb7372fa822af6c06c8045963571d13ad6348b upstream.
dvb_frontend_invoke_release() may free the frontend struct.
So, the free logic can't update it anymore after calling it.
That's OK, as __dvb_frontend_free() is called only when the
krefs are zeroed, so nobody is using it anymore.
That should fix the following KASAN error:
The KASAN report looks like this (running on kernel
3e0cc09a3a2c40ec1ffb6b4e12da86e98feccb11 (4.14-rc5+)):
==================================================================
BUG: KASAN: use-after-free in __dvb_frontend_free+0x113/0x120
Write of size 8 at addr
ffff880067d45a00 by task kworker/0:1/24
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
4.14.0-rc5-43687-g06ab8a23e0e6 #545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
__dump_stack lib/dump_stack.c:16
dump_stack+0x292/0x395 lib/dump_stack.c:52
print_address_description+0x78/0x280 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351
kasan_report+0x23d/0x350 mm/kasan/report.c:409
__asan_report_store8_noabort+0x1c/0x20 mm/kasan/report.c:435
__dvb_frontend_free+0x113/0x120 drivers/media/dvb-core/dvb_frontend.c:156
dvb_frontend_put+0x59/0x70 drivers/media/dvb-core/dvb_frontend.c:176
dvb_frontend_detach+0x120/0x150 drivers/media/dvb-core/dvb_frontend.c:2803
dvb_usb_adapter_frontend_exit+0xd6/0x160 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:340
dvb_usb_adapter_exit drivers/media/usb/dvb-usb/dvb-usb-init.c:116
dvb_usb_exit+0x9b/0x200 drivers/media/usb/dvb-usb/dvb-usb-init.c:132
dvb_usb_device_exit+0xa5/0xf0 drivers/media/usb/dvb-usb/dvb-usb-init.c:295
usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
__device_release_driver drivers/base/dd.c:861
device_release_driver_internal+0x4f1/0x5c0 drivers/base/dd.c:893
device_release_driver+0x1e/0x30 drivers/base/dd.c:918
bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
device_del+0x5c4/0xab0 drivers/base/core.c:1985
usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
hub_port_connect drivers/usb/core/hub.c:4754
hub_port_connect_change drivers/usb/core/hub.c:5009
port_event drivers/usb/core/hub.c:5115
hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
worker_thread+0x221/0x1850 kernel/workqueue.c:2253
kthread+0x363/0x440 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
Allocated by task 24:
save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772
kmalloc ./include/linux/slab.h:493
kzalloc ./include/linux/slab.h:666
dtt200u_fe_attach+0x4c/0x110 drivers/media/usb/dvb-usb/dtt200u-fe.c:212
dtt200u_frontend_attach+0x35/0x80 drivers/media/usb/dvb-usb/dtt200u.c:136
dvb_usb_adapter_frontend_init+0x32b/0x660 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:286
dvb_usb_adapter_init drivers/media/usb/dvb-usb/dvb-usb-init.c:86
dvb_usb_init drivers/media/usb/dvb-usb/dvb-usb-init.c:162
dvb_usb_device_init+0xf73/0x17f0 drivers/media/usb/dvb-usb/dvb-usb-init.c:277
dtt200u_usb_probe+0xa1/0xe0 drivers/media/usb/dvb-usb/dtt200u.c:155
usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26b/0x3c0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_set_configuration+0x104e/0x1870 drivers/usb/core/message.c:1932
generic_probe+0x73/0xe0 drivers/usb/core/generic.c:174
usb_probe_device+0xaf/0xe0 drivers/usb/core/driver.c:266
really_probe drivers/base/dd.c:413
driver_probe_device+0x610/0xa00 drivers/base/dd.c:557
__device_attach_driver+0x230/0x290 drivers/base/dd.c:653
bus_for_each_drv+0x161/0x210 drivers/base/bus.c:463
__device_attach+0x26b/0x3c0 drivers/base/dd.c:710
device_initial_probe+0x1f/0x30 drivers/base/dd.c:757
bus_probe_device+0x1eb/0x290 drivers/base/bus.c:523
device_add+0xd0b/0x1660 drivers/base/core.c:1835
usb_new_device+0x7b8/0x1020 drivers/usb/core/hub.c:2457
hub_port_connect drivers/usb/core/hub.c:4903
hub_port_connect_change drivers/usb/core/hub.c:5009
port_event drivers/usb/core/hub.c:5115
hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
worker_thread+0x221/0x1850 kernel/workqueue.c:2253
kthread+0x363/0x440 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
Freed by task 24:
save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459
kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
slab_free_hook mm/slub.c:1390
slab_free_freelist_hook mm/slub.c:1412
slab_free mm/slub.c:2988
kfree+0xf6/0x2f0 mm/slub.c:3919
dtt200u_fe_release+0x3c/0x50 drivers/media/usb/dvb-usb/dtt200u-fe.c:202
dvb_frontend_invoke_release.part.13+0x1c/0x30 drivers/media/dvb-core/dvb_frontend.c:2790
dvb_frontend_invoke_release drivers/media/dvb-core/dvb_frontend.c:2789
__dvb_frontend_free+0xad/0x120 drivers/media/dvb-core/dvb_frontend.c:153
dvb_frontend_put+0x59/0x70 drivers/media/dvb-core/dvb_frontend.c:176
dvb_frontend_detach+0x120/0x150 drivers/media/dvb-core/dvb_frontend.c:2803
dvb_usb_adapter_frontend_exit+0xd6/0x160 drivers/media/usb/dvb-usb/dvb-usb-dvb.c:340
dvb_usb_adapter_exit drivers/media/usb/dvb-usb/dvb-usb-init.c:116
dvb_usb_exit+0x9b/0x200 drivers/media/usb/dvb-usb/dvb-usb-init.c:132
dvb_usb_device_exit+0xa5/0xf0 drivers/media/usb/dvb-usb/dvb-usb-init.c:295
usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
__device_release_driver drivers/base/dd.c:861
device_release_driver_internal+0x4f1/0x5c0 drivers/base/dd.c:893
device_release_driver+0x1e/0x30 drivers/base/dd.c:918
bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
device_del+0x5c4/0xab0 drivers/base/core.c:1985
usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
hub_port_connect drivers/usb/core/hub.c:4754
hub_port_connect_change drivers/usb/core/hub.c:5009
port_event drivers/usb/core/hub.c:5115
hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
process_one_work+0xc73/0x1d90 kernel/workqueue.c:2119
worker_thread+0x221/0x1850 kernel/workqueue.c:2253
kthread+0x363/0x440 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
The buggy address belongs to the object at
ffff880067d45500
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 1280 bytes inside of
2048-byte region [
ffff880067d45500,
ffff880067d45d00)
The buggy address belongs to the page:
page:
ffffea00019f5000 count:1 mapcount:0 mapping: (null)
index:0x0 compound_mapcount: 0
flags: 0x100000000008100(slab|head)
raw:
0100000000008100 0000000000000000 0000000000000000 00000001000f000f
raw:
dead000000000100 dead000000000200 ffff88006c002d80 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff880067d45900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880067d45980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880067d45a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff880067d45a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880067d45b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes:
ead666000a5f ("media: dvb_frontend: only use kref after initialized")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Matthias Schwarzott <zzam@gentoo.org>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Scheller [Sun, 29 Oct 2017 15:43:22 +0000 (11:43 -0400)]
media: dvb-core: always call invoke_release() in fe_free()
commit
62229de19ff2b7f3e0ebf4d48ad99061127d0281 upstream.
Follow-up to:
ead666000a5f ("media: dvb_frontend: only use kref after initialized")
The aforementioned commit fixed refcount OOPSes when demod driver attaching
succeeded but tuner driver didn't. However, the use count of the attached
demod drivers don't go back to zero and thus couldn't be cleanly unloaded.
Improve on this by calling dvb_frontend_invoke_release() in
__dvb_frontend_free() regardless of fepriv being NULL, instead of returning
when fepriv is NULL. This is safe to do since _invoke_release() will check
for passed pointers being valid before calling the .release() function.
[mchehab@s-opensource.com: changed the logic a little bit to reduce
conflicts with another bug fix patch under review]
Fixes:
ead666000a5f ("media: dvb_frontend: only use kref after initialized")
Signed-off-by: Daniel Scheller <d.scheller@gmx.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reinette Chatre [Fri, 20 Oct 2017 09:16:58 +0000 (02:16 -0700)]
x86/intel_rdt: Fix potential deadlock during resctrl unmount
[ Upstream commit
36b6f9fcb8928c06b6638a4cf91bc9d69bb49aa2 ]
Lockdep warns about a potential deadlock:
[ 66.782842] ======================================================
[ 66.782888] WARNING: possible circular locking dependency detected
[ 66.782937] 4.14.0-rc2-test-test+ #48 Not tainted
[ 66.782983] ------------------------------------------------------
[ 66.783052] umount/336 is trying to acquire lock:
[ 66.783117] (cpu_hotplug_lock.rw_sem){++++}, at: [<
ffffffff81032395>] rdt_kill_sb+0x215/0x390
[ 66.783193]
but task is already holding lock:
[ 66.783244] (rdtgroup_mutex){+.+.}, at: [<
ffffffff810321b6>] rdt_kill_sb+0x36/0x390
[ 66.783305]
which lock already depends on the new lock.
[ 66.783364]
the existing dependency chain (in reverse order) is:
[ 66.783419]
-> #3 (rdtgroup_mutex){+.+.}:
[ 66.783467] __lock_acquire+0x1293/0x13f0
[ 66.783509] lock_acquire+0xaf/0x220
[ 66.783543] __mutex_lock+0x71/0x9b0
[ 66.783575] mutex_lock_nested+0x1b/0x20
[ 66.783610] intel_rdt_online_cpu+0x3b/0x430
[ 66.783649] cpuhp_invoke_callback+0xab/0x8e0
[ 66.783687] cpuhp_thread_fun+0x7a/0x150
[ 66.783722] smpboot_thread_fn+0x1cc/0x270
[ 66.783764] kthread+0x16e/0x190
[ 66.783794] ret_from_fork+0x27/0x40
[ 66.783825]
-> #2 (cpuhp_state){+.+.}:
[ 66.783870] __lock_acquire+0x1293/0x13f0
[ 66.783906] lock_acquire+0xaf/0x220
[ 66.783938] cpuhp_issue_call+0x102/0x170
[ 66.783974] __cpuhp_setup_state_cpuslocked+0x154/0x2a0
[ 66.784023] __cpuhp_setup_state+0xc7/0x170
[ 66.784061] page_writeback_init+0x43/0x67
[ 66.784097] pagecache_init+0x43/0x4a
[ 66.784131] start_kernel+0x3ad/0x3f7
[ 66.784165] x86_64_start_reservations+0x2a/0x2c
[ 66.784204] x86_64_start_kernel+0x72/0x75
[ 66.784241] verify_cpu+0x0/0xfb
[ 66.784270]
-> #1 (cpuhp_state_mutex){+.+.}:
[ 66.784319] __lock_acquire+0x1293/0x13f0
[ 66.784355] lock_acquire+0xaf/0x220
[ 66.784387] __mutex_lock+0x71/0x9b0
[ 66.784419] mutex_lock_nested+0x1b/0x20
[ 66.784454] __cpuhp_setup_state_cpuslocked+0x52/0x2a0
[ 66.784497] __cpuhp_setup_state+0xc7/0x170
[ 66.784535] page_alloc_init+0x28/0x30
[ 66.784569] start_kernel+0x148/0x3f7
[ 66.784602] x86_64_start_reservations+0x2a/0x2c
[ 66.784642] x86_64_start_kernel+0x72/0x75
[ 66.784678] verify_cpu+0x0/0xfb
[ 66.784707]
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
[ 66.784759] check_prev_add+0x32f/0x6e0
[ 66.784794] __lock_acquire+0x1293/0x13f0
[ 66.784830] lock_acquire+0xaf/0x220
[ 66.784863] cpus_read_lock+0x3d/0xb0
[ 66.784896] rdt_kill_sb+0x215/0x390
[ 66.784930] deactivate_locked_super+0x3e/0x70
[ 66.784968] deactivate_super+0x40/0x60
[ 66.785003] cleanup_mnt+0x3f/0x80
[ 66.785034] __cleanup_mnt+0x12/0x20
[ 66.785070] task_work_run+0x8b/0xc0
[ 66.785103] exit_to_usermode_loop+0x94/0xa0
[ 66.786804] syscall_return_slowpath+0xe8/0x150
[ 66.788502] entry_SYSCALL_64_fastpath+0xab/0xad
[ 66.790194]
other info that might help us debug this:
[ 66.795139] Chain exists of:
cpu_hotplug_lock.rw_sem --> cpuhp_state --> rdtgroup_mutex
[ 66.800035] Possible unsafe locking scenario:
[ 66.803267] CPU0 CPU1
[ 66.804867] ---- ----
[ 66.806443] lock(rdtgroup_mutex);
[ 66.808002] lock(cpuhp_state);
[ 66.809565] lock(rdtgroup_mutex);
[ 66.811110] lock(cpu_hotplug_lock.rw_sem);
[ 66.812608]
*** DEADLOCK ***
[ 66.816983] 2 locks held by umount/336:
[ 66.818418] #0: (&type->s_umount_key#35){+.+.}, at: [<
ffffffff81229738>] deactivate_super+0x38/0x60
[ 66.819922] #1: (rdtgroup_mutex){+.+.}, at: [<
ffffffff810321b6>] rdt_kill_sb+0x36/0x390
When the resctrl filesystem is unmounted the locks should be obtain in the
locks in the same order as was done when the cpus came online:
cpu_hotplug_lock before rdtgroup_mutex.
This also requires to switch the static_branch_disable() calls to the
_cpulocked variant because now cpu hotplug lock is held already.
[ tglx: Switched to cpus_read_[un]lock ]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Acked-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/cc292e76be073f7260604651711c47b09fd0dc81.1508490116.git.reinette.chatre@intel.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Leon Romanovsky [Wed, 25 Oct 2017 20:10:19 +0000 (23:10 +0300)]
RDMA/cxgb4: Annotate r2 and stag as __be32
[ Upstream commit
7d7d065a5eec7e218174d5c64a9f53f99ffdb119 ]
Chelsio cxgb4 HW is big-endian, hence there is need to properly
annotate r2 and stag fields as __be32 and not __u32 to fix the
following sparse warnings.
drivers/infiniband/hw/cxgb4/qp.c:614:16:
warning: incorrect type in assignment (different base types)
expected unsigned int [unsigned] [usertype] r2
got restricted __be32 [usertype] <noident>
drivers/infiniband/hw/cxgb4/qp.c:615:18:
warning: incorrect type in assignment (different base types)
expected unsigned int [unsigned] [usertype] stag
got restricted __be32 [usertype] <noident>
Cc: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Zdenek Kabelac [Wed, 8 Nov 2017 12:44:56 +0000 (13:44 +0100)]
md: free unused memory after bitmap resize
[ Upstream commit
0868b99c214a3d55486c700de7c3f770b7243e7c ]
When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.
This fixes in particular kmemleak reports like this one:
unreferenced object 0xffff8f4311e9c000 (size 4096):
comm "lvm", pid 19333, jiffies
4295263268 (age 528.265s)
hex dump (first 32 bytes):
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
backtrace:
[<
ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
[<
ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
[<
ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
[<
ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
[<
ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
[<
ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
[<
ffffffffa676f1ae>] bitmap_load+0xbe/0x160
[<
ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
[<
ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
[<
ffffffffa6774b52>] dm_resume+0x122/0x140
[<
ffffffffa6779b9f>] dev_suspend+0x18f/0x290
[<
ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
[<
ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
[<
ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
[<
ffffffffa62d7269>] SyS_ioctl+0x79/0x90
[<
ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Heinz Mauelshagen [Thu, 2 Nov 2017 18:58:28 +0000 (19:58 +0100)]
dm raid: fix panic when attempting to force a raid to sync
[ Upstream commit
233978449074ca7e45d9c959f9ec612d1b852893 ]
Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.
Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paul Moore [Fri, 1 Sep 2017 13:44:34 +0000 (09:44 -0400)]
audit: ensure that 'audit=1' actually enables audit for PID 1
[ Upstream commit
173743dd99a49c956b124a74c8aacb0384739a4c ]
Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked. This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.
This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Steve Grubb [Tue, 17 Oct 2017 22:29:22 +0000 (18:29 -0400)]
audit: Allow auditd to set pid to 0 to end auditing
[ Upstream commit
33e8a907804428109ce1d12301c3365d619cc4df ]
The API to end auditing has historically been for auditd to set the
pid to 0. This patch restores that functionality.
See: https://github.com/linux-audit/audit-kernel/issues/69
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Israel Rukshin [Sun, 5 Nov 2017 08:43:01 +0000 (08:43 +0000)]
nvmet-rdma: update queue list during ib_device removal
[ Upstream commit
43b92fd27aaef0f529c9321cfebbaec1d7b8f503 ]
A NULL deref happens when nvmet_rdma_remove_one() is called more than once
(e.g. while connected via 2 ports).
The first call frees the queues related to the first ib_device but
doesn't remove them from the queue list.
While calling nvmet_rdma_remove_one() for the second ib_device it goes over
the full queue list again and we get the NULL deref.
Fixes:
f1d4ef7d ("nvmet-rdma: register ib_client to not deadlock in device removal")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grmberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bart Van Assche [Wed, 8 Nov 2017 18:23:45 +0000 (10:23 -0800)]
blk-mq: Avoid that request queue removal can trigger list corruption
[ Upstream commit
aba7afc5671c23beade64d10caf86e24a9105dab ]
Avoid that removal of a request queue sporadically triggers the
following warning:
list_del corruption. next->prev should be
ffff8807d649b970, but was
6b6b6b6b6b6b6b6b
WARNING: CPU: 3 PID: 342 at lib/list_debug.c:56 __list_del_entry_valid+0x92/0xa0
Call Trace:
process_one_work+0x11b/0x660
worker_thread+0x3d/0x3b0
kthread+0x129/0x140
ret_from_fork+0x27/0x40
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hongxu Jia [Fri, 10 Nov 2017 07:59:17 +0000 (15:59 +0800)]
ide: ide-atapi: fix compile error with defining macro DEBUG
[ Upstream commit
8dc7a31fbce5e2dbbacd83d910da37105181b054 ]
Compile ide-atapi failed with defining macro "DEBUG"
...
|drivers/ide/ide-atapi.c:285:52: error: 'struct request' has
no member named 'cmd'; did you mean 'csd'?
| debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]);
...
Since we split the scsi_request out of struct request, it missed
do the same thing on debug_log
Fixes:
82ed4db499b8 ("block: split scsi_request out of struct request")
Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Keefe Liu [Thu, 9 Nov 2017 12:09:31 +0000 (20:09 +0800)]
ipvlan: fix ipv6 outbound device
[ Upstream commit
ca29fd7cce5a6444d57fb86517589a1a31c759e1 ]
When process the outbound packet of ipv6, we should assign the master
device to output device other than input device.
Signed-off-by: Keefe Liu <liuqifa@huawei.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vaidyanathan Srinivasan [Wed, 23 Aug 2017 18:58:41 +0000 (00:28 +0530)]
powerpc/powernv/idle: Round up latency and residency values
[ Upstream commit
8d4e10e9ed9450e18fbbf6a8872be0eac9fd4999 ]
On PowerNV platforms, firmware provides exit latency and
target residency for each of the idle states in nano
seconds. Cpuidle framework expects the values in micro
seconds. Round up to nearest micro seconds to avoid errors
in cases where the values are defined as fractional micro
seconds.
Default idle state of 'snooze' has exit latency of zero. If
other states have fractional micro second exit latency, they
would get rounded down to zero micro second and make cpuidle
framework choose deeper idle state when snooze loop is the
right choice.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Masahiro Yamada [Thu, 12 Oct 2017 09:22:25 +0000 (18:22 +0900)]
kbuild: do not call cc-option before KBUILD_CFLAGS initialization
[ Upstream commit
433dc2ebe7d17dd21cba7ad5c362d37323592236 ]
Some $(call cc-option,...) are invoked very early, even before
KBUILD_CFLAGS, etc. are initialized.
The returned string from $(call cc-option,...) depends on
KBUILD_CPPFLAGS, KBUILD_CFLAGS, and GCC_PLUGINS_CFLAGS.
Since they are exported, they are not empty when the top Makefile
is recursively invoked.
The recursion occurs in several places. For example, the top
Makefile invokes itself for silentoldconfig. "make tinyconfig",
"make rpm-pkg" are the cases, too.
In those cases, the second call of cc-option from the same line
runs a different shell command due to non-pristine KBUILD_CFLAGS.
To get the same result all the time, KBUILD_* and GCC_PLUGINS_CFLAGS
must be initialized before any call of cc-option. This avoids
garbage data in the .cache.mk file.
Move all calls of cc-option below the config targets because target
compiler flags are unnecessary for Kconfig.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Marc Zyngier [Thu, 16 Nov 2017 17:58:17 +0000 (17:58 +0000)]
KVM: arm/arm64: vgic-its: Preserve the revious read from the pending table
commit
64afe6e9eb4841f35317da4393de21a047a883b3 upstream.
The current pending table parsing code assumes that we keep the
previous read of the pending bits, but keep that variable in
the current block, making sure it is discarded on each loop.
We end-up using whatever is on the stack. Who knows, it might
just be the right thing...
Fixes:
33d3bc9556a7d ("KVM: arm64: vgic-its: Read initial LPI pending table")
Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Al Viro [Tue, 5 Dec 2017 23:27:57 +0000 (23:27 +0000)]
fix kcm_clone()
commit
a5739435b5a3b8c449f8844ecd71a3b1e89f0a33 upstream.
1) it's fput() or sock_release(), not both
2) don't do fd_install() until the last failure exit.
3) not a bug per se, but... don't attach socket to struct file
until it's set up.
Take reserving descriptor into the caller, move fd_install() to the
caller, sanitize failure exits and calling conventions.
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jeff Layton [Tue, 14 Nov 2017 19:42:57 +0000 (14:42 -0500)]
fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
commit
4d2dc2cc766c3b51929658cacbc6e34fc8e242fb upstream.
Currently, we're capping the values too low in the F_GETLK64 case. The
fields in that structure are 64-bit values, so we shouldn't need to do
any sort of fixup there.
Make sure we check that assumption at build time in the future however
by ensuring that the sizes we're copying will fit.
With this, we no longer need COMPAT_LOFF_T_MAX either, so remove it.
Fixes:
94073ad77fff2 (fs/locks: don't mess with the address limit in compat_fcntl64)
Reported-by: Vitaly Lipatov <lav@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vincent Pelletier [Sun, 26 Nov 2017 06:52:53 +0000 (06:52 +0000)]
usb: gadget: ffs: Forbid usb_ep_alloc_request from sleeping
commit
30bf90ccdec1da9c8198b161ecbff39ce4e5a9ba upstream.
Found using DEBUG_ATOMIC_SLEEP while submitting an AIO read operation:
[ 100.853642] BUG: sleeping function called from invalid context at mm/slab.h:421
[ 100.861148] in_atomic(): 1, irqs_disabled(): 1, pid: 1880, name: python
[ 100.867954] 2 locks held by python/1880:
[ 100.867961] #0: (&epfile->mutex){....}, at: [<
f8188627>] ffs_mutex_lock+0x27/0x30 [usb_f_fs]
[ 100.868020] #1: (&(&ffs->eps_lock)->rlock){....}, at: [<
f818ad4b>] ffs_epfile_io.isra.17+0x24b/0x590 [usb_f_fs]
[ 100.868076] CPU: 1 PID: 1880 Comm: python Not tainted 4.14.0-edison+ #118
[ 100.868085] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48
[ 100.868093] Call Trace:
[ 100.868122] dump_stack+0x47/0x62
[ 100.868156] ___might_sleep+0xfd/0x110
[ 100.868182] __might_sleep+0x68/0x70
[ 100.868217] kmem_cache_alloc_trace+0x4b/0x200
[ 100.868248] ? dwc3_gadget_ep_alloc_request+0x24/0xe0 [dwc3]
[ 100.868302] dwc3_gadget_ep_alloc_request+0x24/0xe0 [dwc3]
[ 100.868343] usb_ep_alloc_request+0x16/0xc0 [udc_core]
[ 100.868386] ffs_epfile_io.isra.17+0x444/0x590 [usb_f_fs]
[ 100.868424] ? _raw_spin_unlock_irqrestore+0x27/0x40
[ 100.868457] ? kiocb_set_cancel_fn+0x57/0x60
[ 100.868477] ? ffs_ep0_poll+0xc0/0xc0 [usb_f_fs]
[ 100.868512] ffs_epfile_read_iter+0xfe/0x157 [usb_f_fs]
[ 100.868551] ? security_file_permission+0x9c/0xd0
[ 100.868587] ? rw_verify_area+0xac/0x120
[ 100.868633] aio_read+0x9d/0x100
[ 100.868692] ? __fget+0xa2/0xd0
[ 100.868727] ? __might_sleep+0x68/0x70
[ 100.868763] SyS_io_submit+0x471/0x680
[ 100.868878] do_int80_syscall_32+0x4e/0xd0
[ 100.868921] entry_INT80_32+0x2a/0x2a
[ 100.868932] EIP: 0xb7fbb676
[ 100.868941] EFLAGS:
00000292 CPU: 1
[ 100.868951] EAX:
ffffffda EBX:
b7aa2000 ECX:
00000002 EDX:
b7af8368
[ 100.868961] ESI:
b7fbb660 EDI:
b7aab000 EBP:
bfb6c658 ESP:
bfb6c638
[ 100.868973] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Masamitsu Yamazaki [Wed, 15 Nov 2017 07:33:14 +0000 (07:33 +0000)]
ipmi: Stop timers before cleaning up the module
commit
4f7f5551a760eb0124267be65763008169db7087 upstream.
System may crash after unloading ipmi_si.ko module
because a timer may remain and fire after the module cleaned up resources.
cleanup_one_si() contains the following processing.
/*
* Make sure that interrupts, the timer and the thread are
* stopped and will not run again.
*/
if (to_clean->irq_cleanup)
to_clean->irq_cleanup(to_clean);
wait_for_timer_and_thread(to_clean);
/*
* Timeouts are stopped, now make sure the interrupts are off
* in the BMC. Note that timers and CPU interrupts are off,
* so no need for locks.
*/
while (to_clean->curr_msg || (to_clean->si_state != SI_NORMAL)) {
poll(to_clean);
schedule_timeout_uninterruptible(1);
}
si_state changes as following in the while loop calling poll(to_clean).
SI_GETTING_MESSAGES
=> SI_CHECKING_ENABLES
=> SI_SETTING_ENABLES
=> SI_GETTING_EVENTS
=> SI_NORMAL
As written in the code comments above,
timers are expected to stop before the polling loop and not to run again.
But the timer is set again in the following process
when si_state becomes SI_SETTING_ENABLES.
=> poll
=> smi_event_handler
=> handle_transaction_done
// smi_info->si_state == SI_SETTING_ENABLES
=> start_getting_events
=> start_new_msg
=> smi_mod_timer
=> mod_timer
As a result, before the timer set in start_new_msg() expires,
the polling loop may see si_state becoming SI_NORMAL
and the module clean-up finishes.
For example, hard LOCKUP and panic occurred as following.
smi_timeout was called after smi_event_handler,
kcs_event and hangs at port_inb()
trying to access I/O port after release.
[exception RIP: port_inb+19]
RIP:
ffffffffc0473053 RSP:
ffff88069fdc3d80 RFLAGS:
00000006
RAX:
ffff8806800f8e00 RBX:
ffff880682bd9400 RCX:
0000000000000000
RDX:
0000000000000ca3 RSI:
0000000000000ca3 RDI:
ffff8806800f8e40
RBP:
ffff88069fdc3d80 R8:
ffffffff81d86dfc R9:
ffffffff81e36426
R10:
00000000000509f0 R11:
0000000000100000 R12:
0000000000]:000000
R13:
0000000000000000 R14:
0000000000000246 R15:
ffff8806800f8e00
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
To fix the problem I defined a flag, timer_can_start,
as member of struct smi_info.
The flag is enabled immediately after initializing the timer
and disabled immediately before waiting for timer deletion.
Fixes:
0cfec916e86d ("ipmi: Start the timer and thread on internal msgs")
Signed-off-by: Yamazaki Masamitsu <m-yamazaki@ah.jp.nec.com>
[Some fairly major changes went into the IPMI driver in 4.15, so this
required a backport as the code had changed and moved to a different
file.]
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xin Long [Sun, 26 Nov 2017 12:56:07 +0000 (20:56 +0800)]
sctp: use right member as the param of list_for_each_entry
[ Upstream commit
a8dd397903a6e57157f6265911f7d35681364427 ]
Commit
d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues
when migrating a sock") made a mistake that using 'list' as the param of
list_for_each_entry to traverse the retransmit, sacked and abandoned
queues, while chunks are using 'transmitted_list' to link into these
queues.
It could cause NULL dereference panic if there are chunks in any of these
queues when peeling off one asoc.
So use the chunk member 'transmitted_list' instead in this patch.
Fixes:
d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues when migrating a sock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jakub Kicinski [Mon, 27 Nov 2017 19:11:41 +0000 (11:11 -0800)]
cls_bpf: don't decrement net's refcount when offload fails
[ Upstream commit
25415cec502a1232b19fffc85465882b19a90415 ]
When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point. This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().
__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out. The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.
Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.
Fixes:
aae2c35ec892 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Gustavo A. R. Silva [Sat, 25 Nov 2017 19:14:40 +0000 (13:14 -0600)]
net: openvswitch: datapath: fix data type in queue_gso_packets
[ Upstream commit
2734166e89639c973c6e125ac8bcfc2d9db72b70 ]
gso_type is being used in binary AND operations together with SKB_GSO_UDP.
The issue is that variable gso_type is of type unsigned short and
SKB_GSO_UDP expands to more than 16 bits:
SKB_GSO_UDP = 1 << 16
this makes any binary AND operation between gso_type and SKB_GSO_UDP to
be always zero, hence making some code unreachable and likely causing
undesired behavior.
Fix this by changing the data type of variable gso_type to unsigned int.
Addresses-Coverity-ID:
1462223
Fixes:
0c19f846d582 ("net: accept UFO datagrams from tuntap and packet")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Willem de Bruijn [Tue, 21 Nov 2017 15:22:25 +0000 (10:22 -0500)]
net: accept UFO datagrams from tuntap and packet
[ Upstream commit
0c19f846d582af919db66a5914a0189f9f92c936 ]
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.
Partially revert the UFO removal from
182e0b6b5846~1..
d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.
It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit
939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit
8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").
(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.
Tested
Booted a v4.13 guest kernel with QEMU. On a host kernel before this
patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
enabled, same as on a v4.13 host kernel.
A UFO packet sent from the guest appears on the tap device:
host:
nc -l -p -u 8000 &
tcpdump -n -i tap0
guest:
dd if=/dev/zero of=payload.txt bs=1 count=2000
nc -u 192.16.1.1 8000 < payload.txt
Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
packets arriving fragmented:
./with_tap_pair.sh ./tap_send_ufo tap0 tap1
(from https://github.com/wdebruij/kerneltools/tree/master/tests)
Changes
v1 -> v2
- simplified set_offload change (review comment)
- documented test procedure
Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes:
fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xin Long [Sun, 19 Nov 2017 11:31:04 +0000 (19:31 +0800)]
tun: fix rcu_read_lock imbalance in tun_build_skb
[ Upstream commit
654d573845f35017dc397840fa03610fef3d08b0 ]
rcu_read_lock in tun_build_skb is used to rcu_dereference tun->xdp_prog
safely, rcu_read_unlock should be done in every return path.
Now I could see one place missing it, where it returns NULL in switch-case
XDP_REDIRECT, another palce using rcu_read_lock wrongly, where it returns
NULL in if (xdp_xmit) chunk.
So fix both in this patch.
Fixes:
761876c857cb ("tap: XDP support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David Ahern [Tue, 21 Nov 2017 15:08:57 +0000 (07:08 -0800)]
net: ipv6: Fixup device for anycast routes during copy
[ Upstream commit
98d11291d189cb5adf49694d0ad1b971c0212697 ]
Florian reported a breakage with anycast routes due to commit
4832c30d5458 ("net: ipv6: put host and anycast routes on device with
address"). Prior to this commit anycast routes were added against the
loopback device causing repetitive route entries with no insight into
why they existed. e.g.:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev lo proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
The point of commit
4832c30d5458 is to add the routes using the device
with the address which is causing the route to be added. e.g.,:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev eth1 proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth1 proto kernel metric 0 pref medium
For traffic to work as it did before, the dst device needs to be switched
to the loopback when the copy is created similar to local routes.
Fixes:
4832c30d5458 ("net: ipv6: put host and anycast routes on device with address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wei Xu [Fri, 1 Dec 2017 10:10:37 +0000 (05:10 -0500)]
tun: free skb in early errors
[ Upstream commit
c33ee15b3820a03cf8229ba9415084197b827f8c ]
tun_recvmsg() supports accepting skb by msg_control after
commit
ac77cfd4258f ("tun: support receiving skb through msg_control"),
the skb if presented should be freed no matter how far it can go
along, otherwise it would be leaked.
This patch fixes several missed cases.
Signed-off-by: Wei Xu <wexu@redhat.com>
Reported-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>