Zygo Blaxell [Mon, 13 Jun 2016 03:39:58 +0000 (23:39 -0400)]
btrfs: avoid blocking open_ctree from cleaner_kthread
This fixes a problem introduced in commit
2f3165ecf103599f82bf0ea254039db335fb5005
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".
open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.
Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.
When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 27 May 2016 17:03:04 +0000 (13:03 -0400)]
Btrfs: don't BUG_ON() in btrfs_orphan_add
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 8 Jun 2016 04:36:38 +0000 (00:36 -0400)]
btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor. btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction. trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.
In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion. In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 6 Jun 2016 19:01:23 +0000 (12:01 -0700)]
Btrfs: check if extent buffer is aligned to sectorsize
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.
This adds a warning to let us quickly know what was happening.
Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Heinrich Schuchardt [Sat, 11 Jun 2016 16:11:10 +0000 (18:11 +0200)]
btrfs: Use correct format specifier
Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Wed, 8 Jun 2016 21:36:12 +0000 (14:36 -0700)]
Merge branch 'misc-fixes-4.7' of git://git./linux/kernel/git/kdave/linux into for-linus-4.7
Chris Mason [Wed, 8 Jun 2016 21:35:11 +0000 (14:35 -0700)]
Merge branch 'for-chris' of git://git./linux/kernel/git/kdave/linux into for-linus-4.7
Feifei Xu [Wed, 1 Jun 2016 11:18:30 +0000 (19:18 +0800)]
Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
In __test_eb_bitmaps(), we write random data to a bitmap. Then copy
the bitmap to another bitmap that resides inside an extent buffer.
Later we verify the values of corresponding bits in the bitmap and the
bitmap inside the extent buffer. However, extent_buffer_test_bit()
reads in byte granularity while test_bit() reads in unsigned long
granularity. Hence we end up comparing wrong bits on big-endian
systems such as ppc64. This commit fixes the issue by reading the
bitmap in byte granularity.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:29 +0000 (19:18 +0800)]
Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
With 64K sectorsize, 1G sized block group cannot span across bitmaps.
To execute test_bitmaps() function, this commit allocates
"BITS_PER_BITMAP * sectorsize + PAGE_SIZE" sized block group.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:28 +0000 (19:18 +0800)]
Btrfs: self-tests: Use macros instead of constants and add missing newline
This commit replaces numerical constants with appropriate
preprocessor macros.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:27 +0000 (19:18 +0800)]
Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
To test all possible sectorsizes, this commit adds a sectorsize
array. This commit executes the tests for all possible sectorsizes and
nodesizes.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:26 +0000 (19:18 +0800)]
Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE.
In such a scenario, we will never be able to have an extent buffer
containing more than one page. Hence in such cases this commit does not
execute the page straddling tests.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 16 Sep 2015 13:34:53 +0000 (15:34 +0200)]
btrfs: advertise which crc32c implementation is being used at module load
Since several architectures support hardware-accelerated crc32c
calculation, it would be nice to confirm that btrfs is actually using it.
We can see an elevated use count for the module, but it doesn't actually
show who the users are. This patch simply prints the name of the driver
after successfully initializing the shash.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ added a helper and used in module load-time message ]
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 3 Jun 2016 19:05:15 +0000 (12:05 -0700)]
Btrfs: add validadtion checks for chunk loading
To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.
Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 3 Jun 2016 19:05:14 +0000 (12:05 -0700)]
Btrfs: add more validation checks for superblock
This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Sat, 4 Jun 2016 00:41:42 +0000 (17:41 -0700)]
Btrfs: clear uptodate flags of pages in sys_array eb
We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb. As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.
If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will skip our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"
This adds the 'clear uptodate' part to force it to read from disk.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Sat, 19 Sep 2015 18:28:25 +0000 (11:28 -0700)]
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map. The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.
This actually works most of the time, except when there are extent maps
past the end of the inline extent. rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.
Josef made a script to trigger with xfs_io:
#!/bin/bash
xfs_io -f -c "pwrite 0 1000" inline
xfs_io -c "falloc -k 4k 1M" inline
xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
xfs_io -c "fadvise -d 0 1000" inline
cat inline
You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST
Signed-off-by: Chris Mason <clm@fb.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:25 +0000 (19:18 +0800)]
Btrfs: self-tests: Support non-4k page size
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:24 +0000 (19:18 +0800)]
Btrfs: Fix integer overflow when calculating bytes_per_bitmap
On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
fix integer overflow.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Feifei Xu [Wed, 1 Jun 2016 11:18:23 +0000 (19:18 +0800)]
Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
On a ppc64 machine using 64K as the block size, assume that the RB
tree at btrfs_free_space_ctl->free_space_offset contains following
two entries:
1. A bitmap entry having an offset value of 0 and having the bits
corresponding to the address range [128M+512K, 128M+768K] set.
2. An extent entry corresponding to the address range
[128M-256K, 128M-128K]
In such a scenario, test_check_exists() invoked for checking the
existence of address range [128M+768K, 256M] can lead to an
infinite loop as explained below:
- Checking for the extent entry fails.
- Checking for a bitmap entry results in the free space info in
range [128M+512K, 128M+768K] beng returned.
- rb_prev(info) returns NULL because the bitmap entry starting from
offset 0 comes first in the RB tree.
- current_node = bitmap node.
- while (current_node)
tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
Since extent based free space entry's last address is smaller
than the address being searched for (i.e. 128M+768K) we
incorrectly again obtain the extent node as the "next right node"
of the RB tree and thus end up looping infinitely.
This patch fixes the issue by checking the "tmp" variable which point
to the most recently searched free space node.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Thu, 2 Jun 2016 00:03:50 +0000 (17:03 -0700)]
Merge branch 'dev-replace-fixes-4.7' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.7
Josef Bacik [Wed, 23 Sep 2015 19:00:37 +0000 (15:00 -0400)]
Btrfs: end transaction if we abort when creating uuid root
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 27 May 2016 21:21:27 +0000 (22:21 +0100)]
Btrfs: fix race between device replace and read repair
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Fri, 27 May 2016 16:42:05 +0000 (17:42 +0100)]
Btrfs: fix race between device replace and discard
While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:
[ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
[ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 3206.388595] task:
ffff88017ace0100 ti:
ffff880171b98000 task.ti:
ffff880171b98000
[ 3206.388595] RIP: 0010:[<
ffffffff8124d233>] [<
ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
[ 3206.388595] RSP: 0018:
ffff880171b9bb80 EFLAGS:
00010246
[ 3206.388595] RAX:
ffff880171b9bc28 RBX:
000000000090d000 RCX:
0000000000000000
[ 3206.388595] RDX:
ffffffff82fa1b48 RSI:
ffffffff8179f46c RDI:
ffffffff82fa1b48
[ 3206.388595] RBP:
ffff880171b9bcc0 R08:
0000000000000000 R09:
0000000000000001
[ 3206.388595] R10:
ffff880171b9bce0 R11:
000000000090f000 R12:
ffff880171b9bbe8
[ 3206.388595] R13:
0000000000000010 R14:
0000000000004868 R15:
6b6b6b6b6b6b6b6b
[ 3206.388595] FS:
00007f6182e4e700(0000) GS:
ffff88023fdc0000(0000) knlGS:
0000000000000000
[ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 3206.388595] CR2:
00007f617c2bbb18 CR3:
000000017ad9c000 CR4:
00000000000006e0
[ 3206.388595] Stack:
[ 3206.388595]
0000000000004878 0000000000000000 0000000002400040 0000000000000000
[ 3206.388595]
0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
[ 3206.388595]
ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
[ 3206.388595] Call Trace:
[ 3206.388595] [<
ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<
ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<
ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
[ 3206.388595] [<
ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
[ 3206.388595] [<
ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<
ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
[ 3206.388595] [<
ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<
ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
[ 3206.388595] [<
ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
[ 3206.388595] [<
ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
[ 3206.388595] [<
ffffffff811a8417>] do_fsync+0x31/0x4a
[ 3206.388595] [<
ffffffff811a8637>] SyS_fsync+0x10/0x14
[ 3206.388595] [<
ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 3206.388595] [<
ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[ 3206.388595] [<
ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.
This is illustrated by the following diagram.
CPU 1 CPU 2
btrfs_dev_replace_start()
(...)
btrfs_dev_replace_finishing()
btrfs_start_transaction()
btrfs_commit_transaction()
(...)
btrfs_sync_file()
btrfs_start_transaction()
(...)
btrfs_commit_transaction()
btrfs_finish_extent_commit()
btrfs_discard_extent()
btrfs_map_block()
--> returns a struct btrfs_bio
with a stripe that has a
device field pointing to
source device of the replace
operation (the device that
is being replaced)
mutex_lock(&uuid_mutex)
mutex_lock(&fs_info->fs_devices->device_list_mutex)
mutex_lock(&fs_info->chunk_mutex)
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates the mapping tree and for each
extent map that has a stripe pointing to
the source device, it updates the stripe
to point to the target device instead
btrfs_rm_dev_replace_blocked()
--> waits for fs_info->bio_counter to go down to 0
btrfs_rm_dev_replace_remove_srcdev()
--> removes source device from the list of devices
mutex_unlock(&fs_info->chunk_mutex)
mutex_unlock(&fs_info->fs_devices->device_list_mutex)
mutex_unlock(&uuid_mutex)
btrfs_rm_dev_replace_free_srcdev()
--> frees the source device
--> iterates over all stripes
of the returned struct
btrfs_bio
--> for each stripe it
dereferences its device
pointer
--> it ends up finding a
pointer to the device
used as the source
device for the replace
operation and that was
already freed
So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Wed, 18 May 2016 19:29:44 +0000 (20:29 +0100)]
Btrfs: fix race between device replace and chunk allocation
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.
However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.
For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:
[ extent bg A ] [ hole, unallocated space ] [extent bg B ]
3Gb 4Gb 5Gb
While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:
CPU 1 CPU 2
<we are at transaction N>
scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A
--> sets block group A to RO mode
(btrfs_inc_block_group_ro)
--> sets cursor left to found_key.offset
which is 3Gb
--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device
btrfs_alloc_chunk()
--> allocates device extent
in the range [4Gb, 5Gb[
from the source device for
a new block group C
extent allocated from block
group C for a direct IO,
buffered write or btree node/leaf
extent is written to, perhaps
in response to a writepages()
call from the VM or directly
through direct IO
the write is made only against
the source device and not against
the target device because the
extent's offset is in the interval
[4Gb, 5Gb[ which is larger then
the value of cursor_left (3Gb)
--> scrub_chunks() finishes
--> updates left cursor from 3Gb to
4Gb
--> btrfs_dec_block_group_ro() sets
block group A back to RW mode
<we are still at transaction N>
--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree
<we are still at transaction N>
--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device
So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sat, 14 May 2016 18:44:40 +0000 (19:44 +0100)]
Btrfs: fix race setting block group back to RW mode during device replace
After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sat, 14 May 2016 15:32:35 +0000 (16:32 +0100)]
Btrfs: fix unprotected assignment of the left cursor for device replace
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sat, 14 May 2016 08:12:53 +0000 (09:12 +0100)]
Btrfs: fix race setting block group readonly during device replace
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:
CPU 1 CPU 2
btrfs_dev_replace_start()
btrfs_scrub_dev()
scrub_enumerate_chunks()
--> finds device extent belonging
to block group X
<transaction N starts>
starts buffered write
against some inode
writepages is run against
that inode forcing dellaloc
to run
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
run_delalloc_range()
cow_file_range()
btrfs_reserve_extent()
--> allocates an extent
from block group X
(which is not yet
in RO mode)
btrfs_add_ordered_extent()
--> creates ordered extent Y
flush_epd_write_bio()
--> bio against the extent from
block group X is submitted
btrfs_inc_block_group_ro(bg X)
--> sets block group X to readonly
scrub_chunk(bg X)
scrub_stripe(device extent from srcdev)
--> keeps searching for extent items
belonging to the block group using
the extent tree's commit root
--> it never blocks due to
fs_info->scrub_pause_req as no
one tries to commit transaction N
--> copies all extents found from the
source device into the target device
--> finishes search loop
bio completes
ordered extent Y completes
and creates delayed data
reference which will add an
extent item to the extent
tree when run (typically
at transaction commit time)
--> so the task doing the
scrub/device replace
at CPU 1 misses this
and does not copy this
extent into the new/target
device
btrfs_dec_block_group_ro(bg X)
--> turns block group X back to RW mode
dev_replace->cursor_left is set to the
logical end offset of block group X
So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Fri, 20 May 2016 03:34:23 +0000 (04:34 +0100)]
Btrfs: fix race between device replace and block group removal
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:
[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
[ 9238.594377] ------------[ cut here ]------------
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 9238.594421]
0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
[ 9238.594422]
0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
[ 9238.594423]
0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428] [<
ffffffff8126b42c>] dump_stack+0x67/0x90
[ 9238.594430] [<
ffffffff81052b14>] __warn+0xc2/0xdd
[ 9238.594432] [<
ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434] [<
ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
[ 9238.594450] [<
ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452] [<
ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[ 9238.594464] [<
ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476] [<
ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489] [<
ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490] [<
ffffffff8106f403>] kthread+0xd4/0xdc
[ 9238.594494] [<
ffffffff8149e242>] ret_from_fork+0x22/0x40
[ 9238.594495] [<
ffffffff8106f32f>] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace
183efbe50275f059 ]---
The sequence of steps leading to this is like the following:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_delete_unused_bgs()
btrfs_remove_chunk()
looks up for the extent map
corresponding to the chunk
lock_chunks() (chunk_mutex)
check_system_chunk()
unlock_chunks() (chunk_mutex)
locks fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
iterates over all stripes from
the extent map
--> calls btrfs_free_dev_extent()
passing it the target device
that still has an ID of 0ULL
--> btrfs_free_dev_extent() fails
--> aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid (which is necessarily > 0)
frees the srcdev
unlocks fs_info->chunk_mutex
So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit
50460e37186a
("Btrfs: fix race when finishing dev replace leading to transaction abort").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Fri, 20 May 2016 00:57:20 +0000 (01:57 +0100)]
Btrfs: fix race between readahead and device replace/removal
The list of devices is protected by the device_list_mutex and the device
replace code, in its finishing phase correctly takes that mutex before
removing the source device from that list. However the readahead code was
iterating that list without acquiring the respective mutex leading to
crashes later on due to invalid memory accesses:
[125671.831036] general protection fault: 0000 [#1] PREEMPT SMP
[125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
processor ser
[125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G W 4.6.0-rc7-btrfs-next-29+ #1
[125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
[125671.834973] task:
ffff8801ac520540 ti:
ffff8801ac918000 task.ti:
ffff8801ac918000
[125671.834973] RIP: 0010:[<
ffffffff81270479>] [<
ffffffff81270479>] __radix_tree_lookup+0x6a/0x105
[125671.834973] RSP: 0018:
ffff8801ac91bc28 EFLAGS:
00010206
[125671.834973] RAX:
0000000000000000 RBX:
6b6b6b6b6b6b6b6a RCX:
0000000000000000
[125671.834973] RDX:
0000000000000000 RSI:
00000000000c1bff RDI:
ffff88002ebd62a8
[125671.834973] RBP:
ffff8801ac91bc70 R08:
0000000000000001 R09:
0000000000000000
[125671.834973] R10:
ffff8801ac91bc70 R11:
0000000000000000 R12:
ffff88002ebd62a8
[125671.834973] R13:
0000000000000000 R14:
0000000000000000 R15:
00000000000c1bff
[125671.834973] FS:
0000000000000000(0000) GS:
ffff88023fd40000(0000) knlGS:
0000000000000000
[125671.834973] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[125671.834973] CR2:
000000000073cae4 CR3:
00000000b7723000 CR4:
00000000000006e0
[125671.834973] Stack:
[125671.834973]
0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000
[125671.834973]
0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000
[125671.834973]
ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0
[125671.834973] Call Trace:
[125671.834973] [<
ffffffff81270541>] radix_tree_lookup+0xd/0xf
[125671.834973] [<
ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs]
[125671.834973] [<
ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs]
[125671.834973] [<
ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs]
[125671.834973] [<
ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs]
[125671.834973] [<
ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs]
[125671.834973] [<
ffffffff81069691>] process_one_work+0x271/0x4e9
[125671.834973] [<
ffffffff81069dda>] worker_thread+0x1eb/0x2c9
[125671.834973] [<
ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3
[125671.834973] [<
ffffffff8106f403>] kthread+0xd4/0xdc
[125671.834973] [<
ffffffff8149e242>] ret_from_fork+0x22/0x40
[125671.834973] [<
ffffffff8106f32f>] ? kthread_stop+0x286/0x286
So fix this by taking the device_list_mutex in the readahead code. We
can't use here the lighter approach of using a rcu_read_lock() and
rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
because we end up doing calls to sleeping functions (kzalloc()) in the
respective code path.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Vinson Lee [Sat, 28 May 2016 07:04:38 +0000 (07:04 +0000)]
btrfs: Use __u64 in exported linux/btrfs.h.
This patch fixes this build error.
/usr/include/linux/btrfs.h:121:3: error: unknown type name ‘u64’
u64 devid;
^~~
Fixes:
6b526ed70cf1 ("btrfs: introduce device delete by devid")
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Mon, 16 May 2016 16:21:01 +0000 (09:21 -0700)]
Btrfs: fix handling of faults from btrfs_copy_from_user
When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.
Commit
2e78c927d79 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time. This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.
The accounting errors led to warnings during btrfs_destroy_inode:
[ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23
[ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 70.847542]
0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
[ 70.847543]
0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
[ 70.847547]
ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
[ 70.847547] Call Trace:
[ 70.847550] [<
ffffffff8149d5e9>] dump_stack+0x51/0x78
[ 70.847551] [<
ffffffff8107bdfd>] __warn+0xfd/0x120
[ 70.847553] [<
ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20
[ 70.847555] [<
ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847556] [<
ffffffff812003a1>] ? __destroy_inode+0x71/0x140
[ 70.847558] [<
ffffffff812004b3>] destroy_inode+0x43/0x70
[ 70.847559] [<
ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40
[ 70.847560] [<
ffffffff81200c68>] evict+0x148/0x1d0
[ 70.847562] [<
ffffffff81398ade>] ? start_transaction+0x3de/0x460
[ 70.847564] [<
ffffffff81200d49>] dispose_list+0x59/0x80
[ 70.847565] [<
ffffffff81201ba0>] evict_inodes+0x180/0x190
[ 70.847566] [<
ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50
[ 70.847568] [<
ffffffff811e95f8>] generic_shutdown_super+0x48/0x100
[ 70.847569] [<
ffffffff810b75c0>] ? woken_wake_function+0x20/0x20
[ 70.847571] [<
ffffffff811e9796>] kill_anon_super+0x16/0x30
[ 70.847573] [<
ffffffff81365cde>] btrfs_kill_super+0x1e/0x130
[ 70.847574] [<
ffffffff811e99be>] deactivate_locked_super+0x4e/0x90
[ 70.847576] [<
ffffffff811e9e61>] deactivate_super+0x51/0x70
[ 70.847577] [<
ffffffff8120536f>] cleanup_mnt+0x3f/0x80
[ 70.847579] [<
ffffffff81205402>] __cleanup_mnt+0x12/0x20
[ 70.847581] [<
ffffffff81098358>] task_work_run+0x68/0xa0
[ 70.847582] [<
ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0
[ 70.847583] [<
ffffffff81002e1d>] do_syscall_64+0xbd/0x170
[ 70.847586] [<
ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25
This is the test program I used to force short returns from
btrfs_copy_from_user
void *dontneed(void *arg)
{
char *p = arg;
int ret;
while(1) {
ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
if (ret) {
perror("madvise");
exit(1);
}
}
}
int main(int ac, char **av) {
int ret;
int fd;
char *filename;
unsigned long offset;
char *buf;
int i;
pthread_t tid;
if (ac != 2) {
fprintf(stderr, "usage: dammitdave filename\n");
exit(1);
}
buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(buf, 'a', BUFSIZE);
filename = av[1];
ret = pthread_create(&tid, NULL, dontneed, buf);
if (ret) {
fprintf(stderr, "error %d from pthread_create\n", ret);
exit(1);
}
ret = pthread_detach(tid);
if (ret) {
fprintf(stderr, "pthread detach failed %d\n", ret);
exit(1);
}
while (1) {
fd = open(filename, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}
for (i = 0; i < ROUNDS; i++) {
int this_write = BUFSIZE;
offset = rand() % MAXSIZE;
ret = pwrite(fd, buf, this_write, offset);
if (ret < 0) {
perror("pwrite");
exit(1);
} else if (ret != this_write) {
fprintf(stderr, "short write to %s offset %lu ret %d\n",
filename, offset, ret);
exit(1);
}
if (i == ROUNDS - 1) {
ret = sync_file_range(fd, offset, 4096,
SYNC_FILE_RANGE_WRITE);
if (ret < 0) {
perror("sync_file_range");
exit(1);
}
}
}
ret = ftruncate(fd, 0);
if (ret < 0) {
perror("ftruncate");
exit(1);
}
ret = close(fd);
if (ret) {
perror("close");
exit(1);
}
ret = unlink(filename);
if (ret) {
perror("unlink");
exit(1);
}
}
return 0;
}
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Jones <dsj@fb.com>
Fixes:
2e78c927d79333f299a8ac81c2fd2952caeef335
cc: stable@vger.kernel.org # v4.6
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Thu, 26 May 2016 19:49:21 +0000 (12:49 -0700)]
Merge branch 'for-chris-4.7' of git://git./linux/kernel/git/kdave/linux into for-linus-4.7
David Sterba [Wed, 25 May 2016 20:51:04 +0000 (22:51 +0200)]
Merge branch 'dev/comp-workspaces' into for-chris-4.7-
20160525
David Sterba [Wed, 25 May 2016 20:51:03 +0000 (22:51 +0200)]
Merge branch 'cleanups-4.7' into for-chris-4.7-
20160525
David Sterba [Wed, 25 May 2016 20:51:02 +0000 (22:51 +0200)]
Merge branch 'misc-4.7' into for-chris-4.7-
20160525
Nicholas D Steeves [Fri, 20 May 2016 01:18:45 +0000 (21:18 -0400)]
btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Tue, 17 May 2016 09:37:38 +0000 (17:37 +0800)]
btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.
But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.
Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Wed, 18 May 2016 00:21:48 +0000 (17:21 -0700)]
Btrfs: fix unexpected return value of fiemap
btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:
btrfs_lookup_file_extent ->
btrfs_search_slot(..., ins_len=0, cow=0)
and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.
This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Sat, 14 May 2016 00:06:59 +0000 (17:06 -0700)]
Btrfs: free sys_array eb as soon as possible
While reading sys_chunk_array in superblock, btrfs creates a temporary
extent buffer. Since we don't use it after finishing reading
sys_chunk_array, we don't need to keep it in memory.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Tue, 17 May 2016 21:43:19 +0000 (14:43 -0700)]
Merge branch 'for-chris-4.7' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.7
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Tue, 17 May 2016 21:24:44 +0000 (14:24 -0700)]
Merge branch 'for-chris-4.7' of git://git./linux/kernel/git/kdave/linux into for-linus-4.7
David Sterba [Mon, 16 May 2016 13:46:29 +0000 (15:46 +0200)]
Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-
20160516
# Conflicts:
# include/uapi/linux/btrfs.h
David Sterba [Mon, 16 May 2016 13:46:26 +0000 (15:46 +0200)]
Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-
20160516
David Sterba [Mon, 16 May 2016 13:46:24 +0000 (15:46 +0200)]
Merge branch 'cleanups-4.7' into for-chris-4.7-
20160516
David Sterba [Mon, 16 May 2016 13:46:23 +0000 (15:46 +0200)]
Merge branch 'misc-4.7' into for-chris-4.7-
20160516
Scott Talbert [Mon, 9 May 2016 13:14:28 +0000 (09:14 -0400)]
btrfs: fix memory leak during RAID 5/6 device replacement
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.
Signed-off-by: Scott Talbert <scott.talbert@hgst.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 12 May 2016 12:53:36 +0000 (13:53 +0100)]
Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit
38851cc19adb ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.
This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Mon, 9 May 2016 12:15:41 +0000 (13:15 +0100)]
Btrfs: fix race between block group relocation and nocow writes
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:
CPU 1 CPU 2
btrfs_relocate_block_group(bg X)
direct IO write starts against
an extent in block group X
using nocow mode (inode has the
nodatacow flag or the write is
for a prealloc extent)
btrfs_direct_IO()
btrfs_get_blocks_direct()
--> can_nocow_extent() returns 1
btrfs_inc_block_group_ro(bg X)
--> turns block group into RO mode
btrfs_wait_ordered_roots()
--> returns and does not know about
the DIO write happening at CPU 2
(the task there has not created
yet an ordered extent)
relocate_block_group(bg X)
--> rc->stage == MOVE_DATA_EXTENTS
find_next_extent()
--> returns extent that the DIO
write is going to write to
relocate_data_extent()
relocate_file_extent_cluster()
--> reads the extent from disk into
pages belonging to the relocation
inode and dirties them
--> creates DIO ordered extent
btrfs_submit_direct()
--> submits bio against a location
on disk obtained from an extent
map before the relocation started
btrfs_wait_ordered_range()
--> writes all the pages read before
to disk (belonging to the
relocation inode)
relocation finishes
bio completes and wrote new data
to the old location of the block
group
So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.
The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Mon, 9 May 2016 12:15:27 +0000 (13:15 +0100)]
Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit
38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.
This is an extension of what was done in commit
de0ee0edb21f ("Btrfs: fix
race between fsync and lockless direct IO writes").
So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Thu, 5 May 2016 09:26:26 +0000 (10:26 +0100)]
Btrfs: fix number of transaction units for renames with whiteout
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 5 May 2016 01:08:56 +0000 (02:08 +0100)]
Btrfs: pin logs earlier when doing a rename exchange operation
The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
which had a race fixed by my previous patch titled "Btrfs: pin log earlier
when renaming", and so it suffers from the same problem.
We pin the logs of the affected roots after we insert the new inode
references, leaving a time window where concurrent tasks logging the
inodes can end up logging both the new and old references, resulting
in log trees that when replayed can turn the metadata into inconsistent
states. This behaviour was added to btrfs_rename() in 2009 without any
explanation about why not pinning the logs earlier, just leaving a
comment about the posibility for the race. As of today it's perfectly
safe and sane to pin the logs before we start doing any of the steps
involved in the rename operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 5 May 2016 01:02:27 +0000 (02:02 +0100)]
Btrfs: unpin logs if rename exchange operation fails
If rename exchange operations fail at some point after we pinned any of
the logs, we end up aborting the current transaction but never unpin the
logs, which leaves concurrent tasks that are trying to sync the logs (as
part of an fsync request from user space) blocked forever and preventing
the filesystem from being unmountable.
Fix this by safely unpinning the log.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 5 May 2016 00:41:57 +0000 (01:41 +0100)]
Btrfs: fix inode leak on failure to setup whiteout inode in rename
If we failed to fully setup the whiteout inode during a rename operation
with the whiteout flag, we ended up leaking the inode, not decrementing
its link count nor removing all its items from the fs/subvol tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Dan Fuhry [Thu, 17 Mar 2016 14:23:38 +0000 (15:23 +0100)]
btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
behavior in the renameat2() syscall. This behavior is primarily used by
overlayfs. This patch adds support for these flags to btrfs, enabling it to
be used as a fully functional upper layer for overlayfs.
RENAME_EXCHANGE support was written by Davide Italiano originally
submitted on 2 April 2015.
Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Dan Fuhry <dfuhry@datto.com>
[ remove unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Fri, 29 Apr 2016 12:14:42 +0000 (13:14 +0100)]
Btrfs: pin log earlier when renaming
We were pinning the log right after the first step in the rename operation
(inserting inode ref for the new name in the destination directory)
instead of doing it before. This behaviour was introduced in 2009 for some
reason that was not mentioned neither on the changelog nor any comment,
with the drawback of a small time window where concurrent log writers can
end up logging the new inode reference for the inode we are renaming while
the rename operation is in progress (so that we can end up with a log
containing both the new and old references). As of today there's no reason
to not pin the log before that first step anymore, so just fix this.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Fri, 29 Apr 2016 10:34:22 +0000 (11:34 +0100)]
Btrfs: unpin log if rename operation fails
If rename operations fail at some point after we pinned the log, we end
up aborting the current transaction but never unpin the log, which leaves
concurrent tasks that are trying to sync the log (as part of an fsync
request from user space) blocked forever and preventing the filesystem
from being unmountable.
Fix this by safely unpinning the log.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Tue, 26 Apr 2016 14:39:32 +0000 (15:39 +0100)]
Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).
These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).
So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.
This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:
CPU 1 CPU 2
btrfs_relocate_block_group(bg X)
starts direct IO write,
target inode currently has no
ordered extents ongoing nor
dirty pages (delalloc regions),
therefore the root for our inode
is not in the list
fs_info->ordered_roots
btrfs_direct_IO()
__blockdev_direct_IO()
btrfs_get_blocks_direct()
btrfs_lock_extent_direct()
locks range in the io tree
btrfs_new_extent_direct()
btrfs_reserve_extent()
--> extent allocated
from bg X
btrfs_inc_block_group_ro(bg X)
btrfs_start_delalloc_roots()
__start_delalloc_inodes()
--> does nothing, no dealloc ranges
in the inode's io tree so the
inode's root is not in the list
fs_info->delalloc_roots
btrfs_wait_ordered_roots()
--> does not find the inode's root in the
list fs_info->ordered_roots
--> ends up not waiting for the direct IO
write started by the task at CPU 2
relocate_block_group(rc->stage ==
MOVE_DATA_EXTENTS)
prepare_to_relocate()
btrfs_commit_transaction()
iterates the extent tree, using its
commit root and moves extents into new
locations
btrfs_add_ordered_extent_dio()
--> now a ordered extent is
created and added to the
list root->ordered_extents
and the root added to the
list fs_info->ordered_roots
--> this is too late and the
task at CPU 1 already
started the relocation
btrfs_commit_transaction()
btrfs_finish_ordered_io()
btrfs_alloc_reserved_file_extent()
--> adds delayed data reference
for the extent allocated
from bg X
relocate_block_group(rc->stage ==
UPDATE_DATA_PTRS)
prepare_to_relocate()
btrfs_commit_transaction()
--> delayed refs are run, so an extent
item for the allocated extent from
bg X is added to extent tree
--> commit roots are switched, so the
next scan in the extent tree will
see the extent item
sees the extent in the extent tree
When this happens the relocation produces the following warning when it
finishes:
[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]
0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]
0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]
ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998] [<
ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998] [<
ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998] [<
ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<
ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998] [<
ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<
ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998] [<
ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998] [<
ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998] [<
ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998] [<
ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998] [<
ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998] [<
ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998] [<
ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998] [<
ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998] [<
ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998] [<
ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998] [<
ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998] [<
ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace
eb7803b24ebab8ad ]---
This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:
[ 7344.885290] BTRFS critical (device sdi): unable to find logical
12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task:
ffff880215818600 ti:
ffff880204684000 task.ti:
ffff880204684000
[ 7344.888431] RIP: 0010:[<
ffffffffa037c88c>] [<
ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:
ffff8802046878f0 EFLAGS:
00010282
[ 7344.888431] RAX:
00000000ffffffea RBX:
0000000000001000 RCX:
0000000000000001
[ 7344.888431] RDX:
ffff88023ec0f950 RSI:
ffffffff8183b638 RDI:
00000000ffffffff
[ 7344.888431] RBP:
ffff880204687908 R08:
0000000000000001 R09:
0000000000000000
[ 7344.888431] R10:
ffff880204687770 R11:
ffffffff82f2d52d R12:
0000000000001000
[ 7344.888431] R13:
ffff88021afbfee8 R14:
0000000000006208 R15:
ffff88006cd199b0
[ 7344.888431] FS:
00007f1f9e1d6700(0000) GS:
ffff88023ec00000(0000) knlGS:
0000000000000000
[ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 7344.888431] CR2:
00007f1f9dc8cb60 CR3:
000000023e3b6000 CR4:
00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]
0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]
ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]
ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431] [<
ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431] [<
ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431] [<
ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431] [<
ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<
ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431] [<
ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431] [<
ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<
ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431] [<
ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431] [<
ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431] [<
ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<
ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431] [<
ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431] [<
ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431] [<
ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<
ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<
ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431] [<
ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431] [<
ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431] [<
ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431] [<
ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431] [<
ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431] [<
ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP [<
ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP <
ffff8802046878f0>
[ 7344.970544] ---[ end trace
eb7803b24ebab8ae ]---
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Tue, 26 Apr 2016 14:36:38 +0000 (15:36 +0100)]
Btrfs: don't wait for unrelated IO to finish before relocation
Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.
So improve this by waiting only for ordered extents that fall into the
block group's range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Mon, 25 Apr 2016 03:45:02 +0000 (04:45 +0100)]
Btrfs: fix empty symlink after creating symlink and fsync parent dir
If we create a symlink, fsync its parent directory, crash/power fail and
mount the filesystem, we end up with an empty symlink, which not only is
useless it's also not allowed in linux (the man page symlink(2) is well
explicit about that). So we just need to make sure to fully log an inode
if it's a symlink, to ensure its inline extent gets logged, ensuring the
same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ sync
$ ln -s /mnt/foo /mnt/testdir/bar
$ xfs_io -c fsync /mnt/testdir
<power fail>
$ mount /dev/sdb /mnt
$ readlink /mnt/testdir/bar
<empty string>
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 6 Apr 2016 16:11:56 +0000 (17:11 +0100)]
Btrfs: fix for incorrect directory entries after fsync log replay
If we move a directory to a new parent and later log that parent and don't
explicitly log the old parent, when we replay the log we can end up with
entries for the moved directory in both the old and new parent directories.
Besides being ilegal to have directories with multiple hard links in linux,
it also resulted in the leaving the inode item with a link count of 1.
A similar issue also happens if we move a regular file - after the log tree
is replayed the file has a link in both the old and new parent directories,
when it should be only at the new directory.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/x
$ mkdir /mnt/y
$ touch /mnt/x/foo
$ mkdir /mnt/y/z
$ sync
$ ln /mnt/x/foo /mnt/x/bar
$ mv /mnt/y/z /mnt/x/z
< power fail >
$ mount /dev/sdc /mnt
$ ls -1Ri /mnt
/mnt:
257 x
258 y
/mnt/x:
259 bar
259 foo
260 z
/mnt/x/z:
/mnt/y:
260 z
/mnt/y/z:
$ umount /dev/sdc
$ btrfs check /dev/sdc
Checking filesystem on /dev/sdc
UUID:
a67e2c4a-a4b4-4fdc-b015-
9d9af1e344be
checking extents
checking free space cache
checking fs roots
root 5 inode 260 errors 2000, link count wrong
unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
(...)
Attempting to remove the directory becomes impossible:
$ mount /dev/sdc /mnt
$ rmdir /mnt/y/z
$ ls -lh /mnt/y
ls: cannot access /mnt/y/z: No such file or directory
total 0
d????????? ? ? ? ? ? z
$ rmdir /mnt/x/z
rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
$ ls -lh /mnt/x
ls: cannot access /mnt/x/z: Stale file handle
total 0
-rw-r--r-- 2 root root 0 Apr 6 18:06 bar
-rw-r--r-- 2 root root 0 Apr 6 18:06 foo
d????????? ? ? ? ? ? z
So make sure that on rename we set the last_unlink_trans value for our
inode, even if it's a directory, to the value of the current transaction's
ID and that if the new parent directory is logged that we fallback to a
transaction commit.
A test case for fstests is being submitted as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
David Sterba [Thu, 12 May 2016 09:05:03 +0000 (11:05 +0200)]
btrfs: build fixup for qgroup_account_snapshot
The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an
independent branch for the same merge target (4.7). To make the code
compilable for bisectability reasons, add a temporary stub.
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 11 May 2016 19:53:52 +0000 (12:53 -0700)]
btrfs: qgroup: Fix qgroup accounting when creating snapshot
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.
Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().
However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.
In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
======
btrfs_qgroup_account_extent: bytenr =
29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr =
29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======
The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.
But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.
Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.
Mark: I added a check at the top of qgroup_account_snapshot() to skip this
code if qgroups are turned off. xfstest btrfs/122 exposes this problem.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Vincent Stehlé [Tue, 10 May 2016 12:56:20 +0000 (14:56 +0200)]
Btrfs: fix fspath error deallocation
Make sure to deallocate fspath with vfree() in case of error in
init_ipath().
fspath is allocated with vmalloc() in init_data_container() since
commit
425d17a290c0 ("Btrfs: use larger limit for translation of logical to
inode").
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 Apr 2016 01:07:39 +0000 (03:07 +0200)]
btrfs: make find_workspace warn if there are no workspaces
Be verbose if there are no workspaces at all, ie. the module init time
preallocation failed.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 Apr 2016 00:41:17 +0000 (02:41 +0200)]
btrfs: make find_workspace always succeed
With just one preallocated workspace we can guarantee forward progress
even if there's no memory available for new workspaces. The cost is more
waiting but we also get rid of several error paths.
On average, there will be several idle workspaces, so the waiting
penalty won't be so bad.
In the worst case, all cpus will compete for one workspace until there's
some memory. Attempts to allocate a new one are done each time the
waiters are woken up.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 Apr 2016 00:55:15 +0000 (02:55 +0200)]
btrfs: preallocate compression workspaces
Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 Apr 2016 00:15:15 +0000 (02:15 +0200)]
btrfs: rename and document compression workspace members
The names are confusing, pick more fitting names and add comments.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 9 May 2016 12:11:38 +0000 (14:11 +0200)]
btrfs: GFP_NOFS does not GFP_HIGHMEM
Masking HIGHMEM out of NOFS does not make sense.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 9 May 2016 09:32:39 +0000 (11:32 +0200)]
btrfs: switch to common message helpers in open_ctree, adjust messages
Currently we lack the identification of the filesystem in most if not
all mount messages, done via printk/pr_* functions. We can use the
btrfs_* helpers in open_ctree, as the fs_info <-> sb link is established
at the beginning of the function.
The messages have been updated at the same time to be more consistent:
* dropped sb->s_id, as it's not available via btrfs_*
* added %d for return code where appropriate
* wording changed
* %Lx replaced by %llx
Signed-off-by: David Sterba <dsterba@suse.com>
Adam Borowski [Sun, 8 May 2016 13:08:00 +0000 (15:08 +0200)]
btrfs: fix int32 overflow in shrink_delalloc().
UBSAN: Undefined behaviour in fs/btrfs/extent-tree.c:4623:21
signed integer overflow:
10808 * 262144 cannot be represented in type 'int [8]'
If 8192<=items<16384, we request a writeback of an insane number of pages
which is benign (everything will be written). But if items>=16384, the
space reservation won't be enough.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 8 May 2016 21:38:32 +0000 (14:38 -0700)]
Linux 4.6-rc7
Linus Torvalds [Sat, 7 May 2016 17:53:32 +0000 (10:53 -0700)]
Merge tag 'char-misc-4.6-rc7' of git://git./linux/kernel/git/gregkh/char-misc
Pull misc driver fixes from Gfreg KH:
"Here are three small fixes for some driver problems that were
reported. Full details in the shortlog below.
All of these have been in linux-next with no reported issues"
* tag 'char-misc-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
nvmem: mxs-ocotp: fix buffer overflow in read
Drivers: hv: vmbus: Fix signaling logic in hv_need_to_signal_on_read()
misc: mic: Fix for double fetch security bug in VOP driver
Linus Torvalds [Sat, 7 May 2016 17:50:48 +0000 (10:50 -0700)]
Merge tag 'staging-4.6-rc7' of git://git./linux/kernel/git/gregkh/staging
Pull IIO driver fixes from Grek KH:
"It's really just IIO drivers here, some small fixes that resolve some
'crash on boot' errors that have shown up in the -rc series, and other
bugfixes that are required.
All have been in linux-next with no reported problems"
* tag 'staging-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: imu: mpu6050: Fix name/chip_id when using ACPI
iio: imu: mpu6050: fix possible NULL dereferences
iio:adc:at91-sama5d2: Repair crash on module removal
iio: ak8975: fix maybe-uninitialized warning
iio: ak8975: Fix NULL pointer exception on early interrupt
Linus Torvalds [Sat, 7 May 2016 17:47:03 +0000 (10:47 -0700)]
Merge tag 'usb-4.6-rc7' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some last-remaining fixes for USB drivers to resolve issues
that have shown up in testing. And two new device ids as well.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "USB / PM: Allow USB devices to remain runtime-suspended when sleeping"
usb: musb: jz4740: fix error check of usb_get_phy()
Revert "usb: musb: musb_host: Enable HCD_BH flag to handle urb return in bottom half"
usb: musb: gadget: nuke endpoint before setting its descriptor to NULL
USB: serial: cp210x: add Straizona Focusers device ids
USB: serial: cp210x: add ID for Link ECU
Linus Torvalds [Sat, 7 May 2016 15:27:35 +0000 (08:27 -0700)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
"These are a number of updates to fix a few problems found in the ARM
nommu code over the last couple of years, caused mostly by changes on
the mmu side"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8573/1: domain: move {set,get}_domain under config guard
ARM: 8572/1: nommu: change memory reserve for the vectors
ARM: 8571/1: nommu: fix PMSAv7 setup
Linus Torvalds [Sat, 7 May 2016 15:17:45 +0000 (08:17 -0700)]
Merge tag 'media/v4.6-5' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- deadlock fixes on driver probe at exynos4-is and s43-camif drivers
- a build breakage if media controller is enabled and USB or PCI is
built as module.
* tag 'media/v4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] media-device: fix builds when USB or PCI is compiled as module
[media] media: s3c-camif: fix deadlock on driver probe()
[media] media: exynos4-is: fix deadlock on driver probe
Linus Torvalds [Sat, 7 May 2016 15:13:42 +0000 (08:13 -0700)]
Merge branch 'for-4.6-fixes' of git://git./linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"An ahci driver addition and updates to ahci port enable handling for
some platform devices"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: add AMD Seattle platform driver
ARM: dts: apq8064: add ahci ports-implemented mask
ata: ahci-platform: Add ports-implemented DT bindings.
libahci: save port map for forced port map
Linus Torvalds [Sat, 7 May 2016 15:10:08 +0000 (08:10 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma
Pull rdma fix from Doug Ledford:
"Fix for max sector calculation in iSER"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/iser: Fix max_sectors calculation
Linus Torvalds [Fri, 6 May 2016 20:08:35 +0000 (13:08 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull writeback fix from Jens Axboe:
"Just a single fix for domain aware writeback, fixing a regression that
can cause balance_dirty_pages() to keep looping while not getting any
work done"
* 'for-linus' of git://git.kernel.dk/linux-block:
writeback: Fix performance regression in wb_over_bg_thresh()
Linus Torvalds [Fri, 6 May 2016 19:59:27 +0000 (12:59 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"This contains two fixes: a boot fix for older SGI/UV systems, and an
APIC calibration fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
x86/platform/UV: Bring back the call to map_low_mmrs in uv_system_init
Linus Torvalds [Fri, 6 May 2016 18:58:45 +0000 (11:58 -0700)]
Merge tag 'pm+acpi-4.6-rc7' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"Fixes for problems introduced or discovered recently (intel_pstate,
sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework,
generic device properties framework) and one fix for a hotplug-related
deadlock in ACPICA that's been there forever, but is nasty enough.
Specifics:
- Fix for a recent regression in the intel_pstate driver causing it
to fail to restore the HWP (HW-managed P-states) configuration of
the boot CPU after suspend-to-RAM (Rafael Wysocki).
- Fix for two recent regressions in the intel_pstate driver, one that
can trigger a divide by zero if the driver is accessed via sysfs
before it manages to take the first sample and one causing it to
fail to update a structure field used in a trace point, so the
information coming from it is less useful (Rafael Wysocki).
- Fix for a problem in the sti-cpufreq driver introduced during the
4.5 cycle that causes it to break CPU PM in multi-platform kernels
by registering cpufreq-dt (which subsequently doesn't work)
unconditionally and preventing the driver that would actually work
from registering (Sudeep Holla).
- Stable-candidate fix for an ARM64 cpuidle issue causing idle state
usage counters to be incorrectly updated for idle states that were
not entered due to errors (James Morse).
- Fix for a recently introduced issue in the OPP (Operating
Performance Points) framework causing it to print bogus error
messages for missing optional regulators (Viresh Kumar).
- Fix for a recently introduced issue in the generic device
properties framework that may cause it to attempt to dereferece and
invalid pointer in some cases (Heikki Krogerus).
- Fix for a deadlock in the ACPICA core that may be triggered by
device (eg Thunderbolt) hotplug (Prarit Bhargava)"
* tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / OPP: Remove useless check
ACPICA: Dispatcher: Update thread ID for recursive method calls
intel_pstate: Fix intel_pstate_get()
cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
cpufreq: st: enable selective initialization based on the platform
ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
device property: Avoid potential dereferences of invalid pointers
Linus Torvalds [Fri, 6 May 2016 18:53:27 +0000 (11:53 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"This contains a single fix that fixes a nohz tick stopping bug when
mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
Linus Torvalds [Fri, 6 May 2016 18:40:24 +0000 (11:40 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"This tree contains two fixes: new Intel CPU model numbers and an
AMD/iommu uncore PMU driver fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/iommu: Do not register a task ctx for uncore like PMUs
perf/x86: Add model numbers for Kabylake CPUs
Linus Torvalds [Fri, 6 May 2016 18:33:02 +0000 (11:33 -0700)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"This tree contains three fixes: a console spam fix, a file pattern fix
and a sysfb_efi fix for a bug that triggered on older ThinkPads"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sysfb_efi: Fix valid BAR address range check
x86/efi-bgrt: Switch all pr_err() to pr_notice() for invalid BGRT
MAINTAINERS: Remove asterisk from EFI directory names
Linus Torvalds [Fri, 6 May 2016 18:27:05 +0000 (11:27 -0700)]
Merge branch 'parisc-4.6-5' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"Patch from Dmitry V Levin to fix a kernel crash when a straced process
calls the (invalid) syscall which is equal to value of __NR_Linux_syscalls"
* 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls
Linus Torvalds [Fri, 6 May 2016 18:14:38 +0000 (11:14 -0700)]
Merge tag 'arc-4.6-rc7-fixes' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
"Late in the cycle, but this has fixes for couple of issues: a PAE40
boot crash and Arnd spotting lack of barriers in BE io-accessors.
The 3rd patch for enabling highmem in low physical mem ;-) honestly is
more than a "fix" but its been in works for some time, seems to be
stable in testing and enables 2 of our customers to go forward with
4.6 kernel.
- Fix for PTE truncation in PAE40 builds
- Fix for big endian IO accessors lacking IO barrier
- Allow HIGHMEM to work with low physical addresses"
* tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: support HIGHMEM even without PAE40
ARC: Fix PAE40 boot failures due to PTE truncation
ARC: Add missing io barriers to io{read,write}{16,32}be()
Linus Torvalds [Fri, 6 May 2016 18:05:07 +0000 (11:05 -0700)]
Merge tag 'powerpc-4.6-5' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Fix bad inline asm constraint in create_zero_mask() from Anton
Blanchard"
* tag 'powerpc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix bad inline asm constraint in create_zero_mask()
Linus Torvalds [Fri, 6 May 2016 17:59:53 +0000 (10:59 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Fixes for i915, amdgpu/radeon and imx.
The IMX fix is for an autoloading regression found in Fedora. The
radeon fixes, are the same fix to amdgpu/radeon to avoid a hardware
lockup in some circumstances with a bad mode, and a double free bug I
took a few hours chasing down the other morning.
The i915 fixes are across the board, all stable material, and fixing
some hangs and suspend/resume issues, along with a live status
regressions"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
gpu: ipu-v3: Fix imx-ipuv3-crtc module autoloading
drm/amdgpu: make sure vertical front porch is at least 1
drm/radeon: make sure vertical front porch is at least 1
drm/amdgpu: set metadata pointer to NULL after freeing.
drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW
drm/i915: Fake HDMI live status
drm/i915: Fix eDP low vswing for Broadwell
drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume
drm/i915: Fix system resume if PCI device remained enabled
drm/i915: Avoid stalling on pending flips for legacy cursor updates
Zygo Blaxell [Thu, 5 May 2016 04:23:49 +0000 (00:23 -0400)]
btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes
During a mount, we start the cleaner kthread first because the transaction
kthread wants to wake up the cleaner kthread. We start the transaction
kthread next because everything in btrfs wants transactions. We do reloc
recovery in the thread that was doing the original mount call once the
transaction kthread is running. This means that the cleaner kthread
could already be running when reloc recovery happens (e.g. if a snapshot
delete was started before a crash).
Relocation does not play well with the cleaner kthread, so a mutex was
added in commit
5f3164813b90f7dbcb5c3ab9006906222ce471b7 "Btrfs: fix
race between balance recovery and root deletion" to prevent both from
being active at the same time.
If the cleaner kthread is already holding the mutex by the time we get
to btrfs_recover_relocation, the mount will be blocked until at least
one deleted subvolume is cleaned (possibly more if the mount process
doesn't get the lock right away). During this time (which could be an
arbitrarily long time on a large/slow filesystem), the mount process is
stuck and the filesystem is unnecessarily inaccessible.
Fix this by locking cleaner_mutex before we start cleaner_kthread, and
unlocking the mutex after mount no longer requires it. This ensures
that the mounting process will not be blocked by the cleaner kthread.
The cleaner kthread is already prepared for mutex contention and will
just go to sleep until the mutex is available.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 4 May 2016 12:10:47 +0000 (14:10 +0200)]
btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 4 May 2016 09:32:00 +0000 (11:32 +0200)]
btrfs: add write protection to SET_FEATURES ioctl
Perform the want_write check if we get far enough to do any writes.
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 12 Apr 2016 13:36:16 +0000 (21:36 +0800)]
btrfs: fix lock dep warning move scratch super outside of chunk_mutex
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.
To reproduce:
(fresh boot)
mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
mount /dev/sdc /btrfs
dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
(get devmgt from https://github.com/asj/devmgt.git)
devmgt detach /dev/sde
dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
sync
btrfs replace start -Brf 3 /dev/sdf /btrfs <--
devmgt attach host7
======================================================
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---------------------------------------------------
btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[<
ffffffff812449b4>] __sb_start_write+0xb4/0xf0
but task is already holding lock:
(&fs_info->chunk_mutex){+.+.+.}, at:
[<
ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]
which lock already depends on the new lock.
Chain exists of:
sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->chunk_mutex);
lock(sb_writers);
*** DEADLOCK ***
-> #0 (sb_writers){.+.+.+}:
[<
ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
[<
ffffffff810e707e>] lock_acquire+0xbe/0x210
[<
ffffffff810df49a>] percpu_down_read+0x4a/0xa0
[<
ffffffff812449b4>] __sb_start_write+0xb4/0xf0
[<
ffffffff81265534>] mnt_want_write+0x24/0x50
[<
ffffffff812508a2>] path_openat+0x952/0x1190
[<
ffffffff81252451>] do_filp_open+0x91/0x100
[<
ffffffff8123f5cc>] file_open_name+0xfc/0x140
[<
ffffffff8123f643>] filp_open+0x33/0x60
[<
ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
[<
ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[<
ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[<
ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[<
ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ashish Samant [Sat, 30 Apr 2016 01:33:59 +0000 (18:33 -0700)]
btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Tue, 12 Apr 2016 16:54:40 +0000 (12:54 -0400)]
Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ]
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Wed, 27 Apr 2016 00:53:31 +0000 (17:53 -0700)]
Btrfs: fix divide error upon chunk's stripe_len
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
__btrfs_map_block().
This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 26 Apr 2016 14:22:06 +0000 (16:22 +0200)]
btrfs: sysfs: protect reading label by lock
If the label setting ioctl races with sysfs label handler, we could get
mixed result in the output, part old part new. We should either get the
old or new label. The chances to hit this race are low.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 26 Apr 2016 14:03:57 +0000 (16:03 +0200)]
btrfs: add check to sysfs handler of label
Add a sanity check for the fs_info as we will dereference it, similar to
what the 'store features' handler does.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 23 Jan 2015 17:43:31 +0000 (18:43 +0100)]
btrfs: add read-only check to sysfs handler of features
We don't want to trigger the change on a read-only filesystem, similar
to what the label handler does.
Signed-off-by: David Sterba <dsterba@suse.cz>
David Sterba [Thu, 24 Mar 2016 17:00:53 +0000 (18:00 +0100)]
btrfs: reuse existing variable in scrub_stripe, reduce stack usage
The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.
Signed-off-by: David Sterba <dsterba@suse.com>