Filipe Manana [Wed, 2 Mar 2016 15:49:38 +0000 (15:49 +0000)]
Btrfs: fix loading of orphan roots leading to BUG_ON
When looking for orphan roots during mount we can end up hitting a
BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
replayed and qgroups are enabled. This is because after a log tree is
replayed, a transaction commit is made, which triggers qgroup extent
accounting which in turn does backref walking which ends up reading and
inserting all roots in the radix tree fs_info->fs_root_radix, including
orphan roots (deleted snapshots). So after the log tree is replayed, when
finding orphan roots we hit the BUG_ON with the following trace:
[118209.182438] ------------[ cut here ]------------
[118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
[118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1
[118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[118209.186318] task:
ffff8801ec131040 ti:
ffff8800af34c000 task.ti:
ffff8800af34c000
[118209.186318] RIP: 0010:[<
ffffffffa04237d7>] [<
ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP: 0018:
ffff8800af34faa8 EFLAGS:
00010246
[118209.186318] RAX:
00000000ffffffef RBX:
00000000ffffffef RCX:
0000000000000001
[118209.186318] RDX:
0000000080000000 RSI:
0000000000000001 RDI:
00000000ffffffff
[118209.186318] RBP:
ffff8800af34fb08 R08:
0000000000000001 R09:
0000000000000000
[118209.186318] R10:
ffff8800af34f9f0 R11:
6db6db6db6db6db7 R12:
ffff880171b97000
[118209.186318] R13:
ffff8801ca9d65e0 R14:
ffff8800afa2e000 R15:
0000160000000000
[118209.186318] FS:
00007f5bcb914840(0000) GS:
ffff88023edc0000(0000) knlGS:
0000000000000000
[118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[118209.186318] CR2:
00007f5bcaceb5d9 CR3:
00000000b49b5000 CR4:
00000000000006e0
[118209.186318] Stack:
[118209.186318]
fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
[118209.186318]
fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
[118209.186318]
0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
[118209.186318] Call Trace:
[118209.186318] [<
ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
[118209.186318] [<
ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
[118209.186318] [<
ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318] [<
ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318] [<
ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318] [<
ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
[118209.186318] [<
ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318] [<
ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
[118209.186318] [<
ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318] [<
ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318] [<
ffffffff81195637>] do_mount+0x8a6/0x9e8
[118209.186318] [<
ffffffff8119598d>] SyS_mount+0x77/0x9f
[118209.186318] [<
ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
[118209.186318] RIP [<
ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP <
ffff8800af34faa8>
[118209.230735] ---[ end trace
83938f987d85d477 ]---
So fix this by not treating the error -EEXIST, returned when attempting
to insert a root already inserted by the backref walking code, as an error.
The following test case for xfstests reproduces the bug:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
_run_btrfs_util_prog quota enable $SCRATCH_MNT
# Create 2 directories with one file in one of them.
# We use these just to trigger a transaction commit later, moving the file from
# directory a to directory b and doing an fsync against directory a.
mkdir $SCRATCH_MNT/a
mkdir $SCRATCH_MNT/b
touch $SCRATCH_MNT/a/f
sync
# Create our test file with 2 4K extents.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
# Create a snapshot and delete it. This doesn't really delete the snapshot
# immediately, just makes it inaccessible and invisible to user space, the
# snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
# which is woke up at the next transaction commit.
# A root orphan item is inserted into the tree of tree roots, so that if a
# power failure happens before the dedicated kernel thread does the snapshot
# deletion, the next time the filesystem is mounted it resumes the snapshot
# deletion.
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap
# Now overwrite half of the extents we wrote before. Because we made a snapshpot
# before, which isn't really deleted yet (since no transaction commit happened
# after we did the snapshot delete request), the non overwritten extents get
# referenced twice, once by the default subvolume and once by the snapshot.
$XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
# Now move file f from directory a to directory b and fsync directory a.
# The fsync on the directory a triggers a transaction commit (because a file
# was moved from it to another directory) and the file fsync leaves a log tree
# with file extent items to replay.
mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foobar | _filter_scratch
# Now simulate a power failure and mount the filesystem to replay the log tree.
# After the log tree was replayed, we used to hit a BUG_ON() when processing
# the root orphan item for the deleted snapshot. This is because when processing
# an orphan root the code expected to be the first code inserting the root into
# the fs_info->fs_root_radix radix tree, while in reallity it was the second
# caller attempting to do it - the first caller was the transaction commit that
# took place after replaying the log tree, when updating the qgroup counters.
_flakey_drop_and_remount
echo "File digest before after failure:"
# Must match what he got before the power failure.
md5sum $SCRATCH_MNT/foobar | _filter_scratch
_unmount_flakey
status=0
exit
Fixes:
2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 17 Feb 2016 00:52:10 +0000 (16:52 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Filipe Manana [Mon, 15 Feb 2016 16:20:26 +0000 (16:20 +0000)]
Btrfs: fix direct IO requests not reporting IO error to user space
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:
dio_end_io(dio_bio, bio->bi_error);
It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().
This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.
Cc: stable@vger.kernel.org # 4.3+
Fixes:
4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
David Sterba [Fri, 13 Nov 2015 12:44:28 +0000 (13:44 +0100)]
btrfs: properly set the termination value of ctx->pos in readdir
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.
There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.
We can get to that situation like that:
* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.
The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:
Overflow: e
Overflow:
7fffffff
Overflow:
7fffffffffffffff
PAX: size overflow detected in function btrfs_real_readdir
fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
context: dir_context;
CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
Call Trace:
[<
ffffffff81742f0f>] dump_stack+0x4c/0x7f
[<
ffffffff811cb706>] report_size_overflow+0x36/0x40
[<
ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
[<
ffffffff811dafc8>] iterate_dir+0xa8/0x150
[<
ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
[<
ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
Overflow: 1a
[<
ffffffff811db070>] ? iterate_dir+0x150/0x150
[<
ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
The jump from
7fffffff to
7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.
The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.
References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Thu, 11 Feb 2016 00:51:38 +0000 (16:51 -0800)]
Merge branch 'integration-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Filipe Manana [Wed, 3 Feb 2016 19:17:27 +0000 (19:17 +0000)]
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
While doing some tests I ran into an hang on an extent buffer's rwlock
that produced the following trace:
[39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
[39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
[39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800016] irq event stamp: 0
[39389.800016] hardirqs last enabled at (0): [< (null)>] (null)
[39389.800016] hardirqs last disabled at (0): [<
ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last enabled at (0): [<
ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last disabled at (0): [< (null)>] (null)
[39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800016] task:
ffff880175b1ca40 ti:
ffff8800a185c000 task.ti:
ffff8800a185c000
[39389.800016] RIP: 0010:[<
ffffffff810902af>] [<
ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
[39389.800016] RSP: 0018:
ffff8800a185fb80 EFLAGS:
00000202
[39389.800016] RAX:
0000000000000101 RBX:
ffff8801710c4e9c RCX:
0000000000000101
[39389.800016] RDX:
0000000000000100 RSI:
0000000000000001 RDI:
0000000000000001
[39389.800016] RBP:
ffff8800a185fb98 R08:
0000000000000001 R09:
0000000000000000
[39389.800016] R10:
ffff8800a185fb68 R11:
6db6db6db6db6db7 R12:
ffff8801710c4e98
[39389.800016] R13:
ffff880175b1ca40 R14:
ffff8800a185fc10 R15:
ffff880175b1ca40
[39389.800016] FS:
00007f6d37fff700(0000) GS:
ffff8802be9c0000(0000) knlGS:
0000000000000000
[39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[39389.800016] CR2:
00007f6d300019b8 CR3:
0000000037c93000 CR4:
00000000001406e0
[39389.800016] Stack:
[39389.800016]
ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
[39389.800016]
ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
[39389.800016]
ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
[39389.800016] Call Trace:
[39389.800016] [<
ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
[39389.800016] [<
ffffffff81091895>] do_raw_read_lock+0x3e/0x41
[39389.800016] [<
ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
[39389.800016] [<
ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016] [<
ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016] [<
ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
[39389.800016] [<
ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
[39389.800016] [<
ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
[39389.800016] [<
ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
[39389.800016] [<
ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
[39389.800016] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016] [<
ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
[39389.800016] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016] [<
ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800016] [<
ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800016] [<
ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800016] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800016] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
[39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800012] irq event stamp: 0
[39389.800012] hardirqs last enabled at (0): [< (null)>] (null)
[39389.800012] hardirqs last disabled at (0): [<
ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last enabled at (0): [<
ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last disabled at (0): [< (null)>] (null)
[39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1
[39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800012] task:
ffff880179294380 ti:
ffff880034a60000 task.ti:
ffff880034a60000
[39389.800012] RIP: 0010:[<
ffffffff81091e8d>] [<
ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
[39389.800012] RSP: 0018:
ffff880034a639f0 EFLAGS:
00000206
[39389.800012] RAX:
0000000000000101 RBX:
ffff8801710c4e98 RCX:
0000000000000000
[39389.800012] RDX:
00000000000000ff RSI:
0000000000000000 RDI:
ffff8801710c4e9c
[39389.800012] RBP:
ffff880034a639f8 R08:
0000000000000001 R09:
0000000000000000
[39389.800012] R10:
ffff880034a639b0 R11:
0000000000001000 R12:
ffff8801710c4e98
[39389.800012] R13:
0000000000000001 R14:
ffff880172cbc000 R15:
ffff8801710c4e00
[39389.800012] FS:
00007f6d377fe700(0000) GS:
ffff8802be9e0000(0000) knlGS:
0000000000000000
[39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[39389.800012] CR2:
00007f6d3d3c1000 CR3:
0000000037c93000 CR4:
00000000001406e0
[39389.800012] Stack:
[39389.800012]
ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
[39389.800012]
ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
[39389.800012]
ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
[39389.800012] Call Trace:
[39389.800012] [<
ffffffff81091963>] do_raw_write_lock+0x72/0x8c
[39389.800012] [<
ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
[39389.800012] [<
ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012] [<
ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012] [<
ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
[39389.800012] [<
ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
[39389.800012] [<
ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
[39389.800012] [<
ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
[39389.800012] [<
ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
[39389.800012] [<
ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
[39389.800012] [<
ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
[39389.800012] [<
ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
[39389.800012] [<
ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
[39389.800012] [<
ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
[39389.800012] [<
ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
[39389.800012] [<
ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
[39389.800012] [<
ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
[39389.800012] [<
ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
[39389.800012] [<
ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[39389.800012] [<
ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
[39389.800012] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012] [<
ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012] [<
ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012] [<
ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800012] [<
ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800012] [<
ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800012] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800012] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00
This happens because in the code path executed by the inode_paths ioctl we
end up nesting two calls to read lock a leaf's rwlock when after the first
call to read_lock() and before the second call to read_lock(), another
task (running the delayed items as part of a transaction commit) has
already called write_lock() against the leaf's rwlock. This situation is
illustrated by the following diagram:
Task A Task B
btrfs_ref_to_path() btrfs_commit_transaction()
read_lock(&eb->lock);
btrfs_run_delayed_items()
__btrfs_commit_inode_delayed_items()
__btrfs_update_delayed_inode()
btrfs_lookup_inode()
write_lock(&eb->lock);
--> task waits for lock
read_lock(&eb->lock);
--> makes this task hang
forever (and task B too
of course)
So fix this by avoiding doing the nested read lock, which is easily
avoidable. This issue does not happen if task B calls write_lock() after
task A does the second call to read_lock(), however there does not seem
to exist anything in the documentation that mentions what is the expected
behaviour for recursive locking of rwlocks (leaving the idea that doing
so is not a good usage of rwlocks).
Also, as a side effect necessary for this fix, make sure we do not
needlessly read lock extent buffers when the input path has skip_locking
set (used when called from send).
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 27 Jan 2016 19:17:20 +0000 (19:17 +0000)]
Btrfs: remove no longer used function extent_read_full_page_nolock()
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 27 Jan 2016 18:37:47 +0000 (18:37 +0000)]
Btrfs: fix page reading in extent_same ioctl leading to csum errors
In the extent_same ioctl, we were grabbing the pages (locked) and
attempting to read them without bothering about any concurrent IO
against them. That is, we were not checking for any ongoing ordered
extents nor waiting for them to complete, which leads to a race where
the extent_same() code gets a checksum verification error when it
reads the pages, producing a message like the following in dmesg
and making the operation fail to user space with -ENOMEM:
[18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum
685204116 expected csum
1515870868
Fix this by using btrfs_readpage() for reading the pages instead of
extent_read_full_page_nolock(), which waits for any concurrent ordered
extents to complete and locks the io range. Also do better error handling
and don't treat all failures as -ENOMEM, as that's clearly misleasing,
becoming identical to the checks and operation of prepare_uptodate_page().
The use of extent_read_full_page_nolock() was required before
commit
f441460202cb ("btrfs: fix deadlock with extent-same and readpage"),
as we had the range locked in an inode's io tree before attempting to
read the pages.
Fixes:
f441460202cb ("btrfs: fix deadlock with extent-same and readpage")
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 27 Jan 2016 10:20:58 +0000 (10:20 +0000)]
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
In the extent_same ioctl we are getting the pages for the source and
target ranges and unlocking them immediately after, which is incorrect
because later we attempt to map them (with kmap_atomic) and access their
contents at btrfs_cmp_data(). When we do such access the pages might have
been relocated or removed from memory, which leads to an invalid memory
access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
which produces a trace like the following:
186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
unloaded: btrfs]
[186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1
[186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[186736.681319] task:
ffff880132600400 ti:
ffff880362284000 task.ti:
ffff880362284000
[186736.681319] RIP: 0010:[<
ffffffff81264d00>] [<
ffffffff81264d00>] memcmp+0xb/0x22
[186736.681319] RSP: 0018:
ffff880362287d70 EFLAGS:
00010287
[186736.681319] RAX:
000002c002468acf RBX:
0000000012345678 RCX:
0000000000000000
[186736.681319] RDX:
0000000000001000 RSI:
0005d129c5cf9000 RDI:
0005d129c5cf9000
[186736.681319] RBP:
ffff880362287d70 R08:
0000000000000000 R09:
0000000000001000
[186736.681319] R10:
ffff880000000000 R11:
0000000000000476 R12:
0000000000001000
[186736.681319] R13:
ffff8802f91d4c88 R14:
ffff8801f2a77830 R15:
ffff880352e83e40
[186736.681319] FS:
00007f27b37fe700(0000) GS:
ffff88043dda0000(0000) knlGS:
0000000000000000
[186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[186736.681319] CR2:
00007f27a406a000 CR3:
0000000217421000 CR4:
00000000001406e0
[186736.681319] Stack:
[186736.681319]
ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
[186736.681319]
0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
[186736.681319]
00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
[186736.681319] Call Trace:
[186736.681319] [<
ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
[186736.681319] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[186736.681319] [<
ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[186736.681319] [<
ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[186736.681319] [<
ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[186736.681319] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[186736.681319] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
(gdb) list *(btrfs_ioctl+0x24cb)
0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
2967 dst_addr = kmap_atomic(dst_page);
2968
2969 flush_dcache_page(src_page);
2970 flush_dcache_page(dst_page);
2971
2972 if (memcmp(addr, dst_addr, cmp_len))
2973 ret = BTRFS_SAME_DATA_DIFFERS;
2974
2975 kunmap_atomic(addr);
2976 kunmap_atomic(dst_addr);
So fix this by making sure we keep the pages locked and respect the same
locking order as everywhere else: get and lock the pages first and then
lock the range in the inode's io tree (like for example at
__btrfs_buffered_write() and extent_readpages()). If an ordered extent
is found after locking the range in the io tree, unlock the range,
unlock the pages, wait for the ordered extent to complete and repeat the
entire locking process until no overlapping ordered extents are found.
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Chris Mason [Fri, 29 Jan 2016 16:19:37 +0000 (08:19 -0800)]
Revert "btrfs: synchronize incompat feature bits with sysfs files"
This reverts commit
14e46e04958df740c6c6a94849f176159a333f13.
This ends up doing sysfs operations from deep in balance (where we
should be GFP_NOFS) and under heavy balance load, we're making races
against sysfs internals.
Revert it for now while we figure things out.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 27 Jan 2016 14:38:45 +0000 (06:38 -0800)]
btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
This was copied incorrectly from the __vmalloc call.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 27 Jan 2016 13:48:23 +0000 (05:48 -0800)]
Merge branch 'dev/fst-followup' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
David Sterba [Wed, 27 Jan 2016 13:06:29 +0000 (14:06 +0100)]
btrfs: sysfs: check initialization state before updating features
If the mount phase is not finished, we can't update the sysfs files.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Mon, 25 Jan 2016 10:02:06 +0000 (11:02 +0100)]
Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
This reverts commit
696249132158014d594896df3a81390616069c5c. The
cleaner thread can block freezing when there's a snapshot cleaning in
progress and the other threads get suspended first. From the logs
provided by Martin we're waiting for reading extent pages:
kernel: PM: Syncing filesystems ... done.
kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
kernel: Freezing remaining freezable tasks ...
kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
kernel: btrfs-cleaner D
ffff88033dd13bc0 0 152 2 0x00000000
kernel:
ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
kernel:
ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
kernel:
7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
kernel: Call Trace:
kernel: [<
ffffffff814a58df>] ? bit_wait+0x2c/0x2c
kernel: [<
ffffffff814a5451>] ? schedule+0x6f/0x7c
kernel: [<
ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
kernel: [<
ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
kernel: [<
ffffffff81077603>] ? ktime_get+0x36/0x44
kernel: [<
ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<
ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<
ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
kernel: [<
ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
kernel: [<
ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
kernel: [<
ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
kernel: [<
ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
kernel: [<
ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
kernel: [<
ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
kernel: [<
ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
kernel: [<
ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
kernel: [<
ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
kernel: [<
ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
kernel: [<
ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
kernel: [<
ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
kernel: [<
ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
kernel: [<
ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
kernel: [<
ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
kernel: [<
ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
kernel: [<
ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
kernel: [<
ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
kernel: [<
ffffffff8104d256>] ? kthread+0x95/0x9d
kernel: [<
ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
kernel: [<
ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
kernel: [<
ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
As this affects a released kernel (4.4) we need a minimal fix for
stable kernels.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361
Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
CC: stable@vger.kernel.org # 4.4
CC: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Fri, 22 Jan 2016 01:28:38 +0000 (09:28 +0800)]
btrfs: async-thread: Fix a use-after-free error for trace
Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
So no one use use that pointer after queue_work().
Fix the user-after-free bug by move the trace line before queue_work().
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 21 Jan 2016 10:17:54 +0000 (10:17 +0000)]
Btrfs: fix race between fsync and lockless direct IO writes
An fsync, using the fast path, can race with a concurrent lockless direct
IO write and end up logging a file extent item that points to an extent
that wasn't written to yet. This is because the fast fsync path collects
ordered extents into a local list and then collects all the new extent
maps to log file extent items based on them, while the direct IO write
path creates the new extent map before it creates the corresponding
ordered extent (and submitting the respective bio(s)).
So fix this by making the direct IO write path create ordered extents
before the extent maps and make the fast fsync path collect any new
ordered extents after it collects the extent maps.
Note that making the fsync handler call inode_dio_wait() (after acquiring
the inode's i_mutex) would not work and lead to a deadlock when doing
AIO, as through AIO we end up in a path where the fsync handler is called
(through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
before the inode's dio counter is decremented (inode_dio_wait() waits
for this counter to have a value of zero).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Tue, 26 Jan 2016 00:43:13 +0000 (16:43 -0800)]
Merge branch 'fix/fst-sysfs' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Mon, 25 Jan 2016 15:47:10 +0000 (16:47 +0100)]
btrfs: add free space tree to the cow-only list
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 25 Jan 2016 15:30:22 +0000 (16:30 +0100)]
btrfs: add free space tree to lockdep classes
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 22 Jan 2016 16:16:18 +0000 (17:16 +0100)]
btrfs: tweak free space tree bitmap allocation
The requested bitmap size varies, observed numbers were < 4K up to 16K.
Using vmalloc unconditionally would be too heavy, we'll try contiguous
allocations first and fall back to vmalloc if there's no contig memory.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 22 Jan 2016 09:28:24 +0000 (10:28 +0100)]
btrfs: tests: switch to GFP_KERNEL
There's no reason to do GFP_NOFS in tests, it's not data-heavy and
memory allocation failures would affect only developers or testers.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:54:41 +0000 (18:54 +0100)]
btrfs: synchronize incompat feature bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount (eg. the
LZO compression).
Synchronize the feature bits with the sysfs files representing them
right after we set/clear them.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:50:40 +0000 (18:50 +0100)]
btrfs: sysfs: introduce helper for syncing bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount. We're
going to sync them and need a helper to do that.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:36:46 +0000 (18:36 +0100)]
btrfs: sysfs: add free-space-tree bit attribute
The incompat bit representing the newly added free space tree feature is
missing. Right now it will be listed only among features supported by
the module, not per-fs.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Jan 2016 18:07:04 +0000 (19:07 +0100)]
btrfs: sysfs: fix typo in compat_ro attribute definition
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Tue, 12 Jan 2016 09:52:13 +0000 (17:52 +0800)]
btrfs: raid56: Use raid_write_end_io for scrub
No need to create additional end_io function for scrub, it increased
code size and introduced some un-unified lines, as:
raid_write_parity_end_io():
int err = bio->bi_error;
if (bio->bi_error)
raid_write_end_io():
int err = bio->bi_error;
if (err)
This patch combines them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 12 Jan 2016 09:22:13 +0000 (17:22 +0800)]
btrfs: Remove unnecessary ClearPageUptodate for raid56
PageUptodate flag already initialized to 0 for new page,
no need to set it again.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 3 Mar 2015 12:42:48 +0000 (20:42 +0800)]
btrfs: use rbio->nr_pages to reduce calculation
We can use rbio->stripe_npages to reduce unnecessary calculation in
many code place.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 3 Mar 2015 12:38:46 +0000 (20:38 +0800)]
btrfs: Use unified stripe_page's index calculation
We are using different index calculation method for stripe_page in
current code:
1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index
2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index
3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index
...
They can get same result when stripe_len align to PAGE_CACHE_SIZE,
this is why current code can work, intruduce and use a common function
for calculation is a better choose.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Mon, 8 Dec 2014 11:55:57 +0000 (19:55 +0800)]
btrfs: Fix calculation of rbio->dbitmap's size calculation
Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 1 Dec 2015 10:39:40 +0000 (18:39 +0800)]
btrfs: Fix no_space in write and rm loop
I see no_space in v4.4-rc1 again in xfstests generic/102.
It happened randomly in some node only.
(one of 4 phy-node, and a kvm with non-virtio block driver)
By bisect, we can found the first-bad is:
commit
bdced438acd8 ("block: setup bi_phys_segments after splitting")'
But above patch only triggered the bug by making bio operation
faster(or slower).
Main reason is in our space_allocating code, we need to commit
page writeback before wait it complish, this patch fixed above
bug.
BTW, there is another reason for generic/102 fail, caused by
disable default mixed-blockgroup, I'll fix it in xfstests.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 6 Jan 2016 10:56:36 +0000 (18:56 +0800)]
btrfs: merge functions for wait snapshot creation
wait_for_snapshot_creation() is in same group with oher two:
btrfs_start_write_no_snapshoting()
btrfs_end_write_no_snapshoting()
Rename wait_for_snapshot_creation() and move it into same place
with other two.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 6 Jan 2016 10:47:31 +0000 (18:47 +0800)]
btrfs: delete unused argument in btrfs_copy_from_user
size_t write_bytes is not necessary for btrfs_copy_from_user(),
delete it.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 15 Dec 2015 10:18:09 +0000 (18:18 +0800)]
btrfs: Use direct way to determine raid56 write/recover mode
Old code used bbio->raid_map to determine whether in raid56
write/recover operation, because we didn't't have bbio->map_type.
Now we have direct way for this condition, rid of using
the function-relative data, and make the code more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 9 Dec 2015 13:03:49 +0000 (21:03 +0800)]
btrfs: Small cleanup for get index_srcdev loop
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 15 Dec 2015 01:14:37 +0000 (09:14 +0800)]
btrfs: Enhance chunk validation check
Enhance chunk validation:
1) Num_stripes
We already have such check but it's only in super block sys chunk
array.
Now check all on-disk chunks.
2) Chunk logical
It should be aligned to sector size.
This behavior should be *DOUBLE CHECKED* for 64K sector size like
PPC64 or AArch64.
Maybe we can found some hidden bugs.
3) Chunk length
Same as chunk logical, should be aligned to sector size.
4) Stripe length
It should be power of 2.
5) Chunk type
Any bit out of TYPE_MAS | PROFILE_MASK is invalid.
With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 15 Dec 2015 01:14:36 +0000 (09:14 +0800)]
btrfs: Enhance super validation check
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
Not the old max/min valid check, but also check if it's a power of 2.
So some bogus number like 12K node size won't pass now.
2) Super flag check
For now, there is still some inconsistency between kernel and
btrfs-progs super flags.
And considering btrfs-progs may add new flags for super block, this
check will only output warning.
3) Better root alignment check
Now root bytenr is checked against sector size.
4) Move some check into btrfs_check_super_valid().
Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
And magic number check.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 15 Jan 2016 11:05:12 +0000 (11:05 +0000)]
Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:
[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<
ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<
ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568]
0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568]
ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568]
0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<
ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<
ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<
ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<
ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<
ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<
ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<
ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<
ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<
ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<
ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<
ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<
ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<
ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D
ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719]
ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499]
ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834]
ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<
ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<
ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D
ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626]
ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258]
ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850]
ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<
ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<
ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<
ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<
ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<
ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<
ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<
ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<
ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.
Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.
Fixes:
d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 15 Jan 2016 10:56:15 +0000 (10:56 +0000)]
Btrfs: fix typo in log message when starting a balance
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 20 Jan 2016 02:21:30 +0000 (18:21 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 20 Jan 2016 02:21:00 +0000 (18:21 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Colin Ian King [Tue, 19 Jan 2016 00:05:28 +0000 (00:05 +0000)]
btrfs: remove duplicate const specifier
duplicate const is redundant so remove it
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sebastian Andrzej Siewior [Fri, 15 Jan 2016 13:37:15 +0000 (14:37 +0100)]
btrfs: initialize the seq counter in struct btrfs_device
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<
80307cec>] (dump_stack)
| [<
80070e98>] (__lock_acquire)
| [<
8007184c>] (lock_acquire)
| [<
80287800>] (btrfs_ioctl)
| [<
8012a8d4>] (do_vfs_ioctl)
| [<
8012ac14>] (SyS_ioctl)
so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.
Fixes:
7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter [Wed, 13 Jan 2016 12:21:17 +0000 (15:21 +0300)]
Btrfs: clean up an error code in btrfs_init_space_info()
If we return 1 here, then the caller treats it as an error and returns
-EINVAL. It causes a static checker warning to treat positive returns
as an error.
Fixes:
1aba86d67f34 ('Btrfs: fix easily get into ENOSPC in mixed case')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Wed, 13 Jan 2016 14:08:01 +0000 (22:08 +0800)]
btrfs: fix iterator with update error in backref.c
Fix the following error:
fs/btrfs/backref.c:565:1-20: iterator with update on line 577
Fixes:
a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tsutomu Itoh [Wed, 6 Jan 2016 08:03:40 +0000 (17:03 +0900)]
Btrfs: fix output of compression message in btrfs_parse_options()
The compression message might not be correctly output.
Fix it.
[[before fix]]
# mount -o compress /dev/sdb3 /test3
[ 996.874264] BTRFS info (device sdb3): disk space caching is enabled
[ 996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
[[after fix]]
# mount -o compress /dev/sdb3 /test3
[ 401.021753] BTRFS info (device sdb3): use zlib compression
[ 401.021758] BTRFS info (device sdb3): disk space caching is enabled
[ 401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 439.824624] BTRFS info (device sdb3): force zlib compression
[ 439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 459.918430] BTRFS info (device sdb3): use zlib compression
[ 459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chandan Rajendra [Thu, 7 Jan 2016 13:26:59 +0000 (18:56 +0530)]
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
The following call trace is seen when btrfs/031 test is executed in a loop,
[ 158.661848] ------------[ cut here ]------------
[ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[ 158.664102] BTRFS: Transaction aborted (error -2)
[ 158.664774] Modules linked in:
[ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted
4.4.0-rc6-g511711a #2
[ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 158.667392]
ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[ 158.668515]
ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[ 158.669647]
ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[ 158.670769] Call Trace:
[ 158.671153] [<
ffffffff81431fc8>] dump_stack+0x44/0x5c
[ 158.671884] [<
ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[ 158.672769] [<
ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[ 158.673620] [<
ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[ 158.674440] [<
ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[ 158.675376] [<
ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[ 158.676235] [<
ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[ 158.677268] [<
ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[ 158.678183] [<
ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[ 158.678975] [<
ffffffff81144b8e>] ? vma_merge+0xee/0x300
[ 158.679751] [<
ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[ 158.680599] [<
ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[ 158.681686] [<
ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[ 158.682581] [<
ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[ 158.683399] [<
ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[ 158.684297] [<
ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[ 158.685051] [<
ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 158.685958] ---[ end trace
4b63312de5a2cb76 ]---
[ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[ 158.709508] BTRFS info (device loop0): forced readonly
[ 158.737113] BTRFS info (device loop0): disk space caching is enabled
[ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
This occurs because,
Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
btrfs_drop_snapshot()
Add root corresponding to subvol 257 into
btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
257 is returned as the next free objectid
btrfs_read_fs_root_no_name()
Finds the btrfs_root instance corresponding to the old subvol with ID 257
in btrfs_fs_info->fs_roots_radix.
Returns error since btrfs_root_item->refs has the value of 0.
To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 3 Jun 2015 14:55:48 +0000 (10:55 -0400)]
btrfs: cleanup, stop casting for extent_map->lookup everywhere
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Mon, 11 Jan 2016 16:39:28 +0000 (08:39 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 14:08:37 +0000 (06:08 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 13:59:32 +0000 (05:59 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Wed, 6 Jan 2016 22:42:35 +0000 (22:42 +0000)]
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes:
499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sam Tygier [Wed, 6 Jan 2016 08:46:12 +0000 (08:46 +0000)]
Btrfs: Check metadata redundancy on balance
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single
Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sat, 10 Oct 2015 15:59:53 +0000 (17:59 +0200)]
btrfs: statfs: report zero available if metadata are exhausted
There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.
In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.
To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.
Current:
/dev/sdb1 4.0G 3.3G 719M 83% /mnt/test
New:
/dev/sdb1 4.0G 3.3G 0 100% /mnt/test
We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)
We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:03 +0000 (18:54 +0100)]
btrfs: preallocate path for snapshot creation at ioctl time
We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:00 +0000 (18:54 +0100)]
btrfs: allocate root item at snapshot ioctl time
The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:53:56 +0000 (18:53 +0100)]
btrfs: do an allocation earlier during snapshot creation
We can allocate pending_snapshot earlier and do not have to do cleanup
in case of failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:45 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path locks
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:
text data bss dec hex filename
938731 43670 23144
1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144
1005017 f55d9 fs/btrfs/btrfs.ko.after
(and the generated assembly does not change much)
The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.
Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:42 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path lowest_level
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:38 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path reada
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:35 +0000 (16:31 +0100)]
btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since
a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:31 +0000 (11:42 +0100)]
btrfs: constify static arrays
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:28 +0000 (11:42 +0100)]
btrfs: constify remaining structs with function pointers
* struct extent_io_ops
* struct btrfs_free_space_op
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:24 +0000 (11:42 +0100)]
btrfs tests: replace whole ops structure for free space tests
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Mon, 21 Dec 2015 15:50:23 +0000 (23:50 +0800)]
btrfs: use list_for_each_entry* in backref.c
Use list_for_each_entry*() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:17:00 +0000 (22:17 +0800)]
btrfs: use list_for_each_entry_safe in free-space-cache.c
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:16:59 +0000 (22:16 +0800)]
btrfs: use list_for_each_entry* in check-integrity.c
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Mon, 14 Dec 2015 16:42:10 +0000 (01:42 +0900)]
Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '
268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 10:02:31 +0000 (11:02 +0100)]
btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 20:15:06 +0000 (20:15 +0000)]
btrfs: zero out delayed node upon allocation
It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 19:35:44 +0000 (19:35 +0000)]
btrfs: pass proper enum type to start_transaction()
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 18 Oct 2015 21:35:41 +0000 (21:35 +0000)]
btrfs: switch __btrfs_fs_incompat return type from int to bool
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Tue, 19 May 2015 14:46:45 +0000 (23:46 +0900)]
btrfs: remove unused inode argument from uncompress_inline()
The inode argument is never used from the beginning, so remove it.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 8 Dec 2015 13:39:32 +0000 (14:39 +0100)]
btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.
The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.
The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Dec 2015 17:09:12 +0000 (18:09 +0100)]
btrfs: drop duplicate prefix from scrub workqueues
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:09 +0000 (17:27 +0100)]
btrfs: verbose error when we find an unexpected item in sys_array
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:06 +0000 (17:27 +0100)]
btrfs: handle invalid num_stripes in sys_array
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.
A crafted or corrupted image crashes at mount time:
BTRFS: device fsid
9006933e-2a9a-44f0-917f-
514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted
4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<
6001c86d>] show_stack+0xfe/0x15b
[<
6038a835>] dump_stack+0x2a/0x2c
[<
6038776b>] panic+0x13e/0x2b3
[<
6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<
601cfbbe>] open_ctree+0x192d/0x27af
[<
6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
600d710b>] do_mount+0xa35/0xbc9
[<
600d7557>] SyS_mount+0x95/0xc8
[<
6001e884>] handle_syscall+0x6b/0x8e
Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 15:51:29 +0000 (16:51 +0100)]
btrfs: better packing of btrfs_delayed_extent_op
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key; /* 0 17 */
u8 level; /* 17 1 */
bool update_key; /* 18 1 */
bool update_flags; /* 19 1 */
bool is_data; /* 20 1 */
/* XXX 3 bytes hole, try to pack */
u64 flags_to_set; /* 24 8 */
/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};
The final size is 32 bytes which gives +26 object per slab page.
text data bss dec hex filename
938811 43670 23144
1005625 f5839 fs/btrfs/btrfs.ko.before
938747 43670 23144
1005561 f57f9 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 13:15:51 +0000 (14:15 +0100)]
btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Thu, 19 Nov 2015 09:26:22 +0000 (17:26 +0800)]
btrfs: Support convert to -d dup for btrfs-convert
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.
This patch remove limitation of above case.
Tested by following script:
(combination of dup conversion with fsck):
export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'
do_dup_test()
{
local m_from="$1"
local d_from="$2"
local m_to="$3"
local d_to="$4"
echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
umount "$TEST_DIR" &>/dev/null
./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
mount "$TEST_DEV" "$TEST_DIR" || return 1
cp -a /sbin/* "$TEST_DIR"
[[ "$m_from" != "$m_to" ]] && {
./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
}
[[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
}
umount "$TEST_DIR" || return 1
./btrfsck "$TEST_DEV" || return 1
echo
return 0
}
test_all()
{
for m_from in single dup; do
for d_from in single dup; do
for m_to in single dup; do
for d_to in single dup; do
do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
done
done
done
done
}
test_all
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 22 Oct 2015 19:05:09 +0000 (15:05 -0400)]
Btrfs: igrab inode in writepage
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 7 Oct 2015 09:23:23 +0000 (17:23 +0800)]
Btrfs: add missing brelse when superblock checksum fails
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 5 Jan 2016 16:24:05 +0000 (16:24 +0000)]
Btrfs: fix transaction handle leak on failure to create hard link
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Thu, 31 Dec 2015 18:16:29 +0000 (18:16 +0000)]
Btrfs: fix number of transaction units required to create symlink
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:08:24 +0000 (18:08 +0000)]
Btrfs: don't leave dangling dentry if symlink creation failed
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.
Example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt
$ touch /mnt/foo
$ ln -s /mnt/foo /mnt/bar
ln: failed to create symbolic link ‘bar’: Cannot allocate memory
$ umount /mnt
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID:
d5acb5ba-31bd-42da-b456-
89dca2e716e1
checking extents
checking free space cache
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
found 131073 bytes used err is 1
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124305
file data blocks allocated: 262144
referenced 262144
btrfs-progs v4.2.3
So fix this by adding the directory index entries as the very last
step of symlink creation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:07:59 +0000 (18:07 +0000)]
Btrfs: send, don't BUG_ON() when an empty symlink is found
When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).
So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.
Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 30 Dec 2015 02:42:30 +0000 (02:42 +0000)]
Btrfs: fix race between free space endio workers and space cache writeout
While running a stress test I ran into the following trace/transaction
abort:
[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901]
0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009]
ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490]
ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804] [<
ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049] [<
ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542] [<
ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326] [<
ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636] [<
ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048] [<
ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616] [<
ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950] [<
ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286] [<
ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611] [<
ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610] [<
ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718] [<
ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672] [<
ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800] [<
ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990] [<
ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace
baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:
1) A task calls btrfs_commit_transaction() and starts the writeout of the
space caches for all currently dirty block groups (i.e. it calls
btrfs_start_dirty_block_groups());
2) The previous step starts writeback for space caches;
3) When the writeback finishes it queues jobs for free space endio work
queue (fs_info->endio_freespace_worker) that execute
btrfs_finish_ordered_io();
4) The task committing the transaction sets the transaction's state
to TRANS_STATE_COMMIT_DOING and shortly after calls
btrfs_write_dirty_block_groups();
5) A free space endio job joins the transaction, through
btrfs_join_transaction_nolock(), and updates a free space inode item
in the root tree through btrfs_update_inode_fallback();
6) Updating the free space inode item resulted in COWing one or more
nodes/leaves of the root tree, and that resulted in creating a new
metadata block group, which gets added to the transaction's list
of dirty block groups (this is a very rare case);
7) The free space endio job has not released yet its transaction handle
at this point, so the new metadata block group was not yet fully
created (didn't go through btrfs_create_pending_block_groups() yet);
8) The transaction commit task sees the new metadata block group in
the transaction's list of dirty block groups and processes it.
When it attempts to update the block group's block group item in
the extent tree, through write_one_cache_group(), it isn't able
to find it and aborts the transaction with error -ENOENT - this
is because the free space endio job hasn't yet released its
transaction handle (which calls btrfs_create_pending_block_groups())
and therefore the block group item was not yet added to the extent
tree.
Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Chris Mason [Wed, 30 Dec 2015 15:52:35 +0000 (07:52 -0800)]
btrfs: don't run delayed references while we are creating the free space tree
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.
Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 30 Dec 2015 15:37:26 +0000 (07:37 -0800)]
btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.
Merging in the free space tree deleted a variable needed when
CONFIG_BTRFS_DEBUG=y
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:30:51 +0000 (13:30 -0800)]
btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc
map->num_stripes really can't be zero, but just in case.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:29:09 +0000 (13:29 -0800)]
Merge branch 'freespace-4.5' into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:28:35 +0000 (13:28 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:17:42 +0000 (13:17 -0800)]
Merge branch 'dev/simplify-set-bit' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:11:27 +0000 (13:11 -0800)]
Merge branch 'dev/gfp-flags' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:10:26 +0000 (13:10 -0800)]
Merge branch 'cleanup/misc-simplify' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Fri, 18 Dec 2015 03:02:48 +0000 (03:02 +0000)]
Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups
We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.
However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.
So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Linus Torvalds [Mon, 21 Dec 2015 00:06:09 +0000 (16:06 -0800)]
Linux 4.4-rc6
Linus Torvalds [Sun, 20 Dec 2015 18:01:11 +0000 (10:01 -0800)]
Merge tag 'rtc-4.4-3' of git://git./linux/kernel/git/abelloni/linux
Pull RTC fixes from Alexandre Belloni:
"Late fixes for the RTC subsystem for 4.4:
A fix for a nasty hardware bug in rk808 and an initialization
reordering in da9063 to fix a possible crash"
* tag 'rtc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: da9063: fix access ordering error during RTC interrupt at system power on
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
Steve Twiss [Tue, 8 Dec 2015 16:28:39 +0000 (16:28 +0000)]
rtc: da9063: fix access ordering error during RTC interrupt at system power on
This fix alters the ordering of the IRQ and device registrations in the RTC
driver probe function. This change will apply to the RTC driver that supports
both DA9063 and DA9062 PMICs.
A problem could occur with the existing RTC driver if:
A system is started from a cold boot using the PMIC RTC IRQ to initiate a
power on operation. For instance, if an RTC alarm is used to start a
platform from power off.
The existing driver IRQ is requested before the device has been properly
registered.
i.e.
ret = devm_request_threaded_irq()
comes before
rtc->rtc_dev = devm_rtc_device_register();
In this case, the interrupt can be called before the device has been
registered and the handler can be called immediately. The IRQ handler
da9063_alarm_event() contains the function call
rtc_update_irq(rtc->rtc_dev, 1, RTC_IRQF | RTC_AF);
which in turn tries to access the unavailable rtc->rtc_dev.
The fix is to reorder the functions inside the RTC probe. The IRQ is
requested after the RTC device resource has been registered so that
get_irq_byname is the last thing to happen.
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Julius Werner [Tue, 15 Dec 2015 23:02:49 +0000 (15:02 -0800)]
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar
insufficiently represented reality, and changed the rules about
calculating leap years to account for this. Similarly, in A.D. 2013
Rockchip hardware engineers found that the new Gregorian calendar still
contained flaws, and that the month of November should be counted up to
31 days instead. Unfortunately it takes a long time for calendar changes
to gain widespread adoption, and just like more than 300 years went by
before the last Protestant nation implemented Greg's proposal, we will
have to wait a while until all religions and operating system kernels
acknowledge the inherent advantages of the Rockchip system. Until then
we need to translate dates read from (and written to) Rockchip hardware
back to the Gregorian format.
This patch works by defining Jan 1st, 2016 as the arbitrary anchor date
on which Rockchip and Gregorian calendars are in sync. From that we can
translate arbitrary later dates back and forth by counting the number
of November/December transitons since the anchor date to determine the
offset between the calendars. We choose this method (rather than trying
to regularly "correct" the date stored in hardware) since it's the only
way to ensure perfect time-keeping even if the system may be shut down
for an unknown number of years. The drawback is that other software
reading the same hardware (e.g. mainboard firmware) must use the same
translation convention (including the same anchor date) to be able to
read and write correct timestamps from/to the RTC.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>