Filipe Manana [Fri, 15 Jan 2016 11:05:12 +0000 (11:05 +0000)]
Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:
[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<
ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<
ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568]
0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568]
ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568]
0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<
ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<
ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<
ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<
ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<
ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<
ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<
ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<
ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<
ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<
ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<
ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<
ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<
ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D
ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719]
ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499]
ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834]
ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<
ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<
ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D
ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626]
ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258]
ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850]
ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<
ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<
ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<
ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<
ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<
ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<
ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<
ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<
ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.
Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.
Fixes:
d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 15 Jan 2016 10:56:15 +0000 (10:56 +0000)]
Btrfs: fix typo in log message when starting a balance
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 20 Jan 2016 02:21:30 +0000 (18:21 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 20 Jan 2016 02:21:00 +0000 (18:21 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Colin Ian King [Tue, 19 Jan 2016 00:05:28 +0000 (00:05 +0000)]
btrfs: remove duplicate const specifier
duplicate const is redundant so remove it
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sebastian Andrzej Siewior [Fri, 15 Jan 2016 13:37:15 +0000 (14:37 +0100)]
btrfs: initialize the seq counter in struct btrfs_device
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<
80307cec>] (dump_stack)
| [<
80070e98>] (__lock_acquire)
| [<
8007184c>] (lock_acquire)
| [<
80287800>] (btrfs_ioctl)
| [<
8012a8d4>] (do_vfs_ioctl)
| [<
8012ac14>] (SyS_ioctl)
so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.
Fixes:
7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter [Wed, 13 Jan 2016 12:21:17 +0000 (15:21 +0300)]
Btrfs: clean up an error code in btrfs_init_space_info()
If we return 1 here, then the caller treats it as an error and returns
-EINVAL. It causes a static checker warning to treat positive returns
as an error.
Fixes:
1aba86d67f34 ('Btrfs: fix easily get into ENOSPC in mixed case')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Wed, 13 Jan 2016 14:08:01 +0000 (22:08 +0800)]
btrfs: fix iterator with update error in backref.c
Fix the following error:
fs/btrfs/backref.c:565:1-20: iterator with update on line 577
Fixes:
a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tsutomu Itoh [Wed, 6 Jan 2016 08:03:40 +0000 (17:03 +0900)]
Btrfs: fix output of compression message in btrfs_parse_options()
The compression message might not be correctly output.
Fix it.
[[before fix]]
# mount -o compress /dev/sdb3 /test3
[ 996.874264] BTRFS info (device sdb3): disk space caching is enabled
[ 996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
[[after fix]]
# mount -o compress /dev/sdb3 /test3
[ 401.021753] BTRFS info (device sdb3): use zlib compression
[ 401.021758] BTRFS info (device sdb3): disk space caching is enabled
[ 401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 439.824624] BTRFS info (device sdb3): force zlib compression
[ 439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 459.918430] BTRFS info (device sdb3): use zlib compression
[ 459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chandan Rajendra [Thu, 7 Jan 2016 13:26:59 +0000 (18:56 +0530)]
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
The following call trace is seen when btrfs/031 test is executed in a loop,
[ 158.661848] ------------[ cut here ]------------
[ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[ 158.664102] BTRFS: Transaction aborted (error -2)
[ 158.664774] Modules linked in:
[ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted
4.4.0-rc6-g511711a #2
[ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 158.667392]
ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[ 158.668515]
ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[ 158.669647]
ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[ 158.670769] Call Trace:
[ 158.671153] [<
ffffffff81431fc8>] dump_stack+0x44/0x5c
[ 158.671884] [<
ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[ 158.672769] [<
ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[ 158.673620] [<
ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[ 158.674440] [<
ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[ 158.675376] [<
ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[ 158.676235] [<
ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[ 158.677268] [<
ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[ 158.678183] [<
ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[ 158.678975] [<
ffffffff81144b8e>] ? vma_merge+0xee/0x300
[ 158.679751] [<
ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[ 158.680599] [<
ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[ 158.681686] [<
ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[ 158.682581] [<
ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[ 158.683399] [<
ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[ 158.684297] [<
ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[ 158.685051] [<
ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 158.685958] ---[ end trace
4b63312de5a2cb76 ]---
[ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[ 158.709508] BTRFS info (device loop0): forced readonly
[ 158.737113] BTRFS info (device loop0): disk space caching is enabled
[ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
This occurs because,
Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
btrfs_drop_snapshot()
Add root corresponding to subvol 257 into
btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
257 is returned as the next free objectid
btrfs_read_fs_root_no_name()
Finds the btrfs_root instance corresponding to the old subvol with ID 257
in btrfs_fs_info->fs_roots_radix.
Returns error since btrfs_root_item->refs has the value of 0.
To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 3 Jun 2015 14:55:48 +0000 (10:55 -0400)]
btrfs: cleanup, stop casting for extent_map->lookup everywhere
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Mon, 11 Jan 2016 16:39:28 +0000 (08:39 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 14:08:37 +0000 (06:08 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 13:59:32 +0000 (05:59 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Wed, 6 Jan 2016 22:42:35 +0000 (22:42 +0000)]
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes:
499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sam Tygier [Wed, 6 Jan 2016 08:46:12 +0000 (08:46 +0000)]
Btrfs: Check metadata redundancy on balance
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single
Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sat, 10 Oct 2015 15:59:53 +0000 (17:59 +0200)]
btrfs: statfs: report zero available if metadata are exhausted
There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.
In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.
To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.
Current:
/dev/sdb1 4.0G 3.3G 719M 83% /mnt/test
New:
/dev/sdb1 4.0G 3.3G 0 100% /mnt/test
We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)
We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:03 +0000 (18:54 +0100)]
btrfs: preallocate path for snapshot creation at ioctl time
We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:00 +0000 (18:54 +0100)]
btrfs: allocate root item at snapshot ioctl time
The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:53:56 +0000 (18:53 +0100)]
btrfs: do an allocation earlier during snapshot creation
We can allocate pending_snapshot earlier and do not have to do cleanup
in case of failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:45 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path locks
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:
text data bss dec hex filename
938731 43670 23144
1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144
1005017 f55d9 fs/btrfs/btrfs.ko.after
(and the generated assembly does not change much)
The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.
Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:42 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path lowest_level
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:38 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path reada
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:35 +0000 (16:31 +0100)]
btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since
a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:31 +0000 (11:42 +0100)]
btrfs: constify static arrays
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:28 +0000 (11:42 +0100)]
btrfs: constify remaining structs with function pointers
* struct extent_io_ops
* struct btrfs_free_space_op
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:24 +0000 (11:42 +0100)]
btrfs tests: replace whole ops structure for free space tests
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Mon, 21 Dec 2015 15:50:23 +0000 (23:50 +0800)]
btrfs: use list_for_each_entry* in backref.c
Use list_for_each_entry*() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:17:00 +0000 (22:17 +0800)]
btrfs: use list_for_each_entry_safe in free-space-cache.c
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:16:59 +0000 (22:16 +0800)]
btrfs: use list_for_each_entry* in check-integrity.c
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Mon, 14 Dec 2015 16:42:10 +0000 (01:42 +0900)]
Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '
268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 10:02:31 +0000 (11:02 +0100)]
btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 20:15:06 +0000 (20:15 +0000)]
btrfs: zero out delayed node upon allocation
It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 19:35:44 +0000 (19:35 +0000)]
btrfs: pass proper enum type to start_transaction()
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 18 Oct 2015 21:35:41 +0000 (21:35 +0000)]
btrfs: switch __btrfs_fs_incompat return type from int to bool
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Tue, 19 May 2015 14:46:45 +0000 (23:46 +0900)]
btrfs: remove unused inode argument from uncompress_inline()
The inode argument is never used from the beginning, so remove it.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 8 Dec 2015 13:39:32 +0000 (14:39 +0100)]
btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.
The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.
The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Dec 2015 17:09:12 +0000 (18:09 +0100)]
btrfs: drop duplicate prefix from scrub workqueues
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:09 +0000 (17:27 +0100)]
btrfs: verbose error when we find an unexpected item in sys_array
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:06 +0000 (17:27 +0100)]
btrfs: handle invalid num_stripes in sys_array
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.
A crafted or corrupted image crashes at mount time:
BTRFS: device fsid
9006933e-2a9a-44f0-917f-
514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted
4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<
6001c86d>] show_stack+0xfe/0x15b
[<
6038a835>] dump_stack+0x2a/0x2c
[<
6038776b>] panic+0x13e/0x2b3
[<
6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<
601cfbbe>] open_ctree+0x192d/0x27af
[<
6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
600d710b>] do_mount+0xa35/0xbc9
[<
600d7557>] SyS_mount+0x95/0xc8
[<
6001e884>] handle_syscall+0x6b/0x8e
Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 15:51:29 +0000 (16:51 +0100)]
btrfs: better packing of btrfs_delayed_extent_op
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key; /* 0 17 */
u8 level; /* 17 1 */
bool update_key; /* 18 1 */
bool update_flags; /* 19 1 */
bool is_data; /* 20 1 */
/* XXX 3 bytes hole, try to pack */
u64 flags_to_set; /* 24 8 */
/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};
The final size is 32 bytes which gives +26 object per slab page.
text data bss dec hex filename
938811 43670 23144
1005625 f5839 fs/btrfs/btrfs.ko.before
938747 43670 23144
1005561 f57f9 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 13:15:51 +0000 (14:15 +0100)]
btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Thu, 19 Nov 2015 09:26:22 +0000 (17:26 +0800)]
btrfs: Support convert to -d dup for btrfs-convert
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.
This patch remove limitation of above case.
Tested by following script:
(combination of dup conversion with fsck):
export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'
do_dup_test()
{
local m_from="$1"
local d_from="$2"
local m_to="$3"
local d_to="$4"
echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
umount "$TEST_DIR" &>/dev/null
./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
mount "$TEST_DEV" "$TEST_DIR" || return 1
cp -a /sbin/* "$TEST_DIR"
[[ "$m_from" != "$m_to" ]] && {
./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
}
[[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
}
umount "$TEST_DIR" || return 1
./btrfsck "$TEST_DEV" || return 1
echo
return 0
}
test_all()
{
for m_from in single dup; do
for d_from in single dup; do
for m_to in single dup; do
for d_to in single dup; do
do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
done
done
done
done
}
test_all
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 22 Oct 2015 19:05:09 +0000 (15:05 -0400)]
Btrfs: igrab inode in writepage
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 7 Oct 2015 09:23:23 +0000 (17:23 +0800)]
Btrfs: add missing brelse when superblock checksum fails
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 5 Jan 2016 16:24:05 +0000 (16:24 +0000)]
Btrfs: fix transaction handle leak on failure to create hard link
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Thu, 31 Dec 2015 18:16:29 +0000 (18:16 +0000)]
Btrfs: fix number of transaction units required to create symlink
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:08:24 +0000 (18:08 +0000)]
Btrfs: don't leave dangling dentry if symlink creation failed
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.
Example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt
$ touch /mnt/foo
$ ln -s /mnt/foo /mnt/bar
ln: failed to create symbolic link ‘bar’: Cannot allocate memory
$ umount /mnt
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID:
d5acb5ba-31bd-42da-b456-
89dca2e716e1
checking extents
checking free space cache
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
found 131073 bytes used err is 1
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124305
file data blocks allocated: 262144
referenced 262144
btrfs-progs v4.2.3
So fix this by adding the directory index entries as the very last
step of symlink creation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:07:59 +0000 (18:07 +0000)]
Btrfs: send, don't BUG_ON() when an empty symlink is found
When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).
So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.
Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 30 Dec 2015 02:42:30 +0000 (02:42 +0000)]
Btrfs: fix race between free space endio workers and space cache writeout
While running a stress test I ran into the following trace/transaction
abort:
[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901]
0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009]
ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490]
ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804] [<
ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049] [<
ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542] [<
ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326] [<
ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636] [<
ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048] [<
ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616] [<
ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950] [<
ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286] [<
ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611] [<
ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610] [<
ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718] [<
ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672] [<
ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800] [<
ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990] [<
ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace
baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:
1) A task calls btrfs_commit_transaction() and starts the writeout of the
space caches for all currently dirty block groups (i.e. it calls
btrfs_start_dirty_block_groups());
2) The previous step starts writeback for space caches;
3) When the writeback finishes it queues jobs for free space endio work
queue (fs_info->endio_freespace_worker) that execute
btrfs_finish_ordered_io();
4) The task committing the transaction sets the transaction's state
to TRANS_STATE_COMMIT_DOING and shortly after calls
btrfs_write_dirty_block_groups();
5) A free space endio job joins the transaction, through
btrfs_join_transaction_nolock(), and updates a free space inode item
in the root tree through btrfs_update_inode_fallback();
6) Updating the free space inode item resulted in COWing one or more
nodes/leaves of the root tree, and that resulted in creating a new
metadata block group, which gets added to the transaction's list
of dirty block groups (this is a very rare case);
7) The free space endio job has not released yet its transaction handle
at this point, so the new metadata block group was not yet fully
created (didn't go through btrfs_create_pending_block_groups() yet);
8) The transaction commit task sees the new metadata block group in
the transaction's list of dirty block groups and processes it.
When it attempts to update the block group's block group item in
the extent tree, through write_one_cache_group(), it isn't able
to find it and aborts the transaction with error -ENOENT - this
is because the free space endio job hasn't yet released its
transaction handle (which calls btrfs_create_pending_block_groups())
and therefore the block group item was not yet added to the extent
tree.
Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Chris Mason [Wed, 30 Dec 2015 15:52:35 +0000 (07:52 -0800)]
btrfs: don't run delayed references while we are creating the free space tree
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.
Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 30 Dec 2015 15:37:26 +0000 (07:37 -0800)]
btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.
Merging in the free space tree deleted a variable needed when
CONFIG_BTRFS_DEBUG=y
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:30:51 +0000 (13:30 -0800)]
btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc
map->num_stripes really can't be zero, but just in case.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:29:09 +0000 (13:29 -0800)]
Merge branch 'freespace-4.5' into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:28:35 +0000 (13:28 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:17:42 +0000 (13:17 -0800)]
Merge branch 'dev/simplify-set-bit' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:11:27 +0000 (13:11 -0800)]
Merge branch 'dev/gfp-flags' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:10:26 +0000 (13:10 -0800)]
Merge branch 'cleanup/misc-simplify' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Fri, 18 Dec 2015 03:02:48 +0000 (03:02 +0000)]
Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups
We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.
However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.
So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Linus Torvalds [Mon, 21 Dec 2015 00:06:09 +0000 (16:06 -0800)]
Linux 4.4-rc6
Linus Torvalds [Sun, 20 Dec 2015 18:01:11 +0000 (10:01 -0800)]
Merge tag 'rtc-4.4-3' of git://git./linux/kernel/git/abelloni/linux
Pull RTC fixes from Alexandre Belloni:
"Late fixes for the RTC subsystem for 4.4:
A fix for a nasty hardware bug in rk808 and an initialization
reordering in da9063 to fix a possible crash"
* tag 'rtc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: da9063: fix access ordering error during RTC interrupt at system power on
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
Steve Twiss [Tue, 8 Dec 2015 16:28:39 +0000 (16:28 +0000)]
rtc: da9063: fix access ordering error during RTC interrupt at system power on
This fix alters the ordering of the IRQ and device registrations in the RTC
driver probe function. This change will apply to the RTC driver that supports
both DA9063 and DA9062 PMICs.
A problem could occur with the existing RTC driver if:
A system is started from a cold boot using the PMIC RTC IRQ to initiate a
power on operation. For instance, if an RTC alarm is used to start a
platform from power off.
The existing driver IRQ is requested before the device has been properly
registered.
i.e.
ret = devm_request_threaded_irq()
comes before
rtc->rtc_dev = devm_rtc_device_register();
In this case, the interrupt can be called before the device has been
registered and the handler can be called immediately. The IRQ handler
da9063_alarm_event() contains the function call
rtc_update_irq(rtc->rtc_dev, 1, RTC_IRQF | RTC_AF);
which in turn tries to access the unavailable rtc->rtc_dev.
The fix is to reorder the functions inside the RTC probe. The IRQ is
requested after the RTC device resource has been registered so that
get_irq_byname is the last thing to happen.
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Julius Werner [Tue, 15 Dec 2015 23:02:49 +0000 (15:02 -0800)]
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar
insufficiently represented reality, and changed the rules about
calculating leap years to account for this. Similarly, in A.D. 2013
Rockchip hardware engineers found that the new Gregorian calendar still
contained flaws, and that the month of November should be counted up to
31 days instead. Unfortunately it takes a long time for calendar changes
to gain widespread adoption, and just like more than 300 years went by
before the last Protestant nation implemented Greg's proposal, we will
have to wait a while until all religions and operating system kernels
acknowledge the inherent advantages of the Rockchip system. Until then
we need to translate dates read from (and written to) Rockchip hardware
back to the Gregorian format.
This patch works by defining Jan 1st, 2016 as the arbitrary anchor date
on which Rockchip and Gregorian calendars are in sync. From that we can
translate arbitrary later dates back and forth by counting the number
of November/December transitons since the anchor date to determine the
offset between the calendars. We choose this method (rather than trying
to regularly "correct" the date stored in hardware) since it's the only
way to ensure perfect time-keeping even if the system may be shut down
for an unknown number of years. The drawback is that other software
reading the same hardware (e.g. mainboard firmware) must use the same
translation convention (including the same anchor date) to be able to
read and write correct timestamps from/to the RTC.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Linus Torvalds [Sun, 20 Dec 2015 01:44:19 +0000 (17:44 -0800)]
Merge tag 'tty-4.4-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some tty/serial driver fixes for 4.4-rc6 that resolve some
reported problems. All of these have been in linux-next. The details
are in the shortlog"
* tag 'tty-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: Fix GPF in flush_to_ldisc()
serial: earlycon: Add missing spinlock initialization
serial: sh-sci: Fix length of scatterlist
n_tty: Fix poll() after buffer-limited eof push read
serial: 8250_uniphier: fix dl_read and dl_write functions
Linus Torvalds [Sun, 20 Dec 2015 01:33:58 +0000 (17:33 -0800)]
Merge tag 'usb-4.4-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB and PHY fixes for 4.4-rc6. All of them resolve some
reported problems. Full details in the shortlog"
* tag 'usb-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: fix invalid memory access in hub_activate()
USB: ipaq.c: fix a timeout loop
phy: core: Get a refcount to phy in devm_of_phy_get_by_index()
phy: cygnus: pcie: add missing of_node_put
phy: miphy365x: add missing of_node_put
phy: miphy28lp: add missing of_node_put
phy: rockchip-usb: add missing of_node_put
phy: berlin-sata: add missing of_node_put
phy: mt65xx-usb3: add missing of_node_put
phy: brcmstb-sata: add missing of_node_put
phy: sun9i-usb: add USB dependency
Linus Torvalds [Sun, 20 Dec 2015 00:46:46 +0000 (16:46 -0800)]
Merge tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Four fixes for md:
- two recently introduced regressions fixed.
- one older bug in RAID10 - tagged for -stable since 4.2
- one minor sysfs api improvement"
* tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md:
Fix remove_and_add_spares removes drive added as spare in slot_store
md: fix bug due to nested suspend
MD: change journal disk role to disk 0
md/raid10: fix data corruption and crash during resync
Linus Torvalds [Sun, 20 Dec 2015 00:40:48 +0000 (16:40 -0800)]
Merge tag 'powerpc-4.4-5' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Partial revert of "powerpc: Individual System V IPC system calls"
- pr_warn_once on unsupported OPAL_MSG type from Stewart
- Fix deadlock in opal-irqchip introduced by "Fix double endian
conversion" from Alistair
* tag 'powerpc-4.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/opal-irqchip: Fix deadlock introduced by "Fix double endian conversion"
powerpc/powernv: pr_warn_once on unsupported OPAL_MSG type
Partial revert of "powerpc: Individual System V IPC system calls"
Linus Torvalds [Sat, 19 Dec 2015 18:10:43 +0000 (10:10 -0800)]
Merge tag 'spi-fix-v4.4-rc5' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A couple of reference counting bugs here, one in spidev and one with
holding an extra reference in the core that we never freed if we
removed a device, plus a driver specific fix. Both of the refcounting
bugs are very old but they've only been found by observation so
hopefully their impact has been low"
* tag 'spi-fix-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: fix parent-device reference leak
spi: spidev: Hold spi_lock over all defererences of spi in release()
spi-fsl-dspi: Fix CTAR Register access
Linus Torvalds [Sat, 19 Dec 2015 18:05:00 +0000 (10:05 -0800)]
Merge tag 'gpio-v4.4-3' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Some GPIO fixes for the v4.4 series. Most prominent: I revert the
error propagation from the .get() function until we can fix up all the
drivers properly for v4.5.
- Revert the error number propagation from the .get() vtable entry
temporarily, until we make the proper fixes to all drivers.
- Fix the clamping behaviour in the generic GPIO driver.
- Driver fix for the ath79 driver"
* tag 'gpio-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: revert get() to non-errorprogating behaviour
gpio: generic: clamp values from bgpio_get_set()
gpio: ath79: Fix the logic to clear offset bit of AR71XX_GPIO_REG_OE register
Linus Torvalds [Sat, 19 Dec 2015 18:01:03 +0000 (10:01 -0800)]
Merge tag 'pinctrl-v4.4-3' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Driver fixes for Freescale i.MX7D, Intel, Broadcom 2835
- One MAINTAINERS entry
* tag 'pinctrl-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
MAINTAINERS: pinctrl: Add maintainers for pinctrl-single
pinctrl: bcm2835: Fix initial value for direction_output
pinctrl: intel: fix offset calculation issue of register PAD_OWN
pinctrl: intel: fix bug of register offset calculation
pinctrl: freescale: add ZERO_OFFSET_VALID flag for vf610 pinctrl
Linus Torvalds [Sat, 19 Dec 2015 17:52:44 +0000 (09:52 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A set of 'usual' driver bugfixes for the I2C subsystem"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: rcar: disable runtime PM correctly in slave mode
i2c: designware: Keep pm_runtime_enable/_disable calls in sync
i2c: designware: fix IO timeout issue for AMD controller
i2c: imx: init bus recovery info before adding i2c adapter
i2c: do not use 0x in front of %pa
i2c: davinci: Increase module clock frequency
i2c: mv64xxx: The n clockdiv factor is 0 based on sunxi SoCs
i2c: rk3x: populate correct variable for sda_falling_time
Linus Torvalds [Sat, 19 Dec 2015 17:51:11 +0000 (09:51 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"Just a few assorted driver fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: elants_i2c - fix wake-on-touch
Input: elan_i2c - set input device's vendor and product IDs
Input: sun4i-lradc-keys - fix typo in binding documentation
Input: atmel_mxt_ts - add maxtouch to I2C table for module autoload
Input: arizona-haptic - fix disabling of haptics device
Input: aiptek - fix crash on detecting device without endpoints
Input: atmel_mxt_ts - add generic platform data for Chromebooks
Input: parkbd - clear unused function pointers
Input: walkera0701 - clear unused function pointers
Input: turbografx - clear unused function pointers
Input: gamecon - clear unused function pointers
Input: db9 - clear unused function pointers
Wolfram Sang [Wed, 16 Dec 2015 19:05:18 +0000 (20:05 +0100)]
i2c: rcar: disable runtime PM correctly in slave mode
When we also are I2C slave, we need to disable runtime PM because the
address detection mechanism needs to be active all the time. However, we
can reenable runtime PM once the slave instance was unregistered. So,
use pm_runtime_get_sync/put to achieve this, since it has proper
refcounting. pm_runtime_allow/forbid is like a global knob controllable
from userspace which is unsuitable here.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: stable@kernel.org
Linus Torvalds [Sat, 19 Dec 2015 05:01:35 +0000 (21:01 -0800)]
Merge tag 'pm+acpi-4.4-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a potential regression introduced during the 4.3 cycle
(generic power domains framework), a nasty bug that has been present
forever (power capping RAPL driver), a build issue (Tegra cpufreq
driver) and a minor ugliness introduced recently (intel_pstate).
Specifics:
- Fix a potential regression in the generic power domains framework
introduced during the 4.3 development cycle that may lead to
spurious failures of system suspend in certain situations (Ulf
Hansson).
- Fix a problem in the power capping RAPL (Running Average Power
Limits) driver that causes it to initialize successfully on some
systems where it is not supposed to do that which is due to an
incorrect check in an initialization routine (Prarit Bhargava).
- Fix a build problem in the cpufreq Tegra driver that depends on the
regulator framework, but that dependency is not reflected in
Kconfig (Arnd Bergmann).
- Fix a recent mistake in the intel_pstate driver where a numeric
constant is used directly instead of a symbol defined specifically
for the case in question (Prarit Bhargava)"
* tag 'pm+acpi-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
powercap / RAPL: fix BIOS lock check
cpufreq: intel_pstate: Minor cleanup for FRAC_BITS
cpufreq: tegra: add regulator dependency for T124
PM / Domains: Allow runtime PM callbacks to be re-used during system PM
Linus Torvalds [Sat, 19 Dec 2015 04:35:35 +0000 (20:35 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes this time, two in SES picked up by KASAN for various types
of buffer overrun. The first is a USB array which returns page 8
whatever is asked for and causes us to overrun with incorrect data
format assumptions and the second is an invalid iteration of page 10
(the additional information page).
The final fix is a reversion of a NULL deref fix which caused
suspend/resume not to be called in pairs leading to incorrect device
operation (Jens has queued a more proper fix for the problem in
block)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
ses: fix additional element traversal bug
Revert "SCSI: Fix NULL pointer dereference in runtime PM"
ses: Fix problems with simple enclosures
James Chen [Fri, 18 Dec 2015 23:51:48 +0000 (15:51 -0800)]
Input: elants_i2c - fix wake-on-touch
When sending "SLEEP" command to the controller it ceases scanning
completely and is unable to wake the system up from sleep, so if it is
configured as a wakeup source we should simply configure interrupt for
wakeup and rely on idle logic within the controller to reduce power
consumption while it is not used.
Signed-off-by: James Chen <james.chen@emc.com.tw>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Linus Torvalds [Fri, 18 Dec 2015 23:41:35 +0000 (15:41 -0800)]
Merge tag 'media/v4.4-3' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab.
* tag 'media/v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] airspy: increase USB control message buffer size
[media] hackrf: move RF gain ctrl enable behind module parameter
[media] hackrf: fix possible null ptr on debug printing
[media] Revert "[media] ivtv: avoid going past input/audio array"
Linus Torvalds [Fri, 18 Dec 2015 23:35:08 +0000 (15:35 -0800)]
Merge branch 'for-linus-4.4' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"A couple of small fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: check prepare_uptodate_page() error code earlier
Btrfs: check for empty bitmap list in setup_cluster_bitmaps
btrfs: fix misleading warning when space cache failed to load
Btrfs: fix transaction handle leak in balance
Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
Linus Torvalds [Fri, 18 Dec 2015 22:25:57 +0000 (14:25 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"Three patches"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
include/linux/mmdebug.h: should include linux/bug.h
mm/zswap: change incorrect strncmp use to strcmp
proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter
James Morse [Fri, 18 Dec 2015 22:22:07 +0000 (14:22 -0800)]
include/linux/mmdebug.h: should include linux/bug.h
mmdebug.h uses BUILD_BUG_ON_INVALID(), assuming someone else included
linux/bug.h. Include it ourselves.
This saves build-failures such as:
arch/arm64/include/asm/pgtable.h: In function 'set_pte_at':
arch/arm64/include/asm/pgtable.h:281:3: error: implicit declaration of function 'BUILD_BUG_ON_INVALID' [-Werror=implicit-function-declaration]
VM_WARN_ONCE(!pte_young(pte),
Fixes:
02602a18c32d7 ("bug: completely remove code generated by disabled VM_BUG_ON()")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Streetman [Fri, 18 Dec 2015 22:22:04 +0000 (14:22 -0800)]
mm/zswap: change incorrect strncmp use to strcmp
Change the use of strncmp in zswap_pool_find_get() to strcmp.
The use of strncmp is no longer correct, now that zswap_zpool_type is
not an array; sizeof() will return the size of a pointer, which isn't
the right length to compare. We don't need to use strncmp anyway,
because the existing params and the passed in params are all guaranteed
to be null terminated, so strcmp should be used.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Fri, 18 Dec 2015 22:22:01 +0000 (14:22 -0800)]
proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter
Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
the setting of ret after the get_proc_task call and incorrectly left it as
-ESRCH. Instead, return 0 when successful.
Example breakage:
echo 0 > /proc/self/coredump_filter
bash: echo: write error: No such process
Fixes:
774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 18 Dec 2015 20:51:52 +0000 (12:51 -0800)]
Merge tag 'hwmon-for-linus-v4.4-rc6' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Select CONFIG_BITREVERSE for sht15 driver to avoid build failure if
it is not configured.
- Force wait for conversion time for the first valid data in tmp102
driver to avoid reporting erroneous data to the thermal subsystem.
* tag 'hwmon-for-linus-v4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (sht15) Select CONFIG_BITREVERSE
hwmon: (tmp102) Force wait for conversion time for the first valid data
Linus Torvalds [Fri, 18 Dec 2015 20:38:35 +0000 (12:38 -0800)]
Merge tag 'iommu-fixes-v4.4-rc5' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
"Two similar fixes for the Intel and AMD IOMMU drivers to add proper
access checks before calling handle_mm_fault"
* tag 'iommu-fixes-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Do access checks before calling handle_mm_fault()
iommu/amd: Do proper access checking before calling handle_mm_fault()
Linus Torvalds [Fri, 18 Dec 2015 20:24:52 +0000 (12:24 -0800)]
Merge tag 'for-linus-4.4-rc5-tag' of git://git./linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: fix up cleanup path when alloc fails
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
xen/x86/pvh: Use HVM's flush_tlb_others op
xen: Resume PMU from non-atomic context
xen/events/fifo: Consume unprocessed events when a CPU dies
Linus Torvalds [Fri, 18 Dec 2015 20:19:01 +0000 (12:19 -0800)]
Merge tag 'arc-fixes-for-4.4-rc6' of git://git./linux/kernel/git/vgupta/arc
Pull ARC architecture fixes from Vineet Gupta:
"Fixes for:
- perf interrupts on SMP: Not enabled (at boot) and disabled (at runtime)
- stack unwinder regression (for modules, ignoring dwarf3)
- nsim hosed for non default kernel link base builds"
* tag 'arc-fixes-for-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: smp: Rename platform hook @init_cpu_smp -> @init_per_cpu
ARC: rename smp operation init_irq_cpu() to init_per_cpu()
ARC: dw2 unwind: Ignore CIE version !=1 gracefully instead of bailing
ARC: dw2 unwind: Reinstante unwinding out of modules
ARC: [plat-sim] unbork non default CONFIG_LINUX_LINK_BASE
ARC: intc: Document arc_request_percpu_irq() better
ARCv2: perf: Ensure perf intr gets enabled on all cores
ARC: intc: No need to clear IRQ_NOAUTOEN
ARCv2: intc: Fix random perf irq disabling in SMP setup
ARC: [axs10x] cap ethernet phy to 100 Mbit/sec
Linus Torvalds [Fri, 18 Dec 2015 19:47:06 +0000 (11:47 -0800)]
Merge tag 'sound-4.4-rc6' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"As usual in rc6, this update contains only a few HD-audio and
USB-audio device-specific quirks: yet another Thinkpad noise fixes,
Dell headphone mic fixes, and AudioQuest DragonFly fixes"
* tag 'sound-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Add a fixup for Thinkpad X1 Carbon 2nd
ALSA: hda - Set codec to D3 at reboot/shutdown on Thinkpads
ALSA: hda - Apply click noise workaround for Thinkpads generically
ALSA: hda - Fix headphone mic input on a few Dell ALC293 machines
ALSA: usb-audio: Add sample rate inquiry quirk for AudioQuest DragonFly
ALSA: usb-audio: Add a more accurate volume quirk for AudioQuest DragonFly
Linus Torvalds [Fri, 18 Dec 2015 19:19:16 +0000 (11:19 -0800)]
Merge tag 'for-linus-
20151217' of git://git.infradead.org/linux-mtd
Pull MTD fixes from Brian Norris:
"I was holding out on this pull request for a bit, since there are a
few other small issues being discussed that look like 4.4-rc
regressions. Hopefully I can get those stabilized soon, but these are
ready at any rate:
- A little bit of a last-minute change for the device tree "fixed
partition" binding. This is needed because we might want to reuse
the 'partitions' subnode for other sorts of partitioning
descriptions -- e.g., for describing which on-flash partition
format(s) might be used on the system.
- Also tone down a warning message, since it is probably going to
show up on a lot of systems where it should just be ignored"
* tag 'for-linus-
20151217' of git://git.infradead.org/linux-mtd:
doc: dt: mtd: partitions: add compatible property to "partitions" node
mtd: ofpart: don't complain about missing 'partitions' node too loudly
Chris Mason [Fri, 18 Dec 2015 19:11:10 +0000 (11:11 -0800)]
Merge branch 'freespace-tree' into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Alan Stern [Wed, 16 Dec 2015 18:32:38 +0000 (13:32 -0500)]
USB: fix invalid memory access in hub_activate()
Commit
8520f38099cc ("USB: change hub initialization sleeps to
delayed_work") changed the hub_activate() routine to make part of it
run in a workqueue. However, the commit failed to take a reference to
the usb_hub structure or to lock the hub interface while doing so. As
a result, if a hub is plugged in and quickly unplugged before the work
routine can run, the routine will try to access memory that has been
deallocated. Or, if the hub is unplugged while the routine is
running, the memory may be deallocated while it is in active use.
This patch fixes the problem by taking a reference to the usb_hub at
the start of hub_activate() and releasing it at the end (when the work
is finished), and by locking the hub interface while the work routine
is running. It also adds a check at the start of the routine to see
if the hub has already been disconnected, in which nothing should be
done.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Alexandru Cornea <alexandru.cornea@intel.com>
Tested-by: Alexandru Cornea <alexandru.cornea@intel.com>
Fixes:
8520f38099cc ("USB: change hub initialization sleeps to delayed_work")
CC: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Wed, 16 Dec 2015 11:06:37 +0000 (14:06 +0300)]
USB: ipaq.c: fix a timeout loop
The code expects the loop to end with "retries" set to zero but, because
it is a post-op, it will end set to -1. I have fixed this by moving the
decrement inside the loop.
Fixes:
014aa2a3c32e ('USB: ipaq: minor ipaq_open() cleanup.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Antti Palosaari [Mon, 26 Oct 2015 20:58:14 +0000 (18:58 -0200)]
[media] airspy: increase USB control message buffer size
Driver requested device firmware version string during probe using
only 24 byte long buffer. That buffer is too small for newer firmware
versions, which causes device firmware hang - device stops responding
to any commands after that. Increase buffer size to 128 which should
be enough for any current and future version strings.
Link: https://github.com/airspy/host/issues/27
Cc: <stable@vger.kernel.org> # 3.17+
Reported-by: Benjamin Vernoux <bvernoux@gmail.com>
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Antti Palosaari [Fri, 23 Oct 2015 22:01:31 +0000 (20:01 -0200)]
[media] hackrf: move RF gain ctrl enable behind module parameter
Used Avago MGA-81563 RF amplifier could be destroyed pretty easily
with too strong signal or transmitting to bad antenna.
Add module parameter 'enable_rf_gain_ctrl' which allows enabling
RF gain control - otherwise, default without the module parameter,
RF gain control is set to 'grabbed' state which prevents setting
value to the control.
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Antti Palosaari [Wed, 21 Oct 2015 21:02:41 +0000 (19:02 -0200)]
[media] hackrf: fix possible null ptr on debug printing
drivers/media/usb/hackrf/hackrf.c:1533 hackrf_probe()
error: we previously assumed 'dev' could be null (see line 1366)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Mauro Carvalho Chehab [Wed, 11 Nov 2015 11:22:36 +0000 (09:22 -0200)]
[media] Revert "[media] ivtv: avoid going past input/audio array"
This patch broke ivtv logic, as reported at
https://bugzilla.redhat.com/show_bug.cgi?id=
1278942
This reverts commit
09290cc885937cab3b2d60a6d48fe3d2d3e04061.
Cc: stable@vger.kernel.org # for v4.1 and upper
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Greg Kroah-Hartman [Fri, 18 Dec 2015 17:24:50 +0000 (09:24 -0800)]
Merge tag 'phy-for-4.4-rc' of git://git./linux/kernel/git/kishon/linux-phy into usb-linus
Kishon writes:
phy: for 4.4 -rc
*) Add missing of_node_put in a bunch of PHY drivers
*) Add get_device in devm_of_phy_get_by_index()
*) Fix randconfig build error in sun9i usb driver
Doug Goldstein [Thu, 26 Nov 2015 20:32:39 +0000 (14:32 -0600)]
xen-pciback: fix up cleanup path when alloc fails
When allocating a pciback device fails, clear the private
field. This could lead to an use-after free, however
the 'really_probe' takes care of setting
dev_set_drvdata(dev, NULL) in its failure path (which we would
exercise if the ->probe function failed), so we we
are OK. However lets be defensive as the code can change.
Going forward we should clean up the pci_set_drvdata(dev, NULL)
in the various code-base. That will be for another day.
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Arnd Bergmann [Fri, 18 Dec 2015 14:52:28 +0000 (15:52 +0100)]
hwmon: (sht15) Select CONFIG_BITREVERSE
If CONFIG_BITREVERSE is not built-in, the sht15 driver fails to link:
drivers/built-in.o: In function `sht15_crc8':
drivers/hwmon/sht15.c:195: undefined reference to `byte_rev_table'
This adds a Kconfig 'select' statement, like all other users of
bitrev.h have it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes:
33836ee98533 ("hwmon:change sht15_reverse()")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 23:13:27 +0000 (18:13 -0500)]
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
commit
f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.
Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).
Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.
Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!
When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Wed, 1 Apr 2015 14:49:47 +0000 (10:49 -0400)]
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).
This does not change the behavior of XEN_PCI_OP_disable_msi[|x].
The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.
However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.
This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.
Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>