Btrfs: fix scrub race leading to use-after-free
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:
[68127.807663] BUG: unable to handle kernel paging request at
ffff8803f8947a50
[68127.807663] IP: [<
ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD
3003067 PUD
43e1f5067 PMD
43e030067 PTE
80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task:
ffff880101fc5250 ti:
ffff8803f097c000 task.ti:
ffff8803f097c000
[68127.807663] RIP: 0010:[<
ffffffff8107da31>] [<
ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:
ffff8803f097fbb8 EFLAGS:
00010093
[68127.807663] RAX:
0000000028dd386c RBX:
ffff8803f8947a50 RCX:
0000000028dd3854
[68127.807663] RDX:
0000000000000018 RSI:
0000000000000002 RDI:
0000000000000001
[68127.807663] RBP:
ffff8803f097fbd8 R08:
0000000000000004 R09:
0000000000000001
[68127.807663] R10:
ffff880102620980 R11:
ffff8801f3e8c900 R12:
000000000001d390
[68127.807663] R13:
00000000cabd13c8 R14:
ffff8803f8947800 R15:
ffff88037c574f00
[68127.807663] FS:
0000000000000000(0000) GS:
ffff88043dd00000(0000) knlGS:
0000000000000000
[68127.807663] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[68127.807663] CR2:
ffff8803f8947a50 CR3:
00000000b6481000 CR4:
00000000000006e0
[68127.807663] Stack:
[68127.807663]
ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]
ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]
ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663] [<
ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663] [<
ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663] [<
ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663] [<
ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663] [<
ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663] [<
ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663] [<
ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663] [<
ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663] [<
ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663] [<
ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663] [<
ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663] [<
ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663] [<
ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663] [<
ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663] [<
ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] [<
ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663] [<
ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP [<
ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP <
ffff8803f097fbb8>
[68127.807663] CR2:
ffff8803f8947a50
[68127.807663] ---[ end trace
d7045aac00a66cd8 ]---
This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:
CPU 1 CPU 2
btrfs_scrub_dev()
scrub_bio_end_io_worker()
scrub_pending_bio_dec()
atomic_dec(&sctx->bios_in_flight)
wait sctx->bios_in_flight == 0
wait sctx->workers_pending == 0
mutex_lock(&fs_info->scrub_lock)
(...)
mutex_lock(&fs_info->scrub_lock)
scrub_free_ctx(sctx)
kfree(sctx)
wake_up(&sctx->list_wait)
__wake_up()
spin_lock_irqsave(&sctx->list_wait->lock, flags)
Another variation of this scenario that results in the same use-after-free
issue is:
CPU 1 CPU 2
btrfs_scrub_dev()
wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
scrub_pending_bio_dec()
__wake_up(&sctx->list_wait)
spin_lock_irqsave(&sctx->list_wait->lock, flags)
default_wake_function()
wake up task at CPU 2
wait sctx->workers_pending == 0
mutex_lock(&fs_info->scrub_lock)
(...)
mutex_lock(&fs_info->scrub_lock)
scrub_free_ctx(sctx)
kfree(sctx)
spin_unlock_irqrestore(&sctx->list_wait->lock, flags)
Fix this by holding the scrub lock while doing the wakeup.
This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit
a2de733c78fa7af51ba9670482fa7d392aa67c57).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>