Some recent testing in netpoll with bonding showed this backtrace
------------[ cut here ]------------
kernel BUG at drivers/net/bonding/bonding.h:134!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.2/usb7/devnum
CPU 0
Pid: 1876, comm: rmmod Not tainted 2.6.36-rc3+ #10 D26928/
RIP: 0010:[<
ffffffffa0514ba4>] [<
ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0
RSP: 0018:
ffff88003b1b5d58 EFLAGS:
00010296
RAX:
ffff88003b9b6200 RBX:
ffff8800373e8e00 RCX:
00000000000f4240
RDX:
00000000ffffffff RSI:
0000000000000286 RDI:
0000000000000286
RBP:
ffff88003b1b5dc8 R08:
0000000000000000 R09:
00000001af7de920
R10:
0000000000000000 R11:
ffff880002495e98 R12:
ffff880037922700
R13:
ffff880038c31000 R14:
ffff880037922730 R15:
0000000000000286
FS:
00007f90e6d72700(0000) GS:
ffff880002400000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
000000346f0d9ad0 CR3:
000000003b263000 CR4:
00000000000006f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process rmmod (pid: 1876, threadinfo
ffff88003b1b4000, task
ffff88003b36aa80)
Stack:
00000000ffffffff ffff88003b1b5d7a ffff8800379221e8 ffff880037922000
<0>
ffff88003b1b5dc8 ffffffff813eb5fb ffff88003b1b5da8 0000000031b177a3
<0>
ffff88003b1b5da8 ffff880037922000 ffff88003b1b5e48 ffff88003b1b5e48
Call Trace:
[<
ffffffff813eb5fb>] ? rtmsg_ifinfo+0xcb/0xf0
[<
ffffffff813daad8>] rollback_registered_many+0x168/0x280
[<
ffffffff813dac09>] unregister_netdevice_many+0x19/0x80
[<
ffffffff813e97b3>] __rtnl_kill_links+0x63/0x90
[<
ffffffff813e980b>] __rtnl_link_unregister+0x2b/0x60
[<
ffffffff813e9bde>] rtnl_link_unregister+0x1e/0x30
[<
ffffffffa052124b>] bonding_exit+0x37/0x51 [bonding]
[<
ffffffff81098b2e>] sys_delete_module+0x19e/0x270
[<
ffffffff810bb2b2>] ? audit_syscall_entry+0x252/0x280
[<
ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
RIP [<
ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0 [bonding]
RSP <
ffff88003b1b5d58>
---[ end trace
1395ad691cea24d1 ]---
It occurs because of my recent netpoll blocking patches, which I added to avoid
recursive deadlock in the bonding driver. It relies on some per cpu bits, but
the shutdown path forces some rescheduling as we cancel workqueues for the
driver and wait for some device refcounts. If after the forced reschedule, we
wind up on a different cpu we trigger the bughalt in unblock_netpoll_tx.
The fix is to remove the netpoll block/unblock calls from bond_release_all.
This is safe to do because bond_uninit, which is called via ndo_uninit in
rollback_registered_many, doesn't occur until we send a NETDEV_UNREGISTER event,
which triggers netconsole to remove us as a netpoll client, so we are guaranteed
not to recurse into our own tx path here.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>