Jan Kara [Sun, 17 Aug 2014 09:49:57 +0000 (11:49 +0200)]
isofs: Fix unbounded recursion when processing relocated directories
commit
410dd3cf4c9b36f27ed4542ee18b1af5e68645a4 upstream.
We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).
Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.
Reported-by: Chris Evans <cevans@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jiri Kosina [Thu, 21 Aug 2014 14:57:48 +0000 (09:57 -0500)]
HID: fix a couple of off-by-ones
commit
4ab25786c87eb20857bbb715c3ae34ec8fd6a214 upstream.
There are a few very theoretical off-by-one bugs in report descriptor size
checking when performing a pre-parsing fixup. Fix those.
Reported-by: Ben Hawkes <hawkes@google.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jiri Kosina [Thu, 21 Aug 2014 14:57:17 +0000 (09:57 -0500)]
HID: logitech: perform bounds checking on device_id early enough
commit
ad3e14d7c5268c2e24477c6ef54bbdf88add5d36 upstream.
device_index is a char type and the size of paired_dj_deivces is 7
elements, therefore proper bounds checking has to be applied to
device_index before it is used.
We are currently performing the bounds checking in
logi_dj_recv_add_djhid_device(), which is too late, as malicious device
could send REPORT_TYPE_NOTIF_DEVICE_UNPAIRED early enough and trigger the
problem in one of the report forwarding functions called from
logi_dj_raw_event().
Fix this by performing the check at the earliest possible ocasion in
logi_dj_raw_event().
Reported-by: Ben Hawkes <hawkes@google.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dave Chiluk [Tue, 24 Jun 2014 15:11:26 +0000 (10:11 -0500)]
stable_kernel_rules: Add pointer to netdev-FAQ for network patches
commit
b76fc285337b6b256e9ba20a40cfd043f70c27af upstream.
Stable_kernel_rules should point submitters of network stable patches to the
netdev_FAQ.txt as requests for stable network patches should go to netdev
first.
Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Thu, 14 Aug 2014 01:24:29 +0000 (09:24 +0800)]
Linux 3.10.53
Andrey Utkin [Mon, 4 Aug 2014 20:47:41 +0000 (23:47 +0300)]
arch/sparc/math-emu/math_32.c: drop stray break operator
[ Upstream commit
093758e3daede29cb4ce6aedb111becf9d4bfc57 ]
This commit is a guesswork, but it seems to make sense to drop this
break, as otherwise the following line is never executed and becomes
dead code. And that following line actually saves the result of
local calculation by the pointer given in function argument. So the
proposed change makes sense if this code in the whole makes sense (but I
am unable to analyze it in the whole).
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sowmini Varadhan [Fri, 1 Aug 2014 13:50:40 +0000 (09:50 -0400)]
sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
[ Upstream commit
4ec1b01029b4facb651b8ef70bc20a4be4cebc63 ]
The LDC handshake could have been asynchronously triggered
after ldc_bind() enables the ldc_rx() receive interrupt-handler
(and thus intercepts incoming control packets)
and before vio_port_up() calls ldc_connect(). If that is the case,
ldc_connect() should return 0 and let the state-machine
progress.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Karl Volz <karl.volz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christopher Alexander Tobias Schulze [Sun, 3 Aug 2014 14:01:53 +0000 (16:01 +0200)]
sunsab: Fix detection of BREAK on sunsab serial console
[ Upstream commit
fe418231b195c205701c0cc550a03f6c9758fd9e ]
Fix detection of BREAK on sunsab serial console: BREAK detection was only
performed when there were also serial characters received simultaneously.
To handle all BREAKs correctly, the check for BREAK and the corresponding
call to uart_handle_break() must also be done if count == 0, therefore
duplicate this code fragment and pull it out of the loop over the received
characters.
Patch applies to 3.16-rc6.
Signed-off-by: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christopher Alexander Tobias Schulze [Sun, 3 Aug 2014 13:44:52 +0000 (15:44 +0200)]
bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
[ Upstream commit
5cdceab3d5e02eb69ea0f5d8fa9181800baf6f77 ]
Fix regression in bbc i2c temperature and fan control on some Sun systems
that causes the driver to refuse to load due to the bbc_i2c_bussel resource not
being present on the (second) i2c bus where the temperature sensors and fan
control are located. (The check for the number of resources was removed when
the driver was ported to a pure OF driver in mid 2008.)
Signed-off-by: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Tue, 5 Aug 2014 03:07:37 +0000 (20:07 -0700)]
sparc64: Guard against flushing openfirmware mappings.
[ Upstream commit
4ca9a23765da3260058db3431faf5b4efd8cf926 ]
Based almost entirely upon a patch by Christopher Alexander Tobias
Schulze.
In commit
db64fe02258f1507e13fe5212a989922323685ce ("mm: rewrite vmap
layer") lazy VMAP tlb flushing was added to the vmalloc layer. This
causes problems on sparc64.
Sparc64 has two VMAP mapped regions and they are not contiguous with
eachother. First we have the malloc mapping area, then another
unrelated region, then the vmalloc region.
This "another unrelated region" is where the firmware is mapped.
If the lazy TLB flushing logic in the vmalloc code triggers after
we've had both a module unload and a vfree or similar, it will pass an
address range that goes from somewhere inside the malloc region to
somewhere inside the vmalloc region, and thus covering the
openfirmware area entirely.
The sparc64 kernel learns about openfirmware's dynamic mappings in
this region early in the boot, and then services TLB misses in this
area. But openfirmware has some locked TLB entries which are not
mentioned in those dynamic mappings and we should thus not disturb
them.
These huge lazy TLB flush ranges causes those openfirmware locked TLB
entries to be removed, resulting in all kinds of problems including
hard hangs and crashes during reboot/reset.
Besides causing problems like this, such huge TLB flush ranges are
also incredibly inefficient. A plea has been made with the author of
the VMAP lazy TLB flushing code, but for now we'll put a safety guard
into our flush_tlb_kernel_range() implementation.
Since the implementation has become non-trivial, stop defining it as a
macro and instead make it a function in a C source file.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Mon, 4 Aug 2014 23:34:01 +0000 (16:34 -0700)]
sparc64: Do not insert non-valid PTEs into the TSB hash table.
[ Upstream commit
18f38132528c3e603c66ea464727b29e9bbcb91b ]
The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
only be called when the PTE being installed will be accessible by the user.
This is not true for code paths originating from remove_migration_pte().
There are dire consequences for placing a non-valid PTE into the TSB. The TLB
miss frramework assumes thatwhen a TSB entry matches we can just load it into
the TLB and return from the TLB miss trap.
So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
and over, never satisfying the miss.
Just exit early from update_mmu_cache() and friends in this situation.
Based upon a report and patch from Christopher Alexander Tobias Schulze.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Sat, 17 May 2014 18:28:05 +0000 (11:28 -0700)]
sparc64: Add membar to Niagara2 memcpy code.
[ Upstream commit
5aa4ecfd0ddb1e6dcd1c886e6c49677550f581aa ]
This is the prevent previous stores from overlapping the block stores
done by the memcpy loop.
Based upon a glibc patch by Jose E. Marchesi
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Wed, 7 May 2014 21:07:32 +0000 (14:07 -0700)]
sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
[ Upstream commit
b18eb2d779240631a098626cb6841ee2dd34fda0 ]
Access to the TSB hash tables during TLB misses requires that there be
an atomic 128-bit quad load available so that we fetch a matching TAG
and DATA field at the same time.
On cpus prior to UltraSPARC-III only virtual address based quad loads
are available. UltraSPARC-III and later provide physical address
based variants which are easier to use.
When we only have virtual address based quad loads available this
means that we have to lock the TSB into the TLB at a fixed virtual
address on each cpu when it runs that process. We can't just access
the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
take a recursive TLB miss inside of the TLB miss handler without
risking running out of hardware trap levels (some trap combinations
can be deep, such as those generated by register window spill and fill
traps).
Without huge pages it's working perfectly fine, but when the huge TSB
got added another chunk of fixed virtual address space was not
allocated for this second TSB mapping.
So we were mapping both the 8K and 4MB TSBs to the same exact virtual
address, causing multiple TLB matches which gives undefined behavior.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Wed, 7 May 2014 04:27:37 +0000 (21:27 -0700)]
sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.
[ Upstream commit
e5c460f46ae7ee94831cb55cb980f942aa9e5a85 ]
This was found using Dave Jone's trinity tool.
When a user process which is 32-bit performs a load or a store, the
cpu chops off the top 32-bits of the effective address before
translating it.
This is because we run 32-bit tasks with the PSTATE_AM (address
masking) bit set.
We can't run the kernel with that bit set, so when the kernel accesses
userspace no address masking occurs.
Since a 32-bit process will have no mappings in that region we will
properly fault, so we don't try to handle this using access_ok(),
which can safely just be a NOP on sparc64.
Real faults from 32-bit processes should never generate such addresses
so a bug check was added long ago, and it barks in the logs if this
happens.
But it also barks when a kernel user access causes this condition, and
that _can_ happen. For example, if a pointer passed into a system call
is "0xfffffffc" and the kernel access 4 bytes offset from that pointer.
Just handle such faults normally via the exception entries.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Tue, 29 Apr 2014 06:52:11 +0000 (23:52 -0700)]
sparc64: Fix top-level fault handling bugs.
[ Upstream commit
70ffc6ebaead783ac8dafb1e87df0039bb043596 ]
Make get_user_insn() able to cope with huge PMDs.
Next, make do_fault_siginfo() more robust when get_user_insn() can't
actually fetch the instruction. In particular, use the MMU announced
fault address when that happens, instead of calling
compute_effective_address() and computing garbage.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Tue, 29 Apr 2014 06:50:08 +0000 (23:50 -0700)]
sparc64: Handle 32-bit tasks properly in compute_effective_address().
[ Upstream commit
d037d16372bbe4d580342bebbb8826821ad9edf0 ]
If we have a 32-bit task we must chop off the top 32-bits of the
64-bit value just as the cpu would.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kirill Tkhai [Wed, 16 Apr 2014 20:45:24 +0000 (00:45 +0400)]
sparc64: Make itc_sync_lock raw
[ Upstream commit
49b6c01f4c1de3b5e5427ac5aba80f9f6d27837a ]
One more place where we must not be able
to be preempted or to be interrupted in RT.
Always actually disable interrupts during
synchronization cycle.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David S. Miller [Thu, 1 May 2014 02:37:48 +0000 (19:37 -0700)]
sparc64: Fix argument sign extension for compat_sys_futex().
[ Upstream commit
aa3449ee9c87d9b7660dd1493248abcc57769e31 ]
Only the second argument, 'op', is signed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Tue, 5 Aug 2014 14:49:52 +0000 (16:49 +0200)]
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
[ Upstream commit
757efd32d5ce31f67193cc0e6a56e4dffcc42fb1 ]
Dave reported following splat, caused by improper use of
IP_INC_STATS_BH() in process context.
BUG: using __this_cpu_add() in preemptible [
00000000] code: trinity-c117/14551
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
Call Trace:
[<
ffffffff9e7ee207>] dump_stack+0x4e/0x7a
[<
ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
[<
ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
[<
ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
[<
ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
[<
ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
[<
ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[<
ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
[<
ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
[<
ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
[<
ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
[<
ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
[<
ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
[<
ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
[<
ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
[<
ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
[<
ffffffff9e73b624>] inet_sendmsg+0x104/0x220
[<
ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
[<
ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
[<
ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
[<
ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
[<
ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
[<
ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
[<
ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
[<
ffffffff9e7f9be4>] tracesys+0xdd/0xe2
This is a followup of commits
f1d8cba61c3c4b ("inet: fix possible
seqlock deadlocks") and
7f88c6b23afbd315 ("ipv6: fix possible seqlock
deadlock in ip6_finish_output2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Fri, 1 Aug 2014 03:00:35 +0000 (23:00 -0400)]
iovec: make sure the caller actually wants anything in memcpy_fromiovecend
[ Upstream commit
06ebb06d49486676272a3c030bfeef4bd969a8e6 ]
Check for cases when the caller requests 0 bytes instead of running off
and dereferencing potentially invalid iovecs.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vlad Yasevich [Thu, 31 Jul 2014 14:33:06 +0000 (10:33 -0400)]
net: Correctly set segment mac_len in skb_segment().
[ Upstream commit
fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ]
When performing segmentation, the mac_len value is copied right
out of the original skb. However, this value is not always set correctly
(like when the packet is VLAN-tagged) and we'll end up copying a bad
value.
One way to demonstrate this is to configure a VM which tags
packets internally and turn off VLAN acceleration on the forwarding
bridge port. The packets show up corrupt like this:
16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
(0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
0x0000: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
0x0020: 29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
...
This also leads to awful throughput as GSO packets are dropped and
cause retransmissions.
The solution is to set the mac_len using the values already available
in then new skb. We've already adjusted all of the header offset, so we
might as well correctly figure out the mac_len using skb_reset_mac_len().
After this change, packets are segmented correctly and performance
is restored.
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vlad Yasevich [Thu, 31 Jul 2014 14:30:25 +0000 (10:30 -0400)]
macvlan: Initialize vlan_features to turn on offload support.
[ Upstream commit
081e83a78db9b0ae1f5eabc2dedecc865f509b98 ]
Macvlan devices do not initialize vlan_features. As a result,
any vlan devices configured on top of macvlans perform very poorly.
Initialize vlan_features based on the vlan features of the lower-level
device.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Borkmann [Tue, 22 Jul 2014 13:22:45 +0000 (15:22 +0200)]
net: sctp: inherit auth_capable on INIT collisions
[ Upstream commit
1be9a950c646c9092fb3618197f7b6bfb50e82aa ]
Jason reported an oops caused by SCTP on his ARM machine with
SCTP authentication enabled:
Internal error: Oops: 17 [#1] ARM
CPU: 0 PID: 104 Comm: sctp-test Not tainted
3.13.0-68744-g3632f30c9b20-dirty #1
task:
c6eefa40 ti:
c6f52000 task.ti:
c6f52000
PC is at sctp_auth_calculate_hmac+0xc4/0x10c
LR is at sg_init_table+0x20/0x38
pc : [<
c024bb80>] lr : [<
c00f32dc>] psr:
40000013
sp :
c6f538e8 ip :
00000000 fp :
c6f53924
r10:
c6f50d80 r9 :
00000000 r8 :
00010000
r7 :
00000000 r6 :
c7be4000 r5 :
00000000 r4 :
c6f56254
r3 :
c00c8170 r2 :
00000001 r1 :
00000008 r0 :
c6f1e660
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control:
0005397f Table:
06f28000 DAC:
00000015
Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
Stack: (0xc6f538e8 to 0xc6f54000)
[...]
Backtrace:
[<
c024babc>] (sctp_auth_calculate_hmac+0x0/0x10c) from [<
c0249af8>] (sctp_packet_transmit+0x33c/0x5c8)
[<
c02497bc>] (sctp_packet_transmit+0x0/0x5c8) from [<
c023e96c>] (sctp_outq_flush+0x7fc/0x844)
[<
c023e170>] (sctp_outq_flush+0x0/0x844) from [<
c023ef78>] (sctp_outq_uncork+0x24/0x28)
[<
c023ef54>] (sctp_outq_uncork+0x0/0x28) from [<
c0234364>] (sctp_side_effects+0x1134/0x1220)
[<
c0233230>] (sctp_side_effects+0x0/0x1220) from [<
c02330b0>] (sctp_do_sm+0xac/0xd4)
[<
c0233004>] (sctp_do_sm+0x0/0xd4) from [<
c023675c>] (sctp_assoc_bh_rcv+0x118/0x160)
[<
c0236644>] (sctp_assoc_bh_rcv+0x0/0x160) from [<
c023d5bc>] (sctp_inq_push+0x6c/0x74)
[<
c023d550>] (sctp_inq_push+0x0/0x74) from [<
c024a6b0>] (sctp_rcv+0x7d8/0x888)
While we already had various kind of bugs in that area
ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
we/peer is AUTH capable") and
b14878ccb7fa ("net: sctp: cache
auth_enable per endpoint"), this one is a bit of a different
kind.
Giving a bit more background on why SCTP authentication is
needed can be found in RFC4895:
SCTP uses 32-bit verification tags to protect itself against
blind attackers. These values are not changed during the
lifetime of an SCTP association.
Looking at new SCTP extensions, there is the need to have a
method of proving that an SCTP chunk(s) was really sent by
the original peer that started the association and not by a
malicious attacker.
To cause this bug, we're triggering an INIT collision between
peers; normal SCTP handshake where both sides intent to
authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
parameters that are being negotiated among peers:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
-------------------- COOKIE-ECHO -------------------->
<-------------------- COOKIE-ACK ---------------------
RFC4895 says that each endpoint therefore knows its own random
number and the peer's random number *after* the association
has been established. The local and peer's random number along
with the shared key are then part of the secret used for
calculating the HMAC in the AUTH chunk.
Now, in our scenario, we have 2 threads with 1 non-blocking
SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY
and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling
sctp_bindx(3), listen(2) and connect(2) against each other,
thus the handshake looks similar to this, e.g.:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
<--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] -----------
-------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] -------->
...
Since such collisions can also happen with verification tags,
the RFC4895 for AUTH rather vaguely says under section 6.1:
In case of INIT collision, the rules governing the handling
of this Random Number follow the same pattern as those for
the Verification Tag, as explained in Section 5.2.4 of
RFC 2960 [5]. Therefore, each endpoint knows its own Random
Number and the peer's Random Number after the association
has been established.
In RFC2960, section 5.2.4, we're eventually hitting Action B:
B) In this case, both sides may be attempting to start an
association at about the same time but the peer endpoint
started its INIT after responding to the local endpoint's
INIT. Thus it may have picked a new Verification Tag not
being aware of the previous Tag it had sent this endpoint.
The endpoint should stay in or enter the ESTABLISHED
state but it MUST update its peer's Verification Tag from
the State Cookie, stop any init or cookie timers that may
running and send a COOKIE ACK.
In other words, the handling of the Random parameter is the
same as behavior for the Verification Tag as described in
Action B of section 5.2.4.
Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
side effect interpreter, and in fact it properly copies over
peer_{random, hmacs, chunks} parameters from the newly created
association to update the existing one.
Also, the old asoc_shared_key is being released and based on
the new params, sctp_auth_asoc_init_active_key() updated.
However, the issue observed in this case is that the previous
asoc->peer.auth_capable was 0, and has *not* been updated, so
that instead of creating a new secret, we're doing an early
return from the function sctp_auth_asoc_init_active_key()
leaving asoc->asoc_shared_key as NULL. However, we now have to
authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).
That in fact causes the server side when responding with ...
<------------------ AUTH; COOKIE-ACK -----------------
... to trigger a NULL pointer dereference, since in
sctp_packet_transmit(), it discovers that an AUTH chunk is
being queued for xmit, and thus it calls sctp_auth_calculate_hmac().
Since the asoc->active_key_id is still inherited from the
endpoint, and the same as encoded into the chunk, it uses
asoc->asoc_shared_key, which is still NULL, as an asoc_key
and dereferences it in ...
crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)
... causing an oops. All this happens because sctp_make_cookie_ack()
called with the *new* association has the peer.auth_capable=1
and therefore marks the chunk with auth=1 after checking
sctp_auth_send_cid(), but it is *actually* sent later on over
the then *updated* association's transport that didn't initialize
its shared key due to peer.auth_capable=0. Since control chunks
in that case are not sent by the temporary association which
are scheduled for deletion, they are issued for xmit via
SCTP_CMD_REPLY in the interpreter with the context of the
*updated* association. peer.auth_capable was 0 in the updated
association (which went from COOKIE_WAIT into ESTABLISHED state),
since all previous processing that performed sctp_process_init()
was being done on temporary associations, that we eventually
throw away each time.
The correct fix is to update to the new peer.auth_capable
value as well in the collision case via sctp_assoc_update(),
so that in case the collision migrated from 0 -> 1,
sctp_auth_asoc_init_active_key() can properly recalculate
the secret. This therefore fixes the observed server panic.
Fixes:
730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing")
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Paasch [Tue, 29 Jul 2014 11:40:57 +0000 (13:40 +0200)]
tcp: Fix integer-overflow in TCP vegas
[ Upstream commit
1f74e613ded11517db90b2bd57e9464d9e0fb161 ]
In vegas we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.
Then, we need to do do_div to allow this to be used on 32-bit arches.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Doug Leith <doug.leith@nuim.ie>
Fixes:
8d3a564da34e (tcp: tcp_vegas cong avoid fix)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Paasch [Tue, 29 Jul 2014 10:07:27 +0000 (12:07 +0200)]
tcp: Fix integer-overflows in TCP veno
[ Upstream commit
45a07695bc64b3ab5d6d2215f9677e5b8c05a7d0 ]
In veno we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.
A first attempt at fixing
76f1017757aa0 ([TCP]: TCP Veno congestion
control) was made by
159131149c2 (tcp: Overflow bug in Vegas), but it
failed to add the required cast in tcp_veno_cong_avoid().
Fixes:
76f1017757aa0 ([TCP]: TCP Veno congestion control)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andrey Ryabinin [Sat, 26 Jul 2014 17:26:58 +0000 (21:26 +0400)]
net: sendmsg: fix NULL pointer dereference
[ Upstream commit
40eea803c6b2cfaab092f053248cbeab3f368412 ]
Sasha's report:
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel with the KASAN patchset, I've stumbled on the following spew:
>
> [ 4448.949424] ==================================================================
> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
> [ 4448.952988] Read of size 2 by thread T19638:
> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted
3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
> [ 4448.956823]
ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
> [ 4448.958233]
ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
> [ 4448.959552]
0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
> [ 4448.961266] Call Trace:
> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
> [ 4448.970103] sock_sendmsg (net/socket.c:654)
> [ 4448.971584] ? might_fault (mm/memory.c:3741)
> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 4448.988929] ==================================================================
This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
This bug was introduced in
f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.
This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Sat, 26 Jul 2014 06:58:10 +0000 (08:58 +0200)]
ip: make IP identifiers less predictable
[ Upstream commit
04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ]
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.
With commit
73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.
This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.
Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.
This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.
We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.
For IPv6, adds saddr into the hash as well, but not nexthdr.
If I ping the patched target, we can see ID are now hard to predict.
21:57:11.008086 IP (...)
A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
target > A: ICMP echo reply, seq 1, length 64
21:57:12.013133 IP (...)
A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
target > A: ICMP echo reply, seq 2, length 64
21:57:13.016580 IP (...)
A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
target > A: ICMP echo reply, seq 3, length 64
[1] TCP sessions uses a per flow ID generator not changed by this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Mon, 2 Jun 2014 12:26:03 +0000 (05:26 -0700)]
inetpeer: get rid of ip_id_count
[ Upstream commit
73f156a6e8c1074ac6327e0abd1169e95eb66463 ]
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dmitry Kravkov [Thu, 24 Jul 2014 15:54:47 +0000 (18:54 +0300)]
bnx2x: fix crash during TSO tunneling
[ Upstream commit
fe26566d8a05151ba1dce75081f6270f73ec4ae1 ]
When TSO packet is transmitted additional BD w/o mapping is used
to describe the packed. The BD needs special handling in tx
completion.
kernel: Call Trace:
kernel: <IRQ> [<
ffffffff815e19ba>] dump_stack+0x19/0x1b
kernel: [<
ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
kernel: [<
ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
kernel: [<
ffffffff814a8c0d>] ? find_iova+0x4d/0x90
kernel: [<
ffffffff814ab0e2>] intel_unmap_page.part.36+0x142/0x160
kernel: [<
ffffffff814ad0e6>] intel_unmap_page+0x26/0x30
kernel: [<
ffffffffa01f55d7>] bnx2x_free_tx_pkt+0x157/0x2b0 [bnx2x]
kernel: [<
ffffffffa01f8dac>] bnx2x_tx_int+0xac/0x220 [bnx2x]
kernel: [<
ffffffff8101a0d9>] ? read_tsc+0x9/0x20
kernel: [<
ffffffffa01f8fdb>] bnx2x_poll+0xbb/0x3c0 [bnx2x]
kernel: [<
ffffffff814d041a>] net_rx_action+0x15a/0x250
kernel: [<
ffffffff81067047>] __do_softirq+0xf7/0x290
kernel: [<
ffffffff815f3a5c>] call_softirq+0x1c/0x30
kernel: [<
ffffffff81014d25>] do_softirq+0x55/0x90
kernel: [<
ffffffff810673e5>] irq_exit+0x115/0x120
kernel: [<
ffffffff815f4358>] do_IRQ+0x58/0xf0
kernel: [<
ffffffff815e94ad>] common_interrupt+0x6d/0x6d
kernel: <EOI> [<
ffffffff810bbff7>] ? clockevents_notify+0x127/0x140
kernel: [<
ffffffff814834df>] ? cpuidle_enter_state+0x4f/0xc0
kernel: [<
ffffffff81483615>] cpuidle_idle_call+0xc5/0x200
kernel: [<
ffffffff8101bc7e>] arch_cpu_idle+0xe/0x30
kernel: [<
ffffffff810b4725>] cpu_startup_entry+0xf5/0x290
kernel: [<
ffffffff815cfee1>] start_secondary+0x265/0x27b
kernel: ---[ end trace
11aa7726f18d7e80 ]---
Fixes:
a848ade408b ("bnx2x: add CSUM and TSO support for encapsulation protocols")
Reported-by: Yulong Pei <ypei@redhat.com>
Cc: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Dmitry Kravkov <Dmitry.Kravkov@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Thu, 7 Aug 2014 21:42:40 +0000 (14:42 -0700)]
Linux 3.10.52
Boris Ostrovsky [Wed, 9 Jul 2014 17:18:18 +0000 (13:18 -0400)]
x86/espfix/xen: Fix allocation of pages for paravirt page tables
commit
8762e5092828c4dc0f49da5a47a644c670df77f3 upstream.
init_espfix_ap() is currently off by one level when informing hypervisor
that allocated pages will be used for ministacks' page tables.
The most immediate effect of this on a PV guest is that if
'stack_page = __get_free_page()' returns a non-zeroed-out page the hypervisor
will refuse to use it for a page table (which it shouldn't be anyway). This will
result in warnings by both Xen and Linux.
More importantly, a subsequent write to that page (again, by a PV guest) is
likely to result in fatal page fault.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: http://lkml.kernel.org/r/1404926298-5565-1-git-send-email-boris.ostrovsky@oracle.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Minfei Huang [Wed, 4 Jun 2014 23:11:53 +0000 (16:11 -0700)]
lib/btree.c: fix leak of whole btree nodes
commit
c75b53af2f0043aff500af0a6f878497bef41bca upstream.
I use btree from 3.14-rc2 in my own module. When the btree module is
removed, a warning arises:
kmem_cache_destroy btree_node: Slab cache still has objects
CPU: 13 PID: 9150 Comm: rmmod Tainted: GF O 3.14.0-rc2 #1
Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013
Call Trace:
dump_stack+0x49/0x5d
kmem_cache_destroy+0xcf/0xe0
btree_module_exit+0x10/0x12 [btree]
SyS_delete_module+0x198/0x1f0
system_call_fastpath+0x16/0x1b
The cause is that it doesn't release the last btree node, when height = 1
and fill = 1.
[akpm@linux-foundation.org: remove unneeded test of NULL]
Signed-off-by: Minfei Huang <huangminfei@ucloud.cn>
Cc: Joern Engel <joern@logfs.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Tue, 15 Jul 2014 00:02:31 +0000 (17:02 -0700)]
net/l2tp: don't fall back on UDP [get|set]sockopt
commit
3cf521f7dc87c031617fd47e4b7aa2593c2f3daf upstream.
The l2tp [get|set]sockopt() code has fallen back to the UDP functions
for socket option levels != SOL_PPPOL2TP since day one, but that has
never actually worked, since the l2tp socket isn't an inet socket.
As David Miller points out:
"If we wanted this to work, it'd have to look up the tunnel and then
use tunnel->sk, but I wonder how useful that would be"
Since this can never have worked so nobody could possibly have depended
on that functionality, just remove the broken code and return -EINVAL.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: James Chapman <jchapman@katalix.com>
Acked-by: David Miller <davem@davemloft.net>
Cc: Phil Turnbull <phil.turnbull@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
willy tarreau [Thu, 16 Jan 2014 07:20:11 +0000 (08:20 +0100)]
net: mvneta: replace Tx timer with a real interrupt
commit
71f6d1b31fb1f278a345a30a2180515adc7d80ae upstream.
Right now the mvneta driver doesn't handle Tx IRQ, and relies on two
mechanisms to flush Tx descriptors : a flush at the end of mvneta_tx()
and a timer. If a burst of packets is emitted faster than the device
can send them, then the queue is stopped until next wake-up of the
timer 10ms later. This causes jerky output traffic with bursts and
pauses, making it difficult to reach line rate with very few streams.
A test on UDP traffic shows that it's not possible to go beyond 134
Mbps / 12 kpps of outgoing traffic with 1500-bytes IP packets. Routed
traffic tends to observe pauses as well if the traffic is bursty,
making it even burstier after the wake-up.
It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.
Thus, this patch enables Tx interrupts and removes the timer. It does
the two at once because it's not really possible to make the two
mechanisms coexist, so a split patch doesn't make sense.
First tests performed on a Mirabox (Armada 370) show that less CPU
seems to be used when sending traffic. One reason might be that we now
call the mvneta_tx_done_gbe() with a mask indicating which queues have
been done instead of looping over all of them.
The same UDP test above now happily reaches 987 Mbps / 87.7 kpps.
Single-stream TCP traffic can now more easily reach line rate. HTTP
transfers of 1 MB objects over a single connection went from 730 to
840 Mbps. It is even possible to go significantly higher (>900 Mbps)
by tweaking tcp_tso_win_divisor.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Arnaud Ebalard <arno@natisbad.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
willy tarreau [Thu, 16 Jan 2014 07:20:10 +0000 (08:20 +0100)]
net: mvneta: add missing bit descriptions for interrupt masks and causes
commit
40ba35e74fa56866918d2f3bc0528b5b92725d5e upstream.
Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :
http://www.plugcomputer.org/downloads/mirabox/
So let's put them back in the driver in order to reduce the burden of
current and future maintenance.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
willy tarreau [Thu, 16 Jan 2014 07:20:09 +0000 (08:20 +0100)]
net: mvneta: do not schedule in mvneta_tx_timeout
commit
290213667ab53a95456397763205e4b1e30f46b5 upstream.
If a queue timeout is reported, we can oops because of some
schedules while the caller is atomic, as shown below :
mvneta
d0070000.ethernet eth0: tx timeout
BUG: scheduling while atomic: bash/1528/0x00000100
Modules linked in: slhttp_ethdiv(C) [last unloaded: slhttp_ethdiv]
CPU: 2 PID: 1528 Comm: bash Tainted: G WC 3.13.0-rc4-mvebu-nf #180
[<
c0011bd9>] (unwind_backtrace+0x1/0x98) from [<
c000f1ab>] (show_stack+0xb/0xc)
[<
c000f1ab>] (show_stack+0xb/0xc) from [<
c02ad323>] (dump_stack+0x4f/0x64)
[<
c02ad323>] (dump_stack+0x4f/0x64) from [<
c02abe67>] (__schedule_bug+0x37/0x4c)
[<
c02abe67>] (__schedule_bug+0x37/0x4c) from [<
c02ae261>] (__schedule+0x325/0x3ec)
[<
c02ae261>] (__schedule+0x325/0x3ec) from [<
c02adb97>] (schedule_timeout+0xb7/0x118)
[<
c02adb97>] (schedule_timeout+0xb7/0x118) from [<
c0020a67>] (msleep+0xf/0x14)
[<
c0020a67>] (msleep+0xf/0x14) from [<
c01dcbe5>] (mvneta_stop_dev+0x21/0x194)
[<
c01dcbe5>] (mvneta_stop_dev+0x21/0x194) from [<
c01dcfe9>] (mvneta_tx_timeout+0x19/0x24)
[<
c01dcfe9>] (mvneta_tx_timeout+0x19/0x24) from [<
c024afc7>] (dev_watchdog+0x18b/0x1c4)
[<
c024afc7>] (dev_watchdog+0x18b/0x1c4) from [<
c0020b53>] (call_timer_fn.isra.27+0x17/0x5c)
[<
c0020b53>] (call_timer_fn.isra.27+0x17/0x5c) from [<
c0020cad>] (run_timer_softirq+0x115/0x170)
[<
c0020cad>] (run_timer_softirq+0x115/0x170) from [<
c001ccb9>] (__do_softirq+0xbd/0x1a8)
[<
c001ccb9>] (__do_softirq+0xbd/0x1a8) from [<
c001cfad>] (irq_exit+0x61/0x98)
[<
c001cfad>] (irq_exit+0x61/0x98) from [<
c000d4bf>] (handle_IRQ+0x27/0x60)
[<
c000d4bf>] (handle_IRQ+0x27/0x60) from [<
c000843b>] (armada_370_xp_handle_irq+0x33/0xc8)
[<
c000843b>] (armada_370_xp_handle_irq+0x33/0xc8) from [<
c000fba9>] (__irq_usr+0x49/0x60)
Ben Hutchings attempted to propose a better fix consisting in using a
scheduled work for this, but while it fixed this panic, it caused other
random freezes and panics proving that the reset sequence in the driver
is unreliable and that additional fixes should be investigated.
When sending multiple streams over a link limited to 100 Mbps, Tx timeouts
happen from time to time, and the driver correctly recovers only when the
function is disabled.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
willy tarreau [Thu, 16 Jan 2014 07:20:08 +0000 (08:20 +0100)]
net: mvneta: use per_cpu stats to fix an SMP lock up
commit
74c41b048db1073a04827d7f39e95ac1935524cc upstream.
Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
when they update the stats, and as a result, it randomly happens that
the stats freeze on SMP if two updates happen during stats retrieval.
This is very easily reproducible by starting two HTTP servers and binding
each of them to a different CPU, then consulting /proc/net/dev in loops
during transfers, the interface should immediately lock up. This issue
also randomly happens upon link state changes during transfers, because
the stats are collected in this situation, but it takes more attempts to
reproduce it.
The comments in netdevice.h suggest using per_cpu stats instead to get
rid of this issue.
This patch implements this. It merges both rx_stats and tx_stats into
a single "stats" member with a single syncp. Both mvneta_rx() and
mvneta_rx() now only update the a single CPU's counters.
In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
to get their respective stats.
With this change, stats are still correct and no more lockup is encountered.
Note that this bug was present since the first import of the mvneta
driver. It might make sense to backport it to some stable trees. If
so, it depends on "
d33dc73 net: mvneta: increase the 64-bit rx/tx stats
out of the hot path".
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
[wt: port to 3.10 : u64_stats_init() does not exist in 3.10 and is not needed]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
willy tarreau [Thu, 16 Jan 2014 07:20:07 +0000 (08:20 +0100)]
net: mvneta: increase the 64-bit rx/tx stats out of the hot path
commit
dc4277dd41a80fd5f29a90412ea04bc3ba54fbf1 upstream.
Better count packets and bytes in the stack and on 32 bit then
accumulate them at the end for once. This saves two memory writes
and two memory barriers per packet. The incoming packet rate was
increased by 4.7% on the Openblocks AX3 thanks to this.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Johannes Berg [Mon, 7 Jul 2014 10:01:11 +0000 (12:01 +0200)]
Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
commit
08b9939997df30e42a228e1ecb97f99e9c8ea84e upstream.
This reverts commit
277d916fc2e959c3f106904116bb4f7b1148d47a as it was
at least breaking iwlwifi by setting the IEEE80211_TX_CTL_NO_PS_BUFFER
flag in all kinds of interface modes, not only for AP mode where it is
appropriate.
To avoid reintroducing the original problem, explicitly check for probe
request frames in the multicast buffering code.
Fixes:
277d916fc2e9 ("mac80211: move "bufferable MMPDU" check to fix AP mode scan")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Malcolm Priestley [Wed, 23 Jul 2014 20:35:11 +0000 (21:35 +0100)]
staging: vt6655: Fix Warning on boot handle_irq_event_percpu.
commit
6cff1f6ad4c615319c1a146b2aa0af1043c5e9f5 upstream.
WARNING: CPU: 0 PID: 929 at /home/apw/COD/linux/kernel/irq/handle.c:147 handle_irq_event_percpu+0x1d1/0x1e0()
irq 17 handler device_intr+0x0/0xa80 [vt6655_stage] enabled interrupts
Using spin_lock_irqsave appears to fix this.
Signed-off-by: Malcolm Priestley <tvboxspy@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andy Lutomirski [Wed, 23 Jul 2014 15:34:11 +0000 (08:34 -0700)]
x86_64/entry/xen: Do not invoke espfix64 on Xen
commit
7209a75d2009dbf7745e2fd354abf25c3deb3ca3 upstream.
This moves the espfix64 logic into native_iret. To make this work,
it gets rid of the native patch for INTERRUPT_RETURN:
INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.
This changes the 16-bit SS behavior on Xen from OOPSing to leaking
some bits of the Xen hypervisor's RSP (I think).
[ hpa: this is a nonzero cost on native, but probably not enough to
measure. Xen needs to fix this in their own code, probably doing
something equivalent to espfix64. ]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Sun, 4 May 2014 17:36:22 +0000 (10:36 -0700)]
x86, espfix: Make it possible to disable 16-bit support
commit
34273f41d57ee8d854dcd2a1d754cbb546cb548f upstream.
Embedded systems, which may be very memory-size-sensitive, are
extremely unlikely to ever encounter any 16-bit software, so make it
a CONFIG_EXPERT option to turn off support for any 16-bit software
whatsoever.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Sun, 4 May 2014 17:00:49 +0000 (10:00 -0700)]
x86, espfix: Make espfix64 a Kconfig option, fix UML
commit
197725de65477bc8509b41388157c1a2283542bb upstream.
Make espfix64 a hidden Kconfig option. This fixes the x86-64 UML
build which had broken due to the non-existence of init_espfix_bsp()
in UML: since UML uses its own Kconfig, this option does not appear in
the UML build.
This also makes it possible to make support for 16-bit segments a
configuration option, for the people who want to minimize the size of
the kernel.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Richard Weinberger <richard@nod.at>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Fri, 2 May 2014 18:33:51 +0000 (11:33 -0700)]
x86, espfix: Fix broken header guard
commit
20b68535cd27183ebd3651ff313afb2b97dac941 upstream.
Header guard is #ifndef, not #ifdef...
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Thu, 1 May 2014 21:12:23 +0000 (14:12 -0700)]
x86, espfix: Move espfix definitions into a separate header file
commit
e1fe9ed8d2a4937510d0d60e20705035c2609aea upstream.
Sparse warns that the percpu variables aren't declared before they are
defined. Rather than hacking around it, move espfix definitions into
a proper header file.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Tue, 29 Apr 2014 23:46:09 +0000 (16:46 -0700)]
x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
commit
3891a04aafd668686239349ea58f3314ea2af86b upstream.
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. This
causes some 16-bit software to break, but it also leaks kernel state
to user space. We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.
In checkin:
b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.
This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart. When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace. The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.
(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)
Special thanks to:
- Andy Lutomirski, for the suggestion of using very small stack slots
and copy (as opposed to map) the IRET frame there, and for the
suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H. Peter Anvin [Wed, 21 May 2014 17:22:59 +0000 (10:22 -0700)]
Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"
commit
7ed6fb9b5a5510e4ef78ab27419184741169978a upstream.
This reverts commit
fa81511bb0bbb2b1aace3695ce869da9762624ff in
preparation of merging in the proper fix (espfix64).
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jan Kara [Fri, 1 Aug 2014 10:20:02 +0000 (12:20 +0200)]
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
commit
504d58745c9ca28d33572e2d8a9990b43e06075d upstream.
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [<
811c60be>] serial8250_console_write+0x8c/0x10c
but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [<
8103caeb>] hrtimer_try_to_cancel+0x13/0x66
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (hrtimer_bases.lock){-.-...}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
8103c918>] __hrtimer_start_range_ns+0x1c/0x197
[<
8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[<
81080792>] task_clock_event_start+0x3a/0x3f
[<
810807a4>] task_clock_event_add+0xd/0x14
[<
8108259a>] event_sched_in+0xb6/0x17a
[<
810826a2>] group_sched_in+0x44/0x122
[<
81082885>] ctx_sched_in.isra.67+0x105/0x11f
[<
810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
[<
81082bf6>] __perf_install_in_context+0x8b/0xa3
[<
8107eb8e>] remote_function+0x12/0x2a
[<
8105f5af>] smp_call_function_single+0x2d/0x53
[<
8107e17d>] task_function_call+0x30/0x36
[<
8107fb82>] perf_install_in_context+0x87/0xbb
[<
810852c9>] SYSC_perf_event_open+0x5c6/0x701
[<
810856f9>] SyS_perf_event_open+0x17/0x19
[<
8142f8ee>] syscall_call+0x7/0xb
-> #4 (&ctx->lock){......}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f04c>] _raw_spin_lock+0x21/0x30
[<
81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
8142cae0>] schedule+0xf/0x11
[<
8142f9a6>] work_resched+0x5/0x30
-> #3 (&rq->lock){-.-.-.}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f04c>] _raw_spin_lock+0x21/0x30
[<
81040873>] __task_rq_lock+0x33/0x3a
[<
8104184c>] wake_up_new_task+0x25/0xc2
[<
8102474b>] do_fork+0x15c/0x2a0
[<
810248a9>] kernel_thread+0x1a/0x1f
[<
814232a2>] rest_init+0x1a/0x10e
[<
817af949>] start_kernel+0x303/0x308
[<
817af2ab>] i386_start_kernel+0x79/0x7d
-> #2 (&p->pi_lock){-.-...}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
810413dd>] try_to_wake_up+0x1d/0xd6
[<
810414cd>] default_wake_function+0xb/0xd
[<
810461f3>] __wake_up_common+0x39/0x59
[<
81046346>] __wake_up+0x29/0x3b
[<
811b8733>] tty_wakeup+0x49/0x51
[<
811c3568>] uart_write_wakeup+0x17/0x19
[<
811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<
811c5f28>] serial8250_handle_irq+0x54/0x6a
[<
811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<
811c56d8>] serial8250_interrupt+0x38/0x9e
[<
810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<
81051296>] handle_irq_event+0x2c/0x43
[<
81052cee>] handle_level_irq+0x57/0x80
[<
81002a72>] handle_irq+0x46/0x5c
[<
810027df>] do_IRQ+0x32/0x89
[<
8143036e>] common_interrupt+0x2e/0x33
[<
8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<
811c25a4>] uart_start+0x2d/0x32
[<
811c2c04>] uart_write+0xc7/0xd6
[<
811bc6f6>] n_tty_write+0xb8/0x35e
[<
811b9beb>] tty_write+0x163/0x1e4
[<
811b9cd9>] redirected_tty_write+0x6d/0x75
[<
810b6ed6>] vfs_write+0x75/0xb0
[<
810b7265>] SyS_write+0x44/0x77
[<
8142f8ee>] syscall_call+0x7/0xb
-> #1 (&tty->write_wait){-.....}:
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
81046332>] __wake_up+0x15/0x3b
[<
811b8733>] tty_wakeup+0x49/0x51
[<
811c3568>] uart_write_wakeup+0x17/0x19
[<
811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<
811c5f28>] serial8250_handle_irq+0x54/0x6a
[<
811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<
811c56d8>] serial8250_interrupt+0x38/0x9e
[<
810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<
81051296>] handle_irq_event+0x2c/0x43
[<
81052cee>] handle_level_irq+0x57/0x80
[<
81002a72>] handle_irq+0x46/0x5c
[<
810027df>] do_IRQ+0x32/0x89
[<
8143036e>] common_interrupt+0x2e/0x33
[<
8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<
811c25a4>] uart_start+0x2d/0x32
[<
811c2c04>] uart_write+0xc7/0xd6
[<
811bc6f6>] n_tty_write+0xb8/0x35e
[<
811b9beb>] tty_write+0x163/0x1e4
[<
811b9cd9>] redirected_tty_write+0x6d/0x75
[<
810b6ed6>] vfs_write+0x75/0xb0
[<
810b7265>] SyS_write+0x44/0x77
[<
8142f8ee>] syscall_call+0x7/0xb
-> #0 (&port_lock_key){-.....}:
[<
8104a62d>] __lock_acquire+0x9ea/0xc6d
[<
8104a942>] lock_acquire+0x92/0x101
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
811c60be>] serial8250_console_write+0x8c/0x10c
[<
8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<
8104f5d5>] console_unlock+0x1d7/0x398
[<
8104fb70>] vprintk_emit+0x3da/0x3e4
[<
81425f76>] printk+0x17/0x19
[<
8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<
8105c548>] clockevents_program_event+0xe7/0xf3
[<
8105cc1c>] tick_program_event+0x1e/0x23
[<
8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<
8103c49e>] __remove_hrtimer+0x5b/0x79
[<
8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<
8103cb4b>] hrtimer_cancel+0xd/0x18
[<
8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<
81080705>] task_clock_event_stop+0x20/0x64
[<
81080756>] task_clock_event_del+0xd/0xf
[<
81081350>] event_sched_out+0xab/0x11e
[<
810813e0>] group_sched_out+0x1d/0x66
[<
81081682>] ctx_sched_out+0xaf/0xbf
[<
81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
8142cae0>] schedule+0xf/0x11
[<
8142f9a6>] work_resched+0x5/0x30
other info that might help us debug this:
Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);
*** DEADLOCK ***
4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [<
8142c6f3>] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [<
81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [<
8103caeb>] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [<
8104fb5d>] vprintk_emit+0x3c7/0x3e4
stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted
3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[<
81426f69>] dump_stack+0x16/0x18
[<
81425a99>] print_circular_bug+0x18f/0x19c
[<
8104a62d>] __lock_acquire+0x9ea/0xc6d
[<
8104a942>] lock_acquire+0x92/0x101
[<
811c60be>] ? serial8250_console_write+0x8c/0x10c
[<
811c6032>] ? wait_for_xmitr+0x76/0x76
[<
8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<
811c60be>] ? serial8250_console_write+0x8c/0x10c
[<
811c60be>] serial8250_console_write+0x8c/0x10c
[<
8104af87>] ? lock_release+0x191/0x223
[<
811c6032>] ? wait_for_xmitr+0x76/0x76
[<
8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<
8104f5d5>] console_unlock+0x1d7/0x398
[<
8104fb70>] vprintk_emit+0x3da/0x3e4
[<
81425f76>] printk+0x17/0x19
[<
8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<
8105cc1c>] tick_program_event+0x1e/0x23
[<
8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<
8103c49e>] __remove_hrtimer+0x5b/0x79
[<
8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<
8103cb4b>] hrtimer_cancel+0xd/0x18
[<
8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<
81080705>] task_clock_event_stop+0x20/0x64
[<
81080756>] task_clock_event_del+0xd/0xf
[<
81081350>] event_sched_out+0xab/0x11e
[<
810813e0>] group_sched_out+0x1d/0x66
[<
81081682>] ctx_sched_out+0xaf/0xbf
[<
81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<
8104416d>] ? __dequeue_entity+0x23/0x27
[<
81044505>] ? pick_next_task_fair+0xb1/0x120
[<
8142cacc>] __schedule+0x4c6/0x4cb
[<
81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
[<
810475b0>] ? trace_hardirqs_off+0xb/0xd
[<
81056346>] ? rcu_irq_exit+0x64/0x77
Fix the problem by using printk_deferred() which does not call into the
scheduler.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
John Stultz [Wed, 4 Jun 2014 23:11:40 +0000 (16:11 -0700)]
printk: rename printk_sched to printk_deferred
commit
aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Lars-Peter Clausen [Thu, 17 Jul 2014 15:59:00 +0000 (16:59 +0100)]
iio: buffer: Fix demux table creation
commit
61bd55ce1667809f022be88da77db17add90ea4e upstream.
When creating the demux table we need to iterate over the selected scan mask for
the buffer to get the samples which should be copied to destination buffer.
Right now the code uses the mask which contains all active channels, which means
the demux table contains entries which causes it to copy all the samples from
source to destination buffer one by one without doing any demuxing.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Malcolm Priestley [Wed, 23 Jul 2014 20:35:12 +0000 (21:35 +0100)]
staging: vt6655: Fix disassociated messages every 10 seconds
commit
4aa0abed3a2a11b7d71ad560c1a3e7631c5a31cd upstream.
byReAssocCount is incremented every second resulting in
disassociated message being send every 10 seconds whether
connection or not.
byReAssocCount should only advance while eCommandState
is in WLAN_ASSOCIATE_WAIT
Change existing scope to if condition.
Signed-off-by: Malcolm Priestley <tvboxspy@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David Rientjes [Wed, 30 Jul 2014 23:08:24 +0000 (16:08 -0700)]
mm, thp: do not allow thp faults to avoid cpuset restrictions
commit
b104a35d32025ca740539db2808aa3385d0f30eb upstream.
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
James Bottomley [Thu, 3 Jul 2014 17:17:34 +0000 (19:17 +0200)]
scsi: handle flush errors properly
commit
89fb4cd1f717a871ef79fa7debbe840e3225cd54 upstream.
Flush commands don't transfer data and thus need to be special cased
in the I/O completion handler so that we can propagate errors to
the block layer and filesystem.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Reported-by: Steven Haber <steven@qumulo.com>
Tested-by: Steven Haber <steven@qumulo.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alexandre Bounine [Wed, 30 Jul 2014 23:08:26 +0000 (16:08 -0700)]
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
commit
0193ed8225e1a79ed64632106ec3cc81798cb13c upstream.
This is a bug fix for the situation when function tsi721_desc_get() fails
to obtain a free transaction descriptor.
The bug usually results in a memory access crash dump when data transfer
scatter-gather list has more entries than size of hardware buffer
descriptors ring. This fix ensures that error is properly returned to a
caller instead of an invalid entry.
This patch is applicable to kernel versions starting from v3.5.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eliad Peller [Thu, 17 Jul 2014 12:00:56 +0000 (15:00 +0300)]
cfg80211: fix mic_failure tracing
commit
8c26d458394be44e135d1c6bd4557e1c4e1a0535 upstream.
tsc can be NULL (mac80211 currently always passes NULL),
resulting in NULL-dereference. check before copying it.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Konstantin Khlebnikov [Fri, 25 Jul 2014 08:17:12 +0000 (09:17 +0100)]
ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
commit
811a2407a3cf7bbd027fbe92d73416f17485a3d8 upstream.
On LPAE, each level 1 (pgd) page table entry maps 1GiB, and the level 2
(pmd) entries map 2MiB.
When the identity mapping is created on LPAE, the pgd pointers are copied
from the swapper_pg_dir. If we find that we need to modify the contents
of a pmd, we allocate a new empty pmd table and insert it into the
appropriate 1GB slot, before then filling it with the identity mapping.
However, if the 1GB slot covers the kernel lowmem mappings, we obliterate
those mappings.
When replacing a PMD, first copy the old PMD contents to the new PMD, so
that we preserve the existing mappings, particularly the mappings of the
kernel itself.
[rewrote commit message and added code comment -- rmk]
Fixes:
ae2de101739c ("ARM: LPAE: Add identity mapping support for the 3-level page table format")
Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Milan Broz [Tue, 29 Jul 2014 18:41:09 +0000 (18:41 +0000)]
crypto: af_alg - properly label AF_ALG socket
commit
4c63f83c2c2e16a13ce274ee678e28246bd33645 upstream.
Th AF_ALG socket was missing a security label (e.g. SELinux)
which means that socket was in "unlabeled" state.
This was recently demonstrated in the cryptsetup package
(cryptsetup v1.6.5 and later.)
See https://bugzilla.redhat.com/show_bug.cgi?id=
1115120
This patch clones the sock's label from the parent sock
and resolves the issue (similar to AF_BLUETOOTH protocol family).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Thu, 31 Jul 2014 21:55:39 +0000 (14:55 -0700)]
Linux 3.10.51
Zoltan Kiss [Wed, 26 Mar 2014 22:37:45 +0000 (22:37 +0000)]
core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
commit
36d5fe6a000790f56039afe26834265db0a3ad4c upstream.
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.13: skb_zerocopy() is new in 3.14, but was moved from a
static function in nfnetlink_queue. We need to patch that and its caller, but
not openvswitch.]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Michael Brown [Thu, 10 Jul 2014 11:26:20 +0000 (12:26 +0100)]
x86/efi: Include a .bss section within the PE/COFF headers
commit
c7fb93ec51d462ec3540a729ba446663c26a0505 upstream.
The PE/COFF headers currently describe only the initialised-data
portions of the image, and result in no space being allocated for the
uninitialised-data portions. Consequently, the EFI boot stub will end
up overwriting unexpected areas of memory, with unpredictable results.
Fix by including a .bss section in the PE/COFF headers (functionally
equivalent to the init_size field in the bzImage header).
Signed-off-by: Michael Brown <mbrown@fensystems.co.uk>
Cc: Thomas Bächler <thomas@archlinux.org>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Martin Schwidefsky [Mon, 23 Jun 2014 13:29:40 +0000 (15:29 +0200)]
s390/ptrace: fix PSW mask check
commit
dab6cf55f81a6e16b8147aed9a843e1691dcd318 upstream.
The PSW mask check of the PTRACE_POKEUSR_AREA command is incorrect.
The PSW_MASK_USER define contains the PSW_MASK_ASC bits, the ptrace
interface accepts all combinations for the address-space-control
bits. To protect the kernel space the PSW mask check in ptrace needs
to reject the address-space-control bit combination for home space.
Fixes CVE-2014-3534
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Linus Torvalds [Sat, 26 Jul 2014 21:52:01 +0000 (14:52 -0700)]
Fix gcc-4.9.0 miscompilation of load_balance() in scheduler
commit
2062afb4f804afef61cbe62a30cac9a46e58e067 upstream.
Michel Dänzer and a couple of other people reported inexplicable random
oopses in the scheduler, and the cause turns out to be gcc mis-compiling
the load_balance() function when debugging is enabled. The gcc bug
apparently goes back to gcc-4.5, but slight optimization changes means
that it now showed up as a problem in 4.9.0 and 4.9.1.
The instruction scheduling problem causes gcc to schedule a spill
operation to before the stack frame has been created, which in turn can
corrupt the spilled value if an interrupt comes in. There may be other
effects of this bug too, but that's the code generation problem seen in
Michel's case.
This is fixed in current gcc HEAD, but the workaround as suggested by
Markus Trippelsdorf is pretty simple: use -fno-var-tracking-assignments
when compiling the kernel, which disables the gcc code that causes the
problem. This can result in slightly worse debug information for
variable accesses, but that is infinitely preferable to actual code
generation problems.
Doing this unconditionally (not just for CONFIG_DEBUG_INFO) also allows
non-debug builds to verify that the debug build would be identical: we
can do
export GCC_COMPARE_DEBUG=1
to make gcc internally verify that the result of the build is
independent of the "-g" flag (it will make the compiler build everything
twice, toggling the debug flag, and compare the results).
Without the "-fno-var-tracking-assignments" option, the build would fail
(even with 4.8.3 that didn't show the actual stack frame bug) with a gcc
compare failure.
See also gcc bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61801
Reported-by: Michel Dänzer <michel@daenzer.net>
Suggested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Naoya Horiguchi [Wed, 23 Jul 2014 21:00:19 +0000 (14:00 -0700)]
mm: hugetlb: fix copy_hugetlb_page_range()
commit
0253d634e0803a8376a0d88efee0bf523d8673f9 upstream.
Commit
4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry") changed the order of
huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
in some workloads like hugepage-backed heap allocation via libhugetlbfs.
This patch fixes it.
The test program for the problem is shown below:
$ cat heap.c
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#define HPS 0x200000
int main() {
int i;
char *p = malloc(HPS);
memset(p, '1', HPS);
for (i = 0; i < 5; i++) {
if (!fork()) {
memset(p, '2', HPS);
p = malloc(HPS);
memset(p, '3', HPS);
free(p);
return 0;
}
}
sleep(1);
free(p);
return 0;
}
$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
Fixes
4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry"), so is applicable to -stable kernels which
include it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Suggested-by: Guillaume Morin <guillaume@morinfr.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sven Wegener [Tue, 22 Jul 2014 08:26:06 +0000 (10:26 +0200)]
x86_32, entry: Store badsys error code in %eax
commit
8142b215501f8b291a108a202b3a053a265b03dd upstream.
Commit
554086d ("x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508)") introduced a regression in the x86_32 syscall entry
code, resulting in syscall() not returning proper errors for undefined
syscalls on CPUs supporting the sysenter feature.
The following code:
> int result = syscall(666);
> printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno));
results in:
> result=666 errno=0 error=Success
Obviously, the syscall return value is the called syscall number, but it
should have been an ENOSYS error. When run under ptrace it behaves
correctly, which makes it hard to debug in the wild:
> result=-1 errno=38 error=Function not implemented
The %eax register is the return value register. For debugging via ptrace
the syscall entry code stores the complete register context on the
stack. The badsys handlers only store the ENOSYS error code in the
ptrace register set and do not set %eax like a regular syscall handler
would. The old resume_userspace call chain contains code that clobbers
%eax and it restores %eax from the ptrace registers afterwards. The same
goes for the ptrace-enabled call chain. When ptrace is not used, the
syscall return value is the passed-in syscall number from the untouched
%eax register.
Use %eax as the return value register in syscall_badsys and
sysenter_badsys, like a real syscall handler does, and have the caller
push the value onto the stack for ptrace access.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.net
Reviewed-and-tested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Guenter Roeck [Fri, 18 Jul 2014 14:31:18 +0000 (07:31 -0700)]
hwmon: (smsc47m192) Fix temperature limit and vrm write operations
commit
043572d5444116b9d9ad8ae763cf069e7accbc30 upstream.
Temperature limit clamps are applied after converting the temperature
from milli-degrees C to degrees C, so either the clamp limit needs
to be specified in degrees C, not milli-degrees C, or clamping must
happen before converting to degrees C. Use the latter method to avoid
overflows.
vrm is an u8, so the written value needs to be limited to [0, 255].
Cc: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
John David Anglin [Wed, 23 Jul 2014 23:44:12 +0000 (19:44 -0400)]
parisc: Remove SA_RESTORER define
commit
20dbea494543aefaace874cc3ec93a39b94b1ec4 upstream.
The sa_restorer field in struct sigaction is obsolete and no longer in
the parisc implementation. However, the core code assumes the field is
present if SA_RESTORER is defined. So, the define needs to be removed.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Silesh C V [Wed, 23 Jul 2014 20:59:59 +0000 (13:59 -0700)]
coredump: fix the setting of PF_DUMPCORE
commit
aed8adb7688d5744cb484226820163af31d2499a upstream.
Commit
079148b919d0 ("coredump: factor out the setting of PF_DUMPCORE")
cleaned up the setting of PF_DUMPCORE by removing it from all the
linux_binfmt->core_dump() and moving it to zap_threads().But this ended
up clearing all the previously set flags. This causes issues during
core generation when tsk->flags is checked again (eg. for PF_USED_MATH
to dump floating point registers). Fix this.
Signed-off-by: Silesh C V <svellattu@mvista.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dmitry Torokhov [Sat, 19 Jul 2014 23:30:31 +0000 (16:30 -0700)]
Input: fix defuzzing logic
commit
50c5d36dab930b1f1b1e3348b8608aa8b9ee7610 upstream.
We attempt to remove noise from coordinates reported by devices in
input_handle_abs_event(), unfortunately, unless we were dropping the
event altogether, we were ignoring the adjusted value and were passing
on the original value instead.
Reviewed-by: Andrew de los Reyes <adlr@chromium.org>
Reviewed-by: Benson Leung <bleung@chromium.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mikulas Patocka [Tue, 4 Mar 2014 22:13:47 +0000 (17:13 -0500)]
slab_common: fix the check for duplicate slab names
commit
694617474e33b8603fc76e090ed7d09376514b1a upstream.
The patch
3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the
problem where kmem_cache_create incorrectly reports duplicate cache name
and fails. The problem is described in the header of that patch.
However, the patch doesn't really fix the problem because of these
reasons:
* the logic to test for debugging is reversed. It was intended to perform
the check only if slub debugging is enabled (which implies that caches
with the same parameters are not merged). Therefore, there should be
#if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
The current code has the condition reversed and performs the test if
debugging is disabled.
* slub debugging may be enabled or disabled based on kernel command line,
CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.
This patch fixes the problem by removing the test
"!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
checked if the SLUB allocator is used.
Note to stable kernel maintainers: when backporint this patch, please
backport also the patch
3e374919b314f20e2a04f641ebc1093d758f66a4.
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Lameter [Sat, 21 Sep 2013 21:56:34 +0000 (21:56 +0000)]
slab_common: Do not check for duplicate slab names
commit
3e374919b314f20e2a04f641ebc1093d758f66a4 upstream.
SLUB can alias multiple slab kmem_create_requests to one slab cache to save
memory and increase the cache hotness. As a result the name of the slab can be
stale. Only check the name for duplicates if we are in debug mode where we do
not merge multiple caches.
This fixes the following problem reported by Jonathan Brassow:
The problem with kmem_cache* is this:
*) Assume CONFIG_SLUB is set
1) kmem_cache_create(name="foo-a")
- creates new kmem_cache structure
2) kmem_cache_create(name="foo-b")
- If identical cache characteristics, it will be merged with the previously
created cache associated with "foo-a". The cache's refcount will be
incremented and an alias will be created via sysfs_slab_alias().
3) kmem_cache_destroy(<ptr>)
- Attempting to destroy cache associated with "foo-a", but instead the
refcount is simply decremented. I don't even think the sysfs aliases are
ever removed...
4) kmem_cache_create(name="foo-a")
- This FAILS because kmem_cache_sanity_check colides with the existing
name ("foo-a") associated with the non-removed cache.
This is a problem for RAID (specifically dm-raid) because the name used
for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache
persists for long enough, the memory address of an old mddev will be
reused for a new mddev - causing an identical formulation of the cache
name. Even though kmem_cache_destory had long ago been used to delete
the old cache, the merging of caches has cause the name and cache of that
old instance to be preserved and causes a colision (and thus failure) in
kmem_cache_create(). I see this regularly in my testing.
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tony Luck [Fri, 18 Jul 2014 18:43:01 +0000 (11:43 -0700)]
tracing: Fix wraparound problems in "uptime" trace clock
commit
58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.
The "uptime" trace clock added in:
commit
8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Fixes:
8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tejun Heo [Sat, 5 Jul 2014 22:43:21 +0000 (18:43 -0400)]
blkcg: don't call into policy draining if root_blkg is already gone
commit
0b462c89e31f7eb6789713437eb551833ee16ff3 upstream.
While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.
NULL pointer dereference at
0000000000000028
IP: [<
ffffffff8144e944>] blk_throtl_drain+0x84/0x230
PGD
e4a1067 PUD
b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff88000e222250 ti:
ffff88000efd4000 task.ti:
ffff88000efd4000
RIP: 0010:[<
ffffffff8144e944>] [<
ffffffff8144e944>] blk_throtl_drain+0x84/0x230
RSP: 0018:
ffff88000efd7bf0 EFLAGS:
00010046
RAX:
0000000000000000 RBX:
ffff880015091450 RCX:
0000000000000001
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
ffff88000efd7c10 R08:
0000000000000000 R09:
0000000000000001
R10:
ffff88000e222250 R11:
0000000000000000 R12:
ffff880015091450
R13:
ffff880015092e00 R14:
ffff880015091d70 R15:
ffff88001508fc28
FS:
00007f1332650740(0000) GS:
ffff88001fa80000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
0000000000000028 CR3:
0000000009446000 CR4:
00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[<
ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
[<
ffffffff81427641>] __blk_drain_queue+0x71/0x180
[<
ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
[<
ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
[<
ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
[<
ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
[<
ffffffff8142d476>] blk_release_queue+0x26/0xd0
[<
ffffffff81454968>] kobject_cleanup+0x38/0x70
[<
ffffffff81454848>] kobject_put+0x28/0x60
[<
ffffffff81427505>] blk_put_queue+0x15/0x20
[<
ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
[<
ffffffff810bc339>] execute_in_process_context+0x89/0xa0
[<
ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
[<
ffffffff817930e2>] device_release+0x32/0xa0
[<
ffffffff81454968>] kobject_cleanup+0x38/0x70
[<
ffffffff81454848>] kobject_put+0x28/0x60
[<
ffffffff817934d7>] put_device+0x17/0x20
[<
ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
[<
ffffffff817d121b>] scsi_remove_device+0x2b/0x40
[<
ffffffff817d1257>] sdev_store_delete+0x27/0x30
[<
ffffffff81792ca8>] dev_attr_store+0x18/0x30
[<
ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
[<
ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
[<
ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
[<
ffffffff811f69bd>] SyS_write+0x4d/0xc0
[<
ffffffff81d24692>] system_call_fastpath+0x16/0x1b
776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.
Fix it by skippping calling into policy draining if all the blkgs are
already gone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Romain Degez [Fri, 11 Jul 2014 16:08:13 +0000 (18:08 +0200)]
ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode)
commit
b32bfc06aefab61acc872dec3222624e6cd867ed upstream.
Add support of the Promise FastTrak TX8660 SATA HBA in ahci mode by
registering the board in the ahci_pci_tbl[].
Note: this HBA also provide a hardware RAID mode when activated in
BIOS but specific drivers from the manufacturer are required in this
case.
Signed-off-by: Romain Degez <romain.degez@gmail.com>
Tested-by: Romain Degez <romain.degez@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tejun Heo [Wed, 23 Jul 2014 13:05:27 +0000 (09:05 -0400)]
libata: introduce ata_host->n_tags to avoid oops on SAS controllers
commit
1a112d10f03e83fb3a2fdc4c9165865dec8a3ca6 upstream.
1871ee134b73 ("libata: support the ata host which implements a queue
depth less than 32") directly used ata_port->scsi_host->can_queue from
ata_qc_new() to determine the number of tags supported by the host;
unfortunately, SAS controllers doing SATA don't initialize ->scsi_host
leading to the following oops.
BUG: unable to handle kernel NULL pointer dereference at
0000000000000058
IP: [<
ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: isci libsas scsi_transport_sas mgag200 drm_kms_helper ttm
CPU: 1 PID: 518 Comm: udevd Not tainted 3.16.0-rc6+ #62
Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.02.0002.
122320131210 12/23/2013
task:
ffff880c1a00b280 ti:
ffff88061a000000 task.ti:
ffff88061a000000
RIP: 0010:[<
ffffffff814e0618>] [<
ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
RSP: 0018:
ffff88061a003ae8 EFLAGS:
00010012
RAX:
0000000000000001 RBX:
ffff88000241ca80 RCX:
00000000000000fa
RDX:
0000000000000020 RSI:
0000000000000020 RDI:
ffff8806194aa298
RBP:
ffff88061a003ae8 R08:
ffff8806194a8000 R09:
0000000000000000
R10:
0000000000000000 R11:
ffff88000241ca80 R12:
ffff88061ad58200
R13:
ffff8806194aa298 R14:
ffffffff814e67a0 R15:
ffff8806194a8000
FS:
00007f3ad7fe3840(0000) GS:
ffff880627620000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000058 CR3:
000000061a118000 CR4:
00000000001407e0
Stack:
ffff88061a003b20 ffffffff814e96e1 ffff88000241ca80 ffff88061ad58200
ffff8800b6bf6000 ffff880c1c988000 ffff880619903850 ffff88061a003b68
ffffffffa0056ce1 ffff88061a003b48 0000000013d6e6f8 ffff88000241ca80
Call Trace:
[<
ffffffff814e96e1>] ata_sas_queuecmd+0xa1/0x430
[<
ffffffffa0056ce1>] sas_queuecommand+0x191/0x220 [libsas]
[<
ffffffff8149afee>] scsi_dispatch_cmd+0x10e/0x300 [<
ffffffff814a3bc5>] scsi_request_fn+0x2f5/0x550
[<
ffffffff81317613>] __blk_run_queue+0x33/0x40
[<
ffffffff8131781a>] queue_unplugged+0x2a/0x90
[<
ffffffff8131ceb4>] blk_flush_plug_list+0x1b4/0x210
[<
ffffffff8131d274>] blk_finish_plug+0x14/0x50
[<
ffffffff8117eaa8>] __do_page_cache_readahead+0x198/0x1f0
[<
ffffffff8117ee21>] force_page_cache_readahead+0x31/0x50
[<
ffffffff8117ee7e>] page_cache_sync_readahead+0x3e/0x50
[<
ffffffff81172ac6>] generic_file_read_iter+0x496/0x5a0
[<
ffffffff81219897>] blkdev_read_iter+0x37/0x40
[<
ffffffff811e307e>] new_sync_read+0x7e/0xb0
[<
ffffffff811e3734>] vfs_read+0x94/0x170
[<
ffffffff811e43c6>] SyS_read+0x46/0xb0
[<
ffffffff811e33d1>] ? SyS_lseek+0x91/0xb0
[<
ffffffff8171ee29>] system_call_fastpath+0x16/0x1b
Code: 00 00 00 88 50 29 83 7f 08 01 19 d2 83 e2 f0 83 ea 50 88 50 34 c6 81 1d 02 00 00 40 c6 81 17 02 00 00 00 5d c3 66 0f 1f 44 00 00 <89> 14 25 58 00 00 00
Fix it by introducing ata_host->n_tags which is initialized to
ATA_MAX_QUEUE - 1 in ata_host_init() for SAS controllers and set to
scsi_host_template->can_queue in ata_host_register() for !SAS ones.
As SAS hosts are never registered, this will give them the same
ATA_MAX_QUEUE - 1 as before. Note that we can't use
scsi_host->can_queue directly for SAS hosts anyway as they can go
higher than the libata maximum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Reported-by: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Reported-by: Peter Hurley <peter@hurleysoftware.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Fixes:
1871ee134b73 ("libata: support the ata host which implements a queue depth less than 32")
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kevin Hao [Sat, 12 Jul 2014 04:08:24 +0000 (12:08 +0800)]
libata: support the ata host which implements a queue depth less than 32
commit
1871ee134b73fb4cadab75752a7152ed2813c751 upstream.
The sata on fsl mpc8315e is broken after the commit
8a4aeec8d2d6
("libata/ahci: accommodate tag ordered controllers"). The reason is
that the ata controller on this SoC only implement a queue depth of
16. When issuing the commands in tag order, all the commands in tag
16 ~ 31 are mapped to tag 0 unconditionally and then causes the sata
malfunction. It makes no senses to use a 32 queue in software while
the hardware has less queue depth. So consider the queue depth
implemented by the hardware when requesting a command tag.
Fixes:
8a4aeec8d2d6 ("libata/ahci: accommodate tag ordered controllers")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Hellwig [Tue, 8 Jul 2014 10:25:28 +0000 (12:25 +0200)]
block: don't assume last put of shared tags is for the host
commit
d45b3279a5a2252cafcd665bbf2db8c9b31ef783 upstream.
There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mikulas Patocka [Wed, 2 Jul 2014 16:46:23 +0000 (12:46 -0400)]
block: provide compat ioctl for BLKZEROOUT
commit
3b3a1814d1703027f9867d0f5cbbfaf6c7482474 upstream.
This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Antti Palosaari [Fri, 4 Jul 2014 08:44:39 +0000 (05:44 -0300)]
media: tda10071: force modulation to QPSK on DVB-S
commit
db4175ae2095634dbecd4c847da439f9c83e1b3b upstream.
Only supported modulation for DVB-S is QPSK. Modulation parameter
contains invalid value for DVB-S on some cases, which leads driver
refusing tuning attempt. Due to that, hard code modulation to QPSK
in case of DVB-S.
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hans Verkuil [Mon, 16 Jun 2014 12:08:29 +0000 (09:08 -0300)]
media: hdpvr: fix two audio bugs
commit
3445857b22eafb70a6ac258979e955b116bfd2c6 upstream.
When the audio encoding is changed the driver calls hdpvr_set_audio
with the current opt->audio_input value. However, that should have
been opt->audio_input + 1. So changing the audio encoding inadvertently
changes the input as well. This bug has always been there.
The second bug was introduced in kernel 3.10 and that broke the
default_audio_input module option handling: the audio encoding was
never switched to AC3 if default_audio_input was set to 2 (SPDIF input).
In addition, since starting with 3.10 the audio encoding is always set
at the start the first bug now always happens when the driver is loaded.
In the past this bug would only surface if the user would change the
audio encoding after the driver was loaded.
Also fixes a small trivial typo (bufffer -> buffer).
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reported-by: Scott Doty <scott@corp.sonic.net>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Mon, 28 Jul 2014 15:00:59 +0000 (08:00 -0700)]
Linux 3.10.50
Anton Kolesov [Fri, 20 Jun 2014 16:28:39 +0000 (20:28 +0400)]
ARC: Implement ptrace(PTRACE_GET_THREAD_AREA)
commit
a4b6cb735b25aa84a462a1985e3e43bebaf5beb4 upstream.
This patch adds implementation of GET_THREAD_AREA ptrace request type. This
is required by GDB to debug NPTL applications.
Signed-off-by: Anton Kolesov <Anton.Kolesov@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mateusz Guzik [Sat, 14 Jun 2014 13:00:09 +0000 (15:00 +0200)]
sched: Fix possible divide by zero in avg_atom() calculation
commit
b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.
proc_sched_show_task() does:
if (nr_switches)
do_div(avg_atom, nr_switches);
nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.
Fix the problem by using div64_ul() instead.
As a side effect calculations of avg_atom for big nr_switches are now correct.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Peter Zijlstra [Fri, 6 Jun 2014 17:53:16 +0000 (19:53 +0200)]
locking/mutex: Disable optimistic spinning on some architectures
commit
4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.
The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.
There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.
Opt in for known good archs.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Takashi Iwai [Tue, 15 Jul 2014 06:51:27 +0000 (08:51 +0200)]
PM / sleep: Fix request_firmware() error at resume
commit
4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.
The commit [
247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790
The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.
This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.
Fixes:
247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mike Snitzer [Mon, 14 Jul 2014 20:59:39 +0000 (16:59 -0400)]
dm cache metadata: do not allow the data block size to change
commit
048e5a07f282c57815b3901d4a68a77fa131ce0a upstream.
The block size for the dm-cache's data device must remained fixed for
the life of the cache. Disallow any attempt to change the cache's data
block size.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mike Snitzer [Mon, 14 Jul 2014 20:35:54 +0000 (16:35 -0400)]
dm thin metadata: do not allow the data block size to change
commit
9aec8629ec829fc9403788cd959e05dd87988bd1 upstream.
The block size for the thin-pool's data device must remained fixed for
the life of the thin-pool. Disallow any attempt to change the
thin-pool's data block size.
It should be noted that attempting to change the data block size via
thin-pool table reload will be ignored as a side-effect of the thin-pool
handover that the thin-pool target does during thin-pool table reload.
Here is an example outcome of attempting to load a thin-pool table that
reduced the thin-pool's data block size from 1024K to 512K.
Before:
kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks
After:
kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
kernel: device-mapper: ioctl: error adding target to table
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
John Stultz [Mon, 7 Jul 2014 21:06:11 +0000 (14:06 -0700)]
alarmtimer: Fix bug where relative alarm timers were treated as absolute
commit
16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.
This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.
Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Deucher [Mon, 14 Jul 2014 21:57:19 +0000 (17:57 -0400)]
drm/radeon: avoid leaking edid data
commit
0ac66effe7fcdee55bda6d5d10d3372c95a41920 upstream.
In some cases we fetch the edid in the detect() callback
in order to determine what sort of monitor is connected.
If that happens, don't fetch the edid again in the get_modes()
callback or we will leak the edid.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jason Wang [Mon, 12 May 2014 08:35:39 +0000 (16:35 +0800)]
drm/qxl: return IRQ_NONE if it was not our irq
commit
fbb60fe35ad579b511de8604b06a30b43846473b upstream.
Return IRQ_NONE if it was not our irq. This is necessary for the case
when qxl is sharing irq line with a device A in a crash kernel. If qxl
is initialized before A and A's irq was raised during this gap,
returning IRQ_HANDLED in this case will cause this irq to be raised
again after EOI since kernel think it was handled but in fact it was
not.
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Deucher [Tue, 15 Jul 2014 13:48:53 +0000 (09:48 -0400)]
drm/radeon: set default bl level to something reasonable
commit
201bb62402e0227375c655446ea04fcd0acf7287 upstream.
If the value in the scratch register is 0, set it to the
max level. This fixes an issue where the console fb blanking
code calls back into the backlight driver on unblank and then
sets the backlight level to 0 after the driver has already
set the mode and enabled the backlight.
bugs:
https://bugs.freedesktop.org/show_bug.cgi?id=81382
https://bugs.freedesktop.org/show_bug.cgi?id=70207
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: David Heidelberger <david.heidelberger@ixit.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tomasz Figa [Thu, 17 Jul 2014 15:23:44 +0000 (17:23 +0200)]
irqchip: gic: Fix core ID calculation when topology is read from DT
commit
29e697b11853d3f83b1864ae385abdad4aa2c361 upstream.
Certain GIC implementation, namely those found on earlier, single
cluster, Exynos SoCs, have registers mapped without per-CPU banking,
which means that the driver needs to use different offset for each CPU.
Currently the driver calculates the offset by multiplying value returned
by cpu_logical_map() by CPU offset parsed from DT. This is correct when
CPU topology is not specified in DT and aforementioned function returns
core ID alone. However when DT contains CPU topology, the function
changes to return cluster ID as well, which is non-zero on mentioned
SoCs and so breaks the calculation in GIC driver.
This patch fixes this by masking out cluster ID in CPU offset
calculation so that only core ID is considered. Multi-cluster Exynos
SoCs already have banked GIC implementations, so this simple fix should
be enough.
Reported-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Tomasz Figa <t.figa@samsung.com>
Fixes:
db0d4db22a78d ("ARM: gic: allow GIC to support non-banked setups")
Link: https://lkml.kernel.org/r/1405610624-18722-1-git-send-email-t.figa@samsung.com
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Matthias Brugger [Thu, 3 Jul 2014 11:58:52 +0000 (13:58 +0200)]
irqchip: gic: Add support for cortex a7 compatible string
commit
a97e8027b1d28eafe6bafe062556c1ec926a49c6 upstream.
Patch
0a68214b "ARM: DT: Add binding for GIC virtualization extentions (VGIC)" added
the "arm,cortex-a7-gic" compatible string, but the corresponding IRQCHIP_DECLARE
was never added to the gic driver.
To let real Cortex-A7 SoCs use it, add the necessary declaration to the device driver.
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lkml.kernel.org/r/1404388732-28890-1-git-send-email-matthias.bgg@gmail.com
Fixes:
0a68214b76ca ("ARM: DT: Add binding for GIC virtualization extentions (VGIC)")
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Martin Lau [Tue, 10 Jun 2014 06:06:42 +0000 (23:06 -0700)]
ring-buffer: Fix polling on trace_pipe
commit
97b8ee845393701edc06e27ccec2876ff9596019 upstream.
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Fixes:
2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Amitkumar Karwar [Fri, 20 Jun 2014 18:45:25 +0000 (11:45 -0700)]
mwifiex: fix Tx timeout issue
commit
d76744a93246eccdca1106037e8ee29debf48277 upstream.
https://bugzilla.kernel.org/show_bug.cgi?id=70191
https://bugzilla.kernel.org/show_bug.cgi?id=77581
It is observed that sometimes Tx packet is downloaded without
adding driver's txpd header. This results in firmware parsing
garbage data as packet length. Sometimes firmware is unable
to read the packet if length comes out as invalid. This stops
further traffic and timeout occurs.
The root cause is uninitialized fields in tx_info(skb->cb) of
packet used to get garbage values. In this case if
MWIFIEX_BUF_FLAG_REQUEUED_PKT flag is mistakenly set, txpd
header was skipped. This patch makes sure that tx_info is
correctly initialized to fix the problem.
Reported-by: Andrew Wiley <wiley.andrew.j@gmail.com>
Reported-by: Linus Gasser <list@markas-al-nour.org>
Reported-by: Michael Hirsch <hirsch@teufel.de>
Tested-by: Xinming Hu <huxm@marvell.com>
Signed-off-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Maithili Hinge <maithili@marvell.com>
Signed-off-by: Avinash Patil <patila@marvell.com>
Signed-off-by: Bing Zhao <bzhao@marvell.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
HATAYAMA Daisuke [Wed, 25 Jun 2014 01:09:07 +0000 (10:09 +0900)]
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
commit
b292d7a10487aee6e74b1c18b8d95b92f40d4a4f upstream.
Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.
This commit deals with the issue simply by ignoring CondChgd bit.
Here is explanation in detail.
On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.
intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.
The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.
As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.
I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.
On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.
I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:
In 18.9.1:
"The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
indicate changes to the state of performancmonitoring hardware"
In Table 35-2 IA-32 Architectural MSRs
63 CondChg: status bits of this register has changed.
These are different from the bahviour I see on the actual system as I
explained above.
At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Mon, 21 Jul 2014 05:17:42 +0000 (07:17 +0200)]
ipv4: fix buffer overflow in ip_options_compile()
[ Upstream commit
10ec9472f05b45c94db3c854d22581a20b97db41 ]
There is a benign buffer overflow in ip_options_compile spotted by
AddressSanitizer[1] :
Its benign because we always can access one extra byte in skb->head
(because header is followed by struct skb_shared_info), and in this case
this byte is not even used.
[28504.910798] ==================================================================
[28504.912046] AddressSanitizer: heap-buffer-overflow in ip_options_compile
[28504.913170] Read of size 1 by thread T15843:
[28504.914026] [<
ffffffff81802f91>] ip_options_compile+0x121/0x9c0
[28504.915394] [<
ffffffff81804a0d>] ip_options_get_from_user+0xad/0x120
[28504.916843] [<
ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
[28504.918175] [<
ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
[28504.919490] [<
ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
[28504.920835] [<
ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
[28504.922208] [<
ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
[28504.923459] [<
ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
[28504.924722]
[28504.925106] Allocated by thread T15843:
[28504.925815] [<
ffffffff81804995>] ip_options_get_from_user+0x35/0x120
[28504.926884] [<
ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
[28504.927975] [<
ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
[28504.929175] [<
ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
[28504.930400] [<
ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
[28504.931677] [<
ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
[28504.932851] [<
ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
[28504.934018]
[28504.934377] The buggy address
ffff880026382828 is located 0 bytes to the right
[28504.934377] of 40-byte region [
ffff880026382800,
ffff880026382828)
[28504.937144]
[28504.937474] Memory state around the buggy address:
[28504.938430]
ffff880026382300: ........ rrrrrrrr rrrrrrrr rrrrrrrr
[28504.939884]
ffff880026382400:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.941294]
ffff880026382500: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
[28504.942504]
ffff880026382600:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.943483]
ffff880026382700:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.944511] >
ffff880026382800: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
[28504.945573] ^
[28504.946277]
ffff880026382900:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.094949]
ffff880026382a00:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.096114]
ffff880026382b00:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.097116]
ffff880026382c00:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.098472]
ffff880026382d00:
ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.099804] Legend:
[28505.100269] f - 8 freed bytes
[28505.100884] r - 8 redzone bytes
[28505.101649] . - 8 allocated bytes
[28505.102406] x=1..7 - x allocated bytes + (8-x) redzone bytes
[28505.103637] ==================================================================
[1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ben Hutchings [Sun, 20 Jul 2014 23:06:48 +0000 (00:06 +0100)]
dns_resolver: Null-terminate the right string
[ Upstream commit
640d7efe4c08f06c4ae5d31b79bd8740e7f6790a ]
*_result[len] is parsed as *(_result[len]) which is not at all what we
want to touch here.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes:
84a7c0b1db1c ("dns_resolver: assure that dns_query() result is null-terminated")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Manuel Schölling [Sat, 7 Jun 2014 21:57:25 +0000 (23:57 +0200)]
dns_resolver: assure that dns_query() result is null-terminated
[ Upstream commit
84a7c0b1db1c17d5ded8d3800228a608e1070b40 ]
dns_query() credulously assumes that keys are null-terminated and
returns a copy of a memory block that is off by one.
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sowmini Varadhan [Wed, 16 Jul 2014 14:02:26 +0000 (10:02 -0400)]
sunvnet: clean up objects created in vnet_new() on vnet_exit()
[ Upstream commit
a4b70a07ed12a71131cab7adce2ce91c71b37060 ]
Nothing cleans up the objects created by
vnet_new(), they are completely leaked.
vnet_exit(), after doing the vio_unregister_driver() to clean
up ports, should call a helper function that iterates over vnet_list
and cleans up those objects. This includes unregister_netdevice()
as well as free_netdev().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Karl Volz <karl.volz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Schulz [Sat, 12 Jul 2014 22:53:15 +0000 (00:53 +0200)]
net: pppoe: use correct channel MTU when using Multilink PPP
[ Upstream commit
a8a3e41c67d24eb12f9ab9680cbb85e24fcd9711 ]
The PPP channel MTU is used with Multilink PPP when ppp_mp_explode() (see
ppp_generic module) tries to determine how big a fragment might be. According
to RFC 1661, the MTU excludes the 2-byte PPP protocol field, see the
corresponding comment and code in ppp_mp_explode():
/*
* hdrlen includes the 2-byte PPP protocol field, but the
* MTU counts only the payload excluding the protocol field.
* (RFC1661 Section 2)
*/
mtu = pch->chan->mtu - (hdrlen - 2);
However, the pppoe module *does* include the PPP protocol field in the channel
MTU, which is wrong as it causes the PPP payload to be 1-2 bytes too big under
certain circumstances (one byte if PPP protocol compression is used, two
otherwise), causing the generated Ethernet packets to be dropped. So the pppoe
module has to subtract two bytes from the channel MTU. This error only
manifests itself when using Multilink PPP, as otherwise the channel MTU is not
used anywhere.
In the following, I will describe how to reproduce this bug. We configure two
pppd instances for multilink PPP over two PPPoE links, say eth2 and eth3, with
a MTU of 1492 bytes for each link and a MRRU of 2976 bytes. (This MRRU is
computed by adding the two link MTUs and subtracting the MP header twice, which
is 4 bytes long.) The necessary pppd statements on both sides are "multilink
mtu 1492 mru 1492 mrru 2976". On the client side, we additionally need "plugin
rp-pppoe.so eth2" and "plugin rp-pppoe.so eth3", respectively; on the server
side, we additionally need to start two pppoe-server instances to be able to
establish two PPPoE sessions, one over eth2 and one over eth3. We set the MTU
of the PPP network interface to the MRRU (2976) on both sides of the connection
in order to make use of the higher bandwidth. (If we didn't do that, IP
fragmentation would kick in, which we want to avoid.)
Now we send a ICMPv4 echo request with a payload of 2948 bytes from client to
server over the PPP link. This results in the following network packet:
2948 (echo payload)
+ 8 (ICMPv4 header)
+ 20 (IPv4 header)
---------------------
2976 (PPP payload)
These 2976 bytes do not exceed the MTU of the PPP network interface, so the
IP packet is not fragmented. Now the multilink PPP code in ppp_mp_explode()
prepends one protocol byte (0x21 for IPv4), making the packet one byte bigger
than the negotiated MRRU. So this packet would have to be divided in three
fragments. But this does not happen as each link MTU is assumed to be two bytes
larger. So this packet is diveded into two fragments only, one of size 1489 and
one of size 1488. Now we have for that bigger fragment:
1489 (PPP payload)
+ 4 (MP header)
+ 2 (PPP protocol field for the MP payload (0x3d))
+ 6 (PPPoE header)
--------------------------
1501 (Ethernet payload)
This packet exceeds the link MTU and is discarded.
If one configures the link MTU on the client side to 1501, one can see the
discarded Ethernet frames with tcpdump running on the client. A
ping -s 2948 -c 1 192.168.15.254
leads to the smaller fragment that is correctly received on the server side:
(tcpdump -vvvne -i eth3 pppoes and ppp proto 0x3d)
52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
length 1514: PPPoE [ses 0x3] MLPPP (0x003d), length 1494: seq 0x000,
Flags [end], length 1492
and to the bigger fragment that is not received on the server side:
(tcpdump -vvvne -i eth2 pppoes and ppp proto 0x3d)
52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
length 1515: PPPoE [ses 0x5] MLPPP (0x003d), length 1495: seq 0x000,
Flags [begin], length 1493
With the patch below, we correctly obtain three fragments:
52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
length 1514: PPPoE [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
Flags [begin], length 1492
52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
length 1514: PPPoE [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
Flags [none], length 1492
52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
length 27: PPPoE [ses 0x1] MLPPP (0x003d), length 7: seq 0x000,
Flags [end], length 5
And the ICMPv4 echo request is successfully received at the server side:
IP (tos 0x0, ttl 64, id 21925, offset 0, flags [DF], proto ICMP (1),
length 2976)
192.168.222.2 > 192.168.15.254: ICMP echo request, id 30530, seq 0,
length 2956
The bug was introduced in commit
c9aa6895371b2a257401f59d3393c9f7ac5a8698
("[PPPOE]: Advertise PPPoE MTU") from the very beginning. This patch applies
to 3.10 upwards but the fix can be applied (with minor modifications) to
kernels as old as 2.6.32.
Signed-off-by: Christoph Schulz <develop@kristov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>