Herbert Xu [Fri, 7 Nov 2014 13:27:09 +0000 (21:27 +0800)]
ipv4: Avoid reading user iov twice after raw_probe_proto_opt
Ever since raw_probe_proto_opt was added it had the problem of
causing the user iov to be read twice, once during the probe for
the protocol header and once again in ip_append_data.
This is a potential security problem since it means that whatever
we're probing may be invalid. This patch plugs the hole by
firstly advancing the iov so we don't read the same spot again,
and secondly saving what we read the first time around for use
by ip_append_data.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 7 Nov 2014 13:27:08 +0000 (21:27 +0800)]
ipv4: Use standard iovec primitive in raw_probe_proto_opt
The function raw_probe_proto_opt tries to extract the first two
bytes from the user input in order to seed the IPsec lookup for
ICMP packets. In doing so it's processing iovec by hand and
overcomplicating things.
This patch replaces the manual iovec processing with a call to
memcpy_fromiovecend.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 10 Nov 2014 18:27:49 +0000 (13:27 -0500)]
net: Move bonding headers under include/net
This ways drivers like cxgb4 don't need to do ugly relative includes.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Fri, 7 Nov 2014 04:46:14 +0000 (20:46 -0800)]
cxgb4: Remove unnecessary struct in6_addr * casts
Just use the address of the in6_addr.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 10 Nov 2014 17:57:14 +0000 (12:57 -0500)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
RDMA/cxgb4,cxgb4vf,cxgb4i,csiostor: Cleanup macros
This series moves the debugfs code to a new file debugfs.c and cleans up
macros/register defines.
Various patches have ended up changing the style of the symbolic macros/register
defines and some of them used the macros/register defines that matches the
output of the script from the hardware team.
As a result, the current kernel.org files are a mix of different macro styles.
Since this macro/register defines is used by five different drivers, a
few patch series have ended up adding duplicate macro/register define entries
with different styles. This makes these register define/macro files a complete
mess and we want to make them clean and consistent.
Will post few more series so that we can cover all the macros so that they all
follow the same style to be consistent.
The patches series is created against 'net-next' tree.
And includes patches on cxgb4, cxgb4vf, iw_cxgb4, csiostor and cxgb4i driver.
We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
V3: Use suffix instead of prefix for macros/register defines
V2: Changes the description and cover-letter content to answer David Miller's
question
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 7 Nov 2014 04:05:25 +0000 (09:35 +0530)]
cxgb4: Cleanup macros so they follow the same style and look consistent, part 2
Various patches have ended up changing the style of the symbolic macros/register
defines to different style.
As a result, the current kernel.org files are a mix of different macro styles.
Since this macro/register defines is used by different drivers a
few patch series have ended up adding duplicate macro/register define entries
with different styles. This makes these register define/macro files a complete
mess and we want to make them clean and consistent. This patch cleans up a part
of it.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 7 Nov 2014 04:05:24 +0000 (09:35 +0530)]
cxgb4: Cleanup macros so they follow the same style and look consistent
Various patches have ended up changing the style of the symbolic macros/register
to different style.
As a result, the current kernel.org files are a mix of different macro styles.
Since this macro/register defines is used by different drivers a
few patch series have ended up adding duplicate macro/register define entries
with different styles. This makes these register define/macro files a complete
mess and we want to make them clean and consistent. This patch cleans up a part
of it.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 7 Nov 2014 04:05:23 +0000 (09:35 +0530)]
cxgb4: Add cxgb4_debugfs.c, move all debugfs code to new file
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Nov 2014 05:10:11 +0000 (21:10 -0800)]
mlx4: use napi_complete_done()
To enable gro_flush_timeout, a driver has to use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30
1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80
1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Nov 2014 05:09:44 +0000 (21:09 -0800)]
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30
1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80
1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Taht [Thu, 6 Nov 2014 16:10:14 +0000 (08:10 -0800)]
rtnetlink: add babel protocol recognition
Babel uses rt_proto 42. Add to userspace visible header file.
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Thu, 6 Nov 2014 20:53:41 +0000 (12:53 -0800)]
dccp: Convert DCCP_WARN to net_warn_ratelimited
Remove the dependency on the "warning" sysctl (net_msg_warn)
which is only used by the LIMIT_NETDEBUG macro.
Convert the LIMIT_NETDEBUG use in DCCP_WARN to the more
common net_warn_ratelimited mechanism.
This still ratelimits based on the net_ratelimit()
function, but removes the check for the sysctl.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rick Jones [Thu, 6 Nov 2014 18:37:54 +0000 (10:37 -0800)]
udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts
As NIC multicast filtering isn't perfect, and some platforms are
quite content to spew broadcasts, we should not trigger an event
for skb:kfree_skb when we do not have a match for such an incoming
datagram. We do though want to avoid sweeping the matter under the
rug entirely, so increment a suitable statistic.
This incorporates feedback from David L. Stevens, Karl Neiss and Eric
Dumazet.
V3 - use bool per David Miller
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Neukum [Thu, 6 Nov 2014 14:19:14 +0000 (15:19 +0100)]
cdc-ether: implement MULTICAST flag on the device
Olivier having laid the groundwork this patch transmits the
multicast flag to the device to save some bus traffic.
Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Mon, 3 Nov 2014 20:42:34 +0000 (12:42 -0800)]
uapi: resort Kbuild entries
The entries in the Kbuild files are incorrectly sorted.
Matters for aesthetics only.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Fri, 7 Nov 2014 14:46:42 +0000 (16:46 +0200)]
stmmac: platform: fix sparse warnings
This patch fixes the following sparse warnings. One is fixed by casting return
value to a return type of the function. The others by creating a specific
stmmac_platform.h which provides the bits related to the platform driver.
drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c:59:29: warning: incorrect type in return expression (different address spaces)
drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c:59:29: expected void *
drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c:59:29: got void [noderef] <asn:2>*reg
drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c:64:29: warning: symbol 'meson6_dwmac_data' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c:354:29: warning: symbol 'stih4xx_dwmac_data' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c:361:29: warning: symbol 'stid127_dwmac_data' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c:133:29: warning: symbol 'sun7i_gmac_data' was not declared. Should it be static?
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Fri, 7 Nov 2014 14:53:12 +0000 (16:53 +0200)]
stmmac: remove custom implementation of print_hex_dump()
There is a kernel helper to dump buffers in a hexdecimal format. This patch
substitutes the open coded function by calling that helper.
The output is slightly changed:
- no lead space
- ASCII part will be printed along with the dump
- offset is longer than 3 characters (now 8)
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Nov 2014 17:13:40 +0000 (12:13 -0500)]
Merge branch 'iov_iter'
Herbert Xu says:
====================
Replace skb_copy_datagram_const_iovec with iterator version
This patch series adds the helper skb_copy_datagram_iter, which
is meant to replace both skb_copy_datagram_iovec and its evil
twin skb_copy_datagram_const_iovec.
It then converts tun and macvtap over to the new helper and finally
removes skb_copy_datagram_const_iovec which is only used by tun
and macvtap.
The copy_to_iter return value issue pointed out by Al has now been
fixed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 7 Nov 2014 13:22:26 +0000 (21:22 +0800)]
net: Kill skb_copy_datagram_const_iovec
Now that both macvtap and tun are using skb_copy_datagram_iter, we
can kill the abomination that is skb_copy_datagram_const_iovec.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 7 Nov 2014 13:22:25 +0000 (21:22 +0800)]
macvtap: Use iovec iterators
This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 7 Nov 2014 13:22:23 +0000 (21:22 +0800)]
tun: Use iovec iterators
This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 7 Nov 2014 13:22:22 +0000 (21:22 +0800)]
inet: Add skb_copy_datagram_iter
This patch adds skb_copy_datagram_iter, which is identical to
skb_copy_datagram_iovec except that it operates on iov_iter
instead of iovec.
Eventually all users of skb_copy_datagram_iovec should switch
over to iov_iter and then we can remove skb_copy_datagram_iovec.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Nov 2014 03:01:18 +0000 (22:01 -0500)]
Merge git://git./linux/kernel/git/davem/net
Tom Herbert [Fri, 7 Nov 2014 02:06:01 +0000 (18:06 -0800)]
vxlan: Fix to enable UDP checksums on interface
Add definition to vxlan nla_policy for UDP checksum. This is necessary
to enable UDP checksums on VXLAN.
In some instances, enabling UDP checksums can improve performance on
receive for devices that return legacy checksum-unnecessary for UDP/IP.
Also, UDP checksum provides some protection against VNI corruption.
Testing:
Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.
TCP_STREAM
IPv4, without UDP checksums
14.41% TX CPU utilization
25.71% RX CPU utilization
9083.4 Mbps
IPv4, with UDP checksums
13.99% TX CPU utilization
13.40% RX CPU utilization
9095.65 Mbps
TCP_RR
IPv4, without UDP checksums
94.08% TX CPU utilization
156/248/462 90/95/99% latencies
1.12743e+06 tps
IPv4, with UDP checksums
94.43% TX CPU utilization
158/250/462 90/95/99% latencies
1.13345e+06 tps
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Nov 2014 00:00:26 +0000 (19:00 -0500)]
Merge branch 'amd-xgbe-next'
Tom Lendacky says:
====================
amd-xgbe: AMD XGBE driver updates 2014-11-06
The following series of patches fixes a couple of bugs that slipped
through my last series.
- Free channel structure after freeing the per channel interrupts
- If an skb error allocation occurs during receive processing check
whether more descriptors are associated with the packet or whether
to start on a new packet
This patch series is based on net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 6 Nov 2014 23:02:19 +0000 (17:02 -0600)]
amd-xgbe: Check for complete packet on skb allocation error
If the skb allocation fails during receive processing, the driver would
continue reading descriptors without first determining if there were
any more descriptors for the current packet. Update the code to check
whether more descriptors are associated with the current packet or
whether to move on to the next descriptor as a new packet.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 6 Nov 2014 23:02:13 +0000 (17:02 -0600)]
amd-xgbe: Free channel/ring structures later
The channel structure is freed before freeing the per channel
interrupts resulting in a kernel oops. Move the call to free
the channel structure to after the freeing of the per channel
interrupts.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 6 Nov 2014 12:58:51 +0000 (07:58 -0500)]
netxen: Fix link event handling.
o Poll for the link events only if firmware doesn't have capability
to notify the driver for the link events.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu Varadarajan [Thu, 6 Nov 2014 09:51:39 +0000 (15:21 +0530)]
enic: update desc properly in rx_copybreak
When we reuse the rx buffer, we need to update the desc. If not hardware sees
stale value.
In the following crash, when mtu is changed, hardware sees old rx buffer value
and crashes on skb_put.
Fix this by using enic_queue_rq_desc helper function which updates the necessary
desc.
[ 64.657376] skbuff: skb_over_panic: text:
ffffffffa041f55d len:9010 put:9010 head:
ffff8800d3ca9fc0 data:
ffff8800d3caa000 tail:0x2372 end:0x640 dev:enp0s3
[ 64.659965] ------------[ cut here ]------------
[ 64.661322] kernel BUG at net/core/skbuff.c:100!
[ 64.662644] invalid opcode: 0000 [#1] PREEMPT SMP
[ 64.664001] Modules linked in: rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 cirrus ttm drm_kms_helper drm enic psmouse microcode evdev serio_raw syscopyarea sysfillrect sysimgblt i2c_piix4 i2c_core pcspkr nfs lockd grace sunrpc fscache ext4 crc16 mbcache jbd2 sd_mod ata_generic virtio_balloon ata_piix libata uhci_hcd virtio_pci virtio_ring usbcore usb_common virtio scsi_mod
[ 64.664834] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W
3.17.0-netnext-10335-g942396b-dirty #273
[ 64.664834] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 64.664834] task:
ffffffff81a1d580 ti:
ffffffff81a00000 task.ti:
ffffffff81a00000
[ 64.664834] RIP: 0010:[<
ffffffff81392cf1>] [<
ffffffff81392cf1>] skb_panic+0x61/0x70
[ 64.664834] RSP: 0018:
ffff880210603d48 EFLAGS:
00010292
[ 64.664834] RAX:
000000000000008c RBX:
ffff88020b0f6930 RCX:
0000000000000000
[ 64.664834] RDX:
000000000000008c RSI:
ffffffff8178b288 RDI:
00000000ffffffff
[ 64.664834] RBP:
ffff880210603d68 R08:
0000000000000001 R09:
0000000000000001
[ 64.664834] R10:
00000000000005ce R11:
0000000000000001 R12:
ffff88020b1f0b40
[ 64.664834] R13:
000000000000a332 R14:
ffff880209a1a000 R15:
0000000000000001
[ 64.664834] FS:
0000000000000000(0000) GS:
ffff880210600000(0000) knlGS:
0000000000000000
[ 64.664834] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 64.664834] CR2:
00007f6752935e48 CR3:
0000000035743000 CR4:
00000000000006f0
[ 64.664834] Stack:
[ 64.664834]
ffff8800d3caa000 0000000000002372 0000000000000640 ffff88020b1f0000
[ 64.664834]
ffff880210603d78 ffffffff81392d54 ffff880210603e08 ffffffffa041f55d
[ 64.664834]
0000000000000296 ffffffff00000000 00008e7e00008e7e ffff880200002332
[ 64.664834] Call Trace:
[ 64.664834] <IRQ>
[ 64.664834]
[ 64.664834] [<
ffffffff81392d54>] skb_put+0x54/0x60
[ 64.664834] [<
ffffffffa041f55d>] enic_rq_service.constprop.47+0x3ad/0x730 [enic]
[ 64.664834] [<
ffffffffa041fa79>] enic_poll_msix_rq+0x199/0x370 [enic]
[ 64.664834] [<
ffffffff813a5499>] net_rx_action+0x139/0x210
[ 64.664834] [<
ffffffff81290db3>] ? __this_cpu_preempt_check+0x13/0x20
[ 64.664834] [<
ffffffff8106110e>] __do_softirq+0x14e/0x280
[ 64.664834] [<
ffffffff8106152e>] irq_exit+0x8e/0xb0
[ 64.664834] [<
ffffffff8100fd21>] do_IRQ+0x61/0x100
[ 64.664834] [<
ffffffff814a2bf2>] common_interrupt+0x72/0x72
fixes:
a03bb56e67c357980dae886683733dab5583dc14 ("enic: implement rx_copybreak")
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu Varadarajan [Thu, 6 Nov 2014 09:51:38 +0000 (15:21 +0530)]
enic: handle error condition properly in enic_rq_indicate_buf
In case of error in rx path, we free the buf->os_buf but we do not make it NULL.
In next iteration we use the skb which is already freed. This causes the
following crash.
[ 886.154772] general protection fault: 0000 [#1] PREEMPT SMP
[ 886.154851] Modules linked in: rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 microcode evdev cirrus ttm drm_kms_helper drm enic syscopyarea sysfillrect sysimgblt psmouse i2c_piix4 serio_raw pcspkr i2c_core nfs lockd grace sunrpc fscache ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common ata_generic ata_piix virtio_balloon libata scsi_mod uhci_hcd usbcore virtio_pci virtio_ring virtio usb_common
[ 886.155199] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W
3.17.0-netnext-05668-g876bc7f #272
[ 886.155263] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 886.155304] task:
ffffffff81a1d580 ti:
ffffffff81a00000 task.ti:
ffffffff81a00000
[ 886.155356] RIP: 0010:[<
ffffffff81384030>] [<
ffffffff81384030>] kfree_skb_list+0x10/0x30
[ 886.155418] RSP: 0018:
ffff880210603d48 EFLAGS:
00010206
[ 886.155456] RAX:
0000000000000020 RBX:
0000000000000000 RCX:
0000000000000000
[ 886.155504] RDX:
0000000000000000 RSI:
0000000000000001 RDI:
004500084e000017
[ 886.155553] RBP:
ffff880210603d50 R08:
00000000fe13d1b6 R09:
0000000000000001
[ 886.155601] R10:
0000000000000000 R11:
0000000000000000 R12:
ffff880209ff2f00
[ 886.155650] R13:
ffff88020ac0fe40 R14:
ffff880209ff2f00 R15:
ffff8800da8e3a80
[ 886.155699] FS:
0000000000000000(0000) GS:
ffff880210600000(0000) knlGS:
0000000000000000
[ 886.155774] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 886.155814] CR2:
00007f0e0c925000 CR3:
0000000035e8b000 CR4:
00000000000006f0
[ 886.155865] Stack:
[ 886.155882]
0000000000000000 ffff880210603d78 ffffffff81383f79 ffff880209ff2f00
[ 886.155942]
ffff88020b0c0b40 000000000000c000 ffff880210603d90 ffffffff81383faf
[ 886.156001]
ffff880209ff2f00 ffff880210603da8 ffffffff8138406d ffff88020b1b08c0
[ 886.156061] Call Trace:
[ 886.156080] <IRQ>
[ 886.156095]
[ 886.156112] [<
ffffffff81383f79>] skb_release_data+0xa9/0xc0
[ 886.157656] [<
ffffffff81383faf>] skb_release_all+0x1f/0x30
[ 886.159195] [<
ffffffff8138406d>] consume_skb+0x1d/0x40
[ 886.160719] [<
ffffffff813942e5>] __dev_kfree_skb_any+0x35/0x40
[ 886.162224] [<
ffffffffa02dc1d5>] enic_rq_service.constprop.47+0xe5/0x5a0 [enic]
[ 886.163756] [<
ffffffffa02dc829>] enic_poll_msix_rq+0x199/0x370 [enic]
[ 886.164730] [<
ffffffff81397e29>] net_rx_action+0x139/0x210
[ 886.164730] [<
ffffffff8105fb2e>] __do_softirq+0x14e/0x280
[ 886.164730] [<
ffffffff8105ff2e>] irq_exit+0x8e/0xb0
[ 886.164730] [<
ffffffff8100fc1d>] do_IRQ+0x5d/0x100
[ 886.164730] [<
ffffffff81496832>] common_interrupt+0x72/0x72
fixes:
a03bb56e67c357980dae886683733dab5583dc14 ("enic: implement rx_copybreak")
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 21:40:47 +0000 (16:40 -0500)]
Merge branch 'mlx5-net'
Eli Cohen says:
====================
mlx5_core fixes for 3.18
the following two patches fix races to could lead to kernel panic in some cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Thu, 6 Nov 2014 10:51:22 +0000 (12:51 +0200)]
net/mlx5_core: Fix race on driver load
When events arrive at driver load, the event handler gets called even before
the spinlock and list are initialized. Fix this by moving the initialization
before EQs creation.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Thu, 6 Nov 2014 10:51:21 +0000 (12:51 +0200)]
net/mlx5_core: Fix race in create EQ
After the EQ is created, it can possibly generate interrupts and the interrupt
handler is referencing eq->dev. It is therefore required to set eq->dev before
calling request_irq() so if an event is generated before request_irq() returns,
we will have a valid eq->dev field.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 21:33:13 +0000 (16:33 -0500)]
Merge branch 'net_next_ovs' of git://git./linux/kernel/git/pshelar/openvswitch
Pravin B Shelar says:
====================
Open vSwitch
First two patches are related to OVS MPLS support. Rest of patches
are mostly refactoring and minor improvements to openvswitch.
v1-v2:
- Fix conflicts due to "gue: Remote checksum offload"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 20:16:44 +0000 (15:16 -0500)]
Merge branch 'sunvnet-next'
Sowmini Varadhan says:
====================
sunvnet: bug fixes
This patch series has a coding-style fix and a bug fix.
The coding style fix (patch 1) is the extra indentation flagged by
Ben Hutchings:
http://marc.info/?l=linux-netdev&m=
141529243409594&w=2
The bugfix (patch 2) is the following:
when vnet_event_napi() is called as part of napi_resume
(i.e., continuation of a previous NAPI read that was truncated
due to budget constraints), and then finds no more packets to read,
the code was trying to avoid an additional trip through ldc_rx
as an optimization. However, when this corner case happens, we would
need to reset a number of dring state bits such as rcv_nxt carefully,
which quickly becomes complex and hacky. The cleaner solution
is to just roll back to vnet_poll, re-enable interrupts and set up
dring state as was done in the pre-NAPI version of the driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 6 Nov 2014 19:51:08 +0000 (14:51 -0500)]
sunvnet: Return from vnet_napi_event() if no packets to read
vnet_event_napi() may be called as part of the NAPI ->poll,
to resume reading descriptor rings. When no data is available,
descriptor ring state (e.g., rcv_nxt) needs to be reset
carefully to stay in lock-step with ldc_read(). In the interest
of simplicity, the best way to do this is to return from
vnet_event_napi() when there are no more packets to read.
The next trip through ldc_rx will correctly set up the dring state.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: David Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 6 Nov 2014 19:51:02 +0000 (14:51 -0500)]
sunvnet: Fix indentation in maybe_tx_wakeup()
remove redundant tab.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 20:14:36 +0000 (15:14 -0500)]
Merge branch 'r8152-next'
Hayes Wang says:
====================
r8152: rtl_ops_init modify
Initialize the ops through tp->version. This could skip checking
each VID/PID.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 6 Nov 2014 04:47:40 +0000 (12:47 +0800)]
r8152: remove the definitions of the PID
The PIDs are only used in the id table, so the definitions are
unnacessary. Remove them wouldn't have confusion.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 6 Nov 2014 04:47:39 +0000 (12:47 +0800)]
r8152: modify rtl_ops_init
Replace using VID/PID with using tp->version to initialize the ops.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 6 Nov 2014 04:47:38 +0000 (12:47 +0800)]
r8152: move r8152b_get_version
Move r8152b_get_version() to the location before rtl_ops_init().
Then, the rtl_ops_init() could use tp->version.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 5 Nov 2014 23:42:09 +0000 (15:42 -0800)]
sock.h: Remove unused NETDEBUG macro
It's unused now, just delete it.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 5 Nov 2014 23:36:08 +0000 (15:36 -0800)]
net: esp: Convert NETDEBUG to pr_info
Commit
64ce207306de ("[NET]: Make NETDEBUG pure printk wrappers")
originally had these NETDEBUG printks as always emitting.
Commit
a2a316fd068c ("[NET]: Replace CONFIG_NET_DEBUG with sysctl")
added a net_msg_warn sysctl to these NETDEBUG uses.
Convert these NETDEBUG uses to normal pr_info calls.
This changes the output prefix from "ESP: " to include
"IPSec: " for the ipv4 case and "IPv6: " for the ipv6 case.
These output lines are now like the other messages in the files.
Other miscellanea:
Neaten the arithmetic spacing to be consistent with other
arithmetic spacing in the files.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 5 Nov 2014 22:39:21 +0000 (14:39 -0800)]
net; ipv[46] - Remove 2 unnecessary NETDEBUG OOM messages
These messages aren't useful as there's a generic dump_stack()
on OOM.
Neaten the comment and if test above the OOM by separating the
assign in if into an allocation then if test.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 5 Nov 2014 19:01:59 +0000 (20:01 +0100)]
dsa: mv88e6171: Add support for mv88e6172
The mv88e6172 is very similar to the mv88e6171. So extend the
mv88e6171 driver to support it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 5 Nov 2014 18:47:28 +0000 (19:47 +0100)]
net: dsa: slave: Fix autoneg for phys on switch MDIO bus
When the ports phys are connected to the switches internal MDIO bus,
we need to connect the phy to the slave netdev, otherwise
auto-negotiation etc, does not work.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 5 Nov 2014 19:51:51 +0000 (20:51 +0100)]
sched: fix act file names in header comment
Fixes:
4bba3925 ("[PKT_SCHED]: Prefix tc actions with act_")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ryo Munakata [Wed, 5 Nov 2014 14:45:58 +0000 (23:45 +0900)]
net/9p: remove a comment about pref member which doesn't exist
Signed-off-by: Ryo Munakata <ryomnktml@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Wed, 5 Nov 2014 13:03:31 +0000 (18:33 +0530)]
drivers: net: cpsw: remove cpsw_ale_stop from cpsw_ale_destroy
when cpsw is build as modulea and simple insert and removal of module
creates a deadlock, due to delete timer. the timer is created and destroyed
in cpsw_ale_start and cpsw_ale_stop which are from device open and close.
root@am437x-evm:~# modprobe -r ti_cpsw
[ 158.505333] INFO: trying to register non-static key.
[ 158.510623] the code is fine but needs lockdep annotation.
[ 158.516448] turning off the locking correctness validator.
[ 158.522282] CPU: 0 PID: 1339 Comm: modprobe Not tainted
3.14.23-00445-gd41c88f #44
[ 158.530359] [<
c0015380>] (unwind_backtrace) from [<
c0012088>] (show_stack+0x10/0x14)
[ 158.538603] [<
c0012088>] (show_stack) from [<
c054ad70>] (dump_stack+0x78/0x94)
[ 158.546295] [<
c054ad70>] (dump_stack) from [<
c0088008>] (__lock_acquire+0x176c/0x1b74)
[ 158.554711] [<
c0088008>] (__lock_acquire) from [<
c0088944>] (lock_acquire+0x9c/0x104)
[ 158.563043] [<
c0088944>] (lock_acquire) from [<
c004e520>] (del_timer_sync+0x44/0xd8)
[ 158.571289] [<
c004e520>] (del_timer_sync) from [<
bf2eac1c>] (cpsw_ale_destroy+0x10/0x3c [ti_cpsw])
[ 158.580821] [<
bf2eac1c>] (cpsw_ale_destroy [ti_cpsw]) from [<
bf2eb268>] (cpsw_remove+0x30/0xa0 [ti_cpsw])
[ 158.591000] [<
bf2eb268>] (cpsw_remove [ti_cpsw]) from [<
c035ef44>] (platform_drv_remove+0x18/0x1c)
[ 158.600527] [<
c035ef44>] (platform_drv_remove) from [<
c035d8bc>] (__device_release_driver+0x70/0xc8)
[ 158.610236] [<
c035d8bc>] (__device_release_driver) from [<
c035e0d4>] (driver_detach+0xb4/0xb8)
[ 158.619386] [<
c035e0d4>] (driver_detach) from [<
c035d6e4>] (bus_remove_driver+0x4c/0x90)
[ 158.627988] [<
c035d6e4>] (bus_remove_driver) from [<
c00af2a8>] (SyS_delete_module+0x10c/0x198)
[ 158.637144] [<
c00af2a8>] (SyS_delete_module) from [<
c000e580>] (ret_fast_syscall+0x0/0x48)
[ 179.524727] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=1487, c=1486, q=6)
[ 179.535741] INFO: Stall ended before state dump start
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karl Beldan [Wed, 5 Nov 2014 14:32:59 +0000 (15:32 +0100)]
net: mv643xx_eth: reclaim TX skbs only when released by the HW
ATM, txq_reclaim will dequeue and free an skb for each tx desc released
by the hw that has TX_LAST_DESC set. However, in case of TSO, each
hw desc embedding the last part of a segment has TX_LAST_DESC set,
losing the one-to-one 'last skb frag'/'TX_LAST_DESC set' correspondance,
which causes data corruption.
Fix this by checking TX_ENABLE_INTERRUPT instead of TX_LAST_DESC, and
warn when trying to dequeue from an empty txq (which can be symptomatic
of releasing skbs prematurely).
Fixes:
3ae8f4e0b98 ('net: mv643xx_eth: Implement software TSO')
Reported-by: Slawomir Gajzner <slawomir.gajzner@gmail.com>
Reported-by: Julien D'Ascenzio <jdascenzio@yahoo.fr>
Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Ian Campbell <ijc@hellion.org.uk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 19:43:13 +0000 (14:43 -0500)]
Merge branch 'sfc-next'
Shradha Shah says:
====================
sfc: Clean up Siena SR-IOV support in preparation for EF10 SR-IOV support
This patch series provides a base and clean up for the upcoming
EF10 SRIOV patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 5 Nov 2014 12:16:46 +0000 (12:16 +0000)]
sfc: Add NIC type operations to replace direct calls from efx.c into siena_sriov.c
Also add dummy functions where required to avoid NULL pointer dereference.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 5 Nov 2014 12:16:32 +0000 (12:16 +0000)]
sfc: Rename implementations in siena_sriov.c to have a 'siena' prefix
Patch in preparation for the upcoming EF10 sriov support.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 5 Nov 2014 12:16:18 +0000 (12:16 +0000)]
sfc: Move the current VF state from efx_nic into siena_nic_data
This patch series provides a base and cleanup for the
upcoming EF10 SRIOV support.
This patch moves the VF state into siena_nic_data as a basis to
save the VF state based on nic type.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Malcolm Crossley [Wed, 5 Nov 2014 10:50:22 +0000 (10:50 +0000)]
xen-netback: remove unconditional __pskb_pull_tail() in guest Tx path
Unconditionally pulling 128 bytes into the linear area is not required
for:
- security: Every protocol demux starts with pskb_may_pull() to pull
frag data into the linear area, if necessary, before looking at
headers.
- performance: Netback has already grant copied up-to 128 bytes from
the first slot of a packet into the linear area. The first slot
normally contain all the IPv4/IPv6 and TCP/UDP headers.
The unconditional pull would often copy frag data unnecessarily. This
is a performance problem when running on a version of Xen where grant
unmap avoids TLB flushes for pages which are not accessed. TLB
flushes can now be avoided for > 99% of unmaps (it was 0% before).
Grant unmap TLB flush avoidance will be available in a future version
of Xen (probably 4.6).
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 19:39:01 +0000 (14:39 -0500)]
Merge branch 'stmmac-next'
Andy Shevchenko says:
====================
stmmac: pci: various cleanups and fixes
There are few cleanups and fixes regarding to stmmac PCI driver.
This has been tested on Intel Galileo board with recent net-next tree.
Since v2:
- drop patch 5/5 since it will be part of a big change across entire subsystem
Since v1:
- remove already applied patch
- append patch 1/5
- rework patch 3/5 to be functional compatible with original code
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 5 Nov 2014 10:27:29 +0000 (12:27 +0200)]
stmmac: pci: convert to use dev_* macros
Instead of pr_* macros let's use dev_* macros which provide device name.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 5 Nov 2014 10:27:28 +0000 (12:27 +0200)]
stmmac: pci: use managed resources
Migrate pci driver to managed resources to reduce boilerplate error handling
code.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 5 Nov 2014 10:27:27 +0000 (12:27 +0200)]
stmmac: pci: convert to use dev_pm_ops
Convert system PM callbacks to use dev_pm_ops. In addition remove the PCI calls
related to a power state since the bus code cares about this already.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 5 Nov 2014 10:27:26 +0000 (12:27 +0200)]
stmmac: pci: use defined constant instead of magic number
The last standard PCI resource is defined as PCI_STD_RESOURCE_END. Thus, we
could use it instead of plain integer.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 5 Nov 2014 09:45:32 +0000 (11:45 +0200)]
stmmac: fix sparse warnings
This patch fixes the following sparse warnings.
drivers/net/ethernet/stmicro/stmmac/enh_desc.c:381:30: warning: symbol 'enh_desc_ops' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:253:30: warning: symbol 'ndesc_ops' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c:141:33: warning: symbol 'stmmac_ptp' was not declared. Should it be static?
There is no functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Wed, 5 Nov 2014 07:03:50 +0000 (08:03 +0100)]
ip6_tunnel: Add support for wildcard tunnel endpoints.
This patch adds support for tunnels with local or
remote wildcard endpoints. With this we get a
NBMA tunnel mode like we have it for ipv4 and
sit tunnels.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Wed, 5 Nov 2014 07:02:48 +0000 (08:02 +0100)]
ipv6: Allow sending packets through tunnels with wildcard endpoints
Currently we need the IP6_TNL_F_CAP_XMIT capabiltiy to transmit
packets through an ipv6 tunnel. This capability is set when the
tunnel gets configured, based on the tunnel endpoint addresses.
On tunnels with wildcard tunnel endpoints, we need to do the
capabiltiy checking on a per packet basis like it is done in
the receive path.
This patch extends ip6_tnl_xmit_ctl() to take local and remote
addresses as parameters to allow for per packet capabiltiy
checking.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar [Sun, 19 Oct 2014 19:03:40 +0000 (12:03 -0700)]
openvswitch: Avoid NULL mask check while building mask
OVS does mask validation even if it does not need to convert
netlink mask attributes to mask structure. ovs_nla_get_match()
caller can pass NULL mask structure pointer if the caller does
not need mask. Therefore NULL check is required in SW_FLOW_KEY*
macros. Following patch does not convert mask netlink attributes
if mask pointer is NULL, so we do not need these checks in
SW_FLOW_KEY* macro.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Andy Zhou <azhou@nicira.com>
Pravin B Shelar [Sun, 19 Oct 2014 18:19:51 +0000 (11:19 -0700)]
openvswitch: Refactor action alloc and copy api.
There are two separate API to allocate and copy actions list. Anytime
OVS needs to copy action list, it needs to call both functions.
Following patch moves action allocation to copy function to avoid
code duplication.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Joe Stringer [Sat, 18 Oct 2014 23:14:14 +0000 (16:14 -0700)]
openvswitch: Move key_attr_size() to flow_netlink.h.
flow-netlink has netlink related code.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Lorand Jakab [Mon, 6 Oct 2014 12:45:32 +0000 (05:45 -0700)]
openvswitch: Remove flow member from struct ovs_skb_cb
The 'flow' memeber was chosen for removal because it's only used
in ovs_execute_actions() we can pass it as argument to this
function.
Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Jarno Rajahalme [Tue, 30 Sep 2014 17:52:32 +0000 (10:52 -0700)]
openvswitch: Fix the type of struct ovs_key_nd nd_target field.
Should be the same as other IPv6 address fields.
Current master produces sparse warnings without this change.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Chunhe Li [Mon, 8 Sep 2014 20:17:21 +0000 (13:17 -0700)]
openvswitch: Drop packets when interdev is not up
If the internal device is not up, it should drop received
packets. Sometimes it receive the broadcast or multicast
packets, and the ip protocol stack will casue more cpu
usage wasted.
Signed-off-by: Chunhe Li <lichunhe@huawei.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Andy Zhou [Mon, 8 Sep 2014 20:14:22 +0000 (13:14 -0700)]
openvswitch: Refactor get_dp() function into multiple access APIs.
Avoid recursive read_rcu_lock() by using the lighter weight
get_dp_rcu() API. Add proper locking assertions to get_dp().
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Joe Stringer [Mon, 8 Sep 2014 20:09:37 +0000 (13:09 -0700)]
openvswitch: Refactor ovs_flow_cmd_fill_info().
Split up ovs_flow_cmd_fill_info() to make it easier to cache parts of a
dump reply. This will be used to streamline flow_dump in a future patch.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Andy Zhou [Mon, 8 Sep 2014 07:35:02 +0000 (00:35 -0700)]
openvswitch: refactor do_output() to move NULL check out of fast path
skb_clone() NULL check is implemented in do_output(), as past of the
common (fast) path. Refactoring so that NULL check is done in the
slow path, immediately after skb_clone() is called.
Besides optimization, this change also improves code readability by
making the skb_clone() NULL check consistent within OVS datapath
module.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Jesse Gross [Mon, 6 Oct 2014 12:08:38 +0000 (05:08 -0700)]
openvswitch: Additional logging for -EINVAL on flow setups.
There are many possible ways that a flow can be invalid so we've
added logging for most of them. This adds logs for the remaining
possible cases so there isn't any ambiguity while debugging.
CC: Federico Iezzi <fiezzi@enter.it>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Joe Stringer [Mon, 8 Sep 2014 05:11:08 +0000 (22:11 -0700)]
openvswitch: Remove redundant tcp_flags code.
These two cases used to be treated differently for IPv4/IPv6,
but they are now identical.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Pravin B Shelar [Wed, 7 May 2014 01:41:20 +0000 (18:41 -0700)]
openvswitch: Move table destroy to dp-rcu callback.
Ths simplifies flow-table-destroy API. No need to pass explicit
parameter about context.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@redhat.com>
Simon Horman [Mon, 6 Oct 2014 12:05:13 +0000 (05:05 -0700)]
openvswitch: Add basic MPLS support to kernel
Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.
Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.
Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Pravin B Shelar [Wed, 5 Nov 2014 23:27:48 +0000 (15:27 -0800)]
net: Remove MPLS GSO feature.
Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Tom Herbert [Thu, 6 Nov 2014 00:49:38 +0000 (16:49 -0800)]
fou: Fix typo in returning flags in netlink
When filling netlink info, dport is being returned as flags. Fix
instances to return correct value.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Wed, 5 Nov 2014 02:17:02 +0000 (10:17 +0800)]
r8152: disable the tasklet by default
Let the tasklet only be enabled after open(), and be disabled for
the other situation. The tasklet is only necessary after open() for
tx/rx, so it could be disabled by default.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 5 Nov 2014 19:27:38 +0000 (20:27 +0100)]
ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():
skbuff: skb_over_panic: text:
ffffffff80612a5d len:3776 put:20
head:
ffff88046d751000 data:
ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<
ffffffff80578226>] ? skb_put+0x3a/0x3b
[<
ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<
ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<
ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<
ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<
ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<
ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<
ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<
ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<
ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<
ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit
72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.
However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.
The IGMP case doesn't have this issue as commit
57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in
the cb[].
Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().
Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes:
72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Tue, 4 Nov 2014 23:37:03 +0000 (15:37 -0800)]
net: Convert SEQ_START_TOKEN/seq_printf to seq_puts
Using a single fixed string is smaller code size than using
a format and many string arguments.
Reduces overall code size a little.
$ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
text data bss dec hex filename
34269 7012 14824 56105 db29 net/ipv4/igmp.o.new
34315 7012 14824 56151 db57 net/ipv4/igmp.o.old
30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new
30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old
11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new
11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Tue, 4 Nov 2014 23:23:04 +0000 (00:23 +0100)]
fast_hash: avoid indirect function calls
By default the arch_fast_hash hashing function pointers are initialized
to jhash(2). If during boot-up a CPU with SSE4.2 is detected they get
updated to the CRC32 ones. This dispatching scheme incurs a function
pointer lookup and indirect call for every hashing operation.
rhashtable as a user of arch_fast_hash e.g. stores pointers to hashing
functions in its structure, too, causing two indirect branches per
hashing operation.
Using alternative_call we can get away with one of those indirect branches.
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Nov 2014 02:50:43 +0000 (21:50 -0500)]
Merge branch 'amd-xgbe-next'
Tom Lendacky says:
====================
amd-xgbe: AMD XGBE driver updates 2014-11-04
The following series of patches includes functional updates to the
driver as well as some trivial changes for function renaming and
spelling fixes.
- Move channel and ring structure allocation into the device open path
- Rename the pre_xmit function to dev_xmit
- Explicitly use the u32 data type for the device descriptors
- Use page allocation for the receive buffers
- Add support for split header/payload receive
- Add support for per DMA channel interrupts
- Add support for receive side scaling (RSS)
- Add support for ethtool receive side scaling commands
- Fix the spelling of descriptors
- After a PCS reset, sync the PCS and PHY modes
- Add dependency on HAS_IOMEM to both the amd-xgbe and amd-xgbe-phy
drivers
This patch series is based on net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:46 +0000 (16:07 -0600)]
amd-xgbe-phy: Let AMD_XGBE_PHY depend on HAS_IOMEM
The amd-xgbe-phy driver needs to perform ioremap calls, so add HAS_IOMEM
to its build dependency.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:41 +0000 (16:07 -0600)]
amd-xgbe: Let AMD_XGBE depend on HAS_IOMEM
The amd-xgbe driver needs to perform ioremap calls, so add HAS_IOMEM
to its build dependency.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:35 +0000 (16:07 -0600)]
amd-xgbe-phy: Sync PCS and PHY modes after reset
This patch adds support to sync the states of the PCS and the PHY
after a reset is performed. If the PCS and the PHY are not in the
same state after reset an extra mode change would be performed. This
extra mode change might not be needed if the PCS and the PHY are
synced up after reset.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:29 +0000 (16:07 -0600)]
amd-xgbe: Fix a spelling error
This patch fixes the spelling of the word "descriptor" in a couple
of locations.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:23 +0000 (16:07 -0600)]
amd-xgbe: Add receive side scaling ethtool support
This patch adds support for ethtool receive side scaling (RSS) commands.
Support is added to get/set the RSS hash key and the RSS lookup table.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:07:02 +0000 (16:07 -0600)]
amd-xgbe: Provide support for receive side scaling
This patch provides support for receive side scaling (RSS). RSS allows
for spreading incoming network packets across the Rx queues. When used
in conjunction with the per DMA channel interrupt support, this allows
the receive processing to be spread across multiple processors.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:56 +0000 (16:06 -0600)]
amd-xgbe: Add support for per DMA channel interrupts
This patch provides support for interrupts that are generated by the
Tx/Rx DMA channel pairs of the device. This allows for Tx and Rx
processing to run across multiple processsors.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:50 +0000 (16:06 -0600)]
amd-xgbe: Implement split header receive support
Provide support for splitting IP packets so that the header and
payload can be sent to different DMA addresses. This will allow
the IP header to be put into the linear part of the skb while the
payload can be added as frags.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:44 +0000 (16:06 -0600)]
amd-xgbe: Use page allocations for Rx buffers
Use page allocations for Rx buffers instead of pre-allocating skbs
of a set size.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:37 +0000 (16:06 -0600)]
amd-xgbe: Use the u32 data type for descriptors
The Tx and Rx descriptors are unsigned 32 bit values. Use the u32
type, rather than unsigned int, to map these descriptors.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:32 +0000 (16:06 -0600)]
amd-xgbe: Rename pre_xmit function to dev_xmit
The pre_xmit function name implies that it performs operations prior
to transmitting the packet when in fact it is responsible for setting
up the descriptors and initiating the transmit. Rename this to
function from pre_xmit to dev_xmit, which is consistent with the name
used during receive processing - dev_read.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 4 Nov 2014 22:06:26 +0000 (16:06 -0600)]
amd-xgbe: Move ring allocation to device open
Move the channel and ring tracking structures allocation to device
open. This will allow for future support to vary the number of Tx/Rx
queues without unloading the module.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory Fong [Tue, 4 Nov 2014 19:21:21 +0000 (11:21 -0800)]
bridge: include in6.h in if_bridge.h for struct in6_addr
if_bridge.h uses struct in6_addr ip6, but wasn't including the in6.h
header. Thomas Backlund originally sent a patch to do this, but this
revealed a redefinition issue: https://lkml.org/lkml/2013/1/13/116
The redefinition issue should have been fixed by the following Linux
commits:
ee262ad827f89e2dc7851ec2986953b5b125c6bc inet: defines IPPROTO_* needed for module alias generation
cfd280c91253cc28e4919e349fa7a813b63e71e8 net: sync some IP headers with glibc
and the following glibc commit:
6c82a2f8d7c8e21e39237225c819f182ae438db3 Coordinate IPv6 definitions for Linux and glibc
so actually include the header now.
Reported-by: Colin Guthrie <colin@mageia.org>
Reported-by: Christiaan Welvaart <cjw@daneel.dyndns.org>
Reported-by: Thomas Backlund <tmb@mageia.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Leitner [Tue, 4 Nov 2014 19:15:08 +0000 (17:15 -0200)]
tcp: zero retrans_stamp if all retrans were acked
Ueki Kohei reported that when we are using NewReno with connections that
have a very low traffic, we may timeout the connection too early if a
second loss occurs after the first one was successfully acked but no
data was transfered later. Below is his description of it:
When SACK is disabled, and a socket suffers multiple separate TCP
retransmissions, that socket's ETIMEDOUT value is calculated from the
time of the *first* retransmission instead of the *latest*
retransmission.
This happens because the tcp_sock's retrans_stamp is set once then never
cleared.
Take the following connection:
Linux remote-machine
| |
send#1---->(*1)|--------> data#1 --------->|
| | |
RTO : :
| | |
---(*2)|----> data#1(retrans) ---->|
| (*3)|<---------- ACK <----------|
| | |
| : :
| : :
| : :
16 minutes (or more) :
| : :
| : :
| : :
| | |
send#2---->(*4)|--------> data#2 --------->|
| | |
RTO : :
| | |
---(*5)|----> data#2(retrans) ---->|
| | |
| | |
RTO*2 : :
| | |
| | |
ETIMEDOUT<----(*6)| |
(*1) One data packet sent.
(*2) Because no ACK packet is received, the packet is retransmitted.
(*3) The ACK packet is received. The transmitted packet is acknowledged.
At this point the first "retransmission event" has passed and been
recovered from. Any future retransmission is a completely new "event".
(*4) After 16 minutes (to correspond with retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and (*4).
The socket's timeout SHOULD be calculated from this point in time, but
instead it's calculated from the prior "event" 16 minutes ago.
(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT.
Therefore, now we clear retrans_stamp as soon as all data during the
loss window is fully acked.
Reported-by: Ueki Kohei
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Tue, 4 Nov 2014 18:59:47 +0000 (10:59 -0800)]
ipv6: move INET6_MATCH() to include/net/inet6_hashtables.h
It is only used in net/ipv6/inet6_hashtables.c.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 5 Nov 2014 21:46:40 +0000 (16:46 -0500)]
net: Add and use skb_copy_datagram_msg() helper.
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".
When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.
Having a helper like this means there will be less places to touch
during that transformation.
Based upon descriptions and patch from Al Viro.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 5 Nov 2014 21:34:47 +0000 (16:34 -0500)]
Merge branch 'gue-next'
Tom Herbert says:
====================
gue: Remote checksum offload
This patch set implements remote checksum offload for
GUE, which is a mechanism that provides checksum offload of
encapsulated packets using rudimentary offload capabilities found in
most Network Interface Card (NIC) devices. The outer header checksum
for UDP is enabled in packets and, with some additional meta
information in the GUE header, a receiver is able to deduce the
checksum to be set for an inner encapsulated packet. Effectively this
offloads the computation of the inner checksum. Enabling the outer
checksum in encapsulation has the additional advantage that it covers
more of the packet than the inner checksum including the encapsulation
headers.
Remote checksum offload is described in:
http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
The GUE transmit and receive paths are modified to support the
remote checksum offload option. The option contains a checksum
offset and checksum start which are directly derived from values
set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the
operation is to calculate the packet checksum from "start" to end of
the packet (normally derived for checksum complete), and then set
the resultant value at checksum "offset" (the checksum field has
already been primed with the pseudo header). This emulates a NIC
that implements NETIF_F_HW_CSUM.
The primary purpose of this feature is to eliminate cost of performing
checksum calculation over a packet when encpasulating.
In this patch set:
- Move fou_build_header into fou.c and split it into a couple of
functions
- Enable offloading of outer UDP checksum in encapsulation
- Change udp_offload to support remote checksum offload, includes
new GSO type and ensuring encapsulated layers (TCP) doesn't try to
set a checksum covered by RCO
- TX support for RCO with GUE. This is configured through ip_tunnel
and set the option on transmit when packet being encapsulated is
CHECKSUM_PARTIAL
- RX support for RCO with GUE for normal and GRO paths. Includes
resolving the offloaded checksum
v2:
Address comments from davem: Move accounting for private option
field in gue_encap_hlen to patch in which we add the remote checksum
offload option.
Testing:
I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200
streams, comparing GUE with and without remote checksum offload (doing
checksum-unnecessary to complete conversion in both cases). These
were run on mlnx4 and bnx2x. Some mlnx4 results are below.
GRE/GUE
TCP_STREAM
IPv4, with remote checksum offload
9.71% TX CPU utilization
7.42% RX CPU utilization
36380 Mbps
IPv4, without remote checksum offload
12.40% TX CPU utilization
7.36% RX CPU utilization
36591 Mbps
TCP_RR
IPv4, with remote checksum offload
77.79% CPU utilization
91/144/216 90/95/99% latencies
1.95127e+06 tps
IPv4, without remote checksum offload
78.70% CPU utilization
89/152/297 90/95/99% latencies
1.95458e+06 tps
IPIP/GUE
TCP_STREAM
With remote checksum offload
10.30% TX CPU utilization
7.43% RX CPU utilization
36486 Mbps
Without remote checksum offload
12.47% TX CPU utilization
7.49% RX CPU utilization
36694 Mbps
TCP_RR
With remote checksum offload
77.80% CPU utilization
87/153/270 90/95/99% latencies
1.98735e+06 tps
Without remote checksum offload
77.98% CPU utilization
87/150/287 90/95/99% latencies
1.98737e+06 tps
SIT/GUE
TCP_STREAM
With remote checksum offload
9.68% TX CPU utilization
7.36% RX CPU utilization
35971 Mbps
Without remote checksum offload
12.95% TX CPU utilization
8.04% RX CPU utilization
36177 Mbps
TCP_RR
With remote checksum offload
79.32% CPU utilization
94/158/295 90/95/99% latencies
1.88842e+06 tps
Without remote checksum offload
80.23% CPU utilization
94/149/226 90/95/99% latencies
1.90338e+06 tps
VXLAN
TCP_STREAM
35.03% TX CPU utilization
20.85% RX CPU utilization
36230 Mbps
TCP_RR
77.36% CPU utilization
84/146/270 90/95/99% latencies
2.08063e+06 tps
We can also look at CPU time in csum_partial using perf (with bnx2x
setup). For GRE with TCP_STREAM I see:
With remote checksum offload
0.33% TX
1.81% RX
Without remote checksum offload
6.00% TX
0.51% RX
I suspect the fact that time in csum_partial noticably increases
with remote checksum offload for RX is due to taking the cache miss on
the encapsulated header in that function. By similar reasoning, if on
the TX side the packet were not in cache (say we did a splice from a
file whose data was never touched by the CPU) the CPU savings for TX
would probably be more pronounced.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>