GitHub/LineageOS/G12/android_kernel_amlogic_linux-4.9.git
10 years agotcp: refine TSO autosizing
Eric Dumazet [Sun, 7 Dec 2014 20:22:18 +0000 (12:22 -0800)]
tcp: refine TSO autosizing

Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)

At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).

This is bad because :

 - It sends small TSO packets even in Slow Start where rate quickly
   increases.
 - It tends to make socket write queue very big, increasing tcp_ack()
   processing time, but also increasing memory needs, not necessarily
   accounted for, as fast clones overhead is currently ignored.
 - Lower GRO efficiency and more ACK packets.

Servers with a lot of small lived connections suffer from this.

Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.

Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.

As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.

Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.

Tested:

40 ms rtt link

nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

Before patch :

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/s

 87380 2000000 2000000    0.36         44.22
IpInReceives                    600                0.0
IpOutRequests                   599                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2033249            0.0

After patch :

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380 2000000 2000000    0.36       44.27
IpInReceives                    221                0.0
IpOutRequests                   232                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2013953            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agochelsio: fix misspelling of current function in string
Julia Lawall [Sun, 7 Dec 2014 19:20:56 +0000 (20:20 +0100)]
chelsio: fix misspelling of current function in string

Replace a misspelled function name by %s and then __func__.

This was done using Coccinelle, including the use of Levenshtein distance,
as proposed by Rasmus Villemoes.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agohp100: fix misspelling of current function in string
Julia Lawall [Sun, 7 Dec 2014 19:20:54 +0000 (20:20 +0100)]
hp100: fix misspelling of current function in string

Replace a misspelled function name by %s and then __func__.

This was done using Coccinelle, including the use of Levenshtein distance,
as proposed by Rasmus Villemoes.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agouli526x: fix misspelling of current function in string
Julia Lawall [Sun, 7 Dec 2014 19:20:52 +0000 (20:20 +0100)]
uli526x: fix misspelling of current function in string

Replace a misspelled function name by %s and then __func__.

This was done using Coccinelle, including the use of Levenshtein distance,
as proposed by Rasmus Villemoes.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoisdn: fix misspelling of current function in string
Julia Lawall [Sun, 7 Dec 2014 19:20:47 +0000 (20:20 +0100)]
isdn: fix misspelling of current function in string

Replace a misspelled function name by %s and then __func__.

In the first case, the print is just dropped, because kmalloc itself does
enough error reporting.

This was done using Coccinelle, including the use of Levenshtein distance,
as proposed by Rasmus Villemoes.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodmfe: fix misspelling of current function in string
Julia Lawall [Sun, 7 Dec 2014 19:20:46 +0000 (20:20 +0100)]
dmfe: fix misspelling of current function in string

The function name contains cleanup, not clean.

This was done using Coccinelle, including the use of Levenshtein distance,
as proposed by Rasmus Villemoes.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomacvlan: play well with ipvlan device
Mahesh Bandewar [Sat, 6 Dec 2014 23:53:46 +0000 (15:53 -0800)]
macvlan: play well with ipvlan device

If device is already used as an ipvlan port then refuse to
use it as a macvlan port at early stage of port creation.

thost1:~# ip link add link eth0 ipvl0 type ipvlan
thost1:~# echo $?
0
thost1:~# ip link add link eth0 mvl0 type macvlan
RTNETLINK answers: Device or resource busy
thost1:~# echo $?
2
thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipvlan: move the device check function into netdevice.h
Mahesh Bandewar [Sat, 6 Dec 2014 23:53:33 +0000 (15:53 -0800)]
ipvlan: move the device check function into netdevice.h

Move the port check [ipvlan_dev_master()] and device check
[ipvlan_dev_slave()] functions to netdevice.h and rename them
netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be
consistent with macvlan api naming.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipvlan: play well with macvlan device
Mahesh Bandewar [Sat, 6 Dec 2014 23:53:19 +0000 (15:53 -0800)]
ipvlan: play well with macvlan device

If a device is already a macvlan port then refuse to use it as
an ipvlan port in the early stage of port creation.

thost1:~# ip link add link eth0 mvl0 type macvlan
thost1:~# echo $?
0
thost1:~# ip link add link eth0 ipvl0 type ipvlan
RTNETLINK answers: Device or resource busy
thost1:~# echo $?
2
thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonetdevice: Add a function to check macvlan port
Mahesh Bandewar [Sat, 6 Dec 2014 23:53:04 +0000 (15:53 -0800)]
netdevice: Add a function to check macvlan port

Similar to a check for macvlan device, netif_is_macvlan(), add
another function to check if a device is used as macvlan port.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodst: no need to take reference on DST_NOCACHE dsts
Hannes Frederic Sowa [Sat, 6 Dec 2014 18:19:42 +0000 (19:19 +0100)]
dst: no need to take reference on DST_NOCACHE dsts

Since commit f8864972126899 ("ipv4: fix dst race in sk_dst_get()")
DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
reference on them when we are in rcu protected sections.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodummy: add support for ethtool get_drvinfo
Flavio Leitner [Sat, 6 Dec 2014 00:13:24 +0000 (22:13 -0200)]
dummy: add support for ethtool get_drvinfo

The command 'ethtool -i' is useful to find details
about the interface like the device driver being used.
This was missing for dummy driver.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoDocumentation (ixgbe.txt): use a decimal address.
Rami Rosen [Fri, 5 Dec 2014 17:35:43 +0000 (19:35 +0200)]
Documentation (ixgbe.txt): use a decimal address.

This patch fixes the erronous usage of an hexadecimal address in the
example, by replacing it with a decimal address.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoopenvswitch: set correct protocol on route lookup
Jiri Benc [Fri, 5 Dec 2014 16:24:28 +0000 (17:24 +0100)]
openvswitch: set correct protocol on route lookup

Respect what the caller passed to ovs_tunnel_get_egress_info.

Fixes: 8f0aad6f35f7e ("openvswitch: Extend packet attribute for egress tunnel info")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomacvlan: allow setting LRO independently of lower device
Michal Kubeček [Fri, 5 Dec 2014 16:05:49 +0000 (17:05 +0100)]
macvlan: allow setting LRO independently of lower device

Since commit fbe168ba91f7 ("net: generic dev_disable_lro() stacked
device handling"), dev_disable_lro() zeroes NETIF_F_LRO feature flag
first for a macvlan device and then for its lower device. As an attempt
to set NETIF_F_LRO to zero is ignored, dev_disable_lro() issues a
warning and taints kernel.

Allowing NETIF_F_LRO to be set independently of the lower device
consists of three parts:

  - add the flag to hw_features to allow toggling it
  - allow setting it to 0 even if lower device has the flag set
  - add the flag to MACVLAN_FEATURES to restore copying from lower
    device on macvlan creation

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_basic: fix error path in basic_change()
Jiri Pirko [Fri, 5 Dec 2014 14:50:22 +0000 (15:50 +0100)]
net: sched: cls_basic: fix error path in basic_change()

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: macb: Remove obsolete comment from Kconfig
James Byrne [Fri, 5 Dec 2014 13:03:53 +0000 (13:03 +0000)]
net: macb: Remove obsolete comment from Kconfig

The Kconfig file says that Gigabit mode is not supported, but it has been
supported since commit 140b7552fdff04bbceeb842f0e04f0b4015fe97b ("net/macb:
Add support for Gigabit Ethernet mode").

Signed-off-by: James Byrne <james.byrne@origamienergy.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: ethernet: rocker: Add dependency to CONFIG_BRIDGE in Kconfig
Andreas Ruprecht [Fri, 5 Dec 2014 08:28:03 +0000 (09:28 +0100)]
net: ethernet: rocker: Add dependency to CONFIG_BRIDGE in Kconfig

In a configuration with CONFIG_BRIDGE set to 'm' and CONFIG_ROCKER
set to 'y', undefined references occur at link time:

> drivers/built-in.o: In function `rocker_port_fdb_learn_work':
> /home/jim/linux/drivers/net/ethernet/rocker/rocker.c:3014: undefined
> reference to `br_fdb_external_learn_del'
> /home/jim/linux/drivers/net/ethernet/rocker/rocker.c:3016: undefined
> reference to `br_fdb_external_learn_add'

This patch fixes these by declaring CONFIG_ROCKER as being dependent
on CONFIG_BRIDGE.

Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Andreas Ruprecht <rupran@einserver.de>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/socket.c : introduce helper function do_sock_sendmsg to replace reduplicate code
Gu Zheng [Fri, 5 Dec 2014 07:14:23 +0000 (15:14 +0800)]
net/socket.c : introduce helper function do_sock_sendmsg to replace reduplicate code

Introduce helper function do_sock_sendmsg() to simplify sock_sendmsg{_nosec},
and replace reduplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotcp_cubic: refine Hystart delay threshold
Eric Dumazet [Fri, 5 Dec 2014 00:13:49 +0000 (16:13 -0800)]
tcp_cubic: refine Hystart delay threshold

In commit 2b4636a5f8ca ("tcp_cubic: make the delay threshold of HyStart
less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.

The remaining problem is that using delay_min + (delay_min/16) as the
threshold is too sensitive.

6.25 % of variation is too small for rtt above 60 ms, which are not
uncommon.

Lets use 12.5 % instead (delay_min + (delay_min/8))

Tested:
 80 ms RTT between peers, FQ/pacing packet scheduler on sender.
 10 bulk transfers of 10 seconds :

nstat >/dev/null
for i in `seq 1 10`
 do
   netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
 done
nstat | grep Hystart

With the 6.25 % threshold :

THROUGHPUT=20.66
THROUGHPUT=249.38
THROUGHPUT=254.10
THROUGHPUT=14.94
THROUGHPUT=251.92
THROUGHPUT=237.73
THROUGHPUT=19.18
THROUGHPUT=252.89
THROUGHPUT=21.32
THROUGHPUT=15.58
TcpExtTCPHystartTrainDetect     2                  0.0
TcpExtTCPHystartTrainCwnd       4756               0.0
TcpExtTCPHystartDelayDetect     5                  0.0
TcpExtTCPHystartDelayCwnd       180                0.0

With the 12.5 % threshold
THROUGHPUT=251.09
THROUGHPUT=247.46
THROUGHPUT=250.92
THROUGHPUT=248.91
THROUGHPUT=250.88
THROUGHPUT=249.84
THROUGHPUT=250.51
THROUGHPUT=254.15
THROUGHPUT=250.62
THROUGHPUT=250.89
TcpExtTCPHystartTrainDetect     1                  0.0
TcpExtTCPHystartTrainCwnd       3175               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotcp_cubic: add SNMP counters to track how effective is Hystart
Eric Dumazet [Fri, 5 Dec 2014 00:13:23 +0000 (16:13 -0800)]
tcp_cubic: add SNMP counters to track how effective is Hystart

When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
triggers too soon.

Having SNMP counters to have an idea of how often the various Hystart
methods trigger is useful prior to any modifications.

This patch adds SNMP counters tracking, how many time "ack train" or
"Delay" based Hystart triggers, and cumulative sum of cwnd at the time
Hystart decided to end SS (Slow Start)

myhost:~# nstat -a | grep Hystart
TcpExtTCPHystartTrainDetect     9                  0.0
TcpExtTCPHystartTrainCwnd       20650              0.0
TcpExtTCPHystartDelayDetect     10                 0.0
TcpExtTCPHystartDelayCwnd       360                0.0

->
 Train detection was triggered 9 times, and average cwnd was
 20650/9=2294,
 Delay detection was triggered 10 times and average cwnd was 36

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agox86: bpf_jit_comp: Remove inline from static function definitions
Joe Perches [Fri, 5 Dec 2014 01:01:24 +0000 (17:01 -0800)]
x86: bpf_jit_comp: Remove inline from static function definitions

Let the compiler decide instead.

No change in object size x86-64 -O2 no profiling

Signed-off-by: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agox86: bpf_jit_comp: Reduce is_ereg() code size
Joe Perches [Thu, 4 Dec 2014 23:00:48 +0000 (15:00 -0800)]
x86: bpf_jit_comp: Reduce is_ereg() code size

Use the (1 << reg) & mask trick to reduce code size.

x86-64 size difference -O2 without profiling for various
gcc versions:

$ size arch/x86/net/bpf_jit_comp.o*
   text    data     bss     dec     hex filename
   9266       4       0    9270    2436 arch/x86/net/bpf_jit_comp.o.4.4.new
  10042       4       0   10046    273e arch/x86/net/bpf_jit_comp.o.4.4.old
   9109       4       0    9113    2399 arch/x86/net/bpf_jit_comp.o.4.6.new
   9717       4       0    9721    25f9 arch/x86/net/bpf_jit_comp.o.4.6.old
   8789       4       0    8793    2259 arch/x86/net/bpf_jit_comp.o.4.7.new
  10245       4       0   10249    2809 arch/x86/net/bpf_jit_comp.o.4.7.old
   9671       4       0    9675    25cb arch/x86/net/bpf_jit_comp.o.4.9.new
  10679       4       0   10683    29bb arch/x86/net/bpf_jit_comp.o.4.9.old

Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls: remove unused op put from tcf_proto_ops
Jiri Pirko [Thu, 4 Dec 2014 20:41:18 +0000 (21:41 +0100)]
net: sched: cls: remove unused op put from tcf_proto_ops

It is never called and implementations are void. So just remove it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobnx2x: Use correct fastpath version for VFs.
Yuval Mintz [Thu, 4 Dec 2014 10:52:06 +0000 (12:52 +0200)]
bnx2x: Use correct fastpath version for VFs.

Our FW can support several fastpath HSI [for backward compatibility] but up
until now VFs were always configured to use latest fastpath HSI [although VF
driver might be older and use an older fastpath HSI].

For linux drivers, the differences are insignificant since driver never
utilized features that were overridden by the HSI change. But for VMs running
other operating systems this might be a problem.
In addition, eventually FW might change fastpath HSI in such a manner that
backward compatibility WILL break unless configured with proper version.

This patch fixes the issue for other operating system VMs, as well as lays
the ground work for forward compatibility in regard to the fastpath HSI.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: tulip: Remove private "strncmp"
Rasmus Villemoes [Thu, 4 Dec 2014 10:30:40 +0000 (11:30 +0100)]
net: tulip: Remove private "strncmp"

The comment says that the built-in strncmp didn't work. That is not
surprising, as apparently "str" semantics are not really what is
wanted (hint: de4x5_strncmp only stops when two different bytes are
encountered or the end is reached; not if either byte happens to be
0). de4x5_strncmp is actually a memcmp (except for the signature and
that bytes are not necessarily treated as unsigned char); since only
the boolean value of the result is used we can just replace
de4x5_strncmp with memcmp.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agodrivers: net : cpsw: Update Kconfig for CPSW
Lokesh Vutla [Thu, 4 Dec 2014 04:54:29 +0000 (10:24 +0530)]
drivers: net : cpsw: Update Kconfig for CPSW

CPSW is present in AM33xx, AM43xx, DRA7xx.
Updating the Kconfig to depend on ARCH_OMAP2PLUS instead of listing
all SoC's.

Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: fix missing spinlock init and nullptr oops
Erik Hugne [Wed, 3 Dec 2014 15:58:40 +0000 (16:58 +0100)]
tipc: fix missing spinlock init and nullptr oops

commit 908344cdda80 ("tipc: fix bug in multicast congestion
handling") introduced two bugs with the bclink wakeup
function. This commit fixes the missing spinlock init for the
waiting_sks list. We also eliminate the race condition
between the waiting_sks length check/dequeue operations in
tipc_bclink_wakeup_users by simply removing the redundant
length check.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Tero Aho <Tero.Aho@coriant.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agor8152: redefine REALTEK_USB_DEVICE
hayeswang [Thu, 4 Dec 2014 02:43:11 +0000 (10:43 +0800)]
r8152: redefine REALTEK_USB_DEVICE

Redefine REALTEK_USB_DEVICE for the desired USB interface for probe().
There are three USB interfaces for the device. USB_CLASS_COMM and
USB_CLASS_CDC_DATA are for ECM mode (config #2). USB_CLASS_VENDOR_SPEC
is for the vendor mode (config #1). However, we are not interesting
in USB_CLASS_CDC_DATA for probe(), so redefine REALTEK_USB_DEVICE
to ignore the USB interface class of USB_CLASS_CDC_DATA.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: avoid two atomic operations in fast clones
Eric Dumazet [Thu, 4 Dec 2014 01:04:39 +0000 (17:04 -0800)]
net: avoid two atomic operations in fast clones

Commit ce1a4ea3f125 ("net: avoid one atomic operation in skb_clone()")
took the wrong way to save one atomic operation.

It is actually possible to avoid two atomic operations, if we
do not change skb->fclone values, and only rely on clone_ref
content to signal if the clone is available or not.

skb_clone() can simply use the fast clone if clone_ref is 1.

kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.

Note that because we usually free the clone before the original skb,
this particular attempt is only done for the original skb to have better
branch prediction.

SKB_FCLONE_FREE is removed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agortnetlink: delay RTM_DELLINK notification until after ndo_uninit()
Mahesh Bandewar [Wed, 3 Dec 2014 21:46:24 +0000 (13:46 -0800)]
rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()

The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
until after ndo_uninit") tried to do this ealier but while doing so
it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
delayed call to fill_info(). So this translated into asking driver
to remove private state and then query it's private state. This
could have catastropic consequences.

This change breaks the rtmsg_ifinfo() into two parts - one takes the
precise snapshot of the device by called fill_info() before calling
the ndo_uninit() and the second part sends the notification using
collected snapshot.

It was brought to notice when last link is deleted from an ipvlan device
when it has free-ed the port and the subsequent .fill_info() call is
trying to get the info from the port.

kernel: [  255.139429] ------------[ cut here ]------------
kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
kernel: [  255.139520] Call Trace:
kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotc_act: export uapi header file
stephen hemminger [Wed, 3 Dec 2014 17:33:17 +0000 (09:33 -0800)]
tc_act: export uapi header file

This file is used by iproute2 and should be exported.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'cxgb4-next'
David S. Miller [Tue, 9 Dec 2014 18:32:04 +0000 (13:32 -0500)]
Merge branch 'cxgb4-next'

Hariprasad Shenai says:

====================
cxgb4/cxgb4vf: T5 BAR2 and ethtool related fixes

This series adds new interface to calculate BAR2 SGE queue register address for
cxgb4 and cxgb4vf driver and some more sge related fixes for T5. Also adds a
patch which updates the FW version displayed by ethtool after firmware flash.

The patches series is created against 'net-next' tree.
And includes patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4: Update firmware version after flashing it via ethtool
Hariprasad Shenai [Wed, 3 Dec 2014 14:02:54 +0000 (19:32 +0530)]
cxgb4: Update firmware version after flashing it via ethtool

After successfully loading new firmware, reload the new firmware's version
number information so "ethtool -i", etc. will report the right value

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4/cxgb4vf: Use new interfaces to calculate BAR2 SGE Queue Register addresses
Hariprasad Shenai [Wed, 3 Dec 2014 14:02:53 +0000 (19:32 +0530)]
cxgb4/cxgb4vf: Use new interfaces to calculate BAR2 SGE Queue Register addresses

Use BAR2 Going To Sleep (GTS) for T5 and later. Use new BAR2 User Doorbells for
T5 for both cxgb4 and cxgb4vf driver.

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4/cxgb4vf: Add code to calculate T5 BAR2 Offsets for SGE Queue Registers
Hariprasad Shenai [Wed, 3 Dec 2014 14:02:52 +0000 (19:32 +0530)]
cxgb4/cxgb4vf: Add code to calculate T5 BAR2 Offsets for SGE Queue Registers

Add new Common Code facilities for calculating T5 BAR2 Offsets for SGE Queue
Registers. This new code can handle situations where

    Queues Per Page * SGE BAR2 Queue Register Area Size > Page Size

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4vf: Add and initialize some sge params for VF driver
Hariprasad Shenai [Wed, 3 Dec 2014 14:02:51 +0000 (19:32 +0530)]
cxgb4vf: Add and initialize some sge params for VF driver

Add sge_vf_eq_qpp and sge_vf_iq_qpp to (struct sge_params), initialize
sge_queues_per_page and sge_vf_qpp in t4vf_get_sge_params(), add new
t4vf_prep_adapter() which initializes basic adapter parameters.

Grab both SGE_EGRESS_QUEUES_PER_PAGE_VF and SGE_INGRESS_QUEUES_PER_PAGE_VF
for VF Drivers since we need both to calculate the User Doorbell area
offsets for Egress and Ingress Queues.

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: drop tx side permission checks
Erik Hugne [Wed, 3 Dec 2014 13:44:44 +0000 (14:44 +0100)]
tipc: drop tx side permission checks

Part of the old remote management feature is a piece of code
that checked permissions on the local system to see if a certain
operation was permitted, and if so pass the command to a remote
node. This serves no purpose after the removal of remote management
with commit 5902385a2440 ("tipc: obsolete the remote management
feature") so we remove it.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agorocker: fix eth_type type in struct rocker_ctrl
Jiri Pirko [Wed, 3 Dec 2014 13:14:54 +0000 (14:14 +0100)]
rocker: fix eth_type type in struct rocker_ctrl

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agorocker: introduce be put/get variants and use it when appropriate
Jiri Pirko [Wed, 3 Dec 2014 13:14:53 +0000 (14:14 +0100)]
rocker: introduce be put/get variants and use it when appropriate

This kills the sparse warnings.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: remove useless spin_lock/spin_unlock
Duan Jiong [Wed, 3 Dec 2014 02:29:40 +0000 (10:29 +0800)]
ipv6: remove useless spin_lock/spin_unlock

xchg is atomic, so there is no necessary to use spin_lock/spin_unlock
to protect it. At last, remove the redundant
opt = xchg(&inet6_sk(sk)->opt, opt); statement.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoamd-xgbe: IRQ names require allocated memory
Lendacky, Thomas [Wed, 3 Dec 2014 00:07:18 +0000 (18:07 -0600)]
amd-xgbe: IRQ names require allocated memory

When requesting an irq, the name passed in must be (part of) allocated
memory. The irq name was a local variable and resulted in random
characters when listing /proc/interrupts. Add a character field to the
xgbe_channel structure to hold the irq name and use that.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: fix incorrect rcu_read_unlock() in vnet_start_xmit()
David L Stevens [Tue, 9 Dec 2014 02:46:09 +0000 (21:46 -0500)]
sunvnet: fix incorrect rcu_read_unlock() in vnet_start_xmit()

This patch removes an extra rcu_read_unlock() on an allocation failure
in vnet_skb_shape(). The needed rcu_read_unlock() is already done in
the out_dropped label.

Reported-by: Rashmi Narasimhan <rashmi.narasimhan@oracle.com>
Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'genet-gphy'
David S. Miller [Tue, 9 Dec 2014 02:33:36 +0000 (21:33 -0500)]
Merge branch 'genet-gphy'

Florian Fainelli says:

====================
net: bcmgenet: support for new GPHY revision scheme

These two patches update the GENET GPHY revision logic to account for some
of our newer designs starting with GPHY rev G0.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: phy: bcm7xxx: add an explicit version check for GPHY rev G0
Florian Fainelli [Wed, 3 Dec 2014 17:57:00 +0000 (09:57 -0800)]
net: phy: bcm7xxx: add an explicit version check for GPHY rev G0

GPHY revision G0 has its version rolled over to 0x10, introduce an
explicit check for that revision and invoke the proper workaround
function for it.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: bcmgenet: add support for new GENET PHY revision scheme
Florian Fainelli [Wed, 3 Dec 2014 17:56:59 +0000 (09:56 -0800)]
net: bcmgenet: add support for new GENET PHY revision scheme

Starting with GPHY revision G0, the GENET register layout has changed to
use the same numbering scheme as the Starfighter 2 switch. This means
that GPHY major revision is in bits 15:12, minor in bits 11:8 and patch
level is in bits 7:4.

Introduce a small heuristic which checks for the old scheme first, tests
for the new scheme and finally attempts to catch reserved values and
aborts.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec...
David S. Miller [Tue, 9 Dec 2014 02:30:21 +0000 (21:30 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2014-12-03

1) Fix a set but not used warning. From Fabian Frederick.

2) Currently we make sequence number values available to userspace
   only if we use ESN. Make the sequence number values also available
   for non ESN states. From Zhi Ding.

3) Remove socket policy hashing. We don't need it because socket
   policies are always looked up via a linked list. From Herbert Xu.

4) After removing socket policy hashing, we can use __xfrm_policy_link
   in xfrm_policy_insert. From Herbert Xu.

5) Add a lookup method for vti6 tunnels with wildcard endpoints.
   I forgot this when I initially implemented vti6.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'sunvnet-next'
David S. Miller [Tue, 9 Dec 2014 02:19:13 +0000 (21:19 -0500)]
Merge branch 'sunvnet-next'

David L Stevens says:

====================
sunvnet: add SG, HW_CSUM, GSO, and TSO support

This patch set adds everything needed for TSO support in sunvnet. On my
test hardware, this increases the single-stream TCP throughput for the
default 1500-byte MTU Linux-Linux from ~2Gbps to 10Gbps and Linux-Solaris
from ~2Gbps to 6Gbps.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: add TSO support
David L Stevens [Tue, 2 Dec 2014 20:31:38 +0000 (15:31 -0500)]
sunvnet: add TSO support

This patch adds TSO support for the sunvnet driver.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: add GSO support
David L Stevens [Tue, 2 Dec 2014 20:31:30 +0000 (15:31 -0500)]
sunvnet: add GSO support

This patch adds GSO support to the sunvnet driver.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: add checksum offload support
David L Stevens [Tue, 2 Dec 2014 20:31:22 +0000 (15:31 -0500)]
sunvnet: add checksum offload support

This patch adds support for sender-side checksum offloading.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: add scatter/gather support
David L Stevens [Tue, 2 Dec 2014 20:31:13 +0000 (15:31 -0500)]
sunvnet: add scatter/gather support

This patch adds scatter/gather support to the sunvnet driver.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: add VIO v1.7 and v1.8 support
David L Stevens [Tue, 2 Dec 2014 20:31:04 +0000 (15:31 -0500)]
sunvnet: add VIO v1.7 and v1.8 support

This patch adds support for VIO v1.7 (extended descriptor format)
and v1.8 (receive-side checksumming) to the sunvnet driver.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: rename vnet_port_alloc_tx_bufs and move after version negotiation
David L Stevens [Tue, 2 Dec 2014 20:30:55 +0000 (15:30 -0500)]
sunvnet: rename vnet_port_alloc_tx_bufs and move after version negotiation

This patch changes the name of vnet_port_alloc_tx_bufs to
vnet_port_alloc_tx_ring, since there are no buffer allocations after
transmit zero copy support was added. This patch also moves the ring
allocation to after VIO version negotiation to allow for
different-sized descriptors in later VIO versions.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'rss_hash'
David S. Miller [Tue, 9 Dec 2014 02:07:25 +0000 (21:07 -0500)]
Merge branch 'rss_hash'

Amir Vadai says:

====================
ethtool, net/mlx4_en: RSS hash function selection

This patchset by Eyal adds support in set/get of RSS hash function. Current
supported functions are Toeplitz and XOR. The API is design to enable adding
new hash functions without breaking backward compatibility.
Userspace patch will be sent after API is available in kernel.

The patchset was applied and tested over commit cd4c910 ("netpoll: delete
defconfig references to obsolete NETPOLL_TRAP")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx4_en: Support for configurable RSS hash function
Eyal Perry [Tue, 2 Dec 2014 16:12:11 +0000 (18:12 +0200)]
net/mlx4_en: Support for configurable RSS hash function

The ConnectX HW is capable of using one of the following hash functions:
Toeplitz and an XOR hash function. This patch extends the implementation
of the mlx4_en driver set/get_rxfh callbacks to support getting and
setting the RSS hash function used by the device.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoethtool: Support for configurable RSS hash function
Eyal Perry [Tue, 2 Dec 2014 16:12:10 +0000 (18:12 +0200)]
ethtool: Support for configurable RSS hash function

This patch extends the set/get_rxfh ethtool-options for getting or
setting the RSS hash function.

It modifies drivers implementation of set/get_rxfh accordingly.

This change also delegates the responsibility of checking whether a
modification to a certain RX flow hash parameter is supported to the
driver implementation of set_rxfh.

User-kernel API is done through the new hfunc bitmask field in the
ethtool_rxfh struct. A bit set in the hfunc field is corresponding to an
index in the new string-set ETH_SS_RSS_HASH_FUNCS.

Got approval from most of the relevant driver maintainers that their
driver is using Toeplitz, and for the few that didn't answered, also
assumed it is Toeplitz.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: Shradha Shah <sshah@solarflare.com>
Cc: Shreyas Bhatewara <sbhatewara@vmware.com>
Cc: "VMware, Inc." <pv-drivers@vmware.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_cgroup: remove unnecessary if
Jiri Pirko [Tue, 2 Dec 2014 17:00:36 +0000 (18:00 +0100)]
net_sched: cls_cgroup: remove unnecessary if

since head->handle == handle (checked before), just assign handle.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_flow: remove duplicate assignments
Jiri Pirko [Tue, 2 Dec 2014 17:00:35 +0000 (18:00 +0100)]
net_sched: cls_flow: remove duplicate assignments

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_flow: remove faulty use of list_for_each_entry_rcu
Jiri Pirko [Tue, 2 Dec 2014 17:00:34 +0000 (18:00 +0100)]
net_sched: cls_flow: remove faulty use of list_for_each_entry_rcu

rcu variant is not correct here. The code is called by updater (rtnl
lock is held), not by reader (no rcu_read_lock is held).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_bpf: remove faulty use of list_for_each_entry_rcu
Jiri Pirko [Tue, 2 Dec 2014 17:00:33 +0000 (18:00 +0100)]
net_sched: cls_bpf: remove faulty use of list_for_each_entry_rcu

rcu variant is not correct here. The code is called by updater (rtnl
lock is held), not by reader (no rcu_read_lock is held).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
ACKed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_bpf: remove unnecessary iteration and use passed arg
Jiri Pirko [Tue, 2 Dec 2014 17:00:32 +0000 (18:00 +0100)]
net_sched: cls_bpf: remove unnecessary iteration and use passed arg

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet_sched: cls_basic: remove unnecessary iteration and use passed arg
Jiri Pirko [Tue, 2 Dec 2014 17:00:31 +0000 (18:00 +0100)]
net_sched: cls_basic: remove unnecessary iteration and use passed arg

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net...
David S. Miller [Tue, 9 Dec 2014 01:49:52 +0000 (20:49 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-12-06

This series contains updates to i40e and i40evf.

Shannon provides several patches to cleanup and fix i40e.  First removes
an unneeded break statement in i40e_vsi_link_event().  Then removes
some debug messages that really do not give any useful information and
ends up getting printed every service_task loop, which fills the logfile
with noise when AQ tracing is enabled.  Updates the aq_cmd arguments to
use %i which is much more forgiving and user friendly than the more
restrictive %x, or %d.  Fixes the netdev_stat macro, where the old
xxx_NETDEV_STAT() macro was defined long before the newer
rtnl_link_stats64 came into being, and just never got updated.
Getting the pf_id from the function number had an issue when
when the PF was setup in passthru mode, the PCI bus/device/function
was virtualized and the number in the VM is different from the number in
the bare metal.  This caused HW configuration issues when the wrong pf_id
was used to set up the HMC and other structures.  The PF_FUNC_RID register
has the real bus/device/function information as configured by the BIOS,
so use that for a better number.

Carolyn adds additional text description for the base pf0 and flow
director generated interrupts, since these interrupts are difficult
to distinguish per port on a multi-function device.

Jacob resolves an issue related to images with multiple PFs per
physical port.  We cannot fully support 1588 PTP features, since only
one port should control (i.e. write) the registers at a time.  Doing
so can cause interference of functionality.

Anjali provides several updates to i40e, first adds the Virtual Channel
OP event opcode for CONFIG_RSS, so that the Virtual Channel state
machine can properly decipher status change events.  Then updates the
driver to add (and use) i40e_is_vf macro for future expansion when new
VF MAC types get added.  Adds new update VSI flow to accommodate a
firmware dix with VSI loopback mode.  All VSIs on a VEB should either
have loopback enabled or disabled, a mixed mode is not supported for a
VEB.  Since our driver supports multiple VSIs per PF that need to talk to
each other make sure to enable Loopback for the PF and FDIR VSI as well.

Mitch provides a couple of i40e and i40evf patches.  First updates
i40evf init code more adept at handling when multiple VFs attempt
to initialize simultaneously.

Joe Perches provides a i40e patch which resolves a compile warning
about about frame size being larger than 2048 bytes by reducing the
stack use by using kmemdup and not using a very large struct on the
stack.

v2:
 - Dropped patch 13 & 14 while Mitch reworks the patches based on
   feedback from Ben Hutchings, probably the tryptophan in the turkey
   is to blame for the delay...
 - Added Joe Perches patch which resolves a compile warning about frame
   size
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'eth_skb_pad'
David S. Miller [Tue, 9 Dec 2014 01:47:47 +0000 (20:47 -0500)]
Merge branch 'eth_skb_pad'

Alexander Duyck says:

====================
net: Add helper for padding short Ethernet frames

This patch series adds a pair of helpers to pad short Ethernet frames.  The
general idea is to clean up a number of code paths that were all writing
their own versions of the same or similar function.

An added advantage is that this will help to discourage introducing new
bugs as in at least one case I found the skb->len had been updated, but the
tail pointer update was overlooked.

v2: Added skb_put_padto for cases where length is not ETH_ZLEN
    Updated intel drivers and emulex driver to use skb_put_padto
    Updated eth_skb_pad to use skb_put_padto
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agor8169: Use eth_skb_pad function
Alexander Duyck [Wed, 3 Dec 2014 16:18:04 +0000 (08:18 -0800)]
r8169: Use eth_skb_pad function

Replace rtl_skb_pad with eth_skb_pad since they do the same thing.

Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomyri10ge: use eth_skb_pad helper
Alexander Duyck [Wed, 3 Dec 2014 16:17:58 +0000 (08:17 -0800)]
myri10ge: use eth_skb_pad helper

Update myri10ge to use eth_skb_pad helper.  This also corrects a minor
issue as the driver was updating length without updating the tail pointer.

Cc: Hyong-Youb Kim <hykim@myri.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoniu: Use eth_skb_pad helper
Alexander Duyck [Wed, 3 Dec 2014 16:17:52 +0000 (08:17 -0800)]
niu: Use eth_skb_pad helper

Replace the standard layout for padding an ethernet frame with the
eth_skb_pad call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoemulex: Use skb_put_padto instead of skb_padto() and skb->len assignment
Alexander Duyck [Wed, 3 Dec 2014 16:17:46 +0000 (08:17 -0800)]
emulex: Use skb_put_padto instead of skb_padto() and skb->len assignment

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoethernet/intel: Use eth_skb_pad and skb_put_padto helpers
Alexander Duyck [Wed, 3 Dec 2014 16:17:39 +0000 (08:17 -0800)]
ethernet/intel: Use eth_skb_pad and skb_put_padto helpers

Update the Intel Ethernet drivers to use eth_skb_pad() and skb_put_padto
instead of doing their own implementations of the function.

Also this cleans up two other spots where skb_pad was called but the length
and tail pointers were being manipulated directly instead of just having
the padding length added via __skb_put.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: Add functions for handling padding frame and adding to length
Alexander Duyck [Wed, 3 Dec 2014 16:17:33 +0000 (08:17 -0800)]
net: Add functions for handling padding frame and adding to length

This patch adds two new helper functions skb_put_padto and eth_skb_pad.
These functions deviate from the standard skb_pad or skb_padto in that they
will also update the length and tail pointers so that they reflect the
padding added to the frame.

The eth_skb_pad helper is meant to be used with Ethernet devices to update
either Rx or Tx frames so that they report the correct size.  The
skb_put_padto helper is meant to be used primarily in the transmit path for
network devices that need frames to be padded up to some minimum size and
don't wish to simply update the length somewhere external to the frame.

The motivation behind this is that there are a number of implementations
throughout the network device drivers that are all doing the same thing,
but each a little bit differently and as a result several implementations
contain bugs such as updating the length without updating the tail offset
and other similar issues.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'mlx5-next'
David S. Miller [Tue, 9 Dec 2014 01:46:01 +0000 (20:46 -0500)]
Merge branch 'mlx5-next'

Eli Cohen says:

====================
mlx5 driver updates

The following series contains some fixes to mlx5 as well as update to the list
of supported devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomlx5: Fix error flow in add_keys
Eli Cohen [Tue, 2 Dec 2014 10:26:19 +0000 (12:26 +0200)]
mlx5: Fix error flow in add_keys

If mlx5_core_create_mkey fails, decrease the pending counter to undo the
previous increment.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomlx5: Fix sparse warnings
Eli Cohen [Tue, 2 Dec 2014 10:26:18 +0000 (12:26 +0200)]
mlx5: Fix sparse warnings

1. Add required __acquire/__release statements to balance spinlock usage.
2. Change the index parameter of begin_wqe() to be unsigned to match supplied
argument type.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Add more supported devices
Eli Cohen [Tue, 2 Dec 2014 10:26:17 +0000 (12:26 +0200)]
net/mlx5_core: Add more supported devices

Add ConnectX-4LX to the list of supported devices as well as their virtual
functions.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Clear outbox of dealloc uar
Majd Dibbiny [Tue, 2 Dec 2014 10:26:16 +0000 (12:26 +0200)]
net/mlx5_core: Clear outbox of dealloc uar

The outbox should be cleared before executing the command.

Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Print resource number on QP/SRQ async events
Eli Cohen [Tue, 2 Dec 2014 10:26:15 +0000 (12:26 +0200)]
net/mlx5_core: Print resource number on QP/SRQ async events

Useful for debugging purposes.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Remove unused dev cap enum fields
Eli Cohen [Tue, 2 Dec 2014 10:26:14 +0000 (12:26 +0200)]
net/mlx5_core: Remove unused dev cap enum fields

These enumerations are not used so remove them.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Fix command queue size enforcement
Eli Cohen [Tue, 2 Dec 2014 10:26:13 +0000 (12:26 +0200)]
net/mlx5_core: Fix command queue size enforcement

Command queue descriptor page size is 4KB and not the page size used by the
kernel.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Fix min vectors value in mlx5_enable_msix
Eli Cohen [Tue, 2 Dec 2014 10:26:12 +0000 (12:26 +0200)]
net/mlx5_core: Fix min vectors value in mlx5_enable_msix

mlx5 requires at least one interrupt vector for completions so fix the minvec
argument to pci_enable_msix_range() accordingly.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet/mlx5_core: Request the mlx5 IB module on driver load
Eli Cohen [Tue, 2 Dec 2014 10:26:11 +0000 (12:26 +0200)]
net/mlx5_core: Request the mlx5 IB module on driver load

Call request module on mlx5_ib so it will be available for applications
requiring it, such as installers that require boot over IB.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'r8169-next'
David S. Miller [Tue, 9 Dec 2014 01:43:31 +0000 (20:43 -0500)]
Merge branch 'r8169-next'

Chunhao Lin says:

====================
r8169:change hardware setting

This patch series contains two hardware setting modification to prevent
hardware become abnormal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agor8169:disable rtl8168ep cmac engine
Chun-Hao Lin [Tue, 2 Dec 2014 08:48:31 +0000 (16:48 +0800)]
r8169:disable rtl8168ep cmac engine

Cmac engine is the bridge between driver and dash firmware.
Other os may not disable cmac when leave. And r8169 did not allocate any
resources for cmac engine. Disable it to prevent abnormal system behavior.

Signed-off-by: Chunhao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agor8169:prevent enable hardware tx/rx too early
Chun-Hao Lin [Tue, 2 Dec 2014 08:48:30 +0000 (16:48 +0800)]
r8169:prevent enable hardware tx/rx too early

For RTL8168G/GU/H/EP and RTL8411B remove enable tx/rx from its own hw_start
function. This will prevent enable tx/rx before complete hardware tx/rx
setting.

Tx/Rx will be enabled in the end of function rtl_hw_start_8168.

Signed-off-by: Chunhao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'tipc-next'
David S. Miller [Tue, 9 Dec 2014 01:40:03 +0000 (20:40 -0500)]
Merge branch 'tipc-next'

Ying Xue says:

====================
tipc: convert name table read-write lock to RCU

Now TIPC name table is statically allocated and is protected with a
Read-Write lock. To enhance the performance of TIPC name table lookup,
we are going to involve RCU lock to protect the name table. As a
consequence, it becomes lockless to concurrently look up name table on
read side. However, before the conversion can be successfully made,
the following two things must be first done:

- change allocation way of name table from static to dynamic
- fix several incorrect locking policy issues
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: convert name table read-write lock to RCU
Ying Xue [Tue, 2 Dec 2014 07:00:30 +0000 (15:00 +0800)]
tipc: convert name table read-write lock to RCU

Convert tipc name table read-write lock to RCU. After this change,
a new spin lock is used to protect name table on write side while
RCU is applied on read side.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: remove unnecessary INIT_LIST_HEAD
Ying Xue [Tue, 2 Dec 2014 07:00:29 +0000 (15:00 +0800)]
tipc: remove unnecessary INIT_LIST_HEAD

When a list_head variable is seen as a new entry to be added to a
list head, it's unnecessary to be initialized with INIT_LIST_HEAD().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: simplify relationship between name table lock and node lock
Ying Xue [Tue, 2 Dec 2014 07:00:28 +0000 (15:00 +0800)]
tipc: simplify relationship between name table lock and node lock

When tipc name sequence is published, name table lock is released
before name sequence buffer is delivered to remote nodes through its
underlying unicast links. However, when name sequence is withdrawn,
the name table lock is held until the transmission of the removal
message of name sequence is finished. During the process, node lock
is nested in name table lock. To prevent node lock from being nested
in name table lock, while withdrawing name, we should adopt the same
locking policy of publishing name sequence: name table lock should
be released before message is sent.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: any name table member must be protected under name table lock
Ying Xue [Tue, 2 Dec 2014 07:00:27 +0000 (15:00 +0800)]
tipc: any name table member must be protected under name table lock

As tipc_nametbl_lock is used to protect name_table structure, the lock
must be held while all members of name_table structure are accessed.
However, the lock is not obtained while a member of name_table
structure - local_publ_count is read in tipc_nametbl_publish(), as
a consequence, an inconsistent value of local_publ_count might be got.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: ensure all name sequences are properly protected with its lock
Ying Xue [Tue, 2 Dec 2014 07:00:26 +0000 (15:00 +0800)]
tipc: ensure all name sequences are properly protected with its lock

TIPC internally created a name table which is used to store name
sequences. Now there is a read-write lock - tipc_nametbl_lock to
protect the table, and each name sequence saved in the table is
protected with its private lock. When a name sequence is inserted
or removed to or from the table, its members might need to change.
Therefore, in normal case, the two locks must be held while TIPC
operates the table. However, there are still several places where
we only hold tipc_nametbl_lock without proprerly obtaining name
sequence lock, which might cause the corruption of name sequence.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: ensure all name sequences are released when name table is stopped
Ying Xue [Tue, 2 Dec 2014 07:00:25 +0000 (15:00 +0800)]
tipc: ensure all name sequences are released when name table is stopped

As TIPC subscriber server is terminated before name table, no user
depends on subscription list of name sequence when name table is
stopped. Therefore, all name sequences stored in name table should
be released whatever their subscriptions lists are empty or not,
otherwise, memory leak might happen.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: make name table allocated dynamically
Ying Xue [Tue, 2 Dec 2014 07:00:24 +0000 (15:00 +0800)]
tipc: make name table allocated dynamically

Name table locking policy is going to be adjusted from read-write
lock protection to RCU lock protection in the future commits. But
its essential precondition is to convert the allocation way of name
table from static to dynamic mode.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: remove size variable from publ_list struct
Ying Xue [Tue, 2 Dec 2014 07:00:23 +0000 (15:00 +0800)]
tipc: remove size variable from publ_list struct

The size variable is introduced in publ_list struct to help us exactly
calculate SKB buffer sizes needed by publications when all publications
in name table are delivered in bulk in named_distribute(). But if
publication SKB buffer size is assumed to MTU, the size variable in
publ_list struct can be completely eliminated at the cost of wasting
a bit memory space for last SKB.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tero Aho <tero.aho@coriant.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoudp: Neaten and reduce size of compute_score functions
Joe Perches [Tue, 2 Dec 2014 04:29:06 +0000 (20:29 -0800)]
udp: Neaten and reduce size of compute_score functions

The compute_score functions are a bit difficult to read.

Neaten them a bit to reduce object sizes and make them a
bit more intelligible.

Return early to avoid indentation and avoid unnecessary
initializations.

(allyesconfig, but w/ -O2 and no profiling)

$ size net/ipv[46]/udp.o.*
   text    data     bss     dec     hex filename
  28680    1184      25   29889    74c1 net/ipv4/udp.o.new
  28756    1184      25   29965    750d net/ipv4/udp.o.old
  17600    1010       2   18612    48b4 net/ipv6/udp.o.new
  17632    1010       2   18644    48d4 net/ipv6/udp.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: bcmgenet: enable driver to work without a device tree
Petri Gynther [Tue, 2 Dec 2014 00:18:08 +0000 (16:18 -0800)]
net: bcmgenet: enable driver to work without a device tree

Modify bcmgenet driver so that it can be used on Broadcom 7xxx
MIPS-based STB platforms without a device tree.

Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agohyperv: Add support for vNIC hot removal
Haiyang Zhang [Mon, 1 Dec 2014 21:28:39 +0000 (13:28 -0800)]
hyperv: Add support for vNIC hot removal

This patch adds proper handling of the vNIC hot removal event, which includes
a rescind-channel-offer message from the host side that triggers vNIC close and
removal. In this case, the notices to the host during close and removal is not
necessary because the channel is rescinded. This patch blocks these unnecessary
messages, and lets vNIC removal process complete normally.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotest: bpf: expand DIV_KX to DIV_MOD_KX
Denis Kirjanov [Mon, 1 Dec 2014 10:12:25 +0000 (13:12 +0300)]
test: bpf: expand DIV_KX to DIV_MOD_KX

Expand DIV_KX to use BPF_MOD operation in the
DIV_KX bpf 'classic' test.

CC: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'tstamp-next'
David S. Miller [Tue, 9 Dec 2014 01:20:55 +0000 (20:20 -0500)]
Merge branch 'tstamp-next'

Willem de Bruijn says:

====================
timestamping updates

The main goal for this patchset is to allow correlating timestamps
with the egress interface. Also introduce a warning, as discussed
previously, and update the tests to verify the new feature.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet-timestamp: expand documentation and test
Willem de Bruijn [Mon, 1 Dec 2014 03:22:35 +0000 (22:22 -0500)]
net-timestamp: expand documentation and test

Documentation:
  expand explanation of timestamp counter

Test:
  new: flag -I requests and prints PKTINFO
  new: flag -x prints payload (possibly truncated)
  fix: remove pretty print that breaks common flag '-l 1'

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Willem de Bruijn [Mon, 1 Dec 2014 03:22:34 +0000 (22:22 -0500)]
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp

Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.

on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.

on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.

In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.

The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:

ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {

At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.

Signed-off-by: Willem de Bruijn <willemb@google.com>
----

Changes
  v1 -> v2
    large rewrite
    - integrate with existing pktinfo cmsg generation code
    - on ipv4: only send with new flag, to maintain legacy behavior
    - on ipv6: send at most a single pktinfo cmsg
    - on ipv6: initialize fields if not yet initialized

The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches

1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
   (http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
   cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>