Yotam Gigi [Mon, 23 Jan 2017 10:07:11 +0000 (11:07 +0100)]
mlxsw: spectrum: Add packet sample offloading support
Using the MPSC register, add the functions that configure port-based
packet sampling in hardware and the necessary datatypes in the
mlxsw_sp_port struct. In addition, add the necessary trap for sampled
packets and integrate with matchall offloading to allow offloading of the
sample tc action.
The current offload support is for the tc command:
tc filter add dev <DEV> parent ffff: \
matchall skip_sw \
action sample rate <RATE> group <GROUP> [trunc <SIZE>]
Where only ingress qdiscs are supported, and only a combination of
matchall classifier and sample action will lead to activating hardware
packet sampling.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yotam Gigi [Mon, 23 Jan 2017 10:07:10 +0000 (11:07 +0100)]
mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
The MPSC register allows to configure ingress packet sampling on specific
port of the mlxsw device. The sampled packets are then trapped via
PKT_SAMPLE trap.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yotam Gigi [Mon, 23 Jan 2017 10:07:09 +0000 (11:07 +0100)]
net/sched: Introduce sample tc action
This action allows the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly and sampling them using
the psample module. The user can configure the psample group number, the
sampling rate and the packet's truncation (to save kernel-user traffic).
Example:
To sample ingress traffic from interface eth1, one may use the commands:
tc qdisc add dev eth1 handle ffff: ingress
tc filter add dev eth1 parent ffff: \
matchall action sample rate 12 group 4
Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets on
dev eth1 to psample group 4.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yotam Gigi [Mon, 23 Jan 2017 10:07:08 +0000 (11:07 +0100)]
net: Introduce psample, a new genetlink channel for packet sampling
Add a general way for kernel modules to sample packets, without being tied
to any specific subsystem. This netlink channel can be used by tc,
iptables, etc. and allow to standardize packet sampling in the kernel.
For every sampled packet, the psample module adds the following metadata
fields:
PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
truncated during sampling
PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
user who initiated the sampling. This field allows the user to
differentiate between several samplers working simultaneously and
filter packets relevant to him
PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
sequence is kept for each group
PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
PSAMPLE_ATTR_DATA - the actual packet bits
The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
group. In addition, add the GET_GROUPS netlink command which allows the
user to see the current sample groups, their refcount and sequence number.
This command currently supports only netlink dump mode.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 24 Jan 2017 18:37:51 +0000 (13:37 -0500)]
Merge branch 'mdio_module_driver-misc'
Florian Fainelli says:
====================
net: couple mdio_module_driver changes
Small patch series fixing a comment for mdio_module_driver and
finally utilizing it in b53_mdio.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Jan 2017 05:17:33 +0000 (21:17 -0800)]
net: dsa: b53: Utilize mdio_module_driver
Eliminate a bit of boilerplate code.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Jan 2017 05:17:32 +0000 (21:17 -0800)]
net: phy: Fix typo for MDIO module boilerplate comment
The module boilerplate macro is named mdio_module_driver and not
module_mdio_driver, fix that.
Fixes:
a9049e0c513c ("mdio: Add support for mdio drivers.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 24 Jan 2017 18:35:41 +0000 (13:35 -0500)]
Merge branch 'stmmac-dwmac-meson8b-configurable-RGMII-TX-delay'
Martin Blumenstingl says:
====================
stmmac: dwmac-meson8b: configurable RGMII TX delay
Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
cycle (= 2ns) TX clock delay. This seems to work fine for many boards
(for example Odroid-C2 or Amlogic's reference boards) but there are
some others where TX traffic is simply broken.
There are probably multiple reasons why it's working on some boards
while it's broken on others:
- some of Amlogic's reference boards are using a Micrel PHY
- hardware circuit design
- maybe more...
iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
TX clock delay disabled on the MAC (as it's enabled in the PHY driver).
TX throughput was virtually zero before:
$ iperf3 -c 192.168.1.100 -R
Connecting to host 192.168.1.100, port 5201
Reverse mode, remote host 192.168.1.100 is sending
[ 4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 108 MBytes 901 Mbits/sec
[ 4] 1.00-2.00 sec 94.2 MBytes 791 Mbits/sec
[ 4] 2.00-3.00 sec 96.5 MBytes 810 Mbits/sec
[ 4] 3.00-4.00 sec 96.2 MBytes 808 Mbits/sec
[ 4] 4.00-5.00 sec 96.6 MBytes 810 Mbits/sec
[ 4] 5.00-6.00 sec 96.5 MBytes 810 Mbits/sec
[ 4] 6.00-7.00 sec 96.6 MBytes 810 Mbits/sec
[ 4] 7.00-8.00 sec 96.5 MBytes 809 Mbits/sec
[ 4] 8.00-9.00 sec 105 MBytes 884 Mbits/sec
[ 4] 9.00-10.00 sec 111 MBytes 934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1000 MBytes 839 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 998 MBytes 837 Mbits/sec receiver
iperf Done.
$ iperf3 -c 192.168.1.100
Connecting to host 192.168.1.100, port 5201
[ 4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.01 sec 99.5 MBytes 829 Mbits/sec 117 139 KBytes
[ 4] 1.01-2.00 sec 105 MBytes 884 Mbits/sec 129 70.7 KBytes
[ 4] 2.00-3.01 sec 107 MBytes 889 Mbits/sec 106 187 KBytes
[ 4] 3.01-4.01 sec 105 MBytes 878 Mbits/sec 92 143 KBytes
[ 4] 4.01-5.00 sec 105 MBytes 882 Mbits/sec 140 129 KBytes
[ 4] 5.00-6.01 sec 106 MBytes 883 Mbits/sec 115 195 KBytes
[ 4] 6.01-7.00 sec 102 MBytes 863 Mbits/sec 133 70.7 KBytes
[ 4] 7.00-8.01 sec 106 MBytes 884 Mbits/sec 143 97.6 KBytes
[ 4] 8.01-9.01 sec 104 MBytes 875 Mbits/sec 124 107 KBytes
[ 4] 9.01-10.01 sec 105 MBytes 876 Mbits/sec 90 139 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.01 sec 1.02 GBytes 874 Mbits/sec 1189 sender
[ 4] 0.00-10.01 sec 1.02 GBytes 873 Mbits/sec receiver
iperf Done.
I get similar TX throughput on my Meson GXBB "MXQ Pro+" board when I
disable the PHY's TX-delay and configure a 4ms TX-delay on the MAC.
So changes to at least the RTL8211F PHY driver are needed to get it
working properly in all situations.
Changes since v4:
- add a fallback of 2ns (the value which was previously hardcoded) for
the TX delay so we are backwards-compatible with older .dts'
- update the documentation with the new fallback value and add a small
note that the "amlogic,tx-delay" property is ignored when the phy-mode
is "rmii".
Changes since v3:
- rebased to apply against current net-next branch (fixes a conflict
with
d2ed0a7755fe14c7 "net: ethernet: stmmac: fix of-node and
fixed-link-phydev leaks")
Changes since v2:
- moved all .dts patches (3-7) to a separate series
- removed the default 2ns TX delay when phy-mode RGMII is specified
- (rebased against current net-next)
Changes since v1:
- renamed the devicetree property "amlogic,tx-delay" to
"amlogic,tx-delay-ns", which makes the .dts easier to read as we can
simply specify human-readable values instead of having "preprocessor
defines and calculation in human brain". Thanks to Andrew Lunn for
the suggestion!
- improved documentation to indicate when the MAC TX-delay should be
configured and how to use the PHY's TX-delay
- changed the default TX-delay in the dwmac-meson8b driver from 2ns
to 0ms when any of the rgmii-*id modes are used (the 2ns default
value still applies for phy-mode "rgmii")
- added patches to properly reset the PHY on Meson GXBB devices and to
use a similar configuration than the one we use on Meson GXL devices
(by passing a phy-handle to stmmac and defining the PHY in the mdio0
bus - patch 3-6)
- add the "amlogic,tx-delay-ns" property to all boards which are using
the RGMII PHY (patch 7)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sun, 22 Jan 2017 22:02:46 +0000 (23:02 +0100)]
net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable
Prior to this patch we were using a hardcoded RGMII TX clock delay of
2ns (= 1/4 cycle of the 125MHz RGMII TX clock). This value works for
many boards, but unfortunately not for all (due to the way the actual
circuit is designed, sometimes because the TX delay is enabled in the
PHY, etc.). Making the TX delay on the MAC side configurable allows us
to support all possible hardware combinations.
This allows fixing a compatibility issue on some boards, where the
RTL8211F PHY is configured to generate the TX delay. We can now turn
off the TX delay in the MAC, because otherwise we would be applying the
delay twice (which results in non-working TX traffic).
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Tested-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sun, 22 Jan 2017 22:02:45 +0000 (23:02 +0100)]
net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
This allows configuring the RGMII TX clock delay. The RGMII clock is
generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
The configuration depends on the actual hardware (no delay may be
needed due to the design of the actual circuit, the PHY might add this
delay, etc.).
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Tested-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sun, 22 Jan 2017 21:16:45 +0000 (22:16 +0100)]
net: dsa: Fix inverted test for multiple CPU interface
Remove the wrong !, otherwise we get false positives about having
multiple CPU interfaces.
Fixes:
b22de490869d ("net: dsa: store CPU switch structure in the tree")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Sat, 21 Jan 2017 20:01:32 +0000 (21:01 +0100)]
bridge: multicast to unicast
Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.
multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.
This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).
However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.
The initial patch and idea is from Felix Fietkau.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
[linus.luessing@c0d3.blue: various bug + style fixes, commit message]
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krister Johansen [Sat, 21 Jan 2017 01:49:11 +0000 (17:49 -0800)]
Introduce a sysctl that modifies the value of PROT_SOCK.
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
that denotes the first unprivileged inet port in the namespace. To
disable all privileged ports set this to zero. It also checks for
overlap with the local port range. The privileged and local range may
not overlap.
The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration. The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace. This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 24 Jan 2017 00:26:46 +0000 (01:26 +0100)]
bpf, lpm: fix kfree of im_node in trie_update_elem
We need to initialize im_node to NULL, otherwise in case of error path
it gets passed to kfree() as uninitialized pointer.
Fixes:
b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 23 Jan 2017 21:10:38 +0000 (16:10 -0500)]
Merge branch 'bpf-lpm'
Daniel Mack says:
====================
bpf: add longest prefix match map
This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
Note that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
Patch 1/2 adds the implementation, 2/2 an extensive test suite and 3/3
has benchmarking code for the new trie type.
Feedback is much appreciated.
Changelog:
v3 -> v4:
* David added a 3rd patch that augments map_perf_test for
LPM trie benchmarks
* Limit allocation of maps of this new type to CAP_SYS_ADMIN
for now, as requested by Alexei
* Add a stub .map_delete_elem so the core does not stumble
over a NULL pointer when the syscall is invoked
* Tests for non-power-of-2 prefix lengths were added
* More comment style fixes
v2 -> v3:
* Store both the key match data and the caller provided
value in the same byte array attached to a node. This
avoids double allocations
* Bring back node->flags to distinguish between 'real'
and intermediate nodes
* Fix comment style and some typos
v1 -> v2:
* Turn spin lock into raw spinlock
* Lock with irqsave options during trie_update_elem()
* Return -ENOMEM properly from trie_alloc()
* Force attr->flags == BPF_F_NO_PREALLOC during creation
* Set trie->map.pages after creation to account for map memory
* Allow arbitrary value sizes
* Removed node->flags and denode intermediate nodes through
node->value == NULL instead
rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Herrmann [Sat, 21 Jan 2017 16:26:13 +0000 (17:26 +0100)]
samples/bpf: add lpm-trie benchmark
Extend the map_perf_test_{user,kern}.c infrastructure to stress test
lpm-trie lookups. We hook into the kprobe on sys_gettid() and measure
the latency depending on trie size and lookup count.
On my Intel Haswell i7-6400U, a single gettid() syscall with an empty
bpf program takes roughly 6.5us on my system. Lookups in empty tries
take ~1.8us on first try, ~0.9us on retries. Lookups in tries with 8192
entries take ~7.1us (on the first _and_ any subsequent try).
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Herrmann [Sat, 21 Jan 2017 16:26:12 +0000 (17:26 +0100)]
bpf: Add tests for the lpm trie map
The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.
Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.
The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Mack [Sat, 21 Jan 2017 16:26:11 +0000 (17:26 +0100)]
bpf: add a longest prefix match trie map implementation
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.
Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.
The code carries more information about the internal implementation.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Sat, 21 Jan 2017 06:58:58 +0000 (12:28 +0530)]
net: xilinx: constify net_device_ops structure
Declare net_device_ops structure as const as it is only stored in
the netdev_ops field of a net_device structure. This field is of type
const, so net_device_ops structures having same properties can be made
const too.
Done using Coccinelle:
@r1 disable optional_qualifier@
identifier i;
position p;
@@
static struct net_device_ops i@p={...};
@ok1@
identifier r1.i;
position p;
struct net_device ndev;
@@
ndev.netdev_ops=&i@p
@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p
@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct net_device_ops i;
File size before:
text data bss dec hex filename
6201 744 0 6945 1b21 ethernet/xilinx/xilinx_emaclite.o
File size after:
text data bss dec hex filename
6745 192 0 6937 1b19 ethernet/xilinx/xilinx_emaclite.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Sat, 21 Jan 2017 06:57:26 +0000 (12:27 +0530)]
net: moxa: constify net_device_ops structures
Declare net_device_ops structure as const as it is only stored in
the netdev_ops field of a net_device structure. This field is of type
const, so net_device_ops structures having same properties can be made
const too.
Done using Coccinelle:
@r1 disable optional_qualifier@
identifier i;
position p;
@@
static struct net_device_ops i@p={...};
@ok1@
identifier r1.i;
position p;
struct net_device ndev;
@@
ndev.netdev_ops=&i@p
@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p
@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct net_device_ops i;
File size before:
text data bss dec hex filename
4821 744 0 5565 15bd ethernet/moxa/moxart_ether.o
File size after:
text data bss dec hex filename
5373 192 0 5565 15bd ethernet/moxa/moxart_ether.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Fri, 20 Jan 2017 23:21:04 +0000 (17:21 -0600)]
net: qcom/emac: claim the irq only when the device is opened
During reset, functions emac_mac_down() and emac_mac_up() are called,
so we don't want to free and claim the IRQ unnecessarily. Move those
operations to open/close.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Fri, 20 Jan 2017 23:21:03 +0000 (17:21 -0600)]
net: qcom/emac: rename emac_phy to emac_sgmii and move it
The EMAC has an internal PHY that is often called the "SGMII". This
SGMII is also connected to an external PHY, which is managed by phylib.
These dual PHYs often cause confusion. In this case, the data structure
for managing the SGMII was mis-named and located in the wrong header file.
Structure emac_phy is renamed to emac_sgmii to clearly indicate it applies
to the internal PHY only. It also also moved from emac_phy.h (which
supports the external PHY) to emac_sgmii.h (where it belongs).
To keep the changes minimal, only the structure name is changed, not
the names of any variables of that type.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 20 Jan 2017 16:25:34 +0000 (08:25 -0800)]
bnx2x: avoid two atomic ops per page on x86
Commit
4cace675d687 ("bnx2x: Alloc 4k fragment for each rx ring buffer
element") added extra put_page() and get_page() calls on arches where
PAGE_SIZE=4K like x86
Reorder things to avoid this overhead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 22 Jan 2017 21:59:00 +0000 (16:59 -0500)]
Merge branch 'bcm7278'
Florian Fainelli says:
====================
net: dsa: bcm_sf2: Add support for BCM7278
This patch series adds support for the Broadcom BCM7278 integrated switch
which is a successor of the BCM7445 switch. We have a little bit of
register shuffling going on, which is why most of the functional changes
are to deal with that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:34 +0000 (12:36 -0800)]
net: phy: bcm7xxx: Implement EGPHY workaround for 7278
Implement the HW design team recommended workaround in for 7278. Since
the GPHY now returns its revision information in MII_PHYS_ID[23] we need
to check whether the revision provided in flags is 0 or not.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:33 +0000 (12:36 -0800)]
net: phy: bcm7xxx: Add entry for BCM7278
Add support for the BCM7278 28nm process Gigabit Ethernet PHY.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:32 +0000 (12:36 -0800)]
net: dsa: bcm_sf2: Allow non-IMP ports to have Broadcom tags enabled
Parse the "brcm,use-bcm-hdr" boolean property during ports
identification to fill a bitmask of ports that should have Broadcom tags
enabled. This is needed in some configurations where per-packet metadata
can be exchanged using Broadcom tags between the switch and an on-chip
acceleration device.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:31 +0000 (12:36 -0800)]
net: dsa: bcm_sf2: Move code enabling Broadcom tags
In preparation for enabling Broadcom tags on different ports based on
configuration information, dedicate a function that is responsible for
enabling Broadcom tags for a given port and update the IMP port setup to
call it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:30 +0000 (12:36 -0800)]
net: dsa: bcm_sf2: Add support for BCM7278 integrated switch
Add support for the integrated switch found on BCM7278:
- core_reg_align is set to 1, to force a translation into the target
address space which is 8 bytes aligned
- an alternate SWITCH_REG layout is provided since registers are largely
bit/masks compatible but have different offsets
- conditional for all CORE_STS_OVERRIDE_{IMP,GMII_P} since those got
moved way out of the traditional register space
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:29 +0000 (12:36 -0800)]
net: dsa: bcm_sf2: Prepare for different register layouts
In preparation for supporting a new device with a slightly different
register layout, affecting the SWITCH_REG and SWITCH_CORE address
spaces, perform a few preparatory steps:
- allow matching the compatible string against a data description
- convert the SWITCH_REG register accesses into an indirection table
- prepare for supporting a SWITCH_CORE register alignment requirement
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 20:36:28 +0000 (12:36 -0800)]
net: dsa: bcm_sf2: Make SF2_IO64_MACRO() utilize 32-bit macro
There is no point inlining the 32-bit direct register read/write part,
just infer it from the existing macro. This will make it easier to
centralize the address rewriting that we are going to introduce later
on.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 22 Jan 2017 21:56:07 +0000 (16:56 -0500)]
Merge branch 'systemport-lite'
Florian Fainelli says:
====================
net: systemport: Add support for SYSTEMPORT lite
This patch series adds support for SYSTEMPORT Lite which is an evolution
of the existing SYSTEMPORT adapter.
The two generations are largely identical as far as the transmit/receive
path are concerned, and there were just a few control path changes here
and there.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 19:08:27 +0000 (11:08 -0800)]
net: systemport: Add support for SYSTEMPORT Lite
Add supporf for the SYSTEMPORT Lite Ethernet controller, this piece of hardware
is largely based on the full-blown SYSTEMPORT and differs in the following:
- no full-blown UniMAC, instead we have the MagicPacket matching from UniMAC at
same offset, and a GMII Interface Block (GIB) for the MAC-level stuff, since
we are always interfaced to an Ethernet switch which is fully Ethernet compliant
shortcuts could be made
- 16 transmit queues, whose interrupts are moved into the first Level-2 interrupt
controller bank
- slight TDMA offset change (a register was inserted after TDMA_STATUS, *sigh*)
- 256 RX descriptors (512 words) and 256 TX descriptors (not visible)
As a consequence of these two things, update the code paths accordingly to
differentiate the full-blown from the light version.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 20 Jan 2017 19:08:26 +0000 (11:08 -0800)]
net: systemport: Dynamically allocate number of TX rings
In preparation for adding SYSTEMPORT Lite, which has twice as less transmit
queues than SYSTEMPORT make sure we do allocate TX rings based on the
systemport,txq property to get an appropriate memory footprint.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 20 Jan 2017 16:08:56 +0000 (08:08 -0800)]
ipv6: add NUMA awareness to seg6_hmac_init_algo()
Since we allocate per cpu storage, let's also use NUMA hints.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
jpinto [Fri, 20 Jan 2017 16:00:26 +0000 (16:00 +0000)]
net: stmicro: fix LS field mask in EEE configuration
This patch fixes the LS mask when setting EEE timer.
LS field is 10 bits long and not 11 as currently.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Reported-By: Rayagond Kokatanur <rayagond@vayavyalabs.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 20 Jan 2017 14:36:57 +0000 (22:36 +0800)]
net/mlx4: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 20 Jan 2017 14:36:53 +0000 (22:36 +0800)]
6lowpan: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 19:42:52 +0000 (14:42 -0500)]
Merge branch 'dsa-hwmon'
Andrew Lunn says:
====================
net: dsa: Move temperature sensor code into PHY.
Marvell Ethernet switches contain a temperature sensor. There appears
to be one sensor, which is shared by each of the internal PHYs. Each
PHY has independent registers to read this sensor, and to set a limit
for when an alarm should be raised.
Some Marvell discrete PHY also have the same sensor and registers.
Moving the HWMON code from DSA into the PHY makes the sensor available
in discrete PHYs, and removes the layering violation, the switch
driver poking around in PHY registers.
While moving the code into the PHY driver, it has been re-written to
use the new HWMON APIs.
v2:
Better Cover note explaining one sensor, but multiple independent
registers
Simply error checking.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Fri, 20 Jan 2017 00:37:50 +0000 (01:37 +0100)]
net: dsa: Remove hwmon support
Only the Marvell mv88e6xxx DSA driver made use of the HWMON support in
DSA. The temperature sensor registers are actually in the embedded
PHYs, and the PHY driver now supports it. So remove all HWMON support
from DSA and drivers.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Fri, 20 Jan 2017 00:37:49 +0000 (01:37 +0100)]
phy: marvell: Add support for temperature sensor
Some Marvell PHYs have an inbuilt temperature sensor. Add hwmon
support for this sensor.
There are two different variants. The simpler, older chips have a 5
degree accuracy. The newer devices have 1 degree accuracy.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Josef Bacik [Thu, 19 Jan 2017 22:47:46 +0000 (17:47 -0500)]
inet: don't use sk_v6_rcv_saddr directly
When comparing two sockets we need to use inet6_rcv_saddr so we get a NULL
sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our comparison function
can be wrong.
Fixes:
637bc8b ("inet: reset tb->fastreuseport when adding a reuseport sk")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 19:22:27 +0000 (14:22 -0500)]
Merge tag 'mlx5-updates-2017-01-19' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 and mlx5e updates 2017-01-19
This series includes some updates for mlx5 core and mlx5e netdevice driver.
From Leon, a small fix that remove an unnecessary print.
From Eli Cohen, a fix to the FW version printout in case of internal error.
From Eugenia Emantayev, two patches, the 1st adds mlx5 1pps (pulse per
second) mlx5 infrastructure support and the 2nd adds the necessary bits
for mlx5e ptp logic and structures.
From Mohamad, add support for s-tagged packet receive when in promiscuous
mode.
Form Gal Pressman, MCAM (Management capabilities mask register) and PCAM
(Ports capabilities mask register) registers infrastructure, those
registers are needed in order to query the different statistics registers
support in FW, in order for the driver to enable/disable query and
reporting them back to user. On top of this infrastructure we've exposed
new set of statistics groups:
- MPCNT: Physical layer statistical counters (For symbol errors)
- PPCNT: PCIe performance counters
In addition to the statistics capabilities series we've moved the mlx5 HCA
capabilities fields to a dedicated struct under the driver private data.
At the end a small patch to update & query statistics in the most desired
order.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 17:35:11 +0000 (12:35 -0500)]
Merge branch 'cpsw-common-res-usage'
Ivan Khoronzhuk says:
====================
net: ethernet: ti: cpsw: correct common res usage
This series is intended to remove unneeded redundancies connected with
common resource usage function.
Since v1:
- changed name to cpsw_get_usage_count()
- added comments to open/closw for cpsw_get_usage_count()
- added patch:
net: ethernet: ti: cpsw: clarify ethtool ops changing num of descs
Based on net-next/master
====================
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Thu, 19 Jan 2017 16:58:27 +0000 (18:58 +0200)]
net: ethernet: ti: cpsw: clarify ethtool ops changing num of descs
After adding cpsw_set_ringparam ethtool op, better to carry out
common parts of similar ops splitting descriptors in runtime. It
allows to reuse these parts and shows what the ops actually do.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Thu, 19 Jan 2017 16:58:26 +0000 (18:58 +0200)]
net: ethernet: ti: cpsw: don't duplicate common res in rx handler
No need to duplicate the same function in rx handler to get info
if any interface is running.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Thu, 19 Jan 2017 16:58:25 +0000 (18:58 +0200)]
net: ethernet: ti: cpsw: don't duplicate ndev_running
No need to create additional vars to identify if interface is running.
So simplify code by removing redundant var and checking usage counter
instead.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Thu, 19 Jan 2017 16:58:24 +0000 (18:58 +0200)]
net: ethernet: ti: cpsw: don't disable interrupts in ndo_open
No need to disable interrupts if no open devices,
they are disabled anyway.
Even no need to disable interrupts if some ndev is opened, In this
case shared resources are not touched, only parameters of ndev shell,
so no reason to disable them also. Removed lines have proved it.
So, no need in redundant check and interrupt disable.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Thu, 19 Jan 2017 16:58:23 +0000 (18:58 +0200)]
net: ethernet: ti: cpsw: remove dual check from common res usage function
Common res usage is possible only in case an interface is
running. In case of not dual emac here can be only one interface,
so while ndo_open and switch mode, only one interface can be opened,
thus if open is called no any interface is running ... and no common
res are used. So remove check on dual emac, it will simplify
code/understanding and will match the name it's called.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 17:22:27 +0000 (12:22 -0500)]
Merge branch 'rxbusy'
Mahesh Bandewar says:
====================
use netdev_is_rx_handler_busy() in few known cases
netdev_rx_handler_register() was recently split into two parts - (a) check
if the handler is used, (b) register the new handler, parts. This is
helpful in scenarios like bonding where at the time of registration there
is too much state to unwind and it should check if the device is free
before building that state. IPvlan and macvlan drivers don't have this
issue however it can make use of the same check instead of using a device
specific check.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Wed, 18 Jan 2017 23:02:55 +0000 (15:02 -0800)]
macvlan: use netdev_is_rx_handler_busy instead of checking specific type
netdev_is_rx_handler_busy() check is a superset of netif_is_ipvlan_port()
check and hence should be preferred.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Wed, 18 Jan 2017 23:02:53 +0000 (15:02 -0800)]
ipvlan: use netdev_is_rx_handler_busy instead of checking specific type
IPvlan checks if the master device is already used by checking a
specific device (here it's macvlan device). This is technically not
sufficient and it should just ensure the rx_handler is busy or not.
This would be a super check that includes macvlan and any other that
has already registered rx-handler.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Wed, 18 Jan 2017 23:02:49 +0000 (15:02 -0800)]
net: remove duplicate code.
netdev_rx_handler_register() checks to see if the handler is already
busy which was recently separated into netdev_is_rx_handler_busy(). So
use the same function inside register() to avoid code duplication.
Essentially this change should be a no-op
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Collins [Wed, 18 Jan 2017 21:04:28 +0000 (14:04 -0700)]
fq_codel: Avoid regenerating skb flow hash unless necessary
The fq_codel qdisc currently always regenerates the skb flow hash.
This wastes some cycles and prevents flow seperation in cases where
the traffic has been encrypted and can no longer be understood by the
flow dissector.
Change it to use the prexisting flow hash if one exists, and only
regenerate if necessary.
Signed-off-by: Andrew Collins <acollins@cradlepoint.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lance Richardson [Wed, 18 Jan 2017 20:24:57 +0000 (15:24 -0500)]
vxlan: preserve type of dst_port parm for encap_bypass_if_local()
Eliminate sparse warning by maintaining type of dst_port
as __be16.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lance Richardson [Wed, 18 Jan 2017 20:14:56 +0000 (15:14 -0500)]
csum: eliminate sparse warning in remcsum_unadjust()
Cast second parameter of csum_sub() from __sum16 to __wsum.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 17:10:17 +0000 (12:10 -0500)]
Merge branch 'tipc-multicast-through-replication'
Jon Maloy says:
====================
tipc: emulate multicast through replication
TIPC multicast messages are currently distributed via L2 broadcast
or IP multicast to all nodes in the cluster, irrespective of the
number of real destinations of the message.
In this series we introduce an option to transport messages via
replication ("replicast") across a selected number of unicast links,
instead of relying on the underlying media. This option is used when
true broadcast/multicast is not supported by the media, or when the
number of true destinations is much smaller than the cluster size.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 18 Jan 2017 18:50:53 +0000 (13:50 -0500)]
tipc: make replicast a user selectable option
If the bearer carrying multicast messages supports broadcast, those
messages will be sent to all cluster nodes, irrespective of whether
these nodes host any actual destinations socket or not. This is clearly
wasteful if the cluster is large and there are only a few real
destinations for the message being sent.
In this commit we extend the eligibility of the newly introduced
"replicast" transmit option. We now make it possible for a user to
select which method he wants to be used, either as a mandatory setting
via setsockopt(), or as a relative setting where we let the broadcast
layer decide which method to use based on the ratio between cluster
size and the message's actual number of destination nodes.
In the latter case, a sending socket must stick to a previously
selected method until it enters an idle period of at least 5 seconds.
This eliminates the risk of message reordering caused by method change,
i.e., when changes to cluster size or number of destinations would
otherwise mandate a new method to be used.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 18 Jan 2017 18:50:52 +0000 (13:50 -0500)]
tipc: introduce replicast as transport option for multicast
TIPC multicast messages are currently carried over a reliable
'broadcast link', making use of the underlying media's ability to
transport packets as L2 broadcast or IP multicast to all nodes in
the cluster.
When the used bearer is lacking that ability, we can instead emulate
the broadcast service by replicating and sending the packets over as
many unicast links as needed to reach all identified destinations.
We now introduce a new TIPC link-level 'replicast' service that does
this.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 18 Jan 2017 18:50:51 +0000 (13:50 -0500)]
tipc: add functionality to lookup multicast destination nodes
As a further preparation for the upcoming 'replicast' functionality,
we add some necessary structs and functions for looking up and returning
a list of all nodes that host destinations for a given multicast message.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 18 Jan 2017 18:50:50 +0000 (13:50 -0500)]
tipc: add function for checking broadcast support in bearer
As a preparation for the 'replicast' functionality we are going to
introduce in the next commits, we need the broadcast base structure to
store whether bearer broadcast is available at all from the currently
used bearer or bearers.
We do this by adding a new function tipc_bearer_bcast_support() to
the bearer layer, and letting the bearer selection function in
bcast.c use this to give a new boolean field, 'bcast_support' the
appropriate value.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gianluca Borello [Wed, 18 Jan 2017 17:55:49 +0000 (17:55 +0000)]
bpf: add bpf_probe_read_str helper
Provide a simple helper with the same semantics of strncpy_from_unsafe():
int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)
This gives more flexibility to a bpf program. A typical use case is
intercepting a file name during sys_open(). The current approach is:
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
char buf[PATHLEN]; // PATHLEN is defined to 256
bpf_probe_read(buf, sizeof(buf), ctx->di);
/* consume buf */
}
This is suboptimal because the size of the string needs to be estimated
at compile time, causing more memory to be copied than often necessary,
and can become more problematic if further processing on buf is done,
for example by pushing it to userspace via bpf_perf_event_output(),
since the real length of the string is unknown and the entire buffer
must be copied (and defining an unrolled strnlen() inside the bpf
program is a very inefficient and unfeasible approach).
With the new helper, the code can easily operate on the actual string
length rather than the buffer size:
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
char buf[PATHLEN]; // PATHLEN is defined to 256
int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);
/* consume buf, for example push it to userspace via
* bpf_perf_event_output(), but this time we can use
* res (the string length) as event size, after checking
* its boundaries.
*/
}
Another useful use case is when parsing individual process arguments or
individual environment variables navigating current->mm->arg_start and
current->mm->env_start: using this helper and the return value, one can
quickly iterate at the right offset of the memory area.
The code changes simply leverage the already existent
strncpy_from_unsafe() kernel function, which is safe to be called from a
bpf program as it is used in bpf_trace_printk().
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jan 2017 16:43:17 +0000 (11:43 -0500)]
Merge branch 'bus-agnostic-num-vf'
Phil Sutter says:
====================
Retrieve number of VFs in a bus-agnostic way
Previously, it was assumed that only PCI NICs would be capable of having
virtual functions - with my proposed enhancement of dummy NIC driver
implementing (fake) ones for testing purposes, this is no longer true.
Discussion of said patch has led to the suggestion of implementing a
bus-agnostic method for VF count retrieval so rtnetlink could work with
both real VF-capable PCI NICs as well as my dummy modifications without
introducing ugly hacks.
The following series tries to achieve just that by introducing a bus
type callback to retrieve a device's number of VFs, implementing this
callback for PCI bus and finally adjusting rtnetlink to make use of the
generalized infrastructure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Sutter [Wed, 18 Jan 2017 13:04:39 +0000 (14:04 +0100)]
device: Implement a bus agnostic dev_num_vf routine
Now that pci_bus_type has num_vf callback set, dev_num_vf can be
implemented in a bus type independent way and the check for whether a
PCI device is being handled in rtnetlink can be dropped.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Sutter [Wed, 18 Jan 2017 13:04:38 +0000 (14:04 +0100)]
PCI: implement num_vf bus type callback
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Sutter [Wed, 18 Jan 2017 13:04:37 +0000 (14:04 +0100)]
device: bus_type: Introduce num_vf callback
This allows for bus types to implement their own method of retrieving
the number of virtual functions a NIC on that type of bus supports.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 20 Jan 2017 14:27:04 +0000 (22:27 +0800)]
sock: use hlist_entry_safe
Use hlist_entry_safe() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Fri, 20 Jan 2017 13:53:06 +0000 (14:53 +0100)]
gre6: Clean up unused struct ipv6_tel_txoption definition
Commit
b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE
functions") removed the ip6gre specific transmit function, but left the
struct ipv6_tel_txoption definition. Clean it up.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 20 Jan 2017 13:06:08 +0000 (05:06 -0800)]
net: remove bh disabling around percpu_counter accesses
Shaohua Li made percpu_counter irq safe in commit
098faf5805c8
("percpu_counter: make APIs irq safe")
We can safely remove BH disable/enable sections around various
percpu_counter manipulations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Wed, 18 Jan 2017 14:52:51 +0000 (15:52 +0100)]
cxgb4: hide unused warnings
The two new variables are only used inside of an #ifdef and cause
harmless warnings when that is disabled:
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: In function 'init_one':
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:9: error: unused variable 'port_vec' [-Werror=unused-variable]
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:6: error: unused variable 'v' [-Werror=unused-variable]
This adds another #ifdef around the declarations.
Fixes:
96fe11f27b70 ("cxgb4: Implement ndo_get_phys_port_id for mgmt dev")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 18 Jan 2017 15:40:36 +0000 (07:40 -0800)]
net: ipv6: Keep nexthop of multipath route on admin down
IPv6 deletes route entries associated with multipath routes on an
admin down where IPv4 does not. For example:
$ ip ro ls vrf red
unreachable default metric 8192
1.1.1.0/24 metric 64
nexthop via 10.100.1.254 dev eth1 weight 1
nexthop via 10.100.2.254 dev eth2 weight 1
10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4
10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4
$ ip -6 ro ls vrf red
2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium
2001:db8:2:: dev red proto none metric 0 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 pref medium
2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium
...
Set link down:
$ ip li set eth1 down
IPv4 retains the multihop route but flags eth1 route as dead:
$ ip ro ls vrf red
unreachable default metric 8192
1.1.1.0/24
nexthop via 10.100.1.16 dev eth1 weight 1 dead linkdown
nexthop via 10.100.2.16 dev eth2 weight 1
10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4
and IPv6 deletes the route as part of flushing all routes for the device:
$ ip -6 ro ls vrf red
2001:db8:2:: dev red proto none metric 0 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium
...
Worse, on admin up of the device the multipath route has to be deleted
to get this leg of the route re-added.
This patch keeps routes that are part of a multipath route if
ignore_routes_with_linkdown is set with the dead and linkdown flags
enabling consistency between IPv4 and IPv6:
$ ip -6 ro ls vrf red
2001:db8:2:: dev red proto none metric 0 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown pref medium
2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 18 Jan 2017 04:14:10 +0000 (20:14 -0800)]
mlx4: support __GFP_MEMALLOC for rx
Commit
04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.
Note that this driver does not reuse pages (yet) so we do not have to
add anything else.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Tue, 17 Jan 2017 22:31:19 +0000 (16:31 -0600)]
Revert "net: qcom/emac: configure the external phy to allow pause frames"
This reverts commit
3e884493448131179a5b7cae1ddca1028ffaecc8.
With commit
529ed1275263 ("net: phy: phy drivers should not set
SUPPORTED_[Asym_]Pause"), phylib now handles automatically enabling
pause frame support in the PHY, and the MAC driver should follow suit.
Since the EMAC driver driver does this, we no longer need to force
pause frames support.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Wed, 28 Dec 2016 15:07:17 +0000 (17:07 +0200)]
net/mlx5e: Reorder update stats
Reorder update stats flow to update most important counters last,
to get more accurate results.
New update order:
- PCIe counters
- Port counters
- Vport counters
- Queue counters
- Software counters
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
Gal Pressman [Wed, 14 Dec 2016 15:40:41 +0000 (17:40 +0200)]
net/mlx5: Move cached hca caps to designated caps struct
The caps structure consists of hca caps and port/management caps,
all under one roof.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Thu, 17 Nov 2016 11:46:02 +0000 (13:46 +0200)]
net/mlx5e: Expose PCIe statistics to ethtool
This patch exposes PCIe performance counters, queried with
ethtool -S <devname>.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Thu, 17 Nov 2016 11:46:01 +0000 (13:46 +0200)]
net/mlx5: Add MPCNT register infrastructure
Add the needed infrastructure for future use of MPCNT register.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Tue, 23 Aug 2016 09:23:29 +0000 (12:23 +0300)]
net/mlx5e: Expose physical layer statistical counters to ethtool
Use ethtool -S to query physical layer statistical counters including:
- rx_symbol_errors_phy: Number of symbol errors that were not corrected
by FEC correction algorithm or that FEC was not active on this interface.
- rx_corrected_bits_phy: Number of corrected bits according to active
FEC (RS/FC).
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Tue, 27 Sep 2016 14:04:51 +0000 (17:04 +0300)]
net/mlx5: Add PPCNT physical layer statistical group infrastructure
Add the needed infrastructure for future use of PPCNT physical layer
statistical group.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Gal Pressman [Thu, 8 Dec 2016 14:03:31 +0000 (16:03 +0200)]
net/mlx5: Query and cache PCAM, MCAM registers on initialization
On load_one, we now cache our capabilities registers internally, similar
to QUERY_HCA_CAP. Capabilities can later be queried using macros
introduced in this patch.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Thu, 8 Dec 2016 13:56:00 +0000 (15:56 +0200)]
net/mlx5: Implement PCAM, MCAM access register commands
Introduced registers will expose capabilities of new registers and
features related to port/management.
Driver will query MCAM and PCAM in order to avoid failing on old
firmwares with lack of support.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Thu, 8 Dec 2016 13:52:00 +0000 (15:52 +0200)]
net/mlx5: Expose PCAM, MCAM registers infrastructure
PCAM: Ports capabilities mask register.
MCAM: Management capabilities mask register.
PCAM and MCAM registers will provide information regarding firmware
support for different features, in order to avoid cases where new driver
combined with old firmware results in syndromes (for ex. PCIe counters
before this patchset).
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mohamad Haj Yahia [Sun, 9 Oct 2016 14:05:31 +0000 (17:05 +0300)]
net/mlx5e: Receive s-tagged packets in promiscuous mode
Today when the driver enter to promiscuous mode or vlan
filter is disabled, we add flow rule to receive any c-taggd
packets, therefore s-tagged packets are dropped.
In order to receive s-tagged packets as well we need to add
flow rule to receive any s-tagged packet.
Fixes:
7cb21b794baa ('net/mlx5e: Rename en_flow_table.c to en_fs.c')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mohamad Haj Yahia [Sun, 9 Oct 2016 13:25:43 +0000 (16:25 +0300)]
net/mlx5: Add support to s-tag in mlx5 firmware interface
Add svlan_tag and rename vlan_tag to cvlan_tag in flow table entry
match param.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Eugenia Emantayev [Mon, 22 Aug 2016 11:57:41 +0000 (14:57 +0300)]
net/mlx5e: Implement 1PPS support
This patch enables the 1PPS IN and 1PPS OUT support according
to the advertised HCA capability. Single pin may be configured
to one of the above mutual exclusive functions via standard
Linux tools and APIs. For example, testptp open source application.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eugenia Emantayev [Mon, 10 Oct 2016 13:05:53 +0000 (16:05 +0300)]
net/mlx5: Add MTPPS and MTPPSE registers infrastructure
Implement query and set functionality for MTPPS and MTPPSE registers.
MTPPS (Management Pulse Per Second) provides the device PPS capabilities,
configures the PPS in and out modules and holds the PPS in time stamp.
Query MTPPS is supported only when HCA_CAP.pps is set and modify is supported
when HCA_CAP.pps_modify is set.
MTPPSE (Management Pulse Per Second Event) configures the different event
generation modes for PPS. Supported when HCA_CAP.pps is set.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eli Cohen [Fri, 16 Dec 2016 16:30:15 +0000 (10:30 -0600)]
net/mlx5: Fix version printout in case of health issue
Firmware representation of the firmware version on the health buffer has
changed for newer device. The representation in the initialization
segment does not and will not change. In addition, we print the health
buffer firmware version as a raw hex number.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Leon Romanovsky [Thu, 1 Dec 2016 08:34:57 +0000 (10:34 +0200)]
net/mlx5: Remove information print after attempt to load mlx5_ib module
Infiniband part of mlx5 driver can be compiled as a module
or as a part of bzImage (compiled in). In the second case,
the call to request_module will return an error -ENOENT.
It will cause to a misleading print "failed request module
on mlx5_ib".
This patch removes this print, In order to comply with mlx4.
Fixes:
f66f049fb738 ("net/mlx5_core: Request the mlx5 IB module on driver load")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Volodymyr Bendiuga [Thu, 19 Jan 2017 16:05:04 +0000 (17:05 +0100)]
phy: increase size of MII_BUS_ID_SIZE and bus_id
Some bus names are pretty long and do not fit into
17 chars. Increase therefore MII_BUS_ID_SIZE and
phy_fixup.bus_id to larger number. Now mii_bus.id
can host larger name.
Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
Signed-off-by: Magnus Öberg <magnus.oberg@westermo.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrei.Pistirica@microchip.com [Thu, 19 Jan 2017 15:56:15 +0000 (17:56 +0200)]
macb: Common code to enable ptp support for MACB/GEM
This patch does the following:
- MACB/GEM-PTP interface
- registers and bitfields for TSU
- capability flags to enable PTP per platform basis
Signed-off-by: Andrei Pistirica <andrei.pistirica@microchip.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Thu, 19 Jan 2017 15:04:25 +0000 (16:04 +0100)]
net: caif: Remove unused stats member from struct chnl_net
The stats member of struct chnl_net is used nowhere in the code, so it
might as well be removed.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Thu, 19 Jan 2017 09:45:31 +0000 (10:45 +0100)]
net/sched: cls_flower: reduce fl_change stack size
The new ARP support has pushed the stack size over the edge on ARM,
as there are two large objects on the stack in this function (mask
and tb) and both have now grown a bit more:
net/sched/cls_flower.c: In function 'fl_change':
net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
We can solve this by dynamically allocating one or both of them.
I first tried to do it just for the mask, but that only saved
152 bytes on ARM, while this version just does it for the 'tb'
array, bringing the stack size back down to 664 bytes.
Fixes:
99d31326cbe6 ("net/sched: cls_flower: Support matching on ARP")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Wed, 18 Jan 2017 16:45:01 +0000 (17:45 +0100)]
net: Remove usage of net_device last_rx member
The network stack no longer uses the last_rx member of struct net_device
since the bonding driver switched to use its own private last_rx in
commit
9f242738376d ("bonding: use last_arp_rx in slave_last_rx()").
However, some drivers still (ab)use the field for their own purposes and
some driver just update it without actually using it.
Previously, there was an accompanying comment for the last_rx member
added in commit
4dc89133f49b ("net: add a comment on netdev->last_rx")
which asked drivers not to update is, unless really needed. However,
this commend was removed in commit
f8ff080dacec ("bonding: remove
useless updating of slave->dev->last_rx"), so some drivers added later
on still did update last_rx.
Remove all usage of last_rx and switch three drivers (sky2, atp and
smc91c92_cs) which actually read and write it to use their own private
copy in netdev_priv.
Compile-tested with allyesconfig and allmodconfig on x86 and arm.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 18 Jan 2017 01:41:39 +0000 (20:41 -0500)]
net: dsa: use cpu_switch instead of ds[0]
Now that the DSA Ethernet switches are true Linux devices, the CPU
switch is not necessarily the first one. If its address is higher than
the second switch on the same MDIO bus, its index will be 1, not 0.
Avoid any confusion by using dst->cpu_switch instead of dst->ds[0].
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 18 Jan 2017 01:41:38 +0000 (20:41 -0500)]
net: dsa: store CPU switch structure in the tree
Store a dsa_switch pointer to the CPU switch in the tree instead of only
its index. This avoids the need to initialize it to -1.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Wed, 18 Jan 2017 00:28:06 +0000 (02:28 +0200)]
net: ethernet: ti: davinci_cpdma: correct check on NULL in set rate
Check "ch" on NULL first, then get ctlr.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Jan 2017 21:35:30 +0000 (16:35 -0500)]
Merge branch 'vhost_net-batching'
Jason Wang says:
====================
vhost_net tx batching
This series tries to implement tx batching support for vhost. This was
done by using MSG_MORE as a hint for under layer socket. The backend
(e.g tap) can then batch the packets temporarily in a list and
submit it all once the number of bacthed exceeds a limitation.
Tests shows obvious improvement on guest pktgen over over
mlx4(noqueue) on host:
Mpps -+%
rx-frames = 0 0.91 +0%
rx-frames = 4 1.00 +9.8%
rx-frames = 8 1.00 +9.8%
rx-frames = 16 1.01 +10.9%
rx-frames = 32 1.07 +17.5%
rx-frames = 48 1.07 +17.5%
rx-frames = 64 1.08 +18.6%
rx-frames = 64 (no MSG_MORE) 0.91 +0%
Changes from V4:
- stick to NAPI_POLL_WEIGHT for rx-frames is user specify a value
greater than it.
Changes from V3:
- use ethtool instead of module parameter to control the maximum
number of batched packets
- avoid overhead when MSG_MORE were not set and no packet queued
Changes from V2:
- remove uselss queue limitation check (and we don't drop any packet now)
Changes from V1:
- drop NAPI handler since we don't use NAPI now
- fix the issues that may exceeds max pending of zerocopy
- more improvement on available buffer detection
- move the limitation of batched pacekts from vhost to tuntap
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Wed, 18 Jan 2017 07:02:03 +0000 (15:02 +0800)]
tun: rx batching
We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.
Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
Mpps -+%
rx-frames = 0 0.91 +0%
rx-frames = 4 1.00 +9.8%
rx-frames = 8 1.00 +9.8%
rx-frames = 16 1.01 +10.9%
rx-frames = 32 1.07 +17.5%
rx-frames = 48 1.07 +17.5%
rx-frames = 64 1.08 +18.6%
rx-frames = 64 (no MSG_MORE) 0.91 +0%
User were allowed to change per device batched packets through
ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
to prevent bh from being disabled too long.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Wed, 18 Jan 2017 07:02:02 +0000 (15:02 +0800)]
vhost_net: tx batching
This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Wed, 18 Jan 2017 07:02:01 +0000 (15:02 +0800)]
vhost: better detection of available buffers
This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:
- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
step.
This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>