Philippe Reynes [Sat, 2 Jul 2016 22:05:05 +0000 (00:05 +0200)]
net: ethernet: lantiq_etop: use phy_ethtool_{get|set}_link_ksettings
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Sat, 2 Jul 2016 22:05:04 +0000 (00:05 +0200)]
net: ethernet: lantiq_etop: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Sat, 2 Jul 2016 21:37:00 +0000 (23:37 +0200)]
net: ethernet: cavium: octeon: use phy_ethtool_{get|set}_link_ksettings
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Sat, 2 Jul 2016 21:36:59 +0000 (23:36 +0200)]
net: ethernet: cavium: octeon: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe Jaillet [Sat, 2 Jul 2016 12:31:05 +0000 (14:31 +0200)]
net/mlx4: Fix some indent inconsistancy
Silent a few smatch warnings about indentation.
This include the removal of a 'return' statement in 'resource_tracker.c'.
This 'return' will still be performed when breaking out of the
corresponding 'switch' block.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal Hadi Salim [Sat, 2 Jul 2016 10:43:16 +0000 (06:43 -0400)]
net sched actions: skbedit convert to use more modern nla_put_xxx
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal Hadi Salim [Sat, 2 Jul 2016 10:43:15 +0000 (06:43 -0400)]
net sched actions: skbedit add support for mod-ing skb pkt_type
Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)
Example usage:
tc filter add dev eth0 parent ffff: protocol ip pref 9 u32 \
match ip src 5.5.5.5/32 \
flowid 1:5 action skbedit ptype host
This will tag all packets incoming from 5.5.5.5 with type
PACKET_HOST
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal Hadi Salim [Sat, 2 Jul 2016 10:43:14 +0000 (06:43 -0400)]
net: simplify and make pkt_type_ok() available for other users
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Sat, 2 Jul 2016 06:00:50 +0000 (08:00 +0200)]
net-next: mediatek: fix compile error inside mtk_poll_controller()
commit
8067302973a1 ("net-next: mediatek: add support for IRQ grouping")
failed to properly update the irq handling inside mtk_poll_controller()
causing compile errors if netconsole was enabled. Fix this by updating
the code to use the new separated irq handler function for RX.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Jul 2016 19:21:22 +0000 (15:21 -0400)]
Merge branch 'mlxsw-router-interfaces-groundwork'
Jiri Pirko says:
====================
mlxsw: Lay the groundwork for the introduction of router interfaces
This is first patchset on a way to introduce ipv4 routing offload support
in mlxsw driver. Does preparations before router interfaces will
be introduced in mlxsw.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 2 Jul 2016 09:00:20 +0000 (11:00 +0200)]
mlxsw: spectrum: Add traps needed for router implementation
ip2me:
To instruct HW to send trapped ip2me traffic to kernel, we have to add
this trap. Selection ip2me traffic is introduced later on in this set.
ARPs:
We are going to stop flooding to CPU port when netdev isn't bridged and
only get packets destined to the netdev's IP address and certain control
packets.
Add traps for ARP request (broadcast) and response (unicast) in order to
get these to the CPU and resolve neighbours.
host miss:
If a packet is routed through a directly connected route and its
destination IP is not in the device's neighbour table, then we need to
trap it to CPU. This will cause the host to resolve the MAC of the
neighbour, which will be eventually programmed to the device's table.
router ingress:
In order to trap packets in router part.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:19 +0000 (11:00 +0200)]
mlxsw: spectrum: Use action 'discard' when removing traps
When removing packet traps we should use action 'discard' instead of
'forward', as some trap IDs we'll add cannot be configured with the
later. However, result is the same, as packets are not trapped to the
CPU.
In the future we will be able to reverse the operation properly by
detaching the trap group from the CPU.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:18 +0000 (11:00 +0200)]
mlxsw: reg: Add Router Interface Table Register
Add the Router Interface Table Register (RITR), which allows us to
create and configure router interfaces (RIFs).
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:17 +0000 (11:00 +0200)]
mlxsw: reg: Add FDB action to forward to router
Incoming packets are directed to the router when they match an FDB
entry with action forward to IP router.
Add this action, which was mistakenly named "TRAP".
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:16 +0000 (11:00 +0200)]
mlxsw: spectrum: Add router interface struct
When enabling the router in the device we will represent L3 netdevs
using router interfaces (RIFs). These will be specified whenever
programming routes or neighbours on the netdev.
Introduce the basic RIF infrastructure which allows one to lookup a RIF
by its netdev. Later patches in the series will extend this, but the
basic routines are needed now in order to direct traffic to CPU.
Pointers to the RIF structs are stored in an array indexed by the RIF's
number. This will allow us to efficiently update the kernel's neighbour
table when regularly dumping the device's table.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:15 +0000 (11:00 +0200)]
mlxsw: spectrum_router: Add basic ipv4 router initialization
Create a skeleton router file and do basic HW initialization of router.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:14 +0000 (11:00 +0200)]
mlxsw: spectrum: Initialize ports at the end of init sequence
During ports initialization a net device is registered for each
available port, which implies the port is usable. However, a port is
only usable after the different parts of the device (e.g. flooding,
buffers) are initialized. This is especially important now, when we must
initialize the router before the ports, as otherwise the device can't be
initialized.
Solve that by initializing the switch ports at the end of init sequence.
Also, remove an unnecessary warning about port up/down events, which
would otherwise be invoked whenever removing the driver, as ports are
removed before unregistering the listener for these events.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:13 +0000 (11:00 +0200)]
mlxsw: reg: Add Router General Configuration Register
Add the Router General Configuration Register (RGCR), which allows us to
enable the router in the device and configure its various parameters.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:12 +0000 (11:00 +0200)]
mlxsw: spectrum: Remove RIF from PVID vPort when joining / leaving LAG
We are going to assign router interfaces (RIFs) to netdevs if an IPv4
address was assigned to them. If one was assigned to a port netdev, this
will translate to the PVID vPort being member in a RIF.
While it's possible for a LAG slave to have an IP address, we can't have
a vPort being member in two FIDs (assuming the LAG device will be
put in bridge / assigned an IP address).
Solve that by making the PVID vPort leave any FID it might be a member
in when joining / leaving LAG.
Note that the PVID vPort is the only vPort that can be present on the
port when it's put under LAG.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:11 +0000 (11:00 +0200)]
mlxsw: spectrum: Sync PVID vPort LAG status
When VLAN devices are created on top of LAG, their underlying vPorts are
configured correctly with LAG membership.
However, the PVID vPort is implicit and already present when the port
netdev is put under LAG, so its LAG membership is never set. Set it
correctly when joining / leaving LAG.
This didn't matter until now, but we are going to introduce support for
router interfaces (RIFs), which need to take into account LAG membership.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:10 +0000 (11:00 +0200)]
mlxsw: spectrum: Remove VLANs configuration via SELF flag
When port isn't bridged it is still possible to invoke switchdev ops and
configure the device's VLAN filters.
However, this will require us to use different Router InterFaces (RIFs)
for the same netdev, instead of one per-netdev as with any other
configuration.
Taking the above into account and the fact that this functionality is
questionable with regards to the device's normal use-case, remove it and
instead return an error.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 2 Jul 2016 09:00:09 +0000 (11:00 +0200)]
mlxsw: spectrum: Send untagged packets through a port netdev
Port netdevs (e.g. swXpY) that are not bridged are represented in the
device using a vPort with VID=PVID=1 (the PVID vPort), as untagged
packets entering the switch are internally tagged with the PVID VLAN.
When these packets are routed through a different port netdev they
should egress untagged.
This wasn't a problem until now, as non-bridged traffic only originated
from the CPU, which transmits packets out of the port as-is.
When a vPort is created with VID 1 mark it as egress untagged.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Jul 2016 18:52:43 +0000 (14:52 -0400)]
Merge branch 'bnxt_en-next'
Michael Chan says:
====================
bnxt_en updates for net-next.
Mostly small miscellaneous changes.
Please review for net-next. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:29 +0000 (18:46 -0400)]
bnxt_en: Allow statistics DMA to be configurable using ethtool -C.
The allowable range is 0.25 seconds to 1 second interval. Default is
1 second.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:28 +0000 (18:46 -0400)]
bnxt_en: Assign netdev->dev_port with port ID.
This is useful for multi-function devices.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:27 +0000 (18:46 -0400)]
bnxt_en: Allow promiscuous mode for VF if default VLAN is enabled.
With a default VLAN, the VF has its own VLAN domain and it can receive
all traffic within that domain.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Fri, 1 Jul 2016 22:46:26 +0000 (18:46 -0400)]
bnxt_en: Increase maximum supported MTU to 9500.
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:25 +0000 (18:46 -0400)]
bnxt_en: Enable MRU enables bit when configuring VNIC MRU.
For correctness, the MRU enables bit must be set when passing the
MRU to firmware during vnic configuration.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Swindell [Fri, 1 Jul 2016 22:46:24 +0000 (18:46 -0400)]
bnxt_en: Add support for firmware updates for additional processors.
Add support to the Ethtool FLASHDEV command handler for additional
firmware types to cover all the on-chip processors.
Signed-off-by: Rob Swindell <rob.swindell@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Swindell [Fri, 1 Jul 2016 22:46:23 +0000 (18:46 -0400)]
bnxt_en: Request firmware reset after successful firwmare update
Upon successful mgmt processor firmware update, request a self
reset upon next PCIe reset (e.g. system reboot).
Signed-off-by: Rob Swindell <rob.swindell@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Swindell [Fri, 1 Jul 2016 22:46:22 +0000 (18:46 -0400)]
bnxt_en: Add support for updating flash more securely
To support Secure Firmware Update, we must be able to allocate
a staging area in the Flash. This patch adds support for the
"update" type to tell firmware to do that.
Signed-off-by: Rob Swindell <rob.swindell@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:21 +0000 (18:46 -0400)]
bnxt_en: Do function reset on the 1st PF open only.
Calling the firmware to do function reset on the PF will kill all the VFs.
To prevent that, we call function reset on the 1st PF open before any VF
can be activated. On subsequent PF opens (with possibly some active VFs),
a bit has been set and we'll skip the function reset. VF driver will
always do function reset on every open. If there is an AER event, we will
always do function reset.
Signed-off-by: Michael Chan <michael.chan@broadocm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:20 +0000 (18:46 -0400)]
bnxt_en: Update firmware spec. to 1.3.0.
And update driver version to 1.3.0.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 1 Jul 2016 22:46:19 +0000 (18:46 -0400)]
bnxt_en: VF/NPAR should return -EOPNOTSUPP for unsupported ethtool ops.
Returning 0 for doing nothing is confusing to the user.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Fri, 1 Jul 2016 22:02:35 +0000 (00:02 +0200)]
net: ethernet: davinci_emac: use phy_ethtool_{get|set}_link_ksettings
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Fri, 1 Jul 2016 22:02:34 +0000 (00:02 +0200)]
net: ethernet: davinci_emac: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Jul 2016 18:40:48 +0000 (14:40 -0400)]
Merge branch 'mlx5-next'
Saeed Mahameed says:
====================
Mellanox 100G SRIOV E-Switch offload and VF representors
We are happy to announce SRIOV E-Switch offload and VF netdev representors.
Or Gerlitz says:
Currently, the way SR-IOV embedded switches are dealt with in Linux is limited
in its expressiveness and flexibility, but this is not necessarily due to
hardware limitations. The kernel software model for controlling the SR-IOV
switch simply does not allow the configuration of anything more complex than
MAC/VLAN based forwarding.
Hence the benefits brought by SRIOV come at a price of management flexibility,
when compared to software virtual switches which are used in Para-Virtual (PV)
schemes and allow implementing complex policies and virtual topologies. Such
SW switching typically involved a complex per-packet processing within the host
kernel using subsystems such as TC, Bridge, Netfilter and Open-vswitch.
We'd like to change that and get the best of both worlds: the performance of SR-IOV
with the management flexibility of software switches. This will eventually include
a richer model for controlling the SR-IOV switch for flow-based switching and
tunneling. Under this model, the e-switch is configured dynamically and a fallback
to software exists in case the hardware is unable to offload all required flows.
This series from Hadar Hen-Zion and myself, is the 1st step in that direction,
specfically, it provides full control on the SRIOV embedded switching by host
software and paves the way to offload switching rules and polices with downstream
patches.
To allow for host based SW control on the SRIOV HW switch, we introduce per VF
representor host netdevice. The VF representor plays the same role as TAP devices
in PV setup. A packet send through the VF representor on the host arrives to
the VF, and a packet sent through the VF is received by its representor. The
administrator can hook the representor netdev into a kernel switching component.
Once they do that, packets from the VF are subject to steering (matching and
actions) of that software component."
Doing so indeed hurts the performance benefits of SRIOV as it forces all the
traffic to go through the hypervisor. However, this SW representation is what
would eventually allow us to introduce hybrid model, where we offload steering
for some of the VF/VM traffic to the HW while keeping other VM traffic to go
through the hypervisor. Examples for the latter are first packet of flows which
are needed for SW switches learning and/or matching against policy database or
types of traffic for which offloading is not desired or not supported by the
current HW eswitch generation.
The embedded switch is managed through a PCI device driver. As such, we introduce
a devlink/pci based scheme for setting the mode of the e-switch. The current mode
(where steering is done based on mac/vlan, etc) is referred to as "legacy" and the
new mode as "offloads".
For the mlx5 driver / ConnectX4 HW case, the VF representors implement a functional
subset of mlx5e Ethernet netdevices using their own profile. This design buys us robust
implementation with code reuse and sharing.
The representors are created by the host PCI driver when (1) in SRIOV and (2) the
e-switch is set to offloads mode. Currently, in mlx5 the e-switch management is done
through the PF vport (0) and hence the VF representors along with the existing PF
netdev which represents the uplink share the PCI PF device instance.
The series is built from two major components, the first relates to the e-switch
management and the second to VF representors.
We start with a refactoring that treats the existing SRIOV e-switch code as of operating
in legacy mode. Next, we add the code for the offloads mode which programs the e-switch
to operate in a way which serves for software based switching:
1. miss rule which matches all packets that do not match any HW other switching rule
and forwards them to the e-switch management port (0) for further processing.
2. infrastructure for send-to-vport rules which conceptually bypass other "normal"
steering rules which present at the e-switch datapath. Such rules apply only for packets
that originate in the e-switch manager vport (0).
Since all the VF reps run over the same e-switch port, we use more logic in the host PCI
driver to do HW steering of missed packets into the HW queue opened by a the respective VF
representor. Finally here, we add the devlink APIs to configure the e-switch mode.
The second part from Hadar starts with some refactoring work which allow for multiple
mlx5e NIC instances to be created over the same PCI function, use common resources
and avoid wrong loopbacks.
Next comes the heart of the change which is a profile definition which allow to practically
have both "conventional" mlx5e NIC use cases such as native mode (non SRIOV), VF, PF and VF
representor to share the Ethernet driver code. This is done by a small surgery that ended up
with few internal callbacks that should be implemented by a profile instance. The profile
for the conventional NIC is implemented, to preserve the existing functionality.
The last two patches add e-switch registration API for the VF representors and the
implementation of the VF representors netdevice profile. Being an mlx5e instance, the
VF representor uses HW send/recv queues, completions queues and such. It currently doesn't
support NIC offloads but some of them could be added later on. The VF representor has
switchdev ops, where currently the only supported API is the one to the HW ID,
which is needed to identify multiple representors belonging to the same e-switch.
The architecture + solution (software and firmware) work were done by a team consisting
of Ilya Lesokhin, Haggai Eran, Rony Efraim, Tal Anker, Natan Oppenheimer, Saeed Mahameed,
Hadar and Or, thanks you all!
v1 --> v2 fixes:
* removed unneeded variable (patch #3)
* removed unused value DEVLINK_ESWITCH_MODE_NONE (patch #8)
* changed the devlink mode name from "offloads" to "switchdev" which
better describes what are we referring here, using a known concept (patch #8)
* correctly refer to devlink e-switch modes (patch #10)
* use the correct mlx5e way to define the VF rep statistics (patch #16)
v2 --> v3 fixes:
* Rebased on top
6fde0e63eccb 'be2net: signedness bug in be_msix_enable()'
* Handled compilation error introduced by rebase on top "
f5074d0ce2f8 Merge branch 'mlx5-100G-fixes'"
* This series applies perfectly even with 'mlx5 resiliency and xmit path fixes' merged to net-next
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:09 +0000 (14:51 +0300)]
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:08 +0000 (14:51 +0300)]
net/mlx5: Add Representors registration API
Introduce E-Switch registration/unregister representors functions.
Those functions are called by the mlx5e driver when the PF NIC is
created upon pci probe action regardless of the E-Switch mode (NONE,
LEGACY or OFFLOADS).
Adding basic E-Switch database that will hold the vport represntors
upon creation.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:07 +0000 (14:51 +0300)]
net/mlx5e: Add support for multiple profiles
To allow support in representor netdevices where we create more than one
netdevice per NIC, add profiles to the mlx5e driver. The profiling
allows for creation of mlx5e instances with different characteristics.
Each profile implements its own behavior using set of function pointers
defined in struct mlx5e_profile. This is done to allow for avoiding complex
per profix branching in the code.
Currently only the profile for the conventional NIC is implemented,
which is of use when a netdev is created upon pci probe.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:06 +0000 (14:51 +0300)]
net/mlx5e: Mark enabled RQTs instances explicitly
In the current driver implementation two types of receive queue
tables (RQTs) are in use - direct and indirect.
Change the driver to mark each new created RQT (direct or indirect)
as "enabled". This behaviour is needed for introducing new mlx5e
instances which serve to represent SRIOV VFs.
The VF representors will have only one type of RQTs (direct).
An "enabled" flag is added to each RQT to allow better handling
and code sharing between the representors and the nic netdevices.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:05 +0000 (14:51 +0300)]
net/mlx5e: TIRs management refactoring
The current refresh tirs self loopback mechanism, refreshes all the tirs
belonging to the same mlx5e instance to prevent self loopback by packets
sent over any ring of that instance. This mechanism relies on all the
tirs/tises of an instance to be created with the same transport domain
number (tdn).
Change the driver to refresh all the tirs created under the same tdn
regardless of which mlx5e netdev instance they belong to.
This behaviour is needed for introducing new mlx5e instances which serve
to represent SRIOV VFs. The representors and the PF share vport used for
E-Switch management, and we want to avoid NIC level HW loopback between
them, e.g when sending broadcast packets. To achieve that, both the
representors and the PF NIC will share the tdn.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hadar Hen Zion [Fri, 1 Jul 2016 11:51:04 +0000 (14:51 +0300)]
net/mlx5e: Create NIC global resources only once
To allow creating more than one netdev over the same PCI function, we
change the driver such that global NIC resources are created once and
later be shared amongst all the mlx5e netdevs running over that port.
Move the CQ UAR, PD (pdn), Transport Domain (tdn), MKey resources from
being kept in the mlx5e priv part to a new resources structure
(mlx5e_resources) placed under the mlx5_core device.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:51:03 +0000 (14:51 +0300)]
net/mlx5e: Add devlink based SRIOV mode changes
Implement handlers for the devlink commands to get and set the SRIOV
E-Switch mode.
When turning to the switchdev/offloads mode, we disable the e-switch
and enable it again in the new mode, create the NIC offloads table
and create VF reps.
When turning to legacy mode, we remove the VF reps and the offloads
table, and re-initiate the e-switch in it's legacy mode.
The actual creation/removal of the VF reps is done in downstream patches.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:51:02 +0000 (14:51 +0300)]
net/mlx5: Add devlink interface
The devlink interface is initially used to set/get the mode of the SRIOV e-switch.
Currently, these are only stubs for get/set, down-stream patch will actually
fill them out.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:51:01 +0000 (14:51 +0300)]
net/devlink: Add E-Switch mode control
Add the commands to set and show the mode of SRIOV E-Switch, two modes
are supported:
* legacy: operating in the "old" L2 based mode (DMAC --> VF vport)
* switchdev: the E-Switch is referred to as whitebox switch configured
using standard tools such as tc, bridge, openvswitch etc. To allow
working with the tools, for each VF, a VF representor netdevice is
created by the E-Switch manager vendor device driver instance (e.g PF).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:51:00 +0000 (14:51 +0300)]
net/mlx5: E-Switch, Add API to create vport rx rules
Add the API to create vport rx rules of the form
packet meta-data :: vport == $VPORT --> $TIR
where the TIR is opened by this VF representor.
This logic will by used for packets that didn't match any rule in the
e-switch datapath and should be received into the host OS through the
netdevice that represents the VF they were sent from.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:59 +0000 (14:50 +0300)]
net/mlx5: E-Switch, Add offloads table
Belongs to the NIC offloads name-space, and to be used as part of the
SRIOV offloads logic to steer packets that hit the e-switch miss rule
to the TIR of the relevant VF representor.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:58 +0000 (14:50 +0300)]
net/mlx5: Introduce offloads steering namespace
Add a new namespace (MLX5_FLOW_NAMESPACE_OFFLOADS) to be populated
with flow steering rules that deal with rules that have have to
be executed before the EN NIC steering rules are matched.
The namespace is located after the bypass name-space and before the
kernel name-space. Therefore, it precedes the HW processing done for
rules set for the kernel NIC name-space.
Under SRIOV, it would allow us to match on e-switch missed packet
and forward them to the relevant VF representor TIR.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:57 +0000 (14:50 +0300)]
net/mlx5: E-Switch, Add API to create send-to-vport rules
Add the API to create send-to-vport e-switch rules of the form
packet meta-data :: send-queue-number == $SQN and source-vport == 0 --> $VPORT
These rules are to be used for a send-to-vport logic which conceptually bypasses
the "normal" steering rules currently present at the e-switch datapath.
Such rule should apply only for packets that originate in the e-switch manager
vport (0) and are sent for a given SQN which is used by a given VF representor
device, and hence the matching logic.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:56 +0000 (14:50 +0300)]
net/mlx5: E-Switch, Add miss rule for offloads mode
In the sriov offloads mode, packets that are not matched by any other
rule should be sent towards the e-switch manager for further processing.
Add such "miss" rule which matches ANY packet as the last rule in the
e-switch FDB and programs the HW to send the packet to vport 0 where
the e-switch manager runs.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:55 +0000 (14:50 +0300)]
net/mlx5: E-Switch, Add support for the sriov offloads mode
Unlike the legacy mode, here, forwarding rules are not learned by the
driver per events on macs set by VFs/VMs into their vports, but rather
should be programmed by higher-level SW entities.
Saying that, still, in the offloads mode (SRIOV_OFFLOADS), two flow
groups are created by the driver for management (slow path) purposes:
The first group will be used for sending packets over e-switch vports
from the host OS where the e-switch management code runs, to be
received by VFs.
The second group will be used by a miss rule which forwards packets toward
the e-switch manager. Further logic will trap these packets such that
the receiving net-device as seen by the networking stack is the representor
of the vport that sent the packet over the e-switch data-path.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Fri, 1 Jul 2016 11:50:54 +0000 (14:50 +0300)]
net/mlx5: E-Switch, Add operational mode to the SRIOV e-Switch
Define three modes for the SRIOV e-switch operation, none (SRIOV_NONE,
none of the VF vports are enabled), legacy (SRIOV_LEGACY, the current mode)
and sriov offloads (SRIOV_OFFLOADS). Currently, when in SRIOV, only the
legacy mode is supported, where steering rules are of the form:
destination mac --> VF vport
This patch does not change any functionality.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 21:05:00 +0000 (17:05 -0400)]
Merge tag 'batadv-next-for-davem-
20160701' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature patchset includes the following changes:
- two patches with minimal clean up work by Antonio Quartulli and
Simon Wunderlich
- eight patches of B.A.T.M.A.N. V, API and documentation clean
up work, by Antonio Quartulli and Marek Lindner
- Andrew Lunn fixed the skb priority adoption when forwarding
fragmented packets (two patches)
- Multicast optimization support is now enabled for bridges which
comes with some protocol updates, by Linus Luessing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 20:57:01 +0000 (16:57 -0400)]
Merge branch 'hns-next'
Yisen Zhuang says:
====================
net: hns: fix the typo of hns
This series includes typo fixes which review by Andy, adding
the hns maintainer to MAINTAINERS, as below:
> from Daode: adds the maintainer for hns driver;
> from Daode: fix the typo of hns reviewed by Andy Shevchenko;
> from Kejian: one remove redundant function and two fix to get
configuration from DT.
changlog:
v2 -> v3:
match all files in and below drivers/net/ethernet/hisilicon/
v1 -> v2:
fix the indentations reviewed by David.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kejian Yan [Fri, 1 Jul 2016 09:34:13 +0000 (17:34 +0800)]
net: hns: get reset registers from DT
Since the registers of subctrl may be different, it is better to
mv the registers from hns mdio driver routine to device tree node.
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kejian Yan [Fri, 1 Jul 2016 09:34:12 +0000 (17:34 +0800)]
net: hns: add media-type property for hns
It is PORT_TP type if the service port is GE mode. It is wrong to
judge the port type by using if it is service port. Adding the media
type to know port type.
Reported-by: Jinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kejian Yan [Fri, 1 Jul 2016 09:34:11 +0000 (17:34 +0800)]
net: hns: remove redundant hns_mac_dev_to_enet_if()
The sequence of hns_mac_dev_to_enet_if() is the same as
hns_get_enet_interface(), and hns_get_enet_interface() is called
by initialization to get the mac mode. And the mode is not changed
anywhere. Thus add hns_mac_dev_to_enet_if() function to get the mac
mode is obviously redundant.
Reported-by: Jinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:10 +0000 (17:34 +0800)]
net: hns: normalize two different loop
There are two approaches to assign data, one does 2 loops, another
does 1 loop. This patch normalize the different methods to 1 loop.
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:09 +0000 (17:34 +0800)]
net: hns: add a space before "*/"
In comment line, some time miss a space before */, so this
patch adds a space before */.
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:08 +0000 (17:34 +0800)]
net: hns: delete redundant parenthese
According to the previous review comments from Andy, this patch
deletes the redundant parens in the patch.
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:07 +0000 (17:34 +0800)]
net: hns: change code style from a = a + x to a += x
This patch fixes the code style in hns driver. Change it from
"buff = buff + xxx" to "buff += xxx". The reveiw comments is
from andy.
Reviewed-by: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:06 +0000 (17:34 +0800)]
net: hns: fix code style about hns driver
This patch fixes code sytle of hns driver to make it
simple.
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daode Huang [Fri, 1 Jul 2016 09:34:05 +0000 (17:34 +0800)]
MAINTAINERS: add maintainers for hns driver
This patch adds maintainers for hisilicon network subsystem driver
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 20:45:24 +0000 (16:45 -0400)]
Merge branch 'rds-multipath-datastructures'
Sowmini Varadhan says:
====================
RDS:TCP data structure changes for multipath support
The second installment of changes to enable multipath support in
RDS-TCP. This series implements the changes in rds-tcp so that the
rds_conn_path has a pointer to the rds_tcp_connection in cp_transport_data.
Struct rds_tcp_connection keeps track of the inet_sk per path in
t_sock. The ->sk_user_data in turn is a pointer to the rds_conn_path.
With this set of changes, rds_tcp has the needed plumbing to handle
multiple paths(socket) per rds_connection.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:18 +0000 (16:11 -0700)]
RDS: Do not send a pong to an incoming ping with 0 src port
RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:17 +0000 (16:11 -0700)]
RDS: TCP: Simplify reconnect to avoid duelling reconnnect attempts
When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:16 +0000 (16:11 -0700)]
RDS: TCP: Hooks to set up a single connection path
This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:15 +0000 (16:11 -0700)]
RDS: TCP: make receive path use the rds_conn_path
The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:14 +0000 (16:11 -0700)]
RDS: TCP: make ->sk_user_data point to a rds_conn_path
The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:13 +0000 (16:11 -0700)]
RDS: TCP: Refactor connection destruction to handle multiple paths
A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.
For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.
For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:12 +0000 (16:11 -0700)]
RDS: TCP: Make rds_tcp_connection track the rds_conn_path
The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:11 +0000 (16:11 -0700)]
RDS: TCP: Remove dead logic around c_passive in rds-tcp
The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Jun 2016 23:11:10 +0000 (16:11 -0700)]
RDS: Rework path specific indirections
Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 20:32:27 +0000 (16:32 -0400)]
Merge branch 'bpf-cgroup2'
Martin KaFai Lau says:
====================
cgroup: bpf: cgroup2 membership test on skb
This series is to implement a bpf-way to
check the cgroup2 membership of a skb (sk_buff).
It is similar to the feature added in netfilter:
c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")
The current target is the tc-like usage.
v3:
- Remove WARN_ON_ONCE(!rcu_read_lock_held())
- Stop BPF_MAP_TYPE_CGROUP_ARRAY usage in patch 2/4
- Avoid mounting bpf fs manually in patch 4/4
- Thanks for Daniel's review and the above suggestions
- Check CONFIG_SOCK_CGROUP_DATA instead of CONFIG_CGROUPS. Thanks to
the kbuild bot's report.
Patch 2/4 only needs CONFIG_CGROUPS while patch 3/4 needs
CONFIG_SOCK_CGROUP_DATA. Since a single bpf cgrp2 array alone is
not useful for now, CONFIG_SOCK_CGROUP_DATA is also used in
patch 2/4. We can fine tune it later if we find other use cases
for the cgrp2 array.
- Return EAGAIN instead of ENOENT if the cgrp2 array entry is
NULL. It is to distinguish these two cases: 1) the userland has
not populated this array entry yet. or 2) not finding cgrp2 from the skb.
- Be-lated thanks to Alexei and Tejun on reviewing v1 and giving advice on
this work.
v2:
- Fix two return cases in cgroup_get_from_fd()
- Fix compilation errors when CONFIG_CGROUPS is not used:
- arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY
- filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup
- Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Thu, 30 Jun 2016 17:28:45 +0000 (10:28 -0700)]
cgroup: bpf: Add an example to do cgroup checking in BPF
test_cgrp2_array_pin.c:
A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
pouplates/updates it with a cgroup2's backed fd and pins it to a
bpf-fs's file. The pinned file can be loaded by tc and then used
by the bpf prog later. This program can also update an existing pinned
array and it could be useful for debugging/testing purpose.
test_cgrp2_tc_kern.c:
A bpf prog which should be loaded by tc. It is to demonstrate
the usage of bpf_skb_in_cgroup.
test_cgrp2_tc.sh:
A script that glues the test_cgrp2_array_pin.c and
test_cgrp2_tc_kern.c together. The idea is like:
1. Load the test_cgrp2_tc_kern.o by tc
2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
with a cgroup fd
3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
dropped because of a match on the cgroup
Most of the lines in test_cgrp2_tc.sh is the boilerplate
to setup the cgroup/bpf-fs/net-devices/netns...etc. It is
not bulletproof on errors but should work well enough and
give enough debug info if things did not go well.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Thu, 30 Jun 2016 17:28:44 +0000 (10:28 -0700)]
cgroup: bpf: Add bpf_skb_in_cgroup_proto
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2. It is similar to the
feature added in netfilter:
commit
c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")
The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.
Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Thu, 30 Jun 2016 17:28:43 +0000 (10:28 -0700)]
cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Thu, 30 Jun 2016 17:28:42 +0000 (10:28 -0700)]
cgroup: Add cgroup_get_from_fd
Add a helper function to get a cgroup2 from a fd. It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 20:00:52 +0000 (16:00 -0400)]
Merge branch 'bpf-robustify'
Daniel Borkmann says:
====================
Further robustify putting BPF progs
This series addresses a potential issue reported to us by Jann Horn
with regards to putting progs. First patch moves progs generally under
RCU destruction and second patch refactors getting of progs to simplify
code a bit. For details, please see individual patches. Note, we think
that addressing this one in net-next should be sufficient.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 30 Jun 2016 15:24:44 +0000 (17:24 +0200)]
bpf: refactor bpf_prog_get and type check into helper
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 30 Jun 2016 15:24:43 +0000 (17:24 +0200)]
bpf: generally move prog destruction to RCU deferral
Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:
- Set up a process with threads T1, T2 and T3
- Let T1 set up a socket filter F1 that invokes another filter F2
through a BPF map [tail call]
- Let T1 trigger the socket filter via a unix domain socket write,
don't wait for completion
- Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
- Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
- Let T3 close the file descriptor for F2, dropping the reference
count of F2 to 2
- At this point, T1 should have looked up F2 from the map, but not
finished executing it
- Let T3 remove F2 from the BPF map, dropping the reference count of
F2 to 1
- Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
via schedule_work()
- At this point, the BPF program could be freed
- BPF execution is still running in a freed BPF program
While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().
That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in
ceb56070359b ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.
Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amitoj Kaur Chawla [Thu, 30 Jun 2016 08:44:01 +0000 (14:14 +0530)]
atm: horizon: Use setup_timer
Convert a call to init_timer and accompanying intializations of
the timer's data and function fields to a call to setup_timer.
The Coccinelle semantic patch that fixes this problem is
as follows:
@@
expression t,d,f,e1;
identifier x1;
statement S1;
@@
(
-t.data = d;
|
-t.function = f;
|
-init_timer(&t);
+setup_timer(&t,f,d);
|
-init_timer_on_stack(&t);
+setup_timer_on_stack(&t,f,d);
)
<... when != S1
t.x1 = e1;
...>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 09:40:58 +0000 (05:40 -0400)]
Merge branch 'qed-next'
Manish Chopra says:
====================
qede: Enhancements
This patch series have few small fastpath features
support and code refactoring.
Note - regarding get/set tunable configuration via ethtool
Surprisingly, there is NO ethtool application support for
such configuration given that we have kernel support.
Do let us know if we need to add support for that in user ethtool.
Please consider applying this series to "net-next".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 30 Jun 2016 06:35:22 +0000 (02:35 -0400)]
qede: Bump up driver version to 8.10.1.20
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 30 Jun 2016 06:35:21 +0000 (02:35 -0400)]
qede: Add get/set rx copy break tunable support
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 30 Jun 2016 06:35:20 +0000 (02:35 -0400)]
qede: Utilize xmit_more
This patch uses xmit_more optimization to reduce
number of TX doorbells write per packet.
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 30 Jun 2016 06:35:19 +0000 (02:35 -0400)]
qede: qede_poll refactoring
This patch cleanups qede_poll() routine a bit
and allows qede_poll() to do single iteration to handle
TX completion [As under heavy TX load qede_poll() might
run for indefinite time in the while(1) loop for TX
completion processing and cause CPU stuck].
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 30 Jun 2016 06:35:18 +0000 (02:35 -0400)]
qede: Add support for handling IP fragmented packets.
When handling IP fragmented packets with csum in their
transport header, the csum isn't changed as part of the
fragmentation. As a result, the packet containing the
transport headers would have the correct csum of the original
packet, but one that mismatches the actual packet that
passes on the wire. As a result, on receive path HW would
give an indication that the packet has incorrect csum,
which would cause qede to discard the incoming packet.
Since HW also delivers a notification of IP fragments,
change driver behavior to pass such incoming packets
to stack and let it make the decision whether it needs
to be dropped.
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 09:32:28 +0000 (05:32 -0400)]
Merge branch 'tun-skb_array'
Jason Wang says:
====================
switch to use tx skb array in tun
This series tries to switch to use skb array in tun. This is used to
eliminate the spinlock contention between producer and consumer. The
conversion was straightforward: just introdce a tx skb array and use
it instead of sk_receive_queue.
A minor issue is to keep the tx_queue_len behaviour, since tun used to
use it for the length of sk_receive_queue. This is done through:
- add the ability to resize multiple rings at once to avoid handling
partial resize failure for mutiple rings.
- add the support for zero length ring.
- introduce a notifier which was triggered when tx_queue_len was
changed for a netdev.
- resize all queues during the tx_queue_len changing.
Tests shows about 15% improvement on guest rx pps:
Before: ~1300000pps
After : ~1500000pps
Changes from V3:
- fix kbuild warnings
- call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN
Changes from V2:
- add multiple rings resizing support for ptr_ring/skb_array
- add zero length ring support
- introdce a NETDEV_CHANGE_TX_QUEUE_LEN
- drop new flags
Changes from V1:
- switch to use skb array instead of a customized circular buffer
- add non-blocking support
- rename .peek to .peek_len
- drop lockless peeking since test show very minor improvement
====================
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-from-altitude: 34697 feet.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 30 Jun 2016 06:45:36 +0000 (14:45 +0800)]
tun: switch to use skb array for tx
We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.
This patch tries to address this by:
- switch from sk_receive_queue to a skb_array, and resize it when
tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
skb length.
- implement a tun version of peek_len for vhost_net to use and convert
vhost_net to use peek_len if possible.
Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:
Before: ~1300000pps
After : ~1500000pps
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 30 Jun 2016 06:45:35 +0000 (14:45 +0800)]
net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 30 Jun 2016 06:45:34 +0000 (14:45 +0800)]
skb_array: add wrappers for resizing
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Thu, 30 Jun 2016 06:45:33 +0000 (14:45 +0800)]
ptr_ring: support resizing multiple queues
Sometimes, we need support resizing multiple queues at once. This is
because it was not easy to recover to recover from a partial failure
of multiple queues resizing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 30 Jun 2016 06:45:32 +0000 (14:45 +0800)]
skb_array: minor tweak
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 30 Jun 2016 06:45:31 +0000 (14:45 +0800)]
ptr_ring: support zero length ring
Sometimes, we need zero length ring. But current code will crash since
we don't do any check before accessing the ring. This patch fixes this.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jul 2016 09:03:51 +0000 (05:03 -0400)]
Merge branch 'sch_hfsc-fixes-cleanups'
Michal Soltys says:
====================
HFSC patches, part 1
It's revised version of part of the patches I submitted really, really long
time ago (back then I asked Patrick to ignore them as I found some issues
shortly after submitting).
Anyway this is the first set with very simple fixes/changes though some of them
relatively subtle (I tried to do very exhaustive commit messages explaining what
and why with those).
The patches are against net-next tree.
The second set will be heavier - or rather with more complex explanations, among those I have:
- a fix to subtle issue introduced in
http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281
along with simplifying related stuff
- update times to 96 bits (which allows to "just" use 32 bit shifts and
improves curve definition accuracy at more extreme low/high speeds)
- add curve "merging" instead of just selecting in convex case (computations
mirror those from concave intersection)
But these are eventually for later.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Soltys [Thu, 30 Jun 2016 00:26:48 +0000 (02:26 +0200)]
net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Soltys [Thu, 30 Jun 2016 00:26:47 +0000 (02:26 +0200)]
net/sched/sch_hfsc.c: go passive after vt update
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.
Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.
One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Soltys [Thu, 30 Jun 2016 00:26:46 +0000 (02:26 +0200)]
net/sched/sch_hfsc.c: remove leftover dlist and droplist
This is update to:
commit
a09ceb0e08140a ("sched: remove qdisc->drop")
That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>