Jakub Kicinski [Thu, 3 Nov 2016 17:12:07 +0000 (17:12 +0000)]
nfp: add XDP support in the driver
Add XDP support. Separate stack's and XDP's TX rings logically.
Add functions for handling XDP_TX and cleanup of XDP's TX rings.
For XDP allocate all RX buffers as separate pages and map them
with DMA_BIDIRECTIONAL.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:06 +0000 (17:12 +0000)]
debugfs: constify argument to debugfs_real_fops()
seq_file users can only access const version of file pointer,
because the ->file member of struct seq_operations is marked
as such. Make parameter to debugfs_real_fops() const.
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Nicolai Stange <nicstange@gmail.com>
CC: Christian Lamparter <chunkeey@gmail.com>
CC: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:05 +0000 (17:12 +0000)]
nfp: reorganize nfp_net_rx() to get packet offsets early
Calculate packet offsets early in nfp_net_rx() so that we will be
able to use them in upcoming XDP handler. While at it move relevant
variables into the loop scope.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:04 +0000 (17:12 +0000)]
nfp: add support for ethtool .set_channels
Allow changing the number of rings via ethtool .set_channels API.
Runtime reconfig needs to be extended to handle number of rings.
We need to be able to activate interrupt vectors before rings are
assigned to them.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:03 +0000 (17:12 +0000)]
nfp: move RSS indirection table init into a separate function
We will need to rerun the initialization of the RSS indirection table
after the number of rings is changed. Move the code to a separate
function.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:02 +0000 (17:12 +0000)]
nfp: add helper to reassign rings to IRQ vectors
Instead of fixing ring -> vector relations up in ring swap functions
put the reassignment into a helper function which will reinit all
links.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:01 +0000 (17:12 +0000)]
nfp: loosen relation between rings and IRQs vectors
Upcoming XDP support will break the assumption that one can iterate
over IRQ vectors to get to all the rings easily. Use nn->.x_ring
arrays directly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:12:00 +0000 (17:12 +0000)]
nfp: reuse ring helpers on .ndo_open() path
Ring allocation helpers encapsulate all ring allocation and
initialization steps nicely. Reuse them on .ndo_open() path.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:11:59 +0000 (17:11 +0000)]
nfp: rename ring allocation helpers
"Shadow" in ring helpers used to mean that the helper will allocate
rings without touching existing configuration, this was used for
reconfiguration while the device was running. We will soon use
the same helpers for .ndo_open() path, so replace "shadow" with
"ring_set".
No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:11:58 +0000 (17:11 +0000)]
nfp: centralize runtime reconfiguration logic
All functions which need to reallocate ring resources at runtime
look very similar. Centralize that logic into a separate function.
Encapsulate configuration parameters in a structure.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 3 Nov 2016 17:11:57 +0000 (17:11 +0000)]
nfp: add support for ethtool .get_channels
Report number of rings via ethtool .get_channels API.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 Nov 2016 18:48:45 +0000 (14:48 -0400)]
Merge branch 'amd-xgbe-updates'
Tom Lendacky says:
====================
amd-xgbe: AMD XGBE driver updates 2016-11-03
This patch series is targeted at preparing the driver for a new PCI version
of the hardware. After this series is applied, a follow-on series will
introduce the support for the PCI version of the hardware.
The following updates and fixes are included in this driver update series:
- Fix formatting of PCS debug register dump
- Prepare for priority-based FIFO allocation
- Implement priority-based FIFO allocation
- Prepare for working with more than one type of PCS/PHY
- Prepare for the introduction of clause 37 auto-negotiation
- Add support for clause 37 auto-negotiation
- Prepare for supporting a new PCS register access method
- Add support for 64-bit management counter registers
- Update DMA channel status determination
- Prepare for supporting PCI devices in addition to platform devices
This patch series is based on net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:19:27 +0000 (13:19 -0500)]
amd-xgbe: Prepare for supporting PCI devices
Update the driver framework to separate out platform/ACPI specific code
from general code during device initialization. This will allow for the
introduction of PCI device support.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:19:17 +0000 (13:19 -0500)]
amd-xgbe: Update how to determine DMA channel status
Tx and Rx DMA channel status determiniation is different depending on the
version of the hardware. Update the channel status processing code to
account for the change. Also, reduce the timeout value used when stopping
the channels.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:19:07 +0000 (13:19 -0500)]
amd-xgbe: Support for 64-bit management counter registers
Add support for reading all management counter registers as 64-bit
values. The indication of whether to read the high 32-bits to form
a 64-bit value is indicated in the version data.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:18:56 +0000 (13:18 -0500)]
amd-xgbe: Prepare for a new PCS register access method
Prepare the code to be able to support accessing of the PCS registers
in a new way, while maintaining the current access method. Provide a
version specific field that indicates the method to use.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:18:47 +0000 (13:18 -0500)]
amd-xgbe: Add support for clause 37 auto-negotiation
Add support to be able to use clause 37 auto-negotiation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:18:38 +0000 (13:18 -0500)]
amd-xgbe: Prepare for introduction of clause 37 autoneg
Prepare for the future introduction of clause 37 auto-negotiation by
updating the current auto-negotiation related functions to identify
them as clause 73 functions. Move interrupt enablement to the
enable/disable auto-negotiation functions. Update what will be common
routines to check for the current type of AN and process accordingly.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:18:27 +0000 (13:18 -0500)]
amd-xgbe: Prepare for working with more than one type of phy
Prepare the code to be able to work with more than one type of phy by
adding additional callable functions into the phy interface and removing
phy specific settings/functions from non-phy related files.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:18:16 +0000 (13:18 -0500)]
amd-xgbe: Perform priority-based hardware FIFO allocation
Allocate the FIFO across the hardware Rx queues based on the priority
of the queues. Giving more FIFO resources to queues with a higher
priority. If PFC is active but not enabled for a queue, then less
resources can allocated to the queue.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:17:48 +0000 (13:17 -0500)]
amd-xgbe: Prepare for priority-based FIFO allocation
Currently, the Rx and Tx fifos are evenly allocated between the hardware
queues of the device. As more queues are instantiated, the fifo memory
needs to be able to be allocated based on queue priority. This allows for
higher priority queues to have more fifo memory than lower priority
queues. Prepare for this by modifying the current fifo calculation to
assign the fifo queue allocation in an array that is then used to program
the hardware.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 3 Nov 2016 18:17:38 +0000 (13:17 -0500)]
amd-xgbe: Fix formatting of PCS register dump
Fix the length value used for the PCS register dump so that the full
value can be displayed.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 Nov 2016 18:45:24 +0000 (14:45 -0400)]
Merge branch 'uid-routing'
Lorenzo Colitti says:
====================
net: inet: Support UID-based routing
This patchset adds support for per-UID routing. It allows the
administrator to configure rules such as:
ip rule add uidrange 100-200 lookup 123
This functionality has been in use by all Android devices since
5.0. It is primarily used to impose per-app routing policies (on
Android, every app has its own UID) without having to resort to
rerouting packets in iptables, which breaks getsockname() and
MTU/MSS calculation, and generally disrupts end-to-end
connectivity.
This patch series is similar to the code currently used on
Android, but has better correctness and performance because
it stores the UID in the socket instead of calling sock_i_uid.
This avoids contention on sk->sk_callback_lock, and makes it
possible to correctly route a socket on which userspace has
called close(), for which sock_i_uid will return 0.
Changes from v1:
- Don't set the UID in sk_clone_lock, it's already set by
sock_copy.
- For packets originated by kernel sockets, don't use the socket
UID. This is the UID that created the namespace, but it might
not be mapped in the namespace at all. Instead, use UID 0 in
the namespace, which is less surprising and consistent with
what happens in the root namespace.
- Fix UID routing of IPv4 and IPv6 SYN_RECV sockets.
- Fix UID routing of received IPv6 redirects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Thu, 3 Nov 2016 17:23:43 +0000 (02:23 +0900)]
net: inet: Support UID-based routing in IP protocols.
- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.
Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Thu, 3 Nov 2016 17:23:42 +0000 (02:23 +0900)]
net: core: add UID to flows, rules, and routes
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
rtnetlink. The value INVALID_UID indicates no UID was
specified.
- Add a UID field to the flow structures.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Thu, 3 Nov 2016 17:23:41 +0000 (02:23 +0900)]
net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 Nov 2016 18:40:01 +0000 (14:40 -0400)]
Merge branch 'dsa-mv88e6xxx-port-operation-refine'
Vivien Didelot says:
====================
net: dsa: mv88e6xxx: refine port operations
The Marvell chips have one internal SMI device per port, containing a
set of registers used to configure a port's link, STP state, default
VLAN or addresses database, etc.
This patchset creates port files to implement the port operations as
described in datasheets, and extend the chip ops structure with them.
Patches 1 to 6 implement accessors for port's STP state, port based VLAN
map, default FID, default VID, and 802.1Q mode.
Patches 7 to 11 implement the port's MAC setup of link state, duplex
mode, RGMII delay and speed, all accessed through port's register 0x01.
The new port's MAC setup code is used to re-implement the adjust_link
code and correctly force the link down before changing any of the MAC
settings, as requested by the datasheets.
The port's MAC accessors use values compatible with struct phy_device
(e.g. DUPLEX_FULL) and extend them when needed (e.g. SPEED_MAX).
Changes in v2:
- Strictly use new _UNFORCED values instead of re-using _UNKNOWN ones.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:36 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: setup port's MAC
Now that we have setters to configure the port's MAC, use them to
refactor the port setup and adjust_link code.
Note that port's MAC speed, duplex or RGMII delay must not be changed
unless the port's link is forced down. So wrap all that in a
mv88e6xxx_port_setup_mac function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:35 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port's MAC speed setter
While the two bits for link, duplex or RGMII delays are used the same
way on chips supporting the said feature, the two bits for speed have
different meaning for most of the chips out there.
Speed value is stored in bits 1:0, 0x3 means unforce (normal detection).
Some chips reuse values for alternative speeds when bit 12 is set.
Newer chips with speed > 1Gbps reuse value 0x3 thus need a new bit 13.
Here are the values to write in register 0x1 to (un)force speed:
| Speed |
88E6065 |
88E6185 |
88E6352 |
88E6390 | 88E6390X |
| ------- | ------- | ------- | ------- | ------- | -------- |
| 10 | 0x0000 | 0x0000 | 0x0000 | 0x2000 | 0x2000 |
| 100 | 0x0001 | 0x0001 | 0x0001 | 0x2001 | 0x2001 |
| 200 | 0x0002 | NA | 0x1001 | 0x3001 | 0x3001 |
| 1000 | NA | 0x0002 | 0x0002 | 0x2002 | 0x2002 |
| 2500 | NA | NA | NA | 0x3003 | 0x3003 |
| 10000 | NA | NA | NA | NA | 0x2003 |
| unforce | 0x0003 | 0x0003 | 0x0003 | 0x0000 | 0x0000 |
This patch implements a generic mv88e6xxx_port_set_speed() function used
by chip-specific wrappers to filter supported ports and speeds.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:34 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port's RGMII delay setter
Some chips such as
88E6352 and
88E6390 can be programmed to add delays
to RXCLK for IND inputs or to GTXCLK for OUTD outputs when port is in
RGMII mode.
Add a port function to program such delays according to the provided PHY
interface mode.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:33 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port duplex setter
Similarly to port's link, add setter to force port's half duplex, full
duplex or let normal duplex detection occurs.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:32 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port link setter
Most of the chips will have a port register control bits to force the
port's link up, down, or let normal link detection occurs.
Implement such operation to use it later when setting duplex, etc.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:31 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port 802.1Q mode setter
Add port functions to set the port 802.1Q mode.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:30 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port PVID accessors
Add port functions to access the ports default VID.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:29 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port FID accessors
Add functions to port files to access the ports default FID.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:28 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port vlan map setter
Add a port function to access the Port Based VLAN Map register.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:27 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port state setter
Add the port STP state setter to the port files.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 4 Nov 2016 02:23:26 +0000 (03:23 +0100)]
net: dsa: mv88e6xxx: add port files
The Marvell switches contains one internal SMI device per port, called
"Port Registers". Depending on the model, the addresses of these devices
start from 0x0, 0x8 or 0x10.
Start moving Port Registers specific code to their own files.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 3 Nov 2016 12:24:21 +0000 (13:24 +0100)]
net/sched: cls_flower: Support matching on SCTP ports
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.
Example usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 ip_proto sctp dst_port 80 \
action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Elad Raz [Thu, 3 Nov 2016 08:41:55 +0000 (09:41 +0100)]
mlxsw: pci: Fix the FW ready mask length
The system-status register is actually 16-bit wide and not 8 bit-wide.
Fixes:
233fa44bd67ae ("mlxsw: pci: Implement reset done check")
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Nov 2016 19:41:12 +0000 (15:41 -0400)]
Merge branch 'ip-recvfragsize-cmsg'
Willem de Bruijn says:
====================
ip: add RECVFRAGSIZE cmsg
On IP datagrams and raw sockets, when packets arrive fragmented,
expose the largest received fragment size through a new cmsg.
Protocols implemented on top of these sockets may use this, for
instance, to inform peers to lower MSS on platforms that silently
allow send calls to exceed PMTU and cause fragmentation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 2 Nov 2016 15:02:18 +0000 (11:02 -0400)]
ipv6: on reassembly, record frag_max_size
IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 2 Nov 2016 15:02:17 +0000 (11:02 -0400)]
ipv6: add IPV6_RECVFRAGSIZE cmsg
When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.
At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.
Tested using the test for IP_RECVFRAGSIZE plus
ip netns exec to ip addr add dev veth1 fc07::1/64
ip netns exec from ip addr add dev veth0 fc07::2/64
ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
ip netns exec from nc -q 1 -u fc07::1 6000 < payload
Both with and without enabling connection tracking
ip6tables -A INPUT -m state --state NEW -p udp -j LOG
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 2 Nov 2016 15:02:16 +0000 (11:02 -0400)]
ipv4: add IP_RECVFRAGSIZE cmsg
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.
Tested:
Sent data over a veth pair of which the source has a small mtu.
Sent data using netcat, received using a dedicated process.
Verified that the cmsg IP_RECVFRAGSIZE is returned only when
data arrives fragmented, and in that cases matches the veth mtu.
ip link add veth0 type veth peer name veth1
ip netns add from
ip netns add to
ip link set dev veth1 netns to
ip netns exec to ip addr add dev veth1 192.168.10.1/24
ip netns exec to ip link set dev veth1 up
ip link set dev veth0 netns from
ip netns exec from ip addr add dev veth0 192.168.10.2/24
ip netns exec from ip link set dev veth0 up
ip netns exec from ip link set dev veth0 mtu 1300
ip netns exec from ethtool -K veth0 ufo off
dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Nov 2016 19:31:34 +0000 (15:31 -0400)]
Merge branch 'stmmac-OXNAS'
Neil Armstrong says:
====================
net: stmmac: Add OXNAS DWMAC Glue
This patchset add support for the Sysnopsys DWMAC Gigabit Ethernet
controller Glue layer of the Oxford Semiconductor OX820 SoC.
Changes since v2 at http://lkml.kernel.org/r/
20161031105345.16711-1-narmstrong@baylibre.com :
- Disable/Unprepare clock if regmap read fails in oxnas_dwmac_init
Changes since v1 at https://patchwork.kernel.org/patch/
9388231/ :
- Split dt-bindings in a separate patch
- Add IP version in the dt-bindings compatible
- Check return of clk_prepare_enable()
- use get_stmmac_bsp_priv() helper
- hardwire setup values in oxnas_dwmac_init()
Changes since RFC at https://patchwork.kernel.org/patch/
9387257 :
- Drop init/exit callbacks
- Implement proper remove and PM callback
- Call init from probe
- Disable/Unprepare clock if stmmac probe fails
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Armstrong [Wed, 2 Nov 2016 14:02:37 +0000 (15:02 +0100)]
dt-bindings: net: Add OXNAS DWMAC Bindings
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Armstrong [Wed, 2 Nov 2016 14:02:36 +0000 (15:02 +0100)]
net: stmmac: Add OXNAS Glue Driver
Add Synopsys Designware MAC Glue layer for the Oxford Semiconductor OX820.
Acked-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Nov 2016 19:25:27 +0000 (15:25 -0400)]
Merge branch 'diag-raw-fixes'
Cyrill Gorcunov says:
====================
net: Fixes for raw diag sockets handling
Hi! Here are a few fixes for raw-diag sockets handling: missing
sock_put call and jump for exiting from nested cycle. I made
patches for iproute2 as well so will send them out soon.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Cyrill Gorcunov [Wed, 2 Nov 2016 12:36:32 +0000 (15:36 +0300)]
net: ip, raw_diag -- Use jump for exiting from nested loop
I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cyrill Gorcunov [Wed, 2 Nov 2016 12:36:31 +0000 (15:36 +0300)]
net: ip, raw_diag -- Fix socket leaking for destroy request
In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu Varadarajan [Wed, 2 Nov 2016 00:58:50 +0000 (17:58 -0700)]
enic: set skb->hash type properly
Driver sets the skb l4/l3 hash based on NIC_CFG_RSS_HASH_TYPE_*,
which is bit mask. This is wrong. Hw actually provides us enum.
Use CQ_ENET_RQ_DESC_RSS_TYPE_* to set l3 and l4 hash type.
Fixes:
bf751ba802fe ("driver/net: enic: record q_number and rss_hash for skb")
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Tue, 1 Nov 2016 23:11:51 +0000 (00:11 +0100)]
net: 3com: typhoon: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: David Dillow <dave@thedillows.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Nov 2016 21:55:25 +0000 (14:55 -0700)]
ila: Fix crash caused by rhashtable changes
commit
ca26893f05e86 ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.
Fix is to alloc the rhashtable_iter and save it as pointer
in arg.
Tested:
modprobe ila
./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
./ip ila list # NO crash now
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cyrill Gorcunov [Tue, 1 Nov 2016 20:05:00 +0000 (23:05 +0300)]
net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect
While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:
| [root@pcs7 ~]# cat /proc/`pidof ss`/stack
| [<
ffffffff816d1a76>] __lock_sock+0x80/0xc4
| [<
ffffffff816d206a>] lock_sock_nested+0x47/0x95
| [<
ffffffff8179ded6>] udp_disconnect+0x19/0x33
| [<
ffffffff8179b517>] raw_abort+0x33/0x42
| [<
ffffffff81702322>] sock_diag_destroy+0x4d/0x52
which has not been the case before. I narrowed it down to the commit
| commit
286c72deabaa240b7eebbd99496ed3324d69f3c0
| Author: Eric Dumazet <edumazet@google.com>
| Date: Thu Oct 20 09:39:40 2016 -0700
|
| udp: must lock the socket in udp_disconnect()
where we start locking the socket for different reason.
So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.
Fixes:
286c72deabaa ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Woojung Huh [Tue, 1 Nov 2016 20:02:00 +0000 (20:02 +0000)]
lan78xx: Use irq_domain for phy interrupt from USB Int. EP
To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver,
create irq_domain for USB interrupt EP of phy interrupt and
pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.
Idea comes from drivers/gpio/gpio-dl2.c
Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 1 Nov 2016 17:53:42 +0000 (10:53 -0700)]
tcp: enhance tcp collapsing
As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
time even if the skb at the right has fragments.
We simply have to use more generic skb_copy_bits() instead of
skb_copy_from_linear_data() in tcp_collapse_retrans()
Also need to guard this skb_copy_bits() in case there is nothing to
copy, otherwise skb_put() could panic if left skb has frags.
Tested:
Used following packetdrill test
// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0 write(4, ..., 200) = 200
+0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
+0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
+0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
+0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
+0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
+0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
+0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
+0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
+0 > P. 1301:1401(100) ack 1
+.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
// Check that TCP collapse works :
+0 > P. 1:1001(1000) ack 1
Reported-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Tue, 1 Nov 2016 15:32:27 +0000 (16:32 +0100)]
net: 3c509: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Tue, 1 Nov 2016 15:32:26 +0000 (16:32 +0100)]
net: 3c59x: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Tue, 1 Nov 2016 15:32:25 +0000 (16:32 +0100)]
net: mii: add generic function to support ksetting support
The old ethtool api (get_setting and set_setting) has generic mii
functions mii_ethtool_sset and mii_ethtool_gset.
To support the new ethtool api ({get|set}_link_ksettings), we add
two generics mii function mii_ethtool_{get|set}_link_ksettings_get.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Nov 2016 19:07:12 +0000 (15:07 -0400)]
Merge branch 'mlx4-XDP-tx-refactor'
Tariq Toukan says:
====================
mlx4 XDP TX refactor
This patchset refactors the XDP forwarding case, so that
its dedicated transmit queues are managed in a complete
separation from the other regular ones.
It also adds ethtool counters for XDP cases.
Series generated against net-next commit:
22ca904ad70a genetlink: fix error return code in genl_register_family()
Thanks,
Tariq.
v3:
* Exposed per ring counters.
v2:
* Added ethtool counters.
* Rebased, now patch 2 reverts Brenden's fix, as the bug no longer exists:
958b3d396d7f ("net/mlx4_en: fixup xdp tx irq to match rx")
* Updated commit message of patch 2.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Wed, 2 Nov 2016 15:12:25 +0000 (17:12 +0200)]
net/mlx4_en: Add ethtool statistics for XDP cases
XDP statistics are reported in ethtool, in total and per ring,
as follows:
- xdp_drop: the number of packets dropped by xdp.
- xdp_tx: the number of packets forwarded by xdp.
- xdp_tx_full: the number of times an xdp forward failed
due to a full tx xdp ring.
In addition, all packets that are dropped/forwarded by XDP
are no longer accounted in rx_packets/rx_bytes of the ring,
so that they count traffic that is passed to the stack.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Wed, 2 Nov 2016 15:12:24 +0000 (17:12 +0200)]
net/mlx4_en: Refactor the XDP forwarding rings scheme
Separately manage the two types of TX rings: regular ones, and XDP.
Upon an XDP set, do not borrow regular TX rings and convert them
into XDP ones, but allocate new ones, unless we hit the max number
of rings.
Which means that in systems with smaller #cores we will not consume
the current TX rings for XDP, while we are still in the num TX limit.
XDP TX rings counters are not shown in ethtool statistics.
Instead, XDP counters will be added to the respective RX rings
in a downstream patch.
This has no performance implications.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Wed, 2 Nov 2016 15:12:23 +0000 (17:12 +0200)]
net/mlx4_en: Add TX_XDP for CQ types
Support XDP CQ type, and refactor the CQ type enum.
Rename the is_tx field to match the change.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 31 Oct 2016 16:49:41 +0000 (00:49 +0800)]
sctp: clean up sctp_packet_transmit
After adding sctp gso, sctp_packet_transmit is a quite big function now.
This patch is to extract the codes for packing packet to sctp_packet_pack
from sctp_packet_transmit, and add some comments, simplify the err path by
freeing auth chunk when freeing packet chunk_list in out path and freeing
head skb early if it fails to pack packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Nov 2016 19:00:48 +0000 (15:00 -0400)]
Merge branch 'cls_flower-misc'
Roi Dayan says:
====================
misc TC/flower changes
This series includes two small changes to the TC flower classifier.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Roi Dayan [Tue, 1 Nov 2016 14:08:29 +0000 (16:08 +0200)]
net/sched: cls_flower: merge filter delete/destroy common code
Move common code from fl_delete and fl_detroy to __fl_delete.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roi Dayan [Tue, 1 Nov 2016 14:08:28 +0000 (16:08 +0200)]
net/sched: cls_flower: add missing unbind call when destroying flows
tcf_unbind was called in fl_delete but was missing in fl_destroy when
force deleting flows.
Fixes:
77b9900ef53a ('tc: introduce Flower classifier')
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Nov 2016 18:57:47 +0000 (14:57 -0400)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:
1) Add fib lookup expression for nf_tables, from Florian Westphal. This
new expression provides a native replacement for iptables addrtype
and rp_filter matches. This is more flexible though, since we can
populate the kernel flowi representation to inquire fib to
accomodate new usecases, such as RTBH through skb mark.
2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
new expression allow you to access skbuff route metadata, more
specifically nexthop and classid fields.
3) Add notrack support for nf_tables, to skip conntracking, requested by
many users already.
4) Add boilerplate code to allow to use nf_log infrastructure from
nf_tables ingress.
5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
the xtables cluster match, from Liping Zhang.
6) Move socket lookup code into generic nf_socket_* infrastructure so
we can provide a native replacement for the xtables socket match.
7) Make sure nfnetlink_queue data that is updated on every packets is
placed in a different cache from read-only data, from Florian Westphal.
8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.
9) Start round robin number generation in nft_numgen from zero,
instead of n-1, for consistency with xtables statistics match,
patch from Liping Zhang.
10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
given we retry with a smaller allocation on failure, from Calvin Owens.
11) Cleanup xt_multiport to use switch(), from Gao feng.
12) Remove superfluous check in nft_immediate and nft_cmp, from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Sun, 30 Oct 2016 23:35:07 +0000 (00:35 +0100)]
netfilter: nf_queue: place volatile data in own cacheline
As the comment indicates, the data at the end of nfqnl_instance struct is
written on every queue/dequeue, so it should reside in its own cacheline.
Before this change, 'lock' was in first cacheline so we dirtied both.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Sat, 29 Oct 2016 13:56:27 +0000 (21:56 +0800)]
netfilter: nf_tables: remove useless U8_MAX validation
After call nft_data_init, size is already validated and desc.len will
not exceed the sizeof(struct nft_data), i.e. 16 bytes. So it will never
exceed U8_MAX.
Furthermore, in nft_immediate_init, we forget to call nft_data_uninit
when desc.len exceeds U8_MAX, although this will not happen, but it's
a logical mistake.
Now remove these redundant validation introduced by commit
36b701fae12a
("netfilter: nf_tables: validate maximum value of u32 netlink attributes")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Anders K. Pedersen [Fri, 28 Oct 2016 05:54:15 +0000 (05:54 +0000)]
netfilter: nf_tables: introduce routing expression
Introduces an nftables rt expression for routing related data with support
for nexthop (i.e. the directly connected IP address that an outgoing packet
is sent to), which can be used either for matching or accounting, eg.
# nft add rule filter postrouting \
ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop
This will drop any traffic to 192.168.1.0/24 that is not routed via
192.168.0.1.
# nft add rule filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule ip6 filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
These rules count outgoing traffic per nexthop. Note that the timeout
releases an entry if no traffic is seen for this nexthop within 10 minutes.
# nft add rule inet filter postrouting \
ether type ip \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule inet filter postrouting \
ether type ip6 \
flow table acct { rt nexthop timeout 600s counter }
Same as above, but via the inet family, where the ether type must be
specified explicitly.
"rt classid" is also implemented identical to "meta rtclassid", since it
is more logical to have this match in the routing expression going forward.
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Thu, 27 Oct 2016 18:49:48 +0000 (19:49 +0100)]
netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c
We need this split to reuse existing codebase for the upcoming nf_tables
socket expression.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Thu, 27 Oct 2016 18:49:42 +0000 (19:49 +0100)]
netfilter: nf_log: add packet logging for netdev family
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.
This patch adds the boiler plate code to register the netdev logging
family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 24 Oct 2016 14:56:40 +0000 (16:56 +0200)]
netfilter: nf_tables: add fib expression
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).
Currently supports fetching output interface index/name and the
rtm_type associated with an address.
This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.
The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.
FIB result order for oif/oifname retrieval is as follows:
- if packet is local (skb has rtable, RTF_LOCAL set, this
will also catch looped-back multicast packets), set oif to
the loopback interface.
- if fib lookup returns an error, or result points to local,
store zero result. This means '--local' option of -m rpfilter
is not supported. It is possible to use 'fib type local' or add
explicit saddr/daddr matching rules to create exceptions if this
is really needed.
- store result in the destination register.
In case of multiple routes, search set for desired oif in case
strict matching is requested.
ipv4 and ipv6 behave fib expressions are supposed to behave the same.
[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
http://patchwork.ozlabs.org/patch/688615/
to address fallout from this patch after rebasing nf-next, that was
posted to address compilation warnings. --pablo ]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Wei Yongjun [Tue, 1 Nov 2016 14:45:52 +0000 (14:45 +0000)]
genetlink: fix error return code in genl_register_family()
Fix to return a negative error code from the idr_alloc() error handling
case instead of 0, as done elsewhere in this function.
Also fix the return value check of idr_alloc() since idr_alloc return
negative errors on failure, not zero.
Fixes:
2ae0f17df1cd ("genetlink: use idr to track families")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 31 Oct 2016 22:54:00 +0000 (15:54 -0700)]
net: Enable support for VRF with ipv4 multicast
Enable support for IPv4 multicast:
- similar to unicast the flow struct is updated to L3 master device
if relevant prior to calling fib_rules_lookup. The table id is saved
to the lookup arg so the rule action for ipmr can return the table
associated with the device.
- ip_mr_forward needs to check for master device mismatch as well
since the skb->dev is set to it
- allow multicast address on VRF device for Rx by checking for the
daddr in the VRF device as well as the original ingress device
- on Tx need to drop to __mkroute_output when FIB lookup fails for
multicast destination address.
- if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
IPMR FIB rules on first device create similar to FIB rules. In
addition the VRF driver does not divert IPv4 multicast packets:
it breaks on Tx since the fib lookup fails on the mcast address.
With this patch, ipmr forwarding and local rx/tx work.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Nov 2016 15:53:26 +0000 (11:53 -0400)]
Merge branch 'tipc-socket-layer-improvements'
Parthasarathy Bhuvaragan says:
====================
tipc: socket layer improvements
The following issues with the current socket layer hinders socket diagnostics
implementation, which led to this patch series.
1. tipc socket state is derived from multiple variables like
sock->state, tsk->probing_state and tsk->connected. This style forces
us to export multiple attributes to the user space, which has to be
backward compatible.
2. Abuse of sock->state cannot be exported to user-space without
requiring tipc specific hacks in the user-space.
- For connection less (CL) sockets sock->state is overloaded to
tipc state SS_READY.
- For connection oriented (CO) listening socket sock->state is
overloaded to tipc state SS_LISTEN.
This series is split into four:
1. Bug fixes in patch #1,2,3.
2. Minor cleanups in patch#4-5.
3. Express all tipc states using a single variable in patch#6-8.
4. Migrate the new tipc states to sk->sk_state in patch#9-16.
The figures below represents the FSM after this series:
Stream Server Listening Socket:
+-----------+ +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+ +-------------+
Stream Server Data Socket:
+-----------+ +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+ +------------------+
^ |
| |
| v
+--------------------+
| TIPC_DISCONNECTING |
+--------------------+
Stream Socket Client:
+-----------+ +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+ +-----------------+ |
| |
| |
v |
+------------------+ |
| TIPC_ESTABLISHED | |
+------------------+ |
^ | |
| | |
| v |
+--------------------+ |
| TIPC_DISCONNECTING |<--+
+--------------------+
NOTE:
This is just a base refractoring required for socket diagnostics.
TIPC socket diagnostics support will be introduced in a later series.
v2: - remove extra cast and parenthesis as suggested by David S. Miller in #4.
- map new tipc state values to tcp states to address Eric Dumazet's concern,
thus allow the usage of generic sk_* helpers. This is done in patch#10-15.
- remove TIPC_PROBING state and replace it with probe_unacked flag in #11.
- replace the TIPC_CLOSING state in v1 with sk_shutdown flag in #14.
- introduce __tipc_shutdown() to avoid code duplication in #14.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:49 +0000 (14:02 +0100)]
tipc: remove SS_CONNECTED sock state
In this commit, we replace references to sock->state SS_CONNECTE
with sk_state TIPC_ESTABLISHED.
Finally, the sock->state is no longer explicitly used by tipc.
The FSM below is for various types of connection oriented sockets.
Stream Server Listening Socket:
+-----------+ +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+ +-------------+
Stream Server Data Socket:
+-----------+ +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+ +------------------+
^ |
| |
| v
+--------------------+
| TIPC_DISCONNECTING |
+--------------------+
Stream Socket Client:
+-----------+ +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+ +-----------------+ |
| |
| |
v |
+------------------+ |
| TIPC_ESTABLISHED | |
+------------------+ |
^ | |
| | |
| v |
+--------------------+ |
| TIPC_DISCONNECTING |<--+
+--------------------+
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:48 +0000 (14:02 +0100)]
tipc: create TIPC_CONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_CONNECTING
by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.
There is no functional change in this commit.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:47 +0000 (14:02 +0100)]
tipc: remove SS_DISCONNECTING state
In this commit, we replace the references to SS_DISCONNECTING with
the combination of sk_state TIPC_DISCONNECTING and flags set in
sk_shutdown.
We introduce a new function _tipc_shutdown(), which provides
the common code required by tipc_release() and tipc_shutdown().
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:46 +0000 (14:02 +0100)]
tipc: create TIPC_DISCONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
sk_state. TIPC_DISCONNECTING is replacing the socket connection status
update using SS_DISCONNECTING.
TIPC_DISCONNECTING is set for connection oriented sockets at:
- tipc_shutdown()
- connection probe timeout
- when we receive an error message on the connection.
There is no functional change in this commit.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:45 +0000 (14:02 +0100)]
tipc: create TIPC_OPEN as a new sk_state
In this commit, we create a new tipc socket state TIPC_OPEN in
sk_state. We primarily replace the SS_UNCONNECTED sock->state with
TIPC_OPEN.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:44 +0000 (14:02 +0100)]
tipc: create TIPC_ESTABLISHED as a new sk_state
Until now, tipc maintains probing state for connected sockets in
tsk->probing_state variable.
In this commit, we express this information as socket states and
this remove the variable. We set probe_unacked flag when a probe
is sent out and reset it if we receive a reply. Instead of the
probing state TIPC_CONN_OK, we create a new state TIPC_ESTABLISHED.
There is no functional change in this commit.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:43 +0000 (14:02 +0100)]
tipc: create TIPC_LISTEN as a new sk_state
Until now, tipc maintains the socket state in sock->state variable.
This is used to maintain generic socket states, but in tipc
we overload it and save tipc socket states like TIPC_LISTEN.
Other protocols like TCP, UDP store protocol specific states
in sk->sk_state instead.
In this commit, we :
- declare a new tipc state TIPC_LISTEN, that replaces SS_LISTEN
- Create a new function tipc_set_state(), to update sk->sk_state.
- TIPC_LISTEN state is maintained in sk->sk_state.
- replace references to SS_LISTEN with TIPC_LISTEN.
There is no functional change in this commit.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:42 +0000 (14:02 +0100)]
tipc: remove socket state SS_READY
Until now, tipc socket state SS_READY declares that the socket is a
connectionless socket.
In this commit, we remove the state SS_READY and replace it with a
condition which returns true for datagram / connectionless sockets.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:41 +0000 (14:02 +0100)]
tipc: remove probing_intv from tipc_sock
Until now, probing_intv is a variable in struct tipc_sock but is
always set to a constant CONN_PROBING_INTERVAL. The socket
connection is probed based on this value.
In this commit, we remove this variable and setup the socket
timer based on the constant CONN_PROBING_INTERVAL.
There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:40 +0000 (14:02 +0100)]
tipc: remove tsk->connected from tipc_sock
Until now, we determine if a socket is connected or not based on
tsk->connected, which is set once when the probing state is set
to TIPC_CONN_OK. It is unset when the sock->state is updated from
SS_CONNECTED to any other state.
In this commit, we remove connected variable from tipc_sock and
derive socket connection status from the following condition:
sock->state == SS_CONNECTED => tsk->connected
There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:39 +0000 (14:02 +0100)]
tipc: remove tsk->connected for connectionless sockets
Until now, for connectionless sockets the peer information during
connect is stored in tsk->peer and a connection state is set in
tsk->connected. This is redundant.
In this commit, for connectionless sockets we update:
- __tipc_sendmsg(), when the destination is NULL the peer existence
is determined by tsk->peer.family, instead of tsk->connected.
- tipc_connect(), remove set/unset of tsk->connected.
Hence tsk->connected is no longer used for connectionless sockets.
There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:38 +0000 (14:02 +0100)]
tipc: rename tsk->remote to tsk->peer for consistent naming
Until now, the peer information for connect is stored in tsk->remote
but the rest of code uses the name peer for peer/remote.
In this commit, we rename tsk->remote to tsk->peer to align with
naming convention followed in the rest of the code.
There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:37 +0000 (14:02 +0100)]
tipc: rename struct tipc_skb_cb member handle to bytes_read
In this commit, we rename handle to bytes_read indicating the
purpose of the member.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:36 +0000 (14:02 +0100)]
tipc: set kern=0 in sk_alloc() during tipc_accept()
Until now, tipc_accept() calls sk_alloc() with kern=1. This is
incorrect as the data socket's owner is the user application.
Thus for these accepted data sockets the network namespace
refcount is skipped.
In this commit, we fix this by setting kern=0.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:35 +0000 (14:02 +0100)]
tipc: wakeup sleeping users at disconnect
Until now, in filter_connect() when we terminate a connection due to
an error message from peer, we set the socket state to DISCONNECTING.
The socket is notified about this broken connection using EPIPE when
a user tries to send a message. However if a socket was waiting on a
poll() while the connection is being terminated, we fail to wakeup
that socket.
In this commit, we wakeup sleeping sockets at connection termination.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parthasarathy Bhuvaragan [Tue, 1 Nov 2016 13:02:34 +0000 (14:02 +0100)]
tipc: return early for non-blocking sockets at link congestion
Until now, in stream/mcast send() we pass the message to the link
layer even when the link is congested and add the socket to the
link's wakeup queue. This is unnecessary for non-blocking sockets.
If a socket is set to non-blocking and sends multicast with zero
back off time while receiving EAGAIN, we exhaust the memory.
In this commit, we return immediately at stream/mcast send() for
non-blocking sockets.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Nov 2016 15:05:01 +0000 (11:05 -0400)]
Merge branch 'nfp-cleanups-and-RX-path-rewrite'
Jakub Kicinski says:
====================
nfp: cleanups and RX path rewrite
This series lays groundwork for upcoming XDP support by updating
the RX path not to pre-allocate sk_buffs. I start with few
cleanups, removal of NFP3200-related code being the most significant.
Patch 7 moves to alloc_frag() and build_skb() APIs. Again, a number
of small cleanups follow. The set ends with adding support for
different number of RX and TX rings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:22 +0000 (20:43 +0000)]
nfp: bring back support for different ring counts
We used to always allocate the same number of TX and RX rings
so the support for having r_vectors without one of the rings
was dropped. That makes us, however, unnecessarily limited
to 8 TX rings (8 is the Linux RSS default) most of the time.
Also we are about to add channel count configuration via
ethtool, so bring that support back. TX rings can now default
to num_online_cpus() and RX rings to netif_get_num_default_rss_queues().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:21 +0000 (20:43 +0000)]
nfp: replace num_irqs with max_r_vecs
num_irqs is not used anywhere, replace it with max_r_vecs which holds
number of allocated RX/TX vectors and is going to be useful soon.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:20 +0000 (20:43 +0000)]
nfp: remove nfp_net_irqs_wanted()
nfp_net_irqs_wanted() doesn't really encapsulate much logic,
remove it and inline the calculations.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:19 +0000 (20:43 +0000)]
nfp: use unsigned int for vector/ring counts
Use unsigned int consistently for vector/ring counts.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:18 +0000 (20:43 +0000)]
nfp: create separate define for max number of vectors
We are currently using define for max TX rings to allocate IRQ
vectors. It's OK since the max number of rings for TX and RX
are currently the same, but lets make the code nicer by taking
max of the two.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 31 Oct 2016 20:43:17 +0000 (20:43 +0000)]
nfp: use AND instead of modulo to get ring indexes
We already force ring sizes to be power of 2 so replace
modulo operations with AND (size - 1) in index calculations.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>