Wei Wang [Sat, 17 Jun 2017 17:42:32 +0000 (10:42 -0700)]
ipv4: mark DST_NOGC and remove the operation of dst_free()
With the previous preparation patches, we are ready to get rid of the
dst gc operation in ipv4 code and release dst based on refcnt only.
So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
to dst_free().
At this point, all dst created in ipv4 code do not use the dst gc
anymore and will be destroyed at the point when refcnt drops to 0.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:31 +0000 (10:42 -0700)]
ipv4: call dst_hold_safe() properly
This patch checks all the calls to
dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
dst_hold_safe() is needed to avoid double free issue if dst
gc is removed and dst_release() directly destroys dst when
dst->__refcnt drops to 0.
In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
UDP and other similar protocols always hold refcount for
skb->_skb_refdst. So both paths seem to be safe.
In rx path, as it is lockless and skb_dst_set_noref() is likely to be
used, dst_hold_safe() should always be used when trying to hold dst.
In the routing code, if dst is held during an rcu protected session, it
is necessary to call dst_hold_safe() as the current dst might be in its
rcu grace period.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:30 +0000 (10:42 -0700)]
ipv4: call dst_dev_put() properly
As the intend of this patch series is to completely remove dst gc,
we need to call dst_dev_put() to release the reference to dst->dev
when removing routes from fib because we won't keep the gc list anymore
and will lose the dst pointer right after removing the routes.
Without the gc list, there is no way to find all the dst's that have
dst->dev pointing to the going-down dev.
Hence, we are doing dst_dev_put() immediately before we lose the last
reference of the dst from the routing code. The next dst_check() will
trigger a route re-lookup to find another route (if there is any).
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:29 +0000 (10:42 -0700)]
ipv4: take dst->__refcnt when caching dst in fib
In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
to struct rtable but they never increment dst->__refcnt.
This leads to the need of the dst garbage collector because when user
is done with this dst and calls dst_release(), it can only decrement
dst->__refcnt and can not free the dst even it sees dst->__refcnt
drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
code might still hold reference to it.
And when the routing code tries to delete a route, it has to put the
dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
running periodically to check on dst->__refcnt and finally to free dst
when refcnt becomes 0.
This patch increments dst->__refcnt when
fib_nh/fib_nh_exception holds reference to this dst and properly release
the dst when fib_nh/fib_nh_exception has been updated with a new dst.
This patch is a preparation in order to fully get rid of dst gc later.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:28 +0000 (10:42 -0700)]
net: introduce a new function dst_dev_put()
This function should be called when removing routes from fib tree after
the dst gc is no longer in use.
We first mark DST_OBSOLETE_DEAD on this dst to make sure next
dst_ops->check() fails and returns NULL.
Secondly, as we no longer keep the gc_list, we need to properly
release dst->dev right at the moment when the dst is removed from
the fib/fib6 tree.
It does the following:
1. change dst->input and output pointers to dst_discard/dst_dscard_out to
discard all packets
2. replace dst->dev with loopback interface
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:27 +0000 (10:42 -0700)]
net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt
The current mechanism of freeing dst is a bit complicated. dst has its
ref count and when user grabs the reference to the dst, the ref count is
properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
code due to some historic reasons.
If the reference to dst is always taken properly, we should be able to
simplify the logic in dst_release() to destroy dst when dst->__refcnt
drops from 1 to 0. And this should be the only condition to determine
if we can call dst_destroy().
And as dst is always ref counted, there is no need for a dst garbage
list to hold the dst entries that already get removed by the routing
code but are still held by other users. And the task to periodically
check the list to free dst if ref count become 0 is also not needed
anymore.
This patch introduces a temporary flag DST_NOGC(no garbage collector).
If it is set in the dst, dst_release() will call dst_destroy() when
dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
issue.
This temporary flag is mainly used so that we can make the transition
component by component without breaking other parts.
This flag will be removed after all components are properly transitioned.
This patch also introduces a new function dst_release_immediate() which
destroys dst without waiting on the rcu when refcnt drops to 0. It will
be used in later patches.
Follow-up patches will correct all the places to properly take ref count
on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
be used to release the dst instead of dst_free() and its related
functions.
And final clean-up patch will remove the DST_NOGC flag.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:26 +0000 (10:42 -0700)]
net: use loopback dev when generating blackhole route
Existing ipv4/6_blackhole_route() code generates a blackhole route
with dst->dev pointing to the passed in dst->dev.
It is not necessary to hold reference to the passed in dst->dev
because the packets going through this route are dropped anyway.
A loopback interface is good enough so that we don't need to worry about
releasing this dst->dev when this dev is going down.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:25 +0000 (10:42 -0700)]
udp: call dst_hold_safe() in udp_sk_rx_set_dst()
In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
function, we will try to cache this dst in sk for connected case.
However, a better way to achieve this is to not try to hold dst in
early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
approach is also more consistant with how tcp is handling it. And it
will make later changes simpler.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Sat, 17 Jun 2017 17:42:24 +0000 (10:42 -0700)]
ipv6: remove unnecessary dst_hold() in ip6_fragment()
In ipv6 tx path, rcu_read_lock() is taken so that dst won't get freed
during the execution of ip6_fragment(). Hence, no need to hold dst in
it.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 19:22:42 +0000 (15:22 -0400)]
Merge tag 'mlx5-updates-2017-06-16' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox mlx5 updates and cleanups 2017-06-16
mlx5-updates-2017-06-16
This series provide some updates and cleanups for mlx5 core and netdevice
driver.
From Eli Cohen, add a missing event string.
From Or Gerlitz, some checkpatch cleanups.
From Moni, Disalbe HW level LAG when SRIOV is enabled.
From Tariq, A code reuse cleanup in aRFS flow.
From Itay Aveksis, Typo fix.
From Gal Pressman, ethtool statistics updates and "update stats" deferred work optimizations.
From Majd Dibbiny, Fast unload support on kernel shutdown.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 15 Jun 2017 20:14:48 +0000 (16:14 -0400)]
net: dsa: add cross-chip multicast support
Similarly to how cross-chip VLAN works, define a bitmap of multicast
group members for a switch, now including its DSA ports, so that
multicast traffic can be sent to all switches of the fabric.
A switch may drop the frames if no user port is a member.
This brings support for multicast in a multi-chip environment.
As of now, all switches of the fabric must support the multicast
operations in order to program a single fabric port.
Reported-by: Jason Cobham <jcobham@questertangent.com>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Tested-by: Jason Cobham <jcobham@questertangent.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Fontenot [Thu, 15 Jun 2017 18:48:09 +0000 (14:48 -0400)]
ibmvnic: driver initialization for kdump/kexec
When booting into the kdump/kexec kernel, pHyp and vios
are not prepared for the initialization crq request and
a failover transport event is generated. This is not
handled correctly.
At this point in initialization the driver is still in
the 'probing' state and cannot handle a full reset of the
driver as is normally done for a failover transport event.
To correct this we catch driver resets while still in the
'probing' state and return EAGAIN. This results in the
driver tearing down the main crq and calling ibmvnic_init()
again.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Fri, 16 Jun 2017 17:46:37 +0000 (10:46 -0700)]
decnet: always not take dst->__refcnt when inserting dst into hash table
In the existing dn_route.c code, dn_route_output_slow() takes
dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
does not take dst->__refcnt before calling dn_insert_route().
This makes the whole routing code very buggy.
In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
makes the routes inserted by dn_route_output_slow() not able to be
freed as the refcnt is not released.
In dn_dst_gc(), dnrt_drop() is called to release rt which could
potentially cause the dst->__refcnt to be dropped to -1.
In dn_run_flush(), dst_free() is called to release all the dst. Again,
it makes the dst inserted by dn_route_output_slow() not able to be
released and also, it does not wait on the rcu and could potentially
cause crash in the path where other users still refer to this dst.
This patch makes sure both input and output path do not take
dst->__refcnt before calling dn_insert_route() and also makes sure
dnrt_free()/dst_free() is called when removing dst from the hash table.
The only difference between those 2 calls is that dnrt_free() waits on
the rcu while dst_free() does not.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 16:45:16 +0000 (12:45 -0400)]
Merge branch 'rds-tcp-misc-bug-fixes'
Sowmini Varadhan says:
====================
rds: tcp: misc bug fixes
This series contains 2 bug fixes (patch2, patch3) and one bit of
code cleanup (patch1) identified during database testing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 15 Jun 2017 18:28:55 +0000 (11:28 -0700)]
rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one
Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.
rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit
1a0e100fb2c9 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 15 Jun 2017 18:28:54 +0000 (11:28 -0700)]
rds: tcp: various endian-ness fixes
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 15 Jun 2017 18:28:53 +0000 (11:28 -0700)]
rds: tcp: remove cp_outgoing
After commit
1a0e100fb2c9 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 16:43:49 +0000 (12:43 -0400)]
Merge branch 'dsa-loop-Driver-updates'
Florian Fainelli says:
====================
net: dsa: loop: Driver updates
This patch series updates drivers/net/dsa/dsa_loop.c to provide more useful
coverage of the DSA APIs and how we mangle the CPU ports ethtool statistics
to include its switch-facing port counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 15 Jun 2017 17:15:53 +0000 (10:15 -0700)]
net: dsa: loop: Implement ethtool statistics
When a DSA driver implements ethtool statistics, we also override the
master network device's ethtool statistics with the CPU port's
statistics and this has proven to be a possible source of bugs in the
past. Enhance the dsa_loop.c driver to provide statistics under the
forme of ok/error reads and writes from the per-port PHY read/writes.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 15 Jun 2017 17:15:52 +0000 (10:15 -0700)]
net: dsa: loop: Inline unregister_fixed_phys()
This is a simple function that only gets used in the driver's remove
function, inline it there.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 16:32:35 +0000 (12:32 -0400)]
Merge branch 'pktgen-new-parameters'
Tariq Toukan says:
====================
pktgen new parameters
This patchset adds two parameters to the pktgen scripts.
* The first patch adds a parameter to control the number of
packets sent by every pktgen thread.
* The second patch adds a parameter to control the index of
first thread, instead of always starting from cpu 0.
Series generated against net-next commit:
f7aec129a356 rxrpc: Cache the congestion window setting
Thanks,
Tariq.
V2:
* rebased to latest net-next.
* per Jesper's comments on Patch 2, fixed $F_THREAD description
and $L_THREAD evaluation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 16:07:22 +0000 (19:07 +0300)]
pktgen: Specify the index of first thread
Use "-f <num>", to specify the index of the first
sender thread.
In default first thread is #0.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 16:07:21 +0000 (19:07 +0300)]
pktgen: Specify num packets per thread
Use -n <num>, to specify the number of packets every
thread sends.
Zero means indefinitely.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 16:27:13 +0000 (12:27 -0400)]
Merge branch 'net-mvmdio-add-xMDIO-xSMI-support'
Antoine Tenart says:
====================
net: mvmdio: add xMDIO xSMI support
This series aims to add the xSMI support on the xMDIO bus to the
mvmdio driver. The xSMI interface complies with the IEEE 802.3 clause 45
and is used by 10GbE devices. On 7k and 8k (as of now), such an
interface is found and is used by Ethernet controllers.
Patches 1-4 and 9 are cosmetic cleanups.
Patches 5-7 are prerequisites to the xSMI support.
Patches 8 and 10-11 add the xSMI support to the mvmdio driver, and a
node is added both in the cp110 slave and master device trees.
This was tested on an Armada 8040 mcbin, as well as on both the
Armada 7040 DB and the Armada 8040 DB to ensure the SMI interface
was still working.
@Dave: patch 11 should go through the mvebu tree as asked by Gregory,
thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:25 +0000 (16:43 +0200)]
dt-bindings: orion-mdio: document the new xmdio compatible
A new compatible for Marvell xMDIO interfaces was added into the Marvell
MDIO driver. Document this new compatible.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:24 +0000 (16:43 +0200)]
net: mvmdio: simplify the smi read and write error paths
Cosmetic patch simplifying the smi read and write error paths. It also
align their error paths with the ones of the xsmi functions.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:23 +0000 (16:43 +0200)]
net: mvmdio: add xmdio xsmi support
This patch adds the xmdio xsmi interface support in the mvmdio driver.
This interface is used in Ethernet controllers on Marvell 370, 7k and 8k
(as of now). The xsmi interface supported by this driver complies with
the IEEE 802.3 clause 45. The xSMI interface is used by 10GbE devices.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:22 +0000 (16:43 +0200)]
net: mvmdio: check the MII_ADDR_C45 bit is not set for smi operations
Add a check for the read and write smi operations, to ensure the
MII_ADDR_C45 bit isn't set. This will be needed as soon as the xSMI
support is added to the mvmdio driver.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:21 +0000 (16:43 +0200)]
net: mvmdio: put the poll intervals in the ops structure
Put the two poll intervals (min and max) in the driver's ops
structure. This is needed to add the xmdio support later.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:20 +0000 (16:43 +0200)]
net: mvmdio: introduce an ops structure
Introduce an ops structure to add an indirection on the is_done
function, as this is needed to add the xMDIO support later.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Thu, 15 Jun 2017 14:43:19 +0000 (16:43 +0200)]
net: mvmdio: remove duplicate locking
The MDIO layer already provides per-bus locking, so there's no need for
MDIO bus drivers to do their own internal locking. Remove this.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:18 +0000 (16:43 +0200)]
net: mvmdio: use GENMASK for masks
Cosmetic patch to use the GENMASK helper for masks.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:17 +0000 (16:43 +0200)]
net: mvmdio: use tabs for defines
Cosmetic patch replacing spaces by tabs for defined values.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Thu, 15 Jun 2017 14:43:16 +0000 (16:43 +0200)]
net: mvmdio: reorder headers alphabetically
Cosmetic fix reordering headers alphabetically.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 15:58:38 +0000 (11:58 -0400)]
Merge branch 'bpf-xdp-Report-bpf_prog-ID-in-IFLA_XDP'
Martin KaFai Lau says:
====================
bpf: xdp: Report bpf_prog ID in IFLA_XDP
This is the first usage of the new bpf_prog ID. It is for
reporting the ID of a xdp_prog through netlink.
It rides on the existing IFLA_XDP. This patch adds IFLA_XDP_PROG_ID
for the bpf_prog ID reporting.
It starts with changing the generic_xdp first. After that,
the hardware driver is changed one by one. Jakub Kicinski mentioned
that he will soon introduce XDP_ATTACHED_HW (on top of the existing
XDP_ATTACHED_DRV and XDP_ATTACHED_SKB)
and he is going to reuse the prog_attached for this purpose.
Hence, this patch set keeps the prog_attached even though
!!prog_id also implies there is xdp_prog attached.
I have tested with generic_xdp, mlx4 and mlx5.
v3:
1. Replace 'if' by '?' when checking the xdp_prog pointer
as suggested by Jakub Kicinski (thanks!)
v2:
1. Remove READ_ONCE since it is alredy under rtnl lock
2. Keep prog_attached in 'struct netdev_xdp' as
requested by Jakub Kicinski. The existing prog_attached
and the new prog_id are put under a struct for XDP_QUERY_PROG.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:17 +0000 (17:29 -0700)]
bpf: qede: Report bpf_prog ID during XDP_QUERY_PROG
Add support to qede to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Mintz Yuval <Yuval.Mintz@cavium.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:16 +0000 (17:29 -0700)]
bpf: nfp: Report bpf_prog ID during XDP_QUERY_PROG
Add support to nfp to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:15 +0000 (17:29 -0700)]
bpf: ixgbe: Report bpf_prog ID during XDP_QUERY_PROG
Add support to ixgbe to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:14 +0000 (17:29 -0700)]
bpf: thunderx: Report bpf_prog ID during XDP_QUERY_PROG
Add support to thunderx to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Sunil Goutham <sgoutham@cavium.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:13 +0000 (17:29 -0700)]
bpf: bnxt: Report bpf_prog ID during XDP_QUERY_PROG
Add support to bnxt to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:12 +0000 (17:29 -0700)]
bpf: virtio_net: Report bpf_prog ID during XDP_QUERY_PROG
Add support to virtio_net to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:11 +0000 (17:29 -0700)]
bpf: mlx5e: Report bpf_prog ID during XDP_QUERY_PROG
Add support to mlx5e to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:10 +0000 (17:29 -0700)]
bpf: mlx4: Report bpf_prog ID during XDP_QUERY_PROG
Add support to mlx4 to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 16 Jun 2017 00:29:09 +0000 (17:29 -0700)]
net: Add IFLA_XDP_PROG_ID
Expose prog_id through IFLA_XDP_PROG_ID. This patch
makes modification to generic_xdp. The later patches will
modify other xdp-supported drivers.
prog_id is added to struct net_dev_xdp.
iproute2 patch will be followed. Here is how the 'ip link'
will look like:
> ip link show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 15:48:41 +0000 (11:48 -0400)]
Merge branch 'skb-accessor-cleanups'
Johannes Berg says:
====================
skb data accessors cleanup
Over night, Fengguang's bot told me that it compiled all of its many
various configurations successfully, and I had done allyesconfig on
x86_64 myself yesterday to iron out the things I missed.
So now I think I'm happy with it.
My tree was based on your
commit
3715c47bcda8bb56f7e2be27276282a2d0d48c09
Merge:
18b6e7955d8f d8fbd27469fc
Author: David S. Miller <davem@davemloft.net>
Date: Thu Jun 15 14:31:56 2017 -0400
Merge branch 'r8152-support-new-chips'
when the compilation tests happened, but I've reviewed the changes
coming into net-next in the meantime and didn't see any new usages
of skb data accessors having come in.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:24 +0000 (14:29 +0200)]
networking: add and use skb_put_u8()
Joe and Bjørn suggested that it'd be nicer to not have the
cast in the fairly common case of doing
*(u8 *)skb_put(skb, 1) = c;
Add skb_put_u8() for this case, and use it across the code,
using the following spatch:
@@
expression SKB, C, S;
typedef u8;
identifier fn = {skb_put};
fresh identifier fn2 = fn ## "_u8";
@@
- *(u8 *)fn(SKB, S) = C;
+ fn2(SKB, C);
Note that due to the "S", the spatch isn't perfect, it should
have checked that S is 1, but there's also places that use a
sizeof expression like sizeof(var) or sizeof(u8) etc. Turns
out that nobody ever did something like
*(u8 *)skb_put(skb, 2) = c;
which would be wrong anyway since the second byte wouldn't be
initialized.
Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:23 +0000 (14:29 +0200)]
networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)
Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:22 +0000 (14:29 +0200)]
networking: make skb_pull & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = {
skb_pull,
__skb_pull,
skb_pull_inline,
__pskb_pull_tail,
__pskb_pull,
pskb_pull
};
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = {
skb_pull,
__skb_pull,
skb_pull_inline,
__pskb_pull_tail,
__pskb_pull,
pskb_pull
};
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:21 +0000 (14:29 +0200)]
networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
which actually doesn't cover pskb_put since there are only three
users overall.
A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:20 +0000 (14:29 +0200)]
networking: introduce and use skb_put_data()
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.
An spatch similar to the one for skb_put_zero() converts many
of the places using it:
@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)
@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)
@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);
(again, manually post-processed to retain some comments)
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Jun 2017 12:29:19 +0000 (14:29 +0200)]
networking: convert many more places to skb_put_zero()
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.
The following spatch found many more and also removes the
now unnecessary casts:
@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_zero(skb, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)
@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)
@@
expression skb, len;
@@
-memset(skb_put(skb, len), 0, len);
+skb_put_zero(skb, len);
Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 15:37:13 +0000 (11:37 -0400)]
Merge branch 'r8152-adjust-runtime-suspend-resume'
Hayes Wang says:
====================
r8152: adjust runtime suspend/resume
v2:
For #1, replace GFP_KERNEL with GFP_NOIO for usb_submit_urb().
v1:
Improve the flow about runtime suspend/resume and make the code
easy to read.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 13 Jun 2017 07:14:40 +0000 (15:14 +0800)]
r8152: move calling delay_autosuspend function
Move calling delay_autosuspend() in rtl8152_runtime_suspend(). Calling
delay_autosuspend() as late as possible.
The original flows are
1. check if the driver/device is busy now.
2. set wake events.
3. enter runtime suspend.
If the wake event occurs between (1) and (2), the device may miss it. Besides,
to avoid the runtime resume occurs after runtime suspend immediately, move the
checking to the end of rtl8152_runtime_suspend().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 13 Jun 2017 07:14:39 +0000 (15:14 +0800)]
r8152: split rtl8152_resume function
Split rtl8152_resume() into rtl8152_runtime_resume() and
rtl8152_system_resume().
Besides, replace GFP_KERNEL with GFP_NOIO for usb_submit_urb().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 15:28:49 +0000 (11:28 -0400)]
tls: Depend upon INET not plain NET.
We refer to TCP et al. symbols so have to use INET as
the dependency.
ERROR: "tcp_prot" [net/tls/tls.ko] undefined!
>> ERROR: "tcp_rate_check_app_limited" [net/tls/tls.ko] undefined!
ERROR: "tcp_register_ulp" [net/tls/tls.ko] undefined!
ERROR: "tcp_unregister_ulp" [net/tls/tls.ko] undefined!
ERROR: "do_tcp_sendpages" [net/tls/tls.ko] undefined!
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jun 2017 02:53:24 +0000 (22:53 -0400)]
Merge branch 'mlx4-XDP-performance-improvements'
Tariq Toukan says:
====================
mlx4 XDP performance improvements
This patchset contains data-path improvements, mainly for XDP_DROP
and XDP_TX cases.
Main patches:
* Patch 2 by Saeed allows enabling optimized A0 RX steering (in HW) when
setting a single RX ring.
With this configuration, HW packet-rate dramatically improves,
reaching 28.1 Mpps in XDP_DROP case for both IPv4 (37% gain)
and IPv6 (53% gain).
* Patch 6 enhances the XDP xmit function. Among other changes, now we
ring one doorbell per NAPI. Patch gives 17% gain in XDP_TX case.
* Patch 7 obsoletes the NAPI of XDP_TX completion queue and integrates its
poll into the respective RX NAPI. Patch gives 15% gain in XDP_TX case.
Series generated against net-next commit:
f7aec129a356 rxrpc: Cache the congestion window setting
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:40 +0000 (14:35 +0300)]
net/mlx4_en: Refactor mlx4_en_free_tx_desc
Some code re-ordering, functionally equivalent.
- The !tx_info->inl check is evaluated anyway in both flows
(common case/end case). Run it first, this might finish
the flows earlier.
- dma_unmap calls are identical in both flows, get it out
of the if block into the common area.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:39 +0000 (14:35 +0300)]
net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
operation instead of a multiplication with TXBB_SIZE.
Operations are equivalent as TXBB_SIZE is a power of two.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:38 +0000 (14:35 +0300)]
net/mlx4_en: Increase default TX ring size
Increase the default TX ring size (from 512 to 1024) to match
the RX ring size.
This gives the XDP TX ring a better chance to keep up with the
rate of its RX ring in case of a high load of XDP_TX actions.
Tested:
Ethtool counter rx_xdp_tx_full used to increase, after applying this
patch it stopped.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:37 +0000 (14:35 +0300)]
net/mlx4_en: Poll XDP TX completion queue in RX NAPI
Instead of having their own NAPIs, XDP TX completion queues get
polled within the corresponding RX NAPI.
This prevents any possible race on TX ring prod/cons indices,
between the context that issues the transmits (RX NAPI) and the
context that handles the completions (was previously done in
a separate NAPI).
This also improves performance, as it decreases the number
of NAPIs running on a CPU, saving the overhead of syncing
and switching between the contexts.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.
XDP_TX packet rate:
-------------------------------------
| Before | After | Gain |
IPv4 | 12.0 Mpps | 13.8 Mpps | 15% |
IPv6 | 12.0 Mpps | 13.8 Mpps | 15% |
-------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:36 +0000 (14:35 +0300)]
net/mlx4_en: Improve XDP xmit function
Several performance improvements in XDP TX datapath,
including:
- Ring a single doorbell for XDP TX ring per NAPI budget,
instead of doing it per a lower threshold (was 8).
This includes removing the flow of immediate doorbell ringing
in case of a full TX ring.
- Compiler branch predictor hints.
- Calculate values in compile time rather than in runtime.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.
XDP_TX packet rate:
-------------------------------------
| Before | After | Gain |
IPv4 | 10.3 Mpps | 12.0 Mpps | 17% |
IPv6 | 10.3 Mpps | 12.0 Mpps | 17% |
-------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:35 +0000 (14:35 +0300)]
net/mlx4_en: Improve stack xmit function
Several small code and performance improvements in stack TX datapath,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- Move tx_info non-inline flow handling to a separate function.
- Calculate data_offset in compile time rather than in runtime
(for !lso_header_size branch).
- Avoid trinary-operator ("?") when value can be preset in a matching
branch.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:34 +0000 (14:35 +0300)]
net/mlx4_en: Improve transmit CQ polling
Several small performance improvements in TX CQ polling,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- More proper check of cq type.
- Use boolean instead of int for a binary indication.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet-rate tests for both regular stack and XDP use cases:
No noticeable gain, no degradation.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:33 +0000 (14:35 +0300)]
net/mlx4_en: Improve receive data-path
Several small performance improvements in RX datapath,
including:
- Compiler branch predictor hints.
- Replace a multiplication with a shift operation.
- Minimize variables scope.
- Write-prefetch for packet header.
- Avoid trinary-operator ("?") when value can be preset in a matching
branch.
- Save a branch by updating RX ring doorbell within
mlx4_en_refill_rx_buffers(), which now returns void.
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON
(enable by ethtool -L <interface> rx 1).
XDP_DROP packet rate:
Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).
Drop packets in TC:
-------------------------------------
| Before | After | Gain |
IPv4 | 4.14 Mpps | 4.18 Mpps | 1% |
-------------------------------------
XDP_TX packet rate:
-------------------------------------
| Before | After | Gain |
IPv4 | 10.1 Mpps | 10.3 Mpps | 2% |
IPv6 | 10.1 Mpps | 10.3 Mpps | 2% |
-------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 15 Jun 2017 11:35:32 +0000 (14:35 +0300)]
net/mlx4_en: Optimized single ring steering
Avoid touching RX QP RSS context when loading with only
one RX ring, to allow optimized A0 RX steering.
Enable by:
- loading mlx4_core with module param: log_num_mgm_entry_size = -6.
- then: ethtool -L <interface> rx 1
Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
XDP_DROP packet rate:
-------------------------------------
| Before | After | Gain |
IPv4 | 20.5 Mpps | 28.1 Mpps | 37% |
IPv6 | 18.4 Mpps | 28.1 Mpps | 53% |
-------------------------------------
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 15 Jun 2017 11:35:31 +0000 (14:35 +0300)]
net/mlx4_en: Remove unused argument in TX datapath function
Remove owner argument, as it is obsolete and unused.
This also saves the overhead of calculating its value in data-path.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 15 Jun 2017 19:56:21 +0000 (14:56 -0500)]
atm: solos-pci: remove useless variable assignments
Value assigned to variable _data32_ at lines 1254 and 1257 is
overwritten at line 1260 before it can be used. This makes
such variable assignments useless.
Addresses-Coverity-ID:
1227049
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 15 Jun 2017 19:06:54 +0000 (15:06 -0400)]
net: dsa: assign default CPU port to all ports
The current code only assigns the default cpu_dp to all user ports of
the switch to which the CPU port belongs. The user ports of the other
switches of the fabric thus don't have a default CPU port.
This patch fixes this by assigning the cpu_dp of all user ports of all
switches of the fabric when the tree is fully parsed.
Fixes:
a29342e73911 ("net: dsa: Associate slave network device with CPU port")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Majd Dibbiny [Thu, 9 Feb 2017 12:20:12 +0000 (14:20 +0200)]
net/mlx5: Add fast unload support in shutdown flow
Adding a support to flush all HW resources with one FW command and
skip all the heavy unload flows of the driver on kernel shutdown.
There's no need to free all the SW context since a new fresh kernel
will be loaded afterwards.
Regarding the FW resources, they should be closed, otherwise we will
have leakage in the FW. To accelerate this flow, we execute one command
in the beginning that tells the FW that the driver isn't going to close
any of the FW resources and asks the FW to clean up everything.
Once the commands complete, it's safe to close the PCI resources and
finish the routine.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Majd Dibbiny [Thu, 9 Feb 2017 11:20:46 +0000 (13:20 +0200)]
net/mlx5: Expose command polling interface
Add a new interface for commands execution that allows the
caller to wait for the command's completion in a busy-wait
loop (polling mode).
This is useful if we want to execute a command in a polling mode
while the driver is working in events mode for the rest of
the commands.
This interface will be used in the downstream patches.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Wed, 10 May 2017 12:10:33 +0000 (15:10 +0300)]
net/mlx5e: Optimize update stats work
Unlike ethtool stats, get_stats ndo provides information cached by
update stats work that is running in the background without updating
them explicitly.
We cannot update all counters inside the ndo because some
updates require firmware commands that cannot be performed under a
spinlock.
update_stats work does not need to update ALL counters, since only
some of them are needed by ndo_get_stats.
This patch will allow for a minimal run of update_stats using an extra
parameter which will update necessary counters only and cut 13
firmware commands in each iteration of the work.
Work duration previous to this patch: ~4200us.
Work duration after this patch: ~700us (17% of the original time).
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Gal Pressman [Wed, 14 Jun 2017 08:52:33 +0000 (11:52 +0300)]
net/mlx5e: Move and optimize query out of buffer function
Move "query queue counter out of buffer" helper function out of
qp.c to en_main.c, since mlx5e netdev driver is the only one to use it.
Also allocate the output buffer on the stack instead of the heap, to reduce
number of heap allocs on update_stats work.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Gal Pressman [Mon, 5 Jun 2017 13:45:45 +0000 (16:45 +0300)]
net/mlx5e: Reduce number of heap allocated buffers for update stats
Allocating buffers on the heap every 200ms is something we should avoid,
let's use buffers located on the stack instead.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Gal Pressman [Wed, 19 Apr 2017 15:10:37 +0000 (18:10 +0300)]
net/mlx5e: Rename physical symbol errors counter
Rename rx_symbol_errors_phy to rx_pcs_symbol_err_phy, in order to
prevent confusion with rx_symbol_err_phy counter.
rx_pcs_symbol_err_phy counter counts the number of symbol errors that
were detected on the PCS (regardless of traffic) and weren't
corrected by FEC correction algorithm or that FEC algorithm was
not active on this interface.
rx_symbol_err_phy refers to errors on packet level (physical error
during a packet receive).
Fixes:
5db0a4f64c04 ("net/mlx5e: Expose physical layer statistical...")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Itay Aveksis [Wed, 7 Jun 2017 14:01:51 +0000 (17:01 +0300)]
net/mlx5e: Fix typo in warning if CQ moderation is not supported
Signed-off-by: Itay Aveksis <itayav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Tariq Toukan [Sun, 21 May 2017 08:08:18 +0000 (11:08 +0300)]
net/mlx5e: Use function to map aRFS into traffic type
For a better code reuse and readability, use the existing
function arfs_get_tt() to map arfs_type into mlx5e_traffic_types,
instead of duplicating the switch-case logic.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Moni Shoua [Tue, 9 May 2017 09:20:55 +0000 (12:20 +0300)]
net/mlx5: Undo LAG upon request to create virtual functions
LAG cannot work if virtual functions are present. Therefore, if LAG is
configured, the attempt to create virtual functions will fail. This gives
precedence to LAG over SRIOV which is not the desired behavior as users
might want to use the bonding/teaming driver also want to work with SRIOV.
In that case we don't want to force an order of actions, first create
virtual functions and only than configure a bonding/teaming net device.
To fix, if LAG is configured during a request to create virtual
functions, remove it and continue.
We ignore ENODEV when trying to forbid lag. This makes sense
because "No such device" means that lag is forbidden anyway.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 May 2017 12:50:50 +0000 (15:50 +0300)]
net/mlx5: Avoid space after casting
Fix checkpatch complaints on that:
CHECK: No space is necessary after a cast
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 May 2017 12:40:43 +0000 (15:40 +0300)]
net/mlx5: Align to match opening parenthesis
Fixed checkpatch complaints of the form:
CHECK: Alignment should match open parenthesis
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 May 2017 12:35:27 +0000 (15:35 +0300)]
net/mlx5: Avoid blank lines before/after closing/opening braces
Fixed checkpatch complaints on that:
CHECK: Blank lines aren't necessary before a close brace '}'
CHECK: Blank lines aren't necessary after an open brace '{'
and one on missing blank line..
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 May 2017 12:32:35 +0000 (15:32 +0300)]
net/mlx5: Avoid using multiple blank lines
Fixed bunch of this checkpatch complaint:
CHECK: Please don't use multiple blank lines
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 28 May 2017 12:24:17 +0000 (15:24 +0300)]
net/mlx5: Fix some spelling mistakes
Fixed few places where endianness was misspelled and
one spot whwere output was:
CHECK: 'endianess' may be misspelled - perhaps 'endianness'?
CHECK: 'ouput' may be misspelled - perhaps 'output'?
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eli Cohen [Fri, 28 Oct 2016 14:52:56 +0000 (09:52 -0500)]
net/mlx5: Update eqe_type_str() event names
Add missing NIC_VPORT_CHANGE event.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
David S. Miller [Thu, 15 Jun 2017 18:31:56 +0000 (14:31 -0400)]
Merge branch 'r8152-support-new-chips'
Hayes Wang says:
====================
r8152: support new chips
These patches are used to support new chips.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 15 Jun 2017 06:44:04 +0000 (14:44 +0800)]
r8152: add byte_enable for ocp_read_word function
Add byte_enable for ocp_read_word() to replace reading 4
bytes data with reading the desired 2 bytes data.
This is used to avoid the issue which is described in
commit
b4d99def0938 ("r8152: remove sram_read"). The
original method always reads 4 bytes data, and it may
have problem when reading the PHY registers.
The new method is supported since RTL8153B, but it
doesn't influence the previous chips. The bits of the
byte_enable for the previous chips are the reserved
bits, and the hw would ignore them.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 15 Jun 2017 06:44:03 +0000 (14:44 +0800)]
r8152: support RTL8153B
This patch supports two new chips for RTL8153B.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Thu, 15 Jun 2017 06:44:02 +0000 (14:44 +0800)]
r8152: support new chip 8050
The settings of the new chip are the same with RTL8152, except that
its product ID is 0x8050.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 Jun 2017 18:29:01 +0000 (14:29 -0400)]
Merge branch 'ibmvnic-LPM-bug-fixes'
Thomas Falcon says:
====================
ibmvnic: LPM bug fixes
This series of small patches is meant to resolve a number of
bugs, mostly occurring during an ibmvnic driver reset when
recovering from a logical partition migration (LPM).
The first patch ensures that RX buffer pools are properly
activated following an adapter reset by setting the proper
flag in the pool data structure.
The second patch uses netif_tx_disable to stop TX queues when
closing the device during a reset.
Third, fixup a typo that resulted in partial sanitization of
TX/RX descriptor queues following a device reset.
Fourth, remove an ambiguous conditional check that was resulting
in a kernel panic as null RX/TX completion descriptors were being
processed during napi polling while the device is closing.
Finally, fix a condition where the napi polling routine exits
before it has completed its work budget without notifying the
upper network layers. This omission could result in the
napi_disable function sleeping indefinitely under certain conditions.
v2: Attempt to provide a proper cover letter
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Thu, 15 Jun 2017 04:50:09 +0000 (23:50 -0500)]
ibmvnic: Exit polling routine correctly during adapter reset
This patch fixes a bug where, in the case of a device reset,
the polling routine will never complete, causing napi_disable
to sleep indefinitely when attempting to close the device.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Thu, 15 Jun 2017 04:50:08 +0000 (23:50 -0500)]
ibmvnic: Remove VNIC_CLOSING check from pending_scrq
Fix a kernel panic resulting from data access of a NULL
pointer during device close. The pending_scrq routine is
meant to determine whether there is a valid sub-CRQ message
awaiting processing. When the device is closing, however,
there is a possibility that NULL messages can be processed
because pending_scrq will always return 1 even if there
no valid message in the queue.
It's not clear what this closing state check was originally
meant to accomplish, so just remove it.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Thu, 15 Jun 2017 04:50:07 +0000 (23:50 -0500)]
ibmvnic: Sanitize entire SCRQ buffer on reset
Fixup a typo so that the entire SCRQ buffer is cleaned.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Thu, 15 Jun 2017 04:50:06 +0000 (23:50 -0500)]
ibmvnic: Ensure that TX queues are disabled in __ibmvnic_close
Use netif_tx_disable to guarantee that TX queues are disabled
when __ibmvnic_close is called by the device reset routine.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Thu, 15 Jun 2017 04:50:05 +0000 (23:50 -0500)]
ibmvnic: Activate disabled RX buffer pools on reset
RX buffer pools are disabled while awaiting a device
reset if firmware indicates that the resource is closed.
This patch fixes a bug where pools were not being
subsequently enabled after the device reset, causing
the device to become inoperable.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 14 Jun 2017 22:43:37 +0000 (15:43 -0700)]
sunvnet: restrict advertized checksum offloads to just IP
As much as we'd like to play well with others, we really aren't
handling the checksums on non-IP protocol packets very well. This
is easily seen when trying to do TCP over ipv6 - the checksums are
garbage.
Here we restrict the checksum feature flag to just IP traffic so
that we aren't given work we can't yet do.
Orabug:
26175391,
26259755
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 Jun 2017 18:21:04 +0000 (14:21 -0400)]
Merge branch 'sched-act_tunnel_key-UDP-checksusm'
Jiri Benc says:
====================
net: sched: act_tunnel_key: UDP checksums
Currently, the tunnel_key tc action does not set TUNNEL_CSUM, thus
transmitting packets with zero UDP checksum. This is inconsistent with how
we treat non-lwt UDP tunnels where the default is to fill in the UDP
checksum. Non-zero UDP checksum is the better default anyway for various
reasons previously discussed.
Make this configurable for the tunnel_key tc action with the default being
non-zero checksum. Saves a lot of surprises especially with IPv6.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Wed, 14 Jun 2017 19:19:31 +0000 (21:19 +0200)]
net: sched: act_tunnel_key: make UDP checksum configurable
Allow requesting of zero UDP checksum for encapsulated packets. The name and
meaning of the attribute is "NO_CSUM" in order to have the same meaning of
the attribute missing and being 0.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Wed, 14 Jun 2017 19:19:30 +0000 (21:19 +0200)]
net: sched: act_tunnel_key: request UDP checksum by default
There's currently no way to request (outer) UDP checksum with
act_tunnel_key. This is problem especially for IPv6. Right now, tunnel_key
action with IPv6 does not work without going through hassles: both sides
have to have udp6zerocsumrx configured on the tunnel interface. This is
obviously not a good solution universally.
It makes more sense to compute the UDP checksum by default even for IPv4.
Just set the default to request the checksum when using act_tunnel_key.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 15 Jun 2017 02:58:17 +0000 (21:58 -0500)]
net: s2io: remove useless variable in fill_rx_buffers
Remove useless variable rxd_index and code related.
Addresses-Coverity-ID:
1397691
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 Jun 2017 18:07:51 +0000 (14:07 -0400)]
Merge branch 'dsa-prefix-Global-macros'
Vivien Didelot says:
====================
net: dsa: prefix Global macros
This patch series is the 2/3 step of the register definitions cleanup.
It brings no functional changes.
It prefixes and documents all Global (1) registers with MV88E6XXX_G1_
(or a specific model like MV88E6352_G1_STS_PPU_STATE), and prefers a
16-bit hexadecimal representation of the Marvell registers layout.
The next and last patchset will prefix the Global 2 registers.
====================
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 15 Jun 2017 16:14:06 +0000 (12:14 -0400)]
net: dsa: mv88e6xxx: prefix Global Prio and Tag macros
Prefix and document the remaining Global IP and IEEE Priority and Core
Tag Type registers and give them a clear 16-bit register representation.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>