David S. Miller [Sat, 26 Aug 2017 02:39:58 +0000 (19:39 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2017-08-25
This series contains updates to i40e and i40evf only.
Mitch adjusts the max packet size to account for two VLAN tags.
Sudheer provides a fix to ensure that the watchdog timer is scheduled
immediately after admin queue operations are scheduled in i40evf_down().
Fixes an issue by adding locking around the admin queue command and
update of state variables so that adminq_subtask will have the accurate
information whenever it gets scheduled.
Anjali fixes a bug where the PF flag setup should happen before the VMDq
RSS queue count is initialized for VMDq VSI to get the right number of
queues for RSS in the case of x722 devices. Fixed a problem with the
hardware ATR eviction feature where the NVM setting was incorrect.
Jake separates the flags into two types, hw_features and flags. The
hw_features flags contain a set of features which are enabled at init
time and will not contain feature flags that can be toggled. Everything
else will remain in the flags variable, and can be modified anytime
during run time. We should not be directly copying a cpumask_t, since
it is bitmap and might not be copied correctly, so use cpumask_copy()
instead.
Stefan Assmann makes vf _offload_flags more "generic" by renaming it to
vf_cap_flags, which allows other capabilities besides offloading to be
added.
Alan makes it such that if adaptive-rx/tx is enabled, the user cannot
make any manual adjustments to interrupt moderation. Also makes it so
that if ITR is disabled by adaptive-rx/tx is then enabled, ITR will be
re-enabled.
v2: Dropped patches #1 & #8 from the original patch series submission,
while Jesse and Jake re-work their patches based on feedback from
David Miller. Also removed the duplicate patch 3 that was
accidentally sent out twice in the previous submission.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 26 Aug 2017 02:24:59 +0000 (19:24 -0700)]
Merge branch 'nfp-SR-IOV-ndos-support'
Jakub Kicinski says:
====================
nfp: SR-IOV ndos support
This set adds basic SR-IOV including setting/getting VF MAC addresses,
VLANs, link state and spoofcheck settings. It is wired up for both
vNICs and representors (note: ip link will not report VF settings on
VF/PF representors because they are not linked to the PF PCI device).
Pablo and team add the basic implementation, Simon and Dirk follow
up with the representor plumbing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Fri, 25 Aug 2017 04:31:50 +0000 (21:31 -0700)]
nfp: add basic SR-IOV ndo functions to representors
Add basic ndo_set/get_vf to support SR-IOV on all types
of port representors.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Cascón [Fri, 25 Aug 2017 04:31:49 +0000 (21:31 -0700)]
nfp: add basic SR-IOV ndo functions
Add basic ndo_set/get_vf to support SR-IOV.
VF to egress phy static mapping by now.
Use vfcfg ABI version 2 to write the info to the FW and collect
the return value from the mailbox.
Signed-off-by: Pablo Cascón <pablo.cascon@netronome.com>
Signed-off-by: Jimmy Kizito <jimmy.kizito@netronome.com>
Signed-off-by: Rami Tomer <rami.tomer@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 25 Aug 2017 13:27:05 +0000 (06:27 -0700)]
tcp: fix hang in tcp_sendpage_locked()
syszkaller got a hang in tcp stack, related to a bug in
tcp_sendpage_locked()
root@syzkaller:~# cat /proc/3059/stack
[<
ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
[<
ffffffff83de9473>] lock_sock_nested+0xf3/0x110
[<
ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
[<
ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
[<
ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
[<
ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
[<
ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
[<
ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
[<
ffffffff84089203>] tcp_sendpage+0x43/0x60
[<
ffffffff841641da>] inet_sendpage+0x1aa/0x660
[<
ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
[<
ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
[<
ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
[<
ffffffff81b67243>] __splice_from_pipe+0x343/0x750
[<
ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
[<
ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
[<
ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
[<
ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
Fixes:
306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 26 Aug 2017 00:19:11 +0000 (17:19 -0700)]
Merge branch 'net_sched-clean-up-tc-classes-and-u32-filter'
Cong Wang says:
====================
net_sched: clean up tc classes and u32 filter
Patch 1 and patch 2 prepare for patch 3. Major changes
are in patch 3 and patch 4, details are there too.
v2: Add patch 1 and 2, group all into a patchset
Fix a coding style issue in patch 4
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 24 Aug 2017 23:51:30 +0000 (16:51 -0700)]
net_sched: kill u32_node pointer in Qdisc
It is ugly to hide a u32-filter-specific pointer inside Qdisc,
this breaks the TC layers:
1. Qdisc is a generic representation, should not have any specific
data of any type
2. Qdisc layer is above filter layer, should only save filters in
the list of struct tcf_proto.
This pointer is used as the head of the chain of u32 hash tables,
that is struct tc_u_hnode, because u32 filter is very special,
it allows to create multiple hash tables within one qdisc and
across multiple u32 filters.
Instead of using this ugly pointer, we can just save it in a global
hash table key'ed by (dev ifindex, qdisc handle), therefore we can
still treat it as a per qdisc basis data structure conceptually.
Of course, because of network namespaces, this key is not unique
at all, but it is fine as we already have a pointer to Qdisc in
struct tc_u_common, we can just compare the pointers when collision.
And this only affects slow paths, has no impact to fast path,
thanks to the pointer ->tp_c.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 24 Aug 2017 23:51:29 +0000 (16:51 -0700)]
net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:
1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.
2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.
3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.
Therefore, this patch removes ->get() and ->put(), but:
1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.
2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.
For those who also use ->put() as ->unbind(), just rename them to reflect
this change.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 24 Aug 2017 23:51:28 +0000 (16:51 -0700)]
net_sched: introduce tclass_del_notify()
Like for TC actions, ->delete() is a special case,
we have to prepare and fill the notification before delete
otherwise would get use-after-free after we remove the
reference count.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 24 Aug 2017 23:51:27 +0000 (16:51 -0700)]
net_sched: get rid of more forward declarations
This is not needed if we move them up properly.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 25 Aug 2017 08:24:28 +0000 (11:24 +0300)]
hinic: skb_pad() frees on error
The skb_pad() function frees the skb on error, so this code has a double
free.
Fixes:
00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 26 Aug 2017 00:10:24 +0000 (17:10 -0700)]
Merge branch 'ipv6-sr-updates'
David Lebrun says:
====================
net: updates for IPv6 Segment Routing
v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect()
use cases, and L2ENCAP only uses this redirection. Fix incoherence
between arbitrary MAC header size support and fixed headroom
computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP
mode.
This patch series provides several updates for the SRv6 implementation. The
first patch leverages the existing infrastructure to support encapsulation
of IPv4 packets. The second patch implements the T.Encaps.L2 SR function,
enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet.
The last three patches update the seg6local lightweight tunnel, and mainly
implement four new actions: End.T, End.DX2, End.DX4 and End.DT6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Fri, 25 Aug 2017 07:58:17 +0000 (09:58 +0200)]
ipv6: sr: implement additional seg6local actions
This patch implements the following seg6local actions.
- SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the
next-hop looked up in the specified routing table.
- SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to
the specified network interface.
- SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it,
possibly to the specified next-hop.
- SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it
to the next-hop looked up in the specified routing table.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Fri, 25 Aug 2017 07:56:47 +0000 (09:56 +0200)]
ipv6: sr: add helper functions for seg6local
This patch adds three helper functions to be used with the seg6local packet
processing actions.
The decap_and_validate() function will be used by the End.D* actions, that
decapsulate an SR-enabled packet.
The advance_nextseg() function applies the fundamental operations to update
an SRH for the next segment.
The lookup_nexthop() function helps select the next-hop for the processed
SR packets. It supports an optional next-hop address to route the packet
specifically through it, and an optional routing table to use.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Fri, 25 Aug 2017 07:56:46 +0000 (09:56 +0200)]
ipv6: sr: enforce IPv6 packets for seg6local lwt
This patch ensures that the seg6local lightweight tunnel is used solely
with IPv6 routes and processes only IPv6 packets.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Fri, 25 Aug 2017 07:56:45 +0000 (09:56 +0200)]
ipv6: sr: add support for encapsulation of L2 frames
This patch implements the L2 frame encapsulation mechanism, referred to
as T.Encaps.L2 in the SRv6 specifications [1].
A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only
accepts packets with an existing MAC header (i.e., it will not work for
locally generated packets). The resulting packet looks like IPv6 -> SRH ->
Ethernet -> original L3 payload. The next header field of the SRH is set to
NEXTHDR_NONE.
[1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Fri, 25 Aug 2017 07:56:44 +0000 (09:56 +0200)]
ipv6: sr: add support for ip4ip6 encapsulation
This patch enables the SRv6 encapsulation mode to carry an IPv4 payload.
All the infrastructure was already present, I just had to add a parameter
to seg6_do_srh_encap() to specify the inner packet protocol, and perform
some additional checks.
Usage example:
ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudheer Mogilappagari [Wed, 12 Jul 2017 09:46:07 +0000 (05:46 -0400)]
i40e: synchronize nvmupdate command and adminq subtask
During NVM update, state machine gets into unrecoverable state because
i40e_clean_adminq_subtask can get scheduled after the admin queue
command but before other state variables are updated. This causes
incorrect input to i40e_nvmupd_check_wait_event and state transitions
don't happen.
This issue existed before but surfaced after commit
373149fc99a0
("i40e: Decrease the scope of rtnl lock")
This fix adds locking around admin queue command and update of
state variables so that adminq_subtask will have accurate information
whenever it gets scheduled.
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alan Brady [Wed, 12 Jul 2017 09:46:06 +0000 (05:46 -0400)]
i40e: prevent changing ITR if adaptive-rx/tx enabled
Currently the driver allows the user to change (or even disable)
interrupt moderation if adaptive-rx/tx is enabled when this should
not be the case.
Adaptive RX/TX will not respect the user's ITR settings so
allowing the user to change it is weird. This bug would also
allow the user to disable interrupt moderation with adaptive-rx/tx
enabled which doesn't make much sense either.
This patch makes it such that if adaptive-rx/tx is enabled, the user
cannot make any manual adjustments to interrupt moderation. It also
makes it so that if ITR is disabled but adaptive-rx/tx is then
enabled, ITR will be re-enabled.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 12 Jul 2017 09:46:05 +0000 (05:46 -0400)]
i40e: use cpumask_copy instead of direct assignment
According to the header file cpumask.h, we shouldn't be directly copying
a cpumask_t, since its a bitmap and might not be copied correctly. Lets
use the provided cpumask_copy() function instead.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alan Brady [Wed, 12 Jul 2017 09:46:04 +0000 (05:46 -0400)]
i40evf: use netdev variable in reset task
If we're going to bother initializing a variable to reference it we might
as well use it.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Stefan Assmann [Thu, 29 Jun 2017 13:12:24 +0000 (15:12 +0200)]
i40e/i40evf: rename vf_offload_flags to vf_cap_flags in struct virtchnl_vf_resource
The current name of vf_offload_flags indicates that the bitmap is
limited to offload related features. Make this more generic by renaming
it to vf_cap_flags, which allows for other capabilities besides
offloading to be added.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 23 Jun 2017 08:24:51 +0000 (04:24 -0400)]
i40e: move check for avoiding VID=0 filters into i40e_vsi_add_vlan
In i40e_vsi_add_vlan we treat attempting to add VID=0 as an error,
because it does not do what the caller might expect. We already special
case VID=0 in i40e_vlan_rx_add_vid so that we avoid this error when
adding the VLAN.
This special casing is necessary so that we do not add the VLAN=0 filter
since we don't want to stop receiving untagged traffic. Unfortunately,
not all callers of i40e_vsi_add_vlan are aware of this, including when
we add VLANs from a VF device.
Rather than special casing every single caller of i40e_vsi_add_vlan,
lets just move this check internally. This makes the code simpler
because the caller does not need to be aware of how VLAN=0 is special,
and we don't forget to add this check in new places.
This fixes a harmless error message displaying when adding a VLAN from
within a VF. The message was meaningless but there is no reason to
confuse end users and system administrators, and this is now avoided.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 23 Jun 2017 08:24:50 +0000 (04:24 -0400)]
i40e/i40evf: use cmpxchg64 when updating private flags in ethtool
When a user gives an invalid command to change a private flag which is
not supported, either because it is read-only, or the device is not
capable of the feature, we simply ignore the request.
A naive solution would simply be to report error codes when one of the
flags was not supported. However, this causes problems because it makes
the operation not atomic. If a user requests multiple private flags
together at once we could end up changing one before failing at the
second flag.
We can do a bit better if we instead update a temporary copy of the
flags variable in the loop, and then copy it into place after. If we
aren't careful this has the pitfall of potentially silently overwriting
any changes caused by other threads.
Avoid this by using cmpxchg64 which will compare and swap the flags
variable only if it currently matched the old value. We'll report
-EAGAIN in the (hopefully rare!) case where the cmpxchg64 fails.
This ensures that we can properly report when flags are not supported in
an atomic fashion without the risk of overwriting other threads changes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Fri, 23 Jun 2017 08:24:48 +0000 (04:24 -0400)]
i40e: Detect ATR HW Evict NVM issue and disable the feature
This patch fixes a problem with the HW ATR eviction feature where the
NVM setting was incorrect. This patch detects the issue on X720
adapters and disables the feature if the NVM setting is incorrect.
Without this patch, HW ATR Evict feature does not work on broken NVMs
and is not detected either. If the HW ATR Evict feature is disabled
the SW Eviction feature will take effect.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 23 Jun 2017 08:24:47 +0000 (04:24 -0400)]
i40e: remove workaround for Open Firmware MAC address
Since commit
b499ffb0a22c ("i40e: Look up MAC address in Open Firmware
or IDPROM"), we've had support for obtaining the MAC address
form Open Firmware or IDPROM.
This code relied on sending the Open Firmware address directly to the
device firmware instead of relying on our MAC/VLAN filter list. Thus,
a work around was introduced in commit
b1b15df59232 ("i40e: Explicitly
write platform-specific mac address after PF reset")
We refactored the Open Firmware address enablement code in the ill-named
commit
41c4c2b50d52 ("i40e: allow look-up of MAC address from Open
Firmware or IDPROM")
Since this refactor, we no longer even set I40E_FLAG_PF_MAC. Further, we
don't need this work around, because we actually store the MAC address
as part of the MAC/VLAN filter hash. Thus, we will restore the address
correctly upon reset.
The refactor above failed to revert the workaround, so do that now.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 23 Jun 2017 08:24:46 +0000 (04:24 -0400)]
i40e: separate hw_features from runtime changing flags
The number of flags found in pf->flags has grown quite large, and there
are a lot of different types of flags. Most of the flags are simply
hardware features which are enabled on some firmware or some MAC types.
Other flags are dynamic run-time flags which enable or disable certain
features of the driver.
Separate these two types of flags into pf->hw_features and pf->flags.
The hw_features list will contain a set of features which are enabled at
init time. This will not contain toggles or otherwise dynamically
changing features. These flags should not need atomic protections, as
they will be set once during init and then be essentially read only.
Everything else will remain in the flags variable. These flags may be
modified at any time during run time. A future patch may wish to convert
these flags into set_bit/clear_bit/test_bit or similar approach to
ensure atomic correctness.
The I40E_FLAG_MFP_ENABLED flag may be a good fit for hw_features but
currently is used by ethtool in the private flags settings, and thus has
been left as part of flags.
Additionally, I40E_FLAG_DCB_CAPABLE may be a good fit for the
hw_features but this patch has not tried to untangle it yet.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Fri, 23 Jun 2017 08:24:45 +0000 (04:24 -0400)]
i40e: Fix a bug with VMDq RSS queue allocation
The X722 pf flag setup should happen before the VMDq RSS queue count is
initialized for VMDq VSI to get the right number of queues for RSS in
case of X722 devices.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sudheer Mogilappagari [Fri, 23 Jun 2017 08:24:44 +0000 (04:24 -0400)]
i40evf: prevent VF close returning before state transitions to DOWN
Currently i40evf_close() can return before state transitions to
__I40EVF_DOWN because of the latency involved in processing and
receiving response from PF driver and scheduling of VF watchdog_task.
Due to this inconsistency an immediate call to i40evf_open() fails
because state is still DOWN_PENDING.
When a VF interface is in up state and we try to add it as slave,
The bonding driver calls dev_close() and dev_open() in short duration
resulting in dev_open returning error. The ifenslave command needs
to be run again for dev_open to succeed.
This fix ensures that watchdog timer is scheduled immediately after
admin queue operations are scheduled in i40evf_down(). In addition a
wait condition is added at the end of i40evf_close so that function
wont return when state is still DOWN_PENDING. The timeout value is
chosen after some profiling and includes some buffer.
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Fri, 23 Jun 2017 08:24:43 +0000 (04:24 -0400)]
i40e/i40evf: adjust packet size to account for double VLANs
Now that the kernel supports double VLAN tags, we should at least play
nice. Adjust the max packet size to account for two VLAN tags, not just
one.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Eric Biggers [Thu, 24 Aug 2017 21:38:51 +0000 (14:38 -0700)]
strparser: initialize all callbacks
commit
bbb03029a899 ("strparser: Generalize strparser") added more
function pointers to 'struct strp_callbacks'; however, kcm_attach() was
not updated to initialize them. This could cause the ->lock() and/or
->unlock() function pointers to be set to garbage values, causing a
crash in strp_work().
Fix the bug by moving the callback structs into static memory, so
unspecified members are zeroed. Also constify them while we're at it.
This bug was found by syzkaller, which encountered the following splat:
IP: 0x55
PGD
3b1ca067
P4D
3b1ca067
PUD
3b12f067
PMD 0
Oops: 0010 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-
20170811 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: kstrp strp_work
task:
ffff88006bb0e480 task.stack:
ffff88006bb10000
RIP: 0010:0x55
RSP: 0018:
ffff88006bb17540 EFLAGS:
00010246
RAX:
dffffc0000000000 RBX:
ffff88006ce4bd60 RCX:
0000000000000000
RDX:
1ffff1000d9c97bd RSI:
0000000000000000 RDI:
ffff88006ce4bc48
RBP:
ffff88006bb17558 R08:
ffffffff81467ab2 R09:
0000000000000000
R10:
ffff88006bb17438 R11:
ffff88006bb17940 R12:
ffff88006ce4bc48
R13:
ffff88003c683018 R14:
ffff88006bb17980 R15:
ffff88003c683000
FS:
0000000000000000(0000) GS:
ffff88006de00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000055 CR3:
000000003c145000 CR4:
00000000000006e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
worker_thread+0x223/0x1860 kernel/workqueue.c:2233
kthread+0x35e/0x430 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
Code: Bad RIP value.
RIP: 0x55 RSP:
ffff88006bb17540
CR2:
0000000000000055
---[ end trace
f0e4920047069cee ]---
Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
CONFIG_AF_KCM=y):
#include <linux/bpf.h>
#include <linux/kcm.h>
#include <linux/types.h>
#include <stdint.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/syscall.h>
#include <unistd.h>
static const struct bpf_insn bpf_insns[3] = {
{ .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
{ .code = 0x95 }, /* BPF_EXIT_INSN() */
};
static const union bpf_attr bpf_attr = {
.prog_type = 1,
.insn_cnt = 2,
.insns = (uintptr_t)&bpf_insns,
.license = (uintptr_t)"",
};
int main(void)
{
int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
&bpf_attr, sizeof(bpf_attr));
int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
ioctl(kcm_fd, SIOCKCMATTACH,
&(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
}
Fixes:
bbb03029a899 ("strparser: Generalize strparser")
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Thu, 24 Aug 2017 18:50:02 +0000 (11:50 -0700)]
hv_netvsc: Fix rndis_filter_close error during netvsc_remove
We now remove rndis filter before unregister_netdev(), which calls
device close. It involves closing rndis filter already removed.
This patch fixes this error.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 Aug 2017 04:49:56 +0000 (21:49 -0700)]
Merge tag 'mlx5-updates-2017-08-24' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-08-24
This series includes updates to mlx5 core driver.
From Gal and Saeed, three cleanup patches.
From Matan, Low level flow steering improvements and optimizations,
- Use more efficient data structures for flow steering objects handling.
- Add tracepoints to flow steering operations.
- Overall these patches improve flow steering rule insertion rate by a
factor of seven in large scales (~50K rules or more).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Thu, 24 Aug 2017 10:47:39 +0000 (13:47 +0300)]
hinic: uninitialized variable in hinic_api_cmd_init()
We never set the error code in this function.
Fixes:
eabf0fad81d5 ("net-next/hinic: Initialize api cmd resources")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 25 Aug 2017 03:55:40 +0000 (20:55 -0700)]
net: mv643xx_eth: Be drop monitor friendly
txq_reclaim() does the normal transmit queue reclamation and
rxq_deinit() does the RX ring cleanup, none of these are packet drops,
so use dev_consume_skb() for both locations.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 25 Aug 2017 00:47:11 +0000 (17:47 -0700)]
tg3: Be drop monitor friendly
tg3_tx() does the normal packet TX completion,
tigon3_dma_hwbug_workaround() and tg3_tso_bug() both need to allocate a
new SKB that is suitable to workaround HW bugs, and finally
tg3_free_rings() is doing ring cleanup. Use dev_consume_skb_any() for
these 3 locations to be SKB drop monitor friendly.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 Aug 2017 01:21:17 +0000 (18:21 -0700)]
Merge branch 'ipv6-Route-ICMPv6-errors-with-the-flow-when-ECMP-in-use'
Jakub Sitnicki says:
====================
ipv6: Route ICMPv6 errors with the flow when ECMP in use
This patch set is another take at making Path MTU Discovery work when
server nodes are behind a router employing multipath routing in a
load-balance or anycast setup (that is, when not every end-node can be
reached by every path). The problem has been well described in RFC 7690
[1], but in short - in such setups ICMPv6 PTB errors are not guaranteed
to be routed back to the server node that sent a reply that exceeds path
MTU.
The proposed solution is two-fold:
(1) on the server side - reflect the Flow Label [2]. This can be done
without modifying the application using a new per-netns sysctl knob
that has been proposed independently of this patchset in the patch
entitled "ipv6: Add sysctl for per namespace flow label
reflection" [3].
(2) on the ECMP router - make the ipv6 routing subsystem look into the
ICMPv6 error packets and compute the flow-hash from its payload,
i.e. the offending packet that triggered the error. This is the
same behavior as ipv4 stack has already.
With both parts in place Path MTU Discovery can work past the ECMP
router when using IPv6.
[1] https://tools.ietf.org/html/rfc7690
[2] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
[3] http://patchwork.ozlabs.org/patch/804870/
v1 -> v2:
- don't use "extern" in external function declaration in header file
- style change, put as many arguments as possible on the first line of
a function call, and align consecutive lines to the first argument
- expand the cover letter based on the feedback
v2 -> v3:
- switch to computing flow-hash using flow dissector to align with
recent changes to multipath routing in ipv4 stack
- add a sysctl knob for enabling flow label reflection per netns
---
Testing has covered multipath routing of ICMPv6 PTB errors in forward
and local output path in a simple use-case of an HTTP server sending a
reply which is over the path MTU size [3]. I have also checked if the
flows get evenly spread over multiple paths (i.e. if there are no
regressions) [4].
[3] https://github.com/jsitnicki/tools/tree/master/net/tests/ecmp/pmtud
[4] https://github.com/jsitnicki/tools/tree/master/net/tests/ecmp/load-balance
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 23 Aug 2017 07:58:31 +0000 (09:58 +0200)]
ipv6: Use multipath hash from flow info if available
Allow our callers to influence the choice of ECMP link by honoring the
hash passed together with the flow info. This allows for special
treatment of ICMP errors which we would like to route over the same path
as the IPv6 datagram that triggered the error.
Also go through rt6_multipath_hash(), in the usual case when we aren't
dealing with an ICMP error, so that there is one central place where
multipath hash is computed.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 23 Aug 2017 07:58:30 +0000 (09:58 +0200)]
ipv6: Fold rt6_info_hash_nhsfn() into its only caller
Commit
644d0e656958 ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has
turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes
sense to keep it around. Also remove the accompanying comment that has
become outdated.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 23 Aug 2017 07:58:29 +0000 (09:58 +0200)]
ipv6: Compute multipath hash for ICMP errors from offending packet
When forwarding or sending out an ICMPv6 error, look at the embedded
packet that triggered the error and compute a flow hash over its
headers.
This let's us route the ICMP error together with the flow it belongs to
when multipath (ECMP) routing is in use, which in turn makes Path MTU
Discovery work in ECMP load-balanced or anycast setups (RFC 7690).
Granted, end-hosts behind the ECMP router (aka servers) need to reflect
the IPv6 Flow Label for PMTUD to work.
The code is organized to be in parallel with ipv4 stack:
ip_multipath_l3_keys -> ip6_multipath_l3_keys
fib_multipath_hash -> rt6_multipath_hash
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 23 Aug 2017 07:58:28 +0000 (09:58 +0200)]
net: Extend struct flowi6 with multipath hash
Allow for functions that fill out the IPv6 flow info to also pass a hash
computed over the skb contents. The hash value will drive the multipath
routing decisions.
This is intended for special treatment of ICMPv6 errors, where we would
like to make a routing decision based on the flow identifying the
offending IPv6 datagram that triggered the error, rather than the flow
of the ICMP error itself.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 Aug 2017 01:10:46 +0000 (18:10 -0700)]
devlink: Fix devlink_dpipe_table_register() stub signature.
One too many arguments compared to the non-stub version.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes:
ffd3cdccf214 ("devlink: Add support for dynamic table size")
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 23 Aug 2017 07:55:41 +0000 (09:55 +0200)]
ipv6: Add sysctl for per namespace flow label reflection
Reflecting IPv6 Flow Label at server nodes is useful in environments
that employ multipath routing to load balance the requests. As "IPv6
Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
messages generated in response to a downstream packets from the server
can be routed by a load balancer back to the original server without
looking at transport headers, if the server applies the flow label
reflection. This enables the Path MTU Discovery past the ECMP router in
load-balance or anycast environments where each server node is reachable
by only one path.
Introduce a sysctl to enable flow label reflection per net namespace for
all newly created sockets. Same could be earlier achieved only per
socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
socket option.
[1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Wed, 23 Aug 2017 12:52:01 +0000 (18:22 +0530)]
net/mlx5e: make mlx5e_profile const
Make this const as it is only passed as an argument to the function
mlx5e_create_netdev and the corresponding argument is of type const.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Wed, 23 Aug 2017 12:47:39 +0000 (18:17 +0530)]
net/mlx4_core: make mlx4_profile const
Make these const as they are only used in a copy operation.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 18:59:37 +0000 (11:59 -0700)]
Merge branch 'xdp-more-work-on-xdp-tracepoints'
Jesper Dangaard Brouer says:
====================
xdp: more work on xdp tracepoints
More work on streamlining and performance optimizing the tracepoints
for XDP.
I've created a simple xdp_monitor application that uses this
tracepoint, and prints statistics. Available at github:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_monitor_kern.c
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_monitor_user.c
The improvement over tracepoint with strcpy:
9810372 -
8428762 = +
1381610 pps faster
- (1/
9810372 - 1/
8428762)*10^9 = -16.7 nanosec
- 100-(
8428762/
9810372*100) = strcpy-trace is 14.08% slower
- 981037/
8428762*100 = removing strcpy made it 11.64% faster
V3: Fix merge conflict with commit
e4a8e817d3cb ("bpf: misc xdp redirect cleanups")
V2: Change trace_xdp_redirect() to align with args of trace_xdp_exception()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 24 Aug 2017 10:33:23 +0000 (12:33 +0200)]
xdp: get tracepoints xdp_exception and xdp_redirect in sync
Remove the net_device string name from the xdp_exception tracepoint,
like the xdp_redirect tracepoint.
Align the TP_STRUCT to have common entries between these two
tracepoint.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 24 Aug 2017 10:33:18 +0000 (12:33 +0200)]
xdp: remove net_device names from xdp_redirect tracepoint
There is too much overhead in the current trace_xdp_redirect
tracepoint as it does strcpy and strlen on the net_device names.
Besides, exposing the ifindex/index is actually the information that
is needed in the tracepoint to diagnose issues. When a lookup fails
(either ifindex or devmap index) then there is a need for saying which
to_index that have issues.
V2: Adjust args to be aligned with trace_xdp_exception.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 24 Aug 2017 10:33:13 +0000 (12:33 +0200)]
ixgbe: use return codes from ndo_xdp_xmit that are distinguishable
For XDP_REDIRECT the use of return code -EINVAL is confusing, as it is
used in three different cases. (1) When the index or ifindex lookup
fails, and in the ixgbe driver (2) when link is down and (3) when XDP
have not been enabled.
The return code can be picked up by the tracepoint xdp:xdp_redirect
for diagnosing why XDP_REDIRECT isn't working. Thus, there is a need
different return codes to tell the issues apart.
I'm considering using a specific err-code scheme for XDP_REDIRECT
instead of using these errno codes.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 24 Aug 2017 10:33:08 +0000 (12:33 +0200)]
xdp: make generic xdp redirect use tracepoint trace_xdp_redirect
If the xdp_do_generic_redirect() call fails, it trigger the
trace_xdp_exception tracepoint. It seems better to use the same
tracepoint trace_xdp_redirect, as the native xdp_do_redirect{,_map} does.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 24 Aug 2017 10:33:03 +0000 (12:33 +0200)]
xdp: remove bpf_warn_invalid_xdp_redirect
Given there is a tracepoint that can track the error code
of xdp_do_redirect calls, the WARN_ONCE in bpf_warn_invalid_xdp_redirect
doesn't seem relevant any longer. Simply remove the function.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 16:33:17 +0000 (09:33 -0700)]
Merge branch 'mlxsw-ipv4-host-dpipe-table'
Jiri Pirko says:
====================
mlxsw: Add IPv4 host dpipe table
Arkadi says:
This patchset adds IPv4 host dpipe table support. This will provide the
ability to observe the hardware offloaded IPv4 neighbors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:10 +0000 (08:40 +0200)]
mlxsw: spectrum_dpipe: Add support for controlling neighbor counters
Add support for controlling neighbor counters via dpipe.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:09 +0000 (08:40 +0200)]
mlxsw: spectrum_dpipe: Add support for IPv4 host table dump
Add support for IPv4 host table dump.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:08 +0000 (08:40 +0200)]
mlxsw: spectrum_router: Add support for setting counters on neighbors
Add support for setting counters on neighbors based on dpipe's host table
counter status. This patch also adds the ability for getting the counter
value, which will be used by the dpipe host table implementation in the
next patches.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:07 +0000 (08:40 +0200)]
mlxsw: reg: Make flow counter set type enum to be shared
This is done as a preparation before introducing support for neighbor
counters. The flow counter's type enum is used by many registers, yet,
until now it was used only by mgpc and thus it was private. This patch
updates the namespace for more generic usage.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:06 +0000 (08:40 +0200)]
mlxsw: spectrum_dpipe: Add IPv4 host table initial support
Add IPv4 host table initial support.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:05 +0000 (08:40 +0200)]
mlxsw: spectrum_dpipe: Fix label name
Change label name for case of erif table init failure.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:04 +0000 (08:40 +0200)]
mlxsw: spectrum_router: Add helpers for neighbor access
This is done as a preparation before introducing the ability to dump the
host table via dpipe, and to count the table size. The mlxsw's neighbor
representative struct stays private to the router module.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:03 +0000 (08:40 +0200)]
devlink: Move dpipe entry clear function into devlink
The entry clear routine can be shared between the drivers, thus it is
moved inside devlink.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:02 +0000 (08:40 +0200)]
devlink: Add support for dynamic table size
Up until now the dpipe table's size was static and known at registration
time. The host table does not have constant size and it is resized in
dynamic manner. In order to support this behavior the size is changed
to be obtained dynamically via an op.
This patch also adjust the current dpipe table for the new API.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:01 +0000 (08:40 +0200)]
mlxsw: spectrum_dpipe: Fix erif table op name space
Fix ERIF's table operations name space.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:40:00 +0000 (08:40 +0200)]
devlink: Add IPv4 header for dpipe
This will be used by the IPv4 host table which will be introduced in the
following patches. This header is global and can be reused by many
drivers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arkadi Sharshevsky [Thu, 24 Aug 2017 06:39:59 +0000 (08:39 +0200)]
devlink: Add Ethernet header for dpipe
This will be used by the IPv4 host table which will be introduced in the
following patches. This header is global and can be reused by many
drivers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matan Barak [Sun, 28 May 2017 07:32:09 +0000 (10:32 +0300)]
net/mlx5: Add tracepoints
Add a tracepoint infrastructure for mlx5_core driver.
Implemented flow steering tracepoints:
1. Add flow group
2. Remove flow group
3. Add flow table entry
4. Remove flow table entry
5. Add flow table rule
6. Remove flow table rule
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Sun, 28 May 2017 13:45:31 +0000 (16:45 +0300)]
net/mlx5: Add hash table for flow groups in flow table
When adding a flow table entry (fte) to a flow table (ft), we first
need to find its flow group (fg). Currently, this is done by
traversing a linear list of all flow groups in the flow table.
Furthermore, since multiple flow groups which correspond to the same
fte mask may exist in the same ft, we can't just stop at the first
match. Converting the linear list to rhltable in order to speed things
up.
The last four patches increases the steering rules update rate by a
factor of more than 7 (for insertion of 50K steering rules).
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Sun, 28 May 2017 09:09:06 +0000 (12:09 +0300)]
net/mlx5: Add hash table to search FTEs in a flow-group
When adding a flow table entry (fte) to a flow group (fg), we first
need to check whether this fte exist. In such a case we just merge
the destinations (if possible). Currently, this is done by traversing
the fte list available in a fg. This could take a lot of time when
using large flow groups. Speeding this up by using rhashtable, which
is much faster.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Mon, 7 Aug 2017 08:14:11 +0000 (11:14 +0300)]
net/mlx5: Don't store reserved part in FTEs and FGs
The current code stores fte_match_param in the software representation
of FTEs and FGs. fte_match_param contains a large reserved area at the
bottom of the struct. Since downstream patches are going to hash this
part, we would like to avoid doing so on a reserved part.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Sun, 28 May 2017 07:58:40 +0000 (10:58 +0300)]
net/mlx5: Convert linear search for free index to ida
When allocating a flow table entry, we need to allocate a free index
in the flow group. Currently, this is done by traversing the existing
flow table entries in the flow group, until a free index is found.
Replacing this by using a ida, which allows us to find a free index
much faster.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Tue, 22 Aug 2017 11:22:15 +0000 (14:22 +0300)]
net/mlx5e: Fix wrong code indentation in conditional statement
Fix the following checkpatch warning in en_ethtool.c:
WARNING: suspect code indent for conditional statements (8, 9)
+ for (i = 0; i < NUM_PCIE_PERF_STALL_COUNTERS(priv); i++)
+ strcpy(data + (idx++) * ETH_GSTRING_LEN,
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Mon, 21 Aug 2017 14:54:21 +0000 (17:54 +0300)]
net/mlx5: Remove a leftover unused variable
mlx5_core_wq is no longer being used and should be removed
from the code.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Thu, 24 Aug 2017 12:54:11 +0000 (15:54 +0300)]
net/mlx5: Add a blank line after declarations V2
The blank line should be after u32 val = ...
and not after __be32 __iomem *addr = ...
Fixes:
ad5b39a95c83 ("net/mlx5: Add a blank line after declarations")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Joe Perches <joe@perches.com>
Daniel Borkmann [Thu, 24 Aug 2017 01:20:11 +0000 (03:20 +0200)]
bpf: netdev is never null in __dev_map_flush
No need to test for it in fast-path, every dev in bpf_dtab_netdev
is guaranteed to be non-NULL, otherwise dev_map_update_elem() will
fail in the first place.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shubham Bansal [Wed, 23 Aug 2017 15:59:10 +0000 (21:29 +0530)]
bpf, doc: Add arm32 as arch supporting eBPF JIT
As eBPF JIT support for arm32 was added recently with
commit
39c13c204bb1150d401e27d41a9d8b332be47c49, it seems appropriate to
add arm32 as arch with support for eBPF JIT in bpf and sysctl docs as well.
Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 05:38:08 +0000 (22:38 -0700)]
Merge branch 'bpf-verifier-fixes'
Edward Cree says:
====================
bpf: verifier fixes
Fix a couple of bugs introduced in my recent verifier patches.
Patch #2 does slightly increase the insn count on bpf_lxc.o, but only by
about a hundred insns (i.e. 0.2%).
v2: added test for write-marks bug (patch #1); reworded comment on
propagate_liveness() for clarity.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 23 Aug 2017 14:11:21 +0000 (15:11 +0100)]
bpf/verifier: document liveness analysis
The liveness tracking algorithm is quite subtle; add comments to explain it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 23 Aug 2017 14:10:50 +0000 (15:10 +0100)]
bpf/verifier: remove varlen_map_value_access flag
The optimisation it does is broken when the 'new' register value has a
variable offset and the 'old' was constant. I broke it with my pointer
types unification (see Fixes tag below), before which the 'new' value
would have type PTR_TO_MAP_VALUE_ADJ and would thus not compare equal;
other changes in that patch mean that its original behaviour (ignore
min/max values) cannot be restored.
Tests on a sample set of cilium programs show no change in count of
processed instructions.
Fixes:
f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Wed, 23 Aug 2017 14:10:26 +0000 (15:10 +0100)]
selftests/bpf: add a test for a pruning bug in the verifier
The test makes a read through a map value pointer, then considers pruning
a branch where the register holds an adjusted map value pointer. It
should not prune, but currently it does.
Signed-off-by: Alexei Starovoitov <ast@fb.com>
[ecree@solarflare.com: added test-name and patch description]
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 23 Aug 2017 14:10:03 +0000 (15:10 +0100)]
bpf/verifier: when pruning a branch, ignore its write marks
The fact that writes occurred in reaching the continuation state does
not screen off its reads from us, because we're not really its parent.
So detect 'not really the parent' in do_propagate_liveness, and ignore
write marks in that case.
Fixes:
dc503a8ad984 ("bpf/verifier: track liveness for pruning")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 23 Aug 2017 14:09:46 +0000 (15:09 +0100)]
selftests/bpf: add a test for a bug in liveness-based pruning
Writes in straight-line code should not prevent reads from propagating
along jumps. With current verifier code, the jump from 3 to 5 does not
add a read mark on 3:R0 (because 5:R0 has a write mark), meaning that
the jump from 1 to 3 gets pruned as safe even though R0 is NOT_INIT.
Verifier output:
0: (61) r2 = *(u32 *)(r1 +0)
1: (35) if r2 >= 0x0 goto pc+1
R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umax_value=
4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
2: (b7) r0 = 0
3: (35) if r2 >= 0x0 goto pc+1
R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umax_value=
4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
4: (b7) r0 = 0
5: (95) exit
from 3 to 5: safe
from 1 to 3: safe
processed 8 insns, stack depth 0
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Wed, 23 Aug 2017 11:59:48 +0000 (12:59 +0100)]
gre: remove duplicated assignment of iph
iph is being assigned the same value twice; remove the redundant
first assignment. (Thanks to Nikolay Aleksandrov for pointing out
that the first asssignment should be removed and not the second)
Fixes warning:
net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arvind Yadav [Wed, 23 Aug 2017 10:52:20 +0000 (16:22 +0530)]
net: tipc: constify genl_ops
genl_ops are not supposed to change at runtime. All functions
working with genl_ops provided by <net/genetlink.h> work with
const genl_ops. So mark the non-const structs as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Wed, 23 Aug 2017 09:59:40 +0000 (10:59 +0100)]
net: hinic: make functions set_ctrl0 and set_ctrl1 static
The functions set_ctrl0 and set_ctrl1 are local to the source and do
not need to be in global scope, so make them static.
Cleans up sparse warnings:
symbol 'set_ctrl0' was not declared. Should it be static?
symbol 'set_ctrl1' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 23 Aug 2017 09:57:51 +0000 (11:57 +0200)]
net/sock: allow the user to set negative peek offset
This is necessary to allow the user to disable peeking with
offset once it's enabled.
Unix sockets already allow the above, with this patch we
permit it for udp[6] sockets, too.
Fixes:
627d2d6b5500 ("udp: enable MSG_PEEK at non-zero offset")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 03:44:32 +0000 (20:44 -0700)]
Merge branch 'mlxsw-multichain-tc-offload'
Jiri Pirko says:
====================
mlxsw: spectrum: Introduce multichain TC offload
This patchset introduces offloading of rules added to chain with
non-zero index, which was previously forbidden. Also, goto_chain
termination action is offloaded allowing to jump to processing
of desired chain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 23 Aug 2017 08:08:22 +0000 (10:08 +0200)]
mlxsw: spectrum_flower: Offload goto_chain termination action
If action is gact goto_chain, offload it to HW by jumping to another
ruleset.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 23 Aug 2017 08:08:21 +0000 (10:08 +0200)]
mlxsw: spectrum_acl: Provide helper to lookup ruleset
We need to lookup ruleset in order to offload goto_chain termination
action. This patch adds it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 23 Aug 2017 08:08:20 +0000 (10:08 +0200)]
mlxsw: spectrum_acl: Allow to get group_id value for a ruleset
For goto_chain action we need to know group_id of a ruleset to jump to.
Provide infrastructure in order to get it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 23 Aug 2017 08:08:19 +0000 (10:08 +0200)]
net: sched: add couple of goto_chain helpers
Add helpers to find out if a gact instance is goto_chain termination
action and to get chain index.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 23 Aug 2017 08:08:18 +0000 (10:08 +0200)]
mlxsw: spectrum: Offload multichain TC rules
Reflect chain index coming down from TC core and create a ruleset per
chain. Note that only chain 0, being the implicit chain, is bound to the
device for processing. The rest of chains have to be "jumped-to" by
actions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 03:42:10 +0000 (20:42 -0700)]
Merge branch 'mvpp2-software-TSO-support'
Antoine Tenart says:
====================
net: mvpp2: software TSO support
This series adds the s/w TSO support in the PPv2 driver, in addition to
two cosmetic commits. As stated in patch 3/3:
Using iperf and 10G ports, using TSO shows a significant performance
improvement by a factor 2 to reach around 9.5Gbps in TX; as well as a
significant CPU usage drop (from 25% to 15%).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Wed, 23 Aug 2017 07:46:56 +0000 (09:46 +0200)]
net: mvpp2: software tso support
The patch uses the tso API to implement the tso functionality in Marvell
PPv2 driver.
Using iperf and 10G ports, using TSO shows a significant performance
improvement by a factor 2 to reach around 9.5Gbps in TX; as well as a
significant CPU usage drop (from 25% to 15%).
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Wed, 23 Aug 2017 07:46:55 +0000 (09:46 +0200)]
net: mvpp2: unify the txq size define use
The txq size is defined by MVPP2_AGGR_TXQ_SIZE, which is sometime not
used directly but through variables. As it is a fixed value use the
define everywhere in the driver.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Wed, 23 Aug 2017 07:46:54 +0000 (09:46 +0200)]
net: define the TSO header size in net/tso.h
The TSO header size was defined in many drivers. Factorize the code and
define its size in net/tso.h.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 23 Aug 2017 02:07:26 +0000 (10:07 +0800)]
ipv4: do metrics match when looking up and deleting a route
Now when ipv4 route inserts a fib_info, it memcmp fib_metrics.
It means ipv4 route identifies one route also with metrics.
But when removing a route, it tries to find the route without
caring about the metrics. It will cause that the route with
right metrics can't be removed.
Thomas noticed this issue when doing the testing:
1. add:
# ip route append 192.168.7.0/24 dev v window 1000
# ip route append 192.168.7.0/24 dev v window 1001
# ip route append 192.168.7.0/24 dev v window 1002
# ip route append 192.168.7.0/24 dev v window 1003
2. delete:
# ip route delete 192.168.7.0/24 dev v window 1002
3. show:
192.168.7.0/24 proto boot scope link window 1001
192.168.7.0/24 proto boot scope link window 1002
192.168.7.0/24 proto boot scope link window 1003
The one with window 1002 wasn't deleted but the first one was.
This patch is to do metrics match when looking up and deleting
one route.
Reported-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Aug 2017 03:30:48 +0000 (20:30 -0700)]
Merge branch 'tcp-sw-rx-timestamps'
Mike Maloney says:
====================
net: Add software rx timestamp for TCP.
Add software rx timestamps for TCP, and a test to ensure consistency of
behavior between IP, UDP, and TCP implementation.
Changes since v1:
-Initialize tss->ts[1] to 0 if caller requested any timestamps.
-Fix test case to validate that tss->ts[1] is zero.
-Fix tests to actually use a raw socket.
-Fix --tcp flag to work on the test.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Maloney [Tue, 22 Aug 2017 21:08:49 +0000 (17:08 -0400)]
selftests/net: Add a test to validate behavior of rx timestamps
Validate the behavior of the combination of various timestamp socket
options, and ensure consistency across ip, udp, and tcp.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Maloney [Tue, 22 Aug 2017 21:08:48 +0000 (17:08 -0400)]
tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg
When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
timestamp corresponding to the highest sequence number data returned.
Previously the skb->tstamp is overwritten when a TCP packet is placed
in the out of order queue. While the packet is in the ooo queue, save the
timestamp in the TCB_SKB_CB. This space is shared with the gso_*
options which are only used on the tx path, and a previously unused 4
byte hole.
When skbs are coalesced either in the sk_receive_queue or the
out_of_order_queue always choose the timestamp of the appended skb to
maintain the invariant of returning the timestamp of the last byte in
the recvmsg buffer.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Manlunas [Tue, 22 Aug 2017 19:46:37 +0000 (12:46 -0700)]
liquidio: change manner of detecting whether or not NIC firmware is loaded
In the NIC firmware, the 1-bit flag indicating "firmware is loaded" moved
from SLI_SCRATCH_1 to SLI_SCRATCH_2 (these are Octeon general-purpose
scratch registers). Make the PF driver conform to this change.
Remove code that sets the "firmware is loaded" flag because it's now the
firmware's job to do that.
In the code that detects whether or not the firmware is loaded, don't just
rely on checking the "firmware is loaded" flag because that may cause a
rare false negative. Add code that deduces whether or not the firmware is
loaded; that will never give a false negative.
Also bump up driver version to match newer NIC firmware.
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Wed, 23 Aug 2017 00:04:05 +0000 (17:04 -0700)]
gre: fix goto statement typo
Fix typo: pnet_tap_faied.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>