GitHub/MotorolaMobilityLLC/kernel-slsi.git
7 years agonet/mlx5e: Xmit, no write combining
Saeed Mahameed [Fri, 24 Mar 2017 21:52:04 +0000 (00:52 +0300)]
net/mlx5e: Xmit, no write combining

mlx5e netdev Blue Flame (write combining) support demands a lot of
overhead for a little latency gain for some special cases, this overhead
is hurting the common case.

Here we remove xmit Blue Flame support by creating all bfregs with no
write combining for all SQs, and we remove a lot of BF logic and
conditions from xmit data path.

Simplify mlx5e_tx_notify_hw (doorbell function) by removing BF related
code and by removing one memory barrier needed for WC mapped SQ doorbell
buffers, which no longer exist.

Performance improvement:
System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

Test case                   Before      Now      improvement
---------------------------------------------------------------
TX packets (24 threads)     50Mpps      54Mpps    8%

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine
Saeed Mahameed [Fri, 24 Mar 2017 21:52:03 +0000 (00:52 +0300)]
net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine

Use dma_rmb in mlx5e_get_cqe rather than aggressive rmb (at least on
some architectures), this should help improve the performance on such
CPU archs where dma_rmb is optimized.

Performance improvement:
System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

Test case                   Baseline      Now      improvement
---------------------------------------------------------------
TX packets (24 threads)     45Mpps        50Mpps      11%
TC stack Drop (1 core)      3.45Mpps      3.6Mpps     5%
XDP Drop      (1 core)      14Mpps        16.9Mpps    20%
XDP TX        (1 core)      10.4Mpps      12Mpps      15%

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Add missing OF_MDIO dependency
Florian Fainelli [Fri, 24 Mar 2017 21:57:22 +0000 (14:57 -0700)]
net: dsa: bcm_sf2: Add missing OF_MDIO dependency

bcm_sf2 does require the MDIO_BCM_UNIMAC driver which is now dependent
on OF_MDIO but also internally uses of_mdio.c provided routines which
are guarted with OF_MDIO.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 90eff9096c01 ("net: phy: Allow splitting MDIO bus/device support from PHYs")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'ipv6-sr-perf-improvements'
David S. Miller [Fri, 24 Mar 2017 21:47:32 +0000 (14:47 -0700)]
Merge branch 'ipv6-sr-perf-improvements'

David Lebrun says:

====================
Performances improvement for IPv6 Segment Routing

This patch series improves the performances of IPv6 SR by optimizing skb head
reallocation and extending the use of dst_cache. The overall performances improve
by 35%.

Before patch series (SRH encap):
Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
  680427pps 5443Mb/sec (5443416000bps) errors: 0

After patch series (SRH encap):
Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
  1047220pps 8377Mb/sec (8377760000bps) errors: 0

Baseline for plain IPv6 forwarding:
Result: OK: 4244144(c4243722+d422) usec, 5000000 (1000byte,0frags)
  1178093pps 9424Mb/sec (9424744000bps) errors: 0
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: sr: use dst_cache in seg6_input
David Lebrun [Fri, 24 Mar 2017 09:46:27 +0000 (10:46 +0100)]
ipv6: sr: use dst_cache in seg6_input

We already use dst_cache in seg6_output, when handling locally generated
packets. We extend it in seg6_input, to also handle forwarded packets, and avoid
unnecessary fib lookups.

Performances for SRH encapsulation before the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
  884006pps 7072Mb/sec (7072048000bps) errors: 0

Performances after the patch:
Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
  1047220pps 8377Mb/sec (8377760000bps) errors: 0

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: sr: expand skb head only if necessary
David Lebrun [Fri, 24 Mar 2017 09:46:26 +0000 (10:46 +0100)]
ipv6: sr: expand skb head only if necessary

To insert or encapsulate a packet with an SRH, we need a large enough skb
headroom. Currently, we are using pskb_expand_head to inconditionally increase
the size of the headroom by the amount needed by the SRH (and IPv6 header).
If this reallocation is performed by another CPU than the one that initially
allocated the skb, then when the initial CPU kfree the skb, it will enter the
__slab_free slowpath, impacting performances.

This patch replaces pskb_expand_head with skb_cow_head, that will reallocate the
skb head only if the headroom is not large enough.

Performances for SRH encapsulation before the patch:
Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
  680427pps 5443Mb/sec (5443416000bps) errors: 0

Performances after the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
  884006pps 7072Mb/sec (7072048000bps) errors: 0

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: use setup_deferrable_timer
Geliang Tang [Fri, 24 Mar 2017 14:14:36 +0000 (22:14 +0800)]
net_sched: use setup_deferrable_timer

Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'mlxsw-query-resources'
David S. Miller [Fri, 24 Mar 2017 20:53:29 +0000 (13:53 -0700)]
Merge branch 'mlxsw-query-resources'

Jiri Pirko says:

====================
mlxsw: Query resources from firmware

Ido says:

Some parts of the driver already use the resource query mechanism, but
in other parts we still rely on hard coded values that may change over
time.

This patchset removes most of these remaining values and queries them
from the firmware instead.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Query cell size from firmware
Ido Schimmel [Fri, 24 Mar 2017 07:02:51 +0000 (08:02 +0100)]
mlxsw: spectrum: Query cell size from firmware

As explained in the previous patch, the cell size may change in future
devices, so query it from the firmware instead of hard coding it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Refactor port buffer configuration
Ido Schimmel [Fri, 24 Mar 2017 07:02:50 +0000 (08:02 +0100)]
mlxsw: spectrum: Refactor port buffer configuration

The sizes and thresholds of the priority group (PG) buffers are
configured in cells, which represent a specific amount of bytes.

The cell size can vary in different devices, so it's better to query it
from the firmware than hard coding it.

Refactor the code dealing with this value into different functions, so
that it will be easier to make the conversion in the next patch.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_buffers: Query shared buffer size from firmware
Ido Schimmel [Fri, 24 Mar 2017 07:02:49 +0000 (08:02 +0100)]
mlxsw: spectrum_buffers: Query shared buffer size from firmware

Instead of hard coding the size of the shared buffer in the driver,
query it from the firmware, as it may change in future devices.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: Query maximum number of ports from firmware
Ido Schimmel [Fri, 24 Mar 2017 07:02:48 +0000 (08:02 +0100)]
mlxsw: Query maximum number of ports from firmware

We currently hard code the maximum number of ports in the driver, but
this may change in future devices, so query it from the firmware
instead.

Fallback to a maximum of 64 ports in case this number can't be queried.
This should only happen in SwitchX-2 for which this number is correct.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Query number of LPM trees from firmware
Ido Schimmel [Fri, 24 Mar 2017 07:02:47 +0000 (08:02 +0100)]
mlxsw: spectrum_router: Query number of LPM trees from firmware

Instead of hard coding the number of LPM trees in the driver, query it
from the firmware, as it may change in future devices.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Fri, 24 Mar 2017 20:45:07 +0000 (13:45 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-03-23

This series contains updates to i40e and i40e.txt documentation.

Jake provides all the changes in the series which are centered around
ntuple filter fixes and additional support.  Fixed the current
implementation of .set_rxnfc, where we were not reading the mask field
for filter entries which was resulting in filters not behaving as
expected and not working correctly.  When cleaning up after disabling
flow director support, ensure that the default input set is correctly
reprogrammed.  Since the hardware only supports a single input set for
all flows of that type, the driver shall only allow the input set to
change if there are no other configured filters for that flow type, so
add support to detect when we can update the input set for each flow
type.  Align the driver to other drivers to partition the ring_cookie
value into 8bits of VF index, along with 32bits of queue number instead
of using the user-def field.  Added support to parse the user-def field
into a data structure format to allow future extensions of the user-def
filed by keeping all the code that read/writes the field into a single
location.  Added support for flexible payloads passed via ethtool
user-def field.  We support a single flexible word (2byte) value per
protocol type, and we handle the FLX_PIT register using a list of
flexible entries so that each flow type may be configured separately.
Enabled flow director filters for SCTPv4 packets using the ethtool
ntuple interface to enable filters.  Updated the documentation on the
i40e driver to include the newly added support to ntuple filters.
Reduced complexity of a if-continue-else-break section of code by
taking advantage of using hlist_for_each_entry_continue() instead.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: mpls: Fix setting ttl_propagate for rt2
David Ahern [Fri, 24 Mar 2017 01:02:27 +0000 (19:02 -0600)]
net: mpls: Fix setting ttl_propagate for rt2

Fix copy and paste error setting rt_ttl_propagate.

Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be configured")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: sysctl: Fix a race to avoid unexpected 0 window from space
Gao Feng [Thu, 23 Mar 2017 23:05:12 +0000 (07:05 +0800)]
tcp: sysctl: Fix a race to avoid unexpected 0 window from space

Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)

As a result, tcp_win_from_space returns 0. It is unexpected.

Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: make in_aton() 32-bit internally
Alexey Dobriyan [Thu, 23 Mar 2017 21:58:26 +0000 (00:58 +0300)]
net: make in_aton() 32-bit internally

Converting IPv4 address doesn't need 64-bit arithmetic.

Space savings: 10 bytes!

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function                          old     new   delta
in_aton                            96      86     -10

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: do not reset Octeon if NIC firmware was preloaded
Felix Manlunas [Thu, 23 Mar 2017 20:26:28 +0000 (13:26 -0700)]
liquidio: do not reset Octeon if NIC firmware was preloaded

The PF driver is incorrectly resetting Octeon when the module parameter
"fw_type=none" is there.  "fw_type=none" means the PF should not load any
firmware to the NIC because Octeon is already running preloaded firmware.

Fix it by putting an if (fw_type != none) around the reset code.

Because the Octeon reset is now conditionally gone, when unloading the
driver, conditionally send the RESET_PF command to the firmware who will
then free up PF-related data structures.

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: Add sysctl to toggle early demux for tcp and udp
subashab@codeaurora.org [Thu, 23 Mar 2017 19:34:16 +0000 (13:34 -0600)]
net: Add sysctl to toggle early demux for tcp and udp

Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Tom Herbert <tom@herbertland.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'systemport-tx-napi-improvements'
David S. Miller [Fri, 24 Mar 2017 19:53:15 +0000 (12:53 -0700)]
Merge branch 'systemport-tx-napi-improvements'

Florian Fainelli says:

====================
net: systemport: TX/NAPI improvements

This patch series builds up on Doug's latest changes done in BCMGENET to reduce
the number of spurious interrupts in NAPI, simplify pointer arithmetic and
finally tracking of per TX ring statistics to be SMP friendly.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: systemport: Simplify circular pointer arithmetic
Florian Fainelli [Thu, 23 Mar 2017 17:36:48 +0000 (10:36 -0700)]
net: systemport: Simplify circular pointer arithmetic

Similar to c298ede2fe21 ("net: bcmgenet: simplify circular pointer
arithmetic") we don't need to complex arthimetic since we always have a
ring size that is a power of 2.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: systemport: Clear status to reduce spurious interrupts
Florian Fainelli [Thu, 23 Mar 2017 17:36:47 +0000 (10:36 -0700)]
net: systemport: Clear status to reduce spurious interrupts

Do something similar to commit d5810ca3252a ("net: bcmgenet: clear
status to reduce spurious interrupts") and clear interrupts right before
servicing them. This reduces the number of interrupts by 10K
interrupts/sec for a TX TCP session 1Gbits/sec.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: systemport: Track per TX ring statistics
Florian Fainelli [Thu, 23 Mar 2017 17:36:46 +0000 (10:36 -0700)]
net: systemport: Track per TX ring statistics

bcm_sysport_tx_reclaim_one() is currently summing TX bytes/packets in a
way that is not SMP friendly, mutliples CPUs could run
bcm_sysport_tx_reclaim_one() independently and still update
stats->tx_bytes and stats->tx_packets, cloberring the other CPUs
statistics.

Fix this by tracking per TX rings the number of bytes, packets,
dropped and errors statistics, and provide a bcm_sysport_get_nstats()
function which aggregates everything and returns a consistent output.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'phy-mdio-split'
David S. Miller [Fri, 24 Mar 2017 19:51:05 +0000 (12:51 -0700)]
Merge branch 'phy-mdio-split'

Florian Fainelli says:

====================
net: phy: Allow splitting MDIO bus/device support

This patch series allows building support for MDIO bus controllers which
are sometimes usable and necessary in cases where there are no Ethernet PHYs.

Changes in v3:
- corrected of_mdio compile guards for prototypes vs. stubs
- added a missing OF_MDIO dependency for MDIO_BCM_UNIMAC
- fixed Kbuild bot reported errors against mdio-bitbang

Changes in v2:
- implement Russell's feedback
- solve the circular dependency in the CONFIG_MDIO_DEVICE + CONFIG_PHYLIB case
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: phy: Allow splitting MDIO bus/device support from PHYs
Florian Fainelli [Thu, 23 Mar 2017 17:01:19 +0000 (10:01 -0700)]
net: phy: Allow splitting MDIO bus/device support from PHYs

Introduce a new configuration symbol: MDIO_DEVICE which allows building
the MDIO devices and bus code, without pulling in the entire Ethernet
PHY library and devices code.

PHYLIB nows select MDIO_DEVICE and the relevant Makefile files are
updated to reflect that.

When MDIO_DEVICE (MDIO bus/device only) is selected, but not PHYLIB, we
have mdio-bus.ko as a loadable module, and it does not have a
module_exit() function because the safety of removing a bus class is
unclear.

When both MDIO_DEVICE and PHYLIB are enabled, we need to assemble
everything into a common loadable module: libphy.ko because of nasty
circular dependencies between phy.c, phy_device.c and mdio_bus.c which
are really tough to untangle.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: phy: MDIO_BCM_UNIMAC should depend on OF_MDIO
Florian Fainelli [Thu, 23 Mar 2017 17:01:18 +0000 (10:01 -0700)]
net: phy: MDIO_BCM_UNIMAC should depend on OF_MDIO

The Broadcom MDIO UniMAC driver uses routines provided by of_mdio.c which is
guarded by CONFIG_OF_MDIO.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoof_mdio: Correct check against CONFIG_OF
Florian Fainelli [Thu, 23 Mar 2017 17:01:17 +0000 (10:01 -0700)]
of_mdio: Correct check against CONFIG_OF

CONFIG_OF_MDIO is actually what triggers the build of drivers/of/of_mdio.c, so
providing inline stubs when CONFIG_OF_MDIO=y should be based on that symbol as
well.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: choke: remove dead filter classify code
Jiri Pirko [Thu, 23 Mar 2017 16:02:16 +0000 (17:02 +0100)]
net: sched: choke: remove dead filter classify code

sch_choke is classless qdisc so it does not define cl_ops. Therefore
filter_list cannot be ever changed, being NULL all the time.
Reason is this check in tc_ctl_tfilter:

/* Is it classful? */
cops = q->ops->cl_ops;
if (!cops)
return -EINVAL;

So remove this dead code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: add set_mac to the stmmac_ops
LABBE Corentin [Thu, 23 Mar 2017 13:40:22 +0000 (14:40 +0100)]
net: stmmac: add set_mac to the stmmac_ops

Two different set_mac functions exists but stmmac_dwmac4_set_mac() is
only used for enabling and never for disabling.
So on dwmac4, the MAC RX/TX is never disabled.

This patch add a generic function pointer set_mac() to stmmac_ops and
replace all call to stmmac_set_mac/stmmac_dwmac4_set_mac by a call to
this pointer.

Since dwmac4_ops is const, set_mac cannot be modified after, and so dwmac4_ops
is duplioacted like dwmac4_dma_ops.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoisdn: use setup_timer
Geliang Tang [Thu, 23 Mar 2017 13:15:57 +0000 (21:15 +0800)]
isdn: use setup_timer

Use setup_timer() instead of init_timer() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bridge-ext-learned-entries'
David S. Miller [Fri, 24 Mar 2017 19:30:22 +0000 (12:30 -0700)]
Merge branch 'bridge-ext-learned-entries'

Nikolay Aleksandrov says

====================
net: bridge: allow user-space to add ext learned entries

This set adds the ability to add externally learned entries from
user-space. For symmetry and proper function we need to allow SW entries
to take over HW learned ones (similar to how HW can take over SW entries
currently) which is needed for our use case (evpn) where we have pure SW
ports and HW ports mixed in a single bridge. This does not play well with
switchdev devices currently because there's no feedback when the entry is
taken over, but this case has never worked anyway and feedback can be
easily added when needed.
Patch 02 simply allows to use NTF_EXT_LEARNED from user-space, we already
have Quagga patches that make use of this functionality.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: bridge: allow to add externally learned entries from user-space
Nikolay Aleksandrov [Thu, 23 Mar 2017 10:27:13 +0000 (12:27 +0200)]
net: bridge: allow to add externally learned entries from user-space

The NTF_EXT_LEARNED flag was added for switchdev and externally learned
entries, but it can also be used for entries learned via a software
in user-space which requires dynamic entries that do not expire.
One such case that we have is with quagga and evpn which need dynamic
entries but also require to age them themselves.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: bridge: allow SW learn to take over HW fdb entries
Nikolay Aleksandrov [Thu, 23 Mar 2017 10:27:12 +0000 (12:27 +0200)]
net: bridge: allow SW learn to take over HW fdb entries

Allow to take over an entry which was previously learned via HW when it
shows up from a SW port. This is analogous to how HW takes over SW learned
entries already.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: Remove debugfs interface
Ido Schimmel [Thu, 23 Mar 2017 10:14:24 +0000 (11:14 +0100)]
mlxsw: Remove debugfs interface

We don't use it during development and we can't extend it either, so
remove it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoi40e: make use of hlist_for_each_entry_continue
Jacob Keller [Tue, 7 Mar 2017 23:17:52 +0000 (15:17 -0800)]
i40e: make use of hlist_for_each_entry_continue

Replace a complex if->continue->else->break construction in
i40e_next_filter. We can simply use hlist_for_each_entry_continue
instead. This drops a lot of confusing code. The resulting code is much
easier to understand the intention, and follows the more normal pattern
for using hlist loops. We could have also used a break with a "return
next" at the end of the function, instead of return NULL, but the
current implementation is explicitly clear that when you reach the end
of the loop you get a NULL value. The alternative construction is less
clear since the reader would have to know that next is NULL at the end
of the loop.

Change-Id: Ife74ca451dd79d7f0d93c672bd42092d324d4a03
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: document drivers use of ntuple filters
Jacob Keller [Mon, 6 Feb 2017 22:38:52 +0000 (14:38 -0800)]
i40e: document drivers use of ntuple filters

Add documentation describing the drivers use of ethtool ntuple filters,
including the limitations that it has due to hardware, as well as how it
reads and parses the user-def data block.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: add support for SCTPv4 FDir filters
Jacob Keller [Mon, 6 Feb 2017 22:38:51 +0000 (14:38 -0800)]
i40e: add support for SCTPv4 FDir filters

Enable FDir filters for SCTPv4 packets using the ethtool ntuple
interface to enable filters. The ethtool API does not allow masking on
the verification tag.

Change-Id: I093e88a8143994c7e6f4b7b17a0bd5cf861d18e4
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: implement support for flexible word payload
Jacob Keller [Mon, 6 Feb 2017 22:38:50 +0000 (14:38 -0800)]
i40e: implement support for flexible word payload

Add support for flexible payloads passed via ethtool user-def field.
This support is somewhat limited due to hardware design. The input set
can only be programmed once per filter type, and the flexible offset is
part of this filter input set. This means that the user cannot program
both a regular and a flexible filter at the same time for a given flow
type. Additionally, the user may not program two flexible filters of the
same flow type with different offsets, although they are allowed to
configure different values at that offset location.

We support a single flexible word (2byte) value per protocol type, and
we handle the FLX_PIT register using a list of flexible entries so that
each flow type may be configured separately.

Due to hardware implementation, the flexible data is offset from the
start of the packet payload, and thus may not be in part of the header
data. For this reason, the offset provided by the user defined data is
interpreted as a byte offset from the start of the matching payload.
Previous implementations have tried to represent the offset as from the
start of the frame, but this is not feasible because header sizes may
change due to options.

Change-Id: 36ed27995e97de63f9aea5ade5778ff038d6f811
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: add parsing of flexible filter fields from userdef
Jacob Keller [Mon, 6 Feb 2017 22:38:49 +0000 (14:38 -0800)]
i40e: add parsing of flexible filter fields from userdef

Add code to parse the user-def field into a data structure format. This
code is intended to allow future extensions of the user-def field by
keeping all code that actually reads and writes the field into a single
location. This ensures that we do not litter the driver with references
to the user-def field and minimizes the amount of bitwise operations we
need to do on the data.

Add code which parses the lower 32bits into a flexible word and its
offset. This will be used in a future patch to enable flexible filters
which can match on some arbitrary data in the packet payload. For now,
we just return -EOPNOTSUPP when this is used.

Add code to fill in the user-def field when reporting the filter back,
even though we don't actually implement any user-def fields yet.

Additionally, ensure that we mask the extended FLOW_EXT bit from the
flow_type now that we will be accepting filters which have the FLOW_EXT
bit set (and thus make use of the user-def field).

Change-Id: I238845035c179380a347baa8db8223304f5f6dd7
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: partition the ring_cookie to get VF index
Jacob Keller [Mon, 6 Feb 2017 22:38:48 +0000 (14:38 -0800)]
i40e: partition the ring_cookie to get VF index

Do not use the user-def field for determining the VF target. Instead,
similar to ixgbe, partition the ring_cookie value into 8bits of VF
index, along with 32bits of queue number. This is better than using the
user-def field, because it leaves the field open for extension in
a future patch which will enable flexible data. Also, this matches with
convention used by ixgbe and other drivers.

Change-Id: Ie36745186d817216b12f0313b99ec95cb8a9130c
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: allow changing input set for ntuple filters
Jacob Keller [Mon, 6 Feb 2017 22:38:47 +0000 (14:38 -0800)]
i40e: allow changing input set for ntuple filters

Add support to detect when we can update the input set for each flow
type.

Because the hardware only supports a single input set for all flows of
that matching type, the driver shall only allow the input set to change
if there are no other configured filters for that flow type.

Thus, the first filter added for each flow type is allowed to change the
input set, and all future filters must match the same input set. Display
a diagnostic message whenever the filter input set changes, and
a warning whenever a filter cannot be accepted because it does not match
the configured input set.

Change-Id: Ic22e1c267ae37518bb036aca4a5694681449f283
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: restore default input set for each flow type
Jacob Keller [Mon, 6 Feb 2017 22:38:46 +0000 (14:38 -0800)]
i40e: restore default input set for each flow type

Ensure that the default input set is correctly reprogrammed when
cleaning up after disabling flow director support. This ensures that the
programmed value will be in a clean state.

Although we do not yet have support for SCTPv4 filters, a future patch
will add support for this protocol, so we will correctly restore the
SCTPv4 input set here as well. Note that strictly speaking the default
hardware value for SCTP includes matching the verification tag. However,
the ethtool API does not have support for specifying this value, so
there is no reason to keep the verification field enabled.

This patch is the next step on the way to enabling partial tuple filters
which will be implemented in a following patch.

Change-Id: Ic22e1c267ae37518bb036aca4a5694681449f283
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: check current configured input set when adding ntuple filters
Jacob Keller [Tue, 7 Mar 2017 23:05:23 +0000 (15:05 -0800)]
i40e: check current configured input set when adding ntuple filters

Do not assume that hardware has been programmed with the default mask,
but instead read the input set registers to determine what is currently
programmed. This ensures that all programmed filters match exactly how
the hardware will interpret them, avoiding confusion regarding filter
behavior.

This sets the initial ground-work for allowing custom input sets where
some fields are disabled. A future patch will fully implement this
feature.

Instead of using bitwise negation, we'll just explicitly check for the
correct value. The use of htonl and htons are used to silence sparse
warnings. The compiler should be able to handle the constant value and
avoid actually performing a byteswap.

Change-Id: I3d8db46cb28ea0afdaac8c5b31a2bfb90e3a4102
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agoi40e: correctly honor the mask fields for ETHTOOL_SRXCLSRLINS
Jacob Keller [Tue, 7 Mar 2017 23:05:22 +0000 (15:05 -0800)]
i40e: correctly honor the mask fields for ETHTOOL_SRXCLSRLINS

The current implementation of .set_rxnfc does not properly read the mask
field for filter entries. This results in incorrect driver behavior, as
we do not reject filters which have masks set to ignore some fields. The
current implementation simply assumes that every part of the tuple or
"input set" is specified. This results in filters not behaving as
expected, and not working correctly.

As a first step in supporting some partial filters, add code which
checks the mask fields and rejects any filters which do not have an
acceptable mask. For now, we just assume that all fields must be set.

This will get the driver one step towards allowing some partial filters.
At a minimum, the ethtool commands which previously installed filters
that would not function will now return a non-zero exit code indicating
failure instead.

We should now be meeting the minimum requirements of the .set_rxnfc API,
by ensuring that all filters we program have a valid mask value for each
field.

Finally, add code to report the mask correctly so that the ethtool
command properly reports the mask to the user.

Note that the typecast to (__be16) when checking source and destination
port masks is required because the ~ bitwise negation operator does not
correctly handle variables other than integer size.

Change-Id: Ia020149e07c87aa3fcec7b2283621b887ef0546f
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
7 years agosched: act_csum: don't mangle TCP and UDP GSO packets
Davide Caratti [Thu, 23 Mar 2017 09:39:40 +0000 (10:39 +0100)]
sched: act_csum: don't mangle TCP and UDP GSO packets

after act_csum computes the checksum on skbs carrying GSO TCP/UDP packets,
subsequent segmentation fails because skb_needs_check(skb, true) returns
true. Because of that, skb_warn_bad_offload() is invoked and the following
message is displayed:

WARNING: CPU: 3 PID: 28 at net/core/dev.c:2553 skb_warn_bad_offload+0xf0/0xfd
<...>

  [<ffffffff8171f486>] skb_warn_bad_offload+0xf0/0xfd
  [<ffffffff8161304c>] __skb_gso_segment+0xec/0x110
  [<ffffffff8161340d>] validate_xmit_skb+0x12d/0x2b0
  [<ffffffff816135d2>] validate_xmit_skb_list+0x42/0x70
  [<ffffffff8163c560>] sch_direct_xmit+0xd0/0x1b0
  [<ffffffff8163c760>] __qdisc_run+0x120/0x270
  [<ffffffff81613b3d>] __dev_queue_xmit+0x23d/0x690
  [<ffffffff81613fa0>] dev_queue_xmit+0x10/0x20

Since GSO is able to compute checksum on individual segments of such skbs,
we can simply skip mangling the packet.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dwc-xlgmac: use dual license
Jie Deng [Thu, 23 Mar 2017 04:03:46 +0000 (12:03 +0800)]
net: dwc-xlgmac: use dual license

The driver "dwc-xlgmac" is dual-licensed.
Declare the dual license with MODULE_LICENSE().

Signed-off-by: Jie Deng <jiedeng@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dwc-xlgmac: declaration of dual license in headers
Jie Deng [Thu, 23 Mar 2017 04:03:45 +0000 (12:03 +0800)]
net: dwc-xlgmac: declaration of dual license in headers

The driver "dwc-xlgmac" is dual-licensed. This patch adds
declaration of dual license in file headers.

Signed-off-by: Jie Deng <jiedeng@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bpf-socket-cookie-uid'
David S. Miller [Fri, 24 Mar 2017 00:02:07 +0000 (17:02 -0700)]
Merge branch 'bpf-socket-cookie-uid'

Chenbo Feng says:

====================
net: core: Two Helper function about socket information

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty. These helper functions
can be used on socket and uid based traffic monitoring programs.

Change since V7:
* change the user namespace of uid helper function to sock_net(sk)->user_ns

Change since V6:
* change the user namespace of uid helper function back to init_user_ns
  since in some situation, for example, pinned bpf object, the current
  user namespace is not always applicable.

Change since V5:
* Delete unnecessary blank lines in sample program.
* Refine the variable orders in get_uid helper function.

Change since V4:
* Using current user namespace to get uid instead of using init_ns.
* Add compiling setup of example program in to Makefile.
* Change the name style of the example program binaries.

Change since V3:
* Fixed some typos and incorrect comments in sample program
* replaced raw insns with BPF_STX_XADD and add it to libbpf.h
* Use a temp dir as mount point instead and added a check for
  the user input string.
* Make the get uid helper function returns the user namespace uid
  instead of kuid.
* Return a overflowuid instead of 0 when no uid information is found.

Change since V2:
* Add a sample program to demostrate the usage of the helper function.
* Moved the helper function proto invoking place.
* Add function header into tools/include
* Apply sk_to_full_sk() before getting uid.

Change since V1:
* Removed the unnecessary declarations and export command
* resolved conflict with master branch.
* Examine if the socket is a full socket before getting the uid.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoA Sample of using socket cookie and uid for traffic monitoring
Chenbo Feng [Thu, 23 Mar 2017 00:27:36 +0000 (17:27 -0700)]
A Sample of using socket cookie and uid for traffic monitoring

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptables
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoAdd a eBPF helper function to retrieve socket uid
Chenbo Feng [Thu, 23 Mar 2017 00:27:35 +0000 (17:27 -0700)]
Add a eBPF helper function to retrieve socket uid

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoAdd a helper function to get socket cookie in eBPF
Chenbo Feng [Thu, 23 Mar 2017 00:27:34 +0000 (17:27 -0700)]
Add a helper function to get socket cookie in eBPF

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Thu, 23 Mar 2017 22:11:56 +0000 (15:11 -0700)]
Merge git://git./linux/kernel/git/davem/net

Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'sound-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Thu, 23 Mar 2017 18:58:08 +0000 (11:58 -0700)]
Merge tag 'sound-4.11-rc4' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "This contains the collection of small fixes for 4.11 that were pending
  during my vacation:

   - a few HD-audio quirks (more Dell headset support, docking station
     support on HP laptops)

   - a regression fix for the previous ctxfi DMA mask fix

   - a correction of the new CONFIG_SND_X86 menu entry

   - a fix for the races in ALSA sequencer core spotted by syzkaller"

* tag 'sound-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Adding a group of pin definition to fix headset problem
  ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
  ALSA: x86: Make CONFIG_SND_X86 bool
  ALSA: hda - add support for docking station for HP 840 G3
  ALSA: hda - add support for docking station for HP 820 G2
  ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call

7 years agoqedf: fix wrong le16 conversion
Arnd Bergmann [Mon, 20 Mar 2017 08:49:27 +0000 (09:49 +0100)]
qedf: fix wrong le16 conversion

gcc points out that we are converting a 16-bit integer into a 32-bit
little-endian type and assigning that to 16-bit little-endian
will end up with a zero:

drivers/scsi/qedf/drv_fcoe_fw_funcs.c: In function 'init_initiator_rw_fcoe_task':
include/uapi/linux/byteorder/big_endian.h:32:26: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
  t_st_ctx->read_write.rx_id = cpu_to_le32(FCOE_RX_ID);

The correct solution appears to be to just use a 16-bit byte swap instead.

Fixes: be086e7c53f1 ("qed*: Utilize Firmware 8.15.3.0")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Chad Dupuis <chad.dupuis@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'qed-management-interaction-and-feature-changes'
David S. Miller [Thu, 23 Mar 2017 18:53:31 +0000 (11:53 -0700)]
Merge branch 'qed-management-interaction-and-feature-changes'

Yuval Mintz says:

====================
qed: Management interaction & feature changes

All patches in this series either affect direct interaction with the
management firmware, or changes logic relating to some values retrieved
from it.

Patch #1 revises the basic logic for sending messages to the management
firmware and there completion, and is the most significant [at least
code-wise] of the bunch.

Patch #2 changes infrastrcure in a way that should better protect us form
mistakes leading to stack corruption such as was fixed in
bb4802428432 ("qed: Prevent stack corruption on MFW interaction").

Patch #3 corrects some update API endian issue [sent here as it would
create conflicts with #2, and because it's lack would create a rather
insignifcant problem].

Patch #4 removes some unnecessary logging, allowing cleaner forward
compatibility with future management firmware versions.

Patches #5, #6 slightly change the number of possible L2 queues in some
scenarios, leading to the possibility of having more queues / VFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Reserve VF feature before PF
Mintz, Yuval [Thu, 23 Mar 2017 13:50:20 +0000 (15:50 +0200)]
qed: Reserve VF feature before PF

Align the driver feature distribution with the flow utilized
by the management firmware - first reserve L2 queues for
VFs and use all the remaining for the PF.

The current distribution might lead to PFs with an enormous
amount of queues, but at the same time leave us with insufficient
resources for starting all VFs.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Don't waste SBs unused by RoCE
Mintz, Yuval [Thu, 23 Mar 2017 13:50:19 +0000 (15:50 +0200)]
qed: Don't waste SBs unused by RoCE

When RoCE is enabled on a given L2 interface, the interrupt lines
are divided equally between L2 and RoCE -
But in case number of lines needed for RoCE is limited by number
of available CNQs, we can utilize the additional lines for L2.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Reduce verbosity of unimplemented MFW messages
Mintz, Yuval [Thu, 23 Mar 2017 13:50:18 +0000 (15:50 +0200)]
qed: Reduce verbosity of unimplemented MFW messages

Management firmware and driver are meant to be both backward and forward
compatibile with each other.

If a new mangement firmware would work with an older driver,
it's possible that driver would receive indications which are meaningless
to it. That's perfectly acceptible from the firmware part - so no need to
log such messages at default verbosity; That would only serve to confuse
users.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Correct endian order of MAC passed to MFW
Mintz, Yuval [Thu, 23 Mar 2017 13:50:17 +0000 (15:50 +0200)]
qed: Correct endian order of MAC passed to MFW

The management firmware is running on a Big Endian processor,
and when running on LE platform HW is configured to swap access
to memory shared between management firmware and driver on
32-bit granulariy.

As a result, for matters of simplicity most of the APIs between
driver and management firmware are based on 32-bit variables.
MAC settings are one exception, as driver needs to fill a byte
array when indicating to management firmware that primary MAC
has changed.
Due to the swap, driver must make sure that the mac that was
provided in byte-order would be translated into native order,
otherwise after the swap the management firmware would read
it swapped.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Pass src/dst sizes when interacting with MFW
Tomer Tayar [Thu, 23 Mar 2017 13:50:16 +0000 (15:50 +0200)]
qed: Pass src/dst sizes when interacting with MFW

The driver interaction with management firmware involves a union
of all the data-members relating to the commands the driver prepares.

Current interface assumes the caller always passes such a union -
but thats cumbersome as well as risky [chancing a stack corruption
in case caller accidentally passes a smaller member instead of union].

Change implementation so that caller could pass a pointer to any
of the members instead of the union.

Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Revise MFW command locking
Tomer Tayar [Thu, 23 Mar 2017 13:50:15 +0000 (15:50 +0200)]
qed: Revise MFW command locking

Interaction of driver -> management firmware is based
on a one-pending mailbox [per interface], and various
mailbox commands need to be synchronized.

Current scheme is messy, and there's a difficulty extending
it as it deals differently with various commands as well as
making assumption on the required behavior for load/unload
requests.

Drop the current scheme into a completion-list-based approach;
Each flow would try sending the command when possible,
allowing one flow to complete another flow's completion and
relieve the mailbox before sending its own command.

Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason...
Linus Torvalds [Thu, 23 Mar 2017 18:39:33 +0000 (11:39 -0700)]
Merge branch 'for-linus-4.11' of git://git./linux/kernel/git/mason/linux-btrfs

Pull btrfs fixes from Chris Mason:
 "Zygo tracked down a very old bug with inline compressed extents.

  I didn't tag this one for stable because I want to do individual
  tested backports. It's a little tricky and I'd rather do some extra
  testing on it along the way"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: add missing memset while reading compressed inline extents
  Btrfs: fix regression in lock_delalloc_pages
  btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 23 Mar 2017 18:29:49 +0000 (11:29 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Several netfilter fixes from Pablo and the crew:
      - Handle fragmented packets properly in netfilter conntrack, from
        Florian Westphal.
      - Fix SCTP ICMP packet handling, from Ying Xue.
      - Fix big-endian bug in nftables, from Liping Zhang.
      - Fix alignment of fake conntrack entry, from Steven Rostedt.

 2) Fix feature flags setting in fjes driver, from Taku Izumi.

 3) Openvswitch ipv6 tunnel source address not set properly, from Or
    Gerlitz.

 4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.

 5) sk->sk_frag.page not released properly in some cases, from Eric
    Dumazet.

 6) Fix RTNL deadlocks in nl80211, from Johannes Berg.

 7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.

 8) Cure improper inflight handling during AF_UNIX GC, from Andrey
    Ulanov.

 9) sch_dsmark doesn't write to packet headers properly, from Eric
    Dumazet.

10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
    Yeganeh.

11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.

12) Fix nametbl deadlock in tipc, from Ying Xue.

13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
    Pressman.

14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.

15) Fix hashmap allocation handling, from Alexei Starovoitov.

16) nl_fib_input() needs stronger netlink message length checking, from
    Eric Dumazet.

17) Fix double-free of sk->sk_filter during sock clone, from Daniel
    Borkmann.

18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
  net:ethernet:aquantia: Fix for RX checksum offload.
  amd-xgbe: Fix the ECC-related bit position definitions
  sfc: cleanup a condition in efx_udp_tunnel_del()
  Bluetooth: btqcomsmd: fix compile-test dependency
  inet: frag: release spinlock before calling icmp_send()
  tcp: initialize icsk_ack.lrcvtime at session start time
  genetlink: fix counting regression on ctrl_dumpfamily()
  socket, bpf: fix sk_filter use after free in sk_clone_lock
  ipv4: provide stronger user input validation in nl_fib_input()
  bpf: fix hashmap extra_elems logic
  enic: update enic maintainers
  net: bcmgenet: remove bcmgenet_internal_phy_setup()
  ipv6: make sure to initialize sockc.tsflags before first use
  fjes: Do not load fjes driver if extended socket device is not power on.
  fjes: Do not load fjes driver if system does not have extended socket device.
  net/mlx5e: Count LRO packets correctly
  net/mlx5e: Count GSO packets correctly
  net/mlx5: Increase number of max QPs in default profile
  net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
  net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
  ...

7 years agoALSA: hda - Adding a group of pin definition to fix headset problem
Hui Wang [Thu, 23 Mar 2017 02:00:25 +0000 (10:00 +0800)]
ALSA: hda - Adding a group of pin definition to fix headset problem

A new Dell laptop needs to apply ALC269_FIXUP_DELL1_MIC_NO_PRESENCE to
fix the headset problem, and the pin definiton of this machine is not
in the pin quirk table yet, now adding it to the table.

Signed-off-by: Hui Wang <hui.wang@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
7 years agonet:ethernet:aquantia: Fix for RX checksum offload.
Pavel Belous [Wed, 22 Mar 2017 23:20:39 +0000 (02:20 +0300)]
net:ethernet:aquantia: Fix for RX checksum offload.

Since AQC-100/107/108 chips supports hardware checksums for RX we should indicate this
via NETIF_F_RXCSUM flag.

v1->v2: 'Signed-off-by' tag added.

Signed-off-by: Pavel Belous <pavel.belous@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoamd-xgbe: Fix the ECC-related bit position definitions
Lendacky, Thomas [Wed, 22 Mar 2017 22:25:27 +0000 (17:25 -0500)]
amd-xgbe: Fix the ECC-related bit position definitions

The ECC bit positions that describe whether the ECC interrupt is for
Tx, Rx or descriptor memory and whether the it is a single correctable
or double detected error were defined in incorrectly (reversed order).
Fix the bit position definitions for these settings so that the proper
ECC handling is performed.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'netvsc-bug-fixes-and-cleanups'
David S. Miller [Thu, 23 Mar 2017 02:38:57 +0000 (19:38 -0700)]
Merge branch 'netvsc-bug-fixes-and-cleanups'

Stephen Hemminger says:

====================
netvsc: bug fixes and cleanups

These fix NAPI issues and bugs found during testing of shutdown
testing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: fix and cleanup rndis_filter_set_packet_filter
stephen hemminger [Wed, 22 Mar 2017 21:51:05 +0000 (14:51 -0700)]
netvsc: fix and cleanup rndis_filter_set_packet_filter

Fix warning from unused set_complete variable. And rearrange code
to eliminate unnecessary goto's.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: eliminate unnecessary skb == NULL checks
stephen hemminger [Wed, 22 Mar 2017 21:51:04 +0000 (14:51 -0700)]
netvsc: eliminate unnecessary skb == NULL checks

Since there already is a special case goto for control messages (skb == NULL)
in netvsc_send, there is no need for later checks in same code path.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: remove unnecessary lock on shutdown
stephen hemminger [Wed, 22 Mar 2017 21:51:03 +0000 (14:51 -0700)]
netvsc: remove unnecessary lock on shutdown

The channel inbound lock was not being used at all by the netvsc
device, but the spin_lock was helpful by providing necessary
barrier before waiting.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: use refcount_t for keeping track of sub channels
stephen hemminger [Wed, 22 Mar 2017 21:51:02 +0000 (14:51 -0700)]
netvsc: use refcount_t for keeping track of sub channels

Rather than a lock and variable, use a refcount_t to keep track
of the number of sub channels.  Don't need to wait for subchannels
on device removal since wait was already done in device_add.

Also fix the error handling; don't wait forever in case of
an error on request to create sub channels.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: uses RCU instead of removal flag
stephen hemminger [Wed, 22 Mar 2017 21:51:01 +0000 (14:51 -0700)]
netvsc: uses RCU instead of removal flag

It is cleaner to use RCU protected pointer (nvdev_ctx->nvdev)
to indicate device is in removed state, rather than having a separate
boolean flag. By using the pointer the context can be checked
by static checkers and dynamic lockdep.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: use RCU to protect inner device structure
stephen hemminger [Wed, 22 Mar 2017 21:51:00 +0000 (14:51 -0700)]
netvsc: use RCU to protect inner device structure

The netvsc driver has an internal structure (netvsc_device) which
is created when device is opened and released when device is closed.
And also opened/released when MTU or number of channels change.

Since this is referenced in the receive and transmit path, it is
safer to use RCU to protect/prevent use after free problems.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: change max channel calculation
stephen hemminger [Wed, 22 Mar 2017 21:50:59 +0000 (14:50 -0700)]
netvsc: change max channel calculation

The default number of maximum channels should be limited to the
number of cpus available on the numa node of the primary channel.
This also makes sure maximum channels <= num_online_cpus

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: handle offline mtu and channel change
stephen hemminger [Wed, 22 Mar 2017 21:50:58 +0000 (14:50 -0700)]
netvsc: handle offline mtu and channel change

If device is not up, then changing MTU (or number of channels)
should not re-enable the device.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: fix NAPI performance regression
stephen hemminger [Wed, 22 Mar 2017 21:50:57 +0000 (14:50 -0700)]
netvsc: fix NAPI performance regression

When using NAPI, the single stream performance declined signifcantly
because the poll routine was updating host after every burst
of packets. This excess signalling caused host throttling.

This fix restores the old behavior. Host is only signalled
after the ring has been emptied.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: fix tx completions in napi poll
VSR Burru [Wed, 22 Mar 2017 18:54:50 +0000 (11:54 -0700)]
liquidio: fix tx completions in napi poll

If there are no egress packets pending, then don't look for tx completions
in napi poll.  Also, fix broken tx queue wakeup logic.

Signed-off-by: VSR Burru <veerasenareddy.burru@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: allocate RX buffers in OOM conditions in PF and VF
Satanand Burla [Wed, 22 Mar 2017 18:31:13 +0000 (11:31 -0700)]
liquidio: allocate RX buffers in OOM conditions in PF and VF

Add workqueue that is periodically run to try to allocate RX buffers in OOM
conditions in PF and VF.

Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: vmxnet3: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes [Wed, 22 Mar 2017 07:27:57 +0000 (08:27 +0100)]
net: vmxnet3: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: virtio_net: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes [Tue, 21 Mar 2017 22:24:24 +0000 (23:24 +0100)]
net: virtio_net: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: cleanup a condition in efx_udp_tunnel_del()
Dan Carpenter [Wed, 22 Mar 2017 09:10:02 +0000 (12:10 +0300)]
sfc: cleanup a condition in efx_udp_tunnel_del()

Presumably if there is an "add" function, there is also a "del"
function.  But it causes a static checker warning because it looks like
a common cut and paste bug.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoBluetooth: btqcomsmd: fix compile-test dependency
Arnd Bergmann [Mon, 20 Mar 2017 22:31:10 +0000 (15:31 -0700)]
Bluetooth: btqcomsmd: fix compile-test dependency

compile-testing fails when QCOM_SMD is a loadable module:

drivers/bluetooth/built-in.o: In function `btqcomsmd_send':
btqca.c:(.text+0xa8): undefined reference to `qcom_smd_send'
drivers/bluetooth/built-in.o: In function `btqcomsmd_probe':
btqca.c:(.text+0x3ec): undefined reference to `qcom_wcnss_open_channel'
btqca.c:(.text+0x46c): undefined reference to `qcom_smd_set_drvdata'

This clarifies the dependency to allow compile-testing only when
SMD is completely disabled, otherwise the dependency on QCOM_SMD
will make sure we can link against it.

Fixes: e27ee2b16bad ("Bluetooth: btqcomsmd: Allow driver to build if COMPILE_TEST is enabled")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[bjorn: Restructure and clarify dependency to QCOM_WCNSS_CTRL]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'xgene-v2-mdio-and-ethtool'
David S. Miller [Thu, 23 Mar 2017 02:17:15 +0000 (19:17 -0700)]
Merge branch 'xgene-v2-mdio-and-ethtool'

Iyappan Subramanian says:

====================
drivers: net: xgene-v2: Add MDIO and ethtool support

This patch set,

- adds phy management and ethtool support
- fixes ethernet reset
- addresses review comments from previous patch set

v2: Address review comments from v1
- removed mdio_lock, since there is a top level lock in mdio_bus.c

v1:
- Initial version
====================

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene-v2: misc fixes
Iyappan Subramanian [Wed, 22 Mar 2017 01:18:05 +0000 (18:18 -0700)]
drivers: net: xgene-v2: misc fixes

Fixed review comments from the previous patch-set.

- changed return value check of platform_get_irq() to < 0
- replaced devm_request(free)_irq() calls by request(free)_irq() since
  they are called from open() and close()
- changed sizeof(struct mystruct) to sizeof(*mystruct)
- reduced indentation on tx_timeout()

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene-v2: Fix port reset
Iyappan Subramanian [Wed, 22 Mar 2017 01:18:04 +0000 (18:18 -0700)]
drivers: net: xgene-v2: Fix port reset

Fixed port reset sequence by adding ECC init.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene-v2: Add ethtool support
Iyappan Subramanian [Wed, 22 Mar 2017 01:18:03 +0000 (18:18 -0700)]
drivers: net: xgene-v2: Add ethtool support

Added basic ethtool support.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene-v2: Add MDIO support
Iyappan Subramanian [Wed, 22 Mar 2017 01:18:02 +0000 (18:18 -0700)]
drivers: net: xgene-v2: Add MDIO support

Added phy management support by using phy abstraction layer APIs.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bpf-map-in-map'
David S. Miller [Wed, 22 Mar 2017 22:45:46 +0000 (15:45 -0700)]
Merge branch 'bpf-map-in-map'

Martin KaFai Lau says:

====================
bpf: Add map-in-map support

This patchset adds map-in-map support (map->map).
One use case is the (vips -> webservers) in the L4 load balancer so
that different vips can be backed by different set of webservers.

Please refer to the individual commit log for details.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Add tests for map-in-map
Martin KaFai Lau [Wed, 22 Mar 2017 17:00:35 +0000 (10:00 -0700)]
bpf: Add tests for map-in-map

Test cases for array of maps and hash of maps.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Add hash of maps support
Martin KaFai Lau [Wed, 22 Mar 2017 17:00:34 +0000 (10:00 -0700)]
bpf: Add hash of maps support

This patch adds hash of maps support (hashmap->bpf_map).
BPF_MAP_TYPE_HASH_OF_MAPS is added.

A map-in-map contains a pointer to another map and lets call
this pointer 'inner_map_ptr'.

Notes on deleting inner_map_ptr from a hash map:

1. For BPF_F_NO_PREALLOC map-in-map, when deleting
   an inner_map_ptr, the htab_elem itself will go through
   a rcu grace period and the inner_map_ptr resides
   in the htab_elem.

2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
   when deleting an inner_map_ptr, the htab_elem may
   get reused immediately.  This situation is similar
   to the existing prealloc-ated use cases.

   However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
   inner_map->ops->map_free(inner_map) which will go
   through a rcu grace period (i.e. all bpf_map's map_free
   currently goes through a rcu grace period).  Hence,
   the inner_map_ptr is still safe for the rcu reader side.

This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
check_map_prealloc() in the verifier.  preallocation is a
must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
map-in-map first.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Add array of maps support
Martin KaFai Lau [Wed, 22 Mar 2017 17:00:33 +0000 (10:00 -0700)]
bpf: Add array of maps support

This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map).  The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.

Any bpf map type can be acted as an inner_map.  The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.

Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).

When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also.  ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries.  The equality of ops is implied by the equality of
map_type.

During outer_map creation time, an inner_map_fd is needed to create an
outer_map.  However, the inner_map_fd's life time does not depend on the
outer_map.  The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.

Also, for the outer_map:

* It allows element update and delete from syscall
* It allows element lookup from bpf_prog

The above is similar to the current fd_array pattern.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Fix and simplifications on inline map lookup
Martin KaFai Lau [Wed, 22 Mar 2017 17:00:32 +0000 (10:00 -0700)]
bpf: Fix and simplifications on inline map lookup

Fix in verifier:
For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
a broken case is "a different type of map could be used for the
same lookup instruction". For example, an array in one case and a
hashmap in another.  We have to resort to the old dynamic call behavior
in this case.  The fix is to check for collision on insn_aux->map_ptr.
If there is collision, don't inline the map lookup.

Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
patch for how-to trigger the above case.

Simplifications on array_map_gen_lookup():
1. Calculate elem_size from map->value_size.  It removes the
   need for 'struct bpf_array' which makes the later map-in-map
   implementation easier.
2. Remove the 'elem_size == 1' test

Fixes: 81ed18ab3098 ("bpf: add helper inlining infra and optimize map_array lookup")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoinet: frag: release spinlock before calling icmp_send()
Eric Dumazet [Wed, 22 Mar 2017 15:57:15 +0000 (08:57 -0700)]
inet: frag: release spinlock before calling icmp_send()

Dmitry reported a lockdep splat [1] (false positive) that we can fix
by releasing the spinlock before calling icmp_send() from ip_expire()

This is a false positive because sending an ICMP message can not
possibly re-enter the IP frag engine.

[1]
[ INFO: possible circular locking dependency detected ]
4.10.0+ #29 Not tainted
-------------------------------------------------------
modprobe/12392 is trying to acquire lock:
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock
include/linux/spinlock.h:299 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock
include/linux/netdevice.h:3486 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>]
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180

but task is already holding lock:
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&q->lock)->rlock){+.-...}:
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669
       ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713
       packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459
       deliver_skb net/core/dev.c:1834 [inline]
       dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890
       xmit_one net/core/dev.c:2903 [inline]
       dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923
       sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308
       neigh_output include/net/neighbour.h:478 [inline]
       ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228
       ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672
       ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545
       ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
       __sys_sendmmsg+0x25c/0x750 net/socket.c:2075
       SYSC_sendmmsg net/socket.c:2106 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2101
       do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
       return_from_SYSCALL_64+0x0/0x7a

-> #0 (_xmit_ETHER#2){+.-...}:
       check_prev_add kernel/locking/lockdep.c:1830 [inline]
       check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       __netif_tx_lock include/linux/netdevice.h:3486 [inline]
       sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_hh_output include/net/neighbour.h:468 [inline]
       neigh_output include/net/neighbour.h:476 [inline]
       ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
       ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
       icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
       ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
       call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
       expire_timers kernel/time/timer.c:1307 [inline]
       __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
       run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
       __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:657 [inline]
       smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
       apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
       __read_once_size include/linux/compiler.h:254 [inline]
       atomic_read arch/x86/include/asm/atomic.h:26 [inline]
       rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
       __rcu_is_watching kernel/rcu/tree.c:1133 [inline]
       rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
       rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
       radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
       filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
       do_fault_around mm/memory.c:3231 [inline]
       do_read_fault mm/memory.c:3265 [inline]
       do_fault+0xbd5/0x2080 mm/memory.c:3370
       handle_pte_fault mm/memory.c:3600 [inline]
       __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
       handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
       __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
       do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
       page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&q->lock)->rlock);
                               lock(_xmit_ETHER#2);
                               lock(&(&q->lock)->rlock);
  lock(_xmit_ETHER#2);

 *** DEADLOCK ***

10 locks held by modprobe/12392:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>]
__do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336
 #1:  (rcu_read_lock){......}, at: [<ffffffff8188cab6>]
filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
spin_lock include/linux/spinlock.h:299 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
pte_alloc_one_map mm/memory.c:2944 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
lockdep_copy_map include/linux/lockdep.h:175 [inline]
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201
 #5:  (rcu_read_lock){......}, at: [<ffffffff8389a633>]
ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock
include/linux/spinlock.h:309 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock
net/ipv4/icmp.c:219 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>]
icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681
 #7:  (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>]
ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198
 #8:  (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>]
__dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324
 #9:  (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at:
[<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423

stack backtrace:
CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:52
 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204
 check_prev_add kernel/locking/lockdep.c:1830 [inline]
 check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
 validate_chain kernel/locking/lockdep.c:2267 [inline]
 __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:299 [inline]
 __netif_tx_lock include/linux/netdevice.h:3486 [inline]
 sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
 __dev_xmit_skb net/core/dev.c:3092 [inline]
 __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
 neigh_hh_output include/net/neighbour.h:468 [inline]
 neigh_output include/net/neighbour.h:476 [inline]
 ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:657 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c
RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000
R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25
R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000
 </IRQ>
 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
 filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
 do_fault_around mm/memory.c:3231 [inline]
 do_read_fault mm/memory.c:3265 [inline]
 do_fault+0xbd5/0x2080 mm/memory.c:3370
 handle_pte_fault mm/memory.c:3600 [inline]
 __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011
RIP: 0033:0x7f83172f2786
RSP: 002b:00007fffe859ae80 EFLAGS: 00010293
RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970
RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040
R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: initialize icsk_ack.lrcvtime at session start time
Eric Dumazet [Wed, 22 Mar 2017 15:10:21 +0000 (08:10 -0700)]
tcp: initialize icsk_ack.lrcvtime at session start time

icsk_ack.lrcvtime has a 0 value at socket creation time.

tcpi_last_data_recv can have bogus value if no payload is ever received.

This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agogenetlink: fix counting regression on ctrl_dumpfamily()
Stanislaw Gruszka [Wed, 22 Mar 2017 15:08:33 +0000 (16:08 +0100)]
genetlink: fix counting regression on ctrl_dumpfamily()

Commit 2ae0f17df1cd ("genetlink: use idr to track families") replaced

if (++n < fams_to_skip)
continue;
into:

if (n++ < fams_to_skip)
continue;

This subtle change cause that on retry ctrl_dumpfamily() call we omit
one family that failed to do ctrl_fill_info() on previous call, because
cb->args[0] = n number counts also family that failed to do
ctrl_fill_info().

Patch fixes the problem and avoid confusion in the future just decrease
n counter when ctrl_fill_info() fail.

User visible problem caused by this bug is failure to get access to
some genetlink family i.e. nl80211. However problem is reproducible
only if number of registered genetlink families is big enough to
cause second call of ctrl_dumpfamily().

Cc: Xose Vazquez Perez <xose.vazquez@gmail.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Fixes: 2ae0f17df1cd ("genetlink: use idr to track families")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosocket, bpf: fix sk_filter use after free in sk_clone_lock
Daniel Borkmann [Wed, 22 Mar 2017 12:08:08 +0000 (13:08 +0100)]
socket, bpf: fix sk_filter use after free in sk_clone_lock

In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.

sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.

The issue is that commit 278571baca2a ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.

Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.

Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.

Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.

Fixes: 278571baca2a ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: fix dma operation mode config for older versions
Joao Pinto [Wed, 22 Mar 2017 11:56:05 +0000 (11:56 +0000)]
net: stmmac: fix dma operation mode config for older versions

The dma operation mode configuration routine was wrongly moved to a
function (stmmac_mtl_configuration) that is only executed if the
core version is >= 4.00.

Fixes: 6deee2221e11 ("net: stmmac: prepare dma op mode config for multiple queues")
Reported-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.
Joel Scherpelz [Wed, 22 Mar 2017 09:19:04 +0000 (18:19 +0900)]
net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.

This commit adds a new sysctl accept_ra_rt_info_min_plen that
defines the minimum acceptable prefix length of Route Information
Options. The new sysctl is intended to be used together with
accept_ra_rt_info_max_plen to configure a range of acceptable
prefix lengths. It is useful to prevent misconfigurations from
unintentionally blackholing too much of the IPv6 address space
(e.g., home routers announcing RIOs for fc00::/7, which is
incorrect).

Signed-off-by: Joel Scherpelz <jscherpelz@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv4: provide stronger user input validation in nl_fib_input()
Eric Dumazet [Wed, 22 Mar 2017 02:22:28 +0000 (19:22 -0700)]
ipv4: provide stronger user input validation in nl_fib_input()

Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl

It turns out nl_fib_input() sanity tests on user input is a bit
wrong :

User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.

Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: fix hashmap extra_elems logic
Alexei Starovoitov [Wed, 22 Mar 2017 02:05:04 +0000 (19:05 -0700)]
bpf: fix hashmap extra_elems logic

In both kmalloc and prealloc mode the bpf_map_update_elem() is using
per-cpu extra_elems to do atomic update when the map is full.
There are two issues with it. The logic can be misused, since it allows
max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
at map creation time can fail percpu alloc for large map values with a warn:
WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
illegal size (32824) or align (8) for percpu allocation

The fixes for both of these issues are different for kmalloc and prealloc modes.
For prealloc mode allocate extra num_possible_cpus elements and store
their pointers into extra_elems array instead of actual elements.
Hence we can use these hidden(spare) elements not only when the map is full
but during bpf_map_update_elem() that replaces existing element too.
That also improves performance, since pcpu_freelist_pop/push is avoided.
Unfortunately this approach cannot be used for kmalloc mode which needs
to kfree elements after rcu grace period. Therefore switch it back to normal
kmalloc even when full and old element exists like it was prior to
commit 6c9059817432 ("bpf: pre-allocate hash map elements").

Add tests to check for over max_entries and large map values.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>