GitHub/moto-9609/android_kernel_motorola_exynos9610.git
7 years agonet: sched: introduce multichain support for filters
Jiri Pirko [Wed, 17 May 2017 09:08:01 +0000 (11:08 +0200)]
net: sched: introduce multichain support for filters

Instead of having only one filter per block, introduce a list of chains
for every block. Create chain 0 by default. UAPI is extended so the user
can specify which chain he wants to change. If the new attribute is not
specified, chain 0 is used. That allows to maintain backward
compatibility. If chain does not exist and user wants to manipulate with
it, new chain is created with specified index. Also, when last filter is
removed from the chain, the chain is destroyed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: push chain dump to a separate function
Jiri Pirko [Wed, 17 May 2017 09:08:00 +0000 (11:08 +0200)]
net: sched: push chain dump to a separate function

Since there will be multiple chains to dump, push chain dumping code to
a separate function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: introduce helpers to work with filter chains
Jiri Pirko [Wed, 17 May 2017 09:07:59 +0000 (11:07 +0200)]
net: sched: introduce helpers to work with filter chains

Introduce struct tcf_chain object and set of helpers around it. Wraps up
insertion, deletion and search in the filter chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: move TC_H_MAJ macro call into tcf_auto_prio
Jiri Pirko [Wed, 17 May 2017 09:07:58 +0000 (11:07 +0200)]
net: sched: move TC_H_MAJ macro call into tcf_auto_prio

Call the helper from the function rather than to always adjust the
return value of the function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: replace nprio by a bool to make the function more readable
Jiri Pirko [Wed, 17 May 2017 09:07:57 +0000 (11:07 +0200)]
net: sched: replace nprio by a bool to make the function more readable

The use of "nprio" variable in tc_ctl_tfilter is a bit cryptic and makes
a reader wonder what is going on for a while. So help him to understand
this priority allocation dance a litte bit better.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: rename tcf_destroy_chain helper
Jiri Pirko [Wed, 17 May 2017 09:07:56 +0000 (11:07 +0200)]
net: sched: rename tcf_destroy_chain helper

Make the name consistent with the rest of the helpers around.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: introduce tcf block infractructure
Jiri Pirko [Wed, 17 May 2017 09:07:55 +0000 (11:07 +0200)]
net: sched: introduce tcf block infractructure

Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: move tc_classify function to cls_api.c
Jiri Pirko [Wed, 17 May 2017 09:07:54 +0000 (11:07 +0200)]
net: sched: move tc_classify function to cls_api.c

Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'dsa-sort'
David S. Miller [Wed, 17 May 2017 19:19:40 +0000 (15:19 -0400)]
Merge branch 'dsa-sort'

Andrew Lunn says:

====================
net: dsa: Sort various lists

As we gain more DSA drivers and tagging protocols, the lists are
getting a bit unruly. Do some sorting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: DSA: Sort drivers
Andrew Lunn [Tue, 16 May 2017 20:40:08 +0000 (22:40 +0200)]
drivers: net: DSA: Sort drivers

With more drivers being added, it is time to sort the drivers to
impose some order.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: Sort DSA tagging protocol drivers
Andrew Lunn [Tue, 16 May 2017 20:40:07 +0000 (22:40 +0200)]
net: dsa: Sort DSA tagging protocol drivers

With more tag protocols being added, regain some order by sorting the
entries in various places.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: fix PF falsely indicating success at setting MAC address of a nonexistent VF
Felix Manlunas [Tue, 16 May 2017 18:28:00 +0000 (11:28 -0700)]
liquidio: fix PF falsely indicating success at setting MAC address of a nonexistent VF

In the function assigned to .ndo_set_vf_mac, check the validity of the
vfidx argument before proceeding to tell the firmware to set the VF MAC
address.

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: fix insmod failure when multiple NICs are plugged in
Rick Farrington [Tue, 16 May 2017 18:14:50 +0000 (11:14 -0700)]
liquidio: fix insmod failure when multiple NICs are plugged in

When multiple liquidio NICs are plugged in, the first insmod of the PF
driver succeeds.  But after an rmmod, a subsequent insmod fails.  Reason is
during rmmod, the PF driver resets the Octeon of only one of the NICs; it
neglects to reset the Octeons of the other NICs.

Fix the insmod failure by adding the missing Octeon resets at rmmod.  Keep
a per-NIC refcount that indicates the number of active PFs in a given NIC.
When the refcount goes to zero, then reset the Octeon of that NIC.

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: store CPU port pointer in the tree
Vivien Didelot [Tue, 16 May 2017 18:10:33 +0000 (14:10 -0400)]
net: dsa: store CPU port pointer in the tree

A dsa_switch_tree instance holds a dsa_switch pointer and a port index
to identify the switch port to which the CPU is attached.

Now that the DSA layer has a dsa_port structure to hold this data, use
it to point the switch CPU port.

This patch simply substitutes s/dst->cpu_switch/dst->cpu_dp->ds/ and
s/dst->cpu_port/dst->cpu_dp->index/.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'mlxsw-Preparations-for-restructuring'
David S. Miller [Wed, 17 May 2017 18:06:56 +0000 (14:06 -0400)]
Merge branch 'mlxsw-Preparations-for-restructuring'

Jiri Pirko says:

====================
mlxsw: Preparations for restructuring

This patchset doesn't introduce any functional changes and merely meant
to make the code base more receptive for upcoming restructuring.

The first six patches mainly shuffle code in order to reduce the scope of
structs that shouldn't be defined in the main driver header. Most of them
will be later expanded, so it makes sense to correctly place them now.

The last patches mostly simplify bridge-related functions, so that they
could be more easily modified later on.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Default ports to non-virtual mode
Ido Schimmel [Tue, 16 May 2017 17:38:35 +0000 (19:38 +0200)]
mlxsw: spectrum: Default ports to non-virtual mode

In virtual mode, packets are classified to FIDs based on their ingress
port and VLAN whereas in non-virtual mode only the VLAN is taken into
account.

Currently ports are initialized to use virtual mode due to the presence
of the PVID vPort. However, we're going to transition ports between both
modes based on the FIDs they use and not merely based on the presence on
a VLAN upper. Therefore, during initialization, no mode will be
explicitly set.

Since the Programmer's Reference Manual (PRM) doesn't specify a default,
explicitly set the port to non-virtual mode and later transition the
port between both modes based on the FIDs it uses.

In a follow-up patchset, this step will be moved to the common FID core
where it logically belongs.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Move PVID code to appropriate place
Ido Schimmel [Tue, 16 May 2017 17:38:34 +0000 (19:38 +0200)]
mlxsw: spectrum: Move PVID code to appropriate place

PVID is a port attribute and should therefore reside in the main driver
file and not the switchdev specific one.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_switchdev: Don't batch learning operations
Ido Schimmel [Tue, 16 May 2017 17:38:33 +0000 (19:38 +0200)]
mlxsw: spectrum_switchdev: Don't batch learning operations

We no longer batch VLAN operations, so there's no need to set the
learning state for a range of VLANs.

Use a common function to set the learning state for a Port-VLAN, thereby
making the code saner more receptive for upcoming changes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_switchdev: Don't batch STP operations
Ido Schimmel [Tue, 16 May 2017 17:38:32 +0000 (19:38 +0200)]
mlxsw: spectrum_switchdev: Don't batch STP operations

Simplify the code by using the common function that sets an STP state
for a Port-VLAN and remove the existing one that tries to batch it for
several VLANs.

This will help us in a follow-up patchset to introduce a unified
infrastructure for bridge ports, regardless if the bridge is VLAN-aware
or not.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_switchdev: Don't batch VLAN operations
Ido Schimmel [Tue, 16 May 2017 17:38:31 +0000 (19:38 +0200)]
mlxsw: spectrum_switchdev: Don't batch VLAN operations

switchdev's VLAN object has the ability to describe a range of VLAN IDs,
but this is only used when VLAN operations are done using the SELF flag,
which is something we would like to remove as it allows one to bypass
the bridge driver.

Do VLAN operations on a per-VLAN basis, thereby simplifying the code and
preparing it for refactoring in a follow-up patchset.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_switchdev: Remove redundant check
Ido Schimmel [Tue, 16 May 2017 17:38:30 +0000 (19:38 +0200)]
mlxsw: spectrum_switchdev: Remove redundant check

Since commit 97c242902c20 ("switchdev: Execute bridge ndos only for
bridge ports") switchdev code checks that port is bridged, so no need to
perform the same check in the driver.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Initialize RIFs in a separate function
Ido Schimmel [Tue, 16 May 2017 17:38:29 +0000 (19:38 +0200)]
mlxsw: spectrum_router: Initialize RIFs in a separate function

The router interfaces (RIFs) array is currently initialized together
with the general router configuration. However, in a follow-up patchset
we're going to introduce a common RIF core that will require us to
initialize more RIF constructs, so move the RIF initialization to its
own function.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Move FIB notification block to router struct
Ido Schimmel [Tue, 16 May 2017 17:38:28 +0000 (19:38 +0200)]
mlxsw: spectrum_router: Move FIB notification block to router struct

The FIB notification block logically belongs inside the router specific
struct, so move it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Move RIFs array to its rightful place
Ido Schimmel [Tue, 16 May 2017 17:38:27 +0000 (19:38 +0200)]
mlxsw: spectrum_router: Move RIFs array to its rightful place

The router interfaces (RIFs) array is of no interest to code outside the
routing realm, so declare it inside the router specific struct instead
of the chip-wide one.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_switchdev: Reduce scope of bridge struct
Ido Schimmel [Tue, 16 May 2017 17:38:26 +0000 (19:38 +0200)]
mlxsw: spectrum_switchdev: Reduce scope of bridge struct

Some attributes in the global chip struct are only relevant for bridge
operation, so encapsulate them in their own struct that isn't exposed to
non-bridge code.

This will also help us later, when we add more bridge-specific
attributes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Reduce scope of router struct
Ido Schimmel [Tue, 16 May 2017 17:38:25 +0000 (19:38 +0200)]
mlxsw: spectrum_router: Reduce scope of router struct

In a similar fashion to previous patch, the router structure
('mlxsw_sp_router') doesn't need to be accessible to anyone, but the
router code located at spectrum_router.c

Make this apparent and reduce its scope by defining it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_buffer: Reduce scope of shared buffer struct
Ido Schimmel [Tue, 16 May 2017 17:38:24 +0000 (19:38 +0200)]
mlxsw: spectrum_buffer: Reduce scope of shared buffer struct

The shared buffer structure ('mlxsw_sp_sb') doesn't need to be
accessible to anyone, but the shared buffer code located at
spectrum_buffers.c

Make this apparent and reduce its scope by defining it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agocxgb4: add new T5 pci device id
Ganesh Goudar [Tue, 16 May 2017 16:09:05 +0000 (21:39 +0530)]
cxgb4: add new T5 pci device id

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agocxgb4: reduce resource allocation in kdump kernel
Ganesh Goudar [Tue, 16 May 2017 15:47:42 +0000 (21:17 +0530)]
cxgb4: reduce resource allocation in kdump kernel

When is_kdump_kernel() is true, reduce memory footprint of
cxgb4 by using a single "Queue Set".

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: use pcie_flr instead of duplicating it
Christoph Hellwig [Tue, 16 May 2017 14:21:46 +0000 (16:21 +0200)]
liquidio: use pcie_flr instead of duplicating it

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: phy: Remove residual magic from PHY drivers
Andrew Lunn [Tue, 16 May 2017 16:29:11 +0000 (18:29 +0200)]
net: phy: Remove residual magic from PHY drivers

commit fa8cddaf903c ("net phylib: Remove unnecessary condition check in phy")
removed the only place where the PHY flag PHY_HAS_MAGICANEG was
checked. But it left the flag being set in the drivers. Remove the flag.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobnx2x: Remove open coded carrier check
Leon Romanovsky [Tue, 16 May 2017 12:20:56 +0000 (15:20 +0300)]
bnx2x: Remove open coded carrier check

There is inline function to test if carrier present,
so it makes open-coded solution redundant.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: internal implementation for pacing
Eric Dumazet [Tue, 16 May 2017 11:24:36 +0000 (04:24 -0700)]
tcp: internal implementation for pacing

BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'udp-scalability-improvements'
David S. Miller [Tue, 16 May 2017 19:41:31 +0000 (15:41 -0400)]
Merge branch 'udp-scalability-improvements'

Paolo Abeni says:

====================
udp: scalability improvements

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is up to the 30% depending on
the numa layout and the number of ingress queue used by the relevant nic.

The performance numbers have been gathered using pktgen as sender, with 64
bytes packets, random src port on a host b2b connected via a 10Gbs link
with the dut.

The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
the ingress nic, so that the number of ingress nic rx queues hit by the udp
traffic could be controlled via ethtool -L.

The udp_sink program was bound to the first idle cpu, to get more
stable numbers.

On a single numa node receiver:

nic rx queues           vanilla                 patched kernel
1                       1820 kpps               1900 kpps
2                       1950 kpps               2500 kpps
16                      1670 kpps               2120 kpps

When using a single nic rx queue, busy polling was also enabled,
elsewhere, in the above scenario, the bh processing becomes the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

v1 -> v2:
  Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
  of udp_rmem_release() is now a bool.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoudp: keep the sk_receive_queue held when splicing
Paolo Abeni [Tue, 16 May 2017 09:20:15 +0000 (11:20 +0200)]
udp: keep the sk_receive_queue held when splicing

On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

v1 -> v2:
  the rx_queue_lock_held param in udp_rmem_release() is
  now a bool

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoudp: use a separate rx queue for packet reception
Paolo Abeni [Tue, 16 May 2017 09:20:14 +0000 (11:20 +0200)]
udp: use a separate rx queue for packet reception

under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet/sock: factor out dequeue/peek with offset code
Paolo Abeni [Tue, 16 May 2017 09:20:13 +0000 (11:20 +0200)]
net/sock: factor out dequeue/peek with offset code

And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'nfp-LSO-checksum-and-XDP-datapath-updates'
David S. Miller [Tue, 16 May 2017 16:59:05 +0000 (12:59 -0400)]
Merge branch 'nfp-LSO-checksum-and-XDP-datapath-updates'

Jakub Kicinski says:

====================
nfp: LSO, checksum and XDP datapath updates

This series introduces a number of refinements to standard features
like LSO and checksum offload.  Three major features are support for
CHECKSUM_COMPLETE, refinement of TSO handling and another small speed
up for XDP TX.  This series also switches from depending on some
app FW<>driver ABI versions to heavier use of capabilities.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: eliminate an if statement in calculation of completed frames
Jakub Kicinski [Tue, 16 May 2017 00:55:23 +0000 (17:55 -0700)]
nfp: eliminate an if statement in calculation of completed frames

Given that our rings are always a power of 2, we can simplify the
calculation of number of completed TX descriptors by using masking
instead of if statement based on whether the index have wrapped
or not.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: add a helper for wrapping descriptor index
Jakub Kicinski [Tue, 16 May 2017 00:55:22 +0000 (17:55 -0700)]
nfp: add a helper for wrapping descriptor index

We have a number of places where we calculate the descriptor
index based on a value which may have overflown.  Create a
macro for masking with the ring size.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: complete the XDP TX ring only when it's full
Jakub Kicinski [Tue, 16 May 2017 00:55:21 +0000 (17:55 -0700)]
nfp: complete the XDP TX ring only when it's full

Since XDP TX ring holds "spare" RX buffers anyway, we don't have to
rush the completion.  We can wait until ring fills up completely
before trying to reclaim buffers.  If RX poll has ended an no
buffer has been queued for XDP TX we have no guarantee we will see
another interrupt, so run the reclaim there as well, to make sure
TX statistics won't become stale.

This should help us reclaim more buffers per single queue controller
register read.

Note that the XDP completion is very trivial, it only adds up
the sizes of transmitted frames for statistics so the latency
spike should be acceptable.  In case user sets the ring sizes
to something crazy, limit the completion to 2k entries.

The check if the ring is empty at the beginning of xdp_complete()
is no longer needed - the callers will perform it.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: add CHECKSUM_COMPLETE support
Jakub Kicinski [Tue, 16 May 2017 00:55:20 +0000 (17:55 -0700)]
nfp: add CHECKSUM_COMPLETE support

Introduce NFP_NET_CFG_CTRL_CSUM_COMPLETE capability and implement parsing
of CHECKSUM_COMPLETE metadata.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: version independent support for chained RSS metadata
Edwin Peer [Tue, 16 May 2017 00:55:19 +0000 (17:55 -0700)]
nfp: version independent support for chained RSS metadata

ABI version 4 introduced metadata chaining. Using the ABI version to signal
metadata chaining precludes firmware that advertises new capabilities which
rely on prepended metadata from working on older kernels.

Capability bits are thus better suited to signalling the chained metadata
format. A new version of the RSS capability is introduced to distinguish
between the differing metadata formats for ABI versions other than 4.

Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: don't assume RSS and IRQ moderation are always enabled
Jakub Kicinski [Tue, 16 May 2017 00:55:18 +0000 (17:55 -0700)]
nfp: don't assume RSS and IRQ moderation are always enabled

Even if capability for RSS and IRQ moderation are present we may
have not initialized them for control vNIC.  Depend on selected
features mask (ctrl) rather than capabilities (cap) to determine
which features should be enabled.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: support LSO2 capability
Edwin Peer [Tue, 16 May 2017 00:55:17 +0000 (17:55 -0700)]
nfp: support LSO2 capability

Firmware advertising the LSO2 capability exploits driver provided L3 and L4
offsets in order to avoid parsing packet headers in the TX path. The vlan
field in struct nfp_net_tx_desc is repurposed, making TXVLAN a mutually
exclusive configuration to LSO2.

Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: rename l4_offset in struct nfp_net_tx_desc to lso_hdrlen
Edwin Peer [Tue, 16 May 2017 00:55:16 +0000 (17:55 -0700)]
nfp: rename l4_offset in struct nfp_net_tx_desc to lso_hdrlen

The l4_offset field referred to by NFD is confusingly named. It is not the
offset of the L4 transport header, but rather the L4 payload.

The LSO2 capability supported by alternative device firmware requires
the actual L4 offset, thus the rename seems prudent.

Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: don't enable TSO on the device when disabled
Jakub Kicinski [Tue, 16 May 2017 00:55:15 +0000 (17:55 -0700)]
nfp: don't enable TSO on the device when disabled

We advertise TSO to the stack but leave it disabled by default.
Make sure it's not only disabled in the netdev features but
also on the device itself.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: socket: mark socket protocol handler structs as const
linzhang [Mon, 15 May 2017 02:26:47 +0000 (10:26 +0800)]
net: socket: mark socket protocol handler structs as const

Signed-off-by: linzhang <xiaolou4617@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: hv: Add clean up for included files in Ubuntu net config
Haiyang Zhang [Fri, 12 May 2017 19:14:11 +0000 (12:14 -0700)]
tools: hv: Add clean up for included files in Ubuntu net config

The clean up function is updated to cover duplicate config info in
files included by "source" key word in Ubuntu network config.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobnxt: add dma mapping attributes
Shannon Nelson [Wed, 10 May 2017 01:30:12 +0000 (18:30 -0700)]
bnxt: add dma mapping attributes

On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING attribute
in our Rx path dma mapping in order to get the expected performance out
of the receive path.  Adding it to the Tx path has little effect, so
that's not a part of this patch.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'xgene-Add-ethtool-stats-and-bug-fixes'
David S. Miller [Tue, 16 May 2017 15:41:11 +0000 (11:41 -0400)]
Merge branch 'xgene-Add-ethtool-stats-and-bug-fixes'

Iyappan Subramanian says:

====================
drivers: net: xgene: Add ethtool stats and bug fixes

This patch set,

- adds ethtool extended statistics support
- addresses errata workarounds
- fixes bugs related to statistics

v2: Address review comments from v1
- Adds lock to protect mdio-xgene indirect MAC access
- Refactors xgene-enet indirect MAC read/write functions
- Uses mdio-xgene MAC access routines, if xgene-enet port
  use the same HW.
v1:
- Initial version

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Fix redundant prefetch buffer cleanup
Iyappan Subramanian [Wed, 10 May 2017 20:45:10 +0000 (13:45 -0700)]
drivers: net: xgene: Fix redundant prefetch buffer cleanup

Prefetch buffer cleanup code was called twice, causing EDAC to
report errors during reboot.

[ 1130.972475] xgene-edac 78800000.edac: IOB bridge agent (BA) transaction
error
[ 1130.979584] xgene-edac 78800000.edac: IOB BA write response error
[ 1130.985648] xgene-edac 78800000.edac: IOB BA write access at 0x00.00000000
()
[ 1130.993612] xgene-edac 78800000.edac: IOB BA requestor ID 0x00002400
[ 1131.000242] xgene-edac 78800000.edac: IOB bridge agent (BA) transaction
error
...

This patch fixes the errors by,

- removing the redundant prefetch buffer cleanup from port_ops->shutdown()
- moving port_ops->shutdown() after delete_rings()

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Workaround for HW errata 10GE_10/ENET_15
Quan Nguyen [Wed, 10 May 2017 20:45:09 +0000 (13:45 -0700)]
drivers: net: xgene: Workaround for HW errata 10GE_10/ENET_15

This patch adds workaround for HW errata 10GE_10 and ENET_15:
"HW statistic counters value are duplicated".

- RFCS duplicates RALN counter
- RFLR duplicates RUND counter
- TFCS duplicates TFRG counter
- RALN should be intepreted as 0 in 10G mode

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Add frame recovered statistics counter for errata 10GE_8/ENET_11
Quan Nguyen [Wed, 10 May 2017 20:45:08 +0000 (13:45 -0700)]
drivers: net: xgene: Add frame recovered statistics counter for errata 10GE_8/ENET_11

This patch adds statistic counter for frames recovered from HW errata
10GE_8 and ENET_11:
"HW reports Length error for valid 64 byte frames with len <46 bytes".

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Workaround for HW errata 10GE_4
Quan Nguyen [Wed, 10 May 2017 20:45:07 +0000 (13:45 -0700)]
drivers: net: xgene: Workaround for HW errata 10GE_4

This patch adds workaround for HW errata 10GE_4:
"XGENET_ICM_ECM_DROP_COUNT_REG_0 reg not clear on read".

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Add rx_overrun/tx_underrun statistics
Iyappan Subramanian [Wed, 10 May 2017 20:45:06 +0000 (13:45 -0700)]
drivers: net: xgene: Add rx_overrun/tx_underrun statistics

This patch adds rx_overrun and tx_underrun ethtool statistic counters.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Extend ethtool statistics
Quan Nguyen [Wed, 10 May 2017 20:45:05 +0000 (13:45 -0700)]
drivers: net: xgene: Extend ethtool statistics

This patch adds extended ethtool statistics support.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Remove unused macros
Quan Nguyen [Wed, 10 May 2017 20:45:04 +0000 (13:45 -0700)]
drivers: net: xgene: Remove unused macros

This patch cleans up unused macros to improve readability.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Refactor statistics error parsing code
Quan Nguyen [Wed, 10 May 2017 20:45:03 +0000 (13:45 -0700)]
drivers: net: xgene: Refactor statistics error parsing code

This patch fixes the tx error counters and adds more rx error counters.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Remove redundant local stats
Quan Nguyen [Wed, 10 May 2017 20:45:02 +0000 (13:45 -0700)]
drivers: net: xgene: Remove redundant local stats

Commit 5944701df90d ("net: remove useless memset's in drivers get_stats64")
makes the pdata->stats redundant. This patch removes pdata->stats and
updates get_stats64() callback accordingly.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Use rgmii mdio mac access
Quan Nguyen [Wed, 10 May 2017 20:45:01 +0000 (13:45 -0700)]
drivers: net: xgene: Use rgmii mdio mac access

This patch switches to use rgmii mdio mac access routines if available,
as they share the same HW.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: phy: xgene: Add lock to protect mac access
Quan Nguyen [Wed, 10 May 2017 20:45:00 +0000 (13:45 -0700)]
drivers: net: phy: xgene: Add lock to protect mac access

This patch,

- refactors mac access routine
- adds lock to protect mac indirect access

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: xgene: Protect indirect MAC access
Iyappan Subramanian [Wed, 10 May 2017 20:44:59 +0000 (13:44 -0700)]
drivers: net: xgene: Protect indirect MAC access

This patch,

     - refactors mac read/write functions
     - adds lock to protect indirect mac access

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Mon, 15 May 2017 22:50:49 +0000 (15:50 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Track alignment in BPF verifier so that legitimate programs won't be
    rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

 2) Make tail calls work properly in arm64 BPF JIT, from Deniel
    Borkmann.

 3) Make the configuration and semantics Generic XDP make more sense and
    don't allow both generic XDP and a driver specific instance to be
    active at the same time. Also from Daniel.

 4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.

 5) Fix use-after-free in VRF driver, from Gao Feng.

 6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
    qca_spi driver, from Stefan Wahren.

 7) Always run cleanup routines in BPF samples when we get SIGTERM, from
    Andy Gospodarek.

 8) The mdio phy code should bring PHYs out of reset using the shared
    GPIO lines before invoking bus->reset(). From Florian Fainelli.

 9) Some USB descriptor access endian fixes in various drivers from
    Johan Hovold.

10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
    Pressman.

11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.

12) Cure netdev leak in AF_PACKET when using timestamping via control
    messages. From Douglas Caetano dos Santos.

13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
    Lichvar.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  ldmvsw: stop the clean timer at beginning of remove
  ldmvsw: unregistering netdev before disable hardware
  net: netcp: fix check of requested timestamping filter
  ipv6: avoid dad-failures for addresses with NODAD
  qed: Fix uninitialized data in aRFS infrastructure
  mdio: mux: fix device_node_continue.cocci warnings
  net/packet: fix missing net_device reference release
  net/mlx4_core: Use min3 to select number of MSI-X vectors
  macvlan: Fix performance issues with vlan tagged packets
  net: stmmac: use correct pointer when printing normal descriptor ring
  net/mlx5: Use underlay QPN from the root name space
  net/mlx5e: IPoIB, Only support regular RQ for now
  net/mlx5e: Fix setup TC ndo
  net/mlx5e: Fix ethtool pause support and advertise reporting
  net/mlx5e: Use the correct pause values for ethtool advertising
  vmxnet3: ensure that adapter is in proper state during force_close
  sfc: revert changes to NIC revision numbers
  net: ch9200: add missing USB-descriptor endianness conversions
  net: irda: irda-usb: fix firmware name on big-endian hosts
  net: dsa: mv88e6xxx: add default case to switch
  ...

7 years agoMerge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Mon, 15 May 2017 22:27:02 +0000 (15:27 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "A set of minor cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Minor cleanup of xattr query function
  fs: cifs: transport: Use time_after for time comparison
  SMB2: Fix share type handling
  cifs: cifsacl: Use a temporary ops variable to reduce code length
  Don't delay freeing mids when blocked on slow socket write of request
  CIFS: silence lockdep splat in cifs_relock_file()

7 years agoMerge branch 'ldmsw-fixes'
David S. Miller [Mon, 15 May 2017 19:36:09 +0000 (15:36 -0400)]
Merge branch 'ldmsw-fixes'

Shannon Nelson says:

====================
ldmvsw: port removal stability

Under heavy reboot stress testing we found a couple of timing issues
when removing the device that could cause the kernel great heartburn,
addressed by these two patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoldmvsw: stop the clean timer at beginning of remove
Shannon Nelson [Mon, 15 May 2017 17:51:08 +0000 (10:51 -0700)]
ldmvsw: stop the clean timer at beginning of remove

Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.

Orabug: 25748241

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoldmvsw: unregistering netdev before disable hardware
Thomas Tai [Mon, 15 May 2017 17:51:07 +0000 (10:51 -0700)]
ldmvsw: unregistering netdev before disable hardware

When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".

Orabug: 2598091325925306

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: netcp: fix check of requested timestamping filter
Miroslav Lichvar [Mon, 15 May 2017 14:04:36 +0000 (16:04 +0200)]
net: netcp: fix check of requested timestamping filter

The driver doesn't support timestamping of all received packets and
should return error when trying to enable the HWTSTAMP_FILTER_ALL
filter.

Cc: WingMan Kwok <w-kwok2@ti.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'mlx5-fixes-2017-05-12-V2' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Mon, 15 May 2017 18:38:04 +0000 (14:38 -0400)]
Merge tag 'mlx5-fixes-2017-05-12-V2' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2017-05-12

This series contains some mlx5 fixes for net.
Please pull and let me know if there's any problem.

For -stable:
("net/mlx5e: Fix ethtool pause support and advertise reporting") kernels >= 4.8
("net/mlx5e: Use the correct pause values for ethtool advertising") kernels >= 4.8

v1->v2:
 Dropped statistics spinlock patch, it needs some extra work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: avoid dad-failures for addresses with NODAD
Mahesh Bandewar [Sat, 13 May 2017 00:03:39 +0000 (17:03 -0700)]
ipv6: avoid dad-failures for addresses with NODAD

Every address gets added with TENTATIVE flag even for the addresses with
IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
we realize it's an address with NODAD and complete the process without
sending any probe. However the TENTATIVE flags stays on the
address for sometime enough to cause misinterpretation when we receive a NS.
While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
and endup with an address that was originally configured as NODAD with
DADFAILED.

We can't avoid scheduling dad_work for addresses with NODAD but we can
avoid adding TENTATIVE flag to avoid this racy situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Fix uninitialized data in aRFS infrastructure
Mintz, Yuval [Sun, 14 May 2017 09:21:23 +0000 (12:21 +0300)]
qed: Fix uninitialized data in aRFS infrastructure

Current memset is using incorrect type of variable, causing the
upper-half of the strucutre to be left uninitialized and causing:

  ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable':
  ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized]

Fixes: d51e4af5c209 ("qed: aRFS infrastructure support")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomdio: mux: fix device_node_continue.cocci warnings
Julia Lawall [Fri, 12 May 2017 14:54:23 +0000 (22:54 +0800)]
mdio: mux: fix device_node_continue.cocci warnings

Device node iterators put the previous value of the index variable, so an
explicit put causes a double put.

In particular, of_mdiobus_register can fail before doing anything
interesting, so one could view it as a no-op from the reference count
point of view.

Generated by: scripts/coccinelle/iterators/device_node_continue.cocci

CC: Jon Mason <jon.mason@broadcom.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet/packet: fix missing net_device reference release
Douglas Caetano dos Santos [Fri, 12 May 2017 18:19:15 +0000 (15:19 -0300)]
net/packet: fix missing net_device reference release

When using a TX ring buffer, if an error occurs processing a control
message (e.g. invalid message), the net_device reference is not
released.

Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet/mlx4_core: Use min3 to select number of MSI-X vectors
yuval.shaia@oracle.com [Fri, 12 May 2017 06:10:51 +0000 (09:10 +0300)]
net/mlx4_core: Use min3 to select number of MSI-X vectors

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomacvlan: Fix performance issues with vlan tagged packets
Vlad Yasevich [Thu, 11 May 2017 15:09:52 +0000 (11:09 -0400)]
macvlan: Fix performance issues with vlan tagged packets

Macvlan always turns on offload features that have sofware
fallback (NETIF_GSO_SOFTWARE).  This allows much higher guest-guest
communications over macvtap.

However, macvtap does not turn on these features for vlan tagged traffic.
As a result, depending on the HW that mactap is configured on, the
performance of guest-guest communication over a vlan is very
inconsistent.  If the HW supports TSO/UFO over vlans, then the
performance will be fine.  If not, the the performance will suffer
greatly since the VM may continue using TSO/UFO, and will force the host
segment the traffic and possibly overlow the macvtap queue.

This patch adds the always on offloads to vlan_features.  This
makes sure that any vlan tagged traffic between 2 guest will not
be segmented needlessly.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: use correct pointer when printing normal descriptor ring
Niklas Cassel [Mon, 15 May 2017 08:56:06 +0000 (10:56 +0200)]
net: stmmac: use correct pointer when printing normal descriptor ring

There are two pointers in sysfs_display_ring,
one that increments if using normal dma descriptors,
another if using extended dma descriptors.

When printing the normal dma descriptors, the wrong pointer is used,
thus the printed descriptor addresses are incorrect.

Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet/mlx5: Use underlay QPN from the root name space
Yishai Hadas [Tue, 25 Apr 2017 07:39:57 +0000 (10:39 +0300)]
net/mlx5: Use underlay QPN from the root name space

Root flow table is dynamically changed by the underlying flow steering
layer, and IPoIB/ULPs have no idea what will be the root flow table in
the future, hence we need a dynamic infrastructure to move Underlay QPs
with the root flow table.

Fixes: b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
7 years agonet/mlx5e: IPoIB, Only support regular RQ for now
Saeed Mahameed [Thu, 4 May 2017 14:53:32 +0000 (17:53 +0300)]
net/mlx5e: IPoIB, Only support regular RQ for now

IPoIB doesn't support striding RQ at the moment, for this
we need to explicitly choose non striding RQ in IPoIB init,
even if the HW supports it.

Fixes: 8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
7 years agonet/mlx5e: Fix setup TC ndo
Saeed Mahameed [Tue, 9 May 2017 13:40:46 +0000 (16:40 +0300)]
net/mlx5e: Fix setup TC ndo

Fail-safe support patches introduced a trivial bug,
setup tc callback is doing a wrong check of the netdevice state,
the fix is simply to invert the condition.

Fixes: 6f9485af4020 ("net/mlx5e: Fail safe tc setup")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
7 years agonet/mlx5e: Fix ethtool pause support and advertise reporting
Gal Pressman [Wed, 19 Apr 2017 11:35:15 +0000 (14:35 +0300)]
net/mlx5e: Fix ethtool pause support and advertise reporting

Pause bit should set when RX pause is on, not TX pause.
Also, setting Asym_Pause is incorrect, and should be turned off.

Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
7 years agonet/mlx5e: Use the correct pause values for ethtool advertising
Gal Pressman [Mon, 3 Apr 2017 12:11:22 +0000 (15:11 +0300)]
net/mlx5e: Use the correct pause values for ethtool advertising

Query the operational pause from firmware (PFCC register) instead of
always passing zeros.

Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
7 years agoLinux 4.12-rc1
Linus Torvalds [Sat, 13 May 2017 20:19:49 +0000 (13:19 -0700)]
Linux 4.12-rc1

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Sat, 13 May 2017 17:25:05 +0000 (10:25 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull some more input subsystem updates from Dmitry Torokhov:
 "An updated xpad driver with a few more recognized device IDs, and a
  new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
  via SPI bus"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: cros_ec_keyb - remove extraneous 'const'
  Input: add support for PlayStation 1/2 joypads connected via SPI
  Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
  Input: xpad - sync supported devices with xboxdrv
  Input: xpad - sort supported devices by USB ID

7 years agoMerge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs
Linus Torvalds [Sat, 13 May 2017 17:23:12 +0000 (10:23 -0700)]
Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - new config option CONFIG_UBIFS_FS_SECURITY

 - minor improvements

 - random fixes

* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
  ubi: Add debugfs file for tracking PEB state
  ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
  ubifs: Remove unnecessary assignment
  ubifs: Fix cut and paste error on sb type comparisons
  ubi: fastmap: Fix slab corruption
  ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
  ubi: Make mtd parameter readable
  ubi: Fix section mismatch

7 years agoMerge branch 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 13 May 2017 17:20:02 +0000 (10:20 -0700)]
Merge branch 'for-linus-4.12-rc1' of git://git./linux/kernel/git/rw/uml

Pull UML fixes from Richard Weinberger:
 "No new stuff, just fixes"

* 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: Add missing NR_CPUS include
  um: Fix to call read_initrd after init_bootmem
  um: Include kbuild.h instead of duplicating its macros
  um: Fix PTRACE_POKEUSER on x86_64
  um: Set number of CPUs
  um: Fix _print_addr()

7 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 13 May 2017 16:49:35 +0000 (09:49 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, docs: update memory.stat description with workingset* entries
  mm: vmscan: scan until it finds eligible pages
  mm, thp: copying user pages must schedule on collapse
  dax: fix PMD data corruption when fault races with write
  dax: fix data corruption when fault races with write
  ext4: return to starting transaction in ext4_dax_huge_fault()
  mm: fix data corruption due to stale mmap reads
  dax: prevent invalidation of mapped DAX entries
  Tigran has moved
  mm, vmalloc: fix vmalloc users tracking properly
  mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
  gcov: support GCC 7.1
  mm, vmstat: Remove spurious WARN() during zoneinfo print
  time: delete current_fs_time()
  hwpoison, memcg: forcibly uncharge LRU pages

7 years ago[CIFS] Minor cleanup of xattr query function
Steve French [Sat, 13 May 2017 01:59:10 +0000 (20:59 -0500)]
[CIFS] Minor cleanup of xattr query function

Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).

Signed-off-by: Steve French <steve.french@primarydata.com>
7 years agofs: cifs: transport: Use time_after for time comparison
Karim Eshapa [Thu, 11 May 2017 23:53:38 +0000 (01:53 +0200)]
fs: cifs: transport: Use time_after for time comparison

Use time_after kernel macro for time comparison
that has safety check.

Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
7 years agoSMB2: Fix share type handling
Christophe JAILLET [Fri, 12 May 2017 15:59:32 +0000 (17:59 +0200)]
SMB2: Fix share type handling

In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK    0x01
#define SMB2_SHARE_TYPE_PIPE    0x02
#define SMB2_SHARE_TYPE_PRINT   0x03

Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.

So, test the ShareType value for equality instead.

Fixes: faaf946a7d5b ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
7 years agocifs: cifsacl: Use a temporary ops variable to reduce code length
Joe Perches via samba-technical [Sun, 7 May 2017 08:31:47 +0000 (01:31 -0700)]
cifs: cifsacl: Use a temporary ops variable to reduce code length

Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.

$ size fs/cifs/cifsacl.o*
   text    data     bss     dec     hex filename
   5338     136       8    5482    156a fs/cifs/cifsacl.o.new
   5371     136       8    5515    158b fs/cifs/cifsacl.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
7 years agomm, docs: update memory.stat description with workingset* entries
Roman Gushchin [Fri, 12 May 2017 22:47:09 +0000 (15:47 -0700)]
mm, docs: update memory.stat description with workingset* entries

Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
workingset transition") introduced three new entries in memory stat
file:

 - workingset_refault
 - workingset_activate
 - workingset_nodereclaim

This commit adds a corresponding description to the cgroup v2 docs.

Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: vmscan: scan until it finds eligible pages
Minchan Kim [Fri, 12 May 2017 22:47:06 +0000 (15:47 -0700)]
mm: vmscan: scan until it finds eligible pages

Although there are a ton of free swap and anonymous LRU page in elgible
zones, OOM happened.

  balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
  CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
   oom_kill_process+0x21d/0x3f0
   out_of_memory+0xd8/0x390
   __alloc_pages_slowpath+0xbc1/0xc50
   __alloc_pages_nodemask+0x1a5/0x1c0
   pte_alloc_one+0x20/0x50
   __pte_alloc+0x1e/0x110
   __handle_mm_fault+0x919/0x960
   handle_mm_fault+0x77/0x120
   __do_page_fault+0x27a/0x550
   trace_do_page_fault+0x43/0x150
   do_async_page_fault+0x2c/0x90
   async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:424716 inactive_anon:65314 isolated_anon:0
   active_file:52 inactive_file:46 isolated_file:0
   unevictable:0 dirty:27 writeback:0 unstable:0
   slab_reclaimable:3967 slab_unreclaimable:4125
   mapped:133 shmem:43 pagetables:1674 bounce:0
   free:4637 free_pcp:225 free_cma:0
  Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 992 992 1952
  DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 959
  Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
  DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
  Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
  378 total pagecache pages
  17 pages in swap cache
  Swap cache stats: add 17325, delete 17302, find 0/27
  Free swap  = 978940kB
  Total swap = 1048572kB
  524157 pages RAM
  0 pages HighMem/MovableOnly
  12629 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned
  [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
  [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
  [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd

With investigation, skipping page of isolate_lru_pages makes reclaim
void because it returns zero nr_taken easily so LRU shrinking is
effectively nothing and just increases priority aggressively.  Finally,
OOM happens.

The problem is that get_scan_count determines nr_to_scan with eligible
zones so although priority drops to zero, it couldn't reclaim any pages
if the LRU contains mostly ineligible pages.

get_scan_count:

        size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
size = size >> sc->priority;

Assumes sc->priority is 0 and LRU list is as follows.

N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

(Ie, small eligible pages are in the head of LRU but others are
 almost ineligible pages)

In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
tail of the LRU are not eligible pages.  If get_scan_count counts
skipped pages, it doesn't reclaim any pages remained after scanning 4
pages so it ends up OOM happening.

This patch makes isolate_lru_pages try to scan pages until it encounters
eligible zones's pages.

[akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm, thp: copying user pages must schedule on collapse
David Rientjes [Fri, 12 May 2017 22:47:03 +0000 (15:47 -0700)]
mm, thp: copying user pages must schedule on collapse

We have encountered need_resched warnings in __collapse_huge_page_copy()
while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

mm->mmap_sem is held for write, but the iteration is well bounded.

Reschedule as needed.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agodax: fix PMD data corruption when fault races with write
Ross Zwisler [Fri, 12 May 2017 22:47:00 +0000 (15:47 -0700)]
dax: fix PMD data corruption when fault races with write

This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
  - we add huge zero page to the radix tree
    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agodax: fix data corruption when fault races with write
Jan Kara [Fri, 12 May 2017 22:46:57 +0000 (15:46 -0700)]
dax: fix data corruption when fault races with write

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2) CPU2 - read fault
dax_iomap_pte_fault()
  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
  grab_mapping_entry()
  - we add zero page in the radix tree
    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoext4: return to starting transaction in ext4_dax_huge_fault()
Jan Kara [Fri, 12 May 2017 22:46:54 +0000 (15:46 -0700)]
ext4: return to starting transaction in ext4_dax_huge_fault()

DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: fix data corruption due to stale mmap reads
Jan Kara [Fri, 12 May 2017 22:46:50 +0000 (15:46 -0700)]
mm: fix data corruption due to stale mmap reads

Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX.  That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:

 - open an mmap over a 2MiB hole

 - read from a 2MiB hole, faulting in a 2MiB zero page

 - write to the hole with write(3p). The write succeeds but we
   incorrectly leave the 2MiB zero page mapping intact.

 - via the mmap, read the data that was just written. Since the zero
   page mapping is still intact we read back zeroes instead of the new
   data.

Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.

Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agodax: prevent invalidation of mapped DAX entries
Ross Zwisler [Fri, 12 May 2017 22:46:47 +0000 (15:46 -0700)]
dax: prevent invalidation of mapped DAX entries

Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.

This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.

This patch (of 4):

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

  invalidate_mapping_pages()
    invalidate_exceptional_entry()
      dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped.  This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry.  This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoTigran has moved
Andrew Morton [Fri, 12 May 2017 22:46:44 +0000 (15:46 -0700)]
Tigran has moved

Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>