Johannes Berg [Wed, 24 May 2017 07:07:47 +0000 (09:07 +0200)]
skbuff/mac80211: introduce and use skb_put_zero()
This pattern was introduced a number of times in mac80211 just now,
and since it's present in a number of other places it makes sense
to add a little helper for it.
This just adds the helper and transforms the mac80211 code, a later
patch will transform other places.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Simon Wunderlich [Tue, 23 May 2017 15:00:43 +0000 (17:00 +0200)]
mac80211: enable VHT for mesh channel processing
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Simon Wunderlich [Tue, 23 May 2017 15:00:42 +0000 (17:00 +0200)]
mac80211: mesh: support sending wide bandwidth CSA
To support HT and VHT CSA, beacons and action frames must include the
corresponding IEs.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
[make ieee80211_ie_build_wide_bw_cs() return void]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Simon Wunderlich [Tue, 16 May 2017 09:23:16 +0000 (11:23 +0200)]
mac80211: mark as action frame when parsing IEs of CSA action frames
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Benjamin Berg [Tue, 16 May 2017 09:23:13 +0000 (11:23 +0200)]
mac80211: mesh: Allow following CSA to DFS channels if userspace handles it
If userspace has flagged support for DFS earlier, then we can follow CSA
to DFS channels. So instead of rejecting the switch, allow it to happen
if the flag has been set during mesh setup.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Benjamin Berg [Tue, 16 May 2017 09:23:12 +0000 (11:23 +0200)]
wireless: Require HANDLE_DFS flag to switch channel for non-AP mode
In the case the channel should be switched to one requiring DFS we need
to make sure that userspace will handle radar events when they happen.
For AP mode this is assumed to be the case, as a manager like hostapd
is required. However IBSS and MESH modes can work without further
userspace assistance, so refuse to use DFS channels unless userspace
vouches that it handles DFS.
NOTE: Userspace should have already flagged support earlier during mesh
or IBSS setup. However, this information is not readily accessible
currently.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
[sw: style cleanups]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Benjamin Berg [Tue, 16 May 2017 09:23:11 +0000 (11:23 +0200)]
wireless: Only join DFS channels in mesh mode if userspace flags support
When joining a mesh network it is not guaranteed that userspace has a
daemon listening for radar events. This is however required for channels
requiring DFS. To flag that userspace will handle radar events, it needs
to set NL80211_ATTR_HANDLE_DFS.
This matches the current mechanism used for IBSS mode.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Johannes Berg [Fri, 19 May 2017 11:22:38 +0000 (13:22 +0200)]
mac80211: move clearing result into ieee80211_parse_ch_switch_ie()
Clear the csa_ie in ieee80211_parse_ch_switch_ie() where the data
is filled in, rather than in each caller.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Benjamin Berg [Tue, 16 May 2017 09:23:10 +0000 (11:23 +0200)]
mac80211: mesh: mark channel as unusable if a regulatory MESH CSA is received
In the Mesh Channel Switch Parameters (8.4.2.105) the reason is specified
to WLAN_REASON_MESH_CHAN_REGULATORY in the case that a regulatory
limitation was the cause for the switch. This means another station
detected a radar event.
Mark the channel as unusable if this happens.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
[sw: style cleanup, rebase]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Toke Høiland-Jørgensen [Thu, 6 Apr 2017 09:38:26 +0000 (11:38 +0200)]
mac80211: Dynamically set CoDel parameters per station
CoDel can be too aggressive if a station sends at a very low rate,
leading reduced throughput. This gets worse the more stations are
present, as each station gets more bursty the longer the round-robin
scheduling between stations takes.
This adds dynamic adjustment of CoDel parameters per station. It uses
the rate selection information to estimate throughput and sets more
lenient CoDel parameters if the estimated throughput is below a
threshold (modified by the number of active stations).
A new callback is added that drivers can use to notify mac80211 about
changes in expected throughput, so the same adjustment can be made for
cards that implement rate control in firmware. Drivers that don't use
this will just get the default parameters.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[remove currently unnecessary EXPORT_SYMBOL, fix kernel-doc, remove
inline annotation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Johannes Berg [Thu, 4 May 2017 05:52:10 +0000 (07:52 +0200)]
cfg80211: improve warnings in VHT rate calculation
Linus reported hitting the bandwidth warning, but it is indeed
pretty useless - improve it by printing the rate configuration
and make it only warn once, for both warnings here.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Christoph Hellwig [Tue, 16 May 2017 14:21:46 +0000 (16:21 +0200)]
liquidio: use pcie_flr instead of duplicating it
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 16 May 2017 16:29:11 +0000 (18:29 +0200)]
net: phy: Remove residual magic from PHY drivers
commit
fa8cddaf903c ("net phylib: Remove unnecessary condition check in phy")
removed the only place where the PHY flag PHY_HAS_MAGICANEG was
checked. But it left the flag being set in the drivers. Remove the flag.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Leon Romanovsky [Tue, 16 May 2017 12:20:56 +0000 (15:20 +0300)]
bnx2x: Remove open coded carrier check
There is inline function to test if carrier present,
so it makes open-coded solution redundant.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 16 May 2017 11:24:36 +0000 (04:24 -0700)]
tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.
However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
to use BBR for the few TCP flows they initiate/terminate.
This patch implements an automatic fallback to internal pacing.
Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.
One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.
Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.
Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.
If MQ+pfifo_fast is used on the NIC :
$ sar -n DEV 1 5 | grep eth
14:48:44 eth0 725743.00
2932134.00 46776.76
4335184.68 0.00 0.00 1.00
14:48:45 eth0 725349.00
2932112.00 46751.86
4335158.90 0.00 0.00 0.00
14:48:46 eth0 725101.00
2931153.00 46735.07
4333748.63 0.00 0.00 0.00
14:48:47 eth0 725099.00
2931161.00 46735.11
4333760.44 0.00 0.00 1.00
14:48:48 eth0 725160.00
2931731.00 46738.88
4334606.07 0.00 0.00 0.00
Average: eth0 725290.40
2931658.20 46747.54
4334491.74 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0
259825920 45644
2708324 0 0 21 2 247 98 0 0 100 0 0
4 0 0
259823744 45644
2708356 0 0 0 0
2400825 159843 0 19 81 0 0
0 0 0
259824208 45644
2708072 0 0 0 0
2407351 159929 0 19 81 0 0
1 0 0
259824592 45644
2708128 0 0 0 0
2405183 160386 0 19 80 0 0
1 0 0
259824272 45644
2707868 0 0 0 32
2396361 158037 0 19 81 0 0
Now use MQ+FQ :
lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq
$ sar -n DEV 1 5 | grep eth
14:49:57 eth0 678614.00
2727930.00 43739.13
4033279.14 0.00 0.00 0.00
14:49:58 eth0 677620.00
2723971.00 43674.69
4027429.62 0.00 0.00 1.00
14:49:59 eth0 676396.00
2719050.00 43596.83
4020125.02 0.00 0.00 0.00
14:50:00 eth0 675197.00
2714173.00 43518.62
4012938.90 0.00 0.00 1.00
14:50:01 eth0 676388.00
2719063.00 43595.47
4020171.64 0.00 0.00 0.00
Average: eth0 676843.00
2720837.40 43624.95
4022788.86 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0
259832240 46008
2710912 0 0 21 2 223 192 0 1 99 0 0
1 0 0
259832896 46008
2710744 0 0 0 0
1702206 198078 0 17 82 0 0
0 0 0
259830272 46008
2710596 0 0 0 0
1696340 197756 1 17 83 0 0
4 0 0
259829168 46024
2710584 0 0 16 0
1688472 197158 1 17 82 0 0
3 0 0
259830224 46024
2710408 0 0 0 0
1692450 197212 0 18 82 0 0
As expected, number of interrupts per second is very different.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 May 2017 19:41:31 +0000 (15:41 -0400)]
Merge branch 'udp-scalability-improvements'
Paolo Abeni says:
====================
udp: scalability improvements
This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.
An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.
The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.
The measured overall gain under udp flood is up to the 30% depending on
the numa layout and the number of ingress queue used by the relevant nic.
The performance numbers have been gathered using pktgen as sender, with 64
bytes packets, random src port on a host b2b connected via a 10Gbs link
with the dut.
The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
the ingress nic, so that the number of ingress nic rx queues hit by the udp
traffic could be controlled via ethtool -L.
The udp_sink program was bound to the first idle cpu, to get more
stable numbers.
On a single numa node receiver:
nic rx queues vanilla patched kernel
1 1820 kpps 1900 kpps
2 1950 kpps 2500 kpps
16 1670 kpps 2120 kpps
When using a single nic rx queue, busy polling was also enabled,
elsewhere, in the above scenario, the bh processing becomes the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).
[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
v1 -> v2:
Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
of udp_rmem_release() is now a bool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 16 May 2017 09:20:15 +0000 (11:20 +0200)]
udp: keep the sk_receive_queue held when splicing
On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.
v1 -> v2:
the rx_queue_lock_held param in udp_rmem_release() is
now a bool
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 16 May 2017 09:20:14 +0000 (11:20 +0200)]
udp: use a separate rx queue for packet reception
under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.
The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.
On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 16 May 2017 09:20:13 +0000 (11:20 +0200)]
net/sock: factor out dequeue/peek with offset code
And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 May 2017 16:59:05 +0000 (12:59 -0400)]
Merge branch 'nfp-LSO-checksum-and-XDP-datapath-updates'
Jakub Kicinski says:
====================
nfp: LSO, checksum and XDP datapath updates
This series introduces a number of refinements to standard features
like LSO and checksum offload. Three major features are support for
CHECKSUM_COMPLETE, refinement of TSO handling and another small speed
up for XDP TX. This series also switches from depending on some
app FW<>driver ABI versions to heavier use of capabilities.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:23 +0000 (17:55 -0700)]
nfp: eliminate an if statement in calculation of completed frames
Given that our rings are always a power of 2, we can simplify the
calculation of number of completed TX descriptors by using masking
instead of if statement based on whether the index have wrapped
or not.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:22 +0000 (17:55 -0700)]
nfp: add a helper for wrapping descriptor index
We have a number of places where we calculate the descriptor
index based on a value which may have overflown. Create a
macro for masking with the ring size.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:21 +0000 (17:55 -0700)]
nfp: complete the XDP TX ring only when it's full
Since XDP TX ring holds "spare" RX buffers anyway, we don't have to
rush the completion. We can wait until ring fills up completely
before trying to reclaim buffers. If RX poll has ended an no
buffer has been queued for XDP TX we have no guarantee we will see
another interrupt, so run the reclaim there as well, to make sure
TX statistics won't become stale.
This should help us reclaim more buffers per single queue controller
register read.
Note that the XDP completion is very trivial, it only adds up
the sizes of transmitted frames for statistics so the latency
spike should be acceptable. In case user sets the ring sizes
to something crazy, limit the completion to 2k entries.
The check if the ring is empty at the beginning of xdp_complete()
is no longer needed - the callers will perform it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:20 +0000 (17:55 -0700)]
nfp: add CHECKSUM_COMPLETE support
Introduce NFP_NET_CFG_CTRL_CSUM_COMPLETE capability and implement parsing
of CHECKSUM_COMPLETE metadata.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edwin Peer [Tue, 16 May 2017 00:55:19 +0000 (17:55 -0700)]
nfp: version independent support for chained RSS metadata
ABI version 4 introduced metadata chaining. Using the ABI version to signal
metadata chaining precludes firmware that advertises new capabilities which
rely on prepended metadata from working on older kernels.
Capability bits are thus better suited to signalling the chained metadata
format. A new version of the RSS capability is introduced to distinguish
between the differing metadata formats for ABI versions other than 4.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:18 +0000 (17:55 -0700)]
nfp: don't assume RSS and IRQ moderation are always enabled
Even if capability for RSS and IRQ moderation are present we may
have not initialized them for control vNIC. Depend on selected
features mask (ctrl) rather than capabilities (cap) to determine
which features should be enabled.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edwin Peer [Tue, 16 May 2017 00:55:17 +0000 (17:55 -0700)]
nfp: support LSO2 capability
Firmware advertising the LSO2 capability exploits driver provided L3 and L4
offsets in order to avoid parsing packet headers in the TX path. The vlan
field in struct nfp_net_tx_desc is repurposed, making TXVLAN a mutually
exclusive configuration to LSO2.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edwin Peer [Tue, 16 May 2017 00:55:16 +0000 (17:55 -0700)]
nfp: rename l4_offset in struct nfp_net_tx_desc to lso_hdrlen
The l4_offset field referred to by NFD is confusingly named. It is not the
offset of the L4 transport header, but rather the L4 payload.
The LSO2 capability supported by alternative device firmware requires
the actual L4 offset, thus the rename seems prudent.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 May 2017 00:55:15 +0000 (17:55 -0700)]
nfp: don't enable TSO on the device when disabled
We advertise TSO to the stack but leave it disabled by default.
Make sure it's not only disabled in the netdev features but
also on the device itself.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
linzhang [Mon, 15 May 2017 02:26:47 +0000 (10:26 +0800)]
net: socket: mark socket protocol handler structs as const
Signed-off-by: linzhang <xiaolou4617@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Fri, 12 May 2017 19:14:11 +0000 (12:14 -0700)]
tools: hv: Add clean up for included files in Ubuntu net config
The clean up function is updated to cover duplicate config info in
files included by "source" key word in Ubuntu network config.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 10 May 2017 01:30:12 +0000 (18:30 -0700)]
bnxt: add dma mapping attributes
On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING attribute
in our Rx path dma mapping in order to get the expected performance out
of the receive path. Adding it to the Tx path has little effect, so
that's not a part of this patch.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 May 2017 15:41:11 +0000 (11:41 -0400)]
Merge branch 'xgene-Add-ethtool-stats-and-bug-fixes'
Iyappan Subramanian says:
====================
drivers: net: xgene: Add ethtool stats and bug fixes
This patch set,
- adds ethtool extended statistics support
- addresses errata workarounds
- fixes bugs related to statistics
v2: Address review comments from v1
- Adds lock to protect mdio-xgene indirect MAC access
- Refactors xgene-enet indirect MAC read/write functions
- Uses mdio-xgene MAC access routines, if xgene-enet port
use the same HW.
v1:
- Initial version
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Wed, 10 May 2017 20:45:10 +0000 (13:45 -0700)]
drivers: net: xgene: Fix redundant prefetch buffer cleanup
Prefetch buffer cleanup code was called twice, causing EDAC to
report errors during reboot.
[ 1130.972475] xgene-edac
78800000.edac: IOB bridge agent (BA) transaction
error
[ 1130.979584] xgene-edac
78800000.edac: IOB BA write response error
[ 1130.985648] xgene-edac
78800000.edac: IOB BA write access at 0x00.
00000000
()
[ 1130.993612] xgene-edac
78800000.edac: IOB BA requestor ID 0x00002400
[ 1131.000242] xgene-edac
78800000.edac: IOB bridge agent (BA) transaction
error
...
This patch fixes the errors by,
- removing the redundant prefetch buffer cleanup from port_ops->shutdown()
- moving port_ops->shutdown() after delete_rings()
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:09 +0000 (13:45 -0700)]
drivers: net: xgene: Workaround for HW errata 10GE_10/ENET_15
This patch adds workaround for HW errata 10GE_10 and ENET_15:
"HW statistic counters value are duplicated".
- RFCS duplicates RALN counter
- RFLR duplicates RUND counter
- TFCS duplicates TFRG counter
- RALN should be intepreted as 0 in 10G mode
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:08 +0000 (13:45 -0700)]
drivers: net: xgene: Add frame recovered statistics counter for errata 10GE_8/ENET_11
This patch adds statistic counter for frames recovered from HW errata
10GE_8 and ENET_11:
"HW reports Length error for valid 64 byte frames with len <46 bytes".
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:07 +0000 (13:45 -0700)]
drivers: net: xgene: Workaround for HW errata 10GE_4
This patch adds workaround for HW errata 10GE_4:
"XGENET_ICM_ECM_DROP_COUNT_REG_0 reg not clear on read".
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Wed, 10 May 2017 20:45:06 +0000 (13:45 -0700)]
drivers: net: xgene: Add rx_overrun/tx_underrun statistics
This patch adds rx_overrun and tx_underrun ethtool statistic counters.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:05 +0000 (13:45 -0700)]
drivers: net: xgene: Extend ethtool statistics
This patch adds extended ethtool statistics support.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:04 +0000 (13:45 -0700)]
drivers: net: xgene: Remove unused macros
This patch cleans up unused macros to improve readability.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:03 +0000 (13:45 -0700)]
drivers: net: xgene: Refactor statistics error parsing code
This patch fixes the tx error counters and adds more rx error counters.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:02 +0000 (13:45 -0700)]
drivers: net: xgene: Remove redundant local stats
Commit
5944701df90d ("net: remove useless memset's in drivers get_stats64")
makes the pdata->stats redundant. This patch removes pdata->stats and
updates get_stats64() callback accordingly.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:01 +0000 (13:45 -0700)]
drivers: net: xgene: Use rgmii mdio mac access
This patch switches to use rgmii mdio mac access routines if available,
as they share the same HW.
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quan Nguyen [Wed, 10 May 2017 20:45:00 +0000 (13:45 -0700)]
drivers: net: phy: xgene: Add lock to protect mac access
This patch,
- refactors mac access routine
- adds lock to protect mac indirect access
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Wed, 10 May 2017 20:44:59 +0000 (13:44 -0700)]
drivers: net: xgene: Protect indirect MAC access
This patch,
- refactors mac read/write functions
- adds lock to protect indirect mac access
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 15 May 2017 22:50:49 +0000 (15:50 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Track alignment in BPF verifier so that legitimate programs won't be
rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.
2) Make tail calls work properly in arm64 BPF JIT, from Deniel
Borkmann.
3) Make the configuration and semantics Generic XDP make more sense and
don't allow both generic XDP and a driver specific instance to be
active at the same time. Also from Daniel.
4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.
5) Fix use-after-free in VRF driver, from Gao Feng.
6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
qca_spi driver, from Stefan Wahren.
7) Always run cleanup routines in BPF samples when we get SIGTERM, from
Andy Gospodarek.
8) The mdio phy code should bring PHYs out of reset using the shared
GPIO lines before invoking bus->reset(). From Florian Fainelli.
9) Some USB descriptor access endian fixes in various drivers from
Johan Hovold.
10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
Pressman.
11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.
12) Cure netdev leak in AF_PACKET when using timestamping via control
messages. From Douglas Caetano dos Santos.
13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
Lichvar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
ldmvsw: stop the clean timer at beginning of remove
ldmvsw: unregistering netdev before disable hardware
net: netcp: fix check of requested timestamping filter
ipv6: avoid dad-failures for addresses with NODAD
qed: Fix uninitialized data in aRFS infrastructure
mdio: mux: fix device_node_continue.cocci warnings
net/packet: fix missing net_device reference release
net/mlx4_core: Use min3 to select number of MSI-X vectors
macvlan: Fix performance issues with vlan tagged packets
net: stmmac: use correct pointer when printing normal descriptor ring
net/mlx5: Use underlay QPN from the root name space
net/mlx5e: IPoIB, Only support regular RQ for now
net/mlx5e: Fix setup TC ndo
net/mlx5e: Fix ethtool pause support and advertise reporting
net/mlx5e: Use the correct pause values for ethtool advertising
vmxnet3: ensure that adapter is in proper state during force_close
sfc: revert changes to NIC revision numbers
net: ch9200: add missing USB-descriptor endianness conversions
net: irda: irda-usb: fix firmware name on big-endian hosts
net: dsa: mv88e6xxx: add default case to switch
...
Linus Torvalds [Mon, 15 May 2017 22:27:02 +0000 (15:27 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"A set of minor cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] Minor cleanup of xattr query function
fs: cifs: transport: Use time_after for time comparison
SMB2: Fix share type handling
cifs: cifsacl: Use a temporary ops variable to reduce code length
Don't delay freeing mids when blocked on slow socket write of request
CIFS: silence lockdep splat in cifs_relock_file()
David S. Miller [Mon, 15 May 2017 19:36:09 +0000 (15:36 -0400)]
Merge branch 'ldmsw-fixes'
Shannon Nelson says:
====================
ldmvsw: port removal stability
Under heavy reboot stress testing we found a couple of timing issues
when removing the device that could cause the kernel great heartburn,
addressed by these two patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Mon, 15 May 2017 17:51:08 +0000 (10:51 -0700)]
ldmvsw: stop the clean timer at beginning of remove
Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.
Orabug:
25748241
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Tai [Mon, 15 May 2017 17:51:07 +0000 (10:51 -0700)]
ldmvsw: unregistering netdev before disable hardware
When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".
Orabug:
25980913,
25925306
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Miroslav Lichvar [Mon, 15 May 2017 14:04:36 +0000 (16:04 +0200)]
net: netcp: fix check of requested timestamping filter
The driver doesn't support timestamping of all received packets and
should return error when trying to enable the HWTSTAMP_FILTER_ALL
filter.
Cc: WingMan Kwok <w-kwok2@ti.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 May 2017 18:38:04 +0000 (14:38 -0400)]
Merge tag 'mlx5-fixes-2017-05-12-V2' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2017-05-12
This series contains some mlx5 fixes for net.
Please pull and let me know if there's any problem.
For -stable:
("net/mlx5e: Fix ethtool pause support and advertise reporting") kernels >= 4.8
("net/mlx5e: Use the correct pause values for ethtool advertising") kernels >= 4.8
v1->v2:
Dropped statistics spinlock patch, it needs some extra work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Sat, 13 May 2017 00:03:39 +0000 (17:03 -0700)]
ipv6: avoid dad-failures for addresses with NODAD
Every address gets added with TENTATIVE flag even for the addresses with
IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
we realize it's an address with NODAD and complete the process without
sending any probe. However the TENTATIVE flags stays on the
address for sometime enough to cause misinterpretation when we receive a NS.
While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
and endup with an address that was originally configured as NODAD with
DADFAILED.
We can't avoid scheduling dad_work for addresses with NODAD but we can
avoid adding TENTATIVE flag to avoid this racy situation.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mintz, Yuval [Sun, 14 May 2017 09:21:23 +0000 (12:21 +0300)]
qed: Fix uninitialized data in aRFS infrastructure
Current memset is using incorrect type of variable, causing the
upper-half of the strucutre to be left uninitialized and causing:
ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable':
ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized]
Fixes:
d51e4af5c209 ("qed: aRFS infrastructure support")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Fri, 12 May 2017 14:54:23 +0000 (22:54 +0800)]
mdio: mux: fix device_node_continue.cocci warnings
Device node iterators put the previous value of the index variable, so an
explicit put causes a double put.
In particular, of_mdiobus_register can fail before doing anything
interesting, so one could view it as a no-op from the reference count
point of view.
Generated by: scripts/coccinelle/iterators/device_node_continue.cocci
CC: Jon Mason <jon.mason@broadcom.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Douglas Caetano dos Santos [Fri, 12 May 2017 18:19:15 +0000 (15:19 -0300)]
net/packet: fix missing net_device reference release
When using a TX ring buffer, if an error occurs processing a control
message (e.g. invalid message), the net_device reference is not
released.
Fixes
c14ac9451c348 ("sock: enable timestamping using control messages")
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
yuval.shaia@oracle.com [Fri, 12 May 2017 06:10:51 +0000 (09:10 +0300)]
net/mlx4_core: Use min3 to select number of MSI-X vectors
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Thu, 11 May 2017 15:09:52 +0000 (11:09 -0400)]
macvlan: Fix performance issues with vlan tagged packets
Macvlan always turns on offload features that have sofware
fallback (NETIF_GSO_SOFTWARE). This allows much higher guest-guest
communications over macvtap.
However, macvtap does not turn on these features for vlan tagged traffic.
As a result, depending on the HW that mactap is configured on, the
performance of guest-guest communication over a vlan is very
inconsistent. If the HW supports TSO/UFO over vlans, then the
performance will be fine. If not, the the performance will suffer
greatly since the VM may continue using TSO/UFO, and will force the host
segment the traffic and possibly overlow the macvtap queue.
This patch adds the always on offloads to vlan_features. This
makes sure that any vlan tagged traffic between 2 guest will not
be segmented needlessly.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Cassel [Mon, 15 May 2017 08:56:06 +0000 (10:56 +0200)]
net: stmmac: use correct pointer when printing normal descriptor ring
There are two pointers in sysfs_display_ring,
one that increments if using normal dma descriptors,
another if using extended dma descriptors.
When printing the normal dma descriptors, the wrong pointer is used,
thus the printed descriptor addresses are incorrect.
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yishai Hadas [Tue, 25 Apr 2017 07:39:57 +0000 (10:39 +0300)]
net/mlx5: Use underlay QPN from the root name space
Root flow table is dynamically changed by the underlying flow steering
layer, and IPoIB/ULPs have no idea what will be the root flow table in
the future, hence we need a dynamic infrastructure to move Underlay QPs
with the root flow table.
Fixes:
b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Thu, 4 May 2017 14:53:32 +0000 (17:53 +0300)]
net/mlx5e: IPoIB, Only support regular RQ for now
IPoIB doesn't support striding RQ at the moment, for this
we need to explicitly choose non striding RQ in IPoIB init,
even if the HW supports it.
Fixes:
8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Tue, 9 May 2017 13:40:46 +0000 (16:40 +0300)]
net/mlx5e: Fix setup TC ndo
Fail-safe support patches introduced a trivial bug,
setup tc callback is doing a wrong check of the netdevice state,
the fix is simply to invert the condition.
Fixes:
6f9485af4020 ("net/mlx5e: Fail safe tc setup")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Wed, 19 Apr 2017 11:35:15 +0000 (14:35 +0300)]
net/mlx5e: Fix ethtool pause support and advertise reporting
Pause bit should set when RX pause is on, not TX pause.
Also, setting Asym_Pause is incorrect, and should be turned off.
Fixes:
665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gal Pressman [Mon, 3 Apr 2017 12:11:22 +0000 (15:11 +0300)]
net/mlx5e: Use the correct pause values for ethtool advertising
Query the operational pause from firmware (PFCC register) instead of
always passing zeros.
Fixes:
665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Linus Torvalds [Sat, 13 May 2017 20:19:49 +0000 (13:19 -0700)]
Linux 4.12-rc1
Linus Torvalds [Sat, 13 May 2017 17:25:05 +0000 (10:25 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull some more input subsystem updates from Dmitry Torokhov:
"An updated xpad driver with a few more recognized device IDs, and a
new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
via SPI bus"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: cros_ec_keyb - remove extraneous 'const'
Input: add support for PlayStation 1/2 joypads connected via SPI
Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
Input: xpad - sync supported devices with xboxdrv
Input: xpad - sort supported devices by USB ID
Linus Torvalds [Sat, 13 May 2017 17:23:12 +0000 (10:23 -0700)]
Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
- new config option CONFIG_UBIFS_FS_SECURITY
- minor improvements
- random fixes
* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
ubi: Add debugfs file for tracking PEB state
ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
ubifs: Remove unnecessary assignment
ubifs: Fix cut and paste error on sb type comparisons
ubi: fastmap: Fix slab corruption
ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
ubi: Make mtd parameter readable
ubi: Fix section mismatch
Linus Torvalds [Sat, 13 May 2017 17:20:02 +0000 (10:20 -0700)]
Merge branch 'for-linus-4.12-rc1' of git://git./linux/kernel/git/rw/uml
Pull UML fixes from Richard Weinberger:
"No new stuff, just fixes"
* 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Add missing NR_CPUS include
um: Fix to call read_initrd after init_bootmem
um: Include kbuild.h instead of duplicating its macros
um: Fix PTRACE_POKEUSER on x86_64
um: Set number of CPUs
um: Fix _print_addr()
Linus Torvalds [Sat, 13 May 2017 16:49:35 +0000 (09:49 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, docs: update memory.stat description with workingset* entries
mm: vmscan: scan until it finds eligible pages
mm, thp: copying user pages must schedule on collapse
dax: fix PMD data corruption when fault races with write
dax: fix data corruption when fault races with write
ext4: return to starting transaction in ext4_dax_huge_fault()
mm: fix data corruption due to stale mmap reads
dax: prevent invalidation of mapped DAX entries
Tigran has moved
mm, vmalloc: fix vmalloc users tracking properly
mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
gcov: support GCC 7.1
mm, vmstat: Remove spurious WARN() during zoneinfo print
time: delete current_fs_time()
hwpoison, memcg: forcibly uncharge LRU pages
Steve French [Sat, 13 May 2017 01:59:10 +0000 (20:59 -0500)]
[CIFS] Minor cleanup of xattr query function
Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).
Signed-off-by: Steve French <steve.french@primarydata.com>
Karim Eshapa [Thu, 11 May 2017 23:53:38 +0000 (01:53 +0200)]
fs: cifs: transport: Use time_after for time comparison
Use time_after kernel macro for time comparison
that has safety check.
Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Christophe JAILLET [Fri, 12 May 2017 15:59:32 +0000 (17:59 +0200)]
SMB2: Fix share type handling
In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK 0x01
#define SMB2_SHARE_TYPE_PIPE 0x02
#define SMB2_SHARE_TYPE_PRINT 0x03
Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.
So, test the ShareType value for equality instead.
Fixes:
faaf946a7d5b ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Joe Perches via samba-technical [Sun, 7 May 2017 08:31:47 +0000 (01:31 -0700)]
cifs: cifsacl: Use a temporary ops variable to reduce code length
Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.
$ size fs/cifs/cifsacl.o*
text data bss dec hex filename
5338 136 8 5482 156a fs/cifs/cifsacl.o.new
5371 136 8 5515 158b fs/cifs/cifsacl.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Roman Gushchin [Fri, 12 May 2017 22:47:09 +0000 (15:47 -0700)]
mm, docs: update memory.stat description with workingset* entries
Commit
4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
workingset transition") introduced three new entries in memory stat
file:
- workingset_refault
- workingset_activate
- workingset_nodereclaim
This commit adds a corresponding description to the cgroup v2 docs.
Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Fri, 12 May 2017 22:47:06 +0000 (15:47 -0700)]
mm: vmscan: scan until it finds eligible pages
Although there are a ton of free swap and anonymous LRU page in elgible
zones, OOM happened.
balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
CPU: 7 PID: 1138 Comm: balloon Not tainted
4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x21d/0x3f0
out_of_memory+0xd8/0x390
__alloc_pages_slowpath+0xbc1/0xc50
__alloc_pages_nodemask+0x1a5/0x1c0
pte_alloc_one+0x20/0x50
__pte_alloc+0x1e/0x110
__handle_mm_fault+0x919/0x960
handle_mm_fault+0x77/0x120
__do_page_fault+0x27a/0x550
trace_do_page_fault+0x43/0x150
do_async_page_fault+0x2c/0x90
async_page_fault+0x28/0x30
Mem-Info:
active_anon:424716 inactive_anon:65314 isolated_anon:0
active_file:52 inactive_file:46 isolated_file:0
unevictable:0 dirty:27 writeback:0 unstable:0
slab_reclaimable:3967 slab_unreclaimable:4125
mapped:133 shmem:43 pagetables:1674 bounce:0
free:4637 free_pcp:225 free_cma:0
Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 992 992 1952
DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
lowmem_reserve[]: 0 0 0 959
Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
378 total pagecache pages
17 pages in swap cache
Swap cache stats: add 17325, delete 17302, find 0/27
Free swap = 978940kB
Total swap = 1048572kB
524157 pages RAM
0 pages HighMem/MovableOnly
12629 pages reserved
0 pages cma reserved
0 pages hwpoisoned
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
[ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd
With investigation, skipping page of isolate_lru_pages makes reclaim
void because it returns zero nr_taken easily so LRU shrinking is
effectively nothing and just increases priority aggressively. Finally,
OOM happens.
The problem is that get_scan_count determines nr_to_scan with eligible
zones so although priority drops to zero, it couldn't reclaim any pages
if the LRU contains mostly ineligible pages.
get_scan_count:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
size = size >> sc->priority;
Assumes sc->priority is 0 and LRU list is as follows.
N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H
(Ie, small eligible pages are in the head of LRU but others are
almost ineligible pages)
In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
tail of the LRU are not eligible pages. If get_scan_count counts
skipped pages, it doesn't reclaim any pages remained after scanning 4
pages so it ends up OOM happening.
This patch makes isolate_lru_pages try to scan pages until it encounters
eligible zones's pages.
[akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
Fixes:
3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Fri, 12 May 2017 22:47:03 +0000 (15:47 -0700)]
mm, thp: copying user pages must schedule on collapse
We have encountered need_resched warnings in __collapse_huge_page_copy()
while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.
mm->mmap_sem is held for write, but the iteration is well bounded.
Reschedule as needed.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 12 May 2017 22:47:00 +0000 (15:47 -0700)]
dax: fix PMD data corruption when fault races with write
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.
Currently DAX PMD read fault can race with write(2) in the following
way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pmd_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add huge zero page to the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes:
9f141d6ef6258 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Fri, 12 May 2017 22:46:57 +0000 (15:46 -0700)]
dax: fix data corruption when fault races with write
Currently DAX read fault can race with write(2) in the following way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pte_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add zero page in the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes:
9f141d6ef6258a3a37a045842d9ba7e68f368956
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Fri, 12 May 2017 22:46:54 +0000 (15:46 -0700)]
ext4: return to starting transaction in ext4_dax_huge_fault()
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes. To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().
Fixes:
9f141d6ef6258a3a37a045842d9ba7e68f368956
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Fri, 12 May 2017 22:46:50 +0000 (15:46 -0700)]
mm: fix data corruption due to stale mmap reads
Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX. That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:
- open an mmap over a 2MiB hole
- read from a 2MiB hole, faulting in a 2MiB zero page
- write to the hole with write(3p). The write succeeds but we
incorrectly leave the 2MiB zero page mapping intact.
- via the mmap, read the data that was just written. Since the zero
page mapping is still intact we read back zeroes instead of the new
data.
Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.
Fixes:
c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 12 May 2017 22:46:47 +0000 (15:46 -0700)]
dax: prevent invalidation of mapped DAX entries
Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.
This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).
The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.
This patch (of 4):
dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked. This is done via:
invalidate_mapping_pages()
invalidate_exceptional_entry()
dax_invalidate_mapping_entry()
However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped. This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().
For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.
We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().
Fixes:
c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 12 May 2017 22:46:44 +0000 (15:46 -0700)]
Tigran has moved
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 12 May 2017 22:46:41 +0000 (15:46 -0700)]
mm, vmalloc: fix vmalloc users tracking properly
Commit
1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
turned out to be a bad idea for some architectures. E.g. m68k fails
with
In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
from arch/m68k/include/asm/pgtable.h:4,
from include/linux/vmalloc.h:9,
from arch/m68k/kernel/module.c:9:
arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
>> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
#define pgd_offset_k(address) pgd_offset(&init_mm, address)
as spotted by kernel build bot. nios2 fails for other reason
In file included from include/asm-generic/io.h:767:0,
from arch/nios2/include/asm/io.h:61,
from include/linux/io.h:25,
from arch/nios2/include/asm/pgtable.h:18,
from include/linux/mm.h:70,
from include/linux/pid_namespace.h:6,
from include/linux/ptrace.h:9,
from arch/nios2/include/uapi/asm/elf.h:23,
from arch/nios2/include/asm/elf.h:22,
from include/linux/elf.h:4,
from include/linux/module.h:15,
from init/main.c:16:
include/linux/vmalloc.h: In function '__vmalloc_node_flags':
include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?
which is due to the newly added #include <asm/pgtable.h>, which on nios2
includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
again includes <linux/vmalloc.h>.
Tweaking that around just turns out a bigger headache than necessary.
This patch reverts
1f5307b1e094 and reimplements the original fix in a
different way. __vmalloc_node_flags can stay static inline which will
cover vmalloc* functions. We only have one external user
(kvmalloc_node) and we can export __vmalloc_node_flags_caller and
provide the caller directly. This is much simpler and it doesn't really
need any games with header files.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: revert old comment]
Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
Fixes:
1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SeongJae Park [Fri, 12 May 2017 22:46:38 +0000 (15:46 -0700)]
mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
One return case of `__collapse_huge_page_swapin()` does not invoke
tracepoint while every other return case does. This commit adds a
tracepoint invocation for the case.
Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Martin Liska [Fri, 12 May 2017 22:46:35 +0000 (15:46 -0700)]
gcov: support GCC 7.1
Starting from GCC 7.1, __gcov_exit is a new symbol expected to be
implemented in a profiling runtime.
[akpm@linux-foundation.org: coding-style fixes]
[mliska@suse.cz: v2]
Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz
Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.cz
Signed-off-by: Martin Liska <mliska@suse.cz>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reza Arbab [Fri, 12 May 2017 22:46:32 +0000 (15:46 -0700)]
mm, vmstat: Remove spurious WARN() during zoneinfo print
After commit
e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
zoneinfo"), /proc/zoneinfo will show unpopulated zones.
A memoryless node, having no populated zones at all, was previously
ignored, but will now trigger the WARN() in is_zone_first_populated().
Remove this warning, as its only purpose was to warn of a situation that
has since been enabled.
Aside: The "per-node stats" are still printed under the first populated
zone, but that's not necessarily the first stanza any more. I'm not
sure which criteria is more important with regard to not breaking
parsers, but it looks a little weird to the eye.
Fixes:
e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deepa Dinamani [Fri, 12 May 2017 22:46:29 +0000 (15:46 -0700)]
time: delete current_fs_time()
All uses of the current_fs_time() function have been replaced by other
time interfaces.
And, its use cases can be fulfilled by current_time() or ktime_get_*
variants.
Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 12 May 2017 22:46:26 +0000 (15:46 -0700)]
hwpoison, memcg: forcibly uncharge LRU pages
Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.
hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit
0a31bc97c80c ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since
e8ea14cc6ead ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.
[akpm@linux-foundation.org: coding-style fixes]
Fixes:
0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 12 May 2017 22:43:10 +0000 (15:43 -0700)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:
- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.
- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.
- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.
- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.
These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
Linus Torvalds [Fri, 12 May 2017 19:10:38 +0000 (12:10 -0700)]
Merge tag 'sound-fix-4.12-rc1' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This contains a one-liner change that has a significant impact:
disabling the build of OSS. It's been unmaintained for long time, and
we'd like to drop the stuff. Finally, as the first step, stop the
build. Let's see whether it works without much complaints.
Other than that, there are two small fixes for HD-audio"
* tag 'sound-fix-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
sound: Disable the build of OSS drivers
ALSA: hda: Fix cpu lockup when stopping the cmd dmas
ALSA: hda - Add mute led support for HP EliteBook 840 G3
Linus Torvalds [Fri, 12 May 2017 19:02:21 +0000 (12:02 -0700)]
Merge tag 'for-v4.12-2' of git://git./linux/kernel/git/sre/linux-power-supply
Pull more power-supply updates from Sebastian Reichel:
"The power-supply subsystem has a few more changes for the v4.12 merge
window:
- New battery driver for AXP20X and AXP22X PMICs
- Improve max17042_battery for usage on x86
- Misc small cleanups & fixes"
* tag 'for-v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (34 commits)
power: supply: cpcap-charger: Keep trickle charger bits disabled
power: supply: cpcap-charger: Fix enable for 3.8V charge setting
power: supply: cpcap-charger: Fix charge voltage configuration
power: supply: cpcap-charger: Fix charger name
power: supply: twl4030-charger: make twl4030_bci_property_is_writeable static
power: supply: sbs-battery: Add alert callback
mailmap: add Sebastian Reichel
power: supply: avoid unused twl4030-madc.h
power: supply: sbs-battery: Correct supply status with current draw
power: supply: sbs-battery: Don't ignore the first external power change
power: supply: pda_power: move from timer to delayed_work
power: supply: max17042_battery: Add support for the SCOPE property
power: supply: max17042_battery: Add support for the CHARGE_NOW property
power: supply: max17042_battery: Add support for the CHARGE_FULL_DESIGN property
power: supply: max17042_battery: mAh readings depend on r_sns value
power: supply: max17042_battery: Add support for the VOLT_MIN property
power: supply: max17042_battery: Add support for the TECHNOLOGY attribute
power: supply: max17042_battery: Add external_power_changed callback
power: supply: max17042_battery: Add support for the STATUS property
power: supply: max17042_battery: Add default platform_data fallback data
...
Linus Torvalds [Fri, 12 May 2017 18:58:45 +0000 (11:58 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
- Fix a problem where orderly_shutdown() is called for multiple times
due to multiple critical overheating events raised in a short period
by platform thermal driver. (Keerthy)
- Introduce a backup thermal shutdown mechanism, which invokes
kernel_power_off()/emergency_restart() directly, after
orderly_shutdown() being issued for certain amount of time(specified
via Kconfig). This is useful in certain conditions that userspace may
be unable to power off the system in a clean manner and leaves the
system in a critical state, like in the middle of driver probing
phase. (Keerthy)
- Introduce a new interface in thermal devfreq_cooling code so that the
driver can provide more precise data regarding actual power to the
thermal governor every time the power budget is calculated. (Lukasz
Luba)
- Introduce BCM 2835 soc thermal driver and northstar thermal driver,
within a new sub-folder. (Rafał Miłecki)
- Introduce DA9062/61 thermal driver. (Steve Twiss)
- Remove non-DT booting on TI-SoC driver. Also add support to fetching
coefficients from DT. (Keerthy)
- Refactorf RCAR Gen3 thermal driver. (Niklas Söderlund)
- Small fix on MTK and intel-soc-dts thermal driver. (Dawei Chien,
Brian Bian)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (25 commits)
thermal: core: Add a back up thermal shutdown mechanism
thermal: core: Allow orderly_poweroff to be called only once
Thermal: Intel SoC DTS: Change interrupt request behavior
trace: thermal: add another parameter 'power' to the tracing function
thermal: devfreq_cooling: add new interface for direct power read
thermal: devfreq_cooling: refactor code and add get_voltage function
thermal: mt8173: minor mtk_thermal.c cleanups
thermal: bcm2835: move to the broadcom subdirectory
thermal: broadcom: ns: specify myself as MODULE_AUTHOR
thermal: da9062/61: Thermal junction temperature monitoring driver
Documentation: devicetree: thermal: da9062/61 TJUNC temperature binding
thermal: broadcom: add Northstar thermal driver
dt-bindings: thermal: add support for Broadcom's Northstar thermal
thermal: bcm2835: add thermal driver for bcm2835 SoC
dt-bindings: Add thermal zone to bcm2835-thermal example
thermal: rcar_gen3_thermal: add suspend and resume support
thermal: rcar_gen3_thermal: store device match data in private structure
thermal: rcar_gen3_thermal: enable hardware interrupts for trip points
thermal: rcar_gen3_thermal: record and check number of TSCs found
thermal: rcar_gen3_thermal: check that TSC exists before memory allocation
...
Linus Torvalds [Fri, 12 May 2017 18:48:26 +0000 (11:48 -0700)]
Merge tag 'drm-fixes-for-v4.12-rc1' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"AMD, nouveau, one i915, and one EDID fix for v4.12-rc1
Some fixes that it would be good to have in rc1. It contains the i915
quiet fix that you reported.
It also has an amdgpu fixes pull, with lots of ongoing work on Vega10
which is new in this kernel and is preliminary support so may have a
fair bit of movement.
Otherwise a few non-Vega10 AMD fixes, one EDID fix and some nouveau
regression fixers"
* tag 'drm-fixes-for-v4.12-rc1' of git://people.freedesktop.org/~airlied/linux: (144 commits)
drm/i915: Make vblank evade warnings optional
drm/nouveau/therm: remove ineffective workarounds for alarm bugs
drm/nouveau/tmr: avoid processing completed alarms when adding a new one
drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
drm/nouveau/tmr: handle races with hw when updating the next alarm time
drm/nouveau/tmr: ack interrupt before processing alarms
drm/nouveau/core: fix static checker warning
drm/nouveau/fb/ram/gf100-: remove 0x10f200 read
drm/nouveau/kms/nv50: skip core channel cursor update on position-only changes
drm/nouveau/kms/nv50: fix source-rect-only plane updates
drm/nouveau/kms/nv50: remove pointless argument to window atomic_check_acquire()
drm/amd/powerplay: refine pwm1_enable callback functions for CI.
drm/amd/powerplay: refine pwm1_enable callback functions for vi.
drm/amd/powerplay: refine pwm1_enable callback functions for Vega10.
drm/amdgpu: refine amdgpu pwm1_enable sysfs interface.
drm/amdgpu: add amd fan ctrl mode enums.
drm/amd/powerplay: add more smu message on Vega10.
drm/amdgpu: fix dependency issue
drm/amd: fix init order of sched job
drm/amdgpu: add some additional vega10 pci ids
...
Linus Torvalds [Fri, 12 May 2017 18:44:13 +0000 (11:44 -0700)]
Merge branch 'for-next' of git://git./linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
"Things were a lot more calm than previously expected. It's primarily
fixes in various areas, with most of the new functionality centering
around TCMU backend driver work that Xiubo Li has been driving.
Here's the summary on the feature side:
- Make T10-PI verify configurable for emulated (FILEIO + RD) backends
(Dmitry Monakhov)
- Allow target-core/TCMU pass-through to use in-kernel SPC-PR logic
(Bryant Ly + MNC)
- Add TCMU support for growing ring buffer size (Xiubo Li + MNC)
- Add TCMU support for global block data pool (Xiubo Li + MNC)
and on the bug-fix side:
- Fix COMPARE_AND_WRITE non GOOD status handling for READ phase
failures (Gary Guo + nab)
- Fix iscsi-target hang with explicitly changing per NodeACL
CmdSN number depth with concurrent login driven session
reinstatement. (Gary Guo + nab)
- Fix ibmvscsis fabric driver ABORT task handling (Bryant Ly)
- Fix target-core/FILEIO zero length handling (Bart Van Assche)
Also, there was an OOPs introduced with the WRITE_VERIFY changes that
I ended up reverting at the last minute, because as not unusual Bart
and I could not agree on the fix in time for -rc1. Since it's specific
to a conformance test, it's been reverted for now.
There is a separate patch in the queue to address the underlying
control CDB write overflow regression in >= v4.3 separate from the
WRITE_VERIFY revert here, that will be pushed post -rc1"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (30 commits)
Revert "target: Fix VERIFY and WRITE VERIFY command parsing"
IB/srpt: Avoid that aborting a command triggers a kernel warning
IB/srpt: Fix abort handling
target/fileio: Fix zero-length READ and WRITE handling
ibmvscsis: Do not send aborted task response
tcmu: fix module removal due to stuck thread
target: Don't force session reset if queue_depth does not change
iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
target: Fix compare_and_write_callback handling for non GOOD status
tcmu: Recalculate the tcmu_cmd size to save cmd area memories
tcmu: Add global data block pool support
tcmu: Add dynamic growing data area feature support
target: fixup error message in target_tg_pt_gp_tg_pt_gp_id_store()
target: fixup error message in target_tg_pt_gp_alua_access_type_store()
target/user: PGR Support
target: Add WRITE_VERIFY_16
Documentation/target: add an example script to configure an iSCSI target
target: Use kmalloc_array() in transport_kmap_data_sg()
target: Use kmalloc_array() in compare_and_write_callback()
target: Improve size determinations in two functions
...
Linus Torvalds [Fri, 12 May 2017 18:39:59 +0000 (11:39 -0700)]
Merge branch 'work.sane_pwd' of git://git./linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"Making sure that something like a referral point won't end up as pwd
or root.
The main part is the last commit (fixing mntns_install()); that one
fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
bit more robust - it should be impossible to get opened files (even
O_PATH ones) for referral points in the first place, so the existing
checks are OK, but checking the same thing as in chdir(2) is just as
cheap.
The path_init() commit removes a redundant check that shouldn't have
been there in the first place"
* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make sure that mntns_install() doesn't end up with referral for root
path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
make sure that fchdir() won't accept referral points, etc.
Linus Torvalds [Fri, 12 May 2017 17:45:36 +0000 (10:45 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf updates/fixes from Ingo Molnar:
"Mostly tooling updates, but also two kernel fixes: a call chain
handling robustness fix and an x86 PMU driver event definition fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/callchain: Force USER_DS when invoking perf_callchain_user()
tools build: Fixup sched_getcpu feature test
perf tests kmod-path: Don't fail if compressed modules aren't supported
perf annotate: Fix AArch64 comment char
perf tools: Fix spelling mistakes
perf/x86: Fix Broadwell-EP DRAM RAPL events
perf config: Refactor a duplicated code for obtaining config file name
perf symbols: Allow user probes on versioned symbols
perf symbols: Accept symbols starting at address 0
tools lib string: Adopt prefixcmp() from perf and subcmd
perf units: Move parse_tag_value() to units.[ch]
perf ui gtk: Move gtk .so name to the only place where it is used
perf tools: Move HAS_BOOL define to where perl headers are used
perf memswap: Split the byteswap memory range wrappers from util.[ch]
perf tools: Move event prototypes from util.h to event.h
perf buildid: Move prototypes from util.h to build-id.h
Linus Torvalds [Fri, 12 May 2017 17:43:25 +0000 (10:43 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"A single ARM Juno clocksource driver fix"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/arm_arch_timer: Fix arch_timer_mem_find_best_frame()
Linus Torvalds [Fri, 12 May 2017 17:41:45 +0000 (10:41 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull stackprotector fixlet from Ingo Molnar:
"A single fix/enhancement to increase stackprotector canary randomness
on 64-bit kernels with very little cost"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
Linus Torvalds [Fri, 12 May 2017 17:11:50 +0000 (10:11 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- two boot crash fixes
- unwinder fixes
- kexec related kernel direct mappings enhancements/fixes
- more Clang support quirks
- minor cleanups
- Documentation fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Fix a typo in Documentation
x86/build: Don't add -maccumulate-outgoing-args w/o compiler support
x86/boot/32: Fix UP boot on Quark and possibly other platforms
x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()
x86/kexec/64: Use gbpages for identity mappings if available
x86/mm: Add support for gbpages to kernel_ident_mapping_init()
x86/boot: Declare error() as noreturn
x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility
x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
x86/microcode/AMD: Remove redundant NULL check on mc
Linus Torvalds [Fri, 12 May 2017 17:09:14 +0000 (10:09 -0700)]
Merge tag 'for-linus-4.12b-rc0c-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains two fixes for booting under Xen introduced during this
merge window and two fixes for older problems, where one is just much
more probable due to another merge window change"
* tag 'for-linus-4.12b-rc0c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: adjust early dom0 p2m handling to xen hypervisor behavior
x86/amd: don't set X86_BUG_SYSRET_SS_ATTRS when running under Xen
xen/x86: Do not call xen_init_time_ops() until shared_info is initialized
x86/xen: fix xsave capability setting