Dietmar Eggemann [Fri, 17 Mar 2017 20:27:06 +0000 (20:27 +0000)]
ANDROID: sched/events: Introduce cfs_rq load tracking trace event
The trace event keys load and util (utilization) are mapped to:
(1) load : cfs_rq->runnable_load_avg
(2) util : cfs_rq->avg.util_avg
To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:
path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.
The following list shows examples of the key=value pairs in different
configurations for:
(1) a root task_group:
cpu=4 path=/ load=6 util=331
(2) a task_group:
cpu=1 path=/tg1/tg11/tg111 load=538 util=522
(3) an autogroup:
cpu=3 path=/autogroup-18 load=997 util=517
(4) w/o CONFIG_FAIR_GROUP_SCHED:
cpu=0 path=(null) load=314 util=289
The trace event is only defined for CONFIG_SMP.
The helper function __trace_sched_path() can be used to get the length
parameter of the dynamic array (path == NULL) and to copy the path into
it (path != NULL).
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Iae08075d889dd772c8d2e1a15dc2ca6589e5640e
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 17 Mar 2017 19:09:03 +0000 (19:09 +0000)]
ANDROID: sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG
Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
available to be printed in the load tracking trace events provided by
this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Change-Id: I6f59783a83e0965d96e84446f64b29ad0c4dc35a
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 14 Nov 2014 16:25:50 +0000 (16:25 +0000)]
ANDROID: sched/debug: Add energy procfs interface
This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.
The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.
/--cpu0
| /--domain0
| | /--busy_factor
| | |--busy_idx
| | |--cache_nice_tries
| | |--flags
| | |--forkexec_idx
| | |--group0
| | | /--energy
| | | | /--cap_states
| | | | |--idle_states
| | | | |--nr_cap_states
| | | | |--nr_idle_states
| | |--group1
| | | /--energy
| | | | /--cap_states
| | | | |--idle_states
| | | | |--nr_cap_states
| | | | |--nr_idle_states
| | |--idle_idx
| | |--imbalance_pct
| | |--max_interval
| | |--max_newidle_lb_cost
| | |--min_interval
| | |--name
| | |--newidle_idx
| | |--wake_idx
| |--domain1
| | /--busy_factor
| | |--busy_idx
| | |--cache_nice_tries
| | |--flags
| | |--forkexec_idx
| | |--idle_idx
| | |--imbalance_pct
| | |--max_interval
| | |--max_newidle_lb_cost
| | |--min_interval
| | |--name
| | |--newidle_idx
| | |--wake_idx
The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of power
consumption at this idle state respectively (compute capacity, power
consumption) at this capacity state.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I2b96d1d46e38d1131e78c206cc1d94900e6d0690
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 14 Nov 2014 17:16:41 +0000 (17:16 +0000)]
ANDROID: arm: Support for extracting EAS energy costs from DT
This patch implements support in the arm architecture for extracting
energy cost data from DT and matches the support added for arm64
in "ANDROID: arm64: Support for extracting EAS energy costs from DT"
The data should conform to the DT bindings for energy cost data needed
by EAS (energy aware scheduling).
Test output on TC2:
150 187 172 275 215 334 258 407 301 447 344 549 387 761 430 1024
0 0 0
8
3
150 187 172 275 215 334 258 407 301 447 344 549 387 761 430 1024
0 0 0
8
3
150 187 172 275 215 334 258 407 301 447 344 549 387 761 430 1024
0 0 0
8
3
150 2967 172 2792 215 2810 258 2815 301 2919 344 2847 387 3917 430 4905
25 25 10
8
3
426 7920 512 8165 597 8172 682 8195 768 8265 853 8446 938 11426 1024 15200
70 70 25
8
3
Change-Id: I1f6a70917ec5f4615a57cdbb7a34f1d783901c77
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Chris Redpath [Sat, 16 Dec 2017 14:32:21 +0000 (14:32 +0000)]
ANDROID: arm64: Support for extracting EAS energy costs from DT
This patch implements support for extracting energy cost data from DT.
The data should conform to the DT bindings for energy cost data needed
by EAS (energy aware scheduling).
This patch supercedes the previous EAS patches:
arm64, topology: Updates to use DT bindings for EAS costing data
sched: Support for extracting EAS energy costs from DT
arm64: use cpu scale value derived from energy model
arm64: define hikey620 sys sd energy model
arm64: introduce sys sd energy model infrastructure
arm64: factor out energy model from topology shim layer
arm64, topology: Define JUNO energy and provide it to the scheduler
There is no need to introduce code and replace it with the Android
expression of the same code in this stack.
Note that if sched-energy-costs is present at runtime, you can no longer
write cpu_capacity.
Some platforms may not provide capacity-dmips-mhz, but instead provide
an energy model in sched-energy-costs format. In this case, ensure that
the max capacity defined in the energy model is used as the raw capacity
value and that the arch_topology driver can still be loaded.
This ensures that the topology details are still available in sysfs and
also that the required flags are set.
Reported-by: Quentin Perret <quentin.perret@arm.com>
Further note that the arm support is still using a built-in energy
model, i.e. only arm64 platforms are able to provide energy model
data through the sched-energy-costs node in DT.
Change-Id: Id617b08eaf08cff3a099f35aeedbda72bb826ce6
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
(modified to apply to 4.14 and updated to override dmips-mhz)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Chris Redpath [Tue, 19 Dec 2017 16:31:05 +0000 (16:31 +0000)]
ANDROID: arm: Add Energy Model to dtb for TC2
Change-Id: I8e64f074185a91ec47dfed52404280f32c694786
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Chris Redpath [Wed, 25 Oct 2017 15:14:36 +0000 (16:14 +0100)]
ANDROID: hisilicon: Add energy model data to hisilicon 6220 dtb
Change-Id: I5890924224d5ae26144d60fb2d582de445dda2e6
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Chris Redpath [Wed, 25 Oct 2017 15:13:54 +0000 (16:13 +0100)]
ANDROID: arm64: Add Energy Model to dtb for Juno-r0 and Juno-r2
Change-Id: I0f67de02aec186c700184af60c355fac3158e2d6
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Robin Randhawa [Mon, 29 Jun 2015 16:56:20 +0000 (17:56 +0100)]
ANDROID: Documentation: DT bindings for energy model cost data required by EAS
EAS (energy aware scheduling) provides the scheduler with an alternative
objective - energy efficiency - as opposed to it's current performance
oriented objectives. EAS relies on a simple platform energy cost model
to guide scheduling decisions. The model only considers the CPU
subsystem.
This patch adds documentation describing DT bindings that should be used to
supply the scheduler with an energy cost model.
Change-Id: Iddfdd0fd5be929ac82004bb80b6d87aa48e81dd8
Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Chris Redpath [Sat, 16 Dec 2017 14:33:30 +0000 (14:33 +0000)]
ANDROID: arm64, dts: add hikey cpu capacity-dmips-mhz information
Hikey is an SMP platform, so this property would normally not be necessary.
But since we drive the setting of the EAS specific sched domain flag
SD_SHARE_CAP_STATES via the init_cpu_capacity_callback() cpufreq notifier
we have to make sure that cap_parsing_failed is not set to true in
parse_cpu_capacity() so that init_cpu_capacity_callback() will bail out
before consuming the CPUFREQ_NOTIFY. The easiest way to achieve this is to
provide the dts file with this property.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I2975e457a3817793ac53b0d8b5ff87f7483aa867
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Wed, 20 Sep 2017 12:25:22 +0000 (13:25 +0100)]
ANDROID: drivers base/arch_topology: Detect SD_SHARE_CAP_STATES flag
Detect and set the SD_SHARE_CAP_STATES sched_domain flag automatically
based on the cpufreq policy related_cpus mask. Since the sched_domain
flags functions don't take any parameters we have to assume that flags
are the same for sched_domains are the same level, i.e. platforms mixing
per-core and per-cluster DVFS is not supported.
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I041d01fd5a8f9abb08fbff727efceea5ddeaa03b
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Tue, 6 Jun 2017 08:30:23 +0000 (09:30 +0100)]
ANDROID: drivers base/arch_topology: enforce SCHED_CAPACITY_SCALE as highest CPU capacity
The default CPU capacity is SCHED_CAPACITY_SCALE (1024).
On a heterogeneous system (hmp) this value can be smaller for some cpus.
The CPU capacity parsing code normalizes the capacity-dmips-mhz
properties w.r.t. the highest value found while parsing the DT to
SCHED_CAPACITY_SCALE.
CPU capacity can also be changed by writing to
/sys/devices/system/cpu/cpu*/cpu_capacity.
To make sure that a subset of all online cpus still has a CPU capacity
value of SCHED_CAPACITY_SCALE enforce in the appropriate sysfs attribute
store function cpu_capacity_store().
This will avoid weird setup's like transforming an hmp into an smp
system with a CPU capacity < SCHED_CAPACITY_SCALE for all cpus.
The current cpu_capacity_store() assumes that all cpus of a cluster have
the same CPU capacity value which is true for existing hmp systems (e.g.
big.LITTLE). This assumption is also used by this patch.
If the new CPU capacity value for a cpu is smaller than
SCHED_CAPACITY_SCALE we iterate over the cpus which do not belong to the
cpu's cluster and check that there is still a cpu with CPU capacity
equal SCHED_CAPACITY_SCALE.
The use of &cpu_topology[this_cpu].core_sibling is replaced by
topology_core_cpumask(this_cpu).
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I2a197a012edd9f20b1c794f27567b891c0d2de12
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Thu, 8 Jun 2017 10:13:29 +0000 (11:13 +0100)]
ANDROID: drivers base/arch_topology: fold two pr_debug()'s into one
Output cpu_capacity and raw_capacity in one pr_debug instead of using
two.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I71c50b0988a95ef723602585c8f2cc7017aea78e
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Thara Gopinath [Fri, 23 Jun 2017 14:37:05 +0000 (10:37 -0400)]
ANDROID: sched: Per-Sched-domain over utilization
The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched domain level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the cpu is over utilized. If energy aware scheduling is
enabled and no cpu in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure
that is common across all the sched domains at a level. The new flag
introduced is placed in this structure so that all the sched domains the
same level share the flag. In case of an overutilized cpu, the flag gets
set at level1 sched_domain. The flag at the parent sched_domain level gets
set in either of the two following scenarios.
1. There is a misfit task in one of the cpu's in this sched_domain.
2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to
misfit tasks. For example consider a sched group with n cpus and
n+1 70%utilized tasks. Ideally this is a case for load balance to happen
in a parent sched domain. But neither the total group utilization is
high enough for the load balance to be triggered
in the parent domain nor there is a cpu with a single overutilized task so
that aload balance is triggered in a parent domain. But again this could be
a purely academic sceanrio, as during task wake up these tasks will be placed
more appropriately.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Change-Id: I3f327cff4080096a3e58208dd72c9b7f7913cdb2
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 3 Feb 2015 13:54:11 +0000 (13:54 +0000)]
ANDROID: sched: Disable energy-unfriendly nohz kicks
With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Iba347168ea34c152117d0f139e82d0d92ba2de20
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Sun, 10 May 2015 14:17:32 +0000 (15:17 +0100)]
ANDROID: sched: Consider a not over-utilized energy-aware system as balanced
In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.
There is simply no need to move usage around because all involved
cpus still have spare cycles available.
For an energy-aware system below its tipping point, we rely on the
task placement of the wakeup path. This works well for short running
tasks.
The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Iff5490200ca3ad25fb8e095b89d188f33fd8a4ef
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Wed, 30 Mar 2016 13:29:48 +0000 (14:29 +0100)]
ANDROID: sched/fair: Energy-aware wake-up task placement
When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.
This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I424e72ab15529e45e8788d109823ae0f6de0d97f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Sat, 9 May 2015 15:49:57 +0000 (16:49 +0100)]
ANDROID: sched: Add over-utilization/tipping point indicator
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.
The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:
sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.
For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.
To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:
cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity
IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.
With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.
For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I09caeeb1151b5c02d67ed738e25728d1771eb45f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Wed, 30 Mar 2016 13:20:12 +0000 (14:20 +0100)]
ANDROID: sched/fair: Add energy_diff dead-zone margin
It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I89c0043fcc414f5e57eb6264b76b6137cc43c886
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Tue, 27 Jan 2015 14:04:17 +0000 (14:04 +0000)]
ANDROID: sched: Determine the current sched_group idle-state
To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I86e80ac0ef75bcb5e8d1b8db72800a9d34880467
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 27 Jan 2015 13:48:07 +0000 (13:48 +0000)]
ANDROID: sched, cpuidle: Track cpuidle state index in the scheduler
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ib3d1178512735b0e314881f73fb8ccff5a69319f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 6 Jan 2015 17:34:05 +0000 (17:34 +0000)]
ANDROID: sched: Estimate energy impact of scheduling decisions
Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I05ce491e5e97c9f30183a4f8f3131c92aa68cecc
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Fri, 2 Jan 2015 14:21:56 +0000 (14:21 +0000)]
ANDROID: sched: Extend sched_group_energy to test load-balancing decisions
Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I31afd581448d894a97afa7f5f7dac4666191e6a0
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 18 Dec 2014 14:47:18 +0000 (14:47 +0000)]
ANDROID: sched: Calculate energy consumption of sched_group
For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.
NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I0da68f5aef23247db2652fad86ee06749c7e284a
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Fri, 2 Jan 2015 17:08:52 +0000 (17:08 +0000)]
ANDROID: sched: Highest energy aware balancing sched_domain level pointer
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.
Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ie9ff4dc97b4fda3292ce58c22f1032cd3085529c
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 11 Dec 2014 15:25:29 +0000 (15:25 +0000)]
ANDROID: sched: Relocated cpu_util() and change return type
Move cpu_util() to an earlier position in fair.c and change return
type to unsigned long as negative usage doesn't make much sense. All
other load and capacity related functions use unsigned long including
the caller of cpu_util().
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ic55bbd2af1850cf2d0b9b9d9fddca229670e7774
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Thu, 30 Jul 2015 15:53:30 +0000 (16:53 +0100)]
ANDROID: sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.
Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:
1. Only 1 cpu is/remains in one cluster of a multi cluster system.
This remaining cpu only has DIE and no MC sd.
2. A complete cluster in a two cluster system is hot-plugged out.
The cpus of the remaining cluster only have MC and no DIE sd.
To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:
1. There is only 1 cpu left in the sd.
2. There have to be at least 2 sg's if certain sd flags are set.
Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.
It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.
In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.
However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.
The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.
Test (w/ CONFIG_SCHED_DEBUG):
JUNO r0 default system:
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for remaining A57 (cpu2) cpu:
$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f
Test 2: Hotplug-out the entire A57 cluster:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for the remaining A53 (CPU part 0xd03) cluster:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I5db1596513303caae218f3660def828a5a7e99d5
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 13 Jan 2015 13:50:46 +0000 (13:50 +0000)]
ANDROID: sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).
There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.
cc: Russell King <linux@arm.linux.org.uk>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I636b038920664ade636cc9db9285d2a87943441b
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 14 Nov 2014 16:20:20 +0000 (16:20 +0000)]
ANDROID: sched: Initialize energy data structures
The sched_group_energy (sge) pointer of the first sched_group (sg) in
the sched_domain (sd) is initialized to point to the appropriate (in
terms of sd level and cpu) sge data defined in the arch and so to the
correct part of the Energy Model (EM).
Energy-aware scheduling allows that a system has only EM data up to a
certain sd level (so called highest energy aware balancing sd level).
A check in init_sched_energy() enforces that all sd's below this sd
level contain EM data.
The 'int cpu' parameter of sched_domain_energy_f requires that
check_sched_energy_data() makes sure that all cpus spanned by a sg
are provisioned with the same EM data.
This patch has also been tested with feature FORCE_SD_OVERLAP enabled.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ic73adfee78a189576fb4ed8ad309424ee498dac1
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 14 Nov 2014 16:08:45 +0000 (16:08 +0000)]
ANDROID: sched: Introduce energy data structures
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:
(1) number of elements of the idle state array
(2) pointer to the idle state array which comprises 'power consumption'
for each idle state
(3) number of elements of the capacity state array
(4) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state
The struct sched_group obtains a pointer to a struct sched_group_energy.
The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.
The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I6d9f8c59c418cfaeb092500efb0ae3cd5a8a815d
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 13 Jan 2015 13:45:51 +0000 (13:45 +0000)]
ANDROID: sched: Make energy awareness a sched feature
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.
ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ib94ec79f7f5820dfaed515ede548afe2936c15c6
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 13 Jan 2015 13:43:28 +0000 (13:43 +0000)]
ANDROID: sched: Documentation for scheduler energy cost model
This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I7ad1d6855ac92595377e3334abeb64b158588968
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 19 Oct 2017 12:51:54 +0000 (13:51 +0100)]
ANDROID: arm64: Enable dynamic sched_domain flag setting
The patch lets the arch_topology driver take over setting of
sched_domain flags that should be detected dynamically based on the
actual system topology.
cc: Catalin Marinas <catalin.marinas@arm.com>
cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I7886c0a53899987e77ef6937e1c667bf32a58bfd
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 19 Oct 2017 12:50:06 +0000 (13:50 +0100)]
ANDROID: arm: Enable dynamic sched_domain flag setting
The patch lets the arch_topology driver take over setting of
sched_domain flags that should be detected dynamically based on the
actual system topology.
cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I5bc17383c5666e8ff35d5109016dec98474997ce
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 19 Oct 2017 12:46:03 +0000 (13:46 +0100)]
ANDROID: drivers/base/arch_topology: Dynamic sched_domain flag detection
This patch add support for dynamic sched_domain flag detection. Flags
like SD_ASYM_CPUCAPACITY are not guaranteed to be set at the same level
for all systems. Let the arch_topology driver do the detection of where
those flags should be set instead. This patch adds initial support for
setting the SD_ASYM_CPUCAPACITY flag.
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I924f55770b4065d18c2097231647ca2f19ec3718
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 7 Mar 2017 16:41:26 +0000 (16:41 +0000)]
ANDROID: sched/fair: Avoid unnecessary balancing of asymmetric capacity groups
On systems with asymmetric cpu capacities, a skewed load distribution
might yield better throughput than balancing load per group capacity.
For example, running compute intensive tasks on high capacity cpus while
leaving low capacity cpus idle. So we let load-balance back off if the
busiest group isn't really overloaded.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I8b08a0fa73f357a9972324bc76cec3912fe293cf
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Tue, 7 Mar 2017 16:40:34 +0000 (16:40 +0000)]
ANDROID: sched: Consider misfit tasks when load-balancing
On asymmetric cpu capacity systems and systems with high RT/IRQ load
intensive tasks can end up on cpus that don't suit their compute demand.
In this scenarios 'misfit' tasks should be migrated to cpus with higher
compute capacity to ensure better throughput. group_misfit_task indicates
this scenario, but tweaks to the load-balance code is needed to make the
migrations happen.
Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the first cpu with the rq->misfit flag raised as the source cpu.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: If523e28ca397f67aaddf7dc1ddc1c22488f899d1
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Fri, 17 Jul 2015 15:45:07 +0000 (16:45 +0100)]
ANDROID: sched: Add group_misfit_task load-balance type
To maximize throughput in systems with asymmetric cpu capacities (e.g.
high RT/IRQ load and/or ARM big.LITTLE) load-balancing has to consider
task and cpu utilization as well as per-cpu compute capacity when
load-balancing in addition to the current average load based
load-balancing policy. Tasks that are scheduled on a lower capacity
cpu need to be identified and migrated to a higher capacity cpu if
possible to maximize throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded in any other way (group_imbalanced or
group_overloaded).
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I9e3ccd5c3bde1102e5121c83ec3561cf90b684b7
(fixup for !SMP platforms)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 3 Feb 2017 19:57:05 +0000 (19:57 +0000)]
ANDROID: arm64: wire cpu-invariant accounting support up to the task scheduler
Commit
8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
weak function to #define") changed the wiring which now has to be done
by associating arch_scale_cpu_capacity with the actual implementation
provided by the architecture.
Define arch_scale_cpu_capacity to use the arch_topology "driver"
function topology_get_cpu_scale() for the task scheduler's cpu-invariant
accounting instead of the default arch_scale_cpu_capacity() in
kernel/sched/sched.h.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Change-Id: I67ab10316ee0e9284bb10d54adb4053a223fa6ad
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 3 Feb 2017 21:39:57 +0000 (21:39 +0000)]
ANDROID: arm64: wire frequency-invariant accounting support up to the task scheduler
Commit
dfbca41f3479 ("sched: Optimize freq invariant accounting")
changed the wiring which now has to be done by associating
arch_scale_freq_capacity with the actual implementation provided
by the architecture.
Define arch_scale_freq_capacity to use the arch_topology "driver"
function topology_get_freq_scale() for the task scheduler's
frequency-invariant accounting instead of the default
arch_scale_freq_capacity() in kernel/sched/sched.h.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Change-Id: Idb89c6a03334789b9b405d7f9cc8d79b8462a438
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 3 Feb 2017 18:11:27 +0000 (18:11 +0000)]
ANDROID: arm: wire cpu-invariant accounting support up to the task scheduler
Commit
8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
weak function to #define") changed the wiring which now has to be done
by associating arch_scale_cpu_capacity with the actual implementation
provided by the architecture.
Define arch_scale_cpu_capacity to use the arch_topology "driver"
function topology_get_cpu_scale() for the task scheduler's cpu-invariant
accounting instead of the default arch_scale_cpu_capacity() in
kernel/sched/sched.h.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Change-Id: I13cde277c54d0d6be4af9eb60f0ea1dadc51277a
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 3 Feb 2017 20:37:16 +0000 (20:37 +0000)]
ANDROID: arm: wire frequency-invariant accounting support up to the task scheduler
Commit
dfbca41f3479 ("sched: Optimize freq invariant accounting")
changed the wiring which now has to be done by associating
arch_scale_freq_capacity with the actual implementation provided
by the architecture.
Define arch_scale_freq_capacity to use the arch_topology "driver"
function topology_get_freq_scale() for the task scheduler's
frequency-invariant accounting instead of the default
arch_scale_freq_capacity() in kernel/sched/sched.h.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Change-Id: I04dc62ca703c8a1ebce52b35a39fdeaa55e44ca1
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 21 Jul 2017 10:40:58 +0000 (11:40 +0100)]
ANDROID: drivers base/arch_topology: allow inlining cpu-invariant accounting support
Allow inlining of topology_get_cpu_scale() into the task
scheduler fast path (e.g. __update_load_avg_se()) by coding it as a
static inline function in the arch topology header file.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I878f33922905fb778ad0b3ee86126e62b5a7d834
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 30 Jun 2017 16:00:23 +0000 (17:00 +0100)]
ANDROID: drivers base/arch_topology: provide frequency-invariant accounting support
Implements the arch-specific (arm and arm64) frequency-invariance setter
function arch_set_freq_scale() which provides the following frequency
scaling factor:
current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_supported_freq(cpu)
One possible consumer of the frequency-invariance getter function
topology_get_freq_scale() is the Per-Entity Load Tracking (PELT)
mechanism of the task scheduler.
Allow inlining of topology_get_freq_scale() into the task scheduler
fast path (e.g. __update_load_avg_se()) by coding it as a static inline
function in the arch topology header file.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I7a0fadc160d85daf9e35e346391c5d0bb9167d71
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 21 Jul 2017 10:32:57 +0000 (11:32 +0100)]
ANDROID: cpufreq: dt: invoke frequency-invariance setter function
Call the frequency-invariance setter function arch_set_freq_scale()
if the new frequency has been successfully set which is indicated by
dev_pm_opp_set_rate() returning 0.
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I00dce40c8def07e87bce7bc556ffc3a8693af038
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Fri, 21 Jul 2017 10:16:45 +0000 (11:16 +0100)]
ANDROID: cpufreq: arm_big_little: invoke frequency-invariance setter function
Call the frequency-invariance setter function arch_set_freq_scale()
if the new frequency has been successfully set which is indicated by
bL_cpufreq_set_rate() returning 0.
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I6fe98b7be9439fe46a44bbfb369ca169c3eb58dc
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Sat, 22 Jul 2017 17:13:50 +0000 (18:13 +0100)]
ANDROID: cpufreq: provide default frequency-invariance setter function
Frequency-invariant accounting support based on the ratio of current
frequency and maximum supported frequency is an optional feature an arch
can implement.
Since there are cpufreq drivers (e.g. cpufreq-dt) which can be build for
different arch's a default implementation of the frequency-invariance
setter function arch_set_freq_scale() is needed.
This default implementation is an empty weak function which will be
overwritten by a strong function in case the arch provides one.
The setter function passes the cpumask of related (to the frequency
change) cpus (online and offline cpus), the (new) current frequency and
the maximum supported frequency.
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I912d5815ee29e1171c498e638d1a089c5a598add
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Tue, 13 Jun 2017 22:21:59 +0000 (23:21 +0100)]
ANDROID: drivers base/arch_topology: free cpumask cpus_to_visit
Free cpumask cpus_to_visit in case registering
init_cpu_capacity_notifier has failed or the parsing of the cpu
capacity-dmips-mhz property is done. The cpumask cpus_to_visit is
only used inside the notifier call init_cpu_capacity_callback.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Change-Id: I84986964e6434d23a3c0feff2d9891516abcfe59
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Dietmar Eggemann [Mon, 26 Jan 2015 19:47:28 +0000 (19:47 +0000)]
ANDROID: sched: Enable idle balance to pull single task towards cpu with higher capacity
We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu. Add an
extra criteria to need_active_balance() to kick off active load balance
if the source cpu is over-utilized and has lower capacity than the
destination cpu.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ifa66a30d53c17d339fc5058901a87a643ffc3704
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Morten Rasmussen [Thu, 2 Jul 2015 16:16:34 +0000 (17:16 +0100)]
ANDROID: sched: Prevent unnecessary active balance of single task in sched group
Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().
A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I6a2f51017c49614d5e4224f0e16f240ad8af6d0f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Linus Torvalds [Sun, 12 Nov 2017 18:46:13 +0000 (10:46 -0800)]
Linux 4.14
Linus Torvalds [Sun, 12 Nov 2017 18:12:41 +0000 (10:12 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of small fixes:
- make KGDB work again which got broken by the conversion of WARN()
to #UD. The WARN fixup needs to run before the notifier callchain,
otherwise KGDB tries to handle it and crashes.
- disable KASAN in the ORC unwinder to prevent false positive KASAN
warnings
- prevent default mapping above 47bit when 5 level page tables are
enabled
- make the delay calibration optimization work correctly, which had
the conditionals the wrong way around and was operating on data
which was not yet updated.
- remove the bogus X86_TRAP_BP trap init from the default IDT init
table, which broke 32bit int3 handling by overwriting the correct
int3 setup.
- replace this_cpu* with boot_cpu_data access in the preemptible
oprofile init code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/debug: Handle warnings before the notifier chain, to fix KGDB crash
x86/mm: Fix ELF_ET_DYN_BASE for 5-level paging
x86/idt: Remove X86_TRAP_BP initialization in idt_setup_traps()
x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context
x86/unwind: Disable KASAN checking in the ORC unwinder
x86/smpboot: Make optimization of delay calibration work correctly
Linus Torvalds [Sun, 12 Nov 2017 17:43:53 +0000 (09:43 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf tool fixes from Thomas Gleixner:
"A small set of fixes for perf tool:
- synchronize the i915 drm header to avoid the 'out of date' warning
- make sure that perf trace cleans up its temporary files on exit
- unbreak the build with newer flex versions
- add missing braces in the eBPF parsing rules"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tooling/headers: Sync the tools/include/uapi/drm/i915_drm.h UAPI header
perf trace: Call machine__exit() at exit
perf tools: Fix eBPF event specification parsing
perf tools: Add "reject" option for parse-events.l
Linus Torvalds [Sat, 11 Nov 2017 17:10:39 +0000 (09:10 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Use after free in vlan, from Cong Wang.
2) Handle NAPI poll with a zero budget properly in mlx5 driver, from
Saeed Mahameed.
3) If DMA mapping fails in mlx5 driver, NULL out page, from Inbar
Karmy.
4) Handle overrun in RX FIFO of sun4i CAN driver, from Gerhard
Bertelsmann.
5) Missing return in mdb and vlan prepare phase of DSA layer, from
Vivien Didelot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
vlan: fix a use-after-free in vlan_device_event()
net: dsa: return after vlan prepare phase
net: dsa: return after mdb prepare phase
can: ifi: Fix transmitter delay calculation
tcp: fix tcp_fastretrans_alert warning
tcp: gso: avoid refcount_t warning from tcp_gso_segment()
can: peak: Add support for new PCIe/M2 CAN FD interfaces
can: sun4i: handle overrun in RX FIFO
can: c_can: don't indicate triple sampling support for D_CAN
net/mlx5e: Increase Striding RQ minimum size limit to 4 multi-packet WQEs
net/mlx5e: Set page to null in case dma mapping fails
net/mlx5e: Fix napi poll with zero budget
net/mlx5: Cancel health poll before sending panic teardown command
net/mlx5: Loop over temp list to release delay events
rds: ib: Fix NULL pointer dereference in debug code
David S. Miller [Sat, 11 Nov 2017 12:52:01 +0000 (21:52 +0900)]
Merge tag 'linux-can-fixes-for-4.14-
20171110' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2017-11-10
this is a pull request for net/master.
The first patch by Richard Schütz for the c_can driver removes the false
indication to support triple sampling for d_can. Gerhard Bertelsmann's
patch for the sun4i driver improves the RX overrun handling. The patch
by Stephane Grosjean for the peak_canfd driver adds the PCI ids for
various new PCIe/M2 interfaces. Marek Vasut's patch for the ifi driver
fix transmitter delay calculation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 11 Nov 2017 10:40:05 +0000 (19:40 +0900)]
Merge tag 'mlx5-fixes-2017-11-08' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2017-11-08
The following series includes some fixes for mlx5 core and etherent
driver.
Sorry for the late submission but as you can see i have some very
critical fixes below that i would like them merged into this RC.
Please pull and let me know if there is any problem.
For -stable:
('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13
('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13
('net/mlx5: Cancel health poll before sending panic teardown command') kernels >= 4.13
V1->V2:
- Fix Reviewed-by tag of the 2nd patch.
- Drop the FPGA 0 size fix, it needs some more change log info.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 10 Nov 2017 00:43:13 +0000 (16:43 -0800)]
vlan: fix a use-after-free in vlan_device_event()
After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:
RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);
However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
goto out;
grp = &vlan_info->grp;
Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().
Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes:
efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingo Molnar [Sat, 11 Nov 2017 08:06:57 +0000 (09:06 +0100)]
tooling/headers: Sync the tools/include/uapi/drm/i915_drm.h UAPI header
Last minute upstream update to one of the UAPI headers - sync it with tooling,
to address this warning:
Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Sat, 11 Nov 2017 08:03:59 +0000 (09:03 +0100)]
Merge branch 'perf/urgent' of git://git./linux/kernel/git/acme/linux into perf/urgent
Pull perf tooling fixes from Arnaldo Carvalho de Melo.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vivien Didelot [Wed, 8 Nov 2017 15:50:10 +0000 (10:50 -0500)]
net: dsa: return after vlan prepare phase
The current code does not return after successfully preparing the VLAN
addition on every ports member of a it. Fix this.
Fixes:
1ca4aa9cd4cc ("net: dsa: check VLAN capability of every switch")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 8 Nov 2017 15:49:56 +0000 (10:49 -0500)]
net: dsa: return after mdb prepare phase
The current code does not return after successfully preparing the MDB
addition on every ports member of a multicast group. Fix this.
Fixes:
a1a6b7ea7f2d ("net: dsa: add cross-chip multicast support")
Reported-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 10 Nov 2017 22:18:24 +0000 (14:18 -0800)]
Merge tag 'ceph-for-4.14-rc9' of git://github.com/ceph/ceph-client
Pull ceph gix from Ilya Dryomov:
"Memory allocation flags fix, marked for stable"
* tag 'ceph-for-4.14-rc9' of git://github.com/ceph/ceph-client:
rbd: use GFP_NOIO for parent stat and data requests
Linus Torvalds [Fri, 10 Nov 2017 22:14:23 +0000 (14:14 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input layer updates from Dmitry Torokhov:
- a new ACPI ID for Elan touchpad found in yet another Ideapad model
- Synaptics RMI4 will allow binding to controllers reporting SMB
version 3 (note that we are not adding any new ACPI IDs to the
Synaptics PS/2 drover so unless user explicitly enables intertouch
support there is no user-visible change)
- a fixup to TSC 2004/5 touchscreen driver to mark input devices as
"direct" to help userspace identify the type of device they are
dealing with
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: synaptics-rmi4 - RMI4 can also use SMBUS version 3
Input: tsc200x-core - set INPUT_PROP_DIRECT
Input: elan_i2c - add ELAN060C to the ACPI table
Linus Torvalds [Fri, 10 Nov 2017 20:24:42 +0000 (12:24 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fix from Radim Krčmář:
"Fix PPC HV host crash that can occur as a result of resizing the guest
hashed page table"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates
Linus Torvalds [Fri, 10 Nov 2017 20:21:15 +0000 (12:21 -0800)]
Merge tag 'mips_fixes_4.14_2' of git://git./linux/kernel/git/jhogan/mips
Pull MIPS fixes from James Hogan:
"A final few MIPS fixes for 4.14:
- fix BMIPS NULL pointer dereference (4.7)
- fix AR7 early GPIO init allocation failure (3.19)
- fix dead serial output on certain AR7 platforms (2.6.35)"
* tag 'mips_fixes_4.14_2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
MIPS: AR7: Ensure that serial ports are properly set up
MIPS: AR7: Defer registration of GPIO
MIPS: BMIPS: Fix missing cbr address
Maciej W. Rozycki [Fri, 10 Nov 2017 20:05:24 +0000 (20:05 +0000)]
.mailmap: Add Maciej W. Rozycki's Imagination e-mail address
Following my recent transition from Imagination Technologies to the=20
reincarnated MIPS company add a .mailmap mapping for my work address,
so that `scripts/get_maintainer.pl' gets it right for past commits.
Signed-off-by: Maciej W. Rozycki <macro@mips.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 10 Nov 2017 19:19:11 +0000 (11:19 -0800)]
Revert "x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo"
This reverts commit
941f5f0f6ef5338814145cf2b813cf1f98873e2f.
Sadly, it turns out that we really can't just do the cross-CPU IPI to
all CPU's to get their proper frequencies, because it's much too
expensive on systems with lots of cores.
So we'll have to revert this for now, and revisit it using a smarter
model (probably doing one system-wide IPI at open time, and doing all
the frequency calculations in parallel).
Reported-by: WANG Chao <chao.wang@ucloud.cn>
Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 10 Nov 2017 17:59:41 +0000 (09:59 -0800)]
Merge tag 'drm-fixes-for-v4.14-rc9' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Last few patches to wrap up.
Two i915 fixes that are on their way to stable, one vmware black
screen bug, and one const patch that I was going to drop, but it was
clearly a pretty safe one liner"
* tag 'drm-fixes-for-v4.14-rc9' of git://people.freedesktop.org/~airlied/linux:
drm/i915: Deconstruct struct sgt_dma initialiser
drm/i915: Reject unknown syncobj flags
drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue
drm/vmwgfx: constify vmw_fence_ops
Marek Vasut [Fri, 10 Nov 2017 10:22:39 +0000 (11:22 +0100)]
can: ifi: Fix transmitter delay calculation
The CANFD transmitter delay calculation formula was updated in the
latest software drop from IFI and improves the behavior of the IFI
CANFD core during bitrate switching. Use the new formula to improve
stability of the CANFD operation.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Markus Marb <markus@marb.org>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Yuchung Cheng [Tue, 7 Nov 2017 23:33:43 +0000 (15:33 -0800)]
tcp: fix tcp_fastretrans_alert warning
This patch fixes the cause of an WARNING indicatng TCP has pending
retransmission in Open state in tcp_fastretrans_alert().
The root cause is a bad interaction between path mtu probing,
if enabled, and the RACK loss detection. Upong receiving a SACK
above the sequence of the MTU probing packet, RACK could mark the
probe packet lost in tcp_fastretrans_alert(), prior to calling
tcp_simple_retransmit().
tcp_simple_retransmit() only enters Loss state if it newly marks
the probe packet lost. If the probe packet is already identified as
lost by RACK, the sender remains in Open state with some packets
marked lost and retransmitted. Then the next SACK would trigger
the warning. The likely scenario is that the probe packet was
lost due to its size or network congestion. The actual impact of
this warning is small by potentially entering fast recovery an
ACK later.
The simple fix is always entering recovery (Loss) state if some
packet is marked lost during path MTU probing.
Fixes:
a0370b3f3f2c ("tcp: enable RACK loss detection to trigger recovery")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 7 Nov 2017 23:15:04 +0000 (15:15 -0800)]
tcp: gso: avoid refcount_t warning from tcp_gso_segment()
When a GSO skb of truesize O is segmented into 2 new skbs of truesize N1
and N2, we want to transfer socket ownership to the new fresh skbs.
In order to avoid expensive atomic operations on a cache line subject to
cache bouncing, we replace the sequence :
refcount_add(N1, &sk->sk_wmem_alloc);
refcount_add(N2, &sk->sk_wmem_alloc); // repeated by number of segments
refcount_sub(O, &sk->sk_wmem_alloc);
by a single
refcount_add(sum_of(N) - O, &sk->sk_wmem_alloc);
Problem is :
In some pathological cases, sum(N) - O might be a negative number, and
syzkaller bot was apparently able to trigger this trace [1]
atomic_t was ok with this construct, but we need to take care of the
negative delta with refcount_t
[1]
refcount_t: saturated; leaking memory.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 8404 at lib/refcount.c:77 refcount_add_not_zero+0x198/0x200 lib/refcount.c:77
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 8404 Comm: syz-executor2 Not tainted 4.14.0-rc5-mm1+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1c4/0x1e0 kernel/panic.c:546
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
RIP: 0010:refcount_add_not_zero+0x198/0x200 lib/refcount.c:77
RSP: 0018:
ffff8801c606e3a0 EFLAGS:
00010282
RAX:
0000000000000026 RBX:
0000000000001401 RCX:
0000000000000000
RDX:
0000000000000026 RSI:
ffffc900036fc000 RDI:
ffffed0038c0dc68
RBP:
ffff8801c606e430 R08:
0000000000000001 R09:
0000000000000000
R10:
ffff8801d97f5eba R11:
0000000000000000 R12:
ffff8801d5acf73c
R13:
1ffff10038c0dc75 R14:
00000000ffffffff R15:
00000000fffff72f
refcount_add+0x1b/0x60 lib/refcount.c:101
tcp_gso_segment+0x10d0/0x16b0 net/ipv4/tcp_offload.c:155
tcp4_gso_segment+0xd4/0x310 net/ipv4/tcp_offload.c:51
inet_gso_segment+0x60c/0x11c0 net/ipv4/af_inet.c:1271
skb_mac_gso_segment+0x33f/0x660 net/core/dev.c:2749
__skb_gso_segment+0x35f/0x7f0 net/core/dev.c:2821
skb_gso_segment include/linux/netdevice.h:3971 [inline]
validate_xmit_skb+0x4ba/0xb20 net/core/dev.c:3074
__dev_queue_xmit+0xe49/0x2070 net/core/dev.c:3497
dev_queue_xmit+0x17/0x20 net/core/dev.c:3538
neigh_hh_output include/net/neighbour.h:471 [inline]
neigh_output include/net/neighbour.h:479 [inline]
ip_finish_output2+0xece/0x1460 net/ipv4/ip_output.c:229
ip_finish_output+0x85e/0xd10 net/ipv4/ip_output.c:317
NF_HOOK_COND include/linux/netfilter.h:238 [inline]
ip_output+0x1cc/0x860 net/ipv4/ip_output.c:405
dst_output include/net/dst.h:459 [inline]
ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
ip_queue_xmit+0x8c6/0x18e0 net/ipv4/ip_output.c:504
tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1137
tcp_write_xmit+0x663/0x4de0 net/ipv4/tcp_output.c:2341
__tcp_push_pending_frames+0xa0/0x250 net/ipv4/tcp_output.c:2513
tcp_push_pending_frames include/net/tcp.h:1722 [inline]
tcp_data_snd_check net/ipv4/tcp_input.c:5050 [inline]
tcp_rcv_established+0x8c7/0x18a0 net/ipv4/tcp_input.c:5497
tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1460
sk_backlog_rcv include/net/sock.h:909 [inline]
__release_sock+0x124/0x360 net/core/sock.c:2264
release_sock+0xa4/0x2a0 net/core/sock.c:2776
tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
sock_sendmsg_nosec net/socket.c:632 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:642
___sys_sendmsg+0x31c/0x890 net/socket.c:2048
__sys_sendmmsg+0x1e6/0x5f0 net/socket.c:2138
Fixes:
14afee4b6092 ("net: convert sock.sk_wmem_alloc from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephane Grosjean [Thu, 9 Nov 2017 13:42:14 +0000 (14:42 +0100)]
can: peak: Add support for new PCIe/M2 CAN FD interfaces
This adds support for the following PEAK-System CAN FD interfaces:
PCAN-cPCIe FD CAN FD Interface for cPCI Serial (2 or 4 channels)
PCAN-PCIe/104-Express CAN FD Interface for PCIe/104-Express (1, 2 or 4 ch.)
PCAN-miniPCIe FD CAN FD Interface for PCIe Mini (1, 2 or 4 channels)
PCAN-PCIe FD OEM CAN FD Interface for PCIe OEM version (1, 2 or 4 ch.)
PCAN-M.2 CAN FD Interface for M.2 (1 or 2 channels)
Like the PCAN-PCIe FD interface, all of these boards run the same IP Core
that is able to handle CAN FD (see also http://www.peak-system.com).
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Gerhard Bertelsmann [Mon, 6 Nov 2017 17:16:56 +0000 (18:16 +0100)]
can: sun4i: handle overrun in RX FIFO
SUN4Is CAN IP has a 64 byte deep FIFO buffer. If the buffer is not
drained fast enough (overrun) it's getting mangled. Already received
frames are dropped - the data can't be restored.
Signed-off-by: Gerhard Bertelsmann <info@gerhard-bertelsmann.de>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Richard Schütz [Sun, 29 Oct 2017 12:03:22 +0000 (13:03 +0100)]
can: c_can: don't indicate triple sampling support for D_CAN
The D_CAN controller doesn't provide a triple sampling mode, so don't set
the CAN_CTRLMODE_3_SAMPLES flag in ctrlmode_supported. Currently enabling
triple sampling is a no-op.
Signed-off-by: Richard Schütz <rschuetz@uni-koblenz.de>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.6
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Alexander Shishkin [Mon, 24 Jul 2017 10:04:28 +0000 (13:04 +0300)]
x86/debug: Handle warnings before the notifier chain, to fix KGDB crash
Commit:
9a93848fe787 ("x86/debug: Implement __WARN() using UD0")
turned warnings into UD0, but the fixup code only runs after the
notify_die() chain. This is a problem, in particular, with kgdb,
which kicks in as if it was a BUG().
Fix this by running the fixup code before the notifier chain in
the invalid op handler path.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: <stable@vger.kernel.org> # v4.12+
Link: http://lkml.kernel.org/r/20170724100428.19173-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Eugenia Emantayev [Thu, 12 Jan 2017 15:11:45 +0000 (17:11 +0200)]
net/mlx5e: Increase Striding RQ minimum size limit to 4 multi-packet WQEs
This is to prevent the case of working with a single MPWQE
(1 WQE is always reserved as RQ is linked-list).
When the WQE is fully consumed, HW should still have available buffer
in order not to drop packets.
Fixes:
461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Inbar Karmy [Sun, 15 Oct 2017 14:30:59 +0000 (17:30 +0300)]
net/mlx5e: Set page to null in case dma mapping fails
Currently, when dma mapping fails, put_page is called,
but the page is not set to null. Later, in the page_reuse treatment in
mlx5e_free_rx_descs(), mlx5e_page_release() is called for the second time,
improperly doing dma_unmap (for a non-mapped address) and an extra put_page.
Prevent this by nullifying the page pointer when dma_map fails.
Fixes:
accd58833237 ("net/mlx5e: Introduce RX Page-Reuse")
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Tue, 31 Oct 2017 22:34:00 +0000 (15:34 -0700)]
net/mlx5e: Fix napi poll with zero budget
napi->poll can be called with budget 0, e.g. in netpoll scenarios
where the caller only wants to poll TX rings
(poll_one_napi@net/core/netpoll.c).
The below commit changed RX polling from "while" loop to "do {} while",
which caused to ignore the initial budget and handle at least one RX
packet.
This fixes the following warning:
[ 2852.049194] mlx5e_napi_poll+0x0/0x260 [mlx5_core] exceeded budget in poll
[ 2852.049195] ------------[ cut here ]------------
[ 2852.049195] WARNING: CPU: 0 PID: 25691 at net/core/netpoll.c:171 netpoll_poll_dev+0x18a/0x1a0
Fixes:
4b7dfc992514 ("net/mlx5e: Early-return on empty completion queues")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Tested-by: Martin KaFai Lau <kafai@fb.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Huy Nguyen [Tue, 26 Sep 2017 20:11:56 +0000 (15:11 -0500)]
net/mlx5: Cancel health poll before sending panic teardown command
After the panic teardown firmware command, health_care detects the error
in PCI bus and calls the mlx5_pci_err_detected. This health_care flow is
no longer needed because the panic teardown firmware command will bring
down the PCI bus communication with the HCA.
The solution is to cancel the health care timer and its pending
workqueue request before sending panic teardown firmware command.
Kernel trace:
mlx5_core 0033:01:00.0: Shutdown was called
mlx5_core 0033:01:00.0: health_care:154:(pid 9304): handling bad device here
mlx5_core 0033:01:00.0: mlx5_handle_bad_state:114:(pid 9304): NIC state 1
mlx5_core 0033:01:00.0: mlx5_pci_err_detected was called
mlx5_core 0033:01:00.0: mlx5_enter_error_state:96:(pid 9304): start
mlx5_3:mlx5_ib_event:3061:(pid 9304): warning: event on port 0
mlx5_core 0033:01:00.0: mlx5_enter_error_state:104:(pid 9304): end
Unable to handle kernel paging request for data at address 0x0000003f
Faulting instruction address: 0xc0080000434b8c80
Fixes:
8812c24d28f4 ('net/mlx5: Add fast unload support in shutdown flow')
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Huy Nguyen [Mon, 30 Oct 2017 03:40:56 +0000 (22:40 -0500)]
net/mlx5: Loop over temp list to release delay events
list_splice_init initializing waiting_events_list after splicing it to
temp list, therefore we should loop over temp list to fire the events.
Fixes:
4ca637a20a52 ("net/mlx5: Delay events till mlx5 interface's add complete for pci resume")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Håkon Bugge [Tue, 7 Nov 2017 15:33:34 +0000 (16:33 +0100)]
rds: ib: Fix NULL pointer dereference in debug code
rds_ib_recv_refill() is a function that refills an IB receive
queue. It can be called from both the CQE handler (tasklet) and a
worker thread.
Just after the call to ib_post_recv(), a debug message is printed with
rdsdebug():
ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
(long) ib_sg_dma_address(
ic->i_cm_id->device,
&recv->r_frag->f_sg),
ret);
Now consider an invocation of rds_ib_recv_refill() from the worker
thread, which is preemptible. Further, assume that the worker thread
is preempted between the ib_post_recv() and rdsdebug() statements.
Then, if the preemption is due to a receive CQE event, the
rds_ib_recv_cqe_handler() will be invoked. This function processes
receive completions, including freeing up data structures, such as the
recv->r_frag.
In this scenario, rds_ib_recv_cqe_handler() will process the receive
WR posted above. That implies, that the recv->r_frag has been freed
before the above rdsdebug() statement has been executed. When it is
later executed, we will have a NULL pointer dereference:
[ 4088.068008] BUG: unable to handle kernel NULL pointer dereference at
0000000000000020
[ 4088.076754] IP: rds_ib_recv_refill+0x87/0x620 [rds_rdma]
[ 4088.082686] PGD 0 P4D 0
[ 4088.085515] Oops: 0000 [#1] SMP
[ 4088.089015] Modules linked in: rds_rdma(OE) rds(OE) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) mlx4_ib(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) ib_cm(E) iw_cm(E) ib_core(E) binfmt_misc(E) sb_edac(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) crypto_simd(E) iTCO_wdt(E) glue_helper(E) iTCO_vendor_support(E) sg(E) cryptd(E) pcspkr(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) shpchp(E) ioatdma(E) i2c_i801(E) wmi(E) lpc_ich(E) mei_me(E) mei(E) mfd_core(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E)
[ 4088.168486] fb_sys_fops(E) ahci(E) ixgbe(E) libahci(E) ttm(E) mdio(E) ptp(E) pps_core(E) drm(E) sd_mod(E) libata(E) crc32c_intel(E) mlx4_core(E) i2c_core(E) dca(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: rds]
[ 4088.193442] CPU: 20 PID: 1244 Comm: kworker/20:2 Tainted: G OE 4.14.0-rc7.master.
20171105.ol7.x86_64 #1
[ 4088.205097] Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS
31110000 03/03/2017
[ 4088.216074] Workqueue: ib_cm cm_work_handler [ib_cm]
[ 4088.221614] task:
ffff885fa11d0000 task.stack:
ffffc9000e598000
[ 4088.228224] RIP: 0010:rds_ib_recv_refill+0x87/0x620 [rds_rdma]
[ 4088.234736] RSP: 0018:
ffffc9000e59bb68 EFLAGS:
00010286
[ 4088.240568] RAX:
0000000000000000 RBX:
ffffc9002115d050 RCX:
ffffc9002115d050
[ 4088.248535] RDX:
ffffffffa0521380 RSI:
ffffffffa0522158 RDI:
ffffffffa0525580
[ 4088.256498] RBP:
ffffc9000e59bbf8 R08:
0000000000000005 R09:
0000000000000000
[ 4088.264465] R10:
0000000000000339 R11:
0000000000000001 R12:
0000000000000000
[ 4088.272433] R13:
ffff885f8c9d8000 R14:
ffffffff81a0a060 R15:
ffff884676268000
[ 4088.280397] FS:
0000000000000000(0000) GS:
ffff885fbec80000(0000) knlGS:
0000000000000000
[ 4088.289434] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 4088.295846] CR2:
0000000000000020 CR3:
0000000001e09005 CR4:
00000000001606e0
[ 4088.303816] Call Trace:
[ 4088.306557] rds_ib_cm_connect_complete+0xe0/0x220 [rds_rdma]
[ 4088.312982] ? __dynamic_pr_debug+0x8c/0xb0
[ 4088.317664] ? __queue_work+0x142/0x3c0
[ 4088.321944] rds_rdma_cm_event_handler+0x19e/0x250 [rds_rdma]
[ 4088.328370] cma_ib_handler+0xcd/0x280 [rdma_cm]
[ 4088.333522] cm_process_work+0x25/0x120 [ib_cm]
[ 4088.338580] cm_work_handler+0xd6b/0x17aa [ib_cm]
[ 4088.343832] process_one_work+0x149/0x360
[ 4088.348307] worker_thread+0x4d/0x3e0
[ 4088.352397] kthread+0x109/0x140
[ 4088.355996] ? rescuer_thread+0x380/0x380
[ 4088.360467] ? kthread_park+0x60/0x60
[ 4088.364563] ret_from_fork+0x25/0x30
[ 4088.368548] Code: 48 89 45 90 48 89 45 98 eb 4d 0f 1f 44 00 00 48 8b 43 08 48 89 d9 48 c7 c2 80 13 52 a0 48 c7 c6 58 21 52 a0 48 c7 c7 80 55 52 a0 <4c> 8b 48 20 44 89 64 24 08 48 8b 40 30 49 83 e1 fc 48 89 04 24
[ 4088.389612] RIP: rds_ib_recv_refill+0x87/0x620 [rds_rdma] RSP:
ffffc9000e59bb68
[ 4088.397772] CR2:
0000000000000020
[ 4088.401505] ---[ end trace
fe922e6ccf004431 ]---
This bug was provoked by compiling rds out-of-tree with
EXTRA_CFLAGS="-DRDS_DEBUG -DDEBUG" and inserting an artificial delay
between the rdsdebug() and ib_ib_port_recv() statements:
/* XXX when can this fail? */
ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+ if (can_wait)
+ usleep_range(1000, 5000);
rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
(long) ib_sg_dma_address(
The fix is simply to move the rdsdebug() statement up before the
ib_post_recv() and remove the printing of ret, which is taken care of
anyway by the non-debug code.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 10 Nov 2017 02:26:51 +0000 (18:26 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"2 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: update TPM driver infrastructure changes
sysctl: add register_sysctl() dummy helper
Jarkko Sakkinen [Thu, 9 Nov 2017 21:38:21 +0000 (13:38 -0800)]
MAINTAINERS: update TPM driver infrastructure changes
[akpm@linux-foundation.org: alpha-sort CREDITS, per Randy]
Link: http://lkml.kernel.org/r/20170915223811.21368-1-jarkko.sakkinen@linux.intel.com
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Marcel Selhorst <tpmdd@selhorst.net>
Cc: Ashley Lai <ashleydlai@gmail.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Håvard Skinnemoen <hskinnemoen@gmail.com>
Cc: Martin Kepplinger <martink@posteo.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Gertjan van Wingerde <gwingerde@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Thu, 9 Nov 2017 21:38:18 +0000 (13:38 -0800)]
sysctl: add register_sysctl() dummy helper
register_sysctl() has been around for five years with commit
fea478d4101a ("sysctl: Add register_sysctl for normal sysctl users") but
now that arm64 started using it, I ran into a compile error:
arch/arm64/kernel/armv8_deprecated.c: In function 'register_insn_emulation_sysctl':
arch/arm64/kernel/armv8_deprecated.c:257:2: error: implicit declaration of function 'register_sysctl'
This adds a inline function like we already have for
register_sysctl_paths() and register_sysctl_table().
Link: http://lkml.kernel.org/r/20171106133700.558647-1-arnd@arndb.de
Fixes:
38b9aeb32fa7 ("arm64: Port deprecated instruction emulation to new sysctl interface")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Alex Benne <alex.bennee@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 10 Nov 2017 01:43:27 +0000 (17:43 -0800)]
Merge tag 'pci-v4.14-fixes-7' of git://git./linux/kernel/git/helgaas/pci
Pull PCI maintainership updates from Bjorn Helgaas:
"Update MAINTAINERS for HiSilicon, Microsemi Switchtec, and native host
bridge drivers (Gabriele Paoloni, Sebastian Andrzej Siewior).
Note that starting with changes intended for v4.16, Lorenzo Pieralisi
will maintain the drivers/pci/{dwc,endpoint,host} directories. My
intent is to continue to merge those changes via my tree, so this
should be transparent to you"
* tag 'pci-v4.14-fixes-7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Add Lorenzo Pieralisi for PCI host bridge drivers
MAINTAINERS: Remove Gabriele Paoloni as HiSilicon PCI maintainer
MAINTAINERS: Remove Stephen Bates as Microsemi Switchtec maintainer
Linus Torvalds [Fri, 10 Nov 2017 01:41:39 +0000 (17:41 -0800)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"Last ARM fix for 4.14.
This plugs a hole in dump_instr(), which, with certain conditions
satisfied, can dump instructions from kernel space"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8720/1: ensure dump_instr() checks addr_limit
Linus Torvalds [Thu, 9 Nov 2017 19:16:28 +0000 (11:16 -0800)]
Merge tag 'pm-final-4.14' of git://git./linux/kernel/git/rafael/linux-pm
Pull final power management fixes from Rafael Wysocki:
"These fix a regression in the schedutil cpufreq governor introduced by
a recent change and blacklist Dell XPS13 9360 from using the Low Power
S0 Idle _DSM interface which triggers serious problems on one of these
machines.
Specifics:
- Prevent the schedutil cpufreq governor from using the utilization
of a wrong CPU in some cases which started to happen after one of
the recent changes in it (Chris Redpath).
- Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
interface as that causes serious issue (related to NVMe) to appear
on one of these machines, even though the other Dells XPS13 9360 in
somewhat different HW configurations behave correctly (Rafael
Wysocki)"
* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
cpufreq: schedutil: Examine the correct CPU when we update util
Linus Torvalds [Thu, 9 Nov 2017 17:58:11 +0000 (09:58 -0800)]
Merge tag 'sound-4.14' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"The amount of the changes isn't as quite small as wished, nevertheless
they are straight fixes that deserve merging to 4.14 final.
Most of fixes are about ALSA core bugs spotted by fuzzer: a follow-up
fix for the previous nested rwsem patch, a fix to avoid the resource
hogs due to too many concurrent ALSA timer invocations, and a fix for
a crash with SYSEX MIDI transfer over OSS sequencer emulation that is
used by none but fuzzer.
The rest are usual HD-audio and USB-audio device-specific quirks,
which are safe to apply"
* tag 'sound-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - fix headset mic problem for Dell machines with alc274
ALSA: seq: Fix OSS sysex delivery in OSS emulation
ALSA: seq: Avoid invalid lockdep class warning
ALSA: timer: Limit max instances per timer
ALSA: usb-audio: support new Amanero Combo384 firmware version
Linus Torvalds [Thu, 9 Nov 2017 17:31:34 +0000 (09:31 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix use-after-free in IPSEC input parsing, desintation address
pointer was loaded before pskb_may_pull() which can change the SKB
data pointers. From Florian Westphal.
2) Stack out-of-bounds read in xfrm_state_find(), from Steffen
Klassert.
3) IPVS state of SKB is not properly reset when moving between
namespaces, from Ye Yin.
4) Fix crash in asix driver suspend and resume, from Andrey Konovalov.
5) Don't deliver ipv6 l2tp tunnel packets to ipv4 l2tp tunnels, and
vice versa, from Guillaume Nault.
6) Fix DSACK undo on non-dup ACKs, from Priyaranjan Jha.
7) Fix regression in bond_xmit_hash()'s behavior after the TCP port
selection changes back in 4.2, from Hangbin Liu.
8) Two divide by zero bugs in USB networking drivers when parsing
descriptors, from Bjorn Mork.
9) Fix bonding slaves being stuck in BOND_LINK_FAIL state, from Jay
Vosburgh.
10) Missing skb_reset_mac_header() in qmi_wwan, from Kristian Evensen.
11) Fix the destruction of tc action object races properly, from Cong
Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
cls_u32: use tcf_exts_get_net() before call_rcu()
cls_tcindex: use tcf_exts_get_net() before call_rcu()
cls_rsvp: use tcf_exts_get_net() before call_rcu()
cls_route: use tcf_exts_get_net() before call_rcu()
cls_matchall: use tcf_exts_get_net() before call_rcu()
cls_fw: use tcf_exts_get_net() before call_rcu()
cls_flower: use tcf_exts_get_net() before call_rcu()
cls_flow: use tcf_exts_get_net() before call_rcu()
cls_cgroup: use tcf_exts_get_net() before call_rcu()
cls_bpf: use tcf_exts_get_net() before call_rcu()
cls_basic: use tcf_exts_get_net() before call_rcu()
net_sched: introduce tcf_exts_get_net() and tcf_exts_put_net()
Revert "net_sched: hold netns refcnt for each action"
net: usb: asix: fill null-ptr-deref in asix_suspend
Revert "net: usb: asix: fill null-ptr-deref in asix_suspend"
qmi_wwan: Add missing skb_reset_mac_header-call
bonding: fix slave stuck in BOND_LINK_FAIL state
qrtr: Move to postcore_initcall
net: qmi_wwan: fix divide by 0 on bad descriptors
net: cdc_ether: fix divide by 0 on bad descriptors
...
Kirill A. Shutemov [Tue, 7 Nov 2017 10:38:04 +0000 (13:38 +0300)]
x86/mm: Fix ELF_ET_DYN_BASE for 5-level paging
On machines with 5-level paging we don't want to allocate mapping above
47-bit unless user explicitly asked for it. See
b569bab78d8d ("x86/mm:
Prepare to expose larger address space to userspace") for details.
c715b72c1ba4 ("mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base
changes") broke the behaviour. After the commit elf binary and heap got
mapped above 47-bits.
Use DEFAULT_MAP_WINDOW instead of TASK_SIZE to determine ELF_ET_DYN_BASE so
it's forced to be below 47-bits unconditionally.
Fixes:
c715b72c1ba4 ("mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com
Andrei Vagin [Wed, 8 Nov 2017 00:22:45 +0000 (16:22 -0800)]
perf trace: Call machine__exit() at exit
Otherwise 'perf trace' leaves a temporary file /tmp/perf-vdso.so-XXXXXX.
$ perf trace -o log true
$ ls -l /tmp/perf-vdso.*
-rw------- 1 root root 8192 Nov 8 03:08 /tmp/perf-vdso.so-5bCpD0
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vasily Averin <vvs@virtuozzo.com>
Link: http://lkml.kernel.org/r/20171108002246.8924-1-avagin@openvz.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Thu, 9 Nov 2017 09:02:10 +0000 (10:02 +0100)]
perf tools: Fix eBPF event specification parsing
Looks like I've reached the new level of stupidity, adding missing braces.
Committer testing:
Given the following eBPF C filter, that will add a record when it
returns true, i.e. when the tv_nsec variable is > 2000ns, should be
built and installed via sys_bpf(), but fails to do so before this patch:
# cat filter.c
#include <uapi/linux/bpf.h>
#define SEC(NAME) __attribute__((section(NAME), used))
SEC("func=hrtimer_nanosleep rqtp->tv_nsec")
int func(void *ctx, int err, long nsec)
{
return nsec > 1000;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
#
# perf trace -e nanosleep,filter.c usleep 1
invalid or unsupported event: 'filter.c'
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
#
And works again after it is applied, the nothing is inserted when the co
# perf trace -e *sleep,filter.c usleep 1
0.000 ( 0.066 ms): usleep/23994 nanosleep(rqtp: 0x7ffead94a0d0) = 0
# perf trace -e *sleep,filter.c usleep 2
0.000 ( 0.008 ms): usleep/24378 nanosleep(rqtp: 0x7fffa021ba50) ...
0.008 ( ): perf_bpf_probe:func:(
ffffffffb410cb30) tv_nsec=2000)
0.000 ( 0.066 ms): usleep/24378 ... [continued]: nanosleep()) = 0
#
The intent of
9445464bb831 is kept:
# perf stat -e 'cpu/uops_executed.core,krava/' true
event syntax error: '..cuted.core,krava/'
\___ unknown term
valid terms: cmask,pc,event,edge,in_tx,any,ldlat,inv,umask,in_tx_cp,offcore_rsp,config,config1,config2,name,period
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
#
# perf stat -e 'cpu/uops_executed.core,period=1/' true
Performance counter stats for 'true':
808,332 cpu/uops_executed.core,period=1/
0.
002997237 seconds time elapsed
#
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes:
9445464bb831 ("perf tools: Unwind properly location after REJECT")
Link: http://lkml.kernel.org/n/tip-diea0ihbwpxfw6938huv3whj@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Wed, 8 Nov 2017 15:43:09 +0000 (16:43 +0100)]
perf tools: Add "reject" option for parse-events.l
Arnaldo reported broken builds in some distros using a newer flex
release, 2.6.4, found in Alpine Linux 3.6 and Edge, with flex not
spotting the REJECT macro:
CC /tmp/build/perf/util/parse-events-flex.o
util/parse-events.l: In function 'parse_events_lex':
/tmp/build/perf/util/parse-events-flex.c:4734:16: error: \
'reject_used_but_not_detected' undeclared (first use in this function)
It's happening because we put the REJECT under another USER_REJECT macro
in following commit:
9445464bb831 perf tools: Unwind properly location after REJECT
Fortunately flex provides option for force it to use REJECT, adding it
to parse-events.l.
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Tested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes:
9445464bb831 ("perf tools: Unwind properly location after REJECT")
Link: http://lkml.kernel.org/n/tip-7kdont984mw12ijk7rji6b8p@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ilya Dryomov [Mon, 6 Nov 2017 10:33:36 +0000 (11:33 +0100)]
rbd: use GFP_NOIO for parent stat and data requests
rbd_img_obj_exists_submit() and rbd_img_obj_parent_read_full() are on
the writeback path for cloned images -- we attempt a stat on the parent
object to see if it exists and potentially read it in to call copyup.
GFP_NOIO should be used instead of GFP_KERNEL here.
Cc: stable@vger.kernel.org
Link: http://tracker.ceph.com/issues/22014
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Hui Wang [Thu, 9 Nov 2017 00:48:08 +0000 (08:48 +0800)]
ALSA: hda - fix headset mic problem for Dell machines with alc274
Confirmed with Kailang of Realtek, the pin 0x19 is for Headset Mic, and
the pin 0x1a is for Headphone Mic, he suggested to apply
ALC269_FIXUP_DELL1_MIC_NO_PRESENCE to fix this problem. And we
verified applying this FIXUP can fix this problem.
Cc: <stable@vger.kernel.org>
Cc: Kailang Yang <kailang@realtek.com>
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
David S. Miller [Thu, 9 Nov 2017 01:58:35 +0000 (10:58 +0900)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2017-11-09
1) Fix a use after free due to a reallocated skb head.
From Florian Westphal.
2) Fix sporadic lookup failures on labeled IPSEC.
From Florian Westphal.
3) Fix a stack out of bounds when a socket policy is applied
to an IPv6 socket that sends IPv4 packets.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Thu, 9 Nov 2017 01:17:32 +0000 (11:17 +1000)]
Merge tag 'drm-intel-fixes-2017-11-08' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Fix possible NULL dereference (Chris).
- Avoid miss usage of syncobj by rejecting unknown flags (Tvrtko).
* tag 'drm-intel-fixes-2017-11-08' of git://anongit.freedesktop.org/drm/drm-intel:
drm/i915: Deconstruct struct sgt_dma initialiser
drm/i915: Reject unknown syncobj flags
David S. Miller [Thu, 9 Nov 2017 01:03:10 +0000 (10:03 +0900)]
Merge branch 'net-sched-race-fix'
Cong Wang says:
====================
net_sched: close the race between call_rcu() and cleanup_net()
This patchset tries to fix the race between call_rcu() and
cleanup_net() again. Without holding the netns refcnt the
tc_action_net_exit() in netns workqueue could be called before
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.
Patch 1 reverts the previous fix, patch 2 introduces two new
API's to help to address the bug and the rest patches switch
to the new API's. Please see each patch for details.
I was not able to reproduce this bug, but now after adding
some delay in filter destroy work I manage to trigger the
crash. After this patchset, the crash is not reproducible
any more and the debugging printk's show the order is expected
too.
====================
Fixes:
ddf97ccdd7cb ("net_sched: add network namespace support for tc actions")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Mon, 6 Nov 2017 21:47:30 +0000 (13:47 -0800)]
cls_u32: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.
Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.
Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Mon, 6 Nov 2017 21:47:29 +0000 (13:47 -0800)]
cls_tcindex: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.
Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.
Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>