git.stricted.de - GitHub/LineageOS/android_kernel_motorola

tcp: detect malicious patterns in tcp_collapse_ofo_queue()

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>

tcp: avoid collapses in tcp_prune_queue() if possible

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>

tcp: free batches of packets in tcp_prune_ofo_queue()

[ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>

x86_64_cuttlefish_defconfig: Enable android-verity

Bug: 72722987
Test: Build & boot with x86_64_cuttlefish_defconfig
Change-Id: I961e6aaa944b5ab0c005cb39604a52f8dc98fb06
Signed-off-by: Alistair Strachan <astrachan@google.com>

x86_64_cuttlefish_defconfig: enable verity cert

Bug: 72722987
Test: Build, boot and verify in /proc/keys
Change-Id: Ia55b94d56827003a88cb6083a75340ee31347470
Signed-off-by: Alistair Strachan <astrachan@google.com>

ANDROID: android-verity: Fix broken parameter handling.

android-verity documentation states that the target expectets
the key, followed by the backing device on the commandline as follows

"dm=system none ro,0 1 android-verity <public-key-id> <backing-partition>"

However, the code actually expects the backing device as the first
parameter. Fix that.

Bug: 72722987

Change-Id: Ibd56c0220f6003bdfb95aa2d611f787e75a65c97
Signed-off-by: Sandeep Patil <sspatil@google.com>

ANDROID: android-verity: Make it work with newer kernels

Fixed bio API calls as they changed from 4.4 to 4.9.
Fixed the driver to use the new verify_signature_one() API.
Remove the dead code resulted from the rebase.

Bug: 72722987
Test: Build and boot hikey with system partition mounted as root using
android-verity
Signed-off-by: Sandeep Patil <sspatil@google.com>
Change-Id: I1e29111d57b62f0451404c08d49145039dd00737

ANDROID: android-verity: Add API to verify signature with builtin keys.

The builtin keyring was exported prior to this which allowed
android-verity to simply lookup the key in the builtin keyring and
verify the signature of the verity metadata.

This is now broken as the kernel expects the signature to be
in pkcs#7 format (same used for module signing). Obviously, this doesn't
work with the verity metadata as we just append the raw signature in the
metadata .. sigh.

*This one time*, add an API to accept arbitrary signature and verify
that with a key from system's trusted keyring.

Bug: 72722987
Test:
$ adb push verity_fs.img /data/local/tmp/
$ adb root && adb shell
> cd /data/local/tmp
> losetup /dev/block/loop0 verity_fs.img
> dmctl create verity-fs android-verity 0 4200 Android:#7e4333f9bba00adfe0ede979e28ed1920492b40f 7:0
> mount -t ext4 /dev/block/dm-0 temp/
> cat temp/foo.txt temp/bar.txt

Change-Id: I0c14f3cb2b587b73a4c75907367769688756213e
Signed-off-by: Sandeep Patil <sspatil@google.com>

ANDROID: verity: fix android-verity Kconfig dependencies

Bug: 72722987
Test: Android verity now shows up in 'make menuconfig'

Change-Id: I21c2f36c17f45e5eb0daa1257f5817f9d56527e7
Signed-off-by: Sandeep Patil <sspatil@google.com>

ANDROID: uid_sys_stats: Replace tasklist lock with RCU in uid_cputime_show

Tasklist lock is acuquired in uid_cputime_show for updating the stats
for all tasks in the system. This can potentially disable preemption
for several milli seconds. Replace tasklist_lock with RCU read side
primitives.

Change-Id: Ife69cb577bfdceaae6eb21b9bda09a0fe687e140
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>

ANDROID: Add hold functionality to schedtune CPU boost

When tasks come and go from a runqueue quickly, this can lead to boost
being applied and removed quickly which sometimes means we cannot raise
the CPU frequency again when we need to (due to the rate limit on
frequency updates). This has proved to be a particular issue for RT tasks
and alternative methods have been used in the past to work around it.

This is an attempt to solve the issue for all task classes and cpufreq
governors by introducing a generic mechanism in schedtune to retain
the max boost level from task enqueue for a minimum period - defined
here as 50ms. This timeout was determined experimentally and is not
configurable.

A sched_feat guards the application of this to tasks - in the default
configuration, task boosting only applied to tasks which have RT
policy. Change SCHEDTUNE_BOOST_HOLD_ALL to true to apply it to all
tasks regardless of class.

It works like so:

Every task enqueue (in an allowed class) stores a cpu-local timestamp.
If the task is not a member of an allowed class (all or RT depending
upon feature selection), the timestamp is not updated.
The boost group will stay active regardless of tasks present until
50ms beyond the last timestamp stored. We also store the timestamp
of the active boost group to avoid unneccesarily revisiting the boost
groups when checking CPU boost level.

If the timestamp is more than 50ms in the past when we check boost then
we re-evaluate the boost groups for that CPU, taking into account the
timestamps associated with each group.

Idea based on rt-boost-retention patches from Joel.

Change-Id: I52cc2d2e82d1c5aa03550378c8836764f41630c1
Suggested-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[forward ported from android-4.9-eas-dev proposal]

ANDROID: sched/rt: Add schedtune accounting to rt task enqueue/dequeue

rt tasks are currently not eligible for schedtune boosting. Make it so
by adding enqueue/dequeue hooks.

For rt tasks, schedtune only acts as a frequency boosting framework, it
has no impact on placement decisions and the prefer_idle attribute is
not used.

Also prepare schedutil use of boosted util for rt task boosting

With this change, schedtune accounting will include rt class tasks,
however boosting currently only applies to the utilization provided by
fair class tasks. Sum up the tracked CPU utilization applying boost to
the aggregate util instead - this includes RT task util in the boosting
if any tasks are runnable.

Scenario 1, considering one CPU:
1x rt task running, util 250, boost 0
1x cfs task runnable, util 250, boost 50
previous util=250+(50pct_boosted_250) = 887
new      util=50_pct_boosted_500      = 762

Scenario 2, considering one CPU:
1x rt task running, util 250, boost 50
1x cfs task runnable, util 250, boost 0
previous util=250+250                 = 500
new      util=50_pct_boosted_500      = 762

Scenario 3, considering one CPU:
1x rt task running, util 250, boost 50
1x cfs task runnable, util 250, boost 50
previous util=250+(50pct_boosted_250) = 887
new      util=50_pct_boosted_500      = 762

Scenario 4:
1x rt task running, util 250, boost 50
previous util=250                 = 250
new      util=50_pct_boosted_250  = 637

Change-Id: Ie287cbd0692468525095b5024db9faac8b2f4878
Signed-off-by: Chris Redpath <chris.redpath@arm.com>

UPSTREAM: cpuidle: menu: Avoid selecting shallow states with stopped tick

If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU can spend a long time in that
state if the selection is based on an inaccurate prediction of idle
time. That effect turns out to be relevant, so it needs to be
mitigated.

To that end, modify the menu governor to discard the result of the
idle time prediction if the tick is stopped and the predicted idle
time is less than the tick period length, unless the tick timer is
going to expire soon.

Cherry-picked from 87c9fe6ee495f78f36d39cb37f6a714444a093ee

Change-Id: I7f45e76140c7c429049a23ac5d7cf4728ba70542
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

UPSTREAM: cpuidle: menu: Refine idle state selection for running tick

If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.

To avoid that, make tick_nohz_get_sleep_length() return the current
time to the next event (before stopping the tick) in addition to the
estimated one via an extra pointer argument and make menu_select()
use that value to refine the state selection when necessary.

Cherry-picked from 296bb1e51a4838a6488ec5ce676607093482ecbc

Change-Id: I6c3b4227817231f437da0879d1ca401522ef3a70
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

UPSTREAM: sched: idle: Select idle state before stopping the tick

In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event.  If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.

If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.

In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().

That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped.  It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.

With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.

Cherry-picked from 554c8aa8ecade210d58a252173bb8f2106552a44

Change-Id: I11bbd473ccb00cf89105f97a3861d56459cc8377
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

BACKPORT: time: hrtimer: Introduce hrtimer_next_event_without()

The next set of changes will need to compute the time to the next
hrtimer event over all hrtimers except for the scheduler tick one.

To that end introduce a new helper function,
hrtimer_next_event_without(), for computing the time until the next
hrtimer event over all timers except for one and modify the underlying
code in __hrtimer_next_event_base() to prepare it for being called by
that new function.

No intentional code behavior changes.

Cherry-picked from a59855cd8c613ba4bb95147f6176360d95f75e60

- Fixed conflict with tick_nohz_stopped_cpu not appearing in the
chunk context. Trivial fix.

Change-Id: I0dae9e08e19559efae9697800738c5522ab5933f
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

BACKPORT: time: tick-sched: Split tick_nohz_stop_sched_tick()

In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped, split
tick_nohz_stop_sched_tick() into two separate routines, one computing
the time to the next timer event and the other simply stopping the
tick when the time to the next timer event is known.

Prepare these two routines to be called separately, as one of them
will be called by the idle governor in the cpuidle_select() code
path after subsequent changes.

Update the former callers of tick_nohz_stop_sched_tick() to use
the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
instead of it and move the updates of the sleep_length field in
struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
need to be updated anywhere else.

There should be no intentional visible changes in functionality
resulting from this change.

Cherry-picked from 23a8d888107ce4ce444eab2dcebf4cfb3578770b

- Fixed conflict with tick_nohz_stopped_cpu not appearing in the
chunk context. Trivial fix.

Change-Id: I43eb34c4799b4b9f18842e047c3d016394c22743
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

UPSTREAM: cpuidle: Return nohz hint from cpuidle_select()

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0, or
(2) the selected state is a polling one, or
(3) the expected idle period duration is within the tick period
     range.

In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values.  To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.

Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.

Cherry-picked from 45f1ff59e27ca59d33cc1a317e669d90022ccf7d

Change-Id: I3737f2fd00e308d0becba33cc3b1db727241e2a4
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

UPSTREAM: jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC

Since the subsequent changes will need a TICK_USEC definition
analogous to TICK_NSEC, rename the existing TICK_USEC as
USER_TICK_USEC, update its users and redefine TICK_USEC
accordingly.

Cherry-picked from efefc97736e6f3261879bc9dddcb161224a455f5

Change-Id: I14c6af0ab80964b1b4ba72ef711946e45dc1f204
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

UPSTREAM: sched: idle: Do not stop the tick before cpuidle_idle_call()

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick(). To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.

Cherry-picked from ed98c34919985a9f87c3edacb9a8d8c283c9e243

Change-Id: Id62c91085ab6640bef9266ac64409e2686279aeb
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

BACKPORT: sched: idle: Do not stop the tick upfront in the idle loop

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the nohz code on the duration of the
upcoming idle period.  Specifically, if the tick has been stopped and
the idle governor predicts short idle, the situation is bad regardless
of whether or not the prediction is accurate.  If it is accurate, the
tick has been stopped unnecessarily which means excessive overhead.
If it is not accurate, the CPU is likely to spend too much time in
the (shallow, because short idle has been predicted) idle state
selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.

Cherry-picked from 2aaf709a518d26563b80fd7a42379d7aa7ffed4a

- Fixed conflict with tick_nohz_stopped_cpu not appearing in the
   chunk context. Trivial fix.

Change-Id: Ie316e210971f28aae80dd0f35c5564cb17a81901
Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

BACKPORT: time: tick-sched: Reorganize idle tick management code

Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.

The idea is to split the nohz idle entry call to decouple the idle
time stats accounting and preparatory work from the actual tick stop
code, in order to later be able to delay the tick stop once we reach
more power-knowledgeable callers.

Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.

Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.

Cherry-picked from 0e7767687fdabfc58d5046e7488632bf2ecd4d0c

- Fixed conflict with hrtimer_next_event_base which does not have
an extra parameter in the backported code

Change-Id: Iae3b22cdc0db13212af07249b085b4b97abf557e
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paven Kondati <pkondeti@qti.qualcomm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

ANDROID: sched/fair: fix a warning

Below warning got introduced by 'commit 2cc3df5e1c7be78
("ANDROID: sched: Update max cpu capacity in case of
max frequency constraints")':

warning: '&&' within '||' [-Wlogical-op-parentheses]

Add parentheses required to fix the warning.

Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>

ANDROID: sched/walt: Fix compilation issue for x86_64

Below compilation errors observed when SCHED_WALT enabled and
FAIR_GROUP_SCHED is disabled, fix it.

  CC      kernel/sched/walt.o
kernel/sched/walt.c: In function .walt_inc_cfs_cumulative_runnable_avg.:
kernel/sched/walt.c:157:8: error: .struct cfs_rq. has no member named .cumulative_runnable_avg.
  cfs_rq->cumulative_runnable_avg += p->ravg.demand;
        ^
kernel/sched/walt.c: In function .walt_dec_cfs_cumulative_runnable_avg.:
kernel/sched/walt.c:163:8: error: .struct cfs_rq. has no member named .cumulative_runnable_avg.
  cfs_rq->cumulative_runnable_avg -= p->ravg.demand;
        ^
make[2]: *** [kernel/sched/walt.o] Error 1
make[1]: *** [kernel/sched] Error 2
make: *** [kernel] Error 2

Change-Id: I52d506aa5494f4c351f19e6bb2f1a75c8333f321
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
(cherry-picked for android-4.14)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>

ANDROID: mnt: Fix next_descendent

next_descendent did not properly handle the case
where the initial mount had no slaves. In this case,
we would look for the next slave, but since don't
have a master, the check for wrapping around to the
start of the list will always fail. Instead, we check
for this case, and ensure that we end the iteration
when we come back to the root.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 62094374
Change-Id: I43dfcee041aa3730cb4b9a1161418974ef84812e

ANDROID: sched/events: Introduce util_est trace events

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I359f7ffbd62e86a16a96d7f02da38e9ff260fd99

ANDROID: sched/fair: schedtune: update before schedutil

When a task is enqueue its boosted value must be accounted on that CPU
to better support the selection of the required frequency.
However, schedutil is (implicitly) updated by update_load_avg() which
always happens before schedtune_{en,de}queue_task(), thus potentially
introducing a latency between boost value updates and frequency
selections.

Let's update schedtune at the beginning of enqueue_task_fair(),
which will ensure that all schedutil updates will see the most
updated boost value for a CPU.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I1038f00600dd43ca38b76b2c5681b4f438ae4036

FROMLIST: sched/fair: add support to tune PELT ramp/decay timings

The PELT half-life is the time [ms] required by the PELT signal to build
up a 50% load/utilization, starting from zero. This time is currently
hardcoded to be 32ms, a value which seems to make sense for most of the
workloads.

However, 32ms has been verified to be too long for certain classes of
workloads. For example, in the mobile space many tasks affecting the
user-experience run with a 16ms or 8ms cadence, since they need to match
the common 60Hz or 120Hz refresh rate of the graphics pipeline.
This contributed so fare to the idea that "PELT is too slow" to properly
track the utilization of interactive mobile workloads, especially
compared to alternative load tracking solutions which provides a
better representation of tasks demand in the range of 10-20ms.

A faster PELT ramp-up time could give some advantages to speed-up the
time required for the signal to stabilize and thus to better represent
task demands in the mobile space. As a downside, it also reduces the
decay time, and thus we forget the load/utilization of sleeping tasks
(or idle CPUs) faster.

Fortunately, since the integration of the utilization estimation
support in mainline kernel:

commit 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")

a fast decay time is no longer an issue for tasks utilization estimation.
Although estimated utilization does not slow down the decay of blocked
utilization on idle CPUs, for mobile workloads this seems not to be a
major concern compared to the benefits in interactivity responsiveness.

Let's add a compile time option to choose the PELT speed which better
fits for a specific system. By default the current 32ms half-life is
used, but we can also compile a kernel to use a faster ramp-up time of
either 16ms or 8ms. These two configurations have been verified to give
PELT a further improvement in performance, compared to other out-of-tree
load tracking solutions, when it comes to track interactive workloads
thus better supporting both tasks placements and frequencies selections.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
[
backport from LKML:
Message-ID: <20180409165134.707-1-patrick.bellasi@arm.com>
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I50569748918b799ac4bf4e7d2b387253080a0fd2

BACKPORT: sched/fair: Update util_est before updating schedutil

When a task is enqueued the estimated utilization of a CPU is updated
to better support the selection of the required frequency.

However, schedutil is (implicitly) updated by update_load_avg() which
always happens before util_est_{en,de}queue(), thus potentially
introducing a latency between estimated utilization updates and
frequency selections.

Let's update util_est at the beginning of enqueue_task_fair(),
which will ensure that all schedutil updates will see the most
updated estimated utilization value for a CPU.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")
Link: http://lkml.kernel.org/r/20180524141023.13765-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[
backport from upstream:
commit 2539fc82aa9b ("sched/fair: Update util_est before updating schedutil")
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: If84d40045bd283b0bbba2513a0aee0aa8c5058db

BACKPORT: sched/fair: Update util_est only on util_avg updates

The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.

Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue. If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.

This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.

Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.

Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.

Suggested-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[
backport from upstream:
commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I1309cffc11c1708c1030364facced7b25bcb49d7

BACKPORT: sched/fair: Use util_est in LB and WU paths

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in the wakeup path, via cpu_util_wake(), as well as in the load-balance
path, via cpu_util() which is used by update_sg_lb_stats().

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization (at
previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[
backport from upstream:
commit f9be3e5 ("sched/fair: Use util_est in LB and WU paths")

This provides also schedutil integration, since:
    sugov_get_util()
       boosted_cpu_util()
          cpu_util_freq()
             cpu_util()

thus, not requiring to backport:
commit a07630b8b2c1 ("sched/cpufreq/schedutil: Use util_est for OPP selection")

Support for energy_diff is also provided, since:
    calc_sg_energy()
       find_new_capacity()
          group_max_util()
             cpu_util_wake()
and:
       group_norm_util()
          cpu_util_wake()

Where both cpu_util() and cpu_util_wake() already consider the estimated
utlilization in case of PELT being in use.
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I2be201cf7bb0b1449b14b4da64844067dbdb5eb4

BACKPORT: sched/fair: Add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.

If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.

For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

where:

    cfs_rq::util_est::enqueued = sum(_task_util_est(task))
                                 for each RUNNABLE task on that root cfs_rq

It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:

- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
                 well as frequencies selection

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[
backport from upstream:
commit 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")

No major changes.
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: Icfa41dd73bd4da674b0044cacb11f320cf39eabf

ANDROID: sched/fair: Cleanup cpu_util{_wake}()

The current implementation of cpu_util and cpu_util_wake makes more
difficult the backporting of mainline patches and it also has some
features not longer required by the current EAS code, e.g. delta
utilization in __cpu_util().

Let's clean up these functions definitions to:
1. get rid of the not longer required __cpu_util(cpu, delta)
   This function is now only called with delta=0 and thus we can refold
   its implementation into the original wrapper function cpu_util(cpu)
2. optimize for the WALT path on CONFIG_SCHED_WALT builds
   Currently indeed we execute some not necessary PELT related code even
   when WALT signals are required.
   Let's change this by assuming that on CONFIG_SCHED_WALT build we are
   likely using WALT signals. While on !CONFIG_SCHED_WALT we still have
   just the PELT signals with a code structure which matches mainline
3. move the definitions from sched/sched.h into sched/fair.c
   This is the only module using these functions and it also better
   align with the mainline location for these functions
4. get rid of the walt_util macro
   That macro has a function-like signature but it's modifying a
   parameter (passed by value) which makes it a bit confusing.
   Moreover, the usage of the min_t() macro to cap signals with
   capacity_orig_of() it makes not more required the explicit type cast
   used by that macro to support both 32 and 63 bits targets.
5. remove forward declarations
   Which are not required once the definition is moved at the top of
   fair.c, since they don't have other local dependencies.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I61c9b7b8a0a34b494527c5aa76218c64543c16d2

ANDROID: sched: Update max cpu capacity in case of max frequency constraints

Wakeup balancing uses cpu capacity awareness and needs to know the
system-wide maximum cpu capacity.

Patch "sched: Store system-wide maximum cpu capacity in root domain"
finds the system-wide maximum cpu capacity during scheduler domain
hierarchy setup. This is sufficient as long as maximum frequency
invariance is not enabled.

If it is enabled, the system-wide maximum cpu capacity can change
between scheduler domain hierarchy setups due to frequency capping.

The cpu capacity is changed in update_cpu_capacity() which is called in
load balance on the lowest scheduler domain hierarchy level. To be able
to know if a change in cpu capacity for a certain cpu also has an effect
on the system-wide maximum cpu capacity it is normally necessary to
iterate over all cpus. This would be way too costly. That's why this
patch follows a different approach.

The unsigned long max_cpu_capacity value in struct root_domain is
replaced with a struct max_cpu_capacity, containing value (the
max_cpu_capacity) and cpu (the cpu index of the cpu providing the
maximum cpu_capacity).

Changes to the system-wide maximum cpu capacity and the cpu index are
made if:

1 System-wide maximum cpu capacity < cpu capacity
2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu

There are no changes to the system-wide maximum cpu capacity in all
other cases.

Atomic read and write access to the pair (max_cpu_capacity.val,
max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock.

The access to max_cpu_capacity.val in task_fits_max() is still performed
without taking the max_cpu_capacity.lock.

The code to set max cpu capacity in build_sched_domains() has been
removed because the whole functionality is now provided by
update_cpu_capacity() instead.

This approach can introduce errors temporarily, e.g. in case the cpu
currently providing the max cpu capacity has its cpu capacity lowered
due to frequency capping and calls update_cpu_capacity() before any cpu
which might provide the max cpu now.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Change-Id: Idaa7a16723001e222e476de34df332558e48dd13

ANDROID: arm: enable max frequency capping

Defines arch_scale_max_freq_capacity() to use the topology driver
scale function.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Change-Id: I79f444399ea3b2948364fde80ccee52a9ece5b9a

ANDROID: arm64: enable max frequency capping

Defines arch_scale_max_freq_capacity() to use the topology driver
scale function.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Change-Id: If7565747ec862e42ac55196240522ef8d22ca67d

ANDROID: implement max frequency capping

Implements the Max Frequency Capping Engine (MFCE) getter function
topology_get_max_freq_scale() to provide the scheduler with a
maximum frequency scaling correction factor for more accurate cpu
capacity handling by being able to deal with max frequency capping.

This scaling factor describes the influence of running a cpu with a
current maximum frequency (policy) lower than the maximum possible
frequency (cpuinfo).

The factor is:

policy_max_freq(cpu) << SCHED_CAPACITY_SHIFT / cpuinfo_max_freq(cpu)

It also implements the MFCE setter function arch_set_max_freq_scale()
which is called from cpufreq_set_policy().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Change-Id: I59e52861ee260755ab0518fe1f7183a2e4e3d0fc

ANDROID: sched/fair: add arch scaling function for max frequency capping

To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.

Another subsystem (e.g. cpufreq) or architectural or platform specific
code can overwrite this default implementation, exactly as for frequency
and cpu invariance. It has to be enabled by the arch by defining
arch_scale_max_freq_capacity to the actual implementation.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Change-Id: I770a8b1f4f7340e9e314f71c64a765bf880f4b4d

ANDROID: trace: Add WALT util signal to trace event sched_load_cfs_rq

WALT only operates on root rq. WALT is already intergated into the
trace event sched_load_se.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: If9c50c87404bc712ac29e9a9a8b90f7eb1b254e2

ANDROID: sched, trace: Remove trace event sched_load_avg_cpu

The functionality of sched_load_avg_cpu can be provided by
sched_load_cfs_rq.
The WALT extension will be included into sched_load_cfs_rq by patch
(ANDROID: trace: Add WALT util signal to trace event sched_load_cfs_rq).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I2e3944a6e60a77e3237b7c8cb29cbd840118a6ad

ANDROID: Rename and move include/linux/sched_energy.h

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I51401b94947c64feb598287a991eadf9de064340

ANDROID: Adjust juno energy model

The max cap state cpu capacity value (now 446) has to be equal to the
cpu scale (446) value for the little cpus (A53).

While add it, align the cap values betwenn core and cluster energy
costs for the second lowest OPP to 302.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Idffc696c0cac8175bc62e0216394cd96b75fa566

ANDROID: Check equality of max cap state cap and cpu scale

The max cap state cpu capacity value has to be equal to the cpu scale
value. Print out a warning if this is not the case.

The Energy Model can be adjusted so that this warning disappears.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I97fea5622ea8193154746a0b63aec30069eb9be5

ANDROID: Move energy model init call into arch_topology driver

Now that we decouple the Energy Model (EM) from the cpu_scale
calculation, initialize the EM after the cpu_scale calaculation
has finished.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ia39c969f5c674e6e644951fd8b3e986d16cdab3b

ANDROID: Streamline sched_domain_energy_f functions

There is no need to print that there is no sched group energy for a
sched domain.

It is e.g. perfectly fine to have an Energy Model on MC and DIE level
and not on SYS level.

The Energy Model consistency check between sched domains is done in
init_sched_groups_energy(). In case a sched domain level (l) has energy
data its child sched domain level (l-1) has to have energy data as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Iff1ecc86720a310f4b96282c3f41931aee733af2

ANDROID: Separate cpu_scale and energy model setup

Keep the cpu_scale calculation in the arch_topology driver the way it is
done in mainline. This will require appropriate capacity-dmips-mhz
properties in the dt file.

Do the energy model setup completely separate in kernel/sched/energy.c.

This will allow to adapt the non-mainline energy model code more easily.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ic69d9a619c0a57bb54e5aa6f320e95618394b7cb

ANDROID: update_group_capacity for single cpu in cluster

If we're only left with one big CPU in a cluster (either there was only
one to begin with or the others have been hotplugged out), the MC level
seen from the perspective of that CPU, will only contain one group (that
of the CPU) and the LOAD_BALANCE flag will be cleared at that level.
The MC level is kept nonetheless as it will still contain energy
information, as per commit 06654998ed81 ("ANDROID: sched: EAS & 'single
cpu per cluster'/cpu hotplug interoperability").

This will result in update_cpu_capacity never being called for that big
CPU and its capacity will never be updated. Also, its capacity will
never be considered when setting or updating the max_cpu_capacity
structure.

Therefore, call update_group_capacity for SD levels that do not have the
SD_LOAD_BALANCE flag set in order to update the CPU capacity information
of CPUs that are alone in a cluster and the groups that they belong to.
The call to update_cpu_capacity will also result in appropriately
setting the values of the max_cpu_capacity structure.

Fixes: 06654998ed81 ("ANDROID: sched: EAS & 'single cpu per cluster'/cpu
hotplug interoperability")
Change-Id: I6074972dde1fdf586378f8a3534e3d0aa42a809a
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>

ANDROID: sched/fair: return idle CPU immediately for prefer_idle

When a CPU is selected for a prefer_idle task through find_best_target
this CPU will become a target CPU on which we perform an energy diff and
select between the previous CPU and the target CPU based on the estimated
energy consumption for both placements.

For prefer_idle tasks we should favour performance over energy savings
and therefore return a found idle CPU immediately.

Change-Id: I2242a51134ac172dd58b3b08375388a7b4b84af0
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Suggested-by: Joel Fernandes <joelaf@google.com>

ANDROID: sched/fair: add idle state filter to prefer_idle case

The CPU selection process for a prefer_idle task either minimizes or
maximizes the CPU capacity for idle CPUs depending on the task being
boosted or not.

Given that we are iterating through all CPUs, additionally filter the
choice by preferring a CPU in a more shallow idle state. This will
provide both a faster wake-up for the task and higher energy efficiency,
by allowing CPUs in deeper idle states to remain idle.

Change-Id: Ic55f727a0c551adc0af8e6ee03de6a41337a571b
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>

ANDROID: sched/fair: remove order from CPU selection

find_best_target is currently split into code handling latency sensitive
tasks and code handling non-latency sensitive tasks based on the value
of the prefer_idle flag.
Another differentiation is done for boosted tasks, preferring to start
with higher-capacity CPU when boosted, and with more efficient CPUs when
not boosted. This additional differentiation is obtained by imposing an
order when considering CPUs for selection. This order is determined in
typical big.LITTLE systems by the start point (the CPU with the maximum
or minimum capacity) and by the order of big and little CPU groups
provided in the sched domain hierarchy.

However, it's not guaranteed that the sched domain hierarchy will give
us a sorted list of CPU groups based on their maximum capacities when
dealing with systems with more than 2 capacity groups.
For example, if we consider a system with three groups of CPUs (LITTLEs,
mediums, bigs), the sched domain hierarchy might provide the following
scheduling groups ordering for a prefer_idle-boosted task:
big CPUs -> LITTLE CPUs -> medium CPUs.
If the big CPUs are not idle, but there are a few LITTLEs and mediums
as idle CPUs, by returning the first idle CPU, we will be incorrectly
prefering a lower capacity CPU over a higher capacity CPU.

In order to eliminate this reliance on assuming sched groups are ordered
by capacity, let's:
1. Iterate though all candidate CPUs for all cases.
2. Minimise or maximise the capacity of the considered CPU, depending
on prefer_idle and boost information.

Taking in part each of the four possible cases we analyse the
implementation and impact of this solution:

1. prefer_idle and boosted
This type of tasks needs to favour the selection of a reserved idle CPU,
and thus we still start from the biggest CPU in the system, but we
iterate though all CPUs as to correctly handle the example above by
maximising the capacity of the idle CPU we select. When all CPUs are
active, we already iterate though all CPUs and we're able to maximise
spare capacity or minimise utilisation for the considered target or
backup CPU.

2. prefer_idle and !boosted
For these tasks we prefer the selection of a more energy efficient CPU
and therefore we start from the smallest CPUs in the system, but we
iterate through all the CPUs as to select the most energy efficient idle
CPU, implementation which mimics existing behaviour. When all CPUs are
active, we already iterate though all CPUs and we're able to
maximise spare capacity or minimise utilisation for the considered
target or backup CPU.

3. !prefer_idle and boosted and
4. !prefer_idle and !boosted
For these tasks we already iterate though all CPUs and we're able to
maximise the energy efficiency of the selected CPU.

Change-Id: I940399e22eff29453cba0e2ec52a03b17eec12ae
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>

ANDROID: sched/fair: unify spare capacity calculation

Given that we have a few sites where the spare capacity of a CPU is
calculated as the difference between the original capacity of the CPU
and its computed new utilization, let's unify the calculation and use
that value tracked with a local spare_cap variable.

Change-Id: I78daece7543f78d4f74edbee5e9ceb62908af507
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>

ANDROID:sched/fair: prefer energy efficient CPUs for !prefer_idle tasks

For !prefer_idle tasks we want to minimize capacity_orig to bias their
scheduling towards more energy efficient CPUs. This does not happen in
the current code for boosted tasks due the order of CPUs considered
(from big CPUs to LITTLE CPUs), and to the shallow idle state and
spare capacity maximization filters, which are used to select the best
idle backup CPU and the best active CPU candidates.

Let's fix this by enabling the above filters only when we are within
same capacity CPUs.

Taking in part each of the two cases:
1. Selection of a backup idle CPU - Non prefer_idle tasks should prefer
    more energy efficient CPUs when there are idle CPUs in the system,
    independent of the order given by the presence of a boosted margin.
    This is the behavior for !sysctl_sched_cstate_aware and this should
    be the behaviour for when sysctl_sched_cstate_aware is set as well,
    given that we should prefer a more efficient CPU even if it's in a
    deeper idle state.

2. Selection of an active target CPU: There is no reason for boosted
    tasks to benefit from a higher chance to be placed on big CPU which
    is provided by ordering CPUs from bigs to littles.
    The other mechanism in place set for boosted tasks (making sure we
    select a CPU that fits the task) is enough for a non latency
    sensitive case. Also, by choosing a CPU with maximum spare capacity
    we also cover the preference towards spreading tasks, rather than
    packing them, which improves the chances for tasks to get better
    performance due to potential reduced preemption. Therefore, prefer
    more energy efficient CPUs and only consider spare capacity for CPUs
    with equal capacity_orig.

Change-Id: I3b97010e682674420015e771f0717192444a63a2
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Leo Yan <leo.yan@linaro.org>

ANDROID: sched/fair: fix CPU selection for non latency sensitive tasks

The Non latency sensitive tasks CPU selection targets for an active
CPU in the little cluster. The shallowest c-state CPU is stored as
a backup. However if all CPUs in the little cluster are idle, we pick
an active CPU in the BIG cluster as the target CPU. This incorrect
choice of the target CPU may not get corrected by the
select_energy_cpu_idx() depending on the energy difference between
previous CPU and target CPU.

This can be fixed easily by maintaining the same variable that tracks
maximum capacity of the traversed CPU for both idle and active CPUs.

Change-Id: I3efb8bc82ff005383163921ef2bd39fcac4589ad
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>

ANDROID: sched/fair: Also do misfit in overloaded groups

If we can classify the group as overloaded, that overrides
any classification as misfit but we may still have misfit
tasks present. Check the rq we're looking at to see if
this is the case.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Removed stray reference to rq_has_misfit]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Change-Id: Ida8eb66aa625e34de3fe2ee1b0dd8a78926273d8

ANDROID: sched/fair: Don't balance misfits if it would overload local group

When load balancing in a system with misfit tasks present, if we always
pull a misfit task to the local group this can lead to pulling a running
task from a smaller capacity CPUs to a bigger CPU which is busy. In this
situation, the pulled task is likely not to get a chance to run before
an idle balance on another small CPU pulls it back. This penalises the
pulled task as it is stopped for a short amount of time and then likely
relocated to a different CPU (since the original CPU just did a NEWLY_IDLE
balance and reset the periodic interval).

If we only do this unconditionally for NEWLY_IDLE balance, we can be
sure that any tasks and load which are present on the local group are
related to short-running tasks which we are happy to displace for a
longer running task in a system with misfit tasks present.

However, other balance types should only pull a task if we think
that the local group is underutilized - checking the number of tasks
gives us a conservative estimate here since if they were short tasks
we would have been doing NEWLY_IDLE balances instead.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I710add1ab1139482620b6addc8370ad194791beb

ANDROID: sched/fair: Attempt to improve throughput for asym cap systems

In some systems the capacity and group weights line up to defeat all the
small imbalance correction conditions in fix_small_imbalance, which can
cause bad task placement. Add a new condition if the existing code can't
see anything to fix:

If we have asymmetric capacity, and there are more tasks than CPUs in
the busiest group *and* there are less tasks than CPUs in the local group
then we try to pull something. There could be transient small tasks which
prevent this from working, but on the whole it is beneficial for those
systems with inconvenient capacity/cluster size relationships.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icf81cde215c082a61f816534b7990ccb70aee409

FROMLIST: sched/fair: Don't move tasks to lower capacity cpus unless necessary

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ib86570abdd453a51be885b086c8d80be2773a6f2

FROMLIST: sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I944a003c7b685132c57a90a8aaf85509196679e6

FROMLIST: sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

When hotplugging cpus out or creating exclusive cpusets systems which
were asymmetric at boot might become symmetric. In this case leaving the
flag set might lead to suboptimal scheduling decisions.

The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or by
letting the generic topology code verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-12-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I6a369e474743c96fe0b866ebb084e0a250e8c5d7

FROMLIST: sched/fair: Set rq->rd->overload when misfit

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
  ^
  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I131f9ad90f7fb53f4946f61f2fb65ab0798f23c5

FROMLIST: sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I14ce007f916bffd6a7938b7d56faf2b384d27887

FROMLIST: sched: Change root_domain->overload type to int

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I24c3ee7fc9a0aa76c3a7a1369714248703e73af9

FROMLIST: sched/fair: Change prefer_sibling type to bool

This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I8d3cc862292290c952505ff78e41cfe6b04bf168

FROMLIST: sched/fair: Consider misfit tasks when load-balancing

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com/]
[backported - some parts already in android-4.14]
Signed-off-by: Ioan Budea <ioan.budea@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ib9f9edd31b6c56cfbeb2a2f9d5daaa1b6824375b

FROMLIST: sched: Add sched_group per-cpu max capacity

The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com/]
[backported - some of this is already present in android-4.14]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ic96c914f7f969b8aa353cfa3ad287e0e628b1793

FROMLIST: sched/fair: Add group_misfit_task load-balance type

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[From https://lore.kernel.org/lkml/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com/]
[backported because some parts are already present in android]
Signed-off-by: Ioan Budea <ioan.budea@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I71bd3a77c7088a102ba183df6ece7943aa7eb0c2

FROMLIST: sched: Add static_key for asymmetric cpu capacity optimizations

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Taken from https://lore.kernel.org/lkml/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Iced93ffb71bb2c34eee783c585cffe7252137308

UPSTREAM: ANDROID: binder: change down_write to down_read

binder_update_page_range needs down_write of mmap_sem because
vm_insert_page need to change vma->vm_flags to VM_MIXEDMAP unless
it is set. However, when I profile binder working, it seems
every binder buffers should be mapped in advance by binder_mmap.
It means we could set VM_MIXEDMAP in binder_mmap time which is
already hold a mmap_sem as down_write so binder_update_page_range
doesn't need to hold a mmap_sem as down_write.
Please use proper API down_read. It would help mmap_sem contention
problem as well as fixing down_write abuse.

Ganesh Mahendran tested app launching and binder throughput test
and he said he couldn't find any problem and I did binder latency
test per Greg KH request(Thanks Martijn to teach me how I can do)
I cannot find any problem, too.

Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@google.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 720c241924046aff83f5f2323232f34a30a4c281)

Change-Id: I8358ceaaab4030f7122c95308dcad59557cad411

UPSTREAM: ANDROID: binder: correct the cmd print for BINDER_WORK_RETURN_ERROR

When to execute binder_stat_br the e->cmd has been modifying as BR_OK
instead of the original return error cmd, in fact we want to know the
original return error, such as BR_DEAD_REPLY or BR_FAILED_REPLY, etc.
instead of always BR_OK, in order to avoid the value of the e->cmd is
always BR_OK, so we need assign the value of the e->cmd to cmd before
e->cmd = BR_OK.

Signed-off-by: songjinshi <songjinshi@xiaomi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 838d5565669aa5bb7deb605684a5970d51d5eaf6)

Change-Id: I425b32c5419a491c6b9ceee7c00dde6513e0421d

UPSTREAM: ANDROID: binder: remove 32-bit binder interface.

New devices launching with Android P need to use the 64-bit
binder interface, even on 32-bit SoCs [0].

This change removes the Kconfig option to select the 32-bit
binder interface. We don't think this will affect existing
userspace for the following reasons:
1) The latest Android common tree is 4.14, so we don't
   believe any Android devices are on kernels >4.14.
2) Android devices launch on an LTS release and stick with
   it, so we wouldn't expect devices running on <= 4.14 now
   to upgrade to 4.17 or later. But even if they did, they'd
   rebuild the world (kernel + userspace) anyway.
3) Other userspaces like 'anbox' are already using the
   64-bit interface.

Note that this change doesn't remove the 32-bit UAPI
itself; the reason for that is that Android userspace
always uses the latest UAPI headers from upstream, and
userspace retains 32-bit support for devices that are
upgrading. This will be removed as well in 2-3 years,
at which point we can remove the code from the UAPI
as well.

Finally, this change introduces build errors on archs where
64-bit get_user/put_user is not supported, so make binder
unavailable on m68k (which wouldn't want it anyway).

[0]: https://android-review.googlesource.com/c/platform/build/+/595193

Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1190b4e38f97023154e6b3bef61b251aa5f970d0)

Change-Id: I73dadf1d7b45a42bb18be5d5d3f5c090e61866de

UPSTREAM: android: binder: Use true and false for boolean values

Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Martijn Coenen <maco@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 197410ad884eb18b31d48e9d8e64cb5a9e326f2f)

Change-Id: I30bed831d6b6ff2e9e3e521ccc5d6836f0b30944

UPSTREAM: android: binder: Use octal permissions

checkpatch warns against the use of symbolic permissions,
this patch migrates all symbolic permissions in the binder
driver to octal permissions.

Test: debugfs nodes created by binder have the same unix
permissions prior to and after this patch was applied.

Signed-off-by: Harsh Shandilya <harsh@prjkt.io>
Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Martijn Coenen <maco@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 21d02ddf716669e182a13b69b4dd928cf8ef5e0f)

Change-Id: I8152fe280ead1d04d89593e813a722f9eb5def27

UPSTREAM: android: binder: Prefer __func__ to using hardcoded function name

Coding style fixup

Signed-off-by: Elad Wexler <elad.wexler@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 00c41cddebde8d1a635bf81a7b255b7e56fd0d15)

Change-Id: I795e2a9f525c4a8df5cd0a81842a88529ba54f21

UPSTREAM: ANDROID: binder: make binder_alloc_new_buf_locked static and indent its arguments

The function binder_alloc_new_buf_locked() is only used in this file, so
make it static. Also clean up sparse warning:

drivers/android/binder_alloc.c:330:23: warning: no previous prototype
for ‘binder_alloc_new_buf_locked’ [-Wmissing-prototypes]

In addition, the line of the function name exceeds 80 characters when
add static for this function, hence indent its arguments anew.

Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f827245463a57f5ef64a665e1ca64eed0da00a5)

Change-Id: I6b379df815d30f9b3e9f1dd50334375123b25bbc

UPSTREAM: android: binder: Check for errors in binder_alloc_shrinker_init().

Both list_lru_init() and register_shrinker() might return an error.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sherry Yang <sherryy@android.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 533dfb250d1c8d2bb8c9b65252f7b296b29913d4)

Change-Id: I5325ccaf34a04179ef3dae73dd8f3abfd6e21565

Merge 4.14.56 into android-4.14

Changes in 4.14.56
media: rc: mce_kbd decoder: fix stuck keys
ASoC: mediatek: preallocate pages use platform device
MIPS: Call dump_stack() from show_regs()
MIPS: Use async IPIs for arch_trigger_cpumask_backtrace()
MIPS: Fix ioremap() RAM check
mmc: sdhci-esdhc-imx: allow 1.8V modes without 100/200MHz pinctrl states
mmc: dw_mmc: fix card threshold control configuration
ibmasm: don't write out of bounds in read handler
staging: rtl8723bs: Prevent an underflow in rtw_check_beacon_data().
staging: r8822be: Fix RTL8822be can't find any wireless AP
ata: Fix ZBC_OUT command block check
ata: Fix ZBC_OUT all bit handling
vmw_balloon: fix inflation with batching
ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS
USB: serial: ch341: fix type promotion bug in ch341_control_in()
USB: serial: cp210x: add another USB ID for Qivicon ZigBee stick
USB: serial: keyspan_pda: fix modem-status error handling
USB: yurex: fix out-of-bounds uaccess in read handler
USB: serial: mos7840: fix status-register error handling
usb: quirks: add delay quirks for Corsair Strafe
xhci: xhci-mem: off by one in xhci_stream_id_to_ring()
devpts: hoist out check for DEVPTS_SUPER_MAGIC
devpts: resolve devpts bind-mounts
Fix up non-directory creation in SGID directories
genirq/affinity: assign vectors to all possible CPUs
scsi: megaraid_sas: use adapter_type for all gen controllers
scsi: megaraid_sas: replace instance->ctrl_context checks with instance->adapter_type
scsi: megaraid_sas: replace is_ventura with adapter_type checks
scsi: megaraid_sas: Create separate functions to allocate ctrl memory
scsi: megaraid_sas: fix selection of reply queue
ALSA: hda/realtek - two more lenovo models need fixup of MIC_LOCATION
ALSA: hda - Handle pm failure during hotplug
mm: do not drop unused pages when userfaultd is running
fs/proc/task_mmu.c: fix Locked field in /proc/pid/smaps*
fs, elf: make sure to page align bss in load_elf_library
mm: do not bug_on on incorrect length in __mm_populate()
tracing: Reorder display of TGID to be after PID
kbuild: delete INSTALL_FW_PATH from kbuild documentation
arm64: neon: Fix function may_use_simd() return error status
tools build: fix # escaping in .cmd files for future Make
IB/hfi1: Fix incorrect mixing of ERR_PTR and NULL return values
i2c: tegra: Fix NACK error handling
iw_cxgb4: correctly enforce the max reg_mr depth
xen: setup pv irq ops vector earlier
nvme-pci: Remap CMB SQ entries on every controller reset
crypto: x86/salsa20 - remove x86 salsa20 implementations
uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn()
netfilter: nf_queue: augment nfqa_cfg_policy
netfilter: x_tables: initialise match/target check parameter struct
loop: add recursion validation to LOOP_CHANGE_FD
PM / hibernate: Fix oops at snapshot_write()
RDMA/ucm: Mark UCM interface as BROKEN
loop: remember whether sysfs_create_group() was done
f2fs: give message and set need_fsck given broken node id
Linux 4.14.56

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>

Linux 4.14.56

f2fs: give message and set need_fsck given broken node id

commit a4f843bd004d775cbb360cd375969b8a479568a9 upstream.

syzbot hit the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
link: https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/node.c:1185!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
RSP: 0018:ffff8801d960e820 EFLAGS: 00010293
RAX: ffff8801d88205c0 RBX: 0000000000000003 RCX: ffffffff82f6cc06
RDX: 0000000000000000 RSI: ffffffff82f6d5e8 RDI: 0000000000000004
RBP: ffff8801d960ec30 R08: ffff8801d88205c0 R09: ffffed003b5e46c2
R10: 0000000000000003 R11: 0000000000000003 R12: ffff8801a86e00c0
R13: 0000000000000001 R14: ffff8801a86e0530 R15: ffff8801d9745240
FS:  000000000072c880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3d403209b8 CR3: 00000001d8f3f000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
get_node_page fs/f2fs/node.c:1237 [inline]
truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
evict+0x4a6/0x960 fs/inode.c:557
iput_final fs/inode.c:1519 [inline]
iput+0x62d/0xa80 fs/inode.c:1545
f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
mount_bdev+0x30c/0x3e0 fs/super.c:1164
f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
mount_fs+0xae/0x328 fs/super.c:1267
vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
vfs_kern_mount fs/namespace.c:1027 [inline]
do_new_mount fs/namespace.c:2518 [inline]
do_mount+0x564/0x3070 fs/namespace.c:2848
ksys_mount+0x12d/0x140 fs/namespace.c:3064
__do_sys_mount fs/namespace.c:3078 [inline]
__se_sys_mount fs/namespace.c:3075 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443dea
RSP: 002b:00007ffcc7882368 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443dea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffcc7882370
RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004
R13: 0000000000402ce0 R14: 0000000000000000 R15: 0000000000000000
RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: ffff8801d960e820
---[ end trace 4edbeb71f002bb76 ]---

Reported-and-tested-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

loop: remember whether sysfs_create_group() was done

commit d3349b6b3c373ac1fbfb040b810fcee5e2adc7e0 upstream.

syzbot is hitting WARN() triggered by memory allocation fault
injection [1] because loop module is calling sysfs_remove_group()
when sysfs_create_group() failed.
Fix this by remembering whether sysfs_create_group() succeeded.

[1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+9f03168400f56df89dbc6f1751f4458fe739ff29@syzkaller.appspotmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Renamed sysfs_ready -> sysfs_inited.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

RDMA/ucm: Mark UCM interface as BROKEN

commit 7a8690ed6f5346f6738971892205e91d39b6b901 upstream.

In commit 357d23c811a7 ("Remove the obsolete libibcm library")
in rdma-core [1], we removed obsolete library which used the
/dev/infiniband/ucmX interface.

Following multiple syzkaller reports about non-sanitized
user input in the UCMA module, the short audit reveals the same
issues in UCM module too.

It is better to disable this interface in the kernel,
before syzkaller team invests time and energy to harden
this unused interface.

[1] https://github.com/linux-rdma/rdma-core/pull/279

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

PM / hibernate: Fix oops at snapshot_write()

commit fc14eebfc20854a38fd9f1d93a42b1783dad4d17 upstream.

syzbot is reporting NULL pointer dereference at snapshot_write() [1].
This is because data->handle is zero-cleared by ioctl(SNAPSHOT_FREE).
Fix this by checking data_of(data->handle) != NULL before using it.

[1] https://syzkaller.appspot.com/bug?id=828a3c71bd344a6de8b6a31233d51a72099f27fd

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ae590932da6e45d6564d@syzkaller.appspotmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

loop: add recursion validation to LOOP_CHANGE_FD

commit d2ac838e4cd7e5e9891ecc094d626734b0245c99 upstream.

Refactor the validation code used in LOOP_SET_FD so it is also used in
LOOP_CHANGE_FD. Otherwise it is possible to construct a set of loop
devices that all refer to each other. This can lead to a infinite
loop in starting with "while (is_loop_device(f)) .." in loop_set_fd().

Fix this by refactoring out the validation code and using it for
LOOP_CHANGE_FD as well as LOOP_SET_FD.

Reported-by: syzbot+4349872271ece473a7c91190b68b4bac7c5dbc87@syzkaller.appspotmail.com
Reported-by: syzbot+40bd32c4d9a3cc12a339@syzkaller.appspotmail.com
Reported-by: syzbot+769c54e66f994b041be7@syzkaller.appspotmail.com
Reported-by: syzbot+0a89a9ce473936c57065@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

netfilter: x_tables: initialise match/target check parameter struct

commit c568503ef02030f169c9e19204def610a3510918 upstream.

syzbot reports following splat:

BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450
net/bridge/netfilter/ebt_stp.c:162
ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162
xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506
ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline]
ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline]

The uninitialised access is
xt_mtchk_param->nft_compat

... which should be set to 0.
Fix it by zeroing the struct beforehand, same for tgchk.

ip(6)tables targetinfo uses c99-style initialiser, so no change
needed there.

Reported-by: syzbot+da4494182233c23a5fcf@syzkaller.appspotmail.com
Fixes: 55917a21d0cc0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

netfilter: nf_queue: augment nfqa_cfg_policy

commit ba062ebb2cd561d404e0fba8ee4b3f5ebce7cbfc upstream.

Three attributes are currently not verified, thus can trigger KMSAN
warnings such as :

BUG: KMSAN: uninit-value in __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
BUG: KMSAN: uninit-value in __fswab32 include/uapi/linux/swab.h:59 [inline]
BUG: KMSAN: uninit-value in nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
CPU: 1 PID: 4521 Comm: syz-executor120 Not tainted 4.17.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:113
kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
__msan_warning_32+0x70/0xc0 mm/kmsan/kmsan_instr.c:620
__arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
__fswab32 include/uapi/linux/swab.h:59 [inline]
nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
nfnetlink_rcv_msg+0xb2e/0xc80 net/netfilter/nfnetlink.c:212
netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
nfnetlink_rcv+0x2fe/0x680 net/netfilter/nfnetlink.c:513
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg net/socket.c:639 [inline]
___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
__sys_sendmsg net/socket.c:2155 [inline]
__do_sys_sendmsg net/socket.c:2164 [inline]
__se_sys_sendmsg net/socket.c:2162 [inline]
__x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x43fd59
RSP: 002b:00007ffde0e30d28 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd59
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401680
R13: 0000000000401710 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
slab_post_alloc_hook mm/slab.h:446 [inline]
slab_alloc_node mm/slub.c:2753 [inline]
__kmalloc_node_track_caller+0xb35/0x11b0 mm/slub.c:4395
__kmalloc_reserve net/core/skbuff.c:138 [inline]
__alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
alloc_skb include/linux/skbuff.h:988 [inline]
netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg net/socket.c:639 [inline]
___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
__sys_sendmsg net/socket.c:2155 [inline]
__do_sys_sendmsg net/socket.c:2164 [inline]
__se_sys_sendmsg net/socket.c:2162 [inline]
__x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: fdb694a01f1f ("netfilter: Add fail-open support")
Fixes: 829e17a1a602 ("[NETFILTER]: nfnetlink_queue: allow changing queue length through netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn()

commit 90718e32e1dcc2479acfa208ccfc6442850b594c upstream.

insn_get_length() has the side-effect of processing the entire instruction
but only if it was decoded successfully, otherwise insn_complete() can fail
and in this case we need to just return an error without warning.

Reported-by: syzbot+30d675e3ca03c1c351e7@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: syzkaller-bugs@googlegroups.com
Link: https://lkml.kernel.org/lkml/20180518162739.GA5559@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

crypto: x86/salsa20 - remove x86 salsa20 implementations

commit b7b73cd5d74694ed59abcdb4974dacb4ff8b2a2a upstream.

The x86 assembly implementations of Salsa20 use the frame base pointer
register (%ebp or %rbp), which breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
Recent (v4.10+) kernels will warn about this, e.g.

WARNING: kernel stack regs at 00000000a8291e69 in syzkaller047086:4677 has bad 'bp' value 000000001077994c
[...]

But after looking into it, I believe there's very little reason to still
retain the x86 Salsa20 code.  First, these are *not* vectorized
(SSE2/SSSE3/AVX2) implementations, which would be needed to get anywhere
close to the best Salsa20 performance on any remotely modern x86
processor; they're just regular x86 assembly.  Second, it's still
unclear that anyone is actually using the kernel's Salsa20 at all,
especially given that now ChaCha20 is supported too, and with much more
efficient SSSE3 and AVX2 implementations.  Finally, in benchmarks I did
on both Intel and AMD processors with both gcc 8.1.0 and gcc 4.9.4, the
x86_64 salsa20-asm is actually slightly *slower* than salsa20-generic
(~3% slower on Skylake, ~10% slower on Zen), while the i686 salsa20-asm
is only slightly faster than salsa20-generic (~15% faster on Skylake,
~20% faster on Zen).  The gcc version made little difference.

So, the x86_64 salsa20-asm is pretty clearly useless.  That leaves just
the i686 salsa20-asm, which based on my tests provides a 15-20% speed
boost.  But that's without updating the code to not use %ebp.  And given
the maintenance cost, the small speed difference vs. salsa20-generic,
the fact that few people still use i686 kernels, the doubt that anyone
is even using the kernel's Salsa20 at all, and the fact that a SSE2
implementation would almost certainly be much faster on any remotely
modern x86 processor yet no one has cared enough to add one yet, I don't
think it's worthwhile to keep.

Thus, just remove both the x86_64 and i686 salsa20-asm implementations.

Reported-by: syzbot+ffa3a158337bbc01ff09@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

nvme-pci: Remap CMB SQ entries on every controller reset

commit 815c6704bf9f1c59f3a6be380a4032b9c57b12f1 upstream.

The controller memory buffer is remapped into a kernel address on each
reset, but the driver was setting the submission queue base address
only on the very first queue creation. The remapped address is likely to
change after a reset, so accessing the old address will hit a kernel bug.

This patch fixes that by setting the queue's CMB base address each time
the queue is created.

Fixes: f63572dff1421 ("nvme: unmap CMB and remove sysfs file in reset path")
Reported-by: Christian Black <christian.d.black@intel.com>
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xen: setup pv irq ops vector earlier

commit 0ce0bba4e5e0eb9b753bb821785de5d23c494392 upstream.

Setting pv_irq_ops for Xen PV domains should be done as early as
possible in order to support e.g. very early printk() usage.

The same applies to xen_vcpu_info_reset(0), as it is needed for the
pv irq ops.

Move the call of xen_setup_machphys_mapping() after initializing the
pv functions as it contains a WARN_ON(), too.

Remove the no longer necessary conditional in xen_init_irq_ops()
from PVH V1 times to make clear this is a PV only function.

Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iw_cxgb4: correctly enforce the max reg_mr depth

commit 7b72717a20bba8bdd01b14c0460be7d15061cd6b upstream.

The code was mistakenly using the length of the page array memory instead
of the depth of the page array.

This would cause MR creation to fail in some cases.

Fixes: 8376b86de7d3 ("iw_cxgb4: Support the new memory registration API")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

i2c: tegra: Fix NACK error handling

commit 54836e2d03e76d80aec3399368ffaf5b7caadd1b upstream.

On Tegra30 Cardhu the PCA9546 I2C mux is not ACK'ing I2C commands on
resume from suspend (which is caused by the reset signal for the I2C
mux not being configured correctl). However, this NACK is causing the
Tegra30 to hang on resuming from suspend which is not expected as we
detect NACKs and handle them. The hang observed appears to occur when
resetting the I2C controller to recover from the NACK.

Commit 77821b4678f9 ("i2c: tegra: proper handling of error cases") added
additional error handling for some error cases including NACK, however,
it appears that this change conflicts with an early fix by commit
f70893d08338 ("i2c: tegra: Add delay before resetting the controller
after NACK"). After commit 77821b4678f9 was made we now disable 'packet
mode' before the delay from commit f70893d08338 happens. Testing shows
that moving the delay to before disabling 'packet mode' fixes the hang
observed on Tegra30. The delay was added to give the I2C controller
chance to send a stop condition and so it makes sense to move this to
before we disable packet mode. Please note that packet mode is always
enabled for Tegra.

Fixes: 77821b4678f9 ("i2c: tegra: proper handling of error cases")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

IB/hfi1: Fix incorrect mixing of ERR_PTR and NULL return values

commit b697d7d8c741f27b728a878fc55852b06d0f6f5e upstream.

The __get_txreq() function can return a pointer, ERR_PTR(-EBUSY), or NULL.
All of the relevant call sites look for IS_ERR, so the NULL return would
lead to a NULL pointer exception.

Do not use the ERR_PTR mechanism for this function.

Update all call sites to handle the return value correctly.

Clean up error paths to reflect return value.

Fixes: 45842abbb292 ("staging/rdma/hfi1: move txreq header code")
Cc: <stable@vger.kernel.org> # 4.9.x+
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tools build: fix # escaping in .cmd files for future Make

commit 9feeb638cde083c737e295c0547f1b4f28e99583 upstream.

In 2016 GNU Make made a backwards incompatible change to the way '#'
characters were handled in Makefiles when used inside functions or
macros:

http://git.savannah.gnu.org/cgit/make.git/commit/?id=c6966b323811c37acedff05b57

Due to this change, when attempting to run `make prepare' I get a
spurious make syntax error:

    /home/earnest/linux/tools/objtool/.fixdep.o.cmd:1: *** missing separator.  Stop.

When inspecting `.fixdep.o.cmd' it includes two lines which use
unescaped comment characters at the top:

    \# cannot find fixdep (/home/earnest/linux/tools/objtool//fixdep)
    \# using basic dep data

This is because `tools/build/Build.include' prints these '\#'
characters:

    printf '\# cannot find fixdep (%s)\n' $(fixdep) > $(dot-target).cmd; \
    printf '\# using basic dep data\n\n' >> $(dot-target).cmd;           \

This completes commit 9564a8cf422d ("Kbuild: fix # escaping in .cmd files
for future Make").

Link: https://bugzilla.kernel.org/show_bug.cgi?id=197847
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: neon: Fix function may_use_simd() return error status

commit 2fd8eb4ad87104c54800ef3cea498c92eb15c78a upstream.

It does not matter if the caller of may_use_simd() migrates to
another cpu after the call, but it is still important that the
kernel_neon_busy percpu instance that is read matches the cpu the
task is running on at the time of the read.

This means that raw_cpu_read() is not sufficient. kernel_neon_busy
may appear true if the caller migrates during the execution of
raw_cpu_read() and the next task to be scheduled in on the initial
cpu calls kernel_neon_begin().

This patch replaces raw_cpu_read() with this_cpu_read() to protect
against this race.

Cc: <stable@vger.kernel.org>
Fixes: cb84d11e1625 ("arm64: neon: Remove support for nested or hardirq kernel-mode NEON")
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Yandong Zhao <yandong77520@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kbuild: delete INSTALL_FW_PATH from kbuild documentation

commit 3f9cdee5929b7d035e86302dcf08fbf3e80b0739 upstream.

Removed Kbuild documentation for INSTALL_FW_PATH.

The kbuild symbol INSTALL_FW_PATH was removed from Kbuild tools in
September 2017 (for 4.14) but the symbol was not deleted from
the kbuild documentation, so do that now.

Fixes: 5620a0d1aacd ("firmware: delete in-kernel firmware")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: stable@vger.kernel.org # 4.14+
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tracing: Reorder display of TGID to be after PID

commit f8494fa3dd10b52eab47a9666a8bc34719a129aa upstream.

Currently ftrace displays data in trace output like so:

                                       _-----=> irqs-off
                                      / _----=> need-resched
                                     | / _---=> hardirq/softirq
                                     || / _--=> preempt-depth
                                     ||| /     delay
            TASK-PID   CPU    TGID   ||||    TIMESTAMP  FUNCTION
               | |       |      |    ||||       |         |
            bash-1091  [000] ( 1091) d..2    28.313544: sched_switch:

However Android's trace visualization tools expect a slightly different
format due to an out-of-tree patch patch that was been carried for a
decade, notice that the TGID and CPU fields are reversed:

                                       _-----=> irqs-off
                                      / _----=> need-resched
                                     | / _---=> hardirq/softirq
                                     || / _--=> preempt-depth
                                     ||| /     delay
            TASK-PID    TGID   CPU   ||||    TIMESTAMP  FUNCTION
               | |        |      |   ||||       |         |
            bash-1091  ( 1091) [002] d..2    64.965177: sched_switch:

From kernel v4.13 onwards, during which TGID was introduced, tracing
with systrace on all Android kernels will break (most Android kernels
have been on 4.9 with Android patches, so this issues hasn't been seen
yet). From v4.13 onwards things will break.

The chrome browser's tracing tools also embed the systrace viewer which
uses the legacy TGID format and updates to that are known to be
difficult to make.

Considering this, I suggest we make this change to the upstream kernel
and backport it to all Android kernels. I believe this feature is merged
recently enough into the upstream kernel that it shouldn't be a problem.
Also logically, IMO it makes more sense to group the TGID with the
TASK-PID and the CPU after these.

Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
Cc: jreck@google.com
Cc: tkjos@google.com
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: do not bug_on on incorrect length in __mm_populate()

commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

syzbot has noticed that a specially crafted library can easily hit
VM_BUG_ON in __mm_populate

  kernel BUG at mm/gup.c:1242!
  invalid opcode: 0000 [#1] SMP
  CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
  RIP: 0010:__mm_populate+0x1e2/0x1f0
  Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
  Call Trace:
     vm_brk_flags+0xc3/0x100
     vm_brk+0x1f/0x30
     load_elf_library+0x281/0x2e0
     __ia32_sys_uselib+0x170/0x1e0
     do_fast_syscall_32+0xca/0x420
     entry_SYSENTER_compat+0x70/0x7f

The reason is that the length of the new brk is not page aligned when we
try to populate the it.  There is no reason to bug on that though.
do_brk_flags already aligns the length properly so the mapping is
expanded as it should.  All we need is to tell mm_populate about it.
Besides that there is absolutely no reason to to bug_on in the first
place.  The worst thing that could happen is that the last page wouldn't
get populated and that is far from putting system into an inconsistent
state.

Fix the issue by moving the length sanitization code from do_brk_flags
up to vm_brk_flags.  The only other caller of do_brk_flags is brk
syscall entry and it makes sure to provide the proper length so t here
is no need for sanitation and so we can use do_brk_flags without it.

Also remove the bogus BUG_ONs.

[osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: syzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fs, elf: make sure to page align bss in load_elf_library

commit 24962af7e1041b7e50c1bc71d8d10dc678c556b5 upstream.

The current code does not make sure to page align bss before calling
vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
the requested lenght not being correctly aligned.

Let us make sure to align it properly.

Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
for libc5.

Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fs/proc/task_mmu.c: fix Locked field in /proc/pid/smaps*

commit e70cc2bd579e8a9d6d153762f0fe294d0e652ff0 upstream.

Thomas reports:
"While looking around in /proc on my v4.14.52 system I noticed that all
  processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
  memory than a regular user can usually lock with mlock().

  Commit 493b0e9d945f (in v4.14-rc1) seems to have changed the behavior
  of "Locked".

  Before that commit the code was like this. Notice the VM_LOCKED check.

           (vma->vm_flags & VM_LOCKED) ?
                (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

  After that commit Locked is now the same as Pss:

  (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

  This looks like a mistake."

Indeed, the commit has added mss->pss_locked with the correct value that
depends on VM_LOCKED, but forgot to actually use it.  Fix it.

Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: do not drop unused pages when userfaultd is running

commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.

KVM guests on s390 can notify the host of unused pages.  This can result
in pte_unused callbacks to be true for KVM guest memory.

If a page is unused (checked with pte_unused) we might drop this page
instead of paging it.  This can have side-effects on userfaultd, when
the page in question was already migrated:

The next access of that page will trigger a fault and a user fault
instead of faulting in a new and empty zero page.  As QEMU does not
expect a userfault on an already migrated page this migration will fail.

The most straightforward solution is to ignore the pte_unused hint if a
userfault context is active for this VMA.

Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: hda - Handle pm failure during hotplug

commit aaa23f86001bdb82d2f937c5c7bce0a1e11a6c5b upstream.

Obtaining the runtime pm wakeref can fail, especially in a hotplug
scenario where i915.ko has been unloaded. If we do not catch the
failure, we end up with an unbalanced pm.

v2 additions by tiwai:
hdmi_present_sense() checks the return value and handle only a
negative error case and bails out only if it's really still suspended.
Also, snd_hda_power_down() is called at the error path so that the
refcount is balanced.

Along with it, the spec->pcm_lock is taken outside
hdmi_present_sense() in the caller side, so that it won't cause
deadlock at reentrace via runtime resume.

v3 fix by tiwai:
Missing linux/pm_runtime.h is included.

References: 222bde03881c ("ALSA: hda - Fix mutex deadlock at HDMI/DP hotplug")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>