sched/core: Implement new approach to scale select_idle_cpu()
authorPeter Zijlstra <peterz@infradead.org>
Wed, 17 May 2017 10:53:50 +0000 (12:53 +0200)
committerIngo Molnar <mingo@kernel.org>
Thu, 8 Jun 2017 08:25:17 +0000 (10:25 +0200)
Hackbench recently suffered a bunch of pain, first by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

and then by commit:

  c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()")

which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.

This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel/sched/fair.c
kernel/sched/features.h

index 47a0c552c77bac559f579704af2ae30c60309ffc..396bca9c7996af973e861c1a5cf2029d2948ba2b 100644 (file)
@@ -5794,27 +5794,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
        struct sched_domain *this_sd;
-       u64 avg_cost, avg_idle = this_rq()->avg_idle;
+       u64 avg_cost, avg_idle;
        u64 time, cost;
        s64 delta;
-       int cpu;
+       int cpu, nr = INT_MAX;
 
        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
        if (!this_sd)
                return -1;
 
-       avg_cost = this_sd->avg_scan_cost;
-
        /*
         * Due to large variance we need a large fuzz factor; hackbench in
         * particularly is sensitive here.
         */
-       if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost)
+       avg_idle = this_rq()->avg_idle / 512;
+       avg_cost = this_sd->avg_scan_cost + 1;
+
+       if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
                return -1;
 
+       if (sched_feat(SIS_PROP)) {
+               u64 span_avg = sd->span_weight * avg_idle;
+               if (span_avg > 4*avg_cost)
+                       nr = div_u64(span_avg, avg_cost);
+               else
+                       nr = 4;
+       }
+
        time = local_clock();
 
        for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+               if (!--nr)
+                       return -1;
                if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
                        continue;
                if (idle_cpu(cpu))
index dc4d1483b038176c20a0eb2d1af3c7cd499c6688..d3fb15555291e0bfb4ce2e837f38cc1676a664d4 100644 (file)
@@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
+SCHED_FEAT(SIS_PROP, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls