semaphore: fix
authorIngo Molnar <mingo@elte.hu>
Thu, 8 May 2008 09:53:48 +0000 (11:53 +0200)
committerIngo Molnar <mingo@elte.hu>
Thu, 8 May 2008 15:00:42 +0000 (17:00 +0200)
Yanmin Zhang reported:

| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
| and Itanium Montecito. Bisect located the patch below:
|
64ac24e738823161693bf791f87adc802cf529ff is first bad commit
| commit 64ac24e738823161693bf791f87adc802cf529ff
| Author: Matthew Wilcox <matthew@wil.cx>
| Date:   Fri Mar 7 21:55:58 2008 -0500
|
|     Generic semaphore implementation
|
| After I manually reverted the patch against 2.6.26-rc1 while fixing
| lots of conflicts/errors, aim7 regression became less than 2%.

i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.

Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.

The problem comes from the following code. The new semaphore code does
this on down():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(sem->count > 0))
                sem->count--;
        else
                __down(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

and this on up():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(list_empty(&sem->wait_list)))
                sem->count++;
        else
                __up(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

where __up() does:

        list_del(&waiter->list);
        waiter->up = 1;
        wake_up_process(waiter->task);

and where __down() does this in essence:

        list_add_tail(&waiter.list, &sem->wait_list);
        waiter.task = task;
        waiter.up = 0;
        for (;;) {
                [...]
                spin_unlock_irq(&sem->lock);
                timeout = schedule_timeout(timeout);
                spin_lock_irq(&sem->lock);
                if (waiter.up)
                        return 0;
        }

the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.

That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!

The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.

So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another task gets to lock_kernel() sooner than the "new owner"
scheduled, it will be blocked unnecessarily and for a very long time
when there are 2000 tasks running.

I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.

This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.

Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!

I believe Yanmin's findings and numbers support this analysis too.

The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".

The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:

  # v2.6.25 vanilla:
  ..................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    56096.4         91      207.5   789.7   0.4675
  2000    55894.4         94      208.2   792.7   0.4658

  # v2.6.26-rc1-166-gc0a1811 vanilla:
  ...................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    33230.6         83      350.3   784.5   0.2769
  2000    31778.1         86      366.3   783.6   0.2648

  # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
  ...............................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55707.1         92      209.0   795.6   0.4642
  2000    55704.4         96      209.0   796.0   0.4642

i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.

Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.

There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:

   text    data     bss     dec     hex filename
   1241       0       0    1241     4d9 semaphore.o.before
   1207       0       0    1207     4b7 semaphore.o.after

(because the waiter.up complication got removed.)

Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.

Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/semaphore.c

index 5c2942e768cdd44371ae305d5870abfe7428732c..5e41217239e89f3fa203f9f73e4476e0985ecd79 100644 (file)
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
        unsigned long flags;
 
        spin_lock_irqsave(&sem->lock, flags);
-       if (likely(sem->count > 0))
-               sem->count--;
-       else
+       if (unlikely(!sem->count))
                __down(sem);
+       sem->count--;
        spin_unlock_irqrestore(&sem->lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
        int result = 0;
 
        spin_lock_irqsave(&sem->lock, flags);
-       if (likely(sem->count > 0))
-               sem->count--;
-       else
+       if (unlikely(!sem->count))
                result = __down_interruptible(sem);
+       if (!result)
+               sem->count--;
        spin_unlock_irqrestore(&sem->lock, flags);
 
        return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
        int result = 0;
 
        spin_lock_irqsave(&sem->lock, flags);
-       if (likely(sem->count > 0))
-               sem->count--;
-       else
+       if (unlikely(!sem->count))
                result = __down_killable(sem);
+       if (!result)
+               sem->count--;
        spin_unlock_irqrestore(&sem->lock, flags);
 
        return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies)
        int result = 0;
 
        spin_lock_irqsave(&sem->lock, flags);
-       if (likely(sem->count > 0))
-               sem->count--;
-       else
+       if (unlikely(!sem->count))
                result = __down_timeout(sem, jiffies);
+       if (!result)
+               sem->count--;
        spin_unlock_irqrestore(&sem->lock, flags);
 
        return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
        unsigned long flags;
 
        spin_lock_irqsave(&sem->lock, flags);
-       if (likely(list_empty(&sem->wait_list)))
-               sem->count++;
-       else
+       sem->count++;
+       if (unlikely(!list_empty(&sem->wait_list)))
                __up(sem);
        spin_unlock_irqrestore(&sem->lock, flags);
 }
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
 struct semaphore_waiter {
        struct list_head list;
        struct task_struct *task;
-       int up;
 };
 
 /*
@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 {
        struct task_struct *task = current;
        struct semaphore_waiter waiter;
+       int ret = 0;
 
-       list_add_tail(&waiter.list, &sem->wait_list);
        waiter.task = task;
-       waiter.up = 0;
+       list_add_tail(&waiter.list, &sem->wait_list);
 
        for (;;) {
-               if (state == TASK_INTERRUPTIBLE && signal_pending(task))
-                       goto interrupted;
-               if (state == TASK_KILLABLE && fatal_signal_pending(task))
-                       goto interrupted;
-               if (timeout <= 0)
-                       goto timed_out;
+               if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+                       ret = -EINTR;
+                       break;
+               }
+               if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+                       ret = -EINTR;
+                       break;
+               }
+               if (timeout <= 0) {
+                       ret = -ETIME;
+                       break;
+               }
                __set_task_state(task, state);
                spin_unlock_irq(&sem->lock);
                timeout = schedule_timeout(timeout);
                spin_lock_irq(&sem->lock);
-               if (waiter.up)
-                       return 0;
+               if (sem->count > 0)
+                       break;
        }
 
- timed_out:
-       list_del(&waiter.list);
-       return -ETIME;
-
- interrupted:
        list_del(&waiter.list);
-       return -EINTR;
+       return ret;
 }
 
 static noinline void __sched __down(struct semaphore *sem)
@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
 {
        struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
                                                struct semaphore_waiter, list);
-       list_del(&waiter->list);
-       waiter->up = 1;
        wake_up_process(waiter->task);
 }