vmscan: kill prev_priority completely
authorKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tue, 10 Aug 2010 00:19:27 +0000 (17:19 -0700)
committerLinus Torvalds <torvalds@linux-foundation.org>
Tue, 10 Aug 2010 03:45:00 +0000 (20:45 -0700)
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/memcontrol.h
include/linux/mmzone.h
mm/memcontrol.c
mm/page_alloc.c
mm/vmscan.c
mm/vmstat.c

index 9411d32840b055cfc5b445a670e43e8a0173f585..9f1afd361583f794e28bcdcd1e6ad20a5288e3af 100644 (file)
@@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-                                                       int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-                                                       int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
index 9ed9c459b14c2ce9331a9c6981b3564f582b3dfd..6e6e62648a4d4a6d792fe207d42105563b112aaa 100644 (file)
@@ -347,21 +347,6 @@ struct zone {
        /* Zone statistics */
        atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
-       /*
-        * prev_priority holds the scanning priority for this zone.  It is
-        * defined as the scanning priority at which we achieved our reclaim
-        * target at the previous try_to_free_pages() or balance_pgdat()
-        * invocation.
-        *
-        * We use prev_priority as a measure of how much stress page reclaim is
-        * under - it drives the swappiness decision: whether to unmap mapped
-        * pages.
-        *
-        * Access to both this field is quite racy even on uniprocessor.  But
-        * it is expected to average out OK.
-        */
-       int prev_priority;
-
        /*
         * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
         * this zone's LRU.  Maintained by the pageout code.
index 20a8193a7af8ed275667e682fd22101a0e40bb93..31abd1c2c0c5b2f3f9f1b5cb197a948cddb76596 100644 (file)
@@ -211,8 +211,6 @@ struct mem_cgroup {
        */
        spinlock_t reclaim_param_lock;
 
-       int     prev_priority;  /* for recording reclaim priority */
-
        /*
         * While reclaiming in a hierarchy, we cache the last child we
         * reclaimed from.
@@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
        return ret;
 }
 
-/*
- * prev_priority control...this will be used in memory reclaim path.
- */
-int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
-{
-       int prev_priority;
-
-       spin_lock(&mem->reclaim_param_lock);
-       prev_priority = mem->prev_priority;
-       spin_unlock(&mem->reclaim_param_lock);
-
-       return prev_priority;
-}
-
-void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-       spin_lock(&mem->reclaim_param_lock);
-       if (priority < mem->prev_priority)
-               mem->prev_priority = priority;
-       spin_unlock(&mem->reclaim_param_lock);
-}
-
-void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
-{
-       spin_lock(&mem->reclaim_param_lock);
-       mem->prev_priority = priority;
-       spin_unlock(&mem->reclaim_param_lock);
-}
-
 static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
        unsigned long active;
index 33c6b4c1277b1c7dae15349c151a75e207dc9cd9..a9649f4b261e6b3c01632939c46a77f19f447de1 100644 (file)
@@ -4100,8 +4100,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
                zone_seqlock_init(zone);
                zone->zone_pgdat = pgdat;
 
-               zone->prev_priority = DEF_PRIORITY;
-
                zone_pcp_init(zone);
                for_each_lru(l) {
                        INIT_LIST_HEAD(&zone->lru[l].list);
index b7a4e6a3cf89acbbc973493acf2498c51fa668a9..594eba8a44c09a3314e7ec408bab3c1f55c8afe5 100644 (file)
@@ -1289,20 +1289,6 @@ done:
        return nr_reclaimed;
 }
 
-/*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-       if (priority < zone->prev_priority)
-               zone->prev_priority = priority;
-}
-
 /*
  * This moves pages from the active list to the inactive list.
  *
@@ -1766,17 +1752,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
                if (scanning_global_lru(sc)) {
                        if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                                continue;
-                       note_zone_scanning_priority(zone, priority);
-
                        if (zone->all_unreclaimable && priority != DEF_PRIORITY)
                                continue;       /* Let kswapd poll it */
-               } else {
-                       /*
-                        * Ignore cpuset limitation here. We just want to reduce
-                        * # of used pages by us regardless of memory shortage.
-                        */
-                       mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-                                                       priority);
                }
 
                shrink_zone(priority, zone, sc);
@@ -1877,17 +1854,6 @@ out:
        if (priority < 0)
                priority = 0;
 
-       if (scanning_global_lru(sc)) {
-               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-                       if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-                               continue;
-
-                       zone->prev_priority = priority;
-               }
-       } else
-               mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
        delayacct_freepages_end();
        put_mems_allowed();
 
@@ -2053,22 +2019,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
                .order = order,
                .mem_cgroup = NULL,
        };
-       /*
-        * temp_priority is used to remember the scanning priority at which
-        * this zone was successfully refilled to
-        * free_pages == high_wmark_pages(zone).
-        */
-       int temp_priority[MAX_NR_ZONES];
-
 loop_again:
        total_scanned = 0;
        sc.nr_reclaimed = 0;
        sc.may_writepage = !laptop_mode;
        count_vm_event(PAGEOUTRUN);
 
-       for (i = 0; i < pgdat->nr_zones; i++)
-               temp_priority[i] = DEF_PRIORITY;
-
        for (priority = DEF_PRIORITY; priority >= 0; priority--) {
                int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */
                unsigned long lru_pages = 0;
@@ -2136,9 +2092,7 @@ loop_again:
                        if (zone->all_unreclaimable && priority != DEF_PRIORITY)
                                continue;
 
-                       temp_priority[i] = priority;
                        sc.nr_scanned = 0;
-                       note_zone_scanning_priority(zone, priority);
 
                        nid = pgdat->node_id;
                        zid = zone_idx(zone);
@@ -2211,16 +2165,6 @@ loop_again:
                        break;
        }
 out:
-       /*
-        * Note within each zone the priority level at which this zone was
-        * brought into a happy state.  So that the next thread which scans this
-        * zone will start out at that priority level.
-        */
-       for (i = 0; i < pgdat->nr_zones; i++) {
-               struct zone *zone = pgdat->node_zones + i;
-
-               zone->prev_priority = temp_priority[i];
-       }
        if (!all_zones_ok) {
                cond_resched();
 
@@ -2639,7 +2583,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
                 */
                priority = ZONE_RECLAIM_PRIORITY;
                do {
-                       note_zone_scanning_priority(zone, priority);
                        shrink_zone(priority, zone, &sc);
                        priority--;
                } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
index 15a14b16e1767635aca2d1eaae9dc1568d8e2e0d..f389168f9a837b9c6be4e1f9bb3d0892396315de 100644 (file)
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
        }
        seq_printf(m,
                   "\n  all_unreclaimable: %u"
-                  "\n  prev_priority:     %i"
                   "\n  start_pfn:         %lu"
                   "\n  inactive_ratio:    %u",
                   zone->all_unreclaimable,
-                  zone->prev_priority,
                   zone->zone_start_pfn,
                   zone->inactive_ratio);
        seq_putc(m, '\n');