HATAYAMA Daisuke [Wed, 3 Jul 2013 22:02:15 +0000 (15:02 -0700)]
vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list. Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
This change affects layout of /proc/vmcore. The gaps generated by the
rearrangement are newly made visible to applications as holes.
Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
[end, roundup(end, PAGE_SIZE)].
Suppose variable m points at a vmcore object in vmcore_list, and
variable phdr points at the program header of PT_LOAD type the variable
m corresponds to. Then, pictorially:
m->offset +---------------+
| hole |
phdr->p_offset = +---------------+
m->offset + (paddr - start) | |\
| kernel memory | phdr->p_memsz
| |/
+---------------+
| hole |
m->offset + m->size +---------------+
where m->offset and m->offset + m->size are always page-size aligned.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HATAYAMA Daisuke [Wed, 3 Jul 2013 22:02:14 +0000 (15:02 -0700)]
vmcore: allocate buffer for ELF headers on page-size alignment
Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().
Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used. Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually used
buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.
The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.
The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.
Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that the
offset includes the holes towards the page boundary.
As a result, both set_vmcore_list_offsets_elf{64,32} have the same
definition. Merge them as set_vmcore_list_offsets.
[akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HATAYAMA Daisuke [Wed, 3 Jul 2013 22:02:13 +0000 (15:02 -0700)]
vmcore: clean up read_vmcore()
Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 3 Jul 2013 22:02:11 +0000 (15:02 -0700)]
include/linux/mm.h: add PAGE_ALIGNED() helper
To test whether an address is aligned to PAGE_SIZE.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:02:11 +0000 (15:02 -0700)]
memory_hotplug: use pgdat_resize_lock() in __offline_pages()
mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. [...]
So actually hold it when we update node_present_pages in __offline_pages().
[akpm@linux-foundation.org: fix build]
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:02:10 +0000 (15:02 -0700)]
memory_hotplug: use pgdat_resize_lock() in online_pages()
mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. [...]
So actually hold it when we update node_present_pages in online_pages().
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:02:09 +0000 (15:02 -0700)]
mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:02:08 +0000 (15:02 -0700)]
mm: fix comment referring to non-existent size_seqlock, change to span_seqlock
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:02:06 +0000 (15:02 -0700)]
fs: nfs: inform the VM about pages being committed or unstable
VM page reclaim uses dirty and writeback page states to determine if
flushers are cleaning pages too slowly and that page reclaim should
stall waiting on flushers to catch up. Page state in NFS is a bit more
complex and a clean page can be unreclaimable due to being unstable
which is effectively "dirty" from the perspective of the VM from reclaim
context. Similarly, if the inode is currently being committed then it's
similar to being under writeback.
This patch adds a is_dirty_writeback() handled for NFS that checks if a
pages backing inode is being committed and should be accounted as
writeback and if a page has private state indicating that it is
effectively dirty.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:02:05 +0000 (15:02 -0700)]
mm: vmscan: take page buffers dirty and locked state into account
Page reclaim keeps track of dirty and under writeback pages and uses it
to determine if wait_iff_congested() should stall or if kswapd should
begin writing back pages. This fails to account for buffer pages that
can be under writeback but not PageWriteback which is the case for
filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
can have all the buffers clean and writepage does no IO so it should not
be accounted as congested.
This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback. An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode. By default the
page flags are obeyed.
Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:02:03 +0000 (15:02 -0700)]
mm: vmscan: treat pages marked for immediate reclaim as zone congestion
Currently a zone will only be marked congested if the underlying BDI is
congested but if dirty pages are spread across zones it is possible that
an individual zone is full of dirty pages without being congested. The
impact is that zone gets scanned very quickly potentially reclaiming
really clean pages. This patch treats pages marked for immediate
reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
and stalling in wait_iff_congested.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:02:02 +0000 (15:02 -0700)]
mm: vmscan: move direct reclaim wait_iff_congested into shrink_list
shrink_inactive_list makes decisions on whether to stall based on the
number of dirty pages encountered. The wait_iff_congested() call in
shrink_page_list does no such thing and it's arbitrary.
This patch moves the decision on whether to set ZONE_CONGESTED and the
wait_iff_congested call into shrink_page_list. This keeps all the
decisions on whether to stall or not in the one place.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:02:00 +0000 (15:02 -0700)]
mm: vmscan: set zone flags before blocking
In shrink_page_list a decision may be made to stall and flag a zone as
ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
encountered later then the reclaimer will stall. Set ZONE_WRITEBACK
before potentially going to sleep so it is noticed sooner.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:58 +0000 (15:01 -0700)]
mm: vmscan: stall page reclaim after a list of pages have been processed
Commit "mm: vmscan: Block kswapd if it is encountering pages under
writeback" blocks page reclaim if it encounters pages under writeback
marked for immediate reclaim. It blocks while pages are still isolated
from the LRU which is unnecessary. This patch defers the blocking until
after the isolated pages have been processed and tidies up some of the
comments.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:57 +0000 (15:01 -0700)]
mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems. First and foremost, it's possible for pages
under writeback to be freed which will lead to badness. Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed. In some cases this results in
increased read IO to re-read data from disk. Third, more pages were
being written from kswapd context which can adversly affect IO
performance. Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3). This disconnect confuses the reclaim stalling logic. This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-
20130522 is mmotm as of 22nd May with "Reduce system disruption due to
kswapd" applied on top as per what should be in Andrew's tree
right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-
20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%)
Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%)
Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%)
Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%)
Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%)
Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%)
Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M
1536429.00 ( 0.00%)
1531632.00 ( 0.31%)
1533541.00 ( 0.19%)
Ops minorfaults-715M
1786996.00 ( 0.00%)
1612148.00 ( 9.78%)
1608832.00 ( 9.97%)
Ops minorfaults-2385M
1757952.00 ( 0.00%)
1614874.00 ( 8.14%)
1613541.00 ( 8.21%)
Ops minorfaults-4055M
1774460.00 ( 0.00%)
1633400.00 ( 7.95%)
1630881.00 ( 8.09%)
Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%)
Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%)
Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
23K/sec to just over 4K/second when there is 2385M of IO going
on in the background. With current mmotm, there is no collapse
in performance and with this follow-up series there is little
change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
series, the total amount of swapping is much reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults
11160152 10706748 10622316
Major Faults 46305 755 678
Swap Ins 260249 0 0
Swap Outs 683860 18 18
Direct pages scanned 0 678 2520
Kswapd pages scanned
6046108 8814900 1639279
Kswapd pages reclaimed
1081954 1172267 1094635
Direct pages reclaimed 0 566 2304
Kswapd efficiency 17% 13% 66%
Kswapd velocity 5217.560 7618.953 1414.879
Direct efficiency 100% 83% 91%
Direct velocity 0.000 0.586 2.175
Percentage direct scans 0% 0% 0%
Zone normal velocity 5105.086 6824.681 671.158
Zone dma32 velocity 112.473 794.858 745.896
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim
1929612.000
6861768.000 32821.000
Page writes file
1245752 6861750 32803
Page writes anon 683860 18 18
Page reclaim immediate 7484 40 239
Sector Reads
1130320 93996 86900
Sector Writes
13508052 10823500 11804436
Page rescued immediate 0 0 0
Slabs scanned 33536 27136 18560
Direct inode steals 0 0 0
Kswapd inode steals 8641 1035 0
Kswapd skipped wait 0 0 0
THP fault alloc 8 37 33
THP collapse alloc 508 552 515
THP splits 24 1 1
THP fault fallback 0 0 0
THP collapse fail 0 0 0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
pages swapped were really unused anonymous pages. Related to that,
major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
follow-up patches, the efficiency is now at 66% indicating that far
fewer pages were skipped during scanning due to dirty or writeback
pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
with the follow-up series as kswapd now stalls when the tail of the
LRU queue is full of unqueued dirty pages. The stall gives flushers a
chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
mmtests now reports scanning velocity on a per-zone basis. With mainline,
you can see that the scanning activity is dominated by the Normal
zone with over 45 times more scanning in Normal than the DMA32 zone.
With the series currently in mmotm, the ratio is slightly better but it
is still the case that the bulk of scanning is in the highest zone. With
this follow-up series, the ratio of scanning between the Normal and
DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
number of pages written from kswapd context which is expected to adversly
impact IO performance. With the follow-up patches, far fewer pages are
written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
the follow-up series, there is less slab shrinking activity and no inodes
were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
data being used for the IO is not being aggressively discarded due to
page reclaim skipping over dirty pages and reclaiming clean pages. Note
that the reducion in reads could also be due to inode data not being
re-read from disk after a slab shrink.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 166.99 32.09 33.44
Mean sda-await 853.64 192.76 185.43
Mean sda-r_await 6.31 9.24 5.97
Mean sda-w_await 2992.81 202.65 192.43
Max sda-avgqz 1409.91 718.75 698.98
Max sda-await 6665.74 3538.00 3124.23
Max sda-r_await 58.96 111.95 58.00
Max sda-w_await 28458.94 3977.29 3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
the same with this follow up.
2. Average wait times for writes are reduced and as the IO
is completing faster it at least implies that the gain is because
flushers are writing the files efficiently instead of page reclaim
getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-
20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%)
Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%)
Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%)
Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%)
Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%)
Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%)
Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%)
Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%)
Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%)
Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%)
Ops minorfaults-0M
1445969.00 ( 0.00%)
1520878.00 ( -5.18%)
1454024.00 ( -0.56%)
Ops minorfaults-715M
1557288.00 ( 0.00%)
1528482.00 ( 1.85%)
1535776.00 ( 1.38%)
Ops minorfaults-2385M
1692896.00 ( 0.00%)
1570523.00 ( 7.23%)
1559622.00 ( 7.87%)
Ops minorfaults-4055M
1654985.00 ( 0.00%)
1581456.00 ( 4.44%)
1596713.00 ( 3.52%)
Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%)
Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%)
Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
faster. Note with mmotm, IO completes in a third of the time and faster again
with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
look bad but it does vary a bit as the stalling is not perfect for nfs
or filesystems like ext3 with unusual handling of dirty and writeback
pages
3. There are swapins, particularly with larger amounts of IO indicating
that active pages are being reclaimed. However, the number of much
reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults
36339175 35025445 35219699
Major Faults 310964 27108 51887
Swap Ins
2176399 173069 333316
Swap Outs
3344050 357228 504824
Direct pages scanned 8972 77283 43242
Kswapd pages scanned
20899983 8939566 14772851
Kswapd pages reclaimed
6193156 5172605 5231026
Direct pages reclaimed 8450 73802 39514
Kswapd efficiency 29% 57% 35%
Kswapd velocity 3929.743 1847.499 3058.840
Direct efficiency 94% 95% 91%
Direct velocity 1.687 15.972 8.954
Percentage direct scans 0% 0% 0%
Zone normal velocity 3721.907 939.103 2185.142
Zone dma32 velocity 209.522 924.368 882.651
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim
4082185.000 526319.000 537114.000
Page writes file 738135 169091 32290
Page writes anon
3344050 357228 504824
Page reclaim immediate 9524 170
5595843
Sector Reads
8909900 861192
1483680
Sector Writes
13428980 1488744 2076800
Page rescued immediate 0 0 0
Slabs scanned 38016 31744 28672
Direct inode steals 0 0 0
Kswapd inode steals 424 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 14 15 119
THP collapse alloc 1767 1569 1618
THP splits 30 29 25
THP fault fallback 0 0 0
THP collapse fail 8 5 0
Compaction stalls 17 41 100
Compaction success 7 31 95
Compaction failures 10 10 5
Page migrate success 7083 22157 62217
Page migrate failure 0 0 0
Compaction pages isolated 14847 48758 135830
Compaction migrate scanned 18328 48398 138929
Compaction free scanned
2000255 355827
1720269
Compaction cost 7 24 68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 23.58 0.35 0.44
Mean sda-await 133.47 15.72 15.46
Mean sda-r_await 4.72 4.69 3.95
Mean sda-w_await 507.69 28.40 33.68
Max sda-avgqz 680.60 12.25 23.14
Max sda-await 3958.89 221.83 286.22
Max sda-r_await 63.86 61.23 67.29
Max sda-w_await 11710.38 883.57 1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered. This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance. The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO. The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed. Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped. Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages. Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up. The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only. Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:54 +0000 (15:01 -0700)]
mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone()
balance_pgdat() is very long and some of the logic can and should be
internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:53 +0000 (15:01 -0700)]
mm: vmscan: check if kswapd should writepage once per pgdat scan
Currently kswapd checks if it should start writepage as it shrinks each
zone without taking into consideration if the zone is balanced or not.
This is not wrong as such but it does not make much sense either. This
patch checks once per pgdat scan if kswapd should be writing pages.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:51 +0000 (15:01 -0700)]
mm: vmscan: block kswapd if it is encountering pages under writeback
Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress. This made no sense as the failure
to make progress could be completely independent of IO. It was later
replaced by wait_iff_congested() and removed entirely by commit
258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().
This is problematic. If kswapd encounters many pages under writeback
and it continues to scan until it reaches the high watermark then it
will quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.
The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer
was unable to write to the underlying BDI. kswapd bypasses the BDI
congestion as it sets PF_SWAPWRITE but even if this was taken into
account then it would cause direct reclaimers to stall on writeback
which is not desirable.
This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:50 +0000 (15:01 -0700)]
mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority
Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU
and the zone watermark which is no indication as to whether kswapd
should write pages or not.
This patch tracks if an excessive number of unqueued dirty pages are
being encountered at the end of the LRU. If so, it indicates that dirty
pages are being recycled before flusher threads can clean them and flags
the zone so that kswapd will start writing pages until the zone is
balanced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:48 +0000 (15:01 -0700)]
mm: vmscan: do not allow kswapd to scan at maximum priority
Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0
quite easily if it is encountering a large number of pages it cannot
reclaim such as pages under writeback. When this happens, kswapd
reclaims very aggressively even though there may be no real risk of
allocation failure or OOM.
This patch prevents kswapd reaching priority 0 and trying to reclaim the
world. Direct reclaimers will still reach priority 0 in the event of an
OOM situation.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:47 +0000 (15:01 -0700)]
mm: vmscan: decide whether to compact the pgdat based on reclaim progress
In the past, kswapd makes a decision on whether to compact memory after
the pgdat was considered balanced. This more or less worked but it is
late to make such a decision and does not fit well now that kswapd makes
a decision whether to exit the zone scanning loop depending on reclaim
progress.
This patch will compact a pgdat if at least the requested number of
pages were reclaimed from unbalanced zones for a given priority. If any
zone is currently balanced, kswapd will not call compaction as it is
expected the necessary pages are already available.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:45 +0000 (15:01 -0700)]
mm: vmscan: flatten kswapd priority loop
kswapd stops raising the scanning priority when at least
SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
balanced. It then rechecks if it needs to restart at DEF_PRIORITY and
whether high-order reclaim needs to be reset. This is not wrong per-se
but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
may require several restarts before it has scanned enough pages to meet
the high watermark even at 100% efficiency. This patch irons out the
logic a bit by controlling when priority is raised and removing the
"goto loop_again".
This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not
raise the scanning prioirty higher unless it is failing to reclaim any
pages.
To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:44 +0000 (15:01 -0700)]
mm: vmscan: obey proportional scanning requirements for kswapd
Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.
This patch preserves the proportional scanning and reclaim. It does
mean that kswapd will reclaim more than requested but the number of
pages will be related to the high watermark.
[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
[kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
[hannes@cmpxchg.org: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 Jul 2013 22:01:42 +0000 (15:01 -0700)]
mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
- Drop the slab shrink changes in light of Glaubers series and
discussions highlighted that there were a number of potential
problems with the patch. (mel)
- Rebased to 3.10-rc1
Changelog since V2
- Preserve ratio properly for proportional scanning (kamezawa)
Changelog since V1
- Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
- Reformat comment in shrink_page_list (andi)
- Clarify some comments (dhillf)
- Rework how the proportional scanning is preserved
- Add PageReclaim check before kswapd starts writeback
- Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time. Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts. In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area. One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap. Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed. In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on. There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it. It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times. It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress. The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M
1529984.00 ( 0.00%)
1530235.00 ( -0.02%)
Ops minorfaults-715M
1794168.00 ( 0.00%)
1613750.00 ( 10.06%)
Ops minorfaults-2385M
1739813.00 ( 0.00%)
1609396.00 ( 7.50%)
Ops minorfaults-4055M
1754460.00 ( 0.00%)
1614810.00 ( 7.96%)
Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background. This drop in performance is part of
what users complain of when they start backups. Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely. With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged. Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted. The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration. So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground. There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background
io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload
swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.
swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 1, the worse the
trashing is. Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.
minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Page Ins
1234608 101892
Page Outs
12446272 11810468
Swap Ins 283406 0
Swap Outs 698469 27882
Direct pages scanned 0 136480
Kswapd pages scanned
6266537 5369364
Kswapd pages reclaimed
1088989 930832
Direct pages reclaimed 0 120901
Kswapd efficiency 17% 17%
Kswapd velocity 5398.371 4635.115
Direct efficiency 100% 88%
Direct velocity 0.000 117.817
Percentage direct scans 0% 2%
Page writes by reclaim
1655843 4009929
Page writes file 957374
3982047
Page writes anon 698469 27882
Page reclaim immediate 5245 1745
Page rescued immediate 0 0
Slabs scanned 33664 25216
Direct inode steals 0 0
Kswapd inode steals 19409 778
Kswapd skipped wait 0 0
THP fault alloc 35 30
THP collapse alloc 472 401
THP splits 27 22
THP fault fallback 0 0
THP collapse fail 0 1
Compaction stalls 0 4
Compaction success 0 0
Compaction failures 0 4
Page migrate success 0 0
Page migrate failure 0 0
Compaction pages isolated 0 0
Compaction migrate scanned 0 0
Compaction free scanned 0 0
Compaction cost 0 0
NUMA PTE updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA pages migrated 0 0
AutoNUMA cost 0 0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world. ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
23 tclsh-3367
38 memcachetest-13733
49 memcachetest-12443
57 tee-3368
1541 dd-13826
1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage. There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU. This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX. The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed. It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:41 +0000 (15:01 -0700)]
mm/page_alloc: don't re-init pageset in zone_pcp_update()
When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.
Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:40 +0000 (15:01 -0700)]
mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:39 +0000 (15:01 -0700)]
mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0.
Correct this by calling zone_pageset_init(), which chooses the
appropriate ->batch and ->high calculations.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:38 +0000 (15:01 -0700)]
mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:36 +0000 (15:01 -0700)]
mm/page_alloc: relocate comment to be directly above code it refers to.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:35 +0000 (15:01 -0700)]
mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:34 +0000 (15:01 -0700)]
mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high
Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:33 +0000 (15:01 -0700)]
mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()
zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.
This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).
This does change the behavior of zone_pcp_update() as the percpu pagesets
will not be drained when zone_pcp_update() is called (they will end up
being shrunk, not completely drained, later when a 0-order page is freed
in free_hot_cold_page()).
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:32 +0000 (15:01 -0700)]
mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
pcp->batch could change at any point, avoid relying on it being a stable
value.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:31 +0000 (15:01 -0700)]
mm/page_alloc: insert memory barriers to allow async update of pcp batch and high
Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.
This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.
Suggested by Gilad Ben-Yossef in these threads:
https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49
Also reproduces his proposed comment.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:29 +0000 (15:01 -0700)]
mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
Because we are going to rely upon a careful transision between old and new
->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.
Add a simple mutex to protect both update loops.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cody P Schafer [Wed, 3 Jul 2013 22:01:28 +0000 (15:01 -0700)]
mm/page_alloc: factor out setting of pcp->high and pcp->batch
"Problems" with the current code:
1: there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()
2: stop_machine() in zone_pcp_update() is unnecissary.
3: zone_pcp_update() does not consider the case where
percpu_pagelist_fraction is non-zero
To fix:
1: add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
that expect a stable value.
2: avoid draining pages in zone_pcp_update(), rely upon the memory
barriers added to fix #1
3: factor out quite a few functions, and then call the appropriate one.
Note that it results in a change to the behavior of zone_pcp_update(),
which is used by memory_hotplug. I'm rather certain that I've diserned
(and preserved) the essential behavior (changing ->high and ->batch), and
only eliminated unneeded actions (draining the per cpu pages), but this
may not be the case.
Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to
offline a page, and after the offline has "succeeded". It appears that
the draining was added to zone_pcp_update() to avoid refactoring
setup_pageset() into 2 funtions.
This patch:
Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high based
on ->batch.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Libin [Wed, 3 Jul 2013 22:01:27 +0000 (15:01 -0700)]
uio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.
Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Libin [Wed, 3 Jul 2013 22:01:26 +0000 (15:01 -0700)]
ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.
Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Libin [Wed, 3 Jul 2013 22:01:26 +0000 (15:01 -0700)]
mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.
Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 Jul 2013 22:01:24 +0000 (15:01 -0700)]
mm: remove compressed copy from zram in-memory
Swap subsystem does lazy swap slot free with expecting the page would be
swapped out again so we can avoid unnecessary write.
But the problem in in-memory swap(ex, zram) is that it consumes memory
space until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory
swap and big storage swap or in-memory swap alone.
This patch makes swap subsystem free swap slot as soon as swap-read is
completed and make the swapcache page dirty so the page should be
written out the swap device to reclaim it. It means we never lose it.
I tested this patch with kernel compile workload.
1. before
compile time : 9882.42
zram max wasted space by fragmentation:
13471881 byte
memory space consumed by zram:
174227456 byte
the number of slot free notify: 206684
2. after
compile time : 9653.90
zram max wasted space by fragmentation:
11805932 byte
memory space consumed by zram:
154001408 byte
the number of slot free notify: 426972
[akpm@linux-foundation.org: tweak comment text]
[artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()]
[akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 3 Jul 2013 22:01:23 +0000 (15:01 -0700)]
mm, memcg: don't take task_lock in task_in_mem_cgroup
For processes that have detached their mm's, task_in_mem_cgroup()
unnecessarily takes task_lock() when rcu_read_lock() is all that is
necessary to call mem_cgroup_from_task().
While we're here, switch task_in_mem_cgroup() to return bool.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 3 Jul 2013 22:01:22 +0000 (15:01 -0700)]
pagemap: prepare to reuse constant bits with page-shift
In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits
55-60 are about to change in a couple of releases. Next, if a user
issues soft-dirty clear command via the clear_refs file (it was disabled
before v3.9) we assume that he's aware of the new pagemap format, note
that fact and report the bits in pagemap in the new manner.
The "migration strategy" looks like this then:
1. existing users are not affected -- they don't touch soft-dirty feature, thus
see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
this trick with clear-soft-dirty affecting pagemap format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 3 Jul 2013 22:01:20 +0000 (15:01 -0700)]
mm: soft-dirty bits for user memory changes tracking
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast. This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.
Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 3 Jul 2013 22:01:19 +0000 (15:01 -0700)]
pagemap: introduce pagemap_entry_t without pmshift bits
These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry. Moreover, in next patch we will need to report one more bit
in the pagemap, but all bits are already busy on it.
That said, describe the pagemap entry that has 6 more free zero bits.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 3 Jul 2013 22:01:18 +0000 (15:01 -0700)]
clear_refs: introduce private struct for mm_walk
In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry
this info.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 3 Jul 2013 22:01:16 +0000 (15:01 -0700)]
clear_refs: sanitize accepted commands declaration
This is the implementation of the soft-dirty bit concept that should
help keep track of changes in user memory, which in turn is very-very
required by the checkpoint-restore project (http://criu.org).
To create a dump of an application(s) we save all the information about
it to files, and the biggest part of such dump is the contents of tasks'
memory. However, there are usage scenarios where it's not required to
get _all_ the task memory while creating a dump. For example, when
doing periodical dumps, it's only required to take full memory dump only
at the first step and then take incremental changes of memory. Another
example is live migration. We copy all the memory to the destination
node without stopping all tasks, then stop them, check for what pages
has changed, dump it and the rest of the state, then copy it to the
destination node. This decreases freeze time significantly.
That said, some help from kernel to watch how processes modify the
contents of their memory is required.
The proposal is to track changes with the help of new soft-dirty bit
this way:
1. First do "echo 4 > /proc/$pid/clear_refs".
At that point kernel clears the soft dirty _and_ the writable bits from all
ptes of process $pid. From now on every write to any page will result in #pf
and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
the soft dirty flag.
2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
(the 55'th one). If set, the respective pte was written to since last call
to clear refs.
The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by
kmemcheck, the latter one marks kernel pages with it, while the former
bit is put on user pages so they do not conflict to each other.
This patch:
A new clear-refs type will be added in the next patch, so prepare
code for that.
[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 3 Jul 2013 22:01:15 +0000 (15:01 -0700)]
crypto: sanitize argument for format string
The template lookup interface does not provide a way to use format
strings, so make sure that the interface cannot be abused accidentally.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 3 Jul 2013 22:01:14 +0000 (15:01 -0700)]
block: do not pass disk names as format strings
Disk names may contain arbitrary strings, so they must not be
interpreted as format strings. It seems that only md allows arbitrary
strings to be used for disk names, but this could allow for a local
memory corruption from uid 0 into ring 0.
CVE-2013-2851
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonathan Salwan [Wed, 3 Jul 2013 22:01:13 +0000 (15:01 -0700)]
drivers/cdrom/cdrom.c: use kzalloc() for failing hardware
In drivers/cdrom/cdrom.c mmc_ioctl_cdrom_read_data() allocates a memory
area with kmalloc in line 2885.
2885 cgc->buffer = kmalloc(blocksize, GFP_KERNEL);
2886 if (cgc->buffer == NULL)
2887 return -ENOMEM;
In line 2908 we can find the copy_to_user function:
2908 if (!ret && copy_to_user(arg, cgc->buffer, blocksize))
The cgc->buffer is never cleaned and initialized before this function.
If ret = 0 with the previous basic block, it's possible to display some
memory bytes in kernel space from userspace.
When we read a block from the disk it normally fills the ->buffer but if
the drive is malfunctioning there is a chance that it would only be
partially filled. The result is an leak information to userspace.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cong Wang [Wed, 3 Jul 2013 22:01:12 +0000 (15:01 -0700)]
block/compat_ioctl.c: do not leak info to user-space
There is a hole in struct hd_geometry, so we have to zero the struct on
stack before copying it to user-space.
Signed-off-by: Cong Wang <amwang@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Libo Chen [Wed, 3 Jul 2013 22:01:11 +0000 (15:01 -0700)]
drivers/cdrom/gdrom.c: fix device number leak
Without this patch, gdrom_major will leak when gd.cd_info alloc fails.
Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Wed, 3 Jul 2013 22:01:10 +0000 (15:01 -0700)]
ocfs2: fix NULL pointer dereference when traversing o2hb_all_regions
There may exist NULL pointer dereference in config_item_name() when one
volume (say Volume A) unmounts while another (say Volume B) mounting.
Volume A Volume B
already Mounted.
Unmounting, call
o2hb_heartbeat_group_drop_item()
-> config_item_put(item)
set reg(A)->item.ci_name to NULL
in function config_item_cleanup().
begin mounting, call
o2hb_region_pin() and tranverse all
regions. When reading
reg(A)->item.ci_name, it causes
NULL pointer dereference.
call o2hb_region_release() and
del reg(A) from list.
So we should skip accessing regions that is going to release when
tranverse o2hb_all_regions.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce <xuejiufei@huawei.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 3 Jul 2013 22:01:08 +0000 (15:01 -0700)]
ocfs2: adjust switch_case syntax at o2net_state_change()
Adjust switch..case syntax at o2net_state_change to meet the kernel coding
standard.
s/printk/pr_info/.
[akpm@linux-foundation.org: revert pr_foo() change]
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Gurudas Pai <gurudas.pai@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Tao Ma <tm@tao.ma>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 3 Jul 2013 22:01:07 +0000 (15:01 -0700)]
ocfs2: fix a comments typo at o2quo_hb_still_up()
Fix a comment typo in o2quo_hb_still_up()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Gurudas Pai <gurudas.pai@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Tao Ma <tm@tao.ma>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 3 Jul 2013 22:01:06 +0000 (15:01 -0700)]
ocfs2: consolidate o2hb_global_hearbeat_mode_set() naming convention
s/o2hb_global_hearbeat_mode_set/o2hb_global_heartbeat_mode_set/ to make
the signature of those routines in a consistent manner with others for
heartbeating.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Gurudas Pai <gurudas.pai@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
Cc: Tao Ma <tm@tao.ma>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Noboru Iwamatsu [Wed, 3 Jul 2013 22:01:04 +0000 (15:01 -0700)]
ocfs2: submit disk heartbeat bio using WRITE_SYNC
Under heavy I/O load, writing the disk heartbeat can be forced to wait for
minutes, and this causes the node to be fenced.
This patch tries to use WRITE_SYNC in submitting the heartbeat bio, so
that writing the heartbeat will have a priority over other requests.
Signed-off-by: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
Acked-by: Tao Ma <tm@tao.ma>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Wed, 3 Jul 2013 22:01:03 +0000 (15:01 -0700)]
ocfs2: xattr: fix inlined xattr reflink
Inlined xattr shared free space of inode block with inlined data or data
extent record, so the size of the later two should be adjusted when
inlined xattr is enabled. See ocfs2_xattr_ibody_init(). But this isn't
done well when reflink. For inode with inlined data, its max inlined
data size is adjusted in ocfs2_duplicate_inline_data(), no problem. But
for inode with data extent record, its record count isn't adjusted. Fix
it, or data extent record and inlined xattr may overwrite each other,
then cause data corruption or xattr failure.
One panic caused by this bug in our test environment is the following:
kernel BUG at fs/ocfs2/xattr.c:1435!
invalid opcode: 0000 [#1] SMP
Pid: 10871, comm: multi_reflink_t Not tainted 2.6.39-300.17.1.el5uek #1
RIP: ocfs2_xa_offset_pointer+0x17/0x20 [ocfs2]
RSP: e02b:
ffff88007a587948 EFLAGS:
00010283
RAX:
0000000000000000 RBX:
0000000000000010 RCX:
00000000000051e4
RDX:
ffff880057092060 RSI:
0000000000000f80 RDI:
ffff88007a587a68
RBP:
ffff88007a587948 R08:
00000000000062f4 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000010
R13:
ffff88007a587a68 R14:
0000000000000001 R15:
ffff88007a587c68
FS:
00007fccff7f06e0(0000) GS:
ffff88007fc00000(0000) knlGS:
0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00000000015cf000 CR3:
000000007aa76000 CR4:
0000000000000660
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process multi_reflink_t
Call Trace:
ocfs2_xa_reuse_entry+0x60/0x280 [ocfs2]
ocfs2_xa_prepare_entry+0x17e/0x2a0 [ocfs2]
ocfs2_xa_set+0xcc/0x250 [ocfs2]
ocfs2_xattr_ibody_set+0x98/0x230 [ocfs2]
__ocfs2_xattr_set_handle+0x4f/0x700 [ocfs2]
ocfs2_xattr_set+0x6c6/0x890 [ocfs2]
ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
generic_setxattr+0x70/0x90
__vfs_setxattr_noperm+0x80/0x1a0
vfs_setxattr+0xa9/0xb0
setxattr+0xc3/0x120
sys_fsetxattr+0xa8/0xd0
system_call_fastpath+0x16/0x1b
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 3 Jul 2013 22:01:01 +0000 (15:01 -0700)]
ocfs2: fix readonly issue in ocfs2_unlink()
While deleting a file with ocfs2_unlink(), there is a bug in this
function. This bug will result in filesystem read-only.
After calling ocfs2_orphan_add(), the file which will be deleted is
added into orphan dir. If ocfs2_delete_entry() fails, the file still
exists in the parent dir. And this scenario introduces a conflict of
metadata.
If a file is added into orphan dir, when we put inode of the file with
iput(), the inode i_flags is setted (~OCFS2_VALID_FL) in
ocfs2_remove_inode(), and then write back to disk.
But as previously mentioned, the file still exists in the parent dir.
On other nodes, the file can be still accessed. When first read the
file with ocfs2_read_blocks() from disk, It will check and avalidate
inode using ocfs2_validate_inode_block(). So File system will be
readonly because the inode is invalid. In other words, the inode
i_flags has been set (~OCFS2_VALID_FL).
[akpm@linux-foundation.org: cleanups]
[jeff.liu@oracle.com: s/inode_is_unlinkable/ocfs2_inode_is_unlinkable/]
Signed-off-by: Younger Liu <younger.liu@huawei.com>
Signed-off-by: Jensen <shencanquan@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 3 Jul 2013 22:01:00 +0000 (15:01 -0700)]
ocfs2: remove duplicated mlog_errno() in ocfs2_relink_block_group
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Younger Liu <younger.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 3 Jul 2013 22:00:59 +0000 (15:00 -0700)]
ocfs2: rework transaction rollback in ocfs2_relink_block_group()
In ocfs2_relink_block_group(), we roll back all those changes if notify
intent to modify buffers for metadata update failed even if the relevant
buffer has not yet been modified/got dirty at that point, that are not
quite right because of:
- None buffer has been modified/dirty if failed to call
ocfs2_journal_access_gd() against the previous block group buffer
- Only the previous block group buffer has got dirty if failed to call
ocfs2_journal_access_gd() against the block group buffer
- There is no need to roll back the change for file entry buffer at all
Those problems will not cause anything wrong but unnecessary. This
patch fix them and kill the useless bg_ptr variable as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Younger Liu <younger.liu@huawei.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 3 Jul 2013 22:00:58 +0000 (15:00 -0700)]
ocfs2: need rollback when journal_access failed in ocfs2_orphan_add()
While adding a file into orphan dir in ocfs2_orphan_add(), it calls
__ocfs2_add_entry() before ocfs2_journal_access_di(). If
ocfs2_journal_access_di() failed, the file is added into orphan dir, and
orphan dir dinode updated, but file dinode has not been updated.
Accordingly, the data is not consistent between file dinode and orphan
dir.
So, need to call ocfs2_journal_access_di() before __ocfs2_add_entry(),
and if ocfs2_journal_access_di() failed, orphan_fe and
orphan_dir_inode->i_nlink need rollback.
This bug was added by
3939fda4 ("Ocfs2: Journaling i_flags and
i_orphaned_slot when adding inode to orphan dir.").
Signed-off-by: Younger Liu <younger.liu@huawei.com>
Acked-by: Jeff Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Wed, 3 Jul 2013 22:00:56 +0000 (15:00 -0700)]
ocfs2: dlmlock_master() should return DLM_NORMAL after adding lock to blocked list
dlmlock_master() returns DLM_RECOVERING/DLM_MIGRATING/ DLM_FORWAR after
adding lock to blocked list if lockres has the state
DLM_LOCK_RES_RECOVERING/DLM_LOCK_RES_MIGRATING/ DLM_LOCK_RES_IN_PROGRESS.
so it will retry in dlmlock(). And this may cause dlm_thread fall into an
infinite loop
Thread1 dlm_thread
calls dlm_lock->dlmlock_master,
if lockresA is in state
DLM_LOCK_RES_RECOVERING, calls
__dlm_wait_on_lockres() and waits
until others threads clear this
state;
If cannot grant this lock,
adding lock to blocked list,
and return DLM_RECOVERING;
Grant this lock and move it to
grant list;
After a while, retry and
calls list_add_tail(), adding lock
to blocked list again.
Granted and blocked list of this lockres will become the following
conditions:
lock_res->granted.next = dlm_lock->list_head;
lock_res->blocked.next = dlm_lock->list_head;
dlm_lock->list_head.next = dlm_lock_resource->blocked;
When dlm_thread traverses the granted list, it will fall into an endless
loop, checking dlm_lock.list_head, dlm_lock->list_head.next
(i.e.lock_res->blocked), lock_res->blocked.next(i.e.dlm_lock.list_head
again) .....
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: jensen <shencanquan@huawei.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Wed, 3 Jul 2013 22:00:55 +0000 (15:00 -0700)]
ocfs2: xattr: remove useless free space checking
Free space checking will be done in ocfs2_xattr_ibody_init(). So remove
here.
[akpm@linux-foundation.org: remove unused local]
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 3 Jul 2013 22:00:54 +0000 (15:00 -0700)]
fs/ocfs2/cluster/tcp.c: free sc->sc_page in sc_kref_release()
There is a memory leak in sc_kref_release(). When free struct
o2net_sock_container (sc), we should release sc->sc_page.
Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Goldwyn Rodrigues [Wed, 3 Jul 2013 22:00:53 +0000 (15:00 -0700)]
fs/ocfs2/journal.h: add bits_wanted while calculating credits in ocfs2_calc_extend_credits
While adding extends to a file, the credits are calculated incorrectly
and if the requested clusters is more than one (or more because we used
a conservative limit) then we run out of journal credits and we hit an
assert in journalling code.
The function parameter bits_wanted variable was not used at all.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 3 Jul 2013 22:00:52 +0000 (15:00 -0700)]
ocfs2: fix mutex_unlock and possible memory leak in ocfs2_remove_btree_range
In ocfs2_remove_btree_range, when calling ocfs2_lock_refcount_tree and
ocfs2_prepare_refcount_change_for_del failed, it goes to out and then
tries to call mutex_unlock without mutex_lock before. And when calling
ocfs2_reserve_blocks_for_rec_trunc failed, it should free ref_tree
before return.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Goldwyn Rodrigues [Wed, 3 Jul 2013 22:00:51 +0000 (15:00 -0700)]
ocfs2: remove unecessary variable needs_checkpoint
Code cleanup: needs_checkpoint is assigned to but never used. Delete
the variable.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Wed, 3 Jul 2013 22:00:50 +0000 (15:00 -0700)]
ocfs2: add missing dlm_put() in dlm_begin_reco_handler()
dlm_begin_reco_handler() returns without putting dlm when dlm recovery
state is DLM_RECO_STATE_FINALIZE.
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 3 Jul 2013 22:00:48 +0000 (15:00 -0700)]
ocfs2: should not use le32_add_cpu to set ocfs2_dinode i_flags
If we use le32_add_cpu to set ocfs2_dinode i_flags, it may lead to the
corresponding flag corrupted. So we should change it to bitwise and/or
operation.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: shencanquan <shencanquan@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 3 Jul 2013 22:00:47 +0000 (15:00 -0700)]
fs/ocfs2/dlm/dlmrecovery.c:dlm_request_all_locks(): ret should be int instead of enum
In dlm_request_all_locks, ret is type enum. But o2net_send_message
returns a type int value. Then it will never run into the following
error branch. So we should change the ret type from enum to int.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 3 Jul 2013 22:00:46 +0000 (15:00 -0700)]
fs/ocfs2/dlm/dlmrecovery.c: remove duplicate declarations
Below 3 functions have already been declared in dlmcommon.h, so we have
no need to declare them again in dlmrecovery.c:
dlm_complete_recovery_thread
dlm_launch_recovery_thread
dlm_kick_recovery_thread
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Wed, 3 Jul 2013 22:00:45 +0000 (15:00 -0700)]
configfs: use capped length for ->store_attribute()
The difference between "count" and "len" is that "len" is capped at
4095. Changing it like this makes it match how sysfs_write_file() is
implemented.
This is a static analysis patch. I haven't found any store_attribute()
functions where this change makes a difference.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 3 Jul 2013 22:00:44 +0000 (15:00 -0700)]
sound/soc/codecs/si476x.c: don't use 0bNNN
spacr64 gcc-3.4.5 (at least) spits this back.
Cc: Andrey Smirnov <andrey.smirnov@convergeddevices.net>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bartlomiej Zolnierkiewicz [Wed, 3 Jul 2013 22:00:43 +0000 (15:00 -0700)]
drivers/dma/pl330.c: fix locking in pl330_free_chan_resources()
tasklet_kill() may sleep so call it before taking pch->lock.
Fixes following lockup:
BUG: scheduling while atomic: cat/2383/0x00000002
Modules linked in:
unwind_backtrace+0x0/0xfc
__schedule_bug+0x4c/0x58
__schedule+0x690/0x6e0
sys_sched_yield+0x70/0x78
tasklet_kill+0x34/0x8c
pl330_free_chan_resources+0x24/0x88
dma_chan_put+0x4c/0x50
[...]
BUG: spinlock lockup suspected on CPU#0, swapper/0/0
lock: 0xe52aa04c, .magic:
dead4ead, .owner: cat/2383, .owner_cpu: 1
unwind_backtrace+0x0/0xfc
do_raw_spin_lock+0x194/0x204
_raw_spin_lock_irqsave+0x20/0x28
pl330_tasklet+0x2c/0x5a8
tasklet_action+0xfc/0x114
__do_softirq+0xe4/0x19c
irq_exit+0x98/0x9c
handle_IPI+0x124/0x16c
gic_handle_irq+0x64/0x68
__irq_svc+0x40/0x70
cpuidle_wrap_enter+0x4c/0xa0
cpuidle_enter_state+0x18/0x68
cpuidle_idle_call+0xac/0xe0
cpu_idle+0xac/0xf0
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Jassi Brar <jassisinghbrar@gmail.com>
Cc: Vinod Koul <vinod.koul@linux.intel.com>
Cc: Tomasz Figa <t.figa@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 3 Jul 2013 22:00:42 +0000 (15:00 -0700)]
arch: c6x: mm: include "asm/uaccess.h" to pass compiling
Need include "asm/uaccess.h" to pass compiling.
The related error (with allmodconfig):
arch/c6x/mm/init.c: In function `paging_init':
arch/c6x/mm/init.c:46:2: error: implicit declaration of function `set_fs' [-Werror=implicit-function-declaration]
arch/c6x/mm/init.c:46:9: error: `KERNEL_DS' undeclared (first use in this function)
arch/c6x/mm/init.c:46:9: note: each undeclared identifier is reported only once for each function it appears in
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: <stable@vger.kernel.org> [3.10.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 3 Jul 2013 22:00:41 +0000 (15:00 -0700)]
include/linux/smp.h:on_each_cpu(): switch back to a macro
Commit
f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in !SMP
version of on_each_cpu()") converted on_each_cpu() to a C function.
This required inclusion of irqflags.h, which broke ia64 and mn10300 (at
least) due to header ordering hell.
Switch on_each_cpu() back to a macro to fix this.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org> [3.10.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 3 Jul 2013 17:31:38 +0000 (10:31 -0700)]
Merge tag 'arm64-upstream' of git://git./linux/kernel/git/cmarinas/linux-aarch64
Pull ARM64 updates from Catalin Marinas:
"Main features:
- KVM and Xen ports to AArch64
- Hugetlbfs and transparent huge pages support for arm64
- Applied Micro X-Gene Kconfig entry and dts file
- Cache flushing improvements
For arm64 huge pages support, there are x86 changes moving part of
arch/x86/mm/hugetlbpage.c into mm/hugetlb.c to be re-used by arm64"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (66 commits)
arm64: Add initial DTS for APM X-Gene Storm SOC and APM Mustang board
arm64: Add defines for APM ARMv8 implementation
arm64: Enable APM X-Gene SOC family in the defconfig
arm64: Add Kconfig option for APM X-Gene SOC family
arm64/Makefile: provide vdso_install target
ARM64: mm: THP support.
ARM64: mm: Raise MAX_ORDER for 64KB pages and THP.
ARM64: mm: HugeTLB support.
ARM64: mm: Move PTE_PROT_NONE bit.
ARM64: mm: Make PAGE_NONE pages read only and no-execute.
ARM64: mm: Restore memblock limit when map_mem finished.
mm: thp: Correct the HPAGE_PMD_ORDER check.
x86: mm: Remove general hugetlb code from x86.
mm: hugetlb: Copy general hugetlb code from x86 to mm.
x86: mm: Remove x86 version of huge_pmd_share.
mm: hugetlb: Copy huge_pmd_share from x86 to mm.
arm64: KVM: document kernel object mappings in HYP
arm64: KVM: MAINTAINERS update
arm64: KVM: userspace API documentation
arm64: KVM: enable initialization of a 32bit vcpu
...
Linus Torvalds [Wed, 3 Jul 2013 16:46:29 +0000 (09:46 -0700)]
Merge branch 'for-linus' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM updates from Russell King:
"This contains the usual updates from other people (listed below) and
the usual random muddle of miscellaneous ARM updates which cover some
low priority bug fixes and performance improvements.
I've started to put the pull request wording into the merge commits,
which are:
- NoMMU stuff:
This includes the following series sent earlier to the list:
- nommu-fixes
- R7 Support
- MPU support
I've left out the ARCH_MULTIPLATFORM/!MMU stuff that Arnd and I
were discussing today until we've reached a conclusion/that's had
some more review.
This is rebased (and re-tested) on your devel-stable branch because
otherwise there were going to be conflicts with Uwe's V7M work now
that you've merged that. I've included the fix for limiting MPU to
CPU_V7.
- Huge page support
These changes bring both HugeTLB support and Transparent HugePage
(THP) support to ARM. Only long descriptors (LPAE) are supported
in this series.
The code has been tested on an Arndale board (Exynos 5250).
- LPAE updates
Please pull these miscellaneous LPAE fixes I've been collecting for
a while now for 3.11. They've been tested and reviewed by quite a
few people, and most of the patches are pretty trivial. -- Will Deacon.
- arch_timer cleanups
Please pull these arch_timer cleanups I've been holding onto for a
while. They're the same as my last posting, but have been rebased
to v3.10-rc3.
- mpidr linearisation (multiprocessor id register - identifies which
CPU number we are in the system)
This patch series that implements MPIDR linearization through a
simple hashing algorithm and updates current cpu_{suspend}/{resume}
code to use the newly created hash structures to retrieve context
pointers. It represents a stepping stone for the implementation of
power management code on forthcoming multi-cluster ARM systems.
It has been tested on TC2 (dual cluster A15xA7 system), iMX6q,
OMAP4 and Tegra, with processors hitting low-power states requiring
warm-boot resume through the cpu_resume code path"
* 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (77 commits)
ARM: 7775/1: mm: Remove do_sect_fault from LPAE code
ARM: 7777/1: Avoid extra calls to the C compiler
ARM: 7774/1: Fix dtb dependency to use order-only prerequisites
ARM: 7770/1: remove residual ARMv2 support from decompressor
ARM: 7769/1: Cortex-A15: fix erratum 798181 implementation
ARM: 7768/1: prevent risks of out-of-bound access in ASID allocator
ARM: 7767/1: let the ASID allocator handle suspended animation
ARM: 7766/1: versatile: don't mark pen as __INIT
ARM: 7765/1: perf: Record the user-mode PC in the call chain.
ARM: 7735/2: Preserve the user r/w register TPIDRURW on context switch and fork
ARM: kernel: implement stack pointer save array through MPIDR hashing
ARM: kernel: build MPIDR hash function data structure
ARM: mpu: Ensure that MPU depends on CPU_V7
ARM: mpu: protect the vectors page with an MPU region
ARM: mpu: Allow enabling of the MPU via kconfig
ARM: 7758/1: introduce config HAS_BANDGAP
ARM: 7757/1: mm: don't flush icache in switch_mm with hardware broadcasting
ARM: 7751/1: zImage: don't overwrite ourself with a page table
ARM: 7749/1: spinlock: retry trylock operation if strex fails on free lock
ARM: 7748/1: oabi: handle faults when loading swi instruction from userspace
...
Linus Torvalds [Wed, 3 Jul 2013 16:10:19 +0000 (09:10 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
Al Viro [Wed, 3 Jul 2013 12:19:23 +0000 (16:19 +0400)]
Document ->tmpfile()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 29 Jun 2013 09:23:08 +0000 (13:23 +0400)]
ext4: ->tmpfile() support
very similar to ext3 counterpart...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jie Liu [Tue, 25 Jun 2013 04:02:13 +0000 (12:02 +0800)]
vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().
To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.
Thanks Dave Chinner for this suggestion.
[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Wed, 3 Jul 2013 03:04:25 +0000 (20:04 -0700)]
Merge branch 'for-3.11-cpuset' of git://git./linux/kernel/git/tj/cgroup
Pull cpuset changes from Tejun Heo:
"cpuset has always been rather odd about its configurations - a cgroup
right after creation didn't allow any task executions before
configuration, changing configuration in the parent modifies the
descendants irreversibly and so on. These behaviors are inherently
nasty and almost hostile against sharing the hierarchy with other
controllers making it very difficult to use in unified hierarchy.
Li is currently in the process of updating the behaviors for
__DEVEL__sane_behavior which is the bulk of changes in this pull
request. It isn't complete yet and the behaviors will change further
but all changes are gated behind sane_behavior. In the process, the
rather hairy work-item punting which was used to work around the
limitations of cgroup descendant iterator was simplified."
* 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: rename @cont to @cgrp
cpuset: fix to migrate mm correctly in a corner case
cpuset: allow to move tasks to empty cpusets
cpuset: allow to keep tasks in empty cpusets
cpuset: introduce effective_{cpumask|nodemask}_cpuset()
cpuset: record old_mems_allowed in struct cpuset
cpuset: remove async hotplug propagation work
cpuset: let hotplug propagation work wait for task attaching
cpuset: re-structure update_cpumask() a bit
cpuset: remove cpuset_test_cpumask()
cpuset: remove unnecessary variable in cpuset_attach()
cpuset: cleanup guarantee_online_{cpus|mems}()
cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
Linus Torvalds [Wed, 3 Jul 2013 02:54:47 +0000 (19:54 -0700)]
Merge branch 'for-3.11' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
"This pull request contains the following changes.
- cgroup_subsys_state (css) reference counting has been converted to
percpu-ref. css is what each resource controller embeds into its
own control structure and perform reference count against. It may
be used in hot paths of various subsystems and is similar to module
refcnt in that aspect. For example, block-cgroup's css refcnting
was showing up a lot in Mikulaus's device-mapper scalability work
and this should alleviate it.
- cgroup subtree iterator has been updated so that RCU read lock can
be released after grabbing reference. This allows simplifying its
users which requires blocking which used to build iteration list
under RCU read lock and then traverse it outside. This pull
request contains simplification of cgroup core and device-cgroup.
A separate pull request will update cpuset.
- Fixes for various bugs including corner race conditions and RCU
usage bugs.
- A lot of cleanups and some prepartory work for the planned unified
hierarchy support."
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (48 commits)
cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy
cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options
cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()
cgroup: always use RCU accessors for protected accesses
cgroup: fix RCU accesses around task->cgroups
cgroup: fix RCU accesses to task->cgroups
cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()
cgroup: fix cgroupfs_root early destruction path
cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy
cgroup: implement for_each_[builtin_]subsys()
cgroup: move init_css_set initialization inside cgroup_mutex
cgroup: s/for_each_subsys()/for_each_root_subsys()/
cgroup: clean up find_css_set() and friends
cgroup: remove cgroup->actual_subsys_mask
cgroup: prefix global variables with "cgroup_"
cgroup: convert CFTYPE_* flags to enums
cgroup: rename cont to cgrp
cgroup: clean up cgroup_serial_nr_cursor
cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()
cgroup: make serial_nr_cursor available throughout cgroup.c
...
Linus Torvalds [Wed, 3 Jul 2013 02:53:30 +0000 (19:53 -0700)]
Merge branch 'for-3.11' of git://git./linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
"Surprisingly, Lai and I didn't break too many things implementing
custom pools and stuff last time around and there aren't any follow-up
changes necessary at this point.
The only change in this pull request is Viresh's patches to make some
per-cpu workqueues to behave as unbound workqueues dependent on a boot
param whose default can be configured via a config option. This leads
to higher processing overhead / lower bandwidth as more work items are
bounced across CPUs; however, it can lead to noticeable powersave in
certain configurations - ~10% w/ idlish constant workload on a
big.LITTLE configuration according to Viresh.
This is because per-cpu workqueues interfere with how the scheduler
perceives whether or not each CPU is idle by forcing pinned tasks on
them, which makes the scheduler's power-aware scheduling decisions
less effective.
Its effectiveness is likely less pronounced on homogenous
configurations and this type of optimization can probably be made
automatic; however, the changes are pretty minimal and the affected
workqueues are clearly marked, so it's an easy gain for some
configurations for the time being with pretty unintrusive changes."
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
fbcon: queue work on power efficient wq
block: queue work on power efficient wq
PHYLIB: queue work on system_power_efficient_wq
workqueue: Add system wide power_efficient workqueues
workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
Linus Torvalds [Wed, 3 Jul 2013 02:52:14 +0000 (19:52 -0700)]
Merge branch 'for-3.11' of git://git./linux/kernel/git/tj/percpu
Pull per-cpu changes from Tejun Heo:
"This pull request contains Kent's per-cpu reference counter. It has
gone through several iterations since the last time and the dynamic
allocation is gone.
The usual usage is relatively straight-forward although async kill
confirm interface, which is not used int most cases, is somewhat icky.
There also are some interface concerns - e.g. I'm not sure about
passing in @relesae callback during init as that becomes funny when we
later implement synchronous kill_and_drain - but nothing too serious
and it's quite useable now.
cgroup_subsys_state refcnting has already been converted and we should
convert module refcnt (Kent?)"
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu-refcount: use RCU-sched insted of normal RCU
percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()
percpu-refcount: implement percpu_ref_cancel_init()
percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
percpu-refcount: cosmetic updates
percpu-refcount: consistently use plain (non-sched) RCU
percpu-refcount: Don't use silly cmpxchg()
percpu: implement generic percpu refcounting
Linus Torvalds [Tue, 2 Jul 2013 23:33:05 +0000 (16:33 -0700)]
Merge branch 'x86-uv-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 UV update from Ingo Molnar:
"There's a single commit in this tree, which adds support for a new SGI
UV GRU (Global Reference Unit - fast NUMA messaging ASIC) hardware
feature to scale up and beyond: an optional distributed mode that will
allow per-node address mapping of local GRU space, as opposed to
mapping all GRU hardware to the same contiguous high space"
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/UV: Add GRU distributed mode mappings
Linus Torvalds [Tue, 2 Jul 2013 23:31:49 +0000 (16:31 -0700)]
Merge branch 'x86-tracing-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 tracing updates from Ingo Molnar:
"This tree adds IRQ vector tracepoints that are named after the handler
and which output the vector #, based on a zero-overhead approach that
relies on changing the IDT entries, by Seiji Aguchi.
The new tracepoints look like this:
# perf list | grep -i irq_vector
irq_vectors:local_timer_entry [Tracepoint event]
irq_vectors:local_timer_exit [Tracepoint event]
irq_vectors:reschedule_entry [Tracepoint event]
irq_vectors:reschedule_exit [Tracepoint event]
irq_vectors:spurious_apic_entry [Tracepoint event]
irq_vectors:spurious_apic_exit [Tracepoint event]
irq_vectors:error_apic_entry [Tracepoint event]
irq_vectors:error_apic_exit [Tracepoint event]
[...]"
* 'x86-tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tracing: Add config option checking to the definitions of mce handlers
trace,x86: Do not call local_irq_save() in load_current_idt()
trace,x86: Move creation of irq tracepoints from apic.c to irq.c
x86, trace: Add irq vector tracepoints
x86: Rename variables for debugging
x86, trace: Introduce entering/exiting_irq()
tracing: Add DEFINE_EVENT_FN() macro
Linus Torvalds [Tue, 2 Jul 2013 23:30:46 +0000 (16:30 -0700)]
Merge branch 'x86-ras-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 RAS update from Ingo Molnar:
"The changes in this tree are:
- ACPI APEI (ACPI Platform Error Interface) improvements, by Chen
Gong
- misc MCE fixes/cleanups"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Update MCE severity condition check
mce: acpi/apei: Add comments to clarify usage of the various bitfields in the MCA subsystem
ACPI/APEI: Update einj documentation for param1/param2
ACPI/APEI: Add parameter check before error injection
ACPI, APEI, EINJ: Fix error return code in einj_init()
x86, mce: Fix "braodcast" typo
Linus Torvalds [Tue, 2 Jul 2013 23:29:48 +0000 (16:29 -0700)]
Merge branch 'x86-platform-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
"Two changes:
- A Kconfig dependency fix/cleanup
- Introduce the 'make kvmconfig' KVM configuration helper utility
that turns the current .config into a KVM-bootable config. Useful
for debugging specific native kernel configs that have no KVM
config options enabled on VM setups."
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform: Make X86_GOLDFISH depend on X86_EXTENDED_PLATFORM
x86/platform: Add kvmconfig to the phony targets
x86, platform, kvm, kconfig: Turn existing .config's into KVM-capable configs
Linus Torvalds [Tue, 2 Jul 2013 23:29:05 +0000 (16:29 -0700)]
Merge branch 'x86-mm-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
"Misc improvements:
- Fix /proc/mtrr reporting
- Fix ioremap printout
- Remove the unused pvclock fixmap entry on 32-bit
- misc cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioremap: Correct function name output
x86: Fix /proc/mtrr with base/size more than 44bits
ix86: Don't waste fixmap entries
x86/mm: Drop unneeded include <asm/*pgtable, page*_types.h>
x86_64: Correct phys_addr in cleanup_highmap comment
Linus Torvalds [Tue, 2 Jul 2013 23:28:10 +0000 (16:28 -0700)]
Merge branch 'x86-microcode-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 microcode loading update from Ingo Molnar:
"Two main changes that improve microcode loading on AMD CPUs:
- Add support for all-in-one binary microcode files that concatenate
the microcode images of multiple processor families, by Jacob Shin
- Add early microcode loading (embedded in the initrd) support, also
by Jacob Shin"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, amd: Another early loading fixup
x86, microcode, amd: Allow multiple families' bin files appended together
x86, microcode, amd: Make find_ucode_in_initrd() __init
x86, microcode, amd: Fix warnings and errors on with CONFIG_MICROCODE=m
x86, microcode, amd: Early microcode patch loading support for AMD
x86, microcode, amd: Refactor functions to prepare for early loading
x86, microcode: Vendor abstract out save_microcode_in_initrd()
x86, microcode, intel: Correct typo in printk
Linus Torvalds [Tue, 2 Jul 2013 23:26:44 +0000 (16:26 -0700)]
Merge branch 'x86-fpu-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 FPU changes from Ingo Molnar:
"There are two bigger changes in this tree:
- Add an [early-use-]safe static_cpu_has() variant and other
robustness improvements, including the new X86_DEBUG_STATIC_CPU_HAS
configurable debugging facility, motivated by recent obscure FPU
code bugs, by Borislav Petkov
- Reimplement FPU detection code in C and drop the old asm code, by
Peter Anvin."
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: Use static_cpu_has_safe before alternatives
x86: Add a static_cpu_has_safe variant
x86: Sanity-check static_cpu_has usage
x86, cpu: Add a synthetic, always true, cpu feature
x86: Get rid of ->hard_math and all the FPU asm fu
Linus Torvalds [Tue, 2 Jul 2013 23:25:50 +0000 (16:25 -0700)]
Merge branch 'x86-efi-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 EFI changes from Ingo Molnar:
"Two fixes that should in principle increase robustness of our
interaction with the EFI firmware, and a cleanup"
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: retry ExitBootServices() on failure
efi: Convert runtime services function ptrs
UEFI: Don't pass boot services regions to SetVirtualAddressMap()
Linus Torvalds [Tue, 2 Jul 2013 23:25:06 +0000 (16:25 -0700)]
Merge branch 'x86-debug-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 debug update from Ingo Molnar:
"Misc debuggability improvements:
- Optimize the x86 CPU register printout a bit
- Expose the tboot TXT log via debugfs
- Small do_debug() cleanup"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tboot: Provide debugfs interfaces to access TXT log
x86: Remove weird PTR_ERR() in do_debug
x86/debug: Only print out DR registers if they are not power-on defaults
Linus Torvalds [Tue, 2 Jul 2013 23:24:24 +0000 (16:24 -0700)]
Merge branch 'x86-cpu-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 cpu updates from Ingo Molnar:
"Two changes:
- Extend 32-bit double fault debugging aid to 64-bit
- Fix a build warning"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/cacheinfo: Shut up last long-standing warning
x86: Extend #DF debugging aid to 64-bit
Linus Torvalds [Tue, 2 Jul 2013 23:23:50 +0000 (16:23 -0700)]
Merge branch 'x86-cleanups-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc x86 cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, reloc: Use xorl instead of xorq in relocate_kernel_64.S
x86, cleanups: Remove extra tab in __flush_tlb_one()
x86/mce: Remove check for CONFIG_X86_MCE_P4THERMAL
Linus Torvalds [Tue, 2 Jul 2013 23:22:55 +0000 (16:22 -0700)]
Merge branch 'x86-boot-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 boot build fix from Ingo Molnar:
"Small fixlet for the build process"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Close opened file descriptor
Linus Torvalds [Tue, 2 Jul 2013 23:21:45 +0000 (16:21 -0700)]
Merge branch 'x86-asm-for-linus' of git://git./linux/kernel/git/tip/tip
Pull asm/x86 changes from Ingo Molnar:
"Misc changes, with a bigger processor-flags cleanup/reorganization by
Peter Anvin"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, asm, cleanup: Replace open-coded control register values with symbolic
x86, processor-flags: Fix the datatypes and add bit number defines
x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
linux/const.h: Add _BITUL() and _BITULL()
x86/vdso: Convert use of typedef ctl_table to struct ctl_table
x86: __force_order doesn't need to be an actual variable
Linus Torvalds [Tue, 2 Jul 2013 23:19:24 +0000 (16:19 -0700)]
Merge branch 'sched-mm-for-linus' of git://git./linux/kernel/git/tip/tip
Pull voluntary preemption fixes from Ingo Molnar:
"This tree contains a speedup which is achieved through better
might_sleep()/might_fault() preemption point annotations for uaccess
functions, by Michael S Tsirkin:
1. The only reason uaccess routines might sleep is if they fault.
Make this explicit for all architectures.
2. A voluntary preemption point in uaccess functions means compiler
can't inline them efficiently, this breaks assumptions that they
are very fast and small that e.g. net code seems to make. Remove
this preemption point so behaviour matches with what callers
assume.
3. Accesses (e.g through socket ops) to kernel memory with KERNEL_DS
like net/sunrpc does will never sleep. Remove an unconditinal
might_sleep() in the might_fault() inline in kernel.h (used when
PROVE_LOCKING is not set).
4. Accesses with pagefault_disable() return EFAULT but won't cause
caller to sleep. Check for that and thus avoid might_sleep() when
PROVE_LOCKING is set.
These changes offer a nice speedup for CONFIG_PREEMPT_VOLUNTARY=y
kernels, here's a network bandwidth measurement between a virtual
machine and the host:
before:
incoming: 7122.77 Mb/s
outgoing: 8480.37 Mb/s
after:
incoming: 8619.24 Mb/s [ +21.0% ]
outgoing: 9455.42 Mb/s [ +11.5% ]
I kept these changes in a separate tree, separate from scheduler
changes, because it's a mixed MM and scheduler topic"
* 'sched-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm, sched: Allow uaccess in atomic with pagefault_disable()
mm, sched: Drop voluntary schedule from might_fault()
x86: uaccess s/might_sleep/might_fault/
tile: uaccess s/might_sleep/might_fault/
powerpc: uaccess s/might_sleep/might_fault/
mn10300: uaccess s/might_sleep/might_fault/
microblaze: uaccess s/might_sleep/might_fault/
m32r: uaccess s/might_sleep/might_fault/
frv: uaccess s/might_sleep/might_fault/
arm64: uaccess s/might_sleep/might_fault/
asm-generic: uaccess s/might_sleep/might_fault/