GitHub/MotorolaMobilityLLC/kernel-slsi.git
17 years agomm: revert KERNEL_DS buffered write optimisation
Nick Piggin [Tue, 16 Oct 2007 08:24:53 +0000 (01:24 -0700)]
mm: revert KERNEL_DS buffered write optimisation

Revert the patch from Neil Brown to optimise NFSD writev handling.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: use pagevec to rotate reclaimable page
Hisashi Hifumi [Tue, 16 Oct 2007 08:24:52 +0000 (01:24 -0700)]
mm: use pagevec to rotate reclaimable page

While running some memory intensive load, system response deteriorated just
after swap-out started.

The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback .  This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.

Following patch solves this problem.  I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.

I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.

The test result is:

-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms

-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms

Max round-trip-time was improved.

The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.

I did ping test again to observe performance deterioration caused by taking
a ref.

-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms

The result for my original patch is as follows.

-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms

The influence to response was small.

[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
Lee Schermerhorn [Tue, 16 Oct 2007 08:24:51 +0000 (01:24 -0700)]
Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag

Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this to
obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
wrapper--e.g.,  numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: prevent kswapd from freeing excessive amounts of lowmem
Rik van Riel [Tue, 16 Oct 2007 08:24:50 +0000 (01:24 -0700)]
mm: prevent kswapd from freeing excessive amounts of lowmem

The current VM can get itself into trouble fairly easily on systems with a
small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.

On one side, page_alloc() will allocate down to zone->pages_low, while on
the other side, kswapd() and balance_pgdat() will try to free memory from
every zone, until every zone has more free pages than zone->pages_high.

Highmem can be filled up to zone->pages_low with page tables, ramfs,
vmalloc allocations and other unswappable things quite easily and without
many bad side effects, since we still have a huge ZONE_NORMAL to do future
allocations from.

However, as long as the number of free pages in the highmem zone is below
zone->pages_high, kswapd will continue swapping things out from
ZONE_NORMAL, too!

Sami Farin managed to get his system into a stage where kswapd had freed
about 700MB of low memory and was still "going strong".

The attached patch will make kswapd stop paging out data from zones when
there is more than enough memory free.  We do go above zone->pages_high in
order to keep pressure between zones equal in normal circumstances, but the
patch should prevent the kind of excesses that made Sami's computer totally
unusable.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: no need to cast vmalloc() return value in zone_wait_table_init()
Jesper Juhl [Tue, 16 Oct 2007 08:24:49 +0000 (01:24 -0700)]
mm: no need to cast vmalloc() return value in zone_wait_table_init()

vmalloc() returns a void pointer, so there's no need to cast its
return value in mm/page_alloc.c::zone_wait_table_init().

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofix the max path calculation in radix-tree.c
Jeff Moyer [Tue, 16 Oct 2007 08:24:49 +0000 (01:24 -0700)]
fix the max path calculation in radix-tree.c

A while back, Nick Piggin introduced a patch to reduce the node memory
usage for small files (commit cfd9b7df4abd3257c9e381b0e445817b26a51c0c):

-#define RADIX_TREE_MAP_SHIFT 6
+#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)

Unfortunately, he didn't take into account the fact that the
calculation of the maximum path was based on an assumption of having
to round up:

#define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2)

So, if CONFIG_BASE_SMALL is set, you will end up with a
RADIX_TREE_MAX_PATH that is one greater than necessary.  The practical
upshot of this is just a bit of wasted memory (one long in the
height_to_maxindex array, an extra pre-allocated radix tree node per
cpu, and extra stack usage in a couple of functions), but it seems
worth getting right.

It's also worth noting that I never build with CONFIG_BASE_SMALL.
What I did to test this was duplicate the code in a small user-space
program and check the results of the calculations for max path and the
contents of the height_to_maxindex array.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs: fix nobh error handling
Nick Piggin [Tue, 16 Oct 2007 08:24:48 +0000 (01:24 -0700)]
fs: fix nobh error handling

nobh mode error handling is not just pretty slack, it's wrong.

One cannot zero out the whole page to ensure new blocks are zeroed, because
it just brings the whole page "uptodate" with zeroes even if that may not
be the correct uptodate data.  Also, other parts of the page may already
contain dirty data which would get lost by zeroing it out.  Thirdly, the
writeback of zeroes to the new blocks will also erase existing blocks.  All
these conditions are pagecache and/or filesystem corruption.

The problem comes about because we didn't keep track of which buffers
actually are new or old.  However it is not enough just to keep only this
state, because at the point we start dirtying parts of the page (new
blocks, with zeroes), the handling of IO errors becomes impossible without
buffers because the page may only be partially uptodate, in which case the
page flags allone cannot capture the state of the parts of the page.

So allocate all buffers for the page upfront, but leave them unattached so
that they don't pick up any other references and can be freed when we're
done.  If the error path is hit, then zero the new buffers as the regular
buffer path does, then attach the buffers to the page so that it can
actually be written out correctly and be subject to the normal IO error
handling paths.

As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: add end_buffer_read helper function
Dmitry Monakhov [Tue, 16 Oct 2007 08:24:47 +0000 (01:24 -0700)]
mm: add end_buffer_read helper function

Move duplicated code from end_buffer_read_XXX methods to separate helper
function.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoSlab allocators: fail if ksize is called with a NULL parameter
Christoph Lameter [Tue, 16 Oct 2007 08:24:46 +0000 (01:24 -0700)]
Slab allocators: fail if ksize is called with a NULL parameter

A NULL pointer means that the object was not allocated.  One cannot
determine the size of an object that has not been allocated.  Currently we
return 0 but we really should BUG() on attempts to determine the size of
something nonexistent.

krealloc() interprets NULL to mean a zero sized object.  Handle that
separately in krealloc().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agocalculation of pgoff in do_linear_fault() uses mixed units
Dean Nelson [Tue, 16 Oct 2007 08:24:45 +0000 (01:24 -0700)]
calculation of pgoff in do_linear_fault() uses mixed units

The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
PAGE_CACHE_SIZE.  At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
defined as PAGE_SHIFT, but should that ever change this calculation would
break.

Signed-off-by: Dean Nelson <dcn@sgi.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years ago{slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check
Satyam Sharma [Tue, 16 Oct 2007 08:24:44 +0000 (01:24 -0700)]
{slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check

Considering kfree(NULL) would normally occur only in error paths and
kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
condition check in SLUB's and SLOB's kfree() to optimize for the common
case.  SLAB has this already.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomove mm_struct and vm_area_struct
Martin Schwidefsky [Tue, 16 Oct 2007 08:24:43 +0000 (01:24 -0700)]
move mm_struct and vm_area_struct

Move the definitions of struct mm_struct and struct vma_area_struct to
include/mm_types.h.  This allows to define more function in asm/pgtable.h
and friends with inline assemblies instead of macros.  Compile tested on
i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

[aurelien@aurel32.net: build fix]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoradix-tree: use indirect bit
Nick Piggin [Tue, 16 Oct 2007 08:24:42 +0000 (01:24 -0700)]
radix-tree: use indirect bit

Rather than sign direct radix-tree pointers with a special bit, sign the
indirect one that hangs off the root.  This means that, given a lookup_slot
operation, the invalid result will be differentiated from the valid
(previously, valid results could have the bit either set or clear).

This does not affect slot lookups which occur under lock -- they can never
return an invalid result.  Is needed in future for lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: clarify __add_to_swap_cache locking
Nick Piggin [Tue, 16 Oct 2007 08:24:42 +0000 (01:24 -0700)]
mm: clarify __add_to_swap_cache locking

__add_to_swap_cache unconditionally sets the page locked, which can be a bit
alarming to the unsuspecting reader: in the code paths where the page is
visible to other CPUs, the page should be (and is) already locked.

Instead, just add a check to ensure the page is locked here, and teach the one
path relying on the old behaviour to call SetPageLocked itself.

[hugh@veritas.com: locking fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: improve find_lock_page
Nick Piggin [Tue, 16 Oct 2007 08:24:41 +0000 (01:24 -0700)]
mm: improve find_lock_page

find_lock_page does not need to recheck ->index because if the page is in the
right mapping then the index must be the same.  Also, tree_lock does not need
to be retaken after the page is locked in order to test that ->mapping has not
changed, because holding the page lock pins its mapping.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm: use lockless radix-tree probe
Nick Piggin [Tue, 16 Oct 2007 08:24:40 +0000 (01:24 -0700)]
mm: use lockless radix-tree probe

Probing pages and radix_tree_tagged are lockless operations with the lockless
radix-tree.  Convert these users to RCU locking rather than using tree_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoremove ZERO_PAGE
Nick Piggin [Tue, 16 Oct 2007 08:24:40 +0000 (01:24 -0700)]
remove ZERO_PAGE

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
17 years agoSLUB: direct pass through of page size or higher kmalloc requests
Christoph Lameter [Tue, 16 Oct 2007 08:24:38 +0000 (01:24 -0700)]
SLUB: direct pass through of page size or higher kmalloc requests

This gets rid of all kmalloc caches larger than page size.  A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator.  This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
  need to pass through the slab allocator to get to the
  page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
  PAGE_SIZEd allocations.
  SLUB must call page allocator for each alloc anyways since
  the higher order pages which that allowed avoiding the page alloc calls
  are not available in a reliable way anymore. So we are basically removing
  useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
  SLAB did. Bad things like using page sized kmalloc allocations to
  stand in for page allocate allocs can be transparently handled and are not
  distinguishable from page allocator uses.
- Checking for too large objects can be removed since
  it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofilemap: convert some unsigned long to pgoff_t
Fengguang Wu [Tue, 16 Oct 2007 08:24:37 +0000 (01:24 -0700)]
filemap: convert some unsigned long to pgoff_t

Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofilemap: trivial code cleanups
Fengguang Wu [Tue, 16 Oct 2007 08:24:37 +0000 (01:24 -0700)]
filemap: trivial code cleanups

- remove unused local next_index in do_generic_mapping_read()
- remove a redudant page_cache_read() declaration

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: remove the limit max_sectors_kb imposed on max_readahead_kb
Fengguang Wu [Tue, 16 Oct 2007 08:24:36 +0000 (01:24 -0700)]
readahead: remove the limit max_sectors_kb imposed on max_readahead_kb

Remove the size limit max_sectors_kb imposed on max_readahead_kb.

The size restriction is unreasonable.  Especially when max_sectors_kb cannot
grow larger than max_hw_sectors_kb, which can be rather small for some disk
drives.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: remove several readahead macros
Fengguang Wu [Tue, 16 Oct 2007 08:24:36 +0000 (01:24 -0700)]
readahead: remove several readahead macros

Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: remove the local copy of ra in do_generic_mapping_read()
Fengguang Wu [Tue, 16 Oct 2007 08:24:35 +0000 (01:24 -0700)]
readahead: remove the local copy of ra in do_generic_mapping_read()

The local copy of ra in do_generic_mapping_read() can now go away.

It predates readanead(req_size).  In a time when the readahead code was called
on *every* single page.  Hence a local has to be made to reduce the chance of
the readahead state being overwritten by a concurrent reader.  More details
in: Linux: Random File I/O Regressions In 2.6
<http://kerneltrap.org/node/3039>

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: basic support of interleaved reads
Fengguang Wu [Tue, 16 Oct 2007 08:24:34 +0000 (01:24 -0700)]
readahead: basic support of interleaved reads

This is a simplified version of the pagecache context based readahead.  It
handles the case of multiple threads reading on the same fd and invalidating
each others' readahead state.  It does the trick by scanning the pagecache and
recovering the current read stream's readahead status.

The algorithm works in a opportunistic way, in that it does not try to detect
interleaved reads _actively_, which requires a probe into the page cache
(which means a little more overhead for random reads).  It only tries to
handle a previously started sequential readahead whose state was overwritten
by another concurrent stream, and it can do this job pretty well.

Negative and positive examples(or what you can expect from it):

1) it cannot detect and serve perfect request-by-request interleaved reads
   right:
time stream 1  stream 2
0  1
1            1001
2  2
3            1002
4  3
5            1003
6  4
7            1004
8  5
9           1005

Here no single readahead will be carried out.

2) However, if it's two concurrent reads by two threads, the chance of the
   initial sequential readahead be started is huge. Once the first sequential
   readahead is started for a stream, this patch will ensure that the readahead
   window continues to rampup and won't be disturbed by other streams.

time stream 1  stream 2
0  1
1  2
2            1001
3  3
4            1002
5            1003
6  4
7  5
8            1004
9  6
10           1005
11 7
12           1006
13           1007

Here stream 1 will start a readahead at page 2, and stream 2 will start its
first readahead at page 1003.  From then on the two streams will be served
right.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoradixtree: introduce radix_tree_next_hole()
Fengguang Wu [Tue, 16 Oct 2007 08:24:33 +0000 (01:24 -0700)]
radixtree: introduce radix_tree_next_hole()

Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for
the first hole.  It will be used in interleaved readahead.

The implementation is dumb and obviously correct.  It can help debug(and
document) the possible smart one in future.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: combine file_ra_state.prev_index/prev_offset into prev_pos
Fengguang Wu [Tue, 16 Oct 2007 08:24:33 +0000 (01:24 -0700)]
readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: mmap read-around simplification
Fengguang Wu [Tue, 16 Oct 2007 08:24:32 +0000 (01:24 -0700)]
readahead: mmap read-around simplification

Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreadahead: compacting file_ra_state
Fengguang Wu [Tue, 16 Oct 2007 08:24:31 +0000 (01:24 -0700)]
readahead: compacting file_ra_state

Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when a lot of files are
opened.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoClean up duplicate includes in mm/
Jesper Juhl [Tue, 16 Oct 2007 08:24:30 +0000 (01:24 -0700)]
Clean up duplicate includes in mm/

This patch cleans up duplicate includes in
mm/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoClean up duplicate includes in include/linux/memory_hotplug.h
Jesper Juhl [Tue, 16 Oct 2007 08:24:30 +0000 (01:24 -0700)]
Clean up duplicate includes in include/linux/memory_hotplug.h

This patch cleans up duplicate includes in
include/linux/memory_hotplug.h

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoDuring VM oom condition, kill all threads in process group
Will Schmidt [Tue, 16 Oct 2007 08:24:18 +0000 (01:24 -0700)]
During VM oom condition, kill all threads in process group

We have had complaints where a threaded application is left in a bad state
after one of it's threads is killed when we hit a VM: out_of_memory
condition.

Killing just one of the process threads can leave the application in a bad
state, whereas killing the entire process group would allow for the
application to restart, or be otherwise handled, and makes it very obvious
that something has gone wrong.

This change allows the entire process group to be taken down, rather
than just the one thread.

Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoslub.c:early_kmem_cache_node_alloc() shouldn't be __init
Adrian Bunk [Tue, 16 Oct 2007 08:24:18 +0000 (01:24 -0700)]
slub.c:early_kmem_cache_node_alloc() shouldn't be __init

WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
...

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoppc64: SPARSEMEM_VMEMMAP support
Andy Whitcroft [Tue, 16 Oct 2007 08:24:17 +0000 (01:24 -0700)]
ppc64: SPARSEMEM_VMEMMAP support

Enable virtual memmap support for SPARSEMEM on PPC64 systems.  Slice a 16th
off the end of the linear mapping space and use that to hold the vmemmap.
Uses the same size mapping as uses in the linear 1:1 kernel mapping.

[pbadari@gmail.com: fix warning]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoSPARC64: SPARSEMEM_VMEMMAP support
David Miller [Tue, 16 Oct 2007 08:24:16 +0000 (01:24 -0700)]
SPARC64: SPARSEMEM_VMEMMAP support

[apw@shadowen.org: style fixups]
[apw@shadowen.org: vmemmap sparc64: convert to new config options]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoIA64: SPARSEMEM_VMEMMAP 16K page size support
Christoph Lameter [Tue, 16 Oct 2007 08:24:15 +0000 (01:24 -0700)]
IA64: SPARSEMEM_VMEMMAP 16K page size support

Equip IA64 sparsemem with a virtual memmap.  This is similar to the existing
CONFIG_VIRTUAL_MEM_MAP functionality for DISCONTIGMEM.  It uses a PAGE_SIZE
mapping.

This is provided as a minimally intrusive solution.  We split the 128TB
VMALLOC area into two 64TB areas and use one for the virtual memmap.

This should replace CONFIG_VIRTUAL_MEM_MAP long term.

[apw@shadowen.org: convert to new helper based initialisation]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agox86_64: SPARSEMEM_VMEMMAP 2M page size support
Christoph Lameter [Tue, 16 Oct 2007 08:24:15 +0000 (01:24 -0700)]
x86_64: SPARSEMEM_VMEMMAP 2M page size support

x86_64 uses 2M page table entries to map its 1-1 kernel space.  We also
implement the virtual memmap using 2M page table entries.  So there is no
additional runtime overhead over FLATMEM, initialisation is slightly more
complex.  As FLATMEM still references memory to obtain the mem_map pointer and
SPARSEMEM_VMEMMAP uses a compile time constant, SPARSEMEM_VMEMMAP should be
superior.

With this SPARSEMEM becomes the most efficient way of handling virt_to_page,
pfn_to_page and friends for UP, SMP and NUMA on x86_64.

[apw@shadowen.org: code resplit, style fixups]
[apw@shadowen.org: vmemmap x86_64: ensure end of section memmap is initialised]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agovmemmap: generify initialisation via helpers
Andy Whitcroft [Tue, 16 Oct 2007 08:24:14 +0000 (01:24 -0700)]
vmemmap: generify initialisation via helpers

Convert the common vmemmap population into initialisation helpers for use by
architecture vmemmap populators.  All architecture implementing the
SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
initialiser, which may make use of the helpers.

This allows us to clean up and remove the initialisation Kconfig entries.
With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
indicate use of that variant.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoGeneric Virtual Memmap support for SPARSEMEM
Christoph Lameter [Tue, 16 Oct 2007 08:24:13 +0000 (01:24 -0700)]
Generic Virtual Memmap support for SPARSEMEM

SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agosparsemem: record when a section has a valid mem_map
Andy Whitcroft [Tue, 16 Oct 2007 08:24:11 +0000 (01:24 -0700)]
sparsemem: record when a section has a valid mem_map

We have flags to indicate whether a section actually has a valid mem_map
associated with it.  This is never set and we rely solely on the present bit
to indicate a section is valid.  By definition a section is not valid if it
has no mem_map and there is a window during init where the present bit is set
but there is no mem_map, during which pfn_valid() will return true
incorrectly.

Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
present_section{,_nr} and pfn_present() interfaces for those users who care to
know that a section is going to be valid.

[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agosparsemem: clean up spelling error in comments
Andy Whitcroft [Tue, 16 Oct 2007 08:24:10 +0000 (01:24 -0700)]
sparsemem: clean up spelling error in comments

SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

The current aim is to bring a common virtually mapped mem_map to all
architectures.  This should facilitate the removal of the bespoke
implementations from the architectures.  This also brings performance
improvements for most architecture making sparsmem vmemmap the more desirable
memory model.  The ultimate aim of this work is to expand sparsemem support to
encompass all the features of the other memory models.  This could allow us to
drop support for and remove the other models in the longer term.

Below are some comparitive kernbench numbers for various architectures,
comparing default memory model against SPARSEMEM VMEMMAP.  All but ia64 show
marginal improvement; we expect the ia64 figures to be sorted out when the
larger mapping support returns.

x86-64 non-NUMA
             Base    VMEMAP    % change (-ve good)
User        85.07     84.84    -0.26
System      34.32     33.84    -1.39
Total      119.38    118.68    -0.59

ia64
             Base    VMEMAP    % change (-ve good)
User      1016.41   1016.93    0.05
System      50.83     51.02    0.36
Total     1067.25   1067.95    0.07

x86-64 NUMA
             Base   VMEMAP    % change (-ve good)
User        30.77   431.73     0.22
System      45.39    43.98    -3.11
Total      476.17   475.71    -0.10

ppc64
             Base   VMEMAP    % change (-ve good)
User       488.77   488.35    -0.09
System      56.92    56.37    -0.97
Total      545.69   544.72    -0.18

Below are some AIM bencharks on IA64 and x86-64 (thank Bob).  The seems
pretty much flat as you would expect.

ia64 results 2 cpu non-numa 4Gb SCSI disk

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun  1 07:17:24 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 98.9 100 58.9 1.3 1.6482
101 5547.1 95 106.0 79.4 0.9154
201 6377.7 95 183.4 158.3 0.5288
301 6932.2 95 252.7 237.3 0.3838
401 7075.8 93 329.8 316.7 0.2941
501 7235.6 94 403.0 396.2 0.2407
600 7387.5 94 472.7 475.0 0.2052

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun  1 09:59:04 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 99.1 100 58.8 1.2 1.6509
101 5480.9 95 107.2 79.2 0.9044
201 6490.3 95 180.2 157.8 0.5382
301 6886.6 94 254.4 236.8 0.3813
401 7078.2 94 329.7 316.0 0.2942
501 7250.3 95 402.2 395.4 0.2412
600 7399.1 94 471.9 473.9 0.2055

open power 710 2 cpu, 4 Gb, SCSI and configured physically

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 25.7 100 226.3 4.3 0.4286
101 1096.0 97 536.4 199.8 0.1809
201 1236.4 96 946.1 389.1 0.1025
301 1280.5 96 1368.0 582.3 0.0709
401 1270.2 95 1837.4 771.0 0.0528
501 1251.4 96 2330.1 955.9 0.0416
601 1252.6 96 2792.4 1139.2 0.0347
701 1245.2 96 3276.5 1334.6 0.0296
918 1229.5 96 4345.4 1728.7 0.0223

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 25.6 100 226.9 4.3 0.4275
101 1049.3 97 560.2 198.1 0.1731
201 1199.1 97 975.6 390.7 0.0994
301 1261.7 96 1388.5 591.5 0.0699
401 1256.1 96 1858.1 771.9 0.0522
501 1220.1 96 2389.7 955.3 0.0406
601 1224.6 96 2856.3 1133.4 0.0340
701 1252.0 96 3258.7 1314.1 0.0298
915 1232.8 96 4319.7 1704.0 0.0225

amd64 2 2-core, 4Gb and SATA

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun  2 03:59:48 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 13.0 100 446.4 2.1 0.2173
101 533.4 97 1102.0 110.2 0.0880
201 578.3 97 2022.8 220.8 0.0480
301 583.8 97 3000.6 332.3 0.0323
401 580.5 97 4020.1 442.2 0.0241
501 574.8 98 5072.8 558.8 0.0191
600 566.5 98 6163.8 671.0 0.0157

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun  3 04:19:31 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 13.0 100 447.8 2.0 0.2166
101 536.5 97 1095.6 109.7 0.0885
201 567.7 97 2060.5 219.3 0.0471
301 582.1 96 3009.4 330.2 0.0322
401 578.2 96 4036.4 442.4 0.0240
501 585.1 98 4983.2 555.1 0.0195
600 565.5 98 6175.2 660.6 0.0157

This patch:

Fix some spelling errors.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agox86: optimize page faults like all other achitectures and kill notifier cruft
Christoph Hellwig [Tue, 16 Oct 2007 08:24:07 +0000 (01:24 -0700)]
x86: optimize page faults like all other achitectures and kill notifier cruft

x86(-64) are the last architectures still using the page fault notifier
cruft for the kprobes page fault hook.  This patch converts them to the
proper direct calls, and removes the now unused pagefault notifier bits
aswell as the cruft in kprobes.c that was related to this mess.

I know Andi didn't really like this, but all other architecture maintainers
agreed the direct calls are much better and besides the obvious cruft
removal a common way of dealing with kprobes across architectures is
important aswell.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc64]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoConvert cpu_sibling_map to be a per cpu variable
Mike Travis [Tue, 16 Oct 2007 08:24:05 +0000 (01:24 -0700)]
Convert cpu_sibling_map to be a per cpu variable

Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agox86: Convert cpu_core_map to be a per cpu variable
Mike Travis [Tue, 16 Oct 2007 08:24:04 +0000 (01:24 -0700)]
x86: Convert cpu_core_map to be a per cpu variable

This is from an earlier message from 'Christoph Lameter':

    cpu_core_map is currently an array defined using NR_CPUS. This means that
    we overallocate since we will rarely really use maximum configured cpu.

    If we put the cpu_core_map into the per cpu area then it will be allocated
    for each processor as it comes online.

    This means that the core map cannot be accessed until the per cpu area
    has been allocated. Xen does a weird thing here looping over all processors
    and zeroing the masks that are not yet allocated and that will be zeroed
    when they are allocated. I commented the code out.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd support for Wacom WACF007 and WACF008 to serial pnp driver
Maik Broemme [Tue, 16 Oct 2007 08:24:03 +0000 (01:24 -0700)]
Add support for Wacom WACF007 and WACF008 to serial pnp driver

Notebook manufacturer seems to built a newer Wacom pen enabled tablet to
recent tablet pcs which are not recognized by the serial pnp driver.

Attached is a patch which makes the newer Wacom WACF007 and WACF008 tablets
useable with the serial driver.  The device is fully compatible with it.

Signed-off-by: Maik Broemme <mbroemme@plusserver.de>
Cc: Andrey Panin <pazke@orbita1.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoserial_txx9: Use UPF_FIXED_PORT
Atsushi Nemoto [Tue, 16 Oct 2007 08:24:02 +0000 (01:24 -0700)]
serial_txx9: Use UPF_FIXED_PORT

The UPF_FIXED_PORT flags was introduced in 2.6.22 and it can be used
instead of the driver specific verify_port routine.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agowake up from a serial port
Guennadi Liakhovetski [Tue, 16 Oct 2007 08:24:02 +0000 (01:24 -0700)]
wake up from a serial port

Enable wakeup from serial ports, make it run-time configurable over sysfs,
e.g.,

echo enabled > /sys/devices/platform/serial8250.0/tty/ttyS0/power/wakeup

Requires

# CONFIG_SYSFS_DEPRECATED is not set

Following suggestions from Alan and Russell moved the may_wake_up checks
to serial_core.c. This time actually tested - it does even work. Could
someone, please, verify, that put_device after device_find_child is
correct?

Also would be nice to test with a Natsemi UART, that can wake up the system,
if such systems exist.

For this you just have to apply the patch below, issue the above "echo"
command to one of your Natsemi port, suspend and resume your system, and
verify that your Natsemi port still works.  If you are actually capable of
waking up the system from that port, would be nice to test that as well.

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoprovide stubs for enable_irq_wake() and disable_irq_wake()
Guennadi Liakhovetski [Tue, 16 Oct 2007 08:24:01 +0000 (01:24 -0700)]
provide stubs for enable_irq_wake() and disable_irq_wake()

Provide {enable,disable}_irq_wakeup dummies for undefined
cross-compilers for platforms without CONFIG_GENERIC_IRQ.

Needed by wake-up-from-a-serial-port.patch

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years ago8250_pci: Autodetect mainpine cards
Alan Cox [Tue, 16 Oct 2007 08:24:00 +0000 (01:24 -0700)]
8250_pci: Autodetect mainpine cards

Add support for a whole range of boards. Some are partly autodetected but
not fully correctly others (PCI Express notably) not at all. Stick all
the right entries in.

Thanks to Mainpine for information and testing.

Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoserial_txx9: cleanup includes
Atsushi Nemoto [Tue, 16 Oct 2007 08:23:59 +0000 (01:23 -0700)]
serial_txx9: cleanup includes

Do not include some header files already indluded by serial_core.h.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopcmcia: use DMA_MASK_NONE for the default for all pcmcia devices
James Bottomley [Tue, 16 Oct 2007 08:23:58 +0000 (01:23 -0700)]
pcmcia: use DMA_MASK_NONE for the default for all pcmcia devices

Most non cardbus devices can't do dma, so flag them as such in the device
creation routine.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agointroduce DMA_MASK_NONE as a signal for unable to do DMA
James Bottomley [Tue, 16 Oct 2007 08:23:55 +0000 (01:23 -0700)]
introduce DMA_MASK_NONE as a signal for unable to do DMA

Some devices are incapable of DMA and need to be recognised as such.
Introduce a NONE dma mask to facilitate this plus an inline function:
is_device_dma_capable() to check this.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd support for PCMCIA card Sierra WIreless AC850
Eric Leblond [Tue, 16 Oct 2007 08:23:54 +0000 (01:23 -0700)]
Add support for PCMCIA card Sierra WIreless AC850

Add support for Sierra Wireless AC850 which has the same Ids as the
AC710/750 but has a different firmware.

Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopcmcia: cistpl: use get_unaligned() in CIS parsing
Daniel Ritz [Tue, 16 Oct 2007 08:23:52 +0000 (01:23 -0700)]
pcmcia: cistpl: use get_unaligned() in CIS parsing

Based on a patch by Haavard Skinnemoen posted to linux-pcmcia, but using
static inlines for readability reasons.  this should fix PCMCIA an AVR32

Signed-off-by: Daniel Ritz <daniel.ritz@gmx.ch>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomove a few definitions to au1000_xxs1500.c
Yoichi Yuasa [Tue, 16 Oct 2007 08:23:51 +0000 (01:23 -0700)]
move a few definitions to au1000_xxs1500.c

Only a few definitions is in xxs1500.h .
They can be move to au1000_xxs1500.c .

[m.kozlowski@tuxland.pl: fix unbalanced parenthesis]
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopxa2xx PCMCIA timing issue on iPAQ H5550
Milan Plzik [Tue, 16 Oct 2007 08:23:49 +0000 (01:23 -0700)]
pxa2xx PCMCIA timing issue on iPAQ H5550

Recently I've been trying to get working PCMCIA interface on H5000 ipaq
series, using dual PCMCIA sleeve.  So far things work correctly, but I had
to do one modification to drivers/pcmcia/pxa2xx_base.c to get the interface
working with orinoco gold PCMCIA card (wired pcnet_cs ethernet card worked
even without this modification).

The issue has something to do with assert time on PCMCIA bus, but I'm not
really sure what -- I found the working value just by trial&error approach.
 I'm not sure how is the assert value in pxa2xx_mcxx_asst calculated (I
know, simple formula, but the reason why is it calculated that way is not
obvious for me), neither that my modification is correct.  It just works
with iPAQ.

Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoUse menuconfig objects: PCMCIA
Jan Engelhardt [Tue, 16 Oct 2007 08:23:48 +0000 (01:23 -0700)]
Use menuconfig objects: PCMCIA

Use menuconfigs instead of menus, so the whole menu can be disabled at once
instead of going through all options.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd assembler equivalents to __init{,date}_refok
Ralf Baechle [Tue, 16 Oct 2007 08:23:47 +0000 (01:23 -0700)]
Add assembler equivalents to __init{,date}_refok

I need __INIT_REFOK to fix a MODPOST warning for a few MIPS configs which
have to call init code from .text very early in the game due to bootloader
issues.  __INITDATA_REFOK is just for consistency.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoslow down printk during boot
Randy Dunlap [Tue, 16 Oct 2007 08:23:46 +0000 (01:23 -0700)]
slow down printk during boot

Optionally add a boot delay after each kernel printk() call, crudely
measured in milliseconds, with a maximum delay of 10 seconds per printk.

Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.):
"lpj=loops_per_jiffy boot_delay=100"
to the kernel command line.

It has been useful in cases like "during boot, my machine just reboots or the
screen goes black" by slowing down printk, (and adding initcall_debug), we can
usually see the last thing that happened before the lights went out which is
usually a valuable clue.

[akpm@linux-foundation.org: not all architectures implement CONFIG_HZ]
[akpm@linux-foundation.org: fix lots of stuff]
[bunk@stusta.de: kernel/printk.c: make 2 variables static]
[heiko.carstens@de.ibm.com: fix slow down printk on boot compile error]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoConsolidate PTRACE_DETACH
Alexey Dobriyan [Tue, 16 Oct 2007 08:23:45 +0000 (01:23 -0700)]
Consolidate PTRACE_DETACH

Identical handlers of PTRACE_DETACH go into ptrace_request().
Not touching compat code.
Not touching archs that don't call ptrace_request.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agodocbook: fix filesystems content
Randy Dunlap [Tue, 16 Oct 2007 00:30:19 +0000 (17:30 -0700)]
docbook: fix filesystems content

Fix filesystems docbook warnings.

Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'name'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'mode'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'parent'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'value'
Warning(linux-2.6.23-git8//include/linux/jbd.h:404): No description found for parameter 'h_lockdep_map'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agodocbook: fix usb content
Randy Dunlap [Tue, 16 Oct 2007 00:30:02 +0000 (17:30 -0700)]
docbook: fix usb content

Fix USB docbook warnings.

Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:487): No description found for parameter 'g'
Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:506): No description found for parameter 'g'

Warning(linux-2.6.23-git8//drivers/usb/core/hub.c:1416): No description found for parameter 'usb_dev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agodocbook: fix libata content
Randy Dunlap [Tue, 16 Oct 2007 00:29:46 +0000 (17:29 -0700)]
docbook: fix libata content

Fix libata docbook warnings.

Warning(linux-2.6.23-git8//drivers/ata/libata-scsi.c:3251): No description found for parameter 'dev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agodocbook: fix kernel-api content
Randy Dunlap [Tue, 16 Oct 2007 00:29:33 +0000 (17:29 -0700)]
docbook: fix kernel-api content

Fix kernel-api docbook warnings.

Warning(linux-2.6.23-git8//drivers/message/fusion/mptscsih.c:2618): No description found for parameter 'sc'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMerge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
Linus Torvalds [Mon, 15 Oct 2007 23:08:50 +0000 (16:08 -0700)]
Merge branch 'devel' of /home/rmk/linux-2.6-arm

* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (95 commits)
  [ARM] 4578/1: CM-x270: PCMCIA support
  [ARM] 4577/1: ITE 8152 PCI bridge support
  [ARM] 4576/1: CM-X270 machine support
  [ARM] pxa: Avoid pxa_gpio_mode() in gpio_direction_{in,out}put()
  [ARM] pxa: move pxa_set_mode() from pxa2xx_mainstone.c to mainstone.c
  [ARM] pxa: move pxa_set_mode() from pxa2xx_lubbock.c to lubbock.c
  [ARM] pxa: Make cpu_is_pxaXXX dependent on configuration symbols
  [ARM] pxa: PXA3xx base support
  [NET] smc91x: fix PXA DMA support code
  [SERIAL] Fix console initialisation ordering
  [ARM] pxa: tidy up arch/arm/mach-pxa/Makefile
  [ARM] Update arch/arm/Kconfig for drivers/Kconfig changes
  [ARM] 4600/1: fix kernel build failure with build-id-supporting binutils
  [ARM] 4599/1: Preserve ATAG list for use with kexec (2.6.23)
  [ARM] Rename consistent_sync() as dma_cache_maint()
  [ARM] 4572/1: ep93xx: add cirrus logic edb9307 support
  [ARM] 4596/1: S3C2412: Correct IRQs for SDI+CF and add decoding support
  [ARM] 4595/1: ns9xxx: define registers as void __iomem * instead of volatile u32
  [ARM] 4594/1: ns9xxx: use the new gpio functions
  [ARM] 4593/1: ns9xxx: implement generic clockevents
  ...

17 years agoMerge branch 'locks' of git://linux-nfs.org/~bfields/linux
Linus Torvalds [Mon, 15 Oct 2007 23:07:40 +0000 (16:07 -0700)]
Merge branch 'locks' of git://linux-nfs.org/~bfields/linux

* 'locks' of git://linux-nfs.org/~bfields/linux:
  nfsd: remove IS_ISMNDLCK macro
  Rework /proc/locks via seq_files and seq_list helpers
  fs/locks.c: use list_for_each_entry() instead of list_for_each()
  NFS: clean up explicit check for mandatory locks
  AFS: clean up explicit check for mandatory locks
  9PFS: clean up explicit check for mandatory locks
  GFS2: clean up explicit check for mandatory locks
  Cleanup macros for distinguishing mandatory locks
  Documentation: move locks.txt in filesystems/
  locks: add warning about mandatory locking races
  Documentation: move mandatory locking documentation to filesystems/
  locks: Fix potential OOPS in generic_setlease()
  Use list_first_entry in locks_wake_up_blocks
  locks: fix flock_lock_file() comment
  Memory shortage can result in inconsistent flocks state
  locks: kill redundant local variable
  locks: reverse order of posix_locks_conflict() arguments

17 years agoMerge branch 'release' of ssh://master.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
Linus Torvalds [Mon, 15 Oct 2007 22:32:57 +0000 (15:32 -0700)]
Merge branch 'release' of ssh:///linux/kernel/git/aegl/linux-2.6

* 'release' of ssh://master.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] build fix for scatterlist

17 years ago[libata] pata_cs5536: new API build fix
Jeff Garzik [Mon, 15 Oct 2007 22:10:12 +0000 (18:10 -0400)]
[libata] pata_cs5536: new API build fix

This driver was using hooks that were very recently removed.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
17 years agoMerge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Linus Torvalds [Mon, 15 Oct 2007 21:06:58 +0000 (14:06 -0700)]
Merge branch 'master' of /linux/kernel/git/davem/net-2.6

* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
  [IPV6]: Consolidate the ip6_pol_route_(input|output) pair
  [TCP]: Make snd_cwnd_cnt 32-bit
  [TCP]: Update the /proc/net/tcp documentation
  [NETNS]: Don't panic on creating the namespace's loopback
  [NEIGH]: Ensure that pneigh_lookup is protected with RTNL
  [INET]: kmalloc+memset -> kzalloc in frag_alloc_queue
  [ISDN]: Fix compile with CONFIG_ISDN_X25 disabled.
  [IPV6]: Replace sk_buff ** with sk_buff * in input handlers
  [SELINUX]: Update for netfilter ->hook() arg changes.
  [INET]: Consolidate the xxx_put
  [INET]: Small cleanup for xxx_put after evictor consolidation
  [INET]: Consolidate the xxx_evictor
  [INET]: Consolidate the xxx_frag_destroy
  [INET]: Consolidate xxx_the secret_rebuild
  [INET]: Consolidate the xxx_frag_kill
  [INET]: Collect common frag sysctl variables together
  [INET]: Collect frag queues management objects together
  [INET]: Move common fields from frag_queues in one place.
  [TG3]: Fix performance regression on 5705.
  [ISDN]: Remove local copy of device name to make sure renames work.
  ...

17 years agoMap volume and brightness events on thinkpads
Jeremy Katz [Mon, 15 Oct 2007 20:45:10 +0000 (16:45 -0400)]
Map volume and brightness events on thinkpads

There are standard keycodes for brightness and volume; map the events to
emit them so that things work properly

Signed-off-by: Jeremy Katz <katzj@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years ago[IA64] build fix for scatterlist
Tony Luck [Mon, 15 Oct 2007 20:49:43 +0000 (13:49 -0700)]
[IA64] build fix for scatterlist

include/scsi/scsi_eh.h:79: error: field `sense_sgl' has incomplete type

x86 resolves this by including scatterlist.h from dma-mapping.h which
seems as good a place as any.

Signed-off-by: Tony Luck <tony.luck@intel.com>
17 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Mon, 15 Oct 2007 20:41:39 +0000 (13:41 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
  Input: use full RCU API
  Input: remove tsdev interface
  Input: add support for Blackfin BF54x Keypad controller
  Input: appletouch - another fix for idle reset logic
  HWMON: hdaps - switch to using input-polldev
  Input: add support for SEGA Dreamcast keyboard
  Input: omap-keyboard - don't pretend we support changing keymap
  Input: lifebook - fix X and Y axis range
  Input: usbtouchscreen - add support for GeneralTouch devices
  Input: fix open count handling in input interfaces
  Input: keyboard - add CapsShift lock
  Input: adbhid - produce all CapsLock key events
  Input: ALPS - add signature for ThinkPad R61
  Input: jornada720_kbd - send MSC_SCAN events
  Input: add support for the HP Jornada 7xx (710/720/728) touchscreen
  Input: add support for HP Jornada 7xx onboard keyboard
  Input: add support for HP Jornada onboard keyboard (HP6XX)
  Input: ucb1400_ts - use schedule_timeout_uninterruptible
  Input: xpad - fix dependancy on LEDS class
  Input: auto-select INPUT for MAC_EMUMOUSEBTN option
  ...

Resolved conflicts manually in drivers/hwmon/applesmc.c: converting from
a class device to a device and converting to use input-polldev created a
few apparently trivial clashes..

17 years agoMerge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik...
Linus Torvalds [Mon, 15 Oct 2007 20:31:14 +0000 (13:31 -0700)]
Merge branch 'upstream-linus' of /linux/kernel/git/jgarzik/libata-dev

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  [libata] pata_pcmcia: Add additional id string (corsair, 1GB)
  libata: prevent devices with blank model names from being DMA blacklisted
  ata_piix: SATA 2port controller port map fix
  pata_cs5536: ATA driver for Geode companion chip
  libata: add ST9160821AS / 3.CCD to NCQ blacklist
  libata: fix revalidation issuing after configuration commands
  [libata] sata_nv: add SW NCQ support for MCP51/MCP55/MCP61
  [libata] pata_sil680: Add MMIO support

17 years agoMerge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik...
Linus Torvalds [Mon, 15 Oct 2007 20:30:35 +0000 (13:30 -0700)]
Merge branch 'upstream-linus' of /linux/kernel/git/jgarzik/netdev-2.6

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (35 commits)
  xen-netfront: rearrange netfront structure to separate tx and rx
  netdev: convert non-obvious instances to use ARRAY_SIZE()
  ucc_geth: Fix build break introduced by commit 09f75cd7bf13720738e6a196cc0107ce9a5bd5a0
  gianfar: Fix regression caused by new napi interface
  gianfar: Cleanup compile warning caused by 0795af57
  gianfar: Fix compile regression caused by bea3348e
  add new prom.h for AU1x00
  update AU1000 get_ethernet_addr()
  MIPSsim: General cleanup
  Jazzsonic: Fix warning about unused variable.
  Remove msic_dcr_read() in axon_msi.c
  Use dcr_host_t.base in dcr_unmap()
  Add dcr_host_t.base in dcr_read()/dcr_write()
  Use dcr_host_t.base in ibm_emac_mal
  Update ibm_newemac to use dcr_host_t.base
  tehuti: possible leak in bdx_probe
  TC35815: Fix build
  SAA9730: Fix build
  AR7 ethernet
  myri10ge: update driver version to 1.3.2-1.287
  ...

17 years agoxen-netfront: rearrange netfront structure to separate tx and rx
Jeremy Fitzhardinge [Mon, 15 Oct 2007 19:59:53 +0000 (12:59 -0700)]
xen-netfront: rearrange netfront structure to separate tx and rx

Keep tx and rx elements separate on different cachelines to prevent
bouncing.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoAtari keyboard: incorporate additional review comments
Geert Uytterhoeven [Mon, 15 Oct 2007 19:51:10 +0000 (21:51 +0200)]
Atari keyboard: incorporate additional review comments

Atari keyboard: incorporate additional review comments:
  o Kill reference to source file name
  o Return error value from input_register_device() instead of -ENOMEM

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Michael Schmitz <schmitz@biophys.uni-duesseldorf.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years ago[IPV6]: Consolidate the ip6_pol_route_(input|output) pair
Pavel Emelyanov [Mon, 15 Oct 2007 20:02:51 +0000 (13:02 -0700)]
[IPV6]: Consolidate the ip6_pol_route_(input|output) pair

The difference in both functions is in the "id" passed to
the rt6_select, so just pass it as an extra argument from
two outer helpers.

This is minus 60 lines of code and 360 bytes of .text

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[TCP]: Make snd_cwnd_cnt 32-bit
Ilpo Järvinen [Mon, 15 Oct 2007 19:59:43 +0000 (12:59 -0700)]
[TCP]: Make snd_cwnd_cnt 32-bit

Very little point of having 32-bit snd_cnwd if this is not
32-bit as well, as a number of snd_cwnd incrementation formulas
assume that snd_cwnd_cnt can be at least as large as snd_cwnd.

Whether 32-bit is useful was discussed when e0ef57cc56c3c96
was made:
  http://marc.info/?l=linux-netdev&m=117218144409825&w=2

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[TCP]: Update the /proc/net/tcp documentation
Jean Delvare [Mon, 15 Oct 2007 19:58:35 +0000 (12:58 -0700)]
[TCP]: Update the /proc/net/tcp documentation

* Say that this interface is deprecated.
* Update function name references to match the current code.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years agonetdev: convert non-obvious instances to use ARRAY_SIZE()
Alejandro Martinez Ruiz [Mon, 15 Oct 2007 01:37:43 +0000 (03:37 +0200)]
netdev: convert non-obvious instances to use ARRAY_SIZE()

This will convert remaining non-obvious or naive calculations of array
sizes to use ARRAY_SIZE() macro.

Signed-off-by: Alejandro Martinez Ruiz <alex@flawedcode.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years ago[NETNS]: Don't panic on creating the namespace's loopback
Pavel Emelyanov [Mon, 15 Oct 2007 19:55:33 +0000 (12:55 -0700)]
[NETNS]: Don't panic on creating the namespace's loopback

When the loopback device is failed to initialize inside the new
namespaces, panic() is called. Do not do it when the namespace
in question is not the init_net.

Plus cleanup the error path a bit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years agoReinstate lost flush_ioremap_region() fix to pxa2xx-flash driver
Linus Torvalds [Mon, 15 Oct 2007 19:55:20 +0000 (12:55 -0700)]
Reinstate lost flush_ioremap_region() fix to pxa2xx-flash driver

Commit 90833fdab89da02fc0276224167f0a42e5176f41 ("[ARM] 4554/1: replace
consistent_sync() with flush_ioremap_region()") introduced a new
"flush_ioremap_region()" function to be used by the MTD mainstone-flash
and lubbock-flash drivers to fix a regression from around 2.6.18.

Those drivers were independently merged into a single driver by Todd
Poynor in commit e644f7d6289456657996df4192de76c5d0a9f9c7 ("[MTD] MAPS:
Merge Lubbock and Mainstone drivers into common PXA2xx driver")

Later, those two commits were merged into the main MTD tree by commit
b160292cc216a50fd0cd386b0bda2cd48352c73b ("Merge Linux 2.6.23") by David
Woodhouse, but in that merge, the fix to use flush_iomap_region() got
lost (as it was to files that now no longer existed).

This reinstates the fix in the new driver.

Noticed-by: Russell King <rmk@arm.linux.org.uk>
Tested-and-acked-by: Nicolas Pitre <nico@cam.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Todd Poynor <tpoynor@mvista.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years ago[NEIGH]: Ensure that pneigh_lookup is protected with RTNL
Pavel Emelyanov [Mon, 15 Oct 2007 19:54:15 +0000 (12:54 -0700)]
[NEIGH]: Ensure that pneigh_lookup is protected with RTNL

The pnigh_lookup is used to lookup proxy entries and to
create them in case lookup failed.

However, the "creation" code does not perform the re-lookup
after GFP_KERNEL allocation. This is done because the code
is expected to be protected with the RTNL lock, so add the
assertion (mainly to address future questions from new network
developers like me :) ).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: kmalloc+memset -> kzalloc in frag_alloc_queue
Denis V. Lunev [Mon, 15 Oct 2007 19:53:13 +0000 (12:53 -0700)]
[INET]: kmalloc+memset -> kzalloc in frag_alloc_queue

kmalloc + memset -> kzalloc in frag_alloc_queue

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[ISDN]: Fix compile with CONFIG_ISDN_X25 disabled.
Denis V. Lunev [Mon, 15 Oct 2007 19:52:20 +0000 (12:52 -0700)]
[ISDN]: Fix compile with CONFIG_ISDN_X25 disabled.

On Mon, Oct 15, 2007 at 06:44:56PM +0400, Denis V. Lunev wrote:
Compilation fix. The problem appears after
7c076d1de869256848dacb8de0050a3a390f95df by Karsten Keil <kkeil@suse.de>

Acked-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[libata] pata_pcmcia: Add additional id string (corsair, 1GB)
Kristoffer Ericson [Mon, 15 Oct 2007 19:51:42 +0000 (15:51 -0400)]
[libata] pata_pcmcia: Add additional id string (corsair, 1GB)

Signed-off-by: Kristoffer Ericson <kristoffer.ericson@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
17 years ago[IPV6]: Replace sk_buff ** with sk_buff * in input handlers
Herbert Xu [Mon, 15 Oct 2007 19:50:28 +0000 (12:50 -0700)]
[IPV6]: Replace sk_buff ** with sk_buff * in input handlers

With all the users of the double pointers removed from the IPv6 input path,
this patch converts all occurances of sk_buff ** to sk_buff * in IPv6 input
handlers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years agoscsi/gdth: fix crash in gdth_timeout if no gdth controllers found
Linus Torvalds [Mon, 15 Oct 2007 19:46:16 +0000 (12:46 -0700)]
scsi/gdth: fix crash in gdth_timeout if no gdth controllers found

If the gdth module is loaded (or compiled in), the gdth_timeout function
gets started even if no actual gdth controllers are found b the probing.

That ends up not only being unnecessary, but also causes a crash due to
the function blindly just trying to pick the first entry off the
"gdth_instances" list, and accessing it - which obviously doesn't work
if the list is empty!

Noticed by Ingo Molnar.

Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agolibata: prevent devices with blank model names from being DMA blacklisted
Andrew Paprocki [Mon, 15 Oct 2007 19:43:12 +0000 (15:43 -0400)]
libata: prevent devices with blank model names from being DMA blacklisted

The strn_pattern_cmp routine does not handle a blank name parameter
properly. The only patterns which should match a blank name are "*"
and an explicit "". If the function is passed a blank name in current
code, it will always match against the patt parameter. The bug manifests
itself as the device with the empty model name always matching the first
device in the DMA blacklist, forcing it to revert to PIO mode.

Signed-off-by: Andrew Paprocki <andrew@ishiboo.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
17 years agoata_piix: SATA 2port controller port map fix
Jason Gaston [Thu, 11 Oct 2007 23:05:15 +0000 (16:05 -0700)]
ata_piix: SATA 2port controller port map fix

This patch adds a port map for ICH9 and ICH8 SATA controllers that have only 2 ports available in that mode.

Signed-off-by: Jason Gaston <jason.d.gaston@intel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agopata_cs5536: ATA driver for Geode companion chip
Martin K. Petersen [Thu, 11 Oct 2007 07:38:19 +0000 (03:38 -0400)]
pata_cs5536: ATA driver for Geode companion chip

This is a driver for the ATA controller on the Geode CS5536 companion
chip.  The PCI device ID for this device was previously claimed by
pata_amd.c but the PIO timings were not correct.  This driver also
works around a bug in some BIOSes that handle unaligned access to the
PCI config registers poorly.  Finally, the driver allows fallback to
using MSR registers for configuration on BIOSes that are truly
broken.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years ago[SELINUX]: Update for netfilter ->hook() arg changes.
David S. Miller [Mon, 15 Oct 2007 09:58:25 +0000 (02:58 -0700)]
[SELINUX]: Update for netfilter ->hook() arg changes.

They take a "struct sk_buff *" instead of a "struct sk_buff **" now.

Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Consolidate the xxx_put
Pavel Emelyanov [Mon, 15 Oct 2007 09:41:56 +0000 (02:41 -0700)]
[INET]: Consolidate the xxx_put

These ones use the generic data types too, so move
them in one place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Small cleanup for xxx_put after evictor consolidation
Pavel Emelyanov [Mon, 15 Oct 2007 09:41:09 +0000 (02:41 -0700)]
[INET]: Small cleanup for xxx_put after evictor consolidation

After the evictor code is consolidated there is no need in
passing the extra pointer to the xxx_put() functions.

The only place when it made sense was the evictor code itself.

Maybe this change must got with the previous (or with the
next) patch, but I try to make them shorter as much as
possible to simplify the review (but they are still large
anyway), so this change goes in a separate patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Consolidate the xxx_evictor
Pavel Emelyanov [Mon, 15 Oct 2007 09:40:06 +0000 (02:40 -0700)]
[INET]: Consolidate the xxx_evictor

The evictors collect some statistics for ipv4 and ipv6,
so make it return the number of evicted queues and account
them all at once in the caller.

The XXX_ADD_STATS_BH() macros are just for this case,
but maybe there are places in code, that can make use of
them as well.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Consolidate the xxx_frag_destroy
Pavel Emelyanov [Mon, 15 Oct 2007 09:39:14 +0000 (02:39 -0700)]
[INET]: Consolidate the xxx_frag_destroy

To make in possible we need to know the exact frag queue
size for inet_frags->mem management and two callbacks:

 * to destoy the skb (optional, used in conntracks only)
 * to free the queue itself (mandatory, but later I plan to
   move the allocation and the destruction of frag_queues
   into the common place, so this callback will most likely
   be optional too).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Consolidate xxx_the secret_rebuild
Pavel Emelyanov [Mon, 15 Oct 2007 09:38:08 +0000 (02:38 -0700)]
[INET]: Consolidate xxx_the secret_rebuild

This code works with the generic data types as well, so
move this into inet_fragment.c

This move makes it possible to hide the secret_timer
management and the secret_rebuild routine completely in
the inet_fragment.c

Introduce the ->hashfn() callback in inet_frags() to get
the hashfun for a given inet_frag_queue() object.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Consolidate the xxx_frag_kill
Pavel Emelyanov [Mon, 15 Oct 2007 09:37:18 +0000 (02:37 -0700)]
[INET]: Consolidate the xxx_frag_kill

Since now all the xxx_frag_kill functions now work
with the generic inet_frag_queue data type, this can
be moved into a common place.

The xxx_unlink() code is moved as well.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Collect common frag sysctl variables together
Pavel Emelyanov [Mon, 15 Oct 2007 09:33:45 +0000 (02:33 -0700)]
[INET]: Collect common frag sysctl variables together

Some sysctl variables are used to tune the frag queues
management and it will be useful to work with them in
a common way in the future, so move them into one
structure, moreover they are the same for all the frag
management codes.

I don't place them in the existing inet_frags object,
introduced in the previous patch for two reasons:

 1. to keep them in the __read_mostly section;
 2. not to export the whole inet_frags objects outside.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Collect frag queues management objects together
Pavel Emelyanov [Mon, 15 Oct 2007 09:31:52 +0000 (02:31 -0700)]
[INET]: Collect frag queues management objects together

There are some objects that are common in all the places
which are used to keep track of frag queues, they are:

 * hash table
 * LRU list
 * rw lock
 * rnd number for hash function
 * the number of queues
 * the amount of memory occupied by queues
 * secret timer

Move all this stuff into one structure (struct inet_frags)
to make it possible use them uniformly in the future. Like
with the previous patch this mostly consists of hunks like

-    write_lock(&ipfrag_lock);
+    write_lock(&ip4_frags.lock);

To address the issue with exporting the number of queues and
the amount of memory occupied by queues outside the .c file
they are declared in, I introduce a couple of helpers.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 years ago[INET]: Move common fields from frag_queues in one place.
Pavel Emelyanov [Mon, 15 Oct 2007 09:24:19 +0000 (02:24 -0700)]
[INET]: Move common fields from frag_queues in one place.

Introduce the struct inet_frag_queue in include/net/inet_frag.h
file and place there all the common fields from three structs:

 * struct ipq in ipv4/ip_fragment.c
 * struct nf_ct_frag6_queue in nf_conntrack_reasm.c
 * struct frag_queue in ipv6/reassembly.c

After this, replace these fields on appropriate structures with
this structure instance and fix the users to use correct names
i.e. hunks like

-    atomic_dec(&fq->refcnt);
+    atomic_dec(&fq->q.refcnt);

(these occupy most of the patch)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>