GitHub/mt8127/android_kernel_alcatel_ttab.git
14 years agofs: Add FITRIM ioctl
Lukas Czerner [Thu, 28 Oct 2010 01:30:11 +0000 (21:30 -0400)]
fs: Add FITRIM ioctl

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
start;
len;
minlen;
}

start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
  number of Bytes will be ignored. This will be rounded up to fs
  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: Use return value from sb_issue_discard()
Lukas Czerner [Thu, 28 Oct 2010 01:30:11 +0000 (21:30 -0400)]
ext4: Use return value from sb_issue_discard()

Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: Check return value of sb_getblk() and friends
Namhyung Kim [Thu, 28 Oct 2010 01:30:11 +0000 (21:30 -0400)]
ext4: Check return value of sb_getblk() and friends

Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use bio layer instead of buffer layer in mpage_da_submit_io
Theodore Ts'o [Thu, 28 Oct 2010 01:30:10 +0000 (21:30 -0400)]
ext4: use bio layer instead of buffer layer in mpage_da_submit_io

Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:10 +0000 (21:30 -0400)]
ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()

This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: inline walk_page_buffers() into mpage_da_submit_io
Theodore Ts'o [Thu, 28 Oct 2010 01:30:10 +0000 (21:30 -0400)]
ext4: inline walk_page_buffers() into mpage_da_submit_io

Expand the call:

  if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                        ext4_bh_delay_or_unwritten))
goto redirty_page

into mpage_da_submit_io().

This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: inline ext4_writepage() into mpage_da_submit_io()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:09 +0000 (21:30 -0400)]
ext4: inline ext4_writepage() into mpage_da_submit_io()

As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit.  This makes it clearer what mpage_da_submit needs to do.

Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.

This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: simplify ext4_writepage()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:09 +0000 (21:30 -0400)]
ext4: simplify ext4_writepage()

The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: call mpage_da_submit_io() from mpage_da_map_blocks()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:09 +0000 (21:30 -0400)]
ext4: call mpage_da_submit_io() from mpage_da_map_blocks()

Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().

We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use KMEM_CACHE instead of kmem_cache_create
Theodore Ts'o [Thu, 28 Oct 2010 01:30:09 +0000 (21:30 -0400)]
ext4: use KMEM_CACHE instead of kmem_cache_create

Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache.  This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use search_dirblock() in ext4_dx_find_entry()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:08 +0000 (21:30 -0400)]
ext4: use search_dirblock() in ext4_dx_find_entry()

Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
easier to read, and it takes advantage of common code.  It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
14 years agoext4: avoid uninitialized memory references in ext3_htree_next_block()
Theodore Ts'o [Thu, 28 Oct 2010 01:30:08 +0000 (21:30 -0400)]
ext4: avoid uninitialized memory references in ext3_htree_next_block()

If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code.  The warning was harmless, but it was
useful in pointing out code that was too ugly to live.  This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
14 years agoext4: remove unused ext4_sb_info members
Eric Sandeen [Thu, 28 Oct 2010 01:30:08 +0000 (21:30 -0400)]
ext4: remove unused ext4_sb_info members

Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: queue conversion after adding to inode's completed IO list
Eric Sandeen [Thu, 28 Oct 2010 01:30:07 +0000 (21:30 -0400)]
ext4: queue conversion after adding to inode's completed IO list

By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.

It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.

Thanks to Dave Chinner for pointing out the race.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: don't use ext4_allocation_contexts for tracing
Eric Sandeen [Thu, 28 Oct 2010 01:30:07 +0000 (21:30 -0400)]
ext4: don't use ext4_allocation_contexts for tracing

Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: fix oops in trace_ext4_mb_release_group_pa
Eric Sandeen [Thu, 28 Oct 2010 01:30:07 +0000 (21:30 -0400)]
ext4: fix oops in trace_ext4_mb_release_group_pa

Our QA reported an oops in the ext4_mb_release_group_pa tracing,
and Josef Bacik pointed out that it was because we may have a
non-null but uninitialized ac_inode in the allocation context.

I can reproduce it when running xfstests with ext4 tracepoints on,
on a CONFIG_SLAB_DEBUG kernel.

We call trace_ext4_mb_release_group_pa from 2 places,
ext4_mb_discard_group_preallocations and
ext4_mb_discard_lg_preallocations

In both cases we allocate an ac as a container just for tracing (!)
and never fill in the ac_inode.  There's no reason to be assigning,
testing, or printing it as far as I can see, so just remove it from
the tracepoint.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: fix potential infinite loop in ext4_da_writepages()
Toshiyuki Okajima [Thu, 28 Oct 2010 01:30:07 +0000 (21:30 -0400)]
ext4: fix potential infinite loop in ext4_da_writepages()

On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:

========================================================================
#!/bin/sh

echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================

We can see the following backtrace if we get the kdump when this
hangup occurs:

======================================================================
kthread()
=> bdi_writeback_thread()
   => wb_do_writeback()
      => wb_writeback()
         => writeback_inodes_wb()
            => writeback_sb_inodes()
               => writeback_single_inode()
                  => ext4_da_writepages()  ---+
                                ^ infinite    |
                                |   loop      |
                                +-------------+
======================================================================

The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem
   maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
   - the member, "m_len" of struct mpage_da_data is 1
  mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
  cannot clear "BH_Delay" flag of the buffer_head because the type of
  m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.

  Therefore an infinite loop occurs because ext4_da_writepages()
  cannot write the page (which corresponds to the block) since
  "BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
struct ext4_map_blocks *map)
{
...
int blocks = map->m_len;
...
do {
// cur_logical = 4294967295
// map->m_lblk = 4294967295
// blocks = 1
// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
// (cur_logical >= map->m_lblk + blocks) => true
if (cur_logical >= map->m_lblk + blocks)
break;
----------------------------------------------------------------------

NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: improve llseek error handling for overly large seek offsets
Toshiyuki Okajima [Thu, 28 Oct 2010 01:30:06 +0000 (21:30 -0400)]
ext4: improve llseek error handling for overly large seek offsets

The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.

If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
14 years agoext4: don't update sb journal_devnum when RO dev
Maciej Żenczykowski [Thu, 28 Oct 2010 01:30:06 +0000 (21:30 -0400)]
ext4: don't update sb journal_devnum when RO dev

An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use sb_issue_zeroout in ext4_ext_zeroout
Lukas Czerner [Thu, 28 Oct 2010 01:30:06 +0000 (21:30 -0400)]
ext4: use sb_issue_zeroout in ext4_ext_zeroout

Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use sb_issue_zeroout in setup_new_group_blocks
Lukas Czerner [Thu, 28 Oct 2010 01:30:05 +0000 (21:30 -0400)]
ext4: use sb_issue_zeroout in setup_new_group_blocks

Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: add interface to advertise ext4 features in sysfs
Lukas Czerner [Thu, 28 Oct 2010 01:30:05 +0000 (21:30 -0400)]
ext4: add interface to advertise ext4 features in sysfs

User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: add support for lazy inode table initialization
Lukas Czerner [Thu, 28 Oct 2010 01:30:05 +0000 (21:30 -0400)]
ext4: add support for lazy inode table initialization

When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoAdd helper function for blkdev_issue_zeroout (sb_issue_discard)
Lukas Czerner [Thu, 28 Oct 2010 01:30:04 +0000 (21:30 -0400)]
Add helper function for blkdev_issue_zeroout (sb_issue_discard)

This is done the same way as helper sb_issue_discard for
blkdev_issue_discard.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agojbd2: Add sanity check for attempts to start handle during umount
Theodore Ts'o [Thu, 28 Oct 2010 01:30:04 +0000 (21:30 -0400)]
jbd2: Add sanity check for attempts to start handle during umount

An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup.  So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: fix NULL pointer dereference in print_daily_error_info()
Sergey Senozhatsky [Thu, 28 Oct 2010 01:30:04 +0000 (21:30 -0400)]
ext4: fix NULL pointer dereference in print_daily_error_info()

Fix NULL pointer dereference in print_daily_error_info, when
called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error
reporting timer in ext4_put_super.

Google-Bug-Id: 3017663

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: don't hold spinlock while calling ext4_issue_discard()
Lukas Czerner [Thu, 28 Oct 2010 01:30:04 +0000 (21:30 -0400)]
ext4: don't hold spinlock while calling ext4_issue_discard()

We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.

Google-Bug-Id: 3017678

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: check for negative error code from sb_issue_discard
Lukas Czerner [Thu, 28 Oct 2010 01:30:03 +0000 (21:30 -0400)]
ext4: check for negative error code from sb_issue_discard

sb_issue_discard() is returning negative error code, so check for
-EOPNOTSUPP.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: don't bump up LONG_MAX nr_to_write by a factor of 8
Eric Sandeen [Thu, 28 Oct 2010 01:30:03 +0000 (21:30 -0400)]
ext4: don't bump up LONG_MAX nr_to_write by a factor of 8

I'm uneasy with lots of stuff going on in ext4_da_writepages(),
but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
making anything better, so avoid the multiplier in that case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: stop looping in ext4_num_dirty_pages when max_pages reached
Eric Sandeen [Thu, 28 Oct 2010 01:30:03 +0000 (21:30 -0400)]
ext4: stop looping in ext4_num_dirty_pages when max_pages reached

Today we simply break out of the inner loop when we have accumulated
max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
in the while (!done) loop, this does potentially a lot of work
with no net effect.

When we have accumulated max_pages, just clean up and return.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: use dedicated slab caches for group_info structures
Curt Wohlgemuth [Thu, 28 Oct 2010 01:29:12 +0000 (21:29 -0400)]
ext4: use dedicated slab caches for group_info structures

ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object.  On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)

This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.

Google-Bug-Id: 2980809

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: avoid null dereference in trace_ext4_mballoc_discard
Wen Congyang [Thu, 28 Oct 2010 01:27:12 +0000 (21:27 -0400)]
ext4: avoid null dereference in trace_ext4_mballoc_discard

ac->inode is set to null in function ext4_mb_release_group_pa(),
and then trace_ext4_mballoc_discard(ac) is called, the kernel
will panic.

BUG: unable to handle kernel NULL pointer dereference at 000000a4
IP: [<f87e1714>] ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
*pdpt = 0000000000abd001 *pde = 0000000000000000
Oops: 0000 [#1] SMP

Pid: 550, comm: flush-8:16 Not tainted 2.6.36-rc1 #1 SE7320EP2/Altos G530
EIP: 0060:[<f87e1714>] EFLAGS: 00010206 CPU: 1
EIP is at ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
EAX: f32ac840 EBX: f3f1cf88 ECX: f32ac840 EDX: 00000000
ESI: f32ac83c EDI: f880b9d8 EBP: 00000000 ESP: f4b77ae4
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process flush-8:16 (pid: 550, ti=f4b76000 task=f613e540 task.ti=f4b76000)
Call Trace:
 [<f87f5ac1>] ? ext4_mb_release_group_pa+0x121/0x150 [ext4]
 [<f87f8356>] ? ext4_mb_discard_group_preallocations+0x336/0x400 [ext4]
 [<f87fb7f1>] ? ext4_mb_new_blocks+0x3d1/0x4f0 [ext4]
 [<c05a6c5b>] ? __make_request+0x10b/0x440
 [<f87f1fb4>] ? ext4_ext_map_blocks+0x1334/0x1980 [ext4]
 [<c04ac78a>] ? rb_reserve_next_event+0xaa/0x3b0
 [<f87d18d6>] ? ext4_map_blocks+0xd6/0x1d0 [ext4]
 [<f87d2da7>] ? mpage_da_map_blocks+0xc7/0x8a0 [ext4]
 [<c04c8a68>] ? find_get_pages_tag+0x38/0x110
 [<c04d23a5>] ? __pagevec_release+0x15/0x20
 [<f87d3ca5>] ? ext4_da_writepages+0x2b5/0x5d0 [ext4]
 [<c04cfbe0>] ? __writepage+0x0/0x30
 [<c04d0e34>] ? do_writepages+0x14/0x30
 [<c0526600>] ? writeback_single_inode+0xa0/0x240
 [<c0526971>] ? writeback_sb_inodes+0xc1/0x180
 [<c0526ab8>] ? writeback_inodes_wb+0x88/0x140
 [<c0526d7b>] ? wb_writeback+0x20b/0x320
 [<c045aca7>] ? lock_timer_base+0x27/0x50
 [<c0526fe0>] ? wb_do_writeback+0x150/0x190
 [<c05270a8>] ? bdi_writeback_thread+0x88/0x1f0
 [<c043b680>] ? complete+0x40/0x60
 [<c0527020>] ? bdi_writeback_thread+0x0/0x1f0
 [<c0469474>] ? kthread+0x74/0x80
 [<c0469400>] ? kthread+0x0/0x80
 [<c040a23e>] ? kernel_thread_helper+0x6/0x10

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agojbd2: Fix I/O hang in jbd2_journal_release_jbd_inode
Brian King [Thu, 28 Oct 2010 01:25:12 +0000 (21:25 -0400)]
jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode

This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.

This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoext4: fix EOFBLOCKS_FL handling
Theodore Ts'o [Thu, 28 Oct 2010 01:23:12 +0000 (21:23 -0400)]
ext4: fix EOFBLOCKS_FL handling

It turns out we have several problems with how EOFBLOCKS_FL is
handled.  First of all, there was a fencepost error where we were not
clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
but rather when we allocate the next block _after_ the uninitalized
block.  Secondly we were not testing to see if we needed to clear the
EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
an uninitialized block (which is the most common case).

Google-Bug-Id: 2928259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
14 years agoLinux 2.6.36-rc6
Linus Torvalds [Wed, 29 Sep 2010 01:01:22 +0000 (18:01 -0700)]
Linux 2.6.36-rc6

14 years agoMN10300: Handle missing sys_cacheflush() when caching disabled
David Howells [Wed, 29 Sep 2010 00:57:02 +0000 (01:57 +0100)]
MN10300: Handle missing sys_cacheflush() when caching disabled

When caching is disabled on the MN10300 arch, the sys_cacheflush()
function is removed by conditional stuff in the makefiles, but is still
referred to by the syscall table.

Provide a null version that just returns 0 when caching is disabled (or
-EINVAL if the arguments are silly).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoalpha: fix compile problem in arch/alpha/kernel/signal.c
Linus Torvalds [Tue, 28 Sep 2010 20:26:57 +0000 (13:26 -0700)]
alpha: fix compile problem in arch/alpha/kernel/signal.c

Tssk.  Apparently Al hadn't checked commit c52c2ddc1dfa ("alpha: switch
osf_sigprocmask() to use of sigprocmask()") at all. It doesn't compile.

Fixed as per suggestions from Michael Cree.

Reported-by: Michael Cree <mcree@orcon.net.nz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoMerge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzi...
Linus Torvalds [Tue, 28 Sep 2010 19:38:52 +0000 (12:38 -0700)]
Merge branch 'upstream-linus' of git://git./linux/kernel/git/jgarzik/libata-dev

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  ahci: fix module refcount breakage introduced by libahci split

14 years agoahci: fix module refcount breakage introduced by libahci split
Tejun Heo [Tue, 21 Sep 2010 07:25:48 +0000 (09:25 +0200)]
ahci: fix module refcount breakage introduced by libahci split

libata depends on scsi_host_template for module reference counting and
sht's should be owned by each low level driver.  During libahci split,
the sht was left with libahci.ko leaving the actual low level drivers
not reference counted.  This made ahci and ahci_platform always
unloadable even while they're being actively used.

Fix it by defining AHCI_SHT() macro in ahci.h and defining a sht for
each low level ahci driver.

stable: only applicable to 2.6.35.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Pedro Francisco <pedrogfrancisco@gmail.com>
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@kernel.org
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agoMerge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groec...
Linus Torvalds [Tue, 28 Sep 2010 19:13:13 +0000 (12:13 -0700)]
Merge branch 'hwmon-for-linus' of git://git./linux/kernel/git/groeck/staging

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
  hwmon (coretemp): Fix build breakage if SMP is undefined

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes...
Linus Torvalds [Tue, 28 Sep 2010 19:02:22 +0000 (12:02 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jbarnes/pci-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
  PCI: fix pci_resource_alignment prototype

14 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Linus Torvalds [Tue, 28 Sep 2010 19:01:26 +0000 (12:01 -0700)]
Merge git://git./linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  tcp: Fix >4GB writes on 64-bit.
  net/9p: Mount only matching virtio channels
  de2104x: fix ethtool
  tproxy: check for transparent flag in ip_route_newports
  ipv6: add IPv6 to neighbour table overflow warning
  tcp: fix TSO FACK loss marking in tcp_mark_head_lost
  3c59x: fix regression from patch "Add ethtool WOL support"
  ipv6: add a missing unregister_pernet_subsys call
  s390: use free_netdev(netdev) instead of kfree()
  sgiseeq: use free_netdev(netdev) instead of kfree()
  rionet: use free_netdev(netdev) instead of kfree()
  ibm_newemac: use free_netdev(netdev) instead of kfree()
  smsc911x: Add MODULE_ALIAS()
  net: reset skb queue mapping when rx'ing over tunnel
  br2684: fix scheduling while atomic
  de2104x: fix TP link detection
  de2104x: fix power management
  de2104x: disable autonegotiation on broken hardware
  net: fix a lockdep splat
  e1000e: 82579 do not gate auto config of PHY by hardware during nominal use
  ...

14 years agohwmon (coretemp): Fix build breakage if SMP is undefined
Guenter Roeck [Tue, 28 Sep 2010 01:01:49 +0000 (18:01 -0700)]
hwmon (coretemp): Fix build breakage if SMP is undefined

Commit e40cc4bdfd4b89813f072f72bd9c7055814d3f0f introduced
a build breakage if CONFIG_SMP is undefined. This commit
fixes the problem.

This fix is only a workaround. For a real fix, cpu_sibling_mask() should
be defined in UP include code, eg in linux/smp.h, and asm/smp.h should not be
included directly. This fix is currently not possible because asm/smp.h defines
cpu_sibling_mask() unconditionally and is included directly from many source
files.

Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
14 years agoMerge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux...
Linus Torvalds [Tue, 28 Sep 2010 04:19:27 +0000 (21:19 -0700)]
Merge branch 'x86/urgent' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Avoid 'constant_test_bit()' misoptimization due to cast to non-volatile

14 years agotcp: Fix >4GB writes on 64-bit.
David S. Miller [Tue, 28 Sep 2010 03:24:54 +0000 (20:24 -0700)]
tcp: Fix >4GB writes on 64-bit.

Fixes kernel bugzilla #16603

tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write
zero bytes, for example.

There is also the problem higher up of how verify_iovec() works.  It
wants to prevent the total length from looking like an error return
value.

However it does this using 'int', but syscalls return 'long' (and
thus signed 64-bit on 64-bit machines).  So it could trigger
false-positives on 64-bit as written.  So fix it to use 'long'.

Reported-by: Olaf Bonorden <bono@onlinehome.de>
Reported-by: Daniel Büse <dbuese@gmx.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoFix pktcdvd ioctl dev_minor range check
Dan Rosenberg [Mon, 27 Sep 2010 16:30:28 +0000 (12:30 -0400)]
Fix pktcdvd ioctl dev_minor range check

The PKT_CTRL_CMD_STATUS device ioctl retrieves a pointer to a
pktcdvd_device from the global pkt_devs array.  The index into this
array is provided directly by the user and is a signed integer, so the
comparison to ensure that it falls within the bounds of this array will
fail when provided with a negative index.

This can be used to read arbitrary kernel memory or cause a crash due to
an invalid pointer dereference.  This can be exploited by users with
permission to open /dev/pktcdvd/control (on many distributions, this is
readable by group "cdrom").

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
[ Rather than add a cast, just make the function take the right type -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoMN10300: Default config choice GDBSTUB_TTYSM0 should be GDBSTUB_ON_TTYSM0
David Howells [Mon, 27 Sep 2010 12:12:33 +0000 (13:12 +0100)]
MN10300: Default config choice GDBSTUB_TTYSM0 should be GDBSTUB_ON_TTYSM0

The configuration choice for the port on which the GDB stub listens has
a default of GDBSTUB_TTYSM0, but this should be GDBSTUB_ON_TTYSM0 to
match the option.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agonet/9p: Mount only matching virtio channels
Sven Eckelmann [Mon, 27 Sep 2010 22:54:44 +0000 (15:54 -0700)]
net/9p: Mount only matching virtio channels

p9_virtio_create will only compare the the channel's tag characters
against the device name till the end of the channel's tag but not till
the end of the device name. This means that if a user defines channels
with the tags foo and foobar then he would mount foo when he requested
foonot and may mount foo when he requested foobar.

Thus it is necessary to check both string lengths against each other in
case of a successful partial string match.

Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agode2104x: fix ethtool
Ondrej Zary [Mon, 27 Sep 2010 11:41:45 +0000 (11:41 +0000)]
de2104x: fix ethtool

When the interface is up, using ethtool breaks it because:
a) link is put down but media_timer interval is not shortened to NO_LINK
b) rxtx is stopped but not restarted

Also manual 10baseT-HD (and probably FD too - untested) mode does not work -
the link is forced up, packets are transmitted but nothing is received.
Changing CSR14 value to match documentation (not disabling link check) fixes this.

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoMerge branch 'vhost-net' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
David S. Miller [Mon, 27 Sep 2010 22:04:23 +0000 (15:04 -0700)]
Merge branch 'vhost-net' of git://git./linux/kernel/git/mst/vhost

14 years agotproxy: check for transparent flag in ip_route_newports
Ulrich Weber [Mon, 27 Sep 2010 03:31:00 +0000 (03:31 +0000)]
tproxy: check for transparent flag in ip_route_newports

as done in ip_route_connect()

Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoipv6: add IPv6 to neighbour table overflow warning
Ulrich Weber [Mon, 27 Sep 2010 22:02:18 +0000 (15:02 -0700)]
ipv6: add IPv6 to neighbour table overflow warning

IPv4 and IPv6 have separate neighbour tables, so
the warning messages should be distinguishable.

[ Add a suitable message prefix on the ipv4 side as well -DaveM ]

Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agotcp: fix TSO FACK loss marking in tcp_mark_head_lost
Yuchung Cheng [Fri, 24 Sep 2010 13:22:06 +0000 (13:22 +0000)]
tcp: fix TSO FACK loss marking in tcp_mark_head_lost

When TCP uses FACK algorithm to mark lost packets in
tcp_mark_head_lost(), if the number of packets in the (TSO) skb is
greater than the number of packets that should be marked lost, TCP
incorrectly exits the loop and marks no packets lost in the skb. This
underestimates tp->lost_out and affects the recovery/retransmission.
This patch fargments the skb and marks the correct amount of packets
lost.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland...
Linus Torvalds [Mon, 27 Sep 2010 19:33:54 +0000 (12:33 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/roland/infiniband

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  RDMA/cxgb3: Turn off RX coalescing for iWARP connections

14 years agoMerge master.kernel.org:/home/rmk/linux-2.6-arm
Linus Torvalds [Mon, 27 Sep 2010 19:32:36 +0000 (12:32 -0700)]
Merge master.kernel.org:/home/rmk/linux-2.6-arm

* master.kernel.org:/home/rmk/linux-2.6-arm: (28 commits)
  ARM: 6411/1: vexpress: set RAM latencies to 1 cycle for PL310 on ct-ca9x4 tile
  ARM: 6409/1: davinci: map sram using MT_MEMORY_NONCACHED instead of MT_DEVICE
  ARM: 6408/1: omap: Map only available sram memory
  ARM: 6407/1: mmu: Setup MT_MEMORY and MT_MEMORY_NONCACHED L1 entries
  ARM: pxa: remove pr_<level> uses of KERN_<level>
  ARM: pxa168fb: clear enable bit when not active
  ARM: pxa: fix cpu_is_pxa*() not expanding to zero when not configured
  ARM: pxa168: fix corrected reset vector
  ARM: pxa: Use PIO for PI2C communication on Palm27x
  ARM: pxa: Fix Vpac270 gpio_power for MMC
  ARM: 6401/1: plug a race in the alignment trap handler
  ARM: 6406/1: at91sam9g45: fix i2c bus speed
  leds: leds-ns2: fix locking
  ARM: dove: fix __io() definition to use bus based offset
  dmaengine: fix interrupt clearing for mv_xor
  ARM: kirkwood: Unbreak PCIe I/O port
  ARM: Fix build error when using KCONFIG_CONFIG
  ARM: 6383/1: Implement phys_mem_access_prot() to avoid attributes aliasing
  ARM: 6400/1: at91: fix arch_gettimeoffset fallout
  ARM: 6398/1: add proc info for ARM11MPCore/Cortex-A9 from ARM
  ...

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh...
Linus Torvalds [Mon, 27 Sep 2010 19:32:00 +0000 (12:32 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ericvh/v9fs

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  net/9p: fix memory handling/allocation in rdma_request()

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Linus Torvalds [Mon, 27 Sep 2010 19:31:12 +0000 (12:31 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/bp/bp

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
  amd64_edac: Fix driver module removal

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris...
Linus Torvalds [Mon, 27 Sep 2010 19:29:39 +0000 (12:29 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  TOMOYO: Don't abuse sys_getpid(), sys_getppid()

14 years agoMerge branch 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ickle...
Linus Torvalds [Mon, 27 Sep 2010 19:28:19 +0000 (12:28 -0700)]
Merge branch 'drm-intel-fixes' of git://git./linux/kernel/git/ickle/drm-intel

* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel:
  drm/i915/sdvo: Handle unsupported GET_SUPPORTED_ENHANCEMENTS gracefully
  drm/i915/sdvo: Cleanup connector on error path
  drm/i915: Fix 945GM regression in e259befd

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
Linus Torvalds [Mon, 27 Sep 2010 19:27:00 +0000 (12:27 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/cjb/mmc

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
  mmc: sdhci-s3c: fix NULL ptr access in sdhci_s3c_remove
  mmc: sdhci-s3c: fix incorrect spinlock usage after merge
  mmc: MAINTAINERS: add myself as MMC maintainer

14 years agoMerge branch 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6
Linus Torvalds [Mon, 27 Sep 2010 19:26:33 +0000 (12:26 -0700)]
Merge branch 'urgent' of git://git./linux/kernel/git/brodo/pcmcia-2.6

* 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6:
  pcmcia: pd6729: Fix error path
  pcmcia: preserve configuration information if request_io fails partly

14 years agoMerge git://git.infradead.org/iommu-2.6
Linus Torvalds [Mon, 27 Sep 2010 19:25:10 +0000 (12:25 -0700)]
Merge git://git.infradead.org/iommu-2.6

* git://git.infradead.org/iommu-2.6:
  intel-iommu: Use symbolic values instead of magic numbers in Lenovo w/a
  intel-iommu: Abort IOMMU setup for igfx if BIOS gave no shadow GTT space

14 years agoMerge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Mon, 27 Sep 2010 19:22:21 +0000 (12:22 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/amd-iommu: Fix rounding-bug in __unmap_single
  x86/amd-iommu: Work around S3 BIOS bug
  x86/amd-iommu: Set iommu configuration flags in enable-loop
  x86, setup: Fix earlyprintk=serial,0x3f8,115200
  x86, setup: Fix earlyprintk=serial,ttyS0,115200

14 years agoMerge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Mon, 27 Sep 2010 19:21:48 +0000 (12:21 -0700)]
Merge branch 'perf-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf, x86: Catch spurious interrupts after disabling counters
  tracing/x86: Don't use mcount in kvmclock.c
  tracing/x86: Don't use mcount in pvclock.c

14 years agomn10300: check __get_user/__put_user results...
Al Viro [Sun, 26 Sep 2010 18:29:12 +0000 (19:29 +0100)]
mn10300: check __get_user/__put_user results...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomn10300: get rid of set_fs(USER_DS) in sigframe setup
Al Viro [Sun, 26 Sep 2010 18:29:02 +0000 (19:29 +0100)]
mn10300: get rid of set_fs(USER_DS) in sigframe setup

It really has no business being there; short of a serious kernel bug
we should already have USER_DS at that point.  It shouldn't have been
done on x86 either...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomn10300: ->restart_block.fn needs to be reset on sigreturn
Al Viro [Sun, 26 Sep 2010 18:28:52 +0000 (19:28 +0100)]
mn10300: ->restart_block.fn needs to be reset on sigreturn

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomn10300: prevent double syscall restarts
Al Viro [Sun, 26 Sep 2010 18:28:42 +0000 (19:28 +0100)]
mn10300: prevent double syscall restarts

set ->orig_d0 to -1, same as what sigreturn does

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomn10300: avoid SIGSEGV delivery loop
Al Viro [Sun, 26 Sep 2010 18:28:32 +0000 (19:28 +0100)]
mn10300: avoid SIGSEGV delivery loop

force_sigsegv() is there for purpose...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoalpha: __get_user/__put_user results need to be checked...
Al Viro [Sun, 26 Sep 2010 18:28:22 +0000 (19:28 +0100)]
alpha: __get_user/__put_user results need to be checked...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoalpha: switch osf_sigprocmask() to use of sigprocmask()
Al Viro [Sun, 26 Sep 2010 18:28:12 +0000 (19:28 +0100)]
alpha: switch osf_sigprocmask() to use of sigprocmask()

get rid of a useless wrapper, while we are at it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years ago3c59x: fix regression from patch "Add ethtool WOL support"
Jan Beulich [Mon, 27 Sep 2010 18:07:00 +0000 (11:07 -0700)]
3c59x: fix regression from patch "Add ethtool WOL support"

This patch (commit 690a1f2002a3091bd18a501f46c9530f10481463) added a
new call site for acpi_set_WOL() without checking that the function is
actually suitable to be called via

 vortex_set_wol+0xcd/0xe0 [3c59x]
 dev_ethtool+0xa5a/0xb70
 dev_ioctl+0x2e0/0x4b0
 T.961+0x49/0x50
 sock_ioctl+0x47/0x290
 do_vfs_ioctl+0x7f/0x340
 sys_ioctl+0x80/0xa0
 system_call_fastpath+0x16/0x1b

i.e. outside of code paths run when the device is not yet enabled or
already disabled. In particular, putting the device into D3hot is a
pretty bad idea when it was already brought up.

Furthermore, all prior callers of the function made sure they're
actually dealing with a PCI device, while the newly added one didn't.

In the same spirit, the .get_wol handler shouldn't indicate support
for WOL for non-PCI devices.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoRDMA/cxgb3: Turn off RX coalescing for iWARP connections
Steve Wise [Sun, 19 Sep 2010 00:38:21 +0000 (19:38 -0500)]
RDMA/cxgb3: Turn off RX coalescing for iWARP connections

The HW by default has RX coalescing on.  For iWARP connections, this
causes a 100ms delay in connection establishement due to the ingress
MPA Start message being stalled in HW.  So explicitly turn RX
coalescing off when setting up iWARP connections.

This was causing very bad performance for NP64 gather operations using
Open MPI, due to the way it sets up connections on larger jobs.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
14 years agoARM: 6411/1: vexpress: set RAM latencies to 1 cycle for PL310 on ct-ca9x4 tile
Will Deacon [Mon, 27 Sep 2010 13:55:15 +0000 (14:55 +0100)]
ARM: 6411/1: vexpress: set RAM latencies to 1 cycle for PL310 on ct-ca9x4 tile

The PL310 on the ct-ca9x4 tile for the Versatile Express does not need
to add additional latency when accessing its cache RAMs. Unfortunately,
the boot monitor sets this up for an 8-cycle delay on reads and writes,
resulting in greatly reduced memory performance when the L2 cache is
enabled.

This patch sets the L2 RAM latencies to the correct value of 1 cycle
on the ct-ca9x4 tile before enabling the L2 cache.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
14 years agonet/9p: fix memory handling/allocation in rdma_request()
Davidlohr Bueso [Mon, 13 Sep 2010 15:53:18 +0000 (15:53 +0000)]
net/9p: fix memory handling/allocation in rdma_request()

Return -ENOMEM when erroring on kmalloc and fix memory leaks when returning on error.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
14 years agoamd64_edac: Fix driver module removal
Borislav Petkov [Sun, 26 Sep 2010 10:42:23 +0000 (12:42 +0200)]
amd64_edac: Fix driver module removal

f4347553b30ec66530bfe63c84530afea3803396 removed the edac polling
mechanism in favor of using a notifier chain for conveying MCE
information to edac. However, the module removal path didn't test
whether the driver had setup the polling function workqueue at all and
the rmmod process was hanging in the kernel at try_to_del_timer_sync()
in the cancel_delayed_work() path, trying to cancel an uninitialized
work struct.

Fix that by adding a balancing check to the workqueue removal path.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
14 years agox86: Avoid 'constant_test_bit()' misoptimization due to cast to non-volatile
Alexander Chumachenko [Thu, 1 Apr 2010 12:34:52 +0000 (15:34 +0300)]
x86: Avoid 'constant_test_bit()' misoptimization due to cast to non-volatile

While debugging bit_spin_lock() hang, it was tracked down to gcc-4.4
misoptimization of non-inlined constant_test_bit() due to non-volatile
addr when 'const volatile unsigned long *addr' cast to 'unsigned long *'
with subsequent unconditional jump to pause (and not to the test) leading
to hang.

Compiling with gcc-4.3 or disabling CONFIG_OPTIMIZE_INLINING yields inlined
constant_test_bit() and correct jump, thus working around the kernel bug.

Other arches than asm-x86 may implement this slightly differently;
2.6.29 mitigates the misoptimization by changing the function prototype
(commit c4295fbb6048d85f0b41c5ced5cbf63f6811c46c) but probably fixing the issue
itself is better.

Signed-off-by: Alexander Chumachenko <ledest@gmail.com>
Signed-off-by: Michael Shigorin <mike@osdn.org.ua>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
14 years agoipv6: add a missing unregister_pernet_subsys call
Neil Horman [Fri, 24 Sep 2010 09:55:52 +0000 (09:55 +0000)]
ipv6: add a missing unregister_pernet_subsys call

Clean up a missing exit path in the ipv6 module init routines.  In
addrconf_init we call ipv6_addr_label_init which calls register_pernet_subsys
for the ipv6_addr_label_ops structure.  But if module loading fails, or if the
ipv6 module is removed, there is no corresponding unregister_pernet_subsys call,
which leaves a now-bogus address on the pernet_list, leading to oopses in
subsequent registrations.  This patch cleans up both the failed load path and
the unload path.  Tested by myself with good results.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
 include/net/addrconf.h |    1 +
 net/ipv6/addrconf.c    |   11 ++++++++---
 net/ipv6/addrlabel.c   |    5 +++++
 3 files changed, 14 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agos390: use free_netdev(netdev) instead of kfree()
Vasiliy Kulikov [Mon, 27 Sep 2010 01:56:06 +0000 (18:56 -0700)]
s390: use free_netdev(netdev) instead of kfree()

Freeing netdev without free_netdev() leads to net, tx leaks.
I might lead to dereferencing freed pointer.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

@@
struct net_device* dev;
@@

-kfree(dev)
+free_netdev(dev)

Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agosgiseeq: use free_netdev(netdev) instead of kfree()
Kulikov Vasiliy [Sat, 25 Sep 2010 23:58:06 +0000 (23:58 +0000)]
sgiseeq: use free_netdev(netdev) instead of kfree()

Freeing netdev without free_netdev() leads to net, tx leaks.
I might lead to dereferencing freed pointer.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

@@
struct net_device* dev;
@@

-kfree(dev)
+free_netdev(dev)

Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agorionet: use free_netdev(netdev) instead of kfree()
Kulikov Vasiliy [Sat, 25 Sep 2010 23:58:03 +0000 (23:58 +0000)]
rionet: use free_netdev(netdev) instead of kfree()

Freeing netdev without free_netdev() leads to net, tx leaks.
I might lead to dereferencing freed pointer.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

@@
struct net_device* dev;
@@

-kfree(dev)
+free_netdev(dev)

Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoibm_newemac: use free_netdev(netdev) instead of kfree()
Kulikov Vasiliy [Sat, 25 Sep 2010 23:58:00 +0000 (23:58 +0000)]
ibm_newemac: use free_netdev(netdev) instead of kfree()

Freeing netdev without free_netdev() leads to net, tx leaks.
I might lead to dereferencing freed pointer.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

@@
struct net_device* dev;
@@

-kfree(dev)
+free_netdev(dev)

Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agosmsc911x: Add MODULE_ALIAS()
Vincent Stehlé [Mon, 27 Sep 2010 01:50:05 +0000 (18:50 -0700)]
smsc911x: Add MODULE_ALIAS()

This enables auto loading for the smsc911x ethernet driver.

Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agonet: reset skb queue mapping when rx'ing over tunnel
Tom Herbert [Thu, 23 Sep 2010 11:19:54 +0000 (11:19 +0000)]
net: reset skb queue mapping when rx'ing over tunnel

Reset queue mapping when an skb is reentering the stack via a tunnel.
On second pass, the queue mapping from the original device is no
longer valid.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agobr2684: fix scheduling while atomic
Karl Hiramoto [Thu, 23 Sep 2010 01:50:54 +0000 (01:50 +0000)]
br2684: fix scheduling while atomic

You can't call atomic_notifier_chain_unregister() while in atomic context.

Fix, call un/register_atmdevice_notifier in module __init and __exit.

Bug report:
http://comments.gmane.org/gmane.linux.network/172603

Reported-by: Mikko Vinni <mmvinni@yahoo.com>
Tested-by: Mikko Vinni <mmvinni@yahoo.com>
Signed-off-by: Karl Hiramoto <karl@hiramoto.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agoTOMOYO: Don't abuse sys_getpid(), sys_getppid()
Ben Hutchings [Sun, 26 Sep 2010 04:55:13 +0000 (05:55 +0100)]
TOMOYO: Don't abuse sys_getpid(), sys_getppid()

System call entry functions sys_*() are never to be called from
general kernel code.  The fact that they aren't declared in header
files should have been a clue.  These functions also don't exist on
Alpha since it has sys_getxpid() instead.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: James Morris <jmorris@namei.org>
14 years agode2104x: fix TP link detection
Ondrej Zary [Sat, 25 Sep 2010 10:39:17 +0000 (10:39 +0000)]
de2104x: fix TP link detection

Compex FreedomLine 32 PnP-PCI2 cards have only TP and BNC connectors but the
SROM contains AUI port too. When TP loses link, the driver switches to
non-existing AUI port (which reports that carrier is always present).

Connecting TP back generates LinkPass interrupt but de_media_interrupt() is
broken - it only updates the link state of currently connected media, ignoring
the fact that LinkPass and LinkFail bits of MacStatus register belong to the
TP port only (the chip documentation says that).

This patch changes de_media_interrupt() to switch media to TP when link goes
up (and media type is not locked) and also to update the link state only when
the TP port is used.

Also the NonselPortActive (and also SelPortActive) bits of SIAStatus register
need to be cleared (by writing 1) after reading or they're useless.

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agode2104x: fix power management
Ondrej Zary [Fri, 24 Sep 2010 23:57:02 +0000 (23:57 +0000)]
de2104x: fix power management

At least my 21041 cards come out of suspend with bus mastering disabled so
they did not work after resume(no data transferred).
After adding pci_set_master(), the driver oopsed immediately on resume -
because de_clean_rings() is called on suspend but de_init_rings() call
was missing in resume.

Also disable link (reset SIA) before sleep (de4x5 does this too).

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
14 years agommc: sdhci-s3c: fix NULL ptr access in sdhci_s3c_remove
Marek Szyprowski [Thu, 23 Sep 2010 14:22:05 +0000 (16:22 +0200)]
mmc: sdhci-s3c: fix NULL ptr access in sdhci_s3c_remove

If not all clocks have been defined in platform data, the driver will
cause a null pointer dereference when it is removed. This patch fixes
this issue.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Ball <cjb@laptop.org>
14 years agommc: sdhci-s3c: fix incorrect spinlock usage after merge
Marek Szyprowski [Mon, 20 Sep 2010 13:03:42 +0000 (15:03 +0200)]
mmc: sdhci-s3c: fix incorrect spinlock usage after merge

In the commit f522886e202a34a2191dd5d471b3c4d46410a9a0 a merge conflict
in the sdhci-s3c driver been fixed. However the fix used incorrect
spinlock operation - it caused a race with sdhci interrupt service. The
correct way to solve it is to use spin_lock_irqsave/irqrestore() calls.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Ball <cjb@laptop.org>
14 years agommc: MAINTAINERS: add myself as MMC maintainer
Chris Ball [Fri, 10 Sep 2010 16:05:24 +0000 (12:05 -0400)]
mmc: MAINTAINERS: add myself as MMC maintainer

Signed-off-by: Chris Ball <cjb@laptop.org>
Cc: Pierre Ossman <pierre-list@ossman.eu>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
14 years agopcmcia: pd6729: Fix error path
Rahul Ruikar [Sun, 26 Sep 2010 11:30:29 +0000 (17:00 +0530)]
pcmcia: pd6729: Fix error path

In error return path
call pci_disable_device() which was enabled earlier.

Signed-off-by: Rahul Ruikar <rahul.ruikar@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
14 years agoalpha: fix usp value in multithreaded coredumps
Al Viro [Sat, 25 Sep 2010 20:07:51 +0000 (21:07 +0100)]
alpha: fix usp value in multithreaded coredumps

rdusp() gives us the right value only for the current thread...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoalpha: fix hae_cache race in RESTORE_ALL
Al Viro [Sat, 25 Sep 2010 20:07:14 +0000 (21:07 +0100)]
alpha: fix hae_cache race in RESTORE_ALL

We want interrupts disabled on all paths leading to RESTORE_ALL;
otherwise, we are risking an IRQ coming between the updates of
alpha_mv->hae_cache and *alpha_mv->hae_register and set_hae()
within the IRQ getting badly confused.

RESTORE_ALL used to play with disabling IRQ itself, but that got
removed back in 2002, without making sure we had them disabled
on all paths.  It's cheaper to make sure we have them disabled than
to revert to original variant...

Remove the detritus left from that commit back in 2002; we used to
need a reload of $0 and $1 since swpipl would change those, but
doing that had become pointless when we stopped doing swpipl in
there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
Linus Torvalds [Sat, 25 Sep 2010 16:51:54 +0000 (09:51 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: prevent merges of discard and write requests

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
Linus Torvalds [Sat, 25 Sep 2010 16:51:31 +0000 (09:51 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: sound/pci/rme9652: prevent reading uninitialized stack memory
  ALSA: hda - Fix auto-parse of SPDIF input of Realtek codecs
  ASoC: Fix multi-componentism
  ASoC: Fix soc-cache buffer overflow bug
  ALSA: oxygen: fix analog capture on Claro halo cards
  ALSA: hda - Add Dell Latitude E6400 model quirk
  ASoC: fix clkdev API usage in sh/migor.c

14 years agoAvoid pgoff overflow in remap_file_pages
Larry Woodman [Fri, 24 Sep 2010 16:04:48 +0000 (12:04 -0400)]
Avoid pgoff overflow in remap_file_pages

Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.

Reported-by: Thomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoMerge branch 'fix/hda' into for-linus
Takashi Iwai [Sat, 25 Sep 2010 15:57:53 +0000 (17:57 +0200)]
Merge branch 'fix/hda' into for-linus

14 years agoMerge branch 'fix/asoc' into for-linus
Takashi Iwai [Sat, 25 Sep 2010 15:57:49 +0000 (17:57 +0200)]
Merge branch 'fix/asoc' into for-linus

14 years agoALSA: sound/pci/rme9652: prevent reading uninitialized stack memory
Dan Rosenberg [Sat, 25 Sep 2010 15:07:27 +0000 (11:07 -0400)]
ALSA: sound/pci/rme9652: prevent reading uninitialized stack memory

The SNDRV_HDSP_IOCTL_GET_CONFIG_INFO and
SNDRV_HDSP_IOCTL_GET_CONFIG_INFO ioctls in hdspm.c and hdsp.c allow
unprivileged users to read uninitialized kernel stack memory, because
several fields of the hdsp{m}_config_info structs declared on the stack
are not altered or zeroed before being copied back to the user.  This
patch takes care of it.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>