Stefan Behrens [Thu, 16 May 2013 14:48:19 +0000 (14:48 +0000)]
Btrfs: explicitly use global_block_rsv for quota_tree
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Alexandre Oliva [Wed, 15 May 2013 15:38:55 +0000 (11:38 -0400)]
btrfs: do away with non-whole_page extent I/O
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.
It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Wed, 15 May 2013 07:48:21 +0000 (07:48 +0000)]
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Wed, 15 May 2013 07:48:18 +0000 (07:48 +0000)]
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Wed, 15 May 2013 07:48:17 +0000 (07:48 +0000)]
Btrfs: pause the space balance when remounting to R/O
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Wed, 15 May 2013 07:48:16 +0000 (07:48 +0000)]
Btrfs: fix unprotected root node of the subvolume's inode rb-tree
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Wed, 15 May 2013 07:48:15 +0000 (07:48 +0000)]
Btrfs: fix accessing a freed tree root
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Liu Bo [Tue, 14 May 2013 02:12:15 +0000 (02:12 +0000)]
Btrfs: return errno if possible when we fail to allocate memory
We need to set return value explicitly, otherwise we'll lose the error
value.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Mon, 13 May 2013 13:55:12 +0000 (13:55 +0000)]
Btrfs: update the global reserve if it is empty
Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Mon, 13 May 2013 13:55:11 +0000 (13:55 +0000)]
Btrfs: don't steal the reserved space from the global reserve if their space type is different
If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Mon, 13 May 2013 13:55:10 +0000 (13:55 +0000)]
Btrfs: optimize the error handle of use_block_rsv()
cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Mon, 13 May 2013 13:55:09 +0000 (13:55 +0000)]
Btrfs: don't use global block reservation for inode cache truncation
It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Mon, 13 May 2013 13:55:08 +0000 (13:55 +0000)]
Btrfs: don't abort the current transaction if there is no enough space for inode cache
The filesystem with inode cache was forced to be read-only when we umounted it.
Steps to reproduce:
# mkfs.btrfs -f ${DEV}
# mount -o inode_cache ${DEV} ${MNT}
# dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
# btrfs fi syn ${MNT}
# dd if=${MNT}/file1 of=/dev/null bs=1M
# rm -f ${MNT}/file1
# btrfs fi syn ${MNT}
# umount ${MNT}
It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.
But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Andreas Philipp [Sat, 11 May 2013 11:13:03 +0000 (11:13 +0000)]
Correct allowed raid levels on balance.
Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.
Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Wed, 8 May 2013 08:56:09 +0000 (08:56 +0000)]
Btrfs: fix possible memory leak in replace_path()
In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.
Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Wed, 8 May 2013 08:10:25 +0000 (08:10 +0000)]
Btrfs: fix possible memory leak in the find_parent_nodes()
In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Tue, 7 May 2013 17:28:03 +0000 (17:28 +0000)]
Btrfs: don't allow device replace on RAID5/RAID6
This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.
One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).
0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918 num_stripes = map->num_stripes;
4919 max_errors = nr_parity_stripes(map);
4920
4921 raid_map = kmalloc(sizeof(u64) * num_stripes,
4922 GFP_NOFS);
4923 if (!raid_map) {
4924 ret = -ENOMEM;
4925 goto out;
4926 }
4927
There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 9 May 2013 17:49:30 +0000 (13:49 -0400)]
Btrfs: handle running extent ops with skinny metadata
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata. We also lose
the level of the metadata we are working on. So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 8 May 2013 20:44:57 +0000 (16:44 -0400)]
Btrfs: remove warn on in free space cache writeout
This catches block groups that are too large to properly cache. We deal with
this case fine, so the warning just confuses users. Remove the warning.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 8 May 2013 17:30:11 +0000 (13:30 -0400)]
Btrfs: don't null pointer deref on abort
I'm sorry, theres no excuse for this sort of work. We need to use
root->leafsize since eb may be NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Gabriel de Perthuis [Mon, 6 May 2013 17:40:18 +0000 (17:40 +0000)]
btrfs: don't stop searching after encountering the wrong item
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641
Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Liu Bo [Wed, 1 May 2013 16:23:41 +0000 (16:23 +0000)]
Btrfs: fix off-by-one in fiemap
lock_extent/unlock_extent expect an exclusive end.
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
David Sterba [Tue, 30 Apr 2013 17:29:29 +0000 (17:29 +0000)]
btrfs: annotate quota tree for lockdep
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.
There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Chris Mason [Tue, 7 May 2013 15:00:13 +0000 (11:00 -0400)]
Btrfs: allow superblock mismatch from older mkfs
We've added new checks to make sure the super block crc is correct
during mount. A fresh filesystem from an older mkfs won't have the
crc set. This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
David Sterba [Wed, 6 Mar 2013 14:57:46 +0000 (15:57 +0100)]
btrfs: enhance superblock checks
The superblock checksum is not verified upon mount. <awkward silence>
Add that check and also reorder existing checks to a more logical
order.
Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.
First transaction commit calculates correct superblock checksum and
saves it to disk.
Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
David Sterba [Mon, 29 Apr 2013 13:39:40 +0000 (13:39 +0000)]
btrfs: fix misleading variable name for flags
The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
David Sterba [Mon, 29 Apr 2013 13:38:46 +0000 (13:38 +0000)]
btrfs: use unsigned long type for extent state bits
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Liu Bo [Sat, 27 Apr 2013 02:56:57 +0000 (02:56 +0000)]
Btrfs: improve the loop of scrub_stripe
1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
range within the dev extent
In both cases, we can just finish the loop to save costs.
2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.
3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item. Setting offset=-1 and previous_item() is the correct way.
4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
David Sterba [Fri, 26 Apr 2013 15:20:23 +0000 (15:20 +0000)]
btrfs: read entire device info under lock
There's a theoretical possibility of reading stale (or even more
theoretically, freed) data from DEV_INFO ioctl when the device would
disappear between an early mutex unlock and data being copied from the
device structure.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
David Sterba [Fri, 26 Apr 2013 14:56:29 +0000 (14:56 +0000)]
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
It's unused since
0b32f4bbb423f02ac.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
David Sterba [Fri, 26 Apr 2013 12:56:04 +0000 (12:56 +0000)]
btrfs: handle errors returned from get_tree_block_key
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Eric Sandeen [Thu, 25 Apr 2013 20:41:01 +0000 (20:41 +0000)]
btrfs: make static code static & remove dead code
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Mon, 29 Apr 2013 14:05:57 +0000 (10:05 -0400)]
Btrfs: deal with errors in write_dev_supers
If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk. This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to. Fix this by dealing with
this case and just jacking up the errors count. With this patch we no longer
panic when mounting a restored file system loopback. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 25 Apr 2013 20:23:32 +0000 (16:23 -0400)]
Btrfs: remove almost all of the BUG()'s from tree-log.c
There were a whole bunch and I was doing it for other things. I haven't tested
these error paths but at the very least this is better than panicing. I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 25 Apr 2013 19:55:30 +0000 (15:55 -0400)]
Btrfs: deal with free space cache errors while replaying log
So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed. So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this. With this patch
we just fail to mount instead of panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Thu, 25 Apr 2013 16:04:52 +0000 (16:04 +0000)]
Btrfs: automatic rescan after "quota enable" command
When qgroup tracking is enabled, we do an automatic cycle of the new rescan
mechanism.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Thu, 25 Apr 2013 16:04:51 +0000 (16:04 +0000)]
Btrfs: rescan for qgroups
If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.
A filesystem under rescan can still be umounted. The rescan continues on the
next mount. Status information is provided with a separate ioctl while a
rescan operation is in progress.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Thu, 25 Apr 2013 16:04:50 +0000 (16:04 +0000)]
Btrfs: split btrfs_qgroup_account_ref into four functions
The function is separated into a preparation part and the three accounting
steps mentioned in the qgroups documentation. The goal is to make steps two
and three usable by the rescan functionality. A side effect is that the
function is restructured into readable subunits.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Thu, 25 Apr 2013 10:12:38 +0000 (10:12 +0000)]
Btrfs: allocate new chunks if the space is not enough for global rsv
When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.
By bisect debug, we found it was introduced by commit
96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.
Cc: Jim Schutt <jaschut@sandia.gov>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Wed, 24 Apr 2013 16:57:33 +0000 (16:57 +0000)]
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.
However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.
To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.
The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.
Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Eric Sandeen [Mon, 22 Apr 2013 16:12:31 +0000 (16:12 +0000)]
btrfs: move leak debug code to functions
Clean up the leak debugging in extent_io.c by moving
the debug code into functions. This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.
Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.
Thanks to Dave Sterba for the Kconfig bit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Liu Bo [Mon, 22 Apr 2013 10:53:47 +0000 (10:53 +0000)]
Btrfs: return free space in cow error path
Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.
We don't have to worry about extent maps since they'd be freed in releasepage
path.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Fri, 19 Apr 2013 15:08:05 +0000 (15:08 +0000)]
Btrfs: set UUID in root_item for created trees
It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Fri, 19 Apr 2013 15:08:04 +0000 (15:08 +0000)]
Btrfs: delete unused parameter to btrfs_read_root_item()
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Fri, 19 Apr 2013 01:04:46 +0000 (01:04 +0000)]
Btrfs: fix error handling in btrfs_ioctl_send()
fget() returns NULL if error. So, we should check NULL or not.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Thu, 18 Apr 2013 07:10:44 +0000 (07:10 +0000)]
Btrfs: remove unused variable in __process_changed_new_xattr()
Variable 'p' is not used any more. So, remove it.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 25 Apr 2013 17:44:38 +0000 (13:44 -0400)]
Btrfs: various abort cleanups
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort. This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.
The second half of this are related to the cleanup parts. First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong. The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes. With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 24 Apr 2013 20:41:19 +0000 (16:41 -0400)]
Btrfs: cleanup destroy_marked_extents
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 24 Apr 2013 20:40:05 +0000 (16:40 -0400)]
Btrfs: check return value of commit when recovering log
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out. We need to do this before
we free up the log_tree_root since the caller will handle all of that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 24 Apr 2013 20:38:50 +0000 (16:38 -0400)]
Btrfs: don't panic if we're trying to drop too many refs
This is just obnoxious. Just print a message, abort the transaction, and return
an error. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 24 Apr 2013 20:35:41 +0000 (16:35 -0400)]
Btrfs: cleanup fs roots if we fail to mount
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime. So we need to clean this up
so we don't leave extent buffers hanging around on the cache. With this patch
we no longer leak extent buffers on failure to mount. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 24 Apr 2013 20:32:55 +0000 (16:32 -0400)]
Btrfs: fix extent logging with O_DIRECT into prealloc
This is the same as the fix from commit
Btrfs: fix bad extent logging
but for O_DIRECT. I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent. We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 23 Apr 2013 18:17:42 +0000 (14:17 -0400)]
Btrfs: fix all callers of read_tree_block
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly. You need to check
and make sure the extent_buffer is uptodate before you use it. This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not. With this we no longer
leak EB's when things go horribly wrong. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 23 Apr 2013 16:55:21 +0000 (12:55 -0400)]
Btrfs: only exclude supers in the range of our block group
If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree. This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it. This fixes the problem by only adding excluded
extents if it falls in the block group range we care about. With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 23 Apr 2013 15:30:14 +0000 (11:30 -0400)]
Btrfs: add tree block level sanity check
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level. So add this
to the sanity checks in the endio handler for tree blocks. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 23 Apr 2013 15:08:33 +0000 (11:08 -0400)]
Btrfs: don't try and free ebs twice in log replay
This work is done by btrfs_free_path() anyway so there's no need for this
duplicate work. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 23 Apr 2013 14:53:18 +0000 (10:53 -0400)]
Btrfs: don't BUG_ON() in btrfs_num_copies
A user sent me a btrfs-image that was panicing because of some corruption. This
is because we pass in a bogus value to btrfs_num_copies, and it panics. Instead
just return 1. We only call btrfs_num_copies to see if there are other copies
to try and read for things, so if we just return 1 it will make the callers exit
out with an appropriate error value. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Sat, 20 Apr 2013 14:18:27 +0000 (10:18 -0400)]
Btrfs: don't call readahead hook until we have read the entire eb
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map. Turns out he is using > PAGESIZE leafsizes. The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus. Fix this
by only calling the readahead hook once all pages have been read in. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Sat, 20 Apr 2013 03:45:33 +0000 (23:45 -0400)]
Btrfs: deal with bad mappings in btrfs_map_block
Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find
a chunk for a particular block we were trying to map. This happened because the
block was bogus. We shouldn't be BUG_ON()'ing in this case, just print a
message and return an error. This came from reada_add_block and it appears to
deal with an error fine so we should be good there. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Fri, 19 Apr 2013 23:49:09 +0000 (19:49 -0400)]
Btrfs: use REQ_META for all metadata IO
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Fri, 19 Apr 2013 18:37:26 +0000 (14:37 -0400)]
Btrfs: fix possible infinite loop in slow caching
So I noticed there is an infinite loop in the slow caching code. If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we
don't need_resched(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 17 Apr 2013 16:16:59 +0000 (12:16 -0400)]
Btrfs: fix lockdep warning
The locking order for stuff is
__sb_start_write
ordered_mutex
but with sync() we don't do __sb_start_write for some strange reason, which
means that our iput in wait_ordered_extents could start a transaction which does
the __sb_start_write while we're holding the ordered_mutex. Fix this by using
delayed iput in sync. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Wed, 17 Apr 2013 14:49:51 +0000 (14:49 +0000)]
Btrfs: add all ioctl checks before user change for quota operations
Since all the quota configurations are loaded in memory, and we can
have ioctl checks before operating in the disk. It is safe to do such
things because qgroup_ioctl_lock is held outside.
Without these extra checks firstly, it should be ok to do user change
for quota operations. For example:
if we want to add an existed qgroup, we will do:
->add_qgroup_item()
->add_qgroup_rb()
add_qgroup_item() will return -EEXIST to us, however, qgroups are all
in memory, why not check them in memory firstly.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Wed, 17 Apr 2013 14:00:36 +0000 (14:00 +0000)]
Btrfs: fix missing check about ulist_add() in qgroup.c
ulist_add() may return -ENOMEM, fix missing check about
return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Wed, 17 Apr 2013 09:11:47 +0000 (09:11 +0000)]
Btrfs: clear received_uuid field for new writable snapshots
For created snapshots, the full root_item is copied from the source
root and afterwards selectively modified. The current code forgets
to clear the field received_uuid. The only problem is that it is
confusing when you look at it with 'btrfs subv list', since for
writable snapshots, the contents of the snapshot can be completely
unrelated to the previously received snapshot.
The receiver ignores such snapshots anyway because he also checks
the field stransid in the root_item and that value used to be reset
to zero for all created snapshots.
This commit changes two things:
- clear the received_uuid field for new writable snapshots.
- don't clear the send/receive related information like the stransid
for read-only snapshots (which makes them useable as a parent for
the automatic selection of parents in the receive code).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 17 Apr 2013 14:17:05 +0000 (10:17 -0400)]
Btrfs: don't force pages under writeback to finish when aborting
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right. However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong. We need to
lock the page and wait on page writeback to complete if it is. There is nothing
unsafe about this since we are discarding the transaction anyway. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Tue, 16 Apr 2013 10:22:23 +0000 (10:22 +0000)]
Btrfs: remove unused variable in the iterate_extent_inodes()
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Liu Bo [Tue, 16 Apr 2013 09:20:28 +0000 (09:20 +0000)]
Btrfs: return error when we specify wrong start to defrag
We need such a sanity check for wrong start when we defrag a file, otherwise,
even with a wrong start that's larger than file size, we can end up changing
not only inode's force compress flag but also FS's incompat flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Vincent [Tue, 16 Apr 2013 08:15:25 +0000 (08:15 +0000)]
Btrfs: fix reada debug code compilation
This fixes the following errors:
fs/btrfs/reada.c: In function ‘btrfs_reada_wait’:
fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Tue, 16 Apr 2013 05:19:11 +0000 (05:19 +0000)]
Btrfs: cleanup of function where btrfs_extend_item() is called
Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Tue, 16 Apr 2013 05:18:49 +0000 (05:18 +0000)]
Btrfs: remove unused argument of btrfs_extend_item()
Argument 'trans' is not used in btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Tue, 16 Apr 2013 05:18:22 +0000 (05:18 +0000)]
Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tsutomu Itoh [Tue, 16 Apr 2013 05:18:02 +0000 (05:18 +0000)]
Btrfs: remove unused argument of fixup_low_keys()
Argument 'trans' is not used in fixup_low_keys(). So, remove it.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Mon, 15 Apr 2013 12:56:49 +0000 (12:56 +0000)]
Btrfs: fix confusing edquot happening case
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
dd if=/dev/zero of=/<mnt>/data bs=1M count=10
sync
btrfs quota enable <mnt>
btrfs qgroup create 0/5 <mnt>
btrfs qgroup limit 5M 0/5 <mnt>
rm -f /<mnt>/data
sync
btrfs qgroup show <mnt>
dd if=/dev/zero of=data bs=1M count=1
>From the perspective of users, qgroup's referenced or exclusive
is negative,but user can not continue to write data! a workaround
way is to cast u64 to s64 when doing qgroup reservation.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Mon, 15 Apr 2013 10:26:38 +0000 (10:26 +0000)]
Btrfs: do not continue if out of memory happens
If out of memory happens, we should return -ENOMEM directly to the caller
rather than continue the work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Nathaniel Yazdani [Mon, 15 Apr 2013 00:44:02 +0000 (00:44 +0000)]
btrfs: fix minor typo in comment
In the comment describing the sync_writers field of the btrfs_inode
struct, "fsyncing" was misspelled "fsycing."
Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 14 Apr 2013 14:08:49 +0000 (14:08 +0000)]
Btrfs: cleanup to remove reduplicate code in transaction.c
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Sat, 13 Apr 2013 13:19:55 +0000 (13:19 +0000)]
Btrfs: fix unlock after free on rewinded tree blocks
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Sat, 13 Apr 2013 13:19:54 +0000 (13:19 +0000)]
Btrfs: fix accessing the root pointer in tree mod log functions
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Sat, 13 Apr 2013 13:19:53 +0000 (13:19 +0000)]
Btrfs: fix tree mod log regression on root split operations
Commit
d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.
This commit removes the erroneous logging of removal of all elements.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Thu, 11 Apr 2013 10:30:16 +0000 (10:30 +0000)]
Btrfs: use a lock to protect incompat/compat flag of the super block
The following case will make the incompat/compat flag of the super block
be recovered.
Task1 |Task2
flags = btrfs_super_incompat_flags(); |
|flags = btrfs_super_incompat_flags();
flags |= new_flag1; |
|flags |= new_flag2;
btrfs_set_super_incompat_flags(flags); |
|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.
In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Thu, 11 Apr 2013 10:29:35 +0000 (10:29 +0000)]
Btrfs: fix unblocked autodefraggers when remount
The new mount option is set after parsing the remount arguments,
so it is wrong that checking the autodefrag is close or not at
btrfs_remount_prepare(). Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Fri, 12 Apr 2013 12:12:17 +0000 (12:12 +0000)]
Btrfs: add a rb_tree to improve performance of ulist search
Walking backref tree and btrfs quota rely on ulist very much.
This patch tries to use rb_tree to speed up search time.
The original code always checks whether an element
exists before adding a new element, however it costs O(n).
I try to add a rb_tree in the ulist,this is only used to speed up
search. I also do some measurements with quota enabled.
fsstress -p 4 -n 10000
Without this path:
real 0m51.058s 2m4.745s 1m28.222s 1m5.137s
user 0m0.035s 0m0.041s 0m0.105s 0m0.100s
sys 0m12.009s 0m11.246s 0m10.901s 0m10.999s 0m11.287s
With this path:
real 0m55.295s 0m50.960s 1m2.214s 0m48.273s
user 0m0.053s 0m0.095s 0m0.135s 0m0.107s
sys 0m7.766s 0m6.013s 0m6.319s 0m6.030s 0m6.532s
After applying the patch,the execute time is down by ~42%.(11.287s->6.532s)
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Stefan Behrens [Wed, 10 Apr 2013 17:10:52 +0000 (17:10 +0000)]
Btrfs: allow omitting stream header and end-cmd for btrfs send
Two new flags are added to allow omitting the stream header and the
end command for btrfs send streams. This is used in cases where you
send multiple snapshots back-to-back in one stream.
This used to be encoded like this (with 2 snapshots in this example):
<stream header> + <sequence of commands> + <end cmd> +
<stream header> + <sequence of commands> + <end cmd> + EOF
The new format (if the two new flags are used) is this one:
<stream header> + <sequence of commands> +
<sequence of commands> + <end cmd>
Note that the currently existing receivers treat <end cmd> only as
an indication that a new <stream header> is following. This means,
you can just skip the sequence <end cmd> <stream header> without
loosing compatibility. As long as an EOF is following, the currently
existing receivers handle the new format (if the two new flags are
used) exactly as the old one.
So what is the benefit of this change? The goal is to be able to use
a single stream (one TCP connection) to multiplex a request/response
handshake plus Btrfs send streams, all in the same stream. In this
case you cannot evaluate an EOF condition as an end of the Btrfs send
stream. You need something else, and the <end cmd> is just perfect
for this purpose.
The summary is:
The format change is driven by the need to send several Btrfs send
streams over a single TCP connections, with the ability for a repeated
request/response handshake in the middle. And this format change does
not break any existing tool, it is completely compatible.
You could compare the old behaviour of the Btrfs send stream to the
one of ftp where you need a seperate request/response channel and
newly opened data transfer channels for each file, while the new
behaviour is more like http using a single stream for everything.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Thu, 11 Apr 2013 07:08:55 +0000 (07:08 +0000)]
Btrfs: make __merge_refs() return type be void
__merge_refs() always return 0, it is unnecessary
for the caller to check the return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Wed, 10 Apr 2013 11:22:50 +0000 (11:22 +0000)]
Btrfs: remove some BUG_ONs() when walking backref tree
The only error return value of __add_prelim_ref() is -ENOMEM,
just return errors rather than trigger BUG_ON().
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Mon, 8 Apr 2013 10:56:22 +0000 (10:56 +0000)]
Btrfs: use tree_root to avoid edquot when disabling quota
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10K <mnt>/subv
btrfs quota disable <mnt>/subv
It is wrong for qgroup to reserve when disabling quota,
so just use tree_root to avoid edquot when disabling quota.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:50:20 +0000 (10:50 +0000)]
Btrfs: fix a warning when updating qgroup limit
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup limit 0/1 <mnt>
dmesg
If the relative qgroup dosen't exist, flag 'BTRFS_QGROUP_STATUS_
FLAG_INCONSISTENT' will be set, and print the noise message.
This is wrong, we can just move find_qgroup_rb() before
update_qgroup_limit_item().this dosen't change the logic of the
function. But it can avoid unnecessary noise message and wrong set of flag.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:50:19 +0000 (10:50 +0000)]
Btrfs: fix missing check in the btrfs_qgroup_inherit()
The original code forgot to check 'inherit', we should
gurantee that all the qgroups in the struct 'inherit' exist.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:50:18 +0000 (10:50 +0000)]
Btrfs: fix missing check before creating a qgroup relation
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup assign 0/1 1/1 <mnt>
umount <mnt>
btrfs-debug-tree <disk> | grep QGROUP
If we want to add a qgroup relation, we should gurantee that
'src' and 'dst' exist, otherwise, such qgroup relation should
not be allowed to create.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:50:17 +0000 (10:50 +0000)]
Btrfs: remove some unnecessary spin_lock usages
We use mutex lock to protect all the user change operations.
So when we are calling find_qgroup_rb() to check whether qgroup
exists, we don't have to hold spin_lock.
Besides, when enabling/disabling quota, it must be single thread
when operations come here. spin lock must be firstly used to
clear quota_root when disabling quota, while enabling quota, spin
lock must be used to complete the last assign work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:50:16 +0000 (10:50 +0000)]
Btrfs: introduce a mutex lock for btrfs quota operations
The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
->add_qgroup_info_item()
->add_qgroup_rb()
For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().
What's worse, when we want to add a qgroup relation:
->add_qgroup_relation_item()
->add_qgroup_relations()
We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.
To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Sun, 7 Apr 2013 10:24:57 +0000 (10:24 +0000)]
Btrfs: creating the subvolume qgroup automatically when enabling quota
Creating the subvolume/snapshots(including root subvolume) qgroup
auotomatically when enabling quota.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Zach Brown [Tue, 2 Apr 2013 21:02:16 +0000 (21:02 +0000)]
btrfs: abort unlink trans in missed error case
__btrfs_unlink_inode() aborts its transaction when it sees errors after
it removes the directory item. But it missed the case where
btrfs_del_dir_entries_in_log() returns an error. If this happens then
the unlink appears to fail but the items have been removed without
updating the directory size. The directory then has leaked bytes in
i_size and can never be removed.
Adding the missing transaction abort at least makes this failure
consistent with the other failure cases.
I noticed this while reading the code after someone on irc reported
having a directory with i_size but no entries. I tested it by forcing
btrfs_del_dir_entries_in_log() to return -ENOMEM.
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Eric Sandeen [Thu, 4 Apr 2013 20:45:08 +0000 (20:45 +0000)]
btrfs: ignore device open failures in __btrfs_open_devices
This:
# mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test
would lead to a blkdev open/close mismatch when the mount fails, and
a permanently busy (opened O_EXCL) sdb2:
# wipefs -a /dev/sdb2
wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy
It's because btrfs_open_devices() may open some devices, fail on
the last one, and return that failure stored in "ret." The mount
then fails, but the caller then does not clean up the open devices.
Chris assures me that:
"btrfs_open_devices just means: go off and open every bdev you can from
this uuid. It should return success if we opened any of them at all."
So change the logic to ignore any open failures; just skip processing
of that device. Later on it's decided whether we have enough devices
to continue.
Reported-by: Jan Safranek <jsafrane@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Fri, 5 Apr 2013 07:20:56 +0000 (07:20 +0000)]
Btrfs: improve the performance of the csums lookup
It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.
According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).
# dd if=<mnt>/file of=/dev/null bs=1M count=1024
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Fri, 5 Apr 2013 20:51:15 +0000 (16:51 -0400)]
Btrfs: fix bad extent logging
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 4 Apr 2013 18:31:27 +0000 (14:31 -0400)]
Btrfs: log ram bytes properly
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be. This was
still working out right with compression for some reason but I think we were
getting lucky. It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it. With this patch btrfsck no longer complains
after a log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Thu, 4 Apr 2013 15:55:49 +0000 (11:55 -0400)]
Btrfs: don't wait on ordered extents if we have a trans open
Dave was hitting a lockdep warning because we're now properly taking the ordered
operations mutex in the ordered wait stuff. This is because some cases we will
have a trans handle when we are flushing delalloc space, but we can't wait on
ordered extents because we could potentially deadlock, so fix this by not doing
the wait if we have a trans handle. Thanks
Reported-and-tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 2 Apr 2013 16:40:42 +0000 (12:40 -0400)]
Btrfs: fix error handling in make/read block group
I noticed that we will add a block group to the space info before we add it to
the block group cache rb tree, so we could potentially allocate from the block
group before it's able to be searched for. I don't think this is too much of
a problem, the race window is microscopic, but just in case move the tree
insertion to above the space info linking. This makes it easier to adjust the
error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
error handling setup for these scenarios. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>