GitHub/moto-9609/android_kernel_motorola_exynos9610.git
8 years agodm: add infrastructure for DAX support
Toshi Kani [Wed, 22 Jun 2016 23:54:53 +0000 (17:54 -0600)]
dm: add infrastructure for DAX support

Change mapped device to implement direct_access function,
dm_blk_direct_access(), which calls a target direct_access function.
'struct target_type' is extended to have target direct_access interface.
This function limits direct accessible size to the dm_target's limit
with max_io_len().

Add dm_table_supports_dax() to iterate all targets and associated block
devices to check for DAX support.  To add DAX support to a DM target the
target must only implement the direct_access function.

Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped
device supports DAX and is bio based.  This new type is used to assure
that all target devices have DAX support and remain that way after
QUEUE_FLAG_DAX is set in mapped device.

At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting
DM_TYPE_DAX_BIO_BASED to the type.  Any subsequent table load to the
mapped device must have the same type, or else it fails per the check in
table_load().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agoMerge remote-tracking branch 'jens/for-4.8/core' into dm-4.8
Mike Snitzer [Thu, 21 Jul 2016 03:48:25 +0000 (23:48 -0400)]
Merge remote-tracking branch 'jens/for-4.8/core' into dm-4.8

DM's DAX support depends on block core's newly added QUEUE_FLAG_DAX.

8 years agoblock: do not merge requests without consulting with io scheduler
Tahsin Erdogan [Thu, 7 Jul 2016 18:48:22 +0000 (11:48 -0700)]
block: do not merge requests without consulting with io scheduler

Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
  mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
  if [ -d $i ]; then
    rmdir $i
  fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
  echo ---- $i ----
  cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: Fix spelling in a source code comment
Bart Van Assche [Tue, 19 Jul 2016 15:18:06 +0000 (08:18 -0700)]
block: Fix spelling in a source code comment

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: expose QUEUE_FLAG_DAX in sysfs
Yigal Korman [Thu, 23 Jun 2016 21:05:51 +0000 (17:05 -0400)]
block: expose QUEUE_FLAG_DAX in sysfs

Provides the ability to identify DAX enabled devices in userspace.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Toshi Kani [Thu, 23 Jun 2016 21:05:50 +0000 (17:05 -0400)]
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support

Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agodm thin: fix a race condition between discarding and provisioning a block
Joe Thornber [Fri, 1 Jul 2016 13:00:02 +0000 (14:00 +0100)]
dm thin: fix a race condition between discarding and provisioning a block

The discard passdown was being issued after the block was unmapped,
which meant the block could be reprovisioned whilst the passdown discard
was still in flight.

We can only identify unshared blocks (safe to do a passdown a discard
to) once they're unmapped and their ref count hits zero.  Block ref
counts are now used to guard against concurrent allocation of these
blocks that are being discarded.  So now we unmap the block, issue
passdown discards, and the immediately increment ref counts for regions
that have been discarded via passed down (this is safe because
allocation occurs within the same thread).  We then decrement ref counts
once the passdown discard IO is complete -- signaling these blocks may
now be allocated.

This fixes the potential for corruption that was reported here:
https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html

Reported-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm btree: fix a bug in dm_btree_find_next_single()
Joe Thornber [Fri, 1 Jul 2016 10:09:13 +0000 (11:09 +0100)]
dm btree: fix a bug in dm_btree_find_next_single()

dm_btree_find_next_single() can short-circuit the search for a block
with a return of -ENODATA if all entries are higher than the search key
passed to lower_bound().

This hasn't been a problem because of the way the btree has been used by
DM thinp.  But it must be fixed now in preparation for fixing the race
in DM thinp's handling of simultaneous block discard vs allocation.
Otherwise, once that fix is in place, some of the blocks in a discard
would not be unmapped as expected.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix random optimal_io_size for raid0
Heinz Mauelshagen [Tue, 19 Jul 2016 11:16:24 +0000 (13:16 +0200)]
dm raid: fix random optimal_io_size for raid0

raid_io_hints() was retrieving the number of data stripes used for the
calculation of io_opt from struct r5conf, which is not defined for raid0
mappings.

Base the calculation on the in-core raid_set structure instead.

Also, adjust to use to_bytes() for the sector -> bytes conversion
throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: address checkpatch.pl complaints
Heinz Mauelshagen [Tue, 19 Jul 2016 12:03:51 +0000 (14:03 +0200)]
dm raid: address checkpatch.pl complaints

Use 'unsigned int' where appropriate.
Return negative errors.
Correct an indentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agoBtrfs: fix comparison in __btrfs_map_block()
Vincent Stehlé [Fri, 15 Jul 2016 15:03:21 +0000 (17:03 +0200)]
Btrfs: fix comparison in __btrfs_map_block()

Add missing comparison to op in expression, which was forgotten when doing
the REQ_OP transition.

Fixes: b3d3fa519905 ("btrfs: update __btrfs_map_block for REQ_OP transition")
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agodm: call PR reserve/unreserve on each underlying device
Christoph Hellwig [Fri, 8 Jul 2016 12:23:51 +0000 (21:23 +0900)]
dm: call PR reserve/unreserve on each underlying device

So far we tried to rely on the SCSI 'all target ports' bit to register
all path, but for many setups this didn't work properly as the different
paths are seen as separate initiators to the target instead of multiple
ports of the same initiator.  Because of that we'll stop setting the
'all target ports' bit in SCSI, and let device mapper handle iterating
over the device for each path and register them manually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agosd: don't use the ALL_TG_PT bit for reservations
Christoph Hellwig [Fri, 8 Jul 2016 12:23:50 +0000 (21:23 +0900)]
sd: don't use the ALL_TG_PT bit for reservations

These only work if the we use the same initiator ID for all path,
which might not be true if we use different protocols, or even just
different HBAs.

Instead dm-mpath will grow support to register all path manually
later in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm: fix second blk_delay_queue() parameter to be in msec units not jiffies
Tahsin Erdogan [Fri, 15 Jul 2016 13:27:08 +0000 (06:27 -0700)]
dm: fix second blk_delay_queue() parameter to be in msec units not jiffies

Commit d548b34b062 ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceaccab ("block:
remove per-queue plugging") did.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: d548b34b062 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms")
Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c
8 years agodm raid: change logical functions to actually return bool
Heinz Mauelshagen [Wed, 6 Jul 2016 16:29:22 +0000 (18:29 +0200)]
dm raid: change logical functions to actually return bool

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: use rdev_for_each in status
Heinz Mauelshagen [Thu, 30 Jun 2016 19:32:20 +0000 (21:32 +0200)]
dm raid: use rdev_for_each in status

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: use rs->raid_disks to avoid memory leaks on free
Heinz Mauelshagen [Thu, 30 Jun 2016 12:37:50 +0000 (14:37 +0200)]
dm raid: use rs->raid_disks to avoid memory leaks on free

Also makes code more consistent throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: support delta_disks for raid1, fix table output
Heinz Mauelshagen [Thu, 30 Jun 2016 11:57:08 +0000 (13:57 +0200)]
dm raid: support delta_disks for raid1, fix table output

Add "delta_disks" constructor argument support to raid1 to allow for
consistent userspace disk addition/removal handling.

Fix raid_status() to report all raid disks with status and table output
on disk adding reshapes, not just the ones listed on the mddev; optimize
its rebuild and writemostly output.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: enhance reshape check and factor out reshape setup
Heinz Mauelshagen [Wed, 29 Jun 2016 16:13:58 +0000 (18:13 +0200)]
dm raid: enhance reshape check and factor out reshape setup

Enhance rs_reshape_requested() check function to be more transparent and
fix its raid10 check.

Streamline the constructor by factoring out reshaping preparation into
fucntion rs_prepare_reshape().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: allow resize during recovery
Heinz Mauelshagen [Mon, 27 Jun 2016 12:44:09 +0000 (14:44 +0200)]
dm raid: allow resize during recovery

Resizing a RAID set during recovery can be allowed, because the MD
resynchronization thread will either stop any ongoing recovery in case
of shrinking below the current recovery position or carry on recovery
to the new size if the set is growing.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix rs_is_recovering() to allow for lvextend
Heinz Mauelshagen [Sat, 25 Jun 2016 00:42:54 +0000 (02:42 +0200)]
dm raid: fix rs_is_recovering() to allow for lvextend

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix rebuild and catch bogus sync/resync flags
Heinz Mauelshagen [Fri, 24 Jun 2016 21:21:37 +0000 (23:21 +0200)]
dm raid: fix rebuild and catch bogus sync/resync flags

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix ctr memory leaks on error paths
Heinz Mauelshagen [Fri, 24 Jun 2016 19:49:26 +0000 (21:49 +0200)]
dm raid: fix ctr memory leaks on error paths

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix typo in write_mostly flag
Heinz Mauelshagen [Fri, 24 Jun 2016 19:32:25 +0000 (21:32 +0200)]
dm raid: fix typo in write_mostly flag

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: also reject size change during recovery
Heinz Mauelshagen [Thu, 23 Jun 2016 23:36:06 +0000 (01:36 +0200)]
dm raid: also reject size change during recovery

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix new superblock/bitmap creation on disk addition
Heinz Mauelshagen [Thu, 23 Jun 2016 23:06:28 +0000 (01:06 +0200)]
dm raid: fix new superblock/bitmap creation on disk addition

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add comments and fix typos
Heinz Mauelshagen [Thu, 23 Jun 2016 23:03:19 +0000 (01:03 +0200)]
dm raid: add comments and fix typos

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix raid10 device size error on out-of-place reshape
Heinz Mauelshagen [Thu, 23 Jun 2016 22:36:08 +0000 (00:36 +0200)]
dm raid: fix raid10 device size error on out-of-place reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: prohibit 'nosync' on new raid6 and reject resize during reshape
Heinz Mauelshagen [Thu, 23 Jun 2016 22:32:58 +0000 (00:32 +0200)]
dm raid: prohibit 'nosync' on new raid6 and reject resize during reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: clarify and fix recovery
Heinz Mauelshagen [Thu, 23 Jun 2016 22:21:09 +0000 (00:21 +0200)]
dm raid: clarify and fix recovery

Add function rs_setup_recovery() to allow for defined setup of RAID set
recovery in the constructor.

Will be called with dev_sectors={0, rdev->sectors, MaxSectors} to
recover a new or enforced sync, grown or not to be synhronized RAID set
respectively.

Prevents recovery on raid0, which doesn't support it.

Enforces recovery on raid6 to ensure properly defined Syndromes
mandatory for that MD personality are being created.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix rs_set_capacity on growing reshape
Heinz Mauelshagen [Thu, 23 Jun 2016 22:10:12 +0000 (00:10 +0200)]
dm raid: fix rs_set_capacity on growing reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: make rs_set_capacity to work on shrinking reshape
Heinz Mauelshagen [Thu, 16 Jun 2016 01:15:49 +0000 (03:15 +0200)]
dm raid: make rs_set_capacity to work on shrinking reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: enhance comments in takeover checks
Heinz Mauelshagen [Wed, 15 Jun 2016 20:29:09 +0000 (22:29 +0200)]
dm raid: enhance comments in takeover checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: remove bogus comment and fix comment typos
Heinz Mauelshagen [Wed, 15 Jun 2016 20:27:40 +0000 (22:27 +0200)]
dm raid: remove bogus comment and fix comment typos

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: more restricting data_offset value checks
Heinz Mauelshagen [Wed, 15 Jun 2016 20:27:08 +0000 (22:27 +0200)]
dm raid: more restricting data_offset value checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: reject too many write_mostly devices
Heinz Mauelshagen [Wed, 15 Jun 2016 16:50:18 +0000 (18:50 +0200)]
dm raid: reject too many write_mostly devices

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: the sync_page_io() metadata_op argument is bool
Heinz Mauelshagen [Wed, 15 Jun 2016 16:45:56 +0000 (18:45 +0200)]
dm raid: the sync_page_io() metadata_op argument is bool

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: prohibit to pass in both sync and nosync ctr flags
Heinz Mauelshagen [Wed, 15 Jun 2016 16:43:55 +0000 (18:43 +0200)]
dm raid: prohibit to pass in both sync and nosync ctr flags

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: avoid superfluous memory barriers on static metadata
Heinz Mauelshagen [Wed, 15 Jun 2016 16:39:17 +0000 (18:39 +0200)]
dm raid: avoid superfluous memory barriers on static metadata

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agoblock: atari: Return early for unsupported sector size
Gabriel Krisman Bertazi [Tue, 5 Jul 2016 16:38:32 +0000 (13:38 -0300)]
block: atari: Return early for unsupported sector size

For 4K LBA or very large disks, atari_partition can easily get tricked
into thinking it has found an Atari partition table.  Depending on the
data in the disk, it ends up creating partitions with awkward lengths.

We saw logs like this while playing with fio.

[5.625867] nvme2n1: AHDI p2
[5.625872] nvme2n1: p2 size 2910030523 extends beyond EOD, truncated

People has had issues with misinterpreted AHDI partition tables for a long
time, see this BSD thread from 1995, for example.

https://mail-index.netbsd.org/port-atari/1995/11/19/0001.html

Since the atari partition, according to the spec, doesn't even support
sector sizes with more than 512, a quick sanity check is reasonable to
just bail out early, before even attempting to read sector 0.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agodm rq: check kthread_run return for .request_fn request-based DM
Mike Snitzer [Wed, 6 Jul 2016 13:06:37 +0000 (09:06 -0400)]
dm rq: check kthread_run return for .request_fn request-based DM

Check return value of kthread_run() in dm_old_init_request_queue().

Reported-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm verity fec: fix block calculation
Sami Tolvanen [Tue, 21 Jun 2016 18:02:42 +0000 (11:02 -0700)]
dm verity fec: fix block calculation

do_div was replaced with div64_u64 at some point, causing a bug with
block calculation due to incompatible semantics of the two functions.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm ioctl: Simplify parameter buffer management code
Bart Van Assche [Tue, 28 Jun 2016 14:36:46 +0000 (16:36 +0200)]
dm ioctl: Simplify parameter buffer management code

Merge the two DM_PARAMS_[KV]MALLOC flags into a single flag.

Doing so avoids the crashes seen with previous attempts to consolidate
buffer management to use kvfree() without first flagging that memory had
actually been allocated.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm crypt: Fix sparse complaints
Bart Van Assche [Tue, 28 Jun 2016 14:32:32 +0000 (16:32 +0200)]
dm crypt: Fix sparse complaints

Avoid that sparse complains about assigning a __le64 value to a u64
variable.  Remove the (u64) casts since these are superfluous.  This
patch does not change the behavior of the source code.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agoDoc: block: Fix a typo in queue-sysfs.txt
Masanari Iida [Tue, 28 Jun 2016 20:10:57 +0000 (05:10 +0900)]
Doc: block: Fix a typo in queue-sysfs.txt

This patch fix a spelling typo found in queue-sysfs.txt.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Charge at least 1 jiffie instead of 1 ns
Jan Kara [Tue, 28 Jun 2016 07:04:02 +0000 (09:04 +0200)]
cfq-iosched: Charge at least 1 jiffie instead of 1 ns

Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
could result in charging just 1 ns to a cgroup submitting IO instead of 1
jiffie we always charged before. It is arguable what is the right amount
to change but for now lets retain the old behavior of always charging at
least one jiffie.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Fix regression in bonnie++ rewrite performance
Jan Kara [Tue, 28 Jun 2016 07:04:01 +0000 (09:04 +0200)]
cfq-iosched: Fix regression in bonnie++ rewrite performance

Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
broke the condition for detecting starved sync IO in
cfq_completed_request() because rq->start_time remained in jiffies but
we compared it with nanosecond values. This manifested as a regression
in bonnie++ rewrite performance because we always ended up considering
sync IO starved and thus never increased async IO queue depth.

Since rq->start_time is used in a lot of places, converting it to ns
values would be non-trivial. So just revert the condition in CFQ to use
comparison with jiffies. This will lead to suboptimal results if
cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
relatively far from that with the storage used with CFQ (the default
value is 128 ms).

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Convert slice_resid from u64 to s64
Jan Kara [Tue, 28 Jun 2016 07:04:00 +0000 (09:04 +0200)]
cfq-iosched: Convert slice_resid from u64 to s64

slice_resid can be both positive and negative. Commit 9a7f38c42c2b
(cfq-iosched: Convert from jiffies to nanoseconds) converted it from
long to u64. Although this did not introduce any functional regression
(the operations just overflow and the result was fine), it is certainly
wrong and could cause issues in future. So convert the type to more
appropriate s64.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: Convert fifo_time from ulong to u64
Jan Kara [Tue, 28 Jun 2016 07:03:59 +0000 (09:03 +0200)]
block: Convert fifo_time from ulong to u64

Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
timestamp in it which would overflow on 32-bit archs. Convert it to u64
to avoid the overflow. Since the rq->fifo_time is unioned with struct
call_single_data(), this does not change the size of struct request in
any way.

We have to slightly fixup block/deadline-iosched.c so that comparison
happens in the right types.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblktrace: avoid using timespec
Arnd Bergmann [Fri, 17 Jun 2016 14:58:26 +0000 (16:58 +0200)]
blktrace: avoid using timespec

The blktrace code stores the current time in a 32-bit word in its
user interface. This is a bad idea because 32-bit seconds overflow
at some point.

We probably have until 2106 before this one overflows, as it seems
to use an 'unsigned' variable, but we should confirm that user
space treats it the same way.

Aside from this, we want to stop using 'struct timespec' here,
so I'm adding a comment about the overflow and change the code
to use timespec64 instead to make the loss of range more obvious.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agodm raid: don't use 'const' in function return
Arnd Bergmann [Thu, 16 Jun 2016 09:03:25 +0000 (11:03 +0200)]
dm raid: don't use 'const' in function return

A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:

drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]

This changes the return type to plain 'int'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: fix failed takeover/reshapes by keeping raid set frozen
Heinz Mauelshagen [Tue, 14 Jun 2016 19:23:13 +0000 (15:23 -0400)]
dm raid: fix failed takeover/reshapes by keeping raid set frozen

Superblock updates where bogus causing some takovers/reshapes to fail.

Introduce new runtime flag (RT_FLAG_KEEP_RS_FROZEN) to keep a raid set
frozen when a layout change was requested.  Userpace will immediately
reload the table w/o the flags requesting such change once they made it
to the superblocks and any change of recovery/reshape offsets has to be
avoided until after read.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: support to change bitmap region size
Heinz Mauelshagen [Mon, 13 Jun 2016 23:46:03 +0000 (01:46 +0200)]
dm raid: support to change bitmap region size

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: update Documentation about reshaping/takeover/additonal RAID types
Heinz Mauelshagen [Mon, 13 Jun 2016 23:46:01 +0000 (01:46 +0200)]
dm raid: update Documentation about reshaping/takeover/additonal RAID types

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add reshaping support to the target
Heinz Mauelshagen [Mon, 13 Jun 2016 15:55:14 +0000 (17:55 +0200)]
dm raid: add reshaping support to the target

Add bool functions rs_is_recovering and rs_is_reshaping()
to test for ongoing recovery/reshaping respectively in order
to reject respective requests on ongoing ones.

Remove ctr array size check, because ti->len and array
sectors will differ during disk addition/removal reshape.

Use __is_raid10_near() rather than type string compare.

Introduce rs_check_reshape() and rs_start_reshape(),
use the former in the ctr to reject bogus rehsape requests
and the latter in preresume to actually start a reshape.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add prerequisite functions and definitions for reshaping
Heinz Mauelshagen [Mon, 13 Jun 2016 15:55:13 +0000 (17:55 +0200)]
dm raid: add prerequisite functions and definitions for reshaping

Add rs_is_reshapable(), rs_data_stripes(), rs_reshape_requested(),
rs_set_dev_and_array_sectors() and rs_adjust_data_offsets()

Remove superfluous check for reshape message

Correct runtime bit definitions to be incremental

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: inverse check for flags from invalid to valid flags
Heinz Mauelshagen [Thu, 9 Jun 2016 14:42:16 +0000 (16:42 +0200)]
dm raid: inverse check for flags from invalid to valid flags

It is more intuitive to manage each raid level's features in terms of
what is supported rather than what isn't supported.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: various code cleanups
Mike Snitzer [Thu, 2 Jun 2016 19:27:22 +0000 (15:27 -0400)]
dm raid: various code cleanups

Renamed functions and variables with leading single underscore to have a
double underscore.  Renamed some functions to have better names.  Folded
functions that were split out without reason.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: rename functions that alloc and free struct raid_set
Mike Snitzer [Thu, 2 Jun 2016 19:08:09 +0000 (15:08 -0400)]
dm raid: rename functions that alloc and free struct raid_set

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: remove all the bitops wrappers
Mike Snitzer [Thu, 2 Jun 2016 16:27:46 +0000 (12:27 -0400)]
dm raid: remove all the bitops wrappers

Removes obfuscation that is of little value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: rename _in_range to __within_range
Mike Snitzer [Thu, 2 Jun 2016 16:06:54 +0000 (12:06 -0400)]
dm raid: rename _in_range to __within_range

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add missing "dm-raid0" module alias
Mike Snitzer [Thu, 2 Jun 2016 16:02:19 +0000 (12:02 -0400)]
dm raid: add missing "dm-raid0" module alias

Also update module description to "raid0/1/10/4/5/6 target"

Reported by Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: rename _argname_by_flag to dm_raid_arg_name_by_flag
Mike Snitzer [Thu, 2 Jun 2016 15:58:51 +0000 (11:58 -0400)]
dm raid: rename _argname_by_flag to dm_raid_arg_name_by_flag

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: bump to v1.9.0 and make the extended SB feature flag reflect it
Mike Snitzer [Thu, 2 Jun 2016 15:48:09 +0000 (11:48 -0400)]
dm raid: bump to v1.9.0 and make the extended SB feature flag reflect it

No idea what Heinz was doing with the versioning but upstream commit
4c9971ca6a ("dm raid: make sure no feature flags are set in metadata")
bumped to 1.8.0 already.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: remove ti_error_* wrappers
Mike Snitzer [Tue, 31 May 2016 18:26:52 +0000 (14:26 -0400)]
dm raid: remove ti_error_* wrappers

There ti_error_* wrappers added very little.  No other DM target has
ever gone to such lengths to wrap setting ti->error.

Also fixes some NULL derefences via rs->ti->error.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: tabify appropriate whitespace
Mike Snitzer [Mon, 30 May 2016 17:03:37 +0000 (13:03 -0400)]
dm raid: tabify appropriate whitespace

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: enhance status interface and fixup takeover/raid0
Heinz Mauelshagen [Thu, 19 May 2016 16:49:34 +0000 (18:49 +0200)]
dm raid: enhance status interface and fixup takeover/raid0

The target's status interface has to provide the new 'data_offset' value
to allow userspace to retrieve the kernels offset to the data on each
raid device of a raid set.  This is the base for out-of-place reshaping
required to not write over any data during reshaping (e.g. change
raid6_zr -> raid6_nc):

 - add rs_set_cur() to be able to start up existing array in case of no
   takeover; use in ctr on takeover check

 - enhance raid_status()

 - add supporting functions to get resync/reshape progress and raid
   device status chars

 - fixup rebuild table line output race, which does miss to emit
   'rebuild N' on fully synced/rebuild devices, because it is relying on
   the transient 'In_sync' raid device flag

 - add new status line output for 'data_offset', which'll later be used
   for out-of-place reshaping

 - fixup takeover not working for all levels

 - fixup raid0 message interface oops caused by missing checks
   for the md threads, which don't exist in case of raid0

 - remove ALL_FREEZE_FLAGS not needed for takeover

 - adjust comments

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add raid level takeover support
Heinz Mauelshagen [Thu, 19 May 2016 16:49:33 +0000 (18:49 +0200)]
dm raid: add raid level takeover support

Add raid level takeover support allowing arbitrary takeovers between
raid levels supported by md personalities (i.e. raid0, raid1/10 and
raid4/5/6):

 - add rs_config_{backup|restore} function to allow for temporary
   storing ctr requested layout changes and restore them for takeover
   conersion decision after the superblocks got loaded and analyzed

 - add members to store layout to 'struct raid_set' (not mandatory
   for takeover but needed for reshape in later patch)

 - add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr
   to use in setting up takeover (base to address a 'rebuild' related
   raid_status() table line bug and needed as well for reshape in future
   patch)

 - add runtime flags and respective manipulation functions to be able to
   control e.g. wrting of superlocks to the preresume function on
   takeover and (later) reshape

 - add functions to detect takeover, check it's valid (mandatory here to
   avoid failing on md_run()), setup for it and use in the ctr; those
   will be likely moved out once reshaping gets added to simplify the
   ctr

 - start raid set readonly in ctr and switch to readwrite, optionally
   updating superblocks, in preresume in order to allow suspend to
   quiesce any active table before (which involves superblock updates);
   this ensures the proper sequence of writing the current and any new
   takeover(/reshape) metadata

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: enhance super_sync() to support new superblock members
Heinz Mauelshagen [Thu, 19 May 2016 16:49:32 +0000 (18:49 +0200)]
dm raid: enhance super_sync() to support new superblock members

Add transferring the new takeover/reshape related superblock
members introduced to the super_sync() function:

 - add/move supporting functions

 - add failed devices bitfield transfer functions to retrieve the
   bitfield from superblock format or update it in the superblock

 - add code to transfer all new members

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: add new reshaping/raid10 format table line options to parameter parser
Heinz Mauelshagen [Thu, 19 May 2016 16:49:31 +0000 (18:49 +0200)]
dm raid: add new reshaping/raid10 format table line options to parameter parser

Support the follwoing arguments in the ctr parameter parser:

 - add 'delta_disks', 'data_offset' taking int and sector respectively

 - 'raid10_use_near_sets' bool argument to optionally select
   near sets with supporting raid10 mappings

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: introduce extended superblock and new raid types to support takeover/reshaping
Heinz Mauelshagen [Thu, 19 May 2016 16:49:30 +0000 (18:49 +0200)]
dm raid: introduce extended superblock and new raid types to support takeover/reshaping

Add new members to the dm-raid superblock and new raid types to support
takeover/reshape.

Add all necessary members needed to support takeover and reshape in one
go -- aiming to limit the amount of changes to the superblock layout.

This is a larger patch due to the new superblock members, their related
flags, validation of both and involved API additions/changes:

 - add additional members to keep track of:
   - state about forward/backward reshaping
   - reshape position
   - new level, layout, stripe size and delta disks
   - data offset to current and new data for out-of-place reshapes
   - failed devices bitfield extensions to keep track of max raid devices

 - adjust super_validate() to cope with new superblock members

 - adjust super_init_validation() to cope with new superblock members

 - add definitions for ctr flags supporting delta disks etc.

 - add new raid types (raid6_n_6 etc.)

 - add new raid10 supporting function API (_is_raid10_*())

 - adjust to changed raid10 supporting function API

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agoblock/blk-cgroup.c: Declare local symbols static
Bart Van Assche [Tue, 14 Jun 2016 15:04:32 +0000 (17:04 +0200)]
block/blk-cgroup.c: Declare local symbols static

Detected by sparse.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock/bio-integrity.c: Add #include "blk.h"
Bart Van Assche [Tue, 14 Jun 2016 15:03:45 +0000 (17:03 +0200)]
block/bio-integrity.c: Add #include "blk.h"

This patch avoids that building with W=1 C=2 triggers the following
warning:

block/bio-integrity.c:35:6: warning: symbol 'blk_flush_integrity' was not declared. Should it be static?

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock/partition-generic.c: Remove a set-but-not-used variable
Bart Van Assche [Tue, 14 Jun 2016 15:03:13 +0000 (17:03 +0200)]
block/partition-generic.c: Remove a set-but-not-used variable

A value is assigned to the variable 'info' but that value is never
used. Hence remove the variable 'info'.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agodm raid: use rt_is_raid*() in all appropriate checks
Heinz Mauelshagen [Thu, 19 May 2016 16:49:29 +0000 (18:49 +0200)]
dm raid: use rt_is_raid*() in all appropriate checks

Make use if raid type rt_is_*() bool functions for simplification and
consistency reasons.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: more use of flag testing wrappers
Heinz Mauelshagen [Thu, 19 May 2016 16:49:28 +0000 (18:49 +0200)]
dm raid: more use of flag testing wrappers

 - add _test_flags() function

 - use it to simplify rs_check_for_invalid_flags()

 - use _test_flag() throughout

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: check constructor arguments for invalid raid level/argument combinations
Heinz Mauelshagen [Thu, 19 May 2016 16:49:27 +0000 (18:49 +0200)]
dm raid: check constructor arguments for invalid raid level/argument combinations

Reject invalid flag combinations to avoid potential data corruption or
failing raid set construction:

 - add definitions for constructor flag combinations and invalid flags
   per level

 - add bool test functions for the various raid types
   (also will be used by future reshaping enhancements)

 - introduce rs_check_for_invalid_flags() and _invalid_flags()
   to perform the validity checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: cleanup / provide infrastructure
Heinz Mauelshagen [Thu, 19 May 2016 16:49:26 +0000 (18:49 +0200)]
dm raid: cleanup / provide infrastructure

Provide necessary infrastructure to handle ctr flags and their names
and cleanup setting ti->error:

 - comment constructor flags

 - introduce constructor flag manipulation

 - introduce ti_error_*() functions to simplify
   setting the error message (use in other targets?)

 - introduce array to hold ctr flag <-> flag name mapping

 - introduce argument name by flag functions for that array

 - use those functions throughout the ctr call path

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: use dm_arg_set API in constructor
Heinz Mauelshagen [Thu, 19 May 2016 16:49:25 +0000 (18:49 +0200)]
dm raid: use dm_arg_set API in constructor

- use dm_arg_set API in ctr and its callees parse_raid_params() and dev_parms()

- introduce _in_range() function to check a value is in a [ min, max ] range;
  this is to support more callers in parsing parameters etc. in the future

- correct comment on MAX_RAID_DEVICES

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm raid: rename variable 'ret' to 'r' to conform to other dm code
Heinz Mauelshagen [Thu, 19 May 2016 16:49:24 +0000 (18:49 +0200)]
dm raid: rename variable 'ret' to 'r' to conform to other dm code

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm mpath: add optional "queue_mode" feature
Mike Snitzer [Wed, 25 May 2016 01:16:51 +0000 (21:16 -0400)]
dm mpath: add optional "queue_mode" feature

Allow a user to specify an optional feature 'queue_mode <mode>' where
<mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based,
request_fn rq-based, and blk-mq rq-based respectively.

If the queue_mode feature isn't specified the default for the
"multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y
it'll default to mode "mq".

This new queue_mode feature introduces the ability for each multipath
device to have its own queue_mode (whereas before this feature all
multipath devices effectively had to have the same queue_mode).

This commit also goes a long way to eliminate the awkward (ab)use of
DM_TYPE_*, the associated filter_md_type() and other relatively fragile
and difficult to maintain code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm mpath: remove bio-based bloat from struct dm_mpath_io
Mike Snitzer [Tue, 24 May 2016 19:48:08 +0000 (15:48 -0400)]
dm mpath: remove bio-based bloat from struct dm_mpath_io

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm mpath: reinstate bio-based support
Mike Snitzer [Thu, 19 May 2016 20:15:14 +0000 (16:15 -0400)]
dm mpath: reinstate bio-based support

Add "multipath-bio" target that offers a bio-based multipath target as
an alternative to the request-based "multipath" target -- but in a
following commit "multipath-bio" will immediately be replaced by a new
"queue_mode" feature for the "multipath" target which will allow
bio-based mode to be selected.

When DM multipath was originally converted from bio-based to
request-based the motivation for the change was better dynamic load
balancing (by leveraging block core's request-based IO schedulers, for
merging and sorting, _before_ DM multipath would make the decision on
where to steer the IO -- based on path load and/or availability).

More background is available in this "Request-based Device-mapper
multipath and Dynamic load balancing" paper:
https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf

But we've now come full circle where significantly faster storage
devices no longer need IOs to be made larger to drive optimal IO
performance.  And even if they do there have been changes to the block
and filesystem layers that help ensure upper layers are constructing
larger IOs.  In addition, SCSI's differentiated IO errors will propagate
through to bio-based IO completion hooks -- so that eliminates another
historic justiciation for request-based DM multipath.  Lastly, the block
layer's immutable biovec changes have made bio cloning cheaper than it
has ever been; whereas request cloning is still relatively expensive
(both on a CPU usage and memory footprint level).

As such, bio-based DM multipath offers the promise of a more efficient
IO path for high IOPs devices that are, or will be, emerging.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agodm: move request-based code out to dm-rq.[hc]
Mike Snitzer [Thu, 12 May 2016 20:28:10 +0000 (16:28 -0400)]
dm: move request-based code out to dm-rq.[hc]

Add some seperation between bio-based and request-based DM core code.

'struct mapped_device' and other DM core only structures and functions
have been moved to dm-core.h and all relevant DM core .c files have been
updated to include dm-core.h rather than dm.h

DM targets should _never_ include dm-core.h!

[block core merge conflict resolution from Stephen Rothwell]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
8 years agoblock: bio: kill BIO_MAX_SIZE
Ming Lei [Fri, 10 Jun 2016 03:27:12 +0000 (11:27 +0800)]
block: bio: kill BIO_MAX_SIZE

No one need this macro now, so remove it. Basically
only how many bvecs in one bio matters instead
of how many bytes in this bio.

The motivation is for supporting multipage bvecs, in
which we only know what the max count of bvecs is supported
in the bio.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: temporarily boost queue priority for idle classes
Jens Axboe [Thu, 9 Jun 2016 21:47:29 +0000 (15:47 -0600)]
cfq-iosched: temporarily boost queue priority for idle classes

If we're queuing REQ_PRIO IO and the task is running at an idle IO
class, then temporarily boost the priority. This prevents livelocks
due to priority inversion, when a low priority task is holding file
system resources while attempting to do IO.

An example of that is shown below. An ioniced idle task is holding
the directory mutex, while a normal priority task is trying to do
a directory lookup.

[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
[478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
[478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
[478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
[478381.206874] Call Trace:
[478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
[478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
[478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
[478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
[478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
[478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
[478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
[478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
[478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
[478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
[478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
[478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
[478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
[478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
[478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
[478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
[478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
[478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
[478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
[478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
[478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
[478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
[478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
[478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
[478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
[478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
[478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
[478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
[478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
[478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
[478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
[478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
[478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
[478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
[478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
[478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
[478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
[478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
[478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
8 years agoblock: drbd: avoid to use BIO_MAX_SIZE
Ming Lei [Mon, 30 May 2016 13:34:35 +0000 (21:34 +0800)]
block: drbd: avoid to use BIO_MAX_SIZE

Use BIO_MAX_PAGES instead and we will remove BIO_MAX_SIZE.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: bio: remove BIO_MAX_SECTORS
Ming Lei [Thu, 9 Jun 2016 16:03:28 +0000 (10:03 -0600)]
block: bio: remove BIO_MAX_SECTORS

No one need this macro, so remove it. The motivation is for supporting
multipage bvecs, in which we only know what the max count of bvecs is
supported in the bio, instead of max size or max sectors.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agofs: xfs: replace BIO_MAX_SECTORS with BIO_MAX_PAGES
Ming Lei [Mon, 30 May 2016 13:34:33 +0000 (21:34 +0800)]
fs: xfs: replace BIO_MAX_SECTORS with BIO_MAX_PAGES

BIO_MAX_PAGES is used as maximum count of bvecs, so
replace BIO_MAX_SECTORS with BIO_MAX_PAGES since
BIO_MAX_SECTORS is to be removed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoiov_iter: use bvec iterator to implement iterate_bvec()
Ming Lei [Mon, 30 May 2016 13:34:32 +0000 (21:34 +0800)]
iov_iter: use bvec iterator to implement iterate_bvec()

bvec has one native/mature iterator for long time, so not
necessary to use the reinvented wheel for iterating bvecs
in lib/iov_iter.c.

Two ITER_BVEC test cases are run:
- xfstest(-g auto) on loop dio/aio, no regression found
- swap file works well under extreme stress(stress-ng --all 64 -t
  800 -v), and lots of OOMs are triggerd, and the whole
system still survives

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: mark 1st parameter of bvec_iter_advance as const
Ming Lei [Mon, 30 May 2016 13:34:31 +0000 (21:34 +0800)]
block: mark 1st parameter of bvec_iter_advance as const

bvec_iter_advance() only writes the parameter of iterator,
so the base address of bvec can be marked as const safely.

Without the change, we can see compiling warning in the
following patch for implementing iterate_bvec(): lib/iov_iter.c
with bvec iterator.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: move two bvec structure into bvec.h
Ming Lei [Mon, 30 May 2016 13:34:30 +0000 (21:34 +0800)]
block: move two bvec structure into bvec.h

This patch moves 'struct bio_vec' and 'struct bvec_iter'
into 'include/linux/bvec.h', then always include this header
into 'include/linux/blk_types.h'.

With this change, both 'struct bvec_iter' and bvec iterator
helpers don't depend on CONFIG_BLOCK any more, then we can
use bvec iterator to implement iterate_bvec(): lib/iov_iter.c.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: move bvec iterator into include/linux/bvec.h
Ming Lei [Thu, 9 Jun 2016 16:00:58 +0000 (10:00 -0600)]
block: move bvec iterator into include/linux/bvec.h

bvec iterator helpers should be used to implement by
iterate_bvec():lib/iov_iter.c too, and move them into
one header, so that we can keep bvec iterator header
out of CONFIG_BLOCK. Then we can remove the reinventing
of wheel in iterate_bvec().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblk-mq: actually hook up defer list when running requests
Omar Sandoval [Thu, 9 Jun 2016 01:22:20 +0000 (18:22 -0700)]
blk-mq: actually hook up defer list when running requests

If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
over the rest of the loop body. However, dptr is assigned later in the
loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
want it for.

NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
in-tree driver, but if the code's going to be there, it might as well
work.

Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock: better packing for struct request
Christoph Hellwig [Thu, 9 Jun 2016 14:00:35 +0000 (16:00 +0200)]
block: better packing for struct request

Keep the 32-bit CPU and cmd_type flags together to avoid holes on 64-bit
architectures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoext4: use bio op helprs in ext4 crypto code
Mike Christie [Wed, 8 Jun 2016 20:49:40 +0000 (15:49 -0500)]
ext4: use bio op helprs in ext4 crypto code

This was missed from my last patchset.

This patch has ext4 crypto code use the bio op helper
to set the operation. The operation (discard, write, writesame,
etc) is now defined seperately from the other REQ bits. They
still share the bi_rw field to save space, so we use these
helpers so modules do not have to worry about setting/overwriting
info.

Jens, I am not sure how you handle patches on top of patches
in the next branches. If you merge patches that fix issues
in previous patches in next, then this patch could be part
of

commit 95fe6c1a209ef89d9f94dd04a0ad72be1487d5d5
Author: Mike Christie <mchristi@redhat.com>
Date:   Sun Jun 5 14:31:48 2016 -0500

    block, fs, mm, drivers: use bio set/get op accessors

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Convert to use highres timers
Jan Kara [Wed, 8 Jun 2016 13:11:39 +0000 (15:11 +0200)]
cfq-iosched: Convert to use highres timers

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Expose microsecond interfaces
Jeff Moyer [Wed, 8 Jun 2016 13:11:38 +0000 (15:11 +0200)]
cfq-iosched: Expose microsecond interfaces

Expose interfaces to tune time slices of CFQ IO scheduler in
microseconds.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agocfq-iosched: Convert from jiffies to nanoseconds
Jeff Moyer [Wed, 8 Jun 2016 14:55:34 +0000 (08:55 -0600)]
cfq-iosched: Convert from jiffies to nanoseconds

Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds
so that we can later make the intervals more fine-grained than jiffies.
One jiffie is several miliseconds and even for today's rotating disks
that is a noticeable amount of time and thus we leave disk unnecessarily
idle.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agoblock, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
Mike Christie [Sun, 5 Jun 2016 19:32:25 +0000 (14:32 -0500)]
block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>