md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
authorH. Peter Anvin <hpa@zytor.com>
Wed, 12 Jun 2013 14:37:43 +0000 (07:37 -0700)
committerNeilBrown <neilb@suse.de>
Thu, 13 Jun 2013 04:49:54 +0000 (14:49 +1000)
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
drivers/md/raid1.c
drivers/md/raid10.c
drivers/md/raid5.c

index 5208e9d1aff02a19e9c29b8f6eeea12da4304dea..e02ad445090735e770ed85c9401552d09688ccc4 100644 (file)
@@ -2837,8 +2837,8 @@ static int run(struct mddev *mddev)
                return PTR_ERR(conf);
 
        if (mddev->queue)
-               blk_queue_max_write_same_sectors(mddev->queue,
-                                                mddev->chunk_sectors);
+               blk_queue_max_write_same_sectors(mddev->queue, 0);
+
        rdev_for_each(rdev, mddev) {
                if (!mddev->gendisk)
                        continue;
index aa9ed304951ed42674c6490ddc203dc203090bb4..06c2cbe046e2f0459242d37c40f9ed06e4ab7e12 100644 (file)
@@ -3651,8 +3651,7 @@ static int run(struct mddev *mddev)
        if (mddev->queue) {
                blk_queue_max_discard_sectors(mddev->queue,
                                              mddev->chunk_sectors);
-               blk_queue_max_write_same_sectors(mddev->queue,
-                                                mddev->chunk_sectors);
+               blk_queue_max_write_same_sectors(mddev->queue, 0);
                blk_queue_io_min(mddev->queue, chunk_size);
                if (conf->geo.raid_disks % conf->geo.near_copies)
                        blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
index 4a7be455d6d86ceb6bda86a332b81d036db52dee..26ee39936a28ec523a156a1eb73ac1dce64f8df4 100644 (file)
@@ -5465,7 +5465,7 @@ static int run(struct mddev *mddev)
                if (mddev->major_version == 0 &&
                    mddev->minor_version > 90)
                        rdev->recovery_offset = reshape_offset;
-                       
+
                if (rdev->recovery_offset < reshape_offset) {
                        /* We need to check old and new layout */
                        if (!only_parity(rdev->raid_disk,
@@ -5588,6 +5588,8 @@ static int run(struct mddev *mddev)
                 */
                mddev->queue->limits.discard_zeroes_data = 0;
 
+               blk_queue_max_write_same_sectors(mddev->queue, 0);
+
                rdev_for_each(rdev, mddev) {
                        disk_stack_limits(mddev->gendisk, rdev->bdev,
                                          rdev->data_offset << 9);