libceph: don't allow bidirectional swap of pg-upmap-items
authorIlya Dryomov <idryomov@gmail.com>
Mon, 18 Sep 2017 10:21:37 +0000 (12:21 +0200)
committerIlya Dryomov <idryomov@gmail.com>
Tue, 19 Sep 2017 18:34:29 +0000 (20:34 +0200)
This reverts most of commit f53b7665c8ce ("libceph: upmap semantic
changes").

We need to prevent duplicates in the final result.  For example, we
can currently take

  [1,2,3] and apply [(1,2)] and get [2,2,3]

or

  [1,2,3] and apply [(3,2)] and get [1,2,2]

The rest of the system is not prepared to handle duplicates in the
result set like this.

The reverted piece was intended to allow

  [1,2,3] and [(1,2),(2,1)] to get [2,1,3]

to reorder primaries.  First, this bidirectional swap is hard to
implement in a way that also prevents dups.  For example, [1,2,3] and
[(1,4),(2,3),(3,4)] would give [4,3,4] but would we just drop the last
step we'd have [4,3,3] which is also invalid, etc.  Simpler to just not
handle bidirectional swaps.  In practice, they are not needed: if you
just want to choose a different primary then use primary_affinity, or
pg_upmap (not pg_upmap_items).

Cc: stable@vger.kernel.org # 4.13
Link: http://tracker.ceph.com/issues/21410
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
net/ceph/osdmap.c

index f358d0bfa76b35cb978e9e92b57aea38a4e3b391..79d14d70b7ea7e032ecffec839b2f44ee4fe7cdc 100644 (file)
@@ -2445,19 +2445,34 @@ static void apply_upmap(struct ceph_osdmap *osdmap,
 
        pg = lookup_pg_mapping(&osdmap->pg_upmap_items, pgid);
        if (pg) {
-               for (i = 0; i < raw->size; i++) {
-                       for (j = 0; j < pg->pg_upmap_items.len; j++) {
-                               int from = pg->pg_upmap_items.from_to[j][0];
-                               int to = pg->pg_upmap_items.from_to[j][1];
-
-                               if (from == raw->osds[i]) {
-                                       if (!(to != CRUSH_ITEM_NONE &&
-                                             to < osdmap->max_osd &&
-                                             osdmap->osd_weight[to] == 0))
-                                               raw->osds[i] = to;
+               /*
+                * Note: this approach does not allow a bidirectional swap,
+                * e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1].
+                */
+               for (i = 0; i < pg->pg_upmap_items.len; i++) {
+                       int from = pg->pg_upmap_items.from_to[i][0];
+                       int to = pg->pg_upmap_items.from_to[i][1];
+                       int pos = -1;
+                       bool exists = false;
+
+                       /* make sure replacement doesn't already appear */
+                       for (j = 0; j < raw->size; j++) {
+                               int osd = raw->osds[j];
+
+                               if (osd == to) {
+                                       exists = true;
                                        break;
                                }
+                               /* ignore mapping if target is marked out */
+                               if (osd == from && pos < 0 &&
+                                   !(to != CRUSH_ITEM_NONE &&
+                                     to < osdmap->max_osd &&
+                                     osdmap->osd_weight[to] == 0)) {
+                                       pos = j;
+                               }
                        }
+                       if (!exists && pos >= 0)
+                               raw->osds[pos] = to;
                }
        }
 }