论文阅读《CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data》

本文发表在SC’06中，作者提出了一种可扩展的、伪随机的数据分布算法CRUSH，后来成为了存储系统Ceph的核心部分之一。

CRUSH

CRUSH (Controlled Replication Under Scalable Hashing，发音为[krʌʃ]）是一种针对分布式对象存储系统而设计的、伪随机的数据分布算法。

作用：

We have developed CRUSH, a pseudo-random data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.

CRUSH is implemented as a pseudo-random, deterministic function that maps an input value, typically an object or object group identifier, to a list of devices on which to store object replicas.

$$
CRUSH(\mathrm{pgid, Cluster\ Map, Rule}) = (osd_1, osd_2, \cdots, osd_n)
$$

与传统数据分布方法相比，有何不同：

This differs from conventional approaches in that data placement does not rely on any sort of per-file or per-object directory—CRUSH needs only a compact, hierarchical description of the devices comprising the storage cluster and knowledge of the replica placement policy.

优点：

1.It is completely distributed such that any party in a large system can independently calculate the location of any object;

2.What little metadata is required is mostly static, changing only when devices are added or removed.

Cluster Map

集群运行图：

The cluster map is composed of devices and buckets, both of which have numerical identifiers and weight values associated with them.

1.Buckets can contain any number of devices or other buckets, allowing them to form interior nodes in a storage hierarchy in which devices are always at the leaves.

2.Storage devices are assigned weights by the administrator to control the relative amount of data they are responsible for storing.

3.Bucket weights are defined as the sum of the weights of the items they contain.

其定义在crush/crush.h中：

/** 
 * A crush map define a hierarchy of crush_bucket that end with leaves
 * (buckets and leaves are called items) and a set of crush_rule to
 * map an integer to items with the crush_do_rule() function.
 */
struct crush_map {
    struct crush_bucket **buckets;
    struct crush_rule **rules;
    // ...
}

其中，bucket定义在https://github.com/ceph/ceph/blob/v16.2.7/src/crush/crush.h#L229

struct crush_bucket {
    __s32 id;        /* bucket identifier, < 0 and unique within a crush_map */
    __u16 type;      /* > 0 bucket type, defined by the caller */
    __u8 alg;        /* the item selection ::crush_algorithm */
    __u8 hash;       /* which hash function to use, CRUSH_HASH_* */
    __u32 weight;    /* 16.16 fixed point cumulated children weight */
    __u32 size;      /* size of the __items__ array */
    __s32 *items;    /* array of children: < 0 are buckets, >= 0 items */
};

https://github.com/ceph/ceph/blob/v16.2.7/src/crush/crush.h#L91

struct crush_rule {
	__u32 len;
	struct crush_rule_mask mask;
	struct crush_rule_step steps[0];
};
/*
 * The rule mask is used to describe what the rule is intended for.
 * Given a ruleset and size of output set, we search through the
 * rule list for a matching rule_mask.
 */
struct crush_rule_mask {
	__u8 ruleset;
	__u8 type;
	__u8 min_size;
	__u8 max_size;
};
/*
 * CRUSH uses user-defined "rules" to describe how inputs should be
 * mapped to devices.  A rule consists of sequence of steps to perform
 * to generate the set of output devices.
 */
struct crush_rule_step {
	__u32 op;
	__s32 arg1;
	__s32 arg2;
};

数据分布算法

https://github.com/ceph/ceph/blob/v16.2.7/src/crush/mapper.c#L900

/**
 * crush_do_rule - calculate a mapping with the given input and rule
 * @map: the crush_map
 * @ruleno: the rule id
 * @x: hash input
 * @result: pointer to result vector
 * @result_max: maximum result size
 * @weight: weight vector (for map leaves)
 * @weight_max: size of weight vector
 * @cwin: Pointer to at least map->working_size bytes of memory or NULL.
 */
int crush_do_rule(const struct crush_map *map, int ruleno, int x, 
      int *result, int result_max,
		  const __u32 *weight, int weight_max,
		  void *cwin, const struct crush_choose_arg *choose_args)

Bucket Types

https://github.com/ceph/ceph/blob/master/src/crush/mapper.c

Uniform Buckets

Given a CRUSH input value of x and a replica number r, we choose an item from a uniform bucket of size m:
$$
c(r, x) = (hash(x) + rp) \mod m
$$
where p is a randomly (but deterministically) chosen prime number greater than m.

List Buckets

List buckets structure their contents as a linked list, and can contain items with arbitrary weights.

To place a replica, CRUSH begins at the head of the list with the most recently added item and compares its weight to the sum of all remaining items’ weights.

Tree Buckets

Tree buckets are structured as a weighted binary search tree with items at the leaves.

节点标记策略：

The leftmost leaf in the tree is always labeled “1.”
Each time the tree is expanded, the old root becomes the new root’s left child, and the new root node is labeled with the old root’s label shifted one bit to the left (1, 10, 100, etc.).
The labels for the right side of the tree mirror those on the left side except with a “1” prepended to each value.（先在前面补零对齐，然后再添加前置的1）

Straw Buckets

To place a replica, a straw of random length is drawn for each item in the bucket. The item with the longest straw wins.

The length of each straw is initially a value in a fixed range, based on a hash of the CRUSH input x, replica number r, and bucket item i. Each straw length is scaled by a factor $f(w_i)$ based on the item’s weight so that heavily weighted items are more likely to win the draw:
$$
c(r, x) = \max_i \left( f(w_i) hash(x, r, i) \right)
$$