本文发表在SC’06中,作者提出了一种可扩展的、伪随机的数据分布算法CRUSH,后来成为了存储系统Ceph的核心部分之一。
CRUSH
CRUSH (Controlled Replication Under Scalable Hashing,发音为[krʌʃ])是一种针对分布式对象存储系统而设计的、伪随机的数据分布算法。
作用:
We have developed CRUSH, a pseudo-random data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.
CRUSH is implemented as a pseudo-random, deterministic function that maps an input value, typically an object or object group identifier, to a list of devices on which to store object replicas.
$$
CRUSH(\mathrm{pgid, Cluster\ Map, Rule}) = (osd_1, osd_2, \cdots, osd_n)
$$
与传统数据分布方法相比,有何不同:
This differs from conventional approaches in that data placement does not rely on any sort of per-file or per-object directory—CRUSH needs only a compact, hierarchical description of the devices comprising the storage cluster and knowledge of the replica placement policy.
优点:
1.It is completely distributed such that any party in a large system can independently calculate the location of any object;
2.What little metadata is required is mostly static, changing only when devices are added or removed.
Cluster Map
集群运行图:
The cluster map is composed of devices and buckets, both of which have numerical identifiers and weight values associated with them.
1.Buckets can contain any number of devices or other buckets, allowing them to form interior nodes in a storage hierarchy in which devices are always at the leaves.
2.Storage devices are assigned weights by the administrator to control the relative amount of data they are responsible for storing.
3.Bucket weights are defined as the sum of the weights of the items they contain.
其定义在crush/crush.h中:
1 | /** |
其中,bucket定义在https://github.com/ceph/ceph/blob/v16.2.7/src/crush/crush.h#L229
1 | struct crush_bucket { |
https://github.com/ceph/ceph/blob/v16.2.7/src/crush/crush.h#L91
1 | struct crush_rule { |
数据分布算法
https://github.com/ceph/ceph/blob/v16.2.7/src/crush/mapper.c#L900
1 | /** |
Bucket Types
https://github.com/ceph/ceph/blob/master/src/crush/mapper.c
Uniform Buckets
Given a CRUSH input value of x and a replica number r, we choose an item from a uniform bucket of size m:
$$
c(r, x) = (hash(x) + rp) \mod m
$$
where p is a randomly (but deterministically) chosen prime number greater than m.
List Buckets
List buckets structure their contents as a linked list, and can contain items with arbitrary weights.
To place a replica, CRUSH begins at the head of the list with the most recently added item and compares its weight to the sum of all remaining items’ weights.
Tree Buckets
Tree buckets are structured as a weighted binary search tree with items at the leaves.
节点标记策略:
- The leftmost leaf in the tree is always labeled “1.”
- Each time the tree is expanded, the old root becomes the new root’s left child, and the new root node is labeled with the old root’s label shifted one bit to the left (1, 10, 100, etc.).
- The labels for the right side of the tree mirror those on the left side except with a “1” prepended to each value.(先在前面补零对齐,然后再添加前置的1)
Straw Buckets
To place a replica, a straw of random length is drawn for each item in the bucket. The item with the longest straw wins.
The length of each straw is initially a value in a fixed range, based on a hash of the CRUSH input x, replica number r, and bucket item i. Each straw length is scaled by a factor $f(w_i)$ based on the item’s weight so that heavily weighted items are more likely to win the draw:
$$
c(r, x) = \max_i \left( f(w_i) hash(x, r, i) \right)
$$