本文发表在OSDI’06中,是开源存储系统Ceph的原始论文。
基本概念
MDS:Metadata Server,元数据服务器
OSD:Object Storage Device,对象存储设备
动态元数据管理
主流的分布式文件系统元数据管理方法主要分为三类:
- 基于静态子树划分的元数据管理(Static Subtree Partitioning),如NFS;
- 基于动态子树划分的元数据管理(Dynamic Subtree Partitioning),如Ceph;
- 基于哈希映射的元数据管理,如Lustre。
目录局部性(Directory Locality)和负载均衡(Load Balance)
Static subtree partitioning fails to cope with dynamic workloads and data sets, while hashing destroys metadata locality and critical opportunities for efficient metadata prefetching and storage.
问题二:如何处理热点问题?Hot Spot
The contents of heavily read directories (e. g., many opens) are selectively replicated across multiple nodes to distribute load.
Directories that are particularly large or experiencing a heavy write workload (e. g., many file creations) have their contents hashed by file name across the cluster, achieving a balanced distribution at the expense of directory locality.
数据分布
CRUSH:Controlled Replication Under Scalable Hashing,可扩展哈希下的受控复制算法
PG:Placement Group,放置组
分布式对象存储
EBOFS:Extent and B-tree based Object File System
Cluster Map:集群视图
RADOS:Reliable Autonomic Distributed Object Store,可靠的、自动化的、分布式对象存储
数据复制
RADOS manages its own replication of data using a variant of primary-copy replication.
1.写操作
Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs (for n-way replication).
Clients send all writes to the first non-failed OSD in an object’s PG (the primary), which assigns a new version number for the object and PG and forwards the write to any additional replica OSDs.
After each replica has applied the update and responded to the primary, the primary applies the update locally and the write is acknowledged to the client.
2.读操作
Reads are directed at the primary.
故障检测
让每个OSD监视和其共享PGs的其他OSD的状态。
Failures that make an OSD unreachable on the network, however, require active monitoring, which RADOS distributes by having each OSD monitor those peers with which it shares PGs.
OSD活性的两个维度:
in | out | |
---|---|---|
up | ||
down |
RADOS considers two dimensions of OSD liveness: whether the OSD is reachable, and whether it is assigned data by CRUSH.
- An unresponsive OSD is initially marked down, and any primary responsibilities (update serialization, replication) temporarily pass to the next OSD in each of its placement groups.
- If the OSD does not quickly recover, it is marked out of the data distribution, and another OSD joins each PG to re-replicate its contents.
参考文献
1.Sage Weil, Ceph: A Scalable, High-Performance Distributed File System, OSDI’06
2.Sage Weil, Ceph: Reliable, Scalable, and High-Performance Distributed Storage, Ph.D. Thesis, 2007
3.陈友旭,分布式文件系统中元数据管理优化[D],中国科学技术大学,2019