该论文发表在SC’04中,作者提出了一种动态的元数据管理方法,后来成为了存储系统Ceph的核心部分之一。
背景介绍
据统计,在文件系统中,元数据操作占到了一半以上。这些元数据操作可以分为两大类:
- 对inode的操作,如open、close以及setattr等;
- 对dentry的操作,如rename和unlink等。
对于大型分布式存储系统来说,将元数据管理与数据管理相分离,可以提高系统的并行处理能力,避免传统文件系统中存在的I/O瓶颈:
元数据服务器(Metadata Server,MDS)的内存容量有限,无法将所有的元数据都存放在内存中。因此,常规的方法是用内存中的缓存来处理读请求,所有的更新则必须存储在磁盘中。
元数据存储有两种方式:
- 直接附加存储(Directly-Attached Storage,DAS),将元数据直接存放在MDS中
这种方式的问题在于,一旦MDS发生故障,需要迁移的数据量会变大。
- 共享元数据存储(Shared Metadata Store),将元数据存放到存储服务器中
这种方式减少了需要迁移的数据量,同时还能利用通用硬件来存储数据。
元数据分布
- 静态子树划分(Static Subtree Partitioning)
Static subtree partitioning typically requires a system administrator to decide how the file system should be distributed and manually assign subtrees of the hierarchy to individual file servers.
可能会造成负载不均衡:
If client workload is not evenly distributed across all file data, static partitioning is vulnerable to imbalance as individual servers can be overloaded by “hot spots” of popularity in certain parts of the hierar- chy.
- 哈希映射(Hash-based Partitioning)
distribute files across servers based on a hash of some unique file identifier, such as an in- ode number or path name.
优点:可以实现元数据的负载均衡
缺点:
1.Hot-spots consisting of individual files can still overwhelm a single responsible MDS.
2.A hashed metadata distribution also makes the process of expanding the MDS cluster to accommodate growth more difficult because the size of the desired output range suddenly changes.
- 动态子树划分(Dynamic Subtree Partitioning)
Central to the dynamic subtree partitioning approach is the treatment of the file system as a hierarchy. The file system is partitioned by delegating authority for subtrees of the hierarchy to different metadata servers.
负载均衡
若某个目录很大,或者访问频繁,那么,可以将目录中的内容进行哈希映射:
a single directory becomes extraordinarily large or busy
an individual directory’s contents can be hashed across the cluster, such that the authority for a given directory entry is defined by a hash of the file name and the directory inode number.
哈希映射是动态的:
we propose that the decision to hash (or unhash) a directory be dynamic: as directories grow or become popular it may become appropriate to hash them, but if they shrink or become less popular they should be consolidated on a single node for more efficient manipulation and storage.
负载均衡策略应该追求资源利用率的均衡:
A robust load balancing strategy might seek to equalize utilization of all resources across the cluster.
流量控制
使用访问计数器表示元数据的流行度,其值随着时间不断衰减。
对于热门文件/目录,将其元数据复制多份,分散流量;否则,直接访问相应的MDS。
MDS nodes monitor the popularity of metadata using a simple access counter whose value decays over time, or any other measure or estimate of the extent to which an item appears in client caches (precision isn’t necessary).
All responses sent to clients include current distribution information—that is, which MDS nodes the client should contact in the future—for the metadata requested and their prefix directories, which are then cached on the client.
For unpopular items, the MDS cluster will tell clients to direct future requests only at the authoritative node, while for popular items the client is told the item is replicated on many or all nodes.
目录局部性
目录局部性(Directory Locality)
We exploit workload locality by storing directly related information together whenever possible, and prefetching potentially related information—inodes within the same directory— to more efficiently satisfy requests in typical workloads.
参考文献
1.Sage Weil, et. al, Dynamic Metadata Management for Petabyte-scale File Systems, SC’04