HDFS is the industry's default big data storage system and is widely used in the industry's big data clusters. HDFS clusters have high stability and are easy to scale thanks to their simpler architecture, but clusters containing thousands of nodes and holding hundreds of bits (PB) of data are not uncommon. Let’s briefly review the architecture of HDFS, as shown in Figure 1.
▲Figure 1 HDFS architecture
HDFS provides low-latency metadata access to the client by loading all file system metadata into the data node Namenode memory. Since metadata needs to be loaded into memory, the maximum number of files that a HDFS cluster can support is limited by the Java heap memory, and the upper limit is about 400 million to 500 million files. Therefore, HDFS is suitable for clusters with a large number of large files [more than a few hundred megabytes (MB)]. If there are a lot of small files in the cluster, HDFS's metadata access performance will be affected. Although the node size of a cluster can be scaled through various Federation technologies, a single HDFS cluster still cannot solve the limitations of small files well.
Based on these backgrounds, Hadooph community has launched a new distributed storage system, Ozone. Ozone can easily manage small and large files and is a distributed Key-Value object storage system.
Object storage is a data storage, each data unit is stored as a discrete unit (called an object). Objects can be data of any type or size. Semantically, all objects in object storage are stored in a single flat address space without a hierarchy of the file system. In implementation, in order to support multi-user and user isolation and better manage and use objects, object storage will usually divide several levels into the address space of the plane. These levels are determined by the implementation of object storage, each level has specific semantics that users cannot change. The object hierarchy of
Ozone is divided into three levels, from top to bottom, Volume (volume), Bucket (bucket) and Object (object) , as shown in Figure 2.
▲Figure 2 Ozone's object hierarchy
Volume is similar to the concept of user accounts in Amazon S3 and is the user's Home directory. Volume can only be created by system administrators and is a unit of storage management, such as quota management. Ozone recommends that system administrators create separate Volume for each user. Volume is used to store buckets. Currently, a Volume can contain as many buckets as possible.
Bucket is an object container, the concept is similar to the Bucket in S3, or the Container in Azure. A bucket is created under Volume and can only belong to one Volume. After creation, the ownership relationship cannot be changed, and the name of the bucket is not supported. Amazon S3's bucket name is globally unique and the namespace is shared by all AWS accounts. This means that after the bucket is created, no other AWS accounts in any AWS region can use the bucket's name until the bucket is deleted. In Ozone, the bucket name only needs to be made sure it is unique inside this Volume. Different Volume can create buckets with the same name.
Object is stored in a bucket and is a key + value storage. The key is the name of the object and the value is the content of the object. The name of the object must be unique in the bucket to which it belongs. Objects have their own metadata, including the size of the value, creation time, last modification time, number of backups, access control list ACL, etc. There is no limit on the size of the object.
Ozone supports URL access to Ozone objects in a virtual host.It adopts the following format:
[scheme][bucket.volume.server:port]/key
Where, the scheme can be selected: 1) o3fs, accessing Ozone through the RPC protocol. 2) HTTP/HTTPS, accessing the Ozone REST API through the HTTP protocol. When the scheme is omitted, the RPC protocol is used by default. server:port is the address of Ozone Manager. If not specified, the "ozone.om.address" value in the cluster's configuration file ozone-site.xml is used. If there is no definition in the configuration file, "localhost:9862" is used by default.
Ozone technical architecture is divided into three parts: Ozone Manager, unified metadata management; Storage Container Manager, data block allocation and data node management; Datanode, data node, and the final storage location of data, as shown in Figure 3. Compared with HDFS architecture, you can see the original Namenode function, which is now managed by Ozone Manager and Storage Container Manage respectively. The object metadata space and data distribution are managed separately, which is conducive to the independent on-demand expansion of the two, avoiding the pressure of the previous Namenode single node.
▲Figure 3 Ozone technology framework
SCM is similar to Block Manager in HDFS, manages Containers, writes Pipelines and Datanodes, and provides Block and Container operations and information for Ozone Manager. SCM also listens to the heartbeat information sent by Datanode. As the role of Datanode Manager, it ensures and maintains the data redundancy level required for the cluster. There is no communication between SCM and Ozone Client.
Block are data block objects that truly store user data. A record in the Container is the information of a block. Each block has and only one record in the Container. As shown in Figure 4, in Ozone, the data is copied with the granularity of the Container. Currently, two Pipeline methods are supported in SCM, Standalone reading Pipeline consisting of a single Datanode node, and Apache RATIS writing Pipeline consisting of three Datanode nodes. Container has 2 states, OPEN and CLOSED. When a Container is in OPEN state, a new Block can be written into it. When a Container reaches its pre-sized size (default 5GB), it transitions from the OPEN state to the CLOSED state. A Closed Container is not modifiable.
▲Figure 4 Datanode Container Internal Structure
Apache RATIS composed of three Datanode nodes writes Pipeline to ensure that once the data falls into the disk, the latest data can always be read in the future. The data is strongly consistent, and there are 3 backups per data. There are no need to worry about data loss due to a single disk failure, as shown in Figure 5.
▲Figure 5 RATIS writes Pipeline
4, Datanode
Datanode is the data node of Ozone. It uses Container as the basic storage unit to maintain the data mapping relationship within each Container, and regularly sends heartbeat nodes, reporting node information, management of Container information and Pipeline information to the SCM. When a Container size exceeds 90% of the predetermined size or when a write operation fails, Datanode will send the Container Close command to SCM, converting the Container status from OPEN to CLOSED. Or when Pipeline error occurs, send the Pipeline Close command to SCM to switch Pipeline from OPEN to CLOSED.
5, hierarchical management
▲Figure 6 Ozone semantics and corresponding management module
6, object creation
When the Ozone Client (client) needs to create and write a new object, the client needs to directly deal with Ozone Manager and Datanode. The specific process is shown in Figure 7.
▲Figure 7 Create a new Ozone object
4) Ozone Manager will receive the information returned by SCM and return it to the client. After the
5) client obtains the Datanode list information, it establishes communication with the first Datanode (Raft Pipeline Leader) and writes data to the Datanode Container.
6) After the client completes the data writing, it connects to Ozone Manager to confirm that the data has been updated. Ozone Manager updates the object's metadata and records the information of the Container and Block where the object data is located.
At this point, the creation of the new object is completed. After that, other clients can access this object.
7. Object reading
object reading process is relatively simple, similar to HDFS file reading, as shown in Figure 8.
▲Figure 8 Read Ozone object