HDFS is the industry's default big data storage system and is widely used in the industry's big data clusters. HDFS clusters have high stability and are easy to scale thanks to their simpler architecture, but clusters containing thousands of nodes and storing hundreds of bits of

Category：hotcomm

2025-10-07

HDFS is the industry's default big data storage system and is widely used in the industry's big data clusters. HDFS clusters have high stability and are easy to scale thanks to their simpler architecture, but clusters containing thousands of nodes and holding hundreds of bits (PB) of data are not uncommon. Let’s briefly review the architecture of HDFS, as shown in Figure 1.

▲Figure 1 HDFS architecture

HDFS provides low-latency metadata access to the client by loading all file system metadata into the data node Namenode memory. Since metadata needs to be loaded into memory, the maximum number of files that a HDFS cluster can support is limited by the Java heap memory, and the upper limit is about 400 million to 500 million files. Therefore, HDFS is suitable for clusters with a large number of large files [more than a few hundred megabytes (MB)]. If there are a lot of small files in the cluster, HDFS's metadata access performance will be affected. Although the node size of a cluster can be scaled through various Federation technologies, a single HDFS cluster still cannot solve the limitations of small files well.

Based on these backgrounds, Hadooph community has launched a new distributed storage system, Ozone. Ozone can easily manage small and large files and is a distributed Key-Value object storage system.

1 Basic concepts

Object storage is a data storage, each data unit is stored as a discrete unit (called an object). Objects can be data of any type or size. Semantically, all objects in object storage are stored in a single flat address space without a hierarchy of the file system. In implementation, in order to support multi-user and user isolation and better manage and use objects, object storage will usually divide several levels into the address space of the plane. These levels are determined by the implementation of object storage, each level has specific semantics that users cannot change. The object hierarchy of

Ozone is divided into three levels, from top to bottom, Volume (volume), Bucket (bucket) and Object (object) , as shown in Figure 2.

▲Figure 2 Ozone's object hierarchy

, Volume (volume)

Volume is similar to the concept of user accounts in Amazon S3 and is the user's Home directory. Volume can only be created by system administrators and is a unit of storage management, such as quota management. Ozone recommends that system administrators create separate Volume for each user. Volume is used to store buckets. Currently, a Volume can contain as many buckets as possible.

, Bucket (Bucket)

Bucket is an object container, the concept is similar to the Bucket in S3, or the Container in Azure. A bucket is created under Volume and can only belong to one Volume. After creation, the ownership relationship cannot be changed, and the name of the bucket is not supported. Amazon S3's bucket name is globally unique and the namespace is shared by all AWS accounts. This means that after the bucket is created, no other AWS accounts in any AWS region can use the bucket's name until the bucket is deleted. In Ozone, the bucket name only needs to be made sure it is unique inside this Volume. Different Volume can create buckets with the same name.

, Object (Object)

Object is stored in a bucket and is a key + value storage. The key is the name of the object and the value is the content of the object. The name of the object must be unique in the bucket to which it belongs. Objects have their own metadata, including the size of the value, creation time, last modification time, number of backups, access control list ACL, etc. There is no limit on the size of the object.

Ozone supports URL access to Ozone objects in a virtual host.It adopts the following format:

[scheme][bucket.volume.server:port]/key

Where, the scheme can be selected: 1) o3fs, accessing Ozone through the RPC protocol. 2) HTTP/HTTPS, accessing the Ozone REST API through the HTTP protocol. When the scheme is omitted, the RPC protocol is used by default. server:port is the address of Ozone Manager. If not specified, the "ozone.om.address" value in the cluster's configuration file ozone-site.xml is used. If there is no definition in the configuration file, "localhost:9862" is used by default.

2 technical architecture

Ozone technical architecture is divided into three parts: Ozone Manager, unified metadata management; Storage Container Manager, data block allocation and data node management; Datanode, data node, and the final storage location of data, as shown in Figure 3. Compared with HDFS architecture, you can see the original Namenode function, which is now managed by Ozone Manager and Storage Container Manage respectively. The object metadata space and data distribution are managed separately, which is conducive to the independent on-demand expansion of the two, avoiding the pressure of the previous Namenode single node.

▲Figure 3 Ozone technology framework

Ozone main modules and functions are as follows.

, Ozone Manager (OM)

Ozone Manager is a namespace that manages Ozone, providing new creation, update and delete operations for all Volume (volume), Bucket (bucket) and Key (key). It stores Ozone's metadata information, which includes Volumes, Buckets and Keys. The underlying layer implements the HA of the metadata through RATIS (which implements the RAFT protocol). Ozone Manager only communicates with Ozone Client and Storage Container Manager, and does not communicate directly with Datanode. Ozone Manager stores the metadata of the namespace in RocksDB, avoiding the need to keep all metadata in memory in HDFS, which is often troubled by small file problems. RocksDB is a local Key-Value storage engine developed by Facebook based on LevelDB. It has many optimizations and improvements for SSD, providing high-throughput read and write operations.

, Storage Container Manager (SCM)

SCM is similar to Block Manager in HDFS, manages Containers, writes Pipelines and Datanodes, and provides Block and Container operations and information for Ozone Manager. SCM also listens to the heartbeat information sent by Datanode. As the role of Datanode Manager, it ensures and maintains the data redundancy level required for the cluster. There is no communication between SCM and Ozone Client.

, Block, Container and Pipeline

Block are data block objects that truly store user data. A record in the Container is the information of a block. Each block has and only one record in the Container. As shown in Figure 4, in Ozone, the data is copied with the granularity of the Container. Currently, two Pipeline methods are supported in SCM, Standalone reading Pipeline consisting of a single Datanode node, and Apache RATIS writing Pipeline consisting of three Datanode nodes. Container has 2 states, OPEN and CLOSED. When a Container is in OPEN state, a new Block can be written into it. When a Container reaches its pre-sized size (default 5GB), it transitions from the OPEN state to the CLOSED state. A Closed Container is not modifiable.

▲Figure 4 Datanode Container Internal Structure

Apache RATIS composed of three Datanode nodes writes Pipeline to ensure that once the data falls into the disk, the latest data can always be read in the future. The data is strongly consistent, and there are 3 backups per data. There are no need to worry about data loss due to a single disk failure, as shown in Figure 5.

▲Figure 5 RATIS writes Pipeline

4, Datanode

Datanode is the data node of Ozone. It uses Container as the basic storage unit to maintain the data mapping relationship within each Container, and regularly sends heartbeat nodes, reporting node information, management of Container information and Pipeline information to the SCM. When a Container size exceeds 90% of the predetermined size or when a write operation fails, Datanode will send the Container Close command to SCM, converting the Container status from OPEN to CLOSED. Or when Pipeline error occurs, send the Pipeline Close command to SCM to switch Pipeline from OPEN to CLOSED.

5, hierarchical management

Ozone hierarchical structure allows Ozone Manager, Storage Container Manager and Datanode to be independently extended on demand. The semantics provided by Ozone are also managed in a hierarchical manner, as shown in Figure 6.

▲Figure 6 Ozone semantics and corresponding management module

6, object creation

When the Ozone Client (client) needs to create and write a new object, the client needs to directly deal with Ozone Manager and Datanode. The specific process is shown in Figure 7.

▲Figure 7 Create a new Ozone object

) Ozone client links to Ozone Manager to provide the object information to be created, including the name of the object, the size of the data, the number of backups and other user-defined object properties.

) After receiving the request from the Ozone client, Ozone Manager communicates with SCM and requests the SCM to find a Container in the OPEN state that can accommodate data, and then allocate a sufficient number of blocks in the found Container.

) SCM returns the information list of the three Datanodes of the Pipeline where the new object data to be written to the Container, Block and Container are located to Ozone Manager.

4) Ozone Manager will receive the information returned by SCM and return it to the client. After the

5) client obtains the Datanode list information, it establishes communication with the first Datanode (Raft Pipeline Leader) and writes data to the Datanode Container.

6) After the client completes the data writing, it connects to Ozone Manager to confirm that the data has been updated. Ozone Manager updates the object's metadata and records the information of the Container and Block where the object data is located.

At this point, the creation of the new object is completed. After that, other clients can access this object.

7. Object reading

object reading process is relatively simple, similar to HDFS file reading, as shown in Figure 8.

▲Figure 8 Read Ozone object

) Ozone Client (client) communicates with Ozone Manager, and formulates the object key to be read (/volume/bucket/key).

) Ozone Manager looks for the corresponding object in the metadatabase, returns the Container and Block information where the object data is located, including the Datanode list information where the Container is located to the Ozone Client (client).

) Ozone supports Data locality. If the Ozone Client is running on a node in the cluster, Ozone Manager returns a Datanode list sorted by network topology distance. The Ozone Client can select the first Datanode node (local node), which is also the node closest to the Client to read data, saving network transmission time for data reading.

hotcomm Latest News

Site article recommendation