1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da

2025/05/1006:30:46 hotcomm 1926

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

1. Written in front

We are in an era of big data, and the amount of data of enterprises has exploded. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, data lakes, and now the integration of lakes and warehouses, new methods and new technologies for the industry to build data platforms are emerging one after another.

Understanding the evolutionary pulses, key issues, and core technical principles hidden behind these methods and technologies can help enterprises better build data platforms. This is also the original intention of Baidu Smart Cloud to launch the data lake series content.

This series of articles will contain several parts:

This article will be the beginning of the entire series of data platform, introducing you to the history and development of data platform technology. The subsequent content of

will be divided into two major topics, introducing the core technical principles and best practices in the data platform from the perspectives of storage and computing, as well as Baidu Smart Cloud's thoughts on these issues.

2. The value of data

"Data is the new oil." — Clive Humby, 2006

Clive Humby said this in 2006 "Data is the new oil" and then quickly became everyone's consensus. The trajectory of this guy’s life is the best footnote in the big data era. He was originally a mathematician, and later jointly established a data company with his wife, and later established an investment fund focusing on the data field. When he said this, Clive Humby was striving to promote the company he and his wife founded to the capital market. The capital market likes such simple and powerful golden sentences, and his company sold for a good price in 5 years.

For data owners and practitioners in the data industry, this sentence only tells half the truth. Michael Palmer supplemented this sentence:

"Data is just like crude. It's valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value." — Michael Palmer

In short, it is "Data needs to be refined to release real value."

For a company, the easiest thing to understand and easiest to do is the word "big" in the three words "big data". After realizing that the data in each link of the business may contain the secrets of revenue and user growth, a large amount of original data is often accumulated. These raw data are crude oil. Although precious, they contain a lot of noise, impurities, and even errors. The internal relationship between different data is not obvious. This is a long way from the mystery that needs to be explored. To understand these mysteries, we need to continue to "refine", which means to use appropriate methods to organize, purify, combine and analyze the original data, remove the drought, preserve the essence, peel off the silk and cocoon, reveal the truly valuable parts of the data, and ultimately transform it into the driving force for business growth.

supports this "refining" infrastructure throughout the process is an enterprise's data platform. To data is like a refinery to crude oil.

With the explosive growth of enterprise data volume and more and more enterprises going to the cloud, the challenges of data storage and data processing faced by data platforms are becoming increasingly greater. What kind of technology to use to build and iterate this platform has always been a hot topic in industry research, and new technologies and new ideas are constantly emerging. These technologies are summarized into two typical routes: Data Warehouse and Data Lake. In recent years, the boundaries of these two routes have become increasingly blurred in the evolution process, gradually moving towards integration, and began to form the so-called Modern Data Architecture, also known as Data Lakehouse.

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

1. Written in front

We are in an era of big data, and the amount of data of enterprises has exploded. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, data lakes, and now the integration of lakes and warehouses, new methods and new technologies for the industry to build data platforms are emerging one after another.

Understanding the evolutionary pulses, key issues, and core technical principles hidden behind these methods and technologies can help enterprises better build data platforms. This is also the original intention of Baidu Smart Cloud to launch the data lake series content.

This series of articles will contain several parts:

This article will be the beginning of the entire series of data platform, introducing you to the history and development of data platform technology. The subsequent content of

will be divided into two major topics, introducing the core technical principles and best practices in the data platform from the perspectives of storage and computing, as well as Baidu Smart Cloud's thoughts on these issues.

2. The value of data

"Data is the new oil." — Clive Humby, 2006

Clive Humby said this in 2006 "Data is the new oil" and then quickly became everyone's consensus. The trajectory of this guy’s life is the best footnote in the big data era. He was originally a mathematician, and later jointly established a data company with his wife, and later established an investment fund focusing on the data field. When he said this, Clive Humby was striving to promote the company he and his wife founded to the capital market. The capital market likes such simple and powerful golden sentences, and his company sold for a good price in 5 years.

For data owners and practitioners in the data industry, this sentence only tells half the truth. Michael Palmer supplemented this sentence:

"Data is just like crude. It's valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value." — Michael Palmer

In short, it is "Data needs to be refined to release real value."

For a company, the easiest thing to understand and easiest to do is the word "big" in the three words "big data". After realizing that the data in each link of the business may contain the secrets of revenue and user growth, a large amount of original data is often accumulated. These raw data are crude oil. Although precious, they contain a lot of noise, impurities, and even errors. The internal relationship between different data is not obvious. This is a long way from the mystery that needs to be explored. To understand these mysteries, we need to continue to "refine", which means to use appropriate methods to organize, purify, combine and analyze the original data, remove the drought, preserve the essence, peel off the silk and cocoon, reveal the truly valuable parts of the data, and ultimately transform it into the driving force for business growth.

supports this "refining" infrastructure throughout the process is an enterprise's data platform. To data is like a refinery to crude oil.

With the explosive growth of enterprise data volume and more and more enterprises going to the cloud, the challenges of data storage and data processing faced by data platforms are becoming increasingly greater. What kind of technology to use to build and iterate this platform has always been a hot topic in industry research, and new technologies and new ideas are constantly emerging. These technologies are summarized into two typical routes: Data Warehouse and Data Lake. In recent years, the boundaries of these two routes have become increasingly blurred in the evolution process, gradually moving towards integration, and began to form the so-called Modern Data Architecture, also known as Data Lakehouse.

3. Composition of data platform

Before discussing specific technical issues, let’s take a look at what the industry’s data platform looks like:

Data platform = Storage system + Computing engine + Interface

The functions of these parts can be summarized as follows.

3.1 Data storage

Data storage solves the problem of storing raw materials in, and has the characteristics of long time span, scattered sources, and centralized storage.

  • "Long time span" means that the data storage should save all historical data as much as possible. The importance of historical data to enterprises lies in "learning from history" and observing data trends, health and other information from a longer time dimension.
  • "source dispersed" is because the source of data is usually various business systems, which may be data in relational databases such as MySQL and Oracle, or may be logs recorded by business systems. Businesses may also purchase or collect third-party data sets as supplements to internal data. Data platforms need to have the ability to import data from different sources. As for what format to store it after import, different technical solutions have their own requirements.
  • "centralized storage" is to establish a single source of truth. No matter where the source of the data is, after being included in the data platform, the data platform is the only trustworthy source. Here it refers more to logical centralized storage, and there is physical possibility of decentralization. For example, an enterprise adopts a multi-cloud architecture to store data scattered in different cloud vendors, and the data platform blocks the actual location of the data from the data users. Centralized storage also means more refined control, preventing data usage permissions from expanding to unnecessary scope.

3.2 Computing Engine

The goal of the computing engine is to extract effective information from the data storage. Unfortunately, there is no unified computing engine in the industry at present. Different solutions are often adopted according to the requirements of model, effectiveness, and data volume. Typically, frameworks such as TensorFlow, PyTorch, PaddlePaddle are used for deep learning tasks, offline computing such as data mining uses Hadoop MapReduce and Spark engines, and business intelligence analysis uses MPP data warehouses such as Apache Doris.

different computing engines have different requirements for data storage formats:

  • Some computing engines support interfaces that are relatively open and underlying, and the requirements for data formats are very relaxed. For example, Hadoop MapReduce and Spark can directly read files in HDFS. The engine itself does not care about what format this data is, and the business decides how to interpret and apply the data. Of course, some formats (such as Apache Parquet, etc.) are widely used. On the underlying interface, the engine can choose to encapsulate processing logic for specific formats to reduce the cost of repeated development of the business.
  • Another part of the computing engine is relatively closed and can only support limited data formats, and even does not expose the internal data format. All external data must go through the import steps before processing. For example, the system decides how Apache Doris data is stored by itself. The advantage is that it can make the storage and computing more closely coordinated and perform better. Some valuable data produced by the

calculation engine during the calculation process will generally be stored back into the data storage for convenience for other businesses.

3.3 Interface

interface determines how users of the data platform use the computing engine. The most popular is the SQL language interface. Some computing engines also provide programming interfaces with different levels of encapsulation. For an enterprise, the fewer types of interfaces provided, the better and the more friendly it is to users.

4. Two routes to the data platform: data warehouse and database

4.1 Data warehouse

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

Data warehouse appears much earlier than the database.The initial scenario was Business Intelligence. Simply put, the management of the enterprise hopes to have a dashboard that is convenient for viewing various types of business data, displaying some statistics and trend data, and the data sources are ERP, CRM, business database, etc. In order to make this requirement easy to use, the best way is to collect the data from various data sources within the enterprise on a single site for archives, and maintain historical data so that relevant query requirements can be solved on this site. This unified site is the data warehouse.

mainstream data warehouse implementation is based on "Online Analytical Processing (OLAP)" technology. Before the birth of data warehouses, the business had widely used relational databases such as MySQL and Orcale, which were based on the "On-Line Transactional Processing (OLTP)" technology. The data in the OLTP database has a fixed format, is well organized, and the supported SQL query language is easy to use and understand. At the same time, it itself is one of the most important data sources in the data warehouse. Therefore, it is a natural idea to build a data warehouse directly using an OLTP database. But soon, everyone discovered that the data warehouse has its own business characteristics. Based on OLTP, it encountered a bottleneck, and OLAP gained the opportunity to develop independently:

  • On the one hand, the data storage method of the OLTP database is row-oriented (row-oriented). The data of a row is stored together. Even if only a few fields are needed when reading, the entire row of data needs to be read out and the required fields are extracted. Data warehouse tables usually have more fields, which leads to low reading efficiency. Column-oriented (column-oriented) data storage method, which stores different columns or column families separately, and only the required parts can be read when reading. This method can effectively reduce the amount of data read and is more friendly to the data warehouse scenario.
  • On the other hand, traditional OLTP databases rely on scale-up configured by stand-alone hardware to improve processing capabilities, with a lower upper limit. The data warehouse scenario reads a very large amount of data in a single query. Repeatedly calling the same reading logic on the same field is very suitable for parallel processing optimization of stand-alone and multiple machines, and to use the cluster's scale-out processing capabilities to shorten the query time. This is the core idea of ​​the MPP (Massively Parallel Processor) computing engine.

Therefore, the characteristics of modern data warehouse architecture are distributed, columnar storage, and MPP computing engines. After the user initiates the calculation task, the MPP computing engine of the data warehouse splits the calculation, each node is responsible for processing part of it, and the calculation is carried out in parallel between nodes, and the final summary results are output to the user.

data warehouse is a typical "Schema-on-Write" pattern, requiring stored data to be processed into a predefined format when written, i.e. schema. This is like the administrator of the data warehouse determines the style of a packaging box in advance. All goods (data) must be packed in packaging boxes and neatly before entering the warehouse. The original data of the

data source is often different from the defined schema, so the imported data needs to go through the ETL process, which is the abbreviation of the three steps of extraction (Extract), transformation (Transform), and loading (Load). The Extract stage reads from the original data source for data cleaning to correct errors and duplications in it. Then enter the Transform stage and do the necessary processing to convert the data into the specified schema. Finally, the data is loaded into the data warehouse.

4.2 Data Lake

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

Inner Mongolia Autonomous Region Baiyun Obo Mine, the only mine in the world that contains 17 rare earth elements at the same time. For more than 60 years, this mine has been mined as an iron ore. Later, with the improvement of the strategic value of rare earths and the advancement of mining technology, it has transformed into China's largest rare earth deposit.

tells this story to illustrate the importance of the original data. The original data is like the Baiyun Obo Mine. In addition to the iron that has been discovered, it may also contain rare earths with abundant reserves.The "Schema-on-Write" pattern of data warehouse requires us to know exactly what we are mining before processing the data. When time passes and only the historical data is left in the data warehouse, we may not even know which rare earths have been discarded.

better retain more original data and avoid losing important unknown information. This is the original intention of the data lake concept. Data Lake advocates that all data, whether it is structured data in databases or unstructured data such as videos, pictures, and logs, will be stored in a unified storage base in their original format. Each data source, like a river, gathers into this unified "lake" and integrates it. All data users are supplied with water uniformly by this "lake".

Due to the lack of clear structural information, the data lake uses the "Schema-on-Read" mode, and the user converts the data into the corresponding structure for processing after reading it. Compared with the "Schema-on-Write" of the data warehouse, the processing flow of data becomes ELT, that is, the Transform stage occurs after Load.

"Schema-on-Read" has very loose structure and fewer constraints on computing engines. In fact, the industry has developed a variety of computing engines according to different scenarios.

Traditional data lakes are equivalent to the big data system, and mainly go through two stages: "integrated storage and computing" and "separation of storage and computing":

stage 1: Integrated storage and computing data lakes

In this stage, enterprises develop data lakes based on the Hadoop ecosystem, use HDFS as data storage, and use computing engines such as Hadoop MapReduce and Spark to calculate and store resources on the same batch of machines. The expansion cluster will simultaneously expand the computing power and capacity. After cloud computing developed, this architecture was moved from offline IDC computer rooms to the cloud intact.

Stage 2: Separation and calculation separation data lake

After a period of practice, the integrated computing architecture encountered a bottleneck, which was mainly reflected in several aspects:

  • Computing and storage cannot be expanded separately, and in reality, most users' needs for these two resources do not match, and the integrated computing architecture will inevitably lead to the waste of one of the resources. After the explosive growth of storage capacity and file count of
  • storage capacity and file number, HDFS's NameNode single point architecture encountered a bottleneck in metadata performance. Enterprises alleviated this problem by upgrading NameNode node configuration, multiple HDFS clusters or HDFS Federation, but failed to fundamentally solve this problem, which brought great burden to data platform operation and maintenance personnel.
  • storage cost is also a pain point of the integrated storage and computing architecture. HDFS's 3-replica mechanism is not suitable for storing colder data and is at least twice as expensive than the erasure coding mechanism. There is also a problem of replica amplification on the cloud. The cloud disk provided by cloud manufacturers has a replica mechanism. The actual number of HDFS using cloud disks is higher, possibly up to 9 replicas.

In the process of solving these problems, people noticed the object storage services of cloud vendors. This service provides a nearly infinitely scalable, inexpensive, serverless storage system with performance and capacity. In addition to the shortcomings in POSIX compatibility of some file system interfaces (such as atomic rename, writing and reading, etc.), this service solves the above pain points and is a suitable alternative to HDFS. In fact, the next generation HDFS system, OZone system, also borrows the idea of ​​object storage to solve the above problems.

The data lake based on object storage has given birth to the "storage and computing separation" architecture. The characteristics of storage and computing separation are independent expansion of computing resources and storage resources.

storage separation architecture is an object storage service provided by cloud manufacturers. Compared with self-built HDFS and OZone, one of the biggest advantages of cloud manufacturers comes from scale. Cloud manufacturers need a large enough cluster to store massive user data. The larger the amount of data, the larger the cluster size, the more nodes and devices there are, and the higher the overall performance it can provide.For a single user, it can "borrow" higher performance than self-built HDFS of the same scale. A large enough storage resource pool is the prerequisite and confidence for the separation of storage and computing architecture to work.

On the basis of object storage solving scalability, performance and cost, the serverless product form makes it easy for the computing engine of the data lake to independently scale and scale its computing power. It can even allocate computing resources when computing is needed. The resources will be destroyed immediately after the calculation is completed, and only pay for the resources used, which is optimal in terms of cost and efficiency. This is impossible to achieve in the era before the separation of storage and computing architecture and cloud computing.

For cloud manufacturers, this transformation of architecture has made object storage services the focus of the stage all at once, making cloud manufacturers sweet and testing their technical strength. The awesomeness that has been blown must be fulfilled one by one without any discount. The main challenges here include:

  • scale. A customer has a PB of dozens of PB, and many customers share resource pools. The accumulated capacity of object storage can easily reach EB level, and the corresponding metadata scale reaches trillions. A single cluster serves EB-level capacity and trillion-level metadata, which requires very excellent hard-core architectural design, and there are no shortcomings in scalability in every part of the system.
  • stability. Supports EB-level capacity and trillions of metadata, and the number of machines in each cluster reaches tens of thousands or even hundreds of thousands. With a huge machine base, hardware failures and software failures are commonplace. Reduce or even eliminate the impact of these uncontrollable factors, provide stable delay and throughput levels, low long tails, and compete with high-quality engineering implementation and operation and maintenance capabilities.
  • compatibility. Although object storage as data lake storage has become a consensus, software in the big data system, whether due to historical burdens or indeed cannot be modified, will still rely on some unique capabilities of HDFS in some scenarios. For example, Spark relies on rename to submit tasks, which utilizes the fast execution speed and atomicity guarantee of HDFS rename operations. However, in AWS S3, the originator of object storage, rename is not supported, and can only be roughly simulated through "copy + delete", which has a slow execution speed and no atomicity guarantee. If the general level of object storage for various cloud vendors replaces HDFS in 70% of scenarios, the remaining 30% depends on how the vendor further solves the poor compatibility part, so that the storage and computing separation architecture can be executed more thoroughly.

4.3 Data Warehouse VS Data Lake

Data Warehouse and Data Lake Use the formulas in the previous article to summarize it as:

Data Warehouse = Structured Data Storage System + Built-in Computing Engine + SQL Interface

Data Lake = Original Data Storage System + Multiple Computing Engines + Multiple Interfaces including SQL

Data Warehouse and Data Lake are like iOS and Android:

  • data warehouse is like iOS. It is a relatively closed system with more constraints on data inflow and outflow and usage scenarios, but the advantage lies in the simplicity and ease of use, the closed system has stronger control power, and is easier to optimize performance such as storage formats and computing parallelism. It still plays a dominant role in some query scenarios that require extreme performance.
  • Data Lake is like Android, which emphasizes openness and almost delegates the right to choice to users. There are many mobile phone manufacturers (computing engines) to choose, but using it well requires certain professional capabilities of users. If you use it well, there will be side effects, which can easily lead to "Data Swamp".

5. Modern data platform: Hucang integrated

5.1 Dilemma faced by data lake

Data lake returns the decision of "what data is stored and how to use data" to the user, and the constraints are very relaxed.However, if users do not manage data well when data enters the lake, useful, useless, high-quality and low-quality data will be thrown in, and it is easy to find the required data when used. In the long run, the data lake has become a huge garbage dump, and the standardized name is "data swamp".

In order to avoid the data lake eventually becoming a data swamp, several important problems need to be solved:

Problem 1: Data quality problem

relies solely on "Schema-on-Read" to directly process the data in the original format during calculation and filter out the useless information. This work needs to be repeated every time the calculation is calculated, which not only reduces the speed of calculation but also wastes computing power.

A feasible way is to learn from the practice of data warehouses in the data lake, and perform some pre-processing of the original data through one or more rounds of ETL to convert the data into data that is more friendly to the computing engine and has higher data quality. The original data is not deleted, and the data generated by ETL is also stored in the data lake, which not only retains the original data, but also ensures the efficiency of calculation.

Question 2: Metadata (metadata) Management Question

metadata is the data that describes the data. Its importance to data lies in its responsibility to answer several important philosophical questions "Who am I? Where am I? Where am I? Where am I from?" Data format information (such as field definitions of a database table file), data location information (such as which path the data is stored), data blood relationship (such as which upstream data is processed from), etc. all need to rely on metadata to interpret.

Establishing complete metadata for data lakes can help users better use data. Generally, metadata is divided into two parts, both of which are very important. One is a centralized data catalog service. Generally, this type of service has the ability to automatically analyze and fuzzy search, which is used to manage and discover what data is in the data lake. In addition, the data is built-in metadata, which can ensure that the data can be accurately interpreted even if the data is moved. For example, a data catalog is like a bookshelves in a library. By sorting and archiving books in different categories, the location of the book can be quickly positioned; the metadata built in data is like the catalog part of a book. Through the catalog, you can quickly understand what the book contains and which page it is located; when a book moves from one bookshelves to another, the location of the book is changed, and the catalog of the book has not changed.

metadata management also needs to solve the problem of data permissions. The storage system that the data lake depends on, whether it is HDFS or object storage, provides data permissions in units of directories and files, and the granularity is not consistent with the needs of upper-level services. For example, a data set of an image recognition AI task has many small files. These small files should be regarded as a whole. There is no phenomenon of "a user has permission to access some of the files, but no permission to access other files." Another example is that a file stores data on business orders, and the range of data that can be viewed is different for sales personnel and company executives. These all require more refined permission control.

Problem 3: Data version problem

data entry into the lake is usually not a one-time transaction, and it will not be updated once it is imported. For example, collecting data from the online user order database to the data lake for subsequent analysis requires continuous synchronization of new orders. The easiest way to solve the problem of multiple imports is to import them all every time, but this method is obviously too rough, which will increase resource consumption and the time-consuming data import is also high.

Therefore, supporting incremental updates to data is an important capability of data lakes. There are some difficult problems, including: 1) how to handle read requests when updating; 2) how to recover after the update operation is interrupted; 3) how to identify incomplete update operations; 4) how to restore data after being contaminated by an incorrect operation.The answer to these tricky questions in databases and data warehouses is ACID. Table Formats (Table Formats) that have emerged in the data lake field in recent years, such as Apache Iceberg, Apache Hudi, and Delta Lake, are committed to compensating these capabilities for object storage, and have become an important part of the data lake.

Problem 4: Data circulation problem

real scenes are complex and changeable, and the real-time and accuracy requirements for data processing are different. Therefore, the industry has developed many computing engines. If these computing engines speak their own words and only recognize the storage format they define, then when the same piece of data is processed by different computing engines, it needs to be repeatedly used to Schema-on-Read or ETL, wasting a lot of resources. This is obviously unreasonable.

does not require translation, it is ideal for everyone to speak Mandarin. In the process of the development of big data, some commonly used data formats (Apache Parquet, Apache ORC, etc.) and tabular formats (Apache Iceberg, Apache Hudi, Delta Lake, etc.) have gradually formed. These technologies are gradually supported by more and more computing engines, and in a sense they act as Mandarin in the data lake field and improve data circulation problems.

5.2 Trend of fusion of lakes and warehouses

During the iteration process, the boundaries between the data lake and the data warehouse are becoming increasingly blurred, and gradually showing a trend of fusion:

  • In the process of solving the data swamp, in order to make a very relaxed ecosystem better use, the industry's practice actually puts a lot of constraints on the use of data lakes. Interestingly, these constraints are very similar to many things done by the original data warehouse, such as ETL, ACID, permission control, etc. This makes the data lake show some characteristics of a data warehouse.
  • After trying various non-SQL programming interfaces and interaction methods, the industry found that in many scenarios, SQL is still the best choice. Data warehouses have become more and more open in recent years, and support for some commonly used data formats and table formats in data lakes is getting better and better. In addition to the built-in ETL support, they can also be directly processed as external sources. These trends suggest that data warehouses, as an important computing engine, can grow on top of data lakes.
  • data warehouse also faces the limitations of integration of storage and computing, and is also iterating towards the separation of storage and computing architecture. Some systems adopt a hot and cold separation design, and hot data is stored on the local node high-speed media, and cold data is sinked into the data lake to strike a balance between performance and cost. In addition, some more thorough cloud native warehousing systems, all data is in the data lake, and the cache of local nodes can make up for the data lake speed problem. This design can simplify the architecture of the data warehouse, so that the data warehouse no longer needs to pay attention to data reliability issues, and at the same time, multiple read-only clusters can share the same data. Some important technologies and methods developed in the field of data warehouses can also be referenced by the big data computing engine above the data lake, and vice versa. For example, computing engine acceleration technologies that have been maturely applied in data warehouses such as ClickHouse, such as vectorization and LLVM JIT, have been borrowed to implement Spark's Native engine. Compared with the original JVM engine, the Native engine has higher hardware resource utilization and faster computing speed.

In addition to data warehouses and big data, there are other important computing types in the enterprise, the most common ones are high-performance computing such as AI and HPC. The performance advantages of data lakes are reflected in high throughput, average metadata performance and delay, while high-performance computing has relatively strict requirements on metadata performance and delay. Therefore, enterprises also need to maintain one or more high-speed file storage systems (Lustre, BeeGFS, etc.) for such businesses outside the data lake. In essence, the framework used by high-performance computing is also a kind of computing engine, and the source and output of data are also part of the enterprise's digital assets. How to incorporate this part of the business into the data lake system is an important issue.The answer to this question is similar to the separation of data warehouse inventory, and the solutions are also the same. There are also two routes:

  • hot and cold separation design. The high-speed file storage system uses the data lake as the cold data layer.
  • designs a cloud-native file system based on data lakes. Although this type of file system provides file system interfaces, it is actually a cache acceleration system, using a "cache layer + data lake" architecture. The cache layer maintains the cache of hot data on demand on the compute node or hardware close to the compute node. The data lake stores all data to ensure the reliability of the data. Once data is phased out or lost in the cache system, data can still be reloaded from the data lake. The statement that

lake and warehouse integration was first proposed by Databricks, and there are still differences in the industry. Some other competing companies will try their best to avoid using this term. AWS adopts the modern data architecture. But no matter how you name it, the integration of lake and warehouse represents the next stage of the data lake, and its essence is the ultimate one-stop data platform for enterprises.

The data platform is first of all an all-in-one storage infrastructure, which meets all the data storage needs of enterprises. It can not only meet low-cost storage needs, but also meet high-performance needs. Secondly, the data platform goes beyond the scope of data warehouses and big data, and runs various computing engines such as data warehouses, big data, AI, and HPC. These different computing engines can consume and produce data structures that can be understood by each other, and the flow of data between businesses is not barrier-free.

5.3 Hujian integrated architecture

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

According to the previous discussion, we can use the data platform formula to briefly summarize the Hujian integration:

Hujian integrated = Object storage equipped with metadata layer and acceleration layer + Computing engines in various fields such as data warehouse, big data, AI, HPC + Multiple interfaces including SQL

storage system part, object storage has become the de facto standard storage of data lakes, and its ecological prosperity far exceeds other types of cloud storage products. For storage problems that cannot be solved well in object storage, it is necessary to match appropriate metadata layer and acceleration layer.

  • In response to the data swamp problem, the metadata layer establishes necessary data quality, metadata management, version management, and data circulation mechanisms, so that all internal businesses of the enterprise can easily use high-quality data.
  • is used for some businesses that have higher requirements for metadata and delays, such as data warehouses, AI, HPC, etc. The acceleration layer is used as a supplement to object storage. It generally uses a high-speed file system or cache system. The deployment is closer to the computing node. The metadata and data can automatically flow between the acceleration layer and the data lake. In order to simplify use, the acceleration layer will also be paired with the upper-level job scheduling system to make data flow more intelligent and simple. For example, the job scheduling system warms up data in advance, and after the data is warmed up to the cache, the job scheduling system begins to allocate computing resources to perform calculations, thereby enjoying faster access speeds than the data lake. The

computing engine part contains various engines such as data warehouse, big data, AI, HPC. Data flow is the most basic problem. In addition, another important issue is the scheduling and management issues of these computing engines themselves. From the perspective of resources, these computing engines mainly consume CPU, GPU and other types of computing resources, and have the basis for resource sharing. Improving the overall utilization of resources means saving costs for users. There are two ways to solve this problem.

  • One method is to use cloud vendor hosting or serverless services instead of self-built for specific computing engines. Cloud vendors' services are built-in elastic shrinking capabilities and pay on demand, which can control the relevant resource utilization within a suitable range, avoiding the problem of resource sharing.
  • Another method is that the computing engine that users operate and maintain on their own uses a unified scheduling and resource management platform to allocate resources. In this regard, Kubenetes is the most popular choice. If a certain computing engine has not yet supported deployment on it, it is only a matter of time. Cloud vendors will also usually provide optimized versions or services for users to choose from. The

interface part actually depends on the specific computing engine. SQL is the best choice in scenarios that can be represented by SQL. Other scenarios require users to be familiar with the engine's programming interface.

6. Summary

Enterprise data volume has exploded, and business scenarios are becoming increasingly complex, which is driving continuous transformation of data platform technology. The technical routes of data warehouses and data lakes, these two data platforms have fully demonstrated their respective advantages and disadvantages in past practice. In recent years, they have begun to integrate, learn from their strengths and weaknesses, and iterate towards the so-called integrated lake and warehouse or modern data architecture. The new technologies and methods that are constantly emerging in

are the crystallization of the collective wisdom of countless practitioners, and the open tone is the catalyst that facilitates all of this. The openness of this field is reflected in many aspects:

  • data is open. Computing engines are becoming more and more open, generally supporting some standard data formats, data circulation is becoming easier and easier, and business chooses the most suitable engine to handle computing tasks on demand.
  • technology is open. Most of the important technologies in the Hucang integrated technology architecture exist in the form of open source projects, and no company can monopolize intellectual property rights. The manufacturer's distribution and the open source version can be replaced by each other, and the choice is with the user. The openness of technology has also promoted the integration of technology across fields, and different fields learn from each other's methods and technologies, and make up for each other's weaknesses, achieving a 1 + 1 2 effect.
  • infrastructure is open. In the Hucang integrated solution, cloud vendors play an important role, providing infrastructure such as object storage and hosting big data services. These infrastructures are compatible with industry standards and also have open source replacements, making it easy for customers to build hybrid and multi-cloud architectures to better utilize the flexibility of the cloud.

Under this open tone, the entire industry, whether it is users or platforms, has its own thoughts and opinions on the data platform. We also hope to express some of our views. On the one hand, we hope to provide the industry with some shallow insights, and on the other hand, when looking back in the future, it will be a snow-like and red claw for ourselves.

The initial scenario was Business Intelligence. Simply put, the management of the enterprise hopes to have a dashboard that is convenient for viewing various types of business data, displaying some statistics and trend data, and the data sources are ERP, CRM, business database, etc. In order to make this requirement easy to use, the best way is to collect the data from various data sources within the enterprise on a single site for archives, and maintain historical data so that relevant query requirements can be solved on this site. This unified site is the data warehouse.

mainstream data warehouse implementation is based on "Online Analytical Processing (OLAP)" technology. Before the birth of data warehouses, the business had widely used relational databases such as MySQL and Orcale, which were based on the "On-Line Transactional Processing (OLTP)" technology. The data in the OLTP database has a fixed format, is well organized, and the supported SQL query language is easy to use and understand. At the same time, it itself is one of the most important data sources in the data warehouse. Therefore, it is a natural idea to build a data warehouse directly using an OLTP database. But soon, everyone discovered that the data warehouse has its own business characteristics. Based on OLTP, it encountered a bottleneck, and OLAP gained the opportunity to develop independently:

  • On the one hand, the data storage method of the OLTP database is row-oriented (row-oriented). The data of a row is stored together. Even if only a few fields are needed when reading, the entire row of data needs to be read out and the required fields are extracted. Data warehouse tables usually have more fields, which leads to low reading efficiency. Column-oriented (column-oriented) data storage method, which stores different columns or column families separately, and only the required parts can be read when reading. This method can effectively reduce the amount of data read and is more friendly to the data warehouse scenario.
  • On the other hand, traditional OLTP databases rely on scale-up configured by stand-alone hardware to improve processing capabilities, with a lower upper limit. The data warehouse scenario reads a very large amount of data in a single query. Repeatedly calling the same reading logic on the same field is very suitable for parallel processing optimization of stand-alone and multiple machines, and to use the cluster's scale-out processing capabilities to shorten the query time. This is the core idea of ​​the MPP (Massively Parallel Processor) computing engine.

Therefore, the characteristics of modern data warehouse architecture are distributed, columnar storage, and MPP computing engines. After the user initiates the calculation task, the MPP computing engine of the data warehouse splits the calculation, each node is responsible for processing part of it, and the calculation is carried out in parallel between nodes, and the final summary results are output to the user.

data warehouse is a typical "Schema-on-Write" pattern, requiring stored data to be processed into a predefined format when written, i.e. schema. This is like the administrator of the data warehouse determines the style of a packaging box in advance. All goods (data) must be packed in packaging boxes and neatly before entering the warehouse. The original data of the

data source is often different from the defined schema, so the imported data needs to go through the ETL process, which is the abbreviation of the three steps of extraction (Extract), transformation (Transform), and loading (Load). The Extract stage reads from the original data source for data cleaning to correct errors and duplications in it. Then enter the Transform stage and do the necessary processing to convert the data into the specified schema. Finally, the data is loaded into the data warehouse.

4.2 Data Lake

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

Inner Mongolia Autonomous Region Baiyun Obo Mine, the only mine in the world that contains 17 rare earth elements at the same time. For more than 60 years, this mine has been mined as an iron ore. Later, with the improvement of the strategic value of rare earths and the advancement of mining technology, it has transformed into China's largest rare earth deposit.

tells this story to illustrate the importance of the original data. The original data is like the Baiyun Obo Mine. In addition to the iron that has been discovered, it may also contain rare earths with abundant reserves.The "Schema-on-Write" pattern of data warehouse requires us to know exactly what we are mining before processing the data. When time passes and only the historical data is left in the data warehouse, we may not even know which rare earths have been discarded.

better retain more original data and avoid losing important unknown information. This is the original intention of the data lake concept. Data Lake advocates that all data, whether it is structured data in databases or unstructured data such as videos, pictures, and logs, will be stored in a unified storage base in their original format. Each data source, like a river, gathers into this unified "lake" and integrates it. All data users are supplied with water uniformly by this "lake".

Due to the lack of clear structural information, the data lake uses the "Schema-on-Read" mode, and the user converts the data into the corresponding structure for processing after reading it. Compared with the "Schema-on-Write" of the data warehouse, the processing flow of data becomes ELT, that is, the Transform stage occurs after Load.

"Schema-on-Read" has very loose structure and fewer constraints on computing engines. In fact, the industry has developed a variety of computing engines according to different scenarios.

Traditional data lakes are equivalent to the big data system, and mainly go through two stages: "integrated storage and computing" and "separation of storage and computing":

stage 1: Integrated storage and computing data lakes

In this stage, enterprises develop data lakes based on the Hadoop ecosystem, use HDFS as data storage, and use computing engines such as Hadoop MapReduce and Spark to calculate and store resources on the same batch of machines. The expansion cluster will simultaneously expand the computing power and capacity. After cloud computing developed, this architecture was moved from offline IDC computer rooms to the cloud intact.

Stage 2: Separation and calculation separation data lake

After a period of practice, the integrated computing architecture encountered a bottleneck, which was mainly reflected in several aspects:

  • Computing and storage cannot be expanded separately, and in reality, most users' needs for these two resources do not match, and the integrated computing architecture will inevitably lead to the waste of one of the resources. After the explosive growth of storage capacity and file count of
  • storage capacity and file number, HDFS's NameNode single point architecture encountered a bottleneck in metadata performance. Enterprises alleviated this problem by upgrading NameNode node configuration, multiple HDFS clusters or HDFS Federation, but failed to fundamentally solve this problem, which brought great burden to data platform operation and maintenance personnel.
  • storage cost is also a pain point of the integrated storage and computing architecture. HDFS's 3-replica mechanism is not suitable for storing colder data and is at least twice as expensive than the erasure coding mechanism. There is also a problem of replica amplification on the cloud. The cloud disk provided by cloud manufacturers has a replica mechanism. The actual number of HDFS using cloud disks is higher, possibly up to 9 replicas.

In the process of solving these problems, people noticed the object storage services of cloud vendors. This service provides a nearly infinitely scalable, inexpensive, serverless storage system with performance and capacity. In addition to the shortcomings in POSIX compatibility of some file system interfaces (such as atomic rename, writing and reading, etc.), this service solves the above pain points and is a suitable alternative to HDFS. In fact, the next generation HDFS system, OZone system, also borrows the idea of ​​object storage to solve the above problems.

The data lake based on object storage has given birth to the "storage and computing separation" architecture. The characteristics of storage and computing separation are independent expansion of computing resources and storage resources.

storage separation architecture is an object storage service provided by cloud manufacturers. Compared with self-built HDFS and OZone, one of the biggest advantages of cloud manufacturers comes from scale. Cloud manufacturers need a large enough cluster to store massive user data. The larger the amount of data, the larger the cluster size, the more nodes and devices there are, and the higher the overall performance it can provide.For a single user, it can "borrow" higher performance than self-built HDFS of the same scale. A large enough storage resource pool is the prerequisite and confidence for the separation of storage and computing architecture to work.

On the basis of object storage solving scalability, performance and cost, the serverless product form makes it easy for the computing engine of the data lake to independently scale and scale its computing power. It can even allocate computing resources when computing is needed. The resources will be destroyed immediately after the calculation is completed, and only pay for the resources used, which is optimal in terms of cost and efficiency. This is impossible to achieve in the era before the separation of storage and computing architecture and cloud computing.

For cloud manufacturers, this transformation of architecture has made object storage services the focus of the stage all at once, making cloud manufacturers sweet and testing their technical strength. The awesomeness that has been blown must be fulfilled one by one without any discount. The main challenges here include:

  • scale. A customer has a PB of dozens of PB, and many customers share resource pools. The accumulated capacity of object storage can easily reach EB level, and the corresponding metadata scale reaches trillions. A single cluster serves EB-level capacity and trillion-level metadata, which requires very excellent hard-core architectural design, and there are no shortcomings in scalability in every part of the system.
  • stability. Supports EB-level capacity and trillions of metadata, and the number of machines in each cluster reaches tens of thousands or even hundreds of thousands. With a huge machine base, hardware failures and software failures are commonplace. Reduce or even eliminate the impact of these uncontrollable factors, provide stable delay and throughput levels, low long tails, and compete with high-quality engineering implementation and operation and maintenance capabilities.
  • compatibility. Although object storage as data lake storage has become a consensus, software in the big data system, whether due to historical burdens or indeed cannot be modified, will still rely on some unique capabilities of HDFS in some scenarios. For example, Spark relies on rename to submit tasks, which utilizes the fast execution speed and atomicity guarantee of HDFS rename operations. However, in AWS S3, the originator of object storage, rename is not supported, and can only be roughly simulated through "copy + delete", which has a slow execution speed and no atomicity guarantee. If the general level of object storage for various cloud vendors replaces HDFS in 70% of scenarios, the remaining 30% depends on how the vendor further solves the poor compatibility part, so that the storage and computing separation architecture can be executed more thoroughly.

4.3 Data Warehouse VS Data Lake

Data Warehouse and Data Lake Use the formulas in the previous article to summarize it as:

Data Warehouse = Structured Data Storage System + Built-in Computing Engine + SQL Interface

Data Lake = Original Data Storage System + Multiple Computing Engines + Multiple Interfaces including SQL

Data Warehouse and Data Lake are like iOS and Android:

  • data warehouse is like iOS. It is a relatively closed system with more constraints on data inflow and outflow and usage scenarios, but the advantage lies in the simplicity and ease of use, the closed system has stronger control power, and is easier to optimize performance such as storage formats and computing parallelism. It still plays a dominant role in some query scenarios that require extreme performance.
  • Data Lake is like Android, which emphasizes openness and almost delegates the right to choice to users. There are many mobile phone manufacturers (computing engines) to choose, but using it well requires certain professional capabilities of users. If you use it well, there will be side effects, which can easily lead to "Data Swamp".

5. Modern data platform: Hucang integrated

5.1 Dilemma faced by data lake

Data lake returns the decision of "what data is stored and how to use data" to the user, and the constraints are very relaxed.However, if users do not manage data well when data enters the lake, useful, useless, high-quality and low-quality data will be thrown in, and it is easy to find the required data when used. In the long run, the data lake has become a huge garbage dump, and the standardized name is "data swamp".

In order to avoid the data lake eventually becoming a data swamp, several important problems need to be solved:

Problem 1: Data quality problem

relies solely on "Schema-on-Read" to directly process the data in the original format during calculation and filter out the useless information. This work needs to be repeated every time the calculation is calculated, which not only reduces the speed of calculation but also wastes computing power.

A feasible way is to learn from the practice of data warehouses in the data lake, and perform some pre-processing of the original data through one or more rounds of ETL to convert the data into data that is more friendly to the computing engine and has higher data quality. The original data is not deleted, and the data generated by ETL is also stored in the data lake, which not only retains the original data, but also ensures the efficiency of calculation.

Question 2: Metadata (metadata) Management Question

metadata is the data that describes the data. Its importance to data lies in its responsibility to answer several important philosophical questions "Who am I? Where am I? Where am I? Where am I from?" Data format information (such as field definitions of a database table file), data location information (such as which path the data is stored), data blood relationship (such as which upstream data is processed from), etc. all need to rely on metadata to interpret.

Establishing complete metadata for data lakes can help users better use data. Generally, metadata is divided into two parts, both of which are very important. One is a centralized data catalog service. Generally, this type of service has the ability to automatically analyze and fuzzy search, which is used to manage and discover what data is in the data lake. In addition, the data is built-in metadata, which can ensure that the data can be accurately interpreted even if the data is moved. For example, a data catalog is like a bookshelves in a library. By sorting and archiving books in different categories, the location of the book can be quickly positioned; the metadata built in data is like the catalog part of a book. Through the catalog, you can quickly understand what the book contains and which page it is located; when a book moves from one bookshelves to another, the location of the book is changed, and the catalog of the book has not changed.

metadata management also needs to solve the problem of data permissions. The storage system that the data lake depends on, whether it is HDFS or object storage, provides data permissions in units of directories and files, and the granularity is not consistent with the needs of upper-level services. For example, a data set of an image recognition AI task has many small files. These small files should be regarded as a whole. There is no phenomenon of "a user has permission to access some of the files, but no permission to access other files." Another example is that a file stores data on business orders, and the range of data that can be viewed is different for sales personnel and company executives. These all require more refined permission control.

Problem 3: Data version problem

data entry into the lake is usually not a one-time transaction, and it will not be updated once it is imported. For example, collecting data from the online user order database to the data lake for subsequent analysis requires continuous synchronization of new orders. The easiest way to solve the problem of multiple imports is to import them all every time, but this method is obviously too rough, which will increase resource consumption and the time-consuming data import is also high.

Therefore, supporting incremental updates to data is an important capability of data lakes. There are some difficult problems, including: 1) how to handle read requests when updating; 2) how to recover after the update operation is interrupted; 3) how to identify incomplete update operations; 4) how to restore data after being contaminated by an incorrect operation.The answer to these tricky questions in databases and data warehouses is ACID. Table Formats (Table Formats) that have emerged in the data lake field in recent years, such as Apache Iceberg, Apache Hudi, and Delta Lake, are committed to compensating these capabilities for object storage, and have become an important part of the data lake.

Problem 4: Data circulation problem

real scenes are complex and changeable, and the real-time and accuracy requirements for data processing are different. Therefore, the industry has developed many computing engines. If these computing engines speak their own words and only recognize the storage format they define, then when the same piece of data is processed by different computing engines, it needs to be repeatedly used to Schema-on-Read or ETL, wasting a lot of resources. This is obviously unreasonable.

does not require translation, it is ideal for everyone to speak Mandarin. In the process of the development of big data, some commonly used data formats (Apache Parquet, Apache ORC, etc.) and tabular formats (Apache Iceberg, Apache Hudi, Delta Lake, etc.) have gradually formed. These technologies are gradually supported by more and more computing engines, and in a sense they act as Mandarin in the data lake field and improve data circulation problems.

5.2 Trend of fusion of lakes and warehouses

During the iteration process, the boundaries between the data lake and the data warehouse are becoming increasingly blurred, and gradually showing a trend of fusion:

  • In the process of solving the data swamp, in order to make a very relaxed ecosystem better use, the industry's practice actually puts a lot of constraints on the use of data lakes. Interestingly, these constraints are very similar to many things done by the original data warehouse, such as ETL, ACID, permission control, etc. This makes the data lake show some characteristics of a data warehouse.
  • After trying various non-SQL programming interfaces and interaction methods, the industry found that in many scenarios, SQL is still the best choice. Data warehouses have become more and more open in recent years, and support for some commonly used data formats and table formats in data lakes is getting better and better. In addition to the built-in ETL support, they can also be directly processed as external sources. These trends suggest that data warehouses, as an important computing engine, can grow on top of data lakes.
  • data warehouse also faces the limitations of integration of storage and computing, and is also iterating towards the separation of storage and computing architecture. Some systems adopt a hot and cold separation design, and hot data is stored on the local node high-speed media, and cold data is sinked into the data lake to strike a balance between performance and cost. In addition, some more thorough cloud native warehousing systems, all data is in the data lake, and the cache of local nodes can make up for the data lake speed problem. This design can simplify the architecture of the data warehouse, so that the data warehouse no longer needs to pay attention to data reliability issues, and at the same time, multiple read-only clusters can share the same data. Some important technologies and methods developed in the field of data warehouses can also be referenced by the big data computing engine above the data lake, and vice versa. For example, computing engine acceleration technologies that have been maturely applied in data warehouses such as ClickHouse, such as vectorization and LLVM JIT, have been borrowed to implement Spark's Native engine. Compared with the original JVM engine, the Native engine has higher hardware resource utilization and faster computing speed.

In addition to data warehouses and big data, there are other important computing types in the enterprise, the most common ones are high-performance computing such as AI and HPC. The performance advantages of data lakes are reflected in high throughput, average metadata performance and delay, while high-performance computing has relatively strict requirements on metadata performance and delay. Therefore, enterprises also need to maintain one or more high-speed file storage systems (Lustre, BeeGFS, etc.) for such businesses outside the data lake. In essence, the framework used by high-performance computing is also a kind of computing engine, and the source and output of data are also part of the enterprise's digital assets. How to incorporate this part of the business into the data lake system is an important issue.The answer to this question is similar to the separation of data warehouse inventory, and the solutions are also the same. There are also two routes:

  • hot and cold separation design. The high-speed file storage system uses the data lake as the cold data layer.
  • designs a cloud-native file system based on data lakes. Although this type of file system provides file system interfaces, it is actually a cache acceleration system, using a "cache layer + data lake" architecture. The cache layer maintains the cache of hot data on demand on the compute node or hardware close to the compute node. The data lake stores all data to ensure the reliability of the data. Once data is phased out or lost in the cache system, data can still be reloaded from the data lake. The statement that

lake and warehouse integration was first proposed by Databricks, and there are still differences in the industry. Some other competing companies will try their best to avoid using this term. AWS adopts the modern data architecture. But no matter how you name it, the integration of lake and warehouse represents the next stage of the data lake, and its essence is the ultimate one-stop data platform for enterprises.

The data platform is first of all an all-in-one storage infrastructure, which meets all the data storage needs of enterprises. It can not only meet low-cost storage needs, but also meet high-performance needs. Secondly, the data platform goes beyond the scope of data warehouses and big data, and runs various computing engines such as data warehouses, big data, AI, and HPC. These different computing engines can consume and produce data structures that can be understood by each other, and the flow of data between businesses is not barrier-free.

5.3 Hujian integrated architecture

1. Write it before We are in an era of big data, with the explosive growth of the data volume of enterprises. How to deal with the challenges of massive data storage and processing and build a good data platform is a very critical issue for an enterprise. From data warehouses, da - DayDayNews

According to the previous discussion, we can use the data platform formula to briefly summarize the Hujian integration:

Hujian integrated = Object storage equipped with metadata layer and acceleration layer + Computing engines in various fields such as data warehouse, big data, AI, HPC + Multiple interfaces including SQL

storage system part, object storage has become the de facto standard storage of data lakes, and its ecological prosperity far exceeds other types of cloud storage products. For storage problems that cannot be solved well in object storage, it is necessary to match appropriate metadata layer and acceleration layer.

  • In response to the data swamp problem, the metadata layer establishes necessary data quality, metadata management, version management, and data circulation mechanisms, so that all internal businesses of the enterprise can easily use high-quality data.
  • is used for some businesses that have higher requirements for metadata and delays, such as data warehouses, AI, HPC, etc. The acceleration layer is used as a supplement to object storage. It generally uses a high-speed file system or cache system. The deployment is closer to the computing node. The metadata and data can automatically flow between the acceleration layer and the data lake. In order to simplify use, the acceleration layer will also be paired with the upper-level job scheduling system to make data flow more intelligent and simple. For example, the job scheduling system warms up data in advance, and after the data is warmed up to the cache, the job scheduling system begins to allocate computing resources to perform calculations, thereby enjoying faster access speeds than the data lake. The

computing engine part contains various engines such as data warehouse, big data, AI, HPC. Data flow is the most basic problem. In addition, another important issue is the scheduling and management issues of these computing engines themselves. From the perspective of resources, these computing engines mainly consume CPU, GPU and other types of computing resources, and have the basis for resource sharing. Improving the overall utilization of resources means saving costs for users. There are two ways to solve this problem.

  • One method is to use cloud vendor hosting or serverless services instead of self-built for specific computing engines. Cloud vendors' services are built-in elastic shrinking capabilities and pay on demand, which can control the relevant resource utilization within a suitable range, avoiding the problem of resource sharing.
  • Another method is that the computing engine that users operate and maintain on their own uses a unified scheduling and resource management platform to allocate resources. In this regard, Kubenetes is the most popular choice. If a certain computing engine has not yet supported deployment on it, it is only a matter of time. Cloud vendors will also usually provide optimized versions or services for users to choose from. The

interface part actually depends on the specific computing engine. SQL is the best choice in scenarios that can be represented by SQL. Other scenarios require users to be familiar with the engine's programming interface.

6. Summary

Enterprise data volume has exploded, and business scenarios are becoming increasingly complex, which is driving continuous transformation of data platform technology. The technical routes of data warehouses and data lakes, these two data platforms have fully demonstrated their respective advantages and disadvantages in past practice. In recent years, they have begun to integrate, learn from their strengths and weaknesses, and iterate towards the so-called integrated lake and warehouse or modern data architecture. The new technologies and methods that are constantly emerging in

are the crystallization of the collective wisdom of countless practitioners, and the open tone is the catalyst that facilitates all of this. The openness of this field is reflected in many aspects:

  • data is open. Computing engines are becoming more and more open, generally supporting some standard data formats, data circulation is becoming easier and easier, and business chooses the most suitable engine to handle computing tasks on demand.
  • technology is open. Most of the important technologies in the Hucang integrated technology architecture exist in the form of open source projects, and no company can monopolize intellectual property rights. The manufacturer's distribution and the open source version can be replaced by each other, and the choice is with the user. The openness of technology has also promoted the integration of technology across fields, and different fields learn from each other's methods and technologies, and make up for each other's weaknesses, achieving a 1 + 1 2 effect.
  • infrastructure is open. In the Hucang integrated solution, cloud vendors play an important role, providing infrastructure such as object storage and hosting big data services. These infrastructures are compatible with industry standards and also have open source replacements, making it easy for customers to build hybrid and multi-cloud architectures to better utilize the flexibility of the cloud.

Under this open tone, the entire industry, whether it is users or platforms, has its own thoughts and opinions on the data platform. We also hope to express some of our views. On the one hand, we hope to provide the industry with some shallow insights, and on the other hand, when looking back in the future, it will be a snow-like and red claw for ourselves.

hotcomm Category Latest News