Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform

2020/12/2218:40:48 technology 268

At present, many people in the outside world and the industry have misunderstandings about the understanding of the data center, and they have only emphasized the role of technology. In order to unify everyone's cognition, we have a clearer understanding of the meaning of Taiwan in the data. This article will give an in-depth introduction to the data platform from the perspective of the evolution of the data platform.

Before Z4z

In the era of big data, all AI projects need to have four basic elements: data, algorithms, scenarios, and computing power, all of which are indispensable. Processing big data can no longer solve problems by relying solely on computing power. Computing power is only the core foundation. It also needs to combine different business scenarios and algorithms to create a complete intelligent platform. The data center is based on the basic computing power provided by cloud computing for data intelligence, and it combines with the data asset capabilities and technical capabilities provided by the big data platform to form a data processing capability framework to empower businesses, and achieve digital and intelligent business for enterprises Operation.

At present, many people in the outside world and the industry have misunderstandings about the understanding of data in the platform. They have always only emphasized the role of technology and the promotion of business. However, at the level of commercial landing, more often the development and evolution of technology All need to follow the business, and the development and progress of technology need to be driven backwards based on the needs of the business side and the exploration of the application of data scenarios. This is why Zhihu and Maimai are madly rumoring that Ali is demolishing "Dazhongtai"? My personal guess is that the reason is that I don’t really understand the essence of China Taiwan. In fact, Ali’s initial purpose of building data center is to improve efficiency and solve the problem of business matching, and ultimately achieve cost reduction and efficiency improvement, so the "demolition" is false In the “demolition”, it must be “combined”. One aspect of the “demolition” is the planning and structural upgrade of the corporate strategic layout. If the vision is not high enough and the pattern is not large enough, what you see must be only the surface; on the other hand It is not because of the huge organizational structure that the "dismantling" is done, but only in this way can the efficiency and business matching degree be decoupled to the greatest benefit. The significance of

data center is to reduce costs and increase efficiency, which is used to empower enterprises to accumulate business capabilities, improve business efficiency, and ultimately complete digital transformation. In the previous article, the value and significance of middle-stage construction in the data mentioned that companies need to build their own unique middle-stage capabilities based on their actual conditions.

Because the data center itself is absolutely not reproducible. From the perspective of the BCG matrix combined with the market resources, market environment, market position and business direction of each company, the strategic goals of almost all companies are different. If someone says that they can sell China Taiwan to you, the interpretation of China Taiwan only talks about technology, not about business, only about products, not about business, and not aiming at solving efficiency and matching in combination with corporate business goals. Suspected of being a hooligan. The mission and vision of the data center is to make data a resource like water and electricity, which can be obtained on demand, agile and self-help, more connected to the business, use lower cost, and let data maximize its value through a more efficient way and promote Business innovation and change.

In order to further unify everyone’s cognition and have a clearer understanding of the meaning of data middle-stage, this article will introduce the following in order:

data middle-stage evolution process

data warehouse, data platform and the concept of data middle-stage

data warehouse, The architecture of data platform and data center

The difference and connection between data warehouse, data platform and data center

1

The evolution process of data center

From the data processing dimension, let’s talk about the four stages of data center stage: database stage , Data warehouse stage, data platform stage, data middle stage stage.

database stage: OLTP (transaction processing) is the main application of traditional relational databases, mainly for basic and daily transaction processing, recording instant additions, deletions, changes, and queries. Such as banking transactions, e-commerce transactions and other

data warehouse stage: The main application of the data warehouse system is OLAP (online analytical processing), which supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results. For example, complex dynamic report analysis, user value analysis, etc.

data platform stage: In fact, the industry does not have a unified definition of big data platform. Under normal circumstances, as long as these distributed systems such as Hadoop/Spark/Storm/Flink are used Real-time or offline computing framework, establish computing clusters, and run various computingComputing tasks, with data interconnection and interoperability, support for real-time synchronization of multiple data sets, support for data resource management, and achieve integrated management and control of multi-source heterogeneous data; provide a complete basic operating environment for big data analysis, and provide a unified secondary development interface and other capabilities , Even if you understand the big data platform. Mainly to solve big data storage computing + data application management + task monitoring + data asset management + development management + visual report requirements, etc.

data mid-office stage: refers to a global-level, reusable data asset center and data capability center, Collect, calculate, store, and process massive amounts of data, and at the same time unify standards and calibers, provide clean, transparent, and intelligent data assets and efficient and easy-to-use data capabilities, which can be connected to OLTP (transaction processing) and OLAP (report analysis) From business architecture design to model design, from data research and development to data services, data can be managed, traced, and repeated construction can be avoided. The emphasis is on the ability of data business.

Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform - DayDayNews

Four stages of data middle-station experience

I have experienced 0-1-N of e-commerce companies just before. Take the e-commerce industry as an example to better understand the four stages of data middle-station evolution

1 , The database phase

early start-up of e-commerce entrepreneurship is very easy, the threshold is relatively low, and the cost of trial and error is less. Three or five small partners form a small team to make a front-end page that can place orders. Several servers on the cloud and a MySQL database form a simple OLTP system that can be used by users. Its main function Used to ensure persistent storage of data and simple commodity transaction queries.

now estimates that many small e-commerce and small program entrepreneurs did this in the early days, and even found an outsourcing team to start trial and error in the market. The reason is simple. From the perspective of ROI, the amount of business data in the early stage of the project is not large. The simple GB level, the number of orders and traffic per day are relatively small, and the back-end database can meet the needs by simply querying and displaying a single piece of data. , There are no advanced technologies such as high concurrency and batch processing at all. Even doing data statistics/analysis at the beginning is enough to use Excel to meet the needs.

When users, products, and traffic increase, two transition solutions can be adopted. The first solution is to upgrade the single-machine configuration for slow query speed and insufficient performance, and optimize the database through cache optimization (SQL statement optimization, SQL index optimization, database and table partitioning, SQL script optimization) + memory optimization + thread pool optimization + use NIO communication Mechanism + blocking queue (program optimization), virtual machine (docker) + SSD + suitable IO model to optimize the maximum performance of the single machine configuration; the second solution is to change the original mode, add servers and multiple business databases, Sub-database sub-tables plus single index and double index are performed on the database tables to support the stability and high concurrency of business transactions. In this way, business numbers and indicators can be supported, which can also be quickly queried from the business database.

Finally, with the gradual increase in customers, orders, and external traffic, the amount of data has grown from GB to TB. The database is under greater pressure through ordinary queries and can only be upgraded, so the data warehouse was born.

Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform - DayDayNews

2, data warehouse phase

With the exponential growth of the business, the data volume grows at the same time the company's organizational structure is slowly becoming huge and complex, and the problems faced are more and more deep. The issues that the company’s upper management are concerned about have evolved from simply wondering "Yesterday and today’s GMV", "Last week’s PV and UV", and "What is the month-on-month and year-on-year growth rate of a certain category of goods". It is hoped to conduct refined operations and user value model analysis through data. I hope that through data statistics/analysis/mining, we can analyze users in a specific usage scenario, such as "the relationship between 18-25 year old female users’ purchase behavior of clothing products in the past three months and holiday promotions" .

When the company's operations and senior executives, put forward such very specific cases, hoping to use data statistics/analysis/mining to play a key role in the company's operational decision-making. In fact, it is difficult to directly retrieve from the business database. The reason for

is that the database is transaction-oriented and the data warehouse is subject-oriented. The database generally stores online transaction data forDesigned to capture data, the database is designed to avoid redundancy as much as possible, and it is generally designed with rules that conform to the paradigm. For example, the data structure in the business database is designed for the completion of commodity transactions, not for the convenience of query and analysis. Data warehouses generally store historical data and are designed for data analysis. In the design, redundancy is deliberately introduced and designed in an anti-paradigm way. The two basic elements of database and data warehouse have dimension tables and fact tables. (The dimension table is from the point of view of the problem, such as time, department, and person. The dimension table contains the definition of these things. The fact table contains the data to be queried and the ID of the dimension table.)

Therefore, the emergence of data warehouses is not to replace the database, but to better do data analysis and report requirements analysis, mainly to deal with OLAP (online analytical processing) requirements.

However, with the gradual increase of customers, orders and external traffic, the amount of data has grown from TB to PB level, and the original technical architecture is increasingly unable to support massive data processing. At this time, a data platform was born.

3. Data platform stage:

First, there are too many business systems in the enterprise and the data is not connected to each other. In the process of analyzing data, it is necessary to find the corresponding data from each system first, and then extract the data for integration and open up, before data analysis can be done. In this process, the error rate of artificial integration is high, and the analysis effect is not timely, resulting in low overall efficiency, lag and errors in data migration and data synchronization;

Second, the business system is under heavy pressure, and the structure is relatively heavy, which consumes data analysis and calculation. Resources are great. It is necessary to extract data and process data query and analysis tasks through an independent server to release the pressure on the business system;

Third, performance issues, the company's business is becoming more and more complex, and the amount of data is increasing. The accumulation of historical data is serious, and the data is not used. When the original data system cannot withstand the processing of a larger amount of data, the data processing efficiency is severely reduced.

Therefore, by integrating distributed offline and real-time computing frameworks such as Hadoop/Spark/Storm/Flink, a computing cluster is established, and various computing tasks are run on it, and a big data platform is built, so that the platform has data interconnection and support for multiple data It integrates real-time synchronization, supports data resource management, and realizes the integrated management and control capabilities of multi-source heterogeneous data; it can provide a complete basic operating environment for big data analysis, provide unified secondary development interfaces and other capabilities, and use these capabilities to solve big data storage And calculation problems, improve the efficiency of data analysis and the application of user portrait system/recommendation/search/advertising system.

4. The exponential growth of data volume at the stage of data mid-office

has developed from PB to EB level. In order to better empower business, enterprises start mid-office strategy, open up data from various business lines, integrate and aggregate data, and adopt technology at the bottom Means to solve the problem of unified data storage and unified computing. At the data service layer, through the data service-oriented Data API, the data platform and the front-end business layer are connected, and the algorithm is combined to directly connect the front-end business analysis needs and transaction needs. From Taiwan, through data processing and logical operations, and then enabling business in the reverse direction, it truly achieves the meaning of "all business data, all data business".

2

Data warehouse, data platform and data center concept

Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform - DayDayNews

data warehouse, data platform and data center diagram

data warehouse, data platform and data center concept A collection of strategies supported by type data. It is a single data store, created for analytical reporting and decision support purposes. It can provide guidance for business process improvement, monitoring time, cost, quality and control for companies that require business intelligence. It is a relatively specific functional concept that stores and manages a collection of one or more subject data. The way to provide services for the business is mainly to analyze reports. The

data platform emerges on the basis of big data and combines structured and unstructured The data base platform for data becomes a platform integrating data access, data processing, data storage, query and retrieval, analysis and mining, and application interfaces. The main way to provide services for business is to directly provide data in the data set

The station is a global-level, reusable numberAccording to the Asset Center and Data Capability Center, it can provide clean, transparent, intelligent data assets and efficient and easy-to-use data capabilities, enabling businesses to operate digitally. The main way to provide services to businesses is to provide data service capabilities. The advantages of data warehouses It has metadata, and the data is well organized by means of tables. Data needs to be processed. The data warehouse adopts a layered model. Every time you go up one layer, the loss of data information will gradually increase. The advantage of the data platform is that it can provide advanced analysis functions and a data resource management center. It mainly includes data interconnection and support. Real-time synchronization of data sets; support data resource management to realize the integrated management and control of multi-source heterogeneous data; provide a complete basic operating environment for big data analysis, provide a unified secondary development interface, etc.

data center has a global metadata management system , The management method is also table-based, with granularity down to the field level. The meta-information of the data center contains the meta-information of each sub-store, organized in the form required by the data center, and becomes a data asset management center, which is carried by the data map, like an interconnection pipeline for data distribution and transfer management , We can find the data we need, associating, processing, and analyzing the data, and further accelerate the process of enterprise's transformation from digital to business value. , The collection layer

collects data from various data sources and stores it to the Hadoop-based distributed file system HDFS, during which ETL operations are performed. Among them, data collection generally uses Flume to collect logs, and Sqoop is used to synchronize data in RDBMS and NoSQL to HDFS.

Data sources mainly include: log data (server log + system log, etc.) + business database (Mysql, Oracle, etc.) + buried point Data (server-side buried points + mobile-side buried points data, etc.) + other data (data manually entered by Excel, interface data provided by partners, third-party crawler data, legally purchased third-party data, etc.)

2, storage and analysis layer

mainly has offline computing + real-time computing

storage system: based on the Hadoop distributed file system to store the data of the collection layer

messaging system: add Kafka to prevent data loss

offline computing: it is a part that does not require high real-time performance, usually The calculation results are stored in Hive.

Real-time calculation: use Spark Streaming and Storm to consume log data collected in Kafka, and then save the results in Redis through real-time calculations.

Machine learning: use the machine learning algorithm

3 provided by Spark MLlib, sharing layer

through offline and real-time calculation of data analysis and calculation results are stored in the data sharing layer, as the data sharing layer, mainly as a data distribution and dispatch center. Because the analysis and calculation results of Hive, MR, Spark, and SparkSQL are stored on HDFS, it is impossible for businesses and applications to obtain data directly from HDFS. Among them, Kylin is used as the OLAP engine for multi-dimensional analysis

4, data application

report display + Data analysis + Ad hoc query + Data mining

5, task scheduling and monitoring

Data platform architecture diagram

Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform - DayDayNews

1, collection layer

z0 based on Hadoop file system collection Layer data is stored.

Structured data: Extracted and stored in the HDFS distributed file system through two ways. Data that can be serialized is stored directly in HDFS; data that cannot be serialized is stored in a distributed database environment after data sorting In, after serialization and sorting, the data that cannot be serialized is directly stored in HDFS;

semi-structured and unstructured data: various log data (usually serialized and semi-structured data) are directly stored in HDFS Medium; click stream and data interface data (usually serialized semi-structured data) directlyStored in HDFS; unstructured data is directly stored in HDFS

2, data layer

On the one hand, related business structured data and semi-structured data with a certain format relationship are stored in the Hadoop Hive data warehouse, based on business On the other hand, the semi-structured data in related businesses is directly stored in HDFS distribution

3, computing layer

offline calculation + real-time calculation

4, application layer

visual data analysis Report + search/recommendation/advertising specific scenario application

5, task scheduling and monitoring

Database VS data warehouse VS data platform VS data middle station, 7000 words explain the evolution of the data platform - DayDayNews

Ali data middle station architecture diagram

In order to ensure fast, efficient and high-quality data access, a unified data quality management platform has been established + Data Competence Center

through data Collection and access are the cut-in angles, access internal data (such as Taobao, Tmall, Hema, etc.) + external data (crawler data, third-party cooperation data, buried point data, etc.) according to the format

extracts the data to the computing platform, Build a "data sharing center" based on the architecture of "business sector + business process + analysis dimensions", build the OneData system

at the upper level of the data sharing center, and use business/natural object extraction tags to build a "data only center" as the architecture , To build the OneID system to open up the consumer data system, enterprise data system, content data system, etc. After deep processing, clean, transparent and smart data-enabled products and business lines are obtained; through a unified data service middleware "OneService" Provide unified data services, so that "all business data, all data business"

4

difference and contact of data warehouse, data platform and data center

Difference and contact of data warehouse, data platform and data center:

1 At the conceptual level, the technical capabilities of the

data platform and the data center are based on the development of data warehouses. In the theory of data construction, they are in the same line. The objects they deal with are massive data, and the service purpose and commercial value are also similar. Both the middle platform and the middle platform have their capabilities to provide Open API services to the outside world.

On the one hand, the middle platform is a business application and does not specifically represent a certain technology. It is not directly usable by the end user and must be integrated with the enterprise On the other hand, the platform does not have the nature of business characteristics. It mainly gathers the capabilities of other people and integrates the capabilities of the platform. Relatively speaking, it is static, while the middle station is dynamically changing itself and needs Nourish the business through a data-driven approach, and continuously train and adjust business models and business The capabilities provided by the service algorithm are provided to other systems and platforms for integration.

2. At the data level, the data source of the

data warehouse mainly comes from RDBMS. The data format stored in it is mainly structured data. This data is not the full data of the enterprise, but is integrated and extracted according to the business needs of the enterprise. The expectation of the data source of the data platform and the data center are all domain-level data, mainly structured data, semi-structured data, unstructured data, etc.

3. At the target level, the

data warehouse is based on a single machine. Once the data volume changes Large, it will be limited by stand-alone capacity, computing and performance. It is mainly used for report analysis. The purpose is relatively single. It only uses basic data for relevant analysis reports for extraction, integration, data cleaning and analysis. For example, to add a new report, it is necessary to do it again from the bottom to the upper level. The process is relatively cumbersome; the

data platform is established to solve the problem that the data warehouse cannot handle unstructured data and the long report development cycle, as well as calculation and Performance and other issues. After the data is collected and integrated, after data cleaning, when the business requires it, several small data sets needed by the business side are separately extracted and provided to the business side for use in the form of data sets; the

data center usually After cleaning the basic data in many aspects, a number of subject domains focusing on things are established according to the concept of subject domain; and the data platform is inThe underlying construction is based on a distributed computing platform and storage platform. In theory, the computing and storage capabilities of the platform can be unlimitedly expanded. The goal is to integrate the global data of the entire enterprise, break up the gap between data, and eliminate the problem of inconsistent data standards and caliber.

4. At the application level,

is a data application scenario built on the data center, not only for data report development, analysis and display processing, but also for turning data into a service-oriented way and then providing it to business systems, such as User-oriented portrait system, search/recommendation/advertising marketing system, etc.

technology Category Latest News