Reprinted/ITPUB Author Ren Chaoyang
When you live shopping, the system will recommend products you are interested in in real time. When a news event occurs, the hot word rankings of Baidu search and Weibo will be updated in real time. When you may encounter online fraud, you will immediately receive an alarm call... These scenarios are not unfamiliar to us, and there may be real-time data warehouses behind them to provide support.
In recent years, real-time analysis scenarios have become more and more abundant, and the concept of real-time data warehouse has become very popular, attracting market attention. IT168&ITPUB planned a series of real-time data warehouse topics to discuss new technologies, new trends, and new applications with industry experts. This article is one of them. The interview guest is Brother Data, the manager of the public account [Data Society]. He is a big data veteran and focuses on MPP database research, stream processing computing, data warehouse architecture and data analysis. Is
real-time data warehouse a product or a solution?
data warehouse is very familiar to everyone. In "Building the Data Warehouse" published in 1991, Bill Enmen, the father of data warehouse, first proposed the concept of data warehouse. Data warehouse is a subject-oriented, integrated, relatively stable data collection that reflects historical changes and is used to support management decisions.
As for the currently popular real-time data warehouse, the market has not yet reached a consensus and there is no unified definition. Data Brother believes that real-time data warehouses and traditional data warehouses are both one data warehouse, but they only provide support for different business scenarios as the business changes. Although the concept of real-time data warehouse has only been mentioned now, it appeared very early and has gone through several important stages of development.
In the early days, the amount of enterprise data was not particularly large, and the demand for real-time analysis was not that high. Business libraries such as relational database such as Oracle and MPP database could directly perform statistical analysis to meet real-time analysis needs.
In the era of big data, the amount of data has exploded and big data technology has emerged. Enterprises will use the Storm stream computing framework to support simple real-time computing queries such as real-time hot spot rankings, but Storm cannot support complex calculations well.
In recent years, stream-batch integrated computing engines such as , Spark and Flink have appeared. Now many database manufacturers claim that they are building real-time data warehouses.
In the past, due to the lack of urgent real-time analysis needs of business personnel and technical limitations, companies would use Hive and other OLAP databases to run batches offline. Business analysis could only be done T+1, that is, the data from the previous day would not be analyzed and displayed until the next day. This is also the case in many business scenarios now. With the promotion of real-time business needs, the increase of real-time data, and the continuous development of real-time computing technology, real-time stream computing engines such as Storm and Flink have gradually developed. The real-time computing framework has evolved from the original Lambda architecture with stream-batch separation to the Kappa architecture with stream-batch integration, and new architectures are also emerging.
" may be more of a solution for real-time data warehouses in the future. Different industries and different business scenarios have different choices for real-time data warehouses. " Data Brother said that offline data warehouses and real-time data warehouses are both data warehouses. Offline analysis generally processes large amounts of data in batches, while real-time analysis generally selects small amounts of data from large amounts of data for processing. Now we can see that different database vendors, including some open source OLAP vendors, all say that they can build real-time data warehouses and have their own advantages in different business scenarios.
The real-time data warehouse currently seen on the market is more of a solution combination of " data warehouse + stream computing engine " rather than a separate data warehouse product . For example, Alibaba Cloud provides Hologres+Flink real-time data warehouse solution, Xinghuan Technology provides ArgoDB+real-time stream computing engine Transwarp Slipstream real-time data warehouse solution, Even Technology combines OushuDB+Lava into a real-time lake warehouse solution, etc. It is understood that some database manufacturers are trying to build stream processing into the database to provide real-time processing capabilities.
Business needs and technology development are a spiral process. The development of real-time data warehouses is also driven by real-time business needs. So what are the application scenarios of real-time data warehouses now? In which industries is it applied quickly? What are the application scenarios of
real-time data warehouse?
data brother introduced that real-time data warehouse has some typical application scenarios , such as real-time Top ranking and hot word display, which can be seen in Baidu hot search and Weibo hot words; real-time alarm monitoring, such as the Internet of Things, especially the current hot new energy vehicles , battery instability, early warning for battery usage, etc.; real-time recommendations, which are relatively common, such as the current hot e-commerce live broadcast recommendations. Or after clicking on certain products on some shopping platforms, real-time recommendation ads may appear in WeChat Moments; financial anti-fraud, the country has vigorously promoted online fraud prevention in the past two years, and bank anti-fraud real-time warning is an important application scenario of real-time data warehouse.
Take the hot e-commerce live broadcast as an example. At this year's Volcano Engine Motive Power Conference, Yang Zhenyuan, vice president of ByteDance, introduced that there are many real-time demand scenarios for Douyin e-commerce, and the frequency of business activities is very high. It is necessary to ensure that data support can be completed in real time under the constant burst of demand. The Volcano Engine real-time data warehouse provides Douyin e-commerce with a full set of real-time data on real-time large screens, real-time analysis, real-time warnings, and real-time marketing.
" real-time data warehouse is very popular, but there may not be as many application scenarios as ." Brother Data believes that real-time data warehouses are still in the initial stage of development as a whole. Even some medium and large enterprises do not have many real-time business scenarios. Some companies may not have a dedicated real-time data warehouse technology team, or the team size is very small, with dozens or even hundreds of people working on offline data warehouses and only a few people working on real-time data warehouses. As for small and medium-sized enterprises, since the amount of data is not that large, they can use relational databases or MPP databases to conduct real-time statistical analysis without the need for complex calculations. They may not need to use real-time computing engines such as Flink, or the real-time computing frameworks claimed by some major manufacturers.
According to data brother, the implementation of real-time data warehouse in different industries is also uneven.. Overall, real-time data warehouses are developing the fastest in the Internet industry and have the upper hand. This is because on the one hand, there are sufficient technical reserves and Internet companies have a large number of relevant technical personnel. On the other hand, the organizational structure has advantages. In traditional industries, technology selection requires approval at all levels of the process, and the Internet industry structure is flatter and more flexible. However, many Internet companies currently building real-time data warehouses are conducting technical pre-research or innovative attempts, and may not be immediately applied to business scenarios.
Another front-runner for real-time data warehouse applications is the financial industry. Because the financial industry has policy supervision and other needs, real-time analysis is a rigid need, so real-time business scenario applications are front-runner. Another requirement that has high real-time requirements is the real-time collection of data by new energy electric vehicles. In addition to the company's own needs, it also includes national regulatory requirements, which require the monitoring of real-time vehicle data.
In most traditional enterprises, the current demand for real-time analysis is not that obvious. These companies use more offline data warehouses, just like the traditional BI. They are not even eager to know the data of the previous day. They only need to analyze the data of the past year to predict the trend of the next year to help the company make decisions.
Real-time data warehouse selection and implementation
Data Brother introduced that when selecting real-time data warehouse, enterprises will pay attention to the following factors: first is the data synchronization and real-time writing capability to synchronize the source data; second is support for complex businesses and complex events. such as Storm could do real-time analysis in the past, but it could not support complex calculations well, so Spark and Flink are now used for real-time processing; thirdly is "Exactly-once" that can do real-time calculations. It only calculates once, and calculations will be repeated if calculated multiple times. Real-time calculations are different from batch calculations and require status recording for each operation; fourthly, has low operation and maintenance costs; fifthly, stability, business stability needs to be ensured.
However, Brother Data discovered that many companies currently use some open source components to develop themselves when applying real-time data warehouses, instead of purchasing third-party products or solutions. can more flexibly respond to the company's own business needs because of self-research. However, self-research does not mean complete innovation from scratch. Enterprises will learn from the mature implementation solutions of other manufacturers, combine them with their own application scenarios, tailor them to the enterprise, and create a suitable data display platform. Especially in the past two years, due to the impact of the epidemic and the external environment, many companies have been reducing costs and increasing efficiency, and have become increasingly cautious about IT investments such as R&D.
In addition, the implementation of real-time data warehouse in an enterprise has a lot to do with its original technology stack. If the enterprise does not have relevant technical reserves, reintroducing a new technical system will incur high costs. For example, his company originally used Spark for batch processing and , and later used Spark for stream-batch integrated processing for real-time analysis without introducing a new real-time computing engine like Flink.
It should be pointed out that although Flink and Spark are both stream-batch integrated computing engines, their real-time data processing is different. Flink is event-driven like the previous Storm, processing 24 hours a day like running water from a faucet. Some people point out that it is like escalator . Spark is a time-driven "micro-batch processing" of tasks, which is equivalent to an elevator. It processes a part of the data within a certain period of time and can only be used for some stream processing businesses that do not have high latency requirements. It is reported that Spark can reach sub-second level and can also meet many real-time business scenarios.
As real-time data generates more and more value, the application of real-time data warehouses will be more extensive and in-depth in the future. Enterprises need to choose appropriate solutions based on their own development needs.
Data Brother believes that real-time data warehouse will have the following development trends in the future . First, cloud will be an important development trend of real-time data warehouse, and public cloud may have more cost advantages. The second is to unify the technology stack and unify the real-time and offline technology stacks. For example, enterprises that originally used Spark for offline computing may also use Spark for real-time computing in the future. The third is to unify data entry and exit to avoid inconsistencies between offline and real-time statistical results.
If the real-time data warehouse wants to accelerate its implementation, in addition to enhancing its technical capabilities and making it easier to use, it also needs to build a more complete technical ecosystem. "If technology wants to be promoted and applied and developed, ecology is very important."
finally shared with you a big data decision-making platform construction plan that needs to be picked up.
