Secrets of Alibaba Data Middle-Taiwan Core Products

2021/10/1412:33:03 technology 2580

Editor's note: As the core product of the Ali data center, the Quick BI single code warehouse source code has exceeded 1 million lines and is moving towards 10 million lines. This article focuses on sharing the thinking before the selection of Monorepo, as well as the development experience and experience in specific applications. The content is reproduced from " Alibaba F2E".

In recent years, Alibaba Data Middle-Taiwan products have developed rapidly. The core product Quick BI has become the only domestic BI selected in the Gartner Magic Quadrant for two consecutive years. The source code of the Quick BI single code warehouse exceeded 1 million lines. There are many people and modules involved in the entire development process. Because of the principles described below, the product can always be kept in a rapid development state.

first share a few key data:

code: TypeScript 82 million lines, style Sass+Less+CSS 180,000 lines. (Cloc statistics, remove automatically generated code)

collaboration: Code Review 12,111 times, Commit 53,026 times.

Secrets of Alibaba Data Middle-Taiwan Core Products - DayDayNews

Many people will ask, with so much code, why not split the code base? Don't you quickly introduce the micro front end and serverless framework? Don’t you worry about not being able to maintain it, do you start the turtle speed?

The actual situation is,From the first day, it is estimated that there will be such a large amount of code. The start-up time has also become slower and slower from the first few seconds to 5-10 minutes later, and then optimized to the nearest 5 seconds. Throughout the process, the team felt the advantages of Monorepo (single code warehouse).

This practice wants to illustrate:

The big Code base may be a good thing, the road is simple. It is easier to support complex and flexible business with an extremely "simple" architecture

To achieve a simple architecture, clearer internal specifications, closer collaboration, and more efficient execution

can pass engineering If you can solve the problem in a standardized way, don’t use development specifications. If you can solve it through specifications, don’t rely on freedom. , Give full play to the collective wisdom, vote for a satisfactory warehouse name. At the same time, taking advantage of the integration of Quick BI and FBI bases, the project started. Later, the base code was corrected and the upper-level business code was also incorporated.

  commit 769bf68c1740631b39dca6931a19a5e1692be48dDate: Tue Apr 30 17:48:52 2019 +0800 A New Era of BI Begins  

Why Monorepo?

Secrets of Alibaba Data Middle-Taiwan Core Products - DayDayNews

Before the start of construction, there was a lot of discussion in the single warehouse (Monorepo) and multi warehouse (Polyrepo) teams.

I used to like Polyrepo very much, to create independent repo for each component independent npm, for example, before 2019, there were 43 editor components for the form category:

Secrets of Alibaba Data Middle-Taiwan Core Products - DayDayNews

This can achieve perfect decoupling and extreme reuse, but in fact:

Every time Babel, React , etc. rely on the overall upgrade to make people peel off, so I developed a scaffold. The wheels are all forced out, things have been done a little bit, but the ability to write scripts has risen sharply.

Each time you debug components, click npm link. Later, the component cross-level, you can do 3 layers of npm link, and everyone who has used it knows what a bad experience this is.

version is difficult to align, each time the main warehouse is released, the alignment of the versions between components is a test of vision, and a slight accident triggers an online fault.

What is the advantage of facilitating reuse by others? In the end, supporting my own business is stretched, how dare to let others reuse it...

In the end, we merged all these components into one warehouse. In fact, companies like Google/Facebook/Microsoft highly praise Monorepo internally.

But we are not a fundamentalist Monorepo, there is no need to put unrelated product codes together. Within the solid-line team, a single product can use Monorepo, which will greatly reduce the cost of collaboration. But at the beginning, there were still many questions in the team.

Several core questions about Monorepo

**Single warehouse, will the volume be large? ** How big is

1 million lines of code?

Let’s guess first: 1GB? 10GB? Or more?

First, calculate it according to the formula:

code volume = source code volume +. Git volume + resource files (audio, video, pictures, other files) 1. Let's span

p0 Calculate the volume of source code :

generally recommends that each line is less than 120 characters, we take 100 characters per line to calculate, 1 million lines are:

  100 * 1000,000 = 100,000,000 B after conversion That's 100 MB!  

How big is our warehouse actually?

is only 85 MB! That is 85 characters per line on average.

2. Let's calculate again. The volume of git :

.git records the commit history, branch and tag information of all codes.It will be very bulky, right?

actually made a lot of optimizations at the bottom of Git: 1. All branches and tags are references; 2. Changes are incrementally stored; 3. When the object storage is changed, zlib compression will be used. (For recurring sample code, it will only be stored once, and the compression ratio for standardized code is extremely high).

According to our experience, .git records 10,000 commits and only requires 1 to 3 additional code volumes.

3. Resource file size

Git has done a lot of optimizations for the source code, except for resource files such as video and audio. We recently used BFG to optimize the warehouse of another product from 22GB to 200MB, a 99% reduction! And the commit history and branch of the optimized code are retained (because BFG will edit Git commit records, and some commit id will change).

used to be 22 GB because the warehouse stores videos, released build files and sourcemap files, which should not be placed in the source code repository.

To sum up, the code size of one million lines is generally between 200MB and 400MB. What is the estimated volume occupied by 10 million lines of code?

multiplied by ten is between 2GB ~ 4GB. This is nothing compared to just a few Gs in node_modules, and it is easy to manage. To add a case, Linux kernel has 28 million rows,With Monorepo, thousands of people collaborate. It is said that Linus developed Git to manage the source code of Linux at that time.

**Startup is very slow, right? 5 minutes or 10 minutes? **

I heard some teams say that the code is more than 100,000 lines, and the startup takes 10+ minutes. The typical "big stone" project is already difficult to maintain. Quickly unpack or modify the front end. Maybe the team has only 3 people but dismantled 5 projects, which is very troublesome to collaborate.

We have three approaches:

Split multiple entries according to the page, and only need to start one Entry each time.

sort out the dependencies between sub-packages and pursue the ultimate 3spanking _span-Shapanking _span-Shapanking

Webpack switched to Vite

Especially after Webpack switched to Vite, the final project cold start time was optimized from 2-5 minutes to 5 seconds. The hot compilation time has been optimized from 5 seconds to less than 1 second, and Apple M1 computers are basically within 500ms.

**What about code reuse? Is it necessary to introduce everything when Monorepo is reused? **

Traditional software engineering pursues DRY, but not the more DRY, the better.

Every time a line of code is written, there is a corresponding cost: the cost of maintenance. In order to reduce the code, we have reusable modules. But there is a problem with code reuse: it becomes an obstacle when you want to modify it in the future.

For long-term iterative products like Quick BI, most of the requirements are extensions to the original functions, so writing easy-to-maintain code is the most important. Therefore, the team discourages the use of magic tricks; not simply pursuing code reuse rate, but pursuing easier modification; encouraging coding methods that are easy to delete when modules are offline in the future.

For scenes where multiplexing does exist, we have done unpacking. Inside Monorepo, we have dismantled multiple packages (there are screenshots below). For example, other products require BI to build, and @alife/bi-designer can be reused, and the introduction of dependencies can be minimized with the help of Tree-Shaking.

Current development experience

1. Cold start 5 seconds, hot compilation within 1 second. It used to be 5-10 minutes.

2. The problem that can be solved by changing one line of code, really change one line and publish it once. Instead of changing 10+ projects, publish N times by dependency.

3. Newcomers set up the environment in 10 minutes and get started with development. In the past, there was a Repo for each component, and package authorization took a long time.

4. Avoid the problem of version misalignment

For 2C products, there is no need for multiple versions and multiple branches, but it is not easy for multiple npm to rely on aligned versions

For 2B products,Due to multiple environments and multiple versions, it will be more complicated and extremely complex. Monorepo unifies the internally dependent versions through branches

5. The engineering upgrade only needs one time. It is currently based on the Pri Monorepo solution developed by Lerna.

Of course, it is not easy to maintain the experience mentioned here, and there are still many problems to be solved in the development.

Really need to solve the problem

is not the end of the code put together, the complicated problems behind it are coordination, technical solutions, and stability. For example, how to avoid one person submitting code to cause the entire product to crash?

**Package dependency management**

internally splits multiple sub-packages, each sub-package is a sub-file, and npm can be released separately, see the following figure:

7p

internal span The core principle of package management is:

is one-way dependent from left to right, and can only refer to the left from the right, avoiding circular dependency

. The specification is not enough, and the plug-in is developed to automatically detect. If the left depends on the right, it will directly report an error_ span3span

The introduction of open source npm should be more cautious. Most of npm's maintenance time does not exceed x years, and even the tool library that was once standard like Moment.js will terminate the maintenance. Probably 20% of npm is unmaintained. But in the future, if your online users encounter problems, you will need to gnaw the source code on your own and become passive.So our principle is that the introduction of open source npm requires three people to pass the offline review.

**Code Review Culture**

Code Review can help newcomers grow quickly, and it is also a way to build a team’s technical culture.

has been implementing 100% CR within the team for the past few years, but this is not enough. Mechanical execution is easy to stream CR into a formality, and it needs to be done in different scenarios.

Monorepo There is a risk that once there is a problem, it may be an overall problem.

Currently our Code Review is mainly divided into 3 scenarios:

Online MR Code Review [1 to 1]

Thematic Code Review [3-5 people] large version

Collective Code Review before release [All]

12,111 code reviews have a lot of experience, mainly:

Timely review, encourage small-grained MR, do not have to wait for the entire function development to be completed

code is For people to read, encourage the same code as the vernacular, rather than the classical Chinese

to establish best practices (directory tree structure, naming convention, data flow specification).There are 10 ways to develop a function, but the team needs to choose one and promote it. Can be achieved with simple technology, do not use "high and deep" unpopular technology

emphasizes the development of cleanliness and pursues a culture of elegant code (whether the naming is easy to understand, whether the comments are complete, whether there are performance hazards, etc.)

* *Engineering construction**

First of all, we must thank the front-end DEF engineering team of Amoy Department for their support. With so much code, we continue to challenge the limit and upgrade DEF to support us.

In addition to document specifications, specifications that can be checked by automated tools are good specifications.

checker: ESLint, TS type check, Prettier

syntax checker is an important way to promote the implementation of the specification, ESLint can be increased, and the pre-hooks of git commit are still fast after optimization. But TS type check is slower because it doesn't support increment, so it needs to be used with CI/CD.

Webpack vs Vite

releases and uses Webpack, and develops and uses Vite.

development environment uses Vite for quick debugging, and the production environment still uses Webpack packaging.

The risk is that the development and production compilation products are inconsistent. This part needs to be avoided by regression testing before going online.

**Performance optimization**

For data products, performance challenges come from the larger resource package after the Mono repo, as well as the large amount of data for rendering calculations. challenge.

performance optimization can be divided into three links:

Resource loading: refined Tree Shaking, the difficulty lies in refinement. Webpack's own Tree-Shaking is not good, it does not support Class method to do Tree Shaking, so sometimes the code needs to be modified. The Lazy Loading module can be loaded on demand, especially for large components such as charts and SQL editors. Reasonable interface pre-loading, don't let the network idle.

View rendering: Minimize the number of component rendering times, optimize virtual scrolling of table components, and pre-load and pre-render when idle.

access request: resource localization buffer solution, the mobile terminal uses PWA to cache JS and other resource files and data locally.

There are also performance testing tools to locate performance stuck points. Plan to do the code performance latch, and send a reminder if the package size is increased before the code is submitted.

**Data-driven architecture optimization**

is in the data center,I believe in the business value of data. But for the development itself, data is rarely used in depth.

So S1 focused on exploring the digitalization of the development experience. Analyze by collecting everyone’s development environment and startup time-consuming data [do not count other data to avoid internal scrolling]. I found a lot of interesting things. For example, a classmate had a hot compilation for 3 to 5 minutes. He thought that others were too slow, which seriously affected the development efficiency. When he found out that the data was abnormal from the report, he helped him solve it ten minutes later.

Another example, in order to maintain the consistency of online packaging products, to promote the team to make the version of Node.js unified, it used to rely on nails, and it is impossible to know the effect how many times. Once you have the report, it's clear at a glance.

Secrets of Alibaba Data Middle-Taiwan Core Products - DayDayNews

At present, the entire data-based process runs through, and the sweetness is initially tasted. There are still many interesting analyses that can be done in the future.

deeper experience

**The most efficient way is to do it all at once**

Each line of code leaves a cost. In the long run, the most efficient way is to do it all at once.

Su Shimin said, "The difficulty of doing big things and doing small things is the same, and both will consume your time and energy." In this case, you might as well write the code once. If “TODO” is left in the code, it may be TO DO forever. Objectively speaking, it is more difficult to do a good job at a time. First of all, everyone thinks that the standard of "good" is different. Behind it lies the personal technical ability, the pursuit of experience, and the understanding of the business.

**Organizational culture technology,Complement and complement each other**

The technical architecture has a lot to do with the organizational structure, and it is more important to choose a technical architecture that fits the organization.

If an organization is fragmented, using Monorepo will have a great synergy cost. But if the organization is cohesive, Monorepo can greatly improve efficiency.

Engineering and construction of the base is a team affair, it is difficult to promote by individuals.

In the short term, you can rely on copying in the battle, but in the long term, you need to form a culture to continue iterating.

The high cost of organizational communication should be solved through the organization, and the power to solve it through technology is small. What technology can do is to give full play to the advantages of tools and make changes happen quickly.

**Simple does not come before complexity, but after complexity**

For a simple architecture, someone will always find a way to make it complicated. Stepped on the pit, determined to rebuild, success will return to simplicity, failure will be subverted by the new simple model. Stepping on the pit itself is also valuable, otherwise the newcomer will always step on it again. It's easy to be complicated, but keeping it simple requires foresight and restraint. Have not experienced the process of tempering, the antidote of others may be poison to you. The architecture of

can’t be immutable. Our chart first used D3 directly, ECharts was very simple, and later customized a lot of gradually complicated to be difficult to maintain, so based on the G2 self-developed bi-charts, the architecture became simple again. The development experience may be the same, but the technology behind it has completely changed.

Summary and outlook

A million lines of code is nothing to fear, it is a normal node, and it can still be as agile as tens of thousands of lines of code.

Now Quick BI has moved towards tens of millions, and is striding forward to the goal of world-class BI. The above content is more related to engineering. The purpose of engineering is to allow developers to focus more on business. In fact, there are more business challenges that are not mentioned, because data analysis is born to deal with massive amounts of data, and performance optimization has long-term Practice; insight into rich and different data, there are a lot of precipitation in visualization and complex tables, visualization is not only technology, but also the business itself; mobile phone flat TV and other multi-end display, cross-end adaptation challenges.

In the future, we also hope to turn data analysis into an engine that can be quickly integrated into office and business processes.

The current development model is not perfect. In the iterative process, technical debt will inevitably occur. The essence of architecture optimization is to maintain maintainability and reduce technical debt. Recently, the team is planning to introduce the Redux -Toolkit, which will greatly upgrade the access and data flow, and share it if there is progress. (End)

Alibaba Data Center's core products revealed

.

technology Category Latest News