Author | Produced by Zheng Liyuan
| CSDN (ID: CSDNnews)
As one of the most popular technological breakthroughs in recent years, the application of AI has gradually penetrated into all aspects. In front there are various AI tools to write novels, screenwriters, and illustrations. In the back there is an AI code generation tool Github Copilot to help write code, freeing the hands of programmers .
Under this craze, Huawei recently developed the latest code generation model in , HUAWEI PanGu-Coder. According to reports, HUAWEI PanGu-Coder was jointly developed by Huawei Noah's Ark Laboratory Speech and Semantics Laboratory and Huawei Cloud PaaS Technology Innovation Laboratory . is not only familiar with common algorithms, but also skillfully uses various APIs, and can even be able to do so. Solve advanced mathematics problem . In the code generation one-pass rate (PASS@1) indicator, HUAWEI PanGu-Coder significantly surpasses the model of the same parameter scale. In order to better serve developers using Chinese, its also has excellent performance in Chinese .
At present, HUAWEI PanGu-Coder, which is in the internal testing stage, is still iterating and evolving, and has not been officially opened to the public. However, we may be able to get a glimpse of HUAWEI PanGu- in advance through an interview with Wang Qianxiang, director of Huawei Cloud PaaS Technology Innovation Laboratory, , director of Huawei Cloud PaaS Technology Innovation Laboratory, , and learn about HUAWEI PanGu- More details about Coder and its story from scratch.
HUAWEI PanGu-Coder Prequel
There has been a saying in the technology circle in recent years: Software is devouring the world. The digital transformation of represented by the sentence
means a lot of coding work, and behind it is the hard work of countless programmers. But as digital transformation becomes increasingly popular, programmers are carrying heavy burdens. Wang Qianxiang pointed out: "Compared with the demand for code in various industries, the current output gap among programmers is still too big."
In response to this pain point, the AI code generation model is becoming the key to breaking the deadlock and improving the code output rate. a powerful tool.
In fact, Wang Qianxiang explored this technology when he was Peking University . Later, he often thought about it when he joined Huawei, but he did not invest much at that time. He also briefly discussed this topic with Huawei founder Ren Zhengfei on 2018: " I said at the time that AI programming is still far from entering the practical stage, but Mr. Ren doesn't think so. "
Therefore, last year, when last year, In June Microsoft released Copilot. Wang Qianxiang found that the model behind it Codex gave him a considerable amount of progress in answering OJ questions (Online Judge, online programming practice system) (the accuracy rate is as high as 70% or more). Shocking also inspired the determination to build a domestic code generation model: "Microsoft's Copilot integrates the capabilities of Codex, which well demonstrates the power of the code generation model. Since Huawei has been paying attention to the improvement of software development capabilities for a long time , hoping to use intelligent technology to improve the productivity of software development, naturally, it is necessary to carry out research and development in this area . "
While Wang Qianxiang's Huawei Cloud PaaS Technology Innovation Laboratory where he is concerned about Codex, Huawei Noah's Ark Laboratory's voice and semantics experiment The office is also studying this matter, and there are two main things that push the two sides to decide to jointly increase R&D efforts: one is that the CEO of OpenAI announced that GPT-4 will pay more attention to the code in September last year; the other is that Github announced in October last year that the internal team will be in the 30% of the new code is completed with the help of Copilot. At the same time, the user retention rate of Copilot also exceeds 50% .
Based on this, last November, Huawei Cloud PaaS Technology Innovation Laboratory and Huawei Noah’s Ark Laboratory Speech and Semantics Laboratory jointly established a joint working group to create a domestic code generation model, and officially started construction in December— —Then the HUAWEI PanGu-Coder project started.
Considering that the only codex of OpenAI and AlphaCode of DeepMind that are currently famous in this field are OpenAI, it can be expected that it is not easy to develop a mature AI code generation model. Wang Qianxiang admitted that during the entire development of HUAWEI PanGu-Coder, the difficulties they faced not only come from objective resources, but also from subjective doubts.
The first challenge: computing resources
For example, Google The paper published in February showed that when AlphaCode participated in the programming competition, more than 7,500 TPU cards were invested in each question - training Large models require a lot of computing resources, this is Industry consensus, which is why only a few large enterprises can explore this aspect. When training HUAWEI PanGu-Coder, the development team coordinated Huawei's internal AI full-stack software and hardware ecosystem to solve the problem of insufficient computing resources.
The second challenge: Various questions
Questions come from many aspects. First of all, it is within the company. Many colleagues, including senior experts, believe programmers are unlikely to accept such code. The second is outside the company. Shortly after Copilot was launched, Wang Qianxiang contacted several students working at Microsoft in the United States and found that they had not even heard of the technology. This made him begin to carefully consider the application scenarios of AI-generated code.
trained the HUAWEI PanGu-Coder model with 300 million parameters to the optimal
for 8 months and overcame many obstacles. The HUAWEI PanGu-Coder finally came out at the end of July this year.
Because the autoregressive Transformer architecture adopted by PanGu-Alpha has strong text generation capabilities, HUAWEI PanGu-Coder also uses this model architecture for code generation tasks. Its architecture is shown in the figure below:
At the same time, HUAWEI PanGu-Coder also uses PanGu-Alpha's Chinese and English multilingual vocabulary list. has the ability to support Chinese and English input, and its performance in Chinese is also excellent.
"This is actually a result that exceeds expectations, because we did not deliberately include Chinese when collecting and processing training data." The HUAWEI PanGu-Coder development team conducted an in-depth analysis of this phenomenon, which should be that the pre-trained model is excellent. The cross-language transfer capability and the total number of trainings are large (more than 200 billion tokens), which has led to the HUAWEI PanGu-Coder being able to support Chinese descriptions well.
At present, the Huawei development team is training HUAWEI PanGu-Coder models of multiple scales, including 300 million parameters, 2.6 billion parameters and even larger scales, but Wang Qianxiang revealed that is more concerned at this stage how to train 300 million parameters to the model. Optimal .
"At this stage, many models with large parameters have not been fully trained, and larger parameters also mean that the inference cost increases and the response time becomes longer. Therefore, has an optimal one when computing power costs are limited. The larger the model, the better . "
, it turns out that this idea is correct. The first generation pass rate of the model (PASS@1) is the most important capability measurement indicator of the code language generation model. From this data, the HUAWEI PanGu-Coder, which adopts data set construction strategy and phased training design, is at the 300 million level. The accuracy rate is much higher than other public models: the HUAWEI PanGu-Coder model with .3 billion parameters (PASS@1=17.07%) surpasses Codex (PASS@1=16.22%) is close to 7 The model results of billion parameters, It is basically the same as Google's 1 billion model . The
HUAWEI PanGu-Coder model has been integrated into the code development assistance tool of Huawei Cloud . You can use natural language description to generate function-level Python code in the IDE plug-in, or complete it according to the context. It is worth mentioning that this IDE plug-in built on the HUAWEI PanGu-Coder kernel has a lot of room for pre- and post-processing. In order to generate more reliable and usable code as much as possible, the plug-in integrates Huawei's trustworthy code in recent years. Accumulation in aspects and post-processing to ensure the quality of code provided to programmers.
Thanks to the above measures, HUAWEI PanGu-Coder has performed well in the internal testing stage: is familiar with common data structure algorithms, can write SQL query function, can use machine learning tools to create text classifiers, and can also solve advanced math problems .
The following examples illustrate two actual performances of HUAWEI PanGu-Coder in internal testing:
Let HUAWEI PanGu-Coder write SQL query statement
-
Let HUAWEI PanGu-Coder find differential:
In order to further make HUAWEI PanGu-Coder more In order to combine the real programming scenario, rather than the programming competition scenario introduced in the current articles, its development team is still working hard to improve the ability of code generation, and plans to release to the public in the future, an IDE plug-in with code generation capabilities.
"Tools are human assistants, not killers"
However, with the emergence of more and more AI code generation models such as Codex and HUAWEI PanGu-Coder, controversy and discussions about them in the developer circle have become increasingly fierce. Wang Qianxiang also gave his unique insights into this.
CSDN: From an industry perspective, since the launch of Microsoft's AI programming tool GitHub Copliot, many people have been worried about the copyright of generated codes. Is there such trouble with HUAWEI PanGu-Coder in code generation?
Wang Qianxiang: We have noticed some individuals and groups' copyright concerns about code generation. In my opinion, first of all, knowledge sharing is an important factor in social progress; secondly, sharing should be shared while respecting the original creators. For research work in the academic community, we need to refer to the progress of other peers and list references in the article. Open source is a new era of knowledge sharing, and many different open source protocols were derived.
The current AI code generation technology uses machine learning technology and a large amount of open source code to train a model, and then use this model to convert a piece of natural language into code. This process is like a programmer who has read a lot of open source code and has a certain ability. When encountering similar problems, he will refer to writing similar new code. As long as the code written out is not a simple copy of the original code, I don’t think it has to go up to the level of copyright issues .
Of course, the copyright issue is not a simple technical issue. There is still a lack of consensus. New open source protocols are also constantly emerging, thereby promoting innovation and protecting originality.
CSDN: How do you view the remarks that "the popularity of code generation tools will gradually replace programmers and humans"?
Wang Qianxiang: This kind of remark is similar to the statement that "the popularity of machines will gradually replace workers" that appeared in the 19th century. This worry is unnecessary.
In fact, it can also be seen from the name of Microsoft Copilot: Copilot is the co-driver of the programmer and the smart assistant of the programmer. Code generation tools also have their applicable scenarios, that is, repetitive low-level encoding. Software development is an innovative intellectual activity. Isn’t it great to let tools do repetitive labor, save programmers some time and invest in higher value innovation activities? The tool is a human assistant, not a killer.
Of course, when it comes to the personal level of programmers, you need to gradually improve your abilities and do not compete with tools for low-level repetitive labor. In addition, it is also important to emphasize that the capabilities of code generation tools have boundaries. Don’t expect too much of it to avoid unrealistic expectations of .
CSDN: What impact will the arrival of HUAWEI PanGu-Coder have on developers? What suggestions do you have to give to developers in terms of use?
Wang Qianxiang: The new forces of developers may be more likely to be affected, because newcomers are often more willing to come into contact with new technologies. If you want to give developers some advice, I suggest that you focus on enhancing your design capabilities in in the future and make more use of the tool implementation capabilities of . These design capabilities mainly include:
1) How to describe intentions in a way that is easy to understand by the machine;
2) How to accurately define interfaces, especially method-level interfaces;
3) How to give the best test data to be used to automatically generate the acceptance tool. code.