Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training

2020/11/2313:56:04 technology 2369

Heart of the Machine released

Heart of the Machine Editor

Byte Beat A study published at the EMNLP 2020 conference proposed a new paradigm for multilingual translation-mRASP.

In 1920, the great philosopher Russell visited various places in China. The interpreter was accompanied by Zhao Yuanren, a linguist at Tsinghua University. Zhao Yuan is a very linguistic genius. At that time, he could already speak Baoding, Changzhou, Fuzhou, Nanjing and many other local dialects and English. On the ship that accompanied Russell from Shanghai to Changsha, he learned Changsha with his fellow economist Yang Ruiliu. When the ship docked in Changsha, Zhao Yuanren was able to translate Russell's speech and slang into Shahua. Can neural network translation become the "Zhao Yuanren of machine translation circles"? That is to create a unified model with multiple language capabilities. When encountering a new language, a small amount of temporary learning can achieve a very fluent language level.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Zhao Yuanren (second from left in the back row) and Russell (first from right in the front row)

This article will introduce you to the new multilingual translation paradigm multilingual Random Aligned Substitution Pre-training (mRASP) [1], the core idea of ​​which is Created the "Zhao Yuanren model of machine translation industry", through pre-training technology and fine-tuning on specific languages, leading translation effects can be achieved. The unified model pre-trained in 32 languages ​​has achieved comprehensive results on 47 translation test sets. Significantly improved.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

mRASP is different from the previous translation mode, establishing a successful path for pre-training and fine-tuning translation.

The pre-training paradigm represented by BERT has swept almost all text understanding tasks and has become the cornerstone of various NLP tasks. However, in the field of text generation, especially machine translation, although many new algorithms have emerged in pre-training models, the results achieved are still limited, and there are still challenges in scenarios with different resource richness and multi-language expansion. . The core problem solved by mRASP is: Can a unified translation model be pre-trained, and a small amount of fine-tuning can be used to achieve a good translation effect in any language pair, such as Chinese to Indonesian?

mRASP is mainly designed for machine translation tasks. It has three application advantages:

breaks the limitations of resource scenarios , and can be improved regardless of the level of parallel bilingual resources. In resource-rich languages, such as 40 million parallel sentences training on standard English-French translation tasks, the use of mRASP can still achieve a significant improvement, reaching a BLEU value of 44.3; in low-resource languages, the performance of mRASP is surprising. In extreme cases, you only need 10,000 sentences of training data. After 10 minutes of fine-tuning training, you can get a good translation system.

breaks the limit on the number of languages ​​. Any language translation, whether it is Bengali to Gujarati or Hindi to Filipino, as long as it is a language on the earth, mRASP can be directly used for fine-tuning, and the effect is expected.

low resource consumption . Compared to the "arms race" pre-training gameplay with hundreds of cards, mRASP is more civilian, and it only takes 8 cards to train for a week. To put it simply, we can understand mRASP as a lightweight BERT in the field of machine translation. As long as it is a machine translation task, any scene or language, there may be small surprises! The author of the

paper stated that this technology has been used on the volcano translation system developed by ByteDance, which has been tested by actual business. The author also announced the research data, code and pre-training model, see the GitHub address at the end of the article.

Next, we will introduce and analyze mRASP from three aspects: 1) the challenges of machine translation pre-training; 2) the motivation and methods of mRASP; 3) the actual effects and analysis of mRASP.

The challenge of machine translation pre-training

Currently, most AI tasks are statistical learning based on data. The performance of the model largely depends on the quality and quantity of the data. Use a large amount of easily available data to pre-train the model, and then use a small amount of labeled data to fine-tune in specific application scenarios to achieve actual scenariosAvailable models have become a new successful paradigm for NLP. For example, after BERT [2] is pre-trained on large-scale plain text, a small amount of fine-tuning on 11 tasks of natural language understanding can achieve good results. However, in multilingual machine translation, the paradigm of fine-tuning through pre-training has not yet achieved universal success. Previous NLP pre-training methods such as BERT and GPT [5] have too large a gap between the training target and the target of the translation task, and it is not easy to use directly.

mRASP proposes a new idea, using a large number of bilingual parallel corpora accumulated in multiple languages, combining them to jointly train a unified model, and then fine-tuning based on this, so that the pre-training and fine-tuning goals are as close as possible, so that the pre-training and fine-tuning targets can be as close as possible. The role of training models.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

The above figure compares and analyzes the limitations of previous NLP pre-training methods directly applied in machine translation scenarios. BERT and GPT respectively correspond to the pre-training of the encoder part and decoder part of Transformer [6], while machine translation uses a sequence generation model. This inconsistency of the model structure will cause only a part of the parameters of the translation model to be initialized, which makes it difficult to effectively play the role of pre-training. Therefore, a lot of special skills are needed to improve performance [10].

For the sequence model, some researchers soon proposed frameworks such as MASS [7] and BART [8] to extend pre-training to sequence generation tasks. They use auto-encoder (auto-encoder) for self-learning, and have achieved significant results on many downstream generation tasks. However, there are still two important problems in the application of machine translation: the first is that there is no improvement in resource-rich languages ​​(such as English-German and English-French), and the second is that there is no way to expand to multilingual translation tasks. This limitation is largely due to the fact that self-encoding is a relatively simple task and it is difficult to learn deeper representations, and machine translation requires more complex semantic transformation. This difference between pre-training targets and downstream tasks leads to It is difficult for the model to make the best use of the pre-training data. How

overcomes these two problems has become an important challenge for the application of pre-training models in the field of machine translation. The motivation and method of

mRASP

is a very interesting phenomenon for language learners. They found that after learning three or four languages, learning a new language will speed up. For example, if someone learns German and French separately, it may take one year each. However, if he learns German first and then learn French, it may only take one year and three months. Then he learns Spanish, the speed may be even faster. Fast [3]. The same is true for programming languages. Learning C++ may take one year, and then learning Java and Python may only take one month. A simple explanation of

is that in the process of multilingual learning, humans will spontaneously summarize the more abstract commonalities in languages, focusing on learning the characteristics of new languages. Therefore, if you want to improve your personal language learning ability, you often need to learn more languages ​​and have a more accurate grasp of the commonalities of languages, instead of trying to learn a language. In the same way, for machine translation, whether the translation ability can be transferred to different languages ​​so that the information between different languages ​​can be mutually used has become a very interesting question. Based on this consideration,

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

mRASP designed a general pre-training model to learn the commonalities of language conversion, and then it can be easier to migrate to the new translation direction. Just like a language learner, after learning two languages, learning the third language becomes very easy. The design of

mRASP follows two basic principles: First, the goal of pre-training is basically the same as machine translation, and the ability to learn language conversion is required; The representation in the hidden space should also be close. The

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

mRASP method, using Transformer with language identification as the translation network framework

mRASP follows the general pre-training-fine-tuning framework. In the pre-training phase, unlike the traditional pre-training model that stacks a large amount of unsupervised monolingual data, mRASP takes another approach and adoptsMulti-language parallel data is the main goal of pre-training. Parallel data of dozens of languages ​​are put into the same model for joint training. The

neural network structure adopts Transformer, and uses language token (Language token) to identify the source language and target language. In order to ensure that sentences and words in different languages ​​can be embedded in the same space, sentences with the same meaning should be represented by the same vector in both Chinese and English. In addition, this method also introduces the random replacement alignment technology RAS to create a richer context.

"Love" in the Chinese sentence "I love Beijing Tiananmen" has a certain probability to be replaced with "aime" (French), and "Beijing" has a certain probability to be replaced with "Pékin" (French), so the original sentence may change Into "My aime Pékin Tiananmen Square". A pair of parallel sentence pairs in the training set can be turned into two pairs (even three pairs, four pairs...):

1. I love Beijing Tiananmen ==> I love Beijing Tiananmen Square

2. 我aime Pékin Tiananmen ==> I love Beijing Tiananmen For the model, Square

will naturally learn the correspondence between synonyms in different languages ​​according to this "context" of "man-made" by learning a lot of such parallel corpus. In fact, this random replacement method based on parallel dictionaries narrows the spatial distribution of synonymous sentences in different languages. In the above example, the word vectors calculated for "love" and "aime" (in French) are expected to be as close as possible.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

In the fine-tuning stage, you only need to initialize the parameters of the pre-training stage, and then use the same training method as the traditional one-way machine translation, so using mRASP does not need to master any additional skills . Please refer to the paper [1] for detailed method introduction. The actual effect and analysis of

mRASP

mRASP uses the parallel corpus of 32 languages ​​for pre-training, and only uses the parallel corpus of wmt14 for fine-tuning in the direction of English to French, which achieves the best effect without the need to use the time-consuming and labor-intensive massive monolingual Back Translation (44.3 BLEU). At the same time, when applied to the new language direction—Dutch (Nl) to Portuguese (Pt), only 12,000 parallel sentence pairs are used. After ten minutes of fine-tuning, a usable (BLEU 10+) model can be obtained. It is difficult to train a usable MT model from scratch for the same amount of parallel sentence pairs (BLEU is close to 0). Z1z

briefly summarizes, mRASP has the following advantages:

model is simple and easy to reproduce

mRASP pre-training only uses a total of 110 million parallel sentence pairs (because the same pair of parallel sentence pairs are applicable to both directions, so a total of 220 million Training samples), the vocabulary size is only 64k bpe subwords. Compared with other pre-training methods with tens of billions of data and dozens of layers of networks, mRASP is less difficult to train. A single 8 card can complete pre-training in 32 languages ​​in less than a week. Of course, pre-trained models in more languages ​​can also be obtained through simple expansion.

is extremely versatile.

mRASP has a certain improvement over the one-way machine translation model directly trained on large, medium and small scale training sets, even including the most parallel corpus from English to French (an increase of 1.1 BLEU). Even for languages ​​like Dutch to Portuguese that have never been seen in the pre-training data, it has achieved a significant gain of 10+ BLEU. Here is an excerpt of some representative experimental results:

1) En-De and En-Fr Benchmark

The following figure compares the effect of mRASP plus fine-tuning on En-De and En-Fr in England and Germany (En-De) and England and France (En-Fr). The results of several other cross-language pre-training models plus fine-tuning. It can be seen that the effect of mRASP has certain advantages, it is in En->De wmt 20It reached 30.3 (tokenized BLEU) on the 16 test set and 44.3 (tokenized BLEU) on the En->Fr wmt 2014 test set. Among other models, CTNMT uses BERT pre-training; MASS uses large-scale monolingual data; mBERT is a multilingual BERT model; mBART is another pre-training method that appeared at the same time, introducing massive multilingual monolingual data, and training time Reach 256 cards in 20 days.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

2) The language extension

that has not been seen in the pre-training stage is not included in the parallel sentence pairs in the pre-training stage, and is called "Exotic Direction". Whether it is effective in Exotic Direction determines whether mRASP has good scalability and generalization capabilities. The Exotic Direction in the

paper is divided into four situations:

Exotic Pair: The source language and target language have been separately pre-trained, but the model has not seen their bilingual pair;

Exotic Source: the model only sees it in the pre-training phase I have never seen the source language after the target language;

Exotic Target: The model has only seen the source language during the pre-training phase, and the target language has not been seen at all;

Exotic Full: The model has never seen the source language at all during the pre-training phase Language and target language.

It is difficult to train machine translation in these four unseen pairs. The most difficult one is the last one, which is equivalent to requiring people who only learn Chinese and English to read a few sentences in Latin and Hindi to be able to translate from Latin to Hindi.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

It is worth noting that both sides of the French (Fr-Zh) have appeared separately, but they have not appeared as a parallel pair. Only 20K parallel corpus can be used to reach 20+ BLEU values.

At the same time, for language pairs that have not appeared in the pre-training stage on both languages, such as Dutch to Portuguese (Nl-Pt), only 12,000 sentences are used in parallel corpus. After about 10 minutes of training, it can reach 10 + BLEU value.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

3. Case analysis

In order to understand the effect of mRASP more intuitively, the author also conducted a case analysis in the paper. In the

method (Fr-Zh)

Exotic Pair, the 20k parallel sentence pair

Direct 0.7 BLEU is much weaker than mRASP 25.8. The BLEU

Direct system cannot translate at all, and the mRASP system translates very well.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Dutch-Portuguese (Nl-Pt)

Exotic Full, 12,000 parallel sentence pairs

Direct 0 BLEU vs mRASP 14.1 BLEU

Through case analysis, we found that the translation effect of the Dutch-Portuguese translation model obtained by mRASP could not successfully translate every detail, but it can capture every detail. Live some key information of the original text. For example, in the following example (1) date (2) meeting record meeting message (3) distribution and sharing.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

English and French (En-Fr)

One of the advantages of the model trained by the mRASP method than the Direct method is that the Direct system ignores the tendency of meaningless words (articles, demonstrative words, etc.), while mRASP Keep the article and demonstrative consistent.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

English-Chinese (En-Zh)

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

4. Effect analysis

mRASP, as a general pre-training model, where does it improve the downstream tasks of each MT? The author of

believes that its improvement mainly comes from two aspects:

mRASP brings different languages ​​closer.The vector representation of synonyms between languages;

mRASP narrows the vector representation of synonymous sentences in different languages.

The word-level and sentence-level representations are narrowed, which means that after the pre-training phase of processing and learning a large number of language parallel sentence pairs, mRASP implicitly "masters" the language-independent representation, and this representation can be transferred To any language, mRASP can generally improve the effect of machine translation downstream tasks.

1) mRASP zooms in vector representations of different language word levels. The introduction of

RAS allows synonyms of different languages ​​to share the same context. In NLP, the meaning of a word is determined by context, thereby further narrowing synonyms between different languages The representation.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Above: w/o RAS, below: w/ RAS

It can be seen that after adding the RAS method, the embedding distribution between different languages ​​has been narrowed (the angle becomes smaller).

2) mRASP narrows the vector representation of sentences in different languages. In addition to the vector representation of synonyms

, mRASP also narrows the semantic vector representation.

uses the encoder output vector as the spatial representation of the sentence (L2 normalized averaged-pooled encoder output), from the TED parallel test set (the filtered 15-way parallel test set, a total of 2284 items) matched to the similarity (cosine similarity) ) The most recent sentence, calculate the Top-1 accuracy (sentence retrieval accuracy).

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Figure 1: The accuracy of mRASP minus the accuracy of mBART [9]. Note that Dutch (Nl) does not appear in the mRASP pre-training data at all, and the accuracy in other directions greatly exceeds mBART. The average accuracy of

mRASP retrieval reached 76%.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Figure 2: The accuracy of mRASP minus the accuracy of the mRASP method without RAS. It can be seen that the RAS method of mRASP has obvious benefits in the language (Nl) that has not appeared in the pre-training stage.

Training 32 languages ​​with 8 cards in 7 days, ByteDance launches mRASP, a new paradigm for multilingual pre-training - DayDayNews

Figure 3: After removing the language token at the beginning of the sentence, the accuracy of Nl can be further improved, but the accuracy of other languages ​​has dropped significantly. It can be seen from

that the RAS method has indeed further shortened the semantic vector representation, and sentences with the same semantics will get a close representation after mRASP pre-training.

Summary

Back to the beginning of the text, the language genius Mr. Yuanren Zhao mastered 33 dialects and 7 foreign languages ​​throughout his life, from Baoding in the north to Fuzhou in the south, from the upper reaches of the Yangtze River to the lower reaches of the Yangtze River, from Berkeley in the United States to Paris in France, and he can use his local accent when he arrives. Speak the local language. The establishment of a unified multi-language and cross-domain translation model is one of the ultimate goals of machine translation research. MRASP, which is in line with the language genius Zhao Yuanren, has established a successful path from multilingual pre-training to fine-tuning to multilingual translation models, which will also become a new paradigm of machine translation. ByteDance has applied this technology to the volcano translation system, which can be experienced on the attached webpage below.

Github Address: https://github.com/linzehui/mRASP

Paper address: https://arxiv.org/abs/2010.03142

Volcano Translation Experience Official Website: http://translate.volcengine.cn/

References

[1 ] Lin, Zehui, et al. "Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information." In the Conference on Empirical Methods in Natural Language Processing (2020).

[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL-HLT (1) 2019 : 4171-4186.

[3] Thomas, Reed, and Callie Mady. "Teaching for transfer: Insights from theory and practices in primary-level French-second-language classrooms." McGill Journal of Education/Revue des sciences de l' éducation de McGill 49.2 (2014): 399-416.

[4] Johnson, Melvin, et al. "Google's multilingual neural machine translation system: Enabling zero-shot translation." Transactions of the Association for Computational Linguistics 5 (2017): 339-351.

[5] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018): 12.

[6] Vaswani, Ashish, et al. "Attention is all you need. "Advances in neural information processing systems. 2017.

[7] Song, Kaitao, et al. "MASS: Masked Sequence to Sequence Pre-training for Language G eneration." ICML. 2019.

[8] Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." ACL 2020: 7871-7880

[9] Liu, Yinhan, et al. "Multilingual denoising pre-training for neural machine translation." TACL.2020

[10] Yang, et al. "Towards Making the Most of BERT in Neural Machine Translation" AAAI.2020

NeurIPS 2020 online sharing: Automation of Embedded Knowledge Graph

Paper: "Interstellar: Searching Recurrent Architecture for Knowledge Graph Embedding".

The author of this paper was inspired by Neural Architecture Search (NAS) and proposed to use Interstellar as a cyclic architecture for processing information in relational paths. In addition, the new hybrid search algorithm in this research breaks through the limitations of stand-alone and one-shot search methods, and is promising to be applied to other fields with complex search spaces.

On November 24th, , the fourth paradigm senior researcher Dr. Quanming Yao will explain this cutting-edge research in detail.

technology Category Latest News