Machine Heart Release
Machine Heart Editorial Department
Alibaba Cloud officially opens the deep transfer learning framework EasyTransfer. This article introduces in detail the core functions of the EasyTransfer framework
Recently, Alibaba Cloud officially opens the deep transfer learning framework EasyTransfer, which is the industry's first deep transfer learning framework for NLP scenarios.
open source link: https://github.com/alibaba/EasyTransfer
This framework was developed by Alibaba Cloud Machine Learning PAI team, making model pre-training and transfer learning development and deployment of natural language processing scenarios simpler and more efficient.
Deep transfer learning for natural language processing scenarios has huge demand in real scenarios, because a large number of new fields are emerging, and traditional machine learning requires accumulating a large amount of training data for each field, which will consume a lot of manpower and material resources. Deep transfer learning technology can transfer knowledge learned in the source field to new fields of tasks, thereby greatly reducing the annotation resources.
Although there are many needs for deep transfer learning for natural language scenarios, the open source community does not have a complete framework, and it is a huge challenge to build a simple, easy-to-use and high-performance framework.
First of all, pre-trained model plus knowledge transfer is now the mainstream NLP application model. Usually, the larger the size of the pre-trained model, the more effective the knowledge characterization is, the more effective it is. However, the super-large model brings huge challenges to the framework's distributed architecture. How to provide a high-performance distributed architecture to effectively support hyper-large-scale model training.
Secondly, the user application scenarios are very diverse, and the single transfer learning algorithm cannot be applied. How to provide a complete transfer learning tool to improve the effect of downstream scenarios.
Third, from algorithm development to business implementation, it usually takes a long link, how to provide a simple and easy-to-use one-stop service from model training to deployment.
Faced with these three challenges, the PAI team launched EasyTransfer, a simple and easy-to-use and high-performance transfer learning framework. The framework supports mainstream transfer learning algorithms, supports automatic mixed precision, compilation optimization and efficient distributed data/model parallel strategies, and is suitable for industrial-grade distributed application scenarios.
It is worth mentioning that, in combination with mixed precision, compilation optimization and distributed strategies, the ALBERT model supported by EasyTransfer is more than 4 times faster in distributed training than the community version of ALBERT.
At the same time, after more than 10 BUs and more than 20 business scenarios within Alibaba, it provides NLP and transfer learning users with various conveniences, including the industry-leading high-performance pre-training tool chain and pre-training ModelZoo, rich and easy-to-use AppZoo, efficient transfer learning algorithms, and fully compatible with Alibaba PAI ecological products, providing users with a one-stop service from model training to deployment.
Lin Wei, head of Alibaba Cloud Machine Learning PAI team, said: This open source EasyTransfer code hopes to empower Alibaba's capabilities to more users, lower the threshold for pre-training and knowledge transfer of NLP, and at the same time, it also works in-depth with more partners to create a simple, easy-to-use, high-performance NLP and transfer learning tool.
framework six highlights
Simple and high-performance framework: blocking complex underlying implementations, users only need to pay attention to the logical structure of the model, reducing the entry threshold for NLP and transfer learning; at the same time, the framework supports industrial-grade distributed application scenarios, improves the distributed optimizer, and cooperates with automatic mixing accuracy, compilation optimization, and efficient distributed data/model parallel strategies, so as to achieve more than 4 times faster than the community version of multi-machine and multi-card distributed training;
language model pre-training tool chain: supports a complete pre-training tool chain, which facilitates users to pre-train language models such as T5 and BERT. The pre-trained models produced based on this tool chain are on the Chinese CLUE list and English SuperGLUE The list achieved good results;
rich and high-quality pre-trained models ModelZoo: Supports PAI-ModelZoo, supports Continue Pretrain and Finetune, which support mainstream models such as Bert, Albert, Roberta, XLNet, T5, etc.At the same time, it supports Fashionbert, etc., which is self-developed multimodal model in the clothing industry;
rich and easy-to-use applications AppZoo: supports mainstream NLP applications and self-developed model applications, such as single-tower models such as DAM++, HCNN, and BERT double-tower + vector recall models; supports BERT-HAE and other models under reading comprehension;
automatic knowledge distillation tool: supports knowledge distillation, which can be distilled from large teacher models to small student models. It integrates task-aware BERT model compression AdaBERT, and uses neural network architecture search to search for task-related architectures to compress the original BERT model. It can be compressed to the original 1/17, with up to 29 times increase in inference, and the model effect loss is less than 3%.
is compatible with PAI ecological products: the framework is developed based on PAI-TF, and users can use PAI self-developed and efficient distributed training, compilation and optimization through simple code or configuration file modifications. At the same time, the framework is perfectly compatible with products of the PAI ecosystem, including PAI Web Components (PAI Studio), Development Platform (PAI DSW), and PAI Serving Platform (PAI EAS).
Platform Architecture Overview
EasyTransfer's overall framework is shown in the figure below, which simplifies the difficulty of algorithm development for deep transfer learning as much as possible in design. The framework abstracts commonly used IO, layers, losses, optimizers, models. Users can develop models based on these interfaces, or directly connect to the pre-trained model library ModelZoo for rapid modeling. The framework supports five transfer learning (TL) paradigms, model finenetuning, feature-based TL, instance-based TL, model-based TL, and meta learning. At the same time, the framework integrates AppZoo, which supports mainstream NLP applications, making it easier for users to build commonly used NLP algorithm applications. Finally, the framework is seamlessly compatible with products from the PAI ecosystem, bringing users a one-stop experience from training to deployment.
platform function detailed explanation
The following details the core functions of the EasyTransfer framework.
Simple and easy-to-use API interface design

High-performance distributed framework
EasyTransfer framework supports industrial-grade distributed application scenarios, improves distributed optimizers, and combines automatic mixing accuracy, compilation optimization, and efficient distributed data/model parallel strategies to achieve computing speed of more than 4 times faster than the community version of multi-machine and multi-card distributed training.
Rich ModelZoo
framework provides a set of pre-trained language model tools for users to customize their own pre-trained models, and also provides a pre-trained language model library ModelZoo for users to call directly. Currently, 20+ pre-trained models are supported, among which PAI-ALBERT-zh, which was pre-trained on the PAI platform, won the first place in the Chinese CLUE list, and PAI-ALBERT-en-large won the second place in the English SuperGLUE. Below is a detailed list of pre-trained models:
The effect of pre-trained models on the CLUE list:
SuperGLUE:
Rich AppZoo
EasyTransfer encapsulates AppZoo which is highly easy to use, flexible and low-learning cost. It supports users to "large-scale" open source and self-developed algorithms under only a few lines of commands, and can quickly access NLP applications under different scenarios and business data, including text vectorization, matching, classification, reading comprehension and sequence labeling.
Efficient transfer learning algorithm
EasyTransfer framework supports all mainstream transfer learning paradigms, including Model Fine-tuning, Feature-based TL, Instance-based TL, Model-based TL and Meta Learning. More than 10 algorithms have been developed based on these transfer learning paradigms, which have achieved good results in Alibaba's business practice. All subsequent algorithms will be open sourced into the EasyTransfer code library. During specific applications, users can choose a transfer learning paradigm based on the figure below to test the effect.
Pretrained language model
Natural language processing One of the hot topics in natural language processing is pretrained language models such as BERT, ALBERT, etc. This type of model has achieved very good results in major natural language processing scenarios.In order to better support users to use pre-trained language models, we have implanted a set of standard paradigms for pre-trained language models and a pre-trained language model library ModelZoo in the new version of the transfer learning framework EasyTransfer. In order to reduce the total number of parameters, the traditional Albert canceled the encoder stacking method of bert and instead adopted the encoder loop method, as shown in the figure below. The full loop method does not perform very well in downstream tasks, so we changed the full loop to a full loop on an encoder stacked 2 layers. Then we retrain Albert xxlarge based on the English C4 data. During the pre-training process, we only use MLM loss, combined with Whole Word Masking, based on EasyTransfer's Train on the fly function, we implement dynamic online masking, that is, we can dynamically generate tokens that require masking every time we read the original sentence. Our final pre-trained model, PAI-ALBERT-en-large, achieved the second place in the SuperGLUE list and the first place in China. The model parameters are only 1/10 of the first place Google T5, and the effect gap is within 3.5%. In the future, we will continue to optimize the model framework and strive to achieve better results than T5 with 1/5 of the model parameters.
Multimodal Model FashionBERTh
With the development of Web technology, the Internet contains a large amount of multimodal information, including text, images, voice, video, etc. Searching out important information from massive multimodal information has always been a focus of research in the academic community. The core of multimodal matching is the text and image matching technology, which is also a basic research, and has many applications in many fields, such as Cross-modality IR, Image Caption, Vision Question Answering, and Image Knowledge Reasoning. However, the current academic research focuses on multimodal research in the general field, and relatively few multimodal research in the e-commerce field. Based on this, we collaborated with the Alibaba ICBU team to propose the FashionBERT multimodal pre-training model, which conducts pre-training research on graphic and text information in the e-commerce field, and has been successfully applied in multiple cross-modal retrieval and graphic and text matching business scenarios. The model architecture diagram is shown below. This work proposes Adaptive Loss, which is used to balance graphic matching, pure picture, and plain text loss.
Task Adaptive Knowledge Distillation
Pre-trained model extracts general knowledge from massive unsupervised data, and improves the effect of downstream tasks through knowledge migration methods, achieving excellent results in the scene. Usually, the larger the size of the pre-trained model, the more effective the learned knowledge representation is for downstream tasks and the more obvious the indicator improvement it brings. However, large models obviously cannot meet the timeliness requirements for industrial applications, so model compression needs to be considered. We worked with Alibaba Intelligent Computing Team to propose a new compression method, AdaBERT, which automatically compresses BERT into a small model that adapts to tasks using Differentiable Neural Architecture Search. In this process, we use BERT as the teacher model to extract its useful knowledge on the target task; under the guidance of this knowledge, we adaptively search for a network structure suitable for the target task and compress the small-scale student model. We conducted experimental evaluations on multiple NLP public tasks, and the results showed that the small model compressed by AdaBERT was 12.7 to 29.3 times faster than the original BERT while ensuring the intensive reading was comparable, and the parameter scale was 11.5 to 17.0 times smaller than the original BERT.
QA Scenario Domain Relationship Learning
As early as 2017, we tried transfer learning in the Alibaba Xiaomi Q&A scenario. We mainly focused on DNN based Supervised TL. There are two main frameworks for this type of algorithm, one is Fully-shared(FS) and the other is Specific-shared(SS). The biggest difference between the two is that the former only considers shared representation, while the latter considers specific representation. Generally speaking, the model effect of SS is better than that of FS, because FS can be regarded as a special case of SS.For SS, the ideal case is that the shared part represents the commonality of the two fields, and the specific part represents the characteristics. However, we often find it difficult to achieve such an effect, so we consider using an adversarial loss and domain correlation to help the model learn these two features well. Based on this, we propose a new algorithm, hCNN-DRSS, with the architecture as follows:
We applied this algorithm to Xiaomi's actual business scenarios, and achieved good results in multiple business scenarios (AliExpress, Vientiane, Lazada). At the same time, we also produced an article in WSDM2018: Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answer Systems in E-commerce. Jianfei Yu, Minghui Qiu, et al., WSDM 2018.
Reinforced Transfer Learning
The effectiveness of transfer learning depends largely on the gap between source domain and target domain. If the gap is larger, then the migration is likely to be invalid. In the Xiaomi QA scenario, if you directly migrate Quora's text matching data, many of them are not suitable. In Xiaomi's QA scenario, based on the Actor-Critic algorithm, we built a general reinforced transfer learning framework, using RL to select samples to help the TL model achieve better results. The entire model is divided into three parts, the basic QA model, the transfer learning model (TL) and the reinforcement learning model (RL). The policy function of RL is responsible for selecting high-quality samples. The TL model trains the QA model on the selected samples and provides feedback to RL. RL updates actions based on feedback. The model trained by this framework has achieved a very good improvement in the matching accuracy of both Spanish and Russian in Double 11AliExpress. At the same time, we also sorted the results into papers and published in WSDM2019. (Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching. Chen Qu, Feng Ji, Minghui Qiu, et al., WSDM 2019.)
metal tuning Meta Fine-tuning 3
pre-training language model has made the two-stage training model of Pre-training+Fine-tuning the mainstream. We noticed that in the fine-tuning stage, the model parameters are only fine-tune on specific domains and specific data sets, and do not take into account the migration and tuning effect of cross-domain data. The Meta Fine-tuning algorithm draws on the idea of Meta-learning and aims to learn meta-learners across domains of pre-trained language models, so that the learned meta-learners can be quickly transferred to tasks in a specific domain. This algorithm learns the cross-domain typicality (i.e. transferability) of training data samples, and adds a domain corruption classifier to the pre-trained language model, so that the model learns more domain-invariant representations.
We applied this fine-tuning algorithm to BERT and conducted experiments on multiple tasks such as natural language reasoning and sentiment analysis. Experimental results show that metatuning algorithms are superior to BERT's original fine-tuning algorithm and transfer learning-based fine-tuning algorithm in these tasks. We have also sorted out the results into papers and published in EMNLP 2020. (Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining. Chengyu Wang, Minghui Qiu, Jun Huang, et al., EMNLP 2020.)
Meta-Knowledge Distillation
As pre-trained language models such as BERT have achieved SOTA results on various tasks, models such as BERT have become an important part of the NLP deep transfer learning pipeline. But BERT is not flawless, and this type of model still has two problems: the problem of too large model parameters and slow training/inference speed, so one direction is to distil BERT knowledge into a small model. However, most of the knowledge distillation work focuses on the same field, and ignores the problem of improving the distillation task across fields. We propose to use Meta Learning to learn cross-domain transferable knowledge and distillate the transferable knowledge during the distillation stage.This approach has significantly improved the effectiveness of the learned Student model in the corresponding fields. We have distilled a better student model on multiple cross-domain tasks, approaching the effect of the teacher model. We will sort out this work in the near future and publish codes and articles.
Innovation Article List
EasyTransfer framework has been implemented in dozens of NLP scenarios within Alibaba Group, including intelligent customer service, search recommendation, security risk control, large entertainment, etc., bringing significant improvements in business effects. Currently, EasyTransfer daily service has hundreds of millions of calls, with an average monthly training call volume of more than 50,000. While implementing the business, the EasyTransfer team has also accumulated a lot of innovative algorithm solutions, including work in the direction of meta-learning, multi-modal pre-training, reinforced transfer learning, feature transfer learning, etc. It has jointly published dozens of top conference articles, and the following lists some representative work. These algorithms will be open sourced in the EasyTransfer framework for users to use.
[EMNLP 2020]. Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining. EMNLP 2020. Full Paper.
[SIGIR 2020] FashionBERT: Text and Image Matching for Fashion Domain with Adaptive Loss.
[ACM MM 2020] One-shot Learning for Text Field Labeling in Structure Information Extraction. To appear, Full Oral paper.
[IJCAI 2020] AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search, IJCAI 2020.
[KDD 2019] A Minimax Game for Instance based Selective Transfer Learning. Oral, KDD 2019.
[CIKM 2019] Cross-domain Attention Network with Wasserstein Regularizers for E-commerce Search, CIKM 2019.
[WWW 2019] Multi-Domain Gated CNN for Review Helpfulness Prediction, WWW.
[SIGIR 2019]. BERT with History Modeling for Conversational Question Answer. SIGIR 2019.
[WSDM 2019]. Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching. WSDM 2019, Full Paper.
[ACL 2018]. Transfer Learning for Context-Aware Question Matching in Information-seeking Conversation Systems in E-commerce. ACL. 2018.
[SIGIR 2018]. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. Long Paper.
[WSDM 2018]. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce, 2018. Long Paper.
[CIKM 2017]. AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience, CIKM 2017, Demo Paper, Best Demo Award.
[ICDM 2017]. A Short-Term Rainfall Prediction Model using Multi-Task Convolutional Neural Networks. Long paper, ICDM 2017.
[ACL 2017]. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine, ACL 2017.
[arXiv]. KEML: A Knowledge-Enriched Meta-Learning Framework for Lexical Relation Classification, arXiv.