The progress in natural language processing in recent years has largely come from the large-scale language model . Each new model released will push the parameter quantity and training data quantity to the new high . At the same time, it will also slaughter the list of 's existing benchmark ranking !
For example, In April this year, In , Google released the
Similarly, visual language model actually is also miracle . You can improve the scale of the model through to improve performance.
Of course, if is just a visual language model for multitasking , it is obviously not very general, and must support the input and output of in multiple languages.
Recently, Google upgraded PaLM to PALI (Pathways Language and Image Model) for , which has the ability to understand in multiple languages and images. At the same time, supports 100+ languages to perform various applications across visual, language and multimodal images and languages, such as visual question answering, image caption, object detection, image classification, OCR, text reasoning, etc.
Paper link: https://arxiv.org/abs/2209.06794
model training uses a public image collection, which includes in 109 languages automatically crawled by . It is also called the WebLI data set in this article.
The PaLI model pre-trained on WebLI has achieved state-of-the-art performance on multiple image and language benchmarks, such as COCO-Captions, TextCaps, VQAv2, OK-VQA, TextVQA, etc., and has also surpassed the multilingual visual captioning and visual question-and-answer benchmarks of previous models.
model architecture
PALI is to study whether the language and visual models have the same connection on performance and scale , especially the scalability of the language-image model.
So the architecture design of the model is very simple, mainly for the convenience of experiments, especially reusable and extensible.
model consists of a Transformer encoder that processes input text and an autoregressive Transformer decoder that generates output text.
When processing images, the input of the Transformer encoder also includes visual words representing the image processed by ViT. A key design of the
PaLI model is reuse. The researchers use the weights of previously trained single-modal vision and language models (such as mT5-XXL and large ViTs) as the seeds of the model. This reuse not only migrates the ability of single-modal training, but also saves computational costs. The visual component of the
model uses ViT-e, the largest ViT architecture of to date. It has the same structure as the ViT-G model with 1.8 billion parameters and uses the same training parameters. The difference is that it extends to
Although scaling rules have been studied in the visual and language fields, there is little discussion on scaling behavior in the combination model of vision and language . Expanding the scale of the visual backbone model may lead to saturation of benefits in classification tasks.
researchers further confirmed this, and it can be observed that ViT-e is only a little better than ViT-G on ImageNet, but ViT-e has greatly improved on PaLI's visual language tasks. For example, ViT-e has nearly 3 more CIDEr points than ViT-G on COCO subtitle tasks. 3 points more than ViT-G in the mission. This also hints at the future space for using larger ViT skeleton models in visual language tasks.
researchers used mT5 backbone as the language modeling component , using pre-trained mT5-Large (1 billion parameters) and mT5-XXL (13 billion parameters) to initialize PaLI's language encoder-decoder, and then continued to perform mixed training in many language tasks, including pure language comprehension tasks, which also helps avoid catastrophic forgetting mT5's language comprehension and generation capabilities.
Finally, three PALI models of different sizes were obtained.
109 languages data sets
Deep learning-related extension studies show that the larger the model, the larger the training data set required.
So in order to fully study and unleash the potential of the language-image pre-trained model, the researchers crawled a large amount of image and text data from the Internet and built a brand new dataset WebLI, which includes 12 billion alt-texts and 10 billion images in 109 languages.
In addition to labeling with network text, the researchers also used cloud vision API to OCR the images, and then obtained 29 billion image-OCR data pairs.
Use near-duplication to deduplicate images of the training, verification and testing parts of 68 common visual and visual language data sets to avoid data leakage in downstream evaluation tasks.
In order to further improve the data quality, the researchers will also score based on the cross-modal similarity of "image and alt-text" and adjust the threshold. In the end, only 10% of the images are retained. A total of 1 billion images are used to train PaLI
to train the big model
Since the visual-language task is multimodal, the model needs to have multiple semantic processing capabilities and there will be different goals. For example, some tasks require local positioning of objects to accurately solve tasks, while others may require more global semantic information.
Similarly, some language tasks may require long answers, while others require compact answers.
To solve all these inconsistent goals, the researchers used the richness of WebLI pretraining data to introduce a mixture of pretraining tasks (Pretraining Task Mixture) to prepare models for various downstream applications.
To make the model more general to solve multiple tasks, the author grouped all tasks into a single common API (input: image + text; output: text), allowing knowledge sharing between multiple image and language tasks, which is also a sharing with pre-training settings.
The target used for pretraining is projected into the same API as weighted mixing, with the goal of maintaining the ability of reused model components while training the model to perform new tasks.
model is trained in JAX using the open source T5X and Flaxformer frameworks. ViT-e of the visual part uses the open source BigVision framework to cascade the word vectors of the language part and the patch vectors generated by the visual part, and together serve as input to the multimodal encoder-decoder. The encoder uses mT5-XXL pre-training to initialize. During the training of PaLI, the weights of the visual component are frozen, and only the weights of the multimodal encoder-decoder are updated.
In the experimental part, researchers compared PaLI on common visual language benchmarks, and the PaLI model achieved state-of-the-art results on these tasks, even surpassing the super-large models proposed in previous literature.
For example, PALI with 17 billion parameters performs better than the Flamingo model with 80 billion parameters on some VQA and image title tasks.
and PALI also maintains good performance on monolingual or single-visual tasks, although this is not the main training goal of PALI.
The article also examines how image and language model components interact in model expansion and where the model generates the greatest benefits.
The final conclusion is that joint expansion (scaling) of these two components will produce the best performance. Specifically, scaling of visual components that require relatively few parameters is the most critical, and scaling is also important for improving the performance of multilingual tasks.
After evaluating PaLI on the benchmark Crossmodal-3600 for 35 languages, it can be found that multilingual title task benefits more from the extension of the PaLI model.
In order to avoid unfair bias in large language and image models, it is necessary to keep transparent about how the data used and the model are used, as well as to test the fairness of the model and conduct responsible data analysis. Therefore, the article provides both a Data Card and Model Card
Reference:
https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html