Vision Transformers is promising - technology| DayDayNews

2021/04/1822:33:07 technology 723

Blockquote0blockquote

Vision Transformers related research has been very popular recently. This article is recently seen. I personally feel that the explanation is relatively popular, and there are many illustrations to help understand.

Therefore, I also spent a lot of time to translate (the article content is about 6700 words), if it is helpful to you, please give me a three-lian. happy weekend!

Introduction to Transformers

Transformers is a very powerful deep learning model that has become the standard for many natural language processing tasks and is ready to completely change the field of computer vision.

It all started in 2017, and Google Brain published a paper destined to change everything, Attention Is All You Need[4]. Researchers apply this new architecture to several natural language processing problems. It will soon be seen to what extent this architecture can overcome some of the limitations that plague RNN . RNN is usually used to Language translation to another language and other tasks.

Over the years, Transformers has become a school of natural language processing. Google Brain asked in 2020, will Transformers be equally effective on images? The answer is yes. Visual Transformers was born. After some preliminary modifications to the image,They successfully developed the classic structure of Transformers and quickly reached state of the art on many issues in this field.

Excitingly, a few months later, in early 2021, Facebook researchers released a new version of Transformers, but this time, especially the video, TimeSformers _strong45 spanstrong13. Obviously, even in this case, after some minor structural changes, this architecture quickly became a winner in the video field. Facebook announced in February 2021 that it would combine it with video on social networking sites for various Create new models for this purpose.

Why do we need transformers?

But let’s take a step back and explore the motivations that prompted Google researchers to find new alternative architectures to solve natural language processing tasks.

Traditionally, tasks like translation are done using recurrent neural networks (Recurrent Neural Networks). As we all know, recurrent neural networks have many problems. One of the main problems is its sequential operation. For example, to translate a sentence from English to Italian, using this type of network, the first word of the sentence to be translated is passed to the encoding device along with the initial state, and then the next state is related to the sentence’s The second word is passed to the second encoder together,And so on until the last word. The resulting state of the last encoder is then passed to the decoder , the decoder returns the first translated word and the subsequent state as output, and the state is passed to the other decoder, and so on.

The problem here is obvious. To complete the next step, I must have the result of the previous step. This is a big flaw, because you do not take advantage of the parallelization features of modern GPUs, so there will be a loss in performance. There are other problems, such as gradient explosion, inability to detect dependencies between distant words in the same sentence, and so on.

Attention is all you need?

So the question arises. Is there a mechanism that allows us to calculate in a parallel manner and let us extract the information we need from the sentence? The answer is yes, this mechanism is attention.

If we define attention forgetting as any technical and implementation aspect, how would we proceed to do so?

Let us take an example and ask ourselves, focus on the word "gave". Which words in this sentence should I focus on to increase the meaning of this word? I may ask myself a series of questions, for example, who gave it? In this case, I will focus on the word "I", and then I may ask who is it? Focus my attention on the word Charlie, and finally, I might ask, what did you give me? Finally focus on the word food.

By asking myself these questions,Maybe do this for every word in the sentence, and I might be able to understand the meaning and aspects. The question now is, how to implement this concept in practice?

To understand the calculation of attention, we can compare the calculation of attention with the database world. When we search in the database, we submit a query (Q) and search for one or more keys that satisfy the query in the available data. The output is the value associated with the key most relevant to the query.

The situation of attention calculation is very similar. We first regard the sentence for which attention is to be calculated as a set of vectors. Each word is encoded into a vector through a word embedding mechanism. We think that these vectors are the key to search. Regarding the query we are searching, it can be words from the same sentence (self-attention) or from another sentence. At this point, we need to calculate the similarity between the query and each available key, and perform mathematical calculations by scaling the dot product. This process will return a series of actual values, which may be very different from each other, but since we want to obtain a weight between 0 and 1, whose sum is equal to 1, we apply SoftMax to the result. Once the weight is obtained, we must multiply the weight of each word and its relevance to the query by the vector representing it. We finally return the combination of these products as the attention vector.

To build this mechanism, we use linear layers, starting from the input vector, and generating keys, queries, and values through matrix multiplication. The combination of key and query will allow the most correct match to be obtained between these two sets, and the result will be combined with the value to obtain the most relevant combination.

However,If we want to focus on a word, this mechanism is sufficient, but what if we want to look at the sentence from several angles, and then compute the attention several times in parallel? We use the so-called multi-headed attention, which has a similar structure, and the results are simply combined at the end to return a single summary vector of all calculated attentions.

Now that we have understood which mechanism should be used and determined its parallelism, let us analyze the structure of the multi-headed attention embeddeder and what constitutes the embedded. Structure.

Considering that it is always a translation task, let us first focus on the left part of the image, the encoding part, which translates the entire sentence from English to Italian as input. Here we have seen that there is a huge revolution compared to the RNN method, because it does not process sentences verbatim, but submits them completely. Before the attention calculation, the vector representing the word is combined with the position coding mechanism based on sine and cosine, which embeds the position information of the word in the sentence into the vector. This is very important because we know that in any language, the position of a word in a sentence is very relevant. If we want to make a correct evaluation, this is information that must not be lost. All this information is passed to a multi-head attention mechanism, and the result is standardized and passed to a feedforward. Encoding can be performed N times to obtain more meaningful information.

But the sentence to be translated is not the only input to the transformer. We have the second block, the decoder, which receives the output of the previous execution of the transformer. For example, if we assume that we have translated the first two words,And we want to predict the third word of the sentence in Italian, we will pass the first two translated words to the decoder. Position encoding and multi-head attention will be performed on these words, and the result will be combined with the encoder result. Pay attention to the recalculation of the combination, the result will be a vector of potential candidate words through the linear layer and softmax, as a new translation word, and each word has an associated probability. In the next iteration, the decoder will receive this word in addition to the previous word.

Therefore, this structure has proven to be very effective and high-performance, because it processes the entire sentence instead of word by word, retaining information about the position of the word in the sentence, and using the ability of attention A mechanism to effectively express the content of a sentence.

After all these good explanations, you might think that the transformer is perfect without any flaws. Obviously, it is not like this. One of its advantages is also its disadvantages. Pay attention to the calculation!

In order to calculate the attention of each word relative to all other words, I have to perform N² calculations. Even if the parts can be parallelized, it is still very expensive. With this complexity, let us imagine what it means to count attention multiple times on a text of a few hundred words.

From the graph, you can imagine a matrix, which must be filled with the attention value of each word relative to other words, which is obviously expensive. It must be pointed out that, usually on the decoder, hidden attention can be calculated to avoid calculating the attention between the query word and all subsequent words.

Some people may argue, but if many of the benefits of transformer are related to the attention mechanism, do we really need all the structures mentioned above? But didn't the first Google Brain paper in 2017 say that "attention is all you need"? [4] Of course it is legal, but in March 2021, Google researchers once again published a paper entitled "Attention is not all you need" [6]. what does that mean? The researchers conducted experiments to analyze the behavior of the self-attention mechanism without any other components of the transformer, and found that it converged to a rank 1 matrix at a double exponential rate. This means that the mechanism itself is actually useless. So why is the transformer so powerful? This is due to the tug-of-war between the self-attention mechanism of reducing the matrix rank and the jump connection and MLP of the other two components of the transformer.

The first method allows the distribution of paths to be diversified, avoiding all the same paths, which greatly reduces the probability of the matrix being reduced to rank 1. Due to its nonlinearity, MLP can improve the rank of the generator matrix. On the contrary, it has been shown that normalization has no effect in avoiding the behavior of this self-attention mechanism. Therefore, attention is not all you need, but the transformer architecture manages to use its strengths to achieve impressive results.

Vision Transformers

By 2020, Google researchers thought of this again, "But if people find that Transformers are so effective in the field of natural language processing,How will they handle the image? ". It’s a bit like NLP. We start with the concept of attention, but this time it’s applicable to images. Let’s try to understand it through an example.

Ispan Image from “An Image Worth 16x16 words" (Dosovitskiy et al)

If we consider a photo of a dog standing in front of a wall, any of us would say that this is a "picture of a dog", not a picture "Picture of a wall", this is because we focus on the main and distinguishing subjects of the image, and this is exactly what the attention mechanism applied to the image does.

since we Understanding that the concept of attention can also be extended to images, we only need to find a way to input the image into a classic transformer.

We know that the transformer takes text as an input vector, so how can we What about converting an image into a vector? Of course, the first solution is to use all the pixels of the image and "inline" them to obtain the vector. But let's stop and see what happens if we choose this option.

we said before,The computational complexity of attention is equal to O(N²), which means that if we have to calculate the complexity of each pixel relative to all other pixels, then in a low-resolution image like 256x256 pixels, our computational complexity will be very high. It is absolutely impossible to overcome with current resources. So this method is definitely not feasible.

The solution is very simple. In the article "An image value of 16x16 words" [2], it is proposed to divide the image into blocks, and then use linear projection to convert each block into a vector, and map the block to In vector space.

Now we just need to look at the architecture of Vision Transformer.

Then divide the image into multiple small patches (patches), these patches are obtained by linear projection vector, these vectors are coupled with information about the position of the patches in the image, and submitted to the classic transformer. It is basic to add information about the original position of the patches in the image, because in the linear projection process, even if it is important to fully understand the content of the image, this information will be lost. Insert another vector, which is independent of the analyzed image and is used to obtain global information about the entire image. In fact, the output corresponding to the patch is the only output that is considered and passed to the MLP, and the MLP will return the prediction class .

However, in this process, one thing is that the information loss is very serious. In fact, during the conversion from patch to vector, any information about the position of the pixel in the patch will be lost.The author of (Transformer-in-Transformer, TnT) [3] pointed out that this is of course a very serious matter, because for quality prediction, we do not want to lose the arrangement of pixels in a part of the image to be analyzed. The author of

TnT then asked himself, is it possible to find a better way to submit vectors to Transformer? Their suggestion is to convert each individual patch (pxp) of the image, which is itself an image on 3 RGB channels, and convert it into a c-channel tensor. Then divide this tensor into p'parts, where p'

Then concatenate them and linearly project them so that they are the same size as the vector obtained from the linear projection of the original patch and combined with it.

By doing this, the input vector of the transformer will also be affected by the pixel arrangement in the patch. By doing this, the author managed to further improve the performance of various computer vision tasks.

TimeSformers

In view of the great success of transformers in NLP, and then apply it to images, in 2021, Facebook researchers tried to apply this architecture to video.

Intuitively, this is obviously possible, because we all know that a video is just a set of frames one after another, and a frame is just an image.

There is only one small detail that makes them different from Vision Transformers. You have to consider not only space but also time. In fact, in this case, when we calculate the attention, we cannot treat these frames as isolated images, but we should find some form of attention, taking into account the changes that occur between consecutive frames, because It is the center of video evaluation.

In order to solve this problem, the author proposes several new attention mechanisms, from those that only focus on space and are mainly used as reference points, to those that are axial, scattered, or combined between space and time Attention mechanism.

Image from "An Image Is Worth 16x16 words" (Dosovitskiy et al)

However, for best results the method is Divided Space -Time Attention.It includes, given a frame at instant t and one of its patches as a query, calculate the spatial attention on the entire frame, and then in the same patch of the query, but the temporal attention on the previous frame and the next frame.

But why is this method so effective? The reason is that it learns more independent features than other methods, so it can better understand different categories of videos. We can see this in the visualization below, where each video is represented by a point in space, and its color indicates the category it belongs to.

Image from “An Image Is Worth 16x16 words” (Dosovitskiy et al)

The higher the resolution, the better the accuracy of the model. As for the number of frames, the accuracy will also increase as the number of frames increases. Interestingly, it is impossible to test with more frames than shown in the figure, so the potential accuracy can still be improved, and we have not found the upper limit of this improvement.

In Vision Transformers, a larger training data set usually leads to better accuracy. The author also checked this on TimeSformers, and as the number of training videos considered increases, so does the accuracy.

Conclusions

What should we do now? Transformers has just landed in the field of computer vision, and seems determined to replace the traditional convolutional network, or at least open up an important role for itself in this field. Therefore, the scientific community is in chaos, trying to further improve Transformers, combining them with various technologies, and applying them to practical problems, eventually being able to do things that were not possible until recently. Large companies like Facebook and Google are actively developing and applying Transformers, and we may just scratch the surface.

References and insights

[1] ”Gedas Bertasius, Heng Wang, and Lorenzo Torresani”. ”Is Space-Time Attention All You Need for Video Understandingspan _span13 span13 span13 _span span13 _span 13 span_span 16 _span_span? [2] ”Alexey Dosovitskiy et al.”. ”An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”.

[3] ”Kai Han et al. Transformer in. ” ”.

[4] ”Ashish Vaswani et al.”. ”Attention Is All You Need”. ”span. -training with Noisy Student improves ImageNet classification".

[6] “Yihe Dong et al.”, “A ttention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth”

[7] “Nicola Messina et al.”, “Transformer Messina et al.”, “Transformer Reasoning_spane16”, “Transformer Reasoning Network for Imagespan16”, “Transformer

[8] “Nicola Messina et al.”, “Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders” _p2es spanini “13span Cospan [com for video classification with training code"
.