The best GPUs overall: the RTX 3080 and the RTX 3090. For any cloud service, you can do similar calculations to decide whether to use cloud service or desktop.

Deep learning is an area with very high computing requirements, and your choice of GPU will fundamentally determine your deep learning experience. But, if you want to buy a new GPU, what are the main features? GPU memory, kernel, tensor kernel? How to make a cost-effective choice? This article will dig into these issues, remove common misunderstandings, give you an intuitive understanding of how to think about GPUs, and provide advice to help you make the right choice.

front row reminder: The number of words in this article is about 15,000, and the reading time is long. You can read it slowly after you bookmark it.

doesn’t look at it long, so I’ll give the conclusion first

. The overall best GPUs: RTX 3080 and RTX 3090.

(personal) GPUs avoided: any Tesla card; any Quadro card; any Founders Edition card; Titan RTX, Titan V, Titan XP.

is quite cost-effective and expensive: RTX 3080.

is quite cost-effective and cheaper: RTX 3070, RTX 2060 Super.

I don’t have much money: buy second-hand cards, RTX 2070 (400), RTX2060 (400), RTX2060 (300), GTX 1070 (220), GTX1070Ti (220), GTX1070Ti (230), GTX 1650 Super (190), GTX980Ti (6GB190), GTX980Ti (6GB150).

I have almost no money: There are many startups promoting their cloud: using free cloud points and switching back and forth between accounts of different companies until you can afford a GPU.

I do Kaggle: RTX 3070.

I am a competitive computer vision, pre-training or machine translation researcher: 4x RTX 3090. Be sure to wait until you have good cooling and enough power (I will update this blog post)

I am an NLP researcher: If you are not engaged in machine translation, language modeling, or any type of pre-training work, then the RTX 3080 is enough and quite cost-effective.

I just started to get into in-depth learning, and I was very serious: starts with a piece of RTX 3070. If you are still serious after 6 to 9 months, sell your RTX 3070 and buy a 4x RTX 3080. Sell ​​your GPU and buy a more suitable one after three years (next generation RTX 40s GPU) based on the field you choose next (startups, Kaggle, research, deep learning applications).

I want to try deep learning, but don't take it seriously: RTX 2060 Super is good, but may require a new power supply. If your motherboard has an PCIe x16 slot and has a power supply of about 300W, the GTX 1050 Ti is a great choice because it doesn't require any other computer components to work on your desktop computer.

for model parallelized GPU clusters with less than 128 GPUs: If you can buy RTX GPUs for your cluster: 66% 8x RTX 3080 and 33% 8x RTX 3090 (to ensure efficient cooling). If the cooling problem of the RTX 3090 cannot be solved, you can buy a 33% RTX 6000 GPU or an 8x Tesla A100. If you can't buy an RTX GPU, then I might choose an 8x A100 Supermicro node or an 8x RTX 6000 node.

is used for model parallelized GPU clusters of 128 GPUs: consider the 8x Tesla A100 setup. If you use more than 512 GPUs, you should consider configuring a DGX A100 SuperPOD system to match your size.

body text start

This blog post aims to give you different levels of understanding of GPU and NVIDIA Ampere series GPUs.

(1) If you are not interested in the details of how the GPU works, what makes the GPU faster, what is unique to the NVIDIA RTX 30 Ampere series GPUs, you can jump to the Performance and Per Dollar Performance Chart and Recommendations section. These are the core and most valuable content of this article.

(2) If you are concerned about specific questions, I answered the most common questions in the last part of this blog post to remove some misunderstandings.

(3) If you want to have a deeper understanding of how GPUs and tensor cores work, it is best to read this article from start to finish. Depending on your knowledge of related topics, you can skip a section or two.

I will add a short summary at the beginning of each major section, hoping it will help you decide whether to read this section.

Overview of

The structure of this article is as follows.

First, I'll explain what makes GPU faster. I will discuss CPUs with GPUs, tensor cores, memory bandwidth , GPU storage hierarchy, and their relationship to deep learning performance. These explanations may help you gain a more intuitive understanding of what a GPU can offer.

Then I will do a theoretical estimate of the performance of the GPU and compare it with some NVIDIA's market benchmarks to obtain reliable, unbiased performance data. I'll discuss the unique features of the new NVIDIA RTX 30 Ampere series GPUs that are worth considering if you buy a GPU.

Based on this, I made suggestions for GPUs for 1-2, 4, 8 GPU settings and GPU clusters. Then there is the Q&A section, where I answer common questions on Twitter; in this section, I will also discuss some common misunderstandings and other diverse questions, such as cloud and desktop comparison, cooling methods, AMD and NVIDIA comparison, etc. How does

GPU work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge helps you understand why GPUs are slow in some cases and fast in others. Then, you can better understand why you need a GPU in the first place, and how other future hardware options might compete. If you just want useful performance values ​​and parameters to help you decide which GPU to buy, then you can skip this section. The best high-order explanation for how GPUs work is my answer on Quora:

"Why GPUs are suitable for deep learning": https://timdettmers.com/Why-are-GPUs-well-suited-to-deep-learning?top_ans=21379913

This is a high-order explanation that well illustrates why GPUs are more suitable for deep learning than CPUs. If we focus on the details, we can understand what makes one GPU better than the other.

The most important GPU parameters related to processing speed

This section can help you understand more intuitively how to consider the performance of deep learning. This understanding can help you evaluate future GPUs.

tensor kernel

Key points:

  • tensor kernel reduces the cycle required for multiplication and addition operations to 1/16 - in the example I gave, for a 32×32 matrix, it reduces from 128 cycles to 8 cycles.
  • tensor kernel reduces the number of repeated shared memory accesses, thus saving extra memory access to cycles.
  • tensor core is so fast that calculation is no longer a bottleneck. The only bottleneck is how to transfer data to the tensor kernel.

Now the GPU is cheap enough that almost everyone can afford a GPU with a tensor core. That's why I only recommend GPUs with tensor cores. Understanding how they work helps to understand the importance of these computational units specifically used for matrix multiplication . Here I will show you a simple example: A*B=C matrix multiplication, where all matrices are 32×32 in size. We will look at what the calculation pattern looks like when there are and have no tensor kernels.

To fully understand this, you must understand the concept of cycle. If a processor runs at 1GHz, it can complete 10^9 cycles per second. Each cycle represents a computer meeting. However, most of the time it takes more than one cycle. Therefore, it creates a pipeline to start an operation that needs to wait for the previous operation to complete the number of cycles required. This is also called operation delay.

Here are some important operations delays:

  • Global Memory Access (up to 48GB): ~200 cycle
  • Shared Memory Access (up to 164KB per Stream Multiplication): ~20 cycle
  • Fusion Multiplication (FFMA): 4 cycle
  • Tensor Cores Matrix Multiplication: 1 cycle

In addition, you should know that the minimum thread on the GPU is a package with 32 threads - this is called warp.Usually, warp runs in synchronous mode—threads within warp must wait for each other. All memory operations on the GPU are optimized for warp. For example, the granularity loaded from global memory is 32*4 bytes, and there are exactly 32 floating-point numbers, just one floating-point number per thread. In a streaming multiprocessor (SM, equivalent to one CPU core), we can have up to 32 warp=1024 threads. The resources of SM are allocated to all active warp. So, sometimes we want to run fewer warp so that each warp has more registers /shared memory/tensor kernel resources.

For the following two examples, we assume that the design calculation resources are the same. For this small example of 32×32 matrix multiplication, we use 8 SMs (about 10% of the RTX 3090), with 8 warp per SM.

matrix multiplication (no tensor kernel)

If we want to do a matrix multiplication of A*B=C, where the size of each matrix is ​​32×32, then we would want to load the repeatedly accessed memory into shared memory because its latency is about 1/10 of the former (200cycle vs 20 cycle). Typically, memory blocks in shared memory are called memory tile or just tile. Using 2*32 warp, two 32 *32 floating point matrices can be loaded into shared memory blocks in parallel. We have 8 SMs, each SM has 8 warp, so thanks to parallelization, we only need to perform a sequential load from global to shared memory, which requires 200 cycles.

In order to perform matrix multiplication, we now need to load a vector of 32 values ​​from shared memory A and shared memory B and perform a fused multiplication (FFMA). The output is then stored in register C. We divide the work into a way that each SM does an 8x dot product (32×32) to calculate the 8 outputs of C. Why this happens to be 8 (in the old algorithm it is 4) is very technical. I suggest you read Scott Gray's blog post on matrix multiplication to understand this. This means that we have 8 shared memory accesses, 20 cycles each, 8 FFMA operations, 4 cycles each. Therefore, the total overhead is:

200 cycle (global memory) + 8*20 cycle (shared memory) + 8*4 cycle (FFMA) = 392 cycle

Let's see how much overhead is required when using tensor cores.

matrix multiplication (with tensor kernel)

Using tensor kernel, we can perform 4×4 matrix multiplication in a cycle. To do this, we first need to read the memory into the tensor kernel. Similar to the above, we need to read data from global memory (200 cycles) and store it in shared memory. To do a 32×32 matrix multiplication, we need to perform 8×8=64 tensor kernel operations. A SM has 8 tensor kernels, so we have a total of 64 tensor kernels – that’s exactly the number we need! We can transfer data from shared memory to tensor kernel via 1 memory transfer (20 cycles) and then perform 64 parallel tensor kernel operations (1 cycle). This means that in this case, the total overhead of matrix multiplication of tensor kernel is:

200 cycle (global memory) + 20 cycle (shared memory) + 1 cycle (Tensor Core) = 221 cycle

Therefore, we greatly reduce the overhead of matrix multiplication from 392 cycles to 221 cycles through tensor kernel. In this case, the tensor kernel reduces the cost of shared memory access and FFMA operations.

In this example, both with and without tensor kernels follow roughly the same calculation steps, please note that this is a very simplified example. In practical cases, matrix multiplication involves larger blocks of shared memory, and the calculation mode is slightly different.

However, I believe that with this example, I'm very clear about why memory bandwidth is so important for GPUs equipped with tensor cores. When using tensor kernels for matrix multiplication, global memory is the most important part of the cycle overhead, and we can even have faster GPUs if global memory latency can be reduced. To do this, we can either increase the clock frequency of the memory (increase the number of cycles per second, but also increase the heating and power consumption) or increase the number of elements that can be transferred each time (bus width).

memory bandwidth

From the previous section we have seen that the tensor kernel is very fast. In fact, they are mostly free because they need to wait for data to be read from global memory to shared memory. For example, in BERT large training, it uses very large matrices—for tensor kernels, the bigger the better—our Tensor Core TFLOPS utilization is about 30%, which means that 70% of the time tensor kernels are idle.

This means that when comparing two GPUs with tensor cores, one of the best indicators of GPU performance is their memory bandwidth. For example, the A100 GPU has a memory bandwidth of 1555 GB/s, while the V100 has a memory bandwidth of 900 GB/s. Therefore, it can be basically estimated that the speed of the A100 is 1555/900 = 1.73 times that of the V100.

shared memory/L1 cache size/register

Since memory transfer to tensor core is a limiting factor in performance, we are looking for other GPU properties that can improve memory to tensor core transfer speed. This is related to shared memory, L1 cache, and the number of registers used. Understanding how the storage hierarchy leads to faster memory transfers can help understand how matrix multiplication is performed on the GPU.

To perform matrix multiplication, we utilize the GPU's storage hierarchy, from slow global memory to fast local shared memory, to lightning-fast registers. However, the faster the storage, the smaller the storage. So we need to segment the matrix into smaller matrices and perform matrix multiplication through these smaller chunks in local shared memory, which are fast and close to streaming multiprocessors (SM)—the equivalent of CPU cores. For tensor kernels, we go a step further: we take each block and load a portion of it into the tensor kernel. The matrix memory tile in shared memory is about 10-50 times faster than GPU global memory, while the registers of tensor cores are 200 times faster than GPU global memory.

has a larger tile means we can reuse more memory. I discussed this issue in detail in the blog post TPU vs GPU. In fact, you can think of TPU as every tensor core has a very, very large tile. Therefore, TPUs can reuse more memory each time they transfer data from global memory, making them more efficient than GPUs in matrix multiplication calculations.

The size of each block is determined by the memory size of each streaming multiprocessor (SM, equivalent to a "CPU core" on the GPU). In the following architecture, we have the following shared memory size:

  • Volta: 96kb shared memory / 32 kb L1
  • Turing: 64kb shared memory / 32 kb L1
  • Ampere: 164kb shared memory / 32 kb L1

We see that Ampere has larger shared memory and larger block size, which reduces global memory access latency. Therefore, Ampere can better utilize the overall memory bandwidth of GPU storage. This will improve performance by about 2-5%. For large matrices, performance improvements are particularly obvious.

Ampere Another advantage of tensor kernels is that they share more data between threads. This reduces register usage. Registers are limited to 64k per stream multiprocessor (SM) or 255 per thread. Comparing Volta and Ampere tensor kernels, the Ampere tensor kernel uses 1/3 of the registers, allowing more tensor kernels to be active on every block of shared memory. In other words, we can support 3x tensor kernels with the same number of registers. However, since bandwidth is still a bottleneck, the actual TFLOPS will only have a slight improvement compared to the theoretical TFLOPS. The new tensor kernel improves performance by about 1-3%.

Overall, we can see that Ampere's architecture is optimized, using an improved storage hierarchy (from global memory to shared memory blocks to tensor kernel registers) to make available memory bandwidth more efficient.

Evaluation of Ampere's deep learning performance

Key points:

  • is estimated based on memory bandwidth and improved storage hierarchy of Ampere GPU. In theory, its speed is increased by 1.78 times to 1.87 times.
  • NVIDIA provides accurate benchmarking data for TeslaA100 and V100 GPUs. These data are biased for marketing purposes, but debiased models of these data can be established.
  • debias benchmark data shows that Tesla A100 is 1.70 times faster than V100 in NLP and 1.45 times faster than V100 in computer vision.

If you want to know more technical details about how I evaluate Ampere GPU performance, then this section is for you. If you don't care about these technical aspects, you can skip this section. Theoretical estimation of

Ampere speed

In summary, we believe that the difference between the two GPU architectures equipped with tensor cores is mainly in memory bandwidth. Other benefits of using tensor cores include more shared memory/L1 cache and better registers.

If we compare the TeslaA100GPU bandwidth to the TeslaV100 GPU bandwidth, we get a speed increase of 1555/900 = 1.73 times. Additionally, I want to get 2-5% acceleration from larger shared memory and 1-3% acceleration from improved tensor kernels. This increases the speed between 1.78x and 1.87x. With a similar reasoning method, you can estimate the speed improvements of other GPUs in the Ampere series compared to the TeslaV100.

Ampere speed actual estimation

Assuming we have an estimate for one GPU architecture, such as Ampere, Turing, or Volta, we can easily push these results to other GPUs of the same architecture/series. Fortunately, NVIDIA has benchmarked the A100 and V100 in a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA uses different batch sizes and GPU counts, making these values ​​impossible to compare directly, resulting in results favorable to the A100. So, in a sense, these benchmark data are partly real and partly market data. Generally speaking, you might think that larger batch sizes are fair because the A100 has more memory, but to compare GPU architectures, we should evaluate unbiased memory performance when batch sizes are the same.

To get an unbiased estimate, we can scale the V100 and A100 results in two ways:

(1) consider the batch size difference, and (2) consider the difference between using 1 GPU and 8 GPUs. Fortunately, we can find estimates of these two biases in the data provided by NVIDIA. Double the batch size of

increases throughput by 13.6% (CNN in images per second). I tested Transformer for the same problem on RTX Titan and was surprised to find the same result: 13.5% – which seems to be a reliable estimate.

When we parallelize the network on more GPUs, we lose performance due to some network overhead. The A100 8x GPU system has a better network (NVLink 3.0) compared to the V100 8x GPU system (NVLink 2.0) – another confusing factor. By looking directly at NVIDIA's data, we can find that for CNN, the 8x A100 system is 5% less overhead than the 8x V100 system. This means that 7x speed gains can be obtained from 1x A100 to 8xA100, and 6.67x speeds from 1x V100 to 8xA100. For Transformer, this value is 7%.

Using this data, we can estimate its speed improvement for some specific deep learning architectures from the direct data provided by NVIDIA. Compared with Tesla V100, Tesla A100 can provide the following speed improvements:

  • SE-ResNeXt101: 1.43 times
  • Masked-R-CNN: 1.47 times
  • Transformer (12 layers, machine translation, WMT14 en-de): 1.70 times

Therefore, for computer vision, these values ​​are lower than theoretical estimates. This may be due to the small tensor dimensions, the operational overhead required to prepare matrix multiplications such as img2col or fast Fourier transform (FFT), or the operation fails to take full advantage of the GPU (the final layer is usually relatively small). It can also be an artifact of a specific architecture (grouping convolution).

Transformer The actual estimate is very close to the theoretical estimate. This may be because the algorithm for large matrices is very simple. I will use these actual estimates to calculate the cost efficiency of the GPU. Possible deviations in

estimation

The above is a comparative estimate of A100 and V100.In the past, NVIDIA "gaming" GPU RTX had a secret performance decline: (1) Reduce tensor core utilization, (2) Fan cooling, and (3) Disable peer-to-peer GPU transmission. The RT 30 series may have unannounced performance degradation compared to the Ampere A100. With the news, I will update this blog post. Other considerations for

Ampere / RTX 30 series

Key points:

  • Ampere can be used for sparse network training, which can increase training speed up to 2 times.
  • sparse network training is still rarely used, but will allow Ampere to withstand the test of the future.
  • Ampere has new low-precision data types, which makes it easier to use low-precision values, but not necessarily faster than previous GPUs. The new
  • fan design is great if there is a gap between your GPUs, but it is unclear whether they can cool effectively if there is no gap between multiple GPUs. The 3-slot design of the
  • RTX3090 makes 4x GPU building a problem. The possible solution is to convert 2 slots or use a PCIe extender. The
  • 4x RTX 3090 requires more power than any standard power unit on the market can provide.

The new NVIDIA Ampere RTX30 series has additional advantages such as sparse network training and inference compared to the NVIDIA Turing RTX 20 series. Other features, such as new data types, are more of ease of use because they provide the same performance gains as Turing, but do not require any additional programming.

sparse network training

Ampere can perform fine-grained structure automatic sparse matrix multiplication at the speed of dense matrices. It works if there is a matrix, you split it into 4 elements, and now the sparse matrix tensor kernel property allows 2 of these 4 elements to be zero. This brings a 2x speed increase, as the bandwidth requirement for matrix multiplication is halved.

I have studied sparse network training before. There is a criticism for my work that "you reduce the FLOPS required by the network, but it fails to bring about speed improvements because the GPU cannot do fast sparse matrix multiplication." Well, plus the sparse matrix multiplication characteristics of the tensor kernel, my algorithm, or other sparse training algorithms, now provides a 2x speed improvement in actual training.

Although this feature is still in the experimental stage and training sparse networks is not very common, GPUs have this feature that means you are ready for future sparse training.

Low-precision calculation

In my previous work, I have shown that new data types can improve the stability of low-precision backpropagation. Currently, if you want to implement stable backpropagation of 16-bit floating point numbers (FP16), the main problem is that the normal FP16 data type only supports values ​​within the [-65504,65504] interval. If the gradient exceeds this range, it will become a NaN value. To prevent this from happening during FP16 training, we usually do loss scaling, i.e. multiply the loss by a decimal value before backpropagation to prevent this gradient explosion.

Brain Float 16 format (BF16) uses more bits as exponents, so that its possible range of values ​​is the same as FP32: [-3*10^38,3*10^38]. BF16 has low accuracy, but gradient accuracy is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient expanding rapidly. Therefore, we can see the improvement in training stability when using the BF16 format, with a slight decrease in accuracy.

This means: With BF16 precision, training may be more stable than FP16 precision, while providing an equal speed boost. Using TF32 accuracy, you can get stability close to FP32 and speed improvements close to FP16. The good thing is that to use these data types, just replace FP32 with TF32 and FP16 with BF16 – no code modification required!

However, in general, these new data types can be considered lazy data types, because with some additional programming effort (proper loss scaling, initialization, normalization, use of Apex), all of these benefits can be obtained with old data types. Therefore, these data types do not provide speed improvements, but increase the ease of use of low-precision training.

New fan design / cooling issues

RTX 30 Series new fan design includes a blower fan and a push-pull fan. The design is clever and it will be very effective if there are gaps between your GPUs. However, if you pile up GPUs together, it's not clear how they behave. The blower will be able to exhaust through a bracket away from other graphics processors, but it is not possible to say that it works because the fan is designed differently than before. In a 4 PCIe slot setup, if you want to buy 1 to 2 GPUs, that should be OK. However, if you plan to use 3 to 4 RTX 30 GPUs at the same time, I need to wait for the thermal performance report to see if there is a different GPU cooler, PCIe extender, or other solutions required. I will update this blog post then.

To overcome the heat dissipation problem, water cooling can provide a solution in any situation. Many manufacturers provide water-cooled modules for RTX 3080/RTX 3090 cards, which can keep temperatures low even in a 4x GPU setup. If you want to run a 4x GPU setup, you can pay attention to the next all-in-one water cooling solution, as the heatsink is hard to disperse in most desktop situations. Another solution to the

cooling problem is to buy a PCIe extender and spread the GPU out of the chassis. This works very well, and other PhD students at the University of Washington and I have been very successful with this setup. That doesn't look pretty, but it can keep your GPU cool! Even if you don't have enough space to spread the GPU, it can still help you. For example, if you can find space in a desktop chassis, you might be able to buy a standard 3-slot width RTX 3090 and use a PCIe extender in the chassis to spread them apart. This way, you can solve space and cooling issues in the 4x RTX 3090 setup with a simple solution.

3 Slot design and battery issues

RTX 3090 is a 3-slot GPU, so you can't use NVIDIA's default fan design in a 4x setup. This is reasonable because it runs at 350W TDP and is difficult to cool down in a multi-GPU 2-slot setup. The RTX 3080 runs just a little better at 320W TDP, and cooling a 4x RTX 3080 setup can be very difficult. It is also difficult to power a 4x350w = 1400W system under the 4x RTX 3090 setting. A 1600W power supply unit (PSU) is easy to obtain, but only 200W is powered by the CPU, and the motherboard may be too tight. Maximum power is achieved only if the components are fully utilized, while in deep learning, the CPU is usually only in a weak load state. This way a 1600W PSU will power a 4x RTX 3080 build well, but for a 4x RTX 3090 build it is better to find a high wattage PSU (+1700W). Currently, there seems to be no PSU in the desktop computer market that exceeds 1600W. Server or cryptocurrency mining PSU may solve this problem, but it may be in a weird shape.

GPU Deep Learning Performance

The benchmarks below not only include benchmarks for the Tesla A100 and Tesla V100, I also built a model that fits these data, as well as four different benchmarks based on Titan V, Titan RTX, RTX 2080 Ti, and RTX 2080. [ 1 , 2 , 3 , 4 ]

In addition to this, I also extended the RTX 2070, RTX 2060 card or the Quadro RTX 6000&8000 card by interpolation between these data points of the benchmark data. Usually, in the same architecture, the expansion of the GPU is linearly related to the streaming multiprocessor and bandwidth, and my architectural model is based on this.

I only collected benchmark data for mixed precision FP16 training, because I believe there is no good reason to use FP32 training.

Compared with the RTX 2080 Ti, RTX 3090 increases the speed of the convolutional network by 1.57 times, and 1.5 times the speed of the Transformer, and at the same time, the pricing has also increased by 15%. Therefore, the Ampere RTX 30 offers a very substantial improvement over the Turing RTX 20 series.

GPU Per dollar performance of deep learning

What GPU can bring you the best rewards? It depends on the cost of the entire system.If you have a costly system, it makes sense to invest in a more expensive GPU.

Here are three PCIe 3.0 builds that I use as the base cost of a 2/4 GPU system, and then add the GPU cost. GPU cost is the average cost of GPUs on Amazon and eBay. For the new Ampere GPU, I used pricing. Combined with the performance values ​​given above, you can obtain the performance values ​​per dollar for these GPU systems. For 8 GPU systems, I use Supermicro barebone as the baseline cost – this is the industry standard for RTX servers. Note that these bar charts do not take into account memory requirements. You should first consider your memory needs and then look for the best options in the chart. Regarding memory, here is a general guide:

  • Use pre-trained Transformer; train small Transformer from scratch: = 11GB
  • Training large Transformer or convolutional network in R&D/production environment: = 24 GB
  • Develop neural network prototypes (Transformer or convolutional network): = 10 GB
  • Kaggle Competition: = 8 GB
  • Computer Vision Application: = 10GB

GPU Recommended

The first thing I need to emphasize again is: When choosing a GPU, you need to make sure it has enough memory to do what you want. When choosing a GPU, you should follow the steps below:

  1. What do I want to do with GPU: Kaggle competition, learning deep learning, cracking small projects (GAN or large language models?), computer vision / natural language processing / other fields of research, or something else?
  2. How much memory does I need to do?
  3. Use the above cost/performance graph to find out the GPU that meets memory standards and is best for you.
  4. Are there any other things to note about the GPU I choose? For example, if it is an RTX 3090, can I install it into my computer? Is my power unit (PSU) sufficient for the GPU I chose? Is heat dissipation a problem? Or can I somehow cool the GPU effectively?

Some of these details require you to reflect on what you want. Maybe you can study how much memory the GPU is used by others in your area of ​​interest. I can provide some guidance, but I can't cover everything.

When do I need = 11 GB of memory?

I mentioned before that if you use Transformer, then you should have at least 11GB of memory, if you study Transformer, memory = 24GB is better. This is because most pre-trained models had quite high memory requirements before, and these models were trained on at least an RTX 2080 Ti GPU with 11 GB of memory. Therefore, less than 11GB may experience situations where it is difficult or impossible to run certain models.

Other areas that require a lot of memory include medical imaging, some state-of-the-art computer vision models, and anything with super large images (GAN, style conversion).

Generally speaking, if you want to build a model and gain a competitive advantage accordingly, whether it’s research, industry or Kaggle competition, the extra memory may give you an advantage. When can

memory be 11 GB?

RTX 3070 and RTX3080 are very powerful cards, but they have a little memory. However, for many tasks, you don't need that much memory.

If you want to learn deep learning, then RTX 3070 is perfect. This is because training the basic skills of most architectures can be learned by shrinking them a little bit or using smaller input images. If I learn deep learning again, I might use an RTX 3070 and even multiple if there is extra money. The

RTX 3080 is by far the most cost-effective card and is therefore ideal for prototyping. For prototyping, you want the one with the largest memory, which is still cheap. The prototype design mentioned here can be prototype design in any field: research, Kaggle competition, idea/design modeling for startups, trial research code. For all of these applications, the RTX 3080 is the best GPU.

Assume that I will lead a research lab/startup.I will invest 66-80% of my budget on the RTX 3080 machine and 20-33% on the RTX 3090 machine equipped with a powerful water cooling unit. My idea is that the RTX 3080 is more cost-effective and can be shared as a prototype via a slurm cluster setup. Because prototyping should be done in agile way, smaller models and datasets should be used. The RTX 3080 is perfect for this scenario. Once students/colleagues have obtained a good prototype, they can transfer the prototype to an RTX 3090 machine and expand it into a larger model.

general recommendation

Overall, the RTX 30 series is very powerful and I highly recommend these GPUs. As mentioned in the previous section, pay attention to memory, as well as power and cooling requirements. If you have a PCIe slot between your GPUs, there will be no problem with cooling. Otherwise, if using an RTX 30 card, prepare a water cooling unit, a PCIe expander, or an effective blower fan card (data from the coming weeks will indicate that NVIDIA's fan design is sufficient).

Generally speaking, I would recommend RTX 3090 for anyone who can afford it. Not only now, it will be a very effective card for in the next 3-7 years. So, this is a good investment that will maintain strong growth. HBM memory is unlikely to become cheaper in three years, so the next GPU will only be around 25% better than the RTX 3090. We may see cheap HBM memory in 5-7 years, and after that, you definitely want to upgrade.

If you have multiple RTX 3090s, make sure the solution you choose can provide effective cooling and sufficient power. For what settings are appropriate, I will update the blog post on this issue as the available data increases.

For businesses that do not have strong competitive needs (research companies, Kaggle competitions, competitive startups), I recommend the RTX 3080. This is a more economical solution and will ensure that most networks are trained fast enough. If you use the right memory tips and don't mind extra programming, there are now enough tricks to adapt a 24GB neural network to a 10GB GPU. So, if you accept some uncertainty and extra programming, the RTX 3080 may also be a better choice than the RTX 3090.

Generally speaking, the RTX 3070 is also a reliable card for learning deep learning and prototyping, which is $200 cheaper than the RTX 3080. If you can't afford the RTX3080, then the RTX3070 is the best choice.

If you have a limited budget and the RTX 3070 is too expensive, a used RTX 2070 costs about $260 on eBay. It's not clear whether the RTX 3060 will be launched, but if you're on a limited budget, it might be worth a while. If the pricing is similar to the RTX 2060 and GTX 1060, it is expected to be between $250 and $300 and it has quite strong performance.

GPU cluster recommendation

GPU cluster design is highly purpose-dependent. For a +1024 GPU system, network is the most important, but on such a system, if the user only uses up to 32 GPUs at a time, investing in a strong network infrastructure is a waste. Here I will use prototype-like reasoning, as I did in the RTX 3080 and RTX 3090 case comparison.

Generally speaking, RTX cards are prohibited from entering the data center due to the CUDA license agreement. However, universities can usually get exemptions from this rule. It is worthwhile to get in touch with Nvidia people to request immunity. If you are allowed to use an RTX card, then I recommend using a standard Supermicro 8 GPU system equipped with an RTX 3080 or RTX 3090 GPU (if effective cooling is guaranteed). A set of 8x A100 nodes ensures effective "promotion" after prototype, especially if there is no guarantee that the 8x RTX 3090 server can be fully cooled. In this case, I recommend using the A100 instead of the RTX 6000 / RTX 8000 because the A100 is very cost-effective and future-proof.

If you want to train a very large network on a GPU cluster (+256 GPUs), I recommend the NVIDIA DGX SuperPOD system with A100 GPU. At the scale of +256 GPU, networking becomes crucial.If you want to scale to more than 256 GPUs, you will need a highly optimized system that will not work if you put together a standard solution.

Especially on the scale of +1024 GPU, the only competitive solution on the market is Google TPU Pod and NVIDIA DGX SuperPod. On this scale, I prefer Google TPU Pods because their customized network infrastructure seems to outperform the NVIDIA DGX SuperPod system – although the two systems are very close. Compared to TPU systems, GPU systems provide more flexibility for deep learning models and applications, while TPU systems support larger models and provide better scalability. Therefore, both systems have their own advantages and disadvantages.

Don't buy these GPU

I don't recommend buying multiple RTX Founders Editions or RTX Titans unless you have a PCIe extender that can fix their cooling problems. They are prone to overheating when running and their performance will be much lower than the values ​​in the chart above. The 4x RTX 2080 Ti Founders Editions GPU will quickly exceed 90C, at which point its core clock frequency will decrease and run slower than the properly cooled RTX 2070 GPU.

I don't recommend buying a Tesla V100 or A100 unless you're forced to buy (a company that prohibits RTX data center strategy), or you want to train a very large network on a huge GPU cluster - these GPUs aren't very cost-effective.

If you can afford a better card, don't buy the GTX 16 series. These cards do not have tensor cores, so deep learning performance is relatively poor. Compared to the GTX 16 series, I would rather choose the used RTX 2070 / RTX 2060 / RTX 2060 Super. If you're short of money, the GTX 16 series is also a good choice.

Under what circumstances is it best not to buy a new GPU?

If you already have an RTX 2080 Tis or better GPU, upgrading to RTX 3090 may not make sense. Your GPU is already great, with the newly launched high-energy RTX 30 cards you need to worry about PSU and cooling issues, while the performance improvement is trivial – not worth it.

The only reason I want to upgrade from a 4x RTX 2080 Ti to a 4x RTX 3090 is that I am working on very large Transformers or other network training that is highly dependent on computing. However, if memory is a problem, you can first consider some memory tips to fit the big model on the 4x RTX 2080 Tis before upgrading to RTX 3090.

If you have one or more RTX 2070 GPUs, think twice before upgrading. These are all good GPUs. If you find yourself often limited to 8GB of memory, you can sell these GPUs on eBay and get an RTX 3090. This inference also applies to many other GPUs: if memory is tight, then the upgrade is right.

Question & Answer & Misunderstanding

Key points:

  • PCIe 4.0 and PCIe channels are not important in 2x GPU settings. They aren't particularly important either for 4x GPU setups. Cooling of
  • RTX 3090 and RTX 3080 will be a problem. Use a water-cooled card or PCIe expander.
  • NVLink is useless, it is only useful for GPU clusters.
  • You can use different GPU types in one computer (such as GTX 1080 + RTX 2080 + RTX 3090), but you can't effectively parallelize them.
  • is trained in parallel on more than two machines, you need an Infiniband +50Gbit/s network. The
  • AMD CPU is cheaper than the Intel CPU, and the Intel CPU has almost no advantage.
  • Despite the great efforts, due to the lack of community and tensor cores, AMD GPU+ ROCm may not be able to compete with NVIDIA for at least 1-2 years.
  • If you have been using GPU for less than a year, then cloud GPU is helpful. Apart from that, desktops are the cheaper solution.

Do I need PCIe 4.0?

Generally speaking, it is not needed. If you have a GPU cluster, PCIe 4.0 is useful. If you have an 8x GPU machine, it's OK, but other than that, it doesn't have much benefit. It can help achieve better parallelization and faster data transfer. Data transfer is not a bottleneck in any application. In computer vision, data storage can be a bottleneck in a data transfer pipeline, while PCIe transfer from CPU to GPU is not a bottleneck.So, for most people, there is no real reason to install PCIe 4.0. In a 4 GPU setup, the benefit of this is that parallelism may increase by 1-7%.

Do I need 8x/16x PCIe channel?

is the same as PCIe 4.0, and generally speaking, it is not needed. Parallelization and fast data transfer require PCIe channels, but this is hardly a bottleneck. It's nice to run a GPU on a 4x channel, especially if you only have 2 GPUs. For 4 GPU setup, I would rather have 8 channels per GPU, but if you are running in parallel on 4 GPUs, 4 channels may only reduce performance by 5-10%.

If each RTX 3090 requires 3 PCIe slots, how do I install 4x RTX 3090?

You need a dual-slot variant, or you can try laying them out using the PCIe extender. Apart from space, you need to consider cooling issues and a suitable PSU. The easiest solution to manage seems to be, the 4x RTX 3090 EVGA Hydro Copper plus a custom water cooling cycle. This can keep the card in a low temperature state. EVGA has been making the Hydro Copper version of GPUs for years, and I think you can trust the quality of its water-cooled GPUs. There may be other cheaper variants, though. The

PCIe extender can also solve space and cooling problems, but you need to make sure there is enough space in your chassis to lay out the GPU. Be sure to make sure your PCIe extender is long enough!

How do I cool down 4x RTX 3090 or 4x RTX 3080?

Please read the previous section.

Can I use multiple different types of GPUs?

Yes, you can! But different types of GPUs cannot effectively parallelize. I think a 3x RTX 3070 plus an RTX 3090 is enough for prototype promotion. On the other hand, paralleling on 4x RTX 3070 GPUs is very fast, if you can put models on these GPUs. Apart from that, the only reason I can think of you wish to do this is that you wish to continue using the old GPU. This is OK, but parallelism is inefficient on those GPUs, as the fastest GPUs wait for the slowest GPU to reach a synchronization point (usually gradient updates). What is

NVLink, is it useful?

Generally speaking, NVLink is useless. NVLink is a high-speed interconnect between GPUs. If you have a GPU cluster with +128 GPUs, it will work. Otherwise, it will have little benefit compared to standard PCIe transmissions.

I don't have enough money, even the cheapest GPU you recommend. What can I do?

of course is to buy a second-hand GPU. The used RTX 2070 ($400) and the RTX 2060 ($300) are both great. If you can't afford it, the next best option is to try a used GTX 1070 ($220) or GTX 1070 Ti ($230). If that's too expensive, you can use the GTX 980 Ti ($150 for 6GB) or the GTX 1650 Super ($190). If this is all too expensive, it is best to use a free GPU cloud service. Usually, these services have time limits, and you will have to pay for it later. You can take turns using different services and accounts until you have the money to buy your own GPU. What is the carbon emissions of

GPU? How do I use the GPU without polluting the environment?

I built a carbon calculator for scholars to calculate their own carbon emissions (carbon emissions from flights to conferences + GPU time). This calculator can also be used to calculate purely GPU carbon emissions. You will find that GPUs produce much more carbon than international flights. Therefore, you should make sure you have a green energy source if you don't want to have an astronomical carbon emissions. If there is no electricity supplier in our area to provide green energy, the best way is to buy carbon offset. Many people are skeptical about carbon offset. Do they work? Are they cheating?

I believe that skepticism is harmful in this case, because doing nothing is more harmful than risking being cheated. If you are worried about being scammed, just invest in a offset portfolio to minimize risk.

About ten years ago, I was involved in a project that generated carbon offset. UN officials followed the process, they obtained clean digital data and conducted field inspections of the project site.The carbon offset generated in this case is very reliable. I believe many other projects have similar quality standards.

carbon calculator: https://github.com/TimDettmers/carbonneutral

What is needed to parallel between two machines?

If you want to parallel across machines, then you will need a +50Gbits/s network card to increase the speed. I have a bit outdated blog post about this. Now, I recommend having at least one EDR Infiniband setup, which means the network card has at least 50 GBit/s of bandwidth. Two EDRs with cables sold for about $500 on eBay.

Is the multiplication feature of sparse matrix suitable for general sparse matrix?

doesn't seem to be the case. The granularity of the sparse matrix needs to meet the needs of 2 zero-value elements for every 4 elements, that is, the sparse matrix needs to be highly structured. The algorithm can be adjusted slightly, which involves combining 4 values ​​into 2 values, but it also means that it is impossible for Ampere GPU to implement precise multiplication of arbitrary sparse matrices.

Do I need Intel CPU to support multi-GPU settings?

I don't recommend Intel CPUs unless you use a lot of CPUs in the Kaggle competition (a lot of linear algebra operations on the CPU). Even in the Kaggle competition, the AMD CPU is still good. In terms of deep learning, AMD GPUs are cheaper and better than Intel GPUs. For 4x GPU builds, my preferred CPU is Threadripper. We built dozens of systems with Threadripper in college and they all worked well – no one complained yet. For 8x GPU systems, I usually choose the CPU that the vendor is familiar with. In 8x systems, CPU and PCIe/system reliability is more important than pure performance or cost-effectiveness.

Should I wait for RTX 3090 Ti?

It is not clear whether there will be an RTX 3080 Ti/RTX 3090 Ti/RTX Ampere Titan. The GTX XX90 name is usually left for dual GPU cards, so NVIDIA is deviating from this trend. From the perspective of price and performance, it seems that the RTX 3090 is a name that replaces the RTX 3080 Ti. But all of this is speculation. If you are interested in this question, I suggest you keep track of news on Rumor Mill for a month or two, and if you don't see anything, it's unlikely that the RTX 3080 Ti / RTX 3090 Ti / RTX Ampere Titan will appear. Does the

chassis design have an impact on cooling?

does not. As long as there is a small gap between the GPUs, the cooling of the GPU is usually fine. The chassis design can provide the benefits of 1 to 3 degrees Celsius, while the gap between the GPUs will provide 10 to 30 degrees Celsius improvements. Ultimately, if there is a gap between the GPUs, cooling is OK. If there is no gap between the GPUs, a suitable cooler design (blower fan) or other solutions (water cooling, PCIe expander) is required, but in either case, the chassis design and the chassis fan are not very important. Can

AMD GPUs + ROCm catch up with NVIDIA GPUs + CUDA?

1-2 Can't catch up within the year. This involves three issues: tensor kernel, software and community.

AMD GPU is great in terms of pure silicon: excellent FP16 performance, excellent memory bandwidth. However, due to the lack of tensor kernels or equivalent properties, its deep learning performance is worse than that of NVIDIA GPUs. Encapsulated low-precision mathematical functions cannot solve this problem. Without this hardware feature, AMD GPUs will never be competitive. There are rumors that AMD plans to launch some kind of data center card with tensor kernel equivalent characteristics in 2020, but no new data has emerged since. Data center cards with tensor core equivalent properties mean few people can afford such an AMD GPU, which will give NVIDIA a competitive advantage.

assumes that AMD introduces hardware features similar to tensor kernels in the future. Then a lot of people will say, "But there is no software for AMD GPUs!" How should I use them? This is more of a misunderstanding. AMD software ROCm has been developing for a long time and PyTorch provides excellent support. Although I haven't seen many reports of experience for AMD GPU+ PyTorch, all the software features are integrated. It looks like no matter what network you choose, it will run well on an AMD GPU. So, AMD has come a long way in this regard, and this problem has been more or less solved.

However, if the software problem and the lack of tensor kernel issues have been resolved, AMD has another problem: the lack of community. If you have problems using NVIDIA GPU, you can query this problem on Google and find a solution. This has created a lot of trust in the NVIDIA GPU. You have the infrastructure to make using NVIDIA GPUs easy (any deep learning framework is available, and any scientific problem is well supported). You can easily use NVIDIA GPUs (like apex). You can easily find NVIDIA GPU and programming experts, and I know much fewer AMD GPU experts.

In terms of community, AMD and NVIDIA are a bit like Julia and Python. Julia has great potential, and many people will say it is a high-level programming language for scientific computing. However, Julia is rarely used compared to Python. This is because the Python community is very powerful. Numpy, SciPy, and Pandas are all powerful software packages, and many people are using them. This is very similar to the problem with NVIDIA vs AMD.

Therefore, AMD is likely to be unable to catch up with NVIDIA until the tensor kernel equivalent features are introduced (1/2 to 1 year?) and a strong community around ROCm (2 years?). AMD always grabs a portion of market share in specific sub-fields (such as cryptocurrency mining, data centers). However, in the field of deep learning, NVIDIA may maintain its monopoly for at least a few years. When will

use cloud GPU? When to use a dedicated GPU desktop/server?

Rule of thumb: If you want to do deep learning for more than a year, buy a GPU desktop. Otherwise, it is better to use cloud instances.

is better to calculate it yourself. For example, if you compare an AWS V100 spot instance with a 1x V100 and a desktop with only one RTX 3090 (similar performance), then for desktops we will spend $2,200 (2-GPU Barebone + RTX 3090). Also, assuming you are in the U.S., you need to pay an additional $0.12 per kilowatt/hour, while the AWS instance is $2.14 per hour.

At a 15% annual utilization rate, the power consumed by a desktop computer is:

(350 Watt (GPU) + 100 Watt (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kW/h

that is, 591kW/h per year, an additional US$71 is required.

With utilization of 15% (using cloud instances at 15% of the day), the breakeven point between desktop and cloud instances is approximately 300 days ($2311 vs. $2270):

2.14/h∗0.15(utilization)∗24hours∗300days=2.14/h∗0.15(utilization)∗24hours∗300days=2, 311

So, if you want to continue running a deep learning model after 300 days, it is best to buy a desktop instead of using an AWS spot instance.

For any cloud service, you can do similar calculations to decide whether to use cloud service or desktop.

The following are common utilization rates:

  • PhD student personal desktop: 15%
  • PhD student slurm GPU cluster: 35%
  • enterprise slurm Research cluster: 60%

Generally speaking, for those professions that think about cutting-edge ideas more important than developing practical products, the utilization rate is lower. Some areas have very low utilization (interpretability studies), while others have much higher utilization (machine translation, language modeling). Generally speaking, the utilization rate of personal machines is almost always overestimated. Most individual systems usually have utilization rates between 5-10%. That's why I highly recommend slurm GPU clusters to research groups and companies rather than individual desktop GPU machines.

Follow me and forward this article, send me a private message to me to "receive information", and you can get a mini book worth 4,999 yuan for free. Click "Learn More" at the end of the article to move to the InfoQ official website to get the latest information ~