exponential growth is unsustainable. Where is this all going?
Machine learning is a low-energy model that consumes all supplies and is expensive and unsustainable.
On the other hand, this is a new and exciting area where training is developing rapidly. In addition, due to the dramatic increase in the number of data centers used for training and inference, this requires an exponential increase in power supply. In addition, the amount of data that device intelligence needs to process is also increasing dramatically. These factors have led to a sharp increase in power consumption.
But digitization is threatening the development of energy systems and power supply technology. At the recent Design Automation conference, AMD CTO Mark Papermaster presented a slide showing the energy consumption of ML systems (Figure 1) compared to world energy production.
Figure 1. Energy consumption of ML (machine learning)
Papermaster is not the only one sounding the alarm. “We forget that the driver of innovation over the past 100 years has been efficiency,” said Steve Teig, CEO of Perceive. "This is what drives Moore's Law for . We are now in an era of inverse efficiency." Aart de Geus, chairman and CEO of Synopsys, pleads for action on behalf of Plant Earth. "A person with a brain should have a helpful heart."
Why is energy consumption growing so fast? "The computational demands of neural network are insatiable," said Ian Bratt, Arm researcher and senior technical director. “The larger the neural network, the better the training results and the more problems you can solve. Energy usage is directly proportional to the size of the neural network. Therefore, energy-efficient inference is important for adopting increasingly complex neural networks and enhanced use cases such as Real-time speech and vision applications) are critical."
Unfortunately, not everyone cares about efficiency. "When you look at what hyperscale companies are trying to do, they're trying to get better, more accurate speech recognition, speech recognition, recommendation engines," said Tim Vehling, Mythic's senior vice president of product and business development. "It's It's a matter of money. The more accuracy they can get, the more customers they can serve, the more profitability they can generate. You look at these very large NLPs. Datacenter training and inference of models, that's where a lot of energy is consumed. And I don't know if there's any real incentive to optimize for power in these applications."
But some people do care. "There is some commercial pressure to reduce the carbon impact of these companies, not directly monetary, but more so that consumers will only accept carbon neutral and solutions," said Alexander Wakefield, a scientist at Synopsys. "This is coming from green energy. There's pressure to do that, and if one of the providers says they're carbon neutral, more people might use them."
But not all energy is consumed in the cloud. The growing number of smart edge devices also contributes to this problem. “There are billions of devices that make up the Internet of Things, and in the near future, they will use more electricity than we generate in the world,” Marcie Weinstein, director of strategic and technical marketing for Aspinity. "They consume power to collect and transmit and do whatever they need to do with all this data that they collect."
Figure 2. Inefficiency of edge processing
Figure 2: Inefficiency of edge processing. Source: Aspinity/IHS/SRC
Reducing Power Consumption
In the past, the tech world relied on semiconductor scaling to improve energy efficiency. “We are approaching the physical limits of our process technology,” said Michael Frank, researcher and system architect at Arteris IP. “Transistor widths are between 10 and 20 lattice constants for silicon dioxide. We have many more devices with impurities. Wires that dissipate capacitance, and a large amount of energy is lost during the charging and discharging of these wires.We cannot significantly reduce the voltage until we enter the nonlinear region, where the results of the operation are statistically descriptive rather than deterministic. Technically, I didn't really give us a good chance. However, this is a proof of concept that consumes about 20 watts and does all of these things, including learning. This is called the brain. "
So is ML more efficient than the alternative? "The power consumption of ML must be considered from the perspective of its application system, and the trade-off depends on the power profile that comes with including ML versus the overall system," said ICVS product manager Joe Hupcey. Overall performance gain. "Applies to Siemens EDA. "In many application areas, the industry has developed efficient ML FPGAs and ASIC to reduce power consumption for training and inference, and significant investments are being made to continue this trend. "
has an impact that may force people to focus more on power. Some companies are looking at power per square micron because of heat," said Synopsys scientist Godwin Maben. “Everyone is worried about the heat. When you stack a lot of doors on top of each other in a small area, the power density is high, temperatures rise, and you get close to thermal runaway. Power density now limits performance. As an EDA supplier, we don't just focus on power because when heat comes in, performance per watt, and then performance per watt per square micron, becomes important. "
There are several ways to look at the problem. "I generally like to look at the energy per inference rather than the power," said Russ Klein, director of HLS platforms at Siemens EDA. "Looking at the power can be a little misleading. For example, typically CPU consumes less power than GPU. But GPUs can perform inference much faster than CPUs. The result is that if we look at the energy per inference, the GPU can perform inference using a fraction of the energy required by the CPU. "
It's not clear where the most energy is consumed, and while it seems obvious, the results are somewhat controversial. There are two axes to consider - training vs. inference, and edge vs. cloud.
Training vs. Inference
Why training consumes So much energy? “When you do multiple iterations on the same data set, it consumes a lot of energy,” Arteris’ Frank said. “You’re doing a gradient descent type of approximation. The model is basically a hyperdimensional surface and you're doing some gradients, which are defined by derivatives descending through the multidimensional vector space. "
The amount of energy it takes to do that is increasing rapidly. "If you look at the energy it took to train a model two years ago, some transformer models were in the 27-kilowatt-hour range," said Synopsys' Maben. "If you look Look at today's transformers, it's over 500,000 kilowatt hours. The number of parameters increased from approximately 50 million to 200 million. The number of parameters increased fourfold, but the energy increased by more than 18,000 times. Ultimately, it comes down to the carbon footprint and how many pounds of CO,sub2 this creates. "
How does this compare to inference? "Training involves forward and backward passes, while inference is just a forward pass," said Suhas Mitra, director of product marketing for Cadence Tensilica AI products. “Hence, the ability to reason is always lower. Additionally, during training, the batch size may be large, while during inference, the batch size may be smaller. "
When you try to estimate the total power consumed by these two functions, it becomes controversial. "There is debate about which consumes more energy, training or inference," Maben said. "Training a model consumes a lot of energy, and The number of days required to train on this data is huge. But does it require more energy than reasoning? Training is a one-time fee. You spend a lot of time training. The problem in the training phase is the number of parameters, some models have 150 billion parameters.
Additionally, training is often done more than once. “Training is not a one-and-done and never comes back,” says Mythic’s Vehling. “They’re constantly retraining, reoptimizing the model, so the training is constant. exponential growth is unsustainable. Where is this all going? Machine learning is a low-energy model that consumes all supplies and is expensive and unsustainable. On the other hand, this is a new and exciting area where training is developing rapidly. In addition, due to the dramatic increase in the number of data centers used for training and inference, this requires an exponential increase in power supply. In addition, the amount of data that device intelligence needs to process is also increasing dramatically. These factors have led to a sharp increase in power consumption. But digitization is threatening the development of energy systems and power supply technology. At the recent Design Automation conference, AMD CTO Mark Papermaster presented a slide showing the energy consumption of ML systems (Figure 1) compared to world energy production. Figure 1. Energy consumption of ML (machine learning)
Papermaster is not the only one sounding the alarm. “We forget that the driver of innovation over the past 100 years has been efficiency,” said Steve Teig, CEO of Perceive. "This is what drives Moore's Law for . We are now in an era of inverse efficiency." Aart de Geus, chairman and CEO of Synopsys, pleads for action on behalf of Plant Earth. "A person with a brain should have a helpful heart."
Why is energy consumption growing so fast? "The computational demands of neural network are insatiable," said Ian Bratt, Arm researcher and senior technical director. “The larger the neural network, the better the training results and the more problems you can solve. Energy usage is directly proportional to the size of the neural network. Therefore, energy-efficient inference is important for adopting increasingly complex neural networks and enhanced use cases such as Real-time speech and vision applications) are critical."
Unfortunately, not everyone cares about efficiency. "When you look at what hyperscale companies are trying to do, they're trying to get better, more accurate speech recognition, speech recognition, recommendation engines," said Tim Vehling, Mythic's senior vice president of product and business development. "It's It's a matter of money. The more accuracy they can get, the more customers they can serve, the more profitability they can generate. You look at these very large NLPs. Datacenter training and inference of models, that's where a lot of energy is consumed. And I don't know if there's any real incentive to optimize for power in these applications."
But some people do care. "There is some commercial pressure to reduce the carbon impact of these companies, not directly monetary, but more so that consumers will only accept carbon neutral and solutions," said Alexander Wakefield, a scientist at Synopsys. "This is coming from green energy. There's pressure to do that, and if one of the providers says they're carbon neutral, more people might use them."
But not all energy is consumed in the cloud. The growing number of smart edge devices also contributes to this problem. “There are billions of devices that make up the Internet of Things, and in the near future, they will use more electricity than we generate in the world,” Marcie Weinstein, director of strategic and technical marketing for Aspinity. "They consume power to collect and transmit and do whatever they need to do with all this data that they collect."
Figure 2. Inefficiency of edge processing
Figure 2: Inefficiency of edge processing. Source: Aspinity/IHS/SRC
Reducing Power Consumption
In the past, the tech world relied on semiconductor scaling to improve energy efficiency. “We are approaching the physical limits of our process technology,” said Michael Frank, researcher and system architect at Arteris IP. “Transistor widths are between 10 and 20 lattice constants for silicon dioxide. We have many more devices with impurities. Wires that dissipate capacitance, and a large amount of energy is lost during the charging and discharging of these wires.We cannot significantly reduce the voltage until we enter the nonlinear region, where the results of the operation are statistically descriptive rather than deterministic. Technically, I didn't really give us a good chance. However, this is a proof of concept that consumes about 20 watts and does all of these things, including learning. This is called the brain. "
So is ML more efficient than the alternative? "The power consumption of ML must be considered from the perspective of its application system, and the trade-off depends on the power profile that comes with including ML versus the overall system," said ICVS product manager Joe Hupcey. Overall performance gain. "Applies to Siemens EDA. "In many application areas, the industry has developed efficient ML FPGAs and ASIC to reduce power consumption for training and inference, and significant investments are being made to continue this trend. "
has an impact that may force people to focus more on power. Some companies are looking at power per square micron because of heat," said Synopsys scientist Godwin Maben. “Everyone is worried about the heat. When you stack a lot of doors on top of each other in a small area, the power density is high, temperatures rise, and you get close to thermal runaway. Power density now limits performance. As an EDA supplier, we don't just focus on power because when heat comes in, performance per watt, and then performance per watt per square micron, becomes important. "
There are several ways to look at the problem. "I generally like to look at the energy per inference rather than the power," said Russ Klein, director of HLS platforms at Siemens EDA. "Looking at the power can be a little misleading. For example, typically CPU consumes less power than GPU. But GPUs can perform inference much faster than CPUs. The result is that if we look at the energy per inference, the GPU can perform inference using a fraction of the energy required by the CPU. "
It's not clear where the most energy is consumed, and while it seems obvious, the results are somewhat controversial. There are two axes to consider - training vs. inference, and edge vs. cloud.
Training vs. Inference
Why training consumes So much energy? “When you do multiple iterations on the same data set, it consumes a lot of energy,” Arteris’ Frank said. “You’re doing a gradient descent type of approximation. The model is basically a hyperdimensional surface and you're doing some gradients, which are defined by derivatives descending through the multidimensional vector space. "
The amount of energy it takes to do that is increasing rapidly. "If you look at the energy it took to train a model two years ago, some transformer models were in the 27-kilowatt-hour range," said Synopsys' Maben. "If you look Look at today's transformers, it's over 500,000 kilowatt hours. The number of parameters increased from approximately 50 million to 200 million. The number of parameters increased fourfold, but the energy increased by more than 18,000 times. Ultimately, it comes down to the carbon footprint and how many pounds of CO,sub2 this creates. "
How does this compare to inference? "Training involves forward and backward passes, while inference is just a forward pass," said Suhas Mitra, director of product marketing for Cadence Tensilica AI products. “Hence, the ability to reason is always lower. Additionally, during training, the batch size may be large, while during inference, the batch size may be smaller. "
When you try to estimate the total power consumed by these two functions, it becomes controversial. "There is debate about which consumes more energy, training or inference," Maben said. "Training a model consumes a lot of energy, and The number of days required to train on this data is huge. But does it require more energy than reasoning? Training is a one-time fee. You spend a lot of time training. The problem in the training phase is the number of parameters, some models have 150 billion parameters.
Additionally, training is often done more than once. “Training is not a one-and-done and never comes back,” says Mythic’s Vehling. “They’re constantly retraining, reoptimizing the model, so the training is constant.They're constantly tweaking the model, looking for enhancements, enhancing the dataset, so it's more or less an ongoing activity. "
However, the inference may be repeated multiple times. "You trained a model, it might have been developed for self-driving cars, and now every car uses this model," Maben added. "Now we're talking about in approx. Reasoning in 100 million cars. One prediction is that more than 70% to 80% of the energy will be used for inference rather than training. "
There's some data to support this." In a recent paper from Northeastern University and MIT , it was estimated that inference has a much larger impact on energy consumption than training," Untether AI Product Senior "This is because the model is built specifically for inference and therefore runs much more frequently in inference mode than training mode - essentially train once, run everywhere," said Philip Lewer, director. "
Cloud vs. Edge
There can be many different reasons for moving applications from the cloud to the edge. "The market has seen that some activities are better pushed to the edge rather than the cloud," said Paul Karazuba, vice president of marketing at Expedera. “I don’t think there’s a clear line between what’s done and what’s not done on the edge and how those decisions are made. We see a desire for more AI at the edge, we see a desire for more mission-critical applications at the edge, rather than having AI as a stamp outside the box. artificial intelligence is actually doing something useful in the device, not just being there. "
It's not that you're moving cloud models to the edge. "Let's say you have this natural speech, speech recognition application," Mythic's Vehling said. "You're training these models in the cloud. Most of the time, you're running these models for inference in the cloud. If you look at more inference applications that are at the edge, which are not cloud-based, you can train models against those local resources. So what you're trying to solve are almost two different problems. One is cloud-based and the other is edge-based, and they are not necessarily related to each other. "
models have to know where they will end up running. "You typically find billion-parameter models running in the cloud, but that's just one type of model," Vehling added. "At the other extreme, you have very Small wake word models, they take up very little resources - call them small ml or even lower. Then in the middle are categories of models, such as visual analytics models, which you might see used in camera-based applications. They're much smaller than models in the cloud, but also much larger than this very simple wake word. "
It's not just inference that's at the edge. We're probably going to see more and more training. "Federated learning is one example," said Sharad Chole, chief scientist at Expedera. "One area where it's already being used is autocompletion. Autocomplete may be different for everyone, how do you actually learn it? How did you customize it? This must be done while protecting user privacy. There are challenges. "
Improve efficiency
Moving applications from the training system to the edge involves significant software stacks. "Once you get past the initial training phase, subsequent optimizations will deliver significantly lighter models with minimal performance degradation," Siemens ’s Hupcey said, “Model reduction techniques are used to reduce power consumption during inference. Quantization, weight pruning and approximation are used extensively before or after model deployment. The two most obvious cases are TinyML and lightweight versions of GPT-3. ”
Klein added: “Dropout and pruning is a good start. Quantification into smaller numerical representations also helps. Done aggressively, these can reduce the size of a network by 99% or more, and in many cases result in less than a 1% drop in accuracy. Some have also considered using layers in the model to trade off channels to produce smaller networks without compromising accuracy. ”
These techniques both reduce model size and directly reduce energy requirements, but more improvements can be made.“Now we’re seeing support for mixed precision, where each layer can be quantized to a different domain,” said Expedera’s Chole. "This may be pushed even further. Perhaps in the future each dimension of the weights can be quantified to a different accuracy. This push is good because during training, data scientists will realize how they can reduce power, and what kind of accuracy trade-offs are they making while reducing power"
Conclusion
Models are getting bigger to try to get more accuracy, but this trend has to stop because the amount of power it consumes is increasing. increased disproportionately. While the cloud is affordable today due to its business model, the edge is not. As more companies invest in edge applications, we can expect to see an increased focus on energy optimization. Some companies are looking at reducing power consumption by a factor of 100 over the next 5 years, but this is not nearly enough to stop the trend.
Conclusion
Models are getting bigger to try to get more accuracy, but this trend has to stop because the amount of power it consumes is increasing. increased disproportionately. While the cloud is affordable today due to its business model, the edge is not. As more companies invest in edge applications, we can expect to see an increased focus on energy optimization. Some companies are looking at reducing power consumption by a factor of 100 over the next 5 years, but this is not nearly enough to stop the trend.