
Can we make AI less power-hungry? These researchers are working on it.
arstechnica.com
feeding the beast Can we make AI less power-hungry? These researchers are working on it. As demand surges, figuring out the performance of proprietary models is half the battle. Jacek Krywko Mar 24, 2025 7:00 am | 21 Credit: Igor Borisenko/Getty Images Credit: Igor Borisenko/Getty Images Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreAt the beginning of November 2024, the US Federal Energy Regulatory Commission (FERC) rejected Amazons request to buy an additional 180 megawatts of power directly from the Susquehanna nuclear power plant for a data center located nearby. The rejection was due to the argument that buying power directly instead of getting it through the grid like everyone else works against the interests of other users.Demand for power in the US has been flat for nearly 20 years. But now were seeing load forecasts shooting up. Depending on [what] numbers you want to accept, theyre either skyrocketing or theyre just rapidly increasing, said Mark Christie, a FERC commissioner.Part of the surge in demand comes from data centers, and their increasing thirst for power comes in part from running increasingly sophisticated AI models. As with all world-shaping developments, what set this trend into motion was visionquite literally.The AlexNet momentBack in 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, AI researchers at the University of Toronto, were busy working on a convolution neural network (CNN) for the ImageNet LSRVC, an image-recognition contest. The contests rules were fairly simple: A team had to build an AI system that could categorize images sourced from a database comprising over a million labeled pictures.The task was extremely challenging at the time, so the team figured they needed a really big neural netway bigger than anything other research teams had attempted. AlexNet, named after the lead researcher, had multiple layers, with over 60 million parameters and 650 thousand neurons. The problem with a behemoth like that was how to train it.What the team had in their lab were a few Nvidia GTX 580s, each with 3GB of memory. As the researchers wrote in their paper, AlexNet was simply too big to fit on any single GPU they had. So they figured out how to split AlexNets training phase between two GPUs working in parallelhalf of the neurons ran on one GPU, and the other half ran on the other GPU.AlexNet won the 2012 competition by a landslide, but the team accomplished something way more profound. The size of AI models was once and for all decoupled from what was possible to do on a single CPU or GPU. The genie was out of the bottle.(The AlexNet source code was recently made available through the Computer History Museum.)The balancing actAfter AlexNet, using multiple GPUs to train AI became a no-brainer. Increasingly powerful AIs used tens of GPUs, then hundreds, thousands, and more. But it took some time before this trend started making its presence felt on the grid. According to an Electric Power Research Institute (EPRI) report, the power consumption of data centers was relatively flat between 2010 and 2020. That doesnt mean the demand for data center services was flat, but the improvements in data centers energy efficiency were sufficient to offset the fact we were using them more.Two key drivers of that efficiency were the increasing adoption of GPU-based computing and improvements in the energy efficiency of those GPUs. That was really core to why Nvidia was born. We paired CPUs with accelerators to drive the efficiency onward, said Dion Harris, head of Data Center Product Marketing at Nvidia. In the 20102020 period, Nvidia data center chips became roughly 15 times more efficient, which was enough to keep data center power consumption steady.All that changed with the rise of enormous large language transformer models, starting with ChatGPT in 2022. There was a very big jump when transformers became mainstream, said Mosharaf Chowdhury, a professor at the University of Michigan. (Chowdhury is also at the ML Energy Initiative, a research group focusing on making AI more energy-efficient.)Nvidia has kept up its efficiency improvements, with a ten-fold boost between 2020 and today. The company also kept improving chips that were already deployed. A lot of where this efficiency comes from was software optimization. Only last year, we improved the overall performance of Hopper by about 5x, Harris said. Despite these efficiency gains, based on Lawrence Berkely National Laboratory estimates, the US saw data center power consumption shoot up from around 76 TWh in 2018 to 176 TWh in 2023.The AI lifecycleLLMs work with tens of billions of neurons approaching a number rivalingand perhaps even surpassingthose in the human brain. The GPT 4 is estimated to work with around 100 billion neurons distributed over 100 layers and over 100 trillion parameters that define the strength of connections among the neurons. These parameters are set during training, when the AI is fed huge amounts of data and learns by adjusting these values. Thats followed by the inference phase, where it gets busy processing queries coming in every day.The training phase is a gargantuan computational effortOpen AI supposedly used over 25,000 Nvidia Ampere 100 GPUs running on all cylinders for 100 days. The estimated power consumption is 50 GW-hours, which is enough to power a medium-sized town for a year. According to numbers released by Google, training accounts for 40 percent of the total AI model power consumption over its lifecycle. The remaining 60 percent is inference, where power consumption figures are less spectacular but add up over time.Trimming AI models downThe increasing power consumption has pushed the computer science community to think about how to keep memory and computing requirements down without sacrificing performance too much. One way to go about it is reducing the amount of computation, said Jae-Won Chung, a researcher at the University of Michigan and a member of the ML Energy Initiative.One of the first things researchers tried was a technique called pruning, which aimed to reduce the number of parameters. Yann LeCun, now the chief AI scientist at Meta, proposed this approach back in 1989, terming it (somewhat menacingly) the optimal brain damage. You take a trained model and remove some of its parameters, usually targeting the ones with a value of zero, which add nothing to the overall performance. You take a large model and distill it into a smaller model trying to preserve the quality, Chung explained.You can also make those remaining parameters leaner with a trick called quantization. Parameters in neural nets are usually represented as a single-precision floating point number, occupying 32 bits of computer memory. But you can change the format of parameters to a smaller one that reduces the amount of needed memory and makes the computation faster, Chung said.Shrinking an individual parameter has a minor effect, but when there are billions of them, it adds up. Its also possible to do quantization-aware training, which performs quantization at the training stage. According to Nvidia, which implemented quantization training in its AI model optimization toolkit, this should cut the memory requirements by 29 to 51 percent.Pruning and quantization belong to a category of optimization techniques that rely on tweaking the way AI models work internallyhow many parameters they use and how memory-intensive their storage is. These techniques are like tuning an engine in a car to make it go faster and use less fuel. But there's another category of techniques that focus on the processes computers use to run those AI models instead of the models themselvesakin to speeding a car up by timing the traffic lights better.Finishing firstApart from optimizing the AI models themselves, we could also optimize the way data centers run them. Splitting the training phase workload evenly among 25 thousand GPUs introduces inefficiencies. When you split the model into 100,000 GPUs, you end up slicing and dicing it in multiple dimensions, and it is very difficult to make every piece exactly the same size, Chung said.GPUs that have been given significantly larger workloads have increased power consumption that is not necessarily balanced out by those with smaller loads. Chung figured that if GPUs with smaller workloads ran slower, consuming much less power, they would finish roughly at the same time as GPUs processing larger workloads operating at full speed. The trick was to pace each GPU in such a way that the whole cluster would finish at the same time.To make that happen, Chung built a software tool called Perseus that identified the scope of the workloads assigned to each GPU in a cluster. Perseus takes the estimated time needed to complete the largest workload on a GPU running at full. It then estimates how much computation must be done on each of the remaining GPUs and determines what speed to run them so they finish at the same. Perseus precisely slows some of the GPUs down, and slowing down means less energy. But the end-to-end speed is the same, Chung said.The team tested Perseus by training the publicly available GPT-3, as well as other large language models and a computer vision AI. The results were promising. Perseus could cut up to 30 percent of energy for the whole thing, Chung said. He said the team is talking about deploying Perseus at Meta, but it takes a long time to deploy something at a large company.Are all those optimizations to the models and the way data centers run them enough to keep us in the green? It takes roughly a year or two to plan and build a data center, but it can take longer than that to build a power plant. So are we winning this race or losing? Its a bit hard to say.Back of the envelopeAs the increasing power consumption of data centers became apparent, research groups tried to quantify the problem. A Lawerence Berkley Laboratory team estimated that data centers annual energy draw in 2028 would be between 325 and 580 TWh in the USthats between 6.7 and 12 percent of the total US electricity consumption. The International Energy Agency thinks it will be around 6 percent by 2026. Goldman Sachs Research says 8 percent by 2030, while EPRI claims between 4.6 and 9.1 percent by 2030.EPRI also warns that the impact will be even worse because data centers tend to be concentrated at locations investors think are advantageous, like Virginia, which already sends 25 percent of its electricity to data centers. In Ireland, data centers are expected to consume one-third of the electricity produced in the entire country in the near future. And thats just the beginning.Running huge AI models like ChatGPT is one of the most power-intensive things that data centers do, but it accounts for roughly 12 percent of their operations, according to Nvidia. That is expected to change if companies like Google start to weave conversational LLMs into their most popular services. The EPRI report estimates that a single Google search today uses around 0.3 watts of power, while a single Chat GPT query bumps that up to 2.9 watts. Based on those values, the report estimates that an AI-powered Google search would require Google to deploy 400,000 new servers that would consume 22.8 TWh per year.AI searches take 10x the electricity of a non-AI search, Christie, the FERC commissioner, said at a FERC-organized conference. When FERC commissioners are using those numbers, youd think there would be rock-solid science backing them up. But when Ars asked Chowdhury and Chung about their thoughts on these estimates, they exchanged looks and smiled.Closed AI problemChowdhury and Chung don't think those numbers are particularly credible. They feel we know nothing about what's going on inside commercial AI systems like ChatGPT or Gemini, because OpenAI and Google have never released actual power-consumption figures.They didnt publish any real numbers, any academic papers. The only number, 0.3 watts per Google search, appeared in some blog post or other PR-related thingy, Chodwhury said. We dont know how this power consumption was measured, on what hardware, or under what conditions, he said. But at least it came directly from Google.When you take that 10x Google vs ChatGPT equation or whateverone part is half-known, the other part is unknown, and then the division is done by some third party that has no relationship with Google nor with Open AI, Chowdhury said.Googles PR-related thingy was published back in 2009, while the 2.9-watts-per-ChatGPT-query figure was probably based on a comment about the number of GPUs needed to train GPT-4 made by Jensen Huang, Nvidias CEO, in 2024. That means the 10x AI versus non-AI search claim was actually based on power consumption achieved on entirely different generations of hardware separated by 15 years. But the number seemed plausible, so people keep repeating it, Chowdhury said.All reports we have today were done by third parties that are not affiliated with the companies building big AIs, and yet they arrive at weirdly specific numbers. They take numbers that are just estimates, then multiply those by a whole lot of other numbers and get back with statements like AI consumes more energy than Britain, or more than Africa, or something like that. The truth is they dont know that, Chowdhury said.He argues that better numbers would require benchmarking AI models using a formal testing procedure that could be verified through the peer-review process.As it turns out, the ML Energy Initiative defined just such a testing procedure and ran the benchmarks on any AI models they could get ahold of. The group then posted the results online on their ML.ENERGY Leaderboard.AI-efficiency leaderboardTo get good numbers, the first thing the ML Energy Initiative got rid of was the idea of estimating how power-hungry GPU chips are by using their thermal design power (TDP), which is basically their maximum power consumption. Using TDP was a bit like rating a cars efficiency based on how much fuel it burned running at full speed. Thats not how people usually drive, and thats not how GPUs work when running AI models. So Chung built ZeusMonitor, an all-in-one solution that measured GPU power consumption on the fly.For the tests, his team used setups with Nvidias A100 and H100 GPUs, the ones most commonly used at data centers today, and measured how much energy they used running various large language models (LLMs), diffusion models that generate pictures or videos based on text input, and many other types of AI systems.The largest LLM included in the leaderboard was Metas Llama 3.1 405B, an open-source chat-based AI with 405 billion parameters. It consumed 3352.92 joules of energy per request running on two H100 GPUs. Thats around 0.93 watt-hourssignificantly less than 2.9 watt-hours quoted for ChatGPT queries. These measurements confirmed the improvements in the energy efficiency of hardware. Mixtral 8x22B was the largest LLM the team managed to run on both Ampere and Hopper platforms. Running the model on two Ampere GPUs resulted in 0.32 watt-hours per request, compared to just 0.15 watt-hours on one Hopper GPU.What remains unknown, however, is the performance of proprietary models like GPT-4, Gemini, or Grok. The ML Energy Initiative team says it's very hard for the research community to start coming up with solutions to the energy efficiency problems when we dont even know what exactly were facing. We can make estimates, but Chung insists they need to be accompanied by error-bound analysis. We dont have anything like that today.The most pressing issue, according to Chung and Chowdhury, is the lack of transparency. Companies like Google or Open AI have no incentive to talk about power consumption. If anything, releasing actual numbers would harm them, Chowdhury said. But people should understand what is actually happening, so maybe we should somehow coax them into releasing some of those numbers.Where rubber meets the roadEnergy efficiency in data centers follows the trend similar to Moores lawonly working at a very large scale, instead of on a single chip, Nvidia's Harris said. The power consumption per rack, a unit used in data centers housing between 10 and 14 Nvidia GPUs, is going up, he said, but the performance-per-watt is getting better.When you consider all the innovations going on in software optimization, cooling systems, MEP (mechanical, electrical, and plumbing), and GPUs themselves, we have a lot of headroom, Harris said. He expects this large-scale variant of Moores law to keep going for quite some time, even without any radical changes in technology.There are also more revolutionary technologies looming on the horizon. The idea that drove companies like Nvidia to their current market status was the concept that you could offload certain tasks from the CPU to dedicated, purpose-built hardware. But now, even GPUs will probably use their own accelerators in the future. Neural nets and other parallel computation tasks could be implemented on photonic chips that use light instead of electrons to process information. Photonic computing devices are orders of magnitude more energy-efficient than the GPUs we have today and can run neural networks literally at the speed of light.Another innovation to look forward to is 2D semiconductors, which enable building incredibly small transistors and stacking them vertically, vastly improving the computation density possible within a given chip area. We are looking at a lot of these technologies, trying to assess where we can take them, Harris said. But where rubber really meets the road is how you deploy them at scale. Its probably a bit early to say where the future bang for buck will be.The problem is when we are making a resource more efficient, we simply end up using it more. It is a Jevons paradox, known since the beginnings of the industrial age. But will AI energy consumption increase so much that it causes an apocalypse? Chung doesn't think so. According to Chowdhury, if we run out of energy to power up our progress, we will simply slow down.But people have always been very good at finding the way, Chowdhury added.Jacek KrywkoAssociate WriterJacek KrywkoAssociate Writer Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry. 21 Comments
0 التعليقات
·0 المشاركات
·21 مشاهدة