Breaking Through the AI Bottlenecks
www.informationweek.com
As chief information officers race to adopt and deploy artificial intelligence, they eventually encounter an uncomfortable truth: Their IT infrastructure isn't ready for AI. From widespread GPU shortages and latency-prone networks to rapidly spiking energy demands, they encounter bottlenecks that undermine performance and boost costs.An inefficient AI framework can greatly diminish the value of AI, says Sid Nag, vice president of research at Gartner. Adds Teresa Tung, global data capability lead at Accenture: The scarcity of high-end GPUs is an issue, but there are other factors -- including power, cooling, and data center design and capacity -- that impact results.The takeaway? Demanding and resource-intensive AI workloads require IT leaders to rethink how they design networks, allocate resources and manage power consumption. Those who ignore these challenges risk falling behind in the AI arms race -- and undercutting business performance.Breaking PointsThe most glaring and widely reported problem is a scarcity of high-end GPUs required for inferencing and operating AI models. For example, highly coveted Nvidia Blackwell GPUs, officially known as GB200 NVL-72, have been nearly impossible to find for months, as major companies like Amazon, Google, Meta and Microsoft scoop them up. Yet, even if a business can obtain these units, the cost for a fully configured server can cost around $3 million. A less expensive version, the NVL36 server, runs about $1.8 million.Related:While this can affect an enterprise directly, the shortage of GPUs also impacts major cloud providers like AWS, Google, and Microsoft. They increasingly ration resources and capacity, Nag says. For businesses, the repercussions are palpable. Lacking an adequate hardware infrastructure thats required to build AI models, training a model can become slow and unfeasible. It can also lead to data bottlenecks that undermine performance, he notes.GPU shortages are just a piece of the overall puzzle, however. As organizations look to plug in AI tools for specialized purposes such as computer vision, robotics, or chatbots they discover that theres a need for fast and efficient infrastructure optimized for AI, Tung explains.Network latency can prove particularly challenging. Even small delays in processing AI queries can trip up an initiative. GPU clusters require high-speed interconnects to communicate at maximum speed. Many networks continue to rely on legacy copper, which significantly slows data transfers, according to Terry Thorn, vice president of commercial operations for Ayar Labs, a vendor that specializes in AI-optimized infrastructure.Related:Still another potential problem is data center space and energy consumption. AI workloads -- particularly those running on high-density GPU clusters -- draw vast amounts of power. As deployment scales, CIOs may scramble to add servers, hardware and advanced technologies like liquid cooling. Inefficient hardware, network infrastructure and AI models exacerbate the problem, Nag says.Making matters worse, upgrading power and cooling infrastructure is complicated and time-consuming. Nag points out that these upgrades may require a year or longer to complete, thus creating additional short-term bottlenecks.Scaling SmartOptimizing AI is inherently complicated because the technology impacts areas as diverse as data management, computational resources and user interfaces. Consequently, CIOs must decide how to approach various AI projects based on the use case, AI model and organizational requirements. This includes balancing on-premises GPU clusters with different mixes of chips and cloud-based AI services.Organizations must consider how, when and where cloud services and specialty AI providers make sense, Tung says. If building a GPU cluster internally is either undesirable or out of reach, then its critical to find a suitable service provider. You have to understand the vendors relationships with GPU providers, what types of alternative chips they offer, and what exactly you are gaining access to, she says.Related:In some cases, AWS, Google, or Microsoft may offer a solution through specific products and services. However, an array of niche and specialty AI service companies also exist, and some consulting companies -- Accenture and Deloitte are two of them -- have direct partnerships with Nvidia and other GPU vendors. In some cases, Tung says, you can get data flowing through these custom models and frameworks. You can lean into these relationships to get the GPUs you need.For those running GPU clusters, maximizing network performance is paramount. As workloads scale, systems struggle with data transfer limitations. One of the critical choke points is copper. Ayar Labs, for example, replaces these interconnects with high-speed optical interconnects that reduce latency, power consumption and heat generation. The result is better GPU utilization but also more efficient model processing, particularly for large-scale deployments.In fact, Ayar Labs claims a 10x lower latency and up to 10x more bandwidth over traditional interconnects. Theres also a 4x to 8x reduction in power. No longer are chips waiting for data rather than computing, Thorn states. The problem can become particularly severe as organizations adopt complex large language models. Increasing the size of the pipe boosts utilization and reduces CapEx, he adds.Still another piece of the puzzle is model efficiency and distillation processes. By specifically adapting a model for a laptop or smartphone, for example, its often possible to use different combinations of GPUs and CPUs. The result can be a model that runs faster, better and cheaper, Tung says.Power PlaysAddressing AIs power requirements is also essential. An overarching energy strategy can help avoid short-term performance bottlenecks as well as long-term chokepoints. Energy consumption is going to be a problem, if it is not already a problem for many companies, Nag says. Without adequate supply, power can become a barrier to success. It also can undermine sustainability and boost greenwashing accusations. He suggests that CIOs view AI in a broad and holistic way, including identifying ways to reduce reliance on GPUs.Establishing clear policies and a governance framework around the use of AI can minimize the risk of non-technical business users misusing tools or inadvertently creating bottlenecks. The risk is greater when these users turn to hyperscalers like AWS, Google and Microsoft. Without some guidance and direction, it can be like walking into a candy store and not knowing what to pick, Nag points out.In the end, an enterprise AI framework must bridge both strategy and IT infrastructure. The objective, Tung explains, is ensuring your company controls its destiny in an AI-driven world.
0 Commentarii ·0 Distribuiri ·23 Views