The Cost of AI Infrastructure: New Gear for AI Liftoff
www.informationweek.com
Richard Pallardy, Freelance WriterFebruary 5, 202514 Min ReadTithi Luadthong via Alamy StockOptimizing an organization for AI utilization is challenging -- not least due to the difficulty in determining which equipment and services are actually necessary and balancing those demands with how much it will cost. In a rapidly changing landscape, companies must decide how much they want to depend on AI and make highly consequential decisions in short order.A 2024 Expereo report found that 69% of businesses are planning on adopting AI in some form. According to a 2024 Microsoft report, 41% of leaders surveyed are in search of assistance in improving their AI infrastructure. Two-thirds of executives were dissatisfied with how their organizations were progressing in AI adoptions according to a BCG survey last year.Circumstances vary wildly, from actively training AI programs to simply deploying them -- or both.Regardless of the use case, a complex array of chips is required -- central processing units (CPUs), graphics processing units (GPUs), and potentially data processing units (DPUs) and tensor processing units (TPUs).Enormous amounts of data are required to train and run AI models and these chips are essential to doing so. Discerning how much compute power will be required for a given AI application is crucial to deciding how many of these chips are needed -- and where to get them. Solutions must be simultaneously cost-effective and adaptable.Related:Cloud services are accessible and easily scalable, but costs can add up quickly. Pricing structures are often opaque and budgets can balloon in short order even with relatively constrained use. And depending on the applications of the technology, some hardware may be required as well.On-premise solutions can be eye-wateringly expensive too -- and they come with maintenance and updating costs. Setting up servers in-office or in data centers requires an even more sophisticated understanding of projected computing needs -- the amount of hardware that will be needed and how much it will cost to run it. Still, they are also customizable, and users have more direct control.Then, the technicalities of how to store the data used to train and operate AI models and how to transmit that data at high bandwidths and with low latency come into play. So, too, privacy is a concern, especially in the development of new AI models that often use sensitive data.It is a messy and highly volatile ecosystem, making it even more crucial to make informed decisions on technological investment.Here, InformationWeek investigates the complexities of establishing an AI optimized organization, with insights from Rick Bentley, founder of AI surveillance and remote guarding company Cloudastructure and crypto-mining company Hydro Hash, Adnan Masood, chief AI architect for digital solutions company UST, and Lars Nyman, chief marketing officer of cloud computing company CUDO Compute.Related:All About the ChipsTraining and deploying AI programs hinges on CPUs, GPUs and in some cases TPUs.CPUs provide basic services -- running operating systems, delivering code, and wrangling data. While newer CPUs are capable of the parallel processing required for AI workloads, they are best at sequential processing. An ecosystem only using CPUs is capable of running very moderate AI workloads -- typically, inference only.GPUs of course are the linchpin of AI technology. They allow the processing of multiple streams of data in parallel -- AI is reliant on massive amounts of data and it is crucial that systems can handle these workloads without interruption. Training and running AI models of any significant size -- particularly those using any form of deep learning -- will require GPU power. GPUs may be up to 100 times as efficient as CPUs at performing certain deep learning tasks.Related:Whether they are purchased or rented, GPUs cost a pretty penny. They are also sometimes hard to come by given the high demand.Lars Nyman, CUDO ComputeThey can crunch data and run training models at hyperspeed. SMEs might go for mid-tier Nvidia GPUs like the A100s, while larger enterprises may dive headfirst into specialized systems like Nvidia DGX SuperPODs, Nyman says. A single high-performance GPU server can cost $40,000$400,000, depending on scale and spec.Certain specialized tasks may benefit from the implementation of application specific integrated circuits (ASICs) such as TPUs, which can accelerate workloads that use neural networks.Where Does the Data Live?AI relies on enormous amounts of data -- words, images, recordings. Some of it is structured and some of it is not.Data can exist either in data lakes -- unstructured pools of raw data that must be processed for use -- or data warehouses -- structured repositories of data that can be more easily accessed by AI applications. Data processing protocols can help filter the former into the latter.Organizations looking to optimize their operations through AI need to figure out where to store that data securely while still allowing machine learning algorithms to access and utilize it.Hard disk drives or flash-based solid-state drive arrays may be sufficient for some projects.Good old spindle hard drives are delightfully cheap, Bentley says. They store a lot of data. But they're not that fast compared to the solid state drives that are out now. It depends on what you're trying to do.Organizations that rely on larger amounts of data may need non-volatile memory express (NVMe)-based storage arrays. These systems are primed to communicate with CPUs and channel the data into the AI program where it can be analyzed and deployed.That data needs to be backed up, too.AI systems obviously thrive on data, but that data can be fragile, Nyman observes. At minimum, SMEs need triple-redundancy storage: local drives, cloud backup, and cold storage. Object storage systems like Ceph or S3-compatible services run around $100/TB a month, scaling up fast with your needs.Networking for AIAn efficient network is essential for establishing an effective AI operation. High-speed networking fools the computer into thinking that it actually has the whole model loaded up, Masood says.Ethernet and fiber connections are generally considered optimal due to their high bandwidth and low latency. Remote direct memory access (RDMA) over Converged Ethernet protocols are considered superior to standard Ethernet-based networks due to their smooth handling of large data transfers. InfiniBand may also be an option for AI applications that require high performance.Low-latency, high-bandwidth networking gear, such as 100 Gigabytes per second (Gbps) switches, fiber cabling, and SDN (software-defined networking) keeps your data moving fast -- a necessity, Nyman claims.Bandwidth for AI must be high. Enormous amounts of data must be transferred at high speeds even for relatively constrained AI models. If that data is held up because it simply cannot be transferred in time to complete an operation, the model will not provide the promised service to the end user.Latency is a major hang-up. According to findings by Meta, 30% of wasted time in an AI application is due to slow network speeds. Ensuring that no compute node is idle for any significant amount of time can save enormous amounts of money. Failing to utilize a GPU, for example, can result in lost investment and operational costs.Front-end networks handle the non-AI component of the compute necessary to complete the operations as well as the connectivity and management of the actual AI components. Back-end networks handle the compute involved in training and inference -- communication between the chips.Both Ethernet and fiber are viable choices for the front end network. Ethernet is increasingly the preferred choice for back-end networks. Infrastructure as a service (IaaS) arrangements may take some of the burden off of organizations attempting to navigate the construction of their networks.If you have a large data setup, you don't want to run it with Ethernet, Masood cautions, however. If you're using a protocol like InfiniBand or RDMA, you have to use fiber.Though superior for some situations, these solutions come at a premium. The switches, the transceivers, the fiber cables -- they are expensive, and the maintenance cost is very high, he adds.While some level of onsite technology is likely necessary in some cases, these networking services can be taken offsite, allowing for easier management of the complex array of transfers between the site, data centers and cloud locations. Still, communication between on-premise devices must also be handled rapidly. Private 5G networks may be useful in some cases.Automation of these processes is key -- this can be facilitated by the implementation of a network operating system (NOS) that can handle the various inputs and outputs and scale as the operation grows. Interoperability is key given that many organizations will utilize a hybrid of cloud, data center and onsite resources.DPUs can be used to further streamline network operations by processing data packets, taking some of the workload from CPUs and allowing them to focus on more complex computations.Where Oh Where Do I Site My Compute?AI implementation is tricky: everything, it seems, must happen everywhere and all at once. It is thus challenging to develop a balance of on-site technology, data center resources and cloud technologies that meets the unique needs of a given application.I've seen 30% of people go with the on-prem route and 70% of the people go with the cloud route, Masood says.Adnan Masood, USTSome organizations may be able to get away with using their existing technology, leaning on cloud solutions to keep things running. Implementing a chatbot does not necessarily mean dumping funds into cutting edge hardware and expensive data center storage.Others, however, may find themselves needing more complex workstations, in-house and off-site storage and processing capabilities facilitated by bespoke networks. Training and inference of more complex models requires specialized technology that must be fine-tuned to the task at hand -- balancing exigent costs with scalability and privacy as the project progresses.Onsite SolutionsAll organizations will need some level of onsite hardware. Small-scale implementation of AI in cloud-based applications will likely require only minor upgrades, if any.The computers that people need to run anything on the cloud are just browsers. It's just a dumb terminal, Bentley says. So you don't really need anything in the office. Larger projects will likely need more specialized set ups.The gap, however, is closing rapidly. According to Gartner, AI-enabled PCs containing neural processing units (NPUs) will comprise 43% of PC purchases in 2025. Canalys expects this ratio to rise to 60% by 2027. The transition may be accelerated by the end of support for Windows 10 this year. This suggests that as organizations modernize their basic in-office hardware in the next several years, some level of AI capability will almost certainly be embedded. Some hardware companies are more aggressively rolling out purpose-built AI capable devices as well.Thus, some of the compute power required to power AI will be moved to the edge by default -- likely reducing reliance on cloud and data centers to an extent, especially for organizations treading lightly with their early AI use. Speeds will likely be improved by the simple proximity of the necessary hardware.Organizations considering more advanced equipment must consider the amount of compute power they need from their devices in comparison to what they can get from their cloud or data center services -- and how easily it can be upgraded in the future. It's worth noting, for example, that many laptops are difficult to upgrade because the CPUs and GPUs are soldered to the motherboard.The cost for a good workstation with high-end machines is usually between $5,000$15,000, depending on your setup, Masood reports. That's really valuable, because the workload people have is constantly increasing.Bentley suggests that in some cases, a simpler solution is available. One of the best bangs for the buck as a step up is a gaming PC. It's just an Intel i9. The CPU almost doesn't matter. It has an RTX 4090 graphics card, he says.Organizations that are going all in will benefit from the increasing sophistication of this type of hardware. But they may also require on-premise servers out of practicality. Siting servers in-house allows for easier customization, maintenance and scaling. Bandwidth requirements and latency may be reduced. And it is also a privacy safeguard -- organizations handling high volumes of proprietary data and developing their own algorithms to utilize it need to ensure that it is housed and moved with the greatest of care.The upfront costs of installation, in addition to maintenance and staffing, present a challenge.It's harder to procure hardware, Masood notes. Unless you are running a very sophisticated shop where you have a lot of data privacy restrictions and other concerns, you probably want to still go with the cloud approach.For an SME starting from scratch, youre looking at $500,000 -- $1 million for a modest AI-ready setup: a handful of GPU servers, a solid networking backbone, and basic redundancy, Nyman says. Add more if your ambitions include large-scale training or real-time AI inference.Building in-house data centers is a heavy lift. We're looking at $20$50 million for a mid-sized operation, Nyman estimates. Then theres of course the ongoing cost of cooling, electricity, and maintenance. A 1 megawatt (MW) data center -- enough to power about 10 racks of high-end GPUs -- can cost around $1 million annually just to keep the lights on.But for organizations confident in the profitability of their product, it is likely a worthwhile investment. It may in fact be cheaper than utilizing cloud services in some cases. Further, the cloud is likely to be subjected to an increasing level of strain -- and thus may become less reliable.Off-Site SolutionsData center co-location services may be suitable solutions for organizations that wish to maintain some level of control over their equipment but do not wish to maintain it themselves. They can customize their servers in the same way they might in an on-premise situation -- installing exactly the number of GPUs and other components they require to operate their programs.SMEs may invest in a shared space in a data center -- they will have 100 GPUs, which they're using to handle training or dev based workloads. That costs around $100,000$200,000 upfront, Masood says. People have been experimenting with it.Rick Bentley, CloudastructureThey can then pay the data center to maintain the servers -- which of course results in additional costs. The tools get increasingly sophisticated the more data you're dealing with, and that gets expensive, Bentley says. Support plans can be like $50,000 a month for the guy who sold you the storage array to keep it running well for you.Still, data centers obviate the need for retrofitting on-premise conditions --proper connections, cooling infrastructure and power needs. And at least some maintenance and costs are standardized and predictable. Security protocols will also already be in place, reducing separate security costs.Cloud SolutionsOrganizations that prefer minimal hardware infrastructure -- or none at all -- have the option of utilizing cloud computing providers such as Amazon, Google and Microsoft. These services offer flexible and scalable solutions without the complexity of setting up servers and investing in specialized workstations.Major cloud providers offer a shared responsibility model -- they provide you the GPU instances, they provide the setup. They provide everything for you, Masood says. It's easier.This may be a good option for organizations just beginning to experiment with AI integration or still deciding how to scale up their existing AI applications without spending more on hardware. A wide variety of advanced resources are available, allowing companies to decide on which ones are most useful to them without any overhead aside from the cost of the service and the work itself. Further, they typically offer intuitive interfaces that allow beginners to play with the technology and learn as they go.If companies are using a public cloud provider, they have two options. They can either use managed AI services or they can use the GPU instances the companies provide, Masood says. When they use the GPU instances which companies provide, that is divided into two different categories: spot instances, which means you buy it on demand right away, and renting them. If you rent over longer periods, of course, the cost is cheaper.But cloud is not always the most cost-efficient option. Those bills can get fantastically huge, Bentley says. They start charging for storing data while it's there. There are companies who exist just to help you understand your bill so you can reduce it.They kind of leave you to do the math a lot of the time. I think it's somewhat obfuscated on purpose, he adds. You still need to have at least one full time DevOps person whose job it is to run these things well.In the current environment, organizations are compelled to piece together the solutions that work best for their needs. There are no magic formulas that work for everyone -- it pays to solicit the advice of knowledgeable parties and devise custom setups.AI definitely isnt a plug and play solution -- yet, Nyman says. Its more like building a spaceship where each part is critical and the whole greater than the sum. Costs can be staggering but the potential ROI (process automation, faster insights, and market disruption), can justify the investment.Nonetheless, Masood is encouraged. People used to have this idea that AI was a very capital-intensive business. I think that's unfounded. Models are maturing and things are becoming much more accessible, he says.Read more about:Network ComputingAbout the AuthorRichard PallardyFreelance WriterRichard Pallardy is a freelance writer based in Chicago. He has written for such publications as Vice, Discover, Science Magazine, and the Encyclopedia Britannica.See more from Richard PallardyNever Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.SIGN-UPYou May Also LikeWebinarsMore WebinarsReportsMore Reports
0 Comments
·0 Shares
·35 Views