• VALHALLA: v1.0.0 Beta 2 Released!

    VALHALLA's second beta dropped on May 01, 2025. Grab it while it's hot!

    Posted by garsipal on May 6th, 2025

    May 01, 2025 - The second beta is out!

    Change log:

    Reworked the game HUD with animations and improved gameplay feedback
    Game version string update
    - Show game version in the server browser- Display warning if client and server versions mismatch before connecting- Show client game version in the main menuAdded disconnection error alert box
    Major Invasion mode refinements for better pacing and balance
    - Fixed monster behaviors - Reworked appearance and attacks for certain monsters - Fixed hitbox issuesRevamped offline match setup and voting interface
    New texture manipulation features
    - Transform textures using matrices- Adjust texture hue- Support for spritesheet animationsOptional IP geolocation and custom flag support
    - Servers can enable GeoIP - Players can customize their profile with country or custom flagsSmoother gameplay feel with updated animations and effects
    - Enhanced camera shake and visual effects - Improved bobbing animations - Added natural weapon swayNew music tracks
    - Win/lose themes - New main menu track - Additional gameplay tracksMore community maps
    New announcer lines and voice lines
    Improved tutorial level
    - Added new tasks - More environmental storytelling - Extra missionsInteractive ragdollsReworked projectile system
    - Projectiles can now be destroyed in both online and offline modes - Server recognizes who destroyed a projectile and credits the kill - Code refactored for better maintainabilityAdded interface to customize controls
    Improved intermissionexperience
    - Shorter delay before exit - Voting interface for next map and mode - Fixed infinite intermission bug in Invasion modeWeapon balancing and feedback improvements
    - Stronger recoil for more powerful weapons - New zoom effect - Tweaked weapon statsFixed various bot behavior issues
    New 3D and 2D assets
    Tons of fixes: UI, editor, memory management, optimization and crashes
    Codebase refactoring and cleanup behind the scenes
    More
    #valhalla #v100 #beta #released
    VALHALLA: v1.0.0 Beta 2 Released!
    VALHALLA's second beta dropped on May 01, 2025. Grab it while it's hot! Posted by garsipal on May 6th, 2025 May 01, 2025 - The second beta is out! Change log: Reworked the game HUD with animations and improved gameplay feedback Game version string update - Show game version in the server browser- Display warning if client and server versions mismatch before connecting- Show client game version in the main menuAdded disconnection error alert box Major Invasion mode refinements for better pacing and balance - Fixed monster behaviors - Reworked appearance and attacks for certain monsters - Fixed hitbox issuesRevamped offline match setup and voting interface New texture manipulation features - Transform textures using matrices- Adjust texture hue- Support for spritesheet animationsOptional IP geolocation and custom flag support - Servers can enable GeoIP - Players can customize their profile with country or custom flagsSmoother gameplay feel with updated animations and effects - Enhanced camera shake and visual effects - Improved bobbing animations - Added natural weapon swayNew music tracks - Win/lose themes - New main menu track - Additional gameplay tracksMore community maps New announcer lines and voice lines Improved tutorial level - Added new tasks - More environmental storytelling - Extra missionsInteractive ragdollsReworked projectile system - Projectiles can now be destroyed in both online and offline modes - Server recognizes who destroyed a projectile and credits the kill - Code refactored for better maintainabilityAdded interface to customize controls Improved intermissionexperience - Shorter delay before exit - Voting interface for next map and mode - Fixed infinite intermission bug in Invasion modeWeapon balancing and feedback improvements - Stronger recoil for more powerful weapons - New zoom effect - Tweaked weapon statsFixed various bot behavior issues New 3D and 2D assets Tons of fixes: UI, editor, memory management, optimization and crashes Codebase refactoring and cleanup behind the scenes More #valhalla #v100 #beta #released
    WWW.INDIEDB.COM
    VALHALLA: v1.0.0 Beta 2 Released!
    VALHALLA's second beta dropped on May 01, 2025. Grab it while it's hot! Posted by garsipal on May 6th, 2025 May 01, 2025 - The second beta is out! Change log: Reworked the game HUD with animations and improved gameplay feedback Game version string update - Show game version in the server browser- Display warning if client and server versions mismatch before connecting- Show client game version in the main menuAdded disconnection error alert box Major Invasion mode refinements for better pacing and balance - Fixed monster behaviors - Reworked appearance and attacks for certain monsters - Fixed hitbox issuesRevamped offline match setup and voting interface New texture manipulation features - Transform textures using matrices (`vmatrix` / `texmatrix`) - Adjust texture hue (`vhue` / `texhue`) - Support for spritesheet animationsOptional IP geolocation and custom flag support - Servers can enable GeoIP - Players can customize their profile with country or custom flagsSmoother gameplay feel with updated animations and effects - Enhanced camera shake and visual effects - Improved bobbing animations - Added natural weapon swayNew music tracks - Win/lose themes - New main menu track - Additional gameplay tracksMore community maps New announcer lines and voice lines Improved tutorial level - Added new tasks - More environmental storytelling - Extra missionsInteractive ragdolls (yes, really) Reworked projectile system - Projectiles can now be destroyed in both online and offline modes - Server recognizes who destroyed a projectile and credits the kill - Code refactored for better maintainabilityAdded interface to customize controls Improved intermission (end-of-match) experience - Shorter delay before exit - Voting interface for next map and mode - Fixed infinite intermission bug in Invasion modeWeapon balancing and feedback improvements - Stronger recoil for more powerful weapons - New zoom effect - Tweaked weapon statsFixed various bot behavior issues New 3D and 2D assets Tons of fixes: UI, editor, memory management, optimization and crashes Codebase refactoring and cleanup behind the scenes More
    0 Comments 0 Shares 0 Reviews
  • Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware
    Summary of This Study
    Hardware choices – specifically hardware type and its quantity – along with training time, have a significant positive impact on energy, water, and carbon footprints during AI model training, whereas architecture-related factors do not.
    The interaction between hardware quantity and training time slows the growth of energy, water, and carbon consumption slightly by 0.00002%.
    Overall energy efficiency during AI model training has improved slightly over the years, around 0.13% per year.
    Longer training time can gradually “drain” the overall energy efficiency by 0.03% per hour.
    Outline
    Introduction
    Research Question 1: Architectural and Hardware Choices vs Resource Consumption
    Research Question 2: Energy Efficiency over Time
    Methods
    Estimation methods
    Analysis methods
    Results
    RQ1:
    Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones
    Final Model Selection
    Coefficients Interpretation
    RQ2
    Discussion
    1.
    Introduction
    Ever since the 1940s, when the first digital computers were invented, scientists have always dreamed of creating machines as smart as humans, what now became Artificial Intelligence (AI).
    Fast forward to November 2022, when ChatGPT — an AI model capable of listening and answering instantly — was released, it felt like a dream come true.
    Afterward, hundreds of new AI models have rushed into the race (take a look at the timeline here).
    Today, every single day, one billion messages are sent through ChatGPT (OpenAI Newsroom, 2024), highlighting the rapid AI adoption by users.
    Yet, few people stop to ask: What are the environmental costs behind this new convenience?
    Before users can ask AI questions, these models must first be trained.
    Training is the process where models, or algorithms, are fed datasets and try to find the best fit.
    Imagine a simple regression y = ax + b: training means feeding the algorithm x and y values and allowing it to find the best parameters a and b.
    Of course, AI models typically would not be as simple as a linear regression.
    They would contain tons of parameters, thus requiring massive amounts of computation and datasets.
    Moreover, they would need to run a substantial amount of specialized hardware that can handle that sheer amount of computation and complexity.
    All of that combined made AI consume much more energy than traditional software.
    In addition, AI training requires a stable and uninterrupted energy supply, which primarily comes from non-renewable energy sources like natural gas or coal-based, because solar and wind energy can fluctuate based on weather conditions (Calvert, 2024).
    Moreover, due to the high intensity of energy use, data centers — buildings that store AI models — heat up rapidly, emitting significant carbon footprints and requiring large amounts of water for cooling.
    Therefore, AI models have broad environmental impacts that include not only energy usage but also water consumption and carbon emissions.
    Unfortunately, there is not much official and disclosed data regarding energy, water, and carbon footprints of AI models.
    The public remains largely unaware of these environmental impacts and thus has not created strong pressure or motivations for tech companies to take more systematic changes.
    Furthermore, while some improvements have been made — especially in hardware energy efficiency — there remains little systematic or coordinated effort to effectively reduce the overall environmental impacts of AI.
    Therefore, I am hoping to increase public awareness of these hidden environmental costs and to explore whether recent improvements in energy efficiency are substantial.
    More particularly, I’m seeking to address two research questions in this study:
    RQ1: Is there a significant relationship between AI models’ architectural and hardware choices and their resource consumption during training?
    RQ2: Has AI training become energy-efficient over time?
    2.
    Methods: 
    The paper used a dataset called Notable AI Models from Epoch AI (Epoch AI, 2025), a research institute that investigates the trends of AI development.
    The models included were either historically relevant or represent cutting-edge advances in AI.
    Each model was recorded with key training information such as the number of parameters, dataset size, total compute, hardware type, and hardware quantity, all collected from various sources, including literature reviews, publications, and research papers.
    The dataset also reported the confidence level for these attributes.
    To produce a reliable analysis, I evaluated only models with a confidence rating of “Confident” or “Likely”.
    As noted earlier, there was limited data regarding direct resource consumption.
    Fortunately, the dataset authors have estimated Total Power Draw (in watts, or W) based on several factors, including hardware type, hardware quantity, and some other data center efficiency rates and overhead.
    It is important to note that power and energy are different: power (W) refers to the amount of electricity used per unit of time, while energy (in kilowatt-hours, or kWh) measures the total cumulative electricity consumed over time.
    Since this study investigated resource consumption and energy efficiency during the training phase of AI models, I constructed and estimated four environmental metrics: total energy used (kWh), total water used (liters, or L), total carbon emissions (kilograms of CO2e, or kgCO2e), and energy efficiency (FLOPS/W, to be explained later).
    a.
    Estimation methods
    First, this study estimated energy consumption by selecting models with available total power draw (W) and training times (hours).
    Energy was computed as follows:
    \[\text{Energy (kWh)} = \frac{\text{Total Power Draw (W)}}{1000} \times \text{Training Time (h)}\]
    Next, water consumption and carbon emissions were estimated by rearranging the formulas of two standard rates used in data centers: Water Usage Effectiveness (WUE, in L/kWh) and Carbon Intensity (CI, in kgCO2e/kWh):
    \[\text{WUE (L/kWh)} = \frac{\text{Water (L)}}{\text{Energy (kWh)}}\ \Longrightarrow\ \text{Water (L)} = \text{WUE (L/kWh)} \times \text{Energy (kWh)}\]
    This study used the average WUE of 0.36 L/kWh in 2023, reported by Lawrence Berkeley National Laboratory (2024).
    \[\mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} = \frac{\mathrm{Carbon\ (kgCO_2e)}}{\mathrm{Energy\ (kWh)}}\ \Longrightarrow\ \mathrm{Carbon\ (kgCO_2e)} = \mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} \times \mathrm{Energy\ (kWh)}\]
    This study used an average carbon intensity of 0.548 kg CO₂e/kWh, reported by recent environmental research (Guidi et al, 2024).
    Finally, this study estimated energy efficiency using the FLOPS/W metric.
    A floating-point operation (FLOP) is a basic arithmetic operation (e.g., addition or multiplication) with decimal numbers.
    FLOP per second (FLOPS) measures how many such operations a system can perform each second, and is commonly used to evaluate computing performance.
    FLOPS per Watt (FLOPS/W) measures how much computing performance is achieved per unit of power consumed:
    \[\text{Energy Efficiency (FLOPS/W)} = \frac{\text{Total Compute (FLOP)}}{\text{Training Time (h)} \times 3600 \times \text{Total Power Draw(W)}}\]
    It is important to note that FLOPS/W is typically used to measure hardware-level energy efficiency.
    However, it’s possible that the actual efficiency during AI training may be different from the thereotical efficiency reported for the hardware used.
    I would like to investigate whether any of the training-related factors, beyond hardware alone, may contribute significantly to overall energy efficiency.
    b.
    Analysis methods:
    RQ1: Architectural and Hardware Choices vs Resource Consumption
     Among energy, water, and carbon consumption, I focused on modeling energy consumption, as both water and carbon are derived directly from energy using fixed conversion rates and all three response variables shared identical distributions.
    As a result, I believe we could safely assume that the best-fitting model of energy consumption can be applied to water and carbon.
    While the statistical models were the same, I would still report the results of all three to quantify how many kilowatt-hours of energy, liters of water, and kilograms of carbon are wasted for every unit increase in each significant factor.
    That way, I am hoping to communicate the environmental impacts of AI in a more holistic, concrete, and tangible terms.
    Figure 2a.
    Histogram of Energy Consumption (kWh)
    Figure 2b.
    Histogram of log of Energy Consumption (kWh)
    Based on Figure 1, the histogram of energy showed extreme right skew and the presence of some outliers.
    Therefore, I performed a log transformation on energy data, aiming to stabilize variance and move the distribution closer to normality (Fig.
    2).
    A Shapiro-Wilk test confirmed the log-transformed energy data is approximately normal (p-value = 0.5).
    Based on this, two types of distributions were considered: the Gaussian (normal) and the Gamma distribution.
    While the Gaussian distribution is approriate for symmetric and normal data, the Gamma distribution is more suited for positive, skewed data — commonly used in engineering modeling where small values occur more frequently than larger values.
    For each distribution, the paper compared two approaches for incorporating the log transformation: directly log transforming the response variable versus using a log link function within a generalized linear model (GLM).
    I identified the best combination of distribution and log approach by evaluating their Akaike Information Criterion (AIC), diagnostic plots, along with prediction accuracy.

    The candidate predictors included Parameters, Training Compute, Dataset Size, Training Time, Hardware Quantity, and Hardware Type.
    Architecture-related variables comprised Parameters, Training Compute, and Dataset Size, while hardware-related variables consisted of Hardware Quantity and Hardware Type.
    Training Time didn’t fall neatly into either category but was included due to its central role in training AI models.
    After fitting all candidate predictors into the selected GLM specification, I tested for multicollinearity to determine whether any variables should be excluded.
    Following this, I explored interaction terms, as each resource consumption may not have responded linearly to each independent variable.
    The following interactions were considered based on domain knowledge and various sources:
    Model Size and Hardware Type: Different hardware types have different memory designs.
    The larger and more complex the model is, the more memory it requires (Bali, 2025).
    Energy consumption can be different depending on how the hardware handles memory demands.
    Dataset Size and Hardware Type: Similarly, with different memory designs, hardware may access and read data at different data size (Krashinsky et al, 2020).
    As dataset size increases, energy consumption can vary depending on how the hardware handles large volumes of data.
    Training Time with Hardware Quantity: Running multiple hardware units at the same time adds extra overhead, like keeping everything in sync (HuggingFace, 2025).
    As training goes on, these coordination costs can grow and put more strain on the system, leading to faster energy drain.
    Training Time with Hardware Type: As training time increases, energy use may vary across hardware types since some hardware types may manage heat better or maintain performance more consistently over time, while others may slow down or consume more energy.
    RQ2: Energy Efficiency over Time
    Figure 2c.
    Histogram of Energy Efficiency (FLOPS/W)
    Figure 2d.
    Histogram of Energy Efficiency (FLOPS/W)
    The distribution of energy efficiency was highly skewed.
    Even after a log transformation, the distribution remained non-normal and overdispersed.
    To reduce distortion, I removed one extreme outlier with exceptionally high efficiency, as it was not a frontier model and likely less impactful.
    A Gamma GLM was then fitted using Publication Date as the primary predictor.
    If models using the same hardware exhibited wide variation in efficiency, it would suggest that other factors beyond the hardware may contribute to these differences.
    Therefore, architecture and hardware predictors from the first research question would be used to assess which variables significantly influence energy efficiency over time.
    3.
    Results
    RQ1: Architectural and Hardware Choices vs Resource Consumption
    I ultimately used a Gamma GLM with a log link to model resource consumption.
    This combination was chosen because it had a lower AIC value (1780.85) than the Gaussian log-link model (2005.83) and produced predictions that matched the raw data more closely than models using a log-transformed response variable.
    Those log-transformed models generated predictions that substantially underestimated the actual data on the original scale (see this article on why log-transforming didn’t work in my case).
    Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones
    After fitting all candidate explanatory variables to a Gamma log-link GLM, we found that two architecture-related variables — Parameters and Dataset Size — do not exhibit a significant relationship with resource consumption (p > 0.5).
    A multicollinearity test also showed that Dataset Size and Training Compute were highly correlated with other predictors (GVIF > 6).
    Based on this, I hypothesized that all three architecture variables—Parameters, Dataset Size, and Training Compute) may not hold much predictive power.
    I then removed all three variables from the model and an ANOVA test confirmed that simplified models (Models 4 and 5) are not significantly worse than the full model (Model 1), with p > 0.05:
    Model 1: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_dataset_size +
    Training_time_hour + Hardware_quantity + Training_hardware +
    0
    Model 2: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_time_hour +
    Hardware_quantity + Training_hardware
    Model 3: Energy_kWh ~ Parameters + Training_dataset_size + Training_time_hour +
    Hardware_quantity + Training_hardware
    Model 4: Energy_kWh ~ Parameters + Training_time_hour + Hardware_quantity +
    Training_hardware + 0
    Model 5: Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware +
    0
    Resid.
    Df Resid.
    Dev Df Deviance Pr(>Chi)
    1 46 108.28
    2 47 111.95 -1 -3.6700 0.07809 .
    3 47 115.69 0 -3.7471
    4 48 116.09 -1 -0.3952 0.56314
    5 49 116.61 -1 -0.5228 0.50604
    Moving on with Model 5, I found that Training Time and Hardware Quantity showed significant positive relationships with Energy Consumption (GLM: training time, t = 9.70, p-value < 0.001; hardware quantity, t = 6.89, p-value < 0.001).
    All hardware types were also statistically significant (p-value < 0.001), indicating strong variation in energy use across different types.
    Detailed results are presented below:
    glm(formula = Energy_kWh ~ Training_time_hour + Hardware_quantity +
    Training_hardware + 0, family = Gamma(link = "log"), data = df)
    Coefficients:
    Estimate Std.
    Error t value Pr(>|t|)
    Training_time_hour 1.351e-03 1.393e-04 9.697 5.54e-13 ***
    Hardware_quantity 3.749e-04 5.444e-05 6.886 9.95e-09 ***
    Training_hardwareGoogle TPU v2 7.213e+00 7.614e-01 9.474 1.17e-12 ***
    Training_hardwareGoogle TPU v3 1.060e+01 3.183e-01 33.310 < 2e-16 ***
    Training_hardwareGoogle TPU v4 1.064e+01 4.229e-01 25.155 < 2e-16 ***
    Training_hardwareHuawei Ascend 910 1.021e+01 1.126e+00 9.068 4.67e-12 ***
    Training_hardwareNVIDIA A100 1.083e+01 3.224e-01 33.585 < 2e-16 ***
    Training_hardwareNVIDIA A100 SXM4 40 GB 1.084e+01 5.810e-01 18.655 < 2e-16 ***
    Training_hardwareNVIDIA A100 SXM4 80 GB 1.149e+01 5.754e-01 19.963 < 2e-16 ***
    Training_hardwareNVIDIA GeForce GTX 285 3.065e+00 1.077e+00 2.846 0.00644 **
    Training_hardwareNVIDIA GeForce GTX TITAN X 6.377e+00 7.614e-01 8.375 5.13e-11 ***
    Training_hardwareNVIDIA GTX Titan Black 6.371e+00 1.079e+00 5.905 3.28e-07 ***
    Training_hardwareNVIDIA H100 SXM5 80GB 1.149e+01 6.825e-01 16.830 < 2e-16 ***
    Training_hardwareNVIDIA P100 5.910e+00 7.066e-01 8.365 5.32e-11 ***
    Training_hardwareNVIDIA Quadro P600 5.278e+00 1.081e+00 4.881 1.16e-05 ***
    Training_hardwareNVIDIA Quadro RTX 4000 5.918e+00 1.085e+00 5.455 1.60e-06 ***
    Training_hardwareNVIDIA Quadro RTX 5000 4.932e+00 1.081e+00 4.563 3.40e-05 ***
    Training_hardwareNVIDIA Tesla K80 9.091e+00 7.760e-01 11.716 8.11e-16 ***
    Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.059e+01 6.546e-01 16.173 < 2e-16 ***
    Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.089e+01 1.078e+00 10.099 1.45e-13 ***
    Training_hardwareNVIDIA V100 9.683e+00 4.106e-01 23.584 < 2e-16 ***
    ---
    Signif.
    codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for Gamma family taken to be 1.159293)
    Null deviance: 2.7045e+08 on 70 degrees of freedom
    Residual deviance: 1.1661e+02 on 49 degrees of freedom
    AIC: 1781.2
    Number of Fisher Scoring iterations: 25
    Final Model Selection
    To better capture possible non-additive effects, various interaction terms were explored and their respective AIC scores (Table 1).
    The table below summarizes the tested models and their respective AIC scores:
    ModelPredictorsAIC5Training Time + Hardware Quantity + Hardware Type350.786Training Time + Hardware Quantity + Hardware Type * Parameters357.977Training Time + Hardware Quantity + Hardware Type * Dataset Size335.898Training Time * Hardware Quantity + Hardware Type345.399Training Time * Hardware Type + Hardware Quantity333.03Table 1.
    Summary of different GLM models and their respective AIC scores.
    Although AIC scores did not vary drastically, meaning their model fits are similar, Model 8 was preferred as it was the only one with significant effects in both main terms and interaction.
    Interactions involved Hardware Type were not significant despite some exhibiting better AIC, likely due to limited sample size across 18 hardware types.
    In Model 8, both Training Time and Hardware Quantity showed a significant positive relationship with energy consumption (GLM: t = 11.09, p < 0.001), and between hardware quantity and energy consumption (GLM: training time, t = 11.09, p < 0.001; hardware quantity, t = 7.32, p < 0.001; Fig.
    3a).
    Their interaction term was significantly negative (GLM: t = –4.32, p < 0.001), suggesting that energy consumption grows more slowly when training time increases alongside with a higher number of hardware units.
    All hardware types remained significant (p < 0.001).
    Detailed results are as below:
    glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity +
    Training_hardware + 0, family = Gamma(link = "log"), data = df)
    Coefficients:
    Estimate Std.
    Error t value Pr(>|t|)
    Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 ***
    Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 ***
    Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 ***
    Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 ***
    Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 ***
    Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 ***
    Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 ***
    Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 ***
    Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 ***
    Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 **
    Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 ***
    Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 ***
    Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 ***
    Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 ***
    Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 ***
    Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 ***
    Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 ***
    Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 ***
    Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 ***
    Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 ***
    Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 ***
    Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 ***
    ---
    Signif.
    codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for Gamma family taken to be 1.088522)
    Null deviance: 2.7045e+08 on 70 degrees of freedom
    Residual deviance: 1.0593e+02 on 48 degrees of freedom
    AIC: 1775
    Number of Fisher Scoring iterations: 25
    Figure 3a.
    Relationship between hardware quantity and log of energy consumption across training time groups.
    Training time was originally a continuous variable.
    For the sake of visualization, training time was divided into three equal-sized levels and labeled as high, mid, and low.
    Coefficients Interpretation
    To further interpret the coefficients, we can exponentiate each coefficient and subtract one to estimate the percent change in the response variable for each additional unit in the predictor (Popovic, 2022).
    For energy consumption, each additional hour of training would increase energy use by 0.18%, each additional hardware unit would add 0.07%, and their interaction reduced their combined main effects by 0.00002%.
    Similarly, since water and carbon were directly proportional with energy, the percent change in training time, hardware quantity, and their interaction remained the same (Fig.
    3b, Fig.
    3c).
    However, since hardware types were categorical variables and functioned as baseline intercepts, their values differed across energy, water, and carbon models to reflect differences in overall scale.
    Figure 3b.
    Relationship between hardware quantity and log of water consumption across training time groups.
    Figure 3c.
    Relationship between hardware quantity and log of carbon emissions across training time groups.
    RQ2: Energy Efficiency over Time
    I also used a log-linked Gamma model to examine the relationship between Energy Efficiency and Publication Date, as the Shapiro-Wilk test indicated that the log-transformed data was not normally distributed (p < 0.001).
    There was a positive relationship between Publication Date and Energy Efficiency, with an estimated improvement of 0.13% per year (GLM: t = 8.005, p < 0.001, Fig.
    3d).
    Figure 3d.
    Relationship between publication year and log of energy efficiency (FLOPS/W).
    Each point represents a model, and the blue line shows a fitted trend using a linear model.
    To further investigate, we examined the trends by individual hardware type and observed noticeable variation in efficiency among AI models using the same hardware (Fig.
    3e).
    Among all architecture and hardware choices, Training Time was the only statistically significant factor influencing energy efficiency (GLM: t = 8.581, p < 0.001), with longer training time decreases energy efficiency by 0.03% per hour.
    Figure 3e.
    Trends in log of energy efficiency (FLOPS/W) by hardware type over time.
    Each panel represents a specific hardware model, showing individual data points and fitted linear trends.
    Only hardware types used in at least three models are included.
    4.
    Discussion
    This study found that hardware choices — including Hardware Type and Hardware Quantity — along with Training Time, have a significant relationship with each resource consumption during AI Model Training, while architecture variables do not.
    I suspect that Training Time may have implicitly captured some of the underlying effects of those architecture-related factors.
    In addition, the interaction between Training Time and Hardware also contributes to the resource usage.
    However, this analysis is constrained by the small dataset (70 valid models) across 18 hardware types, which likely limits the statistical power of hardware-involved interaction terms.
    Further research could explore these interactions with larger and more diverse datasets.
    To illustrate how resource-intensive AI training can be, we use Model 8 to predict the baseline energy consumption for a single hour of training on one NVIDIA A100 chip.
    Here are the predictions for each type of resource under this simple setup:
    Energy: The predicted energy use is 29,213 kWh, nearly three times the annual energy consumption of an average U.S.
    household (10,500 kWh/year) (U.S.
    Energy Information Administration, 2023), with each extra hour adding 5258 kWh more and each extra chip adding 2044 kWh.

    Water: Similarly, the same training session would consume 10,521 liters of water, almost ten times the average U.S.
    household’s daily water use (300 gallons or 1135 liters/day) (United States Environmental Protection Agency, 2024), with each extra hour adding 1,894 liters and each chip adding 736 liters.

    Carbon: the predicted carbon emission is 16,009 kg, about four times the annual emissions of a U.S.
    household (4000kg/year) (University of Michigan, 2024), with each extra hour adding 2881 kg and each extra chip adding 1120 kg.
    This study also found that AI models have become more energy-efficient over time, but only slightly, with an estimated improvement of 0.13% per year.
    This suggests that while newer hardware is more efficient, its adoption has not been widespread.
    While the environmental impact of AI may be mitigated over time as hardware hardware has become more efficient, this focus on hardware alone may overlook other contributors to overall energy consumption.
    In this dataset, both Training Compute and Total Power Draw are often estimated values and may include some system-level overhead beyond hardware alone.
    Therefore, the efficiency estimates in this study may reflect not just hardware performance, but potentially other training-related overhead.
    This study observed substantial variation in energy efficiency even among models using the same hardware.
    One key finding is that longer training time can “drain” energy efficiency, reducing it by approximately 0.03%.
    Further studies should explore how training practices, beyond hardware selection, impact the environmental costs of AI development.
    References
    Calvert, B..
    2024.
    AI already uses as much energy as a small country.
    It’s only the beginning.
    Vox.
    https://www.vox.com/climate/2024/3/28/24111721/climate-ai-tech-energy-demand-rising" style="color: #0066cc;">https://www.vox.com/climate/2024/3/28/24111721/climate-ai-tech-energy-demand-rising
    OpenAI Newsroom.
    2024.
    Fresh numbers shared by @sama earlier today: 300M weekly active ChatGPT users.
    1B user messages sent on ChatGPT every day 1.3M devs have built on OpenAI in the US.
    Tweet via X.
    2024.
    https://x.com/OpenAINewsroom/status/1864373399218475440" style="color: #0066cc;">https://x.com/OpenAINewsroom/status/1864373399218475440
    Epoch AI.
    2025.
    Data on Notable AI Models.
    Epoch AI.
    https://epoch.ai/data/notable-ai-models" style="color: #0066cc;">https://epoch.ai/data/notable-ai-models
    Shehabi, A., S.J.
    Smith, A.
    Hubbard, A.
    Newkirk, N.
    Lei, M.A.B.
    Siddik, B.
    Holecek, J.
    Koomey, E.
    Masanet, and D.
    Sartor.
    2024.
    2024 United States Data Center Energy Usage Report.
    Lawrence Berkeley National Laboratory, Berkeley, California.
    LBNL-2001637.
    Guidi, G., F.
    Dominici, J.
    Gilmour, K.
    Butler, E.
    Bell, S.
    Delaney, and F.J.
    Bargagli-Stoffi.
    2024.
    Environmental Burden of United States Data Centers in the Artificial Intelligence Era.
    arXiv abs/2411.09786.
    Bali, S..
    2025.
    GPU Memory Essentials for AI Performance.
    NVIDIA Developer.
    https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/" style="color: #0066cc;">https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/
    Krashinsky, R., O.
    Giroux, S.
    Jones, N.
    Stam, and S.
    Ramaswamy.
    2020.
    NVIDIA Ampere Architecture In-Depth.
    NVIDIA Developer.
    https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/" style="color: #0066cc;">https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/
    HuggingFace.
    2025.
    Performance Tips for Training on Multiple GPUs.
    HuggingFace Documentation.
    https://huggingface.co/docs/transformers/en/perf_train_gpu_many" style="color: #0066cc;">https://huggingface.co/docs/transformers/en/perf_train_gpu_many
    Popovic, G..
    2022.
    Interpreting GLMs.
    Environmental Computing.
    Environment Computing.
    https://environmentalcomputing.net/statistics/glms/interpret-glm-coeffs/" style="color: #0066cc;">https://environmentalcomputing.net/statistics/glms/interpret-glm-coeffs/
    U.S.
    Energy Information Administration.
    2023.
    Use of Energy Explained: Electricity Use in Homes.
    https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php" style="color: #0066cc;">https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php
    United States Environmental Protection Agency.
    2024.
    How We Use Water.
    https://www.epa.gov/watersense/how-we-use-water" style="color: #0066cc;">https://www.epa.gov/watersense/how-we-use-water
    Center for Sustainable Systems, University of Michigan.
    2024.
    Carbon Footprint Factsheet.
    Pub.
    No.
    CSS09–05.
    The post Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware appeared first on Towards Data Science.
    Source: https://towardsdatascience.com/rethinking-environmental-costs-of-training-ai-why-we-should-look-beyond-hardware/" style="color: #0066cc;">https://towardsdatascience.com/rethinking-environmental-costs-of-training-ai-why-we-should-look-beyond-hardware/
    #rethinking #the #environmental #costs #training #why #should #look #beyond #hardware
    Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware
    Summary of This Study Hardware choices – specifically hardware type and its quantity – along with training time, have a significant positive impact on energy, water, and carbon footprints during AI model training, whereas architecture-related factors do not. The interaction between hardware quantity and training time slows the growth of energy, water, and carbon consumption slightly by 0.00002%. Overall energy efficiency during AI model training has improved slightly over the years, around 0.13% per year. Longer training time can gradually “drain” the overall energy efficiency by 0.03% per hour. Outline Introduction Research Question 1: Architectural and Hardware Choices vs Resource Consumption Research Question 2: Energy Efficiency over Time Methods Estimation methods Analysis methods Results RQ1: Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones Final Model Selection Coefficients Interpretation RQ2 Discussion 1. Introduction Ever since the 1940s, when the first digital computers were invented, scientists have always dreamed of creating machines as smart as humans, what now became Artificial Intelligence (AI). Fast forward to November 2022, when ChatGPT — an AI model capable of listening and answering instantly — was released, it felt like a dream come true. Afterward, hundreds of new AI models have rushed into the race (take a look at the timeline here). Today, every single day, one billion messages are sent through ChatGPT (OpenAI Newsroom, 2024), highlighting the rapid AI adoption by users. Yet, few people stop to ask: What are the environmental costs behind this new convenience? Before users can ask AI questions, these models must first be trained. Training is the process where models, or algorithms, are fed datasets and try to find the best fit. Imagine a simple regression y = ax + b: training means feeding the algorithm x and y values and allowing it to find the best parameters a and b. Of course, AI models typically would not be as simple as a linear regression. They would contain tons of parameters, thus requiring massive amounts of computation and datasets. Moreover, they would need to run a substantial amount of specialized hardware that can handle that sheer amount of computation and complexity. All of that combined made AI consume much more energy than traditional software. In addition, AI training requires a stable and uninterrupted energy supply, which primarily comes from non-renewable energy sources like natural gas or coal-based, because solar and wind energy can fluctuate based on weather conditions (Calvert, 2024). Moreover, due to the high intensity of energy use, data centers — buildings that store AI models — heat up rapidly, emitting significant carbon footprints and requiring large amounts of water for cooling. Therefore, AI models have broad environmental impacts that include not only energy usage but also water consumption and carbon emissions. Unfortunately, there is not much official and disclosed data regarding energy, water, and carbon footprints of AI models. The public remains largely unaware of these environmental impacts and thus has not created strong pressure or motivations for tech companies to take more systematic changes. Furthermore, while some improvements have been made — especially in hardware energy efficiency — there remains little systematic or coordinated effort to effectively reduce the overall environmental impacts of AI. Therefore, I am hoping to increase public awareness of these hidden environmental costs and to explore whether recent improvements in energy efficiency are substantial. More particularly, I’m seeking to address two research questions in this study: RQ1: Is there a significant relationship between AI models’ architectural and hardware choices and their resource consumption during training? RQ2: Has AI training become energy-efficient over time? 2. Methods:  The paper used a dataset called Notable AI Models from Epoch AI (Epoch AI, 2025), a research institute that investigates the trends of AI development. The models included were either historically relevant or represent cutting-edge advances in AI. Each model was recorded with key training information such as the number of parameters, dataset size, total compute, hardware type, and hardware quantity, all collected from various sources, including literature reviews, publications, and research papers. The dataset also reported the confidence level for these attributes. To produce a reliable analysis, I evaluated only models with a confidence rating of “Confident” or “Likely”. As noted earlier, there was limited data regarding direct resource consumption. Fortunately, the dataset authors have estimated Total Power Draw (in watts, or W) based on several factors, including hardware type, hardware quantity, and some other data center efficiency rates and overhead. It is important to note that power and energy are different: power (W) refers to the amount of electricity used per unit of time, while energy (in kilowatt-hours, or kWh) measures the total cumulative electricity consumed over time. Since this study investigated resource consumption and energy efficiency during the training phase of AI models, I constructed and estimated four environmental metrics: total energy used (kWh), total water used (liters, or L), total carbon emissions (kilograms of CO2e, or kgCO2e), and energy efficiency (FLOPS/W, to be explained later). a. Estimation methods First, this study estimated energy consumption by selecting models with available total power draw (W) and training times (hours). Energy was computed as follows: \[\text{Energy (kWh)} = \frac{\text{Total Power Draw (W)}}{1000} \times \text{Training Time (h)}\] Next, water consumption and carbon emissions were estimated by rearranging the formulas of two standard rates used in data centers: Water Usage Effectiveness (WUE, in L/kWh) and Carbon Intensity (CI, in kgCO2e/kWh): \[\text{WUE (L/kWh)} = \frac{\text{Water (L)}}{\text{Energy (kWh)}}\ \Longrightarrow\ \text{Water (L)} = \text{WUE (L/kWh)} \times \text{Energy (kWh)}\] This study used the average WUE of 0.36 L/kWh in 2023, reported by Lawrence Berkeley National Laboratory (2024). \[\mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} = \frac{\mathrm{Carbon\ (kgCO_2e)}}{\mathrm{Energy\ (kWh)}}\ \Longrightarrow\ \mathrm{Carbon\ (kgCO_2e)} = \mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} \times \mathrm{Energy\ (kWh)}\] This study used an average carbon intensity of 0.548 kg CO₂e/kWh, reported by recent environmental research (Guidi et al, 2024). Finally, this study estimated energy efficiency using the FLOPS/W metric. A floating-point operation (FLOP) is a basic arithmetic operation (e.g., addition or multiplication) with decimal numbers. FLOP per second (FLOPS) measures how many such operations a system can perform each second, and is commonly used to evaluate computing performance. FLOPS per Watt (FLOPS/W) measures how much computing performance is achieved per unit of power consumed: \[\text{Energy Efficiency (FLOPS/W)} = \frac{\text{Total Compute (FLOP)}}{\text{Training Time (h)} \times 3600 \times \text{Total Power Draw(W)}}\] It is important to note that FLOPS/W is typically used to measure hardware-level energy efficiency. However, it’s possible that the actual efficiency during AI training may be different from the thereotical efficiency reported for the hardware used. I would like to investigate whether any of the training-related factors, beyond hardware alone, may contribute significantly to overall energy efficiency. b. Analysis methods: RQ1: Architectural and Hardware Choices vs Resource Consumption  Among energy, water, and carbon consumption, I focused on modeling energy consumption, as both water and carbon are derived directly from energy using fixed conversion rates and all three response variables shared identical distributions. As a result, I believe we could safely assume that the best-fitting model of energy consumption can be applied to water and carbon. While the statistical models were the same, I would still report the results of all three to quantify how many kilowatt-hours of energy, liters of water, and kilograms of carbon are wasted for every unit increase in each significant factor. That way, I am hoping to communicate the environmental impacts of AI in a more holistic, concrete, and tangible terms. Figure 2a. Histogram of Energy Consumption (kWh) Figure 2b. Histogram of log of Energy Consumption (kWh) Based on Figure 1, the histogram of energy showed extreme right skew and the presence of some outliers. Therefore, I performed a log transformation on energy data, aiming to stabilize variance and move the distribution closer to normality (Fig. 2). A Shapiro-Wilk test confirmed the log-transformed energy data is approximately normal (p-value = 0.5). Based on this, two types of distributions were considered: the Gaussian (normal) and the Gamma distribution. While the Gaussian distribution is approriate for symmetric and normal data, the Gamma distribution is more suited for positive, skewed data — commonly used in engineering modeling where small values occur more frequently than larger values. For each distribution, the paper compared two approaches for incorporating the log transformation: directly log transforming the response variable versus using a log link function within a generalized linear model (GLM). I identified the best combination of distribution and log approach by evaluating their Akaike Information Criterion (AIC), diagnostic plots, along with prediction accuracy. The candidate predictors included Parameters, Training Compute, Dataset Size, Training Time, Hardware Quantity, and Hardware Type. Architecture-related variables comprised Parameters, Training Compute, and Dataset Size, while hardware-related variables consisted of Hardware Quantity and Hardware Type. Training Time didn’t fall neatly into either category but was included due to its central role in training AI models. After fitting all candidate predictors into the selected GLM specification, I tested for multicollinearity to determine whether any variables should be excluded. Following this, I explored interaction terms, as each resource consumption may not have responded linearly to each independent variable. The following interactions were considered based on domain knowledge and various sources: Model Size and Hardware Type: Different hardware types have different memory designs. The larger and more complex the model is, the more memory it requires (Bali, 2025). Energy consumption can be different depending on how the hardware handles memory demands. Dataset Size and Hardware Type: Similarly, with different memory designs, hardware may access and read data at different data size (Krashinsky et al, 2020). As dataset size increases, energy consumption can vary depending on how the hardware handles large volumes of data. Training Time with Hardware Quantity: Running multiple hardware units at the same time adds extra overhead, like keeping everything in sync (HuggingFace, 2025). As training goes on, these coordination costs can grow and put more strain on the system, leading to faster energy drain. Training Time with Hardware Type: As training time increases, energy use may vary across hardware types since some hardware types may manage heat better or maintain performance more consistently over time, while others may slow down or consume more energy. RQ2: Energy Efficiency over Time Figure 2c. Histogram of Energy Efficiency (FLOPS/W) Figure 2d. Histogram of Energy Efficiency (FLOPS/W) The distribution of energy efficiency was highly skewed. Even after a log transformation, the distribution remained non-normal and overdispersed. To reduce distortion, I removed one extreme outlier with exceptionally high efficiency, as it was not a frontier model and likely less impactful. A Gamma GLM was then fitted using Publication Date as the primary predictor. If models using the same hardware exhibited wide variation in efficiency, it would suggest that other factors beyond the hardware may contribute to these differences. Therefore, architecture and hardware predictors from the first research question would be used to assess which variables significantly influence energy efficiency over time. 3. Results RQ1: Architectural and Hardware Choices vs Resource Consumption I ultimately used a Gamma GLM with a log link to model resource consumption. This combination was chosen because it had a lower AIC value (1780.85) than the Gaussian log-link model (2005.83) and produced predictions that matched the raw data more closely than models using a log-transformed response variable. Those log-transformed models generated predictions that substantially underestimated the actual data on the original scale (see this article on why log-transforming didn’t work in my case). Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones After fitting all candidate explanatory variables to a Gamma log-link GLM, we found that two architecture-related variables — Parameters and Dataset Size — do not exhibit a significant relationship with resource consumption (p > 0.5). A multicollinearity test also showed that Dataset Size and Training Compute were highly correlated with other predictors (GVIF > 6). Based on this, I hypothesized that all three architecture variables—Parameters, Dataset Size, and Training Compute) may not hold much predictive power. I then removed all three variables from the model and an ANOVA test confirmed that simplified models (Models 4 and 5) are not significantly worse than the full model (Model 1), with p > 0.05: Model 1: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_dataset_size + Training_time_hour + Hardware_quantity + Training_hardware + 0 Model 2: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_time_hour + Hardware_quantity + Training_hardware Model 3: Energy_kWh ~ Parameters + Training_dataset_size + Training_time_hour + Hardware_quantity + Training_hardware Model 4: Energy_kWh ~ Parameters + Training_time_hour + Hardware_quantity + Training_hardware + 0 Model 5: Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware + 0 Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 46 108.28 2 47 111.95 -1 -3.6700 0.07809 . 3 47 115.69 0 -3.7471 4 48 116.09 -1 -0.3952 0.56314 5 49 116.61 -1 -0.5228 0.50604 Moving on with Model 5, I found that Training Time and Hardware Quantity showed significant positive relationships with Energy Consumption (GLM: training time, t = 9.70, p-value < 0.001; hardware quantity, t = 6.89, p-value < 0.001). All hardware types were also statistically significant (p-value < 0.001), indicating strong variation in energy use across different types. Detailed results are presented below: glm(formula = Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware + 0, family = Gamma(link = "log"), data = df) Coefficients: Estimate Std. Error t value Pr(>|t|) Training_time_hour 1.351e-03 1.393e-04 9.697 5.54e-13 *** Hardware_quantity 3.749e-04 5.444e-05 6.886 9.95e-09 *** Training_hardwareGoogle TPU v2 7.213e+00 7.614e-01 9.474 1.17e-12 *** Training_hardwareGoogle TPU v3 1.060e+01 3.183e-01 33.310 < 2e-16 *** Training_hardwareGoogle TPU v4 1.064e+01 4.229e-01 25.155 < 2e-16 *** Training_hardwareHuawei Ascend 910 1.021e+01 1.126e+00 9.068 4.67e-12 *** Training_hardwareNVIDIA A100 1.083e+01 3.224e-01 33.585 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 40 GB 1.084e+01 5.810e-01 18.655 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 80 GB 1.149e+01 5.754e-01 19.963 < 2e-16 *** Training_hardwareNVIDIA GeForce GTX 285 3.065e+00 1.077e+00 2.846 0.00644 ** Training_hardwareNVIDIA GeForce GTX TITAN X 6.377e+00 7.614e-01 8.375 5.13e-11 *** Training_hardwareNVIDIA GTX Titan Black 6.371e+00 1.079e+00 5.905 3.28e-07 *** Training_hardwareNVIDIA H100 SXM5 80GB 1.149e+01 6.825e-01 16.830 < 2e-16 *** Training_hardwareNVIDIA P100 5.910e+00 7.066e-01 8.365 5.32e-11 *** Training_hardwareNVIDIA Quadro P600 5.278e+00 1.081e+00 4.881 1.16e-05 *** Training_hardwareNVIDIA Quadro RTX 4000 5.918e+00 1.085e+00 5.455 1.60e-06 *** Training_hardwareNVIDIA Quadro RTX 5000 4.932e+00 1.081e+00 4.563 3.40e-05 *** Training_hardwareNVIDIA Tesla K80 9.091e+00 7.760e-01 11.716 8.11e-16 *** Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.059e+01 6.546e-01 16.173 < 2e-16 *** Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.089e+01 1.078e+00 10.099 1.45e-13 *** Training_hardwareNVIDIA V100 9.683e+00 4.106e-01 23.584 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.159293) Null deviance: 2.7045e+08 on 70 degrees of freedom Residual deviance: 1.1661e+02 on 49 degrees of freedom AIC: 1781.2 Number of Fisher Scoring iterations: 25 Final Model Selection To better capture possible non-additive effects, various interaction terms were explored and their respective AIC scores (Table 1). The table below summarizes the tested models and their respective AIC scores: ModelPredictorsAIC5Training Time + Hardware Quantity + Hardware Type350.786Training Time + Hardware Quantity + Hardware Type * Parameters357.977Training Time + Hardware Quantity + Hardware Type * Dataset Size335.898Training Time * Hardware Quantity + Hardware Type345.399Training Time * Hardware Type + Hardware Quantity333.03Table 1. Summary of different GLM models and their respective AIC scores. Although AIC scores did not vary drastically, meaning their model fits are similar, Model 8 was preferred as it was the only one with significant effects in both main terms and interaction. Interactions involved Hardware Type were not significant despite some exhibiting better AIC, likely due to limited sample size across 18 hardware types. In Model 8, both Training Time and Hardware Quantity showed a significant positive relationship with energy consumption (GLM: t = 11.09, p < 0.001), and between hardware quantity and energy consumption (GLM: training time, t = 11.09, p < 0.001; hardware quantity, t = 7.32, p < 0.001; Fig. 3a). Their interaction term was significantly negative (GLM: t = –4.32, p < 0.001), suggesting that energy consumption grows more slowly when training time increases alongside with a higher number of hardware units. All hardware types remained significant (p < 0.001). Detailed results are as below: glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity + Training_hardware + 0, family = Gamma(link = "log"), data = df) Coefficients: Estimate Std. Error t value Pr(>|t|) Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 *** Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 *** Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 *** Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 *** Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 *** Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 *** Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 *** Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 ** Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 *** Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 *** Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 *** Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 *** Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 *** Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 *** Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 *** Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 *** Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 *** Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 *** Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 *** Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.088522) Null deviance: 2.7045e+08 on 70 degrees of freedom Residual deviance: 1.0593e+02 on 48 degrees of freedom AIC: 1775 Number of Fisher Scoring iterations: 25 Figure 3a. Relationship between hardware quantity and log of energy consumption across training time groups. Training time was originally a continuous variable. For the sake of visualization, training time was divided into three equal-sized levels and labeled as high, mid, and low. Coefficients Interpretation To further interpret the coefficients, we can exponentiate each coefficient and subtract one to estimate the percent change in the response variable for each additional unit in the predictor (Popovic, 2022). For energy consumption, each additional hour of training would increase energy use by 0.18%, each additional hardware unit would add 0.07%, and their interaction reduced their combined main effects by 0.00002%. Similarly, since water and carbon were directly proportional with energy, the percent change in training time, hardware quantity, and their interaction remained the same (Fig. 3b, Fig. 3c). However, since hardware types were categorical variables and functioned as baseline intercepts, their values differed across energy, water, and carbon models to reflect differences in overall scale. Figure 3b. Relationship between hardware quantity and log of water consumption across training time groups. Figure 3c. Relationship between hardware quantity and log of carbon emissions across training time groups. RQ2: Energy Efficiency over Time I also used a log-linked Gamma model to examine the relationship between Energy Efficiency and Publication Date, as the Shapiro-Wilk test indicated that the log-transformed data was not normally distributed (p < 0.001). There was a positive relationship between Publication Date and Energy Efficiency, with an estimated improvement of 0.13% per year (GLM: t = 8.005, p < 0.001, Fig. 3d). Figure 3d. Relationship between publication year and log of energy efficiency (FLOPS/W). Each point represents a model, and the blue line shows a fitted trend using a linear model. To further investigate, we examined the trends by individual hardware type and observed noticeable variation in efficiency among AI models using the same hardware (Fig. 3e). Among all architecture and hardware choices, Training Time was the only statistically significant factor influencing energy efficiency (GLM: t = 8.581, p < 0.001), with longer training time decreases energy efficiency by 0.03% per hour. Figure 3e. Trends in log of energy efficiency (FLOPS/W) by hardware type over time. Each panel represents a specific hardware model, showing individual data points and fitted linear trends. Only hardware types used in at least three models are included. 4. Discussion This study found that hardware choices — including Hardware Type and Hardware Quantity — along with Training Time, have a significant relationship with each resource consumption during AI Model Training, while architecture variables do not. I suspect that Training Time may have implicitly captured some of the underlying effects of those architecture-related factors. In addition, the interaction between Training Time and Hardware also contributes to the resource usage. However, this analysis is constrained by the small dataset (70 valid models) across 18 hardware types, which likely limits the statistical power of hardware-involved interaction terms. Further research could explore these interactions with larger and more diverse datasets. To illustrate how resource-intensive AI training can be, we use Model 8 to predict the baseline energy consumption for a single hour of training on one NVIDIA A100 chip. Here are the predictions for each type of resource under this simple setup: Energy: The predicted energy use is 29,213 kWh, nearly three times the annual energy consumption of an average U.S. household (10,500 kWh/year) (U.S. Energy Information Administration, 2023), with each extra hour adding 5258 kWh more and each extra chip adding 2044 kWh. Water: Similarly, the same training session would consume 10,521 liters of water, almost ten times the average U.S. household’s daily water use (300 gallons or 1135 liters/day) (United States Environmental Protection Agency, 2024), with each extra hour adding 1,894 liters and each chip adding 736 liters. Carbon: the predicted carbon emission is 16,009 kg, about four times the annual emissions of a U.S. household (4000kg/year) (University of Michigan, 2024), with each extra hour adding 2881 kg and each extra chip adding 1120 kg. This study also found that AI models have become more energy-efficient over time, but only slightly, with an estimated improvement of 0.13% per year. This suggests that while newer hardware is more efficient, its adoption has not been widespread. While the environmental impact of AI may be mitigated over time as hardware hardware has become more efficient, this focus on hardware alone may overlook other contributors to overall energy consumption. In this dataset, both Training Compute and Total Power Draw are often estimated values and may include some system-level overhead beyond hardware alone. Therefore, the efficiency estimates in this study may reflect not just hardware performance, but potentially other training-related overhead. This study observed substantial variation in energy efficiency even among models using the same hardware. One key finding is that longer training time can “drain” energy efficiency, reducing it by approximately 0.03%. Further studies should explore how training practices, beyond hardware selection, impact the environmental costs of AI development. References Calvert, B.. 2024. AI already uses as much energy as a small country. It’s only the beginning. Vox. https://www.vox.com/climate/2024/3/28/24111721/climate-ai-tech-energy-demand-rising OpenAI Newsroom. 2024. Fresh numbers shared by @sama earlier today: 300M weekly active ChatGPT users. 1B user messages sent on ChatGPT every day 1.3M devs have built on OpenAI in the US. Tweet via X. 2024. https://x.com/OpenAINewsroom/status/1864373399218475440 Epoch AI. 2025. Data on Notable AI Models. Epoch AI. https://epoch.ai/data/notable-ai-models Shehabi, A., S.J. Smith, A. Hubbard, A. Newkirk, N. Lei, M.A.B. Siddik, B. Holecek, J. Koomey, E. Masanet, and D. Sartor. 2024. 2024 United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, California. LBNL-2001637. Guidi, G., F. Dominici, J. Gilmour, K. Butler, E. Bell, S. Delaney, and F.J. Bargagli-Stoffi. 2024. Environmental Burden of United States Data Centers in the Artificial Intelligence Era. arXiv abs/2411.09786. Bali, S.. 2025. GPU Memory Essentials for AI Performance. NVIDIA Developer. https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/ Krashinsky, R., O. Giroux, S. Jones, N. Stam, and S. Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. NVIDIA Developer. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ HuggingFace. 2025. Performance Tips for Training on Multiple GPUs. HuggingFace Documentation. https://huggingface.co/docs/transformers/en/perf_train_gpu_many Popovic, G.. 2022. Interpreting GLMs. Environmental Computing. Environment Computing. https://environmentalcomputing.net/statistics/glms/interpret-glm-coeffs/ U.S. Energy Information Administration. 2023. Use of Energy Explained: Electricity Use in Homes. https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php United States Environmental Protection Agency. 2024. How We Use Water. https://www.epa.gov/watersense/how-we-use-water Center for Sustainable Systems, University of Michigan. 2024. Carbon Footprint Factsheet. Pub. No. CSS09–05. The post Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware appeared first on Towards Data Science. Source: https://towardsdatascience.com/rethinking-environmental-costs-of-training-ai-why-we-should-look-beyond-hardware/ #rethinking #the #environmental #costs #training #why #should #look #beyond #hardware
    TOWARDSDATASCIENCE.COM
    Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware
    Summary of This Study Hardware choices – specifically hardware type and its quantity – along with training time, have a significant positive impact on energy, water, and carbon footprints during AI model training, whereas architecture-related factors do not. The interaction between hardware quantity and training time slows the growth of energy, water, and carbon consumption slightly by 0.00002%. Overall energy efficiency during AI model training has improved slightly over the years, around 0.13% per year. Longer training time can gradually “drain” the overall energy efficiency by 0.03% per hour. Outline Introduction Research Question 1: Architectural and Hardware Choices vs Resource Consumption Research Question 2: Energy Efficiency over Time Methods Estimation methods Analysis methods Results RQ1: Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones Final Model Selection Coefficients Interpretation RQ2 Discussion 1. Introduction Ever since the 1940s, when the first digital computers were invented, scientists have always dreamed of creating machines as smart as humans, what now became Artificial Intelligence (AI). Fast forward to November 2022, when ChatGPT — an AI model capable of listening and answering instantly — was released, it felt like a dream come true. Afterward, hundreds of new AI models have rushed into the race (take a look at the timeline here). Today, every single day, one billion messages are sent through ChatGPT (OpenAI Newsroom, 2024), highlighting the rapid AI adoption by users. Yet, few people stop to ask: What are the environmental costs behind this new convenience? Before users can ask AI questions, these models must first be trained. Training is the process where models, or algorithms, are fed datasets and try to find the best fit. Imagine a simple regression y = ax + b: training means feeding the algorithm x and y values and allowing it to find the best parameters a and b. Of course, AI models typically would not be as simple as a linear regression. They would contain tons of parameters, thus requiring massive amounts of computation and datasets. Moreover, they would need to run a substantial amount of specialized hardware that can handle that sheer amount of computation and complexity. All of that combined made AI consume much more energy than traditional software. In addition, AI training requires a stable and uninterrupted energy supply, which primarily comes from non-renewable energy sources like natural gas or coal-based, because solar and wind energy can fluctuate based on weather conditions (Calvert, 2024). Moreover, due to the high intensity of energy use, data centers — buildings that store AI models — heat up rapidly, emitting significant carbon footprints and requiring large amounts of water for cooling. Therefore, AI models have broad environmental impacts that include not only energy usage but also water consumption and carbon emissions. Unfortunately, there is not much official and disclosed data regarding energy, water, and carbon footprints of AI models. The public remains largely unaware of these environmental impacts and thus has not created strong pressure or motivations for tech companies to take more systematic changes. Furthermore, while some improvements have been made — especially in hardware energy efficiency — there remains little systematic or coordinated effort to effectively reduce the overall environmental impacts of AI. Therefore, I am hoping to increase public awareness of these hidden environmental costs and to explore whether recent improvements in energy efficiency are substantial. More particularly, I’m seeking to address two research questions in this study: RQ1: Is there a significant relationship between AI models’ architectural and hardware choices and their resource consumption during training? RQ2: Has AI training become energy-efficient over time? 2. Methods:  The paper used a dataset called Notable AI Models from Epoch AI (Epoch AI, 2025), a research institute that investigates the trends of AI development. The models included were either historically relevant or represent cutting-edge advances in AI. Each model was recorded with key training information such as the number of parameters, dataset size, total compute, hardware type, and hardware quantity, all collected from various sources, including literature reviews, publications, and research papers. The dataset also reported the confidence level for these attributes. To produce a reliable analysis, I evaluated only models with a confidence rating of “Confident” or “Likely”. As noted earlier, there was limited data regarding direct resource consumption. Fortunately, the dataset authors have estimated Total Power Draw (in watts, or W) based on several factors, including hardware type, hardware quantity, and some other data center efficiency rates and overhead. It is important to note that power and energy are different: power (W) refers to the amount of electricity used per unit of time, while energy (in kilowatt-hours, or kWh) measures the total cumulative electricity consumed over time. Since this study investigated resource consumption and energy efficiency during the training phase of AI models, I constructed and estimated four environmental metrics: total energy used (kWh), total water used (liters, or L), total carbon emissions (kilograms of CO2e, or kgCO2e), and energy efficiency (FLOPS/W, to be explained later). a. Estimation methods First, this study estimated energy consumption by selecting models with available total power draw (W) and training times (hours). Energy was computed as follows: \[\text{Energy (kWh)} = \frac{\text{Total Power Draw (W)}}{1000} \times \text{Training Time (h)}\] Next, water consumption and carbon emissions were estimated by rearranging the formulas of two standard rates used in data centers: Water Usage Effectiveness (WUE, in L/kWh) and Carbon Intensity (CI, in kgCO2e/kWh): \[\text{WUE (L/kWh)} = \frac{\text{Water (L)}}{\text{Energy (kWh)}}\ \Longrightarrow\ \text{Water (L)} = \text{WUE (L/kWh)} \times \text{Energy (kWh)}\] This study used the average WUE of 0.36 L/kWh in 2023, reported by Lawrence Berkeley National Laboratory (2024). \[\mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} = \frac{\mathrm{Carbon\ (kgCO_2e)}}{\mathrm{Energy\ (kWh)}}\ \Longrightarrow\ \mathrm{Carbon\ (kgCO_2e)} = \mathrm{CI\ \left( \frac{\mathrm{kgCO_2e}}{\mathrm{kWh}} \right)} \times \mathrm{Energy\ (kWh)}\] This study used an average carbon intensity of 0.548 kg CO₂e/kWh, reported by recent environmental research (Guidi et al, 2024). Finally, this study estimated energy efficiency using the FLOPS/W metric. A floating-point operation (FLOP) is a basic arithmetic operation (e.g., addition or multiplication) with decimal numbers. FLOP per second (FLOPS) measures how many such operations a system can perform each second, and is commonly used to evaluate computing performance. FLOPS per Watt (FLOPS/W) measures how much computing performance is achieved per unit of power consumed: \[\text{Energy Efficiency (FLOPS/W)} = \frac{\text{Total Compute (FLOP)}}{\text{Training Time (h)} \times 3600 \times \text{Total Power Draw(W)}}\] It is important to note that FLOPS/W is typically used to measure hardware-level energy efficiency. However, it’s possible that the actual efficiency during AI training may be different from the thereotical efficiency reported for the hardware used. I would like to investigate whether any of the training-related factors, beyond hardware alone, may contribute significantly to overall energy efficiency. b. Analysis methods: RQ1: Architectural and Hardware Choices vs Resource Consumption  Among energy, water, and carbon consumption, I focused on modeling energy consumption, as both water and carbon are derived directly from energy using fixed conversion rates and all three response variables shared identical distributions. As a result, I believe we could safely assume that the best-fitting model of energy consumption can be applied to water and carbon. While the statistical models were the same, I would still report the results of all three to quantify how many kilowatt-hours of energy, liters of water, and kilograms of carbon are wasted for every unit increase in each significant factor. That way, I am hoping to communicate the environmental impacts of AI in a more holistic, concrete, and tangible terms. Figure 2a. Histogram of Energy Consumption (kWh) Figure 2b. Histogram of log of Energy Consumption (kWh) Based on Figure 1, the histogram of energy showed extreme right skew and the presence of some outliers. Therefore, I performed a log transformation on energy data, aiming to stabilize variance and move the distribution closer to normality (Fig. 2). A Shapiro-Wilk test confirmed the log-transformed energy data is approximately normal (p-value = 0.5). Based on this, two types of distributions were considered: the Gaussian (normal) and the Gamma distribution. While the Gaussian distribution is approriate for symmetric and normal data, the Gamma distribution is more suited for positive, skewed data — commonly used in engineering modeling where small values occur more frequently than larger values. For each distribution, the paper compared two approaches for incorporating the log transformation: directly log transforming the response variable versus using a log link function within a generalized linear model (GLM). I identified the best combination of distribution and log approach by evaluating their Akaike Information Criterion (AIC), diagnostic plots, along with prediction accuracy. The candidate predictors included Parameters, Training Compute, Dataset Size, Training Time, Hardware Quantity, and Hardware Type. Architecture-related variables comprised Parameters, Training Compute, and Dataset Size, while hardware-related variables consisted of Hardware Quantity and Hardware Type. Training Time didn’t fall neatly into either category but was included due to its central role in training AI models. After fitting all candidate predictors into the selected GLM specification, I tested for multicollinearity to determine whether any variables should be excluded. Following this, I explored interaction terms, as each resource consumption may not have responded linearly to each independent variable. The following interactions were considered based on domain knowledge and various sources: Model Size and Hardware Type: Different hardware types have different memory designs. The larger and more complex the model is, the more memory it requires (Bali, 2025). Energy consumption can be different depending on how the hardware handles memory demands. Dataset Size and Hardware Type: Similarly, with different memory designs, hardware may access and read data at different data size (Krashinsky et al, 2020). As dataset size increases, energy consumption can vary depending on how the hardware handles large volumes of data. Training Time with Hardware Quantity: Running multiple hardware units at the same time adds extra overhead, like keeping everything in sync (HuggingFace, 2025). As training goes on, these coordination costs can grow and put more strain on the system, leading to faster energy drain. Training Time with Hardware Type: As training time increases, energy use may vary across hardware types since some hardware types may manage heat better or maintain performance more consistently over time, while others may slow down or consume more energy. RQ2: Energy Efficiency over Time Figure 2c. Histogram of Energy Efficiency (FLOPS/W) Figure 2d. Histogram of Energy Efficiency (FLOPS/W) The distribution of energy efficiency was highly skewed. Even after a log transformation, the distribution remained non-normal and overdispersed. To reduce distortion, I removed one extreme outlier with exceptionally high efficiency, as it was not a frontier model and likely less impactful. A Gamma GLM was then fitted using Publication Date as the primary predictor. If models using the same hardware exhibited wide variation in efficiency, it would suggest that other factors beyond the hardware may contribute to these differences. Therefore, architecture and hardware predictors from the first research question would be used to assess which variables significantly influence energy efficiency over time. 3. Results RQ1: Architectural and Hardware Choices vs Resource Consumption I ultimately used a Gamma GLM with a log link to model resource consumption. This combination was chosen because it had a lower AIC value (1780.85) than the Gaussian log-link model (2005.83) and produced predictions that matched the raw data more closely than models using a log-transformed response variable. Those log-transformed models generated predictions that substantially underestimated the actual data on the original scale (see this article on why log-transforming didn’t work in my case). Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones After fitting all candidate explanatory variables to a Gamma log-link GLM, we found that two architecture-related variables — Parameters and Dataset Size — do not exhibit a significant relationship with resource consumption (p > 0.5). A multicollinearity test also showed that Dataset Size and Training Compute were highly correlated with other predictors (GVIF > 6). Based on this, I hypothesized that all three architecture variables—Parameters, Dataset Size, and Training Compute) may not hold much predictive power. I then removed all three variables from the model and an ANOVA test confirmed that simplified models (Models 4 and 5) are not significantly worse than the full model (Model 1), with p > 0.05: Model 1: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_dataset_size + Training_time_hour + Hardware_quantity + Training_hardware + 0 Model 2: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_time_hour + Hardware_quantity + Training_hardware Model 3: Energy_kWh ~ Parameters + Training_dataset_size + Training_time_hour + Hardware_quantity + Training_hardware Model 4: Energy_kWh ~ Parameters + Training_time_hour + Hardware_quantity + Training_hardware + 0 Model 5: Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware + 0 Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 46 108.28 2 47 111.95 -1 -3.6700 0.07809 . 3 47 115.69 0 -3.7471 4 48 116.09 -1 -0.3952 0.56314 5 49 116.61 -1 -0.5228 0.50604 Moving on with Model 5, I found that Training Time and Hardware Quantity showed significant positive relationships with Energy Consumption (GLM: training time, t = 9.70, p-value < 0.001; hardware quantity, t = 6.89, p-value < 0.001). All hardware types were also statistically significant (p-value < 0.001), indicating strong variation in energy use across different types. Detailed results are presented below: glm(formula = Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware + 0, family = Gamma(link = "log"), data = df) Coefficients: Estimate Std. Error t value Pr(>|t|) Training_time_hour 1.351e-03 1.393e-04 9.697 5.54e-13 *** Hardware_quantity 3.749e-04 5.444e-05 6.886 9.95e-09 *** Training_hardwareGoogle TPU v2 7.213e+00 7.614e-01 9.474 1.17e-12 *** Training_hardwareGoogle TPU v3 1.060e+01 3.183e-01 33.310 < 2e-16 *** Training_hardwareGoogle TPU v4 1.064e+01 4.229e-01 25.155 < 2e-16 *** Training_hardwareHuawei Ascend 910 1.021e+01 1.126e+00 9.068 4.67e-12 *** Training_hardwareNVIDIA A100 1.083e+01 3.224e-01 33.585 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 40 GB 1.084e+01 5.810e-01 18.655 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 80 GB 1.149e+01 5.754e-01 19.963 < 2e-16 *** Training_hardwareNVIDIA GeForce GTX 285 3.065e+00 1.077e+00 2.846 0.00644 ** Training_hardwareNVIDIA GeForce GTX TITAN X 6.377e+00 7.614e-01 8.375 5.13e-11 *** Training_hardwareNVIDIA GTX Titan Black 6.371e+00 1.079e+00 5.905 3.28e-07 *** Training_hardwareNVIDIA H100 SXM5 80GB 1.149e+01 6.825e-01 16.830 < 2e-16 *** Training_hardwareNVIDIA P100 5.910e+00 7.066e-01 8.365 5.32e-11 *** Training_hardwareNVIDIA Quadro P600 5.278e+00 1.081e+00 4.881 1.16e-05 *** Training_hardwareNVIDIA Quadro RTX 4000 5.918e+00 1.085e+00 5.455 1.60e-06 *** Training_hardwareNVIDIA Quadro RTX 5000 4.932e+00 1.081e+00 4.563 3.40e-05 *** Training_hardwareNVIDIA Tesla K80 9.091e+00 7.760e-01 11.716 8.11e-16 *** Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.059e+01 6.546e-01 16.173 < 2e-16 *** Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.089e+01 1.078e+00 10.099 1.45e-13 *** Training_hardwareNVIDIA V100 9.683e+00 4.106e-01 23.584 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.159293) Null deviance: 2.7045e+08 on 70 degrees of freedom Residual deviance: 1.1661e+02 on 49 degrees of freedom AIC: 1781.2 Number of Fisher Scoring iterations: 25 Final Model Selection To better capture possible non-additive effects, various interaction terms were explored and their respective AIC scores (Table 1). The table below summarizes the tested models and their respective AIC scores: ModelPredictorsAIC5Training Time + Hardware Quantity + Hardware Type350.786Training Time + Hardware Quantity + Hardware Type * Parameters357.977Training Time + Hardware Quantity + Hardware Type * Dataset Size335.898Training Time * Hardware Quantity + Hardware Type345.399Training Time * Hardware Type + Hardware Quantity333.03Table 1. Summary of different GLM models and their respective AIC scores. Although AIC scores did not vary drastically, meaning their model fits are similar, Model 8 was preferred as it was the only one with significant effects in both main terms and interaction. Interactions involved Hardware Type were not significant despite some exhibiting better AIC, likely due to limited sample size across 18 hardware types. In Model 8, both Training Time and Hardware Quantity showed a significant positive relationship with energy consumption (GLM: t = 11.09, p < 0.001), and between hardware quantity and energy consumption (GLM: training time, t = 11.09, p < 0.001; hardware quantity, t = 7.32, p < 0.001; Fig. 3a). Their interaction term was significantly negative (GLM: t = –4.32, p < 0.001), suggesting that energy consumption grows more slowly when training time increases alongside with a higher number of hardware units. All hardware types remained significant (p < 0.001). Detailed results are as below: glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity + Training_hardware + 0, family = Gamma(link = "log"), data = df) Coefficients: Estimate Std. Error t value Pr(>|t|) Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 *** Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 *** Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 *** Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 *** Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 *** Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 *** Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 *** Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 *** Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 ** Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 *** Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 *** Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 *** Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 *** Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 *** Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 *** Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 *** Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 *** Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 *** Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 *** Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 *** Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.088522) Null deviance: 2.7045e+08 on 70 degrees of freedom Residual deviance: 1.0593e+02 on 48 degrees of freedom AIC: 1775 Number of Fisher Scoring iterations: 25 Figure 3a. Relationship between hardware quantity and log of energy consumption across training time groups. Training time was originally a continuous variable. For the sake of visualization, training time was divided into three equal-sized levels and labeled as high, mid, and low. Coefficients Interpretation To further interpret the coefficients, we can exponentiate each coefficient and subtract one to estimate the percent change in the response variable for each additional unit in the predictor (Popovic, 2022). For energy consumption, each additional hour of training would increase energy use by 0.18%, each additional hardware unit would add 0.07%, and their interaction reduced their combined main effects by 0.00002%. Similarly, since water and carbon were directly proportional with energy, the percent change in training time, hardware quantity, and their interaction remained the same (Fig. 3b, Fig. 3c). However, since hardware types were categorical variables and functioned as baseline intercepts, their values differed across energy, water, and carbon models to reflect differences in overall scale. Figure 3b. Relationship between hardware quantity and log of water consumption across training time groups. Figure 3c. Relationship between hardware quantity and log of carbon emissions across training time groups. RQ2: Energy Efficiency over Time I also used a log-linked Gamma model to examine the relationship between Energy Efficiency and Publication Date, as the Shapiro-Wilk test indicated that the log-transformed data was not normally distributed (p < 0.001). There was a positive relationship between Publication Date and Energy Efficiency, with an estimated improvement of 0.13% per year (GLM: t = 8.005, p < 0.001, Fig. 3d). Figure 3d. Relationship between publication year and log of energy efficiency (FLOPS/W). Each point represents a model, and the blue line shows a fitted trend using a linear model. To further investigate, we examined the trends by individual hardware type and observed noticeable variation in efficiency among AI models using the same hardware (Fig. 3e). Among all architecture and hardware choices, Training Time was the only statistically significant factor influencing energy efficiency (GLM: t = 8.581, p < 0.001), with longer training time decreases energy efficiency by 0.03% per hour. Figure 3e. Trends in log of energy efficiency (FLOPS/W) by hardware type over time. Each panel represents a specific hardware model, showing individual data points and fitted linear trends. Only hardware types used in at least three models are included. 4. Discussion This study found that hardware choices — including Hardware Type and Hardware Quantity — along with Training Time, have a significant relationship with each resource consumption during AI Model Training, while architecture variables do not. I suspect that Training Time may have implicitly captured some of the underlying effects of those architecture-related factors. In addition, the interaction between Training Time and Hardware also contributes to the resource usage. However, this analysis is constrained by the small dataset (70 valid models) across 18 hardware types, which likely limits the statistical power of hardware-involved interaction terms. Further research could explore these interactions with larger and more diverse datasets. To illustrate how resource-intensive AI training can be, we use Model 8 to predict the baseline energy consumption for a single hour of training on one NVIDIA A100 chip. Here are the predictions for each type of resource under this simple setup: Energy: The predicted energy use is 29,213 kWh, nearly three times the annual energy consumption of an average U.S. household (10,500 kWh/year) (U.S. Energy Information Administration, 2023), with each extra hour adding 5258 kWh more and each extra chip adding 2044 kWh. Water: Similarly, the same training session would consume 10,521 liters of water, almost ten times the average U.S. household’s daily water use (300 gallons or 1135 liters/day) (United States Environmental Protection Agency, 2024), with each extra hour adding 1,894 liters and each chip adding 736 liters. Carbon: the predicted carbon emission is 16,009 kg, about four times the annual emissions of a U.S. household (4000kg/year) (University of Michigan, 2024), with each extra hour adding 2881 kg and each extra chip adding 1120 kg. This study also found that AI models have become more energy-efficient over time, but only slightly, with an estimated improvement of 0.13% per year. This suggests that while newer hardware is more efficient, its adoption has not been widespread. While the environmental impact of AI may be mitigated over time as hardware hardware has become more efficient, this focus on hardware alone may overlook other contributors to overall energy consumption. In this dataset, both Training Compute and Total Power Draw are often estimated values and may include some system-level overhead beyond hardware alone. Therefore, the efficiency estimates in this study may reflect not just hardware performance, but potentially other training-related overhead. This study observed substantial variation in energy efficiency even among models using the same hardware. One key finding is that longer training time can “drain” energy efficiency, reducing it by approximately 0.03%. Further studies should explore how training practices, beyond hardware selection, impact the environmental costs of AI development. References Calvert, B.. 2024. AI already uses as much energy as a small country. It’s only the beginning. Vox. https://www.vox.com/climate/2024/3/28/24111721/climate-ai-tech-energy-demand-rising OpenAI Newsroom. 2024. Fresh numbers shared by @sama earlier today: 300M weekly active ChatGPT users. 1B user messages sent on ChatGPT every day 1.3M devs have built on OpenAI in the US. Tweet via X. 2024. https://x.com/OpenAINewsroom/status/1864373399218475440 Epoch AI. 2025. Data on Notable AI Models. Epoch AI. https://epoch.ai/data/notable-ai-models Shehabi, A., S.J. Smith, A. Hubbard, A. Newkirk, N. Lei, M.A.B. Siddik, B. Holecek, J. Koomey, E. Masanet, and D. Sartor. 2024. 2024 United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, California. LBNL-2001637. Guidi, G., F. Dominici, J. Gilmour, K. Butler, E. Bell, S. Delaney, and F.J. Bargagli-Stoffi. 2024. Environmental Burden of United States Data Centers in the Artificial Intelligence Era. arXiv abs/2411.09786. Bali, S.. 2025. GPU Memory Essentials for AI Performance. NVIDIA Developer. https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/ Krashinsky, R., O. Giroux, S. Jones, N. Stam, and S. Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. NVIDIA Developer. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ HuggingFace. 2025. Performance Tips for Training on Multiple GPUs. HuggingFace Documentation. https://huggingface.co/docs/transformers/en/perf_train_gpu_many Popovic, G.. 2022. Interpreting GLMs. Environmental Computing. Environment Computing. https://environmentalcomputing.net/statistics/glms/interpret-glm-coeffs/ U.S. Energy Information Administration. 2023. Use of Energy Explained: Electricity Use in Homes. https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php United States Environmental Protection Agency. 2024. How We Use Water. https://www.epa.gov/watersense/how-we-use-water Center for Sustainable Systems, University of Michigan. 2024. Carbon Footprint Factsheet. Pub. No. CSS09–05. The post Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware appeared first on Towards Data Science.
    0 Comments 0 Shares 0 Reviews
  • GPU Architecture & Working intuitively explained


    Author(s): Allohvk

    Originally published on Towards AI.

    GPU Origins
    The image displayed on a computer screen is made up of millions of tiny pixels. In early days, “graphics controllers” were given instructions by the CPU on how to calculate the individual pixel values so that the appropriate image could be displayed. These were ok for conventional displays but for a really good gaming experience, images need to be built dozens of times per second. The CPU was not really designed to handle these kind of loads.
    The whole process of creating the image could be parallelized big-time simply by (a) dividing the image into smaller blocks (b) carrying out computations for each block in parallel & (c) grouping them back again. The results of one block don’t influence the results of the other blocks. CPU’s multi-threading capabilities was not really conceived for such massive parallelization. Enter the GPU! Sony first used the term GPU in 1994, in its PlayStation consoles. The technology was perfected by NVIDIA which soon became a leader.
    GPUs have numerous computation cores (much more than a CPU) and gaming programmers could write Shaders — programs to run graphics computations on the GPU in a massively parallelized way to create the screen images in super-fast time. The GPU is inspired by the CPU but was specifically designed to enable massive multi-threaded operations on its numerous computation cores seamlessly. Creating threads, switching between threads etc is much faster on a GPU. Some smart developers also realized that these parallel processing capabilities could be used for other computationally intensive tasks as well!

    2005: Steinkrau implements a simple 2-layer Neural Net on a GPU
    2006: Kumar et. al. trains a CNN model for document processing
    2007: NVIDIA released Compute Unified Device Architecture (CUDA) — a custom language extending C to exploit data parallelism on GPUs. Now developers had much more granular control over the image rendering.
    2008 a landmark paper by Raina et al was released. This paper pretty much showed everyone how to train deep layers on a GPU
    2014: NVIDIA released CuDNN — a dedicated CUDA library for Deep Learning. Very soon PyTorch, TensorFlow etc incorporated CuDNN, setting the stage for modern GPU usage for AI!

    A GPU is an ASIC or Application-Specific Integrated Circuit having a processor (hosting numerous computational cores), a memory soldered onto it (we want to avoid going to the CPU RAM for everything), a cooling system (well, they heat up pretty fast) and a BIOS chip (same role as a CPU — to store settings, run startup diagnostics etc). This card is then plugged into the motherboard slot using the PCI Express interface. The terms GPU and graphics card are often used interchangeably. Some GPUs like the one in Apple M3 do not have a dedicated memory but instead use the system RAM itself which is possible due to its unique design. Google has the TPU (Tensor Processing Unit) which is its own ASIC. We discuss the GPU memory, the processing cores, the LLM workflows happening inside them & common topologies for clustering.
    Photo by Thomas Foster on Unsplash
    1. GPU Memory module — The VRAM
    Instead of having the GPU talk to the regular RAM, it made sense to create another RAM physically closer to the GPU die so that data retrieval is faster. So a graphics card has a memory called VRAM — Video Random Access Memory in addition to the computation engines . VRAM is connected to the computation engine cores via a Bus called the memory interface.
    1.1 What is DRAM?
    Let us talk first of RAM technology in general. All memory whether it is the CPU RAM or the GPU VRAM are mostly based on DRAM technology which consists of a capacitor and a transistor. The capacitor’s charge represents the data stored. Due to its very nature, this charge gradually leaks. To prevent data loss, a refresh circuit periodically rewrites the data back, restoring its charge. Hence the name — Dynamic RAM due to these preiodic refreshes.
    Most computers use Synchronous DDR5 DRAM’s as their CPU RAMs. Synchronous because it utilizes the system clock for better performance. In other words the action (of retrieving & storing data) is operationally coordinated by an external clock signal. Tying the operations to the clock makes it faster. The processor knows the exact timing & number of cycles in which the data will be available from the RAM to the bus & can plan better. We have DDR1 (1st Gen Double Data Rate Synchronous Dynamic RAM released in 2000) to DDR5 which is the choice of CPU RAM as of today.
    1.2 What is SGRAM?
    Let us now talk about the VRAMs in GPUs. The VRAM is a type of SGRAM — Synchronous Graphics RAM. The current generation of VRAMs being used is GDDR6. Yes, this is 6th generation GDDR, the G standing for “Graphics”. While DDR & GDDR share common origins and early couple of generations were similar, the branches separated after DDR3. So as of 2025, DDR5 rules in CPU RAM and GDDR6 rules for consumer-grade GPU RAMs.
    Conceptually DDRs and GDDRs are similar but note that DDRs are used by CPUs which need low latency whereas GDDRs are used by GPUs which are OK to compromise latency for extremely high throughput. Crudely, the former has more frequent smaller calculations & the latter deals with much higher volume of data & some delays are forgiven considering the vast volumes of data being processed. Even more crudely, the former is a bullet train with 6–8 coaches while the latter a 3 Kilometre long goods train.
    1.3 GDDR VRAMs explained in detail
    GDDR memory are individual chips soldered to the PCB (Printed Circuit Board) very close to the GPU die. The physical proximity improves the speed of data transfer from the VRAM to the GPU processor. There are pins in a GDDR which can be thought of as individual wires that connect it to the processor. Bus width is literally the number of such connections. GDDR6 has 32 pins spread across 2 channels with roughly 16 Gbits.p.s bandwidth per pin. Bandwidth is total amount of data being moved & if you had one single metric at your disposal to take a decision, it would be this. Before we go further, let us try to understand this metric intuitively.
    1.4 Calculating GPU Memory Bandwidth intuitively
    Memory Bandwidth is the max rate at which data can be transferred between the GPU and the VRAM. We discussed that data transmission is synchronized with the clock. The clock cycle is measured in hertz & represents the number of cycles per second. Let us say we have a clock operating at 1000 MHz. This literally means 1 billion clock ticks per second. How long does a tick last? Literally 1/(1 billion) i.e. 1 nano second. Data is sent to and fro every clock cycle. So every nano-second, a bus-full of data is sent from the VRAM to the processor & vice versa.
    How many seats on the bus? Well, we discussed this earlier… This is the memory interface or the bus width… literally the physical count of bits that fit into the bus. A 128-bit bus would ferry 128 bits every nano-second. The D in G’D’DR6 stands for Double. Basically, data is transmitted on both the rising and falling edges of the clock cycle, so 256 bits every nano-second. How many bytes in 1 sec? 256/8 i.e. 32 billion bytes per second or better still 32 GB/s as Giga is the preferred term when measuring data. The capital B denotes bytes whereas the small b denotes bits… a source of confusion.
    A more practical formula is: Bandwidth = Clock * Bus Width x Data Rate, where the Data Rate is the number of data transfers per cycle. GDDR6 is Double Data Rate (as just discussed) and Quad pumped, which quadruples the (doubled) speed. So effectively the Data Rate is 8. Sometimes, you may encounter the same information crouched in different semantics. E.g., if frequency of command clock (CK#) is N, then the write command clock (WK#) is 2N. GDDR6 rates then are QDR (quad data rate) in reference to WK# and ODR (Octal Data Rate) in reference to the CK#.
    Some OEMs multiply the clock speed & data rate & call it a clock rate or something. In that case, the bandwidth is simply that number multiplied by the bus width. In general, this raw formula can be used: num_of_transfers per second * num_of_bits per transfer / 8. “Boost clock” mechanism allows the GPU and GDDR memory to operate at even higher speeds than the default clock when conditions allow it. Boost clock metric refers to the max such operating clock speed. A 1750 MHz clock means:

    1.75GHz is the frequency of command clock(CK#).
    The frequency of the write clock (WK#) is 3.5GHz due to the G”D”DR
    The Quad pumping takes it to 3.5*4=14 G bits moved in 1 second from each pin on the bus.
    We could have bus widths of up to 384 bits! So we get a bandwidth of 14*384 Giga bits per second.
    Divide by 8 to get 672 GB/s. GDDR6 bandwidth can go upto 1 TB/s. Wow!

    1.5 What is HBM VRAM in a GPU?
    When reading or writing data, contention is created when the VRAM has occupied memory channels & is busy receiving or delivering other data. This contention creates latency & this affects bandwidth. Increasing the number of memory channels is a great option. A type of memory called HBM (High-Bandwidth Memory) has lower access latency than GDDR6, since it has 8-memory channels versus 2 channels in GDDR6. HBM also has a wider bus.
    HBM has 1024 pins spread across 8 channels of 128 pins with roughly 2 Gbits.p.s bandwidth per pin. Compare this with (an equivalent) GDDR which has 32 pins spread across 2 channels with roughly 16 Gbits. p.s bandwidth per pin. Notice how HBM keeps the Gbit/sec per pin much lower than GDDR. This saves power (which is important as we shall see). In spite of this, it has higher bandwidth than GDDR6 due to the wider bus & higher channels.
    As we discussed, a pin is literally a wire connecting the VRAM to the processor. Having 1024 wires connected from the processor to the VRAM is not possible on a standard PCB. Therefore, an “interposer” is used as an
    intermediary to connect the VRAM & the processor. Just like a regular IC, wires (connections) are etched in this silicon “interposer” in the desired quantity. After this, the HBM device(s) & the processor are mounted atop this “interposer”. The slightly twisted workaround is called a 2.5D architecture.Another difference is that while GDDR chips are soldered to the PCB surrounding the GPU die, an HBM structure is a vertical stack of DRAMs like a high rise building. The stacked memory dies are linked using microscopic wires with TSV (Through-Silicon Vias) which are vertical electrical connections giving super fast connectivity between the DRAMs. There are huge challenges to stacking items vertically especially around designing heat sinks & managing thermal safety but somehow HBM manufacturers have made this happen.
    HBM has become a gold standard today for AI data centers. It was introduced to the Market by SK Hynix in 2013. Today, we have the 3rd generation HBM3 and their main client is Nvidia. Due to investments made way back, SK Hynix is leading the pack along with Samsung and a relatively recent entrant named Micron. We hear a lot about chips and TSMC but HBM is a key technology to watch out for in the coming years. We typically have more than one HBM devices inside the GPU die.
    GDDR6 co-exists with HBM3. The markets are complementary. The former addresses PCs & other consumer GPUs whereas the latter addresses data center GPUs. Ultra large scale AI deployments like ChatGPT likely leverage the use of a cluster of NVIDIA GPUs working in tandem. Connecting such GPU’s involves the use of NVIDIA NVLink technology which requires fast GPU memory bandwidth speeds and it’s the reason why HBM is prevalent in such systems. If not for the wide bus width and fast data transfer rates offered by HBM, these kind of clusters would be very difficult to design.
    Besides the VRAM, GPUs also include high-speed memory caches that are even closer to the GPU’s processing cores. There is a physical limit to the sizes of these caches. An L1 cache is usually in KB and an L2 cache is usually a few MB. Different hardware & software strategies exist to keep the most useful, and most reused data present in caches.
    2. Cooling Mechanisms in a GPU
    Higher clock speeds generally result in increased heat generation necessitating the need for cooling solutions to maintain optimal operating temperatures. Usual cooling methods are:

    Passive Cooling: These do not have any powered moving components. They take advantage of optimized airflow to take heat away.
    Fans are used to dissipate heat by blowing cool air across the heat sinks, which are metal components designed to absorb & disperse heat
    In water cooling, water is circulated through the GPU surface using pipes & a radiator. The hot liquid running through the pipes is in turn cooled down by the radiator fan.
    Hybrid cooling — which uses a combination of the above

    3. GPU Computation cores — Processors
    Let us now talk about the processors on the GPU. Unlike CPUs which contain only a few cores, the GPU literally has 1000’s of cores & specializes in running tasks in parallel across these cores using SIMD (Single Instruction, Multiple Data) units. Let us stick to NVIDIA terminology. There are multiple processing units called Streaming Multiprocessor (SM) on a NVIDIA GPU. For e.g. an H100 has upto 144 SMs. What is inside an SM? Well there are mainly 2 type of execution units — CUDA cores & Tensor cores. There is also a small memory SRAM which is Shared between all threads running in that SM. More specifically, every SM has a few KB memory that is partitioned between L1 cache & Shared Memory usage.
    3.1 CUDA core versus Tensor core in a GPU — The difference
    Tensor cores are a pretty recent innovation (from V100 onwards) and are specifically designed for faster matrix multiplication. Let us discuss CUDA cores first. These are the computation engines for regular math operations. Each CUDA core can execute one operation per clock cycle. But their strength lies in parallel processing. Many CUDA cores working together can accelerate computation by executing processes in parallel.
    Tensor Cores are specialized hardware units designed to accelerate “mixed precision” training. The earliest version allowed 4×4 FP16 matrices to be multiplied & added to an FP32 output matrix. By using lower-precision FP16 inputs in the computations, the calculations are vastly accelarated & by retaining FP32 outputs for the rest of the procedure, accuracy is not compromised too much. Modern tensor cores use even lower precision formats in DL computations. See this for more details. There may also specialized units like the transformer engine designed to accelerate models built with the Transformer blocks. A single GPU can be partitioned into multiple fully contained and isolated instances, with their own memory, cache & cores via MIG or Multi Instance GPU technology.
    3.2 GPU operations — A FLOP show
    Let us now talk about actual operations. A FLOP (Floating Point Operation) is a single floating-point calculation like an addition. Performance of a GPU is usually measured in TeraFLOP/s. Tera is a trillion, FLOP stands for floating-point operations and the ‘s’ stands for per second.
    Most matrix ops involve a multiply and an add. It makes sense to fuse these ops together to get an Fused Multiply-Add (FMA) op. If we know the FMA speed, we can simply double it to get the FLOP counts per clock. To get the peak FLOP/s rate, we multiply this by the clock rate & the number of SMs. Note that we have FP16, FP32, FP64 & Int8 cores with varying speeds. For e.g.:

    Say there are 4 tensor cores in each SM & 114 SMs in an H100
    Say each tensor core delivers 512 FP16 FMA ops per clock. Careful here: Read the specs clearly to check whether the FMA ops per clock metric is per SM or per individual core. For e.g., this link of A100 is per coreper SM
    Let the Clock speed = 1620 MHz
    So TFLOP/s = 1620 * (2*512) * 4 * 114= 756 TFLOP/s of performance! 756 Trillion operations per second. Wow! What would Babbage say to that?

    4. Putting everything together — LLM Operations in a GPU
    Given this immense compute-power, we can now make a reasonable guess that LLM inference is memory-I0 bound, not compute bound. In other words, it takes more time to load data to the GPU’s compute cores than it does for those cores to perform LLM computations on that data. The processing itself is super-fast & there is enough & more compute power available.

    To start with, the training data needs to be downloaded from a remote source to the CPU memory
    From there, it needs to be transferred to the GPU via the system bus and PCIe bus. The host(CPU)-to-device(GPU) bandwidth is limited by the CPU frequency, PCIe bus, GPU devices & the number of PCIe lanes available.
    Once the data & weights are in the GPU VRAM, they are then ferried across to the SRAM where the processors perform operations on it.
    After the operation the data is moved back to the VRAM & from there it is moved back to the CPU RAM. This is a rather simplistic view. Inside the GPU, the tensors are repeatedly moved back and forth between VRAM & SRAM (the memory allocated to an SM). Can you guess why?

    We saw that SRAM size is in KB so large matrices are not going to fit in there … which explains why there is a constant movement between VRAM which holds all the tensors and SRAM which holds the data on which compute operations are performed. So there is typically a memory-op where tensors are moved from VRAM to SRAM, then a compute-op SRAM and memory-op to move tensors back from SRAM to VRAM. Computations like a matrix multiplication involving 2 large matrices need several such memory + compute ops before the action is completed.
    During the training of GPT-3, the tensor cores on the GPUs used were found to be idle ~50% of the time. So, to extract the best from the infrastructure, data movement needs to be fast enough to ensure the computation cores are kept reasonably occupied. Surely, there is scope for some smart person to come up with shortcuts. Enter Flash attention & other such hacks. But that is a story for another day!
    5. Linking GPUs for LLM training — Topologies
    While LLM inferencing is manegable with a readymade collection of GPUs such as a DGX server (contains 8 H100s), LLM training needs far more GPUs. Before we discuss how to connect GPUs for larger workloads, it makes sense to see how CPU servers are connected in a datacentre. I am not an expert in this area, so please feel free to point out any incorrect interpretations I may have made from the references I quote.
    5.1 Generic concepts on linking processors
    Each server has a card attached to it called the Network Interface Card (NIC). RDMA technology enables direct memory access to a remote server via the NIC hardware. RoCE (RDMA over Converged Ethernet) protocol uses the RDMA technology & adapts it to Ethernet networks. So now, a server can talk to a remote server over a network. A network switch is a device connecting multiple servers in a network, enabling them to communicate with each other. This is the basic technology. Now let us come to the topology.
    So we assemble all the servers physically in one place and pile them up vertically them in neat racks.A very basic topology is to connect each server in a rack to a switch that usually sits on Top of the Rack, aptly named the ToR switch. The ToR switches of different racks are connected to a Spine switch. This topology is a basic implementation of Clos topology — named after Charles Clos who invented this scheme to originally arrange telephone nodes in a “leaf-n-spine” arrangement. The leaf switches are nothing but the ToR switches in modern data centers.
    Source: Fig 1–1 from https://www.oreilly.com/library/view/bgp-in-the/9781491983416/ch01.html
    Fat tree is a variant of Clos. Like before, we have servers arranged into racks connecting to Top-of-the-Rack (ToR) switches. ToR switches are connected to the aggregation switches to provide connectivity across racks, forming a pod. The pods are interconnected with spine switches, allowing any-to-any communication across servers. To be noted is the fact that there are multiple paths connecting servers. So there is lot of redundancy built-in.
    In a typical App deployment running hundreds of microservices on dozens of servers, it is useful to have such fully connected, high bandwidth networks. You never know who is going to talk to whom so it never hurts to overprovision on bandwidth and connectivity. However, network loads during AI training do not follow these patterns. They are more predictable & this allows us to build optimized, cheaper & less power-hungry networks.
    5.2 Linking GPUs via proprietary technology like NVLink
    We can strap together H100’s by leveraging the proprietary NVLink & NVSwitch technologies. NVLink provides the high-speed connection between individual GPUs, while NVSwitch is a chip that enables multiple GPUs to communicate through NVLink, forming a high-bandwidth network. See this nice article for details.
    NVIDIA’s P100 GPU introduced the NVLink1. At that time there was no NVSwitch chip, and the GPUs were connected in a ring-like configuration, which resulted in a lack of direct point-to-point communication between GPUs. The NVSwitch1 chip was introduced with the V100, followed by the NVSwitch2 chip with the A100 GPU. We are in the third-generation NVSwitch3 which can support a cluster of up to 256 H100 GPUs. Each H100 GPU in such a cluster is connected to the internal NVSwitch3 chip through 18 NVLink4.0 connections. This is how trillion parameter LLMs are inferenced.
    5.3 Linking GPUs via RoCE in a rail-optimized topology
    But as they say, ye dil mange more… Meta reportedly trains its newer models on a cluster that’s over 100K H100’s. Phew! How to they manage to link it all up? The standard NVLink tricks can only scale to a limited number of GPUs. Beyond that, we have to use the network topologies discussed earlier & fall back on technologies like RoCE, which allows data to be directly transferred from one GPU’s memory to another without involving the CPU.
    So you have 8 GPUs in one DGX server. You have several such DGX servers in the data centre. Each GPU is assigned a NIC (yes!) & connected via RDMA to all other GPUs thru’ a variant of Clos network called “rail-optimized network”. The idea here is to set up dedicated connections between groups of GPUs with rail switches. If a GPU wants to communicate with a GPU which is in a different group, then it has to go thru’ the spine switch (which takes a lil more time). To implement this, each GPU in a DGX server is indexed serially. A rail is the set of GPUs with the same index on different servers & these are interconnected with a rail switch via RDMA. These rail switches are subsequently connected to spine switches forming any-to-any GPU network.
    Source: Fig 1 from https://arxiv.org/pdf/2307.12169
    This topology streamlines traffic flow. It is like having dedicated lanes for high speed vehicles instead of generally mixing all traffic together. Rail paths are direct connections between a bunch of GPUs with same index. Spine switches serve as the connecting points for differently-indexed GPUs. For e.g., communication between GPU1 of server 1 and GPU1 of server 2 happens via their dedicated rail switch 1. If GPU1 of server 1 needs to reach GPU5 of another server, it has to go thru’ a spine switch.
    The workloads are designed so as to minimize data transfers across rails (since it has to go thru’ the extra spine switch). The good news is that this can be neatly done for AI training ensuring that most of the traffic stays within the rails, and does not cut across. In fact, there is a recent paper which suggests that you can consider removing costly spine switches altogether as inter-rail communication is minimal. Can you guess how?
    5.4 Linking GPUs via RoCE in a rail-only topology
    Well, we have the superfast connectivity using NVLink to communicate between a limited set of GPUs (upto 256). So you create these High Bandwith (HB) domains which use NVLink for communication. You have several such HB domains. We then have the same indexing system and rail connections to interconnect the HB domains. But there are no spine switches! Can you guess how GPU1 of HB domain 1 can talk to GPU5 of another HB domain? Yes! Transfer data via superfast NVLink to GPU5 of HB domain 1 first. Then use the dedicated rail of GPU5 to talk to the GPU5 in another HB domain! This is a rail-only topology as oppsed to rail-optimized topology!
    Given these topologies, we can now plan the training pipeline to have pipeline parallelism, tensor parallelism &/or data parallelism but that is a story for another day. See this, this & this for more details. 100K H100’s consume a LOT of power. Tech companies are exploring nuclear power options to generate clean energy needed for long term sustenance. Else, a 100K GPU cluster may have to be broken down to smaller clusters and connected using optical transceivers across the buildings in a campus.
    This (unplanned) article is a prelude to — Optimizing LLM inference: Key Faultlines & workarounds. To deeply understand how we can optimize LLM operations, we need to understand more about the silicon on which they are executed. Though there are lots of manuals/guides on individual aspects like memory, processors, networking etc, I couldn’t find a concise and reader-friendly thread linking together these various aspects & hence took a shot. This is the 9th of a 15-series article titled My LLM diaries.

    LLM Quantization — From concepts to implementation
    LoRA & its newer variants explained like never before
    In-Context learning: The greatest magic show in the kingdom of LLMs
    RAG in plain English — Summary of 100+ papers
    HNSW — Story of the world’s most popular Vector search algorithm
    VectorDB origins, Vamana & on-disk Vector search algorithms
    Taming LLMs — A study of few popular techniques
    Understanding LLM Agents: Concepts, Patterns & Frameworks
    Anatomy of a GPU — A peek into the hardware fuelling LLM operations
    Optimizing LLM Inference — Key Faultlines & workarounds
    LLM Serving — Architecture considerations
    LLM evaluation & other odds and ends
    Look Ma, LLMs without Prompt Engineering
    LLMs on the laptop — A peek into the Silicon
    Taking a step back — On model sentience, conscientiousness & other philosophical aspects

    Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

    Published via Towards AI



    المصدر: https://towardsai.net/p/machine-learning/gpu-architecture-working-intuitively-explained
    GPU Architecture & Working intuitively explained Author(s): Allohvk Originally published on Towards AI. GPU Origins The image displayed on a computer screen is made up of millions of tiny pixels. In early days, “graphics controllers” were given instructions by the CPU on how to calculate the individual pixel values so that the appropriate image could be displayed. These were ok for conventional displays but for a really good gaming experience, images need to be built dozens of times per second. The CPU was not really designed to handle these kind of loads. The whole process of creating the image could be parallelized big-time simply by (a) dividing the image into smaller blocks (b) carrying out computations for each block in parallel & (c) grouping them back again. The results of one block don’t influence the results of the other blocks. CPU’s multi-threading capabilities was not really conceived for such massive parallelization. Enter the GPU! Sony first used the term GPU in 1994, in its PlayStation consoles. The technology was perfected by NVIDIA which soon became a leader. GPUs have numerous computation cores (much more than a CPU) and gaming programmers could write Shaders — programs to run graphics computations on the GPU in a massively parallelized way to create the screen images in super-fast time. The GPU is inspired by the CPU but was specifically designed to enable massive multi-threaded operations on its numerous computation cores seamlessly. Creating threads, switching between threads etc is much faster on a GPU. Some smart developers also realized that these parallel processing capabilities could be used for other computationally intensive tasks as well! 2005: Steinkrau implements a simple 2-layer Neural Net on a GPU 2006: Kumar et. al. trains a CNN model for document processing 2007: NVIDIA released Compute Unified Device Architecture (CUDA) — a custom language extending C to exploit data parallelism on GPUs. Now developers had much more granular control over the image rendering. 2008 a landmark paper by Raina et al was released. This paper pretty much showed everyone how to train deep layers on a GPU 2014: NVIDIA released CuDNN — a dedicated CUDA library for Deep Learning. Very soon PyTorch, TensorFlow etc incorporated CuDNN, setting the stage for modern GPU usage for AI! A GPU is an ASIC or Application-Specific Integrated Circuit having a processor (hosting numerous computational cores), a memory soldered onto it (we want to avoid going to the CPU RAM for everything), a cooling system (well, they heat up pretty fast) and a BIOS chip (same role as a CPU — to store settings, run startup diagnostics etc). This card is then plugged into the motherboard slot using the PCI Express interface. The terms GPU and graphics card are often used interchangeably. Some GPUs like the one in Apple M3 do not have a dedicated memory but instead use the system RAM itself which is possible due to its unique design. Google has the TPU (Tensor Processing Unit) which is its own ASIC. We discuss the GPU memory, the processing cores, the LLM workflows happening inside them & common topologies for clustering. Photo by Thomas Foster on Unsplash 1. GPU Memory module — The VRAM Instead of having the GPU talk to the regular RAM, it made sense to create another RAM physically closer to the GPU die so that data retrieval is faster. So a graphics card has a memory called VRAM — Video Random Access Memory in addition to the computation engines . VRAM is connected to the computation engine cores via a Bus called the memory interface. 1.1 What is DRAM? Let us talk first of RAM technology in general. All memory whether it is the CPU RAM or the GPU VRAM are mostly based on DRAM technology which consists of a capacitor and a transistor. The capacitor’s charge represents the data stored. Due to its very nature, this charge gradually leaks. To prevent data loss, a refresh circuit periodically rewrites the data back, restoring its charge. Hence the name — Dynamic RAM due to these preiodic refreshes. Most computers use Synchronous DDR5 DRAM’s as their CPU RAMs. Synchronous because it utilizes the system clock for better performance. In other words the action (of retrieving & storing data) is operationally coordinated by an external clock signal. Tying the operations to the clock makes it faster. The processor knows the exact timing & number of cycles in which the data will be available from the RAM to the bus & can plan better. We have DDR1 (1st Gen Double Data Rate Synchronous Dynamic RAM released in 2000) to DDR5 which is the choice of CPU RAM as of today. 1.2 What is SGRAM? Let us now talk about the VRAMs in GPUs. The VRAM is a type of SGRAM — Synchronous Graphics RAM. The current generation of VRAMs being used is GDDR6. Yes, this is 6th generation GDDR, the G standing for “Graphics”. While DDR & GDDR share common origins and early couple of generations were similar, the branches separated after DDR3. So as of 2025, DDR5 rules in CPU RAM and GDDR6 rules for consumer-grade GPU RAMs. Conceptually DDRs and GDDRs are similar but note that DDRs are used by CPUs which need low latency whereas GDDRs are used by GPUs which are OK to compromise latency for extremely high throughput. Crudely, the former has more frequent smaller calculations & the latter deals with much higher volume of data & some delays are forgiven considering the vast volumes of data being processed. Even more crudely, the former is a bullet train with 6–8 coaches while the latter a 3 Kilometre long goods train. 1.3 GDDR VRAMs explained in detail GDDR memory are individual chips soldered to the PCB (Printed Circuit Board) very close to the GPU die. The physical proximity improves the speed of data transfer from the VRAM to the GPU processor. There are pins in a GDDR which can be thought of as individual wires that connect it to the processor. Bus width is literally the number of such connections. GDDR6 has 32 pins spread across 2 channels with roughly 16 Gbits.p.s bandwidth per pin. Bandwidth is total amount of data being moved & if you had one single metric at your disposal to take a decision, it would be this. Before we go further, let us try to understand this metric intuitively. 1.4 Calculating GPU Memory Bandwidth intuitively Memory Bandwidth is the max rate at which data can be transferred between the GPU and the VRAM. We discussed that data transmission is synchronized with the clock. The clock cycle is measured in hertz & represents the number of cycles per second. Let us say we have a clock operating at 1000 MHz. This literally means 1 billion clock ticks per second. How long does a tick last? Literally 1/(1 billion) i.e. 1 nano second. Data is sent to and fro every clock cycle. So every nano-second, a bus-full of data is sent from the VRAM to the processor & vice versa. How many seats on the bus? Well, we discussed this earlier… This is the memory interface or the bus width… literally the physical count of bits that fit into the bus. A 128-bit bus would ferry 128 bits every nano-second. The D in G’D’DR6 stands for Double. Basically, data is transmitted on both the rising and falling edges of the clock cycle, so 256 bits every nano-second. How many bytes in 1 sec? 256/8 i.e. 32 billion bytes per second or better still 32 GB/s as Giga is the preferred term when measuring data. The capital B denotes bytes whereas the small b denotes bits… a source of confusion. A more practical formula is: Bandwidth = Clock * Bus Width x Data Rate, where the Data Rate is the number of data transfers per cycle. GDDR6 is Double Data Rate (as just discussed) and Quad pumped, which quadruples the (doubled) speed. So effectively the Data Rate is 8. Sometimes, you may encounter the same information crouched in different semantics. E.g., if frequency of command clock (CK#) is N, then the write command clock (WK#) is 2N. GDDR6 rates then are QDR (quad data rate) in reference to WK# and ODR (Octal Data Rate) in reference to the CK#. Some OEMs multiply the clock speed & data rate & call it a clock rate or something. In that case, the bandwidth is simply that number multiplied by the bus width. In general, this raw formula can be used: num_of_transfers per second * num_of_bits per transfer / 8. “Boost clock” mechanism allows the GPU and GDDR memory to operate at even higher speeds than the default clock when conditions allow it. Boost clock metric refers to the max such operating clock speed. A 1750 MHz clock means: 1.75GHz is the frequency of command clock(CK#). The frequency of the write clock (WK#) is 3.5GHz due to the G”D”DR The Quad pumping takes it to 3.5*4=14 G bits moved in 1 second from each pin on the bus. We could have bus widths of up to 384 bits! So we get a bandwidth of 14*384 Giga bits per second. Divide by 8 to get 672 GB/s. GDDR6 bandwidth can go upto 1 TB/s. Wow! 1.5 What is HBM VRAM in a GPU? When reading or writing data, contention is created when the VRAM has occupied memory channels & is busy receiving or delivering other data. This contention creates latency & this affects bandwidth. Increasing the number of memory channels is a great option. A type of memory called HBM (High-Bandwidth Memory) has lower access latency than GDDR6, since it has 8-memory channels versus 2 channels in GDDR6. HBM also has a wider bus. HBM has 1024 pins spread across 8 channels of 128 pins with roughly 2 Gbits.p.s bandwidth per pin. Compare this with (an equivalent) GDDR which has 32 pins spread across 2 channels with roughly 16 Gbits. p.s bandwidth per pin. Notice how HBM keeps the Gbit/sec per pin much lower than GDDR. This saves power (which is important as we shall see). In spite of this, it has higher bandwidth than GDDR6 due to the wider bus & higher channels. As we discussed, a pin is literally a wire connecting the VRAM to the processor. Having 1024 wires connected from the processor to the VRAM is not possible on a standard PCB. Therefore, an “interposer” is used as an intermediary to connect the VRAM & the processor. Just like a regular IC, wires (connections) are etched in this silicon “interposer” in the desired quantity. After this, the HBM device(s) & the processor are mounted atop this “interposer”. The slightly twisted workaround is called a 2.5D architecture.Another difference is that while GDDR chips are soldered to the PCB surrounding the GPU die, an HBM structure is a vertical stack of DRAMs like a high rise building. The stacked memory dies are linked using microscopic wires with TSV (Through-Silicon Vias) which are vertical electrical connections giving super fast connectivity between the DRAMs. There are huge challenges to stacking items vertically especially around designing heat sinks & managing thermal safety but somehow HBM manufacturers have made this happen. HBM has become a gold standard today for AI data centers. It was introduced to the Market by SK Hynix in 2013. Today, we have the 3rd generation HBM3 and their main client is Nvidia. Due to investments made way back, SK Hynix is leading the pack along with Samsung and a relatively recent entrant named Micron. We hear a lot about chips and TSMC but HBM is a key technology to watch out for in the coming years. We typically have more than one HBM devices inside the GPU die. GDDR6 co-exists with HBM3. The markets are complementary. The former addresses PCs & other consumer GPUs whereas the latter addresses data center GPUs. Ultra large scale AI deployments like ChatGPT likely leverage the use of a cluster of NVIDIA GPUs working in tandem. Connecting such GPU’s involves the use of NVIDIA NVLink technology which requires fast GPU memory bandwidth speeds and it’s the reason why HBM is prevalent in such systems. If not for the wide bus width and fast data transfer rates offered by HBM, these kind of clusters would be very difficult to design. Besides the VRAM, GPUs also include high-speed memory caches that are even closer to the GPU’s processing cores. There is a physical limit to the sizes of these caches. An L1 cache is usually in KB and an L2 cache is usually a few MB. Different hardware & software strategies exist to keep the most useful, and most reused data present in caches. 2. Cooling Mechanisms in a GPU Higher clock speeds generally result in increased heat generation necessitating the need for cooling solutions to maintain optimal operating temperatures. Usual cooling methods are: Passive Cooling: These do not have any powered moving components. They take advantage of optimized airflow to take heat away. Fans are used to dissipate heat by blowing cool air across the heat sinks, which are metal components designed to absorb & disperse heat In water cooling, water is circulated through the GPU surface using pipes & a radiator. The hot liquid running through the pipes is in turn cooled down by the radiator fan. Hybrid cooling — which uses a combination of the above 3. GPU Computation cores — Processors Let us now talk about the processors on the GPU. Unlike CPUs which contain only a few cores, the GPU literally has 1000’s of cores & specializes in running tasks in parallel across these cores using SIMD (Single Instruction, Multiple Data) units. Let us stick to NVIDIA terminology. There are multiple processing units called Streaming Multiprocessor (SM) on a NVIDIA GPU. For e.g. an H100 has upto 144 SMs. What is inside an SM? Well there are mainly 2 type of execution units — CUDA cores & Tensor cores. There is also a small memory SRAM which is Shared between all threads running in that SM. More specifically, every SM has a few KB memory that is partitioned between L1 cache & Shared Memory usage. 3.1 CUDA core versus Tensor core in a GPU — The difference Tensor cores are a pretty recent innovation (from V100 onwards) and are specifically designed for faster matrix multiplication. Let us discuss CUDA cores first. These are the computation engines for regular math operations. Each CUDA core can execute one operation per clock cycle. But their strength lies in parallel processing. Many CUDA cores working together can accelerate computation by executing processes in parallel. Tensor Cores are specialized hardware units designed to accelerate “mixed precision” training. The earliest version allowed 4×4 FP16 matrices to be multiplied & added to an FP32 output matrix. By using lower-precision FP16 inputs in the computations, the calculations are vastly accelarated & by retaining FP32 outputs for the rest of the procedure, accuracy is not compromised too much. Modern tensor cores use even lower precision formats in DL computations. See this for more details. There may also specialized units like the transformer engine designed to accelerate models built with the Transformer blocks. A single GPU can be partitioned into multiple fully contained and isolated instances, with their own memory, cache & cores via MIG or Multi Instance GPU technology. 3.2 GPU operations — A FLOP show Let us now talk about actual operations. A FLOP (Floating Point Operation) is a single floating-point calculation like an addition. Performance of a GPU is usually measured in TeraFLOP/s. Tera is a trillion, FLOP stands for floating-point operations and the ‘s’ stands for per second. Most matrix ops involve a multiply and an add. It makes sense to fuse these ops together to get an Fused Multiply-Add (FMA) op. If we know the FMA speed, we can simply double it to get the FLOP counts per clock. To get the peak FLOP/s rate, we multiply this by the clock rate & the number of SMs. Note that we have FP16, FP32, FP64 & Int8 cores with varying speeds. For e.g.: Say there are 4 tensor cores in each SM & 114 SMs in an H100 Say each tensor core delivers 512 FP16 FMA ops per clock. Careful here: Read the specs clearly to check whether the FMA ops per clock metric is per SM or per individual core. For e.g., this link of A100 is per coreper SM Let the Clock speed = 1620 MHz So TFLOP/s = 1620 * (2*512) * 4 * 114= 756 TFLOP/s of performance! 756 Trillion operations per second. Wow! What would Babbage say to that? 4. Putting everything together — LLM Operations in a GPU Given this immense compute-power, we can now make a reasonable guess that LLM inference is memory-I0 bound, not compute bound. In other words, it takes more time to load data to the GPU’s compute cores than it does for those cores to perform LLM computations on that data. The processing itself is super-fast & there is enough & more compute power available. To start with, the training data needs to be downloaded from a remote source to the CPU memory From there, it needs to be transferred to the GPU via the system bus and PCIe bus. The host(CPU)-to-device(GPU) bandwidth is limited by the CPU frequency, PCIe bus, GPU devices & the number of PCIe lanes available. Once the data & weights are in the GPU VRAM, they are then ferried across to the SRAM where the processors perform operations on it. After the operation the data is moved back to the VRAM & from there it is moved back to the CPU RAM. This is a rather simplistic view. Inside the GPU, the tensors are repeatedly moved back and forth between VRAM & SRAM (the memory allocated to an SM). Can you guess why? We saw that SRAM size is in KB so large matrices are not going to fit in there … which explains why there is a constant movement between VRAM which holds all the tensors and SRAM which holds the data on which compute operations are performed. So there is typically a memory-op where tensors are moved from VRAM to SRAM, then a compute-op SRAM and memory-op to move tensors back from SRAM to VRAM. Computations like a matrix multiplication involving 2 large matrices need several such memory + compute ops before the action is completed. During the training of GPT-3, the tensor cores on the GPUs used were found to be idle ~50% of the time. So, to extract the best from the infrastructure, data movement needs to be fast enough to ensure the computation cores are kept reasonably occupied. Surely, there is scope for some smart person to come up with shortcuts. Enter Flash attention & other such hacks. But that is a story for another day! 5. Linking GPUs for LLM training — Topologies While LLM inferencing is manegable with a readymade collection of GPUs such as a DGX server (contains 8 H100s), LLM training needs far more GPUs. Before we discuss how to connect GPUs for larger workloads, it makes sense to see how CPU servers are connected in a datacentre. I am not an expert in this area, so please feel free to point out any incorrect interpretations I may have made from the references I quote. 5.1 Generic concepts on linking processors Each server has a card attached to it called the Network Interface Card (NIC). RDMA technology enables direct memory access to a remote server via the NIC hardware. RoCE (RDMA over Converged Ethernet) protocol uses the RDMA technology & adapts it to Ethernet networks. So now, a server can talk to a remote server over a network. A network switch is a device connecting multiple servers in a network, enabling them to communicate with each other. This is the basic technology. Now let us come to the topology. So we assemble all the servers physically in one place and pile them up vertically them in neat racks.A very basic topology is to connect each server in a rack to a switch that usually sits on Top of the Rack, aptly named the ToR switch. The ToR switches of different racks are connected to a Spine switch. This topology is a basic implementation of Clos topology — named after Charles Clos who invented this scheme to originally arrange telephone nodes in a “leaf-n-spine” arrangement. The leaf switches are nothing but the ToR switches in modern data centers. Source: Fig 1–1 from https://www.oreilly.com/library/view/bgp-in-the/9781491983416/ch01.html Fat tree is a variant of Clos. Like before, we have servers arranged into racks connecting to Top-of-the-Rack (ToR) switches. ToR switches are connected to the aggregation switches to provide connectivity across racks, forming a pod. The pods are interconnected with spine switches, allowing any-to-any communication across servers. To be noted is the fact that there are multiple paths connecting servers. So there is lot of redundancy built-in. In a typical App deployment running hundreds of microservices on dozens of servers, it is useful to have such fully connected, high bandwidth networks. You never know who is going to talk to whom so it never hurts to overprovision on bandwidth and connectivity. However, network loads during AI training do not follow these patterns. They are more predictable & this allows us to build optimized, cheaper & less power-hungry networks. 5.2 Linking GPUs via proprietary technology like NVLink We can strap together H100’s by leveraging the proprietary NVLink & NVSwitch technologies. NVLink provides the high-speed connection between individual GPUs, while NVSwitch is a chip that enables multiple GPUs to communicate through NVLink, forming a high-bandwidth network. See this nice article for details. NVIDIA’s P100 GPU introduced the NVLink1. At that time there was no NVSwitch chip, and the GPUs were connected in a ring-like configuration, which resulted in a lack of direct point-to-point communication between GPUs. The NVSwitch1 chip was introduced with the V100, followed by the NVSwitch2 chip with the A100 GPU. We are in the third-generation NVSwitch3 which can support a cluster of up to 256 H100 GPUs. Each H100 GPU in such a cluster is connected to the internal NVSwitch3 chip through 18 NVLink4.0 connections. This is how trillion parameter LLMs are inferenced. 5.3 Linking GPUs via RoCE in a rail-optimized topology But as they say, ye dil mange more… Meta reportedly trains its newer models on a cluster that’s over 100K H100’s. Phew! How to they manage to link it all up? The standard NVLink tricks can only scale to a limited number of GPUs. Beyond that, we have to use the network topologies discussed earlier & fall back on technologies like RoCE, which allows data to be directly transferred from one GPU’s memory to another without involving the CPU. So you have 8 GPUs in one DGX server. You have several such DGX servers in the data centre. Each GPU is assigned a NIC (yes!) & connected via RDMA to all other GPUs thru’ a variant of Clos network called “rail-optimized network”. The idea here is to set up dedicated connections between groups of GPUs with rail switches. If a GPU wants to communicate with a GPU which is in a different group, then it has to go thru’ the spine switch (which takes a lil more time). To implement this, each GPU in a DGX server is indexed serially. A rail is the set of GPUs with the same index on different servers & these are interconnected with a rail switch via RDMA. These rail switches are subsequently connected to spine switches forming any-to-any GPU network. Source: Fig 1 from https://arxiv.org/pdf/2307.12169 This topology streamlines traffic flow. It is like having dedicated lanes for high speed vehicles instead of generally mixing all traffic together. Rail paths are direct connections between a bunch of GPUs with same index. Spine switches serve as the connecting points for differently-indexed GPUs. For e.g., communication between GPU1 of server 1 and GPU1 of server 2 happens via their dedicated rail switch 1. If GPU1 of server 1 needs to reach GPU5 of another server, it has to go thru’ a spine switch. The workloads are designed so as to minimize data transfers across rails (since it has to go thru’ the extra spine switch). The good news is that this can be neatly done for AI training ensuring that most of the traffic stays within the rails, and does not cut across. In fact, there is a recent paper which suggests that you can consider removing costly spine switches altogether as inter-rail communication is minimal. Can you guess how? 5.4 Linking GPUs via RoCE in a rail-only topology Well, we have the superfast connectivity using NVLink to communicate between a limited set of GPUs (upto 256). So you create these High Bandwith (HB) domains which use NVLink for communication. You have several such HB domains. We then have the same indexing system and rail connections to interconnect the HB domains. But there are no spine switches! Can you guess how GPU1 of HB domain 1 can talk to GPU5 of another HB domain? Yes! Transfer data via superfast NVLink to GPU5 of HB domain 1 first. Then use the dedicated rail of GPU5 to talk to the GPU5 in another HB domain! This is a rail-only topology as oppsed to rail-optimized topology! Given these topologies, we can now plan the training pipeline to have pipeline parallelism, tensor parallelism &/or data parallelism but that is a story for another day. See this, this & this for more details. 100K H100’s consume a LOT of power. Tech companies are exploring nuclear power options to generate clean energy needed for long term sustenance. Else, a 100K GPU cluster may have to be broken down to smaller clusters and connected using optical transceivers across the buildings in a campus. This (unplanned) article is a prelude to — Optimizing LLM inference: Key Faultlines & workarounds. To deeply understand how we can optimize LLM operations, we need to understand more about the silicon on which they are executed. Though there are lots of manuals/guides on individual aspects like memory, processors, networking etc, I couldn’t find a concise and reader-friendly thread linking together these various aspects & hence took a shot. This is the 9th of a 15-series article titled My LLM diaries. LLM Quantization — From concepts to implementation LoRA & its newer variants explained like never before In-Context learning: The greatest magic show in the kingdom of LLMs RAG in plain English — Summary of 100+ papers HNSW — Story of the world’s most popular Vector search algorithm VectorDB origins, Vamana & on-disk Vector search algorithms Taming LLMs — A study of few popular techniques Understanding LLM Agents: Concepts, Patterns & Frameworks Anatomy of a GPU — A peek into the hardware fuelling LLM operations Optimizing LLM Inference — Key Faultlines & workarounds LLM Serving — Architecture considerations LLM evaluation & other odds and ends Look Ma, LLMs without Prompt Engineering LLMs on the laptop — A peek into the Silicon Taking a step back — On model sentience, conscientiousness & other philosophical aspects Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI المصدر: https://towardsai.net/p/machine-learning/gpu-architecture-working-intuitively-explained
    TOWARDSAI.NET
    GPU Architecture & Working intuitively explained
    Author(s): Allohvk Originally published on Towards AI. GPU Origins The image displayed on a computer screen is made up of millions of tiny pixels. In early days, “graphics controllers” were given instructions by the CPU on how to calculate the individual pixel values so that the appropriate image could be displayed. These were ok for conventional displays but for a really good gaming experience, images need to be built dozens of times per second. The CPU was not really designed to handle these kind of loads. The whole process of creating the image could be parallelized big-time simply by (a) dividing the image into smaller blocks (b) carrying out computations for each block in parallel & (c) grouping them back again. The results of one block don’t influence the results of the other blocks. CPU’s multi-threading capabilities was not really conceived for such massive parallelization. Enter the GPU! Sony first used the term GPU in 1994, in its PlayStation consoles. The technology was perfected by NVIDIA which soon became a leader. GPUs have numerous computation cores (much more than a CPU) and gaming programmers could write Shaders — programs to run graphics computations on the GPU in a massively parallelized way to create the screen images in super-fast time. The GPU is inspired by the CPU but was specifically designed to enable massive multi-threaded operations on its numerous computation cores seamlessly. Creating threads, switching between threads etc is much faster on a GPU. Some smart developers also realized that these parallel processing capabilities could be used for other computationally intensive tasks as well! 2005: Steinkrau implements a simple 2-layer Neural Net on a GPU 2006: Kumar et. al. trains a CNN model for document processing 2007: NVIDIA released Compute Unified Device Architecture (CUDA) — a custom language extending C to exploit data parallelism on GPUs. Now developers had much more granular control over the image rendering. 2008 a landmark paper by Raina et al was released. This paper pretty much showed everyone how to train deep layers on a GPU 2014: NVIDIA released CuDNN — a dedicated CUDA library for Deep Learning. Very soon PyTorch, TensorFlow etc incorporated CuDNN, setting the stage for modern GPU usage for AI! A GPU is an ASIC or Application-Specific Integrated Circuit having a processor (hosting numerous computational cores), a memory soldered onto it (we want to avoid going to the CPU RAM for everything), a cooling system (well, they heat up pretty fast) and a BIOS chip (same role as a CPU — to store settings, run startup diagnostics etc). This card is then plugged into the motherboard slot using the PCI Express interface. The terms GPU and graphics card are often used interchangeably. Some GPUs like the one in Apple M3 do not have a dedicated memory but instead use the system RAM itself which is possible due to its unique design. Google has the TPU (Tensor Processing Unit) which is its own ASIC. We discuss the GPU memory, the processing cores, the LLM workflows happening inside them & common topologies for clustering. Photo by Thomas Foster on Unsplash 1. GPU Memory module — The VRAM Instead of having the GPU talk to the regular RAM, it made sense to create another RAM physically closer to the GPU die so that data retrieval is faster. So a graphics card has a memory called VRAM — Video Random Access Memory in addition to the computation engines . VRAM is connected to the computation engine cores via a Bus called the memory interface. 1.1 What is DRAM? Let us talk first of RAM technology in general. All memory whether it is the CPU RAM or the GPU VRAM are mostly based on DRAM technology which consists of a capacitor and a transistor. The capacitor’s charge represents the data stored. Due to its very nature, this charge gradually leaks. To prevent data loss, a refresh circuit periodically rewrites the data back, restoring its charge. Hence the name — Dynamic RAM due to these preiodic refreshes. Most computers use Synchronous DDR5 DRAM’s as their CPU RAMs. Synchronous because it utilizes the system clock for better performance. In other words the action (of retrieving & storing data) is operationally coordinated by an external clock signal. Tying the operations to the clock makes it faster. The processor knows the exact timing & number of cycles in which the data will be available from the RAM to the bus & can plan better. We have DDR1 (1st Gen Double Data Rate Synchronous Dynamic RAM released in 2000) to DDR5 which is the choice of CPU RAM as of today. 1.2 What is SGRAM? Let us now talk about the VRAMs in GPUs. The VRAM is a type of SGRAM — Synchronous Graphics RAM. The current generation of VRAMs being used is GDDR6. Yes, this is 6th generation GDDR, the G standing for “Graphics”. While DDR & GDDR share common origins and early couple of generations were similar, the branches separated after DDR3. So as of 2025, DDR5 rules in CPU RAM and GDDR6 rules for consumer-grade GPU RAMs. Conceptually DDRs and GDDRs are similar but note that DDRs are used by CPUs which need low latency whereas GDDRs are used by GPUs which are OK to compromise latency for extremely high throughput. Crudely, the former has more frequent smaller calculations & the latter deals with much higher volume of data & some delays are forgiven considering the vast volumes of data being processed. Even more crudely, the former is a bullet train with 6–8 coaches while the latter a 3 Kilometre long goods train. 1.3 GDDR VRAMs explained in detail GDDR memory are individual chips soldered to the PCB (Printed Circuit Board) very close to the GPU die. The physical proximity improves the speed of data transfer from the VRAM to the GPU processor. There are pins in a GDDR which can be thought of as individual wires that connect it to the processor. Bus width is literally the number of such connections. GDDR6 has 32 pins spread across 2 channels with roughly 16 Gbits.p.s bandwidth per pin. Bandwidth is total amount of data being moved & if you had one single metric at your disposal to take a decision, it would be this. Before we go further, let us try to understand this metric intuitively. 1.4 Calculating GPU Memory Bandwidth intuitively Memory Bandwidth is the max rate at which data can be transferred between the GPU and the VRAM. We discussed that data transmission is synchronized with the clock. The clock cycle is measured in hertz & represents the number of cycles per second. Let us say we have a clock operating at 1000 MHz. This literally means 1 billion clock ticks per second. How long does a tick last? Literally 1/(1 billion) i.e. 1 nano second. Data is sent to and fro every clock cycle. So every nano-second, a bus-full of data is sent from the VRAM to the processor & vice versa. How many seats on the bus? Well, we discussed this earlier… This is the memory interface or the bus width… literally the physical count of bits that fit into the bus. A 128-bit bus would ferry 128 bits every nano-second. The D in G’D’DR6 stands for Double. Basically, data is transmitted on both the rising and falling edges of the clock cycle, so 256 bits every nano-second. How many bytes in 1 sec? 256/8 i.e. 32 billion bytes per second or better still 32 GB/s as Giga is the preferred term when measuring data. The capital B denotes bytes whereas the small b denotes bits… a source of confusion. A more practical formula is: Bandwidth = Clock * Bus Width x Data Rate, where the Data Rate is the number of data transfers per cycle. GDDR6 is Double Data Rate (as just discussed) and Quad pumped, which quadruples the (doubled) speed. So effectively the Data Rate is 8. Sometimes, you may encounter the same information crouched in different semantics. E.g., if frequency of command clock (CK#) is N, then the write command clock (WK#) is 2N. GDDR6 rates then are QDR (quad data rate) in reference to WK# and ODR (Octal Data Rate) in reference to the CK#. Some OEMs multiply the clock speed & data rate & call it a clock rate or something. In that case, the bandwidth is simply that number multiplied by the bus width. In general, this raw formula can be used: num_of_transfers per second * num_of_bits per transfer / 8. “Boost clock” mechanism allows the GPU and GDDR memory to operate at even higher speeds than the default clock when conditions allow it. Boost clock metric refers to the max such operating clock speed. A 1750 MHz clock means: 1.75GHz is the frequency of command clock(CK#). The frequency of the write clock (WK#) is 3.5GHz due to the G”D”DR The Quad pumping takes it to 3.5*4=14 G bits moved in 1 second from each pin on the bus. We could have bus widths of up to 384 bits! So we get a bandwidth of 14*384 Giga bits per second. Divide by 8 to get 672 GB/s. GDDR6 bandwidth can go upto 1 TB/s. Wow! 1.5 What is HBM VRAM in a GPU? When reading or writing data, contention is created when the VRAM has occupied memory channels & is busy receiving or delivering other data. This contention creates latency & this affects bandwidth. Increasing the number of memory channels is a great option. A type of memory called HBM (High-Bandwidth Memory) has lower access latency than GDDR6, since it has 8-memory channels versus 2 channels in GDDR6. HBM also has a wider bus. HBM has 1024 pins spread across 8 channels of 128 pins with roughly 2 Gbits.p.s bandwidth per pin. Compare this with (an equivalent) GDDR which has 32 pins spread across 2 channels with roughly 16 Gbits. p.s bandwidth per pin. Notice how HBM keeps the Gbit/sec per pin much lower than GDDR. This saves power (which is important as we shall see). In spite of this, it has higher bandwidth than GDDR6 due to the wider bus & higher channels. As we discussed, a pin is literally a wire connecting the VRAM to the processor. Having 1024 wires connected from the processor to the VRAM is not possible on a standard PCB. Therefore, an “interposer” is used as an intermediary to connect the VRAM & the processor. Just like a regular IC, wires (connections) are etched in this silicon “interposer” in the desired quantity. After this, the HBM device(s) & the processor are mounted atop this “interposer”. The slightly twisted workaround is called a 2.5D architecture.Another difference is that while GDDR chips are soldered to the PCB surrounding the GPU die, an HBM structure is a vertical stack of DRAMs like a high rise building. The stacked memory dies are linked using microscopic wires with TSV (Through-Silicon Vias) which are vertical electrical connections giving super fast connectivity between the DRAMs. There are huge challenges to stacking items vertically especially around designing heat sinks & managing thermal safety but somehow HBM manufacturers have made this happen. HBM has become a gold standard today for AI data centers. It was introduced to the Market by SK Hynix in 2013. Today, we have the 3rd generation HBM3 and their main client is Nvidia. Due to investments made way back, SK Hynix is leading the pack along with Samsung and a relatively recent entrant named Micron. We hear a lot about chips and TSMC but HBM is a key technology to watch out for in the coming years. We typically have more than one HBM devices inside the GPU die. GDDR6 co-exists with HBM3. The markets are complementary. The former addresses PCs & other consumer GPUs whereas the latter addresses data center GPUs. Ultra large scale AI deployments like ChatGPT likely leverage the use of a cluster of NVIDIA GPUs working in tandem. Connecting such GPU’s involves the use of NVIDIA NVLink technology which requires fast GPU memory bandwidth speeds and it’s the reason why HBM is prevalent in such systems. If not for the wide bus width and fast data transfer rates offered by HBM, these kind of clusters would be very difficult to design. Besides the VRAM, GPUs also include high-speed memory caches that are even closer to the GPU’s processing cores. There is a physical limit to the sizes of these caches. An L1 cache is usually in KB and an L2 cache is usually a few MB. Different hardware & software strategies exist to keep the most useful, and most reused data present in caches. 2. Cooling Mechanisms in a GPU Higher clock speeds generally result in increased heat generation necessitating the need for cooling solutions to maintain optimal operating temperatures. Usual cooling methods are: Passive Cooling: These do not have any powered moving components. They take advantage of optimized airflow to take heat away. Fans are used to dissipate heat by blowing cool air across the heat sinks, which are metal components designed to absorb & disperse heat In water cooling, water is circulated through the GPU surface using pipes & a radiator. The hot liquid running through the pipes is in turn cooled down by the radiator fan. Hybrid cooling — which uses a combination of the above 3. GPU Computation cores — Processors Let us now talk about the processors on the GPU. Unlike CPUs which contain only a few cores, the GPU literally has 1000’s of cores & specializes in running tasks in parallel across these cores using SIMD (Single Instruction, Multiple Data) units. Let us stick to NVIDIA terminology. There are multiple processing units called Streaming Multiprocessor (SM) on a NVIDIA GPU. For e.g. an H100 has upto 144 SMs. What is inside an SM? Well there are mainly 2 type of execution units — CUDA cores & Tensor cores. There is also a small memory SRAM which is Shared between all threads running in that SM. More specifically, every SM has a few KB memory that is partitioned between L1 cache & Shared Memory usage. 3.1 CUDA core versus Tensor core in a GPU — The difference Tensor cores are a pretty recent innovation (from V100 onwards) and are specifically designed for faster matrix multiplication. Let us discuss CUDA cores first. These are the computation engines for regular math operations. Each CUDA core can execute one operation per clock cycle. But their strength lies in parallel processing. Many CUDA cores working together can accelerate computation by executing processes in parallel. Tensor Cores are specialized hardware units designed to accelerate “mixed precision” training. The earliest version allowed 4×4 FP16 matrices to be multiplied & added to an FP32 output matrix. By using lower-precision FP16 inputs in the computations, the calculations are vastly accelarated & by retaining FP32 outputs for the rest of the procedure, accuracy is not compromised too much. Modern tensor cores use even lower precision formats in DL computations. See this for more details. There may also specialized units like the transformer engine designed to accelerate models built with the Transformer blocks. A single GPU can be partitioned into multiple fully contained and isolated instances, with their own memory, cache & cores via MIG or Multi Instance GPU technology. 3.2 GPU operations — A FLOP show Let us now talk about actual operations. A FLOP (Floating Point Operation) is a single floating-point calculation like an addition. Performance of a GPU is usually measured in TeraFLOP/s. Tera is a trillion, FLOP stands for floating-point operations and the ‘s’ stands for per second. Most matrix ops involve a multiply and an add. It makes sense to fuse these ops together to get an Fused Multiply-Add (FMA) op. If we know the FMA speed, we can simply double it to get the FLOP counts per clock. To get the peak FLOP/s rate, we multiply this by the clock rate & the number of SMs. Note that we have FP16, FP32, FP64 & Int8 cores with varying speeds. For e.g.: Say there are 4 tensor cores in each SM & 114 SMs in an H100 Say each tensor core delivers 512 FP16 FMA ops per clock. Careful here: Read the specs clearly to check whether the FMA ops per clock metric is per SM or per individual core. For e.g., this link of A100 is per coreper SM Let the Clock speed = 1620 MHz So TFLOP/s = 1620 * (2*512) * 4 * 114= 756 TFLOP/s of performance! 756 Trillion operations per second. Wow! What would Babbage say to that? 4. Putting everything together — LLM Operations in a GPU Given this immense compute-power, we can now make a reasonable guess that LLM inference is memory-I0 bound, not compute bound. In other words, it takes more time to load data to the GPU’s compute cores than it does for those cores to perform LLM computations on that data. The processing itself is super-fast & there is enough & more compute power available. To start with, the training data needs to be downloaded from a remote source to the CPU memory From there, it needs to be transferred to the GPU via the system bus and PCIe bus. The host(CPU)-to-device(GPU) bandwidth is limited by the CPU frequency, PCIe bus, GPU devices & the number of PCIe lanes available. Once the data & weights are in the GPU VRAM, they are then ferried across to the SRAM where the processors perform operations on it. After the operation the data is moved back to the VRAM & from there it is moved back to the CPU RAM. This is a rather simplistic view. Inside the GPU, the tensors are repeatedly moved back and forth between VRAM & SRAM (the memory allocated to an SM). Can you guess why? We saw that SRAM size is in KB so large matrices are not going to fit in there … which explains why there is a constant movement between VRAM which holds all the tensors and SRAM which holds the data on which compute operations are performed. So there is typically a memory-op where tensors are moved from VRAM to SRAM, then a compute-op SRAM and memory-op to move tensors back from SRAM to VRAM. Computations like a matrix multiplication involving 2 large matrices need several such memory + compute ops before the action is completed. During the training of GPT-3, the tensor cores on the GPUs used were found to be idle ~50% of the time. So, to extract the best from the infrastructure, data movement needs to be fast enough to ensure the computation cores are kept reasonably occupied. Surely, there is scope for some smart person to come up with shortcuts. Enter Flash attention & other such hacks. But that is a story for another day! 5. Linking GPUs for LLM training — Topologies While LLM inferencing is manegable with a readymade collection of GPUs such as a DGX server (contains 8 H100s), LLM training needs far more GPUs. Before we discuss how to connect GPUs for larger workloads, it makes sense to see how CPU servers are connected in a datacentre. I am not an expert in this area, so please feel free to point out any incorrect interpretations I may have made from the references I quote. 5.1 Generic concepts on linking processors Each server has a card attached to it called the Network Interface Card (NIC). RDMA technology enables direct memory access to a remote server via the NIC hardware. RoCE (RDMA over Converged Ethernet) protocol uses the RDMA technology & adapts it to Ethernet networks. So now, a server can talk to a remote server over a network. A network switch is a device connecting multiple servers in a network, enabling them to communicate with each other. This is the basic technology. Now let us come to the topology. So we assemble all the servers physically in one place and pile them up vertically them in neat racks.A very basic topology is to connect each server in a rack to a switch that usually sits on Top of the Rack, aptly named the ToR switch. The ToR switches of different racks are connected to a Spine switch. This topology is a basic implementation of Clos topology — named after Charles Clos who invented this scheme to originally arrange telephone nodes in a “leaf-n-spine” arrangement. The leaf switches are nothing but the ToR switches in modern data centers. Source: Fig 1–1 from https://www.oreilly.com/library/view/bgp-in-the/9781491983416/ch01.html Fat tree is a variant of Clos. Like before, we have servers arranged into racks connecting to Top-of-the-Rack (ToR) switches. ToR switches are connected to the aggregation switches to provide connectivity across racks, forming a pod. The pods are interconnected with spine switches, allowing any-to-any communication across servers. To be noted is the fact that there are multiple paths connecting servers. So there is lot of redundancy built-in. In a typical App deployment running hundreds of microservices on dozens of servers, it is useful to have such fully connected, high bandwidth networks. You never know who is going to talk to whom so it never hurts to overprovision on bandwidth and connectivity. However, network loads during AI training do not follow these patterns. They are more predictable & this allows us to build optimized, cheaper & less power-hungry networks. 5.2 Linking GPUs via proprietary technology like NVLink We can strap together H100’s by leveraging the proprietary NVLink & NVSwitch technologies. NVLink provides the high-speed connection between individual GPUs, while NVSwitch is a chip that enables multiple GPUs to communicate through NVLink, forming a high-bandwidth network. See this nice article for details. NVIDIA’s P100 GPU introduced the NVLink1. At that time there was no NVSwitch chip, and the GPUs were connected in a ring-like configuration, which resulted in a lack of direct point-to-point communication between GPUs. The NVSwitch1 chip was introduced with the V100, followed by the NVSwitch2 chip with the A100 GPU. We are in the third-generation NVSwitch3 which can support a cluster of up to 256 H100 GPUs. Each H100 GPU in such a cluster is connected to the internal NVSwitch3 chip through 18 NVLink4.0 connections. This is how trillion parameter LLMs are inferenced. 5.3 Linking GPUs via RoCE in a rail-optimized topology But as they say, ye dil mange more… Meta reportedly trains its newer models on a cluster that’s over 100K H100’s. Phew! How to they manage to link it all up? The standard NVLink tricks can only scale to a limited number of GPUs. Beyond that, we have to use the network topologies discussed earlier & fall back on technologies like RoCE, which allows data to be directly transferred from one GPU’s memory to another without involving the CPU. So you have 8 GPUs in one DGX server. You have several such DGX servers in the data centre. Each GPU is assigned a NIC (yes!) & connected via RDMA to all other GPUs thru’ a variant of Clos network called “rail-optimized network”. The idea here is to set up dedicated connections between groups of GPUs with rail switches. If a GPU wants to communicate with a GPU which is in a different group, then it has to go thru’ the spine switch (which takes a lil more time). To implement this, each GPU in a DGX server is indexed serially. A rail is the set of GPUs with the same index on different servers & these are interconnected with a rail switch via RDMA. These rail switches are subsequently connected to spine switches forming any-to-any GPU network. Source: Fig 1 from https://arxiv.org/pdf/2307.12169 This topology streamlines traffic flow. It is like having dedicated lanes for high speed vehicles instead of generally mixing all traffic together. Rail paths are direct connections between a bunch of GPUs with same index. Spine switches serve as the connecting points for differently-indexed GPUs. For e.g., communication between GPU1 of server 1 and GPU1 of server 2 happens via their dedicated rail switch 1. If GPU1 of server 1 needs to reach GPU5 of another server, it has to go thru’ a spine switch. The workloads are designed so as to minimize data transfers across rails (since it has to go thru’ the extra spine switch). The good news is that this can be neatly done for AI training ensuring that most of the traffic stays within the rails, and does not cut across. In fact, there is a recent paper which suggests that you can consider removing costly spine switches altogether as inter-rail communication is minimal. Can you guess how? 5.4 Linking GPUs via RoCE in a rail-only topology Well, we have the superfast connectivity using NVLink to communicate between a limited set of GPUs (upto 256). So you create these High Bandwith (HB) domains which use NVLink for communication. You have several such HB domains. We then have the same indexing system and rail connections to interconnect the HB domains. But there are no spine switches! Can you guess how GPU1 of HB domain 1 can talk to GPU5 of another HB domain? Yes! Transfer data via superfast NVLink to GPU5 of HB domain 1 first. Then use the dedicated rail of GPU5 to talk to the GPU5 in another HB domain! This is a rail-only topology as oppsed to rail-optimized topology! Given these topologies, we can now plan the training pipeline to have pipeline parallelism, tensor parallelism &/or data parallelism but that is a story for another day. See this, this & this for more details. 100K H100’s consume a LOT of power. Tech companies are exploring nuclear power options to generate clean energy needed for long term sustenance. Else, a 100K GPU cluster may have to be broken down to smaller clusters and connected using optical transceivers across the buildings in a campus. This (unplanned) article is a prelude to — Optimizing LLM inference: Key Faultlines & workarounds. To deeply understand how we can optimize LLM operations, we need to understand more about the silicon on which they are executed. Though there are lots of manuals/guides on individual aspects like memory, processors, networking etc, I couldn’t find a concise and reader-friendly thread linking together these various aspects & hence took a shot. This is the 9th of a 15-series article titled My LLM diaries. LLM Quantization — From concepts to implementation LoRA & its newer variants explained like never before In-Context learning: The greatest magic show in the kingdom of LLMs RAG in plain English — Summary of 100+ papers HNSW — Story of the world’s most popular Vector search algorithm VectorDB origins, Vamana & on-disk Vector search algorithms Taming LLMs — A study of few popular techniques Understanding LLM Agents: Concepts, Patterns & Frameworks Anatomy of a GPU — A peek into the hardware fuelling LLM operations Optimizing LLM Inference — Key Faultlines & workarounds LLM Serving — Architecture considerations LLM evaluation & other odds and ends Look Ma, LLMs without Prompt Engineering LLMs on the laptop — A peek into the Silicon Taking a step back — On model sentience, conscientiousness & other philosophical aspects Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
    0 Comments 0 Shares 0 Reviews
CGShares https://cgshares.com