Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch...

shared a link

2025-05-20 10:32:47 ·

Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at automating the translation of PyTorch modules into efficient Triton GPU kernels. This initiative seeks to lower the barriers to GPU programming by simplifying kernel development processes.
Technical Overview
KernelLLM is trained on approximately 25,000 paired examples of PyTorch modules and their corresponding Triton kernel implementations. The dataset, known as KernelBook, comprises filtered code from The Stack and synthetically generated samples using torch.compileand other prompting techniques.
The model employs a supervised instruction tuning approach, utilizing prompt templates that include format examples during both training and evaluation. Training was conducted over 10 epochs with a batch size of 32, using 16 GPUs over approximately 12 hours.

Performance Evaluation
KernelLLM’s performance was assessed using KernelBench-Triton, a benchmark designed to evaluate the generation of Triton kernels from PyTorch modules. The model achieved a Pass@1 score of 20.2, outperforming larger models such as GPT-4oand DeepSeek V3, which scored 15 and 16 respectively. With multiple inferences, KernelLLM’s Pass@10 and Pass@20 scores reached 51.8 and 57.1, indicating robust performance in generating correct kernels.
Implications for GPU Programming
By automating the generation of Triton kernels from PyTorch modules, KernelLLM has the potential to streamline the development of GPU-accelerated applications. This could be particularly beneficial for developers seeking to optimize performance without delving into the complexities of manual kernel programming.
The model’s ability to produce efficient kernels may also contribute to more accessible and efficient utilization of GPU resources, potentially impacting areas such as deep learning model training and inference.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated DataSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
#meta #introduces #kernelllm #llm #that

Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at automating the translation of PyTorch modules into efficient Triton GPU kernels. This initiative seeks to lower the barriers to GPU programming by simplifying kernel development processes. Technical Overview KernelLLM is trained on approximately 25,000 paired examples of PyTorch modules and their corresponding Triton kernel implementations. The dataset, known as KernelBook, comprises filtered code from The Stack and synthetically generated samples using torch.compileand other prompting techniques. The model employs a supervised instruction tuning approach, utilizing prompt templates that include format examples during both training and evaluation. Training was conducted over 10 epochs with a batch size of 32, using 16 GPUs over approximately 12 hours. Performance Evaluation KernelLLM’s performance was assessed using KernelBench-Triton, a benchmark designed to evaluate the generation of Triton kernels from PyTorch modules. The model achieved a Pass@1 score of 20.2, outperforming larger models such as GPT-4oand DeepSeek V3, which scored 15 and 16 respectively. With multiple inferences, KernelLLM’s Pass@10 and Pass@20 scores reached 51.8 and 57.1, indicating robust performance in generating correct kernels. Implications for GPU Programming By automating the generation of Triton kernels from PyTorch modules, KernelLLM has the potential to streamline the development of GPU-accelerated applications. This could be particularly beneficial for developers seeking to optimize performance without delving into the complexities of manual kernel programming. The model’s ability to produce efficient kernels may also contribute to more accessible and efficient utilization of GPU resources, potentially impacting areas such as deep learning model training and inference. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated DataSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #meta #introduces #kernelllm #llm #that

WWW.MARKTECHPOST.COM

Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at automating the translation of PyTorch modules into efficient Triton GPU kernels. This initiative seeks to lower the barriers to GPU programming by simplifying kernel development processes. Technical Overview KernelLLM is trained on approximately 25,000 paired examples of PyTorch modules and their corresponding Triton kernel implementations. The dataset, known as KernelBook, comprises filtered code from The Stack and synthetically generated samples using torch.compile() and other prompting techniques. The model employs a supervised instruction tuning approach, utilizing prompt templates that include format examples during both training and evaluation. Training was conducted over 10 epochs with a batch size of 32, using 16 GPUs over approximately 12 hours (192 GPU hours). Performance Evaluation KernelLLM’s performance was assessed using KernelBench-Triton, a benchmark designed to evaluate the generation of Triton kernels from PyTorch modules. The model achieved a Pass@1 score of 20.2, outperforming larger models such as GPT-4o (~200B parameters) and DeepSeek V3 (671B parameters), which scored 15 and 16 respectively. With multiple inferences, KernelLLM’s Pass@10 and Pass@20 scores reached 51.8 and 57.1, indicating robust performance in generating correct kernels. Implications for GPU Programming By automating the generation of Triton kernels from PyTorch modules, KernelLLM has the potential to streamline the development of GPU-accelerated applications. This could be particularly beneficial for developers seeking to optimize performance without delving into the complexities of manual kernel programming. The model’s ability to produce efficient kernels may also contribute to more accessible and efficient utilization of GPU resources, potentially impacting areas such as deep learning model training and inference. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated DataSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

·148 Views

Join

Language

Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels