data:image/s3,"s3://crabby-images/8bf84/8bf84b20cca73878bd132125283ef4c47e5903f0" alt=""
DeepSeek AI Releases DeepEP: An Open-Source EP Communication Library for MoE Model Training and Inference
www.marktechpost.com
Large language models that use the Mixture-of-Experts (MoE) architecture have enabled significant increases in model capacity without a corresponding rise in computation. However, this approach also introduces challengesespecially when it comes to communication between GPUs. In MoE models, only a subset of experts is active for any given token, so efficiently exchanging data among devices is critical. Traditional methods for all-to-all communication can create bottlenecks that increase latency and underutilize GPU resources. In latency-sensitive settings, such as real-time inference, even small delays can affect overall performance. Moreover, while low-precision operations (such as FP8) help reduce memory usage, they require careful optimization to maintain model quality. These issues underscore the need for a communication library tailored to the specific demands of expert parallelism.DeepSeek AI has recently introduced DeepEP, a communication library specifically designed for MoE models and expert parallelism (EP). DeepEP addresses the inefficiencies inherent in how tokens are dispatched and aggregated across GPUs. The library provides high-throughput, low-latency all-to-all GPU kernelscommonly referred to as MoE dispatch and combine kernelsthat streamline data exchange during both training and inference. Notably, DeepEP supports low-precision operations (including FP8), aligning with techniques detailed in the DeepSeek-V3 paper. This release responds directly to the challenges of scaling MoE architectures in both intranode and internode environments.Technical Overview and BenefitsDeepEP offers two primary types of kernels designed to meet different operational needs:Normal Kernels: These kernels are optimized for scenarios that require high throughput, such as during the pre-filling phase of inference or training. They efficiently forward data across GPUs by taking advantage of both NVLink and RDMA networking technologies. For instance, tests on Hopper GPUs with NVLink have shown throughput around 153GB/s for intranode communication, while internode tests using CX7 InfiniBand (approximately 50GB/s bandwidth) achieve stable performance near 4347GB/s. By maximizing available bandwidth, these kernels reduce communication overhead during token dispatch and result combining.Low-Latency Kernels: For inference tasks where responsiveness is crucial, DeepEP provides low-latency kernels that rely solely on RDMA. These kernels are tailored to handle small batchescommon in real-time applicationswith reported latencies as low as 163 microseconds for dispatch operations involving eight experts. The design also incorporates a hook-based communication-computation overlapping technique that allows data transfers to occur concurrently with computation, without consuming GPU streaming multiprocessors (SMs).DeepEP further offers flexibility through adaptive configurations. Users can adjust parameters such as the number of SMs in use or set environment variables (for example, NVSHMEM_IB_SL) to manage traffic isolation. Adaptive routing, which is currently supported in the low-latency kernels, helps distribute network traffic evenly under heavy loads, thereby improving robustness.Performance Insights and Practical OutcomesThe performance metrics for DeepEP are noteworthy. In typical tests using normal kernels, intranode communication can achieve throughput up to 153GB/s, and internode setups maintain around 4347GB/s over RDMA. Low-latency kernels are particularly effective in production scenarios; for a batch of 128 tokens processed with eight experts, dispatch latency can be as low as 163 microseconds. Such improvements mean that the overall inference process becomes more efficient, allowing for larger batch sizes and smoother overlap between computation and communication.In practical terms, these optimizations lead to faster response times in inference decoding and improved throughput in training scenarios. The inclusion of FP8 support not only lowers the memory footprint but also facilitates quicker data transfers, which is essential when deploying models in environments where resources are limited.ConclusionDeepEP is a thoughtful contribution to the field of large-scale language model deployment. By addressing key communication bottlenecks in MoE architectures, it enables more efficient training and inference. Its dual-kernel approachwith one set designed for high throughput and another for low latencyoffers flexibility for a range of applications. Built with support for low-precision operations and equipped with mechanisms for adaptive configuration, DeepEP provides researchers and developers a practical tool to further optimize expert parallelism.In summary, DeepSeek AIs release of DeepEP represents a careful, well-engineered solution that balances performance with resource efficiency. Its design helps pave the way for more scalable and responsive AI models, supporting both academic research and real-world applications in a cost-effective manner.Check outtheGitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Building an Interactive Weather Data Scraper in Google Colab: A Code Guide to Extract, Display, and Download Live Forecast Data Using Python, BeautifulSoup, Requests, Pandas, and IpywidgetsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0pp LLM, Open-Source NLP Models, Streamlit, PyTorch, and Hugging Face TransformersAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Moonshot AI and UCLA Researchers ReleaseMoonlight: A 3B/16B-Parameter Mixture-of-Expert (MoE) Model Trained with 5.7T Tokens Using Muon OptimizerAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Fine-Tuning NVIDIA NV-Embed-v1 on Amazon Polarity Dataset Using LoRA and PEFT: A Memory-Efficient Approach with Transformers and Hugging Face Recommended Open-Source AI Platform: IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System' (Promoted)
0 Commenti
·0 condivisioni
·66 Views