www.marktechpost.com
Real-time speech translation presents a complex challenge, requiring seamless integration of speech recognition, machine translation, and text-to-speech synthesis. Traditional cascaded approaches often introduce compounding errors, fail to retain speaker identity, and suffer from slow processing, making them less suitable for real-time applications like live interpretation. Additionally, existing simultaneous translation models struggle to balance accuracy and latency, relying on complex inference mechanisms that are difficult to scale. A significant barrier remains the lack of large-scale, well-aligned speech datasets, limiting the ability to train models that can generate contextually accurate and natural translations with minimal delay.Kyutai has developed Hibiki, a 2.7 billion-parameter decoder-only model designed for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation. Operating at 12.5Hz framerate with a 2.2kbps bitrate, Hibiki currently supports French-to-English translation and is designed to preserve voice characteristics in the translated output. A distilled version, Hibiki-M (1.7B parameters), is optimized for real-time performance on smartphones, making it more accessible for on-device translation.Technical Approach and BenefitsHibikis decoder-only architecture enables simultaneous speech processing using a multistream language model that predicts both text and audio tokens. It employs a neural audio codec (Mimi) to compress audio while maintaining fidelity, ensuring efficient translation generation. A key aspect of its design is contextual alignment, a method that leverages a text translation models perplexity to determine optimal timing for generating speech, allowing Hibiki to adjust translation delays dynamically while maintaining coherence. Additionally, Hibiki supports batch inference, processing up to 320 sequences in parallel on H100 GPUs, making it viable for large-scale applications. The model is trained on 7M hours of English audio, 450K hours of French, and 40K hours of synthetic parallel data, contributing to its robustness across varied speech patterns.Performance and EvaluationHibiki has demonstrated strong performance in translation quality and speaker fidelity. It achieves an ASR-BLEU score of 30.5, surpassing existing baselines, including offline models. Human evaluations rate its naturalness at 3.73/5, approaching the 4.12/5 score of professional human interpreters. The model also performs well in speaker similarity, with a 0.52 similarity score compared to 0.43 for Seamless. Compared to Seamless and StreamSpeech, Hibiki consistently delivers higher translation quality and better voice transfer, while maintaining a competitive latency. The distilled Hibiki-M variant, though slightly lower in speaker similarity, remains effective for real-time on-device use.ConclusionHibiki provides a practical approach to real-time speech translation, integrating contextual alignment, efficient compression, and real-time inference to improve translation quality while preserving natural speech characteristics. By offering an open-source release under a permissive CC-BY license, Hibiki has the potential to contribute significantly to advancements in multilingual communication.Hibiki 2B for PyTorch (bf16):kyutai/hibiki-2b-pytorch-bf16Hibiki 1B for PyTorch (bf16):kyutai/hibiki-1b-pytorch-bf16Hibiki 2B for MLX (bf16):kyutai/hibiki-2b-mlx-bf16Hibiki 1B for MLX (bf16):kyutai/hibiki-1b-mlx-bf16Check outTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our75k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Fine-Tuning of Llama-2 7B Chat for Python Code Generation: Using QLoRA, SFTTrainer, and Gradient Checkpointing on the Alpaca-14k DatasetAsif Razzaqhttps://www.marktechpost.com/author/6flvq/IBM AI Releases Granite-Vision-3.1-2B: A Small Vision Language Model with Super Impressive Performance on Various TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Weaviate Researchers Introduce Function Calling for LLMs: Eliminating SQL Dependency to Improve Database Querying Accuracy and EfficiencyAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Prime Intellect Releases SYNTHETIC-1: An Open-Source Dataset Consisting of 1.4M Curated Tasks Spanning Math, Coding, Software Engineering, STEM, and Synthetic Code Understanding [Recommended] Join Our Telegram Channel