ElevenLabs new speech-to-text model Scribe is here with highest accuracy rate so far (96.7% for English)
venturebeat.com
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn MoreElevenLabs, the highly-valued AI voice cloning and generation startup from former Palantir alumni, today launched Scribe v1, a new speech-to-text model that reportedly achieves the highest accuracy across multiple languages. Users can try it here.According to the companys benchmarks, it outperforms Googles Gemini 2.0 Flash, OpenAIs Whisper v3 and Deepgram Nova-3 in accurately converting spoken speech into text on the web, achieving new record-low error rates.The company claims that Scribe delivers state-of-the-art transcription accuracy in 99 languages, including improved performance in previously underserved languages such as Serbian, Cantonese and Malayalam.As Flavio Schneider, ElevenLabs lead researcher wrote on X, Scribe is the smartest audio understanding model released by ElevenLabs yet.Scribe doesnt just transcribe it understands audio, Schneider continued in a thread. It can detect non-verbal events (like laughter, sound effects, music and background noise) and analyze long audio contexts for accurate diarization, even in the most challenging environments.Diarization is the name given to the process of separating speakers by their vocal qualities on a recording.In fact, ElevenLabs documentation states Scribe can distinguish and isolate up to 32 different speakers in the same audio file. While ElevenLabs cautions that Scribe is best used when high-accuracy transcription is required rather than real-time transcription, the company also plans to introduce a low-latency version soon, expanding its use for real-time applications.Lowest word error rates (WER)Scribe is designed to handle real-world audio challenges with precision. According to benchmark results from FLEURS and Common Voice, it records the lowest word error rates (WER) for many languages, including Italian (98.7%) and English (96.7%). Key features include:Speaker diarization to differentiate speakers in multi-speaker recordings.Word-level timestamps for detailed transcription accuracy.Detection of non-speech events, such as laughter and background noises.Structured transcript output for seamless integration via API.Pricing and availabilityScribe is available now through the ElevenLabs website and API. Pricing is set at $0.40 per hour of input audio, with a 50% discount for the next six weeks. A low-latency version for real-time applications is also in development.What it means for enterprisesFor enterprise decision-makers, Scribe presents a tool for scalable, high-accuracy transcription, making it useful for industries relying on automated documentation, meeting transcription and content accessibility.The models ability to handle diverse languages with high precision also benefits multinational businesses, media companies and customer support applications.Scribes pricing structure makes it competitive for businesses that require high-volume transcription services, and its API-based integration allows for seamless adoption in enterprise workflows. Additionally, the upcoming low-latency version could position Scribe as a viable option for real-time communication tools.Coming the same day as rival Humes opposite text-to-speech model OctaveTiming is everything, and ElevenLabs chose to launch Scribe the same day as rival Hume AI unveiled Octave, an LLM-powered text-to-speech model that allows users to customize AI-generated voices with adjustable emotions.It is designed for content creation, including audiobooks, podcasts and video game voiceovers. Unlike standard TTS systems, Octave considers context beyond individual sentences, adjusting tone, rhythm and cadence dynamically to sound more natural.Hume AI positions Octave as a direct competitor to ElevenLabs text-to-speech offerings, highlighting that Octaves pricing is about half the cost of ElevenLabs current AI voice services. While Scribe and Octave serve different functions, their development reflects the growing competition in AI-driven audio models. ElevenLabs is prioritizing precise, multi-language speech recognition, while Hume AI is advancing expressive AI-generated speech. For enterprises, this means more specialized solutions for both transcription and synthetic voice applications, enabling more efficient content production, customer engagement and accessibility tools.Scribe is now live, and ElevenLabs is hosting a virtual event next week with the team behind its development. More details, benchmarks and API documentation are available in the official blog post.Daily insights on business use cases with VB DailyIf you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.Read our Privacy PolicyThanks for subscribing. Check out more VB newsletters here.An error occured.
0 Comments ·0 Shares ·46 Views