
Image credit: Sarvam
On Thursday, Sarvam AI, a young company working in artificial intelligence, announced the release of Bulbul V3, a text-to-speech system intended to produce voices that are clear, expressive, and usable at scale in Indian languages. The claim was modestly phrased, but the response it drew was anything but quiet.
Approval followed swiftly from several figures in the AI world. The most notable came from Deedy Das, a partner at Menlo Ventures, the investment firm that has supported companies like Anthropic. Das publicly reversed an earlier judgment of Sarvam, saying that he had been “wrong.” He added that the startup now appears to offer the strongest suite of tools for Indic languages across text-to-speech, speech-to-text, and optical character recognition, describing the achievement as “really valuable.”
Inside Bulbul V3’s Development and Evaluation
Writing on X, Sarvam co-founder Pratyush Kumar said that Bulbul V3 represents the fifth release in a planned sequence of fourteen. According to him, “In an independent third-party human listening study, Bulbul V3 delivers the highest listener preference and low error rates across use cases and languages.”
In a follow-up thread, Kumar explained that the evaluation took the form of a blind listening test carried out by the independent research group Josh Talks AI. Participants were asked to judge Bulbul V3 against established alternatives, including different versions of ElevenLabs and Cartesia’s Sonic-3 system.
The study gathered more than 20,000 individual responses. Based on these results, Bulbul V3 ranked highest for 8 kHz audio, a result Kumar described as setting a new standard for speech synthesis in voice-based AI agents.
Performance Gains That Matter Beyond the Lab
In a blog post accompanying the release, Sarvam said that Bulbul V3 marks a clear advance in three areas that determine whether a speech system works outside the lab.
- First is naturalness. According to the company, the model earns strong listener preference at full-band 48 kHz audio and emerges as the most favored option for 8 kHz telephony, where it performs better than rival systems.
- Second is robustness. Bulbul V3, Sarvam claims, keeps character error rates low even when faced with difficult material, such as mixed languages, numbers, and other irregular inputs.
- Third is stability. In long or high-volume use, the model reportedly produces fewer skipped words and fewer mispronunciations than its competitors.
The company said the evaluation was designed to mirror both ideal and ordinary conditions. Two settings were used: general full-band audio, reflecting studio-quality output, and 8 kHz telephony-grade audio, reflecting everyday use. For each language, between 50 and 70 annotators took part, generating roughly 2,000 judgments per language. In total, more than 500 annotators contributed to the study.
In the same post, Sarvam co-founder Pratyush Kumar noted that listeners were also asked to flag genuine failure cases, in order to assess stability more directly. On this measure, he said, Bulbul V3 recorded the lowest average error rates.
The evaluation, Kumar added, also examined what he described as the “long tail” of language problems like speaking numerics, handling technical material, and correctly rendering named entities. Across these cases and across languages, Bulbul V3, he said, consistently showed the fewest errors.
Expanding Voice Options and Future Language Support
Along with the release of the model, Sarvam introduced a new library of voices. It contains more than thirty studio-grade voices spanning eleven Indian languages, each recorded by trained voice artists rather than generated synthetically. The company says this approach gives the voices greater weight and clarity, and allows them to carry emotion more convincingly, particularly in longer stretches of audio.
Sarvam added that this collection will not remain fixed. Support, it said, is expected to extend to twenty-two Indian languages in the near future. The model also includes a voice-cloning feature, which allows users to create custom voices without losing a natural sound. According to the company, this “enables brand-specific voices, consistent character identities, and personalised experiences at scale.”
Final Words
A domestic AI company has succeeded in getting machines to speak Indian languages more fluently than the large foreign corporations, and suddenly everybody is a convert. The actual trial will be not under controlled studies but in the wild, when your telephone banking service operator of your grandmother will cease to butcher her name, or when that train station announcer will finally get the name of your hometown right. This is the only number 5 version of Sarvam AI which has fourteen releases scheduled. Unless they change this course, by the fourteenth release, these artificial voices may well be no more than the prattle of your gossipy neighbor.





