Sruthi.
ASR engine
for production speech.
Two ASR engines, one API. Sruthi-T — a transformer model that's best-in-class on English long-form (13.27% WER, beating Deepgram, Google, Azure, ElevenLabs). Sruthi-S — a Samba state-space model with 3.65% average WER across LibriSpeech, GigaSpeech, and SPGISpeech, beating Whisper-large-v3 (7.44%) and CrisperWhisper (4.69%).
Pick the model that fits the workload.
Same SDK, same streaming protocol, same telephony adapters — two architectures behind it. Choose the transformer for multilingual production today; reach for Samba-ASR when the workload is English-heavy and accuracy is the only thing that matters.
Transformer · multilingual
Best-in-class on English YouTube long-form. Production-ready in English and Hindi today.
- · 13.27% WER on English long-form (best of 6 systems tested)
- · 16.50% WER on Hindi long-form (2nd of 6, behind Deepgram nova-2)
- · Streaming + batch · code-switching native
- · The default engine in Lingo and IRA deployments
Samba · state-space
Mamba-based encoder-decoder. Linear-complexity attention replacement. English research SOTA, multilingual on roadmap.
- · 3.65% average WER (LibriSpeech, GigaSpeech, SPGISpeech)
- · 1.17% on LibriSpeech clean · 1.84% on SPGISpeech
- · Beats Whisper-large-v3 (7.44%) and CrisperWhisper (4.69%)
- · arXiv 2501.02832 — Shakhadri, Kruthika, Angadi (2025)
Best in class on English long-form.
345 audio samples (~10 hours) drawn from publicly available YouTube videos. Six commercial ASR systems compared head-to-head on word-error rate and character-error rate.
Full data table with CER →
| Model | WER % | CER % |
|---|---|---|
| SandLogic STT | 13.27 | 11.36 |
| Sarvam Saaras v3 | 13.55 | 13.41 |
| Deepgram nova-3 | 17.53 | 10.16 |
| Microsoft Azure | 21.93 | 8.45 |
| ElevenLabs Scribe v2 | 23.19 | 10.16 |
| Google Chirp 3 | 24.47 | 12.24 |
Source: llms.sandlogic.com/asr-benchmarks · 118 English samples · WER and CER measured on identical reference transcripts.
Top-two on Hindi long-form.
Same evaluation methodology — 227 Hindi samples drawn from publicly available YouTube videos. Sruthi-T finishes second, with the lowest WER among systems that also cover English natively.
Full data table with CER →
| Model | WER % | CER % |
|---|---|---|
| Deepgram nova-2 | 13.80 | 7.75 |
| SandLogic STT | 16.50 | 10.95 |
| Sarvam Saaras v3 | 17.52 | 13.51 |
| ElevenLabs Scribe v2 | 19.99 | 10.22 |
| Microsoft Azure | 29.35 | 12.29 |
| Google Chirp 3 | 29.55 | 10.83 |
State-space attention beats transformers on average WER.
Sruthi-S is built on the Samba-ASR architecture — a Mamba-based encoder-decoder that swaps quadratic self-attention for selective state-space recurrence. The result: linear computational complexity and lower average WER than the leading transformer ASR systems on standard English benchmarks.
Mamba beats transformers on average WER.
Per-test-set breakdown →
| Test set | Samba WER % | Whisper-l-v3 | Canary-1b | Note |
|---|---|---|---|---|
| LibriSpeech clean | 1.17 | — | — | frontier |
| LibriSpeech other | 2.48 | — | — | frontier |
| GigaSpeech | 9.12 | — | — | |
| SPGISpeech | 1.84 | — | — | financial domain |
| Average WER | 3.65 | 7.44 | 4.15 |
Source: arXiv 2501.02832 — SAMBA-ASR: State-of-the-Art Speech Recognition Leveraging Structured State-Space Models · Shakhadri, Kruthika, Angadi (SandLogic, 2025) · Trained on LibriSpeech (460h), GigaSpeech (10,000h), SPGISpeech (5,000h).
Six things production ASR needs.
Streaming transcription
Sub-300ms first-token latency. Partial hypotheses surface as the speaker talks — built for live agent assist, IVRs, and real-time captions.
Code-switching native
Handles mid-sentence Hindi-English shifts without resetting the decoder. Trained on real Indian call-center audio, not dubbed corpora.
Long-form robustness
Best-in-class WER on noisy YouTube long-form English (13.27%) — beats Deepgram nova-2, Google Chirp 3, Microsoft Azure, ElevenLabs Scribe v2.
Diarization & overlap
Speaker separation, turn-taking labels, and overlap detection out of the box. No second-pass models, no external diarizer.
Voice biometrics
Speaker verification scoring on the same audio frame as transcription — useful for fraud detection and authentication flows.
On-prem & edge
Same binary runs on Krsna SoC, NVIDIA, AMD, Intel, ARM. Air-gapped deployment supported. No cloud round-trip.
Production-ready today. Multilingual on roadmap.
Benchmarked production
English · Hindi — head-to-head benchmarks against six commercial systems on long-form audio.
Production rollout
Tamil · Telugu · Marathi · Bengali · Kannada · Malayalam · Punjabi · Gujarati — deployed in customer pilots, public benchmarks coming.
Roadmap
Remaining 14 Indic languages and 40 foreign languages are on the multilingual roadmap. Samba-ASR multilingual extension is research-track.
Hear it on your audio.
Send us a sample call from your stack and we'll return a transcript, diarization, and a head-to-head WER comparison against your incumbent within 48 hours.
No NDA needed for the first sample. We benchmark against your incumbent and report numbers — even if they're not in our favor.
Email us a sampleDrop-in replacement for your existing ASR.
REST + WebSocket
OpenAI-compatible HTTP for batch jobs. Persistent WebSocket for streaming with partial hypotheses and end-of-utterance signals.
Telephony adapters
Native connectors for Asterisk, FreeSWITCH, Twilio, Genesys, and Avaya. SIPREC tap supported for call-recording inspection.
Lingo & IRA bundled
Sruthi is the speech layer underneath Lingo and IRA. If you deploy either, the engine ships with them — no separate procurement.