Voice agent infrastructure
Facts from public sources. Notes from production.
A curated directory of voice models, installable skills, and agent infrastructure — structured for builders, not marketing decks.
STT, TTS, and STS models plus audio tools, voice AI, and related models from 27 labs.
159
converts spoken audio into text. Sometimes called "ASR" (automatic speech recognition). Deepgram, AssemblyAI.38converts written text into spoken audio. ElevenLabs, Cartesia.92converts spoken audio directly into spoken audio, skipping the intermediate text step. OpenAI Realtime, Ultravox.16software that determines when someone is speaking vs. silent. Critical for knowing when to interrupt or wait.4Noise cancellation4
Installable agent skills for AI coding tools, focused on voice workflows and integrations.
Top TTS models
Ranked by TTS Arena and Artificial Analysis benchmarks where available, then latency.
Order by
| Model | Lab | Type | Benchmark |
|---|---|---|---|
| Hume | converts written text into spoken audio. ElevenLabs, Cartesia. | #5 · Elo 1561 | |
| MiniMax | converts written text into spoken audio. ElevenLabs, Cartesia. | #7 · Elo 1544 | |
| ElevenLabs | converts written text into spoken audio. ElevenLabs, Cartesia. | #8 · Elo 1539 | |
| MiniMax | converts written text into spoken audio. ElevenLabs, Cartesia. | #9 · Elo 1535 | |
| ElevenLabs | converts written text into spoken audio. ElevenLabs, Cartesia. | #10 · Elo 1531 | |
| ElevenLabs | converts written text into spoken audio. ElevenLabs, Cartesia. | #11 · Elo 1528 | |
| Cartesia | converts written text into spoken audio. ElevenLabs, Cartesia. | #13 · Elo 1513 | |
| PlayHT | converts written text into spoken audio. ElevenLabs, Cartesia. | #23 · Elo 1405 | |
| Cartesia | converts written text into spoken audio. ElevenLabs, Cartesia. | — | |
| Rime | converts written text into spoken audio. ElevenLabs, Cartesia. | — |
Browse by type
Jump to the models directory with a type filter applied.
converts spoken audio into text. Sometimes called "ASR" (automatic speech recognition). Deepgram, AssemblyAI.38converts written text into spoken audio. ElevenLabs, Cartesia.92converts spoken audio directly into spoken audio, skipping the intermediate text step. OpenAI Realtime, Ultravox.16software that determines when someone is speaking vs. silent. Critical for knowing when to interrupt or wait.4Noise cancellation4Voice isolation2standalone products that detect and mask or tokenize sensitive content (PII) in transcripts, LLM payloads, or audio-adjacent streams—not the same as an STT vendor’s “PII redaction” checkbox on transcription output.2converts speech in one language to text or speech in another, often in real-time.1