Speech-to-Text

Transcribe speech to text using the latest AI models from world-class providers. Sim's Speech-to-Text (STT) tools empower you to turn audio and video into accurate, timestamped, and optionally translated transcripts—supporting a diversity of languages and enhanced with advanced features such as diarization and speaker identification.

Supported Providers & Models:

OpenAI Whisper (OpenAI):
OpenAI’s Whisper is an open-source deep learning model renowned for its robustness across languages and audio conditions. It supports advanced models such as whisper-1, excelling in transcription, translation, and tasks demanding high model generalization. Backed by OpenAI—the company known for ChatGPT and leading AI research—Whisper is widely used in research and as a baseline for comparative evaluation.
Deepgram (Deepgram Inc.):
Based in San Francisco, Deepgram offers scalable, production-grade speech recognition APIs for developers and enterprises. Deepgram’s models include nova-3, nova-2, and whisper-large, offering real-time and batch transcription with industry-leading accuracy, multi-language support, automatic punctuation, intelligent diarization, call analytics, and features for use cases ranging from telephony to media production.
ElevenLabs (ElevenLabs):
A leader in voice AI, ElevenLabs is especially known for premium voice synthesis and recognition. Its STT product delivers high-accuracy, natural understanding of numerous languages, dialects, and accents. Recent ElevenLabs STT models are optimized for clarity, speaker distinction, and are suitable for both creative and accessibility scenarios. ElevenLabs is recognized for cutting-edge advancements in AI-powered speech technologies.
AssemblyAI (AssemblyAI Inc.):
AssemblyAI provides API-driven, highly accurate speech recognition, with features such as auto chaptering, topic detection, summarization, sentiment analysis, and content moderation alongside transcription. Its proprietary model, including the acclaimed Conformer-2, powers some of the largest media, call center, and compliance applications in the industry. AssemblyAI is trusted by Fortune 500s and leading AI startups globally.
Google Cloud Speech-to-Text (Google Cloud):
Google’s enterprise-grade Speech-to-Text API supports over 125 languages and variants, offering high accuracy and features such as real-time streaming, word-level confidence, speaker diarization, automatic punctuation, custom vocabulary, and domain-specific tuning. Models such as latest_long, video, and domain-optimized models are available, powered by Google’s years of research and deployed for global scalability.
AWS Transcribe (Amazon Web Services):
AWS Transcribe leverages Amazon’s cloud infrastructure to deliver robust speech recognition as an API. It supports multiple languages and features such as speaker identification, custom vocabulary, channel identification (for call center audio), and medical-specific transcription. Popular models include standard and domain-specific variations. AWS Transcribe is ideal for organizations already using Amazon’s cloud.

How to Choose:
Select the provider and model that fits your application—whether you need fast, enterprise-ready transcription with extra analytics (Deepgram, AssemblyAI, Google, AWS), high versatility and open-source access (OpenAI Whisper), or advanced speaker/contextual understanding (ElevenLabs). Consider the pricing, language coverage, accuracy, and any special features (like summarization, chaptering, or sentiment analysis) you might need.

For more details on capabilities, pricing, feature highlights, and fine-tuning options, refer to each provider’s official documentation via the links above.

Parameter	Type	Required	Description
`provider`	string	Yes	STT provider (whisper)
`apiKey`	string	Yes	OpenAI API key
`model`	string	No	Whisper model to use (default: whisper-1)
`audioFile`	file	No	Audio or video file to transcribe (e.g., MP3, WAV, M4A, WEBM)
`audioFileReference`	file	No	Reference to audio/video file from previous blocks
`audioUrl`	string	No	URL to audio or video file
`language`	string	No	Language code (e.g., "en", "es", "fr") or "auto" for auto-detection
`timestamps`	string	No	Timestamp granularity: none, sentence, or word
`translateToEnglish`	boolean	No	Translate audio to English
`prompt`	string	No	Optional text to guide the model's style or continue a previous audio segment. Helps with proper nouns and context.
`temperature`	number	No	Sampling temperature between 0 and 1. Higher values make output more random, lower values more focused and deterministic.
`responseFormat`	string	No	Output format for the transcription (e.g., "json", "text", "srt", "verbose_json", "vtt")

Parameter	Type	Required	Description
`provider`	string	Yes	STT provider (deepgram)
`apiKey`	string	Yes	Deepgram API key
`model`	string	No	Deepgram model to use (nova-3, nova-2, whisper-large, etc.)
`audioFile`	file	No	Audio or video file to transcribe (e.g., MP3, WAV, M4A, WEBM)
`audioFileReference`	file	No	Reference to audio/video file from previous blocks
`audioUrl`	string	No	URL to audio or video file
`language`	string	No	Language code (e.g., "en", "es", "fr") or "auto" for auto-detection
`timestamps`	string	No	Timestamp granularity: none, sentence, or word
`diarization`	boolean	No	Enable speaker diarization

Parameter	Type	Required	Description
`provider`	string	Yes	STT provider (elevenlabs)
`apiKey`	string	Yes	ElevenLabs API key
`model`	string	No	ElevenLabs model to use (scribe_v1, scribe_v1_experimental)
`audioFile`	file	No	Audio or video file to transcribe (e.g., MP3, WAV, M4A, WEBM)
`audioFileReference`	file	No	Reference to audio/video file from previous blocks
`audioUrl`	string	No	URL to audio or video file
`language`	string	No	Language code (e.g., "en", "es", "fr") or "auto" for auto-detection
`timestamps`	string	No	Timestamp granularity: none, sentence, or word

Parameter	Type	Required	Description
`provider`	string	Yes	STT provider (assemblyai)
`apiKey`	string	Yes	AssemblyAI API key
`model`	string	No	AssemblyAI model to use (default: best)
`audioFile`	file	No	Audio or video file to transcribe (e.g., MP3, WAV, M4A, WEBM)
`audioFileReference`	file	No	Reference to audio/video file from previous blocks
`audioUrl`	string	No	URL to audio or video file
`language`	string	No	Language code (e.g., "en", "es", "fr") or "auto" for auto-detection
`timestamps`	string	No	Timestamp granularity: none, sentence, or word
`diarization`	boolean	No	Enable speaker diarization
`sentiment`	boolean	No	Enable sentiment analysis
`entityDetection`	boolean	No	Enable entity detection
`piiRedaction`	boolean	No	Enable PII redaction
`summarization`	boolean	No	Enable automatic summarization

Parameter	Type	Required	Description
`provider`	string	Yes	STT provider (gemini)
`apiKey`	string	Yes	Google API key
`model`	string	No	Gemini model to use (default: gemini-2.5-flash)
`audioFile`	file	No	Audio or video file to transcribe (e.g., MP3, WAV, M4A, WEBM)
`audioFileReference`	file	No	Reference to audio/video file from previous blocks
`audioUrl`	string	No	URL to audio or video file
`language`	string	No	Language code (e.g., "en", "es", "fr") or "auto" for auto-detection
`timestamps`	string	No	Timestamp granularity: none, sentence, or word

Output

This tool does not produce any outputs.

Speech-to-Text

Usage Instructions

Tools

`stt_whisper`

Input

Output

`stt_deepgram`

Input

Output

`stt_elevenlabs`

Input

Output

`stt_assemblyai`

Input

Output

`stt_gemini`

Input

Output

On this page