Sim

Speech-to-Text

Convert speech to text using AI

Transcribe speech to text using the latest AI models from world-class providers. Sim's Speech-to-Text (STT) tools empower you to turn audio and video into accurate, timestamped, and optionally translated transcripts—supporting a diversity of languages and enhanced with advanced features such as diarization and speaker identification.

Supported Providers & Models:

  • OpenAI Whisper (OpenAI):
    OpenAI’s Whisper is an open-source deep learning model renowned for its robustness across languages and audio conditions. It supports advanced models such as whisper-1, excelling in transcription, translation, and tasks demanding high model generalization. Backed by OpenAI—the company known for ChatGPT and leading AI research—Whisper is widely used in research and as a baseline for comparative evaluation.

  • Deepgram (Deepgram Inc.):
    Based in San Francisco, Deepgram offers scalable, production-grade speech recognition APIs for developers and enterprises. Deepgram’s models include nova-3, nova-2, and whisper-large, offering real-time and batch transcription with industry-leading accuracy, multi-language support, automatic punctuation, intelligent diarization, call analytics, and features for use cases ranging from telephony to media production.

  • ElevenLabs (ElevenLabs):
    A leader in voice AI, ElevenLabs is especially known for premium voice synthesis and recognition. Its STT product delivers high-accuracy, natural understanding of numerous languages, dialects, and accents. Recent ElevenLabs STT models are optimized for clarity, speaker distinction, and are suitable for both creative and accessibility scenarios. ElevenLabs is recognized for cutting-edge advancements in AI-powered speech technologies.

  • AssemblyAI (AssemblyAI Inc.):
    AssemblyAI provides API-driven, highly accurate speech recognition, with features such as auto chaptering, topic detection, summarization, sentiment analysis, and content moderation alongside transcription. Its proprietary model, including the acclaimed Conformer-2, powers some of the largest media, call center, and compliance applications in the industry. AssemblyAI is trusted by Fortune 500s and leading AI startups globally.

  • Google Cloud Speech-to-Text (Google Cloud):
    Google’s enterprise-grade Speech-to-Text API supports over 125 languages and variants, offering high accuracy and features such as real-time streaming, word-level confidence, speaker diarization, automatic punctuation, custom vocabulary, and domain-specific tuning. Models such as latest_long, video, and domain-optimized models are available, powered by Google’s years of research and deployed for global scalability.

  • AWS Transcribe (Amazon Web Services):
    AWS Transcribe leverages Amazon’s cloud infrastructure to deliver robust speech recognition as an API. It supports multiple languages and features such as speaker identification, custom vocabulary, channel identification (for call center audio), and medical-specific transcription. Popular models include standard and domain-specific variations. AWS Transcribe is ideal for organizations already using Amazon’s cloud.

How to Choose:
Select the provider and model that fits your application—whether you need fast, enterprise-ready transcription with extra analytics (Deepgram, AssemblyAI, Google, AWS), high versatility and open-source access (OpenAI Whisper), or advanced speaker/contextual understanding (ElevenLabs). Consider the pricing, language coverage, accuracy, and any special features (like summarization, chaptering, or sentiment analysis) you might need.

For more details on capabilities, pricing, feature highlights, and fine-tuning options, refer to each provider’s official documentation via the links above.

Usage Instructions

Transcribe audio and video files to text using leading AI providers. Supports multiple languages, timestamps, and speaker diarization.

Tools

stt_whisper

Transcribe audio to text using OpenAI Whisper

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (whisper)
apiKeystringYesOpenAI API key
modelstringNoWhisper model to use (default: whisper-1)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word
translateToEnglishbooleanNoTranslate audio to English
promptstringNoOptional text to guide the model's style or continue a previous audio segment. Helps with proper nouns and context.
temperaturenumberNoSampling temperature between 0 and 1. Higher values make output more random, lower values more focused and deterministic.

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments
languagestringDetected or specified language
durationnumberAudio duration in seconds

stt_deepgram

Transcribe audio to text using Deepgram

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (deepgram)
apiKeystringYesDeepgram API key
modelstringNoDeepgram model to use (nova-3, nova-2, whisper-large, etc.)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word
diarizationbooleanNoEnable speaker diarization

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments with speaker labels
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

stt_elevenlabs

Transcribe audio to text using ElevenLabs

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (elevenlabs)
apiKeystringYesElevenLabs API key
modelstringNoElevenLabs model to use (scribe_v1, scribe_v1_experimental)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

stt_assemblyai

Transcribe audio to text using AssemblyAI with advanced NLP features

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (assemblyai)
apiKeystringYesAssemblyAI API key
modelstringNoAssemblyAI model to use (default: best)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word
diarizationbooleanNoEnable speaker diarization
sentimentbooleanNoEnable sentiment analysis
entityDetectionbooleanNoEnable entity detection
piiRedactionbooleanNoEnable PII redaction
summarizationbooleanNoEnable automatic summarization

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments with speaker labels
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score
sentimentarraySentiment analysis results
entitiesarrayDetected entities
summarystringAuto-generated summary

stt_gemini

Transcribe audio to text using Google Gemini with multimodal capabilities

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (gemini)
apiKeystringYesGoogle API key
modelstringNoGemini model to use (default: gemini-2.5-flash)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

Notes

  • Category: tools
  • Type: stt
On this page

On this page

Start building today
Trusted by over 60,000 builders.
Build Agentic workflows visually on a drag-and-drop canvas or with natural language.
Get started