Sim

Speech-to-Text

Convert speech to text using AI

Transcribe speech to text using state-of-the-art AI models from leading providers. The Sim Speech-to-Text (STT) tools allow you to convert audio and video files into accurate transcripts, supporting multiple languages, timestamps, and optional translation.

Supported providers:

  • OpenAI Whisper: Advanced open-source STT model from OpenAI. Supports models such as whisper-1 and handles a wide variety of languages and audio formats.
  • Deepgram: Real-time and batch STT API with deep learning models like nova-3, nova-2, and whisper-large. Offers features like diarization, intent recognition, and industry-specific tuning.
  • ElevenLabs: Known for high-quality speech AI, ElevenLabs provides STT models focused on accuracy and natural language understanding for numerous languages and dialects.

Choose the provider and model best suited to your task—whether fast, production-grade transcription (Deepgram), highly accurate multi-language capability (Whisper), or advanced understanding and language coverage (ElevenLabs).

Usage Instructions

Transcribe audio and video files to text using leading AI providers. Supports multiple languages, timestamps, and speaker diarization.

Tools

stt_whisper

Transcribe audio to text using OpenAI Whisper

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (whisper)
apiKeystringYesOpenAI API key
modelstringNoWhisper model to use (default: whisper-1)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word
translateToEnglishbooleanNoTranslate audio to English

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

stt_deepgram

Transcribe audio to text using Deepgram

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (deepgram)
apiKeystringYesDeepgram API key
modelstringNoDeepgram model to use (nova-3, nova-2, whisper-large, etc.)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word
diarizationbooleanNoEnable speaker diarization

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments with speaker labels
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

stt_elevenlabs

Transcribe audio to text using ElevenLabs

Input

ParameterTypeRequiredDescription
providerstringYesSTT provider (elevenlabs)
apiKeystringYesElevenLabs API key
modelstringNoElevenLabs model to use (scribe_v1, scribe_v1_experimental)
audioFilefileNoAudio or video file to transcribe
audioFileReferencefileNoReference to audio/video file from previous blocks
audioUrlstringNoURL to audio or video file
languagestringNoLanguage code (e.g., "en", "es", "fr") or "auto" for auto-detection
timestampsstringNoTimestamp granularity: none, sentence, or word

Output

ParameterTypeDescription
transcriptstringFull transcribed text
segmentsarrayTimestamped segments
languagestringDetected or specified language
durationnumberAudio duration in seconds
confidencenumberOverall confidence score

Notes

  • Category: tools
  • Type: stt
On this page

On this page

Start building today
Trusted by over 60,000 builders.
Build Agentic workflows visually on a drag-and-drop canvas or with natural language.
Get started