Sim

Text-to-Speech

Convert text to speech using AI voices

Convert text to natural-sounding speech using the latest AI voices. Sim's Text-to-Speech (TTS) tools let you generate audio from written text in dozens of languages, with a choice of expressive voices, formats, and advanced controls like speed, style, emotion, and more.

Supported Providers & Models:

  • OpenAI Text-to-Speech (OpenAI):
    OpenAI's TTS API offers ultra-realistic voices using advanced AI models like tts-1, tts-1-hd, and gpt-4o-mini-tts. Voices include both male and female, with options such as alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, and verse. Supports multiple audio formats (mp3, opus, aac, flac, wav, pcm), adjustable speed and streaming synthesis.

  • Deepgram Aura (Deepgram Inc.):
    Deepgram’s Aura provides expressive English and multilingual AI voices, optimized for conversational clarity, low latency, and customization. Models like aura-asteria-en, aura-luna-en, and others are available. Supports multiple encoding formats (linear16, mp3, opus, aac, flac) and fine tuning on speed, sample rate, and style.

  • ElevenLabs Text-to-Speech (ElevenLabs):
    ElevenLabs leads in lifelike, emotionally rich TTS, offering dozens of voices in 29+ languages and the ability to clone custom voices. Models support voice design, speech synthesis, and direct API access, with advanced controls for style, emotion, stability, and similarity. Suitable for audiobooks, content creation, accessibility, and more.

  • Cartesia TTS (Cartesia):
    Cartesia offers high-quality, fast, and secure text-to-speech with a focus on privacy and flexible deployment. It provides instant streaming, real-time synthesis, and supports multiple international voices and accents, accessible through a simple API.

  • Google Cloud Text-to-Speech (Google Cloud):
    Google uses DeepMind WaveNet and Neural2 models to power high-fidelity voices in 50+ languages and variants. Features include voice selection, pitch, speaking rate, volume control, SSML tags, and access to both standard and studio-grade premium voices. Widely used for accessibility, IVR, and media.

  • Microsoft Azure Speech (Microsoft Azure):
    Azure provides over 400 neural voices across 140+ languages and locales, with unique voice customization, style, emotion, role, and real-time controls. Offers SSML support for pronunciation, intonation, and more. Ideal for global, enterprise, or creative TTS needs.

  • PlayHT (PlayHT):
    PlayHT specializes in realistic voice synthesis, voice cloning, and instant streaming playback with 800+ voices in over 100 languages. Features include emotion, pitch and speed controls, multi-voice audio, and custom voice creation via the API or online studio.

How to Choose:
Pick your provider and model by prioritizing languages, supported voice types, desired formats (mp3, wav, etc.), control granularity (speed, emotion, etc.), and specialized features (voice cloning, accent, streaming). For creative, accessibility, or developer use cases, ensure compatibility with your application's requirements and compare costs.

Visit each provider’s official site for up-to-date capabilities, pricing, and documentation details!

Usage Instructions

Generate natural-sounding speech from text using state-of-the-art AI voices from OpenAI, Deepgram, ElevenLabs, Cartesia, Google Cloud, Azure, and PlayHT. Supports multiple voices, languages, and audio formats.

Tools

tts_openai

Convert text to speech using OpenAI TTS models

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesOpenAI API key
modelstringNoTTS model to use (tts-1, tts-1-hd, or gpt-4o-mini-tts)
voicestringNoVoice to use (alloy, ash, ballad, cedar, coral, echo, marin, sage, shimmer, verse)
responseFormatstringNoAudio format (mp3, opus, aac, flac, wav, pcm)
speednumberNoSpeech speed (0.25 to 4.0, default: 1.0)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_deepgram

Convert text to speech using Deepgram Aura

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesDeepgram API key
modelstringNoDeepgram model/voice (e.g., aura-asteria-en, aura-luna-en)
voicestringNoVoice identifier (alternative to model param)
encodingstringNoAudio encoding (linear16, mp3, opus, aac, flac)
sampleRatenumberNoSample rate (8000, 16000, 24000, 48000)
bitRatenumberNoBit rate for compressed formats
containerstringNoContainer format (none, wav, ogg)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_elevenlabs

Convert text to speech using ElevenLabs voices

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
voiceIdstringYesThe ID of the voice to use
apiKeystringYesElevenLabs API key
modelIdstringNoModel to use (e.g., eleven_monolingual_v1, eleven_turbo_v2_5, eleven_flash_v2_5)
stabilitynumberNoVoice stability (0.0 to 1.0, default: 0.5)
similarityBoostnumberNoSimilarity boost (0.0 to 1.0, default: 0.8)
stylenumberNoStyle exaggeration (0.0 to 1.0)
useSpeakerBoostbooleanNoUse speaker boost (default: true)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_cartesia

Convert text to speech using Cartesia Sonic (ultra-low latency)

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesCartesia API key
modelIdstringNoModel ID (sonic-english, sonic-multilingual)
voicestringNoVoice ID or embedding
languagestringNoLanguage code (en, es, fr, de, it, pt, etc.)
outputFormatjsonNoOutput format configuration (container, encoding, sampleRate)
speednumberNoSpeed multiplier
emotionarrayNoEmotion tags for Sonic-3 (e.g., ['positivity:high'])

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_google

Convert text to speech using Google Cloud Text-to-Speech

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesGoogle Cloud API key
voiceIdstringNoVoice ID (e.g., en-US-Neural2-A, en-US-Wavenet-D)
languageCodestringYesLanguage code (e.g., en-US, es-ES, fr-FR)
genderstringNoVoice gender (MALE, FEMALE, NEUTRAL)
audioEncodingstringNoAudio encoding (LINEAR16, MP3, OGG_OPUS, MULAW, ALAW)
speakingRatenumberNoSpeaking rate (0.25 to 2.0, default: 1.0)
pitchnumberNoVoice pitch (-20.0 to 20.0, default: 0.0)
volumeGainDbnumberNoVolume gain in dB (-96.0 to 16.0)
sampleRateHertznumberNoSample rate in Hz
effectsProfileIdarrayNoEffects profile (e.g., ['headphone-class-device'])

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_azure

Convert text to speech using Azure Cognitive Services

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesAzure Speech Services API key
voiceIdstringNoVoice ID (e.g., en-US-JennyNeural, en-US-GuyNeural)
regionstringNoAzure region (e.g., eastus, westus, westeurope)
outputFormatstringNoOutput audio format
ratestringNoSpeaking rate (e.g., +10%, -20%, 1.5)
pitchstringNoVoice pitch (e.g., +5Hz, -2st, low)
stylestringNoSpeaking style (e.g., cheerful, sad, angry - neural voices only)
styleDegreenumberNoStyle intensity (0.01 to 2.0)
rolestringNoRole (e.g., Girl, Boy, YoungAdultFemale)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_playht

Convert text to speech using PlayHT (voice cloning)

Input

ParameterTypeRequiredDescription
textstringYesThe text to convert to speech
apiKeystringYesPlayHT API key (AUTHORIZATION header)
userIdstringYesPlayHT user ID (X-USER-ID header)
voicestringNoVoice ID or manifest URL
qualitystringNoQuality level (draft, standard, premium)
outputFormatstringNoOutput format (mp3, wav, ogg, flac, mulaw)
speednumberNoSpeed multiplier (0.5 to 2.0)
temperaturenumberNoCreativity/randomness (0.0 to 2.0)
voiceGuidancenumberNoVoice stability (1.0 to 6.0)
textGuidancenumberNoText adherence (1.0 to 6.0)
sampleRatenumberNoSample rate (8000, 16000, 22050, 24000, 44100, 48000)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

Notes

  • Category: tools
  • Type: tts
On this page

On this page

Start building today
Trusted by over 60,000 builders.
Build Agentic workflows visually on a drag-and-drop canvas or with natural language.
Get started