Text-to-speech service using Inworld AI’s TTS APIs
INWORLD_API_KEY
.
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/ LLMFullResponseEndFrame
- LLM response boundariesTTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (LINEAR16 PCM, WAV header stripped)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- API or processing errorsinworld‑tts‑1
for real‑time, cost‑sensitive use (lowest latency); use inworld‑tts‑1‑max
(experimental) when you can trade a bit more latency for richer expressiveness and broader multilingual support.[happy]
, [sad]
, [angry]
, [surprised]
, [fearful]
, [disgusted]
[laughing]
, [whispering]
[breathe]
, [clear_throat]
, [cough]
, [laugh]
, [sigh]
, [yawn]
Mode | Best For | Use Cases |
---|---|---|
Streaming | Real-time applications | Building conversational AI, minimal latency interactions, processing text as available |
Non-Streaming | Batch processing | Longer content generation, complete audio files, batch scenarios, slighly better quality |
Sample Rate | Quality | Use Case |
---|---|---|
16000 Hz | Basic | Voice calls, simple applications |
24000 Hz | Good | General conversational AI |
48000 Hz | High | Professional applications, music |