Text-to-speech service implementation using Coqui’s XTTS streaming server
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/ LLMFullResponseEndFrame
- LLM response boundariesTTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (streaming, resampled from 24kHz)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- Server connection or processing errorsLanguage Code | Description | Service Code |
---|---|---|
Language.CS | Czech | cs |
Language.DE | German | de |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RU | Russian | ru |
Language.TR | Turkish | tr |
Language.ZH | Chinese (Simplified) | zh-cn |
XTTSService
and use it in a pipeline:
TTSUpdateSettingsFrame
for the XTTSService
: