Text-to-speech service implementations using Rime AI
RimeTTSService
: WebSocket-based with word-level timing and interruption supportRimeHttpTTSService
: HTTP-based for simpler use casesRimeTTSService
is recommended for real-time interactive applications.RIME_API_KEY
.
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/ LLMFullResponseEndFrame
- LLM response boundariesTTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data chunks (PCM format)TTSTextFrame
- Word-level timing information (WebSocket service only)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- API or processing errorsFeature | RimeTTSService (WebSocket) | RimeHttpTTSService (HTTP) |
---|---|---|
Word Timestamps | ✅ Precise timing | ❌ Not available |
Interruption | ✅ Context tracking | ⚠️ Basic support |
Streaming | ✅ Real-time chunks | ✅ Chunked response |
Inline Speed | ❌ Not supported | ✅ Word-level control |
Arcana Model | ❌ Not supported | ✅ Latest model |
Model | Description | Availability |
---|---|---|
mistv2 | Hyper-realistic conversational voices (recommended) | Both services |
mist | Previous generation model | Both services |
arcana | Latest high-quality model | HTTP only |
Language Code | Description | Service Code |
---|---|---|
Language.DE | German | ger |
Language.EN | English | eng |
Language.ES | Spanish | spa |
Language.FR | French | fra |
TTSUpdateSettingsFrame
:
RimeTTSService
for interactive applications requiring word timestamps and precise context managementSkipTagsAggregator
by default to handle Rime’s spell()
tagsmistv2
for best balance of quality and performance, arcana
for highest quality (HTTP only)