Overview

SpeechmaticsSTTService enables real-time speech transcription using Speechmatics’ WebSocket API with partial + final results, speaker diarization, and end of utterance detection (VAD).

Installation

To use SpeechmaticsSTTService, install the required dependencies:
pip install "pipecat-ai[speechmatics]"
You’ll also need to set up your Speechmatics API key as an environment variable: SPEECHMATICS_API_KEY.
Get your API key from the Speechmatics Portal.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, 16kHz, mono)

Output

  • InterimTranscriptionFrame - Real-time transcription updates
  • TranscriptionFrame - Final transcription results

Endpoints

Speechmatics STT supports the following endpoints (defaults to EU2):
RegionEnvironmentSTT EndpointAccess
EUEU1wss://neu.rt.speechmatics.com/Self-Service / Enterprise
EUEU2 (Default)wss://eu2.rt.speechmatics.com/Self-Service / Enterprise
USUS1wss://wus.rt.speechmatics.com/Enterprise

Feature Discovery

To check the languages and features supported by Speechmatics STT, you can use the following code:
curl "https://eu2.rt.speechmatics.com/v1/discovery/features"

Language Support

Refer to the Speechmatics docs for more information on supported languages.
Speechmatics STT supports the following languages and regional variants. Setting a language can be done using the language parameter when creating the STT object. The exception to this is English / Mandarin which has the code cmn_en and must be set using the language_code parameter.
Language CodeDescriptionLocales
Language.ARArabic-
Language.BABashkir-
Language.EUBasque-
Language.BEBelarusian-
Language.BGBulgarian-
Language.BNBengali-
Language.YUECantonese-
Language.CACatalan-
Language.HRCroatian-
Language.CSCzech-
Language.DADanish-
Language.NLDutch-
Language.ENEnglishen-US, en-GB, en-AU
Language.EOEsperanto-
Language.ETEstonian-
Language.FAPersian-
Language.FIFinnish-
Language.FRFrench-
Language.GLGalician-
Language.DEGerman-
Language.ELGreek-
Language.HEHebrew-
Language.HIHindi-
Language.HUHungarian-
Language.IAInterlingua-
Language.ITItalian-
Language.IDIndonesian-
Language.GAIrish-
Language.JAJapanese-
Language.KOKorean-
Language.LVLatvian-
Language.LTLithuanian-
Language.MSMalay-
Language.MTMaltese-
Language.CMNMandarincmn-Hans, cmn-Hant
Language.MRMarathi-
Language.MNMongolian-
Language.NONorwegian-
Language.PLPolish-
Language.PTPortuguese-
Language.RORomanian-
Language.RURussian-
Language.SKSlovakian-
Language.SLSlovenian-
Language.ESSpanish-
Language.SVSwedish-
Language.SWSwahili-
Language.TATamil-
Language.THThai-
Language.TRTurkish-
Language.UGUyghur-
Language.UKUkrainian-
Language.URUrdu-
Language.VIVietnamese-
Language.CYWelsh-
For bilingual transcription, use the language_code and domain parameters as follows:
Language CodeDescriptionDomain Options
cmn_enEnglish / Mandarin-
en_msEnglish / Malay-
Language.ESEnglish / Spanishbilingual-en
en_taEnglish / Tamil-

Speaker Diarization

Speechmatics STT supports speaker diarization, which separates out different speakers in the audio. The identity of each speaker is returned in the TranscriptionFrame objects in the user_id attribute. To enable this feature, set enable_diarization to True. Additionally, if speaker_active_format or speaker_passive_format are provided, then the text output for the TranscriptionFrame will be formatted to this specification. Your system context can then be updated to include information about this format to understand which speaker spoke which words. The passive format is optional and is used when the engine has been told to focus on specific speakers and other speakers will then be formatted using the speaker_passive_format format.
  • speaker_active_format -> the formatter for active speakers
  • speaker_passive_format -> the formatter for passive / background speakers
Examples:
  • <{speaker_id}>{text}</{speaker_id}> -> <S1>Good morning.</S1>.
  • @{speaker_id}: {text} -> @S1: Good morning..

Available attributes

AttributeDescriptionExample
speaker_idThe ID of the speakerS1
textThe transcribed textGood morning.

Usage Examples

Examples are included in the Pipecat project: Sample projects:

Basic Configuration

Initialize the SpeechmaticsSTTService and use it in a pipeline:
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = SpeechmaticsSTTService(
    api_key="your-api-key",
    params=SpeechmaticsSTTService.InputParams(
        language=Language.FR,
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

With Diarization

This will enable diarization and also only go to the LLM if words are spoken from the first speaker (S1). Words from other speakers are transcribed but only sent when the first speaker speaks. When using the enable_vad option, this will use the speaker diarization to determine when a speaker is speaking. You will need to disable VAD options within the selected transport object to ensure this works correctly (see 07b-interruptible-speechmatics-vad.py as an example). Initialize the SpeechmaticsSTTService and use it in a pipeline:
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = SpeechmaticsSTTService(
    api_key="your-api-key",
    params=SpeechmaticsSTTService.InputParams(
        language=Language.EN,
        enable_diarization=True,
        enable_vad=True,
        speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
        speaker_passive_format="<PASSIVE><{speaker_id}>{text}</{speaker_id}></PASSIVE>",
        focus_speakers=["S1"],
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Additional Notes

  • Connection Management: Automatically handles WebSocket connections and reconnections
  • Sample Rate: The default sample rate of 16000 in pcm_s16le format
  • VAD Integration: Optionally supports Speechmatics’ built-in VAD and end of utterance detection