> ## Documentation Index > Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt > Use this file to discover all available pages before exploring further. # Speech to Text > Learn how to configure speech recognition to convert user audio into text in your Pipecat pipeline **Speech to Text (STT)** services are responsible for converting user audio into text transcriptions. They receive audio input from users and provide real-time transcriptions that your bot can process and respond to. ## Pipeline Placement STT processors must be positioned correctly in your pipeline to receive and process audio frames: ```python theme={null} pipeline = Pipeline([ transport.input(), # Creates InputAudioRawFrames stt, # Processes audio → creates TranscriptionFrames context_aggregator.user(), # Uses transcriptions for context llm, tts, transport.output(), ]) ``` **Placement requirements:** * **After `transport.input()`**: STT needs `InputAudioRawFrame`s from the transport * **Before context processing**: Transcriptions must be available for context aggregation * **Before LLM processing**: Text must be ready for language model input ## STT Service Types Pipecat provides two types of STT services based on how they process audio: ### 1. STTService (Streaming) **How it works:** * Establishes a WebSocket connection to the STT provider * Continuously streams audio for real-time transcription * Lower latency due to persistent connection ### 2. SegmentedSTTService (HTTP-based) **How it works:** * Uses local VAD (Voice Activity Detection) to chunk speech * Sends audio segments to STT service as wav files * Higher latency due to segmentation and HTTP POST requests STT services are modular and can be swapped out with no additional overhead. You can easily switch between streaming and segmented services based on your needs. ## Supported STT Services Pipecat supports a wide range of STT providers to fit different needs and budgets: View the complete list of supported speech-to-text providers Popular options include: Fast, accurate streaming STT with excellent real-time performance Real-time streaming STT with strong multilingual support and language hints AI-powered transcription with speaker diarization and sentiment analysis Advanced speech recognition with strong accent and dialect handling Low-latency real-time transcription from the Cartesia voice platform Real-time streaming STT with a low-latency transcription API ## STT Configuration ### Service-Specific Configuration Each STT service has its own customization options. Refer to specific service documentation for details: Explore configuration options for each supported STT provider For example, let's look at configuring the **DeepgramSTTService** using the `LiveOptions` class: ```python theme={null} from deepgram import LiveOptions from pipecat.services.deepgram.stt import DeepgramSTTService from pipecat.transcriptions.language import Language # Configure using LiveOptions for full control live_options = LiveOptions( model="nova-2", language=Language.EN_US, interim_results=True, # Enable interim transcripts punctuate=True, # Add punctuation profanity_filter=True, # Filter profanity vad_events=False, # Use pipeline VAD instead ) stt = DeepgramSTTService( api_key=os.getenv("DEEPGRAM_API_KEY"), live_options=live_options, ) ``` ### STTService Base Class Configuration All STT services inherit from the STTService base class. The base class has base configuration options which are set with smart defaults: ```python theme={null} stt = YourSTTService( # Service-specific options... audio_passthrough=True, # Pass audio frames downstream (recommended) sample_rate=16000, # Audio sample rate (better set in PipelineParams) ) ``` **Key options:** * **`audio_passthrough=True`**: Allows audio frames to continue downstream to other processors (like audio recording) * **`sample_rate`**: Audio sampling rate - best practice is to **set the `audio_in_sample_rate` in `PipelineParams` for consistency** Setting `audio_passthrough=False` will stop audio frames from being passed downstream, which may break audio recording or other audio-dependent processors. ### Pipeline-Level Audio Configuration Instead of setting sample rates on individual services, configure them pipeline-wide: ```python theme={null} worker = PipelineWorker( pipeline, params=PipelineParams( audio_in_sample_rate=16000, # All input processors use this rate audio_out_sample_rate=24000, # All output processors use this rate ), ) ``` This ensures all audio processors use consistent sample rates without manual configuration. Always set audio sample rates in `PipelineParams` to avoid mismatches between different audio processors. This simplifies configuration and ensures consistent audio quality across your pipeline. ## Multilingual Transcription Many STT services in Pipecat default to `Language.EN` (English). If you need to transcribe speech in other languages or let the model auto-detect the spoken language, you can enable multilingual support. However, providers implement this differently: **`language=None`** — Whisper-based services (Groq, OpenAI, local Whisper) and ElevenLabs support automatic language detection when no language is specified: ```python theme={null} from pipecat.services.groq.stt import GroqSTTService stt = GroqSTTService( api_key=os.getenv("GROQ_API_KEY"), settings=GroqSTTService.Settings( language=None, # Auto-detect language ), ) ``` **`language="multi"`** — Deepgram uses a special `"multi"` language code to enable multilingual transcription: ```python theme={null} from pipecat.services.deepgram.stt import DeepgramSTTService stt = DeepgramSTTService( api_key=os.getenv("DEEPGRAM_API_KEY"), settings=DeepgramSTTService.Settings( language="multi", # Enable multilingual mode ), ) ``` **Language array** — Google Cloud STT accepts a list of languages for multi-language recognition. See the [Google STT docs](/api-reference/server/services/stt/google) for details. Some services have additional multilingual features. For example, Soniox supports language hints, AssemblyAI offers a dedicated multilingual model, and Speechmatics supports bilingual transcription. See individual service docs for details. ## Best Practices ### Enable Interim Results When available, enable interim transcripts for better user experience: ```python theme={null} stt = DeepgramSTTService( api_key=os.getenv("DEEPGRAM_API_KEY"), live_options=LiveOptions( interim_results=True, ) ) ``` **Benefits:** * Notifies context aggregation that more text is coming * Prevents premature LLM completions * Enables interruption detection * Improves conversation flow ### Enable Punctuation and Formatting Use punctuation when available for better LLM comprehension: ```python theme={null} stt = DeepgramSTTService( api_key=os.getenv("DEEPGRAM_API_KEY"), live_options=LiveOptions( punctuate=True, # Adds punctuation profanity_filter=True, # Optional content filtering ) ) ``` **Benefits:** * Professional-looking transcripts * Better LLM comprehension * Eliminates post-processing needs * Improved context understanding ### Use Local VAD While many STT services provide Voice Activity Detection, use Pipecat's local Silero VAD for better performance: ```python theme={null} from pipecat.audio.vad.silero import SileroVADAnalyzer # Configure in context aggregator user_aggregator, assistant_aggregator = LLMContextAggregatorPair( context, user_params=LLMUserAggregatorParams( vad_analyzer=SileroVADAnalyzer(), ), ) ``` **Advantages:** * **150-200ms faster** speech detection (no network round trip) * More responsive conversation flow * Better interruption handling * Reduced latency overall ### Tune STT Latency Each STT service has a measured P99 latency for delivering final transcripts after the user stops speaking. This value is used by turn stop strategies to decide how long to wait before ending the user's turn. If you notice the bot responding too early (cutting off the user) or too late (long pauses), tuning this value can help. Learn about TTFS latency, see default values for every STT service, and how to measure and override for your deployment ## Key Takeaways * **Pipeline placement matters** - STT must come after transport input, before context processing * **Service types differ** - streaming services have lower latency than segmented * **Services are modular** - easily swap providers without code changes * **Best practices improve performance** - use interim results, formatting, and local VAD * **Configuration affects quality** - proper setup significantly impacts transcription accuracy ## What's Next Now that you understand speech recognition, let's explore how to manage conversation context and memory in your voice AI bot. Learn how to handle conversation history and context in your pipeline