Live bilingual EN/ES speech-to-text, built so a Spanish-speaking congregation could follow along with an English sermon in real time — on a laptop, no cloud.
A local church holds English services with a growing number of Spanish-speaking attendees. Hiring a human interpreter every week wasn't practical, and commercial live-translation tools either required constant cloud connectivity, a monthly SaaS bill, or both — plus handed off sensitive pastoral audio to a third party.
What they needed: real-time EN → ES captions displayed at the back of the sanctuary, running on whatever laptop was already at the sound desk, with no outbound traffic after the weights are downloaded.
The system is a single Python process with four stages wired together by in-memory queues, then fanned out to the display over WebSocket so the projector laptop can be anywhere on the LAN.
┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ ┌────────────┐ │ mic input │───▶│ whisper asr │───▶│ marian-mt │───▶│ websocket │ │ audio · vad │ │ en text · MLX │ │ en→es · gemma fb │ │ display │ └─────────────┘ └─────────────────┘ └─────────────────────┘ └────────────┘
Whisper-small runs at >1× realtime on an M-series Mac. Cloud ASR would have been faster and more accurate, but no internet was a hard requirement. Accuracy tuning happens via domain-specific prompts, not model size.
Audio is chunked on ~2 s voice-activity boundaries and transcribed one chunk at a time. This costs end-to-end latency (~1.2 s median) but keeps the implementation simple and resilient to dropped chunks.
MarianMT's EN-ES pair is tiny and fast; TranslateGemma produces better prose but is heavier. A runtime flag swaps between them depending on host hardware.
The projector laptop just opens a URL. No screen-sharing, no OBS scene — any browser on the LAN becomes a caption screen. Trivial to restart, trivial to fail over.
Runs weekly at services on a single MacBook. Spanish-speaking attendees can follow the sermon in real time without needing a human interpreter or a cloud connection. Median latency is around a second; worst-case spikes are bounded by the chunk size.
Next up: speaker adaptation (prime Whisper with the pastor's voice), glossary pinning for religious terminology, and a second display path for individual phone screens over the same WebSocket.