$cat ./projects/stark-translate/INTENT.md

stark-translate

Deployed weekly

Live bilingual EN/ES speech-to-text, built so a Spanish-speaking congregation could follow along with an English sermon in real time — on a laptop, no cloud.

Role: Builder
Stack: Python
Latency: ~1.2 s
Status: Deployed

§01 · problem path
./problem.md why does this exist? who benefits? what would happen if it didn't?

congregation_split_by_language

A local church holds English services with a growing number of Spanish-speaking attendees. Hiring a human interpreter every week wasn't practical, and commercial live-translation tools either required constant cloud connectivity, a monthly SaaS bill, or both — plus handed off sensitive pastoral audio to a third party.

What they needed: real-time EN → ES captions displayed at the back of the sanctuary, running on whatever laptop was already at the sound desk, with no outbound traffic after the weights are downloaded.

§02 · architecture path
./pipeline.py single process, 4 stages, in-memory queues, ws fan-out.

four_stages_one_pipe

The system is a single Python process with four stages wired together by in-memory queues, then fanned out to the display over WebSocket so the projector laptop can be anywhere on the LAN.

┌─────────────┐    ┌─────────────────┐    ┌─────────────────────┐    ┌────────────┐
│  mic input  │───▶│   whisper asr   │───▶│     marian-mt       │───▶│  websocket │
│  audio · vad │    │  en text · MLX  │    │  en→es · gemma fb   │    │  display   │
└─────────────┘    └─────────────────┘    └─────────────────────┘    └────────────┘

›Python 3.11host runtime · single processv.dep

›WhisperASR · whisper-small · on-devicev.dep

›MarianMTNMT · en-es · ~80 MBv.dep

›TranslateGemmaNMT fallback · prose-quality, heavierv.dep

›WebSocketfan-out · LAN-only display pathv.dep

›MLXApple silicon backendv.dep

›sounddevicecapture · CoreAudio bindingsv.dep

§03 · decisions path
./CHANGELOG.md picked → rejected. each line is a deliberate trade.

tradeoffs_actually_made

a3f1c92 · +34 −12

On-device, not cloud.

Whisper-small runs at >1× realtime on an M-series Mac. Cloud ASR would have been faster and more accurate, but no internet was a hard requirement. Accuracy tuning happens via domain-specific prompts, not model size.
b1ee047 · +78 −03

Chunked streaming, not true continuous.

Audio is chunked on ~2 s voice-activity boundaries and transcribed one chunk at a time. This costs end-to-end latency (~1.2 s median) but keeps the implementation simple and resilient to dropped chunks.
4d8c3a1 · +22 −41

MarianMT for EN→ES, with TranslateGemma as a quality fallback.

MarianMT's EN-ES pair is tiny and fast; TranslateGemma produces better prose but is heavier. A runtime flag swaps between them depending on host hardware.
9f02e44 · +12 −08

WebSocket display, not a rendered overlay.

The projector laptop just opens a URL. No screen-sharing, no OBS scene — any browser on the LAN becomes a caption screen. Trivial to restart, trivial to fail over.

§04 · outcome path
./STATE.md running. weekly. quiet sundays.

what_it_does_now

Runs weekly at services on a single MacBook. Spanish-speaking attendees can follow the sermon in real time without needing a human interpreter or a cloud connection. Median latency is around a second; worst-case spikes are bounded by the chunk size.

Next up: speaker adaptation (prime Whisper with the pastor's voice), glossary pinning for religious terminology, and a second display path for individual phone screens over the same WebSocket.

signal · live

diagram · live pipeline

stark-translate

congregation_split_by_language

four_stages_one_pipe

tradeoffs_actually_made

On-device, not cloud.

Chunked streaming, not true continuous.

MarianMT for EN→ES, with TranslateGemma as a quality fallback.

WebSocket display, not a rendered overlay.

what_it_does_now