Jan 14, 2026

Senior Telecom / Voice AI Engineer – SIP, RTP, Whisper, LLM (Outbound Calling System)

Job / Advertisement Description

We’re building a real-time AI voice calling engine for outbound retail calls. Think less “chatbot” and more telecom-grade voice system: live calls, streaming audio, instant hangups, low latency, and AI that actually sounds human. This is a backend / systems role only. No frontend, no dashboards, no UI work. If you enjoy working close to the metal — SIP, RTP, audio streams, GPUs, latency tuning — this project will be fun. If your background is mostly frontend or API glue, this probably isn’t the right fit. What You’ll Be Building A fully automated outbound calling system that: Places outbound SIP calls Detects humans vs voicemail in real time Streams audio both ways Runs STT → LLM → TTS live during the call Hangs up fast on bad leads Triggers bookings automatically via webhooks Inbound calling may come later, but outbound is the focus. Core Skills We’re Looking For Telecom & Call Control (Main Focus) You should be comfortable with: SIP (INVITE, ACK, BYE, CANCEL, re-INVITE) RTP audio streams VoIP fundamentals Answering Machine Detection (AMD) Call state handling and lifecycle control You’ll be responsible for: An outbound SIP dialer Human vs voicemail detection Fast hang-up logic (within ~3 seconds) Call duration limits Custom call control logic Tools we expect you to know (or similar): Asterisk or FreeSWITCH SIP carriers like Telnyx (preferred) or Twilio This is not just a Twilio + webhook project. Real-Time Audio Pipeline Latency matters a lot here. Target: end-to-end latency under 700ms You’ll work with: RTP ↔ PCM conversion Streaming audio chunking Low-latency buffering Barge-in / interruption handling Silence detection and trimming Speech-to-Text (STT) Self-hosted Whisper or Fast-Whisper Streaming transcription (not batch) GPU-optimized inference (INT8 / FP16) Partial transcripts and early cutoffs LLM / Conversation Logic Local LLM inference preferred (LLaMA 3.8B quantized or Mixtral) Optional GPT or Grok if cost allows Stateless conversation flow Short, controlled responses Hard stop logic (“not interested” → hang up) Booking trigger logic Text-to-Speech (TTS) We currently use ElevenLabs, but want to move off recurring TTS costs. You’ll: Clone an ElevenLabs voice locally Train a TTS model using 30–60 minutes of clean audio Stream TTS in chunks Normalize audio Preferred tools: XTTS Piper (fine-tuned) Phrase caching for speed Booking & Automation Webhook-based booking flow Google Sheets integration Calendar availability checks Slot validation and confirmation Email or call transfer is optional. Scaling & Reliability You should be comfortable with: Linux Docker GPU deployment Call queues Crash recovery Rate limiting Logging and metrics Target scale: 50–100 concurrent calls Cost Awareness (Important) This system needs to run very efficiently. Target usage: ~5,000 minutes/day Target cost goal: ~$0.008 per minute (all-in) This includes telephony, STT, TTS, LLM, and compute. You don’t need to promise this on day one — but you should understand how to architect toward it. Engagement Details Long-term collaboration preferred Milestones are fine Budget is flexible for the right engineer We care more about correctness, latency, and stability than rushing This Role Is a Great Fit If You… Have real SIP / RTP / VoIP experience Have worked with real-time audio or voice systems Enjoy performance tuning and system design Like building things that actually run in production Probably Not a Fit If You… Focus mostly on frontend work Only do API-level integrations Have never worked with live audio or telecom How to Apply Please include: Your experience with SIP / RTP / VoIP A real-time voice or audio system you’ve worked on Experience with Whisper or streaming STT How you think about latency and cost optimization Tools you like for telephony and audio If this sounds interesting and matches your background, we’d love to talk.