We’re building a real-time AI voice calling engine for outbound retail calls.
Think less “chatbot” and more telecom-grade voice system:
live calls, streaming audio, instant hangups, low latency, and AI that actually sounds human.
This is a backend / systems role only.
No frontend, no dashboards, no UI work.
If you enjoy working close to the metal — SIP, RTP, audio streams, GPUs, latency tuning — this project will be fun.
If your background is mostly frontend or API glue, this probably isn’t the right fit.
What You’ll Be Building
A fully automated outbound calling system that:
Places outbound SIP calls
Detects humans vs voicemail in real time
Streams audio both ways
Runs STT → LLM → TTS live during the call
Hangs up fast on bad leads
Triggers bookings automatically via webhooks
Inbound calling may come later, but outbound is the focus.
Core Skills We’re Looking For
Telecom & Call Control (Main Focus)
You should be comfortable with:
SIP (INVITE, ACK, BYE, CANCEL, re-INVITE)
RTP audio streams
VoIP fundamentals
Answering Machine Detection (AMD)
Call state handling and lifecycle control
You’ll be responsible for:
An outbound SIP dialer
Human vs voicemail detection
Fast hang-up logic (within ~3 seconds)
Call duration limits
Custom call control logic
Tools we expect you to know (or similar):
Asterisk or FreeSWITCH
SIP carriers like Telnyx (preferred) or Twilio
This is not just a Twilio + webhook project.
Real-Time Audio Pipeline
Latency matters a lot here.
Target: end-to-end latency under 700ms
You’ll work with:
RTP ↔ PCM conversion
Streaming audio chunking
Low-latency buffering
Barge-in / interruption handling
Silence detection and trimming
Speech-to-Text (STT)
Self-hosted Whisper or Fast-Whisper
Streaming transcription (not batch)
GPU-optimized inference (INT8 / FP16)
Partial transcripts and early cutoffs
LLM / Conversation Logic
Local LLM inference preferred (LLaMA 3.8B quantized or Mixtral)
Optional GPT or Grok if cost allows
Stateless conversation flow
Short, controlled responses
Hard stop logic (“not interested” → hang up)
Booking trigger logic
Text-to-Speech (TTS)
We currently use ElevenLabs, but want to move off recurring TTS costs.
You’ll:
Clone an ElevenLabs voice locally
Train a TTS model using 30–60 minutes of clean audio
Stream TTS in chunks
Normalize audio
Preferred tools:
XTTS
Piper (fine-tuned)
Phrase caching for speed
Booking & Automation
Webhook-based booking flow
Google Sheets integration
Calendar availability checks
Slot validation and confirmation
Email or call transfer is optional.
Scaling & Reliability
You should be comfortable with:
Linux
Docker
GPU deployment
Call queues
Crash recovery
Rate limiting
Logging and metrics
Target scale: 50–100 concurrent calls
Cost Awareness (Important)
This system needs to run very efficiently.
Target usage: ~5,000 minutes/day
Target cost goal: ~$0.008 per minute (all-in)
This includes telephony, STT, TTS, LLM, and compute.
You don’t need to promise this on day one — but you should understand how to architect toward it.
Engagement Details
Long-term collaboration preferred
Milestones are fine
Budget is flexible for the right engineer
We care more about correctness, latency, and stability than rushing
This Role Is a Great Fit If You…
Have real SIP / RTP / VoIP experience
Have worked with real-time audio or voice systems
Enjoy performance tuning and system design
Like building things that actually run in production
Probably Not a Fit If You…
Focus mostly on frontend work
Only do API-level integrations
Have never worked with live audio or telecom
How to Apply
Please include:
Your experience with SIP / RTP / VoIP
A real-time voice or audio system you’ve worked on
Experience with Whisper or streaming STT
How you think about latency and cost optimization
Tools you like for telephony and audio
If this sounds interesting and matches your background, we’d love to talk.