Self-Hosting a Voice AI Stack on NVIDIA DGX Spark
How I replaced per-minute voice APIs with a self-hosted stack on dual NVIDIA DGX Sparks — faster-whisper, vLLM + Qwen3/GLM-4.5, Kokoro, LiveKit — and flattened the cost curve.

Over the past few months at Vital4U, I led the migration away from per-minute third-party voice APIs to a fully self-hosted AI voice platform running on two NVIDIA DGX Sparks. The result: predictable costs, lower end-to-end latency, and far more control over the call experience.
Why self-host at all?
Every minute of voice AI billed through a hosted provider is a minute that eats into margin. For a growing call volume, that bill scales linearly — and we were looking at six figures a year if nothing changed. Owning the stack on dedicated GPUs flips that curve flat after the capital spend.
The stack
- STT:
faster-whisperfor low-latency transcription on GPU. - LLM:
vLLMservingQwen3andGLM-4.5, picked per-intent for cost/quality tradeoffs. - TTS:
Kokorofor fast, natural speech synthesis. - Orchestration: self-hosted
LiveKit(Go / Node.js / Rust) handling WebRTC sessions. - Frontend: Svelte 5 dashboard for live monitoring and per-agent configuration.
Call routing as a finite state machine
Rather than one giant prompt, calls move through a node-graph FSM covering intent detection, order collection, account lookup, returns, and escalation. Each node emits a structured exit token, which keeps routing deterministic and debuggable.
RAG per project
Each customer gets an isolated vector DB knowledge base. A LangChain ingestion pipeline handles multi-format input (PDFs, docs, sheets, webpages) so their agent can speak to their product catalog, policies, and SOPs — not a generic one.
What it took
Beyond the model serving, most of the work was pipeline engineering: keeping first-token latency under the threshold where a caller feels friction, handling barge-in cleanly, and building the FSM in a way that's editable without a redeploy. The Svelte 5 dashboard became the control tower where non-engineers can wire new flows.
This project reinforced something I keep coming back to: when your unit economics depend on inference, owning the inference is the leverage point.
