Roberto Chavez Jr

Software Engineer

Resume

← Back to blog Engineering Apr 17, 2026

Self-Hosting a Voice AI Stack on NVIDIA DGX Spark

How I replaced per-minute voice APIs with a self-hosted stack on dual NVIDIA DGX Sparks — faster-whisper, vLLM + Qwen3/GLM-4.5, Kokoro, LiveKit — and flattened the cost curve.

AILLMVoiceSelf-Hosting
Self-Hosting a Voice AI Stack on NVIDIA DGX Spark cover

Over the past few months at Vital4U, I led the migration away from per-minute third-party voice APIs to a fully self-hosted AI voice platform running on two NVIDIA DGX Sparks. The result: predictable costs, lower end-to-end latency, and far more control over the call experience.

Why self-host at all?

Every minute of voice AI billed through a hosted provider is a minute that eats into margin. For a growing call volume, that bill scales linearly — and we were looking at six figures a year if nothing changed. Owning the stack on dedicated GPUs flips that curve flat after the capital spend.

The stack

  • STT: faster-whisper for low-latency transcription on GPU.
  • LLM: vLLM serving Qwen3 and GLM-4.5, picked per-intent for cost/quality tradeoffs.
  • TTS: Kokoro for fast, natural speech synthesis.
  • Orchestration: self-hosted LiveKit (Go / Node.js / Rust) handling WebRTC sessions.
  • Frontend: Svelte 5 dashboard for live monitoring and per-agent configuration.

Call routing as a finite state machine

Rather than one giant prompt, calls move through a node-graph FSM covering intent detection, order collection, account lookup, returns, and escalation. Each node emits a structured exit token, which keeps routing deterministic and debuggable.

RAG per project

Each customer gets an isolated vector DB knowledge base. A LangChain ingestion pipeline handles multi-format input (PDFs, docs, sheets, webpages) so their agent can speak to their product catalog, policies, and SOPs — not a generic one.

What it took

Beyond the model serving, most of the work was pipeline engineering: keeping first-token latency under the threshold where a caller feels friction, handling barge-in cleanly, and building the FSM in a way that's editable without a redeploy. The Svelte 5 dashboard became the control tower where non-engineers can wire new flows.

This project reinforced something I keep coming back to: when your unit economics depend on inference, owning the inference is the leverage point.