
๐ฌ Valet Video
Real-time webcam streaming with AI object detection, tracking, and voice interaction
Home Lab Setup ยท GPU: RTX 5090 32GB ยท Latency Target: <100ms detection
The Problem
I had separate vision and voice services (Valet Visual and Valet Voice), but neither handled real-time video:
- No live video - Valet Visual processes static images
- No streaming - Can't analyze a webcam feed in real-time
- No voice control - Detection prompts had to be typed
- No object tracking - Each frame was independent
I wanted to point a webcam at something and say "track the red car" while seeing live bounding boxes.
The Solution
Valet Video combines computer vision and conversational voice AI in a single WebRTC-based service. Stream your webcam to the server, get live detection overlays, and use voice commands to control what gets tracked.
Browser (WebRTC) โโ Valet Video โโ GPU
โ โ
Canvas Overlay Vision + Voice
(detection viz) (SAM3, Moshi)
Open http://localhost:9600/client in your browser, grant webcam access, and start streaming.
Key Features
- ๐ฅ WebRTC Streaming - Browser-based webcam streaming with sub-100ms target latency
- ๐ Real-Time Detection - GroundingDINO for text-prompted detection, YOLO11 for fast detection
- ๐ Object Tracking - SAM3 video predictor for persistent tracking across frames
- ๐๏ธ Voice Control - Speak commands like "find the person in blue" or "track all cars"
- โ๏ธ Quality Profiles - Multiple resolution/FPS profiles for performance tuning
- ๐ง Dynamic VRAM - Automatic model loading/unloading based on demand
Architecture
flowchart TB
subgraph Browser["Browser Client"]
WC[Webcam Stream]
MIC[Microphone]
OV[Canvas Overlay]
SPK[Audio Output]
end
subgraph Service["Valet Video Service"]
WR[WebRTC Gateway]
VP[Vision Pipeline]
AP[Voice Pipeline]
CO[Coordinator]
VM[VRAM Manager]
end
subgraph Models["AI Models"]
GD[GroundingDINO]
SAM[SAM3 Tracker]
YOLO[YOLO11]
MOSHI[Moshi 7B]
end
WC -->|Video Track| WR
MIC -->|Audio Track| WR
WR --> CO
CO --> VP
CO --> AP
VP --> GD & SAM & YOLO
AP --> MOSHI
CO -->|DataChannel| OV
AP -->|Audio Track| SPK
VM -.->|Manages| Models
The service receives video and audio via WebRTC, processes frames through the vision pipeline, handles voice through Moshi, and sends detection results back to the browser via DataChannel for real-time overlay rendering.
Vision Pipeline
The vision pipeline chains detection, segmentation, and tracking:
| Model | Purpose | VRAM |
|---|---|---|
| GroundingDINO | Text-prompted detection ("find the red car") | 2.5 GB |
| YOLO11 | Fast general detection (80 classes) | 1.5 GB |
| SAM3 | Video segmentation and tracking | 4.0 GB |
Processing Modes
Different modes balance quality vs. speed:
| Mode | Detection Rate | Frame Skip | Use Case |
|---|---|---|---|
| realtime | Every frame | None | Maximum accuracy |
| balanced | Every 3rd frame | None | Default - good balance |
| efficient | Every 6th frame | Alternate | Lower VRAM |
| minimal | Every 10th frame | 2 of 3 | Testing |
When detection is skipped, the pipeline interpolates from the last result and updates tracking status.
Voice Pipeline
The voice pipeline uses Moshi 7B for real-time conversational voice AI:
flowchart LR
A[Audio In] --> M[Moshi 7B INT8]
M --> T[Transcription]
M --> R[Response Audio]
T --> P[Command Parser]
P --> V[Vision Pipeline]
R --> S[Speaker Output]
Natural language commands:
- "Find the red car"
- "Track all people"
- "What do you see?"
- "Stop tracking"
- "Focus on the dog"
The command parser routes voice commands to the vision pipeline, updating detection prompts in real-time.
Fallback Mode
If Moshi isn't available, falls back to:
- STT: Whisper Large V3 (3 GB)
- TTS: Kokoro (1 GB)
Quality Profiles
Pre-configured profiles for different use cases:
| Profile | Resolution | FPS | VRAM | Use Case |
|---|---|---|---|---|
4k-high |
3840ร2160 | 30 | ~22 GB | Maximum quality |
1080p-high |
1920ร1080 | 30 | ~18 GB | Recommended |
1080p-low |
1920ร1080 | 15 | ~16 GB | Balanced |
720p-high |
1280ร720 | 30 | ~17 GB | Fast processing |
720p-low |
1280ร720 | 15 | ~15 GB | Low latency |
480p |
854ร480 | 30 | ~14 GB | Legacy/testing |
Service Profiles
Combinations optimized for specific scenarios:
| Profile | Models | VRAM | Max Resolution |
|---|---|---|---|
| full | GroundingDINO + SAM3 + Moshi INT8 | 18 GB | 1080p |
| vision-only | GroundingDINO + SAM3 | 8 GB | 4K |
| voice-only | Moshi FP16 | 18 GB | N/A |
| fast-detect | YOLO11 + Moshi INT8 | 12 GB | 4K |
| minimal | YOLO11 only | 3 GB | 4K |
Browser Client
The demo client provides a complete interface:

โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ [Status: Connected - Streaming] โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ Webcam Feed with Detection Overlay โ โ
โ โ โ โ
โ โ [person: 94%]โโโโโโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โ Absolute Legend! โ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ "Track the person" โ "I see one person in center" โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ [Controls] [Detected] [Performance] โ
โ Profile: 1080p 1: person 94% FPS: 28 โ
โ Mirror: ON 2: laptop 87% Latency: 45ms โ
โ Boxes: ON VRAM: 18.2 GB โ
โ Masks: OFF Objects: 2 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Features:
- Mirror/Flip - Selfie mode or standard view
- Visualization toggles - Bounding boxes, segmentation masks, motion trails
- Real-time stats - FPS, latency, VRAM usage
- Detected object list - Label, confidence, tracking status
API Endpoints
REST API
GET /health # Service health
GET /health/vram # VRAM status and allocation
GET /profiles # Available quality profiles
POST /sessions # Create streaming session
GET /sessions/{id} # Get session details
DELETE /sessions/{id} # End session
GET /models # Loaded models status
POST /models/{name}/load # Load a model
WebSocket
WS /ws/signaling/{session_id} # WebRTC signaling
WS /ws/stream/{session_id} # Frame streaming (fallback)
DataChannel Messages
Detection results streamed in real-time:
{
"type": "detections",
"frame_id": 12345,
"timestamp": 1704931200.456,
"processing_time_ms": 45,
"detections": [
{
"id": 1,
"label": "red car",
"confidence": 0.94,
"bbox": [120, 340, 380, 520],
"tracking_status": "stable",
"velocity": [2.5, 0.1],
"color": "#FF4444"
}
]
}
VRAM Management
Dynamic VRAM allocation prevents conflicts:
vram:
limit_gb: 28 # Reserve 4GB headroom
warning_gb: 24
allocation:
vision: 8 # GroundingDINO + SAM3
voice: 10 # Moshi INT8
buffers: 4 # Frame/audio buffers
system: 6 # Headroom
The VRAM manager:
- Tracks all model allocations
- Prevents over-allocation
- Estimates frame buffer requirements based on resolution
- Reports utilization via
/health/vram
Frame Buffer Estimates
| Resolution | Buffers (6) | VRAM |
|---|---|---|
| 4K | 3.2 GB | ~3.2 GB |
| 1080p | 0.8 GB | ~0.8 GB |
| 720p | 0.4 GB | ~0.4 GB |
Running It
Prerequisites
- NVIDIA GPU with 24+ GB VRAM (RTX 4090/5090)
- Docker with NVIDIA Container Toolkit
- Modern browser with WebRTC support
Quick Start
cd valet-video
docker compose up -d
# Check status
curl http://localhost:9600/health
# Open browser client
open http://localhost:9600/client
HTTPS for Remote Access
Webcam access requires a secure context. For remote access:
# Option 1: SSH tunnel
ssh -L 9600:localhost:9600 your-server
# Then open http://localhost:9600/client
# Option 2: Enable HTTPS
./scripts/generate-ssl-cert.sh
# Set SSL_KEYFILE and SSL_CERTFILE env vars
Tech Stack
| Component | Technology | Why |
|---|---|---|
| API | FastAPI + Uvicorn | Async, WebSocket support |
| WebRTC | aiortc | Python WebRTC implementation |
| Detection | GroundingDINO, YOLO11 | Text-prompted + fast detection |
| Tracking | SAM3 | Video object tracking |
| Voice | Moshi 7B | Real-time conversational |
| Fallback | Whisper + Kokoro | STT + TTS when Moshi unavailable |
| Container | Docker + CUDA | GPU-accelerated deployment |
My Home Lab Setup
Running on my home dev environment:
| Component | Spec |
|---|---|
| GPU | RTX 5090 32GB |
| Profile | 1080p-high (18 GB VRAM) |
| Models | GroundingDINO + SAM3 + Moshi INT8 |
| Port | 9600 |
I access it via SSH tunnel from my laptop. Detection latency is typically 45-60ms with stable 28 FPS at 1080p.
Related Services
Valet Video integrates components from:
- Valet Visual - Vision models (GroundingDINO, SAM)
- Valet Voice - Voice models (Moshi, Whisper)
- Valet Model Runtime - Model orchestration
What I Learned
Building real-time video AI taught me:
- WebRTC is complex - ICE negotiation, STUN/TURN, DataChannels - lots of moving parts
- Frame rate vs. detection rate - Processing every frame isn't necessary; interpolation works well
- VRAM is precious - At 1080p with full stack, 18GB of 32GB is used
- Voice adds magic - Natural language commands make the system feel alive
- Browser security matters - HTTPS or localhost required for webcam access
What's Next
- Multi-session support (currently single session)
- Mobile browser testing
- TURN server for NAT traversal
- Recording and playback
- Segment anything in video (click-to-track)