Back to All Posts

Vacuum - Web Scraping & Knowledge Ingestion for LLMs

Language Seed January 13, 2026 5 min read

Vacuum

๐ŸŒ€ Vacuum

LLM-powered web content extraction and knowledge ingestion

Home Lab Project ยท 4 Search Engines ยท Plugin Architecture


The Problem

I needed to feed my AI agents with fresh, structured knowledge from the web:

  • Raw HTML is useless - LLMs need clean, extracted content
  • Web scraping is brittle - Selectors break when sites change
  • Search is fragmented - Different APIs, different formats
  • No memory - Every scrape starts from scratch

Traditional scrapers return HTML soup. I wanted a system that could understand pages, extract the relevant parts, and store them for retrieval.


The Solution

Vacuum uses LLMs to intelligently extract content from web pages. Instead of CSS selectors, you describe what you want in natural language, and the LLM figures out how to get it.

# Traditional scraping (fragile)
title = soup.select_one("h1.article-title").text
content = soup.select_one("div.content-body").text

# Vacuum (intelligent)
result = await vacuum.browse(
    url="https://example.com/article",
    prompt="Extract the article title, author, date, and main content"
)
# Returns structured JSON with exactly what you asked for

Key Features

  • ๐Ÿง  LLM-Powered Extraction - Uses ScrapeGraphAI with Valet Runtime
  • ๐Ÿ” Multi-Engine Search - Tavily, Serper, Brave, DuckDuckGo in one API
  • ๐Ÿ“ฆ Smart Caching - TTL strategies by content type (news vs docs)
  • ๐ŸŽฏ Reranking - Reorder results by relevance using mxbai-rerank
  • ๐Ÿ–ผ๏ธ Visual Analysis - Extract content from images via Valet Visual
  • ๐Ÿ”Œ Plugin System - Add custom content sources
  • ๐Ÿ’พ Vector Storage - ChromaDB for semantic search

Architecture

flowchart TB
    subgraph Input["Input Sources"]
        U[URLs]
        Q[Search Query]
        D[Documents]
    end
    
    subgraph Vacuum["Vacuum Service"]
        direction TB
        API[FastAPI Gateway]
        
        subgraph Core["Core Operations"]
            B[Browse<br/>Single URL]
            S[Scrape<br/>Depth Crawl]
            F[Fetch<br/>Multi-source]
        end
        
        subgraph Services["Services"]
            WS[Web Search<br/>4 engines]
            RR[Reranker<br/>mxbai-rerank]
            EM[Embeddings<br/>mxbai-embed]
        end
    end
    
    subgraph Backend["Backend"]
        VR[Valet Runtime<br/>LLM]
        VV[Valet Visual<br/>Vision]
        CH[ChromaDB<br/>Vectors]
        RD[Redis<br/>Cache]
    end
    
    U & Q & D --> API
    API --> B & S & F
    API --> WS
    B & S & F --> VR
    B & S & F --> RD
    WS --> RR
    RR --> EM --> CH

Core Operations

Browse - Single URL Extraction

Extract structured content from a single page:

curl -X POST http://localhost:8500/browse \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.python.org/3/library/asyncio.html",
    "prompt": "Extract the main concepts and code examples"
  }'

Scrape - Depth Crawling

Crawl a site to a specified depth:

curl -X POST http://localhost:8500/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://fastapi.tiangolo.com/",
    "prompt": "Extract all tutorial content",
    "depth": 2
  }'

Fetch - Multi-Source Retrieval

Gather content from multiple URLs in parallel:

curl -X POST http://localhost:8500/fetch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://a.com", "https://b.com", "https://c.com"],
    "prompt": "Summarize the main points"
  }'

Web Search Integration

Vacuum provides a unified search API across multiple engines:

flowchart LR
    subgraph Query["Search Query"]
        Q[Query Text]
    end
    
    subgraph Engines["Search Engines"]
        T[Tavily<br/>AI-optimized]
        S[Serper<br/>Google results]
        B[Brave<br/>Privacy-focused]
        D[DuckDuckGo<br/>Free fallback]
    end
    
    subgraph Output["Unified Response"]
        R[Standardized Results]
        A[Direct Answer]
        F[Follow-up Questions]
    end
    
    Q --> T & S & B & D
    T & S & B & D --> R
    T --> A & F
Engine Best For API Key Required
Tavily AI applications, returns clean content Yes
Serper Google search results Yes
Brave Privacy, no tracking Yes
DuckDuckGo Free fallback, no key needed No

The system uses auto mode by default, selecting the best available engine based on your API keys.


Caching Strategy

Different content types have different freshness requirements:

Content Type TTL Rationale
Documentation 24 hours Changes infrequently
News 30 minutes Stale quickly
Social 5 minutes Real-time content
Default 1 hour Balance freshness/performance

Cache is stored in Redis with automatic expiration:

flowchart LR
    R[Request] --> C{Cache Hit?}
    C -->|Yes| CR[Return Cached]
    C -->|No| F[Fetch + Extract]
    F --> S[Store with TTL]
    S --> CR2[Return Fresh]

Plugin System

Extend Vacuum with custom content sources:

Creating a Plugin

from vacuum.plugins.base import BasePlugin, PluginMetadata

class MyPlugin(BasePlugin):
    @classmethod
    def metadata(cls) -> PluginMetadata:
        return PluginMetadata(
            name="my-plugin",
            version="1.0.0",
            description="Custom content source"
        )
    
    async def process(self, request):
        # Your extraction logic here
        return {"content": "..."}

Plugins are auto-discovered from the plugins directory and registered at startup.


Tech Stack

Component Technology Why
API FastAPI Async, fast, OpenAPI docs
Extraction ScrapeGraphAI LLM-powered scraping
LLM Valet Runtime Unified model access
Vision Valet Visual Image extraction
Cache Redis Fast, TTL support
Vectors ChromaDB Semantic search
Reranking mxbai-rerank Result relevance

What I Learned

  1. LLMs beat selectors - Natural language extraction is more resilient than CSS selectors
  2. Caching is essential - Web scraping without caching wastes compute and hits rate limits
  3. Multi-engine search - No single search API is perfect; having fallbacks is crucial
  4. Plugin architecture - Adding new sources should be easy, not a code change

What's Next

  • YouTube transcript extraction
  • PDF ingestion pipeline
  • Scheduled refresh for important sources
  • Better deduplication in ChromaDB
hello_world

Notes to self, articles and content to share with others. Building AI systems and sharing knowledge.

Connect

SvelteKit Tailwind shadcn

ยฉ 2026 Language Seed. All rights reserved.

Built with using SvelteKit