Advanced Architectures for Vertical AI Agents

Lesson 20: Vector Databases & Indexing - Production Vector Storage for RAG

Feb 20, 2026

[A] Today’s Build

We’re building a production-ready vector database system that replaces L19’s in-memory dictionary with ChromaDB:

Local ChromaDB instance with persistent storage and collection management
Gemini AI embedding pipeline that converts documents to 768-dimensional vectors
Similarity search engine with metadata filtering and ranking capabilities
Real-time dashboard showing indexed documents, search results, and performance metrics
Multi-collection architecture supporting different embedding strategies

From L19: We extend the naive RAG’s document chunker and retrieval interface, replacing the simple dict with a professional vector store that handles millions of embeddings.

Enables L21: Our ChromaDB foundation supports advanced embeddings (Sentence Transformers, custom models) and semantic chunking strategies we’ll implement next.

[B] Architecture Context

90-Lesson Path Position: Module 4 (RAG Systems, Lessons 19-27) - we’re transitioning from basic RAG patterns to production-grade vector infrastructure.

Integration with L19: We reuse the document chunking logic but upgrade the storage layer from {"doc_id": {"text": ..., "embedding": ...}} to ChromaDB’s optimized index structures with HNSW algorithms and metadata filtering.

Vector database architecture showing ChromaDB integration with Gemini embeddings

Module Objectives: By L27, we’ll have complete enterprise RAG with query routing, hybrid search, and guardrails - L20 provides the scalable vector backbone.

[C] Core Concepts

Vector databases aren’t just “storage with similarity” - they’re specialized data structures solving high-dimensional nearest neighbor search in sub-linear time. At Netflix, we indexed 50M+ embeddings for content recommendations; ChromaDB’s HNSW (Hierarchical Navigable Small World) graphs made 10ms searches possible where brute-force would take 30+ seconds.

Key insight:

Embeddings alone don’t make retrieval intelligent. The magic happens in the metadata-vector coupling - storing chunk positions, timestamps, source URLs alongside vectors enables filtered search (”find similar legal docs from Q3 2024”) that naive cosine similarity can’t achieve.

VAIA Relevance:

When your agent retrieves context for responses, it’s not just finding “similar text” - it’s navigating a semantic graph where distances encode meaning, and metadata filters encode business logic. Poor indexing = slow responses. Poor metadata = irrelevant context = hallucinations.

Workflow: Document → Chunk → Embed (Gemini API) → Index (ChromaDB collection) → Query → k-NN Search → Ranked Results

State Changes: Documents transition through RECEIVED → CHUNKED → EMBEDDED → INDEXED states, with each step validated before progression.

[D] VAIA Integration

Production Architecture Fit: Vector databases sit between your agent’s LLM and your document corpus, acting as the “long-term memory” layer. Google’s customer support agents use similar patterns - 100M+ support articles indexed with metadata filters for product, language, and version.

Enterprise Deployment Patterns:

Multi-tenancy: Separate ChromaDB collections per customer/project with isolation
Hybrid search: Combine vector similarity with keyword BM25 for precision
Cache warming: Pre-compute embeddings during ingestion, not query time
Incremental updates: Add/remove documents without full reindex

Real-World Example: Stripe’s documentation assistant uses Pinecone (similar to ChromaDB) with 50K+ API docs. Metadata filters ensure users see docs for their SDK version, avoiding version mismatch errors that generic RAG produces.

[E] Implementation

GitHub Link

https://github.com/sysdr/vertical-ai-agents/tree/main/lesson20

Component Architecture:

VectorStore: ChromaDB client with collection management, embedding integration
EmbeddingService: Gemini API wrapper that batches requests, handles rate limits
IndexingPipeline: Coordinates chunking → embedding → storage with retry logic
SearchEngine: Query processing, metadata filtering, result ranking
Dashboard: Real-time visualization of indexed docs, search performance, collection stats

Control Flow:

User uploads document → Backend chunks text → Sends batches to Gemini → Stores vectors+metadata in ChromaDB → Dashboard polls for index stats

Document indexing and search workflow from upload to ranked results

Data Flow:

Text chunks flow through embedding pipeline (768-dim vectors), join with metadata (source, timestamp, chunk_id), persist to ChromaDB’s HNSW index

Document state machine tracking progression through indexing pipeline

[F] Coding Highlights

Batch Embedding for Rate Limit Efficiency:

async def embed_batch(self, texts: list[str]) -> list[list[float]]:
    # Gemini allows 100 texts/request - use it
    embeddings = []
    for batch in chunked(texts, 100):
        result = await genai_model.embed_content(
            content=batch,
            task_type="retrieval_document"
        )
        embeddings.extend(result['embedding'])
    return embeddings

Metadata-Filtered Search:

results = collection.query(
    query_embeddings=query_vector,
    n_results=10,
    where={"source": "legal", "year": {"$gte": 2024}},
    include=["documents", "metadatas", "distances"]
)

Production Considerations:

Connection pooling: Reuse ChromaDB client across requests
Embedding caching: Hash text → check cache before API call
Error boundaries: Failed embeddings don’t crash entire batch
Monitoring: Track embedding latency, search p95, index size

[G] Validation

Verification Methods:

Indexing Health: Upload 100 docs → Verify 100 collections entries with correct metadata
Search Accuracy: Query “machine learning” → Top-5 results should contain ML-related chunks
Performance: 1000-doc index → Search completes <100ms (p95)
Metadata Filtering: Search with where={"category": "finance"} → All results match filter

Success Criteria:

ChromaDB collection persists after restart (disk persistence works)
Identical queries return same results (deterministic ranking)
Dashboard shows real-time index stats (WebSocket updates)

Benchmarks:

Embedding: 50 docs/sec with Gemini API (rate limits apply)
Indexing: 1000 chunks/sec into ChromaDB
Search: <50ms for 10K-doc collections, <200ms for 1M-doc

[H] Assignment

Extend the System:

Implement semantic deduplication - when indexing, detect near-duplicate chunks (cosine similarity >0.95) and merge them, preserving metadata from both sources. This prepares for L21’s chunking strategies where overlapping chunks are common.

Add a collection comparison dashboard - index the same documents with different embedding models (you’ll use Gemini text-embedding-004 only for now, but design the interface for L21’s model experiments).

Build Toward L21: Create a metadata field chunking_strategy that tags chunks as “character”, “sentence”, or “semantic” - we’ll populate this when implementing advanced chunking next lesson.

[I] Solution Hints

Deduplication: After embedding, before collection.add(), query ChromaDB with the new embedding → if distance <0.05 (similarity >0.95), update existing entry’s metadata array instead of creating new entry.

Collection comparison: Create separate ChromaDB collections (default, experimental) → index same docs in both → dashboard displays side-by-side search results for quality comparison.

Metadata tagging: Add {"chunking_strategy": "character", "chunk_method_version": "1.0"} to every chunk’s metadata - this becomes crucial for A/B testing chunking approaches in L21.

[J] Looking Ahead

L21 Bridge: Our ChromaDB setup handles any embedding model - next lesson, we’ll swap Gemini’s default model for Sentence Transformers (all-MiniLM-L6-v2, all-mpnet-base-v2) and compare retrieval quality. The chunking_strategy metadata enables semantic chunking experiments.

Module Progress: We’ve moved from naive RAG (L19) to production vector storage (L20) - next steps are advanced embeddings (L21), query optimization (L22), and hybrid search (L23), culminating in enterprise RAG with guardrails by L27.

The vector database is your VAIA’s memory backbone - invest in proper indexing now, or pay with slow responses and poor retrieval later.

Hands On AI Agent Mastery Course

Discussion about this post

Ready for more?