What is ChromaDB Tutorial?

Build local-first RAG applications with ChromaDB, the open-source embedding database

ChromaDB Tutorial - AI Agents & RAG Tutorial | TechLead

Getting Started with ChromaDB

ChromaDB is an open-source embedding database built for developer productivity. It runs locally with zero configuration, making it perfect for development, prototyping, and applications where data privacy is paramount. ChromaDB can also run as a server for team collaboration and production deployments.

ChromaDB Features

Zero Config: Works out of the box with pip install or npm install — no server needed
Built-in Embeddings: Auto-embeds text using default models (no OpenAI key required for basics)
Metadata Filtering: Rich query filters on document metadata
Persistent Storage: Save collections to disk and reload them
Multi-Modal: Supports text and image embeddings

Installation

# Python
pip install chromadb

# TypeScript / JavaScript
npm install chromadb

Basic Usage

# ChromaDB basics in Python
import chromadb

# In-memory client (data lost when process ends)
client = chromadb.Client()

# Persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./my_chroma_db")

# Create a collection
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"},  # Distance metric
)

# Add documents (ChromaDB auto-embeds with default model)
collection.add(
    ids=["id1", "id2", "id3", "id4", "id5"],
    documents=[
        "Python is a versatile programming language used in AI and web development.",
        "JavaScript powers interactive web applications and runs in browsers.",
        "Docker containers package applications with their dependencies.",
        "PostgreSQL is a powerful open-source relational database.",
        "React is a JavaScript library for building user interfaces.",
    ],
    metadatas=[
        {"topic": "python", "category": "language", "difficulty": "beginner"},
        {"topic": "javascript", "category": "language", "difficulty": "beginner"},
        {"topic": "docker", "category": "devops", "difficulty": "intermediate"},
        {"topic": "postgresql", "category": "database", "difficulty": "intermediate"},
        {"topic": "react", "category": "frontend", "difficulty": "intermediate"},
    ],
)

# Simple query
results = collection.query(
    query_texts=["web development frameworks"],
    n_results=3,
)

print("Documents:", results["documents"])
print("Distances:", results["distances"])
print("Metadatas:", results["metadatas"])

# Query with metadata filter
results = collection.query(
    query_texts=["programming tools"],
    n_results=3,
    where={"category": "language"},  # Only search in 'language' category
)

# Complex filters
results = collection.query(
    query_texts=["backend development"],
    n_results=5,
    where={
        "$and": [
            {"difficulty": {"$ne": "beginner"}},
            {"category": {"$in": ["database", "devops"]}},
        ]
    },
)

TypeScript Usage

// ChromaDB in TypeScript
import { ChromaClient, OpenAIEmbeddingFunction } from "chromadb";

const client = new ChromaClient();

// Use OpenAI embeddings for better quality
const embedder = new OpenAIEmbeddingFunction({
  openai_api_key: process.env.OPENAI_API_KEY!,
  openai_model: "text-embedding-3-small",
});

// Create collection with custom embedder
const collection = await client.getOrCreateCollection({
  name: "knowledge_base",
  embeddingFunction: embedder,
  metadata: { "hnsw:space": "cosine" },
});

// Add documents
await collection.add({
  ids: ["doc1", "doc2", "doc3"],
  documents: [
    "Our API supports REST and GraphQL endpoints.",
    "Authentication uses JWT tokens with 1-hour expiry.",
    "Rate limiting is set to 1000 requests per minute.",
  ],
  metadatas: [
    { section: "api", version: "v2" },
    { section: "auth", version: "v2" },
    { section: "limits", version: "v2" },
  ],
});

// Semantic search
const results = await collection.query({
  queryTexts: ["How do I authenticate API requests?"],
  nResults: 3,
});

console.log(results.documents);

// Update existing documents
await collection.update({
  ids: ["doc3"],
  documents: ["Rate limiting is set to 2000 requests per minute for premium users."],
  metadatas: [{ section: "limits", version: "v3" }],
});

// Delete documents
await collection.delete({
  where: { version: "v1" },
});

Building a Complete RAG App with ChromaDB

# Full RAG application with ChromaDB
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import anthropic
import os
from pathlib import Path

# Setup
embedding_fn = OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=embedding_fn,
)

anthropic_client = anthropic.Anthropic()

# Ingest documents from a directory
def ingest_directory(directory: str) -> int:
    """Ingest all text files from a directory."""
    docs, ids, metadatas = [], [], []

    for filepath in Path(directory).glob("**/*.txt"):
        text = filepath.read_text()
        # Simple chunking
        chunks = [text[i:i+1000] for i in range(0, len(text), 800)]

        for j, chunk in enumerate(chunks):
            doc_id = f"{filepath.stem}_chunk_{j}"
            docs.append(chunk)
            ids.append(doc_id)
            metadatas.append({
                "source": filepath.name,
                "chunk_index": j,
                "total_chunks": len(chunks),
            })

    if docs:
        # Batch add (ChromaDB handles batching internally)
        collection.add(ids=ids, documents=docs, metadatas=metadatas)

    return len(docs)

# RAG query function
def ask(question: str, n_results: int = 5) -> str:
    """Ask a question using RAG."""
    # Retrieve relevant chunks
    results = collection.query(
        query_texts=[question],
        n_results=n_results,
    )

    if not results["documents"][0]:
        return "No relevant documents found."

    # Build context
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(f"[{meta['source']}] {doc}")

    context = "\n\n".join(context_parts)

    # Generate answer
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=f"""Answer based on the provided context. Cite sources in brackets.
If the context doesn't cover the question, say so.

Context:
{context}""",
        messages=[{"role": "user", "content": question}],
    )

    return response.content[0].text

# Usage
count = ingest_directory("./knowledge_base")
print(f"Ingested {count} chunks")

answer = ask("What is the refund policy?")
print(answer)

ChromaDB vs Pinecone

Feature	ChromaDB	Pinecone
Setup	pip install, zero config	Account + API key
Hosting	Local + self-host	Managed cloud only
Scale	Millions of vectors	Billions of vectors
Cost	Free (open source)	Free tier + paid
Best For	Development, small apps	Production, large scale

ChromaDB Best Practices

Use PersistentClient: Always use persistent storage in development to avoid re-embedding on restart
Custom embeddings: The default model is small — use OpenAI or sentence-transformers for better quality
Collection naming: Use descriptive names and separate collections by document type or tenant
Metadata is key: Store source, timestamps, and categories for effective filtering
Batch operations: Add documents in reasonable batches (1000-5000) for optimal performance

Summary

ChromaDB is the fastest way to go from zero to a working RAG application. Its zero-configuration setup, built-in embedding support, and persistent storage make it ideal for development and small-to-medium production deployments. Start with ChromaDB to validate your RAG approach, then migrate to a managed solution like Pinecone if you need to scale beyond millions of vectors.

ChromaDB Tutorial

Getting Started with ChromaDB

ChromaDB Features

Installation

Basic Usage

TypeScript Usage

Building a Complete RAG App with ChromaDB

ChromaDB vs Pinecone

ChromaDB Best Practices

Summary

Continue Learning

AI-Native Engineering

AI & Machine Learning

LangChain

Python

Vercel AI SDK