What Are Embeddings?
Embeddings are numerical vector representations of text that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search — finding relevant documents based on meaning rather than exact keyword matches. Embeddings are the engine that powers RAG retrieval.
When you embed a question like "How do I reset my password?" and a document chunk containing "To change your account password, go to Settings...", the vectors will be close together in the embedding space despite sharing few exact words. This semantic understanding is what makes RAG far more powerful than traditional keyword search.
Key Embedding Concepts
- Dimensions: The number of values in the vector (256 to 3072). Higher dimensions can capture more nuance but cost more to store
- Similarity Metrics: Cosine similarity, dot product, or Euclidean distance to compare vectors
- Context Window: Maximum tokens the model can embed at once (512 to 8192)
- Normalization: Most models normalize vectors to unit length for consistent cosine similarity
Popular Embedding Models (2025-2026)
Model Comparison
| Model | Dimensions | Max Tokens | Cost | MTEB Score |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | 8191 | $0.13/M tokens | 64.6 |
| text-embedding-3-small | 1536 | 8191 | $0.02/M tokens | 62.3 |
| Cohere embed-v4 | 1024 | 512 | $0.10/M tokens | 66.3 |
| Voyage AI voyage-3 | 1024 | 32000 | $0.06/M tokens | 67.1 |
| all-MiniLM-L6-v2 | 384 | 512 | Free (local) | 56.3 |
| BGE-large-en-v1.5 | 1024 | 512 | Free (local) | 64.2 |
Using OpenAI Embeddings
// OpenAI embeddings in TypeScript
import OpenAI from "openai";
const openai = new OpenAI();
// Single text embedding
async function embedText(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
dimensions: 1536, // Can reduce dimensions for cost savings
});
return response.data[0].embedding;
}
// Batch embedding (more efficient)
async function embedBatch(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data
.sort((a, b) => a.index - b.index)
.map(d => d.embedding);
}
// Cosine similarity between two vectors
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Example: find similarity between texts
const query = await embedText("How do I deploy to production?");
const doc1 = await embedText("Production deployment guide using Docker and Kubernetes");
const doc2 = await embedText("Making chocolate chip cookies at home");
console.log("Relevant doc:", cosineSimilarity(query, doc1)); // ~0.82
console.log("Irrelevant doc:", cosineSimilarity(query, doc2)); // ~0.15
Using Open-Source Embeddings Locally
# Local embeddings with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model (downloads on first use, ~90MB for MiniLM)
model = SentenceTransformer("all-MiniLM-L6-v2")
# Single embedding
text = "How do I reset my password?"
embedding = model.encode(text)
print(f"Shape: {embedding.shape}") # (384,)
# Batch embedding
texts = [
"Password reset instructions",
"Account security settings",
"Chocolate cake recipe",
]
embeddings = model.encode(texts)
# Compute similarities
query_embedding = model.encode("How do I change my password?")
similarities = np.dot(embeddings, query_embedding) / (
np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
)
for text, sim in zip(texts, similarities):
print(f"{sim:.3f} - {text}")
# For better quality, use a larger model
model_large = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Prefix queries with "Represent this sentence:" for BGE models
query_emb = model_large.encode("Represent this sentence: How to deploy?")
# Using Cohere embeddings
import cohere
co = cohere.Client()
response = co.embed(
texts=["Hello world", "How are you?"],
model="embed-english-v3.0",
input_type="search_document", # or "search_query" for queries
)
print(f"Dimensions: {len(response.embeddings[0])}") # 1024
Choosing the Right Model
Decision Guide
| Scenario | Recommended Model | Reason |
|---|---|---|
| General RAG, budget-friendly | text-embedding-3-small | Low cost, good quality, easy API |
| Highest quality, cost not an issue | Voyage AI voyage-3 | Top MTEB scores, long context |
| Offline / privacy sensitive | BGE-large-en-v1.5 | Local, no data leaves your infra |
| Prototype / development | all-MiniLM-L6-v2 | Free, fast, good enough for testing |
| Multilingual | Cohere embed-multilingual | 100+ languages supported |
Embedding Best Practices
- Match query and document models: Always use the same embedding model for both queries and documents
- Batch your embeddings: API calls are more efficient in batches of 100-2000 texts
- Cache embeddings: Don't re-embed unchanged documents — store and reuse vectors
- Normalize vectors: Ensure all vectors are L2-normalized for consistent cosine similarity
- Benchmark on your data: MTEB scores are general — test on your specific domain for best results
Summary
The embedding model you choose directly impacts RAG retrieval quality. OpenAI's text-embedding-3-small offers the best balance of cost and quality for most applications. For maximum quality, consider Voyage AI or Cohere. For privacy or offline use, open-source models like BGE-large work well. Always benchmark on your specific data and use case before committing to a model.