Vector Databases and Semantic Search: A Practical Guide
A hands-on guide to vector databases and semantic search, covering embeddings, indexing algorithms, database selection, and building production search systems.
Vector Databases and Semantic Search: A Practical Guide
Traditional databases are built for exact matches. You ask for all customers in "New York" and get back every record where the city field equals "New York." But what if a user searches for "affordable apartments near Central Park"? No field contains that exact string, and keyword matching will miss listings described as "budget-friendly studios in Midtown" even though they are semantically relevant. Semantic search understands meaning, not just keywords, and vector databases are the infrastructure that makes it possible at scale.
The rise of large language models and embedding models has made vector databases one of the fastest-growing database categories. They power retrieval-augmented generation (RAG) systems, recommendation engines, image search, anomaly detection, and more. This guide covers how they work and how to build production systems with them.
How Vector Embeddings Work
An embedding is a fixed-length numerical representation of a piece of data, typically a list of 384 to 1536 floating-point numbers, that captures its semantic meaning. Text with similar meaning produces embeddings that are close together in this high-dimensional space, even if the words used are completely different.
Embedding models are neural networks trained to map inputs into a vector space where semantic similarity corresponds to geometric proximity. For text, models like OpenAI's text-embedding-3-small, Cohere's embed-v3, or open-source options like all-MiniLM-L6-v2 from Sentence Transformers are commonly used.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [
"How to reset my password",
"I forgot my login credentials",
"What are your business hours",
"When is the store open",
"Return policy for damaged items",
]
embeddings = model.encode(texts)
# Compute pairwise cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(embeddings)
# "reset my password" and "forgot login credentials" -> ~0.82 similarity
# "reset my password" and "business hours" -> ~0.15 similarityThe key insight is that you generate embeddings once for your corpus (documents, products, images) and store them. At query time, you embed the user's query with the same model and find the stored embeddings that are closest to it.
Vector Indexing Algorithms
Brute-force comparison of a query vector against millions of stored vectors is too slow for production use. Vector databases use approximate nearest neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed improvements.
The main indexing approaches are:
HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node is a vector and edges connect nearby vectors. Search starts at the top layer (coarse navigation) and descends through layers (fine-grained search). HNSW offers excellent query performance and is the default in most vector databases. The tradeoff is higher memory usage since the graph structure must be held in memory alongside the vectors.
IVF (Inverted File Index) partitions the vector space into clusters using k-means. At query time, it identifies the closest clusters and searches only within those partitions. IVF uses less memory than HNSW but typically has lower recall at the same query speed.
Product Quantization (PQ) compresses vectors by splitting them into sub-vectors and quantizing each independently. This dramatically reduces memory usage, enabling billion-scale datasets on commodity hardware, but at the cost of some accuracy.
In practice, many systems combine these approaches. IVF-PQ uses inverted file indexing with product-quantized vectors for memory-efficient billion-scale search. HNSW with scalar quantization provides a good balance of speed, accuracy, and memory for datasets up to hundreds of millions of vectors.
Choosing a Vector Database
The vector database landscape has matured significantly. Here is a practical comparison of the major options:
| Database | Deployment | Best For | Key Strength |
|---|---|---|---|
| Pinecone | Managed cloud | Teams wanting zero ops | Fully managed, scales automatically |
| Weaviate | Self-hosted or cloud | Multimodal search | Built-in vectorization modules |
| Qdrant | Self-hosted or cloud | Performance-sensitive apps | Rust-based, fast filtering |
| Milvus | Self-hosted or cloud | Large-scale deployments | Billion-scale support |
| ChromaDB | Embedded | Prototyping and small apps | Simple API, easy to start |
| pgvector | PostgreSQL extension | Existing Postgres users | No new infrastructure needed |
For teams already running PostgreSQL, pgvector is often the pragmatic starting point. It avoids introducing a new database into your stack and handles datasets up to a few million vectors well. As scale and performance requirements grow, migrating to a purpose-built vector database becomes worthwhile.
Building a Semantic Search System
Here is a complete implementation of a semantic search system using Qdrant:
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, Filter,
FieldCondition, MatchValue,
)
from sentence_transformers import SentenceTransformer
import uuid
# Initialize
client = QdrantClient(host="localhost", port=6333)
encoder = SentenceTransformer("all-MiniLM-L6-v2")
# Create collection
client.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(
size=384, # Dimension of all-MiniLM-L6-v2
distance=Distance.COSINE,
),
)
# Index documents
documents = [
{
"text": "To reset your password, go to Settings > Security > Change Password.",
"category": "account",
"product": "web_app",
},
{
"text": "Refunds are processed within 5-7 business days after approval.",
"category": "billing",
"product": "all",
},
# ... hundreds or thousands more documents
]
points = []
for doc in documents:
embedding = encoder.encode(doc["text"]).tolist()
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload=doc,
))
client.upsert(collection_name="knowledge_base", points=points)
# Search with optional metadata filtering
def search(query: str, category: str = None, top_k: int = 5):
query_vector = encoder.encode(query).tolist()
search_filter = None
if category:
search_filter = Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value=category),
)
]
)
results = client.search(
collection_name="knowledge_base",
query_vector=query_vector,
query_filter=search_filter,
limit=top_k,
)
return [
{"text": hit.payload["text"], "score": hit.score}
for hit in results
]
# Example: semantic match despite different wording
results = search("I can't log in to my account")
# Returns: "To reset your password, go to Settings > Security..."Optimizing for Production Quality
Raw vector similarity is a starting point, not a final answer. Production search systems layer additional techniques to improve relevance:
Hybrid search combines dense vector similarity with sparse keyword matching (BM25). This catches cases where exact terminology matters, like product SKUs, error codes, or proper nouns, that embedding models may not distinguish well.
Re-ranking applies a more expensive cross-encoder model to the top candidates retrieved by the vector search. Cross-encoders process the query and document together, capturing interactions that bi-encoder embeddings miss, at the cost of higher latency.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def search_with_reranking(query, top_k=5, rerank_top=20):
# Stage 1: Fast retrieval of candidates
candidates = search(query, top_k=rerank_top)
# Stage 2: Re-rank with cross-encoder
pairs = [[query, c["text"]] for c in candidates]
rerank_scores = reranker.predict(pairs)
for i, score in enumerate(rerank_scores):
candidates[i]["rerank_score"] = float(score)
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:top_k]Metadata filtering narrows the search space before vector comparison. If you know the user is asking about billing, filter to billing documents first, then run similarity search within that subset. This improves both relevance and performance.
Chunking strategy determines how you split documents into searchable units. Chunks that are too large dilute the semantic signal; chunks that are too small lose context. A sliding window approach with overlap, typically 200-500 tokens with 50-token overlap, works well for most text corpora.