Vector Databases and Semantic Search: A Practical Guide

Traditional databases are built for exact matches. You ask for all customers in "New York" and get back every record where the city field equals "New York." But what if a user searches for "affordable apartments near Central Park"? No field contains that exact string, and keyword matching will miss listings described as "budget-friendly studios in Midtown" even though they are semantically relevant. Semantic search understands meaning, not just keywords, and vector databases are the infrastructure that makes it possible at scale.

The rise of large language models and embedding models has made vector databases one of the fastest-growing database categories. They power retrieval-augmented generation (RAG) systems, recommendation engines, image search, anomaly detection, and more. This guide covers how they work and how to build production systems with them.

How Vector Embeddings Work

An embedding is a fixed-length numerical representation of a piece of data, typically a list of 384 to 1536 floating-point numbers, that captures its semantic meaning. Text with similar meaning produces embeddings that are close together in this high-dimensional space, even if the words used are completely different.

Embedding models are neural networks trained to map inputs into a vector space where semantic similarity corresponds to geometric proximity. For text, models like OpenAI's text-embedding-3-small, Cohere's embed-v3, or open-source options like all-MiniLM-L6-v2 from Sentence Transformers are commonly used.

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
texts = [
    "How to reset my password",
    "I forgot my login credentials",
    "What are your business hours",
    "When is the store open",
    "Return policy for damaged items",
]
 
embeddings = model.encode(texts)
 
# Compute pairwise cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(embeddings)
 
# "reset my password" and "forgot login credentials" -> ~0.82 similarity
# "reset my password" and "business hours" -> ~0.15 similarity

The key insight is that you generate embeddings once for your corpus (documents, products, images) and store them. At query time, you embed the user's query with the same model and find the stored embeddings that are closest to it.

Vector Indexing Algorithms

Brute-force comparison of a query vector against millions of stored vectors is too slow for production use. Vector databases use approximate nearest neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed improvements.

The main indexing approaches are:

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node is a vector and edges connect nearby vectors. Search starts at the top layer (coarse navigation) and descends through layers (fine-grained search). HNSW offers excellent query performance and is the default in most vector databases. The tradeoff is higher memory usage since the graph structure must be held in memory alongside the vectors.

IVF (Inverted File Index) partitions the vector space into clusters using k-means. At query time, it identifies the closest clusters and searches only within those partitions. IVF uses less memory than HNSW but typically has lower recall at the same query speed.

Product Quantization (PQ) compresses vectors by splitting them into sub-vectors and quantizing each independently. This dramatically reduces memory usage, enabling billion-scale datasets on commodity hardware, but at the cost of some accuracy.

In practice, many systems combine these approaches. IVF-PQ uses inverted file indexing with product-quantized vectors for memory-efficient billion-scale search. HNSW with scalar quantization provides a good balance of speed, accuracy, and memory for datasets up to hundreds of millions of vectors.

Choosing a Vector Database

The vector database landscape has matured significantly. Here is a practical comparison of the major options:

Database	Deployment	Best For	Key Strength
Pinecone	Managed cloud	Teams wanting zero ops	Fully managed, scales automatically
Weaviate	Self-hosted or cloud	Multimodal search	Built-in vectorization modules
Qdrant	Self-hosted or cloud	Performance-sensitive apps	Rust-based, fast filtering
Milvus	Self-hosted or cloud	Large-scale deployments	Billion-scale support
ChromaDB	Embedded	Prototyping and small apps	Simple API, easy to start
pgvector	PostgreSQL extension	Existing Postgres users	No new infrastructure needed

For teams already running PostgreSQL, pgvector is often the pragmatic starting point. It avoids introducing a new database into your stack and handles datasets up to a few million vectors well. As scale and performance requirements grow, migrating to a purpose-built vector database becomes worthwhile.

Building a Semantic Search System

Here is a complete implementation of a semantic search system using Qdrant:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, Filter,
    FieldCondition, MatchValue,
)
from sentence_transformers import SentenceTransformer
import uuid
 
# Initialize
client = QdrantClient(host="localhost", port=6333)
encoder = SentenceTransformer("all-MiniLM-L6-v2")
 
# Create collection
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(
        size=384,  # Dimension of all-MiniLM-L6-v2
        distance=Distance.COSINE,
    ),
)
 
# Index documents
documents = [
    {
        "text": "To reset your password, go to Settings > Security > Change Password.",
        "category": "account",
        "product": "web_app",
    },
    {
        "text": "Refunds are processed within 5-7 business days after approval.",
        "category": "billing",
        "product": "all",
    },
    # ... hundreds or thousands more documents
]
 
points = []
for doc in documents:
    embedding = encoder.encode(doc["text"]).tolist()
    points.append(PointStruct(
        id=str(uuid.uuid4()),
        vector=embedding,
        payload=doc,
    ))
 
client.upsert(collection_name="knowledge_base", points=points)
 
# Search with optional metadata filtering
def search(query: str, category: str = None, top_k: int = 5):
    query_vector = encoder.encode(query).tolist()
 
    search_filter = None
    if category:
        search_filter = Filter(
            must=[
                FieldCondition(
                    key="category",
                    match=MatchValue(value=category),
                )
            ]
        )
 
    results = client.search(
        collection_name="knowledge_base",
        query_vector=query_vector,
        query_filter=search_filter,
        limit=top_k,
    )
 
    return [
        {"text": hit.payload["text"], "score": hit.score}
        for hit in results
    ]
 
# Example: semantic match despite different wording
results = search("I can't log in to my account")
# Returns: "To reset your password, go to Settings > Security..."

Optimizing for Production Quality

Raw vector similarity is a starting point, not a final answer. Production search systems layer additional techniques to improve relevance:

Hybrid search combines dense vector similarity with sparse keyword matching (BM25). This catches cases where exact terminology matters, like product SKUs, error codes, or proper nouns, that embedding models may not distinguish well.

Re-ranking applies a more expensive cross-encoder model to the top candidates retrieved by the vector search. Cross-encoders process the query and document together, capturing interactions that bi-encoder embeddings miss, at the cost of higher latency.

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def search_with_reranking(query, top_k=5, rerank_top=20):
    # Stage 1: Fast retrieval of candidates
    candidates = search(query, top_k=rerank_top)
 
    # Stage 2: Re-rank with cross-encoder
    pairs = [[query, c["text"]] for c in candidates]
    rerank_scores = reranker.predict(pairs)
 
    for i, score in enumerate(rerank_scores):
        candidates[i]["rerank_score"] = float(score)
 
    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:top_k]

Metadata filtering narrows the search space before vector comparison. If you know the user is asking about billing, filter to billing documents first, then run similarity search within that subset. This improves both relevance and performance.

Chunking strategy determines how you split documents into searchable units. Chunks that are too large dilute the semantic signal; chunks that are too small lose context. A sliding window approach with overlap, typically 200-500 tokens with 50-token overlap, works well for most text corpora.