RAG Beyond Basics: Reranking, Hybrid Search & pgvector in Production

Article

RAG Beyond Basics: Reranking, Hybrid Search & pgvector in Production

Every RAG tutorial online looks the same: embed your documents, store vectors, retrieve by cosine similarity, pass to LLM, done. It works well enough in demos. In production with real users and real data, it falls apart in ways that are frustrating to debug and embarrassing to explain to clients.

This is what we've learned building RAG systems that actually hold up under production load. No toy examples.

Why Naive RAG Fails

The fundamental assumption of naive RAG is that cosine similarity between query and document embeddings is a reliable proxy for relevance. It isn't. Not reliably.

Three common failure modes:

Semantic drift under domain shift. General-purpose embeddings (OpenAI text-embedding-3-small, for example) encode semantic relationships based on general web text. If your documents are highly domain-specific — legal contracts, medical records, financial reports — the embedding model's representation of domain-specific terms may not accurately reflect their relevance relationships in your context.

Popularity bias in embedding space. Common terms cluster together in embedding space. Queries using common vocabulary will retrieve documents that happen to contain those common terms, not necessarily the most relevant documents. A query about "contract termination clauses" might retrieve documents that frequently use the word "contract" rather than documents specifically about termination clauses.

Top-K retrieval truncation. You retrieve the top 5 or 10 documents. The actual most relevant document might be at position 8 in a semantic search but position 1 if you also considered keyword matching. With naive RAG, you never find it.

Reranking: The Missing Layer

Reranking is the single highest-impact improvement you can make to a RAG pipeline. The idea: retrieve a larger candidate set (top 20-50) using fast approximate vector search, then apply a more accurate but slower cross-encoder model to reorder them and take the actual top K.

Why does this work? Bi-encoder models (used for initial retrieval) encode query and document independently and compare their representations. Cross-encoder models (used for reranking) see the query and document together, which gives them far more context for relevance judgment. They're slower — you can't run them over your entire corpus — but run on a small candidate set, they're fast enough.

Cohere's reranking API makes this straightforward:

import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

def rerank_results(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
    response = co.rerank(
        model="rerank-multilingual-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True,
    )
    return [
        {
            "document": result.document.text,
            "relevance_score": result.relevance_score,
            "index": result.index,
        }
        for result in response.results
    ]

# Usage in pipeline
initial_results = vector_store.search(query, limit=30)  # Retrieve 30
docs = [r.content for r in initial_results]
reranked = rerank_results(query, docs, top_n=5)  # Keep top 5

In our production systems we consistently see 15-30% improvement in answer quality after adding reranking, measured by user satisfaction and factual accuracy on internal benchmarks. The latency cost is real (typically 200-500ms added) but acceptable for most business applications.

Use rerank-multilingual-v3.0 for anything touching European languages. It handles Polish, German, Ukrainian, French, and others without the quality degradation you see with English-only models.

Hybrid Search: Best of Both Worlds

Hybrid search combines semantic vector search with traditional keyword (BM25/full-text) search. The intuition: sometimes what matters is not semantic similarity but exact term matching. Model names, product codes, regulatory references, proper nouns — these need keyword matching to be found reliably.

The basic approach: run both searches, merge and deduplicate results, apply a fusion function.

from pgvector.psycopg2 import register_vector
import psycopg2

def hybrid_search(
    query: str,
    query_embedding: list[float],
    table: str,
    semantic_weight: float = 0.7,
    keyword_weight: float = 0.3,
    limit: int = 20,
) -> list[dict]:
    conn = get_connection()

    # Reciprocal Rank Fusion
    sql = """
    WITH semantic AS (
        SELECT
            id,
            content,
            1 - (embedding <=> %s::vector) AS semantic_score,
            ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS sem_rank
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    ),
    keyword AS (
        SELECT
            id,
            content,
            ts_rank_cd(to_tsvector('english', content), query) AS keyword_score,
            ROW_NUMBER() OVER (
                ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC
            ) AS kw_rank
        FROM documents,
             plainto_tsquery('english', %s) query
        WHERE to_tsvector('english', content) @@ query
        LIMIT %s
    ),
    fused AS (
        SELECT
            COALESCE(s.id, k.id) AS id,
            COALESCE(s.content, k.content) AS content,
            COALESCE(1.0 / (60 + s.sem_rank), 0) * %s +
            COALESCE(1.0 / (60 + k.kw_rank), 0) * %s AS score
        FROM semantic s
        FULL OUTER JOIN keyword k ON s.id = k.id
    )
    SELECT id, content, score
    FROM fused
    ORDER BY score DESC
    LIMIT %s;
    """

    params = (
        query_embedding, query_embedding, query_embedding, limit * 2,  # semantic
        query, limit * 2,  # keyword
        semantic_weight, keyword_weight,  # fusion weights
        limit,
    )

    cursor = conn.cursor()
    cursor.execute(sql, params)
    rows = cursor.fetchall()

    return [{"id": r[0], "content": r[1], "score": r[2]} for r in rows]

The 60 in the RRF formula is a constant that controls rank sensitivity. Higher values make the fusion less sensitive to rank differences — use 60 as a starting point and tune from there.

pgvector vs Qdrant: The Honest Trade-off

Both are legitimate choices. The decision isn't technical religiosity — it's about your existing infrastructure and operational complexity tolerance.

pgvector makes sense when: - You're already on PostgreSQL and want to minimize operational complexity - Your dataset fits comfortably in a single database instance (hundreds of millions of vectors or fewer) - You need ACID transactions that combine vector search with relational data - Your team knows SQL and doesn't want to learn another query language

Qdrant makes sense when: - You're running at very large scale (billions of vectors) - You need named vectors and multi-vector search out of the box - You want dedicated vector DB performance without tuning PostgreSQL - You need built-in payload filtering that's more expressive than SQL WHERE clauses

For the B2B projects we typically work on — CRMs, knowledge bases, document Q&A — pgvector handles the scale comfortably and dramatically reduces infrastructure complexity. We run Qdrant when clients have specific performance requirements or are already operating at scale.

pgvector optimization tips that matter:

-- Use HNSW index for production (better query performance than IVFFlat)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Set ef_search at query time based on accuracy/speed trade-off
SET hnsw.ef_search = 100;  -- Higher = more accurate, slower

-- Partial indexes for filtered search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WHERE active = true AND language = 'en';

-- Check index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, 1 - (embedding <=> '[...]'::vector) AS score
FROM documents
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

Chunking: Semantic Over Fixed-Size

Fixed-size chunking (split every N tokens) is simple and consistently mediocre. The problem: it splits on arbitrary boundaries that often destroy semantic coherence. A paragraph about a regulatory requirement gets split into three chunks, each of which retrieves poorly because it lacks context.

Semantic chunking groups text by conceptual boundaries: paragraph breaks, section headers, topic shifts. More expensive to compute, significantly better retrieval results.

A practical hybrid approach:

def semantic_chunk(text: str, max_chunk_size: int = 512) -> list[str]:
    """
    Split text by paragraph boundaries, merging short paragraphs
    and splitting long ones at sentence boundaries.
    """
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks = []
    current_chunk = []
    current_size = 0

    for para in paragraphs:
        para_size = len(para.split())

        if current_size + para_size > max_chunk_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = [para]
            current_size = para_size
        else:
            current_chunk.append(para)
            current_size += para_size

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

Always store chunk metadata: source document ID, chunk position, section header if available, page number for PDFs. Without metadata, retrieval results are uninterpretable and citation is impossible.

Embedding Model Selection for Multilingual

If you're building for European markets, English-only embedding models are a genuine quality problem. text-embedding-3-small is trained primarily on English text. Cross-lingual transfer exists but degrades for morphologically complex languages like Polish or Ukrainian.

Cohere embed-multilingual-v3.0 is what we use as the default for any system handling multiple European languages. It explicitly supports 100+ languages with meaningful quality in Polish, German, Ukrainian, French, and others. The quality difference on non-English queries is significant enough to matter in production.

For monolingual English systems, text-embedding-3-large with reduced dimensions (1536 → 256 or 512 using Matryoshka representation learning) is competitive on cost while maintaining quality.

Log your retrieval scores. Seriously. If you're not storing the embedding distances and reranker scores for every query, you have no signal for debugging retrieval quality. We log all of this to PostgreSQL and run weekly analysis on queries with low confidence scores.

Putting It Together

A production RAG pipeline looks like this:

  1. Ingestion: semantic chunking → embed with Cohere multilingual → store in pgvector with metadata
  2. Retrieval: hybrid search (semantic + keyword) → candidate set of 20-30
  3. Reranking: Cohere rerank → top 5 results
  4. Generation: LLM with retrieved context → response

Each step has observability (latency, scores, chunk IDs). Each step is tunable without touching the others.

The gap between naive RAG and this pipeline is not marginal. For real-world document Q&A with domain-specific content, the difference in user-reported accuracy is often the difference between a system users trust and one they ignore after a week.


We build production RAG systems for B2B clients. If you're evaluating RAG architecture for your use case, talk to us.

Comments

No comments yet. Be the first to comment.

Leave a comment