Reranking
Definition
Reranking is a second-stage retrieval step that reorders an initial set of candidate documents by deeper relevance to the query. After a fast first-stage retriever returns many candidates, a more powerful (often cross-encoder) model scores each query-document pair, surfacing the best passages to feed a language model for grounded, accurate answers.
How it works
Retrieval is usually a two-stage funnel. The first stage uses fast methods — BM25, vector search, or a hybrid of both — to pull back a broad candidate set, often dozens or hundreds of documents. This stage optimizes for recall: don't miss anything relevant.
The second stage is reranking. A more expensive model looks at the query and each candidate together and scores their true relevance. Cross- encoders are common here: instead of comparing two precomputed vectors, they read the query and document jointly, which captures nuance a single embedding distance can miss. The candidates are then reordered by this sharper score.
Because reranking only runs on the shortlist from stage one, it can afford to be slow and precise. Only the top few reranked passages get placed into the model's context window for answer generation.
Why it matters for AI search
What an AI engine cites is almost always drawn from the top reranked passages. First-stage retrieval is good at finding plausibly relevant documents but imperfect at ordering them; reranking fixes the ordering, pushing the genuinely best evidence to the top and demoting near-misses.
Better-ordered passages mean more accurate, better-grounded answers and more relevant citations. For content owners, this raises the bar: appearing in the candidate set is not enough — content must be the most directly relevant, specific answer to the query to survive reranking and reach the context window where citations are decided.
Frequently asked questions
What does a reranker do?
A reranker takes the candidate documents returned by first-stage retrieval and reorders them by deeper, query-specific relevance, usually with a more powerful model that scores each query-document pair so the best passages rise to the top.
How is reranking different from initial retrieval?
Initial retrieval is fast and optimizes for recall across a large corpus. Reranking is slower and more precise, running only on the small candidate set from the first stage to sharpen the ordering before passages reach the LLM.
What is a cross-encoder reranker?
A cross-encoder reads the query and a candidate document together in one pass rather than comparing separate precomputed vectors. This joint view captures nuance that single-vector similarity can miss, producing more accurate relevance scores.
Why does reranking matter for getting cited?
AI engines typically cite from the top reranked passages. Being in the candidate set is not enough; content must be the most directly relevant answer to survive reranking and land in the context window where citation decisions happen.
Hybrid search
Hybrid search combines keyword (lexical) retrieval and vector (semantic) retrieval so an AI system matches both exact terms and underlying meaning. By blending methods like BM25 with embedding similarity, it improves recall and precision over either approach alone, producing better candidate passages for grounding and citation in AI answers.
Vector search
Vector search is a retrieval method that finds information by comparing numerical meaning representations called embeddings, rather than matching exact keywords. Queries and documents are converted to vectors, and the system returns items whose vectors are closest in space — surfacing semantically relevant results even when the wording differs.
BM25
BM25 (Okapi BM25) is a classic keyword-based ranking algorithm that scores how well a document matches a query's terms. It weighs term frequency, rarity, and document length to rank results. Despite being decades old, BM25 remains a core candidate generator in modern AI retrieval pipelines, often paired with vector search.
Retrieval-augmented generation (RAG)
Retrieval-augmented generation (RAG) is an AI architecture that gives a large language model real-time access to external documents at query time — retrieving relevant passages from a vector database or search index and inserting them into the model's context before it generates a response. RAG is the foundation of modern AI search and the most effective technique for reducing hallucination.
Embeddings
Embeddings are numerical vector representations of text, images, or other data that capture semantic meaning. By mapping content into a high- dimensional space where similar items sit close together, embeddings let AI systems compare meaning mathematically — powering similarity search, retrieval, clustering, and recommendation.
Citation probability
Citation probability is the likelihood that an AI system will cite a specific URL when generating a response to a target prompt. Unlike share of model, which measures brand visibility across a prompt set, citation probability is a per-URL metric — it tells you how strong an individual page is at earning citations.