Context window
Definition
A context window is the maximum amount of text, measured in tokens, that a language model can consider in a single interaction — including the prompt, retrieved documents, conversation history, and the model's own output. Frontier models in early 2026 reach context windows of roughly a million tokens, enabling long documents and rich grounding.
How it works
Everything a model sees in one interaction must fit in its context window: the system prompt, the user's question, any retrieved passages, tool outputs, prior conversation turns, and the response being generated. It is measured in tokens — chunks of text roughly three-quarters of a word on average. When the total exceeds the window, older content must be truncated, summarized, or dropped.
Context windows have grown rapidly. Early large models handled a few thousand tokens; frontier models in early 2026 reach around a million tokens, enough for entire books or large codebases at once. Bigger windows reduce the need to chop inputs into pieces and let a model reason over more material in a single pass.
Larger is not automatically better. Long contexts cost more, add latency, and can suffer from uneven attention, where details buried in the middle get less weight than information at the start or end.
Why it matters for AI search
The context window is the workspace where grounding happens. In retrieval- augmented generation, retrieved passages must be placed into the window for the model to use them — so context size caps how much evidence can inform a single answer. Larger windows let AI engines bring in more sources, longer pages, and richer history before responding.
For content owners, this interacts with how content is selected and chunked. Even with big windows, systems still retrieve and rank passages, so being the clearly relevant, well-structured passage that earns a place in the context window is what matters. Concise, self-contained sections are easier to retrieve, fit cleanly, and survive to the point where citations are made.
Frequently asked questions
What is a context window?
It is the maximum number of tokens a model can process in one interaction, covering the prompt, retrieved documents, conversation history, and the generated output combined. Exceeding it forces older content to be truncated or summarized.
How large are context windows in 2026?
Frontier models reach roughly a million tokens, enough to hold entire books or large codebases in a single interaction. Smaller and older models have considerably tighter limits.
Is a bigger context window always better?
Not always. Larger windows cost more and add latency, and models can attend unevenly to very long inputs, underweighting details buried in the middle. Relevant, well-organized context still matters more than sheer length.
How does the context window affect AI citations?
Retrieved passages must fit in the window to influence an answer, so it caps how much evidence informs a response. Concise, self-contained, well- structured content is easier to retrieve and fit, improving its chance of being grounded and cited.
Tokens
Tokens are the fundamental units of text that language models process. A tokenizer splits text into tokens, which can be subwords, whole words, or characters, and the model reads and generates one token at a time. Token counts determine API pricing, how much fits in a context window, and the practical capacity of any AI interaction.
Retrieval-augmented generation (RAG)
Retrieval-augmented generation (RAG) is an AI architecture that gives a large language model real-time access to external documents at query time — retrieving relevant passages from a vector database or search index and inserting them into the model's context before it generates a response. RAG is the foundation of modern AI search and the most effective technique for reducing hallucination.
Context engineering
Context engineering is the discipline of assembling the right information, instructions, tools, and memory into a language model's context window so it produces accurate, grounded outputs. It broadens prompt engineering beyond wording to the whole question of what gets retrieved, included, ordered, and excluded at inference time.
AI grounding
AI grounding is the practice of anchoring an LLM's response in retrieved, citable sources at inference time — instead of letting the model rely solely on its training memory. Grounding is what separates a hallucination-prone chatbot from a search-grade AI assistant like Perplexity, Google AI Overviews, Bing Chat, or retrieval-augmented ChatGPT.
Large language model (LLM)
A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.
Reranking
Reranking is a second-stage retrieval step that reorders an initial set of candidate documents by deeper relevance to the query. After a fast first-stage retriever returns many candidates, a more powerful (often cross-encoder) model scores each query-document pair, surfacing the best passages to feed a language model for grounded, accurate answers.