Indexly
AI & LLMsUpdated April 27, 2026

AI inference

Definition

AI inference is the runtime step where a trained AI model takes a prompt and produces an output — the tokens you see streaming back from ChatGPT, Claude, Gemini, or Perplexity. Inference is what costs money in production: every prompt and every generated token consumes GPU time, and the economics of any AI product live in this loop.

How it works

An AI inference call has three stages:

  • Prefill: the model ingests the prompt and any retrieved context, computing a key-value cache. Cost scales with input token count.

  • Decode: the model generates output tokens one at a time, each conditioned on everything that came before. Latency scales with output token count and model size.

  • Stream or batch: tokens stream back over Server-Sent Events for chat experiences, or arrive as a single JSON response for batch use cases.

Frontier inference also includes optional steps: retrieval (for grounded answers), tool calls (for agents), and structured-output validation. Each adds latency but also reliability.

Inference vs training

Training builds the model. Inference uses it.

Training is a one-time (or periodic) operation that consumes massive compute — thousands of GPUs running for weeks. Inference happens every time a user sends a prompt and consumes a small but non-trivial slice of GPU time per call.

The economic asymmetry matters: a frontier model might cost tens of millions of dollars to train but pennies per inference call. Volume tips the balance — at scale, inference dominates the AI compute bill.

<1s

First-token latency target for customer-facing inference

Indexly best practice

5–10×

Cost reduction from prompt caching on repeated-context workloads

Indexly engineering

Pennies

Typical per-call cost for frontier model inference at production scale

Indexly observation

Why it matters

Inference is where AI products live or die. Latency, cost, and reliability all live in the inference loop:

  • Latency determines whether a chat experience feels magical or sluggish. Sub-second first-token latency is the modern bar.

  • Cost scales with both input and output token volume. Caching, batch APIs, and routing simple requests to cheaper models are the main levers.

  • Reliability depends on grounding, structured output validation, and retry logic. Hallucination at inference time is the most common failure mode.

For brands optimizing for AI search, inference is also where citation decisions get made. Grounded inference pulls from the index at runtime — meaning a page added yesterday can be cited today, without waiting for the next training run.

How to optimize inference

Five practices for production inference:

  1. Cache prompts. Anthropic's prompt caching, OpenAI's cached inputs, and similar features cut cost 5–10× on repeated context.

  2. Stream responses. First-token latency drops dramatically when you stream — even if total generation time is unchanged, the user perceives the product as faster.

  3. Route by complexity. Send simple classification to Haiku / Flash / GPT-4o-mini, send hard reasoning to Sonnet / Opus / GPT-4o. Quality improves and cost drops simultaneously.

  4. Bound output length. Set max_tokens aggressively and use structured outputs to keep generations terse. Most products waste tokens on filler the user never reads.

  5. Validate before showing. Citation validation, schema validation, and tool-call argument validation catch the rare hallucinations that slip through. The cost of a re-roll is small compared to the cost of a wrong answer.

Frequently asked questions

What's the difference between training and inference?

Training builds the model — a one-time operation that consumes massive compute. Inference uses the trained model to generate outputs, happening every time a user sends a prompt. Training cost is huge but one-time; inference cost is small per call but scales with volume.

Why is inference latency so important?

Users perceive sub-second latency as instant; multi- second waits feel sluggish even if the answer is perfect. Streaming, prompt caching, and routing simple queries to faster models are the main levers for keeping latency low without sacrificing quality.

How much does AI inference cost?

Frontier models cost cents per million input tokens and a few cents per million output tokens. Naïve production usage can run real bills at scale, but caching and routing typically bring per-user cost back to fractions of a cent per interaction.

Does inference happen on the model provider's servers?

Usually — for hosted APIs (OpenAI, Anthropic, Gemini, Perplexity). Open-source models can be run on self-hosted infrastructure (vLLM, Ollama, AWS Bedrock, Azure AI). Self-hosting trades managed- service convenience for control and sometimes cost.

How does inference relate to AI search visibility?

Grounded inference is what produces the cited answer a buyer reads in Perplexity, Gemini, or AI Overviews. The retrieval step inside inference picks which pages get cited. Optimizing for AI visibility is really optimizing for the retrieval stage of inference.

AI API

An AI API is a programmatic interface that lets developers send prompts to a large language model and receive generated responses — typically over HTTP with JSON payloads. The major AI APIs in 2026 are the OpenAI API (GPT-4o, GPT-4.1), Anthropic API (Claude 3.5 / 4 Sonnet, Claude Opus), Google Gemini API, xAI Grok API, and the Perplexity API.

AI agent

An AI agent is a software system that uses a large language model (typically GPT-4o, Claude 3.5 / 4 Sonnet, Gemini 2.5, or open-source equivalents) to plan, decide, and act over multiple steps to complete a goal — calling tools, retrieving data, and producing outputs without step-by-step human supervision. Agents are the working surface of agentic AI in 2026.

AI grounding

AI grounding is the practice of anchoring an LLM's response in retrieved, citable sources at inference time — instead of letting the model rely solely on its training memory. Grounding is what separates a hallucination-prone chatbot from a search-grade AI assistant like Perplexity, Google AI Overviews, Bing Chat, or retrieval-augmented ChatGPT.

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) is an AI architecture that gives a large language model real-time access to external documents at query time — retrieving relevant passages from a vector database or search index and inserting them into the model's context before it generates a response. RAG is the foundation of modern AI search and the most effective technique for reducing hallucination.

AI training data

AI training data is the corpus of text, code, images, and other content used to train large language models. Frontier models like GPT-4o, Claude 4 Sonnet, Gemini 2.5, and Llama 4 are trained on trillions of tokens drawn from web crawls, books, code repositories, and licensed datasets — the composition of which shapes what the model knows, who it cites, and how it represents brands.

AI models for deep research

AI models for deep research are the long-running, agentic modes shipped by major AI providers — ChatGPT Deep Research, Perplexity Deep Research, Gemini Deep Research, and Claude's research mode — that take a single complex prompt, autonomously plan and run dozens of web searches, read source pages end-to-end, and synthesize a multi-page report with full citations. They are the most agentic search experience exposed to consumers in 2026.