AI & LLMsUpdated May 6, 2026

Test-time compute

Definition

Test-time compute is the practice of allocating extra computation during inference — when a model is answering — so it can effectively think longer before responding. Instead of relying only on a model's size, systems spend more compute per query through longer reasoning, multiple sampled attempts, or search over candidate answers. This improves reasoning quality on hard problems and underpins modern reasoning models.

How it works

Test-time compute scales the work a model does per query rather than the size of the model. Several techniques fall under this umbrella:

Longer reasoning — letting the model generate an extended chain-of-thought trace before answering, giving it more steps to work through a problem.
Sampling and selection — producing several candidate answers and choosing the best, for example by majority vote or a scoring step.
Search — exploring multiple reasoning paths and pursuing the most promising, akin to searching a tree of possibilities.

The shared idea is that hard problems benefit from deliberation. By spending more computation at inference, a model of fixed size can reach answers it would miss in a single quick pass. The amount of compute can often be dialed up or down per request to trade accuracy against latency and cost.

Why it matters

Test-time compute reframed how AI improves. For years, gains came mainly from scaling model size and training data. Inference-time scaling opened a second axis: a fixed model can deliver substantially better results on hard tasks simply by thinking longer.

This is the engine behind reasoning models, which are trained to use extended thinking automatically. It also gives developers a tunable dial — spend more compute on a difficult query, less on an easy one — so quality and cost can be balanced per task rather than fixed by the model alone.

The limits are practical. More test-time compute means higher latency and cost, with diminishing returns beyond a point. The skill lies in applying it where deliberation pays off and using fast, single-pass generation everywhere else.

Frequently asked questions

What is test-time compute?

It is computation spent during inference — when a model answers — to let it think longer or try multiple approaches before responding. Techniques include extended reasoning traces, sampling several answers and selecting the best, and searching over reasoning paths.

How does test-time compute differ from training compute?

Training compute builds the model once, up front. Test-time compute is spent every time the model answers a query. Increasing test-time compute can improve a fixed model's accuracy on hard problems without retraining it.

How is test-time compute related to reasoning models?

Reasoning models are trained to use test-time compute automatically, generating extended internal reasoning before answering. The extra inference-time computation is what gives them their accuracy gains on complex, multi-step problems.

What are the downsides of more test-time compute?

It increases latency and cost per query and shows diminishing returns past a point. The practical approach is to apply more compute only to hard problems that benefit from deliberation and use fast single-pass generation for routine tasks.

Reasoning models

Reasoning models are language models trained to solve complex problems by thinking step by step before answering, spending extra computation at inference to work through a problem rather than responding immediately. Examples include OpenAI's o-series, DeepSeek-R1, and reasoning-tier Gemini and Claude modes. The approach trades latency and cost for stronger performance on math, coding, science, and multi-step planning.

Chain of thought (CoT)

Chain of thought is a prompting technique that improves a model's reasoning by encouraging it to work through a problem step by step before giving a final answer. Making intermediate reasoning explicit helps models handle multi-step math, logic, and planning tasks more reliably. Once a hand-written prompting trick, chain-of-thought reasoning is now built directly into reasoning models that think before they respond.

AI inference

AI inference is the runtime step where a trained AI model takes a prompt and produces an output — the tokens you see streaming back from ChatGPT, Claude, Gemini, or Perplexity. Inference is what costs money in production: every prompt and every generated token consumes GPU time, and the economics of any AI product live in this loop.

Reasoning models

Large language model (LLM)

A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.

Machine learning

Machine learning is the subset of AI in which systems learn patterns from data to make predictions or decisions, rather than following explicitly programmed rules. By training on examples, models improve at tasks like ranking, classification, recommendation, and language understanding. It is the foundation beneath modern AI, including the large language models that power AI search.