AI & LLMsUpdated May 6, 2026

Large language model (LLM)

Definition

A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.

How it works

An LLM learns from enormous text corpora during a pretraining phase, adjusting billions of parameters to predict the next token given the tokens before it. This simple objective, repeated across trillions of tokens, produces a model with broad knowledge of language, facts, and reasoning patterns.

Most LLMs are built on the transformer architecture, which uses attention mechanisms to weigh how much each token in the context should influence the next prediction. After pretraining, models are typically refined with supervised fine-tuning and reinforcement learning from human feedback to make them helpful, harmless, and aligned with instructions.

At inference time, the model takes a prompt and generates a response one token at a time. Techniques like retrieval-augmented generation, tool calling, and grounding extend an LLM beyond its static training knowledge.

Why it matters

LLMs are the engine behind the AI products that now mediate how people find information. Chat assistants, AI Overviews, and answer engines all rely on LLMs to interpret a query and synthesize a response, often replacing the traditional list of ranked links.

For brands and publishers, this shift means visibility increasingly depends on whether an LLM surfaces and cites your content inside a generated answer. Understanding how LLMs retrieve, weigh, and summarize sources is now central to search strategy.

Frequently asked questions

What is the difference between an LLM and a chatbot?

An LLM is the underlying model that understands and generates language. A chatbot is an application built on top of an LLM, adding a conversational interface, memory, safety guardrails, and often retrieval or tools. ChatGPT is a chatbot; GPT is the LLM powering it.

How big is a large language model?

Size is usually measured in parameters. Frontier LLMs range from tens of billions to hundreds of billions of parameters, though parameter count alone does not determine quality. Training data, architecture, and fine-tuning all shape real-world capability.

Do LLMs actually understand language?

LLMs model statistical patterns in language extremely well, which lets them produce fluent, often accurate responses. Whether this constitutes genuine understanding is debated. Practically, they can fail on reasoning or facts outside their training data, which is why grounding and verification matter.

Why do LLMs hallucinate?

Because an LLM generates the most probable next token rather than retrieving verified facts, it can produce confident but incorrect statements. Grounding the model in retrieved sources and validating outputs are the main techniques used to reduce hallucination.

Foundation models

Foundation models are large-scale AI models trained on broad, diverse data that serve as a general-purpose base adapted for many downstream applications. Rather than building a model per task, organizations fine-tune or prompt a single foundation model for translation, summarization, coding, search, and more. Large language models and multimodal models are common examples.

Transformer architecture

The transformer is the neural-network architecture behind modern large language models. Introduced in 2017, it uses self-attention to weigh how strongly each token relates to every other token in the context, letting models capture long-range meaning and process sequences in parallel. This design made today's LLMs and multimodal models possible.

Tokens

Tokens are the fundamental units of text that language models process. A tokenizer splits text into tokens, which can be subwords, whole words, or characters, and the model reads and generates one token at a time. Token counts determine API pricing, how much fits in a context window, and the practical capacity of any AI interaction.

AI inference

AI inference is the runtime step where a trained AI model takes a prompt and produces an output — the tokens you see streaming back from ChatGPT, Claude, Gemini, or Perplexity. Inference is what costs money in production: every prompt and every generated token consumes GPU time, and the economics of any AI product live in this loop.

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) is an AI architecture that gives a large language model real-time access to external documents at query time — retrieving relevant passages from a vector database or search index and inserting them into the model's context before it generates a response. RAG is the foundation of modern AI search and the most effective technique for reducing hallucination.

AI hallucination

AI hallucination is when a large language model generates content that sounds plausible and confident but is factually wrong, fabricated, or unverifiable — invented citations, made-up statistics, or fictional events presented with the same fluency as accurate information. Hallucination is a structural feature of how LLMs work, not a bug that can be fully eliminated.