Transformer architecture
Definition
The transformer is the neural-network architecture behind modern large language models. Introduced in 2017, it uses self-attention to weigh how strongly each token relates to every other token in the context, letting models capture long-range meaning and process sequences in parallel. This design made today's LLMs and multimodal models possible.
How it works
The transformer's core innovation is the self-attention mechanism. For each token in a sequence, attention computes how relevant every other token is, then blends their information accordingly. This lets the model understand that a pronoun refers to a noun many words earlier, or that a word's meaning depends on distant context.
Unlike earlier recurrent designs that processed text one step at a time, transformers process all tokens in a sequence in parallel. This parallelism is what made it practical to train on the enormous datasets that give large language models their breadth.
A transformer stacks many layers, each with attention and feed-forward components, and uses positional information so the model knows token order. The same architecture, scaled up and adapted, underlies text, image, and multimodal models alike.
Why it matters
The transformer is the foundation of the current AI era. Nearly every frontier large language model and most multimodal models are transformer-based, so understanding the architecture explains both their strengths and their limits.
Attention is also why context matters so much in practice. Because the model attends across the whole context window, what you include in a prompt, and how much fits in that window, directly shapes the output. The architecture's compute cost growing with context length is the reason context windows and efficiency remain active areas of development.
Frequently asked questions
What is self-attention in a transformer?
Self-attention is the mechanism that lets each token in a sequence weigh its relationship to every other token, then combine their information. It is how transformers capture context and long-range dependencies that earlier architectures struggled with.
Why did transformers replace earlier neural networks?
Earlier recurrent networks processed text sequentially, which was slow and struggled with long-range context. Transformers process sequences in parallel and model relationships across an entire context at once, making large-scale training and better long-range understanding possible.
Are all large language models transformers?
The vast majority are. The transformer has been the dominant architecture for LLMs since 2017. Researchers explore alternatives for efficiency, but transformer-based models remain the standard for frontier systems.
How does the transformer relate to the context window?
The context window is the span of tokens the model's attention can operate over at once. Because self-attention cost grows with sequence length, the context window is bounded, which is why managing what fits inside it is an important practical concern.
Large language model (LLM)
A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.
Foundation models
Foundation models are large-scale AI models trained on broad, diverse data that serve as a general-purpose base adapted for many downstream applications. Rather than building a model per task, organizations fine-tune or prompt a single foundation model for translation, summarization, coding, search, and more. Large language models and multimodal models are common examples.
Tokens
Tokens are the fundamental units of text that language models process. A tokenizer splits text into tokens, which can be subwords, whole words, or characters, and the model reads and generates one token at a time. Token counts determine API pricing, how much fits in a context window, and the practical capacity of any AI interaction.
Context window
A context window is the maximum amount of text, measured in tokens, that a language model can consider in a single interaction — including the prompt, retrieved documents, conversation history, and the model's own output. Frontier models in early 2026 reach context windows of roughly a million tokens, enabling long documents and rich grounding.
Embeddings
Embeddings are numerical vector representations of text, images, or other data that capture semantic meaning. By mapping content into a high- dimensional space where similar items sit close together, embeddings let AI systems compare meaning mathematically — powering similarity search, retrieval, clustering, and recommendation.
Multimodal AI
Multimodal AI refers to models that process and understand multiple types of input, such as text, images, audio, and video, within a single system. Instead of handling one modality at a time, a multimodal model can read a chart, describe a photo, transcribe speech, and reason across them together, enabling richer interactions and search experiences.