Tokens
Definition
Tokens are the fundamental units of text that language models process. A tokenizer splits text into tokens, which can be subwords, whole words, or characters, and the model reads and generates one token at a time. Token counts determine API pricing, how much fits in a context window, and the practical capacity of any AI interaction.
How it works
Before a model can process text, a tokenizer breaks it into tokens. Most modern tokenizers use subword units, so a common word may be a single token while a rare or long word splits into several. As a rough guide for English, a token averages around four characters, and a page of text is on the order of several hundred tokens.
The model converts each token into a numeric representation, processes the sequence, and generates output one token at a time, with each new token conditioned on those before it. This is why models stream their responses word by word, or more precisely token by token.
Token counts apply to both input and output. The combined total must fit within the model's context window, and providers usually bill separately for input and output tokens.
Why it matters
Tokens are the unit of both cost and capacity in AI systems. API pricing is quoted per token, so understanding token counts is essential for estimating and controlling spend, especially at scale. Verbose prompts and long outputs translate directly into higher bills.
Tokens also bound what a model can consider at once. The context window is measured in tokens, so long documents, conversation history, and retrieved context all compete for the same budget. Efficient prompting and retrieval are largely exercises in spending that token budget wisely.
Frequently asked questions
How many tokens are in a word?
It varies by language and tokenizer, but for English a rough average is about three-quarters of a word per token, or roughly one token for every four characters. Common words are often a single token, while rare or long words split into several.
Why are tokens used instead of words or characters?
Subword tokens balance vocabulary size and flexibility. They let a model represent any word, including new or rare ones, by combining pieces, while keeping common words compact. This is more efficient than per-character processing and more flexible than fixed word lists.
Do input and output tokens cost the same?
Usually not. Most providers price input and output tokens separately, and output tokens are often more expensive because generation is the compute-intensive step. Both count toward the context window and your total bill.
How do tokens relate to the context window?
The context window is the maximum number of tokens a model can hold in a single request, covering the prompt, any retrieved content, and the generated output. Once you exceed it, earlier content must be trimmed or summarized.
Large language model (LLM)
A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.
Context window
A context window is the maximum amount of text, measured in tokens, that a language model can consider in a single interaction — including the prompt, retrieved documents, conversation history, and the model's own output. Frontier models in early 2026 reach context windows of roughly a million tokens, enabling long documents and rich grounding.
Transformer architecture
The transformer is the neural-network architecture behind modern large language models. Introduced in 2017, it uses self-attention to weigh how strongly each token relates to every other token in the context, letting models capture long-range meaning and process sequences in parallel. This design made today's LLMs and multimodal models possible.
AI inference
AI inference is the runtime step where a trained AI model takes a prompt and produces an output — the tokens you see streaming back from ChatGPT, Claude, Gemini, or Perplexity. Inference is what costs money in production: every prompt and every generated token consumes GPU time, and the economics of any AI product live in this loop.
Embeddings
Embeddings are numerical vector representations of text, images, or other data that capture semantic meaning. By mapping content into a high- dimensional space where similar items sit close together, embeddings let AI systems compare meaning mathematically — powering similarity search, retrieval, clustering, and recommendation.
Natural language processing (NLP)
Natural language processing is the AI discipline that enables computers to understand, interpret, and generate human language. It spans tasks such as translation, summarization, sentiment analysis, entity recognition, and question answering. Once driven by hand-built rules and statistical models, NLP is now dominated by large language models built on the transformer architecture.