AI & LLMsUpdated May 6, 2026

Small language models (SLMs)

Definition

Small language models are compact AI models, typically ranging from about one to ten billion parameters, designed for on-device deployment, low latency, and cost efficiency while retaining useful capability. By trading some breadth for a smaller footprint, SLMs run on phones, laptops, and edge hardware, enabling private, fast, and inexpensive language tasks.

How it works

Small language models share the same transformer foundations as large models but with far fewer parameters. Techniques such as knowledge distillation, where a smaller model learns from a larger one, careful data curation, and quantization let SLMs deliver strong performance on focused tasks despite their compact size.

Because they fit in limited memory, SLMs can run directly on a phone, laptop, or edge device without sending data to the cloud. This local execution cuts latency to near-instant and keeps prompts and outputs on the device.

SLMs excel at well-scoped tasks like classification, extraction, summarization, and routing, and are often paired with retrieval so a small model can answer accurately from supplied context rather than relying on broad memorized knowledge.

Why it matters

Not every task needs a frontier model. Routing simple, high-volume work to a small model cuts cost dramatically and removes the latency of a network round trip, which is why SLMs are central to production AI economics.

On-device SLMs also unlock privacy-sensitive use cases. When the model runs locally, sensitive data never leaves the device, enabling AI features in regulated industries and in regions with strict data rules. This blend of speed, cost, and privacy makes SLMs a fast-growing complement to large models.

Frequently asked questions

How small is a small language model?

There is no strict boundary, but SLMs are generally in the range of about one to ten billion parameters, small enough to run on consumer hardware. Some specialized models are even smaller while remaining useful for narrow tasks.

When should I use an SLM instead of a large model?

Choose an SLM for high-volume, well-defined tasks where latency, cost, or privacy matter, such as classification, extraction, or routing. Reserve large models for open-ended reasoning, complex generation, and tasks requiring broad world knowledge.

Are small language models less accurate?

On broad, open-ended tasks they often trail large models. But on focused tasks, especially when grounded with retrieval, a well-tuned SLM can match larger models while being faster and cheaper. Fit to the task matters more than raw size.

Can small language models run offline?

Yes. A key advantage of SLMs is that they fit on phones, laptops, and edge devices, so they can run entirely offline. This enables private, low-latency AI without a cloud connection.

Large language model (LLM)

A large language model is an AI system trained on vast amounts of text to understand and generate human language. Built on transformer architecture and containing billions of parameters, LLMs predict the next token in a sequence, enabling them to answer questions, write, summarize, and reason. They power modern chat assistants, AI search, and autonomous agents.

Foundation models

Foundation models are large-scale AI models trained on broad, diverse data that serve as a general-purpose base adapted for many downstream applications. Rather than building a model per task, organizations fine-tune or prompt a single foundation model for translation, summarization, coding, search, and more. Large language models and multimodal models are common examples.

Open source LLMs

Open source LLMs are large language models whose weights are publicly available for download, allowing anyone to self-host, fine-tune, and inspect them. Families such as Llama, Mistral, Qwen, and DeepSeek give organizations control over deployment, customization, and data privacy, in contrast to closed models accessible only through a provider's API.

AI inference

AI inference is the runtime step where a trained AI model takes a prompt and produces an output — the tokens you see streaming back from ChatGPT, Claude, Gemini, or Perplexity. Inference is what costs money in production: every prompt and every generated token consumes GPU time, and the economics of any AI product live in this loop.

Apple Intelligence

Apple Intelligence is Apple's personal AI system, built into iPhone, iPad, and Mac, that blends on-device processing, Private Cloud Compute, and deep app integration to power writing tools, summaries, a more capable Siri, and image features. It emphasizes personal context and privacy, with optional handoff to external models for broader world knowledge.

AI fine-tuning

AI fine-tuning is the process of taking a pre-trained model and training it further on a smaller, specialized dataset so it adapts to a specific task, domain, tone, or format. It adjusts the model's existing weights rather than training from scratch, producing outputs that better match a brand's requirements or a narrow use case at lower cost than full training.