Foundation & Models¶
Modern AI is built on foundation models -- large neural networks trained on massive datasets that can be adapted to a wide range of tasks. This page breaks down how these models work, the different types available, and the key concepts you need to understand before building AI-powered applications.
What Is a Large Language Model (LLM)?¶
A Large Language Model is a type of AI that has been trained on vast amounts of text data to understand and generate human language. At its core, an LLM predicts the next most likely token (word or sub-word) given everything that came before it.
Think of it like a very sophisticated autocomplete: you give it a sentence, and it figures out what should come next -- except it can do this across paragraphs, pages, and even entire documents.
Key Insight
LLMs do not "understand" language the way humans do. They learn statistical patterns in text. Their ability to generate coherent, useful responses comes from the sheer scale of data and parameters, not from genuine comprehension.
How It Works (Simplified)¶
- Training: The model reads billions of text samples and learns patterns -- grammar, facts, reasoning styles, code syntax, and more.
- Prompt: You provide an input (the prompt).
- Inference: The model generates output by predicting tokens one at a time, each informed by the tokens before it.
The Transformer Architecture¶
Nearly all modern language models are built on the Transformer architecture, introduced in the 2017 paper "Attention Is All You Need".
The key innovation is the self-attention mechanism, which allows the model to weigh the importance of every word in a sentence relative to every other word -- regardless of distance. This solved a major limitation of earlier architectures (RNNs, LSTMs) that struggled with long-range dependencies.
- Self-attention
- A mechanism that lets each token in a sequence look at all other tokens to determine context. For example, in "The cat sat on the mat because it was tired," attention helps the model understand that "it" refers to "the cat."
- Parameters
- The internal weights of the model that are adjusted during training. More parameters generally means more capacity to learn patterns. GPT-4 is estimated to have over a trillion parameters.
- Pre-training
- The initial phase where the model learns general language patterns from a large corpus. This produces a foundation model that can then be fine-tuned for specific tasks.
Tokens and Context Windows¶
Tokens¶
Models do not process raw text. Instead, text is broken into tokens -- small units that might be whole words, parts of words, or punctuation.
| Text | Approximate Tokens |
|---|---|
| "Hello" | 1 token |
| "ChatGPT is great" | 4 tokens |
| "Artificial intelligence" | 2-3 tokens |
| 1,000 words of English | ~750 tokens |
Why Tokens Matter
You are billed per token (input + output) when using commercial APIs. Understanding tokenization helps you estimate costs and optimize prompts.
Context Window¶
The context window is the maximum number of tokens a model can process in a single request (input + output combined). It defines how much information the model can "see" at once.
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 4 Sonnet | 200K tokens |
| Gemini 2.5 Pro | 1M tokens |
| Llama 3.1 405B | 128K tokens |
| Phi-4 | 16K tokens |
A larger context window allows the model to work with longer documents, but also increases cost and latency.
Types of AI Models¶
Comparison Table¶
| Feature | Foundation Model | LLM | SLM | VLM |
|---|---|---|---|---|
| What it is | Base model trained on broad data | Large text-focused model | Small, efficient text model | Model that handles text + images |
| Parameters | Varies (billions+) | 70B - 1T+ | 1B - 14B | Varies |
| Typical use | General-purpose base | Complex reasoning, generation | On-device, low-latency tasks | Image understanding, visual Q&A |
| Examples | GPT-4, Claude, Gemini | GPT-4o, Claude 4 Sonnet | Phi-4, Gemma 3 | GPT-4o (vision), Gemini, Llama 3.2-Vision |
| Cost | High | High | Low to moderate | Moderate to high |
| Deployment | Cloud | Cloud | Edge or cloud | Cloud |
Foundation Models¶
A foundation model is any large-scale model trained on broad, diverse data that can be adapted (via prompting, fine-tuning, or RAG) to many downstream tasks. GPT-4, Claude, and Gemini are all foundation models.
Large Language Models (LLMs)¶
LLMs are foundation models specifically focused on text. They excel at complex reasoning, long-form generation, summarization, translation, and code. Their strength is versatility, but they require significant compute resources.
Small Language Models (SLMs)¶
SLMs trade some capability for efficiency. Models like Microsoft's Phi-4 (14B parameters) or Google's Gemma can run on consumer hardware or at the edge. They are ideal when:
- Latency must be very low
- Cost per query must be minimal
- The task is well-defined and does not require broad world knowledge
- Data privacy requires on-premise or on-device deployment
Vision Language Models (VLMs)¶
VLMs extend language models with the ability to process images alongside text. You can ask them to describe a photo, extract data from a chart, or answer questions about a diagram. Examples include GPT-4o with vision, Gemini, and Llama 3.2-Vision.
How Inference Works¶
When you send a prompt to an AI model, here is what happens behind the scenes:
graph LR
A["User Prompt"] --> B["Tokenizer"]
B --> C["Model\n(Transformer)"]
C --> D["Token\nPrediction"]
D --> E["Detokenizer"]
E --> F["Generated\nResponse"]
style A fill:#057398,stroke:#004987,color:#fff
style B fill:#00A0DF,stroke:#004987,color:#fff
style C fill:#632C4F,stroke:#632C4F,color:#fff
style D fill:#853175,stroke:#632C4F,color:#fff
style E fill:#00A0DF,stroke:#004987,color:#fff
style F fill:#259638,stroke:#259638,color:#fff
- Tokenization: Your text prompt is split into tokens using the model's tokenizer.
- Encoding: Tokens are converted into numerical representations (embeddings).
- Processing: The transformer processes these embeddings through many layers of self-attention and feed-forward networks.
- Prediction: The model outputs a probability distribution over its vocabulary for the next token.
- Decoding: The predicted token is selected (using strategies like temperature, top-p sampling) and appended to the output.
- Repetition: Steps 3-5 repeat until the model generates a stop token or reaches the maximum output length.
- Detokenization: The output tokens are converted back into human-readable text.
Key Inference Parameters¶
- Temperature
- Controls randomness. Lower values (0.0-0.3) produce focused, deterministic output. Higher values (0.7-1.0) increase creativity and variation.
- Top-p (nucleus sampling)
- Limits the model to considering only the most probable tokens whose cumulative probability reaches a threshold p. A top-p of 0.9 means the model considers the smallest set of tokens that together have a 90% probability.
- Max tokens
- The maximum number of tokens the model will generate in its response.
Temperature Is Not Creativity
Setting temperature to 1.0 does not make the model "more creative" in a meaningful sense. It increases randomness, which can lead to incoherent or off-topic output. For most production use cases, keep temperature between 0.0 and 0.5.
Choosing the Right Model¶
There is no single "best" model. The right choice depends on your requirements:
| Requirement | Recommended Approach |
|---|---|
| Complex reasoning, broad knowledge | LLM (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro) |
| Low latency, cost-sensitive | SLM (Phi-4, Gemma) |
| Image + text understanding | VLM (GPT-4o vision, Gemini) |
| On-device or edge deployment | SLM with quantization |
| Domain-specific accuracy | Fine-tuned LLM or SLM |
| Long document processing | Model with large context window |
Start Simple
Begin with a hosted LLM and well-crafted prompts. Only move to fine-tuning or smaller models when you have a clear need for cost reduction, latency improvement, or domain specialization.