Fine-Tuning & Training¶
Prompting and RAG handle most AI use cases. But when you need a model to adopt a specific behavior, tone, or deep domain expertise that cannot be achieved through instructions alone, fine-tuning is the next step. This page explains when and how to customize AI models for your specific needs.
When to Fine-Tune: The Decision Flow¶
Before investing in fine-tuning, make sure simpler approaches will not work. Here is a decision framework:
graph TD
A["You have an\nAI use case"] --> B{"Can prompt\nengineering\nsolve it?"}
B -->|"Yes"| C["Use Prompting\n(lowest cost)"]
B -->|"No"| D{"Do you need\nreal-time or\nchanging data?"}
D -->|"Yes"| E["Use RAG\n(moderate cost)"]
D -->|"No"| F{"Do you need the\nmodel to behave\ndifferently?"}
F -->|"Yes"| G["Fine-Tune\n(higher cost)"]
F -->|"No"| H["Combine\nRAG + Prompting"]
style A fill:#057398,stroke:#004987,color:#fff
style B fill:#00A0DF,stroke:#004987,color:#fff
style C fill:#259638,stroke:#259638,color:#fff
style D fill:#00A0DF,stroke:#004987,color:#fff
style E fill:#259638,stroke:#259638,color:#fff
style F fill:#632C4F,stroke:#632C4F,color:#fff
style G fill:#853175,stroke:#632C4F,color:#fff
style H fill:#259638,stroke:#259638,color:#fff
The 80/20 Rule
In practice, 80% of AI use cases can be solved with good prompt engineering and RAG. Fine-tuning is for the remaining 20% where you need the model to fundamentally change how it responds.
What Is Fine-Tuning?¶
Fine-tuning is the process of taking a pre-trained model and further training it on a smaller, task-specific dataset. The model adjusts its internal weights to become better at the specific task while retaining its general capabilities.
Analogy¶
Think of a pre-trained model as a university graduate with broad knowledge. Fine-tuning is like that graduate doing a specialized internship -- they apply their general knowledge to a specific domain and get better at it.
What Fine-Tuning Changes¶
- Output style and tone: Write in your brand voice, match a specific format.
- Domain behavior: Respond like a medical professional, legal expert, or financial analyst.
- Task specialization: Get consistently better at a narrow task (classification, extraction, scoring).
- Reduced prompting: The model "just knows" things that previously required long prompts.
What Fine-Tuning Does NOT Do¶
- It does not give the model access to new data at inference time (that is what RAG does).
- It does not guarantee elimination of hallucinations.
- It does not change the model's architecture or fundamental capabilities.
Supervised Fine-Tuning (SFT)¶
Supervised Fine-Tuning is the most common approach. You provide a dataset of input-output pairs (examples of what the model should produce) and train it to replicate those patterns.
Training Data Format¶
Most fine-tuning APIs expect data in a conversational format:
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Patient presents with acute bronchitis."},
{"role": "assistant", "content": "ICD-10 Code: J20.9 - Acute bronchitis, unspecified"}
]
}
How Much Data Do You Need?¶
| Goal | Minimum Examples | Recommended |
|---|---|---|
| Style/tone adjustment | 50-100 | 200-500 |
| Task specialization | 100-500 | 500-2,000 |
| Deep domain expertise | 500-1,000 | 2,000-10,000 |
Quality Over Quantity
50 high-quality, carefully curated examples will outperform 5,000 noisy, inconsistent ones. Invest in data quality. Every example should be one you would be proud to show as a correct response.
LoRA and Efficient Fine-Tuning¶
Training all parameters of a large model is expensive and slow. LoRA (Low-Rank Adaptation) and related techniques make fine-tuning practical by training only a small fraction of the model's parameters.
How LoRA Works¶
Instead of updating all model weights during training, LoRA:
- Freezes the original model weights.
- Injects small, trainable matrices (adapters) into specific layers.
- Trains only these adapters, which are typically less than 1% of the total parameters.
Benefits¶
| Aspect | Full Fine-Tuning | LoRA |
|---|---|---|
| Parameters trained | All (billions) | ~0.1-1% |
| GPU memory | Very high (multi-GPU) | Moderate (single GPU possible) |
| Training time | Hours to days | Minutes to hours |
| Storage | Full model copy per task | Small adapter file per task |
| Quality | Highest | Near-equivalent for most tasks |
QLoRA¶
QLoRA combines LoRA with quantization -- reducing the precision of the frozen model weights (e.g., from 16-bit to 4-bit). This further reduces memory requirements, making it possible to fine-tune large models on consumer-grade GPUs.
LoRA for Experimentation
LoRA's low cost and fast iteration make it ideal for experimentation. You can quickly test whether fine-tuning will help your use case before committing to a full training run.
Transfer Learning¶
Transfer learning is the broader concept that fine-tuning is built on. The idea is simple: knowledge learned from one task can be transferred to help with a related task.
Why It Works¶
A model pre-trained on billions of words of text has already learned:
- Grammar and sentence structure
- Common sense reasoning
- World knowledge
- Logic and pattern recognition
When you fine-tune, you are not teaching the model language from scratch. You are redirecting its existing capabilities toward your specific domain.
The Transfer Learning Pipeline¶
graph LR
A["Large Dataset\n(general text)"] --> B["Pre-Training"]
B --> C["Foundation\nModel"]
C --> D["Domain Dataset\n(your data)"]
D --> E["Fine-Tuning"]
E --> F["Specialized\nModel"]
style A fill:#057398,stroke:#004987,color:#fff
style B fill:#00A0DF,stroke:#004987,color:#fff
style C fill:#632C4F,stroke:#632C4F,color:#fff
style D fill:#853175,stroke:#632C4F,color:#fff
style E fill:#9E57A2,stroke:#632C4F,color:#fff
style F fill:#259638,stroke:#259638,color:#fff
RLHF: Reinforcement Learning from Human Feedback¶
RLHF is the technique that transformed raw language models into the helpful, harmless assistants we use today. It aligns model behavior with human preferences.
How RLHF Works (Simplified)¶
- Supervised Fine-Tuning: Start with a base model and fine-tune it on curated examples of good responses.
- Reward Model Training: Have humans rank multiple model outputs for the same prompt. Use these rankings to train a separate "reward model" that predicts which responses humans prefer.
- Reinforcement Learning: Use the reward model to further train the language model. The language model learns to generate responses that score high on the reward model.
Why RLHF Matters¶
- It is the reason models refuse harmful requests, stay on topic, and try to be helpful.
- It is how model providers align general-purpose models with safety guidelines.
- Without RLHF, models would simply predict the most statistically likely text, which often includes toxic or unhelpful content.
Alternatives to RLHF¶
- DPO (Direct Preference Optimization)
- A simpler alternative to RLHF that skips the reward model step. It directly uses human preference data to adjust model weights. Faster and more stable than RLHF.
- RLAIF (RL from AI Feedback)
- Uses an AI model (instead of humans) to provide feedback. Scales better but may inherit biases from the feedback model.
RLHF Is Typically Done by Model Providers
Unless you are building a foundation model from scratch, you will likely not implement RLHF yourself. It is included here for understanding because it fundamentally shapes how all modern AI models behave.
Cost and Effort Considerations¶
Fine-tuning is not free. Here is an honest look at the investment required:
Cost Factors¶
| Factor | Description |
|---|---|
| Compute | GPU hours for training. LoRA reduces this significantly. |
| Data preparation | Curating, cleaning, and formatting training data is the most time-consuming step. |
| Evaluation | You need a systematic way to measure whether fine-tuning improved the model. |
| Iteration | Fine-tuning rarely works perfectly on the first try. Budget for multiple rounds. |
| Hosting | Fine-tuned models may need dedicated endpoints, increasing serving costs. |
| Maintenance | As your domain changes, training data and models need updating. |
Comparison of Approaches¶
| Approach | Upfront Cost | Ongoing Cost | Time to Deploy | Flexibility |
|---|---|---|---|---|
| Prompt engineering | Very low | Per-token API costs | Hours | High (change prompts anytime) |
| RAG | Moderate (indexing pipeline) | Per-query + API costs | Days to weeks | High (update data anytime) |
| LoRA fine-tuning | Moderate (GPU + data prep) | Hosting + API costs | Weeks | Medium (retrain to update) |
| Full fine-tuning | High (multi-GPU + data prep) | Hosting + API costs | Weeks to months | Low (expensive to update) |
Do Not Fine-Tune Prematurely
Fine-tuning should be your last resort, not your first approach. Exhaust prompt engineering and RAG first. If you still cannot achieve the quality you need, then consider fine-tuning -- starting with LoRA.