Infrastructure & Operations¶
Building an AI model is only half the challenge. Running it reliably, efficiently, and cost-effectively in production is the other half. This page covers MLOps (the operational practices for machine learning), model optimization, and the infrastructure decisions that determine whether your AI system scales or stalls.
MLOps: DevOps for Machine Learning¶
MLOps (Machine Learning Operations) applies the principles of DevOps -- automation, monitoring, version control, CI/CD -- to the machine learning lifecycle. It bridges the gap between data science experiments and production systems.
MLOps vs DevOps¶
| Aspect | DevOps | MLOps |
|---|---|---|
| What is versioned | Code | Code + data + models + experiments |
| What is tested | Application behavior | Model accuracy + application behavior |
| What is deployed | Application artifacts | Model artifacts + serving infrastructure |
| What is monitored | Uptime, latency, errors | Uptime, latency, errors + model accuracy + data drift |
| What triggers redeployment | Code changes | Code changes + data changes + model degradation |
| Pipeline | Build, test, deploy | Ingest, train, evaluate, deploy, monitor |
MLOps Maturity
Most teams start at "manual everything" (level 0) and gradually automate. You do not need a fully automated MLOps pipeline on day one. Start with version control for data and models, then add automation incrementally.
The MLOps Lifecycle¶
graph LR
A["Data\nCollection"] --> B["Data\nPreparation"]
B --> C["Model\nTraining"]
C --> D["Model\nEvaluation"]
D --> E["Model\nRegistry"]
E --> F["Deployment"]
F --> G["Monitoring"]
G -->|"Drift\nDetected"| A
style A fill:#057398,stroke:#004987,color:#fff
style B fill:#00A0DF,stroke:#004987,color:#fff
style C fill:#632C4F,stroke:#632C4F,color:#fff
style D fill:#853175,stroke:#632C4F,color:#fff
style E fill:#9E57A2,stroke:#632C4F,color:#fff
style F fill:#004987,stroke:#004987,color:#fff
style G fill:#259638,stroke:#259638,color:#fff
Key stages:
- Data Collection: Gather raw data from sources -- databases, APIs, logs, user interactions.
- Data Preparation: Clean, transform, and feature-engineer the data. Track lineage so you know where every data point came from.
- Model Training: Train (or fine-tune) the model. Log hyperparameters, metrics, and artifacts.
- Model Evaluation: Compare the new model against baselines using defined metrics. Automated evaluation gates prevent bad models from reaching production.
- Model Registry: Store versioned models with metadata (who trained it, on what data, with what performance).
- Deployment: Serve the model via an API endpoint, batch pipeline, or edge device.
- Monitoring: Track model performance, data quality, and operational health in production. When quality degrades, trigger the cycle again.
Model Drift and Monitoring¶
Model drift is the gradual degradation of model performance over time. A model that was accurate at launch may become unreliable as the real world changes around it.
Types of Drift¶
- Data drift
- The distribution of input data changes. For example, a customer sentiment model trained on pre-pandemic reviews may perform poorly on post-pandemic data because the language and topics shifted.
- Concept drift
- The relationship between inputs and outputs changes. For example, a fraud detection model may degrade as fraudsters develop new techniques that look nothing like historical fraud patterns.
- Feature drift
- The data pipeline changes, causing features to be computed differently or become unavailable. For example, a feature that previously held "days since last purchase" is now always zero due to a data pipeline bug.
Monitoring Strategy¶
| What to Monitor | How | Alert Threshold |
|---|---|---|
| Prediction accuracy | Compare predictions to ground truth (when available) | Accuracy drops below defined baseline |
| Input data distribution | Statistical tests (KS test, PSI) comparing current vs training data | Distribution shift exceeds threshold |
| Output distribution | Track the distribution of model predictions over time | Sudden changes in prediction patterns |
| Latency | Measure end-to-end response time | P95 latency exceeds SLA |
| Error rates | Track failed predictions, timeouts, and exceptions | Error rate exceeds baseline |
| Token usage | Monitor tokens consumed per request | Unexpected spikes in consumption |
Monitoring Is Not Optional
In production AI systems, monitoring is as critical as the model itself. Without it, you will not know your model is degrading until users complain -- or worse, until bad decisions are already made.
Quantization and Model Optimization¶
Quantization reduces the precision of a model's numerical weights -- for example, from 32-bit floating point (FP32) to 8-bit integers (INT8) or even 4-bit. This dramatically reduces model size, memory usage, and inference latency, often with minimal impact on quality.
How Quantization Works¶
| Precision | Bits per Weight | Relative Size | Typical Quality Impact |
|---|---|---|---|
| FP32 | 32 | 1x (baseline) | None (full precision) |
| FP16 / BF16 | 16 | 0.5x | Negligible |
| INT8 | 8 | 0.25x | Minimal for most tasks |
| INT4 | 4 | 0.125x | Noticeable on complex reasoning |
Quantization Methods¶
- Post-training quantization (PTQ)
- Applied after training is complete. No additional training data is needed. Fast and easy but may lose more quality than training-aware methods.
- Quantization-aware training (QAT)
- Simulates quantization during training, allowing the model to adapt its weights to lower precision. Produces better quality but requires training infrastructure.
- GPTQ / AWQ / GGUF
- Specialized quantization formats for LLMs. GPTQ and AWQ are GPU-focused, while GGUF (used by llama.cpp) is optimized for CPU and edge inference.
Start with INT8
For most deployment scenarios, INT8 quantization provides an excellent balance of size reduction and quality preservation. Only go to INT4 if you have strict hardware constraints and can tolerate some quality loss.
Edge AI and On-Device Inference¶
Edge AI runs models directly on local devices -- laptops, phones, IoT devices, on-premise servers -- rather than sending data to the cloud. This is increasingly practical with small language models and quantization.
When to Use Edge AI¶
| Scenario | Why Edge Makes Sense |
|---|---|
| Data privacy | Sensitive data never leaves the device or local network |
| Low latency | No network round-trip to a cloud API |
| Offline operation | Works without internet connectivity |
| Cost at scale | No per-query API costs for high-volume use cases |
| Regulatory compliance | Data residency requirements mandate local processing |
Edge AI Technologies¶
| Technology | Description |
|---|---|
| ONNX Runtime | Cross-platform inference engine, supports quantized models |
| llama.cpp | C++ inference for LLMs, runs on CPU, supports GGUF quantization |
| TensorFlow Lite | Google's on-device ML framework |
| Apple Core ML | On-device inference optimized for Apple hardware |
| Windows ML | ML inference on Windows devices using DirectML |
Trade-Offs¶
Edge AI is not free. You gain privacy, latency, and cost benefits, but you trade model capability. A 3B-parameter quantized model running on a laptop will not match the quality of GPT-4o running in the cloud. Choose edge deployment when the trade-off makes sense for your use case.
Cost Management in AI¶
AI infrastructure costs can scale quickly if not managed carefully. Here are the main cost drivers and how to control them:
Cost Drivers¶
| Cost Driver | Description | How to Optimize |
|---|---|---|
| API tokens | Pay-per-token for hosted model APIs | Optimize prompts, cache responses, use smaller models for simple tasks |
| GPU compute | Training and inference on GPU instances | Use spot instances, right-size GPU SKUs, quantize models |
| Storage | Vector databases, model artifacts, training data | Compress embeddings, archive old model versions, use tiered storage |
| Data processing | ETL pipelines, embedding generation, indexing | Batch operations, incremental updates instead of full re-indexing |
| Monitoring | Logging, tracing, evaluation | Sample traces rather than logging everything, set retention policies |
Cost Optimization Strategies¶
- Remove unnecessary context from prompts
- Use shorter system prompts
- Cache common responses
- Batch similar requests
- Use SLMs for simple tasks (classification, extraction)
- Use LLMs only for complex reasoning
- Route requests to the cheapest capable model
- Consider open-source models for high-volume workloads
- Use auto-scaling to match demand
- Deploy in regions with lower compute costs
- Use spot/preemptible instances for training
- Quantize models to reduce serving costs
Track Cost Per Request
Establish a metric for cost per request or cost per user interaction. This helps you make informed decisions about model selection, prompt design, and infrastructure choices. Without this metric, costs tend to grow unnoticed.
Putting It All Together¶
A production AI system brings together all of these concerns:
| Layer | Concerns | Key Decisions |
|---|---|---|
| Model | Selection, fine-tuning, quantization | Which model? Cloud or edge? What precision? |
| Data | Ingestion, embedding, indexing, freshness | How often to re-index? What chunking strategy? |
| Application | Orchestration, guardrails, caching | What framework? What safety checks? |
| Infrastructure | Compute, storage, networking | GPU SKUs, auto-scaling, regions |
| Operations | Monitoring, alerting, incident response | What to monitor? What are the SLAs? |
| Cost | Budgeting, optimization, chargeback | Cost per request? Budget alerts? |
Each layer has its own best practices, but they are deeply interconnected. A change in model selection (e.g., switching from GPT-4o to Phi-4) ripples through infrastructure (less GPU needed), cost (lower per-request), and application (may need prompt adjustments).