Retrieval & Data¶

AI models are powerful, but they have a fundamental limitation: they only know what they were trained on. Retrieval-Augmented Generation (RAG) bridges this gap by connecting models to your real-time, enterprise data -- without retraining. This page covers how RAG works, the data infrastructure that powers it, and when to use which approach.

This page is an overview

For deep dives into each topic — embeddings, chunking strategies, vector database comparisons, GraphRAG, and evaluation — see the RAG & Knowledge Systems section. Each sub-page includes production guidance, model comparisons, and decision frameworks not covered here.

What Is RAG?¶

Retrieval-Augmented Generation (RAG) is a pattern where, instead of relying solely on a model's built-in knowledge, you retrieve relevant information from an external source and include it in the prompt before generating a response.

This solves several problems:

Stale knowledge: Models have a training cutoff date. RAG gives them access to current data.
Hallucinations: By grounding responses in retrieved facts, RAG significantly reduces fabrication.
Domain specificity: Your internal documents, policies, and data can be surfaced without fine-tuning.

How RAG Works¶

graph TD
    A["User Question"] --> B["Generate\nEmbedding"]
    B --> C["Vector\nDatabase"]
    C -->|"Similarity\nSearch"| D["Relevant\nChunks"]
    D --> E["Build Prompt\n(Question + Context)"]
    E --> F["LLM"]
    F --> G["Grounded\nResponse"]

    style A fill:#057398,stroke:#004987,color:#fff
    style B fill:#00A0DF,stroke:#004987,color:#fff
    style C fill:#632C4F,stroke:#632C4F,color:#fff
    style D fill:#853175,stroke:#632C4F,color:#fff
    style E fill:#9E57A2,stroke:#632C4F,color:#fff
    style F fill:#004987,stroke:#004987,color:#fff
    style G fill:#259638,stroke:#259638,color:#fff

Step by step:

User asks a question -- for example, "What is our company's remote work policy?"
The question is converted to an embedding -- a numerical representation that captures its meaning.
A similarity search runs against the vector database -- finding document chunks whose embeddings are closest to the question's embedding.
The most relevant chunks are retrieved -- typically the top 3-10 results.
A prompt is assembled -- combining the user's question with the retrieved context.
The LLM generates a response -- grounded in the retrieved information rather than its general training data.

RAG Is Not Search

RAG goes beyond traditional keyword search. It uses semantic similarity -- meaning it can find relevant content even when the exact words do not match. Asking "Can I work from home?" will match a document titled "Remote Work Policy" even though the words are different.

Embeddings¶

An embedding is a dense numerical vector (a list of numbers) that represents the meaning of a piece of text. Similar meanings produce vectors that are close together in high-dimensional space.

Deep dive available

For embedding model comparisons, dimension tradeoffs, input types, and production guidance, see Embeddings.

Text	Embedding (simplified)
"dog"	[0.12, 0.85, 0.33, ...]
"puppy"	[0.13, 0.84, 0.31, ...]
"automobile"	[0.91, 0.05, 0.72, ...]

Notice that "dog" and "puppy" have very similar vectors, while "automobile" is far away. This is the core principle behind semantic search.

Embedding Models¶

Embedding models are specialized models designed to convert text into vectors. They are different from generative models -- they do not produce text, only vectors.

Model	Dimensions	Provider
text-embedding-3-large	3,072	OpenAI
text-embedding-3-small	1,536	OpenAI
Cohere Embed v3	1,024	Cohere
BGE-large-en-v1.5	1,024	BAAI (open source)

Dimensions Matter

Higher dimensions capture more nuance but require more storage and compute. For most enterprise use cases, 1,024-1,536 dimensions offer a good balance.

Vector Databases¶

A vector database is a specialized data store optimized for storing, indexing, and querying embedding vectors at scale. Traditional databases use exact-match queries; vector databases use approximate nearest neighbor (ANN) algorithms to find the most similar vectors efficiently.

How Vector Search Works¶

Indexing: When you ingest a document, each chunk is embedded and stored as a vector.
Querying: When a user asks a question, the question is embedded and the database finds the closest stored vectors.
Ranking: Results are ranked by similarity score (typically cosine similarity or dot product).

Popular Vector Databases¶

Database	Type	Key Strengths
Azure AI Search	Managed service	Hybrid search (vector + keyword), integrated with Azure ecosystem
Pinecone	Managed service	Simple API, serverless option, fast at scale
Weaviate	Open source / managed	GraphQL API, multi-modal support
Qdrant	Open source / managed	Rust-based performance, filtering
Chroma	Open source	Lightweight, great for prototyping
pgvector	PostgreSQL extension	Use your existing Postgres infrastructure

Hybrid Search

The best results often come from hybrid search -- combining vector similarity with traditional keyword matching. Azure AI Search supports this natively with its hybrid search capability.

Chunking Strategies¶

Before you can embed documents, you need to break them into chunks -- smaller pieces that fit within embedding model limits and provide focused, retrievable units of information.

Chunking strategy directly impacts retrieval quality. Too large and chunks contain mixed topics. Too small and chunks lack context.

Deep dive available

For all eight chunking strategies including parent-child, late chunking, and agentic chunking — with a decision flowchart — see Chunking Strategies.

Common Strategies¶

Strategy	Description	Best For
Fixed-size	Split every N characters/tokens with overlap	Simple documents, quick setup
Sentence-based	Split on sentence boundaries	Narratives, articles
Paragraph-based	Split on paragraph breaks	Well-structured documents
Semantic	Use an embedding model to detect topic shifts	Complex documents with varied content
Recursive	Try paragraph, then sentence, then character splits	General-purpose fallback
Document-aware	Respect headings, sections, tables	Technical docs, reports with structure

Overlap Is Important

Always include overlap between chunks (typically 10-20% of chunk size). Without overlap, important information that spans a chunk boundary can be lost.

Chunk Size Guidelines¶

Use Case	Recommended Chunk Size
FAQ / short-answer retrieval	200-500 tokens
Document summarization	500-1,000 tokens
Technical documentation	300-800 tokens
Legal / regulatory text	500-1,000 tokens

Knowledge Graphs and GraphRAG¶

Traditional RAG retrieves isolated chunks. GraphRAG adds a layer of structure by building a knowledge graph from your documents -- capturing entities, relationships, and themes.

Deep dive available

For Microsoft GraphRAG's full architecture, local vs global query modes, cost tradeoffs, and implementation options, see GraphRAG.

Why GraphRAG?¶

Multi-hop reasoning: Answer questions that require connecting information across multiple documents.
Thematic understanding: Identify high-level themes and summaries across a corpus.
Better context: Provide the LLM with structured relationships, not just text snippets.

How GraphRAG Works¶

graph TD
    A["Documents"] --> B["Entity &\nRelationship\nExtraction"]
    B --> C["Knowledge\nGraph"]
    C --> D["Community\nDetection"]
    D --> E["Summarization"]

    F["User Query"] --> G["Graph\nTraversal"]
    C --> G
    G --> H["Structured\nContext"]
    H --> I["LLM"]
    I --> J["Rich\nResponse"]

    style A fill:#057398,stroke:#004987,color:#fff
    style B fill:#00A0DF,stroke:#004987,color:#fff
    style C fill:#632C4F,stroke:#632C4F,color:#fff
    style D fill:#853175,stroke:#632C4F,color:#fff
    style E fill:#9E57A2,stroke:#632C4F,color:#fff
    style F fill:#057398,stroke:#004987,color:#fff
    style G fill:#00A0DF,stroke:#004987,color:#fff
    style H fill:#57C0E8,stroke:#004987,color:#000
    style I fill:#004987,stroke:#004987,color:#fff
    style J fill:#259638,stroke:#259638,color:#fff

GraphRAG vs Standard RAG

Use standard RAG when questions are about specific facts in specific documents. Use GraphRAG when questions require synthesis across many documents or understanding of relationships between entities.

RAG vs Fine-Tuning: When to Use Which¶

This is one of the most common decisions in AI application design. Here is a clear comparison:

Factor	RAG	Fine-Tuning
Best for	Grounding in specific, changing data	Teaching the model new behaviors or styles
Data freshness	Real-time (data can be updated anytime)	Static (requires retraining to update)
Setup effort	Moderate (indexing pipeline)	High (training pipeline, GPU resources)
Cost	Per-query retrieval cost + LLM cost	Upfront training cost + inference cost
Hallucination control	Strong (responses grounded in retrieved data)	Moderate (model may still hallucinate)
Customization depth	Surface-level (provides context)	Deep (changes model behavior)
When data changes	Re-index documents	Retrain the model
Example use case	"Answer questions about our HR policies"	"Write emails in our brand voice"

Combine Them

RAG and fine-tuning are not mutually exclusive. A fine-tuned model that also uses RAG can provide domain-specific behavior and up-to-date, grounded responses. Many production systems use both.

Decision Guide¶

Use RAG WhenUse Fine-Tuning WhenUse Both When

Your data changes frequently
You need citations and traceability
You want to avoid retraining costs
Accuracy on specific documents is critical
You need to keep data private (data stays in your infrastructure)

You need the model to adopt a specific tone or style
The task requires specialized domain knowledge baked into the model
You want to reduce prompt size (the model "just knows" things)
Latency is critical and you cannot afford retrieval overhead

You need domain-specific behavior AND current data
You want a fine-tuned model that can also reference live documents
You are building a production system that requires both accuracy and style

References¶

Next Steps¶

For deeper RAG coverage: RAG & Knowledge Systems
For agent-based retrieval: Agentic AI
To compare RAG and fine-tuning approaches: Fine-Tuning & Training