Building a Comprehensive RAG System: A Deep Dive Into Knowledge Architecture

A comprehensive guide to building a Retrieval-Augmented Generation (RAG) system that combines vector databases, document processing, and large language models to create intelligent knowledge retrieval applications.

Posted Apr 24, 2025

By Nitin Kumar Singh

19 min read

Building a Comprehensive RAG System: A Deep Dive Into Knowledge Architecture

TL;DR: This guide walks you through building a production-ready RAG system using FastAPI, ChromaDB, MinIO, and OpenAI. Learn document chunking, vector embeddings, hybrid search, and real-world deployment strategies.

Introduction

As a .NET developer watching the AI landscape evolve, I found myself both excited and skeptical. When tools like Claude.ai and ChatGPT started offering out-of-the-box RAG solutions, I wanted to build my own system with full control over the implementation.

My goals were clear:

Control exactly which documents were accessible
Run everything on our own infrastructure for data privacy
Customize the retrieval and generation processes
Learn the underlying technology by building it

Despite being primarily a .NET developer, I chose Python for this project—its AI and machine learning ecosystem made it the practical choice. The learning curve proved steep, yet diving into a new tech stack delivered immense value.

Over several weekends, I built a prototype RAG system that ingests project documentation and answers specific questions about architecture and implementation details.

This article shares everything I learned building a comprehensive RAG system—architecture, technical components, challenges, and best practices.

What is RAG? (A Developer-Friendly Explanation)

Retrieval-Augmented Generation (RAG) combines a smart search engine with an AI writer to deliver accurate, document-grounded answers:

Step	What Happens
1. Retrieval	When you ask a question, the system searches through your documents to find the most relevant information
2. Augmentation	It takes those search results and prepares them as context
3. Generation	It passes that context to an LLM (like GPT-4) which generates a helpful response

Why RAG Matters

RAG grounds AI responses in your actual documents, eliminating hallucinations. The AI references information it was never trained on because you provide the current context directly.

Analogy: Imagine asking a friend a question about a movie they watched long ago. Without RAG, they might try to recall from memory and make up details they’ve forgotten. With RAG, they’d quickly look up the movie details online before answering, ensuring accuracy.

RAG System Architecture Visualization

To better understand how the different components of a RAG system interact, let’s look at a visual representation of the architecture:

flowchart TD
    User([User]) <--> |Questions & Answers| UI[Next.js Frontend]
    UI <--> |API Calls| BE[FastAPI Backend]
    
    subgraph "Document Processing Pipeline"
        direction LR
        UP[Document Upload] --> Parser[Document Parser]
        Parser --> Chunker[Text Chunker]
        Chunker --> Embedder[Embedding Generator]
        Embedder --> Indexer[Vector Indexer]
    end
    
    subgraph "Storage Layer"
        direction LR
        DB[(PostgreSQL)] --- MetaStore[Metadata Store]
        MinIO[(MinIO)] --- DocStore[Document Store]
        VDB[(ChromaDB)] --- VecStore[Vector Store]
    end
    
    subgraph "Retrieval-Generation Pipeline"
        direction LR
        QP[Query Processor] --> VS[Vector Search]
        VS --> HybridRanker[Hybrid Ranking]
        HybridRanker --> ContextBuilder[Context Builder]
        ContextBuilder --> LLM[OpenAI LLM]
    end
    
    BE --> UP
    Indexer --> VDB
    Parser --> MinIO
    BE --> MetaStore
    
    BE --> QP
    VS --> VDB
    LLM --> UI
    
    classDef pipeline fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1;
    classDef storage fill:#FFF3E0,stroke:#E65100,stroke-width:2px,color:#BF360C;
    classDef external fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20;
    classDef subgraphStyle fill:#FAFAFA,stroke:#616161,stroke-width:1px;
    
    class User,UI external
    class DB,MinIO,VDB storage
    class UP,Parser,Chunker,Embedder,Indexer,QP,VS,HybridRanker,ContextBuilder,LLM pipeline

The diagram above illustrates the complete flow of our RAG system:

User Interaction Layer: Users interact with the Next.js frontend, submitting questions and receiving answers.
Document Processing Pipeline:
- Documents are uploaded through the UI
- The parser extracts text based on document type (PDF, DOCX, images, etc.)
- The chunker breaks documents into manageable pieces
- The embedding generator creates vector representations
- The indexer stores these vectors in ChromaDB
Storage Layer:
- PostgreSQL stores metadata about documents and relationships
- MinIO stores the original document files
- ChromaDB stores and indexes vector embeddings
Retrieval-Generation Pipeline:
- Query processor handles and optimizes user questions
- Vector search finds relevant document chunks
- Hybrid ranking combines vector and keyword search results
- Context builder assembles coherent context for the LLM
- OpenAI’s LLM generates the final response

This architecture ensures efficient document processing, accurate retrieval, and high-quality responses based on your own data.

How RAG Works: The Building Blocks

The following components form the foundation of every RAG system:

Document Chunking: Making Content Digestible

Chunking resembles cutting a long novel into individual chapters. Large documents (like a 100-page PDF) require breaking into smaller pieces with specific characteristics:

Requirement	Why It Matters
Small enough	To be processed efficiently by embedding models
Large enough	To maintain meaningful context
Properly split	At logical boundaries (like paragraphs)

Why is chunking important? Imagine searching through a library. It’s much easier to find information if you can look at individual chapters rather than entire books at once.

In my implementation, I use a technique called recursive character splitting:

  
self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Each chunk is about 1000 characters
    chunk_overlap=200,  # Overlap between chunks to maintain context
    separators=["\n\n", "\n", " ", ""]  # Try to split at paragraph breaks first
)

The overlap is crucial — it’s like having the last few sentences of the previous chapter at the beginning of the next one, helping maintain continuity between chunks.

Vector Embeddings: Teaching Computers to Understand Meaning

Vector embeddings convert text into numerical representations (vectors) that capture semantic meaning. Two phrases with similar meanings produce similar vector patterns, regardless of the exact words used.

Example: The phrases “I love programming” and “I enjoy coding” would have similar vector representations because they have similar meanings, even though they share few words.

How Embeddings Work

When we process text like “The quick brown fox jumps over the lazy dog,” we get a list of numbers (a vector) that represents its meaning in hundreds or thousands of dimensions.

In simpler terms:

Each word or phrase gets converted into a list of numbers
Similar meanings result in similar number patterns
We can measure how similar two texts are by comparing their number patterns

My implementation uses OpenAI’s text-embedding-3-large model:

  
# A simplified view of how embeddings work
text = "What is the authentication strategy for our mobile app?"
embedding = openai_embedding_model.embed_query(text)
# embedding is now a vector like [0.023, -0.041, 0.067, ...] with 3072 dimensions

Vector Databases: Finding Needles in Numeric Haystacks

Once document chunks are converted to vectors, a specialized database stores and searches them. Traditional databases excel at exact matches (like “show me all customers named ‘Smith’”). Vector databases excel at finding semantically similar items (like “show me text related to this question”).

In my system, I use ChromaDB:

  
# Store a document chunk and its embedding
collection.add_documents([
    document_chunk_1,  # The text content
    document_chunk_1_embedding  # Its vector representation
])

# Later, search for similar chunks
query_embedding = openai_embedding_model.embed_query("What's our authentication strategy?")
similar_chunks = collection.similarity_search(query_embedding)

The vector database uses sophisticated algorithms to quickly find the most similar vectors without checking every single one — essential when you have thousands or millions of chunks.

How RAG Search Works: A Step-by-Step Explanation

When you ask RAG a question, here’s what happens behind the scenes:

flowchart LR
    A[User Question] --> B[Query Processing]
    B --> C[Similarity Search]
    C --> D[Context Building]
    D --> E[Response Generation]
    E --> F[Answer with Sources]
    
    classDef default fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1;

Step	What Happens
1. Query Processing	Your question is converted into a vector embedding; key terms may be extracted for hybrid search
2. Similarity Search	The vector database finds chunks with the most similar embeddings to your question
3. Context Building	Retrieved chunks are assembled into a coherent context with source information
4. Response Generation	The context and your question are sent to an LLM which generates a grounded response

Example: If you ask “What authentication strategy did we choose for mobile?”, the system finds chunks mentioning authentication, mobile apps, and security protocols — then assembles them into context for the LLM to answer: “According to the Mobile Architecture Document, we chose OAuth 2.0 with PKCE for mobile authentication because…”

System Architecture

The stack: FastAPI backend, Next.js/React frontend. Here’s how it breaks down:

Backend Architecture

The backend is built using FastAPI, a modern Python web framework, and is organized into several modules:

Module	Responsibility
Document Processing	Handles file uploads, parsing, chunking, and embedding generation
Vector Database	ChromaDB integration for storing and retrieving embeddings
Document Storage	MinIO for files, PostgreSQL for metadata
Retrieval System	Hybrid search, filtering, ranking, and context preparation
Generation System	OpenAI integration for response generation

Frontend Architecture

The frontend is built with Next.js and React, offering a responsive and interactive user interface:

Feature	Description
Chat Interface	Real-time chat with message history and response metrics
Document Management	File upload, web scraping, document listing and deletion
Data Sources	Configuration for different data source types

Core Concepts

Quick reference for the technical pieces that matter:

Document Chunking

Why chunk? Three reasons:

Reason	Explanation
Context Windows	Most LLMs have a limited context window. Chunking ensures documents can be processed regardless of size
Granular Retrieval	Smaller chunks enable more precise retrieval of just the relevant sections
Efficient Embedding	Creating embeddings for smaller chunks is more efficient and produces better results

In the project, chunking is handled by the RecursiveCharacterTextSplitter class:

  
# From backend/app/services/parsers/base.py
self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=settings.CHUNK_SIZE,
    chunk_overlap=settings.CHUNK_OVERLAP,
    separators=["\n\n", "\n", " ", ""]
)

The chunking process involves:

Setting a maximum chunk size (e.g., 1000 characters)
Defining an overlap between chunks (e.g., 200 characters) to preserve context
Using intelligent separators (like paragraph breaks) to avoid splitting mid-sentence

Pro Tip: The overlap is particularly important because it helps maintain context between chunks and ensures that information split across chunk boundaries is still retrievable.

Vector Embeddings

Embeddings turn text into numbers that capture meaning. Here’s how the project handles them:

  
# From backend/app/services/vector_store.py
def _get_embeddings(self):
    """Initialize embedding model based on configuration"""
    if settings.OPENAI_API_KEY:
        logger.info("Using OpenAI embeddings")
        return OpenAIEmbeddings(
            openai_api_key=settings.OPENAI_API_KEY
        )
    else:
        logger.info(f"Using local HuggingFace embeddings: {settings.EMBEDDING_MODEL}")
        return HuggingFaceEmbeddings(
            model_name=settings.EMBEDDING_MODEL
        )

Key Aspects of Embeddings

Aspect	Description
Semantic Understanding	Captures meaning, not just keywords. “climate change” and “global warming” have similar embeddings
Dimensionality	Modern embeddings typically have 768-3072 dimensions. This project uses 3072 dimensions
Vector Operations	Mathematical operations like cosine similarity can measure text similarity
Language Agnostic	Same model works across multiple languages

Vector Databases

Regular databases find exact matches. Vector databases find similar things. I use ChromaDB:

  
# From backend/app/services/vector_store.py
def get_collection(self, collection_name: str):
    """Get or create a collection in the vector store"""
    if collection_name not in self.collections:
        client = self._get_chroma_client()

        # Check if collection exists already
        existing_collections = client.list_collections()
        collection_exists = any(c.name == collection_name for c in existing_collections)

        if not collection_exists:
            logger.info(f"Creating new collection: {collection_name}")
            client.create_collection(name=collection_name)

        # Initialize LangChain Chroma wrapper with our collection
        self.collections[collection_name] = Chroma(
            client=client,
            collection_name=collection_name,
            embedding_function=self.embeddings
        )

Why Vector Databases Are Essential

Feature	Benefit
ANN Search	Quickly find similar vectors without checking every single one
Scalability	Handle millions of vectors efficiently
Metadata Filtering	Filter by metadata (e.g., only search PDFs)
Collections	Organize vectors into logical groups

ChromaDB Features: Persistent storage, multiple embedding model support, REST API for remote access, and document metadata storage alongside vectors.

Object Storage with MinIO

MinIO stores the original files. S3-compatible, self-hosted, works great with Docker:

  
# From backend/app/services/object_storage.py
def upload_file(
    self,
    file_data: BinaryIO,
    object_name: str,
    bucket_name: str = "documents",
    content_type: Optional[str] = None
) -> str:
    """Upload a file to object storage"""
    client = self._get_client()

    # Get file size
    file_data.seek(0, os.SEEK_END)
    file_size = file_data.tell()
    file_data.seek(0)

    client.put_object(
        bucket_name=bucket_name,
        object_name=object_name,
        data=file_data,
        length=file_size,
        content_type=content_type
    )

    return f"{bucket_name}/{object_name}"

Why MinIO?

Scales well with large file volumes
S3-compatible API—swap to AWS later if needed
Keeps the database lean—PostgreSQL holds metadata, MinIO holds actual files
Docker-friendly deployment
Free to self-host

The system organizes files into buckets like “documents,” “images,” and “raw” to keep different types of content separate.

PostgreSQL for Metadata

PostgreSQL tracks everything about the documents—titles, processing status, chunk references:

  
# From backend/app/models/document.py
class Document(Base):
    """Model for storing document metadata"""
    __tablename__ = "documents"

    id = Column(String, primary_key=True, index=True, default=lambda: str(uuid.uuid4()))
    filename = Column(String, index=True)
    title = Column(String, index=True)
    description = Column(Text, nullable=True)
    mime_type = Column(String)
    source_type = Column(String) # file, database, website, etc.
    source_path = Column(String) # Original path or URL
    storage_path = Column(String) # Path in MinIO
    file_size = Column(Integer, nullable=True)
    page_count = Column(Integer, nullable=True)

    # Metadata
    doc_metadata = Column(JSON, nullable=True)

    # Processing status
    is_processed = Column(Boolean, default=False)
    is_indexed = Column(Boolean, default=False)
    processing_error = Column(Text, nullable=True)

    # Timestamps
    created_at = Column(DateTime(timezone=True), server_default=func.now())
    updated_at = Column(DateTime(timezone=True), onupdate=func.now())

    # Relationships
    chunks = relationship("DocumentChunk", back_populates="document", cascade="all, delete-orphan")

What PostgreSQL handles:

Document metadata: titles, descriptions, MIME types, processing status
Relationships: linking documents to their chunks
Query logs: what users ask, how the system performs
Transactions: keeps things consistent when processing fails midway
JSON columns: flexible metadata without schema changes

Tables in the schema:

documents — file metadata
document_chunks — individual chunks with vector IDs
data_sources — external connection configs
query_logs — user queries and response times

My Implementation Journey: Challenges and Solutions

I ran into three major problems while building this. Here’s what broke and how I fixed it:

Challenge 1: Document Parsing Complexity

Different document formats presented unique challenges. PDFs with complex layouts would often result in garbled text, while scanned documents were essentially images requiring OCR.

Solution: I implemented a specialized parser factory that selects the appropriate parser based on MIME type:

  
def get_parser(self, mime_type: str):
    """Get the appropriate parser for the given MIME type"""
    # For application/pdf and similar document types
    if mime_type in ['application/pdf', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document']:
        return document_parser
    # For text-based documents
    elif mime_type.startswith('text/'):
        return text_parser
    # For images
    elif mime_type.startswith('image/'):
        return image_parser
    # For web URLs
    elif mime_type == 'text/html':
        return web_parser

For PDFs, I used PyMuPDF which handles complex layouts better than alternatives. For images, I integrated Tesseract OCR with preprocessing steps to improve recognition quality.

Challenge 2: Semantic Search Quality

Initial semantic search results proved inconsistent. The system occasionally returned irrelevant chunks while better matches sat unused in the database.

Solution: I implemented a hybrid retrieval system combining vector search with keyword matching:

  
def _filter_relevant_documents(self, query: str, documents: List[Document], scores: List[float]):
    """Filter documents based on relevance to the query"""
    # Extract key terms from the query
    query_terms = set(self._extract_key_terms(query.lower()))
    
    # Process each document
    relevant_docs = []
    for doc, score in zip(documents, scores):
        # Vector similarity score
        relevance_score = 1.0 - score
        
        # Term matching score
        content = doc.page_content.lower()
        term_matches = sum(1 for term in query_terms if term in content)
        term_match_ratio = term_matches / len(query_terms) if query_terms else 0
        
        # Combined score with appropriate weights
        combined_score = (0.7 * relevance_score) + (0.3 * term_match_ratio)
        
        # Filter based on threshold
        if combined_score >= 0.5:
            relevant_docs.append(doc)
    
    # Sort by relevance score
    return sorted(relevant_docs, key=lambda x: x.metadata.get("relevance_score", 0), reverse=True)

This approach dramatically improved results—documents now match semantically AND contain the actual keywords users type.

Challenge 3: Performance at Scale

Once I hit a few thousand documents, everything slowed to a crawl. Indexing took forever and queries lagged noticeably.

Solution: Three fixes:

Batch processing for document ingestion:

  
def add_documents(self, documents: List[Document], collection_name: str) -> List[str]:
    """Add documents to the vector store in batches"""
    # Get appropriate collection
    collection = self.get_collection(collection_name)
    
    # Process in batches of 100
    batch_size = 100
    vector_ids = []
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        batch_ids = collection.add_documents(batch)
        vector_ids.extend(batch_ids)
    
    return vector_ids

Asynchronous processing using background workers for document ingestion, which allowed users to continue working while documents were being processed.
Query caching to store results of common queries:

  
async def get_cached_response(self, query: str, collection_names: List[str]):
    """Get cached response for a query if available"""
    cache_key = f"{query}_{'-'.join(sorted(collection_names))}"
    cached = await self.redis.get(cache_key)
    if cached:
        return json.loads(cached)
    return None

Result: 70% faster indexing, 85% faster queries for repeated questions.

Deep Dive into the Retrieval Process

Retrieval makes or breaks a RAG system. Get this wrong and your LLM hallucinates. Here’s my implementation:

Step 1: Query Processing

When a user sends a question, the system first processes it to optimize retrieval:

  
def process_query(self, query: str) -> Dict:
    """Process and optimize the query for retrieval"""
    # Remove stop words and normalize
    processed_query = self._normalize_text(query)
    
    # Extract key entities using spaCy
    key_entities = self._extract_entities(query)
    
    # Generate query variations (optional)
    variations = self._generate_query_variations(query)
    
    return {
        "original": query,
        "processed": processed_query,
        "entities": key_entities,
        "variations": variations
    }

Step 2: Multi-Collection Search

Most RAG tutorials search one collection. I needed to search across different document types at once:

  
def retrieve_for_rag(self, query: str, collection_names: List[str], filter_criteria: Dict = None, top_k: int = 5):
    # Process the query
    processed_query = self.process_query(query)
    
    # Search each collection
    all_documents = []
    all_scores = []
    
    for collection_name in collection_names:
        try:
            # Get documents with scores
            docs_with_scores = vector_store.search_with_score(
                query=processed_query["original"],
                collection_name=collection_name,
                filter=filter_criteria,
                k=top_k
            )
            
            if docs_with_scores:
                # Add documents and their relevance scores
                for doc, score in docs_with_scores:
                    all_documents.append(doc)
                    all_scores.append(score)
        except Exception as e:
            logger.error(f"Error searching collection {collection_name}: {str(e)}")
    
    # Filter and rank documents
    relevant_docs = self._filter_relevant_documents(query, all_documents, all_scores)
    
    # Generate context from relevant documents
    context = self._generate_context(relevant_docs)
    
    return {
        "documents": relevant_docs,
        "context": context
    }

Step 3: Context Generation

Last step: turn retrieved chunks into something the LLM can use. I group by source so the model knows where each piece came from:

  
def _generate_context(self, documents: List[Document]) -> str:
    """Generate context text from documents"""
    # Group documents by source
    source_to_docs = {}
    for doc in documents:
        source_key = self._get_source_identifier(doc)
        if source_key not in source_to_docs:
            source_to_docs[source_key] = []
        source_to_docs[source_key].append(doc)
    
    # Build context with source blocks
    context = ""
    for source_key, docs in source_to_docs.items():
        # Sort documents by relevance and position
        docs = sorted(docs, key=lambda x: (x.metadata.get("relevance_score", 0), x.metadata.get("chunk", 0)), reverse=True)
        
        # Add source header
        context += f"[Source: {source_key}]\n"
        
        # Add document contents with proper formatting
        for doc in docs:
            content = doc.page_content.strip()
            context += f"{content}\n\n"
    
    return context.strip()

Grouping by source helps the LLM cite properly. Answers come back like “According to [Mobile Architecture Doc]…” instead of vague claims.

Real-World Use Cases

Here’s how I’ve actually used this system:

Technical Documentation

I loaded 5,000+ pages of API docs, architecture decisions, and legacy code documentation into the system. Now when a new developer asks “How does our auth system handle token expiration?”—they get an answer with the exact source file, not a vague guess.

Onboarding dropped from weeks to days. Nobody digs through Confluence anymore.

What I Learned the Hard Way

Garbage in, garbage out: If your document parsing sucks, your answers will too. I spent more time on chunking logic than I expected—and it paid off.
Vector search alone fails sometimes: Hybrid retrieval (vectors + keywords) catches queries that pure semantic search misses. Worth the extra complexity.
Nobody uses ugly tools: I almost shipped without a decent UI. Bad move. People won’t adopt something that feels clunky, no matter how good the backend is.
Don’t boil the ocean: Start with 100 documents and one use case. Get that working perfectly before scaling up. I learned this after wasting two weeks on edge cases that didn’t matter.
Prompts matter more than you think: The difference between a mediocre answer and a great one often comes down to how you phrase the system prompt. Experiment relentlessly.

What’s Next

I can finally ask questions about my own docs and get real answers. That alone was worth the weekend hours.

Next on my list:

Images and audio: PDFs with diagrams are still a pain. Working on multi-modal support.
Custom embeddings: The generic OpenAI model works, but domain-specific embeddings should do better for technical content.
Agents: Hooking this up to agents that can take actions based on retrieved info.
Team features: Multiple users contributing to and querying the same knowledge base.

If you build something similar, I’d love to hear about it. Drop a comment or open an issue on the repo.

Get Involved

Code: Grab the repo and try it with your own docs.
Follow along: GitHub profile for updates.
PRs welcome: Check the issues tab if you want to contribute.

Full source code on GitHub, MIT licensed.

Artificial Intelligence, Software Architecture

This post is licensed under CC BY 4.0 by the author.