Production Readiness: Auth, RBAC, and Deployment

Building Multi-Agent AI Systems - This article is part of a series.

Part 7: This Article

Part 8: Agent Memory -- Remembering Across Conversations

Part 9: Evaluating Agent Quality -- Testing What You Cannot Unit Test

Part 10: MCP Integration -- Connecting AI Agents to the Tool Ecosystem

Part 11: Graph-Based Workflows -- Beyond Simple Orchestration

Agents access real data and take real actions. A chatbot that browses a catalog is harmless. An agent that cancels orders, issues refunds, and queries inventory across warehouses is not. Without proper auth, any user could view any order or access admin tools. And none of the security work matters if a new developer cannot clone the repo and run the system.

This article covers both production readiness concerns together: securing the agent platform (JWT authentication, RBAC, user-scoped data isolation), then packaging it for deployment (Docker Compose architecture, one-command startup, environment configuration).

Part A: Authentication, RBAC, and Security
#

JWT Authentication
#

ECommerce Agents uses a self-contained JWT scheme with two token types: access tokens (60-minute expiry) and refresh tokens (7-day expiry). No external identity provider – bcrypt for password hashing and PyJWT for token management.

# Python — Microsoft Agent Framework (Python SDK)
# agents/shared/jwt_utils.py

ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 60
REFRESH_TOKEN_EXPIRE_DAYS = 7


def hash_password(password: str) -> str:
    return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()


def verify_password(password: str, password_hash: str) -> bool:
    return bcrypt.checkpw(password.encode(), password_hash.encode())


def create_access_token(email: str, role: str, user_id: str) -> str:
    expire = datetime.now(timezone.utc) + timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)
    payload = {
        "sub": email,
        "role": role,
        "user_id": user_id,
        "exp": expire,
        "type": "access",
    }
    return jwt.encode(payload, settings.JWT_SECRET, algorithm=ALGORITHM)


def decode_token(token: str) -> dict:
    """Raises jwt.InvalidTokenError on failure."""
    return jwt.decode(token, settings.JWT_SECRET, algorithms=[ALGORITHM])

A few design decisions worth calling out:

The type field distinguishes access tokens from refresh tokens. Without this, a refresh token could be used as an access token – same secret, same algorithm, same decode function. The middleware explicitly rejects tokens where type != "access".

Role is embedded in the token. Role (customer, seller, admin) is baked into the JWT at login. Authorization decisions downstream don’t require a database round-trip. The tradeoff: role changes don’t take effect until the current access token expires (at most 60 minutes).

The login flow:

# Python — Microsoft Agent Framework (Python SDK)
@router.post("/api/auth/login", response_model=AuthResponse)
async def login(body: LoginRequest) -> AuthResponse:
    row = await pool.fetchrow(
        "SELECT id, email, password_hash, role, is_active FROM users WHERE email = $1",
        body.email,
    )

    if not row:
        raise HTTPException(status_code=401, detail="Invalid email or password")
    if not row["is_active"]:
        raise HTTPException(status_code=403, detail="Account is deactivated")
    if not verify_password(body.password, row["password_hash"]):
        raise HTTPException(status_code=401, detail="Invalid email or password")

    access_token = create_access_token(row["email"], row["role"], str(row["id"]))
    refresh_token = create_refresh_token(row["email"])
    # ... return AuthResponse

Same “Invalid email or password” message for both missing user and wrong password – prevents email enumeration.

The ECommerce Agents login page – email and password authentication with JWT tokens issued on successful login.

Role-Based Access Control
#

ECommerce Agents defines four roles:

Role	What They Access
`customer`	Their own orders, products, reviews, promotions, and loyalty status
`power_user`	Same as customer, plus early access to promotions and higher API rate limits
`seller`	Their own products, inventory levels, order fulfillment, and reviews on their products
`admin`	Everything – all users, all data, platform-wide metrics, agent management

The role is enforced at two levels. At the API level, FastAPI dependencies gate entire endpoints. At the agent level, role-specific prompt instructions change what the agent does with the same underlying tools.

Admin dashboard with usage statistics and platform metrics

The admin dashboard – only accessible to users with the admin role.

User-Scoped Data with ContextVars
#

This is the most important security pattern in the entire system. Every tool that touches user data reads the current user’s identity from a ContextVar – not from a function parameter, and critically, not from the LLM’s output.

# Python — Microsoft Agent Framework (Python SDK)
# agents/shared/context.py

from contextvars import ContextVar

current_user_email: ContextVar[str] = ContextVar("current_user_email", default="")
current_user_role: ContextVar[str] = ContextVar("current_user_role", default="")
current_session_id: ContextVar[str] = ContextVar("current_session_id", default="")

The auth middleware sets these values when a request arrives. Then every tool reads them implicitly:

# Python — Microsoft Agent Framework (Python SDK)
@tool(name="get_user_orders", description="List orders for the current user.")
async def get_user_orders(
    status: Annotated[str | None, Field(description="Filter by order status")] = None,
    limit: Annotated[int, Field(description="Max number of orders to return")] = 10,
) -> list[dict]:
    email = current_user_email.get()
    if not email:
        return [{"error": "No user context available"}]

    conditions = ["u.email = $1"]
    args: list = [email]
    # ... build query with user email as the mandatory first filter

The critical line is conditions = ["u.email = $1"]. The user’s email is always the first filter in the WHERE clause. The LLM never decides which user’s data to fetch – that is hardcoded from the authenticated context.

Why ContextVars instead of function parameters? Two reasons. First, MAF’s @tool decorator exposes function parameters to the LLM as tool arguments. If user_email were a parameter, the model could hallucinate a different user’s email and pass it in. ContextVars are invisible to the LLM – they exist outside the tool’s schema entirely. Second, ContextVars are natively async-safe in Python. Each asyncio task gets its own copy of the context, so concurrent requests never leak user identity between each other.

Two customer sessions showing different orders for different users

User-scoped data isolation – two customers asking the same question see completely different orders.

Inter-Agent Authentication
#

When the orchestrator calls a specialist via A2A, it uses a two-mode middleware:

# Python — Microsoft Agent Framework (Python SDK)
# agents/shared/auth.py

class AgentAuthMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        if request.url.path in {"/health", "/.well-known/agent-card.json"}:
            return await call_next(request)

        # Mode 1: Inter-agent trust (shared secret from orchestrator)
        agent_secret = request.headers.get("x-agent-secret")
        if agent_secret:
            if agent_secret != settings.AGENT_SHARED_SECRET:
                return JSONResponse({"error": "Invalid agent secret"}, status_code=401)

            current_user_email.set(request.headers.get("x-user-email", "system"))
            current_user_role.set(request.headers.get("x-user-role", "system"))
            return await call_next(request)

        # Mode 2: User JWT (for direct API calls)
        auth_header = request.headers.get("authorization", "")
        token = auth_header.removeprefix("Bearer ")
        payload = decode_token(token)  # raises on invalid/expired
        current_user_email.set(payload.get("sub", ""))
        current_user_role.set(payload.get("role", "customer"))
        return await call_next(request)

Specialist agents don’t need to know whether they were called by the orchestrator or directly by a user with a JWT. Either way, ContextVars get set with the authenticated identity, and every downstream tool calls current_user_email.get().

Role-Aware Prompts
#

The orchestrator loads different system prompts based on the authenticated user’s role:

# agents/config/prompts/orchestrator.yaml
system_prompt:
  base: |
    You are the Customer Support orchestrator for this e-commerce platform.
    # ... base instructions for all roles

  role_instructions:
    customer: |
      This user is a customer. Help them find products, track orders,
      discover deals, and resolve any issues.
    seller: |
      This user is a seller on the platform. They may ask about their
      own products, inventory levels, and customer reviews on their products.
    admin: |
      This user is an admin with full access to all data and agents.
      They can query any user's data and view platform-wide metrics.

Same agent, same tools, different behavior – driven entirely by the authenticated role flowing into the prompt.

Seller dashboard showing the seller’s own products and inventory

The seller dashboard – a seller sees their own products and inventory. Same underlying tools, different data based on authenticated role.

API Endpoint RBAC
#

FastAPI dependencies gate entire endpoints by role:

# Python — Microsoft Agent Framework (Python SDK)
async def require_auth(request: Request) -> dict:
    token = request.headers.get("authorization", "").removeprefix("Bearer ")
    payload = decode_token(token)  # raises HTTPException 401 on failure
    if payload.get("type") != "access":
        raise HTTPException(status_code=401, detail="Invalid token type")
    current_user_email.set(payload.get("sub", ""))
    current_user_role.set(payload.get("role", "customer"))
    return payload


async def require_admin(user: dict = Depends(require_auth)) -> dict:
    if user.get("role") != "admin":
        raise HTTPException(status_code=403, detail="Admin access required")
    return user


async def require_seller(user: dict = Depends(require_auth)) -> dict:
    if user.get("role") not in ("seller", "admin"):
        raise HTTPException(status_code=403, detail="Seller access required")
    return user

# Usage:
@router.get("/api/admin/usage")
async def get_usage_stats(admin: dict = Depends(require_admin)):
    ...

@router.get("/api/seller/products")
async def list_seller_products(user: dict = Depends(require_seller)):
    ...

Authentication always runs first (401 = “who are you?”). Role check runs after (403 = “are you allowed?”). Clear signals to the frontend for handling each case.

Security Checklist
#

Parameterized SQL everywhere. Every query uses asyncpg’s $1, $2 syntax. No string interpolation of user input. SQL injection eliminated at the driver level.
User-scoped queries as default. Every tool that touches user data includes u.email = $1. There is no “get all orders” tool exposed to the LLM.
Token type enforcement. The type field prevents cross-use of access and refresh tokens.
Constant-time password comparison. bcrypt’s checkpw is inherently timing-safe.
Account deactivation. Login checks is_active before issuing tokens.
Error message uniformity. “Invalid email or password” for all failed logins – no email enumeration.
Shared secret rotation. AGENT_SHARED_SECRET is externalized to environment variables. Rotation requires restarting agents, no code changes.

Part B: Docker Compose Deployment
#

Architecture
#

ECommerce Agents runs 11 services organized into four groups using Docker Compose profiles:

graph TD subgraph Infrastructure DB[(PostgreSQL + pgvector)] Redis[(Redis 7)] Aspire[Aspire Dashboard] end subgraph "Seed (runs once)" Seeder[seeder] end subgraph "Agents (profile: agents)" Orch[orchestrator :8080] PD[product-discovery :8081] OM[order-management :8082] PP[pricing-promotions :8083] RS[review-sentiment :8084] IF[inventory-fulfillment :8085] end subgraph "Frontend (profile: frontend)" FE[Next.js :3000] end Seeder -->|service_healthy| DB Orch -->|service_healthy| DB Orch -->|service_healthy| Redis PD -->|service_healthy| DB OM -->|service_healthy| DB FE --> Orch style DB fill:#1e6091,color:#fff style Redis fill:#1e6091,color:#fff style Aspire fill:#5b8c5a,color:#fff style Orch fill:#d4740e,color:#fff style FE fill:#6b7280,color:#fff

Profiles let you start subsets of the stack independently – docker compose up -d db redis aspire for just infrastructure, or docker compose --profile agents up -d for agents without the frontend.

YAML Anchors for Shared Configuration
#

Every agent needs the same 20+ environment variables. Rather than duplicating them, the orchestrator defines the canonical set and specialists merge it:

orchestrator:
  environment: &agent-env
    DATABASE_URL: postgresql://ecommerce:ecommerce_secret@db:5432/ecommerce_agents
    REDIS_URL: redis://redis:6379
    LLM_PROVIDER: ${LLM_PROVIDER:-openai}
    OPENAI_API_KEY: ${OPENAI_API_KEY:-}
    LLM_MODEL: ${LLM_MODEL:-gpt-4.1}
    JWT_SECRET: ${JWT_SECRET:-change-me-in-production}
    AGENT_SHARED_SECRET: ${AGENT_SHARED_SECRET:-agent-internal-shared-secret}
    OTEL_ENABLED: "true"
    OTEL_EXPORTER_OTLP_ENDPOINT: http://aspire:18889
    OTEL_SERVICE_NAME: ecommerce.orchestrator
    AGENT_REGISTRY: >-
      {
        "product-discovery": "http://product-discovery:8081",
        "order-management": "http://order-management:8082",
        ...
      }

product-discovery:
  environment:
    <<: *agent-env
    OTEL_SERVICE_NAME: ecommerce.product-discovery

The &agent-env anchor defines the base. <<: *agent-env merges it into each specialist. Only OTEL_SERVICE_NAME is overridden per agent so traces are properly attributed in the Aspire Dashboard.

docker compose ps output showing all services running and healthy

All 11 services running and healthy.

Multi-Target Dockerfile
#

One Dockerfile builds all 6 agents using AGENT_NAME and AGENT_PORT build arguments:

FROM python:3.12-slim AS base

RUN apt-get update && \
    apt-get install -y --no-install-recommends gcc libpq-dev curl && \
    rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Create non-root user
RUN groupadd -r agent && useradd -r -g agent -d /app -s /sbin/nologin agent

WORKDIR /app

# Install Python dependencies (cached -- same for all agents)
COPY pyproject.toml .
RUN uv sync --no-dev --no-install-project

# Copy shared library
COPY shared/ shared/
COPY config/ config/

# Agent-specific copy
ARG AGENT_NAME=orchestrator
ARG AGENT_PORT=8080

COPY ${AGENT_NAME}/ ${AGENT_NAME}/

RUN chown -R agent:agent /app

ENV AGENT_NAME=${AGENT_NAME}
ENV AGENT_PORT=${AGENT_PORT}
ENV PYTHONPATH=/app
ENV UV_CACHE_DIR=/app/.cache/uv

USER agent
EXPOSE ${AGENT_PORT}

HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
    CMD curl -f http://localhost:${AGENT_PORT}/health || exit 1

CMD uv run --no-project uvicorn ${AGENT_NAME}.main:app --host 0.0.0.0 --port ${AGENT_PORT}

Every agent shares the same Python dependencies and shared/ library. The uv sync layer is cached once and reused for all 6 builds. Only the final COPY ${AGENT_NAME}/ layer differs.

In docker-compose.yml:

product-discovery:
  build:
    context: ./agents
    args: { AGENT_NAME: product_discovery, AGENT_PORT: 8081 }

The --start-period=30s on the health check is important. Agents need time to initialize: uv resolves the virtual environment, Python imports the module, the database pool is created, and OpenTelemetry starts. Without a start period, Docker would mark containers unhealthy before they boot.

The dev.sh Script
#

One command gets everything running:

./scripts/dev.sh              # Full rebuild and start everything
./scripts/dev.sh --clean      # Nuke volumes, rebuild from scratch
./scripts/dev.sh --seed-only  # Re-run seeder against existing DB
./scripts/dev.sh --infra-only # Start db + redis + aspire only

The script follows a strict sequence with health checks at each gate:

Prerequisite check (Docker and Docker Compose v2)
Environment check (create .env from .env.example if missing)
Build all agent images
Start infrastructure (db, redis, aspire)
Wait for health (pg_isready, redis-cli ping)
Run seeder (one-shot container that exits after seeding)
Start agents – poll each agent’s /health endpoint
Start frontend – poll http://localhost:3000
Print service URL summary

Terminal showing dev.sh startup completing with all service URLs

The dev.sh script completing successfully – all services healthy, all URLs printed.

Environment Configuration
#

Copy .env.example to .env and configure your LLM provider:

OpenAI (default for local development):

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here
LLM_MODEL=gpt-4.1

Azure OpenAI:

LLM_PROVIDER=azure
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_KEY=your-key
AZURE_OPENAI_DEPLOYMENT=gpt-4.1
AZURE_OPENAI_API_VERSION=2025-03-01-preview

Both providers use the same ChatClient interface. Switching requires changing one variable – no code changes.

Generate proper secrets:

python -c "import secrets; print(secrets.token_hex(32))"

Port Map
#

Service	Host Port	Purpose
PostgreSQL	5432	Database (pgvector)
Redis	6379	Cache
Aspire Dashboard UI	18888	Telemetry visualization
Orchestrator	8080	Customer Support agent (API gateway)
Product Discovery	8081	Product search and recommendations
Order Management	8082	Order CRUD and tracking
Pricing & Promotions	8083	Pricing rules and coupon engine
Review & Sentiment	8084	Review analysis and sentiment
Inventory & Fulfillment	8085	Stock management and shipping
Frontend (Next.js)	3000	Web UI

ECommerce Agents frontend running at localhost:3000

The ECommerce Agents frontend at localhost:3000.

Database Seeding
#

The seeder runs as a one-shot container and populates:

20 users with hashed passwords (bcrypt) and assigned roles
50 products across categories (electronics, clothing, home, sports)
200 orders with line items, shipping addresses, and status history
500 reviews with star ratings and text content

To re-seed:

./scripts/dev.sh --seed-only

The seed script is idempotent – it truncates existing data before inserting.

Products page showing the seeded catalog with products across categories

The product catalog populated by the seeder – 50 products across four categories.

Troubleshooting
#

Port already in use. Find the process with lsof -i :5432 and stop it, or change the host port in docker-compose.yml.

Missing API key. Open .env and replace the placeholder with your actual key. If using Azure OpenAI, set LLM_PROVIDER=azure and fill in Azure-specific variables.

Database connection refused. Run docker compose ps to check container status. If db shows as unhealthy, check docker compose logs db. Common cause: the init.sql script has a syntax error.

Health check timeout. Check docker compose logs orchestrator for the root cause. Common: LLM credentials wrong (agent validates at startup), missing environment variable (Pydantic Settings logs exactly which field), or memory pressure (docker stats to check).

uv cache permission denied. Rebuild with ./scripts/dev.sh --clean.

What’s Next
#

The platform is running, secured, and observable. Advanced capabilities are next:

In Part 8: Agent Memory, we give agents the ability to remember user preferences and past interactions across sessions – so every conversation feels personal rather than starting from scratch.

The complete source code is available at github.com/nitin27may/e-commerce-agents.