Skip to main content

MAF v1 — Deployment with Docker and Compose

Nitin Kumar Singh
Author
Nitin Kumar Singh
I build enterprise AI solutions and cloud-native systems. I write about architecture patterns, AI agents, Azure, and modern development practices — with full source code.
MAF v1 — Deployment with Docker and Compose
MAF v1: Python and .NET - This article is part of a series.
Part 25: This Article

Series note — Final appendix to MAF v1: Python and .NET. Sits after Ch24 — Prompt engineering. Supersedes the deployment half of the original Python-only Part 7 — Production Readiness: Auth, RBAC, and Deployment — the auth half of that article is now Ch20c — Production hardening. This chapter covers the deployment substrate only: one Dockerfile that builds every agent, one Compose file that runs the cluster, one bootstrap script that brings it up. Now with the .NET twin alongside Python.

Capstone code — This chapter doesn’t ship a standalone tutorial folder. The runnable deployment substrate lives at the repo root: agents/python/Dockerfile (multi-target Python build), docker-compose.yml (Python stack), docker-compose.dotnet.yml (.NET overlay), and scripts/dev.sh (one-shot bootstrap).

Why this chapter
#

The capstone has six agents (Python and .NET twins each), one MCP server, one frontend, Postgres + pgvector, Redis, Aspire Dashboard. That’s eleven services even before you count what’s needed for a developer to run a single specialist locally for an afternoon. Without packaging discipline, “clone and run” turns into a half-day of pip install errors, missing environment variables, and Postgres connection strings that drift between READMEs.

The deployment substrate is what makes the capstone runnable in one command. This chapter walks the three pieces — a multi-target Dockerfile that shares the dependency layer across all agents, a Compose file that wires the cluster with health gates and YAML anchors, and a dev.sh script that brings everything up in dependency order with retries.

It is not a guide to Kubernetes, AKS, or production cloud deployment. That’s a different conversation (the capstone roadmap flags it as a Phase 11+ companion series). What you have here is the desktop-developer baseline that everything else builds on.

Prerequisites
#

  • Docker 24+ and Docker Compose v2 (the docker compose command, not the legacy docker-compose binary).
  • Bash. The dev.sh script uses set -euo pipefail, trap, and command -v — works on macOS, Linux, and WSL2 out of the box.
  • For the Python stack: uv (the script builds the agent images, you don’t install Python locally).
  • For the .NET stack: nothing extra — the .NET SDK lives inside the build image.
  • 4 GB of free RAM for the cluster (Postgres + Redis + Aspire + 6 Python agents + frontend; the .NET twin replaces the agents but uses similar memory).

What you’ll learn
#

  • The multi-target Dockerfile pattern: one file, ARG AGENT_NAME per build, shared dependency cache.
  • The .NET equivalent — multi-stage dotnet publish with a /p:PublishSingleFile=true runtime image — and where the language idioms force the file to look different.
  • The Compose YAML-anchor trick that keeps 6 agents from each duplicating 20 environment variables.
  • The health-gate cascade in dev.sh that turns “wait for Postgres” from a sleep 10 into something that survives a slow disk.
  • The --profile knobs that let you boot infra-only, agents-only, or a single agent for an afternoon of focused debugging.

The Compose layout
#

A snippet, not the whole file:

# docker-compose.yml
services:

  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB:       ecommerce_agents
      POSTGRES_USER:     ecommerce
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}        # required, no default
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./db/init:/docker-entrypoint-initdb.d:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ecommerce -d ecommerce_agents"]
      interval: 5s
      timeout: 3s
      retries: 20
    ports: ["5432:5432"]

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    ports: ["6379:6379"]

  aspire:
    image: mcr.microsoft.com/dotnet/aspire-dashboard:latest
    environment:
      DOTNET_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS: "true"
    ports:
      - "18888:18888"   # web UI
      - "18889:18889"   # OTLP receiver

  orchestrator:
    build:
      context: ./agents/python
      args: { AGENT_NAME: orchestrator, AGENT_PORT: "8080" }
    environment: &agent-env
      DATABASE_URL: postgresql://ecommerce:${POSTGRES_PASSWORD}@db:5432/ecommerce_agents
      REDIS_URL:    redis://redis:6379
      LLM_PROVIDER: ${LLM_PROVIDER:-openai}
      OPENAI_API_KEY: ${OPENAI_API_KEY:-}
      AZURE_OPENAI_ENDPOINT:  ${AZURE_OPENAI_ENDPOINT:-}
      AZURE_OPENAI_KEY:       ${AZURE_OPENAI_KEY:-}
      AZURE_OPENAI_DEPLOYMENT: ${AZURE_OPENAI_DEPLOYMENT:-}
      JWT_SECRET: ${JWT_SECRET}
      AGENT_SHARED_SECRET: ${AGENT_SHARED_SECRET}
      OTEL_ENABLED: "true"
      OTEL_EXPORTER_OTLP_ENDPOINT: http://aspire:18889
      OTEL_SERVICE_NAME: ecommerce.orchestrator
      AGENT_REGISTRY: >-
        {"product-discovery":"http://product-discovery:8081",
         "order-management":"http://order-management:8082"}
    depends_on:
      db:    { condition: service_healthy }
      redis: { condition: service_healthy }
    healthcheck:
      test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s
    ports: ["8080:8080"]

  product-discovery:
    build:
      context: ./agents/python
      args: { AGENT_NAME: product_discovery, AGENT_PORT: "8081" }
    environment:
      <<: *agent-env
      OTEL_SERVICE_NAME: ecommerce.product-discovery
    depends_on:
      db: { condition: service_healthy }
    healthcheck:
      test: ["CMD", "curl", "-fsS", "http://localhost:8081/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s
    ports: ["8081:8081"]

  # ...repeat for order-management, pricing-promotions, review-sentiment,
  # inventory-fulfillment — each merges <<: *agent-env and overrides
  # OTEL_SERVICE_NAME

  frontend:
    build: ./web
    environment:
      NEXT_PUBLIC_API_URL: http://localhost:8080
    depends_on:
      orchestrator: { condition: service_healthy }
    ports: ["3000:3000"]

volumes:
  pgdata:

Three things worth pulling out:

1. The &agent-env anchor. The orchestrator block defines the canonical environment set. Every specialist merges it with <<: *agent-env and overrides exactly one key (OTEL_SERVICE_NAME). Adding a new env var means one edit, not six.

2. condition: service_healthy everywhere. Compose’s default depends_on only waits for start, not for ready. Without service_healthy, Postgres takes 4 seconds to actually accept connections; the orchestrator boots in 2 seconds, can’t connect, and crashes. With service_healthy, Compose holds the dependent until the upstream’s healthcheck passes.

3. start_period: 30s on the agent healthcheck. The agent boot path is uv resolve → import → init DB pool → init OTel → uvicorn ready. Three of those steps do non-trivial I/O. A 30-second start period prevents Compose from killing the container before it has a chance to become healthy. On a fresh CI runner with cold caches, bump this to 60s.

The ${POSTGRES_PASSWORD} and ${JWT_SECRET} references have no default — Compose will refuse to start if they’re missing. That’s intentional. Defaults like change-me-in-production always end up being run in production.

Multi-target Dockerfile (Python)
#

One file. ARG AGENT_NAME selects which subdirectory gets baked in. The uv sync layer is shared across every agent build.

# agents/python/Dockerfile
FROM python:3.12-slim AS base

RUN apt-get update \
 && apt-get install -y --no-install-recommends gcc libpq-dev curl \
 && rm -rf /var/lib/apt/lists/*

# Bring in `uv` from its distroless image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

RUN groupadd -r agent \
 && useradd  -r -g agent -d /app -s /sbin/nologin agent

WORKDIR /app

# Layer 1 — dependencies (cached across every agent build)
COPY pyproject.toml uv.lock ./
RUN uv sync --no-dev --no-install-project

# Layer 2 — shared library (also cached across agents)
COPY shared/ shared/
COPY config/ config/

# Layer 3 — agent-specific code (only this layer differs per agent)
ARG AGENT_NAME=orchestrator
ARG AGENT_PORT=8080
COPY ${AGENT_NAME}/ ${AGENT_NAME}/

RUN chown -R agent:agent /app

ENV AGENT_NAME=${AGENT_NAME} \
    AGENT_PORT=${AGENT_PORT} \
    PYTHONPATH=/app \
    UV_CACHE_DIR=/app/.cache/uv

USER agent
EXPOSE ${AGENT_PORT}

HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
  CMD curl -fsS http://localhost:${AGENT_PORT}/health || exit 1

CMD uv run --no-project uvicorn ${AGENT_NAME}.main:app \
      --host 0.0.0.0 --port ${AGENT_PORT}

Layer 1 (uv sync) takes ~30 seconds on a fresh build but is identical for every agent — Docker reuses the cached layer. Layer 2 (shared/ + config/) is also identical. Only Layer 3 (COPY ${AGENT_NAME}/) differs per agent, and it’s tiny. Building all six agents takes roughly the same time as building one.

The non-root agent user is non-negotiable. A compromise of one agent should not give the attacker root in the container; from there to a host escape is a smaller hop than people think. The --start-period=30s is the same one we set in Compose — it has to be set in both places because Compose doesn’t honour the Dockerfile value for condition: service_healthy.

Multi-stage Dockerfile (.NET)
#

The .NET version uses a build stage and a runtime stage. The runtime image carries only the published output, not the SDK.

# agents/dotnet/Dockerfile
ARG AGENT_NAME=Orchestrator
ARG AGENT_PORT=8080

FROM mcr.microsoft.com/dotnet/sdk:9.0 AS build

WORKDIR /src

# Layer 1 — restore (cached across every agent)
COPY ECommerceAgents.sln ./
COPY src/Shared/Shared.csproj          src/Shared/
COPY src/Orchestrator/Orchestrator.csproj         src/Orchestrator/
COPY src/ProductDiscovery/ProductDiscovery.csproj src/ProductDiscovery/
# ... one COPY per agent project

RUN dotnet restore

# Layer 2 — copy source and publish the selected agent
COPY src/ src/
ARG AGENT_NAME
RUN dotnet publish "src/${AGENT_NAME}/${AGENT_NAME}.csproj" \
      -c Release \
      -o /app/publish \
      /p:PublishSingleFile=false

# Runtime stage
FROM mcr.microsoft.com/dotnet/aspnet:9.0 AS runtime

RUN apt-get update && apt-get install -y --no-install-recommends curl \
 && rm -rf /var/lib/apt/lists/* \
 && groupadd -r agent && useradd -r -g agent -d /app agent

WORKDIR /app
COPY --from=build /app/publish .

ARG AGENT_NAME
ARG AGENT_PORT
ENV AGENT_NAME=${AGENT_NAME} \
    AGENT_PORT=${AGENT_PORT} \
    ASPNETCORE_URLS=http://+:${AGENT_PORT}

USER agent
EXPOSE ${AGENT_PORT}

HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
  CMD curl -fsS http://localhost:${AGENT_PORT}/health || exit 1

ENTRYPOINT ["sh", "-c", "dotnet ${AGENT_NAME}.dll"]

Two structural differences from the Python version, both forced by language idioms:

  • Two stages, not one. The .NET SDK image is ~700 MB; the ASP.NET runtime image is ~200 MB. Copying only /app/publish from the build stage cuts the final image to ~250 MB — a comfortable size for fast pulls. Python’s uv sync --no-dev already gives us a small image; no second stage needed.
  • Project files COPYed individually before source. Restoring against .csproj and .sln only — without the .cs source — means a source change doesn’t bust the restore cache. Python’s pyproject.toml carries the same role; the layer pattern is identical even though the file shape differs.

The same Compose file points at either Dockerfile via the build.context field — ./agents/python for the Python stack, ./agents/dotnet for the .NET stack. The capstone ships a docker-compose.dotnet.yml overlay that swaps every agent’s build context.

Topology
#

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor': '#2563eb','primaryTextColor': '#ffffff','primaryBorderColor': '#1e40af', 'lineColor': '#64748b','secondaryColor': '#f59e0b','tertiaryColor': '#10b981', 'background': 'transparent'}}}%% flowchart LR classDef infra fill:#64748b,stroke:#334155,color:#ffffff classDef core fill:#2563eb,stroke:#1e40af,color:#ffffff classDef extern fill:#f59e0b,stroke:#b45309,color:#000000 classDef obs fill:#10b981,stroke:#047857,color:#ffffff subgraph data["Shared infra"] pg["Postgres 16
+ pgvector
:5432"] rd["Redis 7
:6379"] end subgraph telemetry["Observability"] asp["Aspire Dashboard
:18888 / OTLP :18889"] end subgraph agents["Agents"] orch["orchestrator
:8080"] pd["product-discovery
:8081"] om["order-management
:8082"] pp["pricing-promotions
:8083"] rs["review-sentiment
:8084"] inv["inventory-fulfillment
:8085"] end fe["Next.js frontend
:3000"] llm["OpenAI / Azure OpenAI"] fe --> orch orch --> pd orch --> om orch --> pp orch --> rs orch --> inv orch --> llm pd --> llm om --> llm orch --> pg pd --> pg om --> pg orch --> rd orch -. OTel .-> asp pd -. OTel .-> asp class pg,rd infra class orch,pd,om,pp,rs,inv core class llm extern class asp obs class fe core

Eleven services, three groups. Frontend talks only to the orchestrator; the orchestrator talks to specialists, Postgres, Redis, and the LLM provider; specialists talk to Postgres and the LLM. OTel spans flow from every service to Aspire.

The dev.sh bootstrap script
#

One command:

./scripts/dev.sh                 # Full rebuild, start everything
./scripts/dev.sh --clean         # Nuke volumes, rebuild from scratch
./scripts/dev.sh --infra-only    # Just db + redis + aspire
./scripts/dev.sh --agent orchestrator   # Just one agent (for focused debugging)
./scripts/dev.sh --dotnet        # Use the .NET twin (docker-compose.dotnet.yml)

The script (lightly edited for length):

#!/usr/bin/env bash
set -euo pipefail

CLEAN=0; INFRA_ONLY=0; STACK="python"; SINGLE_AGENT=""

while [[ $# -gt 0 ]]; do
  case "$1" in
    --clean)      CLEAN=1; shift ;;
    --infra-only) INFRA_ONLY=1; shift ;;
    --dotnet)     STACK="dotnet"; shift ;;
    --agent)      SINGLE_AGENT="$2"; shift 2 ;;
    *) echo "Unknown arg: $1" >&2; exit 2 ;;
  esac
done

# 1. Prereqs
command -v docker >/dev/null || { echo "Docker required"; exit 1; }
docker compose version >/dev/null 2>&1 || { echo "Compose v2 required"; exit 1; }

# 2. Env
[[ -f .env ]] || { cp .env.example .env; echo "Created .env from example. Fill it in and re-run."; exit 1; }
# shellcheck disable=SC1091
set -a; source .env; set +a
: "${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}"
: "${JWT_SECRET:?JWT_SECRET must be set in .env}"

COMPOSE="docker compose"
[[ "$STACK" == "dotnet" ]] && COMPOSE="docker compose -f docker-compose.yml -f docker-compose.dotnet.yml"

# 3. Optional clean
if (( CLEAN )); then
  $COMPOSE down -v --remove-orphans
fi

# 4. Build (cached layers make this fast on re-runs)
$COMPOSE build

# 5. Infra first
$COMPOSE up -d db redis aspire

# 6. Wait for infra health (poll, don't sleep)
wait_healthy() {
  local svc="$1" tries=0
  until [[ "$($COMPOSE ps -q "$svc" | xargs docker inspect -f '{{.State.Health.Status}}')" == "healthy" ]]; do
    ((tries++ < 60)) || { echo "Timeout waiting for $svc"; $COMPOSE logs --tail=50 "$svc"; exit 1; }
    sleep 1
  done
  echo "  $svc — healthy"
}
echo "Waiting for infrastructure..."
wait_healthy db
wait_healthy redis

# 7. Seed (one-shot)
$COMPOSE run --rm seeder

(( INFRA_ONLY )) && { echo "Infra ready. Skipping agents."; exit 0; }

# 8. Agents
if [[ -n "$SINGLE_AGENT" ]]; then
  $COMPOSE up -d "$SINGLE_AGENT"
  wait_healthy "$SINGLE_AGENT"
else
  $COMPOSE up -d orchestrator product-discovery order-management \
                 pricing-promotions review-sentiment inventory-fulfillment
  for svc in orchestrator product-discovery order-management \
             pricing-promotions review-sentiment inventory-fulfillment; do
    wait_healthy "$svc"
  done
fi

# 9. Frontend
$COMPOSE up -d frontend

# 10. Summary
cat <<EOF

   Cluster up.

   Frontend:    http://localhost:3000
   Orchestrator API:  http://localhost:8080
   Aspire Dashboard:  http://localhost:18888
   Postgres:    localhost:5432  (db: ecommerce_agents)
   Redis:       localhost:6379

   Tail an agent:  docker compose logs -f orchestrator
   Stop:           ./scripts/dev.sh --clean   # also wipes volumes
EOF

The poll-don’t-sleep pattern in wait_healthy is what makes this script survive cold CI runners. The original capstone shipped a 30-second sleep — that was just enough to mask Postgres flakiness on a fast disk and totally inadequate when the disk was slow. Polling every 1 second up to 60 attempts gives a hard 60-second budget but exits as soon as the service is genuinely up.

The : "${POSTGRES_PASSWORD:?...}" form is a Bash idiom that exits with a clear error when the variable is missing or empty. It’s the dev.sh equivalent of Compose refusing to start without ${POSTGRES_PASSWORD} defaulted.

.env template
#

# .env.example — copy to .env and fill in.
# dev.sh will refuse to start until POSTGRES_PASSWORD and JWT_SECRET are set.

# Database
POSTGRES_PASSWORD=        # generate: openssl rand -hex 24

# Auth (Ch20c hardening covers rotation; keep these as bootstrap defaults)
JWT_SECRET=               # generate: openssl rand -hex 32
AGENT_SHARED_SECRET=      # generate: openssl rand -hex 32

# LLM provider — pick ONE
LLM_PROVIDER=openai

# OpenAI
OPENAI_API_KEY=

# Azure OpenAI (uncomment to use; set LLM_PROVIDER=azure)
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_KEY=
# AZURE_OPENAI_DEPLOYMENT=gpt-4
# AZURE_OPENAI_API_VERSION=2025-03-01-preview

The capstone’s auth code reads the LLM keys via the same chat_client_factory() it uses for everything else. Switching providers is the LLM_PROVIDER flag plus the relevant block of variables — no code edits.

Profile-based subsets
#

Compose profiles let a single file describe several launch shapes:

services:
  pgadmin:
    image: dpage/pgadmin4
    profiles: ["tools"]
    # ...

  redis-commander:
    image: rediscommander/redis-commander
    profiles: ["tools"]
    # ...

Default up ignores profiled services. docker compose --profile tools up -d brings them in. The capstone uses this for pgadmin, redis-commander, and a loadtest runner — all useful, none of which should be running in everyday development.

Gotchas
#

  • Compose v2 vs v1. The docker compose plugin (v2) and the legacy docker-compose Python binary are not the same. The script tests for v2 explicitly. v1’s depends_on.condition: service_healthy works but a few less-common features have drifted; don’t mix them.
  • pg_isready lies. It returns success as soon as Postgres accepts TCP, which can be ~200ms before it accepts queries. The db healthcheck above uses pg_isready -U ecommerce -d ecommerce_agents rather than the parameter-less form, which connects to the actual DB and catches that gap.
  • start_period in Dockerfile vs Compose. Compose ignores the Dockerfile’s HEALTHCHECK --start-period for the purposes of condition: service_healthy. Set it in both — the Dockerfile value is what docker run honours, the Compose value is what docker compose up honours. They should match.
  • Volume permissions. The non-root agent user owns /app. If you bind-mount a host directory over /app/.cache/uv, Docker creates it root-owned and uv fails to write. Either chown the host directory before mounting, or use a named volume.
  • OTel exporter unreachable on first boot. If Aspire isn’t up yet, the agent’s OTel exporter logs warnings but doesn’t crash. The Aspire depends_on is omitted from agents because aspire itself doesn’t have a healthcheck — the dependency is graceful by design. If you want strict ordering, add a healthcheck to Aspire (curl on :18888).
  • Disk space. docker compose down -v (or dev.sh --clean) is the only way to actually free the named volume. docker compose down keeps pgdata intact — useful when you want to restart without re-seeding, dangerous when you forgot a migration.

What changes for the capstone
#

The capstone today ships everything described above (agents/python/Dockerfile, docker-compose.yml, scripts/dev.sh). The Phase 11 entry in the refactor plan tracks small modernisations that came out of writing this chapter:

  • scripts/dev.sh — replace the legacy sleep 30 with the wait_healthy poll loop above.
  • .env.example — promote POSTGRES_PASSWORD and JWT_SECRET from optional with defaults to required (matching the Compose ${VAR} no-default form).
  • agents/dotnet/Dockerfile — split into build / runtime stages as shown above; the original was a single-stage SDK image.
  • docker-compose.dotnet.yml — overlay file that flips every agent’s build.context to ./agents/dotnet.

No agent code changes. The deployment substrate lives entirely in Compose and shell.

What’s next
#

That closes the deployment-substrate gap the original Part 7 left open. The companion series on AKS / Managed Identity / Key Vault / private endpoints — flagged in the capstone roadmap — picks up where this chapter ends. For the framework series itself, this is the last appendix: Ch21 — Putting it all together is the chapter to come back to once you’ve used the deployment substrate to run something real.

MAF v1: Python and .NET - This article is part of a series.
Part 25: This Article

Related