Series note — Final appendix to MAF v1: Python and .NET. Sits after Ch24 — Prompt engineering. Supersedes the deployment half of the original Python-only Part 7 — Production Readiness: Auth, RBAC, and Deployment — the auth half of that article is now Ch20c — Production hardening. This chapter covers the deployment substrate only: one Dockerfile that builds every agent, one Compose file that runs the cluster, one bootstrap script that brings it up. Now with the .NET twin alongside Python.
Capstone code — This chapter doesn’t ship a standalone tutorial folder. The runnable deployment substrate lives at the repo root:
agents/python/Dockerfile(multi-target Python build),docker-compose.yml(Python stack),docker-compose.dotnet.yml(.NET overlay), andscripts/dev.sh(one-shot bootstrap).
Why this chapter#
The capstone has six agents (Python and .NET twins each), one MCP server, one frontend, Postgres + pgvector, Redis, Aspire Dashboard. That’s eleven services even before you count what’s needed for a developer to run a single specialist locally for an afternoon. Without packaging discipline, “clone and run” turns into a half-day of pip install errors, missing environment variables, and Postgres connection strings that drift between READMEs.
The deployment substrate is what makes the capstone runnable in one command. This chapter walks the three pieces — a multi-target Dockerfile that shares the dependency layer across all agents, a Compose file that wires the cluster with health gates and YAML anchors, and a dev.sh script that brings everything up in dependency order with retries.
It is not a guide to Kubernetes, AKS, or production cloud deployment. That’s a different conversation (the capstone roadmap flags it as a Phase 11+ companion series). What you have here is the desktop-developer baseline that everything else builds on.
Prerequisites#
- Docker 24+ and Docker Compose v2 (the
docker composecommand, not the legacydocker-composebinary). - Bash. The
dev.shscript usesset -euo pipefail,trap, andcommand -v— works on macOS, Linux, and WSL2 out of the box. - For the Python stack:
uv(the script builds the agent images, you don’t install Python locally). - For the .NET stack: nothing extra — the .NET SDK lives inside the build image.
- 4 GB of free RAM for the cluster (Postgres + Redis + Aspire + 6 Python agents + frontend; the .NET twin replaces the agents but uses similar memory).
What you’ll learn#
- The multi-target Dockerfile pattern: one file,
ARG AGENT_NAMEper build, shared dependency cache. - The .NET equivalent — multi-stage
dotnet publishwith a/p:PublishSingleFile=trueruntime image — and where the language idioms force the file to look different. - The Compose YAML-anchor trick that keeps 6 agents from each duplicating 20 environment variables.
- The health-gate cascade in
dev.shthat turns “wait for Postgres” from asleep 10into something that survives a slow disk. - The
--profileknobs that let you boot infra-only, agents-only, or a single agent for an afternoon of focused debugging.
The Compose layout#
A snippet, not the whole file:
# docker-compose.yml
services:
db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: ecommerce_agents
POSTGRES_USER: ecommerce
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} # required, no default
volumes:
- pgdata:/var/lib/postgresql/data
- ./db/init:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ecommerce -d ecommerce_agents"]
interval: 5s
timeout: 3s
retries: 20
ports: ["5432:5432"]
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
ports: ["6379:6379"]
aspire:
image: mcr.microsoft.com/dotnet/aspire-dashboard:latest
environment:
DOTNET_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS: "true"
ports:
- "18888:18888" # web UI
- "18889:18889" # OTLP receiver
orchestrator:
build:
context: ./agents/python
args: { AGENT_NAME: orchestrator, AGENT_PORT: "8080" }
environment: &agent-env
DATABASE_URL: postgresql://ecommerce:${POSTGRES_PASSWORD}@db:5432/ecommerce_agents
REDIS_URL: redis://redis:6379
LLM_PROVIDER: ${LLM_PROVIDER:-openai}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
AZURE_OPENAI_ENDPOINT: ${AZURE_OPENAI_ENDPOINT:-}
AZURE_OPENAI_KEY: ${AZURE_OPENAI_KEY:-}
AZURE_OPENAI_DEPLOYMENT: ${AZURE_OPENAI_DEPLOYMENT:-}
JWT_SECRET: ${JWT_SECRET}
AGENT_SHARED_SECRET: ${AGENT_SHARED_SECRET}
OTEL_ENABLED: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: http://aspire:18889
OTEL_SERVICE_NAME: ecommerce.orchestrator
AGENT_REGISTRY: >-
{"product-discovery":"http://product-discovery:8081",
"order-management":"http://order-management:8082"}
depends_on:
db: { condition: service_healthy }
redis: { condition: service_healthy }
healthcheck:
test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
ports: ["8080:8080"]
product-discovery:
build:
context: ./agents/python
args: { AGENT_NAME: product_discovery, AGENT_PORT: "8081" }
environment:
<<: *agent-env
OTEL_SERVICE_NAME: ecommerce.product-discovery
depends_on:
db: { condition: service_healthy }
healthcheck:
test: ["CMD", "curl", "-fsS", "http://localhost:8081/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
ports: ["8081:8081"]
# ...repeat for order-management, pricing-promotions, review-sentiment,
# inventory-fulfillment — each merges <<: *agent-env and overrides
# OTEL_SERVICE_NAME
frontend:
build: ./web
environment:
NEXT_PUBLIC_API_URL: http://localhost:8080
depends_on:
orchestrator: { condition: service_healthy }
ports: ["3000:3000"]
volumes:
pgdata:Three things worth pulling out:
1. The &agent-env anchor. The orchestrator block defines the canonical environment set. Every specialist merges it with <<: *agent-env and overrides exactly one key (OTEL_SERVICE_NAME). Adding a new env var means one edit, not six.
2. condition: service_healthy everywhere. Compose’s default depends_on only waits for start, not for ready. Without service_healthy, Postgres takes 4 seconds to actually accept connections; the orchestrator boots in 2 seconds, can’t connect, and crashes. With service_healthy, Compose holds the dependent until the upstream’s healthcheck passes.
3. start_period: 30s on the agent healthcheck. The agent boot path is uv resolve → import → init DB pool → init OTel → uvicorn ready. Three of those steps do non-trivial I/O. A 30-second start period prevents Compose from killing the container before it has a chance to become healthy. On a fresh CI runner with cold caches, bump this to 60s.
The ${POSTGRES_PASSWORD} and ${JWT_SECRET} references have no default — Compose will refuse to start if they’re missing. That’s intentional. Defaults like change-me-in-production always end up being run in production.
Multi-target Dockerfile (Python)#
One file. ARG AGENT_NAME selects which subdirectory gets baked in. The uv sync layer is shared across every agent build.
# agents/python/Dockerfile
FROM python:3.12-slim AS base
RUN apt-get update \
&& apt-get install -y --no-install-recommends gcc libpq-dev curl \
&& rm -rf /var/lib/apt/lists/*
# Bring in `uv` from its distroless image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN groupadd -r agent \
&& useradd -r -g agent -d /app -s /sbin/nologin agent
WORKDIR /app
# Layer 1 — dependencies (cached across every agent build)
COPY pyproject.toml uv.lock ./
RUN uv sync --no-dev --no-install-project
# Layer 2 — shared library (also cached across agents)
COPY shared/ shared/
COPY config/ config/
# Layer 3 — agent-specific code (only this layer differs per agent)
ARG AGENT_NAME=orchestrator
ARG AGENT_PORT=8080
COPY ${AGENT_NAME}/ ${AGENT_NAME}/
RUN chown -R agent:agent /app
ENV AGENT_NAME=${AGENT_NAME} \
AGENT_PORT=${AGENT_PORT} \
PYTHONPATH=/app \
UV_CACHE_DIR=/app/.cache/uv
USER agent
EXPOSE ${AGENT_PORT}
HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
CMD curl -fsS http://localhost:${AGENT_PORT}/health || exit 1
CMD uv run --no-project uvicorn ${AGENT_NAME}.main:app \
--host 0.0.0.0 --port ${AGENT_PORT}Layer 1 (uv sync) takes ~30 seconds on a fresh build but is identical for every agent — Docker reuses the cached layer. Layer 2 (shared/ + config/) is also identical. Only Layer 3 (COPY ${AGENT_NAME}/) differs per agent, and it’s tiny. Building all six agents takes roughly the same time as building one.
The non-root agent user is non-negotiable. A compromise of one agent should not give the attacker root in the container; from there to a host escape is a smaller hop than people think. The --start-period=30s is the same one we set in Compose — it has to be set in both places because Compose doesn’t honour the Dockerfile value for condition: service_healthy.
Multi-stage Dockerfile (.NET)#
The .NET version uses a build stage and a runtime stage. The runtime image carries only the published output, not the SDK.
# agents/dotnet/Dockerfile
ARG AGENT_NAME=Orchestrator
ARG AGENT_PORT=8080
FROM mcr.microsoft.com/dotnet/sdk:9.0 AS build
WORKDIR /src
# Layer 1 — restore (cached across every agent)
COPY ECommerceAgents.sln ./
COPY src/Shared/Shared.csproj src/Shared/
COPY src/Orchestrator/Orchestrator.csproj src/Orchestrator/
COPY src/ProductDiscovery/ProductDiscovery.csproj src/ProductDiscovery/
# ... one COPY per agent project
RUN dotnet restore
# Layer 2 — copy source and publish the selected agent
COPY src/ src/
ARG AGENT_NAME
RUN dotnet publish "src/${AGENT_NAME}/${AGENT_NAME}.csproj" \
-c Release \
-o /app/publish \
/p:PublishSingleFile=false
# Runtime stage
FROM mcr.microsoft.com/dotnet/aspnet:9.0 AS runtime
RUN apt-get update && apt-get install -y --no-install-recommends curl \
&& rm -rf /var/lib/apt/lists/* \
&& groupadd -r agent && useradd -r -g agent -d /app agent
WORKDIR /app
COPY --from=build /app/publish .
ARG AGENT_NAME
ARG AGENT_PORT
ENV AGENT_NAME=${AGENT_NAME} \
AGENT_PORT=${AGENT_PORT} \
ASPNETCORE_URLS=http://+:${AGENT_PORT}
USER agent
EXPOSE ${AGENT_PORT}
HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
CMD curl -fsS http://localhost:${AGENT_PORT}/health || exit 1
ENTRYPOINT ["sh", "-c", "dotnet ${AGENT_NAME}.dll"]Two structural differences from the Python version, both forced by language idioms:
- Two stages, not one. The .NET SDK image is ~700 MB; the ASP.NET runtime image is ~200 MB. Copying only
/app/publishfrom the build stage cuts the final image to ~250 MB — a comfortable size for fast pulls. Python’suv sync --no-devalready gives us a small image; no second stage needed. - Project files COPYed individually before source. Restoring against
.csprojand.slnonly — without the.cssource — means a source change doesn’t bust the restore cache. Python’spyproject.tomlcarries the same role; the layer pattern is identical even though the file shape differs.
The same Compose file points at either Dockerfile via the build.context field — ./agents/python for the Python stack, ./agents/dotnet for the .NET stack. The capstone ships a docker-compose.dotnet.yml overlay that swaps every agent’s build context.
Topology#
+ pgvector
:5432"] rd["Redis 7
:6379"] end subgraph telemetry["Observability"] asp["Aspire Dashboard
:18888 / OTLP :18889"] end subgraph agents["Agents"] orch["orchestrator
:8080"] pd["product-discovery
:8081"] om["order-management
:8082"] pp["pricing-promotions
:8083"] rs["review-sentiment
:8084"] inv["inventory-fulfillment
:8085"] end fe["Next.js frontend
:3000"] llm["OpenAI / Azure OpenAI"] fe --> orch orch --> pd orch --> om orch --> pp orch --> rs orch --> inv orch --> llm pd --> llm om --> llm orch --> pg pd --> pg om --> pg orch --> rd orch -. OTel .-> asp pd -. OTel .-> asp class pg,rd infra class orch,pd,om,pp,rs,inv core class llm extern class asp obs class fe core
Eleven services, three groups. Frontend talks only to the orchestrator; the orchestrator talks to specialists, Postgres, Redis, and the LLM provider; specialists talk to Postgres and the LLM. OTel spans flow from every service to Aspire.
The dev.sh bootstrap script#
One command:
./scripts/dev.sh # Full rebuild, start everything
./scripts/dev.sh --clean # Nuke volumes, rebuild from scratch
./scripts/dev.sh --infra-only # Just db + redis + aspire
./scripts/dev.sh --agent orchestrator # Just one agent (for focused debugging)
./scripts/dev.sh --dotnet # Use the .NET twin (docker-compose.dotnet.yml)The script (lightly edited for length):
#!/usr/bin/env bash
set -euo pipefail
CLEAN=0; INFRA_ONLY=0; STACK="python"; SINGLE_AGENT=""
while [[ $# -gt 0 ]]; do
case "$1" in
--clean) CLEAN=1; shift ;;
--infra-only) INFRA_ONLY=1; shift ;;
--dotnet) STACK="dotnet"; shift ;;
--agent) SINGLE_AGENT="$2"; shift 2 ;;
*) echo "Unknown arg: $1" >&2; exit 2 ;;
esac
done
# 1. Prereqs
command -v docker >/dev/null || { echo "Docker required"; exit 1; }
docker compose version >/dev/null 2>&1 || { echo "Compose v2 required"; exit 1; }
# 2. Env
[[ -f .env ]] || { cp .env.example .env; echo "Created .env from example. Fill it in and re-run."; exit 1; }
# shellcheck disable=SC1091
set -a; source .env; set +a
: "${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}"
: "${JWT_SECRET:?JWT_SECRET must be set in .env}"
COMPOSE="docker compose"
[[ "$STACK" == "dotnet" ]] && COMPOSE="docker compose -f docker-compose.yml -f docker-compose.dotnet.yml"
# 3. Optional clean
if (( CLEAN )); then
$COMPOSE down -v --remove-orphans
fi
# 4. Build (cached layers make this fast on re-runs)
$COMPOSE build
# 5. Infra first
$COMPOSE up -d db redis aspire
# 6. Wait for infra health (poll, don't sleep)
wait_healthy() {
local svc="$1" tries=0
until [[ "$($COMPOSE ps -q "$svc" | xargs docker inspect -f '{{.State.Health.Status}}')" == "healthy" ]]; do
((tries++ < 60)) || { echo "Timeout waiting for $svc"; $COMPOSE logs --tail=50 "$svc"; exit 1; }
sleep 1
done
echo " $svc — healthy"
}
echo "Waiting for infrastructure..."
wait_healthy db
wait_healthy redis
# 7. Seed (one-shot)
$COMPOSE run --rm seeder
(( INFRA_ONLY )) && { echo "Infra ready. Skipping agents."; exit 0; }
# 8. Agents
if [[ -n "$SINGLE_AGENT" ]]; then
$COMPOSE up -d "$SINGLE_AGENT"
wait_healthy "$SINGLE_AGENT"
else
$COMPOSE up -d orchestrator product-discovery order-management \
pricing-promotions review-sentiment inventory-fulfillment
for svc in orchestrator product-discovery order-management \
pricing-promotions review-sentiment inventory-fulfillment; do
wait_healthy "$svc"
done
fi
# 9. Frontend
$COMPOSE up -d frontend
# 10. Summary
cat <<EOF
Cluster up.
Frontend: http://localhost:3000
Orchestrator API: http://localhost:8080
Aspire Dashboard: http://localhost:18888
Postgres: localhost:5432 (db: ecommerce_agents)
Redis: localhost:6379
Tail an agent: docker compose logs -f orchestrator
Stop: ./scripts/dev.sh --clean # also wipes volumes
EOFThe poll-don’t-sleep pattern in wait_healthy is what makes this script survive cold CI runners. The original capstone shipped a 30-second sleep — that was just enough to mask Postgres flakiness on a fast disk and totally inadequate when the disk was slow. Polling every 1 second up to 60 attempts gives a hard 60-second budget but exits as soon as the service is genuinely up.
The : "${POSTGRES_PASSWORD:?...}" form is a Bash idiom that exits with a clear error when the variable is missing or empty. It’s the dev.sh equivalent of Compose refusing to start without ${POSTGRES_PASSWORD} defaulted.
.env template#
# .env.example — copy to .env and fill in.
# dev.sh will refuse to start until POSTGRES_PASSWORD and JWT_SECRET are set.
# Database
POSTGRES_PASSWORD= # generate: openssl rand -hex 24
# Auth (Ch20c hardening covers rotation; keep these as bootstrap defaults)
JWT_SECRET= # generate: openssl rand -hex 32
AGENT_SHARED_SECRET= # generate: openssl rand -hex 32
# LLM provider — pick ONE
LLM_PROVIDER=openai
# OpenAI
OPENAI_API_KEY=
# Azure OpenAI (uncomment to use; set LLM_PROVIDER=azure)
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_KEY=
# AZURE_OPENAI_DEPLOYMENT=gpt-4
# AZURE_OPENAI_API_VERSION=2025-03-01-previewThe capstone’s auth code reads the LLM keys via the same chat_client_factory() it uses for everything else. Switching providers is the LLM_PROVIDER flag plus the relevant block of variables — no code edits.
Profile-based subsets#
Compose profiles let a single file describe several launch shapes:
services:
pgadmin:
image: dpage/pgadmin4
profiles: ["tools"]
# ...
redis-commander:
image: rediscommander/redis-commander
profiles: ["tools"]
# ...Default up ignores profiled services. docker compose --profile tools up -d brings them in. The capstone uses this for pgadmin, redis-commander, and a loadtest runner — all useful, none of which should be running in everyday development.
Gotchas#
- Compose v2 vs v1. The
docker composeplugin (v2) and the legacydocker-composePython binary are not the same. The script tests for v2 explicitly. v1’sdepends_on.condition: service_healthyworks but a few less-common features have drifted; don’t mix them. pg_isreadylies. It returns success as soon as Postgres accepts TCP, which can be ~200ms before it accepts queries. Thedbhealthcheck above usespg_isready -U ecommerce -d ecommerce_agentsrather than the parameter-less form, which connects to the actual DB and catches that gap.start_periodin Dockerfile vs Compose. Compose ignores the Dockerfile’sHEALTHCHECK --start-periodfor the purposes ofcondition: service_healthy. Set it in both — the Dockerfile value is whatdocker runhonours, the Compose value is whatdocker compose uphonours. They should match.- Volume permissions. The non-root
agentuser owns/app. If you bind-mount a host directory over/app/.cache/uv, Docker creates it root-owned and uv fails to write. Either chown the host directory before mounting, or use a named volume. - OTel exporter unreachable on first boot. If Aspire isn’t up yet, the agent’s OTel exporter logs warnings but doesn’t crash. The Aspire
depends_onis omitted from agents becauseaspireitself doesn’t have a healthcheck — the dependency is graceful by design. If you want strict ordering, add a healthcheck to Aspire (curl on:18888). - Disk space.
docker compose down -v(ordev.sh --clean) is the only way to actually free the named volume.docker compose downkeepspgdataintact — useful when you want to restart without re-seeding, dangerous when you forgot a migration.
What changes for the capstone#
The capstone today ships everything described above (agents/python/Dockerfile, docker-compose.yml, scripts/dev.sh). The Phase 11 entry in the refactor plan tracks small modernisations that came out of writing this chapter:
scripts/dev.sh— replace the legacysleep 30with thewait_healthypoll loop above..env.example— promotePOSTGRES_PASSWORDandJWT_SECRETfrom optional with defaults to required (matching the Compose${VAR}no-default form).agents/dotnet/Dockerfile— split into build / runtime stages as shown above; the original was a single-stage SDK image.docker-compose.dotnet.yml— overlay file that flips every agent’sbuild.contextto./agents/dotnet.
No agent code changes. The deployment substrate lives entirely in Compose and shell.
What’s next#
That closes the deployment-substrate gap the original Part 7 left open. The companion series on AKS / Managed Identity / Key Vault / private endpoints — flagged in the capstone roadmap — picks up where this chapter ends. For the framework series itself, this is the last appendix: Ch21 — Putting it all together is the chapter to come back to once you’ve used the deployment substrate to run something real.

