Series note — Companion chapter sitting between Ch20b — DevUI and Ch21 — Putting it all together. This is HTTP-layer hardening — not framework API surface — but it’s the gap most readers hit when they take the capstone past localhost. The original series introduced JWT auth in Part 7 — Production Readiness. That post covered the happy path. This one covers what production demanded six months in: password reset, refresh-token rotation with reuse detection, and graceful secret rotation.
Repo — Runnable code: tutorials/20c-production-hardening.
Why this chapter#
JWT auth and RBAC, as covered in the original series, get you a working login flow. They don’t get you through a real ops cycle. Three concrete moments will eventually force the issue:
- A user forgets their password.
- A refresh token leaks (a stolen device, a logged URL, a backup snapshot).
- You need to roll the JWT signing key — or the
x-agent-secretshared between agents — without a deploy window.
Each of these has a textbook answer; each of those answers has a footgun. This chapter walks the patterns the capstone settled on, in Python first, with the .NET twin called out where the idioms actually diverge.
Prerequisites#
- Familiarity with Part 7 — Production Readiness: Auth, RBAC, and Deployment — this picks up where that left off.
- Python 3.12+ with
fastapi,pyjwt,bcrypt,asyncpg(or the .NET equivalents). - Postgres 16 (the capstone’s database).
- An SMTP relay or transactional-email provider for password reset. The code below assumes a
send_email(to, subject, body)helper; wire whatever you use.
Part A: Password reset#
Threat model first#
A password-reset endpoint is a credential issuer. A leak in this flow is worse than a leak in login — the attacker writes the new credential. Three rules cover most of what can go wrong:
- The token in the email is a one-shot bearer credential. Treat it like a password. Short TTL. Single-use. Never logged.
- Store the hash, not the token. A database snapshot must not let an attacker complete pending resets.
- Don’t leak account existence.
POST /auth/forgot-passwordreturns the same shape whether the email is registered or not. Otherwise the endpoint becomes an account-enumeration oracle.
Schema#
CREATE TABLE password_reset_tokens (
token_hash TEXT PRIMARY KEY, -- sha256(plaintext token)
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
expires_at TIMESTAMPTZ NOT NULL,
consumed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
requesting_ip INET,
user_agent TEXT
);
CREATE INDEX idx_prt_user_unconsumed
ON password_reset_tokens (user_id)
WHERE consumed_at IS NULL;The partial index on unconsumed rows lets us quickly invalidate prior pending tokens for the same user when a new one is issued (one outstanding token per user — see below).
Issuing a token#
import hashlib, secrets
from datetime import datetime, timedelta, timezone
RESET_TOKEN_TTL = timedelta(minutes=15)
def _hash_token(token: str) -> str:
return hashlib.sha256(token.encode()).hexdigest()
async def request_password_reset(email: str, request_ip: str, user_agent: str) -> None:
"""Generate and email a reset link. Never raises for unknown emails."""
user = await db.fetch_one("SELECT id FROM users WHERE email = $1", email)
if user is None:
# Constant-time burn — same wall-clock as the success path
await asyncio.sleep(0.05)
return
# Invalidate any outstanding pending token for this user
await db.execute(
"UPDATE password_reset_tokens SET consumed_at = now() "
"WHERE user_id = $1 AND consumed_at IS NULL",
user["id"],
)
plaintext = secrets.token_urlsafe(32) # 256 bits
await db.execute(
"INSERT INTO password_reset_tokens "
"(token_hash, user_id, expires_at, requesting_ip, user_agent) "
"VALUES ($1, $2, $3, $4, $5)",
_hash_token(plaintext),
user["id"],
datetime.now(timezone.utc) + RESET_TOKEN_TTL,
request_ip,
user_agent,
)
reset_url = f"{settings.frontend_base_url}/reset-password?token={plaintext}"
await send_email(
to=email,
subject="Reset your ECommerce Agents password",
body=f"Click to reset (expires in 15 minutes): {reset_url}",
)Three details that matter:
secrets.token_urlsafe(32)generates 256 bits of entropy, URL-safe.random.choiceanduuid4are both wrong here —randomis not cryptographic, anduuid4only carries 122 bits.- Plaintext goes only to the email. The DB never sees it. If the DB leaks, the attacker has hashes — which they can’t use to construct the URL.
- Old pending tokens are invalidated when a new one is issued. Otherwise an attacker who triggered an earlier reset (and intercepted that email) can race the legitimate user.
Consuming the token#
async def reset_password(token: str, new_password: str) -> bool:
token_hash = _hash_token(token)
async with db.transaction():
row = await db.fetch_one(
"SELECT user_id, expires_at, consumed_at "
"FROM password_reset_tokens WHERE token_hash = $1 FOR UPDATE",
token_hash,
)
if row is None:
return False
if row["consumed_at"] is not None:
return False
if row["expires_at"] < datetime.now(timezone.utc):
return False
await db.execute(
"UPDATE password_reset_tokens SET consumed_at = now() "
"WHERE token_hash = $1",
token_hash,
)
await db.execute(
"UPDATE users SET password_hash = $1 WHERE id = $2",
hash_password(new_password),
row["user_id"],
)
# Critical: revoke all existing refresh tokens for this user
await db.execute(
"UPDATE refresh_tokens SET revoked_at = now() "
"WHERE user_id = $1 AND revoked_at IS NULL",
row["user_id"],
)
return TrueThe FOR UPDATE row lock on the reset-token row is what makes single-use atomic across concurrent requests. The transaction also covers the users update and the refresh-token revocation — if anything fails, nothing changes.
The refresh-token revocation is the easy-to-forget step. Without it, a password reset doesn’t sign anyone out — an attacker who already had a refresh token (the reason for the reset) keeps minting access tokens. This is the single most common production bug in this flow.
Endpoint#
@router.post("/auth/forgot-password", status_code=204)
async def forgot_password(req: ForgotPasswordRequest, request: Request):
await request_password_reset(
email=req.email,
request_ip=request.client.host,
user_agent=request.headers.get("user-agent", ""),
)
return Response(status_code=204)
@router.post("/auth/reset-password")
async def reset_password_endpoint(req: ResetPasswordRequest):
ok = await reset_password(req.token, req.new_password)
if not ok:
# Same response on expired/consumed/unknown — no oracle
raise HTTPException(400, detail="Invalid or expired reset token")
return {"ok": True}forgot-password always returns 204. The endpoint is unconditionally rate-limited (10 per IP per hour and 3 per email per day are reasonable starting points; the capstone uses Redis for both counters). Rate limits exist to throttle the email send-budget — not to prove anything to an attacker.
.NET note#
ASP.NET’s IDataProtector plus a small ResetTokenStore table covers the same ground. The DataProtectionTokenProvider<TUser> that ships with Microsoft.AspNetCore.Identity is useful only if you’re already using Identity; if not, the secrets.token_urlsafe(32) → SHA256 pattern translates one-for-one to RandomNumberGenerator.GetBytes(32) + Base64Url + SHA256.HashData. Don’t reach for Guid.NewGuid() here.
Part B: Refresh-token rotation with reuse detection#
The original series ships short-lived access tokens (60 min) and long-lived refresh tokens (7 days), both as JWTs. That’s enough to log in once and stay logged in. It’s not enough to detect a stolen refresh.
The hardening: every refresh issues a new refresh token and invalidates the previous one. If a refresh token is presented after it has been used, the entire token family is revoked and the user is forced to re-authenticate.
This catches the canonical attack: an attacker steals a refresh token from a backup, the legitimate user later refreshes (rotating it), and then the attacker tries to refresh with the stale value. The replay attempt triggers a family-wide revocation — the legitimate user is signed out, but so is the attacker, and the security team gets a paged incident.
Token family lineage#
The family_id ties RT0, RT1, RT2 together. A reuse attempt on any rotated token revokes every descendant in the family.
Schema#
CREATE TABLE refresh_tokens (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
family_id UUID NOT NULL, -- shared across the lineage
parent_id UUID REFERENCES refresh_tokens(id),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
token_hash TEXT NOT NULL UNIQUE, -- sha256 of the JWT
issued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL,
rotated_at TIMESTAMPTZ, -- set when used to mint a child
revoked_at TIMESTAMPTZ -- set when family is killed
);
CREATE INDEX idx_rt_family ON refresh_tokens (family_id);
CREATE INDEX idx_rt_user_active
ON refresh_tokens (user_id)
WHERE revoked_at IS NULL;token_hash (not the raw JWT) is what we look up. JWTs are bearer credentials — we should not store them at rest in plaintext.
Refresh handler#
async def refresh(presented_token: str) -> tuple[str, str]:
"""Returns (new_access_token, new_refresh_token) or raises."""
payload = jwt.decode(presented_token, settings.jwt_public_key, algorithms=["RS256"])
if payload.get("type") != "refresh":
raise InvalidTokenError("Wrong token type")
token_hash = _hash_token(presented_token)
async with db.transaction():
row = await db.fetch_one(
"SELECT id, family_id, user_id, rotated_at, revoked_at, expires_at "
"FROM refresh_tokens WHERE token_hash = $1 FOR UPDATE",
token_hash,
)
if row is None:
raise InvalidTokenError("Unknown token")
if row["revoked_at"] is not None:
raise InvalidTokenError("Token revoked")
if row["expires_at"] < datetime.now(timezone.utc):
raise InvalidTokenError("Token expired")
# Reuse detection: this token has already been rotated
if row["rotated_at"] is not None:
await db.execute(
"UPDATE refresh_tokens SET revoked_at = now() "
"WHERE family_id = $1 AND revoked_at IS NULL",
row["family_id"],
)
log.warning("refresh_reuse_detected", user_id=str(row["user_id"]),
family_id=str(row["family_id"]))
raise InvalidTokenError("Token reuse detected — family revoked")
# Mint child
new_jwt = _issue_refresh_jwt(row["user_id"])
new_id = uuid.uuid4()
await db.execute(
"INSERT INTO refresh_tokens "
"(id, family_id, parent_id, user_id, token_hash, expires_at) "
"VALUES ($1, $2, $3, $4, $5, $6)",
new_id, row["family_id"], row["id"], row["user_id"],
_hash_token(new_jwt),
datetime.now(timezone.utc) + REFRESH_TTL,
)
await db.execute(
"UPDATE refresh_tokens SET rotated_at = now() WHERE id = $1",
row["id"],
)
return _issue_access_jwt(row["user_id"]), new_jwtThe whole sequence sits inside one transaction with a row lock on the presented token. That guarantees:
- Two concurrent refreshes with the same token can’t both succeed (one will block on
FOR UPDATE, then seerotated_atset, then trigger family revocation — which is the right behaviour: simultaneous use of the same refresh is exactly the attack signature). - The “mint child + mark parent rotated” pair is atomic. A crash mid-flight leaves the parent unmodified.
Logout#
async def logout(refresh_jwt: str) -> None:
token_hash = _hash_token(refresh_jwt)
await db.execute(
"UPDATE refresh_tokens SET revoked_at = now() "
"WHERE family_id = (SELECT family_id FROM refresh_tokens WHERE token_hash = $1) "
"AND revoked_at IS NULL",
token_hash,
)Logout revokes the entire family, not just the presented token. This matters because it makes “Sign out everywhere” trivial — the same query, run for every active family belonging to the user.
What this doesn’t catch#
Refresh-token rotation catches replay of stolen refresh tokens. It does not catch:
- Theft of the access token before it expires. Access tokens are stateless; the only mitigation is short TTL. 60 minutes is the upper bound; 15 is more honest.
- Compromise of the JWT signing key. That’s Part C.
- A logged-in attacker abusing the legitimate session. Audit your tool calls (Ch06 middleware), don’t try to fix this in auth.
.NET note#
The pattern is identical with Microsoft.IdentityModel.Tokens for JWT validation and EF Core for the table. The one trap: JwtBearerOptions.SaveToken = true will stash the access JWT on HttpContext; that’s harmless but unnecessary if you’re already extracting it explicitly.
Part C: Graceful secret rotation#
Two kinds of secrets need rotation in the capstone:
- JWT signing keys — the asymmetric key pair used to sign access and refresh tokens. RS256 means there’s a public key (used by every agent for verification) and a private key (used only by the orchestrator’s auth service to sign).
- Inter-agent shared secrets — the
x-agent-secretheader that a specialist accepts as proof the request came from an authorized orchestrator.
Rotation that requires a deploy window or simultaneous restart is the same as no rotation — it never happens. The pattern below makes both kinds rotatable on a live system.
The two-key window#
The trick is the same in both cases: maintain a list of valid keys, sign with the newest, accept any. A new key is “promoted in” hours or days before the old one is “demoted out.” During the overlap, both work.
# settings.py
JWT_KEYS: dict[str, RSAKey] = {
"2026-04-15": load_rsa_key("keys/jwt-2026-04-15.pem"), # current signer
"2026-03-01": load_rsa_key("keys/jwt-2026-03-01.pem"), # accepted, retiring
}
JWT_CURRENT_KID = "2026-04-15"Signing (always uses current)#
def issue_access_jwt(user_id: UUID) -> str:
return jwt.encode(
{
"sub": str(user_id),
"type": "access",
"exp": datetime.now(timezone.utc) + ACCESS_TTL,
},
settings.JWT_KEYS[settings.JWT_CURRENT_KID].private,
algorithm="RS256",
headers={"kid": settings.JWT_CURRENT_KID},
)The kid (key ID) header is what makes the verifier’s job trivial.
Verifying (looks up by kid, rejects unknown)#
def decode_access_jwt(token: str) -> dict:
header = jwt.get_unverified_header(token)
kid = header.get("kid")
if kid not in settings.JWT_KEYS:
raise InvalidTokenError(f"Unknown key id: {kid}")
return jwt.decode(
token,
settings.JWT_KEYS[kid].public,
algorithms=["RS256"],
)Rotation lifecycle#
signs"]:::sign k0a["Key A
accepts only"]:::accept k0g["Key A
removed"]:::gone k1i["Key B
added,
accepts only"]:::accept k1s["Key B
signs"]:::sign k0 -- "T0+0d:
add Key B" --> k1i k1i -- "T0+1d:
promote Key B" --> k1s k0 -- "T0+1d:
demote Key A" --> k0a k0a -- "T0+8d (>access TTL):
remove Key A" --> k0g
Day 0: add Key B as accepted-only. Day 1: switch the signer to Key B; Key A keeps verifying tokens already in flight. Day 8 (or whenever every Key A access token has expired): remove Key A from the keyring.
The “Day 8” gap matters — it must be longer than the longest access-token TTL plus any refresh slack. With a 60-minute access TTL, 24 hours is plenty. With a 7-day refresh, you wait a week before pulling the old key (or you accept that a small number of users will get one forced re-login).
Operationalizing#
The JWT_KEYS dict above is a static reload. In production you want it to come from a key vault or a config service that supports atomic updates:
- Azure Key Vault — store keys as secrets keyed by
kid. Cache them locally with a 60-second TTL; the cache miss is acceptable because rotation is a planned event. - Filesystem with a watcher —
inotify/FileSystemWatcheron the keys directory; the agent reloadsJWT_KEYSwhen a.pemis added or removed. Cheaper than a vault for self-hosted deployments. - A
/admin/reload-keysendpoint — gated behind admin RBAC, useful as a manual fallback when the watcher misses an event.
Whichever way, the application code stays the same: it reads from settings.JWT_KEYS and trusts that someone (orchestration, watcher, manual call) keeps the dict current.
x-agent-secret rotation#
The agent-to-agent shared secret is even simpler — there’s no kid because the header value is the secret. Solution: accept a list of secrets.
# settings.py
AGENT_SECRETS: set[str] = {
"current_value_set_at_2026_04_15", # promoted today
"previous_value_set_at_2026_03_01", # being retired
}
AGENT_SECRET_CURRENT = "current_value_set_at_2026_04_15"
def verify_agent_secret(presented: str) -> bool:
# Constant-time compare against any allowed value
for allowed in settings.AGENT_SECRETS:
if hmac.compare_digest(presented, allowed):
return True
return False
def outgoing_agent_secret() -> str:
return settings.AGENT_SECRET_CURRENTThe orchestrator always sends the current secret; specialists accept either. The rotation script:
- Generate new secret. Add it to
AGENT_SECRETSon every specialist (rolling restart or live config reload). Don’t changeAGENT_SECRET_CURRENTyet. - Once every specialist accepts the new secret, switch
AGENT_SECRET_CURRENTon the orchestrator. The orchestrator now sends the new value. - After the access-token TTL, remove the old secret from
AGENT_SECRETSon the specialists.
hmac.compare_digest instead of == matters: the latter short-circuits on the first different byte, which leaks length and content via timing.
.NET note#
IConfiguration reload-on-change combined with IOptionsMonitor<JwtKeyringOptions> does the equivalent without a watcher. For Key Vault, the Azure.Extensions.AspNetCore.Configuration.Secrets package wires it up directly. The constant-time compare is CryptographicOperations.FixedTimeEquals — never string.Equals.
Putting it together#
Three checklists, copy-paste:
On every login flow:
- Access token is RS256 with a
kidheader. - Refresh token is also a JWT with
kid, hashed in the DB, and tied to afamily_id. - Logout revokes the entire family.
On every refresh:
- Verify with whatever key the
kidselects. - Look up the token by
sha256hash, in a transaction, withFOR UPDATE. - If
rotated_atis set on the presented token, revoke the entire family.
On every secret-rotation event:
- Add the new key to the verifier set first.
- Promote the signer once every verifier accepts the new key.
- Remove the old key only after the longest live token has expired.
What changes for the capstone#
The capstone today (agents/python/shared/auth.py) ships the original Part 7 design — single JWT signing key, simple refresh, no reset flow. Phase 8 of the refactor plan tracks the migration:
agents/python/shared/auth/keyring.py—JWT_KEYSmap,decode_access_jwt,outgoing_agent_secret.migrations/0042_password_reset_and_refresh_families.sql— the two new tables and the partial indexes.agents/python/orchestrator/auth_routes.py—/auth/forgot-password,/auth/reset-password, refactored/auth/refresh.
None of this changes the agent code. Specialists keep using verify_agent_secret; tools keep reading current_user_email. The hardening lives entirely in the auth boundary.
What’s next#
That closes the production-hardening loop the original Part 7 left open. Ch21 — Putting it all together is the final chapter: a guided tour of the live e-commerce stack with file:line citations for every concept covered across the series.

