Series note — Part of MAF v1: Python and .NET. The frontend-side of streaming is covered in the Python-only Part 6 — Frontend: Rich Cards and Streaming Responses. This chapter is the backend side — the agent producing the stream, and the session that makes follow-up turns actually work.
Repo — Runnable code for this chapter: tutorials/03-streaming-and-multiturn. Clone,
cdin, follow along.
Why this chapter#
Your Chapter 01 agent answers one question and exits. Every real chat product needs two things on top of that, and they sit on opposite axes:
- Streaming — tokens appear on screen as the LLM emits them. This is a UX latency concern. A non-streaming run waits for the full response before flushing anything; on gpt-4.1 that’s routinely 2–8 seconds of silence for a long answer, long enough that users start hammering the send button thinking it broke. Streaming flushes the first token in ~300ms — same total time, but the user sees motion immediately.
- Multi-turn — the LLM sees the previous turns when the user asks a follow-up. This is a state concern. Without a session, “How old is it?” has nothing to resolve “it” against. With one, the agent sees
[turn1_user, turn1_assistant, turn2_user]and answers correctly.
These are orthogonal concepts — streaming is about when bytes leave the model, sessions are about what bytes went in. You can stream without sessions (stateless autocomplete), and you can have sessions without streaming (batch chatbot). Every interactive chat UI wants both, so this chapter teaches them together and the diagram makes their independence visible.
By the end you will have a REPL in each language that streams answers token-by-token and remembers what you said two turns ago.
Prerequisites#
- Completed Chapter 02 — Adding Tools.
.envat the repo root with eitherOPENAI_API_KEYor the Azure OpenAI trio.- Read-first (optional): Agents — Running Agents, Agents — Conversations, and the Get Started — Multi-turn quickstart.
The concept#
A streaming run swaps agent.run(q) / agent.RunAsync(q) — which returns a single AgentResponse once the LLM is done — for an async iterator of AgentResponseUpdate values. Each update is a delta: usually a text fragment, sometimes tool-call metadata, sometimes nothing (empty text, metadata-only). Concatenating the text / Text property on every update reconstructs the full answer.
An AgentSession is MAF’s handle to conversation state. In both languages it’s the object you thread through successive run calls to make them a single conversation rather than N isolated ones. What it actually stores depends on the provider:
- In-process (default): the full list of messages in memory. Throwing the session away resets context.
- Service-backed (OpenAI Assistants, Foundry): a
service_session_id/ server-side thread id. The messages live on the server; the client holds a reference. - Persistent (Redis, Postgres — see Ch04): a session id keyed to rows in your store, rehydrated on every request.
For this chapter, default in-process is fine. Chapter 04 swaps in a durable backend.
(history store) participant LLM as LLM
(OpenAI / Azure) rect rgb(239, 246, 255) note over User,LLM: Turn 1 — "What is Python?" User->>Agent: run(q1, stream=True, session) Agent->>Session: read history (empty) Agent->>LLM: system + [q1] LLM-->>Agent: AgentResponseUpdate chunk 1 Agent-->>User: chunk 1 (first token on screen) LLM-->>Agent: AgentResponseUpdate chunk 2..N Agent-->>User: chunks stream in Agent->>Session: append(q1, a1) end rect rgb(240, 253, 244) note over User,LLM: Turn 2 — "How old is it?" (same session) User->>Agent: run(q2, stream=True, session) Agent->>Session: read history (q1, a1) Agent->>LLM: system + [q1, a1, q2] LLM-->>Agent: chunk 1..N (resolves "it" = Python) Agent-->>User: "1991" streams in Agent->>Session: append(q2, a2) end
Read top-to-bottom. The blue band is turn 1, the green band is turn 2. Streaming is what the -->> arrows from LLM back to User do (many small deltas, not one big reply). Sessions are what the vertical Session participant does (read history in, append transcript at the end). They don’t depend on each other — either can be removed and the other still works — but combining them is what makes a chat UI feel like a conversation.
Why streaming matters#
Two numbers. First-token latency on gpt-4.1 is routinely 300–600ms; full-answer latency for a 200-token response is 3–8 seconds. Without streaming the user waits the full 3–8s staring at a spinner. With streaming they see the first word at 300–600ms and the rest drips in. The total time is the same; the perceived time is roughly “first token” because humans read while content arrives. In user-study terms: non-streamed responses under 10s routinely get “it broke, I pressed send again” behaviour; streamed responses of the same duration do not.
Streaming is also a useful abuse signal. If the first token never arrives within 5s, the request is probably stuck upstream — kill it rather than wait for the full timeout. The capstone’s chat_stream route does exactly this with a wall-clock deadline, a per-response byte ceiling, and a client-disconnect probe (agents/python/tests/test_stream_backpressure.py:109-124). Three guards, because an abandoned generator left running costs money on every token.
There are costs too. Streamed responses are harder to post-process (you either buffer before rendering, losing the latency win, or commit to incremental parsing). Tool calls that appear mid-stream need special handling — the delta carries structured data you can’t just concatenate. And some content safety filters only run at response boundaries, not per-chunk, which can surface content briefly before the filter catches up. None of these are showstoppers; all of them are things you’ll trip on if streaming is the only mode you know.
What AgentSession actually stores#
In Python, AgentSession is a lightweight value object:
@dataclass
class AgentSession:
session_id: str | None = None
service_session_id: str | None = None # set when a provider stores the thread
state: dict[str, Any] = field(default_factory=dict)When you call agent.run(..., session=...), MAF pulls prior messages out of the session’s associated ChatHistoryProvider (in-memory by default), appends the new user message, hands everything to the chat client, then appends the assistant response back. Next call sees all of it. Swap the provider for Redis and nothing else changes — agent.run(...) stays identical.
In .NET the shape is the same — AgentSession holds a SessionId, an optional ServiceSessionId, and a StateBag for your own scratch data. await using var session = ... releases any server-side resources when the block exits.
A practical note: the session does not carry the system prompt. Instructions live on the agent; the session only accumulates user/assistant/tool messages. Changing Instructions between turns takes effect immediately because MAF re-injects them on every call. That’s a feature — you can run the same session against a second agent with different instructions and it will see the same conversation — but it also means you can’t “stamp” the prompt at session-creation time and expect it to persist. Always set instructions on the agent.
Jargon recap#
AgentResponseUpdate— one delta emitted by a streaming run.text/Textholds the fragment; some updates carry only tool-call metadata and have empty text.- SSE (Server-Sent Events) — the HTTP transport the frontend uses to consume the stream (
text/event-stream, onedata: ...\n\nframe per chunk). Orthogonal to MAF — the agent yields updates; the transport serialises them. Seeweb/src/lib/api.ts:192-260for the consumer side. AgentSession— the handle that carries conversation state acrossruncalls. Value object holding a session id, optional service-side id, and astatedict.- Streaming vs non-streaming run — same agent, same chat client, same prompt. The only difference is whether you call
agent.run(q)(oneAgentResponse) oragent.run(q, stream=True)/RunStreamingAsync(q)(async iterator of updates).
Full definitions in the jargon glossary.
Python#
Full source: python/main.py. The streaming helper is four lines of real logic:
# python/main.py (excerpt)
from agent_framework import Agent, AgentSession
from agent_framework.openai import OpenAIChatClient, OpenAIChatCompletionClient
INSTRUCTIONS = "You are a concise assistant. Keep answers to one short paragraph."
async def stream_answer(agent: Agent, question: str, session: AgentSession) -> list[str]:
chunks: list[str] = []
async for update in agent.run(question, stream=True, session=session):
if update.text:
chunks.append(update.text)
print(update.text, end="", flush=True)
print()
return chunks
async def chat(agent: Agent, questions: list[str]) -> list[list[str]]:
session = agent.create_session() # one session, many turns
all_chunks: list[list[str]] = []
for q in questions:
print(f"\nQ: {q}\nA: ", end="", flush=True)
all_chunks.append(await stream_answer(agent, q, session))
return all_chunksThree things worth staring at:
agent.run(..., stream=True, session=session)is a single overloaded call — the same method as Chapter 01, two extra kwargs. Python returns an async iterator whenstream=Trueis set; the type system hides the switch.if update.text:skips empty updates. Tool-call-only and metadata-only updates carrytext=""; printing them breaks the visual stream.session = agent.create_session()is called once, outside the loop. Creating a fresh session per turn is how you accidentally get back to single-turn behaviour — the Gotchas section has a story about that.
Run it:
cd tutorials/03-streaming-and-multiturn/python
uv sync
uv run python main.py "What is Python in one line?" "How old is it? Answer with a year only."
#
# Q: What is Python in one line?
# A: Python is a high-level, interpreted programming language known for its readability...
#
# Q: How old is it? Answer with a year only.
# A: 1991The second turn never says “Python” — the agent resolves “it” because both turns share the session.
Interactive REPL mode (no args):
uv run python main.py
# Multi-turn chat (empty line to quit).
# Q: _.NET#
Full source: dotnet/Program.cs. Shape-for-shape identical to Python:
// dotnet/Program.cs (excerpt)
using Microsoft.Agents.AI;
public const string Instructions =
"You are a concise assistant. Keep answers to one short paragraph.";
public static async Task<List<string>> StreamAnswer(AIAgent agent, string question, AgentSession thread)
{
var chunks = new List<string>();
await foreach (var update in agent.RunStreamingAsync(question, thread))
{
if (!string.IsNullOrEmpty(update.Text))
{
chunks.Add(update.Text);
Console.Write(update.Text);
}
}
Console.WriteLine();
return chunks;
}
public static async Task<List<List<string>>> Chat(AIAgent agent, IReadOnlyList<string> questions)
{
var thread = await agent.CreateSessionAsync(); // one session, many turns
var allChunks = new List<List<string>>();
foreach (var q in questions)
{
Console.WriteLine($"\nQ: {q}");
Console.Write("A: ");
allChunks.Add(await StreamAnswer(agent, q, thread));
}
return allChunks;
}Two differences from Python:
RunStreamingAsyncis a separate method, not a flag onRunAsync. Each returns a different static type (Task<AgentResponse>vsIAsyncEnumerable<AgentResponseUpdate>), which is how the C# type system expresses the mode switch.CreateSessionAsyncis awaited. For in-process providers the work is synchronous; the async signature exists because service-backed providers (Assistants API, Foundry) round-trip to the server to allocate a thread id.
Run it:
cd tutorials/03-streaming-and-multiturn/dotnet
dotnet run -- "What is Python in one line?" "How old is it? Answer with a year only."Background Responses — the third mode#
Streaming fixes latency for short-to-medium answers. For genuinely long work — a workflow that takes 30 seconds to a few minutes — you don’t want a connection open that whole time. Modern OpenAI / Azure OpenAI deployments expose a background responses mode: kick off the run, get back a handle, poll (or webhook) until it completes.
The shape in MAF terms:
- Start the run with the backgrounded flag. The call returns immediately with a response id.
- Store the id somewhere correlated to the user’s request (DB, Redis).
- Poll a
GetResponse(id)endpoint or subscribe to a webhook for the completion event. Each poll returns the current state (queued,in_progress,completed,failed). - On completion, pull the final
AgentResponseand deliver it to the user out-of-band (email, push notification, a “new result ready” badge).
This is a good fit for research agents, deep-analysis workflows, batch report generation, and anything with human-review loops. It’s a bad fit for an interactive chat UI — users won’t wait a page reload for their next line. Default to streaming; reach for background responses only when the interaction is explicitly async.
The mental model is a three-mode axis of blocking time:
| Mode | Connection open for | Use when |
|---|---|---|
| Non-streaming | Full response (3–8s typical) | CLI / batch / low-latency short answers |
| Streaming | Full response, but bytes flow (same wall-time) | Any interactive UI |
| Background | None after kickoff; poll separately | Minute-plus runs; async notification channels |
See the Agents — Background Responses page for the current API surface. At the time of writing (MAF v1.0), background responses are a Responses-API-only feature, so you need a deployment that exposes /responses (not just /chat/completions).
We won’t use it again in this chapter — noted here so you know the third mode exists. The capstone may adopt it for the inventory-rebuild workflow in a later iteration.
Side-by-side differences#
| Aspect | Python | .NET |
|---|---|---|
| Stream entry point | agent.run(..., stream=True) — same method, bool kwarg | agent.RunStreamingAsync(...) — separate method |
| Non-stream return | AgentResponse | AgentResponse |
| Stream return | AsyncIterable[AgentResponseUpdate] | IAsyncEnumerable<AgentResponseUpdate> |
| Chunk accessor | update.text | update.Text |
| Iteration | async for update in ... | await foreach (var update in ...) |
| Session creation | agent.create_session() (sync) | await agent.CreateSessionAsync() |
| Session type | AgentSession | AgentSession |
| Session disposal | GC when no reference held | await using var session = ... to release deterministically |
| Cancel mid-stream | asyncio task cancellation | CancellationToken on RunStreamingAsync |
The .NET await on session creation reflects a real difference: for service-backed providers (Assistants API, Foundry) the call reaches the server to allocate a thread. The Python API papers over this by exposing only the synchronous local case up front — service-backed sessions in Python come from get_history_provider() configured in your agent factory.
Gotchas#
- Session reset bug.
session = agent.create_session()inside the turn loop creates a fresh session every turn — you get single-turn behaviour back. Always hoist it outside the loop. This hit me first time writing the capstone’s chat route. - Empty updates are real.
update.text/update.Textcan be empty when the update carries only tool-call metadata, usage info, or a finish reason. Guarding withif update.text:is not defensive coding — it’s required. - Don’t append
\nper chunk. Each chunk is a fragment (often mid-word). Concatenate, don’t linebreak. The one time you print a newline is after the stream ends. - Stream handle is hot. The async iterator starts producing as soon as you call
run(stream=True)/RunStreamingAsync. Abandoning it without iterating wastes tokens. In the frontend we wrap it in anAbortController(web/src/app/(app)/chat/page.tsx:196-199) and abort on unmount, send, or navigation. - In-process sessions die with the process. The default in-memory provider means restarting your agent resets every conversation. Ch04 swaps in durable storage (Redis / Postgres).
- Responses API vs Chat Completions. Sessions work on both, but with different semantics. Chat Completions sessions are always client-side (MAF replays history on every call). Responses API sessions can be server-side (a
service_session_id/ thread id handles replay for you). Pick one and stay consistent per agent. - Client disconnects mid-stream. The caller hitting Stop, or their laptop closing the lid, leaves your generator running. The capstone’s
chat_streamroute pollsrequest.is_disconnected()each iteration and breaks out — seeagents/python/tests/test_stream_backpressure.py:94-106for the contract.
Tests#
# Python — 4 tests: streaming yields multiple chunks, multi-turn reuses
# session (second turn sees longer history), chunks reassemble to full
# text, and a real-LLM integration test that proves "1991" comes back
# on the follow-up turn.
cd tutorials/03-streaming-and-multiturn/python
uv sync
uv run pytest -v
# .NET — 3 integration tests: streaming yields >1 chunk on a long answer,
# multi-turn preserves context ("1991" in the follow-up), and two separate
# sessions do NOT share context.
cd tutorials/03-streaming-and-multiturn/dotnet
dotnet test tests/Streaming.Tests.csprojThe Python suite runs three fast unit tests against a StreamingCannedClient (a BaseChatClient subclass that yields a canned response in three chunks) plus one real-LLM integration test that skips cleanly when credentials are absent. .NET tests are all integration because faking MAF’s streaming pipeline convincingly requires reimplementing half of IChatClient. Both sides assert the “1991” answer on the second turn — the clearest proof that session state round-trips through the LLM.
Three tests are worth flagging as reusable patterns:
- Streaming yields multiple chunks. Python’s
test_stream_yields_multiple_chunksassertslen(chunks) >= 2. This is the bare minimum “streaming happened” assertion — any single-chunk response is indistinguishable from non-streaming. - Session reuse advances history. Python’s
test_multiturn_reuses_sessionchecksclient.conversation_lengths[1] > client.conversation_lengths[0]. The fake client records how many messages it saw per call; a longer second call proves the session accumulated. This is the mechanical proof independent of LLM behaviour. - Separate sessions do NOT share context. .NET’s
Separate_Sessions_Do_Not_Share_Contextcreates two sessions, teaches the first a word, asks the second what the word was, and asserts “none” in the reply. This is the negative assertion — sessions don’t leak across instances.
Together those three cover the orthogonality claim from the concept section: streaming is per-call, sessions are per-conversation, and either can be swapped independently.
How this shows up in the capstone#
- Python orchestrator streaming.
agents/python/shared/agent_host.py:66-84is this chapter’sstream_answerfunction with two extra steps: it converts the incoming A2A history payload intoMessageobjects, then delegates toagent.run(messages, stream=True). The/api/chat/streamroute wraps the generator in the SSE transport plus the three backpressure guards (disconnect, timeout, max-bytes) tested inagents/python/tests/test_stream_backpressure.py:1-75. - Legacy custom loop. The older non-MAF path at
agents/python/shared/agent_host.py:203-249runs whenMAF_NATIVE_EXECUTION=false. It exists only as a rollback switch — same observable output, different innards (OpenAI chat-completions directly, tool-calling loop by hand). Every specialist agent runs the MAF-native path in production. - Session reconstruction.
agents/python/shared/session.py:224-229(session_from_id) is how the orchestrator rebinds an in-flight conversation to its persistent session id — the same pattern Ch04 turns into a full durable store. History injection into the first LLM call happens in_history_as_maf_messagesatagents/python/shared/agent_host.py:44-63. - Frontend SSE consumer.
web/src/lib/api.ts:192-260reads the stream frame-by-frame and invokesonChunk(text)per delta. The chat page atweb/src/app/(app)/chat/page.tsx:196-199owns a per-messageAbortControllerso the user hitting Stop or navigating away terminates the stream cleanly. - .NET session providers.
agents/dotnet/src/ECommerceAgents.Shared/Sessions/SessionProviderFactory.cskeys offMAF_SESSION_BACKENDand returns a matchingISessionHistoryProvider(memory,file,postgres). Mirrors the Pythonget_history_provider()factory. The concrete providers live next to the factory in the same folder.
Further reading & links#
This chapter
- Canonical article: nitinksingh.com/posts/maf-v1-03-streaming-and-multiturn/
- Source on GitHub: tutorials/03-streaming-and-multiturn
- Previous: Chapter 02 — Adding Tools · Next: Chapter 04 — Sessions and Memory
Microsoft Agent Framework docs
- Agents — Running Agents (streaming, cancellation, options)
- Agents — Conversations (sessions, history providers)
- Get Started — Multi-turn
- Agents — Background Responses (the async/polling third mode)
Where it lives in the capstone
- Python streaming:
agents/python/shared/agent_host.py:66-84· legacy loop:agents/python/shared/agent_host.py:203-249 - Python sessions:
agents/python/shared/session.py:224-229 - Python backpressure tests:
agents/python/tests/test_stream_backpressure.py:1-75 - Frontend SSE:
web/src/lib/api.ts:192-260·web/src/app/(app)/chat/page.tsx:196-199 - .NET sessions:
agents/dotnet/src/ECommerceAgents.Shared/Sessions/
Series shared resources
What’s next#
Chapter 04 — Sessions and Memory keeps the AgentSession from this chapter but makes it survive a process restart — Redis and Postgres ChatHistoryProvider implementations, plus the trade-offs between them.

