Skip to main content

MAF v1 — Streaming and Multi-turn (Python + .NET)

Nitin Kumar Singh
Author
Nitin Kumar Singh
I build enterprise AI solutions and cloud-native systems. I write about architecture patterns, AI agents, Azure, and modern development practices — with full source code.
MAF v1 — Streaming and Multi-turn (Python + .NET)
MAF v1: Python and .NET - This article is part of a series.
Part 3: This Article

Series note — Part of MAF v1: Python and .NET. The frontend-side of streaming is covered in the Python-only Part 6 — Frontend: Rich Cards and Streaming Responses. This chapter is the backend side — the agent producing the stream, and the session that makes follow-up turns actually work.

Repo — Runnable code for this chapter: tutorials/03-streaming-and-multiturn. Clone, cd in, follow along.

Why this chapter
#

Your Chapter 01 agent answers one question and exits. Every real chat product needs two things on top of that, and they sit on opposite axes:

  1. Streaming — tokens appear on screen as the LLM emits them. This is a UX latency concern. A non-streaming run waits for the full response before flushing anything; on gpt-4.1 that’s routinely 2–8 seconds of silence for a long answer, long enough that users start hammering the send button thinking it broke. Streaming flushes the first token in ~300ms — same total time, but the user sees motion immediately.
  2. Multi-turn — the LLM sees the previous turns when the user asks a follow-up. This is a state concern. Without a session, “How old is it?” has nothing to resolve “it” against. With one, the agent sees [turn1_user, turn1_assistant, turn2_user] and answers correctly.

These are orthogonal concepts — streaming is about when bytes leave the model, sessions are about what bytes went in. You can stream without sessions (stateless autocomplete), and you can have sessions without streaming (batch chatbot). Every interactive chat UI wants both, so this chapter teaches them together and the diagram makes their independence visible.

By the end you will have a REPL in each language that streams answers token-by-token and remembers what you said two turns ago.

Prerequisites
#

The concept
#

A streaming run swaps agent.run(q) / agent.RunAsync(q) — which returns a single AgentResponse once the LLM is done — for an async iterator of AgentResponseUpdate values. Each update is a delta: usually a text fragment, sometimes tool-call metadata, sometimes nothing (empty text, metadata-only). Concatenating the text / Text property on every update reconstructs the full answer.

An AgentSession is MAF’s handle to conversation state. In both languages it’s the object you thread through successive run calls to make them a single conversation rather than N isolated ones. What it actually stores depends on the provider:

  • In-process (default): the full list of messages in memory. Throwing the session away resets context.
  • Service-backed (OpenAI Assistants, Foundry): a service_session_id / server-side thread id. The messages live on the server; the client holds a reference.
  • Persistent (Redis, Postgres — see Ch04): a session id keyed to rows in your store, rehydrated on every request.

For this chapter, default in-process is fine. Chapter 04 swaps in a durable backend.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor': '#2563eb','primaryTextColor': '#ffffff','primaryBorderColor': '#1e40af', 'lineColor': '#64748b','secondaryColor': '#f59e0b','tertiaryColor': '#10b981', 'background': 'transparent'}}}%% sequenceDiagram autonumber participant User as User participant Agent as Agent participant Session as AgentSession
(history store) participant LLM as LLM
(OpenAI / Azure) rect rgb(239, 246, 255) note over User,LLM: Turn 1 — "What is Python?" User->>Agent: run(q1, stream=True, session) Agent->>Session: read history (empty) Agent->>LLM: system + [q1] LLM-->>Agent: AgentResponseUpdate chunk 1 Agent-->>User: chunk 1 (first token on screen) LLM-->>Agent: AgentResponseUpdate chunk 2..N Agent-->>User: chunks stream in Agent->>Session: append(q1, a1) end rect rgb(240, 253, 244) note over User,LLM: Turn 2 — "How old is it?" (same session) User->>Agent: run(q2, stream=True, session) Agent->>Session: read history (q1, a1) Agent->>LLM: system + [q1, a1, q2] LLM-->>Agent: chunk 1..N (resolves "it" = Python) Agent-->>User: "1991" streams in Agent->>Session: append(q2, a2) end

Read top-to-bottom. The blue band is turn 1, the green band is turn 2. Streaming is what the -->> arrows from LLM back to User do (many small deltas, not one big reply). Sessions are what the vertical Session participant does (read history in, append transcript at the end). They don’t depend on each other — either can be removed and the other still works — but combining them is what makes a chat UI feel like a conversation.

Why streaming matters
#

Two numbers. First-token latency on gpt-4.1 is routinely 300–600ms; full-answer latency for a 200-token response is 3–8 seconds. Without streaming the user waits the full 3–8s staring at a spinner. With streaming they see the first word at 300–600ms and the rest drips in. The total time is the same; the perceived time is roughly “first token” because humans read while content arrives. In user-study terms: non-streamed responses under 10s routinely get “it broke, I pressed send again” behaviour; streamed responses of the same duration do not.

Streaming is also a useful abuse signal. If the first token never arrives within 5s, the request is probably stuck upstream — kill it rather than wait for the full timeout. The capstone’s chat_stream route does exactly this with a wall-clock deadline, a per-response byte ceiling, and a client-disconnect probe (agents/python/tests/test_stream_backpressure.py:109-124). Three guards, because an abandoned generator left running costs money on every token.

There are costs too. Streamed responses are harder to post-process (you either buffer before rendering, losing the latency win, or commit to incremental parsing). Tool calls that appear mid-stream need special handling — the delta carries structured data you can’t just concatenate. And some content safety filters only run at response boundaries, not per-chunk, which can surface content briefly before the filter catches up. None of these are showstoppers; all of them are things you’ll trip on if streaming is the only mode you know.

What AgentSession actually stores
#

In Python, AgentSession is a lightweight value object:

@dataclass
class AgentSession:
    session_id: str | None = None
    service_session_id: str | None = None   # set when a provider stores the thread
    state: dict[str, Any] = field(default_factory=dict)

When you call agent.run(..., session=...), MAF pulls prior messages out of the session’s associated ChatHistoryProvider (in-memory by default), appends the new user message, hands everything to the chat client, then appends the assistant response back. Next call sees all of it. Swap the provider for Redis and nothing else changes — agent.run(...) stays identical.

In .NET the shape is the same — AgentSession holds a SessionId, an optional ServiceSessionId, and a StateBag for your own scratch data. await using var session = ... releases any server-side resources when the block exits.

A practical note: the session does not carry the system prompt. Instructions live on the agent; the session only accumulates user/assistant/tool messages. Changing Instructions between turns takes effect immediately because MAF re-injects them on every call. That’s a feature — you can run the same session against a second agent with different instructions and it will see the same conversation — but it also means you can’t “stamp” the prompt at session-creation time and expect it to persist. Always set instructions on the agent.

Jargon recap
#

  • AgentResponseUpdate — one delta emitted by a streaming run. text / Text holds the fragment; some updates carry only tool-call metadata and have empty text.
  • SSE (Server-Sent Events) — the HTTP transport the frontend uses to consume the stream (text/event-stream, one data: ...\n\n frame per chunk). Orthogonal to MAF — the agent yields updates; the transport serialises them. See web/src/lib/api.ts:192-260 for the consumer side.
  • AgentSession — the handle that carries conversation state across run calls. Value object holding a session id, optional service-side id, and a state dict.
  • Streaming vs non-streaming run — same agent, same chat client, same prompt. The only difference is whether you call agent.run(q) (one AgentResponse) or agent.run(q, stream=True) / RunStreamingAsync(q) (async iterator of updates).

Full definitions in the jargon glossary.

Python
#

Full source: python/main.py. The streaming helper is four lines of real logic:

# python/main.py (excerpt)
from agent_framework import Agent, AgentSession
from agent_framework.openai import OpenAIChatClient, OpenAIChatCompletionClient

INSTRUCTIONS = "You are a concise assistant. Keep answers to one short paragraph."

async def stream_answer(agent: Agent, question: str, session: AgentSession) -> list[str]:
    chunks: list[str] = []
    async for update in agent.run(question, stream=True, session=session):
        if update.text:
            chunks.append(update.text)
            print(update.text, end="", flush=True)
    print()
    return chunks

async def chat(agent: Agent, questions: list[str]) -> list[list[str]]:
    session = agent.create_session()                 # one session, many turns
    all_chunks: list[list[str]] = []
    for q in questions:
        print(f"\nQ: {q}\nA: ", end="", flush=True)
        all_chunks.append(await stream_answer(agent, q, session))
    return all_chunks

Three things worth staring at:

  • agent.run(..., stream=True, session=session) is a single overloaded call — the same method as Chapter 01, two extra kwargs. Python returns an async iterator when stream=True is set; the type system hides the switch.
  • if update.text: skips empty updates. Tool-call-only and metadata-only updates carry text=""; printing them breaks the visual stream.
  • session = agent.create_session() is called once, outside the loop. Creating a fresh session per turn is how you accidentally get back to single-turn behaviour — the Gotchas section has a story about that.

Run it:

cd tutorials/03-streaming-and-multiturn/python
uv sync
uv run python main.py "What is Python in one line?" "How old is it? Answer with a year only."
#
# Q: What is Python in one line?
# A: Python is a high-level, interpreted programming language known for its readability...
#
# Q: How old is it? Answer with a year only.
# A: 1991

The second turn never says “Python” — the agent resolves “it” because both turns share the session.

Interactive REPL mode (no args):

uv run python main.py
# Multi-turn chat (empty line to quit).
# Q: _

.NET
#

Full source: dotnet/Program.cs. Shape-for-shape identical to Python:

// dotnet/Program.cs (excerpt)
using Microsoft.Agents.AI;

public const string Instructions =
    "You are a concise assistant. Keep answers to one short paragraph.";

public static async Task<List<string>> StreamAnswer(AIAgent agent, string question, AgentSession thread)
{
    var chunks = new List<string>();
    await foreach (var update in agent.RunStreamingAsync(question, thread))
    {
        if (!string.IsNullOrEmpty(update.Text))
        {
            chunks.Add(update.Text);
            Console.Write(update.Text);
        }
    }
    Console.WriteLine();
    return chunks;
}

public static async Task<List<List<string>>> Chat(AIAgent agent, IReadOnlyList<string> questions)
{
    var thread = await agent.CreateSessionAsync();       // one session, many turns
    var allChunks = new List<List<string>>();
    foreach (var q in questions)
    {
        Console.WriteLine($"\nQ: {q}");
        Console.Write("A: ");
        allChunks.Add(await StreamAnswer(agent, q, thread));
    }
    return allChunks;
}

Two differences from Python:

  • RunStreamingAsync is a separate method, not a flag on RunAsync. Each returns a different static type (Task<AgentResponse> vs IAsyncEnumerable<AgentResponseUpdate>), which is how the C# type system expresses the mode switch.
  • CreateSessionAsync is awaited. For in-process providers the work is synchronous; the async signature exists because service-backed providers (Assistants API, Foundry) round-trip to the server to allocate a thread id.

Run it:

cd tutorials/03-streaming-and-multiturn/dotnet
dotnet run -- "What is Python in one line?" "How old is it? Answer with a year only."

Background Responses — the third mode
#

Streaming fixes latency for short-to-medium answers. For genuinely long work — a workflow that takes 30 seconds to a few minutes — you don’t want a connection open that whole time. Modern OpenAI / Azure OpenAI deployments expose a background responses mode: kick off the run, get back a handle, poll (or webhook) until it completes.

The shape in MAF terms:

  1. Start the run with the backgrounded flag. The call returns immediately with a response id.
  2. Store the id somewhere correlated to the user’s request (DB, Redis).
  3. Poll a GetResponse(id) endpoint or subscribe to a webhook for the completion event. Each poll returns the current state (queued, in_progress, completed, failed).
  4. On completion, pull the final AgentResponse and deliver it to the user out-of-band (email, push notification, a “new result ready” badge).

This is a good fit for research agents, deep-analysis workflows, batch report generation, and anything with human-review loops. It’s a bad fit for an interactive chat UI — users won’t wait a page reload for their next line. Default to streaming; reach for background responses only when the interaction is explicitly async.

The mental model is a three-mode axis of blocking time:

ModeConnection open forUse when
Non-streamingFull response (3–8s typical)CLI / batch / low-latency short answers
StreamingFull response, but bytes flow (same wall-time)Any interactive UI
BackgroundNone after kickoff; poll separatelyMinute-plus runs; async notification channels

See the Agents — Background Responses page for the current API surface. At the time of writing (MAF v1.0), background responses are a Responses-API-only feature, so you need a deployment that exposes /responses (not just /chat/completions).

We won’t use it again in this chapter — noted here so you know the third mode exists. The capstone may adopt it for the inventory-rebuild workflow in a later iteration.

Side-by-side differences
#

AspectPython.NET
Stream entry pointagent.run(..., stream=True) — same method, bool kwargagent.RunStreamingAsync(...) — separate method
Non-stream returnAgentResponseAgentResponse
Stream returnAsyncIterable[AgentResponseUpdate]IAsyncEnumerable<AgentResponseUpdate>
Chunk accessorupdate.textupdate.Text
Iterationasync for update in ...await foreach (var update in ...)
Session creationagent.create_session() (sync)await agent.CreateSessionAsync()
Session typeAgentSessionAgentSession
Session disposalGC when no reference heldawait using var session = ... to release deterministically
Cancel mid-streamasyncio task cancellationCancellationToken on RunStreamingAsync

The .NET await on session creation reflects a real difference: for service-backed providers (Assistants API, Foundry) the call reaches the server to allocate a thread. The Python API papers over this by exposing only the synchronous local case up front — service-backed sessions in Python come from get_history_provider() configured in your agent factory.

Gotchas
#

  • Session reset bug. session = agent.create_session() inside the turn loop creates a fresh session every turn — you get single-turn behaviour back. Always hoist it outside the loop. This hit me first time writing the capstone’s chat route.
  • Empty updates are real. update.text / update.Text can be empty when the update carries only tool-call metadata, usage info, or a finish reason. Guarding with if update.text: is not defensive coding — it’s required.
  • Don’t append \n per chunk. Each chunk is a fragment (often mid-word). Concatenate, don’t linebreak. The one time you print a newline is after the stream ends.
  • Stream handle is hot. The async iterator starts producing as soon as you call run(stream=True) / RunStreamingAsync. Abandoning it without iterating wastes tokens. In the frontend we wrap it in an AbortController (web/src/app/(app)/chat/page.tsx:196-199) and abort on unmount, send, or navigation.
  • In-process sessions die with the process. The default in-memory provider means restarting your agent resets every conversation. Ch04 swaps in durable storage (Redis / Postgres).
  • Responses API vs Chat Completions. Sessions work on both, but with different semantics. Chat Completions sessions are always client-side (MAF replays history on every call). Responses API sessions can be server-side (a service_session_id / thread id handles replay for you). Pick one and stay consistent per agent.
  • Client disconnects mid-stream. The caller hitting Stop, or their laptop closing the lid, leaves your generator running. The capstone’s chat_stream route polls request.is_disconnected() each iteration and breaks out — see agents/python/tests/test_stream_backpressure.py:94-106 for the contract.

Tests
#

# Python — 4 tests: streaming yields multiple chunks, multi-turn reuses
# session (second turn sees longer history), chunks reassemble to full
# text, and a real-LLM integration test that proves "1991" comes back
# on the follow-up turn.
cd tutorials/03-streaming-and-multiturn/python
uv sync
uv run pytest -v

# .NET — 3 integration tests: streaming yields >1 chunk on a long answer,
# multi-turn preserves context ("1991" in the follow-up), and two separate
# sessions do NOT share context.
cd tutorials/03-streaming-and-multiturn/dotnet
dotnet test tests/Streaming.Tests.csproj

The Python suite runs three fast unit tests against a StreamingCannedClient (a BaseChatClient subclass that yields a canned response in three chunks) plus one real-LLM integration test that skips cleanly when credentials are absent. .NET tests are all integration because faking MAF’s streaming pipeline convincingly requires reimplementing half of IChatClient. Both sides assert the “1991” answer on the second turn — the clearest proof that session state round-trips through the LLM.

Three tests are worth flagging as reusable patterns:

  1. Streaming yields multiple chunks. Python’s test_stream_yields_multiple_chunks asserts len(chunks) >= 2. This is the bare minimum “streaming happened” assertion — any single-chunk response is indistinguishable from non-streaming.
  2. Session reuse advances history. Python’s test_multiturn_reuses_session checks client.conversation_lengths[1] > client.conversation_lengths[0]. The fake client records how many messages it saw per call; a longer second call proves the session accumulated. This is the mechanical proof independent of LLM behaviour.
  3. Separate sessions do NOT share context. .NET’s Separate_Sessions_Do_Not_Share_Context creates two sessions, teaches the first a word, asks the second what the word was, and asserts “none” in the reply. This is the negative assertion — sessions don’t leak across instances.

Together those three cover the orthogonality claim from the concept section: streaming is per-call, sessions are per-conversation, and either can be swapped independently.

How this shows up in the capstone
#

  • Python orchestrator streaming. agents/python/shared/agent_host.py:66-84 is this chapter’s stream_answer function with two extra steps: it converts the incoming A2A history payload into Message objects, then delegates to agent.run(messages, stream=True). The /api/chat/stream route wraps the generator in the SSE transport plus the three backpressure guards (disconnect, timeout, max-bytes) tested in agents/python/tests/test_stream_backpressure.py:1-75.
  • Legacy custom loop. The older non-MAF path at agents/python/shared/agent_host.py:203-249 runs when MAF_NATIVE_EXECUTION=false. It exists only as a rollback switch — same observable output, different innards (OpenAI chat-completions directly, tool-calling loop by hand). Every specialist agent runs the MAF-native path in production.
  • Session reconstruction. agents/python/shared/session.py:224-229 (session_from_id) is how the orchestrator rebinds an in-flight conversation to its persistent session id — the same pattern Ch04 turns into a full durable store. History injection into the first LLM call happens in _history_as_maf_messages at agents/python/shared/agent_host.py:44-63.
  • Frontend SSE consumer. web/src/lib/api.ts:192-260 reads the stream frame-by-frame and invokes onChunk(text) per delta. The chat page at web/src/app/(app)/chat/page.tsx:196-199 owns a per-message AbortController so the user hitting Stop or navigating away terminates the stream cleanly.
  • .NET session providers. agents/dotnet/src/ECommerceAgents.Shared/Sessions/SessionProviderFactory.cs keys off MAF_SESSION_BACKEND and returns a matching ISessionHistoryProvider (memory, file, postgres). Mirrors the Python get_history_provider() factory. The concrete providers live next to the factory in the same folder.

Further reading & links#

This chapter

Microsoft Agent Framework docs

Where it lives in the capstone

  • Python streaming: agents/python/shared/agent_host.py:66-84 · legacy loop: agents/python/shared/agent_host.py:203-249
  • Python sessions: agents/python/shared/session.py:224-229
  • Python backpressure tests: agents/python/tests/test_stream_backpressure.py:1-75
  • Frontend SSE: web/src/lib/api.ts:192-260 · web/src/app/(app)/chat/page.tsx:196-199
  • .NET sessions: agents/dotnet/src/ECommerceAgents.Shared/Sessions/

Series shared resources

What’s next
#

Chapter 04 — Sessions and Memory keeps the AgentSession from this chapter but makes it survive a process restart — Redis and Postgres ChatHistoryProvider implementations, plus the trade-offs between them.

MAF v1: Python and .NET - This article is part of a series.
Part 3: This Article

Related

MAF v1 — Sessions (Python + .NET)

·12 mins
Serialize an AgentSession to JSON, persist it, reload in a fresh process, and have the agent pick up exactly where it left off.