feat: drive interactive skills via an LLM responder (#303) by adamdougal · Pull Request #304 · microsoft/waza

adamdougal · 2026-05-29T15:53:42Z

Summary

Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under inputs.responder.

Related issue

Closes #303

Agent handoff

Scope: New responder feature end-to-end — config + validation, classifier, orchestration loop, outcome surfacing through the web API and dashboard, JSON schema, and docs.
Key files changed: internal/models/testcase.go & outcome.go (config + ResponderInfo), internal/responder/responder.go (classifier with persistent session + teardown), internal/orchestration/runner.go (executeResponderLoop, injectable classifier factory), internal/execution/copilot.go (DeleteSession), internal/webapi/{types,store}.go, web/src/components/RunDetail.tsx & api/client.ts (responder badge), schemas/task.schema.json, README + site/ docs.
Important decisions: inputs.responder is a sibling of follow_up_prompts and mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicit Close() teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists; model is optional and defaults to the eval's config.model; each task builds its own classifier (concurrency-safe).
Follow-ups or known gaps: The completed outcome value is effectively unreachable in practice (a self-initiated stop returns stopped); left as-is and considered acceptable.

Type of change

Validation

go test ./...
make lint or golangci-lint run
Docs site checked, if docs changed
Web/dashboard checks run, if web/ changed
Manual validation completed: run-detail Playwright e2e (chromium) 5/5 passing
Not applicable; reason:

Documentation

README updated, if user-facing behavior changed
site/ docs updated, if CLI, YAML, dashboard, or validator behavior changed
Examples updated, if relevant
Not applicable

Risk and rollback

Risk level: Low
Rollback plan: Feature is fully additive and gated on the new optional inputs.responder field — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.

Notes for reviewers

The responder's session lifecycle is the area most worth a close look: Classify lazily creates a persistent session on the first call and resumes it thereafter, and executeResponderLoop defers Close() with a detached 30s context so teardown still runs on cancellation. CopilotEngine.DeleteSession removes the session from both e.sessions and e.usageCollectors (the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation between responder and follow_up_prompts, and that the orchestration branch gives Responder precedence over FollowUps

Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.

Changes:

New internal/responder package implementing a Classifier that drives a persistent surrogate-user session and emits reply/stop/abstain decisions via structured tool calls.
Runner integration (executeResponderLoop) that drives the agent loop, merges responses, and records a ResponderInfo summary with outcomes completed/stopped/abstained/cap_exhausted/error.
Config/schema/validation, API/dashboard surfacing, docs, and tests for the new inputs.responder field.

Show a summary per file

File	Description
internal/responder/responder.go	New responder Classifier with persistent session + 3 decision tools.
internal/responder/responder_test.go	Unit tests for tools, session reuse, cleanup, and model defaulting.
internal/orchestration/runner.go	Adds `executeResponderLoop`/`sendResponderReply` and `newClassifier` hook.
internal/orchestration/responder_loop_test.go	Tests reply→stop, abstain→error, cap-exhausted scenarios.
internal/models/testcase.go	Adds `ResponderConfig` on `TaskStimulus` + validation.
internal/models/testcase_test.go	Validation tests for responder config.
internal/models/outcome.go	Adds `ResponderInfo` and outcome constants.
internal/models/outcome_test.go	JSON serialization test for `Responder`.
internal/execution/copilot.go	New `DeleteSession` for explicit teardown.
internal/webapi/types.go	Adds `ResponderInfoResponse`.
internal/webapi/store.go	Maps `run.Responder` to API response.
internal/webapi/additional_test.go	Test for responder mapping.
internal/validation/schema_test.go	Schema acceptance test for responder.
schemas/task.schema.json	Schema for `inputs.responder`.
web/src/api/client.ts	TypeScript `ResponderInfo` type.
web/src/components/RunDetail.tsx	`ResponderBadge` for task rows.
web/dist/index.html	Rebuilt asset reference.
site/src/content/docs/, README.md, docs/plans/	Documentation and design notes.

Copilot's findings

Files reviewed: 20/21 changed files
Comments generated: 3

spboyer · 2026-06-01T17:31:38Z

+	}
+
+	if lastWasReply {
+		info.Outcome = models.ResponderOutcomeCapExhausted
+		slog.WarnContext(ctx, "responder budget exhausted while agent still asking questions",
+			"test", tc.DisplayName, "max_followups", cfg.MaxFollowups)
+	}
+	return info


Good catch — validated and applied in f539f0b. Dropped lastWasReply and the misleading ResponderOutcomeCompleted seed; the post-loop branch now unconditionally records cap_exhausted since every other exit returns early. Also removed the now-unused ResponderOutcomeCompleted constant.

spboyer · 2026-06-01T17:31:40Z

+			Handler: func(inv copilot.ToolInvocation) (copilot.ToolResult, error) {
+				var args struct {
+					Answer string `mapstructure:"answer"`
+				}
+				_ = mapstructure.Decode(inv.Arguments, &args)
+				d.decision = Decision{Kind: DecisionReply, Answer: args.Answer}
+				d.set = true
+				return copilot.ToolResult{}, nil
+			},


Validated and fixed in f539f0b. Each handler now returns the decode error and stores it on the recorder; Classify surfaces it as responder tool call invalid: ... instead of fabricating an empty reply/abstain. Added a regression test (TestClassifyMalformedArgsIsError).

spboyer · 2026-06-01T17:31:42Z

+	set      bool
+}
+
+func (d *decisionRecorder) tools() []copilot.Tool {


Validated and fixed in f539f0b. The recorder now refuses a second decision call via guardDuplicate and records the conflict on d.err; Classify surfaces it rather than letting handler order pick a winner. Added regression tests (TestDecisionToolsRejectDuplicateCall, TestClassifyDuplicateDecisionIsError).

codecov-commenter · 2026-05-29T15:57:13Z

Codecov Report

❌ Patch coverage is 71.96262% with 60 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@23e9dba). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
internal/orchestration/runner.go	55.95%	27 Missing and 10 partials ⚠️
internal/responder/responder.go	89.21%	7 Missing and 4 partials ⚠️
internal/execution/copilot.go	0.00%	10 Missing ⚠️
internal/models/testcase.go	84.61%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #304   +/-   ##
=======================================
  Coverage        ?   75.30%           
=======================================
  Files           ?      160           
  Lines           ?    18859           
  Branches        ?        0           
=======================================
  Hits            ?    14202           
  Misses          ?     3640           
  Partials        ?     1017

Flag	Coverage Δ
go-implementation	`75.30% <71.96%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

adamdougal and others added 16 commits May 29, 2026 14:57

feat: add inputs.responder config model microsoft#303

25964ea

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: validate inputs.responder fields and mutual exclusivity microso…

de29bda

…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add responder decision types and tools microsoft#303

c32374c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add responder Classifier with persistent session microsoft#303

6732d92

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add ResponderInfo to RunResult microsoft#303

5740353

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor: add injectable responder classifier factory to runner micro…

0670c60

…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: drive interactive skills via responder loop microsoft#303

8c7bedc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add inputs.responder to task JSON schema microsoft#303

4cf3968

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: document inputs.responder for interactive skills microsoft#303

1355263

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: surface responder outcome in dashboard microsoft#303

5b5b0c8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chore: remove implementation plan

59c3d51

adamdougal requested a review from spboyer as a code owner May 29, 2026 15:53

Copilot AI review requested due to automatic review settings May 29, 2026 15:53

github-actions Bot enabled auto-merge (squash) May 29, 2026 15:54

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: drive interactive skills via an LLM responder (#303)#304

feat: drive interactive skills via an LLM responder (#303)#304
adamdougal wants to merge 17 commits into
microsoft:mainfrom
adamdougal:feat/responder

adamdougal commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

spboyer Jun 1, 2026

Uh oh!

spboyer Jun 1, 2026

Uh oh!

spboyer Jun 1, 2026

Uh oh!

codecov-commenter commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

adamdougal commented May 29, 2026

Summary

Related issue

Agent handoff

Type of change

Validation

Documentation

Risk and rollback

Notes for reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

spboyer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented May 29, 2026 •

edited

Loading