Skip to content

feat: drive interactive skills via an LLM responder (#303)#304

Open
adamdougal wants to merge 17 commits into
microsoft:mainfrom
adamdougal:feat/responder
Open

feat: drive interactive skills via an LLM responder (#303)#304
adamdougal wants to merge 17 commits into
microsoft:mainfrom
adamdougal:feat/responder

Conversation

@adamdougal
Copy link
Copy Markdown

Summary

Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under inputs.responder.

Related issue

Closes #303

Agent handoff

  • Scope: New responder feature end-to-end — config + validation, classifier, orchestration loop, outcome surfacing through the web API and dashboard, JSON schema, and docs.
  • Key files changed: internal/models/testcase.go & outcome.go (config + ResponderInfo), internal/responder/responder.go (classifier with persistent session + teardown), internal/orchestration/runner.go (executeResponderLoop, injectable classifier factory), internal/execution/copilot.go (DeleteSession), internal/webapi/{types,store}.go, web/src/components/RunDetail.tsx & api/client.ts (responder badge), schemas/task.schema.json, README + site/ docs.
  • Important decisions: inputs.responder is a sibling of follow_up_prompts and mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicit Close() teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists; model is optional and defaults to the eval's config.model; each task builds its own classifier (concurrency-safe).
  • Follow-ups or known gaps: The completed outcome value is effectively unreachable in practice (a self-initiated stop returns stopped); left as-is and considered acceptable.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor or maintenance
  • CI/CD or release change

Validation

  • go test ./...
  • make lint or golangci-lint run
  • Docs site checked, if docs changed
  • Web/dashboard checks run, if web/ changed
  • Manual validation completed: run-detail Playwright e2e (chromium) 5/5 passing
  • Not applicable; reason:

Documentation

  • README updated, if user-facing behavior changed
  • site/ docs updated, if CLI, YAML, dashboard, or validator behavior changed
  • Examples updated, if relevant
  • Not applicable

Risk and rollback

  • Risk level: Low
  • Rollback plan: Feature is fully additive and gated on the new optional inputs.responder field — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.

Notes for reviewers

The responder's session lifecycle is the area most worth a close look: Classify lazily creates a persistent session on the first call and resumes it thereafter, and executeResponderLoop defers Close() with a detached 30s context so teardown still runs on cancellation. CopilotEngine.DeleteSession removes the session from both e.sessions and e.usageCollectors (the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation between responder and follow_up_prompts, and that the orchestration branch gives Responder precedence over FollowUps

adamdougal and others added 16 commits May 29, 2026 14:57
 Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#303

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamdougal adamdougal requested a review from spboyer as a code owner May 29, 2026 15:53
Copilot AI review requested due to automatic review settings May 29, 2026 15:53
@github-actions github-actions Bot enabled auto-merge (squash) May 29, 2026 15:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.

Changes:

  • New internal/responder package implementing a Classifier that drives a persistent surrogate-user session and emits reply/stop/abstain decisions via structured tool calls.
  • Runner integration (executeResponderLoop) that drives the agent loop, merges responses, and records a ResponderInfo summary with outcomes completed/stopped/abstained/cap_exhausted/error.
  • Config/schema/validation, API/dashboard surfacing, docs, and tests for the new inputs.responder field.
Show a summary per file
File Description
internal/responder/responder.go New responder Classifier with persistent session + 3 decision tools.
internal/responder/responder_test.go Unit tests for tools, session reuse, cleanup, and model defaulting.
internal/orchestration/runner.go Adds executeResponderLoop/sendResponderReply and newClassifier hook.
internal/orchestration/responder_loop_test.go Tests reply→stop, abstain→error, cap-exhausted scenarios.
internal/models/testcase.go Adds ResponderConfig on TaskStimulus + validation.
internal/models/testcase_test.go Validation tests for responder config.
internal/models/outcome.go Adds ResponderInfo and outcome constants.
internal/models/outcome_test.go JSON serialization test for Responder.
internal/execution/copilot.go New DeleteSession for explicit teardown.
internal/webapi/types.go Adds ResponderInfoResponse.
internal/webapi/store.go Maps run.Responder to API response.
internal/webapi/additional_test.go Test for responder mapping.
internal/validation/schema_test.go Schema acceptance test for responder.
schemas/task.schema.json Schema for inputs.responder.
web/src/api/client.ts TypeScript ResponderInfo type.
web/src/components/RunDetail.tsx ResponderBadge for task rows.
web/dist/index.html Rebuilt asset reference.
site/src/content/docs/, README.md, docs/plans/ Documentation and design notes.

Copilot's findings

  • Files reviewed: 20/21 changed files
  • Comments generated: 3

Comment on lines +1351 to +1358
}

if lastWasReply {
info.Outcome = models.ResponderOutcomeCapExhausted
slog.WarnContext(ctx, "responder budget exhausted while agent still asking questions",
"test", tc.DisplayName, "max_followups", cfg.MaxFollowups)
}
return info
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — validated and applied in f539f0b. Dropped lastWasReply and the misleading ResponderOutcomeCompleted seed; the post-loop branch now unconditionally records cap_exhausted since every other exit returns early. Also removed the now-unused ResponderOutcomeCompleted constant.

Comment on lines +76 to +84
Handler: func(inv copilot.ToolInvocation) (copilot.ToolResult, error) {
var args struct {
Answer string `mapstructure:"answer"`
}
_ = mapstructure.Decode(inv.Arguments, &args)
d.decision = Decision{Kind: DecisionReply, Answer: args.Answer}
d.set = true
return copilot.ToolResult{}, nil
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated and fixed in f539f0b. Each handler now returns the decode error and stores it on the recorder; Classify surfaces it as responder tool call invalid: ... instead of fabricating an empty reply/abstain. Added a regression test (TestClassifyMalformedArgsIsError).

set bool
}

func (d *decisionRecorder) tools() []copilot.Tool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated and fixed in f539f0b. The recorder now refuses a second decision call via guardDuplicate and records the conflict on d.err; Classify surfaces it rather than letting handler order pick a winner. Added regression tests (TestDecisionToolsRejectDuplicateCall, TestClassifyDuplicateDecisionIsError).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 29, 2026

Codecov Report

❌ Patch coverage is 71.96262% with 60 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@23e9dba). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/orchestration/runner.go 55.95% 27 Missing and 10 partials ⚠️
internal/responder/responder.go 89.21% 7 Missing and 4 partials ⚠️
internal/execution/copilot.go 0.00% 10 Missing ⚠️
internal/models/testcase.go 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #304   +/-   ##
=======================================
  Coverage        ?   75.30%           
=======================================
  Files           ?      160           
  Lines           ?    18859           
  Branches        ?        0           
=======================================
  Hits            ?    14202           
  Misses          ?     3640           
  Partials        ?     1017           
Flag Coverage Δ
go-implementation 75.30% <71.96%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addresses three review comments on PR microsoft#304:

* Reject duplicate decision tool calls in the same turn instead of
  letting handler order silently pick the winner. The recorder now
  returns an error on the second call and Classify surfaces it.
* Propagate mapstructure decode failures from each tool handler so
  malformed arguments become a 'responder tool call invalid' error
  rather than a fabricated empty reply/abstain.
* Drop the unused lastWasReply flag and the dead initial
  ResponderOutcomeCompleted seed in the responder loop. The loop can
  only exit normally after a reply, so the post-loop branch
  unconditionally records cap_exhausted. Removed the now-unused
  ResponderOutcomeCompleted constant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for driving interactive skills via a responder LLM

5 participants