Skip to content

runtime: catastrophic memory explosion triggers macOS jetsam kill in long-running TUI sessions #2978

@aheritier

Description

@aheritier

Summary

Long-running docker-agent run TUI sessions experience sudden catastrophic memory explosions — growing from a stable ~168 MB baseline to 26+ GB in under 2 minutes — which trigger macOS jetsam SIGKILL. Observed across 7+ separate sessions over two weeks, consistently on macOS Apple Silicon.

This is not a gradual memory leak. The process runs stably for hours, then a specific operation triggers an unbounded allocation cascade.

Environment

  • OS: macOS 26.5.1 (25F80), Darwin arm64 (Mac16,6 / Apple M4 Max)
  • Config: multi-agent config with 9+ MCP servers, 2 LSP toolsets, 15+ sub-agents
  • Session DB: ~/.cagent/session.db — 687 MB, 1168 sessions, 51649 items, 531 MB of message JSON

Confirmed evidence

1. macOS Jetsam report (JetsamEvent-2026-06-02-174418.ips)

"largestProcess": "docker-agent"

Process entry (PID 66283) at time of kill:

rpages:       26,275,011  (~400 GB virtual footprint)
mem_regions:  23,326
fds:          400
cpuTime:      1270.6s
lifetimeMax:  26,275,011  ← peak == current, still climbing at kill

At kill time the system had only 534 MB free pages — 70+ system daemons were jetsammed in the same cascade.

2. RSS timeseries (30-second samples, PID 48218)

Process ran flat at 168 MB for 8+ hours, then:

Time RSS CPU%
01:19 → 09:04 168 MB (flat) 4–9%
09:23 172 MB 16%
09:26 182 MB 16%
09:28:03 413 MB 126%
09:28:33 1,478 MB 72%
09:29:03 1,738 MB 144%
09:29:33 26,150 MB 102%
09:30:03 DEAD (jetsam)

168 MB → 26 GB in 90 seconds. CPU spiking to 126–144% confirms multiple goroutines allocating simultaneously.

3. Trigger pattern

Immediately before the explosion the agent ran a large filesystem search (Search Files Content across ~, 48-second runtime, 2008 matches across 1191 files).

Hypothesis: a large tool result triggers a cascade of in-memory copies — tool output buffer → message list append → session_items WAL write → context window serialisation for Anthropic API — each step holding its own copy, with no back-pressure or size cap.

Steps to reproduce (approximate)

  1. Run a long-lived multi-agent session (several hours, many tool calls, no max_history_items cap)
  2. Trigger a large filesystem search (e.g. Search Files Content across ~ or a large repo with thousands of matches)
  3. Monitor RSS: while sleep 30; do ps -o rss= -p $PID; done
  4. Observe RSS explosion when the large tool result is processed and serialised into context

Expected behaviour

Tool results exceeding a size threshold should be truncated or streamed rather than fully buffered. The session serialisation path should not hold multiple full copies of large payloads simultaneously.

Suggested investigation points

  • pkg/tools/builtin/shell/shell.go — tool output accumulation (no size cap on stdout buffer)
  • pkg/runtime/toolexec/dispatcher.go — tool result handling before entering message list
  • Session store serialisation of session_items to sqlite
  • pkg/runtime/streaming.go — context window construction for next API call
  • Message history path: debug logs show message_count=163 being sent whole with no max_history_items cap set

Additional context

  • Multiple simultaneous TUI sessions amplify the issue (3 running concurrently in some cases)
  • The ~/.cagent/session.db in-memory representation during serialisation is significantly larger than its 687 MB on-disk size
  • Identical explosion profile observed in a second jetsam event (2026-06-02): same largestProcess verdict, same sudden RSS profile after hours of stability

Metadata

Metadata

Assignees

Labels

area/agentFor work that has to do with the general agent loop/agentic features of the apparea/sessionsFor features/issues/fixes related to session lifecycle (resume, persistence, export)area/toolsFor features/issues/fixes related to the usage of built-in and MCP toolsarea/tuiFor features/issues/fixes related to the TUIautomatedIssues created by cagentstatus/needs-triageFor issues that need to be triaged

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions