From b106eed92dc41c803d9be34d5fcdf63e96f51411 Mon Sep 17 00:00:00 2001
From: Rolando Santamaria Maso <kyberneees@gmail.com>
Date: Sun, 7 Jun 2026 15:55:40 +0200
Subject: [PATCH 1/2] feat(vision): add vision tool using MiniCPM-V 4.6 (1.3B)
 via llama-mtmd-cli
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a `vision` built-in tool that analyses images and videos locally using
MiniCPM-V 4.6 — a 1.3B multimodal model running via llama.cpp's
llama-mtmd-cli, with no cloud API required.

## What's new

**Tool (`vision`)**
- Accepts images (JPEG, PNG, GIF, WebP, BMP) and videos (MP4, MOV, AVI,
  MKV, WebM)
- Videos: ffprobe reads duration, ffmpeg extracts N evenly-spaced frames,
  all frames sent as a multi-image call to the model (configurable via
  `video_frames`, default 8)
- Security: O_NOFOLLOW open (symlink protection), danger.CheckOperation
  classification, all output wrapped in wrapUntrusted() with provenance tag
- Setup instructions in every error path (missing binary, missing model,
  missing mmproj, missing ffmpeg)

**Docker (`docker/Dockerfile`)**
- New `minicpm` multi-stage build: downloads pre-built llama-mtmd-cli
  (llama.cpp b9549) for amd64/arm64 from the official GitHub release, then
  fetches MiniCPM-V-4_6-Q4_K_M.gguf (529 MB) and mmproj-model-f16.gguf
  (1.1 GB) from HuggingFace into /usr/local/share/minicpm-v/models/
- Overridable via --build-arg MINICPM_QUANT=Q8_0 and LLAMA_VERSION
- Runtime stage copies binary + models; no new runtime deps (libstdc++6
  already present for whisper)

**Config (`internal/config/loader.go`)**
- New VisionConfig struct: ModelsDir, BinaryPath, VideoFrames
- Wired into FileConfig, ResolvedConfig, resolveVision(), mergeFile()

**Tests**
- 13 tests in cmd/odek/vision_tool_test.go: empty path, invalid JSON, file
  not found, symlink rejected, missing binary, missing model, missing mmproj,
  mock happy-path image (4 extensions), custom prompt, mock happy-path video
  (with mock ffprobe+ffmpeg via PATH override), missing ffmpeg fallback,
  schema shape
- 3 tests in internal/config/vision_test.go: resolveVision defaults,
  zero-frames backfill, custom values round-trip

**Docs**
- docs/CHEATSHEET.md: new Image & Video Understanding section with config
  snippet and field reference
- docs/SECURITY.md: vision added to untrusted-content table, always-untrusted
  list, and skills provenance gate paragraph
- docs/CONFIG.md + docs/TELEGRAM.md: smart-previews bullet updated
- docker/README.md: new Image & video understanding (out of the box) section
- README.md: vision added to external-content ingestion list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 README.md                            |   2 +-
 cmd/odek/injection_hardening_test.go |   2 +-
 cmd/odek/main.go                     |   7 +-
 cmd/odek/main_test.go                |   2 +-
 cmd/odek/mcp.go                      |   2 +-
 cmd/odek/repl.go                     |   2 +-
 cmd/odek/schedule.go                 |   2 +-
 cmd/odek/serve.go                    |   2 +-
 cmd/odek/subagent.go                 |   2 +-
 cmd/odek/subagent_contract_test.go   |   6 +-
 cmd/odek/telegram.go                 |   2 +-
 cmd/odek/vision_tool.go              | 320 +++++++++++++++++++++++++
 cmd/odek/vision_tool_test.go         | 338 +++++++++++++++++++++++++++
 docker/Dockerfile                    |  43 ++++
 docker/README.md                     |  19 ++
 docs/CHEATSHEET.md                   |  19 ++
 docs/CONFIG.md                       |   2 +-
 docs/SECURITY.md                     |   5 +-
 docs/TELEGRAM.md                     |   2 +-
 internal/config/loader.go            |  39 ++++
 internal/config/vision_test.go       |  40 ++++
 21 files changed, 839 insertions(+), 19 deletions(-)
 create mode 100644 cmd/odek/vision_tool.go
 create mode 100644 cmd/odek/vision_tool_test.go
 create mode 100644 internal/config/vision_test.go

diff --git a/README.md b/README.md
index ca74f31..b93a577 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ odek is not a framework. It's a **runtime** — the smallest possible surface ar
 Every session can run in an isolated Docker container: no network, no host mounts beyond the working directory, zero capabilities, destroyed on exit. `odek serve` enables the sandbox **by default**; `odek run` keeps it opt-in but warns when running unsandboxed. `--ctx` files are auto-injected into the container at `/workspace/`. Full security model in [docs/SANDBOXING.md](docs/SANDBOXING.md).
 
 ### 🛡️ Prompt-Injection-Aware
-External content the agent ingests (`browser`, `read_file`, `shell`, `search_files`, `multi_grep`, `transcribe`, `session_search`, MCP tools) is wrapped in per-call nonce'd `<untrusted_content>` boundaries so the model can distinguish data from instructions. Redirect hops are re-classified (`browser`/`http_batch`), MCP tool descriptions are scanned for injection at registration, and the MCP error channel is wrapped too. The danger classifier resists 8 known shell-evasion tricks (`$()`, backticks, `$IFS`, `command`/`exec`, `\rm`, basenamed absolute paths). Approvers engage friction mode after 3 same-class approvals in 60 s. Memory episodes from tainted sessions are stored but never auto-replayed. Skill auto-save tracks provenance and pins untrusted suggestions for explicit `odek skill promote`. `odek audit <session-id>` surfaces every ingest + per-turn divergence heuristic. Full threat model in [docs/SECURITY.md](docs/SECURITY.md).
+External content the agent ingests (`browser`, `read_file`, `shell`, `search_files`, `multi_grep`, `transcribe`, `vision`, `session_search`, MCP tools) is wrapped in per-call nonce'd `<untrusted_content>` boundaries so the model can distinguish data from instructions. Redirect hops are re-classified (`browser`/`http_batch`), MCP tool descriptions are scanned for injection at registration, and the MCP error channel is wrapped too. The danger classifier resists 8 known shell-evasion tricks (`$()`, backticks, `$IFS`, `command`/`exec`, `\rm`, basenamed absolute paths). Approvers engage friction mode after 3 same-class approvals in 60 s. Memory episodes from tainted sessions are stored but never auto-replayed. Skill auto-save tracks provenance and pins untrusted suggestions for explicit `odek skill promote`. `odek audit <session-id>` surfaces every ingest + per-turn divergence heuristic. Full threat model in [docs/SECURITY.md](docs/SECURITY.md).
 
 ### 🧩 Sub-Agent Delegation
 Parallel OS-process sub-agents via `delegate_tasks`. True isolation — each sub-agent is a fresh `odek subagent` process with its own config, tools, and termination timeout. Up to 8 concurrent workers. [docs/SUBAGENTS.md](docs/SUBAGENTS.md)
diff --git a/cmd/odek/injection_hardening_test.go b/cmd/odek/injection_hardening_test.go
index b5c459e..d19a220 100644
--- a/cmd/odek/injection_hardening_test.go
+++ b/cmd/odek/injection_hardening_test.go
@@ -244,7 +244,7 @@ func TestBuiltinTools_SessionSearchWrappedAsUntrusted(t *testing.T) {
 	store, cleanup := seedSessionStore(t)
 	defer cleanup()
 
-	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 4, "", config.TranscriptionConfig{}, store)
+	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 4, "", config.TranscriptionConfig{}, config.VisionConfig{}, store)
 
 	var ss odek.Tool
 	for _, tool := range tools {
diff --git a/cmd/odek/main.go b/cmd/odek/main.go
index 85c335a..4830d09 100644
--- a/cmd/odek/main.go
+++ b/cmd/odek/main.go
@@ -779,7 +779,7 @@ func run(args []string) error {
 
 	// Sandbox setup
 	var sandboxCleanup func() error
-	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, nil)
+	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, resolved.Vision, nil)
 
 	// MCP server tools
 	var mcpCleanup func()
@@ -1054,7 +1054,7 @@ func setupSandbox(tools []odek.Tool, cfg sandboxConfig) (containerName string, c
 	return containerName, cleanup, nil
 }
 
-func builtinTools(dc danger.DangerousConfig, sm *skills.SkillManager, approver danger.Approver, maxConcurrency int, apiKey string, tc config.TranscriptionConfig, store *session.Store) []odek.Tool {
+func builtinTools(dc danger.DangerousConfig, sm *skills.SkillManager, approver danger.Approver, maxConcurrency int, apiKey string, tc config.TranscriptionConfig, vc config.VisionConfig, store *session.Store) []odek.Tool {
 	tools := []odek.Tool{
 		&shellTool{
 			dangerousConfig: dc,
@@ -1089,6 +1089,7 @@ func builtinTools(dc danger.DangerousConfig, sm *skills.SkillManager, approver d
 		&trTool{dangerousConfig: dc},
 		&wordCountTool{dangerousConfig: dc},
 		newTranscribeTool(dc, tc),
+		newVisionTool(dc, vc),
 		// session_search returns content from arbitrary past sessions —
 		// including sessions that ingested untrusted content. That path
 		// otherwise bypasses the memory taint gate and the audit log, so
@@ -1598,7 +1599,7 @@ func continueCmd(args []string) error {
 			"./.odek/skills",
 		)
 	}
-	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, store)
+	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, resolved.Vision, store)
 	var sandboxCleanup func() error
 
 	// MCP server tools
diff --git a/cmd/odek/main_test.go b/cmd/odek/main_test.go
index 67b8d62..619d412 100644
--- a/cmd/odek/main_test.go
+++ b/cmd/odek/main_test.go
@@ -203,7 +203,7 @@ func TestRun_NoAPIKey(t *testing.T) {
 }
 
 func TestBuiltinTools(t *testing.T) {
-	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, nil)
+	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 	if len(tools) == 0 {
 		t.Fatal("builtinTools() returned empty slice")
 	}
diff --git a/cmd/odek/mcp.go b/cmd/odek/mcp.go
index 993f260..55f604f 100644
--- a/cmd/odek/mcp.go
+++ b/cmd/odek/mcp.go
@@ -73,7 +73,7 @@ Flags:
 	}
 
 	// Build tools
-	toolSet := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, nil)
+	toolSet := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 
 	// MCP server tools — connect and discover before sandbox
 	var mcpCleanup func()
diff --git a/cmd/odek/repl.go b/cmd/odek/repl.go
index 4ae42d5..bd2fcae 100644
--- a/cmd/odek/repl.go
+++ b/cmd/odek/repl.go
@@ -77,7 +77,7 @@ func replCmd(args []string) error {
 			"./.odek/skills",
 		)
 	}
-	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, nil)
+	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 	var sandboxCleanup func() error
 
 	// MCP server tools
diff --git a/cmd/odek/schedule.go b/cmd/odek/schedule.go
index 0e21111..2858370 100644
--- a/cmd/odek/schedule.go
+++ b/cmd/odek/schedule.go
@@ -570,7 +570,7 @@ func runTaskHeadless(ctx context.Context, resolved config.ResolvedConfig, system
 		resolved.Dangerous.NonInteractive = &deny
 	}
 
-	tools := builtinTools(resolved.Dangerous, nil, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, nil)
+	tools := builtinTools(resolved.Dangerous, nil, nil, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, resolved.Vision, nil)
 	tools = append(tools, mcpTools...)
 
 	// Capture cumulative token usage from the final iteration so the Runner
diff --git a/cmd/odek/serve.go b/cmd/odek/serve.go
index 676bde2..b37a06d 100644
--- a/cmd/odek/serve.go
+++ b/cmd/odek/serve.go
@@ -267,7 +267,7 @@ func newServeAgent(resolved config.ResolvedConfig, system string, sendFn func(v
 	approver := newWSApprover(sendFn)
 	resolved.Dangerous.Approver = approver
 
-	tools := builtinTools(resolved.Dangerous, sm, approver, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, nil)
+	tools := builtinTools(resolved.Dangerous, sm, approver, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 
 	// Find the delegateTasksTool to wire up sub-agent log streaming
 	var subagentTool *delegateTasksTool
diff --git a/cmd/odek/subagent.go b/cmd/odek/subagent.go
index dea620f..b4f8275 100644
--- a/cmd/odek/subagent.go
+++ b/cmd/odek/subagent.go
@@ -291,7 +291,7 @@ func subagentCmd(args []string) error {
 			"./.odek/skills",
 		)
 	}
-	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, nil)
+	tools := builtinTools(resolved.Dangerous, sm, nil, resolved.MaxConcurrency, resolved.APIKey, config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 	var sandboxCleanup func() error
 
 	// MCP server tools
diff --git a/cmd/odek/subagent_contract_test.go b/cmd/odek/subagent_contract_test.go
index f5e24b0..ffb9221 100644
--- a/cmd/odek/subagent_contract_test.go
+++ b/cmd/odek/subagent_contract_test.go
@@ -320,7 +320,7 @@ func TestSubagent_ExitCodeThree(t *testing.T) {
 // ── 4. delegate_tasks Tool Schema ───────────────────────────────────
 
 func TestDelegateTasksTool_Exists(t *testing.T) {
-	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, nil)
+	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 	if len(tools) == 0 {
 		t.Fatal("builtinTools() returned empty slice")
 	}
@@ -338,7 +338,7 @@ func TestDelegateTasksTool_Exists(t *testing.T) {
 }
 
 func TestDelegateTasksTool_HasSchema(t *testing.T) {
-	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, nil)
+	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 
 	var tool odek.Tool
 	for _, t2 := range tools {
@@ -432,7 +432,7 @@ func TestDelegateTasksTool_HasSchema(t *testing.T) {
 }
 
 func TestDelegateTasksTool_Description(t *testing.T) {
-	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, nil)
+	tools := builtinTools(danger.DangerousConfig{}, nil, nil, 3, "", config.TranscriptionConfig{}, config.VisionConfig{}, nil)
 
 	var tool odek.Tool
 	for _, t2 := range tools {
diff --git a/cmd/odek/telegram.go b/cmd/odek/telegram.go
index a7dd0fd..d88bfac 100644
--- a/cmd/odek/telegram.go
+++ b/cmd/odek/telegram.go
@@ -1078,7 +1078,7 @@ func handleChatMessage(
 	}
 
 	// Build the agent with Telegram approver.
-	tools := builtinTools(resolved.Dangerous, nil, approver, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, sessionManager.Store)
+	tools := builtinTools(resolved.Dangerous, nil, approver, resolved.MaxConcurrency, resolved.APIKey, resolved.Transcription, resolved.Vision, sessionManager.Store)
 
 	modelLabel := odek.ProfileLabel(resolved.Model)
 	if modelLabel == "" {
diff --git a/cmd/odek/vision_tool.go b/cmd/odek/vision_tool.go
new file mode 100644
index 0000000..d9f57dd
--- /dev/null
+++ b/cmd/odek/vision_tool.go
@@ -0,0 +1,320 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strconv"
+	"strings"
+	"syscall"
+
+	"github.com/BackendStack21/odek"
+	"github.com/BackendStack21/odek/internal/config"
+	"github.com/BackendStack21/odek/internal/danger"
+)
+
+var videoExts = map[string]bool{
+	".mp4": true, ".mov": true, ".avi": true, ".mkv": true,
+	".webm": true, ".m4v": true, ".flv": true, ".wmv": true,
+}
+
+// llamaMtmdBinary locates the llama-mtmd-cli binary.
+// Priority: cfg.BinaryPath > PATH search.
+func llamaMtmdBinary(cfg config.VisionConfig) (string, error) {
+	if cfg.BinaryPath != "" {
+		if _, err := os.Stat(cfg.BinaryPath); err == nil {
+			return cfg.BinaryPath, nil
+		}
+		return "", fmt.Errorf("llama-mtmd-cli not found at configured path %q", cfg.BinaryPath)
+	}
+	if path, err := exec.LookPath("llama-mtmd-cli"); err == nil {
+		return path, nil
+	}
+	return "", fmt.Errorf(`llama-mtmd-cli not found on PATH.
+
+The vision tool requires llama.cpp's multimodal CLI (build b9549+).
+
+To install manually:
+  git clone --depth 1 --branch b9549 https://github.com/ggerganov/llama.cpp
+  cd llama.cpp
+  cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_CURL=OFF
+  cmake --build build -j$(nproc) --target llama-mtmd-cli
+  install build/bin/llama-mtmd-cli /usr/local/bin/
+
+Or set binary_path in the vision config.`)
+}
+
+// visionModelPaths resolves the model.gguf and mmproj.gguf paths.
+// Priority: cfg.ModelsDir > Docker image path > ~/.odek/minicpm-v/models.
+func visionModelPaths(cfg config.VisionConfig) (modelPath, mmprojPath string, err error) {
+	dir := cfg.ModelsDir
+	if dir == "" {
+		// Docker image baked path (see docker/Dockerfile minicpm stage)
+		const dockerPath = "/usr/local/share/minicpm-v/models"
+		if _, statErr := os.Stat(filepath.Join(dockerPath, "model.gguf")); statErr == nil {
+			dir = dockerPath
+		} else {
+			home, homeErr := os.UserHomeDir()
+			if homeErr != nil {
+				return "", "", fmt.Errorf("cannot determine home directory: %v", homeErr)
+			}
+			dir = filepath.Join(home, ".odek", "minicpm-v", "models")
+		}
+	}
+
+	mp := filepath.Join(dir, "model.gguf")
+	mmp := filepath.Join(dir, "mmproj.gguf")
+
+	if _, err := os.Stat(mp); err != nil {
+		return "", "", fmt.Errorf(`MiniCPM-V model not found at %q.
+
+Download and install:
+  mkdir -p %s
+  cd %s
+  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/MiniCPM-V-4_6-Q4_K_M.gguf"
+  mv MiniCPM-V-4_6-Q4_K_M.gguf model.gguf
+  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/mmproj-model-f16.gguf"
+  mv mmproj-model-f16.gguf mmproj.gguf
+
+Or set models_dir in the vision config.`, mp, dir, dir)
+	}
+	if _, err := os.Stat(mmp); err != nil {
+		return "", "", fmt.Errorf("MiniCPM-V projector not found at %q — download mmproj-model-f16.gguf to %s and rename to mmproj.gguf", mmp, dir)
+	}
+	return mp, mmp, nil
+}
+
+// extractVideoFrames samples n evenly-spaced frames from videoPath into a
+// temporary directory. Returns paths to the JPEG frame files; caller must
+// remove the directory (filepath.Dir of the first path).
+func extractVideoFrames(videoPath string, n int) ([]string, error) {
+	if _, err := exec.LookPath("ffmpeg"); err != nil {
+		return nil, fmt.Errorf("ffmpeg not found — required for video frame extraction")
+	}
+	if _, err := exec.LookPath("ffprobe"); err != nil {
+		return nil, fmt.Errorf("ffprobe not found — required to read video duration")
+	}
+
+	// Get duration with ffprobe
+	out, err := exec.Command("ffprobe",
+		"-v", "error",
+		"-show_entries", "format=duration",
+		"-of", "csv=p=0",
+		videoPath,
+	).Output()
+	if err != nil {
+		return nil, fmt.Errorf("ffprobe failed: %v", err)
+	}
+	var duration float64
+	fmt.Sscanf(strings.TrimSpace(string(out)), "%f", &duration)
+	if duration <= 0 {
+		duration = 60
+	}
+
+	tmpDir, err := os.MkdirTemp("", "odek-vision-*")
+	if err != nil {
+		return nil, fmt.Errorf("cannot create temp dir: %v", err)
+	}
+
+	// Extract frames at evenly-spaced timestamps, avoiding the very start/end
+	interval := duration / float64(n+1)
+	var frames []string
+	for i := 1; i <= n; i++ {
+		ts := interval * float64(i)
+		out := filepath.Join(tmpDir, fmt.Sprintf("frame_%02d.jpg", i))
+		cmd := exec.Command("ffmpeg",
+			"-ss", fmt.Sprintf("%.3f", ts),
+			"-i", videoPath,
+			"-frames:v", "1",
+			"-q:v", "2",
+			"-y",
+			out,
+		)
+		if cmd.Run() == nil {
+			frames = append(frames, out)
+		}
+	}
+
+	if len(frames) == 0 {
+		os.RemoveAll(tmpDir)
+		return nil, fmt.Errorf("no frames could be extracted from %q", videoPath)
+	}
+	return frames, nil
+}
+
+// runLlamaMtmd calls llama-mtmd-cli in single-turn mode with one or more images
+// and returns the trimmed stdout response.
+func runLlamaMtmd(binary, modelPath, mmprojPath, prompt string, imagePaths []string) (string, error) {
+	args := []string{
+		"-m", modelPath,
+		"--mmproj", mmprojPath,
+		"-c", "4096",
+		"--temp", "0.7",
+		"--top-p", "0.8",
+		"--top-k", "100",
+		"--repeat-penalty", "1.05",
+		"-n", strconv.Itoa(1024),
+		"-p", prompt,
+	}
+	for _, img := range imagePaths {
+		args = append(args, "--image", img)
+	}
+
+	cmd := exec.Command(binary, args...)
+	output, err := cmd.Output()
+	if err != nil {
+		if exitErr, ok := err.(*exec.ExitError); ok {
+			return "", fmt.Errorf("llama-mtmd-cli failed (exit %d): %s",
+				exitErr.ExitCode(), strings.TrimSpace(string(exitErr.Stderr)))
+		}
+		return "", fmt.Errorf("llama-mtmd-cli failed: %v", err)
+	}
+	return strings.TrimSpace(string(output)), nil
+}
+
+// ═════════════════════════════════════════════════════════════════════════
+// vision Tool
+// ═════════════════════════════════════════════════════════════════════════
+
+type visionTool struct {
+	dangerousConfig danger.DangerousConfig
+	visionCfg       config.VisionConfig
+}
+
+func newVisionTool(dc danger.DangerousConfig, vc config.VisionConfig) *visionTool {
+	return &visionTool{dangerousConfig: dc, visionCfg: vc}
+}
+
+func (t *visionTool) Name() string { return "vision" }
+func (t *visionTool) Description() string {
+	return `Analyze an image or video file using MiniCPM-V 4.6, a local 1.3B multimodal model (llama-mtmd-cli). Images are described directly; videos are sampled into evenly-spaced frames and analyzed together. Supports JPEG, PNG, GIF, WebP, BMP for images and MP4, MOV, AVI, MKV, WebM for video. Requires llama-mtmd-cli and MiniCPM-V 4.6 model files (bundled in the Docker image).`
+}
+
+type visionArgs struct {
+	Path   string `json:"path"`
+	Prompt string `json:"prompt,omitempty"`
+}
+
+type visionResult struct {
+	Description string `json:"description"`
+	Model       string `json:"model"`
+	Type        string `json:"type"` // "image" or "video"
+	Frames      int    `json:"frames,omitempty"`
+	Error       string `json:"error,omitempty"`
+}
+
+func (t *visionTool) Schema() any {
+	return map[string]any{
+		"type": "object",
+		"properties": map[string]any{
+			"path": map[string]any{
+				"type":        "string",
+				"description": "Path to an image (JPEG, PNG, GIF, WebP, BMP) or video file (MP4, MOV, AVI, MKV, WebM).",
+			},
+			"prompt": map[string]any{
+				"type":        "string",
+				"description": `Instruction or question for the model. Default: "Describe this in detail."`,
+			},
+		},
+		"required": []string{"path"},
+	}
+}
+
+func (t *visionTool) Call(argsJSON string) (result string, err error) {
+	defer func() {
+		if r := recover(); r != nil {
+			err = fmt.Errorf("vision: panic: %v", r)
+			result = `{"error":"internal error"}`
+		}
+	}()
+
+	var args visionArgs
+	if err := json.Unmarshal([]byte(argsJSON), &args); err != nil {
+		return jsonError("invalid arguments: " + err.Error())
+	}
+	if args.Path == "" {
+		return jsonError("path is required")
+	}
+	prompt := args.Prompt
+	if prompt == "" {
+		prompt = "Describe this in detail."
+	}
+
+	// Security: classify the file path
+	if err := t.dangerousConfig.CheckOperation(danger.ToolOperation{
+		Name: "vision", Resource: args.Path, Risk: danger.ClassifyPath(args.Path),
+	}, nil); err != nil {
+		return jsonError(err.Error())
+	}
+
+	// Check file exists (O_NOFOLLOW prevents symlink attacks)
+	f, err := os.OpenFile(args.Path, os.O_RDONLY|syscall.O_NOFOLLOW, 0)
+	if err != nil {
+		return jsonResult(visionResult{
+			Error: fmt.Sprintf("cannot open file %q: %v", args.Path, err),
+		})
+	}
+	f.Close()
+
+	binary, err := llamaMtmdBinary(t.visionCfg)
+	if err != nil {
+		return jsonResult(visionResult{Error: err.Error()})
+	}
+	modelPath, mmprojPath, err := visionModelPaths(t.visionCfg)
+	if err != nil {
+		return jsonResult(visionResult{Error: err.Error()})
+	}
+
+	ext := strings.ToLower(filepath.Ext(args.Path))
+	source := "vision:" + args.Path
+
+	if videoExts[ext] {
+		return t.analyzeVideo(binary, modelPath, mmprojPath, args.Path, prompt, source)
+	}
+	return t.analyzeImage(binary, modelPath, mmprojPath, args.Path, prompt, source)
+}
+
+func (t *visionTool) analyzeImage(binary, modelPath, mmprojPath, imgPath, prompt, source string) (string, error) {
+	desc, err := runLlamaMtmd(binary, modelPath, mmprojPath, prompt, []string{imgPath})
+	if err != nil {
+		return jsonResult(visionResult{Error: err.Error()})
+	}
+	return jsonResult(visionResult{
+		Description: wrapUntrusted(source, desc),
+		Model:       "minicpm-v-4.6",
+		Type:        "image",
+	})
+}
+
+func (t *visionTool) analyzeVideo(binary, modelPath, mmprojPath, videoPath, prompt, source string) (string, error) {
+	n := t.visionCfg.VideoFrames
+	if n <= 0 {
+		n = 8
+	}
+
+	frames, err := extractVideoFrames(videoPath, n)
+	if err != nil {
+		return jsonResult(visionResult{Error: err.Error()})
+	}
+	defer os.RemoveAll(filepath.Dir(frames[0]))
+
+	videoPrompt := fmt.Sprintf(
+		"These are %d frames sampled evenly from a video. %s",
+		len(frames), prompt,
+	)
+	desc, err := runLlamaMtmd(binary, modelPath, mmprojPath, videoPrompt, frames)
+	if err != nil {
+		return jsonResult(visionResult{Error: err.Error()})
+	}
+	return jsonResult(visionResult{
+		Description: wrapUntrusted(source, desc),
+		Model:       "minicpm-v-4.6",
+		Type:        "video",
+		Frames:      len(frames),
+	})
+}
+
+// Ensure visionTool implements odek.Tool
+var _ odek.Tool = (*visionTool)(nil)
diff --git a/cmd/odek/vision_tool_test.go b/cmd/odek/vision_tool_test.go
new file mode 100644
index 0000000..024f5d8
--- /dev/null
+++ b/cmd/odek/vision_tool_test.go
@@ -0,0 +1,338 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"github.com/BackendStack21/odek/internal/config"
+	"github.com/BackendStack21/odek/internal/danger"
+)
+
+// ── Helpers ──────────────────────────────────────────────────────────────
+
+// createMockLlamaMtmd creates a shell script that mimics llama-mtmd-cli
+// in single-turn mode: it ignores all flags and writes a fixed description
+// to stdout. This lets tests exercise the full tool.Call() path without a
+// real model or GPU.
+func createMockLlamaMtmd(t *testing.T) string {
+	t.Helper()
+	dir := t.TempDir()
+	script := `#!/bin/sh
+echo 'A vivid test scene with colorful objects arranged neatly. The foreground shows a simple geometric shape and the background is uniformly lit.'
+`
+	path := filepath.Join(dir, "llama-mtmd-cli")
+	if err := os.WriteFile(path, []byte(script), 0755); err != nil {
+		t.Fatalf("createMockLlamaMtmd: %v", err)
+	}
+	return path
+}
+
+// createMockFftools creates mock ffprobe and ffmpeg binaries in a temp dir
+// and returns the dir. Prepend it to PATH so extractVideoFrames uses them.
+//
+//	ffprobe: outputs a fixed duration (10.0 seconds)
+//	ffmpeg:  creates a minimal JPEG stub at the last argument path
+func createMockFftools(t *testing.T) string {
+	t.Helper()
+	dir := t.TempDir()
+
+	ffprobe := `#!/bin/sh
+echo '10.000000'
+`
+	// ffmpeg mock: the output path is always the last argument.
+	// "shift $(($# - 1))" leaves $1 as the last original arg.
+	ffmpeg := `#!/bin/sh
+shift $(($# - 1))
+mkdir -p "$(dirname "$1")"
+printf '\xff\xd8\xff\xe0' > "$1"
+`
+	for name, body := range map[string]string{"ffprobe": ffprobe, "ffmpeg": ffmpeg} {
+		p := filepath.Join(dir, name)
+		if err := os.WriteFile(p, []byte(body), 0755); err != nil {
+			t.Fatalf("createMockFftools: %v", err)
+		}
+	}
+	return dir
+}
+
+// fakeModelsDir creates a directory with stub model.gguf and mmproj.gguf files.
+func fakeModelsDir(t *testing.T) string {
+	t.Helper()
+	dir := t.TempDir()
+	os.WriteFile(filepath.Join(dir, "model.gguf"), []byte("fake model"), 0644)
+	os.WriteFile(filepath.Join(dir, "mmproj.gguf"), []byte("fake mmproj"), 0644)
+	return dir
+}
+
+// fakeImageFile writes a minimal file with a supported image extension.
+func fakeImageFile(t *testing.T, ext string) string {
+	t.Helper()
+	path := filepath.Join(t.TempDir(), "test"+ext)
+	os.WriteFile(path, []byte("fake image data"), 0644)
+	return path
+}
+
+// fakeVideoFile writes a minimal file with a supported video extension.
+func fakeVideoFile(t *testing.T) string {
+	t.Helper()
+	path := filepath.Join(t.TempDir(), "test.mp4")
+	os.WriteFile(path, []byte("fake video data"), 0644)
+	return path
+}
+
+// decodeVisionResult is a convenience wrapper for parsing tool output.
+func decodeVisionResult(t *testing.T, raw string) visionResult {
+	t.Helper()
+	var r visionResult
+	if err := json.Unmarshal([]byte(raw), &r); err != nil {
+		t.Fatalf("decodeVisionResult: unmarshal failed: %v\nraw: %s", err, raw)
+	}
+	return r
+}
+
+// ── Unit Tests ────────────────────────────────────────────────────────────
+
+func TestVision_EmptyPath(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{})
+	result, err := tool.Call(`{"path":""}`)
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	var r struct{ Error string `json:"error"` }
+	json.Unmarshal([]byte(result), &r)
+	if !strings.Contains(r.Error, "required") {
+		t.Errorf("expected 'required' in error, got: %s", r.Error)
+	}
+}
+
+func TestVision_InvalidJSON(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{})
+	result, err := tool.Call(`{bad json}`)
+	if err != nil {
+		return // error return is also acceptable
+	}
+	var r struct{ Error string `json:"error"` }
+	json.Unmarshal([]byte(result), &r)
+	if !strings.Contains(r.Error, "invalid") {
+		t.Errorf("expected 'invalid' in error, got: %s", r.Error)
+	}
+}
+
+func TestVision_FileNotFound(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{})
+	result, err := tool.Call(`{"path":"/nonexistent/image.jpg"}`)
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if !strings.Contains(r.Error, "cannot open") {
+		t.Errorf("expected 'cannot open' in error, got: %s", r.Error)
+	}
+}
+
+func TestVision_SymlinkRejected(t *testing.T) {
+	dir := t.TempDir()
+	target := filepath.Join(dir, "real.png")
+	os.WriteFile(target, []byte("data"), 0644)
+	link := filepath.Join(dir, "link.jpg")
+	os.Symlink(target, link)
+
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{})
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, link))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if r.Error == "" {
+		t.Error("expected an error for symlink path, got none")
+	}
+}
+
+func TestVision_MissingBinary(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath: "/nonexistent/llama-mtmd-cli",
+	})
+	path := fakeImageFile(t, ".jpg")
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, path))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if !strings.Contains(r.Error, "llama") && !strings.Contains(r.Error, "not found") {
+		t.Errorf("expected error mentioning 'llama' or 'not found', got: %s", r.Error)
+	}
+}
+
+func TestVision_MissingModel(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath: createMockLlamaMtmd(t),
+		ModelsDir:  t.TempDir(), // empty — no model.gguf
+	})
+	path := fakeImageFile(t, ".jpg")
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, path))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if !strings.Contains(r.Error, "model") && !strings.Contains(r.Error, "not found") {
+		t.Errorf("expected error about missing model, got: %s", r.Error)
+	}
+}
+
+func TestVision_MissingMmproj(t *testing.T) {
+	dir := t.TempDir()
+	os.WriteFile(filepath.Join(dir, "model.gguf"), []byte("fake"), 0644)
+	// mmproj.gguf intentionally absent
+
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath: createMockLlamaMtmd(t),
+		ModelsDir:  dir,
+	})
+	path := fakeImageFile(t, ".jpg")
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, path))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if !strings.Contains(r.Error, "mmproj") && !strings.Contains(r.Error, "projector") {
+		t.Errorf("expected error about missing mmproj, got: %s", r.Error)
+	}
+}
+
+// ── Mock Happy-Path Tests ─────────────────────────────────────────────────
+
+// TestVision_MockHappyPath_Image exercises the full image analysis flow
+// with a mock llama-mtmd-cli binary — no GPU or real model required.
+func TestVision_MockHappyPath_Image(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath: createMockLlamaMtmd(t),
+		ModelsDir:  fakeModelsDir(t),
+	})
+
+	for _, ext := range []string{".jpg", ".jpeg", ".png", ".webp"} {
+		t.Run("ext="+ext, func(t *testing.T) {
+			path := fakeImageFile(t, ext)
+			result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, path))
+			if err != nil {
+				t.Fatalf("unexpected error: %v", err)
+			}
+			r := decodeVisionResult(t, result)
+			if r.Error != "" {
+				t.Fatalf("expected success, got error: %s", r.Error)
+			}
+			if r.Type != "image" {
+				t.Errorf("type = %q, want 'image'", r.Type)
+			}
+			if r.Model != "minicpm-v-4.6" {
+				t.Errorf("model = %q, want 'minicpm-v-4.6'", r.Model)
+			}
+			if r.Description == "" {
+				t.Error("description is empty")
+			}
+			if !strings.Contains(r.Description, "test scene") {
+				t.Errorf("description = %q, expected mock output containing 'test scene'", r.Description)
+			}
+		})
+	}
+}
+
+// TestVision_MockHappyPath_CustomPrompt verifies that a custom prompt is
+// accepted (the mock binary ignores it, but the tool must not error).
+func TestVision_MockHappyPath_CustomPrompt(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath: createMockLlamaMtmd(t),
+		ModelsDir:  fakeModelsDir(t),
+	})
+	path := fakeImageFile(t, ".png")
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s","prompt":"What text is visible?"}`, path))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if r.Error != "" {
+		t.Fatalf("expected success, got error: %s", r.Error)
+	}
+	if r.Type != "image" {
+		t.Errorf("type = %q, want 'image'", r.Type)
+	}
+}
+
+// TestVision_MockHappyPath_Video exercises the full video analysis flow
+// with mock ffprobe, ffmpeg, and llama-mtmd-cli. It verifies frame extraction
+// and the multi-image prompt path end-to-end.
+func TestVision_MockHappyPath_Video(t *testing.T) {
+	ffDir := createMockFftools(t)
+	// Prepend mock tool dir to PATH so exec.LookPath finds them first.
+	orig := os.Getenv("PATH")
+	t.Setenv("PATH", ffDir+string(os.PathListSeparator)+orig)
+
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath:  createMockLlamaMtmd(t),
+		ModelsDir:   fakeModelsDir(t),
+		VideoFrames: 3,
+	})
+
+	videoPath := fakeVideoFile(t)
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, videoPath))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if r.Error != "" {
+		t.Fatalf("expected success, got error: %s", r.Error)
+	}
+	if r.Type != "video" {
+		t.Errorf("type = %q, want 'video'", r.Type)
+	}
+	if r.Model != "minicpm-v-4.6" {
+		t.Errorf("model = %q, want 'minicpm-v-4.6'", r.Model)
+	}
+	if r.Frames == 0 {
+		t.Error("frames count is 0, expected > 0")
+	}
+	if r.Description == "" {
+		t.Error("description is empty")
+	}
+}
+
+// TestVision_VideoFallsBackOnMissingFfmpeg verifies that a missing ffmpeg
+// produces a clear, actionable error rather than a panic.
+func TestVision_VideoFallsBackOnMissingFfmpeg(t *testing.T) {
+	// Point PATH at an empty dir so neither ffmpeg nor ffprobe are found.
+	t.Setenv("PATH", t.TempDir())
+
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{
+		BinaryPath:  createMockLlamaMtmd(t),
+		ModelsDir:   fakeModelsDir(t),
+		VideoFrames: 4,
+	})
+	videoPath := fakeVideoFile(t)
+	result, err := tool.Call(fmt.Sprintf(`{"path":"%s"}`, videoPath))
+	if err != nil {
+		t.Fatalf("unexpected error: %v", err)
+	}
+	r := decodeVisionResult(t, result)
+	if !strings.Contains(r.Error, "ffmpeg") && !strings.Contains(r.Error, "ffprobe") {
+		t.Errorf("expected error mentioning ffmpeg/ffprobe, got: %s", r.Error)
+	}
+}
+
+// TestVision_SchemaShape verifies the JSON schema contains the required keys.
+func TestVision_SchemaShape(t *testing.T) {
+	tool := newVisionTool(danger.DangerousConfig{}, config.VisionConfig{})
+	schema := tool.Schema()
+	b, err := json.Marshal(schema)
+	if err != nil {
+		t.Fatalf("marshal schema: %v", err)
+	}
+	s := string(b)
+	for _, want := range []string{`"path"`, `"prompt"`, `"required"`} {
+		if !strings.Contains(s, want) {
+			t.Errorf("schema missing %q; schema: %s", want, s)
+		}
+	}
+}
diff --git a/docker/Dockerfile b/docker/Dockerfile
index 4de3ace..11e2e31 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -54,6 +54,44 @@ RUN mkdir -p /models \
  && curl -fsSL -o "/models/ggml-${WHISPER_MODEL}.bin" \
       "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-${WHISPER_MODEL}.bin"
 
+# ---- minicpm-v stage ----
+# Downloads a pre-built llama-mtmd-cli binary and fetches MiniCPM-V 4.6 model
+# files (main GGUF + vision projector) so the `vision` tool works out of the
+# box with zero host setup. Pre-built binaries from the official llama.cpp
+# release avoid a multi-minute C++ compile inside Docker.
+#
+# Size added to the runtime image: ~530 MB model + ~1.1 GB mmproj ≈ 1.6 GB.
+# To use a different quantization level: --build-arg MINICPM_QUANT=Q8_0
+# Available quants: Q4_0 (501 MB) | Q4_K_M (529 MB, default) | Q8_0 (812 MB)
+# LLAMA_VERSION pins the llama.cpp release — bump it deliberately.
+FROM debian:bookworm-slim AS minicpm
+ARG MINICPM_QUANT=Q4_K_M
+ARG LLAMA_VERSION=b9549
+RUN apt-get update && apt-get install -y --no-install-recommends \
+      curl ca-certificates \
+ && rm -rf /var/lib/apt/lists/*
+# Map dpkg architecture to the llama.cpp release asset name suffix
+# (amd64 → x64, arm64 → arm64) and extract only llama-mtmd-cli.
+RUN set -ex \
+ && LLAMA_ARCH=$(dpkg --print-architecture | sed 's/amd64/x64/') \
+ && curl -fsSL \
+      "https://github.com/ggerganov/llama.cpp/releases/download/${LLAMA_VERSION}/llama-${LLAMA_VERSION}-bin-ubuntu-${LLAMA_ARCH}.tar.gz" \
+      -o /tmp/llama.tar.gz \
+ && mkdir -p /tmp/llama-bin \
+ && tar -xzf /tmp/llama.tar.gz -C /tmp/llama-bin \
+ && find /tmp/llama-bin -name "llama-mtmd-cli" -exec install -m 755 {} /usr/local/bin/llama-mtmd-cli \; \
+ && rm -rf /tmp/llama.tar.gz /tmp/llama-bin
+# Fetch the model GGUF and vision projector into a fixed image path (NOT under
+# ~/.odek — bind-mount profiles would hide files baked there). The runtime
+# config points vision.models_dir at this path, or the tool auto-detects it.
+RUN mkdir -p /usr/local/share/minicpm-v/models \
+ && curl -fsSL \
+      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/MiniCPM-V-4_6-${MINICPM_QUANT}.gguf" \
+      -o /usr/local/share/minicpm-v/models/model.gguf \
+ && curl -fsSL \
+      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/mmproj-model-f16.gguf" \
+      -o /usr/local/share/minicpm-v/models/mmproj.gguf
+
 # ---- runtime stage ----
 FROM debian:bookworm-slim
 # Tooling the agent commonly needs inside the sandbox container.
@@ -82,6 +120,11 @@ RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
 COPY --from=whisper /whisper/build/bin/whisper-cli /usr/local/bin/whisper-cli
 COPY --from=whisper /models/ /usr/local/share/whisper/models/
 
+# Bundle llama-mtmd-cli + MiniCPM-V 4.6 from the minicpm stage so `vision`
+# works with zero setup. The tool auto-detects /usr/local/share/minicpm-v/models.
+COPY --from=minicpm /usr/local/bin/llama-mtmd-cli /usr/local/bin/llama-mtmd-cli
+COPY --from=minicpm /usr/local/share/minicpm-v/models/ /usr/local/share/minicpm-v/models/
+
 # ── Adding extra dependencies the agent can use ──────────────────────────
 # The agent runs shell commands INSIDE this image, so any runtime or CLI it
 # should call must be installed here. Trim or extend to taste, then rebuild
diff --git a/docker/README.md b/docker/README.md
index e6d367f..a4922d9 100644
--- a/docker/README.md
+++ b/docker/README.md
@@ -150,6 +150,25 @@ auto-transcription work with zero setup. No host install, no first-run download.
   `--build-arg WHISPER_MODEL=base` (or `small` / `medium`) and bump the
   `model` field in the config to match.
 
+## Image & video understanding (out of the box)
+
+The image **bundles `llama-mtmd-cli` (llama.cpp b9549) and MiniCPM-V 4.6**
+(1.3B multimodal model) so the `vision` tool works with zero setup — no cloud
+API, no host install, no first-run download.
+
+- The model GGUF (`Q4_K_M`, ~529 MB) and vision projector (`mmproj`, ~1.1 GB)
+  ship at `/usr/local/share/minicpm-v/models/`. They live outside `~/.odek` so
+  Telegram bind-mounts cannot shadow them.
+- Send the agent an image path → `vision` describes it locally using the
+  bundled 1.3B model. Video files (MP4, MOV, AVI, MKV, WebM) are sampled into
+  frames via `ffmpeg` and analysed together in one multi-image call.
+- Want a higher-quality quantization? Rebuild with
+  `--build-arg MINICPM_QUANT=Q8_0` (812 MB model, better accuracy at the cost
+  of ~300 MB extra image size). Available quants: `Q4_0` (501 MB), `Q4_K_M`
+  (529 MB, default), `Q8_0` (812 MB).
+- To point at models installed on the host instead, set `vision.models_dir` in
+  config to the directory containing `model.gguf` and `mmproj.gguf`.
+
 ## Verify the profiles differ
 
 - **Restricted**: ask it to `rm -rf` everything in `/workspace` → denied, never runs.
diff --git a/docs/CHEATSHEET.md b/docs/CHEATSHEET.md
index 3bcc6dd..752a4bb 100644
--- a/docs/CHEATSHEET.md
+++ b/docs/CHEATSHEET.md
@@ -85,6 +85,25 @@ Priority: `~/.odek/config.json` ← `./odek.json` ← `ODEK_*` env ← CLI flags
 
 Settings: `model` (tiny/base/small/medium), `language` (ISO code, empty=auto), `auto_transcribe` (Telegram voice → text), `models_dir` (model directory), `binary_path` (whisper binary path).
 
+### Image & Video Understanding
+- **`vision`** tool uses local MiniCPM-V 4.6 (1.3B) via `llama-mtmd-cli` — no cloud APIs
+- Accepts images (JPEG, PNG, GIF, WebP, BMP) and videos (MP4, MOV, AVI, MKV, WebM)
+- Videos are sampled into evenly-spaced frames with ffmpeg; all frames analysed in one call
+- Model files: `model.gguf` (~529 MB, Q4\_K\_M) + `mmproj.gguf` (~1.1 GB) — bundled in the Docker image at `/usr/local/share/minicpm-v/models/`
+- Configure via `vision` section in config:
+
+```json
+{
+  "vision": {
+    "models_dir": "~/.odek/minicpm-v/models",
+    "binary_path": "/usr/local/bin/llama-mtmd-cli",
+    "video_frames": 8
+  }
+}
+```
+
+Settings: `models_dir` (dir with `model.gguf` + `mmproj.gguf`), `binary_path` (llama-mtmd-cli path), `video_frames` (frames to sample from video, default 8).
+
 ## Memory System Architecture
 
 ### Three Tiers
diff --git a/docs/CONFIG.md b/docs/CONFIG.md
index 8887f3f..03f973e 100644
--- a/docs/CONFIG.md
+++ b/docs/CONFIG.md
@@ -434,7 +434,7 @@ The progress system is an evolving single message that gets edited in-place (sim
 ```
 
 Key behaviors:
-- **Smart previews** — instead of showing raw JSON args, the system extracts meaningful context: filename for file tools, the command text for shell, URL for browser, query text for memory/search tools, audio filename for transcribe
+- **Smart previews** — instead of showing raw JSON args, the system extracts meaningful context: filename for file tools, the command text for shell, URL for browser, query text for memory/search tools, audio filename for transcribe, file path for vision
 - **Edit throttling** — edits are rate-limited to one every 1.5 seconds to avoid hitting Telegram's flood control limits. Rapid tool chains don't produce 429 errors
 - **Tool dedup** — when the same tool runs consecutively (common with parallel batch tools like `batch_read`), identical lines are collapsed into a `(×N)` counter instead of repeating N times
 - **Flood control fallback** — if an edit message fails with "flood" or "retry after", the system automatically switches to sending new messages instead of editing. This prevents the bot from becoming unresponsive under heavy load
diff --git a/docs/SECURITY.md b/docs/SECURITY.md
index 755ccf3..9c66f83 100644
--- a/docs/SECURITY.md
+++ b/docs/SECURITY.md
@@ -58,6 +58,7 @@ Tools that wrap:
 | `search_files`, `multi_grep` | `<path>:<line>` per match |
 | `shell` | `$ <command>` |
 | `transcribe` | `transcribe:<audio path>` (full transcript + each segment) |
+| `vision` | `vision:<file path>` (full description) |
 | `session_search` | `session_search` (whole result — past sessions may be tainted) |
 | any MCP tool | `mcp:<server>:<tool>` |
 
@@ -102,7 +103,7 @@ Both:
 
 Taint is decided per tool call by `memory.ToolCallTaints` (the single source of truth, shared with skills):
 
-- **Always untrusted:** `browser`, `http_batch`, `transcribe` (network / opaque-audio content), `session_search` (recall of prior-session transcripts, which may carry earlier-injected text), and any MCP tool (`server__tool`).
+- **Always untrusted:** `browser`, `http_batch`, `transcribe` (network / opaque-audio content), `vision` (opaque-image/video content), `session_search` (recall of prior-session transcripts, which may carry earlier-injected text), and any MCP tool (`server__tool`).
 - **Path-reading tools** (`read_file`, `search_files`, `multi_grep`, `batch_read`, `json_query`, `head_tail`, `count_lines`, `checksum`, `word_count`, `sort`, `tr`, `diff`, `file_info`, `glob`, `tree`, `base64`) taint when **any** of their path arguments resolves **outside the workspace trust zone** — the workspace dir, the sandbox `/workspace` mount, or `~/.odek`. Reads confined to the workspace stay trusted, so ordinary coding sessions remain recallable; reads of anything else (system/credential paths, home files, sibling repos) taint. The check is a workspace-containment allowlist rather than a sensitive-path denylist, and it resolves symlinks (so e.g. `/etc` → `/private/etc` on macOS cannot disguise an escape). A malformed argument string is treated conservatively as untrusted. When adding a new file-reading tool, add it to `PathReadingTools`.
 
 **Auto-extracted durable facts are opt-in and trusted-only.** At session end odek
@@ -140,7 +141,7 @@ Promotion is **CLI-only and human-gated** — it is deliberately *not* exposed a
 
 ### 6. Skill provenance gate
 
-`internal/skills` carries the same provenance model and shares the exact taint decision (`memory.ToolCallTaints`). Skills auto-saved from sessions that crossed the trust boundary — `browser` / `http_batch` / `transcribe` / any MCP tool, or a `read_file` / `search_files` / `multi_grep` of a **sensitive** path — are tagged with `Provenance.Untrusted=true` and `NeedsReview=true`. The skill loader pins those skills to the Lazy set regardless of their `auto_load` flag.
+`internal/skills` carries the same provenance model and shares the exact taint decision (`memory.ToolCallTaints`). Skills auto-saved from sessions that crossed the trust boundary — `browser` / `http_batch` / `transcribe` / `vision` / any MCP tool, or a `read_file` / `search_files` / `multi_grep` of a **sensitive** path — are tagged with `Provenance.Untrusted=true` and `NeedsReview=true`. The skill loader pins those skills to the Lazy set regardless of their `auto_load` flag.
 
 After reviewing the skill body, promote it:
 
diff --git a/docs/TELEGRAM.md b/docs/TELEGRAM.md
index f497dc2..eb30fdd 100644
--- a/docs/TELEGRAM.md
+++ b/docs/TELEGRAM.md
@@ -328,7 +328,7 @@ Tool progress shows what the agent is doing in real time. Controlled by the `too
 **Key features:**
 - **Reasoning-first progress** — the first sentence of the LLM's internal reasoning (under 20 words) appears at the top of the progress bubble, followed by individual tool previews. The LLM is prompted to make this sentence user-facing, specific, and engaging
 - **Language matching** — the bot always replies in the same language the user writes in, including the thinking message and progress indicator
-- **Smart previews** — extracts meaningful context: filename for file tools, command for shell, URL for browser, query for memory, filename for transcribe
+- **Smart previews** — extracts meaningful context: filename for file tools, command for shell, URL for browser, query for memory, filename for transcribe, file path for vision
 - **Edit throttling** — 1.5s minimum between edits prevents Telegram flood control (429 errors)
 - **Tool dedup** — if the same tool runs N times in a row (common with parallel batches), shows `📝 read_file: "main.go" (×5)` instead of 5 identical lines
 - **Flood fallback** — if an edit fails with "flood" or "retry after", automatically switches to sending new messages
diff --git a/internal/config/loader.go b/internal/config/loader.go
index 77ccd84..0948914 100644
--- a/internal/config/loader.go
+++ b/internal/config/loader.go
@@ -97,6 +97,20 @@ type TranscriptionConfig struct {
 	BinaryPath     string `json:"binary_path,omitempty"`
 }
 
+// VisionConfig controls the vision tool (MiniCPM-V 4.6 via llama-mtmd-cli).
+// Populated from the "vision" section of odek.json or ~/.odek/config.json.
+type VisionConfig struct {
+	// ModelsDir is the directory containing model.gguf and mmproj.gguf.
+	// Default: /usr/local/share/minicpm-v/models (Docker image path), with
+	// fallback to ~/.odek/minicpm-v/models for out-of-container installs.
+	ModelsDir string `json:"models_dir,omitempty"`
+	// BinaryPath overrides PATH lookup for the llama-mtmd-cli binary.
+	BinaryPath string `json:"binary_path,omitempty"`
+	// VideoFrames is the number of frames to sample evenly from a video file.
+	// Default: 8.
+	VideoFrames int `json:"video_frames,omitempty"`
+}
+
 // FileConfig is the JSON schema used by ~/.odek/config.json and ./odek.json.
 // Pointer booleans distinguish "explicitly set to false" from "not set".
 type FileConfig struct {
@@ -164,6 +178,9 @@ type FileConfig struct {
 	// Transcription configures local audio transcription (whisper.cpp).
 	Transcription *TranscriptionConfig `json:"transcription,omitempty"`
 
+	// Vision configures local image/video understanding (MiniCPM-V 4.6 via llama-mtmd-cli).
+	Vision *VisionConfig `json:"vision,omitempty"`
+
 	// Schedules configures the native in-process task scheduler.
 	Schedules *SchedulesConfig `json:"schedules,omitempty"`
 
@@ -272,6 +289,10 @@ type ResolvedConfig struct {
 	// Default: auto_transcribe=true, model="tiny", language="", no binary_path.
 	Transcription TranscriptionConfig
 
+	// Vision is the resolved vision config.
+	// Default: VideoFrames=8, ModelsDir="" (auto-detect), BinaryPath="" (PATH lookup).
+	Vision VisionConfig
+
 	// Schedules is the resolved scheduler config.
 	// Default: enabled=true, max_concurrent=2, timezone="UTC", catchup=false.
 	Schedules ScheduleConfig
@@ -662,6 +683,7 @@ func LoadConfig(cli CLIFlags) ResolvedConfig {
 		MCPServers:      cfg.MCPServers,
 		Telegram:        resolveTelegram(cfg.Telegram),
 		Transcription:   resolveTranscription(cfg.Transcription),
+		Vision:          resolveVision(cfg.Vision),
 		Schedules:       resolveSchedules(cfg.Schedules),
 		InteractionMode: ifZero(cfg.InteractionMode, "engaging"),
 		ToolProgress:    ifZero(cfg.ToolProgress, "all"),
@@ -935,6 +957,20 @@ func resolveTranscription(cfg *TranscriptionConfig) TranscriptionConfig {
 	}
 }
 
+// resolveVision returns the resolved vision config.
+// If the file config is nil, returns sensible defaults.
+func resolveVision(cfg *VisionConfig) VisionConfig {
+	if cfg != nil {
+		if cfg.VideoFrames == 0 {
+			cfg.VideoFrames = 8
+		}
+		return *cfg
+	}
+	return VisionConfig{
+		VideoFrames: 8,
+	}
+}
+
 // SchedulesConfig is the file-level scheduler configuration. Tri-state fields
 // use pointers so "unset" is distinguishable from an explicit false.
 type SchedulesConfig struct {
@@ -1088,6 +1124,9 @@ func overlayFile(base, override FileConfig) FileConfig {
 	if override.Transcription != nil {
 		base.Transcription = override.Transcription
 	}
+	if override.Vision != nil {
+		base.Vision = override.Vision
+	}
 	if override.Schedules != nil {
 		base.Schedules = override.Schedules
 	}
diff --git a/internal/config/vision_test.go b/internal/config/vision_test.go
new file mode 100644
index 0000000..41dd3e7
--- /dev/null
+++ b/internal/config/vision_test.go
@@ -0,0 +1,40 @@
+package config
+
+import "testing"
+
+func TestResolveVision_Defaults(t *testing.T) {
+	v := resolveVision(nil)
+	if v.VideoFrames != 8 {
+		t.Errorf("VideoFrames = %d, want 8", v.VideoFrames)
+	}
+	if v.ModelsDir != "" {
+		t.Errorf("ModelsDir = %q, want empty", v.ModelsDir)
+	}
+	if v.BinaryPath != "" {
+		t.Errorf("BinaryPath = %q, want empty", v.BinaryPath)
+	}
+}
+
+func TestResolveVision_ZeroFramesFilled(t *testing.T) {
+	v := resolveVision(&VisionConfig{VideoFrames: 0})
+	if v.VideoFrames != 8 {
+		t.Errorf("VideoFrames = %d, want 8 (zero filled with default)", v.VideoFrames)
+	}
+}
+
+func TestResolveVision_CustomValues(t *testing.T) {
+	v := resolveVision(&VisionConfig{
+		ModelsDir:   "/custom/models",
+		BinaryPath:  "/usr/local/bin/llama-mtmd-cli",
+		VideoFrames: 16,
+	})
+	if v.ModelsDir != "/custom/models" {
+		t.Errorf("ModelsDir = %q, want '/custom/models'", v.ModelsDir)
+	}
+	if v.BinaryPath != "/usr/local/bin/llama-mtmd-cli" {
+		t.Errorf("BinaryPath = %q, want '/usr/local/bin/llama-mtmd-cli'", v.BinaryPath)
+	}
+	if v.VideoFrames != 16 {
+		t.Errorf("VideoFrames = %d, want 16", v.VideoFrames)
+	}
+}

From be67d47b11e913adb8bfd18145e885bf0b387d20 Mon Sep 17 00:00:00 2001
From: Rolando Santamaria Maso <kyberneees@gmail.com>
Date: Sun, 7 Jun 2026 16:02:32 +0200
Subject: [PATCH 2/2] fix(vision): pin MiniCPM-V HuggingFace downloads to a
 specific revision hash
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace `resolve/main/` with `resolve/<sha>/` (78e02f0) so Docker builds
are reproducible — a future model update on the main branch won't silently
change the binary image.

vprotocol auto-repair: finding D001 (Axis 2.6 Dependency Integrity).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 cmd/odek/vision_tool.go | 4 ++--
 docker/Dockerfile       | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/cmd/odek/vision_tool.go b/cmd/odek/vision_tool.go
index d9f57dd..20b1ab3 100644
--- a/cmd/odek/vision_tool.go
+++ b/cmd/odek/vision_tool.go
@@ -73,9 +73,9 @@ func visionModelPaths(cfg config.VisionConfig) (modelPath, mmprojPath string, er
 Download and install:
   mkdir -p %s
   cd %s
-  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/MiniCPM-V-4_6-Q4_K_M.gguf"
+  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/78e02f066e9819a60573b78a4275df8a0c27f698/MiniCPM-V-4_6-Q4_K_M.gguf"
   mv MiniCPM-V-4_6-Q4_K_M.gguf model.gguf
-  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/mmproj-model-f16.gguf"
+  curl -LO "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/78e02f066e9819a60573b78a4275df8a0c27f698/mmproj-model-f16.gguf"
   mv mmproj-model-f16.gguf mmproj.gguf
 
 Or set models_dir in the vision config.`, mp, dir, dir)
diff --git a/docker/Dockerfile b/docker/Dockerfile
index 11e2e31..64545b7 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -86,10 +86,10 @@ RUN set -ex \
 # config points vision.models_dir at this path, or the tool auto-detects it.
 RUN mkdir -p /usr/local/share/minicpm-v/models \
  && curl -fsSL \
-      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/MiniCPM-V-4_6-${MINICPM_QUANT}.gguf" \
+      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/78e02f066e9819a60573b78a4275df8a0c27f698/MiniCPM-V-4_6-${MINICPM_QUANT}.gguf" \
       -o /usr/local/share/minicpm-v/models/model.gguf \
  && curl -fsSL \
-      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/main/mmproj-model-f16.gguf" \
+      "https://huggingface.co/openbmb/MiniCPM-V-4_6-gguf/resolve/78e02f066e9819a60573b78a4275df8a0c27f698/mmproj-model-f16.gguf" \
       -o /usr/local/share/minicpm-v/models/mmproj.gguf
 
 # ---- runtime stage ----