Analysis — context-mode
ANALYSIS: context-mode
Section titled “ANALYSIS: context-mode”Summary
Section titled “Summary”context-mode is a two-mechanism MCP server for Claude Code (and 11 other platforms) that prevents raw tool output from entering the context window. The headline “98% token reduction” is real but applies specifically to the ctx_execute_file path (summarization mode). The ctx_index/ctx_search path achieves 44–93% savings — intentionally lower, because it returns exact content verbatim. Both mechanisms are architecturally sound and verified from source. The benchmark harness exists and is runnable (npx tsx tests/benchmark.ts), though reproduction requires dev dependencies.
What it does (verified from source)
Section titled “What it does (verified from source)”Mechanism 1: Sandbox executor
Section titled “Mechanism 1: Sandbox executor”The critical path for ctx_execute / ctx_execute_file / ctx_batch_execute:
- Agent calls MCP tool with
{ language, code }. Executorclass (src/executor.ts) writes code to a temp file in an isolated temp directory (OS_TMPDIR, bypassing theTMPDIRenv var which may point to the project root).- Subprocess is spawned via Node.js
child_process.spawnwith a hardened env: a 40+ entry denylist strips vars that could corrupt stdout or inject code (BASH_ENV,NODE_OPTIONS,PYTHONSTARTUP,RUBYOPT,ERL_AFLAGS,GIT_SSH_COMMAND,LD_PRELOAD,DYLD_INSERT_LIBRARIES, and others — see source for full list). - Only
stdoutis captured and returned to the MCP caller.stderrand exit code are available in the result struct but are not forwarded to the LLM unless the agent explicitly requests them. - Temp dir is cleaned up after execution.
Supported runtimes: JavaScript (Bun preferred, Node.js fallback), TypeScript, Python, Shell, Ruby, Go, Rust, PHP, Perl, R, Elixir (11 total).
ctx_execute_file wraps a file’s content into the script before execution — the agent writes analysis code, not a data read.
ctx_batch_execute combines multiple commands and BM25 search queries in a single MCP round-trip, eliminating per-tool latency for research workflows.
Smart truncation (v0.3+): when output exceeds the configured limit, the executor preserves the first 60% and last 40% of lines at line boundaries, with a labeled gap: "[47 lines / 3.2KB truncated — showing first 12 + last 8 lines]". This preserves error messages that appear at the end of output (e.g. stack traces, test failures).
Mechanism 2: Session tracker + FTS5 knowledge base
Section titled “Mechanism 2: Session tracker + FTS5 knowledge base”A SQLite database (better-sqlite3) backs three independent subsystems:
- Event log:
session_events,session_meta,session_resumetables record every tool call, file edit, git op, error, and user decision via hook callbacks. - Knowledge base:
chunksandchunks_trigramFTS5 virtual tables powerctx_index/ctx_search. The agent indexes fetched content and retrieves exact matching chunks on demand. - Compaction recovery: the
PreCompacthook fires before Claude Code’s built-in summarization, indexes the current session into FTS5, and injects only BM25-relevant events back into context — recovering session continuity without a full context dump.
Hook registration (Claude Code)
Section titled “Hook registration (Claude Code)”Hooks are registered in ~/.claude/settings.json. The required hooks are PreToolUse (intercepts: Bash, WebFetch, Read, Grep, Agent, Task, and the ctx_* tools themselves) and SessionStart. Optional: PostToolUse, PreCompact, UserPromptSubmit. Hook commands use absolute Node.js paths to avoid PATH issues across nvm/Homebrew/Volta environments.
Benchmark claims — verified vs as-reported
Section titled “Benchmark claims — verified vs as-reported”What the harness tests
Section titled “What the harness tests”The benchmark suite (tests/benchmark.ts, tests/context-comparison.ts, tests/ecosystem-benchmark.ts) tests two tool categories against a fixture corpus.
ctx_execute_file (summarization path)
Section titled “ctx_execute_file (summarization path)”| Scenario | Raw | Context | Savings | Status |
|---|---|---|---|---|
| Playwright snapshot | 56.2 KB | 299 B | 99% | as reported |
| GitHub Issues (20) | 58.9 KB | 1.1 KB | 98% | as reported |
| Access log (500 reqs) | 45.1 KB | 155 B | 100% | as reported |
| CSV analytics (500 rows) | 85.5 KB | 222 B | 100% | as reported |
| Git log (153 commits) | 11.6 KB | 107 B | 99% | as reported |
| Test output (30 suites) | 6.0 KB | 337 B | 95% | as reported |
Subtotal (13 scenarios): 316 KB → 5.5 KB (98% savings) (as reported)
These numbers are plausible given the mechanism: an LLM-written Python/JS script summarizes the data into 1–3 lines. The compression ratio is bounded by how much information fits in a console.log() call, not by the input size.
ctx_index + ctx_search (retrieval path)
Section titled “ctx_index + ctx_search (retrieval path)”| Scenario | Raw | Context (3 queries) | Savings | Status |
|---|---|---|---|---|
| Supabase Edge Functions | 3.9 KB | 2,246 B | 44% | as reported |
| React useEffect docs | 5.9 KB | 1,494 B | 75% | as reported |
| Next.js App Router | 6.5 KB | 3,311 B | 50% | as reported |
| Tailwind CSS docs | 4.0 KB | 620 B | 85% | as reported |
| Skill prompt (main) | 4.4 KB | 932 B | 79% | as reported |
| Skill refs (4 files) | 33.2 KB | 2,412 B | 93% | as reported |
Subtotal (6 scenarios): 60.3 KB → 11.0 KB (82% savings) (as reported)
The lower savings vs ctx_execute_file is by design and architecturally correct: BM25 search returns exact chunks, not summaries. A useEffect cleanup example comes back as the full code block, not "cleanup pattern in useEffect".
Session-level aggregate (as reported)
Section titled “Session-level aggregate (as reported)”Full debugging session scenario across 6 tool calls: 177 KB raw → 10.2 KB in context (94% reduction, ~45,300 → ~2,600 tokens).
Reproduction attempt
Section titled “Reproduction attempt”Status: partially verified — context savings confirmed; session-level aggregate unverified (fixture-based).
Run at pinned commit 601aaf1, macOS Darwin 25.4.0, Node.js v24.14.1, Bun 1.3.11, Python 3.14.3.
Dev dependencies installed via npm install in the submodule before running.
Context savings (verified):
| Scenario | Raw | Output | Savings |
|---|---|---|---|
| API Response (200 users) | ~49 KB | 22 B | 100% |
| Build Output (500 lines) | ~24 KB | 37 B | 100% |
| Log File (1000 entries) | ~78 KB | 67 B | 100% |
| npm ls output | ~39 KB | 25 B | 100% |
These match the reported range. The simulated workloads use pre-written summarization scripts — real-world savings depend on the agent writing effective scripts.
Cold start latency (verified — not disclosed in README):
| Runtime | Avg | Min | P95 |
|---|---|---|---|
| JavaScript (Bun) | 2851 ms | 2315 ms | 3650 ms |
| TypeScript (Bun) | 3418 ms | 3062 ms | 3854 ms |
| Python | 1958 ms | 1558 ms | 2679 ms |
| Shell | 2553 ms | 2402 ms | 3122 ms |
| Perl | 1182 ms | 289 ms | 3671 ms |
Each ctx_execute_file call adds 1–4 seconds of wall-clock latency (subprocess spawn + runtime startup). Not mentioned in the README or BENCHMARK.md. For sessions with many tool calls this compounds — 20 JS calls ≈ 57 seconds of pure overhead.
Concurrent execution scales linearly (10 tasks × 1605ms/task = 16053ms total), suggesting no I/O bottleneck beyond subprocess startup.
Architectural assessment
Section titled “Architectural assessment”What’s genuinely novel
Section titled “What’s genuinely novel”-
MCP-layer interception vs output filtering: RTK and similar tools filter output after it reaches the LLM context. context-mode prevents it from arriving at all — the raw output stays inside the subprocess. This is a fundamentally different threat model: the LLM never sees data it doesn’t need, so there is no risk of the model fixating on irrelevant output.
-
Two-speed retrieval in one tool: The
ctx_execute_file/ctx_index+ctx_searchdistinction maps cleanly onto “compute-and-discard” (logs, CSVs, snapshots) vs “retrieve-exact” (docs, code examples). Forcing the agent to choose the right tool makes the retrieval strategy explicit and auditable. -
PreCompact hook for session continuity: Rather than relying on Claude Code’s built-in summarizer (which loses structured context), the PreCompact hook snapshots the event log to FTS5 and reinjects only relevant events. This is the mechanism behind the “30 min → 3 hours” session duration claim — it avoids the context cliff that terminates most long sessions.
-
Security-conscious sandbox: The env var denylist (
BASH_ENV,NODE_OPTIONS,ERL_AFLAGS,LD_PRELOAD,GIT_SSH_COMMAND, etc.) is unusually thorough. Most comparable tools useexecwith the inherited environment. This matters for shared or CI environments where env vars could be attacker-controlled.
Gaps and risks
Section titled “Gaps and risks”- Benchmark methodology is agent-authored: the
ctx_execute_filesavings depend entirely on the agent writing a useful summarization script. If the agent writes a verbose script, savings drop. The harness fixtures are curated — real-world savings will vary. ctx_index+ctx_searchsavings vary 44–93% depending on query quality and corpus structure. The tool decision matrix (BENCHMARK.md) is the right guide; the headline “98%” figure does not apply here.--continueflag is footgun: without it, previous session state is silently deleted on restart. There is no warning in the default startup path.- ELv2 license: source-available but not OSI open source. Cannot be used as the basis of a competing hosted service. Free for individual/internal use — check with legal before vendoring in a product.
- Platform hook compliance varies: Zed and Antigravity enforce routing via instruction files only (~60% compliance as reported). The context-saving guarantee is not reliable on those platforms.
better-sqlite3native module: requires platform-specific rebuild after Node.js upgrades. Alpine Linux needs build tools for older versions. Silent breakage is possible.
Recommendation
Section titled “Recommendation”Adopt for Claude Code use. Already deployed in this session. The mechanism is sound, the architecture is auditable, and the ctx_execute_file path delivers the claimed savings on the scenarios tested. The ctx_index+ctx_search path is the better choice when exact content is needed (code examples, API refs, skill prompts).
Do not use the “98%” figure as a blanket claim — it applies to ctx_execute_file summarization scenarios. Quote per-mechanism numbers when comparing against other tools.
Priority for follow-up: run the full benchmark harness against a real debugging session to get verified session-level numbers (the 94% session aggregate is self-reported against a curated fixture set).
Comparison hooks (for ANALYSIS.md matrix)
Section titled “Comparison hooks (for ANALYSIS.md matrix)”| Dimension | context-mode |
|---|---|
| Approach | MCP-layer output interception + FTS5 knowledge base |
| Compression (summarization) | 95–100% (as reported) |
| Compression (retrieval) | 44–93% (as reported) |
| Token budget model | Implicit: agent chooses tool; no hard budget |
| Injection strategy | On-demand (agent calls ctx_search) + compaction recovery |
| Eviction | PreCompact hook → BM25 reinject of relevant events only |
| Benchmark harness | Yes — tests/benchmark.ts; fixture-based |
| License | ELv2 |
| Maturity | v1.0.75; 12 platforms; actively maintained |