Context Management — Cross-Tool Synthesis
Context Management — Cross-Tool Synthesis
Section titled “Context Management — Cross-Tool Synthesis”Research synthesis across all analyzed tools and papers. Updated as individual ANALYSIS-*.md files are added and promoted.
Comparison matrix
Section titled “Comparison matrix”Populated as analyses are added.
| Tool / Paper | Approach | Compression | Token budget model | Benchmarks | Notes | Overlap & recommendation |
|---|---|---|---|---|---|---|
| context-mode | MCP-layer output interception + FTS5 knowledge base | 95–100% (summarization, verified); 44–93% (retrieval, as reported) | Implicit: agent selects tool | Partially verified; cold start 1–4s/call undisclosed | PreCompact hook extends session ~30 min → ~3 hr (as reported); ELv2 license | No direct peer for MCP output sandboxing. Overlaps n2-arachne on budget enforcement — prefer this (better license, two-speed retrieval). Pair with codebase-memory-mcp for structural navigation. |
| codebase-memory-mcp | AST-to-SQLite knowledge graph; structural graph queries replace file reads | ~90–99% vs grep (directional; 5 live queries ~1,095 tokens verified) | None — result set size is the bound | No runnable harness; live queries verified | Dynamic language edges heuristic; no auth on MCP/UI; MIT | Overlaps code-review-graph, codegraph, jcodemunch-mcp (AST-graph family). Prefer code-review-graph for breadth (22 tools, community detection). Use this for pure-SQLite graph with no Python/Docker dependency. |
| code-review-graph | Tree-sitter AST → SQLite; blast-radius + community detection + hybrid search; 22 tools | 8.2× average (as reported, range 0.7×–16.4×); 49× “daily tasks” unverified | None — result set size | evaluate/ runner exists; not reproduced; MRR 0.35 (stated, low) | 7,624 stars; Python 3.10+; active community; MIT | Best of the AST-graph family for breadth. Overlaps codebase-memory-mcp, codegraph, jcodemunch-mcp. Prefer over codegraph (no README integrity issue) and jcodemunch-mcp (MIT vs non-OSI, richer toolset). |
| codegraph | Tree-sitter AST → SQLite; single codegraph_explore blast-radius tool | 94% fewer tool calls / 77% faster (as reported, own eval runner — unverified) | None — traversal result set varies | evaluation/runner.ts exists; not reproduced; 8.2× table is CRG’s data | WASM bundled; zero native deps; README integrity issue; 412 stars; MIT | Overlaps code-review-graph and jcodemunch-mcp. Prefer code-review-graph unless zero-dep WASM bundle is a hard requirement. README integrity issue (benchmark table copied from CRG) warrants caution. |
| graphify | Prompt-orchestrated multi-modal knowledge graph (skill.md drives Python CLI); Tree-sitter AST + LLM semantic extraction; Leiden community detection | 71.5× token reduction (as reported; single curated 52-file corpus; extreme baseline) | None — graph query cost vs raw file reads | No standalone harness; computed inline during /graphify runs | Multi-modal: code + PDF + image + video; persistent graph.json; 7-tool MCP server mode; 3.7k+ stars; MIT | Overlaps code-review-graph and codebase-memory-mcp for code graph building. Unique for mixed-media corpora (code + PDFs + images). Prefer graph tools for pure-code use cases; prefer graphify only if multi-modal ingestion is required. |
| Understand-Anything | Multi-agent LLM pipeline → structural + domain graph dashboard | N/A — developer comprehension focus; no token reduction claim | None | None documented | 8,081 stars; TypeScript/Node.js; MIT | No overlap with token-reduction tools — orthogonal value proposition. Choose only if the goal is domain mapping and comprehension, not context compression. |
| git-semantic-bun | Local vector index over git commit messages | N/A — retrieval, not summarization | None | gsb benchmark requires user-provided queries; no published figures | No MCP; pre-stable; 3 stars; MIT | No overlap — semantic git-history search is unique in this survey. Use only if querying commit history by meaning is the specific requirement. |
| qmd | 8-step hybrid query: BM25 probe → LLM query expansion → vec search → RRF fusion → chunk selection → reranking → score blend → dedup | N/A — retrieval, not output compression | Implicit: caller sets result limit | Full qmd bench harness; no published results; vitest eval suite (6 docs, 24 queries) | 20.3k stars; custom 1.7B query expansion model (no training artifacts); no HTTP auth; MIT | Overlaps jdocmunch-mcp for markdown section retrieval. Prefer this for dynamic query workloads — most sophisticated pipeline in the survey. Prefer jdocmunch-mcp only for O(1) access to known-section structured docs (and only if non-commercial license is acceptable). |
| caveman | Claude Code skill enforcing caveman-speak output + compress sub-tool for explicit compression | ~75% output-token reduction; ~45% input-token reduction (as reported; updated from 65% triage) | Implicit: agent output style + budget | Offline evals/measure.py against committed snapshot (10 prompts, reproducible); benchmarks/run.py requires API key | 6 intensity levels incl. Wenyan variants; auto-clarity escape for security warnings | No direct peer — output-style compression is unique in this survey. Use when output verbosity is the token budget bottleneck, not input context size. Complementary (not competing) with all input-side tools. |
| n2-arachne | MCP server assembling token-budgeted payloads with fixed % allocations (10% structural / 30% dep / 40% semantic / 20% recent) | Budget enforced (chars/3.5 heuristic — not a real tokenizer) | Explicit: fixed % allocations across 4 layers | None — test script is a placeholder; CHANGELOG references missing harness file | Non-commercial-only license (NOASSERTION SPDX); 19.9× speedup headline describes non-hot-path function | Overlaps context-mode on budget enforcement. Prefer context-mode (ELv2 vs non-commercial, verified savings). Use n2-arachne only if the fixed-% allocation model is specifically required and non-commercial terms are acceptable. |
| jdocmunch-mcp | Section-level markdown indexing; O(1) byte-offset retrieval | 110× byte reduction on structured docs (as reported; bytes not tokens; savings accounting flaw) | None — returns matched sections | No harness; 3 narrative case studies generated by Claude | Opt-out telemetry to j.gravelle.us; v1.7.1; non-commercial dual license ($79–$1,999 tiers) | Overlaps qmd for markdown retrieval. Prefer qmd for general workloads (MIT, richer pipeline). Use this for O(1) access to stable, known-section documents only; non-commercial license and telemetry are material risks. |
| serena | LSP-backed symbol-path retrieval + editing; two backends (LSP + JetBrains); progressive fallback on oversized results | Not quantified — qualitative only | Implicit: _limit_length + shortened_result_factories progressive fallback | None (analytics.py tracks usage; no baseline comparison) | ~30 tools; 55 LSP language servers; novel fallback mechanism; flat markdown memory (no TTL/search) | No direct peer for LSP-backed symbol editing. Orthogonal to graph tools (editing precision vs graph traversal). Prefer for multi-language refactoring workflows where symbol-level accuracy matters; no overlap with context-mode or rtk. |
| jcodemunch-mcp | Tree-sitter AST → SQLite WAL; exact byte-span retrieval; 9 MCP tools | 95% token reduction (as reported; 3 small repos; range 79.7–99.8%) | None — result set size | benchmarks/harness/run_benchmark.py runnable; not independently reproduced | Non-OSI license (paid commercial use); optional AI summarization sends code to external APIs | Overlaps code-review-graph and codegraph (AST-graph family). Prefer code-review-graph (MIT, 22 tools, broader benchmark). Use this only if byte-span precision is required and a paid non-OSI license is acceptable. |
| rtk | Claude Code hook-based CLI proxy; two-track filter pipeline (69 Rust handlers + 58 TOML filters) | 60–90% on dev commands (as reported; chars/4 heuristic) | None — passthrough proxy | scripts/benchmark.sh runnable; live fixtures; 80% improvement CI gate | v0.35.0; Apache-2.0; TOML filter correctness enforced at compile time | No direct peer — transparent CLI-proxy hook architecture is unique. Use if goal is passive token reduction on Claude Code dev commands without changing agent behavior. Complementary to all MCP-layer tools. |
| socraticode | Qdrant-backed hybrid search (dense + BM25 via Qdrant RRF); AST-aware chunking (18+ languages); polyglot dependency graph | 61.5% (as reported; bytes not tokens; single live session; no harness) | None | No harness; 3 narrative case studies | Docker required; Qdrant v1.15.2+ required; MIT | Overlaps codebase-memory-mcp and code-review-graph for code search. Prefer code-review-graph (no Docker/Qdrant dep, stronger benchmarks). Use socraticode only if Qdrant is already in the infrastructure stack and RRF hybrid search is needed. |
| sdl-mcp | LadybugDB knowledge graph; Symbol Cards (~100 tokens/symbol, LLM summary, ETag re-fetch); Iris Gate Ladder 4-rung escalation; Delta Packs blast-radius; SCIP compiler-grade edges; 38 tool surfaces | 81% tools/list overhead (gateway mode, as reported); no end-to-end session figure | Implicit: Iris Gate Ladder prompts cheapest-first retrieval; max-cards on slices | No harness; no end-to-end benchmark; all figures author-run | LLM summary cost at index time undisclosed; LadybugDB opaque (no schema/SQL/Cypher); source-available license; 12 languages (Rust indexer); 125 stars | Watch. Iris Gate Ladder + ETag conditional re-fetch are architecturally novel — the most disciplined context-escalation model in this survey. But no end-to-end token savings figure exists, LadybugDB is opaque, and LLM index-time costs are undisclosed. Overlaps codebase-memory-mcp and code-review-graph for graph-based code intelligence; prefer those (MIT, transparent storage, broader languages) until SDL-MCP provides a reproducible end-to-end benchmark and documents summary costs. |
| osgrep | npm CLI; LanceDB vector store; Granite 30M dense + mxbai ColBERT 17M (int8) late-interaction reranking; tree-sitter AST chunking; FTS + vector → RRF → two-stage ColBERT pipeline | ~20% cost reduction / ~30% speedup (as reported; 10-query, single-codebase CSV; no answer quality assessment) | None — result set size is the bound | 10-query CSV (opencode corpus, cost-only); internal MRR harness (eval.ts, 70+ cases, self-referential); not reproduced | MCP server is a non-functional stub as of commit 9f2faf7; last push 2026-01-17; 1,128 stars; Apache-2.0 | Overlaps socraticode and codebase-memory-mcp for semantic code search. Unique: richest hybrid retrieval pipeline in the survey (dense + FTS + two-stage ColBERT) in a zero-external-service npm package. MCP stub rules out MCP-framework integration currently. Prefer socraticode if Qdrant is already in the stack and RRF hybrid search is sufficient; prefer codebase-memory-mcp if graph queries are needed. Use osgrep if the call-stack trace + skeleton compression commands are the target use case or if a pure-npm zero-dep installation is required. |
Key themes
Section titled “Key themes”Populated as analysis matures.
Recommended reading order
Section titled “Recommended reading order”- context-mode — read first; establishes the MCP-layer interception pattern and the two-speed retrieval distinction (summarization vs searchable index) that all subsequent tool comparisons should reference.
- codebase-memory-mcp — read second; graph queries are complementary to context-mode (structural navigation vs output sandboxing); together they cover the two main sources of context bloat in coding sessions.
- code-review-graph — canonical AST-graph tool; community detection, wiki generation, and blast-radius analysis; use as the benchmark baseline for all other graph-based tools.
- codegraph — read alongside code-review-graph; architecturally similar but WASM-bundled and single-tool; important README integrity caveat (benchmark table copied from code-review-graph).
- graphify — read next in the graph family; prompt-orchestrated multi-modal variant (LLM drives Python CLI); 71.5× headline figure is on a curated 52-file corpus with an extreme baseline — understand the methodology before comparing to peers.
- sdl-mcp — read after graphify; caps the graph-intelligence family with the most disciplined context-escalation model (Iris Gate Ladder + ETag conditional re-fetch + Delta Packs). Watch verdict: novel architecture, no end-to-end benchmark, opaque proprietary storage.
- serena — LSP-backed approach is orthogonal to AST-graph tools; the progressive fallback on oversized results is a novel pattern worth understanding for any tool that returns variable-size context.
- jcodemunch-mcp — same AST-graph family as code-review-graph; narrowest benchmark (3 repos, range 79.7–99.8%); non-OSI license is a material risk.
- rtk — representative of the CLI-proxy category; simplest architecture in the survey; important caveat that all figures use a chars/4 heuristic, not a real tokenizer.
- n2-arachne — read for the fixed-percentage budget allocation model; chars/3.5 heuristic and non-commercial license are the two primary risks.
- jdocmunch-mcp — read for the O(1) byte-offset retrieval pattern; note the savings accounting flaw (counts all sections, not returned) and opt-out telemetry before adopting.
- socraticode — Qdrant-backed hybrid search (dense + BM25 via RRF); Docker required; the 61.5% figure is bytes not tokens from a single session — important methodological caveat shared with jdocmunch.
- osgrep — read alongside socraticode; richest hybrid retrieval pipeline in the survey (FTS + dense + two-stage ColBERT reranking) with zero external service dependencies; key caveat: MCP server is a non-functional stub as of last commit, and the 10-query benchmark covers cost only with no answer quality assessment.
- qmd — most sophisticated retrieval pipeline in this survey (8 steps: BM25 probe → LLM query expansion → vec → RRF → rerank); relevant primarily when the agent’s knowledge base is markdown, not code.
- caveman — output-style compression is a different category from all others; useful when output verbosity rather than input retrieval is the token budget bottleneck.
- Understand-Anything — read if developer comprehension (domain mapping, not token reduction) is the target; different value proposition from every other tool in this list.
- git-semantic-bun — borderline scope; read only if semantic retrieval from git commit history is specifically needed.