Skip to content

Context Management — Cross-Tool Synthesis

Context Management — Cross-Tool Synthesis

Research synthesis across all analyzed tools and papers. Updated as individual ANALYSIS-*.md files are added and promoted.

Comparison matrix

Populated as analyses are added.

Tool / Paper	Approach	Compression	Token budget model	Benchmarks	Notes	Overlap & recommendation
context-mode	MCP-layer output interception + FTS5 knowledge base	95–100% (summarization, verified); 44–93% (retrieval, as reported)	Implicit: agent selects tool	Partially verified; cold start 1–4s/call undisclosed	PreCompact hook extends session ~30 min → ~3 hr (as reported); ELv2 license	No direct peer for MCP output sandboxing. Overlaps n2-arachne on budget enforcement — prefer this (better license, two-speed retrieval). Pair with codebase-memory-mcp for structural navigation.
codebase-memory-mcp	AST-to-SQLite knowledge graph; structural graph queries replace file reads	~90–99% vs grep (directional; 5 live queries ~1,095 tokens verified)	None — result set size is the bound	No runnable harness; live queries verified	Dynamic language edges heuristic; no auth on MCP/UI; MIT	Overlaps code-review-graph, codegraph, jcodemunch-mcp (AST-graph family). Prefer code-review-graph for breadth (22 tools, community detection). Use this for pure-SQLite graph with no Python/Docker dependency.
code-review-graph	Tree-sitter AST → SQLite; blast-radius + community detection + hybrid search; 22 tools	8.2× average (as reported, range 0.7×–16.4×); 49× “daily tasks” unverified	None — result set size	`evaluate/` runner exists; not reproduced; MRR 0.35 (stated, low)	7,624 stars; Python 3.10+; active community; MIT	Best of the AST-graph family for breadth. Overlaps codebase-memory-mcp, codegraph, jcodemunch-mcp. Prefer over codegraph (no README integrity issue) and jcodemunch-mcp (MIT vs non-OSI, richer toolset).
codegraph	Tree-sitter AST → SQLite; single `codegraph_explore` blast-radius tool	94% fewer tool calls / 77% faster (as reported, own eval runner — unverified)	None — traversal result set varies	`evaluation/runner.ts` exists; not reproduced; 8.2× table is CRG’s data	WASM bundled; zero native deps; README integrity issue; 412 stars; MIT	Overlaps code-review-graph and jcodemunch-mcp. Prefer code-review-graph unless zero-dep WASM bundle is a hard requirement. README integrity issue (benchmark table copied from CRG) warrants caution.
graphify	Prompt-orchestrated multi-modal knowledge graph (skill.md drives Python CLI); Tree-sitter AST + LLM semantic extraction; Leiden community detection	71.5× token reduction (as reported; single curated 52-file corpus; extreme baseline)	None — graph query cost vs raw file reads	No standalone harness; computed inline during `/graphify` runs	Multi-modal: code + PDF + image + video; persistent `graph.json`; 7-tool MCP server mode; 3.7k+ stars; MIT	Overlaps code-review-graph and codebase-memory-mcp for code graph building. Unique for mixed-media corpora (code + PDFs + images). Prefer graph tools for pure-code use cases; prefer graphify only if multi-modal ingestion is required.
Understand-Anything	Multi-agent LLM pipeline → structural + domain graph dashboard	N/A — developer comprehension focus; no token reduction claim	None	None documented	8,081 stars; TypeScript/Node.js; MIT	No overlap with token-reduction tools — orthogonal value proposition. Choose only if the goal is domain mapping and comprehension, not context compression.
git-semantic-bun	Local vector index over git commit messages	N/A — retrieval, not summarization	None	`gsb benchmark` requires user-provided queries; no published figures	No MCP; pre-stable; 3 stars; MIT	No overlap — semantic git-history search is unique in this survey. Use only if querying commit history by meaning is the specific requirement.
qmd	8-step hybrid query: BM25 probe → LLM query expansion → vec search → RRF fusion → chunk selection → reranking → score blend → dedup	N/A — retrieval, not output compression	Implicit: caller sets result limit	Full `qmd bench` harness; no published results; vitest eval suite (6 docs, 24 queries)	20.3k stars; custom 1.7B query expansion model (no training artifacts); no HTTP auth; MIT	Overlaps jdocmunch-mcp for markdown section retrieval. Prefer this for dynamic query workloads — most sophisticated pipeline in the survey. Prefer jdocmunch-mcp only for O(1) access to known-section structured docs (and only if non-commercial license is acceptable).
caveman	Claude Code skill enforcing caveman-speak output + `compress` sub-tool for explicit compression	~75% output-token reduction; ~45% input-token reduction (as reported; updated from 65% triage)	Implicit: agent output style + budget	Offline `evals/measure.py` against committed snapshot (10 prompts, reproducible); `benchmarks/run.py` requires API key	6 intensity levels incl. Wenyan variants; auto-clarity escape for security warnings	No direct peer — output-style compression is unique in this survey. Use when output verbosity is the token budget bottleneck, not input context size. Complementary (not competing) with all input-side tools.
n2-arachne	MCP server assembling token-budgeted payloads with fixed % allocations (10% structural / 30% dep / 40% semantic / 20% recent)	Budget enforced (chars/3.5 heuristic — not a real tokenizer)	Explicit: fixed % allocations across 4 layers	None — test script is a placeholder; CHANGELOG references missing harness file	Non-commercial-only license (NOASSERTION SPDX); 19.9× speedup headline describes non-hot-path function	Overlaps context-mode on budget enforcement. Prefer context-mode (ELv2 vs non-commercial, verified savings). Use n2-arachne only if the fixed-% allocation model is specifically required and non-commercial terms are acceptable.
jdocmunch-mcp	Section-level markdown indexing; O(1) byte-offset retrieval	110× byte reduction on structured docs (as reported; bytes not tokens; savings accounting flaw)	None — returns matched sections	No harness; 3 narrative case studies generated by Claude	Opt-out telemetry to j.gravelle.us; v1.7.1; non-commercial dual license ($79–$1,999 tiers)	Overlaps qmd for markdown retrieval. Prefer qmd for general workloads (MIT, richer pipeline). Use this for O(1) access to stable, known-section documents only; non-commercial license and telemetry are material risks.
serena	LSP-backed symbol-path retrieval + editing; two backends (LSP + JetBrains); progressive fallback on oversized results	Not quantified — qualitative only	Implicit: `_limit_length` + `shortened_result_factories` progressive fallback	None (`analytics.py` tracks usage; no baseline comparison)	~30 tools; 55 LSP language servers; novel fallback mechanism; flat markdown memory (no TTL/search)	No direct peer for LSP-backed symbol editing. Orthogonal to graph tools (editing precision vs graph traversal). Prefer for multi-language refactoring workflows where symbol-level accuracy matters; no overlap with context-mode or rtk.
jcodemunch-mcp	Tree-sitter AST → SQLite WAL; exact byte-span retrieval; 9 MCP tools	95% token reduction (as reported; 3 small repos; range 79.7–99.8%)	None — result set size	`benchmarks/harness/run_benchmark.py` runnable; not independently reproduced	Non-OSI license (paid commercial use); optional AI summarization sends code to external APIs	Overlaps code-review-graph and codegraph (AST-graph family). Prefer code-review-graph (MIT, 22 tools, broader benchmark). Use this only if byte-span precision is required and a paid non-OSI license is acceptable.
rtk	Claude Code hook-based CLI proxy; two-track filter pipeline (69 Rust handlers + 58 TOML filters)	60–90% on dev commands (as reported; chars/4 heuristic)	None — passthrough proxy	`scripts/benchmark.sh` runnable; live fixtures; 80% improvement CI gate	v0.35.0; Apache-2.0; TOML filter correctness enforced at compile time	No direct peer — transparent CLI-proxy hook architecture is unique. Use if goal is passive token reduction on Claude Code dev commands without changing agent behavior. Complementary to all MCP-layer tools.
socraticode	Qdrant-backed hybrid search (dense + BM25 via Qdrant RRF); AST-aware chunking (18+ languages); polyglot dependency graph	61.5% (as reported; bytes not tokens; single live session; no harness)	None	No harness; 3 narrative case studies	Docker required; Qdrant v1.15.2+ required; MIT	Overlaps codebase-memory-mcp and code-review-graph for code search. Prefer code-review-graph (no Docker/Qdrant dep, stronger benchmarks). Use socraticode only if Qdrant is already in the infrastructure stack and RRF hybrid search is needed.
sdl-mcp	LadybugDB knowledge graph; Symbol Cards (~100 tokens/symbol, LLM summary, ETag re-fetch); Iris Gate Ladder 4-rung escalation; Delta Packs blast-radius; SCIP compiler-grade edges; 38 tool surfaces	81% `tools/list` overhead (gateway mode, as reported); no end-to-end session figure	Implicit: Iris Gate Ladder prompts cheapest-first retrieval; `max-cards` on slices	No harness; no end-to-end benchmark; all figures author-run	LLM summary cost at index time undisclosed; LadybugDB opaque (no schema/SQL/Cypher); source-available license; 12 languages (Rust indexer); 125 stars	Watch. Iris Gate Ladder + ETag conditional re-fetch are architecturally novel — the most disciplined context-escalation model in this survey. But no end-to-end token savings figure exists, LadybugDB is opaque, and LLM index-time costs are undisclosed. Overlaps codebase-memory-mcp and code-review-graph for graph-based code intelligence; prefer those (MIT, transparent storage, broader languages) until SDL-MCP provides a reproducible end-to-end benchmark and documents summary costs.
osgrep	npm CLI; LanceDB vector store; Granite 30M dense + mxbai ColBERT 17M (int8) late-interaction reranking; tree-sitter AST chunking; FTS + vector → RRF → two-stage ColBERT pipeline	~20% cost reduction / ~30% speedup (as reported; 10-query, single-codebase CSV; no answer quality assessment)	None — result set size is the bound	10-query CSV (opencode corpus, cost-only); internal MRR harness (`eval.ts`, 70+ cases, self-referential); not reproduced	MCP server is a non-functional stub as of commit 9f2faf7; last push 2026-01-17; 1,128 stars; Apache-2.0	Overlaps socraticode and codebase-memory-mcp for semantic code search. Unique: richest hybrid retrieval pipeline in the survey (dense + FTS + two-stage ColBERT) in a zero-external-service npm package. MCP stub rules out MCP-framework integration currently. Prefer socraticode if Qdrant is already in the stack and RRF hybrid search is sufficient; prefer codebase-memory-mcp if graph queries are needed. Use osgrep if the call-stack trace + skeleton compression commands are the target use case or if a pure-npm zero-dep installation is required.

Key themes

Populated as analysis matures.

Recommended reading order

context-mode — read first; establishes the MCP-layer interception pattern and the two-speed retrieval distinction (summarization vs searchable index) that all subsequent tool comparisons should reference.
codebase-memory-mcp — read second; graph queries are complementary to context-mode (structural navigation vs output sandboxing); together they cover the two main sources of context bloat in coding sessions.
code-review-graph — canonical AST-graph tool; community detection, wiki generation, and blast-radius analysis; use as the benchmark baseline for all other graph-based tools.
codegraph — read alongside code-review-graph; architecturally similar but WASM-bundled and single-tool; important README integrity caveat (benchmark table copied from code-review-graph).
graphify — read next in the graph family; prompt-orchestrated multi-modal variant (LLM drives Python CLI); 71.5× headline figure is on a curated 52-file corpus with an extreme baseline — understand the methodology before comparing to peers.
sdl-mcp — read after graphify; caps the graph-intelligence family with the most disciplined context-escalation model (Iris Gate Ladder + ETag conditional re-fetch + Delta Packs). Watch verdict: novel architecture, no end-to-end benchmark, opaque proprietary storage.
serena — LSP-backed approach is orthogonal to AST-graph tools; the progressive fallback on oversized results is a novel pattern worth understanding for any tool that returns variable-size context.
jcodemunch-mcp — same AST-graph family as code-review-graph; narrowest benchmark (3 repos, range 79.7–99.8%); non-OSI license is a material risk.
rtk — representative of the CLI-proxy category; simplest architecture in the survey; important caveat that all figures use a chars/4 heuristic, not a real tokenizer.
n2-arachne — read for the fixed-percentage budget allocation model; chars/3.5 heuristic and non-commercial license are the two primary risks.
jdocmunch-mcp — read for the O(1) byte-offset retrieval pattern; note the savings accounting flaw (counts all sections, not returned) and opt-out telemetry before adopting.
socraticode — Qdrant-backed hybrid search (dense + BM25 via RRF); Docker required; the 61.5% figure is bytes not tokens from a single session — important methodological caveat shared with jdocmunch.
osgrep — read alongside socraticode; richest hybrid retrieval pipeline in the survey (FTS + dense + two-stage ColBERT reranking) with zero external service dependencies; key caveat: MCP server is a non-functional stub as of last commit, and the 10-query benchmark covers cost only with no answer quality assessment.
qmd — most sophisticated retrieval pipeline in this survey (8 steps: BM25 probe → LLM query expansion → vec → RRF → rerank); relevant primarily when the agent’s knowledge base is markdown, not code.
caveman — output-style compression is a different category from all others; useful when output verbosity rather than input retrieval is the token budget bottleneck.
Understand-Anything — read if developer comprehension (domain mapping, not token reduction) is the target; different value proposition from every other tool in this list.
git-semantic-bun — borderline scope; read only if semantic retrieval from git commit history is specifically needed.