Analysis — jdocmunch-mcp
ANALYSIS: jdocmunch-mcp
Section titled “ANALYSIS: jdocmunch-mcp”Source review
Section titled “Source review”Source vendored at tools/jgravelle-jdocmunch-mcp/. All claims below marked (verified) were
confirmed by reading the referenced source files directly.
Architecture
Section titled “Architecture”Six-stage indexing pipeline wired across three packages:
parser/— one module per format plus shared utilities (sections.py,hierarchy.py,markdown_parser.py, eight format-specific parsers)storage/doc_store.py—DocIndex+DocStore; all persistent statetools/— 16 thin handler modules imported byserver.py
Entry point is server.py::main(), which starts an mcp.server.Server and registers all
tools. The CLI is a second entry point via cli/ (init.py and hooks.py).
Critical path (get_section)
Section titled “Critical path (get_section)”server.pydispatches totools/get_section.pyget_sectioncallsDocStore.get_section_content(owner, name, section_id)doc_store.py::get_section_contentcallsindex.get_section(section_id)— O(1) dict lookup viaDocIndex._section_indexbuilt in thepost_initconstructor method (verified from source)- Opens the cached raw file, seeks to
byte_start, readsbyte_end - byte_startbytes (verified from source:f.seek(byte_start); raw = f.read(byte_end - byte_start)) - Returns UTF-8 decoded string
Key data structures
Section titled “Key data structures”Sectiondataclass (parser/sections.py):id,repo,doc_path,title,content,level,parent_id,children,byte_start,byte_end,summary,tags,references,content_hash,embeddingDocIndexdataclass: sections list +_section_indexdict (built at load time);file_hashesdict for SHA-256 drift detection;index_versionfieldDocStore: JSON index at~/.doc-index/{owner}/{name}.json; raw content files under~/.doc-index/{owner}/{name}/; module-level_INDEX_CACHEdict keyed by(str(index_path), mtime_ns)
Summary
Section titled “Summary”jdocmunch-mcp is a Python MCP server that indexes documentation into heading-delimited sections
and exposes 16 tools for structured retrieval (the prior triage reported 13; the vendored source
registers 16, including get_backlinks, get_stale_pages, and get_wiki_stats added since
that triage). The core idea is a two-phase workflow: agents first call search_sections or
get_toc to receive summaries only, then call get_section or get_sections to fetch exact
content via stored byte offsets — loading no more than the section requested. This discipline
is the entire basis of the token-saving claim.
The architecture is well-executed and verifiably coherent from source. The savings figures in
TOKEN_SAVINGS.md (97–98%) are self-reported estimates, not reproducible measurements; actual
token reductions depend heavily on corpus structure and query specificity. The earlier triage
described benchmarks/ as narrated case studies with no executable harness — this is now
partially contradicted by source: benchmarks/wiki/run_benchmark.py is a real, runnable
Python harness with tiktoken-based token counting. The tool is genuinely useful for structured
documentation corpora; its value is lower for free-form prose or small doc sets. The dual
license creates a meaningful barrier for commercial teams.
What it does (verified from source)
Section titled “What it does (verified from source)”Core mechanism
Section titled “Core mechanism”Indexing follows a six-stage pipeline (verified from parser/, storage/, tools/):
-
Discovery — GitHub API walk (
tools/index_repo.py) or local directory traversal (tools/index_local.py). GitHub traversal is paginated viahttpx. -
Security filtering —
security.pyapplies path-traversal checks,.gitignore-pattern exclusion viapathspec, binary detection, and a secret-file heuristic that silently skips files matchingsecret*.mdor containing credential-like patterns. -
Preprocessing —
preprocess_content()in the parser package dispatch module converts non-Markdown formats to Markdown before parsing. Covered conversions:.mdxviastrip_mdx()(verified: 12 compiled regex patterns removing YAML frontmatter, JSX tags, import/export, mermaid fence blocks),.ipynbviaconvert_notebook(),.htmlviaconvert_html(),.xml/.svg/.xhtmlviaconvert_xml(), OpenAPI YAML/JSON viaconvert_openapi(), plain JSON/JSONC viaconvert_json(), Godot scene files viaconvert_godot(). All outputs are fed to the unified Markdown parser. -
Section splitting —
parse_markdown()inparser/markdown_parser.pywalks the file line by line, tracking UTF-8 byte offsets per line (len(line.encode("utf-8"))) (verified from source). Both ATX headings (# H1) and setext headings (underline with===/---) are handled. Content before the first heading becomes a level-0 root section. Each section recordsbyte_startandbyte_endinto the raw file.Section IDs use a hierarchical slug:
{repo}::{doc_path}::{ancestor_chain/leaf}#{level}(verified fromsections.py::make_section_idandmake_hierarchical_slug). The ancestor chain prevents same-named headings under different parents from colliding or renumbering when the document is edited. -
Hierarchy wiring —
parser/hierarchy.py::wire_hierarchy()uses a level-stack to assignparent_idandchildrento each section in a second linear pass. -
Storage —
DocStoreinstorage/doc_store.pywrites two artefacts atomically (temp file + rename) (verified from source:tmp_path.replace(index_path)): a JSON index at~/.doc-index/{owner}/{name}.jsoncontaining section metadata (titles, summaries, byte offsets, hashes, optional embeddings) but not full content, and raw content files under~/.doc-index/{owner}/{name}/. A module-level_INDEX_CACHEdict keyed by(str(index_path), mtime_ns)prevents redundantjson.load()calls within a server process (verified from source).
Retrieval is O(1) (verified from source): DocStore.get_section_content() looks up the
section dict in DocIndex._section_index (a plain Python dict built at post_init
time), then opens the cached raw file, seeks to byte_start, and reads
byte_end - byte_start bytes — no re-parsing required.
Drift detection uses SHA-256 hashes per file (verified from source: hashlib.sha256).
DocStore.detect_changes() compares the stored hash map against current file hashes and
returns (changed, new, deleted) lists, enabling incremental re-index rather than full
rebuilds.
Search has two modes controlled by the presence of stored embeddings (verified from
storage/doc_store.py::DocIndex.search()):
-
Lexical (default): weighted keyword scoring across title (exact: +20, substring: +10, word overlap: +5 per word), summary (substring: +8, word: +2 per word), tags (+3), content (capped at +5). Results with score zero excluded. Content stripped from results.
-
Semantic (optional): cosine-similarity over stored embedding vectors using a pure-Python dot-product implementation (no numpy) (verified from
embeddings/provider.py::cosine_similarity). Falls back to lexical if no embeddings are present or query embedding fails. Three providers: Geminitext-embedding-004(768 dims), OpenAItext-embedding-3-small(1536 dims), andsentence-transformers/all-MiniLM-L6-v2(384 dims, fully offline). Auto-detection priority (verified from source): explicitJDOCMUNCH_EMBEDDING_PROVIDERoverride, thenGOOGLE_API_KEY, thenOPENAI_API_KEY, then installedsentence-transformers.
Summarization operates in three tiers: heading text (free, always available), AI batch
(Claude Haiku or Gemini Flash, in batches of 8 sections), then "Section: {title}" fallback.
Additional summarizer providers added since prior triage: MiniMax and ZhipuAI (GLM), selected
via JDOCMUNCH_SUMMARIZER_PROVIDER env var (verified from source CLAUDE.md).
Token savings accounting in storage/token_tracker.py (verified from source): each
search_sections response computes raw_bytes = sum(len(content) for all_sections_in_matched_documents)
minus response_bytes, divided by 4 (the bytes-per-token heuristic). tiktoken
(cl100k_base) is used if installed for an accurate count. Savings accumulate in
~/.doc-index/_savings.json. A background httpx thread posts an anonymised
{"delta": <int>, "anon_id": <uuid4>} payload to
https://j.gravelle.us/APIs/savings/post.php unless JDOCMUNCH_SHARE_SAVINGS=0 is set
(verified from source — telemetry is opt-out, not opt-in).
Savings accounting flaw (verified from source): in tools/search_sections.py, raw_bytes
is computed as the sum of content bytes for every section belonging to any document that
appeared in the result set — not the bytes the agent would have actually consumed. A query
returning one result from a 100-section document reports savings as if the agent would have
read all 100 sections. This inflates reported savings figures substantially for large corpora.
Claude Code integration hooks (new since prior triage — verified from cli/hooks.py):
three Claude Code hooks are now shipped:
PreToolUse— interceptsReadcalls on doc files above 2 KB and prints a stderr hint directing Claude to usesearch_sections+get_sectioninstead (never hard-blocks, to preserve the Edit workflow requiring a prior Read)PostToolUse— auto-reindexes a doc file afterEdit/Writeby spawningjdocmunch-mcp index-file <path>as a fire-and-forget background processPreCompact— emits a session snapshot of indexed repos as asystemMessageso doc orientation survives context compaction
Interface / API
Section titled “Interface / API”16 MCP tools registered on the server (verified from server.py::list_tools(); prior triage
reported 13 — three tools were added: get_backlinks, get_stale_pages, get_wiki_stats):
| Tool | Registered name | Purpose |
|---|---|---|
index_local | index_local | Index a local folder |
index_repo | doc_index_repo | Index a GitHub repository |
list_repos | doc_list_repos | List indexed documentation sets |
get_toc | get_toc | Flat section list, summaries only |
get_toc_tree | get_toc_tree | Nested section tree per document |
get_document_outline | get_document_outline | Section hierarchy for one document |
search_sections | search_sections | Weighted keyword or semantic search, summaries only |
get_section | get_section | Full content of one section by ID (byte-range read) |
get_sections | get_sections | Batch content retrieval |
get_section_context | get_section_context | Section + ancestor headings + child summaries |
delete_index | delete_index | Remove a doc index |
get_broken_links | get_broken_links | Detect internal links that no longer resolve |
get_doc_coverage | get_doc_coverage | Which symbol IDs have matching doc sections |
get_backlinks | get_backlinks | Sections that link to a given section (new) |
get_stale_pages | get_stale_pages | Identify pages not updated recently (new) |
get_wiki_stats | get_wiki_stats | Aggregate statistics for an indexed wiki (new) |
Every response includes a _meta envelope: latency_ms, sections_returned, tokens_saved,
total_tokens_saved, cost_avoided, total_cost_avoided.
get_doc_coverage accepts a list of jcodemunch-format symbol IDs
({repo}::{filepath}::{name}#{type}) and matches them against section titles. It degrades
gracefully when IDs are passed as plain strings, but the integration is not documented beyond
a single comment in the source.
Dependencies
Section titled “Dependencies”Core (always installed, verified from pyproject.toml): mcp>=1.10.0,<2.0.0,
httpx>=0.27.0, pathspec>=0.12.0, pyyaml>=6.0. Python 3.10+ required.
Optional extras (verified from pyproject.toml): anthropic>=0.40.0 ([anthropic]),
google-generativeai>=0.8.0 ([gemini] or [semantic]), openai>=1.0.0 ([openai],
[minimax], [zhipu]). sentence-transformers is not listed as a pyproject extra
(verified — absent from pyproject.toml) and must be installed manually; auto-detected at
runtime if importable. No database; all state is JSON files under ~/.doc-index/.
Scope / limitations
Section titled “Scope / limitations”-
Savings are highest for well-structured, heading-rich documentation. Plain-text and prose-heavy files fall back to paragraph-block splitting, producing coarser sections.
-
Semantic search is functional but underdocumented.
sentence-transformersis not listed as a pyproject extra and has no install instructions in the README. -
The
tokens_savedfigure in_metacounts all section bytes in matched documents, not bytes an agent would have actually read. This inflates reported savings for documents with many sections where only one is relevant. -
Storage is local-only (
~/.doc-index/). No multi-user or remote-index support. -
Telemetry is opt-out. Savings deltas are posted to
j.gravelle.usby default. Not prominently documented. -
No background index watcher or TTL-based auto-refresh. Drift detection requires an explicit re-index call.
Benchmark claims — verified vs as-reported
Section titled “Benchmark claims — verified vs as-reported”| Metric | Value | Status |
|---|---|---|
| Find specific topic: ~12,000 raw tokens vs ~400 jDocMunch, ~97% savings | TOKEN_SAVINGS.md | as reported — no methodology disclosed |
| Browse doc structure: ~40,000 vs ~800, ~98% savings | TOKEN_SAVINGS.md | as reported — no methodology disclosed |
| Read one section: ~12,000 vs ~300, ~97.5% savings | TOKEN_SAVINGS.md | as reported — no methodology disclosed |
| Explore a doc set: ~100,000 vs ~2,000, ~98% savings | TOKEN_SAVINGS.md | as reported — no methodology disclosed |
| Kubernetes: 500 files indexed in 3,352 ms, 4,355 sections | benchmarks/ | as reported — developer-authored case study |
| Kubernetes: 5 parallel queries at 83–100 ms each | benchmarks/ | as reported — developer-authored case study |
| Kubernetes: credential plugin section 863 B from 95,051-byte file (110x reduction) | benchmarks/ | as reported — byte figures consistent with architecture |
| SciPy: 430 files / 10,402 sections indexed in 2,247 ms | benchmarks/ | as reported |
| LangChain MDX: 490 MDX files parsed, section count +754% after strip_mdx() | benchmarks/ | as reported |
| Byte-offset O(1) retrieval via seek() + read() | storage/doc_store.py | verified |
O(1) section dict lookup via _section_index built at load time | storage/doc_store.py | verified |
| In-memory index cache keyed by (path, mtime_ns) | storage/doc_store.py | verified |
| Atomic writes via temp-file rename | storage/doc_store.py | verified |
| SHA-256 per-file drift detection | storage/doc_store.py | verified |
| Opt-out telemetry to j.gravelle.us | storage/token_tracker.py | verified |
| Savings flaw: counts all sections in matched docs, not returned sections | storage/doc_store.py, tools/search_sections.py | verified |
| 16 tools registered (not 13 as in prior triage) | server.py | verified |
sentence-transformers not a pyproject.toml extra | pyproject.toml | verified |
| INDEX_VERSION = 2 in source (prior analysis said INDEX_VERSION = 2; CLAUDE.md inside tool says 1 — stale docs) | storage/doc_store.py | verified |
Prior triage claim partially contradicted: the earlier triage stated “no script, fixture
dataset, or assertion harness that an external party can execute.” Source review found
benchmarks/wiki/run_benchmark.py — a real Python harness that accepts a cloned wiki path
and query list, tokenizes with tiktoken cl100k_base, parses sections offline, and scores
results. It is reproducible by any external party with pip install tiktoken and a cloned
wiki. The three older narrated benchmark files remain developer-authored case studies without
executables, but the tool set now includes one genuine reproducible harness.
Architectural assessment
Section titled “Architectural assessment”What’s genuinely novel
Section titled “What’s genuinely novel”The section-first two-phase workflow occupies a distinct position between full-file reads (high recall, expensive) and embedding-based chunking (requires infrastructure, loses structural context). Heading boundaries as the chunking unit, combined with byte-offset storage for O(1) retrieval, and a TOC-first navigation layer before content commitment, is a coherent and practical design.
The hierarchical slug scheme is a sound engineering choice. Using the ancestor heading chain as the slug prefix makes IDs stable under sibling insertions: a new same-named heading elsewhere in the document does not renumber IDs in other branches. This matters for agents caching section IDs across turns.
Incremental indexing by SHA-256 hash diff is well-implemented. Rebuilding only changed files is important for large documentation repos that update frequently.
The multi-format preprocessing pipeline (convert-to-Markdown, parse once) avoids maintaining separate retrieval paths per format. The MDX pre-processor with 12 compiled regex patterns covering Mintlify/JSX constructs is a genuine addition over generic Markdown tools.
The multi-provider embedding layer with pure-Python cosine similarity (no numpy) enables offline semantic search without a mandatory heavy dependency chain.
Gaps and risks
Section titled “Gaps and risks”Savings accounting is misleading. The tokens_saved figure counts all section bytes in
matched documents. A query returning one result from a 100-section document reports ~99%
savings even if the agent would only have opened the relevant section. This inflates headline
numbers and makes comparison with other tools unreliable.
Benchmark harness is limited. The older narrated case studies remain developer-authored
with no assertion mechanism. benchmarks/wiki/run_benchmark.py is a real harness with
tiktoken-based token counting, but it uses an offline approximation of jDocMunch’s section
parser (a simplified in-script heading splitter, not the actual parse_markdown() path) and
hardcodes a SEARCH_META_TOKENS = 190 estimate rather than measuring live MCP responses. It
exercises the concept but does not run the production code path. No CI coverage for benchmark
regression exists.
Telemetry is opt-out. Savings deltas and an anonymous UUID are posted to a third-party URL by default. Non-starter for air-gapped or privacy-sensitive deployments. Not disclosed prominently in the README.
Commercial license barrier. Any for-profit organization using jdocmunch-mcp for internal developer tooling requires a paid commercial license. Studio tier ($349) covers five developers; Platform ($1,999) covers org-wide internal deployment. This eliminates the tool as a candidate for open-source agent frameworks.
Semantic search is an undocumented optional. The sentence-transformers provider is
functional but not listed in pyproject.toml extras and has no README install instructions.
Teams relying on lexical fallback may get degraded recall on paraphrase-heavy queries without
understanding why.
Index version migration is silent (verified). INDEX_VERSION = 2 is hardcoded in source
(storage/doc_store.py line 15). A version mismatch silently drops the index (load_index
returns None) and triggers a full re-index with no user notification. The embedded CLAUDE.md
inside the tool still states INDEX_VERSION=1 — stale documentation. Large corpora would
silently re-index on upgrade.
Recommendation
Section titled “Recommendation”Adopt for non-commercial or individually licensed teams managing large, well-structured documentation corpora (API references, framework docs, Kubernetes-scale Markdown trees) where agents need section-level navigation without loading entire files. The byte-offset retrieval and structural search are well-implemented and the two-phase workflow discipline is sound.
Do not adopt if:
- The deployment is within a for-profit organization and the commercial license cost is prohibitive.
- Telemetry cannot be disabled (verify
JDOCMUNCH_SHARE_SAVINGS=0is enforced). - Corpora are predominantly prose-heavy free-form text (savings will be modest).
- Reproducible benchmark evidence is required before adoption.
For open-source or non-commercial agent pipelines, jdocmunch-mcp is the most complete structural documentation retrieval MCP server available as of this writing. Treat published savings figures as illustrative estimates, not measured performance guarantees.
Comparison hooks (for ANALYSIS.md matrix)
Section titled “Comparison hooks (for ANALYSIS.md matrix)”| Dimension | jdocmunch-mcp |
|---|---|
| Approach | Structural section indexing via heading boundaries; byte-offset O(1) retrieval |
| Compression | Summaries-only search results; full content only on explicit get_section call |
| Token budget model | Per-call savings estimate (raw doc bytes minus response bytes / 4); opt-out telemetry |
| Injection strategy | Agent-driven two-phase: search or TOC first, targeted retrieval second |
| Eviction | No eviction; persistent JSON index; incremental update via hash-based drift detection |
| Benchmark harness | One reproducible harness (benchmarks/wiki/run_benchmark.py, tiktoken-based); three older narrated case studies with no assertions; harness uses offline parser approximation, not production code path |
| License | Dual: free non-commercial; paid commercial ($79 Builder / $349 Studio / $1,999 Platform) |
| Maturity | v1.8.0; 13 tools in prior triage (16 verified from source); test suite present; active development |