jdocmunch-mcp — Benchmark Reproduction
jdocmunch-mcp — Benchmark Reproduction
Section titled “jdocmunch-mcp — Benchmark Reproduction”Source: https://github.com/jgravelle/jdocmunch-mcp (v1.8.0)
Date: 2026-04-13
Outcome: partially reproducible — one executable harness found at
benchmarks/wiki/run_benchmark.py; three older narrated case studies remain non-executable
Harness location
Section titled “Harness location”Executable harness (new in v1.8.0)
Section titled “Executable harness (new in v1.8.0)”benchmarks/wiki/run_benchmark.py (12.2 KB)benchmarks/wiki/results_jcodemunch_wiki.jsonbenchmarks/wiki/results_jcodemunch_wiki.mdrun_benchmark.py is a self-contained Python script that:
- Accepts a path to a cloned GitHub wiki (
git clone <repo>.wiki.git) - Scans all
.mdfiles and tokenizes them with tiktokencl100k_base - Parses the wiki into heading-delimited sections using an offline approximation of
jDocMunch’s section parser (a simplified in-script splitter, not the production
parse_markdown()code path) - For each query, finds the best-matching section by keyword overlap
- Reports two baselines: full-wiki concatenation and single-file (conservative)
- Adds a hardcoded
SEARCH_META_TOKENS = 190overhead to simulatesearch_sectionsJSON
Caveats:
- The section parser in the harness is a stripped-down heading splitter that does not
replicate the production
make_hierarchical_slug, byte-offset tracking, orwire_hierarchy()logic. Section boundaries may differ from what the server would produce. SEARCH_META_TOKENS = 190is an estimate measured from real responses by the developer; it is not computed from live MCP calls in the script.- The scoring function used to select the best section (
find_best_section) is a simple word-overlap counter, not the weighted multi-field scoring inDocIndex._lexical_search. - The script runs entirely offline and does not require a running MCP server.
Narrated case studies (non-executable)
Section titled “Narrated case studies (non-executable)”benchmarks/jDocMunch_Benchmark_Kubernetes.mdbenchmarks/jDocMunch_Benchmark_LangChain_MDX.mdbenchmarks/jDocMunch_Benchmark_SciPy.mdbenchmarks/jDocMunch_Benchmark_Wiki.mdAll four were authored by the developer (Claude Sonnet 4.6 on Windows). None contain scripts, fixture datasets, or assertion logic.
Reproducing the wiki harness
Section titled “Reproducing the wiki harness”Prerequisites
Section titled “Prerequisites”pip install tiktokengit clone https://github.com/jgravelle/jcodemunch-mcp.wiki.git /tmp/jcodemunch-wikipython tools/jgravelle-jdocmunch-mcp/benchmarks/wiki/run_benchmark.py /tmp/jcodemunch-wikiWith custom queries and output files:
python tools/jgravelle-jdocmunch-mcp/benchmarks/wiki/run_benchmark.py /tmp/jcodemunch-wiki \ --queries "cross repo dependency" "benchmark token" "search scoring" \ --out /tmp/results.md --json /tmp/results.jsonExpected output shape
Section titled “Expected output shape”Markdown tables with:
- Corpus summary (file count, bytes, tokens)
- Full-wiki baseline: baseline tokens vs. jDocMunch tokens per query, savings %, ratio
- Single-file baseline (conservative): assumes the agent already knows which file to open
- Per-query detail: target file, matched section, section bytes/tokens, jDocMunch total tokens
The developer-run results at benchmarks/wiki/results_jcodemunch_wiki.md used the
jcodemunch-mcp wiki (7 content pages, 68 sections, 29 KB, 7,449 baseline tokens) and
reported 82–97% reduction per query vs. full-wiki baseline.
What the narrated case studies report (as reported)
Section titled “What the narrated case studies report (as reported)”Kubernetes corpus
Section titled “Kubernetes corpus”- Corpus:
kubernetes/websitedocs directory, 1,569.mdfiles (16 MB), 500 indexed. - Sections extracted: 4,355.
- Index time: 3,352 ms.
- Five parallel queries at 83–100 ms each.
- Batch precision retrieval (5 sections in one call): 754 ms.
- Tokens saved across 5 queries: ~34,222 (as reported by server
_meta). - Largest single-file reduction: 95,051-byte
authentication.md→ 863 bytes fetched (110x).
SciPy corpus
Section titled “SciPy corpus”- Corpus:
scipy/doc, 430 files (24 MD + 406 RST), 3.4 MB. - Sections extracted: 10,402.
- Index time: 2,247 ms.
- 12 domain queries at 129–153 ms each.
LangChain MDX corpus
Section titled “LangChain MDX corpus”- Corpus: 500 LangChain/LangGraph/LangSmith
.mdxfiles. - Before MDX support: 200 files indexed, 699 sections (490 MDX files inaccessible).
- After MDX support: 500 files indexed, 5,973 sections (+754%).
- Index time: 5,204 ms (vs ~800 ms for
.md-only run).
Savings formula (verified from source)
Section titled “Savings formula (verified from source)”The tokens_saved value in each _meta response is computed in
storage/token_tracker.py::estimate_savings():
def estimate_savings(raw_bytes: int, response_bytes: int) -> int: return max(0, (raw_bytes - response_bytes) // _BYTES_PER_TOKEN)Where raw_bytes — as computed in tools/search_sections.py — is the sum of all section
content bytes in every document that contributed a result, not just the bytes of sections
actually returned:
matched_doc_paths = {r.get("doc_path") for r in results}raw_bytes = sum( len(s.get("content", "").encode("utf-8")) for s in index.sections if s.get("doc_path") in matched_doc_paths)_BYTES_PER_TOKEN = 4. tiktoken (cl100k_base) is used instead if installed.
This formula systematically overstates savings for documents with many sections where only one or a few are relevant to the query.
Live MCP reproduction (narrated case studies)
Section titled “Live MCP reproduction (narrated case studies)”No automated comparison against the reported figures from the three older case studies is possible. The closest approximation:
- Install jdocmunch-mcp:
pip install jdocmunch-mcp- Index a public documentation corpus (Kubernetes docs example):
# Start MCP server, then via MCP client:# index_repo(repo="kubernetes/website", subdir="content/en/docs", max_files=500)- Run the five Kubernetes queries manually via
search_sections, record_meta.tokens_savedand_meta.latency_ms. - Run batch precision retrieval via
get_sectionswith the section IDs from step 3.
No fixture dataset or expected-output assertions exist for these older benchmarks.
Assessment
Section titled “Assessment”The wiki harness (run_benchmark.py) is reproducible by any external party with pip install tiktoken and a cloned wiki. It is the first independently runnable benchmark in this tool.
However, its offline section parser is a simplified approximation of the production code path,
and its search scoring differs from the server’s weighted multi-field algorithm. Results will
be directionally correct but not identical to what a live MCP session would produce.
The three older case studies and the TOKEN_SAVINGS.md figures (97–98%) remain unverifiable without the original corpora, exact query set, and server state used in measurement. The server’s savings accounting inflates numbers relative to what an agent would actually consume. Treat those figures as illustrative upper bounds, not reproducible measurements.