Benchmark repro guide — lum1104-understand-anything
Benchmark Repro Guide: lum1104-understand-anything
Section titled “Benchmark Repro Guide: lum1104-understand-anything”This document records the state of the benchmark harness for Understand-Anything as found in the vendored source at tools/lum1104-understand-anything/.
Harness status
Section titled “Harness status”No benchmark harness exists in the vendored source.
The understand-anything-plugin/src/tests/ directory contains four unit test files (context-builder.test.ts, diff-analyzer.test.ts, explain-builder.test.ts, onboard-builder.test.ts) that exercise skill-builder utility functions with synthetic fixtures. These are not benchmarks — they validate output structure and do not measure token consumption, latency, or pipeline throughput.
The scripts/generate-large-graph.mjs script generates a synthetic 3,000-node graph for dashboard rendering load tests. The project’s own CLAUDE.md explicitly marks this as “not part of the production pipeline.”
No script in the repository measures or compares token consumption, indexing time, or comprehension outcomes.
Claimed benchmark figures (as reported)
Section titled “Claimed benchmark figures (as reported)”No quantitative benchmark figures are claimed anywhere in the project. The README, CONTRIBUTING.md, and all skill definition files are explicit that the tool’s value proposition is developer comprehension and onboarding, not token reduction. The following qualitative properties are claimed:
| Claim | Source |
|---|---|
| Incremental updates (only changed files re-analyzed) | README.md |
| Parallel file analysis (up to 5 concurrent agents, 20–30 files/batch) | Skill SKILL.md |
| Multi-platform: Claude Code, Cursor, Codex, Gemini CLI, VS Code + Copilot, OpenCode, Antigravity, Pi Agent, OpenClaw (10 platforms) | README.md compatibility table |
| 8,081 GitHub stars | README.md / GitHub |
The incremental update mechanism is verified from source: staleness.ts implements getChangedFiles() via git diff <lastCommitHash>..HEAD --name-only and per-file content fingerprinting (fingerprints.json). No latency or throughput figures accompany this claim.
Why there is no benchmark
Section titled “Why there is no benchmark”The tool is a developer comprehension dashboard. Its outputs — interactive knowledge graph, guided tours, domain ontology, persona-adaptive UI — are qualitative. There is no scalar metric (token reduction, time saved, error rate) that the project attempts to quantify, and no gold-standard corpus against which comprehension quality could be measured.
The only numeric constraint in the pipeline is the >100 files gate in the SKILL.md, which prompts user confirmation before proceeding — an implicit acknowledgment that large codebases incur significant LLM API cost.
What a benchmark would require
Section titled “What a benchmark would require”To assess the tool’s stated goal of reducing developer onboarding time, a benchmark would need:
- A study cohort of developers working on an unfamiliar codebase, split into control (unassisted) and treatment (Understand-Anything dashboard) groups.
- Task completion time and error rate for a set of navigation/comprehension tasks.
- LLM API cost per full analysis run, measured at batch granularity.
- Incremental rebuild cost after a representative set of commits.
None of these exist. The tool has no equivalent to a token-savings harness.
Running the unit tests
Section titled “Running the unit tests”cd tools/lum1104-understand-anything/understand-anything-pluginpnpm installpnpm --filter "@understand-anything/skill" run testExpected output: vitest run of 4 test files (context-builder, diff-analyzer, explain-builder, onboard-builder). All tests operate on synthetic in-memory fixtures. No LSP, no LLM API calls, no graph build.
Running the dashboard rendering test
Section titled “Running the dashboard rendering test”cd tools/lum1104-understand-anythingnode scripts/generate-large-graph.mjs# Generates a 3,000-node synthetic knowledge-graph.json# Open understand-anything-plugin/packages/dashboard with a local dev server to inspect renderingThis does not exercise the analysis pipeline.