LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- Introduces LongBench, the first bilingual (EN+ZH) multi-task benchmark for long-context understanding.
- Covers 21 datasets across 6 task categories: single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion.
- Average length: 6,711 words (English), 13,386 characters (Chinese).
- Evaluates 8 LLMs and reveals that commercial models (GPT-3.5-Turbo-16k) outperform open-source but still struggle on longer contexts.
- Finds that scaled position embedding and longer-sequence fine-tuning substantially improve long-context performance.
- Finds that retrieval-based compression improves performance on relevant tasks.
- Published at ACL 2024; dataset at https://github.com/THUDM/LongBench.
What’s novel / different
Section titled “What’s novel / different”At submission time (Aug 2023), no existing benchmark covered long-context understanding in a bilingual, multi-task, standardised way. NIAH tests were synthetic; existing QA/summarization benchmarks used short contexts. LongBench was the first to unify 21 real-world datasets into a single evaluation suite with automated scoring and bilingual coverage, making it a de facto standard for the 2023–2024 era.
Mechanism overview
Section titled “Mechanism overview”Problem / motivation
Section titled “Problem / motivation”LLMs had recently extended context windows (GPT-3.5-Turbo-16k, Claude 100K) but no rigorous benchmark existed to measure whether longer contexts actually improved understanding on real downstream tasks. Researchers had no agreed-upon way to compare long-context capabilities.
Core approach
Section titled “Core approach”- 6 categories: single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion.
- 21 datasets: sourced from existing NLP benchmarks and real-world documents, reformatted into a unified schema.
- Bilingual: parallel English and Chinese subsets; Chinese uses character count rather than word count.
- Automatic evaluation: standardised metrics (F1, ROUGE, accuracy) for all datasets.
Key design decisions
Section titled “Key design decisions”- Unified format: all datasets normalized to the same input/output schema to enable cross-dataset comparison.
- Real documents: datasets use actual books, reports, codebases — not synthetic haystacks.
- Average not max: average length (6.7K EN) is moderate by 2024 standards; tests mid-range context, not extreme limits.
Evaluation (as reported)
Section titled “Evaluation (as reported)”| Finding | Detail |
|---|---|
| Best model | GPT-3.5-Turbo-16k outperforms all open-source models (as reported) |
| Open-source best | Models with scaled position embedding + long-sequence fine-tuning perform best (as reported) |
| Retrieval | Context compression via retrieval improves performance on applicable tasks (as reported) |
| Overall | All tested models struggle as context length increases; no model maintains near-full performance across all tasks (as reported) |
Specific per-model, per-dataset numbers in paper tables.
Implementation details worth stealing
Section titled “Implementation details worth stealing”- Unified schema: normalising heterogeneous datasets into one format dramatically reduces evaluation boilerplate — reuse this pattern for custom eval suites.
- Bilingual split: evaluating separately in EN and ZH exposes language-specific degradation; useful for multilingual model comparisons.
- 6-category taxonomy: the category breakdown (single-doc, multi-doc, summarization, few-shot, synthetic, code) is a reusable mental model for long-context task design.
Open questions / risks / missing details
Section titled “Open questions / risks / missing details”- Average length is short: 6.7K EN average is now well within standard model context windows (as of 2025); the benchmark may not challenge frontier models.
- Superseded by LongBench v2: v2 (arxiv:2412.15204) introduces harder reasoning tasks with 8K–2M contexts; evaluate both together for historical comparison.
- Metric reliability: automatic metrics (F1, ROUGE) for summarization are notoriously noisy; no model-based evaluation.
- Synthetic tasks: includes synthetic position-retrieval tasks alongside real tasks — not clearly separated in some analyses.
- 8 models only: original evaluation is narrow by 2025 standards; community has since published larger model comparisons using this benchmark.