Skip to content

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

  • Introduces LongBench, the first bilingual (EN+ZH) multi-task benchmark for long-context understanding.
  • Covers 21 datasets across 6 task categories: single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion.
  • Average length: 6,711 words (English), 13,386 characters (Chinese).
  • Evaluates 8 LLMs and reveals that commercial models (GPT-3.5-Turbo-16k) outperform open-source but still struggle on longer contexts.
  • Finds that scaled position embedding and longer-sequence fine-tuning substantially improve long-context performance.
  • Finds that retrieval-based compression improves performance on relevant tasks.
  • Published at ACL 2024; dataset at https://github.com/THUDM/LongBench.

At submission time (Aug 2023), no existing benchmark covered long-context understanding in a bilingual, multi-task, standardised way. NIAH tests were synthetic; existing QA/summarization benchmarks used short contexts. LongBench was the first to unify 21 real-world datasets into a single evaluation suite with automated scoring and bilingual coverage, making it a de facto standard for the 2023–2024 era.

LLMs had recently extended context windows (GPT-3.5-Turbo-16k, Claude 100K) but no rigorous benchmark existed to measure whether longer contexts actually improved understanding on real downstream tasks. Researchers had no agreed-upon way to compare long-context capabilities.

  • 6 categories: single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion.
  • 21 datasets: sourced from existing NLP benchmarks and real-world documents, reformatted into a unified schema.
  • Bilingual: parallel English and Chinese subsets; Chinese uses character count rather than word count.
  • Automatic evaluation: standardised metrics (F1, ROUGE, accuracy) for all datasets.
  • Unified format: all datasets normalized to the same input/output schema to enable cross-dataset comparison.
  • Real documents: datasets use actual books, reports, codebases — not synthetic haystacks.
  • Average not max: average length (6.7K EN) is moderate by 2024 standards; tests mid-range context, not extreme limits.
FindingDetail
Best modelGPT-3.5-Turbo-16k outperforms all open-source models (as reported)
Open-source bestModels with scaled position embedding + long-sequence fine-tuning perform best (as reported)
RetrievalContext compression via retrieval improves performance on applicable tasks (as reported)
OverallAll tested models struggle as context length increases; no model maintains near-full performance across all tasks (as reported)

Specific per-model, per-dataset numbers in paper tables.

  • Unified schema: normalising heterogeneous datasets into one format dramatically reduces evaluation boilerplate — reuse this pattern for custom eval suites.
  • Bilingual split: evaluating separately in EN and ZH exposes language-specific degradation; useful for multilingual model comparisons.
  • 6-category taxonomy: the category breakdown (single-doc, multi-doc, summarization, few-shot, synthetic, code) is a reusable mental model for long-context task design.
  • Average length is short: 6.7K EN average is now well within standard model context windows (as of 2025); the benchmark may not challenge frontier models.
  • Superseded by LongBench v2: v2 (arxiv:2412.15204) introduces harder reasoning tasks with 8K–2M contexts; evaluate both together for historical comparison.
  • Metric reliability: automatic metrics (F1, ROUGE) for summarization are notoriously noisy; no model-based evaluation.
  • Synthetic tasks: includes synthetic position-retrieval tasks alongside real tasks — not clearly separated in some analyses.
  • 8 models only: original evaluation is narrow by 2025 standards; community has since published larger model comparisons using this benchmark.