ACLSum: A High-Quality Dataset for Scientific Paper Summarization

By: Prof. Dr. Kai Eckert | Tue, 25 Jun 2024

New expert-crafted dataset enables aspect-based summarization of scholarly publications with unprecedented quality

Our team is delighted to announce the publication of “ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications” at NAACL 2024, authored by Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, and Simone Paolo Ponzetto.

Addressing a Critical Gap

While extensive efforts have been directed toward developing summarization datasets, most existing resources have been semi-automatically generated through web data crawling. This has resulted in subpar quality for training and evaluating summarization systems—a compromise largely driven by the substantial costs of generating ground-truth summaries, particularly for specialized domains like scientific literature.

Expert-Crafted Quality

ACLSum represents a paradigm shift by presenting a novel summarization dataset carefully crafted and evaluated by domain experts. Unlike previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers, comprehensively covering three crucial dimensions: challenges, approaches, and outcomes.

Dual Format Support

A unique feature of ACLSum is its support for both extractive and abstractive summarization approaches:

Extractive summarization: The dataset includes passage annotations (aspects) that serve as gold labels, identifying key sentences from source documents
Abstractive summarization: Expert-written summaries provide reference texts for generating new summarizations

This dual format enables researchers to explore and compare different summarization strategies within the scholarly domain.

Comprehensive Evaluation

The research team conducted extensive experiments evaluating:

Dataset quality through expert validation
Performance of pretrained language models
Capabilities of state-of-the-art large language models (LLMs)
Effectiveness of extract-then-abstract versus end-to-end approaches

The results indicate the general superiority of end-to-end aspect-based summarization while revealing interesting limitations in LLM-based extraction approaches.

Impact on Research

With 250 carefully annotated documents—more than twice the size of comparable datasets—ACLSum provides researchers with a substantial, high-quality resource for advancing scientific paper summarization. The dataset is openly available on GitHub, promoting reproducible research and continued innovation in this important area.

Citation: Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, Simone Paolo Ponzetto (2024): ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6660–6675.