Two Huge Title and Keyword Generation Corpora of Research Articles
Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and c...
Saved in:
Main Authors: | , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
11-02-2020
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Recent developments in sequence-to-sequence learning with neural networks
have considerably improved the quality of automatically generated text
summaries and document keywords, stipulating the need for even bigger training
corpora. Metadata of research articles are usually easy to find online and can
be used to perform research on various tasks. In this paper, we introduce two
huge datasets for text summarization (OAGSX) and keyword generation (OAGKX)
research, containing 34 million and 23 million records, respectively. The data
were retrieved from the Open Academic Graph which is a network of research
profiles and publications. We carefully processed each record and also tried
several extractive and abstractive methods of both tasks to create performance
baselines for other researchers. We further illustrate the performance of those
methods previewing their outputs. In the near future, we would like to apply
topic modeling on the two sets to derive subsets of research articles from more
specific disciplines. |
---|---|
DOI: | 10.48550/arxiv.2002.04689 |