The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large lan...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
07-03-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As language models grow ever larger, the need for large-scale high-quality
text datasets has never been more pressing, especially in multilingual
settings. The BigScience workshop, a 1-year international and multidisciplinary
initiative, was formed with the goal of researching and training large language
models as a values-driven undertaking, putting issues of ethics, harm, and
governance in the foreground. This paper documents the data creation and
curation efforts undertaken by BigScience to assemble the Responsible
Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset
spanning 59 languages that was used to train the 176-billion-parameter
BigScience Large Open-science Open-access Multilingual (BLOOM) language model.
We further release a large initial subset of the corpus and analyses thereof,
and hope to empower large-scale monolingual and multilingual modeling projects
with both the data and the processing tools, as well as stimulate research
around this large multilingual corpus. |
---|---|
DOI: | 10.48550/arxiv.2303.03915 |