Search Results - "Suarez, Pedro Ortiz"
-
1
Automatic extraction of materials and properties from superconductors scientific literature
Published in Science and technology of advanced materials. Methods (31-12-2023)“…The automatic extraction of materials and related properties from the scientific literature is gaining attention in data-driven materials science (Materials…”
Get full text
Journal Article -
2
Semi-automatic staging area for high-quality structured data extraction from scientific literature
Published in Science and technology of advanced materials. Methods (31-12-2023)“…We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature,…”
Get full text
Journal Article -
3
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Published in Transactions of the Association for Computational Linguistics (31-01-2022)“…With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large,…”
Get full text
Journal Article -
4
Semi-automatic staging area for high-quality structured data extraction from scientific literature
Published 16-11-2023“…We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature,…”
Get full text
Journal Article -
5
Automatic extraction of materials and properties from superconductors scientific literature
Published 23-11-2022“…STAM:M, 2023, VOL. 3, NO. 1, 2153633 The automatic extraction of materials and related properties from the scientific literature is gaining attention in…”
Get full text
Journal Article -
6
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Published 21-02-2022“…Transactions of the Association for Computational Linguistics (2022) 10: 50-72 With the success of large-scale pre-training and multilingual modeling in…”
Get full text
Journal Article -
7
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Published 20-12-2022“…As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for…”
Get full text
Journal Article -
8
Moly\'e: A Corpus-based Approach to Language Contact in Colonial France
Published 08-08-2024“…Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the…”
Get full text
Journal Article -
9
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Published 17-01-2022“…The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods…”
Get full text
Journal Article -
10
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Published 12-06-2024“…Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et…”
Get full text
Journal Article -
11
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
Published 18-06-2020“…Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020, Online We use the multilingual OSCAR corpus, extracted from…”
Get full text
Journal Article -
12
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Published 18-02-2022“…Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources…”
Get full text
Journal Article -
13
Data Processing for the OpenGPT-X Model Family
Published 11-10-2024“…This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating…”
Get full text
Journal Article -
14
CamemBERT: a Tasty French Language Model
Published 21-05-2020“…Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020, Online Pretrained language models are now ubiquitous in…”
Get full text
Journal Article -
15
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Published 30-09-2024“…We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a…”
Get full text
Journal Article -
16
Tokenizer Choice For LLM Training: Negligible or Crucial?
Published 12-10-2023“…The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures…”
Get full text
Journal Article -
17
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Published 24-01-2022“…In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large…”
Get full text
Journal Article -
18
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Published 07-03-2023“…As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The…”
Get full text
Journal Article -
19
Establishing a New State-of-the-Art for French Named Entity Recognition
Published 27-05-2020“…LREC 2020 - 12th Language Resources and Evaluation Conference, May 2020, Marseille, France The French TreeBank developed at the University Paris 7 is the main…”
Get full text
Journal Article