Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models
Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LL...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
09-06-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Large language models have gained tremendous popularity in domains such as
e-commerce, finance, healthcare, and education. Fine-tuning is a common
approach to customize an LLM on a domain-specific dataset for a desired
downstream task. In this paper, we present a valuable resource for fine-tuning
LLMs developed for the Spanish language to perform a variety of tasks such as
classification, masked language modeling, clustering, and others. Our resource
is a collection of handwritten notary records from the seventeenth century
obtained from the National Archives of Argentina. This collection contains a
combination of original images and transcribed text (and metadata) of 160+
pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara
and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical
evaluation, we demonstrate that our collection can be used to fine-tune Spanish
LLMs for tasks such as classification and masked language modeling, and can
outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource
will be an invaluable resource for historical text analysis and is publicly
available on GitHub. |
---|---|
DOI: | 10.48550/arxiv.2406.05812 |