The first annotated corpus of historical Basque

Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th...

Full description

Saved in:
Bibliographic Details
Published in:Digital Scholarship in the Humanities Vol. 37; no. 2; pp. 391 - 404
Main Authors: Estarrona, Ainara, Etxeberria, Izaskun, Soraluze, Ander, Etxepare, Ricardo, Padilla-Moyano, Manuel
Format: Journal Article
Language:English
Published: Oxford University Press 25-05-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages.
AbstractList Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages.
Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages.
Author Etxepare, Ricardo
Soraluze, Ander
Estarrona, Ainara
Padilla-Moyano, Manuel
Etxeberria, Izaskun
Author_xml – sequence: 1
  givenname: Ainara
  surname: Estarrona
  fullname: Estarrona, Ainara
– sequence: 2
  givenname: Izaskun
  surname: Etxeberria
  fullname: Etxeberria, Izaskun
– sequence: 3
  givenname: Ander
  surname: Soraluze
  fullname: Soraluze, Ander
– sequence: 4
  givenname: Ricardo
  surname: Etxepare
  fullname: Etxepare, Ricardo
– sequence: 5
  givenname: Manuel
  surname: Padilla-Moyano
  fullname: Padilla-Moyano, Manuel
BackLink https://hal.science/hal-03505658$$DView record in HAL
BookMark eNo9kMFKAzEQhoNUsNaefIG9iqydJJts9liLWqHgpYK3MMkm7Mq6aZOt4Nu7paWnf_7hmzl8t2TSh94Rck_hiULFF11nF36PBqS8IlMGQuSlVF-Ty1zSGzJP6RsAaFExXlVTstg2LvNtTEOGfR8GHFyd2RB3h5QFnzVtGkJsLXbZM6b9wd2Ra49dcvNzzsjn68t2tc43H2_vq-Umt0yyIeeCFtLUzCuDoKRRvqC2VgAMa8-xQJSF4LVSxnHGTcltJSmntiqNLcXYZ-Th9LfBTu9i-4PxTwds9Xq50ccdcAFCCvVLR_bxxNoYUorOXw4o6KMaParRZzX8HzG_WGo
Cites_doi 10.1017/S1351324918000505
10.1017/S1351324915000315
10.1080/00437956.1951.11659408
10.1093/llc/11.4.193
10.3726/978-3-653-02701-3
ContentType Journal Article
Copyright Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID AAYXX
CITATION
1XC
BXJBU
IHQJB
VOOES
DOI 10.1093/llc/fqab066
DatabaseName CrossRef
Hyper Article en Ligne (HAL)
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société (Open Access)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 2055-768X
EndPage 404
ExternalDocumentID oai_HAL_hal_03505658v1
10_1093_llc_fqab066
GroupedDBID .4H
0R~
3LD
48X
5VS
AACJB
AAFXQ
AAMVS
AAMZS
AAOGV
AAPQZ
AAPXW
AARHZ
AAUAY
AAUOS
AAVAP
AAYXX
ABIXL
ABJNI
ABKEB
ABPTD
ABQLI
ABWST
ABXVK
ABXVV
ACHQT
ACUFI
ADBKU
ADEZT
ADGZP
ADHKW
ADIPN
ADLOL
ADOCK
ADQBN
ADQIT
ADRIX
AEJER
AEKSI
AEMDU
AENZO
AEPUE
AFFZL
AFHLB
AFOFC
AFVIK
AFVSF
AFXEN
AGINJ
AGQXC
AGSYK
ALMA_UNASSIGNED_HOLDINGS
ALUQC
AOLPF
APWMN
ASPYK
ATGXG
AVWKF
AYLYT
BAYMD
BCRHZ
BHZBG
BOXDG
CITATION
DAKXR
EHI
ETYVG
EYXSX
F9B
FLUFQ
FOEOM
FQBLK
GJXCC
H13
HMHOC
HZ~
I-F
J21
JXSIZ
KAQDR
KOP
KSI
KSN
MJWOD
MXSPP
NGC
NOMLY
O9-
OJQWA
OJZSN
OXVUA
PEELM
PLIXB
PQQKQ
Q5Y
ROL
ROX
RXO
TJH
TJJ
YADRA
YAJVU
YKOAZ
YXANX
~SN
1XC
BXJBU
EBS
IHQJB
VOOES
ID FETCH-LOGICAL-c262t-35146bd2f8ba086b8f41cd8002adf3a4aa6453d88be323b73c96131c97bc75b73
ISSN 2055-7671
IngestDate Tue Oct 15 15:37:41 EDT 2024
Fri Aug 23 01:16:08 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Keywords natural language processing (NLP)
history of Basque
historical corpora
diachronic syntax
digital humanities
Language English
License Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c262t-35146bd2f8ba086b8f41cd8002adf3a4aa6453d88be323b73c96131c97bc75b73
ORCID 0000-0003-1948-8248
OpenAccessLink https://hal.science/hal-03505658
PageCount 14
ParticipantIDs hal_primary_oai_HAL_hal_03505658v1
crossref_primary_10_1093_llc_fqab066
PublicationCentury 2000
PublicationDate 2022-05-25
PublicationDateYYYYMMDD 2022-05-25
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-05-25
  day: 25
PublicationDecade 2020
PublicationTitle Digital Scholarship in the Humanities
PublicationYear 2022
Publisher Oxford University Press
Publisher_xml – name: Oxford University Press
References Kroch (2022053116092419000_fqab066-B16) 2004
Kroch (2022053116092419000_fqab066-B17) 2016
Novak (2022053116092419000_fqab066-B31) 2012
Lafon (2022053116092419000_fqab066-B19) 1944
Arriola (2022053116092419000_fqab066-B7) 2015
Michelena (2022053116092419000_fqab066-B29) 1987
Aldezabal (2022053116092419000_fqab066-B2) 2001
Alegria (2022053116092419000_fqab066-B4) 2006; 36
Lakarra (2022053116092419000_fqab066-B22) 2005; 5
Satrústegui (2022053116092419000_fqab066-B35) 1987
Michelena (2022053116092419000_fqab066-B28) 1977
Claridge (2022053116092419000_fqab066-B8) 2009
Lash (2022053116092419000_fqab066-B24) 2014
Michelena (2022053116092419000_fqab066-B27) 1964
Alegria (2022053116092419000_fqab066-B6) 1996; 11
Aduriz (2022053116092419000_fqab066-B1) 2000
Uhlenbeck (2022053116092419000_fqab066-B39) 1924; 15
Novak (2022053116092419000_fqab066-B32) 2016; 22
Hunston (2022053116092419000_fqab066-B15) 2009
Schuchardt (2022053116092419000_fqab066-B36) 1907
Etxeberria (2022053116092419000_fqab066-B9) 2016
Nelson (2022053116092419000_fqab066-B30) 2010
Alegria (2022053116092419000_fqab066-B5) 2008; 41
Etxeberria (2022053116092419000_fqab066-B10) 2019; 25
Gorrochategui (2022053116092419000_fqab066-B14) 2018
Gómez (2022053116092419000_fqab066-B13) 1995
Manterola (2022053116092419000_fqab066-B25) 2015
Lakarra (2022053116092419000_fqab066-B23) 2006; 21
Sarasola (2022053116092419000_fqab066-B33) 1976
Ezeiza (2022053116092419000_fqab066-B11) 1998
Alegria (2022053116092419000_fqab066-B3) 2002
Martínez-Arena (2022053116092419000_fqab066-B26) 2013
Sarasola (2022053116092419000_fqab066-B34) 1983
Lafon (2022053116092419000_fqab066-B20) 1951–1952; 7
Lakarra (2022053116092419000_fqab066-B21) 1995
Trask (2022053116092419000_fqab066-B38) 1997
(2022053116092419000_fqab066-B41) 2016
Kroch (2022053116092419000_fqab066-B18) 2000
Wallenberg (2022053116092419000_fqab066-B40) 2011
Galves (2022053116092419000_fqab066-B12) 2017
References_xml – volume: 25
  start-page: 307
  issue: 2
  year: 2019
  ident: 2022053116092419000_fqab066-B10
  article-title: Weighted finite-state transducers for normalization of historical texts
  publication-title: Natural Language Engineering
  doi: 10.1017/S1351324918000505
  contributor:
    fullname: Etxeberria
– volume-title: Euskal Testu Zaharrak (I) (Old Basque texts)
  year: 1987
  ident: 2022053116092419000_fqab066-B35
  contributor:
    fullname: Satrústegui
– start-page: 1
  volume-title: Proceedings of the Workshop on Constraint Grammar - Methods, Tools and Applications; at NODALIDA 2015
  year: 2015
  ident: 2022053116092419000_fqab066-B7
  contributor:
    fullname: Arriola
– volume: 41
  start-page: 5
  year: 2008
  ident: 2022053116092419000_fqab066-B5
  article-title: Chunk and clause identification for Basque by Filtering and Ranking with Perceptrons
  publication-title: Procesamiento del Lenguaje Natural
  contributor:
    fullname: Alegria
– year: 2014
  ident: 2022053116092419000_fqab066-B24
  contributor:
    fullname: Lash
– volume-title: The History of Basque
  year: 1997
  ident: 2022053116092419000_fqab066-B38
  contributor:
    fullname: Trask
– volume-title: The Iberische Deklination
  year: 1907
  ident: 2022053116092419000_fqab066-B36
  contributor:
    fullname: Schuchardt
– start-page: 235
  volume-title: Towards a History of Basque Language
  year: 1995
  ident: 2022053116092419000_fqab066-B13
  contributor:
    fullname: Gómez
– year: 2011
  ident: 2022053116092419000_fqab066-B40
  contributor:
    fullname: Wallenberg
– volume-title: The Routledge Handbook of Corpus Linguistics
  year: 2010
  ident: 2022053116092419000_fqab066-B30
  contributor:
    fullname: Nelson
– start-page: 447
  volume-title: Proceedings of the Second International Conference on Language Resources and Evaluation
  year: 2000
  ident: 2022053116092419000_fqab066-B1
  contributor:
    fullname: Aduriz
– year: 2004
  ident: 2022053116092419000_fqab066-B16
  contributor:
    fullname: Kroch
– start-page: 189
  volume-title: Towards a History of Basque Language
  year: 1995
  ident: 2022053116092419000_fqab066-B21
  contributor:
    fullname: Lakarra
– start-page: 242
  volume-title: Corpus Linguistics. An International Handbook
  year: 2009
  ident: 2022053116092419000_fqab066-B8
  contributor:
    fullname: Claridge
– volume-title: Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz [= Normalization of linguistic variants using phonological and morphological inferences]. Doctoral dissertation
  year: 2016
  ident: 2022053116092419000_fqab066-B9
  contributor:
    fullname: Etxeberria
– volume: 5
  start-page: 407
  year: 2005
  ident: 2022053116092419000_fqab066-B22
  article-title: Prolegómenos a la reconstrucción de segundo grado y análisis del cambio tipológico en (proto) vasco
  publication-title: Palaeohispanica
  contributor:
    fullname: Lakarra
– start-page: 1
  volume-title: IRCS Workshop on Linguistic Databases
  year: 2001
  ident: 2022053116092419000_fqab066-B2
  contributor:
    fullname: Aldezabal
– volume-title: Fonética Histórica Vasca
  year: 1977
  ident: 2022053116092419000_fqab066-B28
  contributor:
    fullname: Michelena
– volume-title: Historia Social de la Literatura Vasca
  year: 1976
  ident: 2022053116092419000_fqab066-B33
  contributor:
    fullname: Sarasola
– volume-title: Historia de la lengua vasca
  year: 2018
  ident: 2022053116092419000_fqab066-B14
  contributor:
    fullname: Gorrochategui
– start-page: 1
  volume-title: LREC-2002 Customizing Knowledge in NLP Applications Workshop
  year: 2002
  ident: 2022053116092419000_fqab066-B3
  contributor:
    fullname: Alegria
– volume-title: Orotariko Euskal Hiztegia (Basque General Dictionary)
  year: 1987
  ident: 2022053116092419000_fqab066-B29
  contributor:
    fullname: Michelena
– volume-title: Contribución al estudio y edición de textos antiguos vascos
  year: 1983
  ident: 2022053116092419000_fqab066-B34
  contributor:
    fullname: Sarasola
– start-page: 154
  volume-title: Corpus linguistics. An International Handbook
  year: 2009
  ident: 2022053116092419000_fqab066-B15
  contributor:
    fullname: Hunston
– volume: 15
  start-page: 565
  year: 1924
  ident: 2022053116092419000_fqab066-B39
  article-title: De la possibilité d’une parenté entre me basque et les langues caucasiques
  publication-title: Revista Internacional de Estudios Vascos
  contributor:
    fullname: Uhlenbeck
– year: 2000
  ident: 2022053116092419000_fqab066-B18
  contributor:
    fullname: Kroch
– volume-title: Textos Arcaicos Vascos
  year: 1964
  ident: 2022053116092419000_fqab066-B27
  contributor:
    fullname: Michelena
– volume: 22
  start-page: 907
  issue: 6
  year: 2016
  ident: 2022053116092419000_fqab066-B32
  article-title: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework
  publication-title: Natural Language Engineering
  doi: 10.1017/S1351324915000315
  contributor:
    fullname: Novak
– volume: 21
  start-page: 229
  year: 2006
  ident: 2022053116092419000_fqab066-B23
  article-title: Protovasco, munda y otros: reconstrucción interna y tipología holística diacrónica
  publication-title: Oihenart: cuadernos de lengua y literatura
  contributor:
    fullname: Lakarra
– start-page: 1064
  year: 2016
  ident: 2022053116092419000_fqab066-B41
– volume: 7
  start-page: 227
  year: 1951–1952
  ident: 2022053116092419000_fqab066-B20
  article-title: Concordances morphologiques entre le basque et les langues caucasiques
  publication-title: Word
  doi: 10.1080/00437956.1951.11659408
  contributor:
    fullname: Lafon
– year: 2016
  ident: 2022053116092419000_fqab066-B17
  contributor:
    fullname: Kroch
– volume: 36
  start-page: 25
  year: 2006
  ident: 2022053116092419000_fqab066-B4
  article-title: Lessons from the development of a named entity recognizer for Basque
  publication-title: Procesamiento del Lenguaje Natural
  contributor:
    fullname: Alegria
– start-page: 45
  volume-title: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing
  year: 2012
  ident: 2022053116092419000_fqab066-B31
  contributor:
    fullname: Novak
– year: 2017
  ident: 2022053116092419000_fqab066-B12
  contributor:
    fullname: Galves
– volume-title: Towards a History of Basque Morphology: Articles and demonstratives
  year: 2015
  ident: 2022053116092419000_fqab066-B25
  contributor:
    fullname: Manterola
– volume: 11
  start-page: 193
  issue: 4
  year: 1996
  ident: 2022053116092419000_fqab066-B6
  article-title: Automatic morphological analysis of Basque
  publication-title: Literary and Linguistic Computing
  doi: 10.1093/llc/11.4.193
  contributor:
    fullname: Alegria
– start-page: 380
  volume-title: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
  year: 1998
  ident: 2022053116092419000_fqab066-B11
  article-title: Combining stochastic and rule-based methods for disambiguation in agglutinative languages
  contributor:
    fullname: Ezeiza
– volume-title: Le système du verbe basque au XVIe siècle
  year: 1944
  ident: 2022053116092419000_fqab066-B19
  contributor:
    fullname: Lafon
– volume-title: Basque and Proto-Basque. Language-Internal and Typological Approaches to Linguistic Reconstruction
  year: 2013
  ident: 2022053116092419000_fqab066-B26
  doi: 10.3726/978-3-653-02701-3
  contributor:
    fullname: Martínez-Arena
SSID ssj0001492399
Score 2.250421
Snippet Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing...
Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing...
SourceID hal
crossref
SourceType Open Access Repository
Aggregation Database
StartPage 391
SubjectTerms Humanities and Social Sciences
Title The first annotated corpus of historical Basque
URI https://hal.science/hal-03505658
Volume 37
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELZ24dJLRekL2iKrQr1U6ebhJM5xKakWtOICVL1FdmwXVJTQZFNV_PqOY68THgd64BLtjtYjrefTeMaZ-QahfcVLGgWh8mKeCI8kgfB0R7EnMzjLpU-VT_ohtqfpyQ96mJN8Mll3Cg6yJ7U0yMDWunP2P6ztlIIAPoPN4QlWh-ej7a4uIab7zKqq1qGk7ltrrru-ZuNiYAU5YK0puHbB6eHlTz1CpGfm1AmvruOyZZALw5MxqjjMIaxsmtq0lM11V69z8PnqLxitaUwd7tENa391DoOnmhKgu5G2mnIoDtardDm8bfcH5NbjKwnIZv3YM-3LxnOFfhx7aWJmq3yRY1k_R9i5XsP3YiEWjvxolAWjI5mYCcX3vL1hwrq6KrVRfjPuJw-wat857VwNonn7HhWwvLCLp2gzBH8FjnJzfpAffx8u6zSNXT-L1P012-sJGmagYWY13Ipuphfry_k-WDnbQs9tloHnBh4v0ERW2-jN0t5Nt_gTXjo67fYlmgFocA8a7ECDDWhwrfAAGmxA8wqdf8vPvi48O0nDK8MkXPXtGgkXoaKcQQ7LqSJBKXSuwISKGGEsIXEkKOUyCiOeRmUGYV5QZikv0xi-v0YbVV3Jtwj7gUqJhJgm9QWB3IJDAK8ykHEawDqxg_bXO1BcG8KU4oGt3kEfYXfcLzTJ-WK-LLRMv-uGNIP-CXYfp-sdejag8D3aWDWd_ICmrej2rCH_AQA5bVc
link.rule.ids 230,315,783,787,888,27936,27937
linkProvider Oxford University Press
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+first+annotated+corpus+of+historical+Basque&rft.jtitle=Digital+Scholarship+in+the+Humanities&rft.au=Estarrona%2C+Ainara&rft.au=Etxeberria%2C+Izaskun&rft.au=Soraluze%2C+Ander&rft.au=Etxepare%2C+Ricardo&rft.date=2022-05-25&rft.issn=2055-7671&rft.eissn=2055-768X&rft.volume=37&rft.issue=2&rft.spage=391&rft.epage=404&rft_id=info:doi/10.1093%2Fllc%2Ffqab066&rft.externalDBID=n%2Fa&rft.externalDocID=10_1093_llc_fqab066
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2055-7671&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2055-7671&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2055-7671&client=summon