The first annotated corpus of historical Basque
Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th...
Saved in:
Published in: | Digital Scholarship in the Humanities Vol. 37; no. 2; pp. 391 - 404 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Oxford University Press
25-05-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | Abstract
This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages. |
---|---|
AbstractList | Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages. Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing of historical varieties of this language with computational techniques. The corpus size is around one million words, expanding from the 15th to the mid-18th century and encompassing the most significant written production in all historical dialects. Morphosyntactic tagging allows for systematic searches at different levels of complexity; additionally, a rich set of metadata enables searches based on sociohistorical criteria too. This is not only the first tagged corpus of historical Basque but also a means to improve language processing tools by analyzing historical varieties more or less distant from the present-day standard language. Moreover, this project aims to set a model for further works in the historical corpora of Basque and inform similar projects on other languages. |
Author | Etxepare, Ricardo Soraluze, Ander Estarrona, Ainara Padilla-Moyano, Manuel Etxeberria, Izaskun |
Author_xml | – sequence: 1 givenname: Ainara surname: Estarrona fullname: Estarrona, Ainara – sequence: 2 givenname: Izaskun surname: Etxeberria fullname: Etxeberria, Izaskun – sequence: 3 givenname: Ander surname: Soraluze fullname: Soraluze, Ander – sequence: 4 givenname: Ricardo surname: Etxepare fullname: Etxepare, Ricardo – sequence: 5 givenname: Manuel surname: Padilla-Moyano fullname: Padilla-Moyano, Manuel |
BackLink | https://hal.science/hal-03505658$$DView record in HAL |
BookMark | eNo9kMFKAzEQhoNUsNaefIG9iqydJJts9liLWqHgpYK3MMkm7Mq6aZOt4Nu7paWnf_7hmzl8t2TSh94Rck_hiULFF11nF36PBqS8IlMGQuSlVF-Ty1zSGzJP6RsAaFExXlVTstg2LvNtTEOGfR8GHFyd2RB3h5QFnzVtGkJsLXbZM6b9wd2Ra49dcvNzzsjn68t2tc43H2_vq-Umt0yyIeeCFtLUzCuDoKRRvqC2VgAMa8-xQJSF4LVSxnHGTcltJSmntiqNLcXYZ-Th9LfBTu9i-4PxTwds9Xq50ccdcAFCCvVLR_bxxNoYUorOXw4o6KMaParRZzX8HzG_WGo |
Cites_doi | 10.1017/S1351324918000505 10.1017/S1351324915000315 10.1080/00437956.1951.11659408 10.1093/llc/11.4.193 10.3726/978-3-653-02701-3 |
ContentType | Journal Article |
Copyright | Distributed under a Creative Commons Attribution 4.0 International License |
Copyright_xml | – notice: Distributed under a Creative Commons Attribution 4.0 International License |
DBID | AAYXX CITATION 1XC BXJBU IHQJB VOOES |
DOI | 10.1093/llc/fqab066 |
DatabaseName | CrossRef Hyper Article en Ligne (HAL) HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société (Open Access) Hyper Article en Ligne (HAL) (Open Access) |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Languages & Literatures |
EISSN | 2055-768X |
EndPage | 404 |
ExternalDocumentID | oai_HAL_hal_03505658v1 10_1093_llc_fqab066 |
GroupedDBID | .4H 0R~ 3LD 48X 5VS AACJB AAFXQ AAMVS AAMZS AAOGV AAPQZ AAPXW AARHZ AAUAY AAUOS AAVAP AAYXX ABIXL ABJNI ABKEB ABPTD ABQLI ABWST ABXVK ABXVV ACHQT ACUFI ADBKU ADEZT ADGZP ADHKW ADIPN ADLOL ADOCK ADQBN ADQIT ADRIX AEJER AEKSI AEMDU AENZO AEPUE AFFZL AFHLB AFOFC AFVIK AFVSF AFXEN AGINJ AGQXC AGSYK ALMA_UNASSIGNED_HOLDINGS ALUQC AOLPF APWMN ASPYK ATGXG AVWKF AYLYT BAYMD BCRHZ BHZBG BOXDG CITATION DAKXR EHI ETYVG EYXSX F9B FLUFQ FOEOM FQBLK GJXCC H13 HMHOC HZ~ I-F J21 JXSIZ KAQDR KOP KSI KSN MJWOD MXSPP NGC NOMLY O9- OJQWA OJZSN OXVUA PEELM PLIXB PQQKQ Q5Y ROL ROX RXO TJH TJJ YADRA YAJVU YKOAZ YXANX ~SN 1XC BXJBU EBS IHQJB VOOES |
ID | FETCH-LOGICAL-c262t-35146bd2f8ba086b8f41cd8002adf3a4aa6453d88be323b73c96131c97bc75b73 |
ISSN | 2055-7671 |
IngestDate | Tue Oct 15 15:37:41 EDT 2024 Fri Aug 23 01:16:08 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Keywords | natural language processing (NLP) history of Basque historical corpora diachronic syntax digital humanities |
Language | English |
License | Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c262t-35146bd2f8ba086b8f41cd8002adf3a4aa6453d88be323b73c96131c97bc75b73 |
ORCID | 0000-0003-1948-8248 |
OpenAccessLink | https://hal.science/hal-03505658 |
PageCount | 14 |
ParticipantIDs | hal_primary_oai_HAL_hal_03505658v1 crossref_primary_10_1093_llc_fqab066 |
PublicationCentury | 2000 |
PublicationDate | 2022-05-25 |
PublicationDateYYYYMMDD | 2022-05-25 |
PublicationDate_xml | – month: 05 year: 2022 text: 2022-05-25 day: 25 |
PublicationDecade | 2020 |
PublicationTitle | Digital Scholarship in the Humanities |
PublicationYear | 2022 |
Publisher | Oxford University Press |
Publisher_xml | – name: Oxford University Press |
References | Kroch (2022053116092419000_fqab066-B16) 2004 Kroch (2022053116092419000_fqab066-B17) 2016 Novak (2022053116092419000_fqab066-B31) 2012 Lafon (2022053116092419000_fqab066-B19) 1944 Arriola (2022053116092419000_fqab066-B7) 2015 Michelena (2022053116092419000_fqab066-B29) 1987 Aldezabal (2022053116092419000_fqab066-B2) 2001 Alegria (2022053116092419000_fqab066-B4) 2006; 36 Lakarra (2022053116092419000_fqab066-B22) 2005; 5 Satrústegui (2022053116092419000_fqab066-B35) 1987 Michelena (2022053116092419000_fqab066-B28) 1977 Claridge (2022053116092419000_fqab066-B8) 2009 Lash (2022053116092419000_fqab066-B24) 2014 Michelena (2022053116092419000_fqab066-B27) 1964 Alegria (2022053116092419000_fqab066-B6) 1996; 11 Aduriz (2022053116092419000_fqab066-B1) 2000 Uhlenbeck (2022053116092419000_fqab066-B39) 1924; 15 Novak (2022053116092419000_fqab066-B32) 2016; 22 Hunston (2022053116092419000_fqab066-B15) 2009 Schuchardt (2022053116092419000_fqab066-B36) 1907 Etxeberria (2022053116092419000_fqab066-B9) 2016 Nelson (2022053116092419000_fqab066-B30) 2010 Alegria (2022053116092419000_fqab066-B5) 2008; 41 Etxeberria (2022053116092419000_fqab066-B10) 2019; 25 Gorrochategui (2022053116092419000_fqab066-B14) 2018 Gómez (2022053116092419000_fqab066-B13) 1995 Manterola (2022053116092419000_fqab066-B25) 2015 Lakarra (2022053116092419000_fqab066-B23) 2006; 21 Sarasola (2022053116092419000_fqab066-B33) 1976 Ezeiza (2022053116092419000_fqab066-B11) 1998 Alegria (2022053116092419000_fqab066-B3) 2002 Martínez-Arena (2022053116092419000_fqab066-B26) 2013 Sarasola (2022053116092419000_fqab066-B34) 1983 Lafon (2022053116092419000_fqab066-B20) 1951–1952; 7 Lakarra (2022053116092419000_fqab066-B21) 1995 Trask (2022053116092419000_fqab066-B38) 1997 (2022053116092419000_fqab066-B41) 2016 Kroch (2022053116092419000_fqab066-B18) 2000 Wallenberg (2022053116092419000_fqab066-B40) 2011 Galves (2022053116092419000_fqab066-B12) 2017 |
References_xml | – volume: 25 start-page: 307 issue: 2 year: 2019 ident: 2022053116092419000_fqab066-B10 article-title: Weighted finite-state transducers for normalization of historical texts publication-title: Natural Language Engineering doi: 10.1017/S1351324918000505 contributor: fullname: Etxeberria – volume-title: Euskal Testu Zaharrak (I) (Old Basque texts) year: 1987 ident: 2022053116092419000_fqab066-B35 contributor: fullname: Satrústegui – start-page: 1 volume-title: Proceedings of the Workshop on Constraint Grammar - Methods, Tools and Applications; at NODALIDA 2015 year: 2015 ident: 2022053116092419000_fqab066-B7 contributor: fullname: Arriola – volume: 41 start-page: 5 year: 2008 ident: 2022053116092419000_fqab066-B5 article-title: Chunk and clause identification for Basque by Filtering and Ranking with Perceptrons publication-title: Procesamiento del Lenguaje Natural contributor: fullname: Alegria – year: 2014 ident: 2022053116092419000_fqab066-B24 contributor: fullname: Lash – volume-title: The History of Basque year: 1997 ident: 2022053116092419000_fqab066-B38 contributor: fullname: Trask – volume-title: The Iberische Deklination year: 1907 ident: 2022053116092419000_fqab066-B36 contributor: fullname: Schuchardt – start-page: 235 volume-title: Towards a History of Basque Language year: 1995 ident: 2022053116092419000_fqab066-B13 contributor: fullname: Gómez – year: 2011 ident: 2022053116092419000_fqab066-B40 contributor: fullname: Wallenberg – volume-title: The Routledge Handbook of Corpus Linguistics year: 2010 ident: 2022053116092419000_fqab066-B30 contributor: fullname: Nelson – start-page: 447 volume-title: Proceedings of the Second International Conference on Language Resources and Evaluation year: 2000 ident: 2022053116092419000_fqab066-B1 contributor: fullname: Aduriz – year: 2004 ident: 2022053116092419000_fqab066-B16 contributor: fullname: Kroch – start-page: 189 volume-title: Towards a History of Basque Language year: 1995 ident: 2022053116092419000_fqab066-B21 contributor: fullname: Lakarra – start-page: 242 volume-title: Corpus Linguistics. An International Handbook year: 2009 ident: 2022053116092419000_fqab066-B8 contributor: fullname: Claridge – volume-title: Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz [= Normalization of linguistic variants using phonological and morphological inferences]. Doctoral dissertation year: 2016 ident: 2022053116092419000_fqab066-B9 contributor: fullname: Etxeberria – volume: 5 start-page: 407 year: 2005 ident: 2022053116092419000_fqab066-B22 article-title: Prolegómenos a la reconstrucción de segundo grado y análisis del cambio tipológico en (proto) vasco publication-title: Palaeohispanica contributor: fullname: Lakarra – start-page: 1 volume-title: IRCS Workshop on Linguistic Databases year: 2001 ident: 2022053116092419000_fqab066-B2 contributor: fullname: Aldezabal – volume-title: Fonética Histórica Vasca year: 1977 ident: 2022053116092419000_fqab066-B28 contributor: fullname: Michelena – volume-title: Historia Social de la Literatura Vasca year: 1976 ident: 2022053116092419000_fqab066-B33 contributor: fullname: Sarasola – volume-title: Historia de la lengua vasca year: 2018 ident: 2022053116092419000_fqab066-B14 contributor: fullname: Gorrochategui – start-page: 1 volume-title: LREC-2002 Customizing Knowledge in NLP Applications Workshop year: 2002 ident: 2022053116092419000_fqab066-B3 contributor: fullname: Alegria – volume-title: Orotariko Euskal Hiztegia (Basque General Dictionary) year: 1987 ident: 2022053116092419000_fqab066-B29 contributor: fullname: Michelena – volume-title: Contribución al estudio y edición de textos antiguos vascos year: 1983 ident: 2022053116092419000_fqab066-B34 contributor: fullname: Sarasola – start-page: 154 volume-title: Corpus linguistics. An International Handbook year: 2009 ident: 2022053116092419000_fqab066-B15 contributor: fullname: Hunston – volume: 15 start-page: 565 year: 1924 ident: 2022053116092419000_fqab066-B39 article-title: De la possibilité d’une parenté entre me basque et les langues caucasiques publication-title: Revista Internacional de Estudios Vascos contributor: fullname: Uhlenbeck – year: 2000 ident: 2022053116092419000_fqab066-B18 contributor: fullname: Kroch – volume-title: Textos Arcaicos Vascos year: 1964 ident: 2022053116092419000_fqab066-B27 contributor: fullname: Michelena – volume: 22 start-page: 907 issue: 6 year: 2016 ident: 2022053116092419000_fqab066-B32 article-title: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework publication-title: Natural Language Engineering doi: 10.1017/S1351324915000315 contributor: fullname: Novak – volume: 21 start-page: 229 year: 2006 ident: 2022053116092419000_fqab066-B23 article-title: Protovasco, munda y otros: reconstrucción interna y tipología holística diacrónica publication-title: Oihenart: cuadernos de lengua y literatura contributor: fullname: Lakarra – start-page: 1064 year: 2016 ident: 2022053116092419000_fqab066-B41 – volume: 7 start-page: 227 year: 1951–1952 ident: 2022053116092419000_fqab066-B20 article-title: Concordances morphologiques entre le basque et les langues caucasiques publication-title: Word doi: 10.1080/00437956.1951.11659408 contributor: fullname: Lafon – year: 2016 ident: 2022053116092419000_fqab066-B17 contributor: fullname: Kroch – volume: 36 start-page: 25 year: 2006 ident: 2022053116092419000_fqab066-B4 article-title: Lessons from the development of a named entity recognizer for Basque publication-title: Procesamiento del Lenguaje Natural contributor: fullname: Alegria – start-page: 45 volume-title: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing year: 2012 ident: 2022053116092419000_fqab066-B31 contributor: fullname: Novak – year: 2017 ident: 2022053116092419000_fqab066-B12 contributor: fullname: Galves – volume-title: Towards a History of Basque Morphology: Articles and demonstratives year: 2015 ident: 2022053116092419000_fqab066-B25 contributor: fullname: Manterola – volume: 11 start-page: 193 issue: 4 year: 1996 ident: 2022053116092419000_fqab066-B6 article-title: Automatic morphological analysis of Basque publication-title: Literary and Linguistic Computing doi: 10.1093/llc/11.4.193 contributor: fullname: Alegria – start-page: 380 volume-title: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1 year: 1998 ident: 2022053116092419000_fqab066-B11 article-title: Combining stochastic and rule-based methods for disambiguation in agglutinative languages contributor: fullname: Ezeiza – volume-title: Le système du verbe basque au XVIe siècle year: 1944 ident: 2022053116092419000_fqab066-B19 contributor: fullname: Lafon – volume-title: Basque and Proto-Basque. Language-Internal and Typological Approaches to Linguistic Reconstruction year: 2013 ident: 2022053116092419000_fqab066-B26 doi: 10.3726/978-3-653-02701-3 contributor: fullname: Martínez-Arena |
SSID | ssj0001492399 |
Score | 2.250421 |
Snippet | Abstract
This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing... Abstract This article presents the elaboration of a morphosyntactically annotated diachronic corpus of Basque, and the first results obtained in the processing... |
SourceID | hal crossref |
SourceType | Open Access Repository Aggregation Database |
StartPage | 391 |
SubjectTerms | Humanities and Social Sciences |
Title | The first annotated corpus of historical Basque |
URI | https://hal.science/hal-03505658 |
Volume | 37 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELZ24dJLRekL2iKrQr1U6ebhJM5xKakWtOICVL1FdmwXVJTQZFNV_PqOY68THgd64BLtjtYjrefTeMaZ-QahfcVLGgWh8mKeCI8kgfB0R7EnMzjLpU-VT_ohtqfpyQ96mJN8Mll3Cg6yJ7U0yMDWunP2P6ztlIIAPoPN4QlWh-ej7a4uIab7zKqq1qGk7ltrrru-ZuNiYAU5YK0puHbB6eHlTz1CpGfm1AmvruOyZZALw5MxqjjMIaxsmtq0lM11V69z8PnqLxitaUwd7tENa391DoOnmhKgu5G2mnIoDtardDm8bfcH5NbjKwnIZv3YM-3LxnOFfhx7aWJmq3yRY1k_R9i5XsP3YiEWjvxolAWjI5mYCcX3vL1hwrq6KrVRfjPuJw-wat857VwNonn7HhWwvLCLp2gzBH8FjnJzfpAffx8u6zSNXT-L1P012-sJGmagYWY13Ipuphfry_k-WDnbQs9tloHnBh4v0ERW2-jN0t5Nt_gTXjo67fYlmgFocA8a7ECDDWhwrfAAGmxA8wqdf8vPvi48O0nDK8MkXPXtGgkXoaKcQQ7LqSJBKXSuwISKGGEsIXEkKOUyCiOeRmUGYV5QZikv0xi-v0YbVV3Jtwj7gUqJhJgm9QWB3IJDAK8ykHEawDqxg_bXO1BcG8KU4oGt3kEfYXfcLzTJ-WK-LLRMv-uGNIP-CXYfp-sdejag8D3aWDWd_ICmrej2rCH_AQA5bVc |
link.rule.ids | 230,315,783,787,888,27936,27937 |
linkProvider | Oxford University Press |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+first+annotated+corpus+of+historical+Basque&rft.jtitle=Digital+Scholarship+in+the+Humanities&rft.au=Estarrona%2C+Ainara&rft.au=Etxeberria%2C+Izaskun&rft.au=Soraluze%2C+Ander&rft.au=Etxepare%2C+Ricardo&rft.date=2022-05-25&rft.issn=2055-7671&rft.eissn=2055-768X&rft.volume=37&rft.issue=2&rft.spage=391&rft.epage=404&rft_id=info:doi/10.1093%2Fllc%2Ffqab066&rft.externalDBID=n%2Fa&rft.externalDocID=10_1093_llc_fqab066 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2055-7671&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2055-7671&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2055-7671&client=summon |