SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES
This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of thi...
Saved in:
Published in: | Revista de llengua i dret no. 78; pp. 9 - 34 |
---|---|
Main Authors: | , , , , , , , , , , , , , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Barcelona
Escola d'Administracio Publica de Catalunya
01-12-2022
Escola d'Administració Pública de Catalunya |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs. |
---|---|
AbstractList | This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs. This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages, namely Croatian, Irish, Norwegian and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the paper outlines the main steps of data collection, curation and sharing of the LRs gathered with the support of public and private data contributors. This is followed by the description of the development pipeline and key features of the state-of-the-art bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project; the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges that were encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs. |
Author | Moran, Róisín Resende, Natalia Celeste, Edoardo Parra Escartín, Carla Loinsigh, Órla Ní Kåsen, Andre Bago, Petra Gíslason, Níels Rúnar Klubička, Filip Kristmannsson, Gauti Ramesh, Akshai Olsen, Jon Arild Sheridan, Páraic McHugh, Helen Way, Andy Castilho, Sheila Dunne, Jane Gaspari, Federico |
Author_xml | – sequence: 1 givenname: Petra surname: Bago fullname: Bago, Petra – sequence: 2 givenname: Sheila surname: Castilho fullname: Castilho, Sheila – sequence: 3 givenname: Edoardo surname: Celeste fullname: Celeste, Edoardo – sequence: 4 givenname: Jane surname: Dunne fullname: Dunne, Jane – sequence: 5 givenname: Federico surname: Gaspari fullname: Gaspari, Federico – sequence: 6 givenname: Níels surname: Gíslason middlename: Rúnar fullname: Gíslason, Níels Rúnar – sequence: 7 givenname: Andre surname: Kåsen fullname: Kåsen, Andre – sequence: 8 givenname: Filip surname: Klubička fullname: Klubička, Filip – sequence: 9 givenname: Gauti surname: Kristmannsson fullname: Kristmannsson, Gauti – sequence: 10 givenname: Helen surname: McHugh fullname: McHugh, Helen – sequence: 11 givenname: Róisín surname: Moran fullname: Moran, Róisín – sequence: 12 givenname: Órla surname: Loinsigh middlename: Ní fullname: Loinsigh, Órla Ní – sequence: 13 givenname: Jon surname: Olsen middlename: Arild fullname: Olsen, Jon Arild – sequence: 14 givenname: Carla surname: Parra Escartín fullname: Parra Escartín, Carla – sequence: 15 givenname: Akshai surname: Ramesh fullname: Ramesh, Akshai – sequence: 16 givenname: Natalia surname: Resende fullname: Resende, Natalia – sequence: 17 givenname: Páraic surname: Sheridan fullname: Sheridan, Páraic – sequence: 18 givenname: Andy surname: Way fullname: Way, Andy |
BookMark | eNpNj8tOwzAQRS1UJAr0A9hZYp2SjB-Jl1brJpGCA3kgsYrcxEGtSlPSdsEf8NkECojZzNWd0Zk7l2i07bYWoRvPnQIl_K7fNNOVH0zBBZgSn3pnaOyCBw5zGR_90xdost-v3aGoCDxgY_SRRzKLdYijOIycx1ImcfGME6nDUoYKZypPy2ymchxrXEQKJyqUCZ6n9_LLSPFcPakkfcBaldkwuJezKNYKF5nUeSKLONV4kWa41HOVOb-0OR620wcl9d-l_Bqdt2azt5OffoXKhSpmkZOkYTyTidMAFQeHMSOaJQjWiiWhLWXUCnBdMJYJf8kYNZZ_fTf0Fnjt1hRYYLiAtuWebwNyheITt-nMutr1q1fTv1edWVXfRte_VKY_rOqNrUxNjBGioQCc2toKsqxJO0i_tkMCf2Ddnli7vns72v2hWnfHfjvEr8BnAQkEF5x8AvSQdFg |
ContentType | Journal Article |
Copyright | Copyright Escola d'Administracio Publica de Catalunya Dec 2022 |
Copyright_xml | – notice: Copyright Escola d'Administracio Publica de Catalunya Dec 2022 |
DBID | 7T9 DOA |
DOI | 10.2436/rld.i78.2022.3741 |
DatabaseName | Linguistics and Language Behavior Abstracts (LLBA) DOAJ Directory of Open Access Journals |
DatabaseTitle | Linguistics and Language Behavior Abstracts (LLBA) |
DatabaseTitleList | Linguistics and Language Behavior Abstracts (LLBA) |
Database_xml | – sequence: 1 dbid: DOA name: Directory of Open Access Journals url: http://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Languages & Literatures |
EISSN | 0212-5056 2013-1453 |
EndPage | 34 |
ExternalDocumentID | oai_doaj_org_article_ac3aa99d42264ece93bc3f64e7ce2957 |
GroupedDBID | .4L 1XV 2VB 7T9 ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV EBS EJD GROUPED_DOAJ HEY KQ8 OK1 RNS VEDSB VGZHO ~Y0 |
ID | FETCH-LOGICAL-d249t-55a9db295f9b34f454e92002ae597b554ae600494aef26c0c4258a692ff617e83 |
IEDL.DBID | DOA |
ISSN | 0212-5056 |
IngestDate | Tue Oct 22 15:10:19 EDT 2024 Thu Oct 10 20:51:48 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Issue | 78 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-d249t-55a9db295f9b34f454e92002ae597b554ae600494aef26c0c4258a692ff617e83 |
OpenAccessLink | https://doaj.org/article/ac3aa99d42264ece93bc3f64e7ce2957 |
PQID | 2758389696 |
PQPubID | 2038621 |
PageCount | 26 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_ac3aa99d42264ece93bc3f64e7ce2957 proquest_journals_2758389696 |
PublicationCentury | 2000 |
PublicationDate | 20221201 2022-12-01 |
PublicationDateYYYYMMDD | 2022-12-01 |
PublicationDate_xml | – month: 12 year: 2022 text: 20221201 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Barcelona |
PublicationPlace_xml | – name: Barcelona |
PublicationTitle | Revista de llengua i dret |
PublicationYear | 2022 |
Publisher | Escola d'Administracio Publica de Catalunya Escola d'Administració Pública de Catalunya |
Publisher_xml | – name: Escola d'Administracio Publica de Catalunya – name: Escola d'Administració Pública de Catalunya |
SSID | ssj0000498125 ssib026971828 |
Score | 2.2811265 |
Snippet | This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the... This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain... |
SourceID | doaj proquest |
SourceType | Open Website Aggregation Database |
StartPage | 9 |
SubjectTerms | Data collection European languages evaluation Icelandic language language resources legal translation low-resource languages Machine translation neural machine translation (mt) Norwegian language Serbo-Croatian language |
Title | SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES |
URI | https://www.proquest.com/docview/2758389696 https://doaj.org/article/ac3aa99d42264ece93bc3f64e7ce2957 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELaAiQXxfhV0A2JLW-w4iUcerRgQCyCxRX5KIGhR0g78A342d3ZaITGwsFmOZEe-s-_z-e47xs6qoQ9S2IsMsYXL8lBybJk8E8I6o40VLpZ7u30o75-rmxHR5CxLfVFMWKIHTgs30FZorZSLGZ_eeiVwgIDN0nquZMojH6oflynUJF4oPHM7YP2acDBashjPyIn8H81-euLkuSgGzZvrv5QU6MV5X5RUHT4S-P86n6PRGW-yjQ4twmX6yy22oifbbP-u8zG2cA53S1rkdod9Ef0y2iIgEuIs5Ut-wsIlCU3nqm_hZQII_ODNo30AN33X1DGFLoMKiOQSP7zHQEsPM7JnKWYOEOMC5Z012WI0BwuP_nKmdpc9jUeP17dZV2ohc3j_mmVSauUMrmdQRuQhl7lXFL6hPV44DEIO7YtIJaN94IUdWtzqlS4UDwEhkK_EHlubTCf-gEEI0ihEic4HhIuu0jJHgSPsMlZKc8EP2RWtbf2R2DRq4reOHSj1upN6_ZfUD1lvIZm623RtzUt6Aya6n6P_mOOYrZNCpNiVHlubNXN_wlZbNz-NyvYNJlvaFQ |
link.rule.ids | 315,783,787,867,2109,27936,27937 |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SHARING+HIGH-QUALITY+LANGUAGE+RESOURCES+IN+THE+LEGAL+DOMAIN+TO+DEVELOP+NEURAL+MACHINE+TRANSLATION+FOR+UNDER-RESOURCED+EUROPEAN+LANGUAGES&rft.jtitle=Revista+de+llengua+i+dret&rft.au=Bago%2C+Petra&rft.au=Castilho%2C+Sheila&rft.au=Celeste%2C+Edoardo&rft.au=Dunne%2C+Jane&rft.date=2022-12-01&rft.pub=Escola+d%27Administracio+Publica+de+Catalunya&rft.eissn=0212-5056&rft.issue=78&rft.spage=9&rft_id=info:doi/10.2436%2Frld.i78.2022.3741&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0212-5056&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0212-5056&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0212-5056&client=summon |