SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES

This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of thi...

Full description

Saved in:
Bibliographic Details
Published in:Revista de llengua i dret no. 78; pp. 9 - 34
Main Authors: Bago, Petra, Castilho, Sheila, Celeste, Edoardo, Dunne, Jane, Gaspari, Federico, Gíslason, Níels Rúnar, Kåsen, Andre, Klubička, Filip, Kristmannsson, Gauti, McHugh, Helen, Moran, Róisín, Loinsigh, Órla Ní, Olsen, Jon Arild, Parra Escartín, Carla, Ramesh, Akshai, Resende, Natalia, Sheridan, Páraic, Way, Andy
Format: Journal Article
Language:English
Published: Barcelona Escola d'Administracio Publica de Catalunya 01-12-2022
Escola d'Administració Pública de Catalunya
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.
AbstractList This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.
This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages, namely Croatian, Irish, Norwegian and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the paper outlines the main steps of data collection, curation and sharing of the LRs gathered with the support of public and private data contributors. This is followed by the description of the development pipeline and key features of the state-of-the-art bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project; the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges that were encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.
Author Moran, Róisín
Resende, Natalia
Celeste, Edoardo
Parra Escartín, Carla
Loinsigh, Órla Ní
Kåsen, Andre
Bago, Petra
Gíslason, Níels Rúnar
Klubička, Filip
Kristmannsson, Gauti
Ramesh, Akshai
Olsen, Jon Arild
Sheridan, Páraic
McHugh, Helen
Way, Andy
Castilho, Sheila
Dunne, Jane
Gaspari, Federico
Author_xml – sequence: 1
  givenname: Petra
  surname: Bago
  fullname: Bago, Petra
– sequence: 2
  givenname: Sheila
  surname: Castilho
  fullname: Castilho, Sheila
– sequence: 3
  givenname: Edoardo
  surname: Celeste
  fullname: Celeste, Edoardo
– sequence: 4
  givenname: Jane
  surname: Dunne
  fullname: Dunne, Jane
– sequence: 5
  givenname: Federico
  surname: Gaspari
  fullname: Gaspari, Federico
– sequence: 6
  givenname: Níels
  surname: Gíslason
  middlename: Rúnar
  fullname: Gíslason, Níels Rúnar
– sequence: 7
  givenname: Andre
  surname: Kåsen
  fullname: Kåsen, Andre
– sequence: 8
  givenname: Filip
  surname: Klubička
  fullname: Klubička, Filip
– sequence: 9
  givenname: Gauti
  surname: Kristmannsson
  fullname: Kristmannsson, Gauti
– sequence: 10
  givenname: Helen
  surname: McHugh
  fullname: McHugh, Helen
– sequence: 11
  givenname: Róisín
  surname: Moran
  fullname: Moran, Róisín
– sequence: 12
  givenname: Órla
  surname: Loinsigh
  middlename:
  fullname: Loinsigh, Órla Ní
– sequence: 13
  givenname: Jon
  surname: Olsen
  middlename: Arild
  fullname: Olsen, Jon Arild
– sequence: 14
  givenname: Carla
  surname: Parra Escartín
  fullname: Parra Escartín, Carla
– sequence: 15
  givenname: Akshai
  surname: Ramesh
  fullname: Ramesh, Akshai
– sequence: 16
  givenname: Natalia
  surname: Resende
  fullname: Resende, Natalia
– sequence: 17
  givenname: Páraic
  surname: Sheridan
  fullname: Sheridan, Páraic
– sequence: 18
  givenname: Andy
  surname: Way
  fullname: Way, Andy
BookMark eNpNj8tOwzAQRS1UJAr0A9hZYp2SjB-Jl1brJpGCA3kgsYrcxEGtSlPSdsEf8NkECojZzNWd0Zk7l2i07bYWoRvPnQIl_K7fNNOVH0zBBZgSn3pnaOyCBw5zGR_90xdost-v3aGoCDxgY_SRRzKLdYijOIycx1ImcfGME6nDUoYKZypPy2ymchxrXEQKJyqUCZ6n9_LLSPFcPakkfcBaldkwuJezKNYKF5nUeSKLONV4kWa41HOVOb-0OR620wcl9d-l_Bqdt2azt5OffoXKhSpmkZOkYTyTidMAFQeHMSOaJQjWiiWhLWXUCnBdMJYJf8kYNZZ_fTf0Fnjt1hRYYLiAtuWebwNyheITt-nMutr1q1fTv1edWVXfRte_VKY_rOqNrUxNjBGioQCc2toKsqxJO0i_tkMCf2Ddnli7vns72v2hWnfHfjvEr8BnAQkEF5x8AvSQdFg
ContentType Journal Article
Copyright Copyright Escola d'Administracio Publica de Catalunya Dec 2022
Copyright_xml – notice: Copyright Escola d'Administracio Publica de Catalunya Dec 2022
DBID 7T9
DOA
DOI 10.2436/rld.i78.2022.3741
DatabaseName Linguistics and Language Behavior Abstracts (LLBA)
DOAJ Directory of Open Access Journals
DatabaseTitle Linguistics and Language Behavior Abstracts (LLBA)
DatabaseTitleList Linguistics and Language Behavior Abstracts (LLBA)

Database_xml – sequence: 1
  dbid: DOA
  name: Directory of Open Access Journals
  url: http://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 0212-5056
2013-1453
EndPage 34
ExternalDocumentID oai_doaj_org_article_ac3aa99d42264ece93bc3f64e7ce2957
GroupedDBID .4L
1XV
2VB
7T9
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
EBS
EJD
GROUPED_DOAJ
HEY
KQ8
OK1
RNS
VEDSB
VGZHO
~Y0
ID FETCH-LOGICAL-d249t-55a9db295f9b34f454e92002ae597b554ae600494aef26c0c4258a692ff617e83
IEDL.DBID DOA
ISSN 0212-5056
IngestDate Tue Oct 22 15:10:19 EDT 2024
Thu Oct 10 20:51:48 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 78
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-d249t-55a9db295f9b34f454e92002ae597b554ae600494aef26c0c4258a692ff617e83
OpenAccessLink https://doaj.org/article/ac3aa99d42264ece93bc3f64e7ce2957
PQID 2758389696
PQPubID 2038621
PageCount 26
ParticipantIDs doaj_primary_oai_doaj_org_article_ac3aa99d42264ece93bc3f64e7ce2957
proquest_journals_2758389696
PublicationCentury 2000
PublicationDate 20221201
2022-12-01
PublicationDateYYYYMMDD 2022-12-01
PublicationDate_xml – month: 12
  year: 2022
  text: 20221201
  day: 01
PublicationDecade 2020
PublicationPlace Barcelona
PublicationPlace_xml – name: Barcelona
PublicationTitle Revista de llengua i dret
PublicationYear 2022
Publisher Escola d'Administracio Publica de Catalunya
Escola d'Administració Pública de Catalunya
Publisher_xml – name: Escola d'Administracio Publica de Catalunya
– name: Escola d'Administració Pública de Catalunya
SSID ssj0000498125
ssib026971828
Score 2.2811265
Snippet This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the...
This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain...
SourceID doaj
proquest
SourceType Open Website
Aggregation Database
StartPage 9
SubjectTerms Data collection
European languages
evaluation
Icelandic language
language resources
legal translation
low-resource languages
Machine translation
neural machine translation (mt)
Norwegian language
Serbo-Croatian language
Title SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES
URI https://www.proquest.com/docview/2758389696
https://doaj.org/article/ac3aa99d42264ece93bc3f64e7ce2957
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELaAiQXxfhV0A2JLW-w4iUcerRgQCyCxRX5KIGhR0g78A342d3ZaITGwsFmOZEe-s-_z-e47xs6qoQ9S2IsMsYXL8lBybJk8E8I6o40VLpZ7u30o75-rmxHR5CxLfVFMWKIHTgs30FZorZSLGZ_eeiVwgIDN0nquZMojH6oflynUJF4oPHM7YP2acDBashjPyIn8H81-euLkuSgGzZvrv5QU6MV5X5RUHT4S-P86n6PRGW-yjQ4twmX6yy22oifbbP-u8zG2cA53S1rkdod9Ef0y2iIgEuIs5Ut-wsIlCU3nqm_hZQII_ODNo30AN33X1DGFLoMKiOQSP7zHQEsPM7JnKWYOEOMC5Z012WI0BwuP_nKmdpc9jUeP17dZV2ohc3j_mmVSauUMrmdQRuQhl7lXFL6hPV44DEIO7YtIJaN94IUdWtzqlS4UDwEhkK_EHlubTCf-gEEI0ihEic4HhIuu0jJHgSPsMlZKc8EP2RWtbf2R2DRq4reOHSj1upN6_ZfUD1lvIZm623RtzUt6Aya6n6P_mOOYrZNCpNiVHlubNXN_wlZbNz-NyvYNJlvaFQ
link.rule.ids 315,783,787,867,2109,27936,27937
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SHARING+HIGH-QUALITY+LANGUAGE+RESOURCES+IN+THE+LEGAL+DOMAIN+TO+DEVELOP+NEURAL+MACHINE+TRANSLATION+FOR+UNDER-RESOURCED+EUROPEAN+LANGUAGES&rft.jtitle=Revista+de+llengua+i+dret&rft.au=Bago%2C+Petra&rft.au=Castilho%2C+Sheila&rft.au=Celeste%2C+Edoardo&rft.au=Dunne%2C+Jane&rft.date=2022-12-01&rft.pub=Escola+d%27Administracio+Publica+de+Catalunya&rft.eissn=0212-5056&rft.issue=78&rft.spage=9&rft_id=info:doi/10.2436%2Frld.i78.2022.3741&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0212-5056&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0212-5056&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0212-5056&client=summon