Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The...

Full description

Saved in:
Bibliographic Details
Published in:Journal of data mining and digital humanities Vol. 2021; no. Digital humanities in...
Main Authors: Camps, Jean-Baptiste, Gabay, Simon, Fièvre, Paul, Clérice, Thibault, Cafiero, Florian
Format: Journal Article
Language:English
Published: INRIA 14-02-2021
Nicolas Turenne
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.
ISSN:2416-5999
2416-5999
DOI:10.46298/jdmdh.6485