Diachronic Document Dataset for Semantic Layout Analysis
We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and bor...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
15-11-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We present a novel, open-access dataset designed for semantic layout
analysis, built to support document recreation workflows through mapping with
the Text Encoding Initiative (TEI) standard. This dataset includes 7,254
annotated pages spanning a large temporal range (1600-2024) of digitised and
born-digital materials across diverse document types (magazines, papers from
sciences and humanities, PhD theses, monographs, plays, administrative reports,
etc.) sorted into modular subsets. By incorporating content from different
periods and genres, it addresses varying layout complexities and historical
changes in document structure. The modular design allows domain-specific
configurations. We evaluate object detection models on this dataset, examining
the impact of input size and subset-based training. Results show that a
1280-pixel input size for YOLO is optimal and that training on subsets
generally benefits from incorporating them into a generic model rather than
fine-tuning pre-trained weights. |
---|---|
DOI: | 10.48550/arxiv.2411.10068 |