Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process
The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
15-11-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The digitization of natural history collections over the past three decades
has unlocked a treasure trove of specimen imagery and metadata. There is great
interest in making this data more useful by further labeling it with additional
trait data, and modern deep learning machine learning techniques utilizing
convolutional neural nets (CNNs) and similar networks show particular promise
to reduce the amount of required manual labeling by human experts, making the
process much faster and less expensive. However, in most cases, the accuracy of
these approaches is too low for reliable utilization of the automatic labeling,
typically in the range of 80-85% accuracy. In this paper, we present and
validate an approach that can greatly improve this accuracy, essentially by
examining the confidence that the network has in the generated label as well as
utilizing a user-defined threshold to reject labels that fall below a chosen
level. We demonstrate that a naive model that produced 86% initial accuracy can
achieve improved performance - over 95% accuracy (rejecting about 40% of the
labels) or over 99% accuracy (rejecting about 65%) by selecting higher
confidence thresholds. This gives flexibility to adapt existing models to the
statistical requirements of various types of research and has the potential to
move these automatic labeling approaches from being unusably inaccurate to
being an invaluable new tool. After validating the approach in a number of
ways, we annotate the reproductive state of a large dataset of over 600,000
herbarium specimens. The analysis of the results points at under-investigated
correlations as well as general alignment with known trends. By sharing this
new dataset alongside this work, we want to allow ecologists to gather insights
for their own research questions, at their chosen point of accuracy/coverage
trade-off. |
---|---|
DOI: | 10.48550/arxiv.2411.10074 |