Labeling Poststorm Coastal Imagery for Machine Learning: Measurement of Interrater Agreement

Abstract Classifying images using supervised machine learning (ML) relies on labeled training data—classes or text descriptions, for example, associated with each image. Data‐driven models are only as good as the data used for training, and this points to the importance of high‐quality labeled data...

Full description

Saved in:
Bibliographic Details
Published in:Earth and space science (Hoboken, N.J.) Vol. 8; no. 9
Main Authors: Goldstein, Evan B., Buscombe, Daniel, Lazarus, Eli D., Mohanty, Somya D., Rafique, Shah Nafis, Anarde, Katherine A., Ashton, Andrew D., Beuzen, Tomas, Castagno, Katherine A., Cohn, Nicholas, Conlin, Matthew P., Ellenson, Ashley, Gillen, Megan, Hovenga, Paige A., Over, Jin‐Si R., Palermo, Rose V., Ratliff, Katherine M., Reeves, Ian R. B., Sanborn, Lily H., Straub, Jessamin A., Taylor, Luke A., Wallace, Elizabeth J., Warrick, Jonathan, Wernette, Phillipe, Williams, Hannah E.
Format: Journal Article
Language:English
Published: Hoboken John Wiley & Sons, Inc 01-09-2021
American Geophysical Union (AGU)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Classifying images using supervised machine learning (ML) relies on labeled training data—classes or text descriptions, for example, associated with each image. Data‐driven models are only as good as the data used for training, and this points to the importance of high‐quality labeled data for developing a ML model that has predictive skill. Labeling data is typically a time‐consuming, manual process. Here, we investigate the process of labeling data, with a specific focus on coastal aerial imagery captured in the wake of hurricanes that affected the Atlantic and Gulf Coasts of the United States. The imagery data set is a rich observational record of storm impacts and coastal change, but the imagery requires labeling to render that information accessible. We created an online interface that served labelers a stream of images and a fixed set of questions. A total of 1,600 images were labeled by at least two or as many as seven coastal scientists. We used the resulting data set to investigate interrater agreement: the extent to which labelers labeled each image similarly. Interrater agreement scores, assessed with percent agreement and Krippendorff's alpha, are higher when the questions posed to labelers are relatively simple, when the labelers are provided with a user manual, and when images are smaller. Experiments in interrater agreement point toward the benefit of multiple labelers for understanding the uncertainty in labeling data for machine learning research. Plain Language Summary After hurricanes and storms, pictures taken from a plane can be used to observe how the coast was impacted. A single flight might take thousands of pictures. If a computer could automatically analyze the pictures, then a person would not need to look at them one‐by‐one. To teach a computer to analyze images, we need many pictures and many labels that describe what is visible in each picture. But where do we get those labels? Typically, a coastal scientist labels the pictures by sorting them into folders or typing codes into a spreadsheet. But does every coastal scientist label pictures the same way? Some labeling questions are easy to answer, and scientists mostly agree (“Is this image all water?”). Other labeling questions are harder to answer, and cause disagreement (“Was there damage to buildings?”). This paper is about how well scientists agree when labeling the same pictures, and how we can improve agreement among scientists. We try some experiments and offer a few ideas on how to improve agreement. We suggest writing very clear questions, using smaller images, and having a comprehensive manual. It turns out that having a manual with examples—and reading the manual!—really helps. Key Points We measure agreement among coastal scientists labeling the same sets of poststorm images Coastal scientists agree more when rating landforms, less when labeling inferred processes Iterating on questions, providing documentation, and using smaller image sizes all increase agreement
ISSN:2333-5084
2333-5084
DOI:10.1029/2021EA001896