Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code

Data scientists reportedly spend a significant amount of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adoptin...

Full description

Saved in:
Bibliographic Details
Published in:2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) pp. 304 - 316
Main Authors: Yang, Chenyang, Zhou, Shurui, Guo, Jin L.C., Kastner, Christian
Format: Conference Proceeding
Language:English
Published: IEEE 01-11-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data scientists reportedly spend a significant amount of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adopting existing code, which results in reduced model quality. To support data scientists with data wrangling, we present a technique to generate documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide on-demand documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code.
ISSN:2643-1572
DOI:10.1109/ASE51524.2021.9678520