The misuse of the NASA Metrics Data Program data sets for automated software defect prediction

Background: The NASA Metrics Data Program data sets have been heavily used in software defect prediction experiments. Aim: To demonstrate and explain why these data sets require significant pre-processing in order to be suitable for defect prediction. Method: A meticulously documented data cleansing...

Full description

Saved in:

Bibliographic Details
Published in:	15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011) pp. 96 - 103
Main Authors:	Gray, D, Bowes, D, Davey, N, Yi Sun, Christianson, B
Format:	Conference Proceeding
Language:	English
Published:	Stevenage IET 2011 The Institution of Engineering & Technology
Subjects:	Data handling techniques Knowledge engineering techniques Software engineering techniques data cleansing process fault tolerant computing automated software defect prediction data mining NASA metrics data program data set
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Background: The NASA Metrics Data Program data sets have been heavily used in software defect prediction experiments. Aim: To demonstrate and explain why these data sets require significant pre-processing in order to be suitable for defect prediction. Method: A meticulously documented data cleansing process involving all 13 of the original NASA data sets. Results: Post our novel data cleansing process; each of the data sets had between 6 to 90 percent less of their original number of recorded values. Conclusions: One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Two: Defect prediction data sets could benefit from lower level code metrics in addition to those more commonly used, as these will help to distinguish modules, reducing the likelihood of repeated data points. Three: The bulk of defect prediction experiments based on the NASA Metrics Data Program data sets may have led to erroneous findings. This is mainly due to repeated data points potentially causing substantial amounts of training and testing data to be identical.
ISBN:	9781849195096 1849195099
DOI:	10.1049/ic.2011.0012