Faking feature importance: A cautionary tale on the use of differentially-private synthetic data
Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
02-03-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Synthetic datasets are often presented as a silver-bullet solution to the
problem of privacy-preserving data publishing. However, for many applications,
synthetic data has been shown to have limited utility when used to train
predictive models. One promising potential application of these data is in the
exploratory phase of the machine learning workflow, which involves
understanding, engineering and selecting features. This phase often involves
considerable time, and depends on the availability of data. There would be
substantial value in synthetic data that permitted these steps to be carried
out while, for example, data access was being negotiated, or with fewer
information governance restrictions. This paper presents an empirical analysis
of the agreement between the feature importance obtained from raw and from
synthetic data, on a range of artificially generated and real-world datasets
(where feature importance represents how useful each feature is when predicting
a the outcome). We employ two differentially-private methods to produce
synthetic data, and apply various utility measures to quantify the agreement in
feature importance as this varies with the level of privacy. Our results
indicate that synthetic data can sometimes preserve several representations of
the ranking of feature importance in simple settings but their performance is
not consistent and depends upon a number of factors. Particular caution should
be exercised in more nuanced real-world settings, where synthetic data can lead
to differences in ranked feature importance that could alter key modelling
decisions. This work has important implications for developing synthetic
versions of highly sensitive data sets in fields such as finance and
healthcare. |
---|---|
DOI: | 10.48550/arxiv.2203.01363 |