Efficient k-anonymous microaggregation of multivariate numerical data via principal component analysis

•The primary goal of this work is to reduce the running time of k-anonymous microaggregation algo-rithms operating on datasets with a large quantity of numerical demographic attributes, acting as quasi-identifiers. Principal component analysis (PCA), an algebraic-statistical procedure that construct...

Full description

Saved in:

Bibliographic Details
Published in:	Information sciences Vol. 503; pp. 417 - 443
Main Authors:	Monedero, David Rebollo, Mezher, Ahmad Mohamad, Colomé, Xavier Casanova, Forné, Jordi, Soriano, Miguel
Format:	Journal Article Publication
Language:	English
Published:	Elsevier Inc 01-11-2019
Subjects:	Bases de dades Big data Ciències de la informació Dades massives Data privacy Information science Informàtica K-anonymity Large-scale datasets Microaggregation Principal component analysis Sistemes d'informació Statistical disclosure control Àrees temàtiques de la UPC Data privacy Statistical disclosure control Large-scale datasets Microaggregation k-anonymity Principal component analysis
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•The primary goal of this work is to reduce the running time of k-anonymous microaggregation algo-rithms operating on datasets with a large quantity of numerical demographic attributes, acting as quasi-identifiers. Principal component analysis (PCA), an algebraic-statistical procedure that constructs an or-thogonal projection onto a lower-dimensional subspace, permits the effective reduction of the number of attributes of the original dataset. The optimality principles of multivariate PCA strive to preserve Euclidean distances between the projected data points.•The compressed data is fed to the microaggregation algorithm, but the k-anonymous microcells or groups obtained are directly applied to the original data. The distance-preservation properties of multivariate PCA help construct a micropartition of the set of respondents similar to that obtained when the original data is microaggregated in the conventional fashion, but in fewer dimensions.•This means that we are able to achieve significant time gains ( ≈  14–31%) with very little impact on information utility ( < 2%, with respect to the total variance) with respect to the traditional procedure on the original data.•Additional variants of the above method are devised and analyzed with extensive experimentation on standardized datasets, in terms of running time and information loss, pushing the already substantial speed-up even further ( ≈ 48–64%), with mild distortion impact ( < 3%, with respect to the total variance). k-Anonymous microaggregation is a widespread technique to address the problem of protecting the privacy of the respondents involved beyond the mere suppression of their identifiers, in applications where preserving the utility of the information disclosed is critical. Unfortunately, microaggregation methods with high data utility may impose stringent computational demands when dealing with datasets containing a large number of records and attributes. This work proposes and analyzes various anonymization methods which draw upon the algebraic-statistical technique of principal component analysis (PCA), in order to effective reduce the number of attributes processed, that is, the dimension of the multivariate microaggregation problem at hand. By preserving to a high degree the energy of the numerical dataset and carefully choosing the number of dominant components to process, we manage to achieve remarkable reductions in running time and memory usage with negligible impact in information utility. Our methods are readily applicable to high-utility SDC of large-scale datasets with numerical demographic attributes. © 2019 The Authors. Preprint submitted to Elsevier, Inc.
ISSN:	0020-0255 1872-6291
DOI:	10.1016/j.ins.2019.07.042