TAPER: a two-step approach for all-strong-pairs correlation query in large databases

Given a user-specified minimum correlation threshold /spl theta/ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold /spl theta/. However, when the number of items and transactions are large, t...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on knowledge and data engineering Vol. 18; no. 4; pp. 493 - 508
Main Authors:	Xiong, H., Shashi Shekhar, Tan, P.-M., Vipin Kumar
Format:	Journal Article
Language:	English
Published:	New York, NY IEEE 01-04-2006 IEEE Computer Society The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithms Applied sciences Association analysis Computation Computational efficiency Computer science; control theory; systems Correlation Correlation coefficients Cost control Costs Data analysis Data mining Data processing. List processing. Character string processing Distributed computing Exact sciences and technology Information systems. Data bases Marketing and sales Mathematical analysis Mathematical models Matrices Memory organisation. Data processing Pearson's correlation coefficient Public healthcare Query processing Software statistical computing Studies Transaction databases Upper bound Upper bounds Correlation coefficient Data analysis Correlation Statistical analysis Probabilistic approach Database query Algorithmics Very large databases Association analysis Markets Information retrieval Information extraction Transaction processing Data mining Modeling Variable section Upper bound Filter Database statistical computing Taper Pearson's correlation coefficient
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Given a user-specified minimum correlation threshold /spl theta/ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold /spl theta/. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A two-step all-strong-pairs correlation query (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 ObjectType-Feature-1
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2006.1599388