FPO tree and DP3 algorithm for distributed parallel Frequent Itemsets Mining

•DP3, a high-performance distributed parallel algorithm for Frequent Itemsets Mining.•FPO tree for optimal compactness for light transfers and efficient aggregations.•Shared-memory parallel to exhaust multi-core CPUs resource of distributed nodes.•Obtain memory scalability, load balance for shared-m...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications Vol. 140; p. 112874
Main Authors: Huynh, Van Quoc Phuong, Küng, Josef
Format: Journal Article
Language:English
Published: New York Elsevier Ltd 01-02-2020
Elsevier BV
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•DP3, a high-performance distributed parallel algorithm for Frequent Itemsets Mining.•FPO tree for optimal compactness for light transfers and efficient aggregations.•Shared-memory parallel to exhaust multi-core CPUs resource of distributed nodes.•Obtain memory scalability, load balance for shared-memory and distributed parallels.•DP3 far outperforms well-known and recently high-performance algorithms. Frequent Itemsets Mining is a fundamental mining model in Data Mining. It supports a vast range of application fields and can be employed as a key calculation phase in many other mining models such as Association Rules, Correlations, Classifications, etc. Many distributed parallel algorithms have been introduced to confront with very large-scale datasets of Big Data. However, the problems of running time and memory scalability still have not had adequate solutions for very large and “hard-to-mined” datasets. In this paper, we propose a distributed parallel algorithm named DP3 (Distributed PrePostPlus) which parallelizes the state-of-the-art algorithm PrePost+ and operates in Master-Slaves model. Slave machines mine and send local frequent itemsets and support counts to the Master for aggregations. In the case of tremendous numbers of itemsets transferred between the Slaves and Master, the computational load at the Master, therefore, is extremely heavy if there is not the support from our complete FPO tree (Frequent Patterns Organization) which can provide optimal compactness for light data transfers and highly efficient aggregations with pruning ability. Processing phases of the Slaves and Master are designed for memory scalability and shared-memory parallel in Work-Pool model so as to utilize the computational power of multi-core CPUs. We conducted experiments on both synthetic and real datasets, and the empirical results have shown that our algorithm far outperforms the well-known PFP and other three recently high-performance ones Dist-Eclat, BigFIM, and MapFIM.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2019.112874