Big Data: from collection to visualization

Organisations are increasingly relying on Big Data to provide the opportunities to discover correlations and patterns in data that would have previously remained hidden, and to subsequently use this new information to increase the quality of their business activities. In this paper we present a ‘sto...

Full description

Saved in:

Bibliographic Details
Published in:	Machine learning Vol. 106; no. 6; pp. 837 - 862
Main Authors:	Ghesmoune, Mohammed, Azzag, Hanene, Benbernou, Salima, Lebbah, Mustapha, Duong, Tarn, Ouziri, Mourad
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-06-2017 Springer Nature B.V
Subjects:	Artificial Intelligence Big Data Clustering Clusters Computer Science Control Data acquisition Data collection Data integration Data management Data transmission Mechatronics Multisensor fusion Natural Language Processing (NLP) Robotics Simulation and Modeling Visualization Topological structure GNG Visualization RDF Semantic Data fusion Big data Map-Reduce Spark Data stream clustering Entity resolution Micro-Batch streaming
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Organisations are increasingly relying on Big Data to provide the opportunities to discover correlations and patterns in data that would have previously remained hidden, and to subsequently use this new information to increase the quality of their business activities. In this paper we present a ‘story’ of Big Data from the initial data collection and to the end visualization, passing by the data fusion, and the analysis and clustering tasks. For this, we present a complete work flow on (a) how to represent the heterogeneous collected data using the high performance RDF language, how to perform the fusion of the Big Data in RDF by resolving the issue of entity disambiguity and how to query those data to provide more relevant and complete knowledge and (b) as the data are received in data streams, we propose batchStream , a Micro-Batching version of the growing neural gas approach, which is capable of clustering data streams with a single pass over the data. The batchStream algorithm allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. This Big Data work flow is implemented in the Spark platform and we demonstrate it on synthetic and real data.
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-016-5622-4