Efficient density and cluster based incremental outlier detection in data streams

•A new incremental clustering and density-based outlier detection method is proposed that simultaneously performs both clustering and outlier detection.•To the best of our knowledge, this is the first study to combine the concepts of incremental DBSCAN (iDBSCAN) and iLOF to detect outliers from stre...

Full description

Saved in:
Bibliographic Details
Published in:Information sciences Vol. 607; pp. 901 - 920
Main Authors: Degirmenci, Ali, Karal, Omer
Format: Journal Article
Language:English
Published: Elsevier Inc 01-08-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•A new incremental clustering and density-based outlier detection method is proposed that simultaneously performs both clustering and outlier detection.•To the best of our knowledge, this is the first study to combine the concepts of incremental DBSCAN (iDBSCAN) and iLOF to detect outliers from streaming data.•To minimize the negative effects of the selection of parameters, iLDCBOF automatically adjusts its own hyperparameters for different, real-time applications.•To detect outliers from data streams and prevent their clustering, a newly-developed, core kNN (CkNN) concept is introduced.•The incremental Mahalanobis metric is used in all distance computations to reduce the impact of the data dimensions in both iLOF and iDBSCAN. In this paper, a novel, parameter-free, incremental local density and cluster-based outlier factor (iLDCBOF) method is presented that unifies incremental versions of local outlier factor (LOF) and density-based spatial clustering of applications with noise (DBSCAN) to detect outliers efficiently in data streams. The iLDCBOF has many advanced advantages compared to previously reported iLOF-based studies: (1) it is based on a newly-developed core k-nearest neighbor (CkNN) concept to reliably and scalably detect outliers from data streams and prevent the clustering of outliers; 2) it uses a newly-developed algorithm that automatically adjusts the value of the k (number of neighbors) parameter for different real-time applications; and 3) it uses the Mahalanobis distance metric, so its performance is not affected even for large amounts of data. The iLDCBOF method is well suited for different data stream applications because it requires no distribution assumptions, it is parameterless (determined automatically), and it is easy to implement. ROC-AUC and statistical test analysis results from extensive experiments performed on 16 different real-world datasets showed that the iLDCBOF method significantly outperformed benchmark methods.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2022.06.013