Benchmarking Feature Extraction Techniques for Textual Data Stream Classification
Feature extraction regards transforming unstructured or semi-structured data into structured data that can be used as input for classification and sentiment analysis algorithms, among other applications. This task becomes even more challenging and relevant when textual data becomes available over ti...
Saved in:
Published in: | 2023 International Joint Conference on Neural Networks (IJCNN) pp. 1 - 8 |
---|---|
Main Authors: | , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
18-06-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Feature extraction regards transforming unstructured or semi-structured data into structured data that can be used as input for classification and sentiment analysis algorithms, among other applications. This task becomes even more challenging and relevant when textual data becomes available over time as a continuous data stream since the lexicon and semantics can be ever-evolving. Data streams are, by definition, potentially infinite sequences of data that may have ephemeral characteristics, that is, where the data behavior changes, it leads to a phenomenon named concept drift. Textual data streams are specialized data streams, in which texts arrive over time from a continual data source, such as social media, raising challenges in which feature extractors are of great help. In this paper, we benchmark different feature extraction algorithms, i.e., Hashing Trick, Word2Vec, BERT, and Incremental Word-Vectors; in textual data stream classification, considering different stream lengths. The evaluation was performed over a binary and a multiclass classification task, considering two different datasets. Results show that pre-trained models, such as BERT, achieve interesting results, while Hashing Trick also performs competitively. We also observe that incremental methods such as Word2Vec and Incremental Word-Vectors are the most prepared for changing scenarios, yet, they are much more computationally intensive compared to the former when applied to larger streams. |
---|---|
ISSN: | 2161-4407 |
DOI: | 10.1109/IJCNN54540.2023.10191369 |