A semantic-based model with a hybrid feature engineering process for accurate spam detection

Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming tech...

Full description

Saved in:
Bibliographic Details
Published in:Journal of Electrical Systems and Information Technology Vol. 11; no. 1; pp. 26 - 16
Main Authors: Mohammed, Chira N., Ahmed, Ayah M.
Format: Journal Article
Language:English
Published: Berlin/Heidelberg Springer Berlin Heidelberg 01-12-2024
Springer Nature B.V
SpringerOpen
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming techniques. In this study, we propose a novel spam detection model with a comprehensive feature engineering approach that combines term frequency-inverse document frequency (TF-IDF) vectorizer and word embedding features to optimize the feature space. Our contribution lies in integrating semantic-based word embeddings, leveraging pre-existing knowledge to capture the semantic meaning of words and enhance the representation of email texts. To identify the most suitable word embedding technique for our model, we evaluated GloVe, Word2Vec, and FastText. GloVe was selected for its better performance, which is the result of its pre-training on a large and diverse text corpus. Furthermore, the model was evaluated without word embeddings, which did not exhibit the same effectiveness level as our word embedding-based model. Additionally, we utilized the support vector machine as a classifier and hyperparameter tuning technique to identify our model’s most effective parameter values. The proposed model was tested on two datasets. The experimental results showed that our model outperformed the other models discussed in the literature, achieving an accuracy of 99.5% on the SpamAssassin dataset, and 99.28% on the Enron-Spam dataset.
ISSN:2314-7172
2314-7172
DOI:10.1186/s43067-024-00151-3