A semantic-based model with a hybrid feature engineering process for accurate spam detection
Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming tech...
Saved in:
Published in: | Journal of Electrical Systems and Information Technology Vol. 11; no. 1; pp. 26 - 16 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
Berlin/Heidelberg
Springer Berlin Heidelberg
01-12-2024
Springer Nature B.V SpringerOpen |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming techniques. In this study, we propose a novel spam detection model with a comprehensive feature engineering approach that combines term frequency-inverse document frequency (TF-IDF) vectorizer and word embedding features to optimize the feature space. Our contribution lies in integrating semantic-based word embeddings, leveraging pre-existing knowledge to capture the semantic meaning of words and enhance the representation of email texts. To identify the most suitable word embedding technique for our model, we evaluated GloVe, Word2Vec, and FastText. GloVe was selected for its better performance, which is the result of its pre-training on a large and diverse text corpus. Furthermore, the model was evaluated without word embeddings, which did not exhibit the same effectiveness level as our word embedding-based model. Additionally, we utilized the support vector machine as a classifier and hyperparameter tuning technique to identify our model’s most effective parameter values. The proposed model was tested on two datasets. The experimental results showed that our model outperformed the other models discussed in the literature, achieving an accuracy of 99.5% on the SpamAssassin dataset, and 99.28% on the Enron-Spam dataset. |
---|---|
ISSN: | 2314-7172 2314-7172 |
DOI: | 10.1186/s43067-024-00151-3 |