Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol

The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access Vol. 9; pp. 129902 - 129916
Main Authors:	Behnke, Matthew, Briner, Nathan, Cullen, Drake, Schwerdtfeger, Katelynn, Warren, Jackson, Basnet, Ram, Doleck, Tenzin
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Browsers Chi-square test Chi-squared Classifiers decision tree DNS DoH Domain names Feature extraction Hypertext IP networks LGBM Machine learning pearson correlation Privacy Protocols random forest Security sequential forward selection Servers Traffic models XGBM
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3113294