CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human e...

Full description

Saved in:
Bibliographic Details
Published in:2024 IEEE International Conference on Cyber Security and Resilience (CSR) pp. 296 - 302
Main Authors: Tihanyi, Norbert, Ferrag, Mohamed Amine, Jain, Ridhi, Bisztray, Tamas, Debbah, Merouane
Format: Conference Proceeding
Language:English
Published: IEEE 02-09-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. The research community needs a diverse, accurate, and up-to-date dataset to test the general knowledge of LLMs in cybersecurity. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions, respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. The CyberMetric dataset is publicly available for the research community and can be downloaded from the projects' website: https://github.com/CyberMetric.
DOI:10.1109/CSR61664.2024.10679494