CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human e...

Full description

Saved in:

Bibliographic Details
Published in:	2024 IEEE International Conference on Cyber Security and Resilience (CSR) pp. 296 - 302
Main Authors:	Tihanyi, Norbert, Ferrag, Mohamed Amine, Jain, Ridhi, Bisztray, Tamas, Debbah, Merouane
Format:	Conference Proceeding
Language:	English
Published:	IEEE 02-09-2024
Subjects:	Accuracy Benchmark testing Computer security NIST Standards Problem-solving Reverse engineering Risk management
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. The research community needs a diverse, accurate, and up-to-date dataset to test the general knowledge of LLMs in cybersecurity. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions, respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. The CyberMetric dataset is publicly available for the research community and can be downloaded from the projects' website: https://github.com/CyberMetric.
AbstractList	Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. The research community needs a diverse, accurate, and up-to-date dataset to test the general knowledge of LLMs in cybersecurity. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions, respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. The CyberMetric dataset is publicly available for the research community and can be downloaded from the projects' website: https://github.com/CyberMetric.
Author	Bisztray, Tamas Jain, Ridhi Ferrag, Mohamed Amine Debbah, Merouane Tihanyi, Norbert
Author_xml	– sequence: 1 givenname: Norbert surname: Tihanyi fullname: Tihanyi, Norbert organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates – sequence: 2 givenname: Mohamed Amine surname: Ferrag fullname: Ferrag, Mohamed Amine organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates – sequence: 3 givenname: Ridhi surname: Jain fullname: Jain, Ridhi organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates – sequence: 4 givenname: Tamas surname: Bisztray fullname: Bisztray, Tamas organization: University of Oslo,Oslo,Norway – sequence: 5 givenname: Merouane surname: Debbah fullname: Debbah, Merouane organization: Khalifa University,Abu Dhabi,United Arab Emirates
BookMark	eNo1kN1OhDAQhWuiF7ruGxjTFwD7R2m9Q1xXIxuTVa83hQ7YyBZTwA1vL_7dnMmcb3KSOWfo2HceELqkJKaU6Kv8eSuplCJmhImYEplqocURWupUK54QniZcpqfokE8lhA0MwVXXOMM34Ku3vQnv-NYMpocBl7Na3Hm8_T6CT9NG2djswQ-zvQYPwQxuxnUX8Gqm47z6BhfFpsfO45_8HqoxuGHCj747tGAbOEcntWl7WP7NBXq9W73k91HxtH7IsyJylCQiUolSlRXEKmBQguRSl6XhFIxSOlFQSQYWBE0NM0ZaIyrGbC20FKXgXAm-QBe_uQ4Adh_Bzb9Nu_8--BdYRlvs
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/CSR61664.2024.10679494
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350375367
EndPage	302
ExternalDocumentID	10679494
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i1054-8588cd40d8e2ebe6369bba31ea88958ec62ede417a2aa6da4c22df4964b433843
IEDL.DBID	RIE
IngestDate	Wed Oct 02 05:56:43 EDT 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i1054-8588cd40d8e2ebe6369bba31ea88958ec62ede417a2aa6da4c22df4964b433843
OpenAccessLink	https://arxiv.org/pdf/2402.07688
PageCount	7
ParticipantIDs	ieee_primary_10679494
PublicationCentury	2000
PublicationDate	2024-Sept.-2
PublicationDateYYYYMMDD	2024-09-02
PublicationDate_xml	– month: 09 year: 2024 text: 2024-Sept.-2 day: 02
PublicationDecade	2020
PublicationTitle	2024 IEEE International Conference on Cyber Security and Resilience (CSR)
PublicationTitleAbbrev	CSR
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.9341061
Snippet	Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the...
SourceID	ieee
SourceType	Publisher
StartPage	296
SubjectTerms	Accuracy Benchmark testing Computer security NIST Standards Problem-solving Reverse engineering Risk management
Title	CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
URI	https://ieeexplore.ieee.org/document/10679494
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA22J08qVvwmB69pN9k0TbzVflCwFWkVvJXJZrYWcSttF_Hfu0m3FQ8evIUwJDATeJNJ5j1CbpxWnIPVDBMUTILkTEtQDNNUQeIQAH29YzBpPbzobs_T5LBdLwwihs9nWPfD8JbvFknuS2UNT3dmpJEVUmkZvWnWKrt-eWQanclYcaV8pUTI-tb4l2xKQI3-wT_3OyS1n_47-rhDliOyh9kx-ex8WVyOvAJWckvb9K4we32H5RvtwrrAojX1iOToIqPjIJNVnCHWzmeBddPRDb-0DwMt8lTaK1m-sxkdDkcrOs9oWH9V6tnR-22xrUae-72nzoCVsglsXiRLkumm9opEkdMoihCpWBlrIeYIWpumxkQJdCh5CwSAciATIVwqjZJWFhdWGZ-QarbI8JRQjMFGKTdONq104CBKlW7ZiGMcW63iM1LzXpt-bJgxpluHnf8xf0H2fWzCHy1xSarrZY5XpLJy-XUI5jfAV6R9
link.rule.ids	310,311,782,786,791,792,798,27936,54770
linkProvider	IEEE
linkToHtml	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgDDABoohvPLC6jRPHcdhKP1TUtEJtkdiqc3wpFSJFbSPEvydO0yIGBjbLOtnSnaV3PvveI-TOKMk5aMUwRpcJEJwpAZJhkkiIDQKgrXd0R8HgRbXaliaHbXthELH4fIY1Oyze8s08zmyprG7pzkIRil2y54sgcNbtWmXfL3fCenM0lFxKWytxRW1j_ks4pcCNzuE_dzwi1Z8OPPq0xZZjsoPpCflsfmlc9K0GVnxPG_QhN3t9h8UbbcEqR6MVtZhk6Dylw0IoKz9FrJFNC95NQ9cM0zYQNM9Uabvk-U6nNIr6SzpLabH-slS0o71Nua1KnjvtcbPLSuEENsvTJcGUr6wmkWMUunmQpCdDrcHjCEqFvsJYumhQ8ABcAGlAxK5rEhFKoUV-ZRXeKamk8xTPCEUPtJPw0AhfCwMGnESqQDscPU8r6Z2TqvXa5GPNjTHZOOzij_lbst8d96NJ9DjoXZIDG6fix5Z7RSqrRYbXZHdpspsisN-JEKfI
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE+International+Conference+on+Cyber+Security+and+Resilience+%28CSR%29&rft.atitle=CyberMetric%3A+A+Benchmark+Dataset+based+on+Retrieval-Augmented+Generation+for+Evaluating+LLMs+in+Cybersecurity+Knowledge&rft.au=Tihanyi%2C+Norbert&rft.au=Ferrag%2C+Mohamed+Amine&rft.au=Jain%2C+Ridhi&rft.au=Bisztray%2C+Tamas&rft.date=2024-09-02&rft.pub=IEEE&rft.spage=296&rft.epage=302&rft_id=info:doi/10.1109%2FCSR61664.2024.10679494&rft.externalDocID=10679494