CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human e...

Full description

Saved in:
Bibliographic Details
Published in:2024 IEEE International Conference on Cyber Security and Resilience (CSR) pp. 296 - 302
Main Authors: Tihanyi, Norbert, Ferrag, Mohamed Amine, Jain, Ridhi, Bisztray, Tamas, Debbah, Merouane
Format: Conference Proceeding
Language:English
Published: IEEE 02-09-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. The research community needs a diverse, accurate, and up-to-date dataset to test the general knowledge of LLMs in cybersecurity. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions, respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. The CyberMetric dataset is publicly available for the research community and can be downloaded from the projects' website: https://github.com/CyberMetric.
AbstractList Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different cybersecurity fields, including topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. The research community needs a diverse, accurate, and up-to-date dataset to test the general knowledge of LLMs in cybersecurity. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions, respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. The CyberMetric dataset is publicly available for the research community and can be downloaded from the projects' website: https://github.com/CyberMetric.
Author Bisztray, Tamas
Jain, Ridhi
Ferrag, Mohamed Amine
Debbah, Merouane
Tihanyi, Norbert
Author_xml – sequence: 1
  givenname: Norbert
  surname: Tihanyi
  fullname: Tihanyi, Norbert
  organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates
– sequence: 2
  givenname: Mohamed Amine
  surname: Ferrag
  fullname: Ferrag, Mohamed Amine
  organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates
– sequence: 3
  givenname: Ridhi
  surname: Jain
  fullname: Jain, Ridhi
  organization: Technology Innovation Institute (TII),Abu Dhabi,United Arab Emirates
– sequence: 4
  givenname: Tamas
  surname: Bisztray
  fullname: Bisztray, Tamas
  organization: University of Oslo,Oslo,Norway
– sequence: 5
  givenname: Merouane
  surname: Debbah
  fullname: Debbah, Merouane
  organization: Khalifa University,Abu Dhabi,United Arab Emirates
BookMark eNo1kN1OhDAQhWuiF7ruGxjTFwD7R2m9Q1xXIxuTVa83hQ7YyBZTwA1vL_7dnMmcb3KSOWfo2HceELqkJKaU6Kv8eSuplCJmhImYEplqocURWupUK54QniZcpqfokE8lhA0MwVXXOMM34Ku3vQnv-NYMpocBl7Na3Hm8_T6CT9NG2djswQ-zvQYPwQxuxnUX8Gqm47z6BhfFpsfO45_8HqoxuGHCj747tGAbOEcntWl7WP7NBXq9W73k91HxtH7IsyJylCQiUolSlRXEKmBQguRSl6XhFIxSOlFQSQYWBE0NM0ZaIyrGbC20FKXgXAm-QBe_uQ4Adh_Bzb9Nu_8--BdYRlvs
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CSR61664.2024.10679494
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library Online
  url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350375367
EndPage 302
ExternalDocumentID 10679494
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i1054-8588cd40d8e2ebe6369bba31ea88958ec62ede417a2aa6da4c22df4964b433843
IEDL.DBID RIE
IngestDate Wed Oct 02 05:56:43 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i1054-8588cd40d8e2ebe6369bba31ea88958ec62ede417a2aa6da4c22df4964b433843
OpenAccessLink https://arxiv.org/pdf/2402.07688
PageCount 7
ParticipantIDs ieee_primary_10679494
PublicationCentury 2000
PublicationDate 2024-Sept.-2
PublicationDateYYYYMMDD 2024-09-02
PublicationDate_xml – month: 09
  year: 2024
  text: 2024-Sept.-2
  day: 02
PublicationDecade 2020
PublicationTitle 2024 IEEE International Conference on Cyber Security and Resilience (CSR)
PublicationTitleAbbrev CSR
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.9341061
Snippet Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the...
SourceID ieee
SourceType Publisher
StartPage 296
SubjectTerms Accuracy
Benchmark testing
Computer security
NIST Standards
Problem-solving
Reverse engineering
Risk management
Title CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
URI https://ieeexplore.ieee.org/document/10679494
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA22J08qVvwmB69pN9k0TbzVflCwFWkVvJXJZrYWcSttF_Hfu0m3FQ8evIUwJDATeJNJ5j1CbpxWnIPVDBMUTILkTEtQDNNUQeIQAH29YzBpPbzobs_T5LBdLwwihs9nWPfD8JbvFknuS2UNT3dmpJEVUmkZvWnWKrt-eWQanclYcaV8pUTI-tb4l2xKQI3-wT_3OyS1n_47-rhDliOyh9kx-ex8WVyOvAJWckvb9K4we32H5RvtwrrAojX1iOToIqPjIJNVnCHWzmeBddPRDb-0DwMt8lTaK1m-sxkdDkcrOs9oWH9V6tnR-22xrUae-72nzoCVsglsXiRLkumm9opEkdMoihCpWBlrIeYIWpumxkQJdCh5CwSAciATIVwqjZJWFhdWGZ-QarbI8JRQjMFGKTdONq104CBKlW7ZiGMcW63iM1LzXpt-bJgxpluHnf8xf0H2fWzCHy1xSarrZY5XpLJy-XUI5jfAV6R9
link.rule.ids 310,311,782,786,791,792,798,27936,54770
linkProvider IEEE
linkToHtml http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgDDABoohvPLC6jRPHcdhKP1TUtEJtkdiqc3wpFSJFbSPEvydO0yIGBjbLOtnSnaV3PvveI-TOKMk5aMUwRpcJEJwpAZJhkkiIDQKgrXd0R8HgRbXaliaHbXthELH4fIY1Oyze8s08zmyprG7pzkIRil2y54sgcNbtWmXfL3fCenM0lFxKWytxRW1j_ks4pcCNzuE_dzwi1Z8OPPq0xZZjsoPpCflsfmlc9K0GVnxPG_QhN3t9h8UbbcEqR6MVtZhk6Dylw0IoKz9FrJFNC95NQ9cM0zYQNM9Uabvk-U6nNIr6SzpLabH-slS0o71Nua1KnjvtcbPLSuEENsvTJcGUr6wmkWMUunmQpCdDrcHjCEqFvsJYumhQ8ABcAGlAxK5rEhFKoUV-ZRXeKamk8xTPCEUPtJPw0AhfCwMGnESqQDscPU8r6Z2TqvXa5GPNjTHZOOzij_lbst8d96NJ9DjoXZIDG6fix5Z7RSqrRYbXZHdpspsisN-JEKfI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE+International+Conference+on+Cyber+Security+and+Resilience+%28CSR%29&rft.atitle=CyberMetric%3A+A+Benchmark+Dataset+based+on+Retrieval-Augmented+Generation+for+Evaluating+LLMs+in+Cybersecurity+Knowledge&rft.au=Tihanyi%2C+Norbert&rft.au=Ferrag%2C+Mohamed+Amine&rft.au=Jain%2C+Ridhi&rft.au=Bisztray%2C+Tamas&rft.date=2024-09-02&rft.pub=IEEE&rft.spage=296&rft.epage=302&rft_id=info:doi/10.1109%2FCSR61664.2024.10679494&rft.externalDocID=10679494