WCCI 2018 COMPETITION ON OPEN SOURCE INTELLIGENCE DISCOVERY FOR CYBERSECURITY THREAT AWARENESS

Introduction

This competition is organized within the scope of the European Union H2020 project DiSIEM – Diversity Enhancements for Security Information and Event Management.

About the DiSIEM project

The DiSIEM project objective is to improve Security Information and Event Management (SIEM) systems’ capabilities using diversity-related technology. The project’s expected contributions are to:

  1. Integrate diverse Open Source INTelligence (OSINT) to identify relationships, trends and anomalies, and hence help reacting to new vulnerabilities affecting an IT infrastructure or even predict possible emerging threats against the infrastructure monitored by the SIEM, hence improving organizations cybersecurity threat awareness capabilities.
  2. Develop novel probabilistic security models and risk-based metrics to help security analysts decide which infrastructure configurations offer better security guarantees and increase the capacity of Security Operation Centers (SOCs) to communicate the status of the organization to C-level managers.
  3. Design novel visualization methods to present the diverse live and archival data sets, to better support the decision-making process by enabling the extraction of high-level security insight from the data which will be used by the security analysts working with SOCs that operate the SIEM.
  4. Integrate diverse, redundant and enhanced monitoring capabilities to the SIEM ecosystem.
  5. Add support for long term archival of events in public cloud storage services.

Cybersecurity Threat Awareness

The main goal of cybersecurity threat awareness tools is to provide security analysts timely information about security threats to the IT infrastructures under their responsibility. This translates into two important objectives:

  • Maximize the amount of relevant information presented
    to the analyst;
  • Minimize the amount of irrelevant information presented
    to the analyst.

For this purpose OSINT information is collected about the security of a specific IT infrastructure, then each piece of information is classified as relevant or not to the security of that infrastructure.

As an example, in DiSIEM we are collecting tweets concerning the security of various case-study IT infrastructures. These tweets have to be classified as relevant or not to fulfill the two objectives above. A tweet is relevant if it mentions a threat to an element of an IT infrastructure (e.g., a vulnerability or an exploit) or a security measure to protect that element (e.g., an update or a software patch).

Why Twitter?

Although there are many sources of OSINT, including security-related ones, Twitter was used for two main reasons. First, Twitter is well-recognized as an important information hub for short notices (almost in real-time) about cutting edge information on events regarding many subjects. These include cybersecurity-related events as demonstrated by the highly-active accounts of most security feeds and researchers, where they tweet security-related news. Second, since a tweet is limited to 280 characters (mostly 40–60 words), these messages are simple to process automatically.

More importantly, our work also demonstrates that it is possible to obtain valuable security-related information from Twitter before it becomes available on established databases as confirmed threats (e.g., National Vulnerability Database, ExploitDB). As an example consider some results related to the various NSA tools leaked in August 2016. From those, two threats called EGREGIOUSBLUNDER and ESCALATEPLOWMAN were tweeted 9 (see publication) and 6 (see publication) days ahead of official confirmation on NVD.

Problem Statement

A key part in any software component designed to provide end-to-end OSINT-based cybersecurity threat awareness is a binary classifier that takes as input one piece of information, e.g., a tweet, and assigns it to one of the classes, relevant (1) or irrelevant (-1).

This competition consists in using previously labeled tweet data sets concerning three case-studies, in order to design binary classification models for the case studies. Therefore participants will develop models that take tweets as input and produce the corresponding classification for each tweet: -1 (not relevant) or 1 (relevant).

Datasets

Both for the classifiers design stage (before the deadline for submission of results) and the classifiers evaluation stage (after the deadline for submission of results) we provide three data sets, each corresponding to a different case study (case studies A, B, and C) and corresponding IT infrastructure. The sets consist of tweets that have been manually labeled as relevant or not to the security of the IT infrastructures represented in the case studies.

For the design stage we call these sets Design Set A, B and C (DSA, DSB and DSC). The tweets in DSA, DSB and DSC have been collected from a set of Twitter accounts designated Account Set 1 (AS1).

In the evaluation stage the classifiers proposed by the competition participants will be tested using Evaluation Set A, B and C (ESA, ESB and ESC). The tweets in ESA, ESB and ESC have been collected from a set of Twitter accounts designated Account Set 2 (AS2).

ESA, ESB and ESC are such that all the tweets were posted in Twitter after the tweets in DSA, DSB and DSC. AS2 includes all accounts from AS1, but extends it with an additional set of accounts, having in total 223 tweeters. Hence, the evaluation procedure tests the classifiers generalization performance not only in future unseen data, but also considering data from additional Twitter accounts.

The data sets will be released using line-based text files with lines containing an hyperlink, an integer number that uniquely corresponds to one tweeter, and one label. The hyperlink references the tweet, the integer number relates to the account that posted the tweet, and the label provides its class. A simple program written in Java or Python will be released to output files with the actual tweets text, the corresponding labels and the integer referencing the tweeter account.

As our researchers are continuously labeling more tweets, the data sets will be released upon decision of acceptance of the competition. Currently the data sets altogether include several thousands of tweets.

Evaluation Metrics

The classifiers will be evaluated by metrics that reflect the two objectives stated above:

  • Maximize the amount of relevant information presented to the analyst. This means presenting the highest possible fraction of tweets that were correctly classified as relevant, which corresponds to maximizing the True Positive Rate (TPR) or sensitivity;
  • Minimize the amount of irrelevant information presented to the analyst. This means presenting the smallest possible fraction of tweets that were wrongly classified as relevant, which corresponds to maximizing the True Negative Rate (TNR) or specificity.

Denoting the number of relevant tweets correctly classified as relevant (True Positives) by \(TP\), the number of relevant tweets incorrectly classified as irrelevant (False Negatives) by \(FN\), the number of irrelevant tweets correctly classified as irrelevant (True Negatives) by \(TN\), and the number of irrelevant tweets incorrectly classified as relevant (False Positives) by \(FP\), the metrics are given by:

\(TPR = \frac{TP}{TP+FN}\)
\(TNR = \frac{TN}{TN+FP}\)

Ranking the results

By using the evaluation data sets, participants will be ranked in each case study according to the Euclidean distance of their \(\left(TPR, TNR\right)\) classifier pair to the ideal performance pair, \(\left(1.0, 1.0\right)\), from smallest to highest value. Then a number of points from 1 to the number of participants, (\(n\)), is awarded according to the ranking. Finally the participant receiving the smallest total number of points considering all the case studies, wins the competition.

Competition Rules

The participants must comply with the following rules:

  1. Participants will use only the data sets provided (DSA, DSB and DSC) to train and design their classifier(s). Two options are available:
    • train a single classifier for the three case studies, therefore using the data sets altogether;
    • train one classifier per case study, therefore using the data sets individually. In this case evaluation metrics will be computed considering the aggregated results.
  2. The integer number that relates uniquely to the Twitter account that posted a tweet, cannot be used at the classifier input.
  3. Participants must use only freely available tools/frameworks to train and design their classification models.
  4. All the results have to be reproducible by using solely code provided by the participants and the data sets provided by the competition.

Once the results are published in tis web page, the participants will receive the evaluation data sets ESA, ESB and ESC, allowing them to verify the results published. At the same time the competition organizers will also verify that the participants submissions comply with the competition rules. Only after these verifications the results will be considered final.

Submissions

The following guidelines should be followed when submitting the results:

  1. As soon as participants decide to enter the competition, we kindly ask that an email be sent to the organizers, stating the willingness to participate;
  2. Before the participation archive submission deadline participants will deliver an archive (e.g., a ZIP file) with the following elements:
    • A short report naming the team and the authors, a contact person and corresponding email address, a description of the software tools/platforms employed to train and design the classifier(s), a description of the methodologies employed, and instructions on how to reproduce the design of the classifier model(s) submitted;
    • A computer program in an interpreted language (e.g., Java, Python, R, …) for each classifier, that is able to take an input text file with one tweet per line and produce an output text file with the corresponding classification of each tweet. Classifier parameters may be stored in accompanying files to the programs;
    • A source code package that is able to execute by taking the design data sets as input and reproduce the design and training of the classifier(s) submitted. Instructions on how to execute the code will be given in the report mentioned above. Off course, authorship of the code, procedures and methodology remains with the authors participating.

Important dates

The provisional date for the submission of the participation archive is 8 of May, 2018. This date may be changed if requested by the WCCI 2018 organizing committee.

The preliminary results will be published until 15 of May, 2018.

The validations of the results and of the submissions have to be concluded before 8 of June, 2018.

Organizers

Pedro M. Ferreira, Alysson Bessani, Fernando Alves
LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal

Email addresses:
PMF: pmf (at) ciencias (dot) ulisboa (dot) pt
AB: anbessani (at) ciencias (dot) ulisboa (dot) pt
FA: falves (at) lasige (dot) di (dot) fc (dot) ul (dot) pt