• Corpus ID: 238634360

Datasets are not Enough: Challenges in Labeling Network Traffic

  title={Datasets are not Enough: Challenges in Labeling Network Traffic},
  author={Jorge Guerra and Carlos Adri{\'a}n Catania and Eduardo Veas},
In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network… 

Figures and Tables from this paper


Active learning approach to label network traffic datasets
A novel active learning strategy for building a random forest model based on user previously-labeled connections that provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task.
Automatically building datasets of labeled IP traffic traces: A self-training approach
A self-training system is presented, building a dataset of labeled network traffic based on raw tcpdump traces and no prior knowledge on data, which has shown that intrusion detection systems trained on such a dataset perform as well as the same systems training on correctly hand-labeled data.
Towards Provable Network Traffic Measurement and Analysis via Semi-Labeled Trace Datasets
This paper outlines the basic ideas of the methodology from unit trace collection and semi- labeled dataset creation to its usage for research evaluation and believes that these challenges can be solved by utilization of semi-labeled datasets composed of real-world network traffic and annotated units with interest-related packet traces only.
A Session Based Approach for Aggregating Network Traffic Data -- The SANTA Dataset
This paper compares and contrasts the most widely used network security datasets, evaluating their efficacy in providing a benchmark for intrusion and anomaly detection systems and proposes the Session Aggregation for Network Traffic Analysis (SANTA) dataset.
An autonomous labeling approach to support vector machines algorithms for network traffic anomaly detection
Experiments show that the use of the proposed autonomous labeling approach for autonomous labeling of normal traffic not only outperforms existing SVM alternatives but also, under some attack distributions, obtains improvements over SNORT itself.
Toward developing a systematic approach to generate benchmark datasets for intrusion detection
The intent for this dataset is to assist various researchers in acquiring datasets of this kind for testing, evaluation, and comparison purposes, through sharing the generated datasets and profiles.
A methodology for conducting efficient sanitization of HTTP training datasets
This work proposes a sanitization approach for obtaining datasets from HTTP traces suited for training, testing, or validating anomaly-based attack detectors, and applies it to a trace that includes 45 million requests received by the library web server of the University of Seville.
A Study on Labeling Network Hostile Behavior with Intelligent Interactive Tools
An interactive intelligent system to support the task of identifying hostile behaviors in network logs is described and it is indicated that the behaviour recommendation significantly improves the quality of labels.
Towards Generating Real-life Datasets for Network Intrusion Detection
This paper establishes the importance of an intrusion dataset in the development and validation process of detection mechanisms, identifies a set of requirements for effective dataset generation, and discusses several attack scenarios and their incorporation in generating datasets.
Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization
A reliable dataset is produced that contains benign and seven common attack network flows, which meets real world criteria and is publicly avaliable and evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.