A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

  title={A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels},
  author={Robert J. Joyce and Edward Raff and Charles K. Nicholas},
  journal={Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security},
In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference… 

Figures and Tables from this paper

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

The MOTIF dataset contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, and provides aliases of the different names used to describe the same malware family, allowing the first time accuracy of existing tools when names are obtained from differing sources.

Firenze: Model Evaluation Using Weak Signals

This paper intro-duce Firenze, a novel framework for comparative evaluation of ML models’ performance using domain expertise, encoded into scalable functions called markers, and shows that markers computed and combined over select subsets of samples called regions of interest can provide a strong estimate of their real-world performances.

Logical Assessment Formula and its Principles for Evaluations without Accurate Ground-Truth Labels

The proposed logical assessment formula (LAF) is revealed and its principles for evaluations with inaccurate AGTLs (IAGTLs) are revealed, showing that LAF can be applied for evaluated with IAGTLs from the logical perspective on an easier task, but unable to act like usual strategies for evaluation with AG TLs confidently.

The Cross-evaluation of Machine Learning-based Network Intrusion Detection Systems

The first framework, XeNIDS, for reliable cross-evaluations based on Network Flows is proposed, demonstrating the concealed potential, but also the risks, of cross- evaluations of ML-NIDS.



Are Labels Always Necessary for Classifier Accuracy Evaluation?

  • Weijian DengLiang Zheng
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work constructs a meta-dataset: a dataset comprised of datasets generated from the original images via various transformations such as rotation, background substitution, foreground scaling, etc, and reports a reasonable and promising prediction of the model accuracy.

Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels

This work adapts a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend into a fully unsupervised technique for aggregating the results of multiple anti-virus vendors' detectors.

On Challenges in Evaluating Malware Clustering

The results are reported of an attempt to confirm the conjecture that the method of selecting ground-truth data in prior evaluations biases their results toward high accuracy, and investigate possible reasons why this may be the case.

VAMO: towards a fully automated malware clustering validity analysis

Through an extensive evaluation in a controlled setting and a real-world application, it is shown that VAMO outperforms majority voting-based approaches, and provides a better way for malware analysts to automatically assess the quality of their malware clustering results.

AVclass: A Tool for Massive Malware Labeling

AVclass is described, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample, and implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection.

Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines

A data-driven approach to categorize, reason, and validate common labeling methods used by researchers, and empirically show certain engines fail to perform in-depth analysis on submitted files and can easily produce false positives.

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

This work creates a unified reimplemention and evaluation platform of various widely-used SSL techniques and finds that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeling data, and that performance can degrade substantially when the unlabelED dataset contains out-of-class examples.

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

Deep Ground Truth Analysis of Current Android Malware

This work uses existing anti-virus scan results and automation techniques in categorizing a large Android malware dataset into 135 varieties which belong to 71 malware families, and presents detailed documentation of the process used in creating the dataset, including the guidelines for the manual analysis.

Towards a Methodical Evaluation of Antivirus Scans and Labels - "If You're Not Confused, You're Not Paying Attention"

In recent years, researchers have relied heavily on labels provided by antivirus companies in establishing ground truth for applications and algorithms of malware detection, classification, and