A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

@article{Joyce2021AFF,
  title={A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels},
  author={Robert J. Joyce and Edward Raff and Charles K. Nicholas},
  journal={Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security},
  year={2021}
}
In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference… 

Figures and Tables from this paper

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

TLDR
The MOTIF dataset contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, and provides aliases of the different names used to describe the same malware family, allowing the first time accuracy of existing tools when names are obtained from differing sources.

Firenze: Model Evaluation Using Weak Signals

TLDR
This paper intro-duce Firenze, a novel framework for comparative evaluation of ML models’ performance using domain expertise, encoded into scalable functions called markers, and shows that markers computed and combined over select subsets of samples called regions of interest can provide a strong estimate of their real-world performances.

Logical Assessment Formula and its Principles for Evaluations without Accurate Ground-Truth Labels

TLDR
The proposed logical assessment formula (LAF) is revealed and its principles for evaluations with inaccurate AGTLs (IAGTLs) are revealed, showing that LAF can be applied for evaluated with IAGTLs from the logical perspective on an easier task, but unable to act like usual strategies for evaluation with AG TLs confidently.

The Cross-evaluation of Machine Learning-based Network Intrusion Detection Systems

TLDR
The first framework, XeNIDS, for reliable cross-evaluations based on Network Flows is proposed, demonstrating the concealed potential, but also the risks, of cross- evaluations of ML-NIDS.

References

SHOWING 1-10 OF 85 REFERENCES

On Challenges in Evaluating Malware Clustering

TLDR
The results are reported of an attempt to confirm the conjecture that the method of selecting ground-truth data in prior evaluations biases their results toward high accuracy, and investigate possible reasons why this may be the case.

VAMO: towards a fully automated malware clustering validity analysis

TLDR
Through an extensive evaluation in a controlled setting and a real-world application, it is shown that VAMO outperforms majority voting-based approaches, and provides a better way for malware analysts to automatically assess the quality of their malware clustering results.

AVclass: A Tool for Massive Malware Labeling

TLDR
AVclass is described, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample, and implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection.

Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines

TLDR
A data-driven approach to categorize, reason, and validate common labeling methods used by researchers, and empirically show certain engines fail to perform in-depth analysis on submitted files and can easily produce false positives.

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

TLDR
This work creates a unified reimplemention and evaluation platform of various widely-used SSL techniques and finds that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeling data, and that performance can degrade substantially when the unlabelED dataset contains out-of-class examples.

Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints

TLDR
This work improves the true positive rate (TPR) at an actual realized FPR of 1e-5 from an expected 0.69 for previous methods to 0.80 on the best performing model class on the Sophos industry scale dataset.

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

TLDR
The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

Deep Ground Truth Analysis of Current Android Malware

TLDR
This work uses existing anti-virus scan results and automation techniques in categorizing a large Android malware dataset into 135 varieties which belong to 71 malware families, and presents detailed documentation of the process used in creating the dataset, including the guidelines for the manual analysis.

Towards a Methodical Evaluation of Antivirus Scans and Labels - "If You're Not Confused, You're Not Paying Attention"

In recent years, researchers have relied heavily on labels provided by antivirus companies in establishing ground truth for applications and algorithms of malware detection, classification, and
...