Learn More
Recent work has successfully constructed adversarial " evading " instances for dif-ferentiable prediction models. However generating adversarial instances for tree ensembles, a piecewise constant class of models, has remained an open problem. In this paper, we construct both exact and approximate evasion algorithms for tree ensembles: for a given instance x(More)
In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the "informativeness" of a set of blog comments, we construct a language and tokenization independent metric which we call <i>content complexity</i>, providing a normalized answer to the informal question "how much(More)
Active learning is an area of machine learning examining strategies for allocation of finite resources, particularly human labeling efforts and to an extent feature extraction, in situations where available data exceeds available resources. In this open problem paper, we motivate the necessity of active learning in the security domain, identify problems(More)
In this position paper, we argue that to be of practical interest, a machine-learning based security system must engage with the human operators beyond feature engineering and instance labeling to address the challenge of drift in adversarial environments. We propose that designers of such systems broaden the classification goal into an <i>explanatory</i>(More)
We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive data sets, and augment it with a heuristic procedure to avoid(More)
The malware detection arms race involves constant change: mal-ware changes to evade detection and labels change as detection mechanisms react. Recognizing that malware changes over time, prior work has enforced temporally consistent samples by requiring that training binaries predate evaluation binaries. We present temporally consistent labels, requiring(More)
We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system's ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal submissions(More)
We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization(More)
—In this effort, we consider the verification of properties in C (subset) programs. That is, we prove the validity of a pre/postcondition pair for a program, or demonstrate invalidity via an error trace. This is undecidable in general, and modern static analysis techniques struggle to reason about non-linear programs and programs with loops. To that end, we(More)
  • 1