Learn More
In this position paper, we argue that to be of practical interest, a machine-learning based security system must engage with the human operators beyond feature engineering and instance labeling to address the challenge of drift in adversarial environments. We propose that designers of such systems broaden the classification goal into an <i>explanatory</i>(More)
We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization(More)
Active learning is an area of machine learning examining strategies for allocation of finite resources, particularly human labeling efforts and to an extent feature extraction, in situations where available data exceeds available resources. In this open problem paper, we motivate the necessity of active learning in the security domain, identify problems(More)
In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the "informativeness" of a set of blog comments, we construct a language and tokenization independent metric which we call <i>content complexity</i>, providing a normalized answer to the informal question "how much(More)
Recent work has successfully constructed adversarial " evading " instances for dif-ferentiable prediction models. However generating adversarial instances for tree ensembles, a piecewise constant class of models, has remained an open problem. In this paper, we construct both exact and approximate evasion algorithms for tree ensembles: for a given instance x(More)
We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive data sets, and augment it with a heuristic procedure to avoid(More)
Miscreants register thousands of new domains every day to launch Internet-scale attacks, such as spam, phishing, and drive-by downloads. Quickly and accurately determining a domain's reputation (association with malicious activity) provides a powerful tool for mitigating threats and protecting users. Yet, existing domain reputation systems work by observing(More)
We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system's ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal submissions(More)
• Randomly generate an input which would follow the general expected input data template. For example, the JPEG format defines a number of data segments, which all start by byte 0xFF followed by the segment type identifier byte. The generator would independently generate all the fields and construct a globally valid JPEG file by taking into account the(More)