Learn More
We propose a probabilistic model for behavior-based malware detection that jointly models sequential data and class labels. Given labeled sequences (harmless/malicious), our goal is to reveal behavior patterns and exploit them to predict class labels of unknown sequences. The proposed model is a novel extension of supervised latent Dirichlet allocation with(More)
To develop a robust classification algorithm in the adversarial setting, it is important to understand the adversary’s strategy. We address the problem of label flips attack where an adversary contaminates the training set through flipping labels. By analyzing the objective of the adversary, we formulate an optimization framework for finding the label flips(More)
The explosive amount of malware continues their threats in network and operating systems. Signature-based method is widely used for detecting malware. Unfortunately, it is unable to determine variant malware on-the-fly. On the hand, behavior-based method can effectively characterize the behaviors of malware. However, it is time-consuming to train and(More)
Collapsed Gibbs sampling is a frequently applied method to approximate intractable integrals in probabilistic generative models such as latent Dirichlet allocation. This sampling method has however the crucial drawback of high computational complexity, which makes it limited applicable on large data sets. We propose a novel dynamic sampling strategy to(More)
Machine learning algorithms are increasingly being applied in security-related tasks such as spam and malware detection, although their security properties against deliberate attacks have not yet been widely understood. Intelligent and adaptive attackers may indeed exploit specific vulnerabilities exposed by machine learning techniques to violate system(More)
Internet has emerged as a powerful technology for collecting labeled data from a large number of users around the world at very low cost. Consequently, each instance is often associated with a handful of labels, precluding any assessment of an individual user’s quality. We present a probabilistic model for regression when there are multiple yet some(More)
Enterprises have accumulated both structured and unstructured data steadily as computing resources improve. However, previous research on enterprise data mining often treats these two kinds of data independently and omits mutual benefits. We explore the approach to incorporate a common type of structured data (i.e. organigram) into generative topic model.(More)
A significant problem of Gaussian process (GP) is its unfavorable scaling with a large amount of data. To overcome this issue, we present a novel GP approximation scheme for online regression. Our model is based on a combination of multiple GPs with random hyperparameters. The model is trained by incrementally allocating new examples to a selected subset of(More)
This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation system with intention to automatically translate the subtitles between German and Chinese. Since the common length-based algorithm for alignment shows weakness on short spoken sentences,(More)
Sequence prediction is a key task in machine learning and data mining. It involves predicting the next symbol in a sequence given its previous symbols. Our motivating application is predicting the execution path of a process on an operating system in real-time. In this case, each symbol in the sequence represents a system call accompanied with arguments and(More)