Overview Feature selection is a basic step in the construction of a vector space or bag of words model [BB99]. In particular, when the processing task is to partition a given document collection into clusters of similar documents a choice of good features along with good clustering algorithms is of paramount importance. This chapter suggests two techniques… (More)
Distributed Intrusion Detection Systems (DIDS) offer an alternative to centralized intrusion detection. Current research indicates that a distributed intrusion detection paradigm may afford greater coverage, consequently providing an increase in security. In some cases, DIDS offer an alternative to centralized analysis, consequently improving scalabity.… (More)
We describe an implementation and experiments with a low-distortion randomized projection algorithm [LINI94] that can reduce the number of dimensions in the data by a considerable amount. The performance of the randomized algorithm is compared with that of a popular technique-Principal Component Analysis (PCA). The experiments show that the randomized… (More)
We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, malicious PDFs, polymorphic malware, ransomware, and exploit kits. We will conclude with our view of important research questions in the field.… (More)
Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work… (More)
We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.
In the past few years, the explosive g r o wth of the Internet has allowed the construction of "virtual" systems containing hundreds or thousands of individual , relatively inexpensive computers. The agent paradigm is well-suited for this environment because it is based on distributed autonomous computation. Although the deenition of a software agent v… (More)
Identifying parallel corpora can be an important step in a variety of tasks related to information retrieval. However, at present to identify parallel corpora requires human experts to examine the texts and evaluate their respective contents. We assume in this research that texts which are translations of each other have similarities in their semantic… (More)