Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections

  title={Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections},
  author={Michael Thomas Wojnowicz and Di Zhang and Glenn Chisholm and Xuan Zhao and Matt Wolff},
  journal={2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
  • M. Wojnowicz, Di Zhang, +2 authors M. Wolff
  • Published 1 October 2016
  • Computer Science, Mathematics
  • 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1… Expand
“Influence sketching”: Finding influential samples in large-scale regressions
A new scalable version of Cook's distance is developed, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions), and a new algorithm which is called “influence sketching” is introduced, which can reliably and successfully discover influential samples. Expand
An Introduction to Johnson-Lindenstrauss Transforms
Johnson–Lindenstrauss Transforms are powerful tools for reducing the dimensionality of data while preserving key characteristics of that data, and they have found use in many fields from machineExpand
Wavelet decomposition of software entropy reveals symptoms of malicious code
A method for automatically quantifying the extent to which patterned variations in a file's entropy signal make it "suspicious" is developed, which can be useful for machine learning models for detecting malware based on extracting millions of features from executable files. Expand
Spotlight: Malware Lead Generation at Scale
Spotlight, a large-scale malware lead-generation framework, is presented and it is shown that it can produce top-priority clusters with over 99% purity (i.e., homogeneity), which is higher than simpler approaches and prior work. Expand
Speeded Up Visual Tracker with Adaptive Template Updating Method
This paper uses dense SIFT feature to describe an object appearance and randomized principle component analysis (RPCA) to reduce the original feature space dimensionality in a speeded up visual tracker that is not only capable of long-term tracking but also of online tasks. Expand
SUSPEND: Determining software suspiciousness by non-stationary time series modeling of entropy signals
SUSPEND (S USPicious ENtropy signal Detector), an expert system which evaluates the suspiciousness of an executable file’s entropy signal in order to subserve malware classification, and boosts the predictive performance of traditional entropy analysis from 77.02% to 96.62%. Expand
A Survey of Machine Learning Methods and Challenges for Windows Malware Classification
This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space. Expand


Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification
Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. Expand
Experiments with random projections for machine learning
It is found that the random projection approach predictively underperforms PCA, but its computational advantages may make it attractive for certain applications. Expand
Very sparse random projections
This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. Expand
Randomized Algorithms for Matrices and Data
This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Expand
Large-scale malware classification using random projections and neural networks
This work uses random projections to further reduce the dimensionality of the original input space and trains several very large-scale neural network systems with over 2.6 million labeled samples thereby achieving classification results with a two-class error rate of 0.49% for a single neural network and 0.42% for an ensemble of neural networks. Expand
An Algorithm for the Principal Component Analysis of Large Data Sets
This work adapts one of these randomized methods for principal component analysis (PCA) for use with data sets that are too large to be stored in random-access memory (RAM), and reports on the performance of the algorithm. Expand
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation, and presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. Expand
Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach
Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. Expand
An algorithmic theory of learning: Robust concepts and random projection
This work provides a novel algorithmic analysis via a model of robust concept learning (closely related to “margin classifiers”), and shows that a relatively small number of examples are sufficient to learn rich concept classes. Expand
Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes
This paper considers 8 different optimization formulations for computing a single sparse loading vector and shows the the AM method is nontrivially equivalent to GPower (Journee et al; JMLR 11:517--553, 2010) for all the authors' formulations. Expand