George Ostrouchov

Learn More
Commonly used dependence measures, such as linear correlation, cross-correlogram, or Kendall's tau , cannot capture the complete dependence structure in data unless the structure is restricted to linear, periodic, or monotonic. Mutual information (MI) has been frequently utilized for capturing the complete dependence structure including nonlinear(More)
We describe a new method for computing a global principal component analysis (PCA) for the purpose of dimension reduction in data distributed across several locations. We assume that a virtual n × p (items × features) data matrix is distributed by blocks of rows (items), where n > p and the distribution among s locations is determined by a given(More)
This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multi-dimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O(nd) transmission cost, where n is the number of data points(More)
We address the difficulty involved in obtaining meaningful measurements of I/O performance in HPC applications, as well as the further challenge of understanding the causes of I/O bottlenecks in these applications. The need for I/O optimization is critical given the difficulty in scaling I/O to ever increasing numbers of processing cores. To address this(More)
FastMap is a dimension reduction technique that operates on distances between objects. Although only distances are used, implicitly the technique assumes that the objects are points in a p-dimensional Euclidean space. It selects a sequence of k /spl les/ p orthogonal axes defined by distant pairs of points (called pivots) and computes the projection of the(More)
System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness.(More)
It is well known that information retrieval, clustering and visualization can often be improved by reducing the dimensionality of high dimensional data. Classical techniques offer optimality but are much too slow for extremely large databases. The problem becomes harder yet when data are distributed across geographically dispersed machines. To address this(More)
Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the(More)
Systemic pathways-oriented approaches to analysis of metabolic networks are effective for small networks but are computationally infeasible for genome scale networks. Current computational approaches to this analysis are based on the mathematical principles of convex analysis. The enumeration of a complete set of “systemically independent” metabolic(More)